diff mbox series

[v3,15/17] KVM: x86/tdp_mmu: Propagate building mirror page tables

Message ID 20240619223614.290657-16-rick.p.edgecombe@intel.com (mailing list archive)
State New, archived
Headers show
Series TDX MMU prep series part 1 | expand

Commit Message

Rick Edgecombe June 19, 2024, 10:36 p.m. UTC
From: Isaku Yamahata <isaku.yamahata@intel.com>

Integrate hooks for mirroring page table operations for cases where TDX
will set PTEs or link page tables.

Like other Coco technologies, TDX has the concept of private and shared
memory. For TDX the private and shared mappings are managed on separate
EPT roots. The private half is managed indirectly though calls into a
protected runtime environment called the TDX module, where the shared half
is managed within KVM in normal page tables.

Since calls into the TDX module are relatively slow, walking private page
tables by making calls into the TDX module would not be efficient. Because
of this, previous changes have taught the TDP MMU to keep a mirror root,
which is separate, unmapped TDP root that private operations can be
directed to. Currently this root is disconnected from any actual guest
mapping. Now add plumbing to propagate changes to the "external" page
tables being mirrored. Just create the x86_ops for now, leave plumbing the
operations into the TDX module for future patches.

Add two operations for setting up external page tables, one for linking
new page tables and one for setting leaf PTEs. Don't add any op for
configuring the root PFN, as TDX handles this itself. Don't provide a
way to set permissions on the PTEs also, as TDX doesn't support it.

This results is MMU "mirroring" support that is very targeted towards TDX.
Since it is likely there will be no other user, the main benefit of making
the support generic is to keep TDX specific *looking* code outside of the
MMU. As a generic feature it will make enough sense from TDX's
perspective. For developers unfamiliar with TDX arch it can express the
general concepts such that they can continue to work in the code.

TDX MMU support will exclude certain MMU operations, so only plug in the
mirroring x86 ops where they will be needed. For setting/linking, only
hook tdp_mmu_set_spte_atomic() which is use used for mapping and linking
PTs. Don't bother hooking tdp_mmu_iter_set_spte() as it is only used for
setting PTEs in operations unsupported by TDX: splitting huge pages and
write protecting. Sprinkle a KVM_BUG_ON()s to document as code that these
paths are not supported for mirrored page tables. For zapping operations,
leave those for near future changes.

Many operations in the TDP MMU depend on atomicity of the PTE update.
While the mirror PTE on KVM's side can be updated atomically, the update
that happens inside the external operations (S-EPT updates via TDX module
call) can't happen atomically with the mirror update. The following race
could result during two vCPU's populating private memory:

* vcpu 1: atomically update 2M level mirror EPT entry to be present
* vcpu 2: read 2M level EPT entry that is present
* vcpu 2: walk down into 4K level EPT
* vcpu 2: atomically update 4K level mirror EPT entry to be present
* vcpu 2: set_exterma;_spte() to update 4K secure EPT entry => error
          because 2M secure EPT entry is not populated yet
* vcpu 1: link_external_spt() to update 2M secure EPT entry

Prevent this by setting the mirror PTE to FROZEN_SPTE while the reflect
operations are performed. Only write the actual mirror PTE value once the
reflect operations has completed. When trying to set a PTE to present and
encountering a removed SPTE, retry the fault.

By doing this the race is prevented as follows:
* vcpu 1: atomically update 2M level EPT entry to be FROZEN_SPTE
* vcpu 2: read 2M level EPT entry that is FROZEN_SPTE
* vcpu 2: find that the EPT entry is frozen
          abandon page table walk to resume guest execution
* vcpu 1: link_external_spt() to update 2M secure EPT entry
* vcpu 1: atomically update 2M level EPT entry to be present (unfreeze)
* vcpu 2: resume guest execution
          Depending on vcpu 1 state, vcpu 2 may result in EPT violation
          again or make progress on guest execution

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Prep v3:
 - Rename mirrored->external (Paolo)
 - Better comment on logic that bugs if doing tdp_mmu_set_spte() on
   present PTE. (Paolo)
 - Move zapping KVM_BUG_ON() to proper patch
 - Use spte_to_child_sp() (Paolo)
 - Drop unnessary comment in __tdp_mmu_set_spte_atomic() (Paolo)
 - Rename pfn->pfn_for_gfn to match remove_external_pte in next patch.
 - Rename REMOVED_SPTE to FROZEN_SPTE (Paolo)

TDX MMU Prep v2:
 - Split from "KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU"
 - Rename x86_ops from "private" to "reflect"
 - In response to "sp->mirrored_spt" rename helpers to "mirrored"
 - Drop unused old_pfn and new_pfn in handle_changed_spte()
 - Drop redundant is_shadow_present_pte() check in __tdp_mmu_set_spte_atomic
 - Adjust some warnings and KVM_BUG_ONs
---
 arch/x86/include/asm/kvm-x86-ops.h |  2 +
 arch/x86/include/asm/kvm_host.h    |  7 +++
 arch/x86/kvm/mmu/tdp_mmu.c         | 98 ++++++++++++++++++++++++++----
 3 files changed, 94 insertions(+), 13 deletions(-)

Comments

Binbin Wu June 20, 2024, 5:15 a.m. UTC | #1
On 6/20/2024 6:36 AM, Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
> Integrate hooks for mirroring page table operations for cases where TDX
> will set PTEs or link page tables.
>
> Like other Coco technologies, TDX has the concept of private and shared
> memory. For TDX the private and shared mappings are managed on separate
> EPT roots. The private half is managed indirectly though calls into a
> protected runtime environment called the TDX module, where the shared half
> is managed within KVM in normal page tables.
>
> Since calls into the TDX module are relatively slow, walking private page
> tables by making calls into the TDX module would not be efficient. Because
> of this, previous changes have taught the TDP MMU to keep a mirror root,
> which is separate, unmapped TDP root that private operations can be
> directed to. Currently this root is disconnected from any actual guest
> mapping. Now add plumbing to propagate changes to the "external" page
> tables being mirrored. Just create the x86_ops for now, leave plumbing the
> operations into the TDX module for future patches.
>
> Add two operations for setting up external page tables, one for linking
> new page tables and one for setting leaf PTEs. Don't add any op for
> configuring the root PFN, as TDX handles this itself. Don't provide a
> way to set permissions on the PTEs also, as TDX doesn't support it.
>
> This results is MMU "mirroring" support that is very targeted towards TDX.
                ^
                extra "is"?
> Since it is likely there will be no other user, the main benefit of making
> the support generic is to keep TDX specific *looking* code outside of the
> MMU. As a generic feature it will make enough sense from TDX's
> perspective. For developers unfamiliar with TDX arch it can express the
> general concepts such that they can continue to work in the code.
>
> TDX MMU support will exclude certain MMU operations, so only plug in the
> mirroring x86 ops where they will be needed. For setting/linking, only
> hook tdp_mmu_set_spte_atomic() which is use used for mapping and linking
> PTs. Don't bother hooking tdp_mmu_iter_set_spte() as it is only used for
> setting PTEs in operations unsupported by TDX: splitting huge pages and
> write protecting. Sprinkle a KVM_BUG_ON()s to document as code that these
                             ^
                             extra "a"
> paths are not supported for mirrored page tables. For zapping operations,
> leave those for near future changes.
>
> Many operations in the TDP MMU depend on atomicity of the PTE update.
> While the mirror PTE on KVM's side can be updated atomically, the update
> that happens inside the external operations (S-EPT updates via TDX module
> call) can't happen atomically with the mirror update. The following race
> could result during two vCPU's populating private memory:
>
> * vcpu 1: atomically update 2M level mirror EPT entry to be present
> * vcpu 2: read 2M level EPT entry that is present
> * vcpu 2: walk down into 4K level EPT
> * vcpu 2: atomically update 4K level mirror EPT entry to be present
> * vcpu 2: set_exterma;_spte() to update 4K secure EPT entry => error
>            because 2M secure EPT entry is not populated yet
> * vcpu 1: link_external_spt() to update 2M secure EPT entry
>
> Prevent this by setting the mirror PTE to FROZEN_SPTE while the reflect
> operations are performed. Only write the actual mirror PTE value once the
             ^
             maybe "are being"
> reflect operations has completed. When trying to set a PTE to present and
> encountering a removed SPTE, retry the fault.
                 ^
                 frozen
>
> By doing this the race is prevented as follows:
> * vcpu 1: atomically update 2M level EPT entry to be FROZEN_SPTE
> * vcpu 2: read 2M level EPT entry that is FROZEN_SPTE
> * vcpu 2: find that the EPT entry is frozen
>            abandon page table walk to resume guest execution
> * vcpu 1: link_external_spt() to update 2M secure EPT entry
> * vcpu 1: atomically update 2M level EPT entry to be present (unfreeze)
> * vcpu 2: resume guest execution
>            Depending on vcpu 1 state, vcpu 2 may result in EPT violation
>            again or make progress on guest execution
>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Co-developed-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> TDX MMU Prep v3:
>   - Rename mirrored->external (Paolo)
>   - Better comment on logic that bugs if doing tdp_mmu_set_spte() on
>     present PTE. (Paolo)
>   - Move zapping KVM_BUG_ON() to proper patch
>   - Use spte_to_child_sp() (Paolo)
>   - Drop unnessary comment in __tdp_mmu_set_spte_atomic() (Paolo)
>   - Rename pfn->pfn_for_gfn to match remove_external_pte in next patch.
>   - Rename REMOVED_SPTE to FROZEN_SPTE (Paolo)
>
> TDX MMU Prep v2:
>   - Split from "KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU"
>   - Rename x86_ops from "private" to "reflect"
>   - In response to "sp->mirrored_spt" rename helpers to "mirrored"
>   - Drop unused old_pfn and new_pfn in handle_changed_spte()
>   - Drop redundant is_shadow_present_pte() check in __tdp_mmu_set_spte_atomic
>   - Adjust some warnings and KVM_BUG_ONs
> ---
>   arch/x86/include/asm/kvm-x86-ops.h |  2 +
>   arch/x86/include/asm/kvm_host.h    |  7 +++
>   arch/x86/kvm/mmu/tdp_mmu.c         | 98 ++++++++++++++++++++++++++----
>   3 files changed, 94 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 566d19b02483..3ef19fcb5e42 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -95,6 +95,8 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
>   KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
>   KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
>   KVM_X86_OP(load_mmu_pgd)
> +KVM_X86_OP_OPTIONAL(link_external_spt)
> +KVM_X86_OP_OPTIONAL(set_external_spte)
>   KVM_X86_OP(has_wbinvd_exit)
>   KVM_X86_OP(get_l2_tsc_offset)
>   KVM_X86_OP(get_l2_tsc_multiplier)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d67e88a69fc4..12ff04135a0e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1738,6 +1738,13 @@ struct kvm_x86_ops {
>   	void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
>   			     int root_level);
>   
> +	/* Update external mapping with page table link */
Nit: Add "." at the end of the sentence.

> +	int (*link_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +				void *mirrored_spt);
> +	/* Update the external page table from spte getting set */
Ditto.

> +	int (*set_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +				 kvm_pfn_t pfn_for_gfn);
> +
>   	bool (*has_wbinvd_exit)(void);
>   
>   	u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 582e5a045bb7..bc1ad127046d 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -244,7 +244,7 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
>   	struct kvm_mmu_page *root;
>   
>   	if (mirror)
> -		role.is_mirror = 1;
> +		role.is_mirror = true;
If this is needed, it's better to be done in "KVM: x86/tdp_mmu: Support 
mirror root for TDP MMU".

>   
>   	/*
>   	 * Check for an existing root before acquiring the pages lock to avoid
> @@ -440,6 +440,59 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
>   	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
>   }
>   
[...]
Rick Edgecombe June 24, 2024, 11:52 p.m. UTC | #2
On Thu, 2024-06-20 at 13:15 +0800, Binbin Wu wrote:
> > -               role.is_mirror = 1;
> > +               role.is_mirror = true;
> If this is needed, it's better to be done in "KVM: x86/tdp_mmu: Support 
> mirror root for TDP MMU".

Whoops, and agreed on the textual issues. Thanks.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 566d19b02483..3ef19fcb5e42 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -95,6 +95,8 @@  KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
 KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
 KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
 KVM_X86_OP(load_mmu_pgd)
+KVM_X86_OP_OPTIONAL(link_external_spt)
+KVM_X86_OP_OPTIONAL(set_external_spte)
 KVM_X86_OP(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
 KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d67e88a69fc4..12ff04135a0e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1738,6 +1738,13 @@  struct kvm_x86_ops {
 	void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			     int root_level);
 
+	/* Update external mapping with page table link */
+	int (*link_external_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				void *mirrored_spt);
+	/* Update the external page table from spte getting set */
+	int (*set_external_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				 kvm_pfn_t pfn_for_gfn);
+
 	bool (*has_wbinvd_exit)(void);
 
 	u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 582e5a045bb7..bc1ad127046d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -244,7 +244,7 @@  void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool mirror)
 	struct kvm_mmu_page *root;
 
 	if (mirror)
-		role.is_mirror = 1;
+		role.is_mirror = true;
 
 	/*
 	 * Check for an existing root before acquiring the pages lock to avoid
@@ -440,6 +440,59 @@  static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
 
+static void *get_external_spt(gfn_t gfn, u64 new_spte, int level)
+{
+	if (is_shadow_present_pte(new_spte) && !is_last_spte(new_spte, level)) {
+		struct kvm_mmu_page *sp = spte_to_child_sp(new_spte);
+
+		WARN_ON_ONCE(sp->role.level + 1 != level);
+		WARN_ON_ONCE(sp->gfn != gfn);
+		return sp->external_spt;
+	}
+
+	return NULL;
+}
+
+static int __must_check set_external_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
+						 gfn_t gfn, u64 old_spte,
+						 u64 new_spte, int level)
+{
+	bool was_present = is_shadow_present_pte(old_spte);
+	bool is_present = is_shadow_present_pte(new_spte);
+	bool is_leaf = is_present && is_last_spte(new_spte, level);
+	kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
+	int ret = 0;
+
+	KVM_BUG_ON(was_present, kvm);
+
+	lockdep_assert_held(&kvm->mmu_lock);
+	/*
+	 * We need to lock out other updates to the SPTE until the external
+	 * page table has been modified. Use FROZEN_SPTE similar to
+	 * the zapping case.
+	 */
+	if (!try_cmpxchg64(sptep, &old_spte, FROZEN_SPTE))
+		return -EBUSY;
+
+	/*
+	 * Use different call to either set up middle level
+	 * external page table, or leaf.
+	 */
+	if (is_leaf) {
+		ret = static_call(kvm_x86_set_external_spte)(kvm, gfn, level, new_pfn);
+	} else {
+		void *mirrored_spt = get_external_spt(gfn, new_spte, level);
+
+		KVM_BUG_ON(!mirrored_spt, kvm);
+		ret = static_call(kvm_x86_link_external_spt)(kvm, gfn, level, mirrored_spt);
+	}
+	if (ret)
+		__kvm_tdp_mmu_write_spte(sptep, old_spte);
+	else
+		__kvm_tdp_mmu_write_spte(sptep, new_spte);
+	return ret;
+}
+
 /**
  * handle_changed_spte - handle bookkeeping associated with an SPTE change
  * @kvm: kvm instance
@@ -548,7 +601,7 @@  static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
 }
 
-static inline int __tdp_mmu_set_spte_atomic(struct tdp_iter *iter, u64 new_spte)
+static inline int __tdp_mmu_set_spte_atomic(struct kvm *kvm, struct tdp_iter *iter, u64 new_spte)
 {
 	u64 *sptep = rcu_dereference(iter->sptep);
 
@@ -560,15 +613,25 @@  static inline int __tdp_mmu_set_spte_atomic(struct tdp_iter *iter, u64 new_spte)
 	 */
 	WARN_ON_ONCE(iter->yielded || is_frozen_spte(iter->old_spte));
 
-	/*
-	 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
-	 * does not hold the mmu_lock.  On failure, i.e. if a different logical
-	 * CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with
-	 * the current value, so the caller operates on fresh data, e.g. if it
-	 * retries tdp_mmu_set_spte_atomic()
-	 */
-	if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
-		return -EBUSY;
+	if (is_mirror_sptep(iter->sptep) && !is_frozen_spte(new_spte)) {
+		int ret;
+
+		ret = set_external_spte_present(kvm, iter->sptep, iter->gfn,
+						iter->old_spte, new_spte, iter->level);
+		if (ret)
+			return ret;
+	} else {
+		/*
+		 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs
+		 * and does not hold the mmu_lock.  On failure, i.e. if a
+		 * different logical CPU modified the SPTE, try_cmpxchg64()
+		 * updates iter->old_spte with the current value, so the caller
+		 * operates on fresh data, e.g. if it retries
+		 * tdp_mmu_set_spte_atomic()
+		 */
+		if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
+			return -EBUSY;
+	}
 
 	return 0;
 }
@@ -598,7 +661,7 @@  static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
 
 	lockdep_assert_held_read(&kvm->mmu_lock);
 
-	ret = __tdp_mmu_set_spte_atomic(iter, new_spte);
+	ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
 	if (ret)
 		return ret;
 
@@ -623,7 +686,8 @@  static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 	 * Delay processing of the zapped SPTE until after TLBs are flushed and
 	 * the FROZEN_SPTE is replaced (see below).
 	 */
-	ret = __tdp_mmu_set_spte_atomic(iter, FROZEN_SPTE);
+	ret = __tdp_mmu_set_spte_atomic(kvm, iter, FROZEN_SPTE);
+
 	if (ret)
 		return ret;
 
@@ -680,6 +744,14 @@  static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 	old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
 
 	handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
+
+	/*
+	 * Users that do non-atomic setting of PTEs don't operate on mirror
+	 * roots, so don't handle it and bug the VM if it's seen.
+	 */
+	if (is_mirror_sptep(sptep))
+		KVM_BUG_ON(is_shadow_present_pte(new_spte), kvm);
+
 	return old_spte;
 }