diff mbox series

[10/16] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU

Message ID 20240515005952.3410568-11-rick.p.edgecombe@intel.com (mailing list archive)
State New, archived
Headers show
Series TDX MMU prep series part 1 | expand

Commit Message

Rick Edgecombe May 15, 2024, 12:59 a.m. UTC
From: Isaku Yamahata <isaku.yamahata@intel.com>

Allocate mirrored page table for the private page table and implement MMU
hooks to operate on the private page table.

To handle page fault to a private GPA, KVM walks the mirrored page table in
unencrypted memory and then uses MMU hooks in kvm_x86_ops to propagate
changes from the mirrored page table to private page table.

  private KVM page fault   |
      |                    |
      V                    |
 private GPA               |     CPU protected EPTP
      |                    |           |
      V                    |           V
 mirrored PT root          |     private PT root
      |                    |           |
      V                    |           V
   mirrored PT --hook to propagate-->private PT
      |                    |           |
      \--------------------+------\    |
                           |      |    |
                           |      V    V
                           |    private guest page
                           |
                           |
     non-encrypted memory  |    encrypted memory
                           |

PT:         page table
Private PT: the CPU uses it, but it is invisible to KVM. TDX module manages
            this table to map private guest pages.
Mirrored PT:It is visible to KVM, but the CPU doesn't use it. KVM uses it
            to propagate PT change to the actual private PT.

SPTEs in mirrored page table (refer to them as mirrored SPTEs hereafter)
can be modified atomically with mmu_lock held for read, however, the MMU
hooks to private page table are not atomical operations.

To address it, a special REMOVED_SPTE is introduced and below sequence is
used when mirrored SPTEs are updated atomically.

1. Mirrored SPTE is first atomically written to REMOVED_SPTE.
2. The successful updater of the mirrored SPTE in step 1 proceeds with the
   following steps.
3. Invoke MMU hooks to modify private page table with the target value.
4. (a) On hook succeeds, update mirrored SPTE to target value.
   (b) On hook failure, restore mirrored SPTE to original value.

KVM TDP MMU ensures other threads will not overrite REMOVED_SPTE.

This sequence also applies when SPTEs are atomiclly updated from
non-present to present in order to prevent potential conflicts when
multiple vCPUs attempt to set private SPTEs to a different page size
simultaneously, though 4K page size is only supported for private page
table currently.

2M page support can be done in future patches.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Part 1:
 - Remove unnecessary gfn, access twist in
   tdp_mmu_map_handle_target_level(). (Chao Gao)
 - Open code call to kvm_mmu_alloc_private_spt() instead oCf doing it in
   tdp_mmu_alloc_sp()
 - Update comment in set_private_spte_present() (Yan)
 - Open code call to kvm_mmu_init_private_spt() (Yan)
 - Add comments on TDX MMU hooks (Yan)
 - Fix various whitespace alignment (Yan)
 - Remove pointless warnings and conditionals in
   handle_removed_private_spte() (Yan)
 - Remove redundant lockdep assert in tdp_mmu_set_spte() (Yan)
 - Remove incorrect comment in handle_changed_spte() (Yan)
 - Remove unneeded kvm_pfn_to_refcounted_page() and
   is_error_noslot_pfn() check in kvm_tdp_mmu_map() (Yan)
 - Do kvm_gfn_for_root() branchless (Rick)
 - Update kvm_tdp_mmu_alloc_root() callers to not check error code (Rick)
 - Add comment for stripping shared bit for fault.gfn (Chao)

v19:
- drop CONFIG_KVM_MMU_PRIVATE

v18:
- Rename freezed => frozen

v14 -> v15:
- Refined is_private condition check in kvm_tdp_mmu_map().
  Add kvm_gfn_shared_mask() check.
- catch up for struct kvm_range change
---
 arch/x86/include/asm/kvm-x86-ops.h |   5 +
 arch/x86/include/asm/kvm_host.h    |  25 +++
 arch/x86/kvm/mmu/mmu.c             |  13 +-
 arch/x86/kvm/mmu/mmu_internal.h    |  19 +-
 arch/x86/kvm/mmu/tdp_iter.h        |   2 +-
 arch/x86/kvm/mmu/tdp_mmu.c         | 269 +++++++++++++++++++++++++----
 arch/x86/kvm/mmu/tdp_mmu.h         |   2 +-
 7 files changed, 293 insertions(+), 42 deletions(-)

Comments

Isaku Yamahata May 15, 2024, 5:35 p.m. UTC | #1
On Tue, May 14, 2024 at 05:59:46PM -0700,
Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:

...snip...

> @@ -619,6 +776,8 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
>  	 */
>  	__kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);
>  
> +
> +	role = sptep_to_sp(iter->sptep)->role;
>  	/*
>  	 * Process the zapped SPTE after flushing TLBs, and after replacing
>  	 * REMOVED_SPTE with 0. This minimizes the amount of time vCPUs are
> @@ -626,7 +785,7 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
>  	 * SPTEs.
>  	 */
>  	handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> -			    0, iter->level, true);
> +			    SHADOW_NONPRESENT_VALUE, role, true);
>  
>  	return 0;
>  }

This SHADOW_NONPRESENT_VALUE change should go to another patch at [1]
I replied to [1].

[1] https://lore.kernel.org/kvm/20240507154459.3950778-3-pbonzini@redhat.com/
Rick Edgecombe May 15, 2024, 6 p.m. UTC | #2
On Wed, 2024-05-15 at 10:35 -0700, Isaku Yamahata wrote:
> 
> ...snip...
> 
> > @@ -619,6 +776,8 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm
> > *kvm,
> >          */
> >         __kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);
> >   
> > +
> > +       role = sptep_to_sp(iter->sptep)->role;
> >         /*
> >          * Process the zapped SPTE after flushing TLBs, and after replacing
> >          * REMOVED_SPTE with 0. This minimizes the amount of time vCPUs are
> > @@ -626,7 +785,7 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm
> > *kvm,
> >          * SPTEs.
> >          */
> >         handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> > -                           0, iter->level, true);
> > +                           SHADOW_NONPRESENT_VALUE, role, true);
> >   
> >         return 0;
> >   }
> 
> This SHADOW_NONPRESENT_VALUE change should go to another patch at [1]
> I replied to [1].

Thanks. This call site got added in an upstream patch recently, so you didn't
miss it.
Kai Huang May 16, 2024, 12:52 a.m. UTC | #3
On 15/05/2024 12:59 pm, Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Allocate mirrored page table for the private page table and implement MMU
> hooks to operate on the private page table.
> 
> To handle page fault to a private GPA, KVM walks the mirrored page table in
> unencrypted memory and then uses MMU hooks in kvm_x86_ops to propagate
> changes from the mirrored page table to private page table.
> 
>    private KVM page fault   |
>        |                    |
>        V                    |
>   private GPA               |     CPU protected EPTP
>        |                    |           |
>        V                    |           V
>   mirrored PT root          |     private PT root
>        |                    |           |
>        V                    |           V
>     mirrored PT --hook to propagate-->private PT
>        |                    |           |
>        \--------------------+------\    |
>                             |      |    |
>                             |      V    V
>                             |    private guest page
>                             |
>                             |
>       non-encrypted memory  |    encrypted memory
>                             |
> 
> PT:         page table
> Private PT: the CPU uses it, but it is invisible to KVM. TDX module manages
>              this table to map private guest pages.
> Mirrored PT:It is visible to KVM, but the CPU doesn't use it. KVM uses it
>              to propagate PT change to the actual private PT.
> 
> SPTEs in mirrored page table (refer to them as mirrored SPTEs hereafter)
> can be modified atomically with mmu_lock held for read, however, the MMU
> hooks to private page table are not atomical operations.
> 
> To address it, a special REMOVED_SPTE is introduced and below sequence is
> used when mirrored SPTEs are updated atomically.
> 
> 1. Mirrored SPTE is first atomically written to REMOVED_SPTE.
> 2. The successful updater of the mirrored SPTE in step 1 proceeds with the
>     following steps.
> 3. Invoke MMU hooks to modify private page table with the target value.
> 4. (a) On hook succeeds, update mirrored SPTE to target value.
>     (b) On hook failure, restore mirrored SPTE to original value.
> 
> KVM TDP MMU ensures other threads will not overrite REMOVED_SPTE.
> 
> This sequence also applies when SPTEs are atomiclly updated from
> non-present to present in order to prevent potential conflicts when
> multiple vCPUs attempt to set private SPTEs to a different page size
> simultaneously, though 4K page size is only supported for private page
> table currently.
> 
> 2M page support can be done in future patches.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Co-developed-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> TDX MMU Part 1:
>   - Remove unnecessary gfn, access twist in
>     tdp_mmu_map_handle_target_level(). (Chao Gao)
>   - Open code call to kvm_mmu_alloc_private_spt() instead oCf doing it in
>     tdp_mmu_alloc_sp()
>   - Update comment in set_private_spte_present() (Yan)
>   - Open code call to kvm_mmu_init_private_spt() (Yan)
>   - Add comments on TDX MMU hooks (Yan)
>   - Fix various whitespace alignment (Yan)
>   - Remove pointless warnings and conditionals in
>     handle_removed_private_spte() (Yan)
>   - Remove redundant lockdep assert in tdp_mmu_set_spte() (Yan)
>   - Remove incorrect comment in handle_changed_spte() (Yan)
>   - Remove unneeded kvm_pfn_to_refcounted_page() and
>     is_error_noslot_pfn() check in kvm_tdp_mmu_map() (Yan)
>   - Do kvm_gfn_for_root() branchless (Rick)
>   - Update kvm_tdp_mmu_alloc_root() callers to not check error code (Rick)
>   - Add comment for stripping shared bit for fault.gfn (Chao)
> 
> v19:
> - drop CONFIG_KVM_MMU_PRIVATE
> 
> v18:
> - Rename freezed => frozen
> 
> v14 -> v15:
> - Refined is_private condition check in kvm_tdp_mmu_map().
>    Add kvm_gfn_shared_mask() check.
> - catch up for struct kvm_range change
> ---
>   arch/x86/include/asm/kvm-x86-ops.h |   5 +
>   arch/x86/include/asm/kvm_host.h    |  25 +++
>   arch/x86/kvm/mmu/mmu.c             |  13 +-
>   arch/x86/kvm/mmu/mmu_internal.h    |  19 +-
>   arch/x86/kvm/mmu/tdp_iter.h        |   2 +-
>   arch/x86/kvm/mmu/tdp_mmu.c         | 269 +++++++++++++++++++++++++----
>   arch/x86/kvm/mmu/tdp_mmu.h         |   2 +-
>   7 files changed, 293 insertions(+), 42 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 566d19b02483..d13cb4b8fce6 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -95,6 +95,11 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
>   KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
>   KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
>   KVM_X86_OP(load_mmu_pgd)
> +KVM_X86_OP_OPTIONAL(link_private_spt)
> +KVM_X86_OP_OPTIONAL(free_private_spt)
> +KVM_X86_OP_OPTIONAL(set_private_spte)
> +KVM_X86_OP_OPTIONAL(remove_private_spte)
> +KVM_X86_OP_OPTIONAL(zap_private_spte)
>   KVM_X86_OP(has_wbinvd_exit)
>   KVM_X86_OP(get_l2_tsc_offset)
>   KVM_X86_OP(get_l2_tsc_multiplier)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d010ca5c7f44..20fa8fa58692 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -470,6 +470,7 @@ struct kvm_mmu {
>   	int (*sync_spte)(struct kvm_vcpu *vcpu,
>   			 struct kvm_mmu_page *sp, int i);
>   	struct kvm_mmu_root_info root;
> +	hpa_t private_root_hpa;

Should we have

	struct kvm_mmu_root_info private_root;

instead?

>   	union kvm_cpu_role cpu_role;
>   	union kvm_mmu_page_role root_role;
>   
> @@ -1747,6 +1748,30 @@ struct kvm_x86_ops {
>   	void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
>   			     int root_level);
>   
> +	/* Add a page as page table page into private page table */
> +	int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +				void *private_spt);
> +	/*
> +	 * Free a page table page of private page table.
> +	 * Only expected to be called when guest is not active, specifically
> +	 * during VM destruction phase.
> +	 */
> +	int (*free_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +				void *private_spt);
> +
> +	/* Add a guest private page into private page table */
> +	int (*set_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +				kvm_pfn_t pfn);
> +
> +	/* Remove a guest private page from private page table*/
> +	int (*remove_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +				   kvm_pfn_t pfn);
> +	/*
> +	 * Keep a guest private page mapped in private page table, but clear its
> +	 * present bit
> +	 */
> +	int (*zap_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level);
> +
>   	bool (*has_wbinvd_exit)(void);
>   
>   	u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 76f92cb37a96..2506d6277818 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3701,7 +3701,9 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>   	int r;
>   
>   	if (tdp_mmu_enabled) {
> -		kvm_tdp_mmu_alloc_root(vcpu);
> +		if (kvm_gfn_shared_mask(vcpu->kvm))
> +			kvm_tdp_mmu_alloc_root(vcpu, true);

As mentioned in replies to other patches, I kinda prefer

	kvm->arch.has_mirrored_pt (or has_mirrored_private_pt)

Or we have a helper

	kvm_has_mirrored_pt() / kvm_has_mirrored_private_pt()

> +		kvm_tdp_mmu_alloc_root(vcpu, false);
>   		return 0;
>   	}
>   
> @@ -4685,7 +4687,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>   	if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
>   		for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) {
>   			int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
> -			gfn_t base = gfn_round_for_level(fault->gfn,
> +			gfn_t base = gfn_round_for_level(gpa_to_gfn(fault->addr),
>   							 fault->max_level);

I thought by reaching here the shared bit has already been stripped away 
by the caller?

It doesn't make a lot sense to still have it here, given we have a 
universal KVM-defined PFERR_PRIVATE_ACCESS flag:

https://lore.kernel.org/kvm/20240507155817.3951344-2-pbonzini@redhat.com/T/#mb30987f31b431771b42dfa64dcaa2efbc10ada5e

IMHO we should just strip the shared bit in the TDX variant of 
handle_ept_violation(), and pass the PFERR_PRIVATE_ACCESS (when GPA 
doesn't hvae shared bit) to the common fault handler so it can correctly 
set fault->is_private to true.


>   
>   			if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
> @@ -6245,6 +6247,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
>   
>   	mmu->root.hpa = INVALID_PAGE;
>   	mmu->root.pgd = 0;
> +	mmu->private_root_hpa = INVALID_PAGE;
>   	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
>   		mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
>   
> @@ -7263,6 +7266,12 @@ int kvm_mmu_vendor_module_init(void)
>   void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
>   {
>   	kvm_mmu_unload(vcpu);
> +	if (tdp_mmu_enabled) {
> +		read_lock(&vcpu->kvm->mmu_lock);
> +		mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->private_root_hpa,
> +				   NULL);
> +		read_unlock(&vcpu->kvm->mmu_lock);
> +	}

Hmm.. I don't quite like this, but sorry I kinda forgot why we need to 
to this here.

Could you elaborate?

Anyway, from common code's perspective, we need to have some 
clarification why we design to do it here.

>   	free_mmu_pages(&vcpu->arch.root_mmu);
>   	free_mmu_pages(&vcpu->arch.guest_mmu);
>   	mmu_free_memory_caches(vcpu);
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 0f1a9d733d9e..3a7fe9261e23 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -6,6 +6,8 @@
>   #include <linux/kvm_host.h>
>   #include <asm/kvm_host.h>
>   
> +#include "mmu.h"
> +
>   #ifdef CONFIG_KVM_PROVE_MMU
>   #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
>   #else
> @@ -178,6 +180,16 @@ static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_m
>   	sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_private_spt_cache);
>   }
>   
> +static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page *root,
> +				     gfn_t gfn)
> +{
> +	gfn_t gfn_for_root = kvm_gfn_to_private(kvm, gfn);
> +
> +	/* Set shared bit if not private */
> +	gfn_for_root |= -(gfn_t)!is_private_sp(root) & kvm_gfn_shared_mask(kvm);
> +	return gfn_for_root;
> +}
> +
>   static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
>   {
>   	/*
> @@ -348,7 +360,12 @@ static inline int __kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gp
>   	int r;
>   
>   	if (vcpu->arch.mmu->root_role.direct) {
> -		fault.gfn = fault.addr >> PAGE_SHIFT;
> +		/*
> +		 * Things like memslots don't understand the concept of a shared
> +		 * bit. Strip it so that the GFN can be used like normal, and the
> +		 * fault.addr can be used when the shared bit is needed.
> +		 */
> +		fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_shared_mask(vcpu->kvm);
>   		fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);

Again, I don't think it's nessary for fault.gfn to still have the shared 
bit here?

This kinda usage is pretty much the reason I want to get rid of 
kvm_gfn_shared_mask().

>   	}
>   
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index fae559559a80..8a64bcef9deb 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -91,7 +91,7 @@ struct tdp_iter {
>   	tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
>   	/* A pointer to the current SPTE */
>   	tdp_ptep_t sptep;
> -	/* The lowest GFN mapped by the current SPTE */
> +	/* The lowest GFN (shared bits included) mapped by the current SPTE */
>   	gfn_t gfn;

IMHO we need more clarification of this design.

We at least needs to call out the TDX hardware uses the 'GFA + shared 
bit' when it walks the page table for shared mappings, so we must set up 
the mapping at the GPA with the shared bit.

E.g, because TDX hardware uses separate root for shared/private 
mappings, I think it's a resonable opion for the TDX hardware to just 
use the actual GPA w/o shared bit when it walks the shared page table, 
and still report EPT violation with GPA with shared bit set.

Such HW implementation is completely hidden from software, thus should 
be clarified in the changelog/comments.


>   	/* The level of the root page given to the iterator */
>   	int root_level;

[...]

>   	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
> @@ -1029,8 +1209,8 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>   		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
>   	else
>   		wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
> -					 fault->pfn, iter->old_spte, fault->prefetch, true,
> -					 fault->map_writable, &new_spte);
> +					fault->pfn, iter->old_spte, fault->prefetch, true,
> +					fault->map_writable, &new_spte);
>   
>   	if (new_spte == iter->old_spte)
>   		ret = RET_PF_SPURIOUS;
> @@ -1108,6 +1288,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>   	struct kvm *kvm = vcpu->kvm;
>   	struct tdp_iter iter;
>   	struct kvm_mmu_page *sp;
> +	gfn_t raw_gfn;
> +	bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);

Ditto.  I wish we can have 'has_mirrored_private_pt'.
Rick Edgecombe May 16, 2024, 1:27 a.m. UTC | #4
On Thu, 2024-05-16 at 12:52 +1200, Huang, Kai wrote:
> 
> 
> On 15/05/2024 12:59 pm, Rick Edgecombe wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > Allocate mirrored page table for the private page table and implement MMU
> > hooks to operate on the private page table.
> > 
> > To handle page fault to a private GPA, KVM walks the mirrored page table in
> > unencrypted memory and then uses MMU hooks in kvm_x86_ops to propagate
> > changes from the mirrored page table to private page table.
> > 
> >    private KVM page fault   |
> >        |                    |
> >        V                    |
> >   private GPA               |     CPU protected EPTP
> >        |                    |           |
> >        V                    |           V
> >   mirrored PT root          |     private PT root
> >        |                    |           |
> >        V                    |           V
> >     mirrored PT --hook to propagate-->private PT
> >        |                    |           |
> >        \--------------------+------\    |
> >                             |      |    |
> >                             |      V    V
> >                             |    private guest page
> >                             |
> >                             |
> >       non-encrypted memory  |    encrypted memory
> >                             |
> > 
> > PT:         page table
> > Private PT: the CPU uses it, but it is invisible to KVM. TDX module manages
> >              this table to map private guest pages.
> > Mirrored PT:It is visible to KVM, but the CPU doesn't use it. KVM uses it
> >              to propagate PT change to the actual private PT.
> > 
> > SPTEs in mirrored page table (refer to them as mirrored SPTEs hereafter)
> > can be modified atomically with mmu_lock held for read, however, the MMU
> > hooks to private page table are not atomical operations.
> > 
> > To address it, a special REMOVED_SPTE is introduced and below sequence is
> > used when mirrored SPTEs are updated atomically.
> > 
> > 1. Mirrored SPTE is first atomically written to REMOVED_SPTE.
> > 2. The successful updater of the mirrored SPTE in step 1 proceeds with the
> >     following steps.
> > 3. Invoke MMU hooks to modify private page table with the target value.
> > 4. (a) On hook succeeds, update mirrored SPTE to target value.
> >     (b) On hook failure, restore mirrored SPTE to original value.
> > 
> > KVM TDP MMU ensures other threads will not overrite REMOVED_SPTE.
> > 
> > This sequence also applies when SPTEs are atomiclly updated from
> > non-present to present in order to prevent potential conflicts when
> > multiple vCPUs attempt to set private SPTEs to a different page size
> > simultaneously, though 4K page size is only supported for private page
> > table currently.
> > 
> > 2M page support can be done in future patches.
> > 
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Co-developed-by: Kai Huang <kai.huang@intel.com>
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > ---
> > TDX MMU Part 1:
> >   - Remove unnecessary gfn, access twist in
> >     tdp_mmu_map_handle_target_level(). (Chao Gao)
> >   - Open code call to kvm_mmu_alloc_private_spt() instead oCf doing it in
> >     tdp_mmu_alloc_sp()
> >   - Update comment in set_private_spte_present() (Yan)
> >   - Open code call to kvm_mmu_init_private_spt() (Yan)
> >   - Add comments on TDX MMU hooks (Yan)
> >   - Fix various whitespace alignment (Yan)
> >   - Remove pointless warnings and conditionals in
> >     handle_removed_private_spte() (Yan)
> >   - Remove redundant lockdep assert in tdp_mmu_set_spte() (Yan)
> >   - Remove incorrect comment in handle_changed_spte() (Yan)
> >   - Remove unneeded kvm_pfn_to_refcounted_page() and
> >     is_error_noslot_pfn() check in kvm_tdp_mmu_map() (Yan)
> >   - Do kvm_gfn_for_root() branchless (Rick)
> >   - Update kvm_tdp_mmu_alloc_root() callers to not check error code (Rick)
> >   - Add comment for stripping shared bit for fault.gfn (Chao)
> > 
> > v19:
> > - drop CONFIG_KVM_MMU_PRIVATE
> > 
> > v18:
> > - Rename freezed => frozen
> > 
> > v14 -> v15:
> > - Refined is_private condition check in kvm_tdp_mmu_map().
> >    Add kvm_gfn_shared_mask() check.
> > - catch up for struct kvm_range change
> > ---
> >   arch/x86/include/asm/kvm-x86-ops.h |   5 +
> >   arch/x86/include/asm/kvm_host.h    |  25 +++
> >   arch/x86/kvm/mmu/mmu.c             |  13 +-
> >   arch/x86/kvm/mmu/mmu_internal.h    |  19 +-
> >   arch/x86/kvm/mmu/tdp_iter.h        |   2 +-
> >   arch/x86/kvm/mmu/tdp_mmu.c         | 269 +++++++++++++++++++++++++----
> >   arch/x86/kvm/mmu/tdp_mmu.h         |   2 +-
> >   7 files changed, 293 insertions(+), 42 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-
> > x86-ops.h
> > index 566d19b02483..d13cb4b8fce6 100644
> > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > @@ -95,6 +95,11 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
> >   KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
> >   KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
> >   KVM_X86_OP(load_mmu_pgd)
> > +KVM_X86_OP_OPTIONAL(link_private_spt)
> > +KVM_X86_OP_OPTIONAL(free_private_spt)
> > +KVM_X86_OP_OPTIONAL(set_private_spte)
> > +KVM_X86_OP_OPTIONAL(remove_private_spte)
> > +KVM_X86_OP_OPTIONAL(zap_private_spte)
> >   KVM_X86_OP(has_wbinvd_exit)
> >   KVM_X86_OP(get_l2_tsc_offset)
> >   KVM_X86_OP(get_l2_tsc_multiplier)
> > diff --git a/arch/x86/include/asm/kvm_host.h
> > b/arch/x86/include/asm/kvm_host.h
> > index d010ca5c7f44..20fa8fa58692 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -470,6 +470,7 @@ struct kvm_mmu {
> >         int (*sync_spte)(struct kvm_vcpu *vcpu,
> >                          struct kvm_mmu_page *sp, int i);
> >         struct kvm_mmu_root_info root;
> > +       hpa_t private_root_hpa;
> 
> Should we have
> 
>         struct kvm_mmu_root_info private_root;
> 
> instead?

This is corresponds to:
mmu->root.hpa

We don't need the other fields, so I think better to not take space. It does
look asymmetric though...

> 
> >         union kvm_cpu_role cpu_role;
> >         union kvm_mmu_page_role root_role;
> >   
> > @@ -1747,6 +1748,30 @@ struct kvm_x86_ops {
> >         void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> >                              int root_level);
> >   
> > +       /* Add a page as page table page into private page table */
> > +       int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level
> > level,
> > +                               void *private_spt);
> > +       /*
> > +        * Free a page table page of private page table.
> > +        * Only expected to be called when guest is not active, specifically
> > +        * during VM destruction phase.
> > +        */
> > +       int (*free_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level
> > level,
> > +                               void *private_spt);
> > +
> > +       /* Add a guest private page into private page table */
> > +       int (*set_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level
> > level,
> > +                               kvm_pfn_t pfn);
> > +
> > +       /* Remove a guest private page from private page table*/
> > +       int (*remove_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level
> > level,
> > +                                  kvm_pfn_t pfn);
> > +       /*
> > +        * Keep a guest private page mapped in private page table, but clear
> > its
> > +        * present bit
> > +        */
> > +       int (*zap_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level
> > level);
> > +
> >         bool (*has_wbinvd_exit)(void);
> >   
> >         u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 76f92cb37a96..2506d6277818 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3701,7 +3701,9 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu
> > *vcpu)
> >         int r;
> >   
> >         if (tdp_mmu_enabled) {
> > -               kvm_tdp_mmu_alloc_root(vcpu);
> > +               if (kvm_gfn_shared_mask(vcpu->kvm))
> > +                       kvm_tdp_mmu_alloc_root(vcpu, true);
> 
> As mentioned in replies to other patches, I kinda prefer
> 
>         kvm->arch.has_mirrored_pt (or has_mirrored_private_pt)
> 
> Or we have a helper
> 
>         kvm_has_mirrored_pt() / kvm_has_mirrored_private_pt()

Yep I think everyone is on board with not doing kvm_gfn_shared_mask() for these
checks at this point.

> 
> > +               kvm_tdp_mmu_alloc_root(vcpu, false);
> >                 return 0;
> >         }
> >   
> > @@ -4685,7 +4687,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct
> > kvm_page_fault *fault)
> >         if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
> >                 for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level)
> > {
> >                         int page_num = KVM_PAGES_PER_HPAGE(fault-
> > >max_level);
> > -                       gfn_t base = gfn_round_for_level(fault->gfn,
> > +                       gfn_t base = gfn_round_for_level(gpa_to_gfn(fault-
> > >addr),
> >                                                          fault->max_level);
> 
> I thought by reaching here the shared bit has already been stripped away 
> by the caller?

We don't support MTRRs so this code wont be executed for TDX, but not clear what
you are asking.
fault->addr has the shared bit (if present)
fault->gfn has it stripped.

> 
> It doesn't make a lot sense to still have it here, given we have a 
> universal KVM-defined PFERR_PRIVATE_ACCESS flag:
> 
> https://lore.kernel.org/kvm/20240507155817.3951344-2-pbonzini@redhat.com/T/#mb30987f31b431771b42dfa64dcaa2efbc10ada5e
> 
> IMHO we should just strip the shared bit in the TDX variant of 
> handle_ept_violation(), and pass the PFERR_PRIVATE_ACCESS (when GPA 
> doesn't hvae shared bit) to the common fault handler so it can correctly 
> set fault->is_private to true.

I'm not sure what you are seeing here, could elaborate?

> 
> 
> >   
> >                         if (kvm_mtrr_check_gfn_range_consistency(vcpu, base,
> > page_num))
> > @@ -6245,6 +6247,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu,
> > struct kvm_mmu *mmu)
> >   
> >         mmu->root.hpa = INVALID_PAGE;
> >         mmu->root.pgd = 0;
> > +       mmu->private_root_hpa = INVALID_PAGE;
> >         for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> >                 mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
> >   
> > @@ -7263,6 +7266,12 @@ int kvm_mmu_vendor_module_init(void)
> >   void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> >   {
> >         kvm_mmu_unload(vcpu);
> > +       if (tdp_mmu_enabled) {
> > +               read_lock(&vcpu->kvm->mmu_lock);
> > +               mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu-
> > >private_root_hpa,
> > +                                  NULL);
> > +               read_unlock(&vcpu->kvm->mmu_lock);
> > +       }
> 
> Hmm.. I don't quite like this, but sorry I kinda forgot why we need to 
> to this here.
> 
> Could you elaborate?

I was confused by this too, see the conversation here:
https://lore.kernel.org/kvm/7b76900a42b4044cbbcb0c42922c935562993d1e.camel@intel.com/

> 
> Anyway, from common code's perspective, we need to have some 
> clarification why we design to do it here.
> 
> >         free_mmu_pages(&vcpu->arch.root_mmu);
> >         free_mmu_pages(&vcpu->arch.guest_mmu);
> >         mmu_free_memory_caches(vcpu);
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h
> > b/arch/x86/kvm/mmu/mmu_internal.h
> > index 0f1a9d733d9e..3a7fe9261e23 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -6,6 +6,8 @@
> >   #include <linux/kvm_host.h>
> >   #include <asm/kvm_host.h>
> >   
> > +#include "mmu.h"
> > +
> >   #ifdef CONFIG_KVM_PROVE_MMU
> >   #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
> >   #else
> > @@ -178,6 +180,16 @@ static inline void kvm_mmu_alloc_private_spt(struct
> > kvm_vcpu *vcpu, struct kvm_m
> >         sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu-
> > >arch.mmu_private_spt_cache);
> >   }
> >   
> > +static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page
> > *root,
> > +                                    gfn_t gfn)
> > +{
> > +       gfn_t gfn_for_root = kvm_gfn_to_private(kvm, gfn);
> > +
> > +       /* Set shared bit if not private */
> > +       gfn_for_root |= -(gfn_t)!is_private_sp(root) &
> > kvm_gfn_shared_mask(kvm);
> > +       return gfn_for_root;
> > +}
> > +
> >   static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page
> > *sp)
> >   {
> >         /*
> > @@ -348,7 +360,12 @@ static inline int __kvm_mmu_do_page_fault(struct
> > kvm_vcpu *vcpu, gpa_t cr2_or_gp
> >         int r;
> >   
> >         if (vcpu->arch.mmu->root_role.direct) {
> > -               fault.gfn = fault.addr >> PAGE_SHIFT;
> > +               /*
> > +                * Things like memslots don't understand the concept of a
> > shared
> > +                * bit. Strip it so that the GFN can be used like normal,
> > and the
> > +                * fault.addr can be used when the shared bit is needed.
> > +                */
> > +               fault.gfn = gpa_to_gfn(fault.addr) &
> > ~kvm_gfn_shared_mask(vcpu->kvm);
> >                 fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
> 
> Again, I don't think it's nessary for fault.gfn to still have the shared 
> bit here?

It's getting stripped as it's set for the first time... What do you mean still
have it?

> 
> This kinda usage is pretty much the reason I want to get rid of 
> kvm_gfn_shared_mask().

I think you want to move it to an x86_op right? Not get rid of the concept of a
shared bit? I think KVM will have a hard time doing TDX without knowing about
the shared bit location.

Or maybe you are saying you think it should be stripped earlier and live as a PF
error code?

> 
> >         }
> >   
> > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> > index fae559559a80..8a64bcef9deb 100644
> > --- a/arch/x86/kvm/mmu/tdp_iter.h
> > +++ b/arch/x86/kvm/mmu/tdp_iter.h
> > @@ -91,7 +91,7 @@ struct tdp_iter {
> >         tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
> >         /* A pointer to the current SPTE */
> >         tdp_ptep_t sptep;
> > -       /* The lowest GFN mapped by the current SPTE */
> > +       /* The lowest GFN (shared bits included) mapped by the current SPTE
> > */
> >         gfn_t gfn;
> 
> IMHO we need more clarification of this design.

Have you seen the documentation patch? Where do you think it should be? You mean
in the tdp_iter struct?

> 
> We at least needs to call out the TDX hardware uses the 'GFA + shared 
> bit' when it walks the page table for shared mappings, so we must set up 
> the mapping at the GPA with the shared bit.
> 
> E.g, because TDX hardware uses separate root for shared/private 
> mappings, I think it's a resonable opion for the TDX hardware to just 
> use the actual GPA w/o shared bit when it walks the shared page table, 
> and still report EPT violation with GPA with shared bit set.
> 
> Such HW implementation is completely hidden from software, thus should 
> be clarified in the changelog/comments.
> 
> 
> >         /* The level of the root page given to the iterator */
> >         int root_level;
> 
> [...]
> 
> >         for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
> > @@ -1029,8 +1209,8 @@ static int tdp_mmu_map_handle_target_level(struct
> > kvm_vcpu *vcpu,
> >                 new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
> >         else
> >                 wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter-
> > >gfn,
> > -                                        fault->pfn, iter->old_spte, fault-
> > >prefetch, true,
> > -                                        fault->map_writable, &new_spte);
> > +                                       fault->pfn, iter->old_spte, fault-
> > >prefetch, true,
> > +                                       fault->map_writable, &new_spte);
> >   
> >         if (new_spte == iter->old_spte)
> >                 ret = RET_PF_SPURIOUS;
> > @@ -1108,6 +1288,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct
> > kvm_page_fault *fault)
> >         struct kvm *kvm = vcpu->kvm;
> >         struct tdp_iter iter;
> >         struct kvm_mmu_page *sp;
> > +       gfn_t raw_gfn;
> > +       bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
> 
> Ditto.  I wish we can have 'has_mirrored_private_pt'.
>
Isaku Yamahata May 16, 2024, 1:48 a.m. UTC | #5
On Thu, May 16, 2024 at 12:52:32PM +1200,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On 15/05/2024 12:59 pm, Rick Edgecombe wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > Allocate mirrored page table for the private page table and implement MMU
> > hooks to operate on the private page table.
> > 
> > To handle page fault to a private GPA, KVM walks the mirrored page table in
> > unencrypted memory and then uses MMU hooks in kvm_x86_ops to propagate
> > changes from the mirrored page table to private page table.
> > 
> >    private KVM page fault   |
> >        |                    |
> >        V                    |
> >   private GPA               |     CPU protected EPTP
> >        |                    |           |
> >        V                    |           V
> >   mirrored PT root          |     private PT root
> >        |                    |           |
> >        V                    |           V
> >     mirrored PT --hook to propagate-->private PT
> >        |                    |           |
> >        \--------------------+------\    |
> >                             |      |    |
> >                             |      V    V
> >                             |    private guest page
> >                             |
> >                             |
> >       non-encrypted memory  |    encrypted memory
> >                             |
> > 
> > PT:         page table
> > Private PT: the CPU uses it, but it is invisible to KVM. TDX module manages
> >              this table to map private guest pages.
> > Mirrored PT:It is visible to KVM, but the CPU doesn't use it. KVM uses it
> >              to propagate PT change to the actual private PT.
> > 
> > SPTEs in mirrored page table (refer to them as mirrored SPTEs hereafter)
> > can be modified atomically with mmu_lock held for read, however, the MMU
> > hooks to private page table are not atomical operations.
> > 
> > To address it, a special REMOVED_SPTE is introduced and below sequence is
> > used when mirrored SPTEs are updated atomically.
> > 
> > 1. Mirrored SPTE is first atomically written to REMOVED_SPTE.
> > 2. The successful updater of the mirrored SPTE in step 1 proceeds with the
> >     following steps.
> > 3. Invoke MMU hooks to modify private page table with the target value.
> > 4. (a) On hook succeeds, update mirrored SPTE to target value.
> >     (b) On hook failure, restore mirrored SPTE to original value.
> > 
> > KVM TDP MMU ensures other threads will not overrite REMOVED_SPTE.
> > 
> > This sequence also applies when SPTEs are atomiclly updated from
> > non-present to present in order to prevent potential conflicts when
> > multiple vCPUs attempt to set private SPTEs to a different page size
> > simultaneously, though 4K page size is only supported for private page
> > table currently.
> > 
> > 2M page support can be done in future patches.
> > 
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Co-developed-by: Kai Huang <kai.huang@intel.com>
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > ---
> > TDX MMU Part 1:
> >   - Remove unnecessary gfn, access twist in
> >     tdp_mmu_map_handle_target_level(). (Chao Gao)
> >   - Open code call to kvm_mmu_alloc_private_spt() instead oCf doing it in
> >     tdp_mmu_alloc_sp()
> >   - Update comment in set_private_spte_present() (Yan)
> >   - Open code call to kvm_mmu_init_private_spt() (Yan)
> >   - Add comments on TDX MMU hooks (Yan)
> >   - Fix various whitespace alignment (Yan)
> >   - Remove pointless warnings and conditionals in
> >     handle_removed_private_spte() (Yan)
> >   - Remove redundant lockdep assert in tdp_mmu_set_spte() (Yan)
> >   - Remove incorrect comment in handle_changed_spte() (Yan)
> >   - Remove unneeded kvm_pfn_to_refcounted_page() and
> >     is_error_noslot_pfn() check in kvm_tdp_mmu_map() (Yan)
> >   - Do kvm_gfn_for_root() branchless (Rick)
> >   - Update kvm_tdp_mmu_alloc_root() callers to not check error code (Rick)
> >   - Add comment for stripping shared bit for fault.gfn (Chao)
> > 
> > v19:
> > - drop CONFIG_KVM_MMU_PRIVATE
> > 
> > v18:
> > - Rename freezed => frozen
> > 
> > v14 -> v15:
> > - Refined is_private condition check in kvm_tdp_mmu_map().
> >    Add kvm_gfn_shared_mask() check.
> > - catch up for struct kvm_range change
> > ---
> >   arch/x86/include/asm/kvm-x86-ops.h |   5 +
> >   arch/x86/include/asm/kvm_host.h    |  25 +++
> >   arch/x86/kvm/mmu/mmu.c             |  13 +-
> >   arch/x86/kvm/mmu/mmu_internal.h    |  19 +-
> >   arch/x86/kvm/mmu/tdp_iter.h        |   2 +-
> >   arch/x86/kvm/mmu/tdp_mmu.c         | 269 +++++++++++++++++++++++++----
> >   arch/x86/kvm/mmu/tdp_mmu.h         |   2 +-
> >   7 files changed, 293 insertions(+), 42 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> > index 566d19b02483..d13cb4b8fce6 100644
> > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > @@ -95,6 +95,11 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
> >   KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
> >   KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
> >   KVM_X86_OP(load_mmu_pgd)
> > +KVM_X86_OP_OPTIONAL(link_private_spt)
> > +KVM_X86_OP_OPTIONAL(free_private_spt)
> > +KVM_X86_OP_OPTIONAL(set_private_spte)
> > +KVM_X86_OP_OPTIONAL(remove_private_spte)
> > +KVM_X86_OP_OPTIONAL(zap_private_spte)
> >   KVM_X86_OP(has_wbinvd_exit)
> >   KVM_X86_OP(get_l2_tsc_offset)
> >   KVM_X86_OP(get_l2_tsc_multiplier)
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index d010ca5c7f44..20fa8fa58692 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -470,6 +470,7 @@ struct kvm_mmu {
> >   	int (*sync_spte)(struct kvm_vcpu *vcpu,
> >   			 struct kvm_mmu_page *sp, int i);
> >   	struct kvm_mmu_root_info root;
> > +	hpa_t private_root_hpa;
> 
> Should we have
> 
> 	struct kvm_mmu_root_info private_root;
> 
> instead?

Yes. And the private root allocation can be pushed down into TDP MMU.


> >   	union kvm_cpu_role cpu_role;
> >   	union kvm_mmu_page_role root_role;
> > @@ -1747,6 +1748,30 @@ struct kvm_x86_ops {
> >   	void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> >   			     int root_level);
> > +	/* Add a page as page table page into private page table */
> > +	int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> > +				void *private_spt);
> > +	/*
> > +	 * Free a page table page of private page table.
> > +	 * Only expected to be called when guest is not active, specifically
> > +	 * during VM destruction phase.
> > +	 */
> > +	int (*free_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> > +				void *private_spt);
> > +
> > +	/* Add a guest private page into private page table */
> > +	int (*set_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> > +				kvm_pfn_t pfn);
> > +
> > +	/* Remove a guest private page from private page table*/
> > +	int (*remove_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> > +				   kvm_pfn_t pfn);
> > +	/*
> > +	 * Keep a guest private page mapped in private page table, but clear its
> > +	 * present bit
> > +	 */
> > +	int (*zap_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level);
> > +
> >   	bool (*has_wbinvd_exit)(void);
> >   	u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 76f92cb37a96..2506d6277818 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3701,7 +3701,9 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> >   	int r;
> >   	if (tdp_mmu_enabled) {
> > -		kvm_tdp_mmu_alloc_root(vcpu);
> > +		if (kvm_gfn_shared_mask(vcpu->kvm))
> > +			kvm_tdp_mmu_alloc_root(vcpu, true);
> 
> As mentioned in replies to other patches, I kinda prefer
> 
> 	kvm->arch.has_mirrored_pt (or has_mirrored_private_pt)
> 
> Or we have a helper
> 
> 	kvm_has_mirrored_pt() / kvm_has_mirrored_private_pt()
> 
> > +		kvm_tdp_mmu_alloc_root(vcpu, false);
> >   		return 0;
> >   	}
> > @@ -4685,7 +4687,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >   	if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
> >   		for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) {
> >   			int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
> > -			gfn_t base = gfn_round_for_level(fault->gfn,
> > +			gfn_t base = gfn_round_for_level(gpa_to_gfn(fault->addr),
> >   							 fault->max_level);
> 
> I thought by reaching here the shared bit has already been stripped away by
> the caller?
> 
> It doesn't make a lot sense to still have it here, given we have a universal
> KVM-defined PFERR_PRIVATE_ACCESS flag:
> 
> https://lore.kernel.org/kvm/20240507155817.3951344-2-pbonzini@redhat.com/T/#mb30987f31b431771b42dfa64dcaa2efbc10ada5e
> 
> IMHO we should just strip the shared bit in the TDX variant of
> handle_ept_violation(), and pass the PFERR_PRIVATE_ACCESS (when GPA doesn't
> hvae shared bit) to the common fault handler so it can correctly set
> fault->is_private to true.

Yes, this part should be dropped.  Because we will have vCPUID.MTRR=0 for TDX in
long term, we can make kvm_mmu_honors_guest_mtrrs() always false.  Maybe
kvm->arch.disbled_mtrr or guest_cpuid_has(vcpu, X86_FEATURE_MTRR) = false.  We
will enforce vcpu.CPUID.MTRR=false.

Guest MTRR=0 support can be independently addressed.


> >   			if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
> > @@ -6245,6 +6247,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
> >   	mmu->root.hpa = INVALID_PAGE;
> >   	mmu->root.pgd = 0;
> > +	mmu->private_root_hpa = INVALID_PAGE;
> >   	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> >   		mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
> > @@ -7263,6 +7266,12 @@ int kvm_mmu_vendor_module_init(void)
> >   void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> >   {
> >   	kvm_mmu_unload(vcpu);
> > +	if (tdp_mmu_enabled) {
> > +		read_lock(&vcpu->kvm->mmu_lock);
> > +		mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->private_root_hpa,
> > +				   NULL);
> > +		read_unlock(&vcpu->kvm->mmu_lock);
> > +	}
> 
> Hmm.. I don't quite like this, but sorry I kinda forgot why we need to to
> this here.
> 
> Could you elaborate?
> 
> Anyway, from common code's perspective, we need to have some clarification
> why we design to do it here.

This should be cleaned up.  It can be pushed down into kvm_tdp_mmu_alloc_root().

void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
  allocate shared root
  if (has_mirrort_pt)
    allocate private root


> >   	free_mmu_pages(&vcpu->arch.root_mmu);
> >   	free_mmu_pages(&vcpu->arch.guest_mmu);
> >   	mmu_free_memory_caches(vcpu);
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index 0f1a9d733d9e..3a7fe9261e23 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -6,6 +6,8 @@
> >   #include <linux/kvm_host.h>
> >   #include <asm/kvm_host.h>
> > +#include "mmu.h"
> > +
> >   #ifdef CONFIG_KVM_PROVE_MMU
> >   #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
> >   #else
> > @@ -178,6 +180,16 @@ static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_m
> >   	sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_private_spt_cache);
> >   }
> > +static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page *root,
> > +				     gfn_t gfn)
> > +{
> > +	gfn_t gfn_for_root = kvm_gfn_to_private(kvm, gfn);
> > +
> > +	/* Set shared bit if not private */
> > +	gfn_for_root |= -(gfn_t)!is_private_sp(root) & kvm_gfn_shared_mask(kvm);
> > +	return gfn_for_root;
> > +}
> > +
> >   static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
> >   {
> >   	/*
> > @@ -348,7 +360,12 @@ static inline int __kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gp
> >   	int r;
> >   	if (vcpu->arch.mmu->root_role.direct) {
> > -		fault.gfn = fault.addr >> PAGE_SHIFT;
> > +		/*
> > +		 * Things like memslots don't understand the concept of a shared
> > +		 * bit. Strip it so that the GFN can be used like normal, and the
> > +		 * fault.addr can be used when the shared bit is needed.
> > +		 */
> > +		fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_shared_mask(vcpu->kvm);
> >   		fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
> 
> Again, I don't think it's nessary for fault.gfn to still have the shared bit
> here?
> 
> This kinda usage is pretty much the reason I want to get rid of
> kvm_gfn_shared_mask().

We are going to flags like has_mirrored_pt and we have root page table iterator
with types specified.  I'll investigate how we can reduce (or eliminate)
those helper functions.


> >   	}
> > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> > index fae559559a80..8a64bcef9deb 100644
> > --- a/arch/x86/kvm/mmu/tdp_iter.h
> > +++ b/arch/x86/kvm/mmu/tdp_iter.h
> > @@ -91,7 +91,7 @@ struct tdp_iter {
> >   	tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
> >   	/* A pointer to the current SPTE */
> >   	tdp_ptep_t sptep;
> > -	/* The lowest GFN mapped by the current SPTE */
> > +	/* The lowest GFN (shared bits included) mapped by the current SPTE */
> >   	gfn_t gfn;
> 
> IMHO we need more clarification of this design.
> 
> We at least needs to call out the TDX hardware uses the 'GFA + shared bit'
> when it walks the page table for shared mappings, so we must set up the
> mapping at the GPA with the shared bit.
> 
> E.g, because TDX hardware uses separate root for shared/private mappings, I
> think it's a resonable opion for the TDX hardware to just use the actual GPA
> w/o shared bit when it walks the shared page table, and still report EPT
> violation with GPA with shared bit set.
> 
> Such HW implementation is completely hidden from software, thus should be
> clarified in the changelog/comments.

Totally agree that it deserves for documentation.  I would update the design
document of TDX TDP MMU to include it.  This patch series doesn't include it,
though.


> >   	/* The level of the root page given to the iterator */
> >   	int root_level;
> 
> [...]
> 
> >   	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
> > @@ -1029,8 +1209,8 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> >   		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
> >   	else
> >   		wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
> > -					 fault->pfn, iter->old_spte, fault->prefetch, true,
> > -					 fault->map_writable, &new_spte);
> > +					fault->pfn, iter->old_spte, fault->prefetch, true,
> > +					fault->map_writable, &new_spte);
> >   	if (new_spte == iter->old_spte)
> >   		ret = RET_PF_SPURIOUS;
> > @@ -1108,6 +1288,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >   	struct kvm *kvm = vcpu->kvm;
> >   	struct tdp_iter iter;
> >   	struct kvm_mmu_page *sp;
> > +	gfn_t raw_gfn;
> > +	bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
> 
> Ditto.  I wish we can have 'has_mirrored_private_pt'.

Which name do you prefer? has_mirrored_pt or has_mirrored_private_pt?
Rick Edgecombe May 16, 2024, 2 a.m. UTC | #6
On Wed, 2024-05-15 at 18:48 -0700, Isaku Yamahata wrote:
> On Thu, May 16, 2024 at 12:52:32PM +1200,
> "Huang, Kai" <kai.huang@intel.com> wrote:
> 
> > On 15/05/2024 12:59 pm, Rick Edgecombe wrote:
> > > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > > 
> > > Allocate mirrored page table for the private page table and implement MMU
> > > hooks to operate on the private page table.
> > > 
> > > To handle page fault to a private GPA, KVM walks the mirrored page table
> > > in
> > > unencrypted memory and then uses MMU hooks in kvm_x86_ops to propagate
> > > changes from the mirrored page table to private page table.
> > > 
> > >    private KVM page fault   |
> > >        |                    |
> > >        V                    |
> > >   private GPA               |     CPU protected EPTP
> > >        |                    |           |
> > >        V                    |           V
> > >   mirrored PT root          |     private PT root
> > >        |                    |           |
> > >        V                    |           V
> > >     mirrored PT --hook to propagate-->private PT
> > >        |                    |           |
> > >        \--------------------+------\    |
> > >                             |      |    |
> > >                             |      V    V
> > >                             |    private guest page
> > >                             |
> > >                             |
> > >       non-encrypted memory  |    encrypted memory
> > >                             |
> > > 
> > > PT:         page table
> > > Private PT: the CPU uses it, but it is invisible to KVM. TDX module
> > > manages
> > >              this table to map private guest pages.
> > > Mirrored PT:It is visible to KVM, but the CPU doesn't use it. KVM uses it
> > >              to propagate PT change to the actual private PT.
> > > 
> > > SPTEs in mirrored page table (refer to them as mirrored SPTEs hereafter)
> > > can be modified atomically with mmu_lock held for read, however, the MMU
> > > hooks to private page table are not atomical operations.
> > > 
> > > To address it, a special REMOVED_SPTE is introduced and below sequence is
> > > used when mirrored SPTEs are updated atomically.
> > > 
> > > 1. Mirrored SPTE is first atomically written to REMOVED_SPTE.
> > > 2. The successful updater of the mirrored SPTE in step 1 proceeds with the
> > >     following steps.
> > > 3. Invoke MMU hooks to modify private page table with the target value.
> > > 4. (a) On hook succeeds, update mirrored SPTE to target value.
> > >     (b) On hook failure, restore mirrored SPTE to original value.
> > > 
> > > KVM TDP MMU ensures other threads will not overrite REMOVED_SPTE.
> > > 
> > > This sequence also applies when SPTEs are atomiclly updated from
> > > non-present to present in order to prevent potential conflicts when
> > > multiple vCPUs attempt to set private SPTEs to a different page size
> > > simultaneously, though 4K page size is only supported for private page
> > > table currently.
> > > 
> > > 2M page support can be done in future patches.
> > > 
> > > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > > Co-developed-by: Kai Huang <kai.huang@intel.com>
> > > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > > Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > ---
> > > TDX MMU Part 1:
> > >   - Remove unnecessary gfn, access twist in
> > >     tdp_mmu_map_handle_target_level(). (Chao Gao)
> > >   - Open code call to kvm_mmu_alloc_private_spt() instead oCf doing it in
> > >     tdp_mmu_alloc_sp()
> > >   - Update comment in set_private_spte_present() (Yan)
> > >   - Open code call to kvm_mmu_init_private_spt() (Yan)
> > >   - Add comments on TDX MMU hooks (Yan)
> > >   - Fix various whitespace alignment (Yan)
> > >   - Remove pointless warnings and conditionals in
> > >     handle_removed_private_spte() (Yan)
> > >   - Remove redundant lockdep assert in tdp_mmu_set_spte() (Yan)
> > >   - Remove incorrect comment in handle_changed_spte() (Yan)
> > >   - Remove unneeded kvm_pfn_to_refcounted_page() and
> > >     is_error_noslot_pfn() check in kvm_tdp_mmu_map() (Yan)
> > >   - Do kvm_gfn_for_root() branchless (Rick)
> > >   - Update kvm_tdp_mmu_alloc_root() callers to not check error code (Rick)
> > >   - Add comment for stripping shared bit for fault.gfn (Chao)
> > > 
> > > v19:
> > > - drop CONFIG_KVM_MMU_PRIVATE
> > > 
> > > v18:
> > > - Rename freezed => frozen
> > > 
> > > v14 -> v15:
> > > - Refined is_private condition check in kvm_tdp_mmu_map().
> > >    Add kvm_gfn_shared_mask() check.
> > > - catch up for struct kvm_range change
> > > ---
> > >   arch/x86/include/asm/kvm-x86-ops.h |   5 +
> > >   arch/x86/include/asm/kvm_host.h    |  25 +++
> > >   arch/x86/kvm/mmu/mmu.c             |  13 +-
> > >   arch/x86/kvm/mmu/mmu_internal.h    |  19 +-
> > >   arch/x86/kvm/mmu/tdp_iter.h        |   2 +-
> > >   arch/x86/kvm/mmu/tdp_mmu.c         | 269 +++++++++++++++++++++++++----
> > >   arch/x86/kvm/mmu/tdp_mmu.h         |   2 +-
> > >   7 files changed, 293 insertions(+), 42 deletions(-)
> > > 
> > > diff --git a/arch/x86/include/asm/kvm-x86-ops.h
> > > b/arch/x86/include/asm/kvm-x86-ops.h
> > > index 566d19b02483..d13cb4b8fce6 100644
> > > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > > @@ -95,6 +95,11 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
> > >   KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
> > >   KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
> > >   KVM_X86_OP(load_mmu_pgd)
> > > +KVM_X86_OP_OPTIONAL(link_private_spt)
> > > +KVM_X86_OP_OPTIONAL(free_private_spt)
> > > +KVM_X86_OP_OPTIONAL(set_private_spte)
> > > +KVM_X86_OP_OPTIONAL(remove_private_spte)
> > > +KVM_X86_OP_OPTIONAL(zap_private_spte)
> > >   KVM_X86_OP(has_wbinvd_exit)
> > >   KVM_X86_OP(get_l2_tsc_offset)
> > >   KVM_X86_OP(get_l2_tsc_multiplier)
> > > diff --git a/arch/x86/include/asm/kvm_host.h
> > > b/arch/x86/include/asm/kvm_host.h
> > > index d010ca5c7f44..20fa8fa58692 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -470,6 +470,7 @@ struct kvm_mmu {
> > >         int (*sync_spte)(struct kvm_vcpu *vcpu,
> > >                          struct kvm_mmu_page *sp, int i);
> > >         struct kvm_mmu_root_info root;
> > > +       hpa_t private_root_hpa;
> > 
> > Should we have
> > 
> >         struct kvm_mmu_root_info private_root;
> > 
> > instead?
> 
> Yes. And the private root allocation can be pushed down into TDP MMU.

Why?

> 
[snip]
> > > @@ -7263,6 +7266,12 @@ int kvm_mmu_vendor_module_init(void)
> > >   void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> > >   {
> > >         kvm_mmu_unload(vcpu);
> > > +       if (tdp_mmu_enabled) {
> > > +               read_lock(&vcpu->kvm->mmu_lock);
> > > +               mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu-
> > > >private_root_hpa,
> > > +                                  NULL);
> > > +               read_unlock(&vcpu->kvm->mmu_lock);
> > > +       }
> > 
> > Hmm.. I don't quite like this, but sorry I kinda forgot why we need to to
> > this here.
> > 
> > Could you elaborate?
> > 
> > Anyway, from common code's perspective, we need to have some clarification
> > why we design to do it here.
> 
> This should be cleaned up.  It can be pushed down into
> kvm_tdp_mmu_alloc_root().
> 
> void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
>   allocate shared root
>   if (has_mirrort_pt)
>     allocate private root
> 

Huh? This is kvm_mmu_destroy()...

> 
> > >         free_mmu_pages(&vcpu->arch.root_mmu);
> > >         free_mmu_pages(&vcpu->arch.guest_mmu);
> > >         mmu_free_memory_caches(vcpu);
> > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h
> > > b/arch/x86/kvm/mmu/mmu_internal.h
> > > index 0f1a9d733d9e..3a7fe9261e23 100644
> > > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > > @@ -6,6 +6,8 @@
> > >   #include <linux/kvm_host.h>
> > >   #include <asm/kvm_host.h>
> > > +#include "mmu.h"
> > > +
> > >   #ifdef CONFIG_KVM_PROVE_MMU
> > >   #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
> > >   #else
> > > @@ -178,6 +180,16 @@ static inline void kvm_mmu_alloc_private_spt(struct
> > > kvm_vcpu *vcpu, struct kvm_m
> > >         sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu-
> > > >arch.mmu_private_spt_cache);
> > >   }
> > > +static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page
> > > *root,
> > > +                                    gfn_t gfn)
> > > +{
> > > +       gfn_t gfn_for_root = kvm_gfn_to_private(kvm, gfn);
> > > +
> > > +       /* Set shared bit if not private */
> > > +       gfn_for_root |= -(gfn_t)!is_private_sp(root) &
> > > kvm_gfn_shared_mask(kvm);
> > > +       return gfn_for_root;
> > > +}
> > > +
> > >   static inline bool kvm_mmu_page_ad_need_write_protect(struct
> > > kvm_mmu_page *sp)
> > >   {
> > >         /*
> > > @@ -348,7 +360,12 @@ static inline int __kvm_mmu_do_page_fault(struct
> > > kvm_vcpu *vcpu, gpa_t cr2_or_gp
> > >         int r;
> > >         if (vcpu->arch.mmu->root_role.direct) {
> > > -               fault.gfn = fault.addr >> PAGE_SHIFT;
> > > +               /*
> > > +                * Things like memslots don't understand the concept of a
> > > shared
> > > +                * bit. Strip it so that the GFN can be used like normal,
> > > and the
> > > +                * fault.addr can be used when the shared bit is needed.
> > > +                */
> > > +               fault.gfn = gpa_to_gfn(fault.addr) &
> > > ~kvm_gfn_shared_mask(vcpu->kvm);
> > >                 fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
> > 
> > Again, I don't think it's nessary for fault.gfn to still have the shared bit
> > here?
> > 
> > This kinda usage is pretty much the reason I want to get rid of
> > kvm_gfn_shared_mask().
> 
> We are going to flags like has_mirrored_pt and we have root page table
> iterator
> with types specified.  I'll investigate how we can reduce (or eliminate)
> those helper functions.

Let's transition the abusers off and see whats left. I'm still waiting for an
explanation of why they are bad when uses properly.


[snip]
> 
> > >         /* The level of the root page given to the iterator */
> > >         int root_level;
> > 
> > [...]
> > 
> > >         for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
> > > @@ -1029,8 +1209,8 @@ static int tdp_mmu_map_handle_target_level(struct
> > > kvm_vcpu *vcpu,
> > >                 new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
> > >         else
> > >                 wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter-
> > > >gfn,
> > > -                                        fault->pfn, iter->old_spte,
> > > fault->prefetch, true,
> > > -                                        fault->map_writable, &new_spte);
> > > +                                       fault->pfn, iter->old_spte, fault-
> > > >prefetch, true,
> > > +                                       fault->map_writable, &new_spte);
> > >         if (new_spte == iter->old_spte)
> > >                 ret = RET_PF_SPURIOUS;
> > > @@ -1108,6 +1288,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct
> > > kvm_page_fault *fault)
> > >         struct kvm *kvm = vcpu->kvm;
> > >         struct tdp_iter iter;
> > >         struct kvm_mmu_page *sp;
> > > +       gfn_t raw_gfn;
> > > +       bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
> > 
> > Ditto.  I wish we can have 'has_mirrored_private_pt'.
> 
> Which name do you prefer? has_mirrored_pt or has_mirrored_private_pt?

Why not helpers that wrap vm_type like:
https://lore.kernel.org/kvm/d4c96caffd2633a70a140861d91794cdb54c7655.camel@intel.com/
Kai Huang May 16, 2024, 2:07 a.m. UTC | #7
>>> @@ -470,6 +470,7 @@ struct kvm_mmu {
>>>          int (*sync_spte)(struct kvm_vcpu *vcpu,
>>>                           struct kvm_mmu_page *sp, int i);
>>>          struct kvm_mmu_root_info root;
>>> +       hpa_t private_root_hpa;
>>
>> Should we have
>>
>>          struct kvm_mmu_root_info private_root;
>>
>> instead?
> 
> This is corresponds to:
> mmu->root.hpa
> 
> We don't need the other fields, so I think better to not take space. It does
> look asymmetric though...

Being symmetric is why I asked.  Anyway no strong opinion.

[...]

>>>    
>>> @@ -4685,7 +4687,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct
>>> kvm_page_fault *fault)
>>>          if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
>>>                  for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level)
>>> {
>>>                          int page_num = KVM_PAGES_PER_HPAGE(fault-
>>>> max_level);
>>> -                       gfn_t base = gfn_round_for_level(fault->gfn,
>>> +                       gfn_t base = gfn_round_for_level(gpa_to_gfn(fault-
>>>> addr),
>>>                                                           fault->max_level);
>>
>> I thought by reaching here the shared bit has already been stripped away
>> by the caller?
> 
> We don't support MTRRs so this code wont be executed for TDX, but not clear what
> you are asking.
> fault->addr has the shared bit (if present)
> fault->gfn has it stripped.

When I was looking at the code, I thought fault->gfn is still having the 
shred bit, and gpa_to_gfn() internally strips aways the shared bit, but 
sorry it is not true.

My question is why do we even need this change?  Souldn't we pass the 
actual GFN (which doesn't have the shared bit) to 
kvm_mtrr_check_gfn_range_consistency()?

If so, looks we should use fault->gfn to get the base?

> 
>>
>> It doesn't make a lot sense to still have it here, given we have a
>> universal KVM-defined PFERR_PRIVATE_ACCESS flag:
>>
>> https://lore.kernel.org/kvm/20240507155817.3951344-2-pbonzini@redhat.com/T/#mb30987f31b431771b42dfa64dcaa2efbc10ada5e
>>
>> IMHO we should just strip the shared bit in the TDX variant of
>> handle_ept_violation(), and pass the PFERR_PRIVATE_ACCESS (when GPA
>> doesn't hvae shared bit) to the common fault handler so it can correctly
>> set fault->is_private to true.
> 
> I'm not sure what you are seeing here, could elaborate?
See reply below.

[...]

>>
>> Anyway, from common code's perspective, we need to have some
>> clarification why we design to do it here.
>>
>>>          free_mmu_pages(&vcpu->arch.root_mmu);
>>>          free_mmu_pages(&vcpu->arch.guest_mmu);
>>>          mmu_free_memory_caches(vcpu);
>>> diff --git a/arch/x86/kvm/mmu/mmu_internal.h
>>> b/arch/x86/kvm/mmu/mmu_internal.h
>>> index 0f1a9d733d9e..3a7fe9261e23 100644
>>> --- a/arch/x86/kvm/mmu/mmu_internal.h
>>> +++ b/arch/x86/kvm/mmu/mmu_internal.h
>>> @@ -6,6 +6,8 @@
>>>    #include <linux/kvm_host.h>
>>>    #include <asm/kvm_host.h>
>>>    
>>> +#include "mmu.h"
>>> +
>>>    #ifdef CONFIG_KVM_PROVE_MMU
>>>    #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
>>>    #else
>>> @@ -178,6 +180,16 @@ static inline void kvm_mmu_alloc_private_spt(struct
>>> kvm_vcpu *vcpu, struct kvm_m
>>>          sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu-
>>>> arch.mmu_private_spt_cache);
>>>    }
>>>    
>>> +static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page
>>> *root,
>>> +                                    gfn_t gfn)
>>> +{
>>> +       gfn_t gfn_for_root = kvm_gfn_to_private(kvm, gfn);
>>> +
>>> +       /* Set shared bit if not private */
>>> +       gfn_for_root |= -(gfn_t)!is_private_sp(root) &
>>> kvm_gfn_shared_mask(kvm);
>>> +       return gfn_for_root;
>>> +}
>>> +
>>>    static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page
>>> *sp)
>>>    {
>>>          /*
>>> @@ -348,7 +360,12 @@ static inline int __kvm_mmu_do_page_fault(struct
>>> kvm_vcpu *vcpu, gpa_t cr2_or_gp
>>>          int r;
>>>    
>>>          if (vcpu->arch.mmu->root_role.direct) {
>>> -               fault.gfn = fault.addr >> PAGE_SHIFT;
>>> +               /*
>>> +                * Things like memslots don't understand the concept of a
>>> shared
>>> +                * bit. Strip it so that the GFN can be used like normal,
>>> and the
>>> +                * fault.addr can be used when the shared bit is needed.
>>> +                */
>>> +               fault.gfn = gpa_to_gfn(fault.addr) &
>>> ~kvm_gfn_shared_mask(vcpu->kvm);
>>>                  fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
>>
>> Again, I don't think it's nessary for fault.gfn to still have the shared
>> bit here?
> 
> It's getting stripped as it's set for the first time... What do you mean still
> have it?

Sorry, I meant fault->addr.

> 
>>
>> This kinda usage is pretty much the reason I want to get rid of
>> kvm_gfn_shared_mask().
> 
> I think you want to move it to an x86_op right? Not get rid of the concept of a
> shared bit? I think KVM will have a hard time doing TDX without knowing about
> the shared bit location.
> 
> Or maybe you are saying you think it should be stripped earlier and live as a PF
> error code?

I meant it seems we should just strip shared bit away from the GPA in 
handle_ept_violation() and pass it as 'cr2_or_gpa' here, so fault->addr 
won't have the shared bit.

Do you see any problem of doing so?

> 
>>
>>>          }
>>>    
>>> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
>>> index fae559559a80..8a64bcef9deb 100644
>>> --- a/arch/x86/kvm/mmu/tdp_iter.h
>>> +++ b/arch/x86/kvm/mmu/tdp_iter.h
>>> @@ -91,7 +91,7 @@ struct tdp_iter {
>>>          tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
>>>          /* A pointer to the current SPTE */
>>>          tdp_ptep_t sptep;
>>> -       /* The lowest GFN mapped by the current SPTE */
>>> +       /* The lowest GFN (shared bits included) mapped by the current SPTE
>>> */
>>>          gfn_t gfn;
>>
>> IMHO we need more clarification of this design.
> 
> Have you seen the documentation patch? Where do you think it should be? You mean
> in the tdp_iter struct?

My thinking:

Changelog should clarify why include shared bit to 'gfn' in tdp_iter.

And here around the 'gfn' we can have some simple sentence to explain 
why to include the shared bit.
Kai Huang May 16, 2024, 2:10 a.m. UTC | #8
>>>> +       gfn_t raw_gfn;
>>>> +       bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
>>>
>>> Ditto.  I wish we can have 'has_mirrored_private_pt'.
>>
>> Which name do you prefer? has_mirrored_pt or has_mirrored_private_pt?
> 
> Why not helpers that wrap vm_type like:
> https://lore.kernel.org/kvm/d4c96caffd2633a70a140861d91794cdb54c7655.camel@intel.com/

I am fine with any of them -- boolean (with either name) or helper.
Rick Edgecombe May 16, 2024, 2:57 a.m. UTC | #9
On Thu, 2024-05-16 at 14:07 +1200, Huang, Kai wrote:
> 
> I meant it seems we should just strip shared bit away from the GPA in 
> handle_ept_violation() and pass it as 'cr2_or_gpa' here, so fault->addr 
> won't have the shared bit.
> 
> Do you see any problem of doing so?

We would need to add it back in "raw_gfn" in kvm_tdp_mmu_map().

In the past I did something like the private/shared split, but for execute-only
aliases and a few other wacky things.

It also had a synthetic error code. For awhile I had it so GPA had alias bits
(i.e. shared bit) not stripped, like TDX has today, but there was always some
code that got surprised by the extra bits in the GPA. I want to say it was the
emulation of PAE or something like that (execute-only had to support all the
normal VM stuff).

So in the later revisions I actually had a helper to take a GFN and PF error
code and put the alias bits back in. Then alias bits got stripped immediately
and at the same time the synthetic error code was set. Something similar could
probably work to recreate "raw_gfn" from a fault.

IIRC (and I could easily be wrong), when I discussed this with Sean he said TDX
didn't need to support whatever issue I was working around, and the original
solution was slightly better for TDX.

In any case, I doubt Sean is wedded to a remark he may or may not have made long
ago. But looking at the TDX code today, it doesn't feel that confusing to me.

So I'm not against adding the shared bits back in later, but it doesn't seem
that big of a gain to me. It also has kind of been tried before a long time ago.

> 
> > 
> > > 
> > > >           }
> > > >     
> > > > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> > > > index fae559559a80..8a64bcef9deb 100644
> > > > --- a/arch/x86/kvm/mmu/tdp_iter.h
> > > > +++ b/arch/x86/kvm/mmu/tdp_iter.h
> > > > @@ -91,7 +91,7 @@ struct tdp_iter {
> > > >           tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
> > > >           /* A pointer to the current SPTE */
> > > >           tdp_ptep_t sptep;
> > > > -       /* The lowest GFN mapped by the current SPTE */
> > > > +       /* The lowest GFN (shared bits included) mapped by the current
> > > > SPTE
> > > > */
> > > >           gfn_t gfn;
> > > 
> > > IMHO we need more clarification of this design.
> > 
> > Have you seen the documentation patch? Where do you think it should be? You
> > mean
> > in the tdp_iter struct?
> 
> My thinking:
> 
> Changelog should clarify why include shared bit to 'gfn' in tdp_iter.
> 
> And here around the 'gfn' we can have some simple sentence to explain 
> why to include the shared bit.

Doesn't seem unreasonable.
Kai Huang May 16, 2024, 1:04 p.m. UTC | #10
On Thu, 2024-05-16 at 02:57 +0000, Edgecombe, Rick P wrote:
> On Thu, 2024-05-16 at 14:07 +1200, Huang, Kai wrote:
> > 
> > I meant it seems we should just strip shared bit away from the GPA in 
> > handle_ept_violation() and pass it as 'cr2_or_gpa' here, so fault->addr 
> > won't have the shared bit.
> > 
> > Do you see any problem of doing so?
> 
> We would need to add it back in "raw_gfn" in kvm_tdp_mmu_map().

I don't see any big difference?

Now in this patch the raw_gfn is directly from fault->addr:

	raw_gfn = gpa_to_gfn(fault->addr);

	tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn+1) {
		...
  	}

But there's nothing wrong to get the raw_gfn from the fault->gfn.  In
fact, the zapping code just does this:

        /*
         * start and end doesn't have GFN shared bit.  This function zaps
         * a region including alias.  Adjust shared bit of [start, end) if
         * the root is shared.
         */
        start = kvm_gfn_for_root(kvm, root, start);
        end = kvm_gfn_for_root(kvm, root, end);

So there's nothing wrong to just do the same thing in both functions.

The point is fault->gfn has shared bit stripped away at the beginning, and
AFAICT there's no useful reason to keep shared bit in fault->addr.  The
entire @fault is a temporary structure on the stack during fault handling
anyway.

> 
> In the past I did something like the private/shared split, but for execute-only
> aliases and a few other wacky things.
> 
> It also had a synthetic error code. For awhile I had it so GPA had alias bits
> (i.e. shared bit) not stripped, like TDX has today, but there was always some
> code that got surprised by the extra bits in the GPA. I want to say it was the
> emulation of PAE or something like that (execute-only had to support all the
> normal VM stuff).
> 
> So in the later revisions I actually had a helper to take a GFN and PF error
> code and put the alias bits back in. Then alias bits got stripped immediately
> and at the same time the synthetic error code was set. Something similar could
> probably work to recreate "raw_gfn" from a fault.
> 
> IIRC (and I could easily be wrong), when I discussed this with Sean he said TDX
> didn't need to support whatever issue I was working around, and the original
> solution was slightly better for TDX.
> 
> In any case, I doubt Sean is wedded to a remark he may or may not have made long
> ago. But looking at the TDX code today, it doesn't feel that confusing to me.

[...]

> 
> So I'm not against adding the shared bits back in later, but it doesn't seem
> that big of a gain to me. It also has kind of been tried before a long time ago.

As mentioned above, we are already doing that anyway in the zapping code
path.

> 
> > 
> > > 
> > > > 
> > > > >           }
> > > > >     
> > > > > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> > > > > index fae559559a80..8a64bcef9deb 100644
> > > > > --- a/arch/x86/kvm/mmu/tdp_iter.h
> > > > > +++ b/arch/x86/kvm/mmu/tdp_iter.h
> > > > > @@ -91,7 +91,7 @@ struct tdp_iter {
> > > > >           tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
> > > > >           /* A pointer to the current SPTE */
> > > > >           tdp_ptep_t sptep;
> > > > > -       /* The lowest GFN mapped by the current SPTE */
> > > > > +       /* The lowest GFN (shared bits included) mapped by the current
> > > > > SPTE
> > > > > */
> > > > >           gfn_t gfn;
> > > > 
> > > > IMHO we need more clarification of this design.
> > > 

Btw, another thing after second thought:

So regardless of how to implement in KVM, IIUC TDX hardware requires below
two operations to have the shared bit set in the GPA for shared mapping:

  1) Setup/teardown shared page table mapping
  2) GPA range in TLB flush for shared mapping

(I kinda forgot the TLB flush part so better double check, but I guess I
am >90% sure about it.)

So in the fault handler path, we actually need to be careful of the GFN
passed to relevant functions, because for other operations like finding
memslot based on GFN, we must pass the GFN w/o shared bit.

Now the tricky thing is due to 1) the 'tdp_iter->gfn' is set to the
"raw_gfn" with shared bit in order to find the correct SPTE in the fault
handler path.  And as a result, the current implementation sets the sp-
>gfn to the "raw_gfn" too.

	sp = tdp_mmu_alloc_sp(vcpu);
	...
        tdp_mmu_init_child_sp(sp, &iter);

The problem is in current KVM implementation, iter->gfn and sp->gfn are
used in both cases: 1) page table walk and TLB flush; 2) others like
memslot lookup.

So the result is we need to be very careful whether we should strip the
shared bit away when using them.

E.g., Looking at the current dev branch, if I am reading code correctly,
it seems we have bug around here:

static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
                                          struct kvm_page_fault *fault,
                                          struct tdp_iter *iter)
{                   
	...

        if (unlikely(!fault->slot))
                new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
        else
                wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, 
				iter->gfn, fault->pfn, iter->old_spte, 
				fault->prefetch, true, 
				fault->map_writable, &new_spte);
	...
}

See @iter->gfn (which is "raw_gfn" AFAICT) is passed to both
make_mmio_spte() and make_spte().  But AFAICT both the two functions treat
GFN as the actual GFN.  E.g., 

bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
               const struct kvm_memory_slot *slot,
               unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
               u64 old_spte, bool prefetch, bool can_unsync,
               bool host_writable, u64 *new_spte)
{
	...

	if (shadow_memtype_mask)
                spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
				kvm_is_mmio_pfn(pfn));
	...

	if ((spte & PT_WRITABLE_MASK) &&
			kvm_slot_dirty_track_enabled(slot)) {
                /* Enforced by kvm_mmu_hugepage_adjust. */
                WARN_ON_ONCE(level > PG_LEVEL_4K);
                mark_page_dirty_in_slot(vcpu->kvm, slot, gfn);
        }
	...
}

AFAICT both @gfn in kvm_x86_get_mt_mask() and mark_page_dirty_in_slot()
needs the actual GFN.  They may not be a concern for TDX now, but I think
it's logically wrong to use the raw GFN.

This kinda issue is hard to find in code writing and review.  I am
thinking whether we should have a more clear way to avoid such issues.

The idea is to add a new 'raw_gfn' to @tdp_iter and 'kvm_mmu_page'.  When
we walk the GFN range using iter, we always use the "actual GFN" w/o
shared bit.  Like:

	tdp_mmu_for_each_pte(kvm, iter, mmu, is_private, gfn, gfn + 1) {
		...
	}

But in the tdp_iter_*() functions, we internally calculate the "raw_gfn"
using the "actual GFN" + the 'kvm', and we use the "raw_gfn" to walk the
page table to find the correct SPTE.

So the end code will be: 1) explicitly use iter->raw_gfn for page table
walk and do TLB flush; 2) For all others like memslot lookup, use iter-
>gfn.

(sp->gfn and sp->raw_gfn can be used similarly, e.g., sp->raw_gfn is used
for TLB flush, and for others like memslot lookup we use sp->gfn.)

I think in this way the code will be more clear?
Rick Edgecombe May 16, 2024, 4:36 p.m. UTC | #11
On Thu, 2024-05-16 at 13:04 +0000, Huang, Kai wrote:
> On Thu, 2024-05-16 at 02:57 +0000, Edgecombe, Rick P wrote:
> > On Thu, 2024-05-16 at 14:07 +1200, Huang, Kai wrote:
> > > 
> > > I meant it seems we should just strip shared bit away from the GPA in 
> > > handle_ept_violation() and pass it as 'cr2_or_gpa' here, so fault->addr 
> > > won't have the shared bit.
> > > 
> > > Do you see any problem of doing so?
> > 
> > We would need to add it back in "raw_gfn" in kvm_tdp_mmu_map().
> 
> I don't see any big difference?
> 
> Now in this patch the raw_gfn is directly from fault->addr:
> 
>         raw_gfn = gpa_to_gfn(fault->addr);
> 
>         tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn+1) {
>                 ...
>         }
> 
> But there's nothing wrong to get the raw_gfn from the fault->gfn.  In
> fact, the zapping code just does this:
> 
>         /*
>          * start and end doesn't have GFN shared bit.  This function zaps
>          * a region including alias.  Adjust shared bit of [start, end) if
>          * the root is shared.
>          */
>         start = kvm_gfn_for_root(kvm, root, start);
>         end = kvm_gfn_for_root(kvm, root, end);
> 
> So there's nothing wrong to just do the same thing in both functions.
> 
> The point is fault->gfn has shared bit stripped away at the beginning, and
> AFAICT there's no useful reason to keep shared bit in fault->addr.  The
> entire @fault is a temporary structure on the stack during fault handling
> anyway.

I would like to avoid code churn at this point if there is not a real clear
benefit.

One small benefit of keeping the shared bit in the fault->addr is that it is
sort of consistent with how that field is used in other scenarios in KVM. In
shadow paging it's not even the GPA. So it is simply the "fault address" and has
to be interpreted in different ways in the fault handler. For TDX the fault
address *does* include the shared bit. And the EPT needs to be faulted in at
that address.

If we strip the shared bit when setting fault->addr we have to reconstruct it
when we do the actual shared mapping. There is no way around that. Which helper
does it, isn't important I think. Doing the reconstruction inside
tdp_mmu_for_each_pte() could be neat, except that it doesn't know about the
shared bit position.

The zapping code's use of kvm_gfn_for_root() is different because the gfn comes
without the shared bit. It's not stripped and then added back. Those are
operations that target GFNs really.

I think the real problem is that we are gleaning whether the fault is to private
or shared memory from different things. Sometimes from fault->is_private,
sometimes the presence of the shared bits, and sometimes the role bit. I think
this is confusing, doubly so because we are using some of these things to infer
unrelated things (mirrored vs private).

My guess is that you have noticed this and somehow zeroed in on the shared_mask.
I think we should straighten out the mirrored/private semantics and see what the
results look like. How does that sound to you?
Isaku Yamahata May 16, 2024, 5:10 p.m. UTC | #12
On Thu, May 16, 2024 at 02:00:32AM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Wed, 2024-05-15 at 18:48 -0700, Isaku Yamahata wrote:
> > On Thu, May 16, 2024 at 12:52:32PM +1200,
> > "Huang, Kai" <kai.huang@intel.com> wrote:
> > 
> > > On 15/05/2024 12:59 pm, Rick Edgecombe wrote:
> > > > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > > > 
> > > > Allocate mirrored page table for the private page table and implement MMU
> > > > hooks to operate on the private page table.
> > > > 
> > > > To handle page fault to a private GPA, KVM walks the mirrored page table
> > > > in
> > > > unencrypted memory and then uses MMU hooks in kvm_x86_ops to propagate
> > > > changes from the mirrored page table to private page table.
> > > > 
> > > >    private KVM page fault   |
> > > >        |                    |
> > > >        V                    |
> > > >   private GPA               |     CPU protected EPTP
> > > >        |                    |           |
> > > >        V                    |           V
> > > >   mirrored PT root          |     private PT root
> > > >        |                    |           |
> > > >        V                    |           V
> > > >     mirrored PT --hook to propagate-->private PT
> > > >        |                    |           |
> > > >        \--------------------+------\    |
> > > >                             |      |    |
> > > >                             |      V    V
> > > >                             |    private guest page
> > > >                             |
> > > >                             |
> > > >       non-encrypted memory  |    encrypted memory
> > > >                             |
> > > > 
> > > > PT:         page table
> > > > Private PT: the CPU uses it, but it is invisible to KVM. TDX module
> > > > manages
> > > >              this table to map private guest pages.
> > > > Mirrored PT:It is visible to KVM, but the CPU doesn't use it. KVM uses it
> > > >              to propagate PT change to the actual private PT.
> > > > 
> > > > SPTEs in mirrored page table (refer to them as mirrored SPTEs hereafter)
> > > > can be modified atomically with mmu_lock held for read, however, the MMU
> > > > hooks to private page table are not atomical operations.
> > > > 
> > > > To address it, a special REMOVED_SPTE is introduced and below sequence is
> > > > used when mirrored SPTEs are updated atomically.
> > > > 
> > > > 1. Mirrored SPTE is first atomically written to REMOVED_SPTE.
> > > > 2. The successful updater of the mirrored SPTE in step 1 proceeds with the
> > > >     following steps.
> > > > 3. Invoke MMU hooks to modify private page table with the target value.
> > > > 4. (a) On hook succeeds, update mirrored SPTE to target value.
> > > >     (b) On hook failure, restore mirrored SPTE to original value.
> > > > 
> > > > KVM TDP MMU ensures other threads will not overrite REMOVED_SPTE.
> > > > 
> > > > This sequence also applies when SPTEs are atomiclly updated from
> > > > non-present to present in order to prevent potential conflicts when
> > > > multiple vCPUs attempt to set private SPTEs to a different page size
> > > > simultaneously, though 4K page size is only supported for private page
> > > > table currently.
> > > > 
> > > > 2M page support can be done in future patches.
> > > > 
> > > > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > > > Co-developed-by: Kai Huang <kai.huang@intel.com>
> > > > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > > > Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> > > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > > Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > ---
> > > > TDX MMU Part 1:
> > > >   - Remove unnecessary gfn, access twist in
> > > >     tdp_mmu_map_handle_target_level(). (Chao Gao)
> > > >   - Open code call to kvm_mmu_alloc_private_spt() instead oCf doing it in
> > > >     tdp_mmu_alloc_sp()
> > > >   - Update comment in set_private_spte_present() (Yan)
> > > >   - Open code call to kvm_mmu_init_private_spt() (Yan)
> > > >   - Add comments on TDX MMU hooks (Yan)
> > > >   - Fix various whitespace alignment (Yan)
> > > >   - Remove pointless warnings and conditionals in
> > > >     handle_removed_private_spte() (Yan)
> > > >   - Remove redundant lockdep assert in tdp_mmu_set_spte() (Yan)
> > > >   - Remove incorrect comment in handle_changed_spte() (Yan)
> > > >   - Remove unneeded kvm_pfn_to_refcounted_page() and
> > > >     is_error_noslot_pfn() check in kvm_tdp_mmu_map() (Yan)
> > > >   - Do kvm_gfn_for_root() branchless (Rick)
> > > >   - Update kvm_tdp_mmu_alloc_root() callers to not check error code (Rick)
> > > >   - Add comment for stripping shared bit for fault.gfn (Chao)
> > > > 
> > > > v19:
> > > > - drop CONFIG_KVM_MMU_PRIVATE
> > > > 
> > > > v18:
> > > > - Rename freezed => frozen
> > > > 
> > > > v14 -> v15:
> > > > - Refined is_private condition check in kvm_tdp_mmu_map().
> > > >    Add kvm_gfn_shared_mask() check.
> > > > - catch up for struct kvm_range change
> > > > ---
> > > >   arch/x86/include/asm/kvm-x86-ops.h |   5 +
> > > >   arch/x86/include/asm/kvm_host.h    |  25 +++
> > > >   arch/x86/kvm/mmu/mmu.c             |  13 +-
> > > >   arch/x86/kvm/mmu/mmu_internal.h    |  19 +-
> > > >   arch/x86/kvm/mmu/tdp_iter.h        |   2 +-
> > > >   arch/x86/kvm/mmu/tdp_mmu.c         | 269 +++++++++++++++++++++++++----
> > > >   arch/x86/kvm/mmu/tdp_mmu.h         |   2 +-
> > > >   7 files changed, 293 insertions(+), 42 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/include/asm/kvm-x86-ops.h
> > > > b/arch/x86/include/asm/kvm-x86-ops.h
> > > > index 566d19b02483..d13cb4b8fce6 100644
> > > > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > > > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > > > @@ -95,6 +95,11 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
> > > >   KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
> > > >   KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
> > > >   KVM_X86_OP(load_mmu_pgd)
> > > > +KVM_X86_OP_OPTIONAL(link_private_spt)
> > > > +KVM_X86_OP_OPTIONAL(free_private_spt)
> > > > +KVM_X86_OP_OPTIONAL(set_private_spte)
> > > > +KVM_X86_OP_OPTIONAL(remove_private_spte)
> > > > +KVM_X86_OP_OPTIONAL(zap_private_spte)
> > > >   KVM_X86_OP(has_wbinvd_exit)
> > > >   KVM_X86_OP(get_l2_tsc_offset)
> > > >   KVM_X86_OP(get_l2_tsc_multiplier)
> > > > diff --git a/arch/x86/include/asm/kvm_host.h
> > > > b/arch/x86/include/asm/kvm_host.h
> > > > index d010ca5c7f44..20fa8fa58692 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -470,6 +470,7 @@ struct kvm_mmu {
> > > >         int (*sync_spte)(struct kvm_vcpu *vcpu,
> > > >                          struct kvm_mmu_page *sp, int i);
> > > >         struct kvm_mmu_root_info root;
> > > > +       hpa_t private_root_hpa;
> > > 
> > > Should we have
> > > 
> > >         struct kvm_mmu_root_info private_root;
> > > 
> > > instead?
> > 
> > Yes. And the private root allocation can be pushed down into TDP MMU.
> 
> Why?

Because the only TDP MMU supports mirrored PT and the change of the root pt
allocation will be contained in TDP MMU.  Also it will be symetric to
kvm_mmu_destroy() and kvm_tdp_mmu_destroy().


> [snip]
> > > > @@ -7263,6 +7266,12 @@ int kvm_mmu_vendor_module_init(void)
> > > >   void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> > > >   {
> > > >         kvm_mmu_unload(vcpu);
> > > > +       if (tdp_mmu_enabled) {
> > > > +               read_lock(&vcpu->kvm->mmu_lock);
> > > > +               mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu-
> > > > >private_root_hpa,
> > > > +                                  NULL);
> > > > +               read_unlock(&vcpu->kvm->mmu_lock);
> > > > +       }
> > > 
> > > Hmm.. I don't quite like this, but sorry I kinda forgot why we need to to
> > > this here.
> > > 
> > > Could you elaborate?
> > > 
> > > Anyway, from common code's perspective, we need to have some clarification
> > > why we design to do it here.
> > 
> > This should be cleaned up.  It can be pushed down into
> > kvm_tdp_mmu_alloc_root().
> > 
> > void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
> >   allocate shared root
> >   if (has_mirrort_pt)
> >     allocate private root
> > 
> 
> Huh? This is kvm_mmu_destroy()...




> > > >         free_mmu_pages(&vcpu->arch.root_mmu);
> > > >         free_mmu_pages(&vcpu->arch.guest_mmu);
> > > >         mmu_free_memory_caches(vcpu);
> > > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h
> > > > b/arch/x86/kvm/mmu/mmu_internal.h
> > > > index 0f1a9d733d9e..3a7fe9261e23 100644
> > > > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > > > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > > > @@ -6,6 +6,8 @@
> > > >   #include <linux/kvm_host.h>
> > > >   #include <asm/kvm_host.h>
> > > > +#include "mmu.h"
> > > > +
> > > >   #ifdef CONFIG_KVM_PROVE_MMU
> > > >   #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
> > > >   #else
> > > > @@ -178,6 +180,16 @@ static inline void kvm_mmu_alloc_private_spt(struct
> > > > kvm_vcpu *vcpu, struct kvm_m
> > > >         sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu-
> > > > >arch.mmu_private_spt_cache);
> > > >   }
> > > > +static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page
> > > > *root,
> > > > +                                    gfn_t gfn)
> > > > +{
> > > > +       gfn_t gfn_for_root = kvm_gfn_to_private(kvm, gfn);
> > > > +
> > > > +       /* Set shared bit if not private */
> > > > +       gfn_for_root |= -(gfn_t)!is_private_sp(root) &
> > > > kvm_gfn_shared_mask(kvm);
> > > > +       return gfn_for_root;
> > > > +}
> > > > +
> > > >   static inline bool kvm_mmu_page_ad_need_write_protect(struct
> > > > kvm_mmu_page *sp)
> > > >   {
> > > >         /*
> > > > @@ -348,7 +360,12 @@ static inline int __kvm_mmu_do_page_fault(struct
> > > > kvm_vcpu *vcpu, gpa_t cr2_or_gp
> > > >         int r;
> > > >         if (vcpu->arch.mmu->root_role.direct) {
> > > > -               fault.gfn = fault.addr >> PAGE_SHIFT;
> > > > +               /*
> > > > +                * Things like memslots don't understand the concept of a
> > > > shared
> > > > +                * bit. Strip it so that the GFN can be used like normal,
> > > > and the
> > > > +                * fault.addr can be used when the shared bit is needed.
> > > > +                */
> > > > +               fault.gfn = gpa_to_gfn(fault.addr) &
> > > > ~kvm_gfn_shared_mask(vcpu->kvm);
> > > >                 fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
> > > 
> > > Again, I don't think it's nessary for fault.gfn to still have the shared bit
> > > here?
> > > 
> > > This kinda usage is pretty much the reason I want to get rid of
> > > kvm_gfn_shared_mask().
> > 
> > We are going to flags like has_mirrored_pt and we have root page table
> > iterator
> > with types specified.  I'll investigate how we can reduce (or eliminate)
> > those helper functions.
> 
> Let's transition the abusers off and see whats left. I'm still waiting for an
> explanation of why they are bad when uses properly.

Sure. Let's untangle things one by one.


> [snip]
> > 
> > > >         /* The level of the root page given to the iterator */
> > > >         int root_level;
> > > 
> > > [...]
> > > 
> > > >         for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
> > > > @@ -1029,8 +1209,8 @@ static int tdp_mmu_map_handle_target_level(struct
> > > > kvm_vcpu *vcpu,
> > > >                 new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
> > > >         else
> > > >                 wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter-
> > > > >gfn,
> > > > -                                        fault->pfn, iter->old_spte,
> > > > fault->prefetch, true,
> > > > -                                        fault->map_writable, &new_spte);
> > > > +                                       fault->pfn, iter->old_spte, fault-
> > > > >prefetch, true,
> > > > +                                       fault->map_writable, &new_spte);
> > > >         if (new_spte == iter->old_spte)
> > > >                 ret = RET_PF_SPURIOUS;
> > > > @@ -1108,6 +1288,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct
> > > > kvm_page_fault *fault)
> > > >         struct kvm *kvm = vcpu->kvm;
> > > >         struct tdp_iter iter;
> > > >         struct kvm_mmu_page *sp;
> > > > +       gfn_t raw_gfn;
> > > > +       bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
> > > 
> > > Ditto.  I wish we can have 'has_mirrored_private_pt'.
> > 
> > Which name do you prefer? has_mirrored_pt or has_mirrored_private_pt?
> 
> Why not helpers that wrap vm_type like:
> https://lore.kernel.org/kvm/d4c96caffd2633a70a140861d91794cdb54c7655.camel@intel.com/

I followed the existing way.  Anyway I'm fine with either way.
Isaku Yamahata May 16, 2024, 7:42 p.m. UTC | #13
On Thu, May 16, 2024 at 04:36:48PM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Thu, 2024-05-16 at 13:04 +0000, Huang, Kai wrote:
> > On Thu, 2024-05-16 at 02:57 +0000, Edgecombe, Rick P wrote:
> > > On Thu, 2024-05-16 at 14:07 +1200, Huang, Kai wrote:
> > > > 
> > > > I meant it seems we should just strip shared bit away from the GPA in 
> > > > handle_ept_violation() and pass it as 'cr2_or_gpa' here, so fault->addr 
> > > > won't have the shared bit.
> > > > 
> > > > Do you see any problem of doing so?
> > > 
> > > We would need to add it back in "raw_gfn" in kvm_tdp_mmu_map().
> > 
> > I don't see any big difference?
> > 
> > Now in this patch the raw_gfn is directly from fault->addr:
> > 
> >         raw_gfn = gpa_to_gfn(fault->addr);
> > 
> >         tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn+1) {
> >                 ...
> >         }
> > 
> > But there's nothing wrong to get the raw_gfn from the fault->gfn.  In
> > fact, the zapping code just does this:
> > 
> >         /*
> >          * start and end doesn't have GFN shared bit.  This function zaps
> >          * a region including alias.  Adjust shared bit of [start, end) if
> >          * the root is shared.
> >          */
> >         start = kvm_gfn_for_root(kvm, root, start);
> >         end = kvm_gfn_for_root(kvm, root, end);
> > 
> > So there's nothing wrong to just do the same thing in both functions.
> > 
> > The point is fault->gfn has shared bit stripped away at the beginning, and
> > AFAICT there's no useful reason to keep shared bit in fault->addr.  The
> > entire @fault is a temporary structure on the stack during fault handling
> > anyway.
> 
> I would like to avoid code churn at this point if there is not a real clear
> benefit.
> 
> One small benefit of keeping the shared bit in the fault->addr is that it is
> sort of consistent with how that field is used in other scenarios in KVM. In
> shadow paging it's not even the GPA. So it is simply the "fault address" and has
> to be interpreted in different ways in the fault handler. For TDX the fault
> address *does* include the shared bit. And the EPT needs to be faulted in at
> that address.
> 
> If we strip the shared bit when setting fault->addr we have to reconstruct it
> when we do the actual shared mapping. There is no way around that. Which helper
> does it, isn't important I think. Doing the reconstruction inside
> tdp_mmu_for_each_pte() could be neat, except that it doesn't know about the
> shared bit position.
> 
> The zapping code's use of kvm_gfn_for_root() is different because the gfn comes
> without the shared bit. It's not stripped and then added back. Those are
> operations that target GFNs really.
> 
> I think the real problem is that we are gleaning whether the fault is to private
> or shared memory from different things. Sometimes from fault->is_private,
> sometimes the presence of the shared bits, and sometimes the role bit. I think
> this is confusing, doubly so because we are using some of these things to infer
> unrelated things (mirrored vs private).

It's confusing we don't check it in uniform way.


> My guess is that you have noticed this and somehow zeroed in on the shared_mask.
> I think we should straighten out the mirrored/private semantics and see what the
> results look like. How does that sound to you?

I had closer look of the related code.  I think we can (mostly) uniformly use
gpa/gfn without shared mask.  Here is the proposal.  We need a real patch to see
how the outcome looks like anyway.  I think this is like what Kai is thinking
about.


- rename role.is_private => role.is_mirrored_pt

- sp->gfn: gfn without shared bit.

- fault->address: without gfn_shared_mask
  Actually it doesn't matter much.  We can use gpa with gfn_shared_mask.

- Update struct tdp_iter
  struct tdp_iter
    gfn: gfn without shared bit

    /* Add new members */

    /* Indicates which PT to walk. */
    bool mirrored_pt;

    // This is used tdp_iter_refresh_sptep()
    // shared gfn_mask if mirrored_pt
    // 0 if !mirrored_pt
    gfn_shared_mask

- Pass mirrored_pt and gfn_shared_mask to
  tdp_iter_start(..., mirrored_pt, gfn_shared_mask)

  and update tdp_iter_refresh_sptep()
  static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
        ...
        iter->sptep = iter->pt_path[iter->level - 1] +
                SPTE_INDEX((iter->gfn << PAGE_SHIFT) | iter->gfn_shared_mask, iter->level);

  Change for_each_tdp_mte_min_level() accordingly.
  Also the iteretor to call this.
   
  #define for_each_tdp_pte_min_level(kvm, iter, root, min_level, start, end)      \
          for (tdp_iter_start(&iter, root, min_level, start,                      \
               mirrored_root, mirrored_root ? kvm_gfn_shared_mask(kvm) : 0);      \
               iter.valid && iter.gfn < kvm_gfn_for_root(kvm, root, end);         \
               tdp_iter_next(&iter))

- trace point: update to include mirroredd_pt. Or Leave it as is for now.

- pr_err() that log gfn in handle_changed_spte()
  Update to include mirrored_pt. Or Leave it as is for now.

- Update spte handler (handle_changed_spte(), handle_removed_pt()...),
  use iter->mirror_pt or pass down mirror_pt.
Rick Edgecombe May 17, 2024, 2:35 a.m. UTC | #14
Here is a diff of an attempt to merge all the feedback so far. It's on top of
the the dev branch from this series.

On Thu, 2024-05-16 at 12:42 -0700, Isaku Yamahata wrote:
> - rename role.is_private => role.is_mirrored_pt

Agreed.

> 
> - sp->gfn: gfn without shared bit.
> 
> - fault->address: without gfn_shared_mask
>   Actually it doesn't matter much.  We can use gpa with gfn_shared_mask.

I left fault->addr with shared bits. It's not used anymore for TDX except in the
tracepoint which I think makes sense.

> 
> - Update struct tdp_iter
>   struct tdp_iter
>     gfn: gfn without shared bit
> 
>     /* Add new members */
> 
>     /* Indicates which PT to walk. */
>     bool mirrored_pt;
> 
>     // This is used tdp_iter_refresh_sptep()
>     // shared gfn_mask if mirrored_pt
>     // 0 if !mirrored_pt
>     gfn_shared_mask
> 
> - Pass mirrored_pt and gfn_shared_mask to
>   tdp_iter_start(..., mirrored_pt, gfn_shared_mask)
> 
>   and update tdp_iter_refresh_sptep()
>   static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
>         ...
>         iter->sptep = iter->pt_path[iter->level - 1] +
>                 SPTE_INDEX((iter->gfn << PAGE_SHIFT) | iter->gfn_shared_mask,
> iter->level);

I tried something else. The iterators still have gfn's with shared bits, but the
addition of the shared bit is wrapped in tdp_mmu_for_each_pte(), so
kvm_tdp_mmu_map() and similar don't have to handle the shared bits. They just
pass in a root, and tdp_mmu_for_each_pte() knows how to adjust the GFN. Like:

#define tdp_mmu_for_each_pte(_iter, _kvm, _root, _start, _end)	\
	for_each_tdp_pte(_iter, _root,	\
			 kvm_gfn_for_root(_kvm, _root, _start), \
			 kvm_gfn_for_root(_kvm, _root, _end))

I also changed the callers to use the new enum to specify roots. This way they
can pass something with a nice name instead of true/false for bool private.

Keeping a gfn_shared_mask inside the iterator didn't seem more clear to me, and
bit more cumbersome. But please compare it.

> 
>   Change for_each_tdp_mte_min_level() accordingly.
>   Also the iteretor to call this.
>    
>   #define for_each_tdp_pte_min_level(kvm, iter, root, min_level, start,
> end)      \
>           for (tdp_iter_start(&iter, root, min_level,
> start,                      \
>                mirrored_root, mirrored_root ? kvm_gfn_shared_mask(kvm) :
> 0);      \
>                iter.valid && iter.gfn < kvm_gfn_for_root(kvm, root,
> end);         \
>                tdp_iter_next(&iter))

I liked it a lot because the callers don't need to manually call
kvm_gfn_for_root() anymore. But I tried it and it required a lot of additions of
kvm to the iterators call sites. I ended up removing it, but I'm not sure.

> 
> - trace point: update to include mirroredd_pt. Or Leave it as is for now.
> 
> - pr_err() that log gfn in handle_changed_spte()
>   Update to include mirrored_pt. Or Leave it as is for now.

I left it, as fault->addr still has shared bit.

> 
> - Update spte handler (handle_changed_spte(), handle_removed_pt()...),
>   use iter->mirror_pt or pass down mirror_pt.

You mean just rename it, or something else?


Anyway below is a first cut based on the discussion.

A few other things:
1. kvm_is_private_gpa() is moved into Intel code. kvm_gfn_shared_mask() remains
for only two operations in common code:
 - kvm_gfn_for_root() <- required for zapping/mapping
 - Stripping the bit when setting fault.gfn <- possible to remove if we strip
cr2_or_gpa
2. I also played with changing KVM_PRIVATE_ROOTS to KVM_MIRROR_ROOTS.
Unfortunately there is still some confusion between private and mirrored. For
example you walk a mirror root (what is actually happening), but you have to
allocate private page tables as you do, as well as call out to x86_ops named
private. So those concepts are effectively linked and used a bit
interchangeably.

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e35a446baaad..64af6fd7cf85 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -351,7 +351,7 @@ union kvm_mmu_page_role {
                unsigned ad_disabled:1;
                unsigned guest_mode:1;
                unsigned passthrough:1;
-               unsigned is_private:1;
+               unsigned mirrored_pt:1;
                unsigned :4;
 
                /*
@@ -364,14 +364,14 @@ union kvm_mmu_page_role {
        };
 };
 
-static inline bool kvm_mmu_page_role_is_private(union kvm_mmu_page_role role)
+static inline bool kvm_mmu_page_role_is_mirrored(union kvm_mmu_page_role role)
 {
-       return !!role.is_private;
+       return !!role.mirrored_pt;
 }
 
-static inline void kvm_mmu_page_role_set_private(union kvm_mmu_page_role *role)
+static inline void kvm_mmu_page_role_set_mirrored(union kvm_mmu_page_role
*role)
 {
-       role->is_private = 1;
+       role->mirrored_pt = 1;
 }
 
 /*
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index a578ea09dfb3..0c08b4f9093c 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -338,21 +338,26 @@ static inline gfn_t kvm_gfn_shared_mask(const struct kvm
*kvm)
        return kvm->arch.gfn_shared_mask;
 }
 
-static inline gfn_t kvm_gfn_to_shared(const struct kvm *kvm, gfn_t gfn)
-{
-       return gfn | kvm_gfn_shared_mask(kvm);
-}
-
 static inline gfn_t kvm_gfn_to_private(const struct kvm *kvm, gfn_t gfn)
 {
        return gfn & ~kvm_gfn_shared_mask(kvm);
 }
 
-static inline bool kvm_is_private_gpa(const struct kvm *kvm, gpa_t gpa)
-{
-       gfn_t mask = kvm_gfn_shared_mask(kvm);
 
-       return mask && !(gpa_to_gfn(gpa) & mask);
+/* The VM keeps a mirrored copy of the private memory */
+static inline bool kvm_has_mirrored_tdp(const struct kvm *kvm)
+{
+       return kvm->arch.vm_type == KVM_X86_TDX_VM;
+}
+
+static inline bool kvm_has_private_root(const struct kvm *kvm)
+{
+       return kvm->arch.vm_type == KVM_X86_TDX_VM;
+}
+
+static inline bool kvm_zap_leafs_only(const struct kvm *kvm)
+{
+       return kvm->arch.vm_type == KVM_X86_TDX_VM;
 }
 
 #endif
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3d291c5d2d50..c6a0af5aefce 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -686,7 +686,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu,
bool maybe_indirect)
                                       1 + PT64_ROOT_MAX_LEVEL +
PTE_PREFETCH_NUM);
        if (r)
                return r;
-       if (kvm_gfn_shared_mask(vcpu->kvm)) {
+       if (kvm_has_mirrored_tdp(vcpu->kvm)) {
                r = kvm_mmu_topup_memory_cache(&vcpu-
>arch.mmu_private_spt_cache,
                                               PT64_ROOT_MAX_LEVEL);
                if (r)
@@ -3702,7 +3702,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
        int r;
 
        if (tdp_mmu_enabled) {
-               if (kvm_gfn_shared_mask(vcpu->kvm))
+               if (kvm_has_private_root(vcpu->kvm))
                        kvm_tdp_mmu_alloc_root(vcpu, true);
                kvm_tdp_mmu_alloc_root(vcpu, false);
                return 0;
@@ -6539,17 +6539,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start,
gfn_t gfn_end)
 
        flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
 
-       if (tdp_mmu_enabled) {
-               /*
-                * kvm_zap_gfn_range() is used when MTRR or PAT memory
-                * type was changed.  TDX can't handle zapping the private
-                * mapping, but it's ok because KVM doesn't support either of
-                * those features for TDX. In case a new caller appears, BUG
-                * the VM if it's called for solutions with private aliases.
-                */
-               KVM_BUG_ON(kvm_gfn_shared_mask(kvm), kvm);
+       if (tdp_mmu_enabled)
                flush = kvm_tdp_mmu_zap_leafs(kvm, gfn_start, gfn_end, flush);
-       }
 
        if (flush)
                kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end -
gfn_start);
@@ -6996,10 +6987,38 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm)
        kvm_mmu_zap_all(kvm);
 }
 
+static void kvm_mmu_zap_memslot_leafs(struct kvm *kvm, struct kvm_memory_slot
*slot)
+{
+       if (KVM_BUG_ON(!tdp_mmu_enabled, kvm))
+               return;
+
+       write_lock(&kvm->mmu_lock);
+
+       /*
+        * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
+        * case scenario we'll have unused shadow pages lying around until they
+        * are recycled due to age or when the VM is destroyed.
+        */
+       struct kvm_gfn_range range = {
+               .slot = slot,
+               .start = slot->base_gfn,
+               .end = slot->base_gfn + slot->npages,
+               .may_block = true,
+       };
+
+       if (kvm_tdp_mmu_unmap_gfn_range(kvm, &range, false))
+               kvm_flush_remote_tlbs(kvm);
+
+       write_unlock(&kvm->mmu_lock);
+}
+
 void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
                                   struct kvm_memory_slot *slot)
 {
-       kvm_mmu_zap_all_fast(kvm);
+       if (kvm_zap_leafs_only(kvm))
+               kvm_mmu_zap_memslot_leafs(kvm, slot);
+       else
+               kvm_mmu_zap_all_fast(kvm);
 }
 
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 3a7fe9261e23..2b1b2a980b03 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -159,9 +159,9 @@ static inline int kvm_mmu_page_as_id(struct kvm_mmu_page
*sp)
        return kvm_mmu_role_as_id(sp->role);
 }
 
-static inline bool is_private_sp(const struct kvm_mmu_page *sp)
+static inline bool is_mirrored_sp(const struct kvm_mmu_page *sp)
 {
-       return kvm_mmu_page_role_is_private(sp->role);
+       return kvm_mmu_page_role_is_mirrored(sp->role);
 }
 
 static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp)
@@ -186,7 +186,7 @@ static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct
kvm_mmu_page *root,
        gfn_t gfn_for_root = kvm_gfn_to_private(kvm, gfn);
 
        /* Set shared bit if not private */
-       gfn_for_root |= -(gfn_t)!is_private_sp(root) & kvm_gfn_shared_mask(kvm);
+       gfn_for_root |= -(gfn_t)!is_mirrored_sp(root) &
kvm_gfn_shared_mask(kvm);
        return gfn_for_root;
 }
 
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 5eae8eac2da0..d0d13a4317e8 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -74,9 +74,6 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned
int access)
        u64 spte = generation_mmio_spte_mask(gen);
        u64 gpa = gfn << PAGE_SHIFT;
 
-       WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value &&
-                    !kvm_gfn_shared_mask(vcpu->kvm));
-
        access &= shadow_mmio_access_mask;
        spte |= vcpu->kvm->arch.shadow_mmio_value | access;
        spte |= gpa | shadow_nonpresent_or_rsvd_mask;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index d0df691ced5c..17d3f1593a24 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -265,9 +265,9 @@ static inline struct kvm_mmu_page *root_to_sp(hpa_t root)
        return spte_to_child_sp(root);
 }
 
-static inline bool is_private_sptep(u64 *sptep)
+static inline bool is_mirrored_sptep(u64 *sptep)
 {
-       return is_private_sp(sptep_to_sp(sptep));
+       return is_mirrored_sp(sptep_to_sp(sptep));
 }
 
 static inline bool is_mmio_spte(struct kvm *kvm, u64 spte)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 42ccafc7deff..7f13016e210b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -97,15 +97,15 @@ static bool tdp_mmu_root_match(struct kvm_mmu_page *root,
 {
        if (WARN_ON_ONCE(types == BUGGY_KVM_ROOTS))
                return false;
-       if (WARN_ON_ONCE(!(types & (KVM_SHARED_ROOTS | KVM_PRIVATE_ROOTS))))
+       if (WARN_ON_ONCE(!(types & (KVM_SHARED_ROOTS | KVM_MIRROR_ROOTS))))
                return false;
 
        if ((types & KVM_VALID_ROOTS) && root->role.invalid)
                return false;
 
-       if ((types & KVM_SHARED_ROOTS) && !is_private_sp(root))
+       if ((types & KVM_SHARED_ROOTS) && !is_mirrored_sp(root))
                return true;
-       if ((types & KVM_PRIVATE_ROOTS) && is_private_sp(root))
+       if ((types & KVM_MIRROR_ROOTS) && is_mirrored_sp(root))
                return true;
 
        return false;
@@ -252,7 +252,7 @@ void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool
private)
        struct kvm_mmu_page *root;
 
        if (private)
-               kvm_mmu_page_role_set_private(&role);
+               kvm_mmu_page_role_set_mirrored(&role);
 
        /*
         * Check for an existing root before acquiring the pages lock to avoid
@@ -446,7 +446,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t
pt, bool shared)
                                    shared);
        }
 
-       if (is_private_sp(sp) &&
+       if (is_mirrored_sp(sp) &&
            WARN_ON(static_call(kvm_x86_free_private_spt)(kvm, sp->gfn, sp-
>role.level,
                                                         
kvm_mmu_private_spt(sp)))) {
                /*
@@ -580,7 +580,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id,
gfn_t gfn,
                                u64 old_spte, u64 new_spte,
                                union kvm_mmu_page_role role, bool shared)
 {
-       bool is_private = kvm_mmu_page_role_is_private(role);
+       bool is_mirrored = kvm_mmu_page_role_is_mirrored(role);
        int level = role.level;
        bool was_present = is_shadow_present_pte(old_spte);
        bool is_present = is_shadow_present_pte(new_spte);
@@ -665,12 +665,12 @@ static void handle_changed_spte(struct kvm *kvm, int
as_id, gfn_t gfn,
         */
        if (was_present && !was_leaf &&
            (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
-               KVM_BUG_ON(is_private !=
is_private_sptep(spte_to_child_pt(old_spte, level)),
+               KVM_BUG_ON(is_mirrored !=
is_mirrored_sptep(spte_to_child_pt(old_spte, level)),
                           kvm);
                handle_removed_pt(kvm, spte_to_child_pt(old_spte, level),
shared);
        }
 
-       if (is_private && !is_present)
+       if (is_mirrored && !is_present)
                handle_removed_private_spte(kvm, gfn, old_spte, new_spte,
role.level);
 
        if (was_leaf && is_accessed_spte(old_spte) &&
@@ -690,7 +690,7 @@ static inline int __tdp_mmu_set_spte_atomic(struct kvm *kvm,
struct tdp_iter *it
         */
        WARN_ON_ONCE(iter->yielded || is_removed_spte(iter->old_spte));
 
-       if (is_private_sptep(iter->sptep) && !is_removed_spte(new_spte)) {
+       if (is_mirrored_sptep(iter->sptep) && !is_removed_spte(new_spte)) {
                int ret;
 
                if (is_shadow_present_pte(new_spte)) {
@@ -840,7 +840,7 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id,
tdp_ptep_t sptep,
        WARN_ON_ONCE(is_removed_spte(old_spte) || is_removed_spte(new_spte));
 
        old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
-       if (is_private_sptep(sptep) && !is_removed_spte(new_spte) &&
+       if (is_mirrored_sptep(sptep) && !is_removed_spte(new_spte) &&
            is_shadow_present_pte(new_spte)) {
                /* Because write spin lock is held, no race.  It should success.
*/
                KVM_BUG_ON(__set_private_spte_present(kvm, sptep, gfn, old_spte,
@@ -872,11 +872,10 @@ static inline void tdp_mmu_iter_set_spte(struct kvm *kvm,
struct tdp_iter *iter,
                        continue;                                       \
                else
 
-#define tdp_mmu_for_each_pte(_iter, _mmu, _private, _start, _end)      \
-       for_each_tdp_pte(_iter,                                         \
-                root_to_sp((_private) ? _mmu->private_root_hpa :       \
-                               _mmu->root.hpa),                        \
-               _start, _end)
+#define tdp_mmu_for_each_pte(_iter, _kvm, _root, _start, _end) \
+       for_each_tdp_pte(_iter, _root,  \
+                        kvm_gfn_for_root(_kvm, _root, _start), \
+                        kvm_gfn_for_root(_kvm, _root, _end))
 
 /*
  * Yield if the MMU lock is contended or this thread needs to return control
@@ -1307,12 +1306,11 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm,
struct tdp_iter *iter,
  */
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
-       struct kvm_mmu *mmu = vcpu->arch.mmu;
        struct kvm *kvm = vcpu->kvm;
+       enum kvm_tdp_mmu_root_types root_type = tdp_mmu_get_root_type(kvm,
fault);
+       struct kvm_mmu_page *root;
        struct tdp_iter iter;
        struct kvm_mmu_page *sp;
-       gfn_t raw_gfn;
-       bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
        int ret = RET_PF_RETRY;
 
        kvm_mmu_hugepage_adjust(vcpu, fault);
@@ -1321,9 +1319,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct
kvm_page_fault *fault)
 
        rcu_read_lock();
 
-       raw_gfn = gpa_to_gfn(fault->addr);
-
-       tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn + 1) {
+       root = tdp_mmu_get_root(vcpu, root_type);
+       tdp_mmu_for_each_pte(iter, kvm, root, fault->gfn, fault->gfn + 1) {
                int r;
 
                if (fault->nx_huge_page_workaround_enabled)
@@ -1349,7 +1346,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct
kvm_page_fault *fault)
                 * needs to be split.
                 */
                sp = tdp_mmu_alloc_sp(vcpu);
-               if (kvm_is_private_gpa(kvm, raw_gfn << PAGE_SHIFT))
+               if (root_type == KVM_MIRROR_ROOTS)
                        kvm_mmu_alloc_private_spt(vcpu, sp);
                tdp_mmu_init_child_sp(sp, &iter);
 
@@ -1360,7 +1357,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct
kvm_page_fault *fault)
                         * TODO: large page support.
                         * Doesn't support large page for TDX now
                         */
-                       KVM_BUG_ON(is_private_sptep(iter.sptep), vcpu->kvm);
+                       KVM_BUG_ON(is_mirrored_sptep(iter.sptep), vcpu->kvm);
                        r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
                } else {
                        r = tdp_mmu_link_sp(kvm, &iter, sp, true);
@@ -1405,7 +1402,7 @@ static enum kvm_tdp_mmu_root_types
kvm_process_to_root_types(struct kvm *kvm,
        WARN_ON_ONCE(process == BUGGY_KVM_INVALIDATION);
 
        /* Always process shared for cases where private is not on a separate
root */
-       if (!kvm_gfn_shared_mask(kvm)) {
+       if (!kvm_has_private_root(kvm)) {
                process |= KVM_PROCESS_SHARED;
                process &= ~KVM_PROCESS_PRIVATE;
        }
@@ -2022,14 +2019,14 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
  * Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}.
  */
 static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
-                                 bool is_private)
+                                 enum kvm_tdp_mmu_root_types root_type)
 {
+       struct kvm_mmu_page *root = tdp_mmu_get_root(vcpu, root_type);
        struct tdp_iter iter;
-       struct kvm_mmu *mmu = vcpu->arch.mmu;
        gfn_t gfn = addr >> PAGE_SHIFT;
        int leaf = -1;
 
-       tdp_mmu_for_each_pte(iter, mmu, is_private, gfn, gfn + 1) {
+       tdp_mmu_for_each_pte(iter, vcpu->kvm, root, gfn, gfn + 1) {
                leaf = iter.level;
                sptes[leaf] = iter.old_spte;
        }
@@ -2042,7 +2039,7 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr,
u64 *sptes,
 {
        *root_level = vcpu->arch.mmu->root_role.level;
 
-       return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, false);
+       return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, KVM_SHARED_ROOTS);
 }
 
 int kvm_tdp_mmu_get_walk_private_pfn(struct kvm_vcpu *vcpu, u64 gpa,
@@ -2054,7 +2051,7 @@ int kvm_tdp_mmu_get_walk_private_pfn(struct kvm_vcpu
*vcpu, u64 gpa,
        lockdep_assert_held(&vcpu->kvm->mmu_lock);
 
        rcu_read_lock();
-       leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, true);
+       leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, KVM_MIRROR_ROOTS);
        rcu_read_unlock();
        if (leaf < 0)
                return -ENOENT;
@@ -2082,15 +2079,12 @@ EXPORT_SYMBOL_GPL(kvm_tdp_mmu_get_walk_private_pfn);
 u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
                                        u64 *spte)
 {
+       struct kvm_mmu_page *root = tdp_mmu_get_root(vcpu, KVM_SHARED_ROOTS);
        struct tdp_iter iter;
-       struct kvm_mmu *mmu = vcpu->arch.mmu;
        gfn_t gfn = addr >> PAGE_SHIFT;
        tdp_ptep_t sptep = NULL;
 
-       /* fast page fault for private GPA isn't supported. */
-       WARN_ON_ONCE(kvm_is_private_gpa(vcpu->kvm, addr));
-
-       tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) {
+       tdp_mmu_for_each_pte(iter, vcpu->kvm, root, gfn, gfn + 1) {
                *spte = iter.old_spte;
                sptep = iter.sptep;
        }
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index b8a967426fac..40f5f9753131 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -22,15 +22,30 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct
kvm_mmu_page *root);
 enum kvm_tdp_mmu_root_types {
        BUGGY_KVM_ROOTS = BUGGY_KVM_INVALIDATION,
        KVM_SHARED_ROOTS = KVM_PROCESS_SHARED,
-       KVM_PRIVATE_ROOTS = KVM_PROCESS_PRIVATE,
+       KVM_MIRROR_ROOTS = KVM_PROCESS_PRIVATE,
        KVM_VALID_ROOTS = BIT(2),
-       KVM_ANY_VALID_ROOTS = KVM_SHARED_ROOTS | KVM_PRIVATE_ROOTS |
KVM_VALID_ROOTS,
-       KVM_ANY_ROOTS = KVM_SHARED_ROOTS | KVM_PRIVATE_ROOTS,
+       KVM_ANY_VALID_ROOTS = KVM_SHARED_ROOTS | KVM_MIRROR_ROOTS |
KVM_VALID_ROOTS,
+       KVM_ANY_ROOTS = KVM_SHARED_ROOTS | KVM_MIRROR_ROOTS,
 };
 
 static_assert(!(KVM_SHARED_ROOTS & KVM_VALID_ROOTS));
-static_assert(!(KVM_PRIVATE_ROOTS & KVM_VALID_ROOTS));
-static_assert(KVM_PRIVATE_ROOTS == (KVM_SHARED_ROOTS << 1));
+static_assert(!(KVM_MIRROR_ROOTS & KVM_VALID_ROOTS));
+static_assert(KVM_MIRROR_ROOTS == (KVM_SHARED_ROOTS << 1));
+
+static inline enum kvm_tdp_mmu_root_types tdp_mmu_get_root_type(struct kvm
*kvm,
+                                                               struct
kvm_page_fault *fault)
+{
+       if (fault->is_private && kvm_has_mirrored_tdp(kvm))
+               return KVM_MIRROR_ROOTS;
+       return KVM_SHARED_ROOTS;
+}
+
+static inline struct kvm_mmu_page *tdp_mmu_get_root(struct kvm_vcpu *vcpu, enum
kvm_tdp_mmu_root_types type)
+{
+       if (type == KVM_MIRROR_ROOTS)
+               return root_to_sp(vcpu->arch.mmu->private_root_hpa);
+       return root_to_sp(vcpu->arch.mmu->root.hpa);
+}
 
 bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool
flush);
 bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 7fdc67835e06..b4e324fe55c5 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -69,6 +69,14 @@ static inline void
vmx_handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
        vcpu->arch.at_instruction_boundary = true;
 }
 
+
+static inline bool gpa_on_private_root(const struct kvm *kvm, gpa_t gpa)
+{
+       gfn_t mask = kvm_gfn_shared_mask(kvm);
+
+       return kvm_has_private_root(kvm) && !(gpa_to_gfn(gpa) & mask);
+}
+
 static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
                                             unsigned long exit_qualification)
 {
@@ -90,7 +98,7 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu
*vcpu, gpa_t gpa,
        error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
               PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
 
-       if (kvm_is_private_gpa(vcpu->kvm, gpa))
+       if (gpa_on_private_root(vcpu->kvm, gpa))
                error_code |= PFERR_PRIVATE_ACCESS;
 
        return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index bfb939826276..d7626f80b7f7 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1772,7 +1772,7 @@ static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
 {
        unsigned long exit_qual;
 
-       if (kvm_is_private_gpa(vcpu->kvm, tdexit_gpa(vcpu))) {
+       if (gpa_on_private_root(vcpu->kvm, tdexit_gpa(vcpu))) {
                /*
                 * Always treat SEPT violations as write faults.  Ignore the
                 * EXIT_QUALIFICATION reported by TDX-SEAM for SEPT violations.
@@ -2967,8 +2967,8 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu,
struct kvm_tdx_cmd *c
        if (!PAGE_ALIGNED(region.source_addr) || !PAGE_ALIGNED(region.gpa) ||
            !region.nr_pages ||
            region.gpa + (region.nr_pages << PAGE_SHIFT) <= region.gpa ||
-           !kvm_is_private_gpa(kvm, region.gpa) ||
-           !kvm_is_private_gpa(kvm, region.gpa + (region.nr_pages <<
PAGE_SHIFT)))
+           !gpa_on_private_root(kvm, region.gpa) ||
+           !gpa_on_private_root(kvm, region.gpa + (region.nr_pages <<
PAGE_SHIFT)))
                return -EINVAL;
 
        mutex_lock(&kvm->slots_lock);
Kai Huang May 17, 2024, 2:36 a.m. UTC | #15
On 17/05/2024 7:42 am, Isaku Yamahata wrote:
> On Thu, May 16, 2024 at 04:36:48PM +0000,
> "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:
> 
>> On Thu, 2024-05-16 at 13:04 +0000, Huang, Kai wrote:
>>> On Thu, 2024-05-16 at 02:57 +0000, Edgecombe, Rick P wrote:
>>>> On Thu, 2024-05-16 at 14:07 +1200, Huang, Kai wrote:
>>>>>
>>>>> I meant it seems we should just strip shared bit away from the GPA in
>>>>> handle_ept_violation() and pass it as 'cr2_or_gpa' here, so fault->addr
>>>>> won't have the shared bit.
>>>>>
>>>>> Do you see any problem of doing so?
>>>>
>>>> We would need to add it back in "raw_gfn" in kvm_tdp_mmu_map().
>>>
>>> I don't see any big difference?
>>>
>>> Now in this patch the raw_gfn is directly from fault->addr:
>>>
>>>          raw_gfn = gpa_to_gfn(fault->addr);
>>>
>>>          tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn+1) {
>>>                  ...
>>>          }
>>>
>>> But there's nothing wrong to get the raw_gfn from the fault->gfn.  In
>>> fact, the zapping code just does this:
>>>
>>>          /*
>>>           * start and end doesn't have GFN shared bit.  This function zaps
>>>           * a region including alias.  Adjust shared bit of [start, end) if
>>>           * the root is shared.
>>>           */
>>>          start = kvm_gfn_for_root(kvm, root, start);
>>>          end = kvm_gfn_for_root(kvm, root, end);
>>>
>>> So there's nothing wrong to just do the same thing in both functions.
>>>
>>> The point is fault->gfn has shared bit stripped away at the beginning, and
>>> AFAICT there's no useful reason to keep shared bit in fault->addr.  The
>>> entire @fault is a temporary structure on the stack during fault handling
>>> anyway.
>>
>> I would like to avoid code churn at this point if there is not a real clear
>> benefit. >>
>> One small benefit of keeping the shared bit in the fault->addr is that it is
>> sort of consistent with how that field is used in other scenarios in KVM. In
>> shadow paging it's not even the GPA. So it is simply the "fault address" and has
>> to be interpreted in different ways in the fault handler. For TDX the fault
>> address *does* include the shared bit. And the EPT needs to be faulted in at
>> that address.

It's about how we want to define the semantic of fault->addr (forget 
about shadow MMU because for it fault->addr has different meaning from TDP):

1) It represents the faulting address which points to the actual guest 
memory (has no shared bit).

2) It represents the faulting address which is truly used as the 
hardware page table walk.

The fault->gfn always represents the location of actual guest memory 
(w/o shared bit) in both cases.

I originally thought 2) isn't consistent for both SNP and TDX, but 
thinking more, I now think actually both the two semantics are 
consistent for SNP and TDX, which is important in order to avoid confusion.

Anyway it's a trivial because fault->addr is only used for fault 
handling path.

So yes fine to me we choose to use 2) here.  But IMHO we should 
explicitly add a comment to 'struct kvm_page_fault' that the @addr 
represents 2).

And I think more important thing is how we handle this "gfn" and 
"raw_gfn" in tdp_iter and 'struct kvm_mmu_page'.  See below.

>>
>> If we strip the shared bit when setting fault->addr we have to reconstruct it
>> when we do the actual shared mapping. There is no way around that. Which helper
>> does it, isn't important I think. Doing the reconstruction inside
>> tdp_mmu_for_each_pte() could be neat, except that it doesn't know about the
>> shared bit position.
>>
>> The zapping code's use of kvm_gfn_for_root() is different because the gfn comes
>> without the shared bit. It's not stripped and then added back. Those are
>> operations that target GFNs really.
>>
>> I think the real problem is that we are gleaning whether the fault is to private
>> or shared memory from different things. Sometimes from fault->is_private,
>> sometimes the presence of the shared bits, and sometimes the role bit. I think
>> this is confusing, doubly so because we are using some of these things to infer
>> unrelated things (mirrored vs private).
> 
> It's confusing we don't check it in uniform way.
> 
> 
>> My guess is that you have noticed this and somehow zeroed in on the shared_mask.
>> I think we should straighten out the mirrored/private semantics and see what the
>> results look like. How does that sound to you?
> 
> I had closer look of the related code.  I think we can (mostly) uniformly use
> gpa/gfn without shared mask.  Here is the proposal.  We need a real patch to see
> how the outcome looks like anyway.  I think this is like what Kai is thinking
> about.
> 
> 
> - rename role.is_private => role.is_mirrored_pt
> 
> - sp->gfn: gfn without shared bit.

I think we can treat 'tdp_iter' and 'struct kvm_mmu_page' in the same 
way, because conceptually they both reflects the page table.

So I think both of them can have "gfn" or "raw_gfn", and "shared_gfn_mask".

Or have both "raw_gfn" or "gfn" but w/o "shared_gfn_mask". This may be 
more straightforward because we can just use them when needed w/o 
needing to play with gfn_shared_mask.

> 
> - fault->address: without gfn_shared_mask
>    Actually it doesn't matter much.  We can use gpa with gfn_shared_mask.

See above.  I think it makes sense to include the shared bit here.  It's 
trivial anyway though.

> 
> - Update struct tdp_iter
>    struct tdp_iter
>      gfn: gfn without shared bit

Or "raw_gfn"?

Which may be more straightforward because it can be just from:

	raw_gfn = gpa_to_gfn(fault->addr);
	gfn = fault->gfn;

	tdp_mmu_for_each_pte(..., raw_gfn, raw_gfn + 1, gfn)

Which is the reason to make fault->addr include the shared bit AFAICT.

> 
>      /* Add new members */
> 
>      /* Indicates which PT to walk. */
>      bool mirrored_pt;

I don't think you need this?  It's only used to select the root for page 
table walk.  Once it's done, we already have the @sptep to operate on.

And I think you can just get @mirrored_pt from the sptep:

	mirrored_pt = sptep_to_sp(sptep)->role.mirrored_pt;

Instead, I think we should keep the @is_private to indicate whether the 
GFN is private or not, which should be distinguished with 'mirrored_pt', 
which the root page table (and the @sptep) already reflects.

Of course if the @root/@sptep is mirrored_pt, the is_private should be 
always true, like:

	WARN_ON_ONCE(sptep_to_sp(sptep)->role.is_mirrored_pt
			&& !is_private);

Am I missing anything?

> 
>      // This is used tdp_iter_refresh_sptep()
>      // shared gfn_mask if mirrored_pt
>      // 0 if !mirrored_pt
>      gfn_shared_mask >
> - Pass mirrored_pt and gfn_shared_mask to
>    tdp_iter_start(..., mirrored_pt, gfn_shared_mask)

As mentioned above, I am not sure whether we need @mirrored_pt, because 
it already have the @root.  Instead we should pass is_private, which 
indicates the GFN is private.

> 
>    and update tdp_iter_refresh_sptep()
>    static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
>          ...
>          iter->sptep = iter->pt_path[iter->level - 1] +
>                  SPTE_INDEX((iter->gfn << PAGE_SHIFT) | iter->gfn_shared_mask, iter->level);
> 
>    Change for_each_tdp_mte_min_level() accordingly.
>    Also the iteretor to call this.
>     
>    #define for_each_tdp_pte_min_level(kvm, iter, root, min_level, start, end)      \
>            for (tdp_iter_start(&iter, root, min_level, start,                      \
>                 mirrored_root, mirrored_root ? kvm_gfn_shared_mask(kvm) : 0);      \
>                 iter.valid && iter.gfn < kvm_gfn_for_root(kvm, root, end);         \
>                 tdp_iter_next(&iter))

See above.

> 
> - trace point: update to include mirroredd_pt. Or Leave it as is for now.
> 
> - pr_err() that log gfn in handle_changed_spte()
>    Update to include mirrored_pt. Or Leave it as is for now.
> 
> - Update spte handler (handle_changed_spte(), handle_removed_pt()...),
>    use iter->mirror_pt or pass down mirror_pt.
> 

IIUC only sp->role.is_mirrored_pt is needed, tdp_iter->is_mirrored_pt 
isn't necessary.  But when the @sp is created, we need to initialize 
whether it is mirrored_pt.

Am I missing anything?
Isaku Yamahata May 17, 2024, 8:14 a.m. UTC | #16
On Fri, May 17, 2024 at 02:36:43PM +1200,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On 17/05/2024 7:42 am, Isaku Yamahata wrote:
> > On Thu, May 16, 2024 at 04:36:48PM +0000,
> > "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:
> > 
> > > On Thu, 2024-05-16 at 13:04 +0000, Huang, Kai wrote:
> > > > On Thu, 2024-05-16 at 02:57 +0000, Edgecombe, Rick P wrote:
> > > > > On Thu, 2024-05-16 at 14:07 +1200, Huang, Kai wrote:
> > > > > > 
> > > > > > I meant it seems we should just strip shared bit away from the GPA in
> > > > > > handle_ept_violation() and pass it as 'cr2_or_gpa' here, so fault->addr
> > > > > > won't have the shared bit.
> > > > > > 
> > > > > > Do you see any problem of doing so?
> > > > > 
> > > > > We would need to add it back in "raw_gfn" in kvm_tdp_mmu_map().
> > > > 
> > > > I don't see any big difference?
> > > > 
> > > > Now in this patch the raw_gfn is directly from fault->addr:
> > > > 
> > > >          raw_gfn = gpa_to_gfn(fault->addr);
> > > > 
> > > >          tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn+1) {
> > > >                  ...
> > > >          }
> > > > 
> > > > But there's nothing wrong to get the raw_gfn from the fault->gfn.  In
> > > > fact, the zapping code just does this:
> > > > 
> > > >          /*
> > > >           * start and end doesn't have GFN shared bit.  This function zaps
> > > >           * a region including alias.  Adjust shared bit of [start, end) if
> > > >           * the root is shared.
> > > >           */
> > > >          start = kvm_gfn_for_root(kvm, root, start);
> > > >          end = kvm_gfn_for_root(kvm, root, end);
> > > > 
> > > > So there's nothing wrong to just do the same thing in both functions.
> > > > 
> > > > The point is fault->gfn has shared bit stripped away at the beginning, and
> > > > AFAICT there's no useful reason to keep shared bit in fault->addr.  The
> > > > entire @fault is a temporary structure on the stack during fault handling
> > > > anyway.
> > > 
> > > I would like to avoid code churn at this point if there is not a real clear
> > > benefit. >>
> > > One small benefit of keeping the shared bit in the fault->addr is that it is
> > > sort of consistent with how that field is used in other scenarios in KVM. In
> > > shadow paging it's not even the GPA. So it is simply the "fault address" and has
> > > to be interpreted in different ways in the fault handler. For TDX the fault
> > > address *does* include the shared bit. And the EPT needs to be faulted in at
> > > that address.
> 
> It's about how we want to define the semantic of fault->addr (forget about
> shadow MMU because for it fault->addr has different meaning from TDP):
> 
> 1) It represents the faulting address which points to the actual guest
> memory (has no shared bit).
> 
> 2) It represents the faulting address which is truly used as the hardware
> page table walk.
> 
> The fault->gfn always represents the location of actual guest memory (w/o
> shared bit) in both cases.
> 
> I originally thought 2) isn't consistent for both SNP and TDX, but thinking
> more, I now think actually both the two semantics are consistent for SNP and
> TDX, which is important in order to avoid confusion.
> 
> Anyway it's a trivial because fault->addr is only used for fault handling
> path.
> 
> So yes fine to me we choose to use 2) here.  But IMHO we should explicitly
> add a comment to 'struct kvm_page_fault' that the @addr represents 2).

Ok. I'm fine with 2).


> And I think more important thing is how we handle this "gfn" and "raw_gfn"
> in tdp_iter and 'struct kvm_mmu_page'.  See below.
> 
> > > 
> > > If we strip the shared bit when setting fault->addr we have to reconstruct it
> > > when we do the actual shared mapping. There is no way around that. Which helper
> > > does it, isn't important I think. Doing the reconstruction inside
> > > tdp_mmu_for_each_pte() could be neat, except that it doesn't know about the
> > > shared bit position.
> > > 
> > > The zapping code's use of kvm_gfn_for_root() is different because the gfn comes
> > > without the shared bit. It's not stripped and then added back. Those are
> > > operations that target GFNs really.
> > > 
> > > I think the real problem is that we are gleaning whether the fault is to private
> > > or shared memory from different things. Sometimes from fault->is_private,
> > > sometimes the presence of the shared bits, and sometimes the role bit. I think
> > > this is confusing, doubly so because we are using some of these things to infer
> > > unrelated things (mirrored vs private).
> > 
> > It's confusing we don't check it in uniform way.
> > 
> > 
> > > My guess is that you have noticed this and somehow zeroed in on the shared_mask.
> > > I think we should straighten out the mirrored/private semantics and see what the
> > > results look like. How does that sound to you?
> > 
> > I had closer look of the related code.  I think we can (mostly) uniformly use
> > gpa/gfn without shared mask.  Here is the proposal.  We need a real patch to see
> > how the outcome looks like anyway.  I think this is like what Kai is thinking
> > about.
> > 
> > 
> > - rename role.is_private => role.is_mirrored_pt
> > 
> > - sp->gfn: gfn without shared bit.
> 
> I think we can treat 'tdp_iter' and 'struct kvm_mmu_page' in the same way,
> because conceptually they both reflects the page table.

Agreed that iter->gfn and sp->gfn should be in same way.


> So I think both of them can have "gfn" or "raw_gfn", and "shared_gfn_mask".
> 
> Or have both "raw_gfn" or "gfn" but w/o "shared_gfn_mask". This may be more
> straightforward because we can just use them when needed w/o needing to play
> with gfn_shared_mask.
> 
> > 
> > - fault->address: without gfn_shared_mask
> >    Actually it doesn't matter much.  We can use gpa with gfn_shared_mask.
> 
> See above.  I think it makes sense to include the shared bit here.  It's
> trivial anyway though.

Ok, let's make fault->addr include shared mask.


> > - Update struct tdp_iter
> >    struct tdp_iter
> >      gfn: gfn without shared bit
> 
> Or "raw_gfn"?
> 
> Which may be more straightforward because it can be just from:
> 
> 	raw_gfn = gpa_to_gfn(fault->addr);
> 	gfn = fault->gfn;
> 
> 	tdp_mmu_for_each_pte(..., raw_gfn, raw_gfn + 1, gfn)
> 
> Which is the reason to make fault->addr include the shared bit AFAICT.

If we can eliminate raw_gfn and kvm_gfn_for_root(), it's better.


> > 
> >      /* Add new members */
> > 
> >      /* Indicates which PT to walk. */
> >      bool mirrored_pt;
> 
> I don't think you need this?  It's only used to select the root for page
> table walk.  Once it's done, we already have the @sptep to operate on.
> 
> And I think you can just get @mirrored_pt from the sptep:
> 
> 	mirrored_pt = sptep_to_sp(sptep)->role.mirrored_pt;
> 
> Instead, I think we should keep the @is_private to indicate whether the GFN
> is private or not, which should be distinguished with 'mirrored_pt', which
> the root page table (and the @sptep) already reflects.
> 
> Of course if the @root/@sptep is mirrored_pt, the is_private should be
> always true, like:
> 
> 	WARN_ON_ONCE(sptep_to_sp(sptep)->role.is_mirrored_pt
> 			&& !is_private);
> 
> Am I missing anything?

You said it not correct to use role. So I tried to find a way to pass down
is_mirrored and avoid to use role.

Did you change your mind? or you're fine with new name is_mirrored?

https://lore.kernel.org/kvm/4ba18e4e-5971-4683-82eb-63c985e98e6b@intel.com/
  > I don't think using kvm_mmu_page.role is correct.



> > 
> >      // This is used tdp_iter_refresh_sptep()
> >      // shared gfn_mask if mirrored_pt
> >      // 0 if !mirrored_pt
> >      gfn_shared_mask >
> > - Pass mirrored_pt and gfn_shared_mask to
> >    tdp_iter_start(..., mirrored_pt, gfn_shared_mask)
> 
> As mentioned above, I am not sure whether we need @mirrored_pt, because it
> already have the @root.  Instead we should pass is_private, which indicates
> the GFN is private.

If we can use role, we don't need iter.mirrored_pt isn't needed.


> > - trace point: update to include mirroredd_pt. Or Leave it as is for now.
> > 
> > - pr_err() that log gfn in handle_changed_spte()
> >    Update to include mirrored_pt. Or Leave it as is for now.
> > 
> > - Update spte handler (handle_changed_spte(), handle_removed_pt()...),
> >    use iter->mirror_pt or pass down mirror_pt.
> > 
> 
> IIUC only sp->role.is_mirrored_pt is needed, tdp_iter->is_mirrored_pt isn't
> necessary.  But when the @sp is created, we need to initialize whether it is
> mirrored_pt.
> 
> Am I missing anything?

Because you didn't like to use role, I tried to find other way.
Isaku Yamahata May 17, 2024, 9:03 a.m. UTC | #17
On Fri, May 17, 2024 at 02:35:46AM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> Here is a diff of an attempt to merge all the feedback so far. It's on top of
> the the dev branch from this series.
> 
> On Thu, 2024-05-16 at 12:42 -0700, Isaku Yamahata wrote:
> > - rename role.is_private => role.is_mirrored_pt
> 
> Agreed.
> 
> > 
> > - sp->gfn: gfn without shared bit.
> > 
> > - fault->address: without gfn_shared_mask
> >   Actually it doesn't matter much.  We can use gpa with gfn_shared_mask.
> 
> I left fault->addr with shared bits. It's not used anymore for TDX except in the
> tracepoint which I think makes sense.

As discussed with Kai [1], make fault->addr represent the real fault address.

[1] https://lore.kernel.org/kvm/20240517081440.GM168153@ls.amr.corp.intel.com/

> 
> > 
> > - Update struct tdp_iter
> >   struct tdp_iter
> >     gfn: gfn without shared bit
> > 
> >     /* Add new members */
> > 
> >     /* Indicates which PT to walk. */
> >     bool mirrored_pt;
> > 
> >     // This is used tdp_iter_refresh_sptep()
> >     // shared gfn_mask if mirrored_pt
> >     // 0 if !mirrored_pt
> >     gfn_shared_mask
> > 
> > - Pass mirrored_pt and gfn_shared_mask to
> >   tdp_iter_start(..., mirrored_pt, gfn_shared_mask)
> > 
> >   and update tdp_iter_refresh_sptep()
> >   static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
> >         ...
> >         iter->sptep = iter->pt_path[iter->level - 1] +
> >                 SPTE_INDEX((iter->gfn << PAGE_SHIFT) | iter->gfn_shared_mask,
> > iter->level);
> 
> I tried something else. The iterators still have gfn's with shared bits, but the
> addition of the shared bit is wrapped in tdp_mmu_for_each_pte(), so
> kvm_tdp_mmu_map() and similar don't have to handle the shared bits. They just
> pass in a root, and tdp_mmu_for_each_pte() knows how to adjust the GFN. Like:
> 
> #define tdp_mmu_for_each_pte(_iter, _kvm, _root, _start, _end)	\
> 	for_each_tdp_pte(_iter, _root,	\
> 			 kvm_gfn_for_root(_kvm, _root, _start), \
> 			 kvm_gfn_for_root(_kvm, _root, _end))

I'm wondering to remove kvm_gfn_for_root() at all.


> I also changed the callers to use the new enum to specify roots. This way they
> can pass something with a nice name instead of true/false for bool private.

This is nice.


> Keeping a gfn_shared_mask inside the iterator didn't seem more clear to me, and
> bit more cumbersome. But please compare it.
> 
> > 
> >   Change for_each_tdp_mte_min_level() accordingly.
> >   Also the iteretor to call this.
> >    
> >   #define for_each_tdp_pte_min_level(kvm, iter, root, min_level, start,
> > end)      \
> >           for (tdp_iter_start(&iter, root, min_level,
> > start,                      \
> >                mirrored_root, mirrored_root ? kvm_gfn_shared_mask(kvm) :
> > 0);      \
> >                iter.valid && iter.gfn < kvm_gfn_for_root(kvm, root,
> > end);         \
> >                tdp_iter_next(&iter))
> 
> I liked it a lot because the callers don't need to manually call
> kvm_gfn_for_root() anymore. But I tried it and it required a lot of additions of
> kvm to the iterators call sites. I ended up removing it, but I'm not sure.

...

> > - Update spte handler (handle_changed_spte(), handle_removed_pt()...),
> >   use iter->mirror_pt or pass down mirror_pt.
> 
> You mean just rename it, or something else?

I scratch this. I thought Kai didn't like to use role [2].
But now it seems okay. [3]

[2] https://lore.kernel.org/kvm/4ba18e4e-5971-4683-82eb-63c985e98e6b@intel.com/
  > I don't think using kvm_mmu_page.role is correct.

[3] https://lore.kernel.org/kvm/20240517081440.GM168153@ls.amr.corp.intel.com/
  > I think you can just get @mirrored_pt from the sptep:
  >  mirrored_pt = sptep_to_sp(sptep)->role.mirrored_pt;


> Anyway below is a first cut based on the discussion.
> 
> A few other things:
> 1. kvm_is_private_gpa() is moved into Intel code. kvm_gfn_shared_mask() remains
> for only two operations in common code:
>  - kvm_gfn_for_root() <- required for zapping/mapping
>  - Stripping the bit when setting fault.gfn <- possible to remove if we strip
> cr2_or_gpa
> 2. I also played with changing KVM_PRIVATE_ROOTS to KVM_MIRROR_ROOTS.
> Unfortunately there is still some confusion between private and mirrored. For
> example you walk a mirror root (what is actually happening), but you have to
> allocate private page tables as you do, as well as call out to x86_ops named
> private. So those concepts are effectively linked and used a bit
> interchangeably.

On top of your patch, I created the following patch to remove kvm_gfn_for_root().
Although I haven't tested it yet, I think the following shows my idea.

- Add gfn_shared_mask to struct tdp_iter.
- Use iter.gfn_shared_mask to determine the starting sptep in the root.
- Remove kvm_gfn_for_root()

---
 arch/x86/kvm/mmu/mmu_internal.h | 10 -------
 arch/x86/kvm/mmu/tdp_iter.c     |  5 ++--
 arch/x86/kvm/mmu/tdp_iter.h     | 16 ++++++-----
 arch/x86/kvm/mmu/tdp_mmu.c      | 48 ++++++++++-----------------------
 4 files changed, 26 insertions(+), 53 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 2b1b2a980b03..9676af0cb133 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -180,16 +180,6 @@ static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_m
 	sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_private_spt_cache);
 }
 
-static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page *root,
-				     gfn_t gfn)
-{
-	gfn_t gfn_for_root = kvm_gfn_to_private(kvm, gfn);
-
-	/* Set shared bit if not private */
-	gfn_for_root |= -(gfn_t)!is_mirrored_sp(root) & kvm_gfn_shared_mask(kvm);
-	return gfn_for_root;
-}
-
 static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
 {
 	/*
diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
index 04c247bfe318..c5f2ca1ceede 100644
--- a/arch/x86/kvm/mmu/tdp_iter.c
+++ b/arch/x86/kvm/mmu/tdp_iter.c
@@ -12,7 +12,7 @@
 static void tdp_iter_refresh_sptep(struct tdp_iter *iter)
 {
 	iter->sptep = iter->pt_path[iter->level - 1] +
-		SPTE_INDEX(iter->gfn << PAGE_SHIFT, iter->level);
+		SPTE_INDEX((iter->gfn | iter->gfn_shared_mask) << PAGE_SHIFT, iter->level);
 	iter->old_spte = kvm_tdp_mmu_read_spte(iter->sptep);
 }
 
@@ -37,7 +37,7 @@ void tdp_iter_restart(struct tdp_iter *iter)
  * rooted at root_pt, starting with the walk to translate next_last_level_gfn.
  */
 void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
-		    int min_level, gfn_t next_last_level_gfn)
+		    int min_level, gfn_t next_last_level_gfn, gfn_t gfn_shared_mask)
 {
 	if (WARN_ON_ONCE(!root || (root->role.level < 1) ||
 			 (root->role.level > PT64_ROOT_MAX_LEVEL))) {
@@ -46,6 +46,7 @@ void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
 	}
 
 	iter->next_last_level_gfn = next_last_level_gfn;
+	iter->gfn_shared_mask = gfn_shared_mask;
 	iter->root_level = root->role.level;
 	iter->min_level = min_level;
 	iter->pt_path[iter->root_level - 1] = (tdp_ptep_t)root->spt;
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 8a64bcef9deb..274b42707f0a 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -91,8 +91,9 @@ struct tdp_iter {
 	tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
 	/* A pointer to the current SPTE */
 	tdp_ptep_t sptep;
-	/* The lowest GFN (shared bits included) mapped by the current SPTE */
+	/* The lowest GFN (shared bits excluded) mapped by the current SPTE */
 	gfn_t gfn;
+	gfn_t gfn_shared_mask;
 	/* The level of the root page given to the iterator */
 	int root_level;
 	/* The lowest level the iterator should traverse to */
@@ -120,18 +121,19 @@ struct tdp_iter {
  * Iterates over every SPTE mapping the GFN range [start, end) in a
  * preorder traversal.
  */
-#define for_each_tdp_pte_min_level(iter, root, min_level, start, end) \
-	for (tdp_iter_start(&iter, root, min_level, start); \
-	     iter.valid && iter.gfn < end;		     \
+#define for_each_tdp_pte_min_level(iter, kvm, root, min_level, start, end) \
+	for (tdp_iter_start(&iter, root, min_level, start,			\
+			    is_mirrored_sp(root) ? 0: kvm_gfn_shared_mask(kvm)); \
+	     iter.valid && iter.gfn < end;					\
 	     tdp_iter_next(&iter))
 
-#define for_each_tdp_pte(iter, root, start, end) \
-	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end)
+#define for_each_tdp_pte(iter, kvm, root, start, end)				\
+	for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_4K, start, end)
 
 tdp_ptep_t spte_to_child_pt(u64 pte, int level);
 
 void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
-		    int min_level, gfn_t next_last_level_gfn);
+		    int min_level, gfn_t next_last_level_gfn, gfn_t gfn_shared_mask);
 void tdp_iter_next(struct tdp_iter *iter);
 void tdp_iter_restart(struct tdp_iter *iter);
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7f13016e210b..bf7aa87eb593 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -862,20 +862,18 @@ static inline void tdp_mmu_iter_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 					  iter->gfn, iter->level);
 }
 
-#define tdp_root_for_each_pte(_iter, _root, _start, _end) \
-	for_each_tdp_pte(_iter, _root, _start, _end)
+#define tdp_root_for_each_pte(_iter, _kvm, _root, _start, _end)	\
+	for_each_tdp_pte(_iter, _kvm, _root, _start, _end)
 
-#define tdp_root_for_each_leaf_pte(_iter, _root, _start, _end)	\
-	tdp_root_for_each_pte(_iter, _root, _start, _end)		\
+#define tdp_root_for_each_leaf_pte(_iter, _kvm, _root, _start, _end)	\
+	tdp_root_for_each_pte(_iter, _kvm, _root, _start, _end)		\
 		if (!is_shadow_present_pte(_iter.old_spte) ||		\
 		    !is_last_spte(_iter.old_spte, _iter.level))		\
 			continue;					\
 		else
 
 #define tdp_mmu_for_each_pte(_iter, _kvm, _root, _start, _end)	\
-	for_each_tdp_pte(_iter, _root,	\
-			 kvm_gfn_for_root(_kvm, _root, _start), \
-			 kvm_gfn_for_root(_kvm, _root, _end))
+	for_each_tdp_pte(_iter, _kvm, _root, _start, _end)
 
 /*
  * Yield if the MMU lock is contended or this thread needs to return control
@@ -941,7 +939,7 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 	gfn_t end = tdp_mmu_max_gfn_exclusive();
 	gfn_t start = 0;
 
-	for_each_tdp_pte_min_level(iter, root, zap_level, start, end) {
+	for_each_tdp_pte_min_level(iter, kvm, root, zap_level, start, end) {
 retry:
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
 			continue;
@@ -1043,17 +1041,9 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
-	/*
-	 * start and end doesn't have GFN shared bit.  This function zaps
-	 * a region including alias.  Adjust shared bit of [start, end) if the
-	 * root is shared.
-	 */
-	start = kvm_gfn_for_root(kvm, root, start);
-	end = kvm_gfn_for_root(kvm, root, end);
-
 	rcu_read_lock();
 
-	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
+	for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_4K, start, end) {
 		if (can_yield &&
 		    tdp_mmu_iter_cond_resched(kvm, &iter, flush, false)) {
 			flush = false;
@@ -1448,19 +1438,9 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
 	 * into this helper allow blocking; it'd be dead, wasteful code.
 	 */
 	__for_each_tdp_mmu_root(kvm, root, range->slot->as_id, types) {
-		gfn_t start, end;
-
-		/*
-		 * For TDX shared mapping, set GFN shared bit to the range,
-		 * so the handler() doesn't need to set it, to avoid duplicated
-		 * code in multiple handler()s.
-		 */
-		start = kvm_gfn_for_root(kvm, root, range->start);
-		end = kvm_gfn_for_root(kvm, root, range->end);
-
 		rcu_read_lock();
 
-		tdp_root_for_each_leaf_pte(iter, root, start, end)
+		tdp_root_for_each_leaf_pte(iter, kvm, root, range->start, range->end)
 			ret |= handler(kvm, &iter, range);
 
 		rcu_read_unlock();
@@ -1543,7 +1523,7 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
 
-	for_each_tdp_pte_min_level(iter, root, min_level, start, end) {
+	for_each_tdp_pte_min_level(iter, kvm, root, min_level, start, end) {
 retry:
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
 			continue;
@@ -1706,7 +1686,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 	 * level above the target level (e.g. splitting a 1GB to 512 2MB pages,
 	 * and then splitting each of those to 512 4KB pages).
 	 */
-	for_each_tdp_pte_min_level(iter, root, target_level + 1, start, end) {
+	for_each_tdp_pte_min_level(iter, kvm, root, target_level + 1, start, end) {
 retry:
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
 			continue;
@@ -1791,7 +1771,7 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	rcu_read_lock();
 
-	tdp_root_for_each_pte(iter, root, start, end) {
+	tdp_root_for_each_pte(iter, kvm, root, start, end) {
 retry:
 		if (!is_shadow_present_pte(iter.old_spte) ||
 		    !is_last_spte(iter.old_spte, iter.level))
@@ -1846,7 +1826,7 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	rcu_read_lock();
 
-	tdp_root_for_each_leaf_pte(iter, root, gfn + __ffs(mask),
+	tdp_root_for_each_leaf_pte(iter, kvm, root, gfn + __ffs(mask),
 				    gfn + BITS_PER_LONG) {
 		if (!mask)
 			break;
@@ -1903,7 +1883,7 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 
 	rcu_read_lock();
 
-	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_2M, start, end) {
+	for_each_tdp_pte_min_level(iter, kvm, root, PG_LEVEL_2M, start, end) {
 retry:
 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
 			continue;
@@ -1973,7 +1953,7 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	rcu_read_lock();
 
-	for_each_tdp_pte_min_level(iter, root, min_level, gfn, gfn + 1) {
+	for_each_tdp_pte_min_level(iter, kvm, root, min_level, gfn, gfn + 1) {
 		if (!is_shadow_present_pte(iter.old_spte) ||
 		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
Rick Edgecombe May 17, 2024, 6:16 p.m. UTC | #18
On Fri, 2024-05-17 at 02:03 -0700, Isaku Yamahata wrote:
> 
> On top of your patch, I created the following patch to remove
> kvm_gfn_for_root().
> Although I haven't tested it yet, I think the following shows my idea.
> 
> - Add gfn_shared_mask to struct tdp_iter.
> - Use iter.gfn_shared_mask to determine the starting sptep in the root.
> - Remove kvm_gfn_for_root()

I investigated it.

After this, gfn_t's never have shared bit. It's a simple rule. The MMU mostly
thinks it's operating on a shared root that is mapped at the normal GFN. Only
the iterator knows that the shared PTEs are actually in a different location.

There are some negative side effects:
1. The struct kvm_mmu_page's gfn doesn't match it's actual mapping anymore.
2. As a result of above, the code that flushes TLBs for a specific GFN will be
confused. It won't functionally matter for TDX, just look buggy to see flushing
code called with the wrong gfn.
3. A lot of tracepoints no longer have the "real" gfn
4. mmio spte doesn't have the shared bit, as previous (no effect)
5. Some zapping code (__tdp_mmu_zap_root(), tdp_mmu_zap_leafs()) intends to
actually operating on the raw_gfn. It wants to iterate the whole EPT, so it goes
from 0 to tdp_mmu_max_gfn_exclusive(). So now for mirrored it does, but for
shared it only covers the shared range. Basically kvm_mmu_max_gfn() is wrong if
we pretend shared GFNs are just strangely mapped normal GFNs. Maybe we could
just fix this up to report based on GPAW for TDX? Feels wrong.

On the positive effects side:
1. There is code that passes sp->gfn into things that it shouldn't (if it has
shared bits) like memslot lookups.
2. Also code that passes iter.gfn into things it shouldn't like
kvm_mmu_max_mapping_level().

These places are not called by TDX, but if you know that gfn's might include
shared bits, then that code looks buggy.

I think the solution in the diff is more elegant then before, because it hides
what is really going on with the shared root. That is both good and bad. Can we
accept the downsides?
Isaku Yamahata May 17, 2024, 7:16 p.m. UTC | #19
On Fri, May 17, 2024 at 06:16:26PM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Fri, 2024-05-17 at 02:03 -0700, Isaku Yamahata wrote:
> > 
> > On top of your patch, I created the following patch to remove
> > kvm_gfn_for_root().
> > Although I haven't tested it yet, I think the following shows my idea.
> > 
> > - Add gfn_shared_mask to struct tdp_iter.
> > - Use iter.gfn_shared_mask to determine the starting sptep in the root.
> > - Remove kvm_gfn_for_root()
> 
> I investigated it.

Thanks for looking at it.


> After this, gfn_t's never have shared bit. It's a simple rule. The MMU mostly
> thinks it's operating on a shared root that is mapped at the normal GFN. Only
> the iterator knows that the shared PTEs are actually in a different location.
> 
> There are some negative side effects:
> 1. The struct kvm_mmu_page's gfn doesn't match it's actual mapping anymore.
> 2. As a result of above, the code that flushes TLBs for a specific GFN will be
> confused. It won't functionally matter for TDX, just look buggy to see flushing
> code called with the wrong gfn.

flush_remote_tlbs_range() is only for Hyper-V optimization.  In other cases,
x86_op.flush_remote_tlbs_range = NULL or the member isn't defined at compile
time.  So the remote tlb flush falls back to flushing whole range.  I don't
expect TDX in hyper-V guest.  I have to admit that the code looks superficially
broken and confusing.


> 3. A lot of tracepoints no longer have the "real" gfn

Anyway we'd like to sort out trace points and pr_err() eventually because we
already added new pferr flags.


> 4. mmio spte doesn't have the shared bit, as previous (no effect)
> 5. Some zapping code (__tdp_mmu_zap_root(), tdp_mmu_zap_leafs()) intends to
> actually operating on the raw_gfn. It wants to iterate the whole EPT, so it goes
> from 0 to tdp_mmu_max_gfn_exclusive(). So now for mirrored it does, but for
> shared it only covers the shared range. Basically kvm_mmu_max_gfn() is wrong if
> we pretend shared GFNs are just strangely mapped normal GFNs. Maybe we could
> just fix this up to report based on GPAW for TDX? Feels wrong.

Yes, it's broken with kvm_mmu_max_gfn().


> On the positive effects side:
> 1. There is code that passes sp->gfn into things that it shouldn't (if it has
> shared bits) like memslot lookups.
> 2. Also code that passes iter.gfn into things it shouldn't like
> kvm_mmu_max_mapping_level().
> 
> These places are not called by TDX, but if you know that gfn's might include
> shared bits, then that code looks buggy.
> 
> I think the solution in the diff is more elegant then before, because it hides
> what is really going on with the shared root. That is both good and bad. Can we
> accept the downsides?

Kai, do you have any thoughts?
Kai Huang May 18, 2024, 5:42 a.m. UTC | #20
> 
> > > 
> > >      /* Add new members */
> > > 
> > >      /* Indicates which PT to walk. */
> > >      bool mirrored_pt;
> > 
> > I don't think you need this?  It's only used to select the root for page
> > table walk.  Once it's done, we already have the @sptep to operate on.
> > 
> > And I think you can just get @mirrored_pt from the sptep:
> > 
> > 	mirrored_pt = sptep_to_sp(sptep)->role.mirrored_pt;
> > 
> > Instead, I think we should keep the @is_private to indicate whether the GFN
> > is private or not, which should be distinguished with 'mirrored_pt', which
> > the root page table (and the @sptep) already reflects.
> > 
> > Of course if the @root/@sptep is mirrored_pt, the is_private should be
> > always true, like:
> > 
> > 	WARN_ON_ONCE(sptep_to_sp(sptep)->role.is_mirrored_pt
> > 			&& !is_private);
> > 
> > Am I missing anything?
> 
> You said it not correct to use role. So I tried to find a way to pass down
> is_mirrored and avoid to use role.
> 
> Did you change your mind? or you're fine with new name is_mirrored?
> 
> https://lore.kernel.org/kvm/4ba18e4e-5971-4683-82eb-63c985e98e6b@intel.com/
>   > I don't think using kvm_mmu_page.role is correct.
> 
> 

No.  I meant "using kvm_mmu_page.role.mirrored_pt to determine whether to
invoke kvm_x86_ops::xx_private_spt()" is not correct.  Instead, we should
use fault->is_private to determine:

	if (fault->is_private && kvm_x86_ops::xx_private_spt())
		kvm_x86_ops::xx_private_spte();
	else
		// normal TDP MMU operation

The reason is this pattern works not just for TDX, but also for SNP (and
SW_PROTECTED_VM) if they ever need specific page table ops.

Whether we are operating on the mirrored page table or not doesn't matter,
because we have already selected the root page table at the beginning of
kvm_tdp_mmu_map() based on whether the VM needs to use mirrored pt for
private mapping:


	bool mirrored_pt = fault->is_private && kvm_use_mirrored_pt(kvm);

	tdp_mmu_for_each_pte(iter, mmu, mirrored_pt, raw_gfn, raw_gfn +
1) 
	{
		...
	}

#define tdp_mmu_for_each_pte(_iter, _mmu, _mirrored_pt, _start, _end)   \
        for_each_tdp_pte(_iter,                                         \
                 root_to_sp((_mirrored_pt) ? _mmu->private_root_hpa :   \
                                _mmu->root.hpa),                        \
                _start, _end)

If you somehow needs the mirrored_pt in later time when handling the page
fault, you don't need another "mirrored_pt" in tdp_iter, because you can
easily get it from the sptep (or just get from the root):

	mirrored_pt = sptep_to_sp(sptep)->role.mirrored_pt;

What we really need to pass in is the fault->is_private, because we are
not able to get whether a GPN is private based on kvm_shared_gfn_mask()
for SNP and SW_PROTECTED_VM.

Since the current KVM code only mainly passes the @kvm and the @iter for
many TDP MMU functions like tdp_mmu_set_spte_atomic(), the easiest way to
convery the fault->is_private is to add a new 'is_private' (or even
better, 'is_private_gpa' to be more precisely) to tdp_iter.

Otherwise, we either need to explicitly pass the entire @fault (which
might not be a, or @is_private_gpa.

Or perhaps I am missing anything?
Rick Edgecombe May 18, 2024, 3:41 p.m. UTC | #21
On Sat, 2024-05-18 at 05:42 +0000, Huang, Kai wrote:
> 
> No.  I meant "using kvm_mmu_page.role.mirrored_pt to determine whether to
> invoke kvm_x86_ops::xx_private_spt()" is not correct.

I agree this looks wrong.

>   Instead, we should
> use fault->is_private to determine:
> 
>         if (fault->is_private && kvm_x86_ops::xx_private_spt())
>                 kvm_x86_ops::xx_private_spte();
>         else
>                 // normal TDP MMU operation
> 
> The reason is this pattern works not just for TDX, but also for SNP (and
> SW_PROTECTED_VM) if they ever need specific page table ops.

I think the problem is there are a lot of things that are more on the mirrored
concept side:
 - Allocating the "real" PTE pages (i.e. sp->private_spt)
 - Setting the PTE when the mirror changes
 - Zapping the real PTE when the mirror is zapped (and there is no fault)
 - etc

And on the private side there is just knowing that private faults should operate
on the mirror root.

The xx_private_spte() operations are actually just updating the real PTE for the
mirror. In some ways it doesn't have to be about "private". It could be a mirror
of something else and still need the updates. For SNP and others they don't need
to do anything like that. (AFAIU)

So based on that, I tried to change the naming of xx_private_spt() to reflect
that. Like:
if (role.mirrored)
  update_mirrored_pte()

The TDX code could encapsulate that mirrored updates need to update private EPT.
Then I had a helper that answered the question of whether to handle private
faults on the mirrored root.

The FREEZE stuff actually made a bit more sense too, because it was clear it
wasn't a special TDX private memory thing, but just about the atomicity.

The problem was I couldn't get rid of all special things that are private (can't
remember what now).

I wonder if I should give it a more proper try. What do you think?

At this point, I was just going to change the "mirrored" name to
"private_mirrored". Then code that does either mirrored things or private things
both looks correct. Basically making it clear that the MMU only supports
mirroring private memory.

> 
> Whether we are operating on the mirrored page table or not doesn't matter,
> because we have already selected the root page table at the beginning of
> kvm_tdp_mmu_map() based on whether the VM needs to use mirrored pt for
> private mapping:

I think it does matter, especially for the other operations (not faults). Did
you look at the other things checking the role?

> 
> 
>         bool mirrored_pt = fault->is_private && kvm_use_mirrored_pt(kvm);
> 
>         tdp_mmu_for_each_pte(iter, mmu, mirrored_pt, raw_gfn, raw_gfn +
> 1) 
>         {
>                 ...
>         }
> 
> #define tdp_mmu_for_each_pte(_iter, _mmu, _mirrored_pt, _start, _end)   \
>         for_each_tdp_pte(_iter,                                         \
>                  root_to_sp((_mirrored_pt) ? _mmu->private_root_hpa :   \
>                                 _mmu->root.hpa),                        \
>                 _start, _end)
> 
> If you somehow needs the mirrored_pt in later time when handling the page
> fault, you don't need another "mirrored_pt" in tdp_iter, because you can
> easily get it from the sptep (or just get from the root):
> 
>         mirrored_pt = sptep_to_sp(sptep)->role.mirrored_pt;
> 
> What we really need to pass in is the fault->is_private, because we are
> not able to get whether a GPN is private based on kvm_shared_gfn_mask()
> for SNP and SW_PROTECTED_VM.

SNP and SW_PROTECTED_VM (today) don't need do anything special here, right?

> 
> Since the current KVM code only mainly passes the @kvm and the @iter for
> many TDP MMU functions like tdp_mmu_set_spte_atomic(), the easiest way to
> convery the fault->is_private is to add a new 'is_private' (or even
> better, 'is_private_gpa' to be more precisely) to tdp_iter.
> 
> Otherwise, we either need to explicitly pass the entire @fault (which
> might not be a, or @is_private_gpa.
> 
> Or perhaps I am missing anything?

I think two things:
 - fault->is_private is only for faults, and we have other cases where we call
out to kvm_x86_ops.xx_private() things.
 - Calling out to update something else is really more about the "mirrored"
concept then about private.
Kai Huang May 20, 2024, 10:38 a.m. UTC | #22
On Sat, 2024-05-18 at 15:41 +0000, Edgecombe, Rick P wrote:
> On Sat, 2024-05-18 at 05:42 +0000, Huang, Kai wrote:
> > 
> > No.  I meant "using kvm_mmu_page.role.mirrored_pt to determine whether to
> > invoke kvm_x86_ops::xx_private_spt()" is not correct.
> 
> I agree this looks wrong.
> 
> >   Instead, we should
> > use fault->is_private to determine:
> > 
> >         if (fault->is_private && kvm_x86_ops::xx_private_spt())
> >                 kvm_x86_ops::xx_private_spte();
> >         else
> >                 // normal TDP MMU operation
> > 
> > The reason is this pattern works not just for TDX, but also for SNP (and
> > SW_PROTECTED_VM) if they ever need specific page table ops.
> 
> I think the problem is there are a lot of things that are more on the mirrored
> concept side:
>  - Allocating the "real" PTE pages (i.e. sp->private_spt)
>  - Setting the PTE when the mirror changes
>  - Zapping the real PTE when the mirror is zapped (and there is no fault)
>  - etc
> 
> And on the private side there is just knowing that private faults should operate
> on the mirror root.

... and issue SEAMCALL to operate the real private page table?

> 
> The xx_private_spte() operations are actually just updating the real PTE for the
> mirror. In some ways it doesn't have to be about "private". It could be a mirror
> of something else and still need the updates. For SNP and others they don't need
> to do anything like that. (AFAIU)

AFAICT xx_private_spte() should issue SEAMCALL to operate the real private
page table?

> 
> So based on that, I tried to change the naming of xx_private_spt() to reflect
> that. Like:
> if (role.mirrored)
>   update_mirrored_pte()
> 
> The TDX code could encapsulate that mirrored updates need to update private EPT.
> Then I had a helper that answered the question of whether to handle private
> faults on the mirrored root.

I am fine with this too, but I am also fine with the existing pattern:

That we update the mirrored_pt using normal TDP MMU operation, and then
invoke the xx_private_spte() for private GPA.

My only true comment is, to me it seems more reasonable to invoke
xx_private_spte() based on fault->is_private, but not on
'use_mirrored_pt'.

See my reply to your question whether SNP needs special handling below.

> 
> The FREEZE stuff actually made a bit more sense too, because it was clear it
> wasn't a special TDX private memory thing, but just about the atomicity.
> 
> The problem was I couldn't get rid of all special things that are private (can't
> remember what now).
> 
> I wonder if I should give it a more proper try. What do you think?
> 
> At this point, I was just going to change the "mirrored" name to
> "private_mirrored". Then code that does either mirrored things or private things
> both looks correct. Basically making it clear that the MMU only supports
> mirroring private memory.

I don't have preference on name.  "mirrored_private" also works for me.

> 
> > 
> > Whether we are operating on the mirrored page table or not doesn't matter,
> > because we have already selected the root page table at the beginning of
> > kvm_tdp_mmu_map() based on whether the VM needs to use mirrored pt for
> > private mapping:
> 
> I think it does matter, especially for the other operations (not faults). Did
> you look at the other things checking the role?

Yeah I shouldn't say "doesn't matter".  I meant we can get this from the
iter->spetp or the root.

> 
> > 
> > 
> >         bool mirrored_pt = fault->is_private && kvm_use_mirrored_pt(kvm);
> > 
> >         tdp_mmu_for_each_pte(iter, mmu, mirrored_pt, raw_gfn, raw_gfn +
> > 1) 
> >         {
> >                 ...
> >         }
> > 
> > #define tdp_mmu_for_each_pte(_iter, _mmu, _mirrored_pt, _start, _end)   \
> >         for_each_tdp_pte(_iter,                                         \
> >                  root_to_sp((_mirrored_pt) ? _mmu->private_root_hpa :   \
> >                                 _mmu->root.hpa),                        \
> >                 _start, _end)
> > 
> > If you somehow needs the mirrored_pt in later time when handling the page
> > fault, you don't need another "mirrored_pt" in tdp_iter, because you can
> > easily get it from the sptep (or just get from the root):
> > 
> >         mirrored_pt = sptep_to_sp(sptep)->role.mirrored_pt;
> > 
> > What we really need to pass in is the fault->is_private, because we are
> > not able to get whether a GPN is private based on kvm_shared_gfn_mask()
> > for SNP and SW_PROTECTED_VM.
> 
> SNP and SW_PROTECTED_VM (today) don't need do anything special here, right?

Conceptually, I think SNP also needs to at least issue some command(s) to
update the RMP table to reflect the GFN<->PFN relationship.  From this
point, I do see a fit.

I briefly looked into SNP patchset, and I also raised the discussion there
(with you and Isaku copied):

https://lore.kernel.org/lkml/20240501085210.2213060-1-michael.roth@amd.com/T/#m8ca554a6d4bad7fa94dedefcf5914df19c9b8051

I could be wrong, though.
Isaku Yamahata May 20, 2024, 6:58 p.m. UTC | #23
On Mon, May 20, 2024 at 10:38:58AM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Sat, 2024-05-18 at 15:41 +0000, Edgecombe, Rick P wrote:
> > On Sat, 2024-05-18 at 05:42 +0000, Huang, Kai wrote:
> > > 
> > > No.  I meant "using kvm_mmu_page.role.mirrored_pt to determine whether to
> > > invoke kvm_x86_ops::xx_private_spt()" is not correct.
> > 
> > I agree this looks wrong.
> > 
> > >   Instead, we should
> > > use fault->is_private to determine:
> > > 
> > >         if (fault->is_private && kvm_x86_ops::xx_private_spt())
> > >                 kvm_x86_ops::xx_private_spte();
> > >         else
> > >                 // normal TDP MMU operation
> > > 
> > > The reason is this pattern works not just for TDX, but also for SNP (and
> > > SW_PROTECTED_VM) if they ever need specific page table ops.

Do you want to split the concept from invoking hooks from mirrored PT
and to allow invoking hooks even for shared PT (probably without
mirrored PT)?  So far I tied the mirrored PT to invoking the hooks as
those hooks are to reflect the changes on mirrored PT to private PT.

Is there any use case to allow hook for shared PT?

- SEV_SNP
  Although I can't speak for SNP folks, I guess they don't need hooks.
  I guess they want to stay away from directly modifying the TDP MMU
  (to add TDP MMU hooks).  Instead, They added hooks to guest_memfd.
  RMP (Reverse mapping table) doesn't have to be consistent with NPT.

  Anyway, I'll reply to
  https://lore.kernel.org/lkml/20240501085210.2213060-1-michael.roth@amd.com/T/#m8ca554a6d4bad7fa94dedefcf5914df19c9b8051
 
TDX
  I don't see immediate need to allow hooks for shared PT.

SW_PROTECTED (today)
  It uses only shared PT and don't need hooks.

SW_PROTECTED (with mirrored pt with shared mask in future in theory)
  This would be similar to TDX, we wouldn't need hooks for shared PT.

SW_PROTECTED (shared PT only without mirrored pt in future in theory)
  I don't see necessity hooks for shared PT.
  (Or I don't see value of this SW_PROTECTED case.)


> > I think the problem is there are a lot of things that are more on the mirrored
> > concept side:
> >  - Allocating the "real" PTE pages (i.e. sp->private_spt)
> >  - Setting the PTE when the mirror changes
> >  - Zapping the real PTE when the mirror is zapped (and there is no fault)
> >  - etc
> > 
> > And on the private side there is just knowing that private faults should operate
> > on the mirror root.
> 
> ... and issue SEAMCALL to operate the real private page table?

For zapping case,
- SEV-SNP
  They use the hook for guest_memfd.
- SW_PROTECTED (with mirrored pt in future in theory)
  This would be similar to TDX.


> > The xx_private_spte() operations are actually just updating the real PTE for the
> > mirror. In some ways it doesn't have to be about "private". It could be a mirror
> > of something else and still need the updates. For SNP and others they don't need
> > to do anything like that. (AFAIU)
> 
> AFAICT xx_private_spte() should issue SEAMCALL to operate the real private
> page table?
> 
> > 
> > So based on that, I tried to change the naming of xx_private_spt() to reflect
> > that. Like:
> > if (role.mirrored)
> >   update_mirrored_pte()
> > 
> > The TDX code could encapsulate that mirrored updates need to update private EPT.
> > Then I had a helper that answered the question of whether to handle private
> > faults on the mirrored root.
> 
> I am fine with this too, but I am also fine with the existing pattern:
> 
> That we update the mirrored_pt using normal TDP MMU operation, and then
> invoke the xx_private_spte() for private GPA.
> 
> My only true comment is, to me it seems more reasonable to invoke
> xx_private_spte() based on fault->is_private, but not on
> 'use_mirrored_pt'.
> 
> See my reply to your question whether SNP needs special handling below.
> 
> > 
> > The FREEZE stuff actually made a bit more sense too, because it was clear it
> > wasn't a special TDX private memory thing, but just about the atomicity.
> > 
> > The problem was I couldn't get rid of all special things that are private (can't
> > remember what now).
> > 
> > I wonder if I should give it a more proper try. What do you think?
> > 
> > At this point, I was just going to change the "mirrored" name to
> > "private_mirrored". Then code that does either mirrored things or private things
> > both looks correct. Basically making it clear that the MMU only supports
> > mirroring private memory.
> 
> I don't have preference on name.  "mirrored_private" also works for me.

For hook names, we can use mirrored_private or reflect or handle?
(or whatever better name)

The current hook names
  {link, free}_private_spt(),
  {set, remove, zap}_private_spte()

=>
  # use mirrored_private
  {link, free}_mirrored_private_spt(),
  {set, remove, zap}_mirrored_private_spte()

  or 
  # use reflect (update or handle?) mirrored to private
  reflect_{linked, freeed}_mirrored_spt(),
  reflect_{set, removed, zapped}_mirrored_spte()

  or 
  # Don't add anything.  I think this would be confusing. 
  {link, free}_spt(),
  {set, remove, zap}_spte()


I think we should also rename the internal functions in TDP MMU.
- handle_removed_private_spte()
- set_private_spte_present()
handle and set is inconsistent. They should have consistent name.

=>
handle_{removed, set}_mirrored_private_spte()
or 
reflect_{removed, set}_mirrored_spte()


> > >         bool mirrored_pt = fault->is_private && kvm_use_mirrored_pt(kvm);
> > > 
> > >         tdp_mmu_for_each_pte(iter, mmu, mirrored_pt, raw_gfn, raw_gfn +
> > > 1) 
> > >         {
> > >                 ...
> > >         }
> > > 
> > > #define tdp_mmu_for_each_pte(_iter, _mmu, _mirrored_pt, _start, _end)   \
> > >         for_each_tdp_pte(_iter,                                         \
> > >                  root_to_sp((_mirrored_pt) ? _mmu->private_root_hpa :   \
> > >                                 _mmu->root.hpa),                        \
> > >                 _start, _end)
> > > 
> > > If you somehow needs the mirrored_pt in later time when handling the page
> > > fault, you don't need another "mirrored_pt" in tdp_iter, because you can
> > > easily get it from the sptep (or just get from the root):
> > > 
> > >         mirrored_pt = sptep_to_sp(sptep)->role.mirrored_pt;
> > > 
> > > What we really need to pass in is the fault->is_private, because we are
> > > not able to get whether a GPN is private based on kvm_shared_gfn_mask()
> > > for SNP and SW_PROTECTED_VM.
> > 
> > SNP and SW_PROTECTED_VM (today) don't need do anything special here, right?
> 
> Conceptually, I think SNP also needs to at least issue some command(s) to
> update the RMP table to reflect the GFN<->PFN relationship.  From this
> point, I do see a fit.
> 
> I briefly looked into SNP patchset, and I also raised the discussion there
> (with you and Isaku copied):
> 
> https://lore.kernel.org/lkml/20240501085210.2213060-1-michael.roth@amd.com/T/#m8ca554a6d4bad7fa94dedefcf5914df19c9b8051
> 
> I could be wrong, though.

I'll reply to it.
Rick Edgecombe May 20, 2024, 7:02 p.m. UTC | #24
On Mon, 2024-05-20 at 11:58 -0700, Isaku Yamahata wrote:
> For hook names, we can use mirrored_private or reflect or handle?
> (or whatever better name)
> 
> The current hook names
>   {link, free}_private_spt(),
>   {set, remove, zap}_private_spte()
> 
> =>
>   # use mirrored_private
>   {link, free}_mirrored_private_spt(),
>   {set, remove, zap}_mirrored_private_spte()
> 
>   or 
>   # use reflect (update or handle?) mirrored to private
>   reflect_{linked, freeed}_mirrored_spt(),
>   reflect_{set, removed, zapped}_mirrored_spte()

reflect is a nice name. I'm trying this path right now. I'll share a branch.

> 
>   or 
>   # Don't add anything.  I think this would be confusing. 
>   {link, free}_spt(),
>   {set, remove, zap}_spte()
Kai Huang May 20, 2024, 10:34 p.m. UTC | #25
On 21/05/2024 6:58 am, Isaku Yamahata wrote:
> On Mon, May 20, 2024 at 10:38:58AM +0000,
> "Huang, Kai" <kai.huang@intel.com> wrote:
> 
>> On Sat, 2024-05-18 at 15:41 +0000, Edgecombe, Rick P wrote:
>>> On Sat, 2024-05-18 at 05:42 +0000, Huang, Kai wrote:
>>>>
>>>> No.  I meant "using kvm_mmu_page.role.mirrored_pt to determine whether to
>>>> invoke kvm_x86_ops::xx_private_spt()" is not correct.
>>>
>>> I agree this looks wrong.
>>>
>>>>    Instead, we should
>>>> use fault->is_private to determine:
>>>>
>>>>          if (fault->is_private && kvm_x86_ops::xx_private_spt())
>>>>                  kvm_x86_ops::xx_private_spte();
>>>>          else
>>>>                  // normal TDP MMU operation
>>>>
>>>> The reason is this pattern works not just for TDX, but also for SNP (and
>>>> SW_PROTECTED_VM) if they ever need specific page table ops.
> 
> Do you want to split the concept from invoking hooks from mirrored PT
> and to allow invoking hooks even for shared PT (probably without
> mirrored PT)?  So far I tied the mirrored PT to invoking the hooks as
> those hooks are to reflect the changes on mirrored PT to private PT.
> 
> Is there any use case to allow hook for shared PT?

To be clear, my intention is to allow hook, if available, for "private 
GPA".  The point here is for "private GPA", but not "shared PT".

> 
> - SEV_SNP
>    Although I can't speak for SNP folks, I guess they don't need hooks.
>    I guess they want to stay away from directly modifying the TDP MMU
>    (to add TDP MMU hooks).  Instead, They added hooks to guest_memfd.
>    RMP (Reverse mapping table) doesn't have to be consistent with NPT.
> 
>    Anyway, I'll reply to
>    https://lore.kernel.org/lkml/20240501085210.2213060-1-michael.roth@amd.com/T/#m8ca554a6d4bad7fa94dedefcf5914df19c9b8051

For SNP _ONLY_ I completely understand.  The point is, TDX needs to 
modify anyway.  So if SNP can use hooks for TDX, and if in that case we 
can avoid guest_memfd hooks, then I think it's better?

But I can certainly be, and probably am, wrong, because that 
gmem_memfd() hooks have been there for long time.

>   
> TDX
>    I don't see immediate need to allow hooks for shared PT. >
> SW_PROTECTED (today)
>    It uses only shared PT and don't need hooks.
> 
> SW_PROTECTED (with mirrored pt with shared mask in future in theory)
>    This would be similar to TDX, we wouldn't need hooks for shared PT.
> 
> SW_PROTECTED (shared PT only without mirrored pt in future in theory)
>    I don't see necessity hooks for shared PT.
>    (Or I don't see value of this SW_PROTECTED case.)
> 

I don't think SW_PROTECTED VM will ever need to have any TDP MMU hook, 
because there's no hardware feature backing behind it.

My intention is for SNP.  Even if SNP doesn't need any TDP MMU hook 
today, I think invoking hook depending on "private GPA", but not 
"private page table" provides more flexibility.  And this also works for 
TDX, regardless whether SNP wants to implement any TDP MMU hook.

So conceptually speaking, I don't see any disadvantage of my proposal, 
regardless whether SNP chooses to use any TDP MMU hook or not.  On the 
other hand, if we choose to "invoke hooks depending on page table type", 
then this code will indeed be only for TDX.
Isaku Yamahata May 20, 2024, 11:32 p.m. UTC | #26
On Fri, May 17, 2024 at 12:16:30PM -0700,
Isaku Yamahata <isaku.yamahata@intel.com> wrote:

> > 4. mmio spte doesn't have the shared bit, as previous (no effect)
> > 5. Some zapping code (__tdp_mmu_zap_root(), tdp_mmu_zap_leafs()) intends to
> > actually operating on the raw_gfn. It wants to iterate the whole EPT, so it goes
> > from 0 to tdp_mmu_max_gfn_exclusive(). So now for mirrored it does, but for
> > shared it only covers the shared range. Basically kvm_mmu_max_gfn() is wrong if
> > we pretend shared GFNs are just strangely mapped normal GFNs. Maybe we could
> > just fix this up to report based on GPAW for TDX? Feels wrong.
> 
> Yes, it's broken with kvm_mmu_max_gfn().

I looked into this one.  I think we need to adjust the value even for VMX case.
I have something at the bottom.  What do you think?  I compiled it only at the
moment. This is to show the idea.


Based on "Intel Trust Domain CPU Architectural Extensions"
There are four cases to consider.
- TDX Shared-EPT with 5-level EPT with host max_pa > 47
  mmu_max_gfn should be host max gfn - (TDX key bits)

- TDX Shared-EPT with 4-level EPT with host max_pa > 47
  The host allows 5-level.  The guest doesn't need it. So use 4-level.
  mmu_max_gfn should be 47 = min(47, host max gfn - (TDX key bits))).

- TDX Shared-EPT with 4-level EPT with host max_pa < 48
  mmu_max_gfn should be min(47, host max gfn - (TDX key bits)))

- The value for Shared-EPT works for TDX Secure-EPT.

- For VMX case (with TDX CPU extension enabled)
  mmu_max_gfn should be host max gfn - (TDX key bits)
  For VMX only with TDX disabled, TDX key bits == 0.

So kvm_mmu_max_gfn() need to be per-VM value.  And now gfn_shared_mask() is
out side of guest max PA.  
(Maybe we'd like to check if guest cpuid[0x8000:0008] matches with those.)

Citation from "Intel Trust Domain CPU Architectural Extensions" for those
interested in the related sentences:

1.4.2 Guest Physical Address Translation
  Transition to SEAM VMX non-root operation is formatted to require Extended
  Page Tables (EPT) to be enabled. In SEAM VMX non-root operation, there should
  be two EPTs active: the private EPT specified using the EPTP field of the VMCS
  and a shared EPT specified using the Shared-EPTP field of the VMCS.
  When translating a GPA using the shared EPT, an EPT misconfiguration can occur
  if the entry is present and the physical address bits in the range
  (MAXPHYADDR-1) to (MAXPHYADDR-TDX_RESERVED_KEYID_BITS) are set, i.e., if
  configured with a TDX private KeyID.
  If the CPU's maximum physical-address width (MAXPA) is 52 and the guest
  physical address width is configured to be 48, accesses with GPA bits 51:48
  not all being 0 can cause an EPT-violation, where such EPT-violations are not
  mutated to #VE, even if the “EPT-violations #VE” execution control is 1.
  If the CPU's physical-address width (MAXPA) is less than 48 and the SHARED bit
  is configured to be in bit position 47, GPA bit 47 would be reserved, and GPA
  bits 46:MAXPA would be reserved. On such CPUs, setting bits 51:48 or bits
  46:MAXPA in any paging structure can cause a reserved bit page fault on
  access.

1.5 OPERATION OUTSIDE SEAM
  The physical address bits reserved for encoding TDX private KeyID are meant to
  be treated as reserved bits when not in SEAM operation.
  When translating a linear address outside SEAM, if any paging structure entry
  has bits reserved for TDX private KeyID encoding in the physical address set,
  then the processor helps generate a reserved bit page fault exception.  When
  translating a guest physical address outside SEAM, if any EPT structure entry
  has bits reserved for TDX private KeyID encoding in the physical address set,
  then the processor helps generate an EPT misconfiguration


diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e3df14142db0..4ea6ad407a3d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1559,6 +1559,7 @@ struct kvm_arch {
 #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
 	struct kvm_mmu_memory_cache split_desc_cache;
 
+	gfn_t mmu_max_gfn;
 	gfn_t gfn_shared_mask;
 };
 
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index bab9b0c4f0a9..fcb7197f7487 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -64,7 +64,7 @@ static __always_inline u64 rsvd_bits(int s, int e)
  */
 extern u8 __read_mostly shadow_phys_bits;
 
-static inline gfn_t kvm_mmu_max_gfn(void)
+static inline gfn_t __kvm_mmu_max_gfn(void)
 {
 	/*
 	 * Note that this uses the host MAXPHYADDR, not the guest's.
@@ -82,6 +82,11 @@ static inline gfn_t kvm_mmu_max_gfn(void)
 	return (1ULL << (max_gpa_bits - PAGE_SHIFT)) - 1;
 }
 
+static inline gfn_t kvm_mmu_max_gfn(struct kvm *kvm)
+{
+	return kvm->arch.mmu_max_gfn;
+}
+
 static inline u8 kvm_get_shadow_phys_bits(void)
 {
 	/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1fb6055b1565..25da520e81d6 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3333,7 +3333,7 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
 	 * only if L1's MAXPHYADDR is inaccurate with respect to the
 	 * hardware's).
 	 */
-	if (unlikely(fault->gfn > kvm_mmu_max_gfn()))
+	if (unlikely(fault->gfn > kvm_mmu_max_gfn(vcpu->kvm)))
 		return RET_PF_EMULATE;
 
 	return RET_PF_CONTINUE;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 630acf2b17f7..04b3c83f21a0 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -952,7 +952,7 @@ static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
 	return iter->yielded;
 }
 
-static inline gfn_t tdp_mmu_max_gfn_exclusive(void)
+static inline gfn_t tdp_mmu_max_gfn_exclusive(struct kvm *kvm)
 {
 	/*
 	 * Bound TDP MMU walks at host.MAXPHYADDR.  KVM disallows memslots with
@@ -960,7 +960,7 @@ static inline gfn_t tdp_mmu_max_gfn_exclusive(void)
 	 * MMIO SPTEs for "impossible" gfns, instead sending such accesses down
 	 * the slow emulation path every time.
 	 */
-	return kvm_mmu_max_gfn() + 1;
+	return kvm_mmu_max_gfn(kvm) + 1;
 }
 
 static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
@@ -968,7 +968,7 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 {
 	struct tdp_iter iter;
 
-	gfn_t end = tdp_mmu_max_gfn_exclusive();
+	gfn_t end = tdp_mmu_max_gfn_exclusive(kvm);
 	gfn_t start = 0;
 
 	for_each_tdp_pte_min_level(kvm, iter, root, zap_level, start, end) {
@@ -1069,7 +1069,7 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
 {
 	struct tdp_iter iter;
 
-	end = min(end, tdp_mmu_max_gfn_exclusive());
+	end = min(end, tdp_mmu_max_gfn_exclusive(kvm));
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index a3c39bd783d6..025d51a55505 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -12,6 +12,8 @@
 static bool enable_tdx __ro_after_init;
 module_param_named(tdx, enable_tdx, bool, 0444);
 
+static gfn_t __ro_after_init mmu_max_gfn;
+
 #if IS_ENABLED(CONFIG_HYPERV) || IS_ENABLED(CONFIG_INTEL_TDX_HOST)
 static int vt_flush_remote_tlbs(struct kvm *kvm);
 #endif
@@ -24,6 +26,27 @@ static void vt_hardware_disable(void)
 	vmx_hardware_disable();
 }
 
+#define MSR_IA32_TME_ACTIVATE	0x982
+#define MKTME_UNINITIALIZED	2
+#define TME_ACTIVATE_LOCKED	BIT_ULL(0)
+#define TME_ACTIVATE_ENABLED	BIT_ULL(1)
+#define TDX_RESERVED_KEYID_BITS(tme_activate)	\
+	(((tme_activate) & GENMASK_ULL(39, 36)) >> 36)
+
+static void vt_adjust_max_pa(void)
+{
+	u64 tme_activate;
+
+	mmu_max_gfn = __kvm_mmu_max_gfn();
+
+	rdmsrl(MSR_IA32_TME_ACTIVATE, tme_activate);
+	if (!(tme_activate & TME_ACTIVATE_LOCKED) ||
+	    !(tme_activate & TME_ACTIVATE_ENABLED))
+		return;
+
+	mmu_max_gfn -= (gfn_t)TDX_RESERVED_KEYID_BITS(tme_activate);
+}
+
 static __init int vt_hardware_setup(void)
 {
 	int ret;
@@ -69,6 +92,8 @@ static __init int vt_hardware_setup(void)
 		vt_x86_ops.flush_remote_tlbs = vt_flush_remote_tlbs;
 #endif
 
+	vt_adjust_max_pa();
+
 	return 0;
 }
 
@@ -89,6 +114,8 @@ static int vt_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
 
 static int vt_vm_init(struct kvm *kvm)
 {
+	kvm->arch.mmu_max_gfn = mmu_max_gfn;
+
 	if (is_td(kvm))
 		return tdx_vm_init(kvm);
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 3be4b8ff7cb6..206ad053cbad 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2610,8 +2610,11 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 
 	if (td_params->exec_controls & TDX_EXEC_CONTROL_MAX_GPAW)
 		kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(51));
-	else
+	else {
 		kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(47));
+		kvm->arch.mmu_max_gfn = min(kvm->arch.mmu_max_gfn,
+					    gpa_to_gfn(BIT_ULL(47)));
+	}
 
 out:
 	/* kfree() accepts NULL. */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7f89405c8bc4..c519bb9c9559 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12693,6 +12693,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	if (ret)
 		goto out;
 
+	kvm->arch.mmu_max_gfn = __kvm_mmu_max_gfn();
 	kvm_mmu_init_vm(kvm);
 
 	ret = static_call(kvm_x86_vm_init)(kvm);
@@ -13030,7 +13031,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 		return -EINVAL;
 
 	if (change == KVM_MR_CREATE || change == KVM_MR_MOVE) {
-		if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn())
+		if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn(kvm))
 			return -EINVAL;
 
 #if 0
Rick Edgecombe May 20, 2024, 11:39 p.m. UTC | #27
On Mon, 2024-05-20 at 12:02 -0700, Rick Edgecombe wrote:
> 
> reflect is a nice name. I'm trying this path right now. I'll share a branch.

Here is the branch:
https://github.com/rpedgeco/linux/commit/674cd68b6ba626e48fe2446797d067e38dca80e3

TODO:
 - kvm_mmu_max_gfn() updates from iterator changes
 - kvm_flush_remote_tlbs_gfn() updates from iterator changes

The historically controversial mmu.h helpers:
static inline gfn_t kvm_gfn_direct_mask(const struct kvm *kvm)
{
	/* Only TDX sets this and it's the shared mask */
	return kvm->arch.gfn_shared_mask;
}

/* The VM keeps a mirrored copy of the private memory */
static inline bool kvm_has_mirrored_tdp(const struct kvm *kvm)
{
	return kvm->arch.vm_type == KVM_X86_TDX_VM;
}

static inline bool kvm_on_mirror(const struct kvm *kvm, enum kvm_process
process)
{
	if (!kvm_has_mirrored_tdp(kvm))
		return false;

	return process & KVM_PROCESS_PRIVATE;
}

static inline bool kvm_on_direct(const struct kvm *kvm, enum kvm_process
process)
{
	if (!kvm_has_mirrored_tdp(kvm))
		return true;

	return process & KVM_PROCESS_SHARED;
}

static inline bool kvm_zap_leafs_only(const struct kvm *kvm)
{
	return kvm->arch.vm_type == KVM_X86_TDX_VM;
}


In this solution, the tdp_mmu.c doesn't have a concept of private vs shared EPT
or GPA aliases. It just knows KVM_PROCESS_PRIVATE/SHARED, and fault->is_private.

Based on the PROCESS enums or fault->is_private, helpers in mmu.h encapsulate
whether to operate on the normal "direct" roots or the mirrored roots. When
!TDX, it always operates on direct.

The code that does PTE setting/zapping etc, calls out the mirrored "reflect"
helper and does the extra atomicity stuff when it sees the mirrored role bit.

In Isaku's code to make gfn's never have shared bits, there was still the
concept of "shared" in the TDP MMU. But now since the TDP MMU focuses on
mirrored vs direct instead, an abstraction is introduced to just ask for the
mask for the root. For TDX the direct root is for shared memory, so instead the
kvm_gfn_direct_mask() gets applied when operating on the direct root.

I think there are still some things to be polished in the branch, but overall it
does a good job of cleaning up the confusion about the connection between
private and mirrored. And also between this and the previous changes, improves
littering the generic MMU code with private/shared alias concepts.

At the same time, I think the abstractions have a small cost in clarity if you
are looking at the code from TDX's perspective. It probably wont raise any
eyebrows for people used to tracing nested EPT violations through paging_tmpl.h.
But compared to naming everything mirrored_private, there is more obfuscation of
the bits twiddled.
Isaku Yamahata May 21, 2024, 2:25 a.m. UTC | #28
On Mon, May 20, 2024 at 11:39:06PM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Mon, 2024-05-20 at 12:02 -0700, Rick Edgecombe wrote:
> > 
> > reflect is a nice name. I'm trying this path right now. I'll share a branch.
> 
> Here is the branch:
> https://github.com/rpedgeco/linux/commit/674cd68b6ba626e48fe2446797d067e38dca80e3

Thank you for sharing it. It makes it easy to create further patches on top of
it.

...

> In this solution, the tdp_mmu.c doesn't have a concept of private vs shared EPT
> or GPA aliases. It just knows KVM_PROCESS_PRIVATE/SHARED, and fault->is_private.
> 
> Based on the PROCESS enums or fault->is_private, helpers in mmu.h encapsulate
> whether to operate on the normal "direct" roots or the mirrored roots. When
> !TDX, it always operates on direct.
> 
> The code that does PTE setting/zapping etc, calls out the mirrored "reflect"
> helper and does the extra atomicity stuff when it sees the mirrored role bit.
> 
> In Isaku's code to make gfn's never have shared bits, there was still the
> concept of "shared" in the TDP MMU. But now since the TDP MMU focuses on
> mirrored vs direct instead, an abstraction is introduced to just ask for the
> mask for the root. For TDX the direct root is for shared memory, so instead the
> kvm_gfn_direct_mask() gets applied when operating on the direct root.

"direct" is better than "shared".  It might be confusing with the existing
role.direct, but I don't think of better other name.

I resorted to pass around kvm for gfn_direct_mask to the iterator.  Alternative
way is to stash it in struct kvm_mmu_page of root somehow.  Then, we can strip
kvm from the iterator and the related macros.


> I think there are still some things to be polished in the branch, but overall it
> does a good job of cleaning up the confusion about the connection between
> private and mirrored. And also between this and the previous changes, improves
> littering the generic MMU code with private/shared alias concepts.
> 
> At the same time, I think the abstractions have a small cost in clarity if you
> are looking at the code from TDX's perspective. It probably wont raise any
> eyebrows for people used to tracing nested EPT violations through paging_tmpl.h.
> But compared to naming everything mirrored_private, there is more obfuscation of
> the bits twiddled.

The rename makes the code much less confusing.  I noticed that mirror and
mirrored are mixed. I'm not sure whether it's intentional or accidental.
Rick Edgecombe May 21, 2024, 2:57 a.m. UTC | #29
On Mon, 2024-05-20 at 19:25 -0700, Isaku Yamahata wrote:
> On Mon, May 20, 2024 at 11:39:06PM +0000,
> "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:
> 
> > On Mon, 2024-05-20 at 12:02 -0700, Rick Edgecombe wrote:
> > In this solution, the tdp_mmu.c doesn't have a concept of private vs shared
> > EPT
> > or GPA aliases. It just knows KVM_PROCESS_PRIVATE/SHARED, and fault-
> > >is_private.
> > 
> > Based on the PROCESS enums or fault->is_private, helpers in mmu.h
> > encapsulate
> > whether to operate on the normal "direct" roots or the mirrored roots. When
> > !TDX, it always operates on direct.
> > 
> > The code that does PTE setting/zapping etc, calls out the mirrored "reflect"
> > helper and does the extra atomicity stuff when it sees the mirrored role
> > bit.
> > 
> > In Isaku's code to make gfn's never have shared bits, there was still the
> > concept of "shared" in the TDP MMU. But now since the TDP MMU focuses on
> > mirrored vs direct instead, an abstraction is introduced to just ask for the
> > mask for the root. For TDX the direct root is for shared memory, so instead
> > the
> > kvm_gfn_direct_mask() gets applied when operating on the direct root.
> 
> "direct" is better than "shared".  It might be confusing with the existing
> role.direct, but I don't think of better other name.

Yea, direct is kind of overloaded. But it actually is "direct" in the
role.direct sense at least.

> 
> I resorted to pass around kvm for gfn_direct_mask to the iterator. 
> Alternative
> way is to stash it in struct kvm_mmu_page of root somehow.  Then, we can strip
> kvm from the iterator and the related macros.

It seems like it would use too much memory. Looking up the mask once per
iteration doesn't seem too terrible to me.

> 
> 
> > I think there are still some things to be polished in the branch, but
> > overall it
> > does a good job of cleaning up the confusion about the connection between
> > private and mirrored. And also between this and the previous changes,
> > improves
> > littering the generic MMU code with private/shared alias concepts.
> > 
> > At the same time, I think the abstractions have a small cost in clarity if
> > you
> > are looking at the code from TDX's perspective. It probably wont raise any
> > eyebrows for people used to tracing nested EPT violations through
> > paging_tmpl.h.
> > But compared to naming everything mirrored_private, there is more
> > obfuscation of
> > the bits twiddled.
> 
> The rename makes the code much less confusing.  I noticed that mirror and
> mirrored are mixed. I'm not sure whether it's intentional or accidental.

We need a better name for sp->mirrored_spt and related functions. It is not the
mirror page table, it's the actual page table that is getting mirrored

It would be nice to have a good generic name (not private) for what the mirrored
page tables are mirroring. Mirror vs mirrored is too close, but I couldn't think
of anything. Reflect only seems to fit as a verb.


Another nice thing about this separation, I think we can break the big patch
apart a bit. I think maybe I'll start re-arranging things into patches. Unless
there is any objection to the whole direction. Kai?
Rick Edgecombe May 21, 2024, 3:07 p.m. UTC | #30
On Mon, 2024-05-20 at 16:32 -0700, Isaku Yamahata wrote:
> I looked into this one.  I think we need to adjust the value even for VMX
> case.
> I have something at the bottom.  What do you think?  I compiled it only at the
> moment. This is to show the idea.
> 
> 
> Based on "Intel Trust Domain CPU Architectural Extensions"
> There are four cases to consider.
> - TDX Shared-EPT with 5-level EPT with host max_pa > 47
>   mmu_max_gfn should be host max gfn - (TDX key bits)
> 
> - TDX Shared-EPT with 4-level EPT with host max_pa > 47
>   The host allows 5-level.  The guest doesn't need it. So use 4-level.
>   mmu_max_gfn should be 47 = min(47, host max gfn - (TDX key bits))).
> 
> - TDX Shared-EPT with 4-level EPT with host max_pa < 48
>   mmu_max_gfn should be min(47, host max gfn - (TDX key bits)))
> 
> - The value for Shared-EPT works for TDX Secure-EPT.
> 
> - For VMX case (with TDX CPU extension enabled)
>   mmu_max_gfn should be host max gfn - (TDX key bits)
>   For VMX only with TDX disabled, TDX key bits == 0.
> 
> So kvm_mmu_max_gfn() need to be per-VM value.  And now gfn_shared_mask() is
> out side of guest max PA.  
> (Maybe we'd like to check if guest cpuid[0x8000:0008] matches with those.)
> 
> Citation from "Intel Trust Domain CPU Architectural Extensions" for those
> interested in the related sentences:
> 
> 1.4.2 Guest Physical Address Translation
>   Transition to SEAM VMX non-root operation is formatted to require Extended
>   Page Tables (EPT) to be enabled. In SEAM VMX non-root operation, there
> should
>   be two EPTs active: the private EPT specified using the EPTP field of the
> VMCS
>   and a shared EPT specified using the Shared-EPTP field of the VMCS.
>   When translating a GPA using the shared EPT, an EPT misconfiguration can
> occur
>   if the entry is present and the physical address bits in the range
>   (MAXPHYADDR-1) to (MAXPHYADDR-TDX_RESERVED_KEYID_BITS) are set, i.e., if
>   configured with a TDX private KeyID.
>   If the CPU's maximum physical-address width (MAXPA) is 52 and the guest
>   physical address width is configured to be 48, accesses with GPA bits 51:48
>   not all being 0 can cause an EPT-violation, where such EPT-violations are
> not
>   mutated to #VE, even if the “EPT-violations #VE” execution control is 1.
>   If the CPU's physical-address width (MAXPA) is less than 48 and the SHARED
> bit
>   is configured to be in bit position 47, GPA bit 47 would be reserved, and
> GPA
>   bits 46:MAXPA would be reserved. On such CPUs, setting bits 51:48 or bits
>   46:MAXPA in any paging structure can cause a reserved bit page fault on
>   access.

In "if the entry is present and the physical address bits in the range
(MAXPHYADDR-1) to (MAXPHYADDR-TDX_RESERVED_KEYID_BITS) are set", it's not clear
to be if "physical address bits" is referring to the GPA or the "entry" (meaning
the host pfn). The "entry" would be my guess.

It is also confusing when it talks about "guest physical address". It must mean
4 vs 5 level paging? How else is the shared EPT walker supposed to know the
guest maxpa. In which case it would be consistent with normal EPT behavior. But
the assertions around reserved bit page faults are surprising.

Based on those guesses, I'm not sure the below code is correct. We wouldn't need
to remove keyid bits from the GFN.

Maybe we should clarify the spec? Or are you confident reading it the other way?

> 
> 1.5 OPERATION OUTSIDE SEAM
>   The physical address bits reserved for encoding TDX private KeyID are meant
> to
>   be treated as reserved bits when not in SEAM operation.
>   When translating a linear address outside SEAM, if any paging structure
> entry
>   has bits reserved for TDX private KeyID encoding in the physical address
> set,
>   then the processor helps generate a reserved bit page fault exception.  When
>   translating a guest physical address outside SEAM, if any EPT structure
> entry
>   has bits reserved for TDX private KeyID encoding in the physical address
> set,
>   then the processor helps generate an EPT misconfiguration

This is more specific regarding which bits should not have key id bits: "if any
paging structure entry has bits reserved for TDX private KeyID encoding in the
physical address set". It is bits in the PTE, not the GPA.

> 
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index e3df14142db0..4ea6ad407a3d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1559,6 +1559,7 @@ struct kvm_arch {
>  #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
>         struct kvm_mmu_memory_cache split_desc_cache;
>  
> +       gfn_t mmu_max_gfn;
>         gfn_t gfn_shared_mask;
>  };
>  
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index bab9b0c4f0a9..fcb7197f7487 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -64,7 +64,7 @@ static __always_inline u64 rsvd_bits(int s, int e)
>   */
>  extern u8 __read_mostly shadow_phys_bits;
>  
> -static inline gfn_t kvm_mmu_max_gfn(void)
> +static inline gfn_t __kvm_mmu_max_gfn(void)
>  {
>         /*
>          * Note that this uses the host MAXPHYADDR, not the guest's.
> @@ -82,6 +82,11 @@ static inline gfn_t kvm_mmu_max_gfn(void)
>         return (1ULL << (max_gpa_bits - PAGE_SHIFT)) - 1;
>  }
>  
> +static inline gfn_t kvm_mmu_max_gfn(struct kvm *kvm)
> +{
> +       return kvm->arch.mmu_max_gfn;
> +}
> +
>  static inline u8 kvm_get_shadow_phys_bits(void)
>  {
>         /*
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 1fb6055b1565..25da520e81d6 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3333,7 +3333,7 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu
> *vcpu,
>          * only if L1's MAXPHYADDR is inaccurate with respect to the
>          * hardware's).
>          */
> -       if (unlikely(fault->gfn > kvm_mmu_max_gfn()))
> +       if (unlikely(fault->gfn > kvm_mmu_max_gfn(vcpu->kvm)))
>                 return RET_PF_EMULATE;
>  
>         return RET_PF_CONTINUE;
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 630acf2b17f7..04b3c83f21a0 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -952,7 +952,7 @@ static inline bool __must_check
> tdp_mmu_iter_cond_resched(struct kvm *kvm,
>         return iter->yielded;
>  }
>  
> -static inline gfn_t tdp_mmu_max_gfn_exclusive(void)
> +static inline gfn_t tdp_mmu_max_gfn_exclusive(struct kvm *kvm)
>  {
>         /*
>          * Bound TDP MMU walks at host.MAXPHYADDR.  KVM disallows memslots
> with
> @@ -960,7 +960,7 @@ static inline gfn_t tdp_mmu_max_gfn_exclusive(void)
>          * MMIO SPTEs for "impossible" gfns, instead sending such accesses
> down
>          * the slow emulation path every time.
>          */
> -       return kvm_mmu_max_gfn() + 1;
> +       return kvm_mmu_max_gfn(kvm) + 1;
>  }
>  
>  static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> @@ -968,7 +968,7 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct
> kvm_mmu_page *root,
>  {
>         struct tdp_iter iter;
>  
> -       gfn_t end = tdp_mmu_max_gfn_exclusive();
> +       gfn_t end = tdp_mmu_max_gfn_exclusive(kvm);
>         gfn_t start = 0;
>  
>         for_each_tdp_pte_min_level(kvm, iter, root, zap_level, start, end) {
> @@ -1069,7 +1069,7 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct
> kvm_mmu_page *root,
>  {
>         struct tdp_iter iter;
>  
> -       end = min(end, tdp_mmu_max_gfn_exclusive());
> +       end = min(end, tdp_mmu_max_gfn_exclusive(kvm));
>  
>         lockdep_assert_held_write(&kvm->mmu_lock);
>  
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index a3c39bd783d6..025d51a55505 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -12,6 +12,8 @@
>  static bool enable_tdx __ro_after_init;
>  module_param_named(tdx, enable_tdx, bool, 0444);
>  
> +static gfn_t __ro_after_init mmu_max_gfn;
> +
>  #if IS_ENABLED(CONFIG_HYPERV) || IS_ENABLED(CONFIG_INTEL_TDX_HOST)
>  static int vt_flush_remote_tlbs(struct kvm *kvm);
>  #endif
> @@ -24,6 +26,27 @@ static void vt_hardware_disable(void)
>         vmx_hardware_disable();
>  }
>  
> +#define MSR_IA32_TME_ACTIVATE  0x982
> +#define MKTME_UNINITIALIZED    2
> +#define TME_ACTIVATE_LOCKED    BIT_ULL(0)
> +#define TME_ACTIVATE_ENABLED   BIT_ULL(1)
> +#define TDX_RESERVED_KEYID_BITS(tme_activate)  \
> +       (((tme_activate) & GENMASK_ULL(39, 36)) >> 36)
> +
> +static void vt_adjust_max_pa(void)
> +{
> +       u64 tme_activate;
> +
> +       mmu_max_gfn = __kvm_mmu_max_gfn();
> +
> +       rdmsrl(MSR_IA32_TME_ACTIVATE, tme_activate);
> +       if (!(tme_activate & TME_ACTIVATE_LOCKED) ||
> +           !(tme_activate & TME_ACTIVATE_ENABLED))
> +               return;
> +
> +       mmu_max_gfn -= (gfn_t)TDX_RESERVED_KEYID_BITS(tme_activate);
> +}

As above, I'm not sure this is right. I guess you read the above as bits in the
GPA?

> +
>  static __init int vt_hardware_setup(void)
>  {
>         int ret;
> @@ -69,6 +92,8 @@ static __init int vt_hardware_setup(void)
>                 vt_x86_ops.flush_remote_tlbs = vt_flush_remote_tlbs;
>  #endif
>  
> +       vt_adjust_max_pa();
> +
>         return 0;
>  }
>  
> @@ -89,6 +114,8 @@ static int vt_vm_enable_cap(struct kvm *kvm, struct
> kvm_enable_cap *cap)
>  
>  static int vt_vm_init(struct kvm *kvm)
>  {
> +       kvm->arch.mmu_max_gfn = mmu_max_gfn;
> +
>         if (is_td(kvm))
>                 return tdx_vm_init(kvm);
>  
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 3be4b8ff7cb6..206ad053cbad 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2610,8 +2610,11 @@ static int tdx_td_init(struct kvm *kvm, struct
> kvm_tdx_cmd *cmd)
>  
>         if (td_params->exec_controls & TDX_EXEC_CONTROL_MAX_GPAW)
>                 kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(51));
> -       else
> +       else {
>                 kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(47));
> +               kvm->arch.mmu_max_gfn = min(kvm->arch.mmu_max_gfn,
> +                                           gpa_to_gfn(BIT_ULL(47)));
> +       }
>  
>  out:
>         /* kfree() accepts NULL. */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7f89405c8bc4..c519bb9c9559 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12693,6 +12693,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long
> type)
>         if (ret)
>                 goto out;
>  
> +       kvm->arch.mmu_max_gfn = __kvm_mmu_max_gfn();
>         kvm_mmu_init_vm(kvm);
>  
>         ret = static_call(kvm_x86_vm_init)(kvm);
> @@ -13030,7 +13031,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>                 return -EINVAL;
>  
>         if (change == KVM_MR_CREATE || change == KVM_MR_MOVE) {
> -               if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn())
> +               if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn(kvm))
>                         return -EINVAL;
>  
>  #if 0
Isaku Yamahata May 21, 2024, 4:15 p.m. UTC | #31
On Tue, May 21, 2024 at 03:07:50PM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> > 1.4.2 Guest Physical Address Translation
> >   Transition to SEAM VMX non-root operation is formatted to require Extended
> >   Page Tables (EPT) to be enabled. In SEAM VMX non-root operation, there
> > should
> >   be two EPTs active: the private EPT specified using the EPTP field of the
> > VMCS
> >   and a shared EPT specified using the Shared-EPTP field of the VMCS.
> >   When translating a GPA using the shared EPT, an EPT misconfiguration can
> > occur
> >   if the entry is present and the physical address bits in the range
> >   (MAXPHYADDR-1) to (MAXPHYADDR-TDX_RESERVED_KEYID_BITS) are set, i.e., if
> >   configured with a TDX private KeyID.
> >   If the CPU's maximum physical-address width (MAXPA) is 52 and the guest
> >   physical address width is configured to be 48, accesses with GPA bits 51:48
> >   not all being 0 can cause an EPT-violation, where such EPT-violations are
> > not
> >   mutated to #VE, even if the “EPT-violations #VE” execution control is 1.
> >   If the CPU's physical-address width (MAXPA) is less than 48 and the SHARED
> > bit
> >   is configured to be in bit position 47, GPA bit 47 would be reserved, and
> > GPA
> >   bits 46:MAXPA would be reserved. On such CPUs, setting bits 51:48 or bits
> >   46:MAXPA in any paging structure can cause a reserved bit page fault on
> >   access.
> 
> In "if the entry is present and the physical address bits in the range
> (MAXPHYADDR-1) to (MAXPHYADDR-TDX_RESERVED_KEYID_BITS) are set", it's not clear
> to be if "physical address bits" is referring to the GPA or the "entry" (meaning
> the host pfn). The "entry" would be my guess.
> 
> It is also confusing when it talks about "guest physical address". It must mean
> 4 vs 5 level paging? How else is the shared EPT walker supposed to know the
> guest maxpa. In which case it would be consistent with normal EPT behavior. But
> the assertions around reserved bit page faults are surprising.
> 
> Based on those guesses, I'm not sure the below code is correct. We wouldn't need
> to remove keyid bits from the GFN.
> 
> Maybe we should clarify the spec? Or are you confident reading it the other way?

I'll read them more closely. At least the following patch is broken.
Isaku Yamahata May 22, 2024, 10:34 p.m. UTC | #32
On Tue, May 21, 2024 at 09:15:20AM -0700,
Isaku Yamahata <isaku.yamahata@intel.com> wrote:

> On Tue, May 21, 2024 at 03:07:50PM +0000,
> "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:
> 
> > > 1.4.2 Guest Physical Address Translation
> > >   Transition to SEAM VMX non-root operation is formatted to require Extended
> > >   Page Tables (EPT) to be enabled. In SEAM VMX non-root operation, there
> > > should
> > >   be two EPTs active: the private EPT specified using the EPTP field of the
> > > VMCS
> > >   and a shared EPT specified using the Shared-EPTP field of the VMCS.
> > >   When translating a GPA using the shared EPT, an EPT misconfiguration can
> > > occur
> > >   if the entry is present and the physical address bits in the range
> > >   (MAXPHYADDR-1) to (MAXPHYADDR-TDX_RESERVED_KEYID_BITS) are set, i.e., if
> > >   configured with a TDX private KeyID.
> > >   If the CPU's maximum physical-address width (MAXPA) is 52 and the guest
> > >   physical address width is configured to be 48, accesses with GPA bits 51:48
> > >   not all being 0 can cause an EPT-violation, where such EPT-violations are
> > > not
> > >   mutated to #VE, even if the “EPT-violations #VE” execution control is 1.
> > >   If the CPU's physical-address width (MAXPA) is less than 48 and the SHARED
> > > bit
> > >   is configured to be in bit position 47, GPA bit 47 would be reserved, and
> > > GPA
> > >   bits 46:MAXPA would be reserved. On such CPUs, setting bits 51:48 or bits
> > >   46:MAXPA in any paging structure can cause a reserved bit page fault on
> > >   access.
> > 
> > In "if the entry is present and the physical address bits in the range
> > (MAXPHYADDR-1) to (MAXPHYADDR-TDX_RESERVED_KEYID_BITS) are set", it's not clear
> > to be if "physical address bits" is referring to the GPA or the "entry" (meaning
> > the host pfn). The "entry" would be my guess.
> > 
> > It is also confusing when it talks about "guest physical address". It must mean
> > 4 vs 5 level paging? How else is the shared EPT walker supposed to know the
> > guest maxpa. In which case it would be consistent with normal EPT behavior. But
> > the assertions around reserved bit page faults are surprising.
> > 
> > Based on those guesses, I'm not sure the below code is correct. We wouldn't need
> > to remove keyid bits from the GFN.
> > 
> > Maybe we should clarify the spec? Or are you confident reading it the other way?
> 
> I'll read them more closely. At least the following patch is broken.

I was confused with guest(virtual) maxphyaddr and host maxphyaddr. Here is the
outcome.  We have 5 potentially problematic points related to mmu max pfn.

Related operations
==================
- memslot creation or kvm_arch_prepare_memory_region()
  We can create the slot beyond virtual maxphyaddr without any change.  Although
  it's weird, it doesn't immediately harm.  If we prevent it, some potentially
  problematic case won't happen.

- TDP MMU iterator (including memslot deletion)
  It works fine without any change because it uses only necessary bits of GPA.
  It ignores upper bits of given GFN for start. it ends with the SPTE traverse
  if GPA > virtual maxphyaddr.

  For secure-EPT
  It may go beyond shared-bit if slots is huge enough to cross the boundary of
  private-vs-shared.  Because (we can make) tdp mmu fault handler doesn't
  populate on such entries, it essentially results in NOP.

- population EPT violation
  Because TDX EPT violation handler can filter out ept violation with GPA >
  virtual maxphyaddr, we can assume GPA passed to the fault handler is < virtual
  maxphyaddr.

- zapping (including memslot deletion)
  Because zapping not-populated GFN is nop, so zapping specified GFN works fine.

- pre_fault_memory
  KVM_PRE_FAULT_MEMORY calls the fault handler without virtual maxphyaddr
  Additional check is needed to prevent GPA > virtual maxphyaddr
  if virtual maxphyaddr < 47 or 52.


I can think of the following options.

options
=======
option 1. Allow per-VM kvm_mmu_max_gfn()
Pro: Conceptually easy to understand and it's straightforward to disallow
     memslot creation > virtual maxphyaddr
Con: overkill for the corner case? The diff is attached.  This is only when user
     space creates memlost > virtual maxphyaddr and the guest accesses GPA >
     virtual maxphyaddr)

option 2. Keep kvm_mmu_max_gfn() and add ad hock address check.
Pro: Minimal change?
     Modify kvm_handel_noslot_fault() or kvm_faultin_pfn() to reject GPA >
     virtual maxphyaddr.
Con: Conceptually confusing with allowing operation on GFN > virtual maxphyaddr.
     The change might be unnatural or ad-hoc because it allow to create memslot
     with GPA > virtual maxphyaddr.


The following is an experimental change for option 1.

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 406effc613e5..dbc371071cb5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1558,6 +1558,7 @@ struct kvm_arch {
 #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
 	struct kvm_mmu_memory_cache split_desc_cache;
 
+	gfn_t mmu_max_gfn;
 	gfn_t gfn_shared_mask;
 };
 
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 9cd83448e39f..7b7ecaf1c607 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -64,7 +64,7 @@ static __always_inline u64 rsvd_bits(int s, int e)
  */
 extern u8 __read_mostly shadow_phys_bits;
 
-static inline gfn_t kvm_mmu_max_gfn(void)
+static inline gfn_t __kvm_mmu_max_gfn(void)
 {
 	/*
 	 * Note that this uses the host MAXPHYADDR, not the guest's.
@@ -82,6 +82,11 @@ static inline gfn_t kvm_mmu_max_gfn(void)
 	return (1ULL << (max_gpa_bits - PAGE_SHIFT)) - 1;
 }
 
+static inline gfn_t kvm_mmu_max_gfn(struct kvm *kvm)
+{
+	return kvm->arch.mmu_max_gfn;
+}
+
 static inline u8 kvm_get_shadow_phys_bits(void)
 {
 	/*
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 295c27dc593b..515edc6ae867 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3333,7 +3333,7 @@ static int kvm_handle_noslot_fault(struct kvm_vcpu *vcpu,
 	 * only if L1's MAXPHYADDR is inaccurate with respect to the
 	 * hardware's).
 	 */
-	if (unlikely(fault->gfn > kvm_mmu_max_gfn()))
+	if (unlikely(fault->gfn > kvm_mmu_max_gfn(vcpu->kvm)))
 		return RET_PF_EMULATE;
 
 	return RET_PF_CONTINUE;
@@ -6509,6 +6509,7 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
 
 void kvm_mmu_init_vm(struct kvm *kvm)
 {
+	kvm->arch.mmu_max_gfn = __kvm_mmu_max_gfn();
 	kvm->arch.shadow_mmio_value = shadow_mmio_value;
 	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
 	INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 79c9b22ceef6..ee3456b2096d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -945,7 +945,7 @@ static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
 	return iter->yielded;
 }
 
-static inline gfn_t tdp_mmu_max_gfn_exclusive(void)
+static inline gfn_t tdp_mmu_max_gfn_exclusive(struct kvm *kvm)
 {
 	/*
 	 * Bound TDP MMU walks at host.MAXPHYADDR.  KVM disallows memslots with
@@ -953,7 +953,7 @@ static inline gfn_t tdp_mmu_max_gfn_exclusive(void)
 	 * MMIO SPTEs for "impossible" gfns, instead sending such accesses down
 	 * the slow emulation path every time.
 	 */
-	return kvm_mmu_max_gfn() + 1;
+	return kvm_mmu_max_gfn(kvm) + 1;
 }
 
 static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
@@ -961,7 +961,7 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 {
 	struct tdp_iter iter;
 
-	gfn_t end = tdp_mmu_max_gfn_exclusive();
+	gfn_t end = tdp_mmu_max_gfn_exclusive(kvm);
 	gfn_t start = 0;
 
 	for_each_tdp_pte_min_level(iter, kvm, root, zap_level, start, end) {
@@ -1062,7 +1062,7 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
 {
 	struct tdp_iter iter;
 
-	end = min(end, tdp_mmu_max_gfn_exclusive());
+	end = min(end, tdp_mmu_max_gfn_exclusive(kvm));
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 61715424629b..5c2afca59386 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2549,7 +2549,9 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	struct kvm_tdx_init_vm *init_vm = NULL;
 	struct td_params *td_params = NULL;
-	int ret;
+	struct kvm_memory_slot *slot;
+	struct kvm_memslots *slots;
+	int ret, idx, i, bkt;
 
 	BUILD_BUG_ON(sizeof(*init_vm) != 8 * 1024);
 	BUILD_BUG_ON(sizeof(struct td_params) != 1024);
@@ -2611,6 +2613,25 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 		kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(51));
 	else
 		kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(47));
+	kvm->arch.mmu_max_gfn = min(kvm->arch.mmu_max_gfn,
+				    kvm->arch.gfn_shared_mask - 1);
+	/*
+	 * As memslot can be created before KVM_TDX_INIT_VM, check whether the
+	 * existing memslot is equal or lower than mmu_max_gfn.
+	 */
+	idx = srcu_read_lock(&kvm->srcu);
+	write_lock(&kvm->mmu_lock);
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
+		slots = __kvm_memslots(kvm, i);
+		kvm_for_each_memslot(slot, bkt, slots) {
+			if (slot->base_gfn + slot->npages > kvm->arch.mmu_max_gfn) {
+				ret = -ERANGE;
+				break;
+			}
+		}
+	}
+	write_unlock(&kvm->mmu_lock);
+	srcu_read_unlock(&kvm->srcu, idx);
 
 out:
 	/* kfree() accepts NULL. */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c5812cd1a4bc..9461cd4f540b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13029,7 +13029,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 		return -EINVAL;
 
 	if (change == KVM_MR_CREATE || change == KVM_MR_MOVE) {
-		if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn())
+		if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn(kvm))
 			return -EINVAL;
 
 		return kvm_alloc_memslot_metadata(kvm, new);
Rick Edgecombe May 22, 2024, 11:09 p.m. UTC | #33
On Wed, 2024-05-22 at 15:34 -0700, Isaku Yamahata wrote:
> option 1. Allow per-VM kvm_mmu_max_gfn()
> Pro: Conceptually easy to understand and it's straightforward to disallow
>      memslot creation > virtual maxphyaddr
> Con: overkill for the corner case? The diff is attached.  This is only when
> user
>      space creates memlost > virtual maxphyaddr and the guest accesses GPA >
>      virtual maxphyaddr)

It breaks the promise that gfn's don't have the share bit which is the pro for
hiding the shared bit in the tdp mmu iterator.

> 
> option 2. Keep kvm_mmu_max_gfn() and add ad hock address check.
> Pro: Minimal change?
>      Modify kvm_handel_noslot_fault() or kvm_faultin_pfn() to reject GPA >
>      virtual maxphyaddr.
> Con: Conceptually confusing with allowing operation on GFN > virtual
> maxphyaddr.
>      The change might be unnatural or ad-hoc because it allow to create
> memslot
>      with GPA > virtual maxphyaddr.

I can't find any actual functional problem to just ignoring it. Just some extra
work to go over ranges that aren't covered by the root.

How about we leave option 1 as a separate patch and note it is not functionally
required? Then we can shed it if needed. At the least it can serve as a
conversation piece in the meantime.
Isaku Yamahata May 22, 2024, 11:47 p.m. UTC | #34
On Wed, May 22, 2024 at 11:09:54PM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Wed, 2024-05-22 at 15:34 -0700, Isaku Yamahata wrote:
> > option 1. Allow per-VM kvm_mmu_max_gfn()
> > Pro: Conceptually easy to understand and it's straightforward to disallow
> >      memslot creation > virtual maxphyaddr
> > Con: overkill for the corner case? The diff is attached.  This is only when
> > user
> >      space creates memlost > virtual maxphyaddr and the guest accesses GPA >
> >      virtual maxphyaddr)
> 
> It breaks the promise that gfn's don't have the share bit which is the pro for
> hiding the shared bit in the tdp mmu iterator.
> 
> > 
> > option 2. Keep kvm_mmu_max_gfn() and add ad hock address check.
> > Pro: Minimal change?
> >      Modify kvm_handel_noslot_fault() or kvm_faultin_pfn() to reject GPA >
> >      virtual maxphyaddr.
> > Con: Conceptually confusing with allowing operation on GFN > virtual
> > maxphyaddr.
> >      The change might be unnatural or ad-hoc because it allow to create
> > memslot
> >      with GPA > virtual maxphyaddr.
> 
> I can't find any actual functional problem to just ignoring it. Just some extra
> work to go over ranges that aren't covered by the root.
> 
> How about we leave option 1 as a separate patch and note it is not functionally
> required? Then we can shed it if needed. At the least it can serve as a
> conversation piece in the meantime.

Ok. We understand the situation correctly. I think it's okay to do nothing for
now with some notes somewhere as record because it doesn't affect much for usual
case.
Rick Edgecombe May 22, 2024, 11:50 p.m. UTC | #35
On Wed, 2024-05-22 at 16:47 -0700, Isaku Yamahata wrote:
> > How about we leave option 1 as a separate patch and note it is not
> > functionally
> > required? Then we can shed it if needed. At the least it can serve as a
> > conversation piece in the meantime.
> 
> Ok. We understand the situation correctly. I think it's okay to do nothing for
> now with some notes somewhere as record because it doesn't affect much for
> usual
> case.

I meant we include your proposed option 1 as a separate patch in the next
series. I'm writing am currently writing a log for the iterator changes, and
I'll note it as an issue. And then we include this later in the same series. No?
Isaku Yamahata May 23, 2024, 12:01 a.m. UTC | #36
On Wed, May 22, 2024 at 11:50:58PM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Wed, 2024-05-22 at 16:47 -0700, Isaku Yamahata wrote:
> > > How about we leave option 1 as a separate patch and note it is not
> > > functionally
> > > required? Then we can shed it if needed. At the least it can serve as a
> > > conversation piece in the meantime.
> > 
> > Ok. We understand the situation correctly. I think it's okay to do nothing for
> > now with some notes somewhere as record because it doesn't affect much for
> > usual
> > case.
> 
> I meant we include your proposed option 1 as a separate patch in the next
> series. I'm writing am currently writing a log for the iterator changes, and
> I'll note it as an issue. And then we include this later in the same series. No?

Ok, Let's include the patch.
Rick Edgecombe May 23, 2024, 6:27 p.m. UTC | #37
On Wed, 2024-05-22 at 17:01 -0700, Isaku Yamahata wrote:
> Ok, Let's include the patch.

We were discussing offline, that actually the existing behavior of
kvm_mmu_max_gfn() can be improved for normal VMs. It would be more proper to
trigger it off of the GFN range supported by EPT level, than the host MAXPA. 

Today I was thinking, to fix this would need somthing like an x86_ops.max_gfn(),
so it could get at VMX stuff (usage of 4/5 level EPT). If that exists we might
as well just call it directly in kvm_mmu_max_gfn().

Then for TDX we could just provide a TDX implementation, rather than stash the
GFN on the kvm struct? Instead it could use gpaw stashed on struct kvm_tdx. The
op would still need to be take a struct kvm.

What do you think of that alternative?
Rick Edgecombe May 23, 2024, 11:14 p.m. UTC | #38
On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
> +static void handle_removed_private_spte(struct kvm *kvm, gfn_t gfn,
> +                                       u64 old_spte, u64 new_spte,
> +                                       int level)
> +{
> +       bool was_present = is_shadow_present_pte(old_spte);
> +       bool was_leaf = was_present && is_last_spte(old_spte, level);
> +       kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
> +       int ret;
> +
> +       /*
> +        * Allow only leaf page to be zapped. Reclaim non-leaf page tables
> page
> +        * at destroying VM.
> +        */
> +       if (!was_leaf)
> +               return;
> +
> +       /* Zapping leaf spte is allowed only when write lock is held. */
> +       lockdep_assert_held_write(&kvm->mmu_lock);
> +       ret = static_call(kvm_x86_zap_private_spte)(kvm, gfn, level);
> +       /* Because write lock is held, operation should success. */
> +       if (KVM_BUG_ON(ret, kvm))
> +               return;
> +
> +       ret = static_call(kvm_x86_remove_private_spte)(kvm, gfn, level,
> old_pfn);

I don't see why these (zap_private_spte and remove_private_spte) can't be a
single op. Was it to prepare for huge pages support or something? In the base
series they are both only called once.

> +       KVM_BUG_ON(ret, kvm);
> +}
> +
Isaku Yamahata May 24, 2024, 7:55 a.m. UTC | #39
On Thu, May 23, 2024 at 06:27:49PM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Wed, 2024-05-22 at 17:01 -0700, Isaku Yamahata wrote:
> > Ok, Let's include the patch.
> 
> We were discussing offline, that actually the existing behavior of
> kvm_mmu_max_gfn() can be improved for normal VMs. It would be more proper to
> trigger it off of the GFN range supported by EPT level, than the host MAXPA. 
> 
> Today I was thinking, to fix this would need somthing like an x86_ops.max_gfn(),
> so it could get at VMX stuff (usage of 4/5 level EPT). If that exists we might
> as well just call it directly in kvm_mmu_max_gfn().
> 
> Then for TDX we could just provide a TDX implementation, rather than stash the
> GFN on the kvm struct? Instead it could use gpaw stashed on struct kvm_tdx. The
> op would still need to be take a struct kvm.
> 
> What do you think of that alternative?

I don't see benefit of x86_ops.max_gfn() compared to kvm->arch.max_gfn.
But I don't have strong preference. Either way will work.

The max_gfn for the guest is rather static once the guest is created and
initialized.  Also the existing codes that use max_gfn expect that the value
doesn't change.  So we can use x86_ops.vm_init() to determine the value for VMX
and TDX.  If we introduced x86_ops.max_gfn(), the implementation will be simply
return kvm_vmx->max_gfn or return kvm_tdx->max_gfn. (We would have similar for
SVM and SEV.)  So I don't see benefit of x86_ops.max_gfn() than
kvm->arch.max_gfn.
Isaku Yamahata May 24, 2024, 8:20 a.m. UTC | #40
On Thu, May 23, 2024 at 11:14:07PM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
> > +static void handle_removed_private_spte(struct kvm *kvm, gfn_t gfn,
> > +                                       u64 old_spte, u64 new_spte,
> > +                                       int level)
> > +{
> > +       bool was_present = is_shadow_present_pte(old_spte);
> > +       bool was_leaf = was_present && is_last_spte(old_spte, level);
> > +       kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
> > +       int ret;
> > +
> > +       /*
> > +        * Allow only leaf page to be zapped. Reclaim non-leaf page tables
> > page
> > +        * at destroying VM.
> > +        */
> > +       if (!was_leaf)
> > +               return;
> > +
> > +       /* Zapping leaf spte is allowed only when write lock is held. */
> > +       lockdep_assert_held_write(&kvm->mmu_lock);
> > +       ret = static_call(kvm_x86_zap_private_spte)(kvm, gfn, level);
> > +       /* Because write lock is held, operation should success. */
> > +       if (KVM_BUG_ON(ret, kvm))
> > +               return;
> > +
> > +       ret = static_call(kvm_x86_remove_private_spte)(kvm, gfn, level,
> > old_pfn);
> 
> I don't see why these (zap_private_spte and remove_private_spte) can't be a
> single op. Was it to prepare for huge pages support or something? In the base
> series they are both only called once.

That is for large page support. The step to merge or split large page is
1. zap_private_spte()
2. tlb shoot down
3. merge/split_private_spte()
Rick Edgecombe May 28, 2024, 4:27 p.m. UTC | #41
On Fri, 2024-05-24 at 00:55 -0700, Isaku Yamahata wrote:
> On Thu, May 23, 2024 at 06:27:49PM +0000,
> "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:
> 
> > On Wed, 2024-05-22 at 17:01 -0700, Isaku Yamahata wrote:
> > > Ok, Let's include the patch.
> > 
> > We were discussing offline, that actually the existing behavior of
> > kvm_mmu_max_gfn() can be improved for normal VMs. It would be more proper to
> > trigger it off of the GFN range supported by EPT level, than the host
> > MAXPA. 
> > 
> > Today I was thinking, to fix this would need somthing like an
> > x86_ops.max_gfn(),
> > so it could get at VMX stuff (usage of 4/5 level EPT). If that exists we
> > might
> > as well just call it directly in kvm_mmu_max_gfn().
> > 
> > Then for TDX we could just provide a TDX implementation, rather than stash
> > the
> > GFN on the kvm struct? Instead it could use gpaw stashed on struct kvm_tdx.
> > The
> > op would still need to be take a struct kvm.
> > 
> > What do you think of that alternative?
> 
> I don't see benefit of x86_ops.max_gfn() compared to kvm->arch.max_gfn.
> But I don't have strong preference. Either way will work.

The non-TDX VM's won't need per-VM data, right? So it's just unneeded extra
state per-vm.

> 
> The max_gfn for the guest is rather static once the guest is created and
> initialized.  Also the existing codes that use max_gfn expect that the value
> doesn't change.  So we can use x86_ops.vm_init() to determine the value for
> VMX
> and TDX.  If we introduced x86_ops.max_gfn(), the implementation will be
> simply
> return kvm_vmx->max_gfn or return kvm_tdx->max_gfn. (We would have similar for
> SVM and SEV.)  So I don't see benefit of x86_ops.max_gfn() than
> kvm->arch.max_gfn.

For TDX it will be based on the shared bit, so we actually already have the per-
vm data we need. So we don't even need both gfn_shared_mask and max_gfn for TDX.
Paolo Bonzini May 28, 2024, 4:59 p.m. UTC | #42
On Thu, May 16, 2024 at 4:11 AM Huang, Kai <kai.huang@intel.com> wrote:
>
>
> >>>> +       gfn_t raw_gfn;
> >>>> +       bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
> >>>
> >>> Ditto.  I wish we can have 'has_mirrored_private_pt'.
> >>
> >> Which name do you prefer? has_mirrored_pt or has_mirrored_private_pt?
> >
> > Why not helpers that wrap vm_type like:
> > https://lore.kernel.org/kvm/d4c96caffd2633a70a140861d91794cdb54c7655.camel@intel.com/
>
> I am fine with any of them -- boolean (with either name) or helper.

Helpers are fine.

Paolo
Paolo Bonzini May 28, 2024, 5:16 p.m. UTC | #43
On Fri, May 17, 2024 at 9:16 PM Isaku Yamahata <isaku.yamahata@intel.com> wrote:
>
> On Fri, May 17, 2024 at 06:16:26PM +0000,
> "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:
>
> > On Fri, 2024-05-17 at 02:03 -0700, Isaku Yamahata wrote:
> > >
> > > On top of your patch, I created the following patch to remove
> > > kvm_gfn_for_root().
> > > Although I haven't tested it yet, I think the following shows my idea.
> > >
> > > - Add gfn_shared_mask to struct tdp_iter.
> > > - Use iter.gfn_shared_mask to determine the starting sptep in the root.
> > > - Remove kvm_gfn_for_root()
> >
> > I investigated it.
>
> Thanks for looking at it.
>
> > After this, gfn_t's never have shared bit. It's a simple rule. The MMU mostly
> > thinks it's operating on a shared root that is mapped at the normal GFN. Only
> > the iterator knows that the shared PTEs are actually in a different location.
> >
> > There are some negative side effects:
> > 1. The struct kvm_mmu_page's gfn doesn't match it's actual mapping anymore.
> > 2. As a result of above, the code that flushes TLBs for a specific GFN will be
> > confused. It won't functionally matter for TDX, just look buggy to see flushing
> > code called with the wrong gfn.
>
> flush_remote_tlbs_range() is only for Hyper-V optimization.  In other cases,
> x86_op.flush_remote_tlbs_range = NULL or the member isn't defined at compile
> time.  So the remote tlb flush falls back to flushing whole range.  I don't
> expect TDX in hyper-V guest.  I have to admit that the code looks superficially
> broken and confusing.

You could add an "&& kvm_has_private_root(kvm)" to
kvm_available_flush_remote_tlbs_range(), since
kvm_has_private_root(kvm) is sort of equivalent to "there is no 1:1
correspondence between gfn and PTE to be flushed".

I am conflicted myself, but the upsides below are pretty substantial.

Paolo

> > On the positive effects side:
> > 1. There is code that passes sp->gfn into things that it shouldn't (if it has
> > shared bits) like memslot lookups.
> > 2. Also code that passes iter.gfn into things it shouldn't like
> > kvm_mmu_max_mapping_level().
> >
> > These places are not called by TDX, but if you know that gfn's might include
> > shared bits, then that code looks buggy.
> >
> > I think the solution in the diff is more elegant then before, because it hides
> > what is really going on with the shared root. That is both good and bad. Can we
> > accept the downsides?
>
> Kai, do you have any thoughts?
> --
> Isaku Yamahata <isaku.yamahata@intel.com>
>
Paolo Bonzini May 28, 2024, 5:43 p.m. UTC | #44
On Tue, May 21, 2024 at 1:32 AM Isaku Yamahata <isaku.yamahata@intel.com> wrote:
> +static void vt_adjust_max_pa(void)
> +{
> +       u64 tme_activate;
> +
> +       mmu_max_gfn = __kvm_mmu_max_gfn();
> +       rdmsrl(MSR_IA32_TME_ACTIVATE, tme_activate);
> +       if (!(tme_activate & TME_ACTIVATE_LOCKED) ||
> +           !(tme_activate & TME_ACTIVATE_ENABLED))
> +               return;
> +
> +       mmu_max_gfn -= (gfn_t)TDX_RESERVED_KEYID_BITS(tme_activate);

This would be be >>=, not "-=". But I think this should not look at
TME MSRs directly, instead it can use boot_cpu_data.x86_phys_bits. You
can use it instead of shadow_phys_bits in __kvm_mmu_max_gfn() and then
VMX does not need any adjustment.

That said, this is not a bugfix, it's just an optimization.

Paolo

> +       }
>
>  out:
>         /* kfree() accepts NULL. */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7f89405c8bc4..c519bb9c9559 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12693,6 +12693,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>         if (ret)
>                 goto out;
>
> +       kvm->arch.mmu_max_gfn = __kvm_mmu_max_gfn();
>         kvm_mmu_init_vm(kvm);
>
>         ret = static_call(kvm_x86_vm_init)(kvm);
> @@ -13030,7 +13031,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>                 return -EINVAL;
>
>         if (change == KVM_MR_CREATE || change == KVM_MR_MOVE) {
> -               if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn())
> +               if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn(kvm))
>                         return -EINVAL;
>
>  #if 0
>
> --
> Isaku Yamahata <isaku.yamahata@intel.com>
>
Paolo Bonzini May 28, 2024, 5:47 p.m. UTC | #45
On Tue, May 28, 2024 at 6:27 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
> > I don't see benefit of x86_ops.max_gfn() compared to kvm->arch.max_gfn.
> > But I don't have strong preference. Either way will work.
>
> The non-TDX VM's won't need per-VM data, right? So it's just unneeded extra
> state per-vm.

It's just a cached value like there are many in the MMU. It's easier
for me to read code without the mental overhead of a function call.

> For TDX it will be based on the shared bit, so we actually already have the per-
> vm data we need. So we don't even need both gfn_shared_mask and max_gfn for TDX.

But they are independent, for example AMD placed the encryption bit
highest, then the reduced physical address space bits, then finally
the rest of the gfn. I think it's consistent with the kvm_has_*
approach, to not assume much and just store separate data.

Paolo
Rick Edgecombe May 28, 2024, 6:29 p.m. UTC | #46
On Tue, 2024-05-28 at 19:16 +0200, Paolo Bonzini wrote:
> > > After this, gfn_t's never have shared bit. It's a simple rule. The MMU
> > > mostly
> > > thinks it's operating on a shared root that is mapped at the normal GFN.
> > > Only
> > > the iterator knows that the shared PTEs are actually in a different
> > > location.
> > > 
> > > There are some negative side effects:
> > > 1. The struct kvm_mmu_page's gfn doesn't match it's actual mapping
> > > anymore.
> > > 2. As a result of above, the code that flushes TLBs for a specific GFN
> > > will be
> > > confused. It won't functionally matter for TDX, just look buggy to see
> > > flushing
> > > code called with the wrong gfn.
> > 
> > flush_remote_tlbs_range() is only for Hyper-V optimization.  In other cases,
> > x86_op.flush_remote_tlbs_range = NULL or the member isn't defined at compile
> > time.  So the remote tlb flush falls back to flushing whole range.  I don't
> > expect TDX in hyper-V guest.  I have to admit that the code looks
> > superficially
> > broken and confusing.
> 
> You could add an "&& kvm_has_private_root(kvm)" to
> kvm_available_flush_remote_tlbs_range(), since
> kvm_has_private_root(kvm) is sort of equivalent to "there is no 1:1
> correspondence between gfn and PTE to be flushed".
> 
> I am conflicted myself, but the upsides below are pretty substantial.

It looks like kvm_available_flush_remote_tlbs_range() is not checked in many of
the paths that get to x86_ops.flush_remote_tlbs_range().

So maybe something like:
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 65bbda95acbb..e09bb6c50a0b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1959,14 +1959,7 @@ static inline int kvm_arch_flush_remote_tlbs(struct kvm
*kvm)
 
 #if IS_ENABLED(CONFIG_HYPERV)
 #define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS_RANGE
-static inline int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn,
-                                                  u64 nr_pages)
-{
-       if (!kvm_x86_ops.flush_remote_tlbs_range)
-               return -EOPNOTSUPP;
-
-       return static_call(kvm_x86_flush_remote_tlbs_range)(kvm, gfn, nr_pages);
-}
+int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, u64 nr_pages);
 #endif /* CONFIG_HYPERV */
 
 enum kvm_intr_type {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 43d70f4c433d..9dc1b3db286d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -14048,6 +14048,14 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu,
unsigned int size,
 }
 EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
 
+int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, u64 nr_pages)
+{
+       if (!kvm_x86_ops.flush_remote_tlbs_range || kvm_gfn_direct_mask(kvm))
+               return -EOPNOTSUPP;
+
+       return static_call(kvm_x86_flush_remote_tlbs_range)(kvm, gfn, nr_pages);
+}
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_mmio);


Regarding the kvm_gfn_direct_mask() usage, in the current WIP code we have
renamed things around the concepts of "mirrored roots" and "direct masks". The
mirrored root, just means "also go off an update something else" (S-EPT). The
direct mask, just means when on the direct root, shift the actual page table
mapping using the mask (shared memory). Kai raised that all TDX special stuff in
the x86 MMU around handling private memory is confusing from the SEV
perspective, so we were trying to rename those things to something related, but
generic instead of "private".

So the TLB flush confusion is more about that the direct GFNs are shifted by
something (i.e. kvm_gfn_direct_mask() returns non-zero).
Rick Edgecombe May 28, 2024, 8:54 p.m. UTC | #47
On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
> +static inline int __tdp_mmu_set_spte_atomic(struct kvm *kvm, struct tdp_iter
> *iter, u64 new_spte)
>  {
>         u64 *sptep = rcu_dereference(iter->sptep);
>  
> @@ -542,15 +671,42 @@ static inline int __tdp_mmu_set_spte_atomic(struct
> tdp_iter *iter, u64 new_spte)
>          */
>         WARN_ON_ONCE(iter->yielded || is_removed_spte(iter->old_spte));
>  
> -       /*
> -        * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
> -        * does not hold the mmu_lock.  On failure, i.e. if a different
> logical
> -        * CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with
> -        * the current value, so the caller operates on fresh data, e.g. if it
> -        * retries tdp_mmu_set_spte_atomic()
> -        */
> -       if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
> -               return -EBUSY;
> +       if (is_private_sptep(iter->sptep) && !is_removed_spte(new_spte)) {
> +               int ret;
> +
> +               if (is_shadow_present_pte(new_spte)) {
> +                       /*
> +                        * Populating case.
> +                        * - set_private_spte_present() implements
> +                        *   1) Freeze SPTE
> +                        *   2) call hooks to update private page table,
> +                        *   3) update SPTE to new_spte
> +                        * - handle_changed_spte() only updates stats.
> +                        */
> +                       ret = set_private_spte_present(kvm, iter->sptep, iter-
> >gfn,
> +                                                      iter->old_spte,
> new_spte, iter->level);
> +                       if (ret)
> +                               return ret;
> +               } else {
> +                       /*
> +                        * Zapping case.
> +                        * Zap is only allowed when write lock is held
> +                        */
> +                       if (WARN_ON_ONCE(!is_shadow_present_pte(new_spte)))

This inside an else block for (is_shadow_present_pte(new_spte)), so it will
always be true if it gets here. But it can't because TDX doesn't do any atomic
zapping.

We can remove the conditional, but in regards to the WARN, any recollection of
what was might have been going on here originally?

> +                               return -EBUSY;
> +               }
Rick Edgecombe May 28, 2024, 9:48 p.m. UTC | #48
On Fri, 2024-05-24 at 01:20 -0700, Isaku Yamahata wrote:
> > 
> > I don't see why these (zap_private_spte and remove_private_spte) can't be a
> > single op. Was it to prepare for huge pages support or something? In the
> > base
> > series they are both only called once.
> 
> That is for large page support. The step to merge or split large page is
> 1. zap_private_spte()
> 2. tlb shoot down
> 3. merge/split_private_spte()

I think we can simplify it for now. Otherwise we can't justify it without
getting into the huge page support.

Looking at how to create some more explainable code here, I'm also wondering
about the tdx_track() call in tdx_sept_remove_private_spte(). I didn't realize
it will send IPIs to each vcpu for *each* page getting zapped. Another one in
the "to optimize later" bucket I guess. And I guess it won't happen very often.
Rick Edgecombe May 28, 2024, 11:06 p.m. UTC | #49
On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
>  static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> -                               u64 old_spte, u64 new_spte, int level,
> -                               bool shared)
> +                               u64 old_spte, u64 new_spte,
> +                               union kvm_mmu_page_role role, bool shared)
>  {
> +       bool is_private = kvm_mmu_page_role_is_private(role);
> +       int level = role.level;
>         bool was_present = is_shadow_present_pte(old_spte);
>         bool is_present = is_shadow_present_pte(new_spte);
>         bool was_leaf = was_present && is_last_spte(old_spte, level);
>         bool is_leaf = is_present && is_last_spte(new_spte, level);
> -       bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
> +       kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
> +       kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
> +       bool pfn_changed = old_pfn != new_pfn;
>  
>         WARN_ON_ONCE(level > PT64_ROOT_MAX_LEVEL);
>         WARN_ON_ONCE(level < PG_LEVEL_4K);
> @@ -513,7 +636,7 @@ static void handle_changed_spte(struct kvm *kvm, int
> as_id, gfn_t gfn,
>  
>         if (was_leaf && is_dirty_spte(old_spte) &&
>             (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
> -               kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> +               kvm_set_pfn_dirty(old_pfn);
>  
>         /*
>          * Recursively handle child PTs if the change removed a subtree from
> @@ -522,15 +645,21 @@ static void handle_changed_spte(struct kvm *kvm, int
> as_id, gfn_t gfn,
>          * pages are kernel allocations and should never be migrated.
>          */
>         if (was_present && !was_leaf &&
> -           (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
> +           (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
> +               KVM_BUG_ON(is_private !=
> is_private_sptep(spte_to_child_pt(old_spte, level)),
> +                          kvm);
>                 handle_removed_pt(kvm, spte_to_child_pt(old_spte, level),
> shared);
> +       }
> +
> +       if (is_private && !is_present)
> +               handle_removed_private_spte(kvm, gfn, old_spte, new_spte,
> role.level);

I'm a little bothered by the asymmetry of where the mirrored hooks get called
between setting and zapping PTEs. Tracing through the code, the relevent
operations that are needed for TDX are:
1. tdp_mmu_iter_set_spte() from tdp_mmu_zap_leafs() and __tdp_mmu_zap_root()
2. tdp_mmu_set_spte_atomic() is used for mapping, linking

(1) is a simple case because the mmu_lock is held for writes. It updates the
mirror root like normal, then has extra logic to call out to update the S-EPT.

(2) on the other hand just has the read lock, so it has to do the whole
operation in a special way. First set REMOVED_SPTE, then update the private
copy, then write to the mirror page tables. It can't get stuffed into
handle_changed_spte() because it has to write REMOVED_SPTE first.

In some ways it makes sense to update the S-EPT. Despite claiming
"handle_changed_spte() only updates stats.", it does some updating of other PTEs
based on the current PTE change. Which is pretty similar to what the mirrored
PTEs are doing. But we can't really do the setting of present PTEs because of
the REMOVED_SPTE stuff.

So we could only make it more symmetrical by moving the S-EPT ops out of
handle_changed_spte() and manually call it in the two places relevant for TDX,
like the below.

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index e966986bb9f2..c9ddb1c2a550 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -438,6 +438,9 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t
pt, bool shared)
                         */
                        old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte,
                                                          REMOVED_SPTE, level);
+
+                       if (is_mirror_sp(sp))
+                               reflect_removed_spte(kvm, gfn, old_spte,
REMOVED_SPTE, level);
                }
                handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
                                    old_spte, REMOVED_SPTE, sp->role, shared);
@@ -667,9 +670,6 @@ static void handle_changed_spte(struct kvm *kvm, int as_id,
gfn_t gfn,
                handle_removed_pt(kvm, spte_to_child_pt(old_spte, level),
shared);
        }
 
-       if (is_mirror && !is_present)
-               reflect_removed_spte(kvm, gfn, old_spte, new_spte, role.level);
-
        if (was_leaf && is_accessed_spte(old_spte) &&
            (!is_present || !is_accessed_spte(new_spte) || pfn_changed))
                kvm_set_pfn_accessed(spte_to_pfn(old_spte));
@@ -839,6 +839,9 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id,
tdp_ptep_t sptep,
                                                      new_spte, level), kvm);
        }
 
+       if (is_mirror_sptep(sptep))
+               reflect_removed_spte(kvm, gfn, old_spte, REMOVED_SPTE, level);
+
        role = sptep_to_sp(sptep)->role;
        role.level = level;
        handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role, false);


Otherwise, we could move the "set present" mirroring operations into
handle_changed_spte(), and have some earlier conditional logic do the
REMOVED_SPTE parts. It starts to become more scattered.

Anyway, it's just a code clarity thing arising from having hard time explaining
the design in the log. Any opinions?

A separate but related comment is below.

>  
>         if (was_leaf && is_accessed_spte(old_spte) &&
>             (!is_present || !is_accessed_spte(new_spte) || pfn_changed))
>                 kvm_set_pfn_accessed(spte_to_pfn(old_spte));
>  }
>  
> @@ -648,6 +807,8 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
>  static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
>                             u64 old_spte, u64 new_spte, gfn_t gfn, int level)
>  {
> +       union kvm_mmu_page_role role;
> +
>         lockdep_assert_held_write(&kvm->mmu_lock);
>  
>         /*
> @@ -660,8 +821,16 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id,
> tdp_ptep_t sptep,
>         WARN_ON_ONCE(is_removed_spte(old_spte) || is_removed_spte(new_spte));
>  
>         old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
> +       if (is_private_sptep(sptep) && !is_removed_spte(new_spte) &&
> +           is_shadow_present_pte(new_spte)) {
> +               /* Because write spin lock is held, no race.  It should
> success. */
> +               KVM_BUG_ON(__set_private_spte_present(kvm, sptep, gfn,
> old_spte,
> +                                                     new_spte, level), kvm);
> +       }

Based on the above enumeration, I don't see how this hunk gets used.

>  
> -       handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
> false);
> +       role = sptep_to_sp(sptep)->role;
> +       role.level = level;
> +       handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role, false);
>         return old_spte;
>  }
>
Isaku Yamahata May 29, 2024, 1:06 a.m. UTC | #50
On Tue, May 28, 2024 at 06:29:59PM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Tue, 2024-05-28 at 19:16 +0200, Paolo Bonzini wrote:
> > > > After this, gfn_t's never have shared bit. It's a simple rule. The MMU
> > > > mostly
> > > > thinks it's operating on a shared root that is mapped at the normal GFN.
> > > > Only
> > > > the iterator knows that the shared PTEs are actually in a different
> > > > location.
> > > > 
> > > > There are some negative side effects:
> > > > 1. The struct kvm_mmu_page's gfn doesn't match it's actual mapping
> > > > anymore.
> > > > 2. As a result of above, the code that flushes TLBs for a specific GFN
> > > > will be
> > > > confused. It won't functionally matter for TDX, just look buggy to see
> > > > flushing
> > > > code called with the wrong gfn.
> > > 
> > > flush_remote_tlbs_range() is only for Hyper-V optimization.  In other cases,
> > > x86_op.flush_remote_tlbs_range = NULL or the member isn't defined at compile
> > > time.  So the remote tlb flush falls back to flushing whole range.  I don't
> > > expect TDX in hyper-V guest.  I have to admit that the code looks
> > > superficially
> > > broken and confusing.
> > 
> > You could add an "&& kvm_has_private_root(kvm)" to
> > kvm_available_flush_remote_tlbs_range(), since
> > kvm_has_private_root(kvm) is sort of equivalent to "there is no 1:1
> > correspondence between gfn and PTE to be flushed".
> > 
> > I am conflicted myself, but the upsides below are pretty substantial.
> 
> It looks like kvm_available_flush_remote_tlbs_range() is not checked in many of
> the paths that get to x86_ops.flush_remote_tlbs_range().
> 
> So maybe something like:
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 65bbda95acbb..e09bb6c50a0b 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1959,14 +1959,7 @@ static inline int kvm_arch_flush_remote_tlbs(struct kvm
> *kvm)
>  
>  #if IS_ENABLED(CONFIG_HYPERV)
>  #define __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS_RANGE
> -static inline int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn,
> -                                                  u64 nr_pages)
> -{
> -       if (!kvm_x86_ops.flush_remote_tlbs_range)
> -               return -EOPNOTSUPP;
> -
> -       return static_call(kvm_x86_flush_remote_tlbs_range)(kvm, gfn, nr_pages);
> -}
> +int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, u64 nr_pages);
>  #endif /* CONFIG_HYPERV */
>  
>  enum kvm_intr_type {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 43d70f4c433d..9dc1b3db286d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -14048,6 +14048,14 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu,
> unsigned int size,
>  }
>  EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
>  
> +int kvm_arch_flush_remote_tlbs_range(struct kvm *kvm, gfn_t gfn, u64 nr_pages)
> +{
> +       if (!kvm_x86_ops.flush_remote_tlbs_range || kvm_gfn_direct_mask(kvm))
> +               return -EOPNOTSUPP;
> +
> +       return static_call(kvm_x86_flush_remote_tlbs_range)(kvm, gfn, nr_pages);
> +}
> +
>  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
>  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
>  EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_mmio);

kvm_x86_ops.flush_remote_tlbs_range() is defined only when CONFIG_HYPERV=y.
We need #ifdef __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS_RANGE  ... #endif around the
function.
Isaku Yamahata May 29, 2024, 1:16 a.m. UTC | #51
On Tue, May 28, 2024 at 09:48:45PM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Fri, 2024-05-24 at 01:20 -0700, Isaku Yamahata wrote:
> > > 
> > > I don't see why these (zap_private_spte and remove_private_spte) can't be a
> > > single op. Was it to prepare for huge pages support or something? In the
> > > base
> > > series they are both only called once.
> > 
> > That is for large page support. The step to merge or split large page is
> > 1. zap_private_spte()
> > 2. tlb shoot down
> > 3. merge/split_private_spte()
> 
> I think we can simplify it for now. Otherwise we can't justify it without
> getting into the huge page support.

Ok. Now we don't care large page support, we can combine those hooks into single
hook.


> Looking at how to create some more explainable code here, I'm also wondering
> about the tdx_track() call in tdx_sept_remove_private_spte(). I didn't realize
> it will send IPIs to each vcpu for *each* page getting zapped. Another one in
> the "to optimize later" bucket I guess. And I guess it won't happen very often.

We need it. Without tracking (or TLB shoot down), we'll hit
TDX_TLB_TRACKING_NOT_DONE.  The TDX module has to guarantee that there is no
remaining TLB entries for pages freed by TDH.MEM.PAGE.REMOVE().
Isaku Yamahata May 29, 2024, 1:24 a.m. UTC | #52
On Tue, May 28, 2024 at 08:54:31PM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
> > +static inline int __tdp_mmu_set_spte_atomic(struct kvm *kvm, struct tdp_iter
> > *iter, u64 new_spte)
> >  {
> >         u64 *sptep = rcu_dereference(iter->sptep);
> >  
> > @@ -542,15 +671,42 @@ static inline int __tdp_mmu_set_spte_atomic(struct
> > tdp_iter *iter, u64 new_spte)
> >          */
> >         WARN_ON_ONCE(iter->yielded || is_removed_spte(iter->old_spte));
> >  
> > -       /*
> > -        * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
> > -        * does not hold the mmu_lock.  On failure, i.e. if a different
> > logical
> > -        * CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with
> > -        * the current value, so the caller operates on fresh data, e.g. if it
> > -        * retries tdp_mmu_set_spte_atomic()
> > -        */
> > -       if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
> > -               return -EBUSY;
> > +       if (is_private_sptep(iter->sptep) && !is_removed_spte(new_spte)) {
> > +               int ret;
> > +
> > +               if (is_shadow_present_pte(new_spte)) {
> > +                       /*
> > +                        * Populating case.
> > +                        * - set_private_spte_present() implements
> > +                        *   1) Freeze SPTE
> > +                        *   2) call hooks to update private page table,
> > +                        *   3) update SPTE to new_spte
> > +                        * - handle_changed_spte() only updates stats.
> > +                        */
> > +                       ret = set_private_spte_present(kvm, iter->sptep, iter-
> > >gfn,
> > +                                                      iter->old_spte,
> > new_spte, iter->level);
> > +                       if (ret)
> > +                               return ret;
> > +               } else {
> > +                       /*
> > +                        * Zapping case.
> > +                        * Zap is only allowed when write lock is held
> > +                        */
> > +                       if (WARN_ON_ONCE(!is_shadow_present_pte(new_spte)))
> 
> This inside an else block for (is_shadow_present_pte(new_spte)), so it will
> always be true if it gets here. But it can't because TDX doesn't do any atomic
> zapping.
> 
> We can remove the conditional, but in regards to the WARN, any recollection of
> what was might have been going on here originally?

We had an optimization so that there are other state in addition to present,
non-present.  When I dropped it, I should've dropped else-sentence.
Rick Edgecombe May 29, 2024, 1:50 a.m. UTC | #53
On Tue, 2024-05-28 at 18:16 -0700, Isaku Yamahata wrote:
> > Looking at how to create some more explainable code here, I'm also wondering
> > about the tdx_track() call in tdx_sept_remove_private_spte(). I didn't
> > realize
> > it will send IPIs to each vcpu for *each* page getting zapped. Another one
> > in
> > the "to optimize later" bucket I guess. And I guess it won't happen very
> > often.
> 
> We need it. Without tracking (or TLB shoot down), we'll hit
> TDX_TLB_TRACKING_NOT_DONE.  The TDX module has to guarantee that there is no
> remaining TLB entries for pages freed by TDH.MEM.PAGE.REMOVE().

It can't be removed without other changes, but the TDX module doesn't enforce
that you have to zap and shootdown a page at at time, right? Like it could be
batched.
Rick Edgecombe May 29, 2024, 1:51 a.m. UTC | #54
On Tue, 2024-05-28 at 18:06 -0700, Isaku Yamahata wrote:
> 
> kvm_x86_ops.flush_remote_tlbs_range() is defined only when CONFIG_HYPERV=y.
> We need #ifdef __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS_RANGE  ... #endif around the
> function.

Oh, right. Thanks.
Isaku Yamahata May 29, 2024, 1:57 a.m. UTC | #55
On Tue, May 28, 2024 at 11:06:45PM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
> >  static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> > -                               u64 old_spte, u64 new_spte, int level,
> > -                               bool shared)
> > +                               u64 old_spte, u64 new_spte,
> > +                               union kvm_mmu_page_role role, bool shared)
> >  {
> > +       bool is_private = kvm_mmu_page_role_is_private(role);
> > +       int level = role.level;
> >         bool was_present = is_shadow_present_pte(old_spte);
> >         bool is_present = is_shadow_present_pte(new_spte);
> >         bool was_leaf = was_present && is_last_spte(old_spte, level);
> >         bool is_leaf = is_present && is_last_spte(new_spte, level);
> > -       bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
> > +       kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
> > +       kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
> > +       bool pfn_changed = old_pfn != new_pfn;
> >  
> >         WARN_ON_ONCE(level > PT64_ROOT_MAX_LEVEL);
> >         WARN_ON_ONCE(level < PG_LEVEL_4K);
> > @@ -513,7 +636,7 @@ static void handle_changed_spte(struct kvm *kvm, int
> > as_id, gfn_t gfn,
> >  
> >         if (was_leaf && is_dirty_spte(old_spte) &&
> >             (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
> > -               kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> > +               kvm_set_pfn_dirty(old_pfn);
> >  
> >         /*
> >          * Recursively handle child PTs if the change removed a subtree from
> > @@ -522,15 +645,21 @@ static void handle_changed_spte(struct kvm *kvm, int
> > as_id, gfn_t gfn,
> >          * pages are kernel allocations and should never be migrated.
> >          */
> >         if (was_present && !was_leaf &&
> > -           (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
> > +           (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
> > +               KVM_BUG_ON(is_private !=
> > is_private_sptep(spte_to_child_pt(old_spte, level)),
> > +                          kvm);
> >                 handle_removed_pt(kvm, spte_to_child_pt(old_spte, level),
> > shared);
> > +       }
> > +
> > +       if (is_private && !is_present)
> > +               handle_removed_private_spte(kvm, gfn, old_spte, new_spte,
> > role.level);
> 
> I'm a little bothered by the asymmetry of where the mirrored hooks get called
> between setting and zapping PTEs. Tracing through the code, the relevent
> operations that are needed for TDX are:
> 1. tdp_mmu_iter_set_spte() from tdp_mmu_zap_leafs() and __tdp_mmu_zap_root()
> 2. tdp_mmu_set_spte_atomic() is used for mapping, linking
> 
> (1) is a simple case because the mmu_lock is held for writes. It updates the
> mirror root like normal, then has extra logic to call out to update the S-EPT.
> 
> (2) on the other hand just has the read lock, so it has to do the whole
> operation in a special way. First set REMOVED_SPTE, then update the private
> copy, then write to the mirror page tables. It can't get stuffed into
> handle_changed_spte() because it has to write REMOVED_SPTE first.
> 
> In some ways it makes sense to update the S-EPT. Despite claiming
> "handle_changed_spte() only updates stats.", it does some updating of other PTEs
> based on the current PTE change. Which is pretty similar to what the mirrored
> PTEs are doing. But we can't really do the setting of present PTEs because of
> the REMOVED_SPTE stuff.
> 
> So we could only make it more symmetrical by moving the S-EPT ops out of
> handle_changed_spte() and manually call it in the two places relevant for TDX,
> like the below.
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index e966986bb9f2..c9ddb1c2a550 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -438,6 +438,9 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t
> pt, bool shared)
>                          */
>                         old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte,
>                                                           REMOVED_SPTE, level);
> +
> +                       if (is_mirror_sp(sp))
> +                               reflect_removed_spte(kvm, gfn, old_spte,
> REMOVED_SPTE, level);

The callback before handling lower level will result in error.


>                 }
>                 handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
>                                     old_spte, REMOVED_SPTE, sp->role, shared);


We should call it here after processing lower level.



> @@ -667,9 +670,6 @@ static void handle_changed_spte(struct kvm *kvm, int as_id,
> gfn_t gfn,
>                 handle_removed_pt(kvm, spte_to_child_pt(old_spte, level),
> shared);
>         }
>  
> -       if (is_mirror && !is_present)
> -               reflect_removed_spte(kvm, gfn, old_spte, new_spte, role.level);
> -
>         if (was_leaf && is_accessed_spte(old_spte) &&
>             (!is_present || !is_accessed_spte(new_spte) || pfn_changed))
>                 kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> @@ -839,6 +839,9 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id,
> tdp_ptep_t sptep,
>                                                       new_spte, level), kvm);
>         }
>  
> +       if (is_mirror_sptep(sptep))
> +               reflect_removed_spte(kvm, gfn, old_spte, REMOVED_SPTE, level);
> +

Ditto.


>         role = sptep_to_sp(sptep)->role;
>         role.level = level;
>         handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role, false);

The callback should be here.  It should be after handling the lower level.



> Otherwise, we could move the "set present" mirroring operations into
> handle_changed_spte(), and have some earlier conditional logic do the
> REMOVED_SPTE parts. It starts to become more scattered.
> Anyway, it's just a code clarity thing arising from having hard time explaining
> the design in the log. Any opinions?

Originally I tried to consolidate the callbacks by following TDP MMU using
handle_changed_spte().  Anyway we can pick from two outcomes based on which is
easy to understand/maintain.


> A separate but related comment is below.
> 
> >  
> >         if (was_leaf && is_accessed_spte(old_spte) &&
> >             (!is_present || !is_accessed_spte(new_spte) || pfn_changed))
> >                 kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> >  }
> >  
> > @@ -648,6 +807,8 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> >  static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
> >                             u64 old_spte, u64 new_spte, gfn_t gfn, int level)
> >  {
> > +       union kvm_mmu_page_role role;
> > +
> >         lockdep_assert_held_write(&kvm->mmu_lock);
> >  
> >         /*
> > @@ -660,8 +821,16 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id,
> > tdp_ptep_t sptep,
> >         WARN_ON_ONCE(is_removed_spte(old_spte) || is_removed_spte(new_spte));
> >  
> >         old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
> > +       if (is_private_sptep(sptep) && !is_removed_spte(new_spte) &&
> > +           is_shadow_present_pte(new_spte)) {
> > +               /* Because write spin lock is held, no race.  It should
> > success. */
> > +               KVM_BUG_ON(__set_private_spte_present(kvm, sptep, gfn,
> > old_spte,
> > +                                                     new_spte, level), kvm);
> > +       }
> 
> Based on the above enumeration, I don't see how this hunk gets used.

I should've removed it.  This is leftover from the old patches.
Rick Edgecombe May 29, 2024, 2:13 a.m. UTC | #56
On Tue, 2024-05-28 at 18:57 -0700, Isaku Yamahata wrote:
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -438,6 +438,9 @@ static void handle_removed_pt(struct kvm *kvm,
> > tdp_ptep_t
> > pt, bool shared)
> >                           */
> >                          old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte,
> >                                                            REMOVED_SPTE,
> > level);
> > +
> > +                       if (is_mirror_sp(sp))
> > +                               reflect_removed_spte(kvm, gfn, old_spte,
> > REMOVED_SPTE, level);
> 
> The callback before handling lower level will result in error.

Hmm, yea the order is changed. It didn't result in an error for some reason
though. Can you elaborate?

> 
> 
> >                  }
> >                  handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
> >                                      old_spte, REMOVED_SPTE, sp->role,
> > shared);
> 
> 
> We should call it here after processing lower level.
> 
> 
> 
> > @@ -667,9 +670,6 @@ static void handle_changed_spte(struct kvm *kvm, int
> > as_id,
> > gfn_t gfn,
> >                  handle_removed_pt(kvm, spte_to_child_pt(old_spte, level),
> > shared);
> >          }
> >   
> > -       if (is_mirror && !is_present)
> > -               reflect_removed_spte(kvm, gfn, old_spte, new_spte,
> > role.level);
> > -
> >          if (was_leaf && is_accessed_spte(old_spte) &&
> >              (!is_present || !is_accessed_spte(new_spte) || pfn_changed))
> >                  kvm_set_pfn_accessed(spte_to_pfn(old_spte));
> > @@ -839,6 +839,9 @@ static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id,
> > tdp_ptep_t sptep,
> >                                                        new_spte, level),
> > kvm);
> >          }
> >   
> > +       if (is_mirror_sptep(sptep))
> > +               reflect_removed_spte(kvm, gfn, old_spte, REMOVED_SPTE,
> > level);
> > +
> 
> Ditto.
> 
> 
> >          role = sptep_to_sp(sptep)->role;
> >          role.level = level;
> >          handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role,
> > false);
> 
> The callback should be here.  It should be after handling the lower level.

Ok, let me try.

> 
> 
> 
> > Otherwise, we could move the "set present" mirroring operations into
> > handle_changed_spte(), and have some earlier conditional logic do the
> > REMOVED_SPTE parts. It starts to become more scattered.
> > Anyway, it's just a code clarity thing arising from having hard time
> > explaining
> > the design in the log. Any opinions?
> 
> Originally I tried to consolidate the callbacks by following TDP MMU using
> handle_changed_spte().

How did it handle the REMOVED_SPTE part of the set_present() path?

>   Anyway we can pick from two outcomes based on which is
> easy to understand/maintain.

I guess I can try to generate a diff of the other one and we can compare. It's a
matter of opinion, but I think splitting it between the two methods is the most
confusing.
Rick Edgecombe May 29, 2024, 2:13 a.m. UTC | #57
On Tue, 2024-05-28 at 19:47 +0200, Paolo Bonzini wrote:
> On Tue, May 28, 2024 at 6:27 PM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> > > I don't see benefit of x86_ops.max_gfn() compared to kvm->arch.max_gfn.
> > > But I don't have strong preference. Either way will work.
> > 
> > The non-TDX VM's won't need per-VM data, right? So it's just unneeded extra
> > state per-vm.
> 
> It's just a cached value like there are many in the MMU. It's easier
> for me to read code without the mental overhead of a function call.

Ok. Since this has (optimization) utility beyond TDX, maybe it's worth splitting
it off as a separate patch? I think maybe we'll pursue this path unless there is
objection.

> 
> > For TDX it will be based on the shared bit, so we actually already have the
> > per-
> > vm data we need. So we don't even need both gfn_shared_mask and max_gfn for
> > TDX.
> 
> But they are independent, for example AMD placed the encryption bit
> highest, then the reduced physical address space bits, then finally
> the rest of the gfn. I think it's consistent with the kvm_has_*
> approach, to not assume much and just store separate data.

I meant for a TDX specific x86_ops implementation we already have the data
needed to compute it (gfn_shared_mask - 1). I didn't realize SEV would benefit
from this too.
Isaku Yamahata May 29, 2024, 2:20 a.m. UTC | #58
On Wed, May 29, 2024 at 01:50:05AM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Tue, 2024-05-28 at 18:16 -0700, Isaku Yamahata wrote:
> > > Looking at how to create some more explainable code here, I'm also wondering
> > > about the tdx_track() call in tdx_sept_remove_private_spte(). I didn't
> > > realize
> > > it will send IPIs to each vcpu for *each* page getting zapped. Another one
> > > in
> > > the "to optimize later" bucket I guess. And I guess it won't happen very
> > > often.
> > 
> > We need it. Without tracking (or TLB shoot down), we'll hit
> > TDX_TLB_TRACKING_NOT_DONE.  The TDX module has to guarantee that there is no
> > remaining TLB entries for pages freed by TDH.MEM.PAGE.REMOVE().
> 
> It can't be removed without other changes, but the TDX module doesn't enforce
> that you have to zap and shootdown a page at at time, right? Like it could be
> batched.

Right. TDX module doesn't enforce it.  If we want to batch zapping, it requires
to track the SPTE state, zapped, not TLB shoot down yet, and not removed yet.
It's simpler to issue TLB shoot per page for now. It would be future
optimization.

At runtime, the zapping happens when memory conversion(private -> shared) or
memslot deletion.  Because it's not often, we don't have to care.
For vm destruction, it's simpler to skip tlb shoot down by deleting HKID first
than to track SPTE state for batching TLB shoot down.
Rick Edgecombe May 29, 2024, 2:29 a.m. UTC | #59
On Tue, 2024-05-28 at 19:20 -0700, Isaku Yamahata wrote:
> Right. TDX module doesn't enforce it.  If we want to batch zapping, it
> requires
> to track the SPTE state, zapped, not TLB shoot down yet, and not removed yet.
> It's simpler to issue TLB shoot per page for now. It would be future
> optimization.

Totally agree we should not change it now. It's just in the list of not
optimized things.

> 
> At runtime, the zapping happens when memory conversion(private -> shared) or
> memslot deletion.  Because it's not often, we don't have to care.

Not sure I agree on this part. But in any case we can discuss it when we are in
the happy situation of upstream TDX users existing and complaining about things.

A great thing about it though - it's obviously correct.

> For vm destruction, it's simpler to skip tlb shoot down by deleting HKID first
> than to track SPTE state for batching TLB shoot down.
Paolo Bonzini May 29, 2024, 7:25 a.m. UTC | #60
On Wed, May 29, 2024 at 4:14 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2024-05-28 at 19:47 +0200, Paolo Bonzini wrote:
> > On Tue, May 28, 2024 at 6:27 PM Edgecombe, Rick P
> > <rick.p.edgecombe@intel.com> wrote:
> > > > I don't see benefit of x86_ops.max_gfn() compared to kvm->arch.max_gfn.
> > > > But I don't have strong preference. Either way will work.
> > >
> > > The non-TDX VM's won't need per-VM data, right? So it's just unneeded extra
> > > state per-vm.
> >
> > It's just a cached value like there are many in the MMU. It's easier
> > for me to read code without the mental overhead of a function call.
>
> Ok. Since this has (optimization) utility beyond TDX, maybe it's worth splitting
> it off as a separate patch? I think maybe we'll pursue this path unless there is
> objection.

Yes, absolutely.

Paolo
Isaku Yamahata May 29, 2024, 4:55 p.m. UTC | #61
On Wed, May 29, 2024 at 02:13:24AM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Tue, 2024-05-28 at 18:57 -0700, Isaku Yamahata wrote:
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -438,6 +438,9 @@ static void handle_removed_pt(struct kvm *kvm,
> > > tdp_ptep_t
> > > pt, bool shared)
> > >                           */
> > >                          old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte,
> > >                                                            REMOVED_SPTE,
> > > level);
> > > +
> > > +                       if (is_mirror_sp(sp))
> > > +                               reflect_removed_spte(kvm, gfn, old_spte,
> > > REMOVED_SPTE, level);
> > 
> > The callback before handling lower level will result in error.
> 
> Hmm, yea the order is changed. It didn't result in an error for some reason
> though. Can you elaborate?

TDH.MEM.{PAGE, SEPT}.REMOVE() needs to be issued from the leaf.  I guess
zapping is done at only leaf by tdp_mmu_zap_leafs(). Subtree zapping case wasn't
exercised.


> > > Otherwise, we could move the "set present" mirroring operations into
> > > handle_changed_spte(), and have some earlier conditional logic do the
> > > REMOVED_SPTE parts. It starts to become more scattered.
> > > Anyway, it's just a code clarity thing arising from having hard time
> > > explaining
> > > the design in the log. Any opinions?
> > 
> > Originally I tried to consolidate the callbacks by following TDP MMU using
> > handle_changed_spte().
> 
> How did it handle the REMOVED_SPTE part of the set_present() path?

is_removed_pt() was used. It was ugly.
Isaku Yamahata May 31, 2024, 2:11 p.m. UTC | #62
On Wed, May 29, 2024 at 09:25:46AM +0200,
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On Wed, May 29, 2024 at 4:14 AM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> >
> > On Tue, 2024-05-28 at 19:47 +0200, Paolo Bonzini wrote:
> > > On Tue, May 28, 2024 at 6:27 PM Edgecombe, Rick P
> > > <rick.p.edgecombe@intel.com> wrote:
> > > > > I don't see benefit of x86_ops.max_gfn() compared to kvm->arch.max_gfn.
> > > > > But I don't have strong preference. Either way will work.
> > > >
> > > > The non-TDX VM's won't need per-VM data, right? So it's just unneeded extra
> > > > state per-vm.
> > >
> > > It's just a cached value like there are many in the MMU. It's easier
> > > for me to read code without the mental overhead of a function call.
> >
> > Ok. Since this has (optimization) utility beyond TDX, maybe it's worth splitting
> > it off as a separate patch? I think maybe we'll pursue this path unless there is
> > objection.
> 
> Yes, absolutely.

Ok, let me cook an independent patch series for kvm-coco-queue.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 566d19b02483..d13cb4b8fce6 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -95,6 +95,11 @@  KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
 KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
 KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
 KVM_X86_OP(load_mmu_pgd)
+KVM_X86_OP_OPTIONAL(link_private_spt)
+KVM_X86_OP_OPTIONAL(free_private_spt)
+KVM_X86_OP_OPTIONAL(set_private_spte)
+KVM_X86_OP_OPTIONAL(remove_private_spte)
+KVM_X86_OP_OPTIONAL(zap_private_spte)
 KVM_X86_OP(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
 KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d010ca5c7f44..20fa8fa58692 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -470,6 +470,7 @@  struct kvm_mmu {
 	int (*sync_spte)(struct kvm_vcpu *vcpu,
 			 struct kvm_mmu_page *sp, int i);
 	struct kvm_mmu_root_info root;
+	hpa_t private_root_hpa;
 	union kvm_cpu_role cpu_role;
 	union kvm_mmu_page_role root_role;
 
@@ -1747,6 +1748,30 @@  struct kvm_x86_ops {
 	void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			     int root_level);
 
+	/* Add a page as page table page into private page table */
+	int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				void *private_spt);
+	/*
+	 * Free a page table page of private page table.
+	 * Only expected to be called when guest is not active, specifically
+	 * during VM destruction phase.
+	 */
+	int (*free_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				void *private_spt);
+
+	/* Add a guest private page into private page table */
+	int (*set_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				kvm_pfn_t pfn);
+
+	/* Remove a guest private page from private page table*/
+	int (*remove_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				   kvm_pfn_t pfn);
+	/*
+	 * Keep a guest private page mapped in private page table, but clear its
+	 * present bit
+	 */
+	int (*zap_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level);
+
 	bool (*has_wbinvd_exit)(void);
 
 	u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 76f92cb37a96..2506d6277818 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3701,7 +3701,9 @@  static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 	int r;
 
 	if (tdp_mmu_enabled) {
-		kvm_tdp_mmu_alloc_root(vcpu);
+		if (kvm_gfn_shared_mask(vcpu->kvm))
+			kvm_tdp_mmu_alloc_root(vcpu, true);
+		kvm_tdp_mmu_alloc_root(vcpu, false);
 		return 0;
 	}
 
@@ -4685,7 +4687,7 @@  int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
 		for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) {
 			int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
-			gfn_t base = gfn_round_for_level(fault->gfn,
+			gfn_t base = gfn_round_for_level(gpa_to_gfn(fault->addr),
 							 fault->max_level);
 
 			if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
@@ -6245,6 +6247,7 @@  static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
 
 	mmu->root.hpa = INVALID_PAGE;
 	mmu->root.pgd = 0;
+	mmu->private_root_hpa = INVALID_PAGE;
 	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
 		mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
 
@@ -7263,6 +7266,12 @@  int kvm_mmu_vendor_module_init(void)
 void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_unload(vcpu);
+	if (tdp_mmu_enabled) {
+		read_lock(&vcpu->kvm->mmu_lock);
+		mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->private_root_hpa,
+				   NULL);
+		read_unlock(&vcpu->kvm->mmu_lock);
+	}
 	free_mmu_pages(&vcpu->arch.root_mmu);
 	free_mmu_pages(&vcpu->arch.guest_mmu);
 	mmu_free_memory_caches(vcpu);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 0f1a9d733d9e..3a7fe9261e23 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -6,6 +6,8 @@ 
 #include <linux/kvm_host.h>
 #include <asm/kvm_host.h>
 
+#include "mmu.h"
+
 #ifdef CONFIG_KVM_PROVE_MMU
 #define KVM_MMU_WARN_ON(x) WARN_ON_ONCE(x)
 #else
@@ -178,6 +180,16 @@  static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_m
 	sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_private_spt_cache);
 }
 
+static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page *root,
+				     gfn_t gfn)
+{
+	gfn_t gfn_for_root = kvm_gfn_to_private(kvm, gfn);
+
+	/* Set shared bit if not private */
+	gfn_for_root |= -(gfn_t)!is_private_sp(root) & kvm_gfn_shared_mask(kvm);
+	return gfn_for_root;
+}
+
 static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
 {
 	/*
@@ -348,7 +360,12 @@  static inline int __kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gp
 	int r;
 
 	if (vcpu->arch.mmu->root_role.direct) {
-		fault.gfn = fault.addr >> PAGE_SHIFT;
+		/*
+		 * Things like memslots don't understand the concept of a shared
+		 * bit. Strip it so that the GFN can be used like normal, and the
+		 * fault.addr can be used when the shared bit is needed.
+		 */
+		fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_shared_mask(vcpu->kvm);
 		fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
 	}
 
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index fae559559a80..8a64bcef9deb 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -91,7 +91,7 @@  struct tdp_iter {
 	tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
 	/* A pointer to the current SPTE */
 	tdp_ptep_t sptep;
-	/* The lowest GFN mapped by the current SPTE */
+	/* The lowest GFN (shared bits included) mapped by the current SPTE */
 	gfn_t gfn;
 	/* The level of the root page given to the iterator */
 	int root_level;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 0d6d96d86703..810d552e9bf6 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -224,7 +224,7 @@  static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
 	tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
 }
 
-void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
+void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool private)
 {
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	union kvm_mmu_page_role role = mmu->root_role;
@@ -232,6 +232,9 @@  void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
 	struct kvm *kvm = vcpu->kvm;
 	struct kvm_mmu_page *root;
 
+	if (private)
+		kvm_mmu_page_role_set_private(&role);
+
 	/*
 	 * Check for an existing root before acquiring the pages lock to avoid
 	 * unnecessary serialization if multiple vCPUs are loading a new root.
@@ -283,13 +286,17 @@  void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu)
 	 * and actually consuming the root if it's invalidated after dropping
 	 * mmu_lock, and the root can't be freed as this vCPU holds a reference.
 	 */
-	mmu->root.hpa = __pa(root->spt);
-	mmu->root.pgd = 0;
+	if (private) {
+		mmu->private_root_hpa = __pa(root->spt);
+	} else {
+		mmu->root.hpa = __pa(root->spt);
+		mmu->root.pgd = 0;
+	}
 }
 
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				u64 old_spte, u64 new_spte, int level,
-				bool shared);
+				u64 old_spte, u64 new_spte,
+				union kvm_mmu_page_role role, bool shared);
 
 static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
@@ -416,12 +423,124 @@  static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 							  REMOVED_SPTE, level);
 		}
 		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
-				    old_spte, REMOVED_SPTE, level, shared);
+				    old_spte, REMOVED_SPTE, sp->role,
+				    shared);
+	}
+
+	if (is_private_sp(sp) &&
+	    WARN_ON(static_call(kvm_x86_free_private_spt)(kvm, sp->gfn, sp->role.level,
+							  kvm_mmu_private_spt(sp)))) {
+		/*
+		 * Failed to free page table page in private page table and
+		 * there is nothing to do further.
+		 * Intentionally leak the page to prevent the kernel from
+		 * accessing the encrypted page.
+		 */
+		sp->private_spt = NULL;
 	}
 
 	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
 
+static void *get_private_spt(gfn_t gfn, u64 new_spte, int level)
+{
+	if (is_shadow_present_pte(new_spte) && !is_last_spte(new_spte, level)) {
+		struct kvm_mmu_page *sp = to_shadow_page(pfn_to_hpa(spte_to_pfn(new_spte)));
+		void *private_spt = kvm_mmu_private_spt(sp);
+
+		WARN_ON_ONCE(!private_spt);
+		WARN_ON_ONCE(sp->role.level + 1 != level);
+		WARN_ON_ONCE(sp->gfn != gfn);
+		return private_spt;
+	}
+
+	return NULL;
+}
+
+static void handle_removed_private_spte(struct kvm *kvm, gfn_t gfn,
+					u64 old_spte, u64 new_spte,
+					int level)
+{
+	bool was_present = is_shadow_present_pte(old_spte);
+	bool was_leaf = was_present && is_last_spte(old_spte, level);
+	kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
+	int ret;
+
+	/*
+	 * Allow only leaf page to be zapped. Reclaim non-leaf page tables page
+	 * at destroying VM.
+	 */
+	if (!was_leaf)
+		return;
+
+	/* Zapping leaf spte is allowed only when write lock is held. */
+	lockdep_assert_held_write(&kvm->mmu_lock);
+	ret = static_call(kvm_x86_zap_private_spte)(kvm, gfn, level);
+	/* Because write lock is held, operation should success. */
+	if (KVM_BUG_ON(ret, kvm))
+		return;
+
+	ret = static_call(kvm_x86_remove_private_spte)(kvm, gfn, level, old_pfn);
+	KVM_BUG_ON(ret, kvm);
+}
+
+static int __must_check __set_private_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
+						   gfn_t gfn, u64 old_spte,
+						   u64 new_spte, int level)
+{
+	bool was_present = is_shadow_present_pte(old_spte);
+	bool is_present = is_shadow_present_pte(new_spte);
+	bool is_leaf = is_present && is_last_spte(new_spte, level);
+	kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
+	int ret = 0;
+
+	lockdep_assert_held(&kvm->mmu_lock);
+	/* TDP MMU doesn't change present -> present */
+	KVM_BUG_ON(was_present, kvm);
+
+	/*
+	 * Use different call to either set up middle level
+	 * private page table, or leaf.
+	 */
+	if (is_leaf) {
+		ret = static_call(kvm_x86_set_private_spte)(kvm, gfn, level, new_pfn);
+	} else {
+		void *private_spt = get_private_spt(gfn, new_spte, level);
+
+		KVM_BUG_ON(!private_spt, kvm);
+		ret = static_call(kvm_x86_link_private_spt)(kvm, gfn, level, private_spt);
+	}
+
+	return ret;
+}
+
+static int __must_check set_private_spte_present(struct kvm *kvm, tdp_ptep_t sptep,
+						 gfn_t gfn, u64 old_spte,
+						 u64 new_spte, int level)
+{
+	int ret;
+
+	/*
+	 * For private page table, callbacks are needed to propagate SPTE
+	 * change into the private page table. In order to atomically update
+	 * both the SPTE and the private page tables with callbacks, utilize
+	 * freezing SPTE.
+	 * - Freeze the SPTE. Set entry to REMOVED_SPTE.
+	 * - Trigger callbacks for private page tables.
+	 * - Unfreeze the SPTE.  Set the entry to new_spte.
+	 */
+	lockdep_assert_held(&kvm->mmu_lock);
+	if (!try_cmpxchg64(sptep, &old_spte, REMOVED_SPTE))
+		return -EBUSY;
+
+	ret = __set_private_spte_present(kvm, sptep, gfn, old_spte, new_spte, level);
+	if (ret)
+		__kvm_tdp_mmu_write_spte(sptep, old_spte);
+	else
+		__kvm_tdp_mmu_write_spte(sptep, new_spte);
+	return ret;
+}
+
 /**
  * handle_changed_spte - handle bookkeeping associated with an SPTE change
  * @kvm: kvm instance
@@ -429,7 +548,7 @@  static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
  * @gfn: the base GFN that was mapped by the SPTE
  * @old_spte: The value of the SPTE before the change
  * @new_spte: The value of the SPTE after the change
- * @level: the level of the PT the SPTE is part of in the paging structure
+ * @role: the role of the PT the SPTE is part of in the paging structure
  * @shared: This operation may not be running under the exclusive use of
  *	    the MMU lock and the operation must synchronize with other
  *	    threads that might be modifying SPTEs.
@@ -439,14 +558,18 @@  static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
  * and fast_pf_fix_direct_spte()).
  */
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				u64 old_spte, u64 new_spte, int level,
-				bool shared)
+				u64 old_spte, u64 new_spte,
+				union kvm_mmu_page_role role, bool shared)
 {
+	bool is_private = kvm_mmu_page_role_is_private(role);
+	int level = role.level;
 	bool was_present = is_shadow_present_pte(old_spte);
 	bool is_present = is_shadow_present_pte(new_spte);
 	bool was_leaf = was_present && is_last_spte(old_spte, level);
 	bool is_leaf = is_present && is_last_spte(new_spte, level);
-	bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
+	kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
+	kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
+	bool pfn_changed = old_pfn != new_pfn;
 
 	WARN_ON_ONCE(level > PT64_ROOT_MAX_LEVEL);
 	WARN_ON_ONCE(level < PG_LEVEL_4K);
@@ -513,7 +636,7 @@  static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 
 	if (was_leaf && is_dirty_spte(old_spte) &&
 	    (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
-		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+		kvm_set_pfn_dirty(old_pfn);
 
 	/*
 	 * Recursively handle child PTs if the change removed a subtree from
@@ -522,15 +645,21 @@  static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	 * pages are kernel allocations and should never be migrated.
 	 */
 	if (was_present && !was_leaf &&
-	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
+	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
+		KVM_BUG_ON(is_private != is_private_sptep(spte_to_child_pt(old_spte, level)),
+			   kvm);
 		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
+	}
+
+	if (is_private && !is_present)
+		handle_removed_private_spte(kvm, gfn, old_spte, new_spte, role.level);
 
 	if (was_leaf && is_accessed_spte(old_spte) &&
 	    (!is_present || !is_accessed_spte(new_spte) || pfn_changed))
 		kvm_set_pfn_accessed(spte_to_pfn(old_spte));
 }
 
-static inline int __tdp_mmu_set_spte_atomic(struct tdp_iter *iter, u64 new_spte)
+static inline int __tdp_mmu_set_spte_atomic(struct kvm *kvm, struct tdp_iter *iter, u64 new_spte)
 {
 	u64 *sptep = rcu_dereference(iter->sptep);
 
@@ -542,15 +671,42 @@  static inline int __tdp_mmu_set_spte_atomic(struct tdp_iter *iter, u64 new_spte)
 	 */
 	WARN_ON_ONCE(iter->yielded || is_removed_spte(iter->old_spte));
 
-	/*
-	 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
-	 * does not hold the mmu_lock.  On failure, i.e. if a different logical
-	 * CPU modified the SPTE, try_cmpxchg64() updates iter->old_spte with
-	 * the current value, so the caller operates on fresh data, e.g. if it
-	 * retries tdp_mmu_set_spte_atomic()
-	 */
-	if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
-		return -EBUSY;
+	if (is_private_sptep(iter->sptep) && !is_removed_spte(new_spte)) {
+		int ret;
+
+		if (is_shadow_present_pte(new_spte)) {
+			/*
+			 * Populating case.
+			 * - set_private_spte_present() implements
+			 *   1) Freeze SPTE
+			 *   2) call hooks to update private page table,
+			 *   3) update SPTE to new_spte
+			 * - handle_changed_spte() only updates stats.
+			 */
+			ret = set_private_spte_present(kvm, iter->sptep, iter->gfn,
+						       iter->old_spte, new_spte, iter->level);
+			if (ret)
+				return ret;
+		} else {
+			/*
+			 * Zapping case.
+			 * Zap is only allowed when write lock is held
+			 */
+			if (WARN_ON_ONCE(!is_shadow_present_pte(new_spte)))
+				return -EBUSY;
+		}
+	} else {
+		/*
+		 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs
+		 * and does not hold the mmu_lock.  On failure, i.e. if a
+		 * different logical CPU modified the SPTE, try_cmpxchg64()
+		 * updates iter->old_spte with the current value, so the caller
+		 * operates on fresh data, e.g. if it retries
+		 * tdp_mmu_set_spte_atomic()
+		 */
+		if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
+			return -EBUSY;
+	}
 
 	return 0;
 }
@@ -576,23 +732,24 @@  static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
 					  struct tdp_iter *iter,
 					  u64 new_spte)
 {
+	u64 *sptep = rcu_dereference(iter->sptep);
 	int ret;
 
 	lockdep_assert_held_read(&kvm->mmu_lock);
 
-	ret = __tdp_mmu_set_spte_atomic(iter, new_spte);
+	ret = __tdp_mmu_set_spte_atomic(kvm, iter, new_spte);
 	if (ret)
 		return ret;
 
 	handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
-			    new_spte, iter->level, true);
-
+			    new_spte, sptep_to_sp(sptep)->role, true);
 	return 0;
 }
 
 static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 					  struct tdp_iter *iter)
 {
+	union kvm_mmu_page_role role;
 	int ret;
 
 	lockdep_assert_held_read(&kvm->mmu_lock);
@@ -605,7 +762,7 @@  static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 	 * Delay processing of the zapped SPTE until after TLBs are flushed and
 	 * the REMOVED_SPTE is replaced (see below).
 	 */
-	ret = __tdp_mmu_set_spte_atomic(iter, REMOVED_SPTE);
+	ret = __tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_SPTE);
 	if (ret)
 		return ret;
 
@@ -619,6 +776,8 @@  static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 	 */
 	__kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);
 
+
+	role = sptep_to_sp(iter->sptep)->role;
 	/*
 	 * Process the zapped SPTE after flushing TLBs, and after replacing
 	 * REMOVED_SPTE with 0. This minimizes the amount of time vCPUs are
@@ -626,7 +785,7 @@  static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 	 * SPTEs.
 	 */
 	handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
-			    0, iter->level, true);
+			    SHADOW_NONPRESENT_VALUE, role, true);
 
 	return 0;
 }
@@ -648,6 +807,8 @@  static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 			    u64 old_spte, u64 new_spte, gfn_t gfn, int level)
 {
+	union kvm_mmu_page_role role;
+
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
 	/*
@@ -660,8 +821,16 @@  static u64 tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 	WARN_ON_ONCE(is_removed_spte(old_spte) || is_removed_spte(new_spte));
 
 	old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
+	if (is_private_sptep(sptep) && !is_removed_spte(new_spte) &&
+	    is_shadow_present_pte(new_spte)) {
+		/* Because write spin lock is held, no race.  It should success. */
+		KVM_BUG_ON(__set_private_spte_present(kvm, sptep, gfn, old_spte,
+						      new_spte, level), kvm);
+	}
 
-	handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
+	role = sptep_to_sp(sptep)->role;
+	role.level = level;
+	handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role, false);
 	return old_spte;
 }
 
@@ -684,8 +853,11 @@  static inline void tdp_mmu_iter_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 			continue;					\
 		else
 
-#define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end)		\
-	for_each_tdp_pte(_iter, root_to_sp(_mmu->root.hpa), _start, _end)
+#define tdp_mmu_for_each_pte(_iter, _mmu, _private, _start, _end)	\
+	for_each_tdp_pte(_iter,						\
+		 root_to_sp((_private) ? _mmu->private_root_hpa :	\
+				_mmu->root.hpa),			\
+		_start, _end)
 
 /*
  * Yield if the MMU lock is contended or this thread needs to return control
@@ -853,6 +1025,14 @@  static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
+	/*
+	 * start and end doesn't have GFN shared bit.  This function zaps
+	 * a region including alias.  Adjust shared bit of [start, end) if the
+	 * root is shared.
+	 */
+	start = kvm_gfn_for_root(kvm, root, start);
+	end = kvm_gfn_for_root(kvm, root, end);
+
 	rcu_read_lock();
 
 	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
@@ -1029,8 +1209,8 @@  static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
 	else
 		wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
-					 fault->pfn, iter->old_spte, fault->prefetch, true,
-					 fault->map_writable, &new_spte);
+					fault->pfn, iter->old_spte, fault->prefetch, true,
+					fault->map_writable, &new_spte);
 
 	if (new_spte == iter->old_spte)
 		ret = RET_PF_SPURIOUS;
@@ -1108,6 +1288,8 @@  int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	struct kvm *kvm = vcpu->kvm;
 	struct tdp_iter iter;
 	struct kvm_mmu_page *sp;
+	gfn_t raw_gfn;
+	bool is_private = fault->is_private && kvm_gfn_shared_mask(kvm);
 	int ret = RET_PF_RETRY;
 
 	kvm_mmu_hugepage_adjust(vcpu, fault);
@@ -1116,7 +1298,9 @@  int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 
 	rcu_read_lock();
 
-	tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
+	raw_gfn = gpa_to_gfn(fault->addr);
+
+	tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn + 1) {
 		int r;
 
 		if (fault->nx_huge_page_workaround_enabled)
@@ -1142,14 +1326,22 @@  int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 * needs to be split.
 		 */
 		sp = tdp_mmu_alloc_sp(vcpu);
+		if (kvm_is_private_gpa(kvm, raw_gfn << PAGE_SHIFT))
+			kvm_mmu_alloc_private_spt(vcpu, sp);
 		tdp_mmu_init_child_sp(sp, &iter);
 
 		sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
 
-		if (is_shadow_present_pte(iter.old_spte))
+		if (is_shadow_present_pte(iter.old_spte)) {
+			/*
+			 * TODO: large page support.
+			 * Doesn't support large page for TDX now
+			 */
+			KVM_BUG_ON(is_private_sptep(iter.sptep), vcpu->kvm);
 			r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
-		else
+		} else {
 			r = tdp_mmu_link_sp(kvm, &iter, sp, true);
+		}
 
 		/*
 		 * Force the guest to retry if installing an upper level SPTE
@@ -1780,7 +1972,7 @@  static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 	gfn_t gfn = addr >> PAGE_SHIFT;
 	int leaf = -1;
 
-	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
+	tdp_mmu_for_each_pte(iter, mmu, is_private, gfn, gfn + 1) {
 		leaf = iter.level;
 		sptes[leaf] = iter.old_spte;
 	}
@@ -1838,7 +2030,10 @@  u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
 	gfn_t gfn = addr >> PAGE_SHIFT;
 	tdp_ptep_t sptep = NULL;
 
-	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
+	/* fast page fault for private GPA isn't supported. */
+	WARN_ON_ONCE(kvm_is_private_gpa(vcpu->kvm, addr));
+
+	tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) {
 		*spte = iter.old_spte;
 		sptep = iter.sptep;
 	}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 437ddd4937a9..ac350c51bc18 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -10,7 +10,7 @@ 
 void kvm_mmu_init_tdp_mmu(struct kvm *kvm);
 void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm);
 
-void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu);
+void kvm_tdp_mmu_alloc_root(struct kvm_vcpu *vcpu, bool private);
 
 __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
 {