diff mbox series

[v2,15/15] KVM: x86/tdp_mmu: Add a helper function to walk down the TDP MMU

Message ID 20240530210714.364118-16-rick.p.edgecombe@intel.com (mailing list archive)
State New, archived
Headers show
Series TDX MMU prep series part 1 | expand

Commit Message

Rick Edgecombe May 30, 2024, 9:07 p.m. UTC
From: Isaku Yamahata <isaku.yamahata@intel.com>

Export a function to walk down the TDP without modifying it.

Future changes will support pre-populating TDX private memory. In order to
implement this KVM will need to check if a given GFN is already
pre-populated in the mirrored EPT, and verify the populated private memory
PFN matches the current one.[1]

There is already a TDP MMU walker, kvm_tdp_mmu_get_walk() for use within
the KVM MMU that almost does what is required. However, to make sense of
the results, MMU internal PTE helpers are needed. Refactor the code to
provide a helper that can be used outside of the KVM MMU code.

Refactoring the KVM page fault handler to support this lookup usage was
also considered, but it was an awkward fit.

Link: https://lore.kernel.org/kvm/ZfBkle1eZFfjPI8l@google.com/ [1]
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
This helper will be used in the future change that implements
KVM_TDX_INIT_MEM_REGION. Please refer to the following commit for the
usage:
https://github.com/intel/tdx/commit/9594fb9a0ab566e0ad556bb19770665319d800fd

TDX MMU Prep v2:
 - Rename function with "mirror" and use root enum

TDX MMU Prep:
 - New patch
---
 arch/x86/kvm/mmu.h         |  2 ++
 arch/x86/kvm/mmu/tdp_mmu.c | 39 +++++++++++++++++++++++++++++++++-----
 2 files changed, 36 insertions(+), 5 deletions(-)

Comments

Paolo Bonzini June 7, 2024, 9:31 a.m. UTC | #1
> -int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> -                        int *root_level)
> +static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> +                                 enum kvm_tdp_mmu_root_types root_type)
>  {
> -       struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
> +       struct kvm_mmu_page *root = tdp_mmu_get_root(vcpu, root_type);

I think this function should take the struct kvm_mmu_page * directly.

> +{
> +       *root_level = vcpu->arch.mmu->root_role.level;
> +
> +       return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, KVM_DIRECT_ROOTS);

Here you pass root_to_sp(vcpu->arch.mmu->root.hpa);

> +int kvm_tdp_mmu_get_walk_mirror_pfn(struct kvm_vcpu *vcpu, u64 gpa,
> +                                    kvm_pfn_t *pfn)
> +{
> +       u64 sptes[PT64_ROOT_MAX_LEVEL + 1], spte;
> +       int leaf;
> +
> +       lockdep_assert_held(&vcpu->kvm->mmu_lock);
> +
> +       rcu_read_lock();
> +       leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, KVM_MIRROR_ROOTS);

and likewise here.

You might also consider using a kvm_mmu_root_info for the mirror root,
even though the pgd field is not used.

Then __kvm_tdp_mmu_get_walk() can take a struct kvm_mmu_root_info * instead.

kvm_tdp_mmu_get_walk_mirror_pfn() doesn't belong in this series, but
introducing __kvm_tdp_mmu_get_walk() can stay here.

Looking at the later patch, which uses
kvm_tdp_mmu_get_walk_mirror_pfn(), I think this function is a bit
overkill. I'll do a pre-review of the init_mem_region function,
especially the usage of kvm_gmem_populate:

+    slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
+    if (!kvm_slot_can_be_private(slot) || !kvm_mem_is_private(kvm, gfn)) {
+        ret = -EFAULT;
+        goto out_put_page;
+    }

The slots_lock is taken, so checking kvm_slot_can_be_private is unnecessary.

Checking kvm_mem_is_private perhaps should also be done in
kvm_gmem_populate() itself. I'll send a patch.

+    read_lock(&kvm->mmu_lock);
+
+    ret = kvm_tdp_mmu_get_walk_mirror_pfn(vcpu, gpa, &mmu_pfn);
+    if (ret < 0)
+        goto out;
+    if (ret > PG_LEVEL_4K) {
+        ret = -EINVAL;
+        goto out;
+    }
+    if (mmu_pfn != pfn) {
+        ret = -EAGAIN;
+        goto out;
+    }

If you require pre-faulting, you don't need to return mmu_pfn and
things would be seriously wrong if the two didn't match, wouldn't
they? You are executing with the filemap_invalidate_lock() taken, and
therefore cannot race with kvm_gmem_punch_hole(). (Soapbox mode on:
this is the advantage of putting the locking/looping logic in common
code, kvm_gmem_populate() in this case).

Therefore, I think kvm_tdp_mmu_get_walk_mirror_pfn() can be replaced just with

int kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa)
{
  struct kvm *kvm = vcpu->kvm
  bool is_direct = !kvm_has_mirrored_tdp(...) || (gpa & kvm->arch.direct_mask);
  hpa_t root = is_direct ? ... : ...;

  lockdep_assert_held(&vcpu->kvm->mmu_lock);
  rcu_read_lock();
  leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, root_to_sp(root));
  rcu_read_unlock();
  if (leaf < 0)
    return false;

  spte = sptes[leaf];
  return is_shadow_present_pte(spte) && is_last_spte(spte, leaf);
}
EXPORT_SYMBOL_GPL(kvm_tdp_mmu_gpa_is_mapped);

+    while (region.nr_pages) {
+        if (signal_pending(current)) {
+            ret = -EINTR;
+            break;
+        }

Checking signal_pending() should be done in kvm_gmem_populate() -
again, I'll take care of that. The rest of the loop is not necessary -
just call kvm_gmem_populate() with the whole range and enjoy. You get
a nice API that is consistent with the intended KVM_PREFAULT_MEMORY
ioctl, because kvm_gmem_populate() returns the number of pages it has
processed and you can use that to massage and copy back the struct
kvm_tdx_init_mem_region.

+        arg = (struct tdx_gmem_post_populate_arg) {
+            .vcpu = vcpu,
+            .flags = cmd->flags,
+        };
+        gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa),
+                         u64_to_user_ptr(region.source_addr),
+                         1, tdx_gmem_post_populate, &arg);

Paolo
Rick Edgecombe June 7, 2024, 11:39 p.m. UTC | #2
On Fri, 2024-06-07 at 11:31 +0200, Paolo Bonzini wrote:
> > -int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> > -                        int *root_level)
> > +static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64
> > *sptes,
> > +                                 enum kvm_tdp_mmu_root_types root_type)
> >   {
> > -       struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
> > +       struct kvm_mmu_page *root = tdp_mmu_get_root(vcpu, root_type);
> 
> I think this function should take the struct kvm_mmu_page * directly.
> 
> > +{
> > +       *root_level = vcpu->arch.mmu->root_role.level;
> > +
> > +       return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, KVM_DIRECT_ROOTS);
> 
> Here you pass root_to_sp(vcpu->arch.mmu->root.hpa);

I see. It is another case of more indirection to try to send the decision making
through the helpers. We can try to open code things more.

> 
> > +int kvm_tdp_mmu_get_walk_mirror_pfn(struct kvm_vcpu *vcpu, u64 gpa,
> > +                                    kvm_pfn_t *pfn)
> > +{
> > +       u64 sptes[PT64_ROOT_MAX_LEVEL + 1], spte;
> > +       int leaf;
> > +
> > +       lockdep_assert_held(&vcpu->kvm->mmu_lock);
> > +
> > +       rcu_read_lock();
> > +       leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, KVM_MIRROR_ROOTS);
> 
> and likewise here.
> 
> You might also consider using a kvm_mmu_root_info for the mirror root,
> even though the pgd field is not used.

This came up on the last version actually. The reason against it was that it
used that tiny bit of extra memory for the pgd. It does look more symmetrical
though.

> 
> Then __kvm_tdp_mmu_get_walk() can take a struct kvm_mmu_root_info * instead.

Ahh, I see. Yes, that's a good reason.

> 
> kvm_tdp_mmu_get_walk_mirror_pfn() doesn't belong in this series, but
> introducing __kvm_tdp_mmu_get_walk() can stay here.

Ok, we can split it.

> 
> Looking at the later patch, which uses
> kvm_tdp_mmu_get_walk_mirror_pfn(), I think this function is a bit
> overkill. I'll do a pre-review of the init_mem_region function,
> especially the usage of kvm_gmem_populate:
> 
> +    slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> +    if (!kvm_slot_can_be_private(slot) || !kvm_mem_is_private(kvm, gfn)) {
> +        ret = -EFAULT;
> +        goto out_put_page;
> +    }
> 
> The slots_lock is taken, so checking kvm_slot_can_be_private is unnecessary.
> 
> Checking kvm_mem_is_private perhaps should also be done in
> kvm_gmem_populate() itself. I'll send a patch.
> 
> +    read_lock(&kvm->mmu_lock);
> +
> +    ret = kvm_tdp_mmu_get_walk_mirror_pfn(vcpu, gpa, &mmu_pfn);
> +    if (ret < 0)
> +        goto out;
> +    if (ret > PG_LEVEL_4K) {
> +        ret = -EINVAL;
> +        goto out;
> +    }
> +    if (mmu_pfn != pfn) {
> +        ret = -EAGAIN;
> +        goto out;
> +    }
> 
> If you require pre-faulting, you don't need to return mmu_pfn and
> things would be seriously wrong if the two didn't match, wouldn't
> they?

Yea, I'm not sure why it would be a normal condition. Maybe Isaku can comment on
the thinking?

>  You are executing with the filemap_invalidate_lock() taken, and
> therefore cannot race with kvm_gmem_punch_hole(). (Soapbox mode on:
> this is the advantage of putting the locking/looping logic in common
> code, kvm_gmem_populate() in this case).
> 
> Therefore, I think kvm_tdp_mmu_get_walk_mirror_pfn() can be replaced just with
> 
> int kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa)
> {
>   struct kvm *kvm = vcpu->kvm
>   bool is_direct = !kvm_has_mirrored_tdp(...) || (gpa & kvm-
> >arch.direct_mask);
>   hpa_t root = is_direct ? ... : ...;
> 
>   lockdep_assert_held(&vcpu->kvm->mmu_lock);
>   rcu_read_lock();
>   leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, root_to_sp(root));
>   rcu_read_unlock();
>   if (leaf < 0)
>     return false;
> 
>   spte = sptes[leaf];
>   return is_shadow_present_pte(spte) && is_last_spte(spte, leaf);
> }
> EXPORT_SYMBOL_GPL(kvm_tdp_mmu_gpa_is_mapped);
> 
> +    while (region.nr_pages) {
> +        if (signal_pending(current)) {
> +            ret = -EINTR;
> +            break;
> +        }
> 
> Checking signal_pending() should be done in kvm_gmem_populate() -
> again, I'll take care of that. The rest of the loop is not necessary -
> just call kvm_gmem_populate() with the whole range and enjoy. You get
> a nice API that is consistent with the intended KVM_PREFAULT_MEMORY
> ioctl, because kvm_gmem_populate() returns the number of pages it has
> processed and you can use that to massage and copy back the struct
> kvm_tdx_init_mem_region.
> 
> +        arg = (struct tdx_gmem_post_populate_arg) {
> +            .vcpu = vcpu,
> +            .flags = cmd->flags,
> +        };
> +        gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa),
> +                         u64_to_user_ptr(region.source_addr),
> +                         1, tdx_gmem_post_populate, &arg);

Ok thanks for the early comments. We can also drop those pieces as they move
into gmem code.

We were discussing starting to do some early public work on the rest of the MMU
series (that includes this patch) and the user API around VM and vCPU creation.
As in, not have the patches fully ready, but to just work on it in public. This
would probably follow finishing this series up.

It's all tentative, but just to give some idea of where we're at with the rest
of the series.
Paolo Bonzini June 8, 2024, 9:17 a.m. UTC | #3
On Sat, Jun 8, 2024 at 1:39 AM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
> We were discussing starting to do some early public work on the rest of the MMU
> series (that includes this patch) and the user API around VM and vCPU creation.
> As in, not have the patches fully ready, but to just work on it in public. This
> would probably follow finishing this series up.
>
> It's all tentative, but just to give some idea of where we're at with the rest
> of the series.

Yes, I think we're at the point where we can start pipelining.

Paolo
Isaku Yamahata June 12, 2024, 6:56 p.m. UTC | #4
On Fri, Jun 07, 2024 at 11:39:14PM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Fri, 2024-06-07 at 11:31 +0200, Paolo Bonzini wrote:
> > > -int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> > > -                        int *root_level)
> > > +static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64
> > > *sptes,
> > > +                                 enum kvm_tdp_mmu_root_types root_type)
> > >   {
> > > -       struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
> > > +       struct kvm_mmu_page *root = tdp_mmu_get_root(vcpu, root_type);
> > 
> > I think this function should take the struct kvm_mmu_page * directly.
> > 
> > > +{
> > > +       *root_level = vcpu->arch.mmu->root_role.level;
> > > +
> > > +       return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, KVM_DIRECT_ROOTS);
> > 
> > Here you pass root_to_sp(vcpu->arch.mmu->root.hpa);
> 
> I see. It is another case of more indirection to try to send the decision making
> through the helpers. We can try to open code things more.
> 
> > 
> > > +int kvm_tdp_mmu_get_walk_mirror_pfn(struct kvm_vcpu *vcpu, u64 gpa,
> > > +                                    kvm_pfn_t *pfn)
> > > +{
> > > +       u64 sptes[PT64_ROOT_MAX_LEVEL + 1], spte;
> > > +       int leaf;
> > > +
> > > +       lockdep_assert_held(&vcpu->kvm->mmu_lock);
> > > +
> > > +       rcu_read_lock();
> > > +       leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, KVM_MIRROR_ROOTS);
> > 
> > and likewise here.
> > 
> > You might also consider using a kvm_mmu_root_info for the mirror root,
> > even though the pgd field is not used.
> 
> This came up on the last version actually. The reason against it was that it
> used that tiny bit of extra memory for the pgd. It does look more symmetrical
> though.
> 
> > 
> > Then __kvm_tdp_mmu_get_walk() can take a struct kvm_mmu_root_info * instead.
> 
> Ahh, I see. Yes, that's a good reason.
> 
> > 
> > kvm_tdp_mmu_get_walk_mirror_pfn() doesn't belong in this series, but
> > introducing __kvm_tdp_mmu_get_walk() can stay here.
> 
> Ok, we can split it.
> 
> > 
> > Looking at the later patch, which uses
> > kvm_tdp_mmu_get_walk_mirror_pfn(), I think this function is a bit
> > overkill. I'll do a pre-review of the init_mem_region function,
> > especially the usage of kvm_gmem_populate:
> > 
> > +    slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> > +    if (!kvm_slot_can_be_private(slot) || !kvm_mem_is_private(kvm, gfn)) {
> > +        ret = -EFAULT;
> > +        goto out_put_page;
> > +    }
> > 
> > The slots_lock is taken, so checking kvm_slot_can_be_private is unnecessary.
> > 
> > Checking kvm_mem_is_private perhaps should also be done in
> > kvm_gmem_populate() itself. I'll send a patch.
> > 
> > +    read_lock(&kvm->mmu_lock);
> > +
> > +    ret = kvm_tdp_mmu_get_walk_mirror_pfn(vcpu, gpa, &mmu_pfn);
> > +    if (ret < 0)
> > +        goto out;
> > +    if (ret > PG_LEVEL_4K) {
> > +        ret = -EINVAL;
> > +        goto out;
> > +    }
> > +    if (mmu_pfn != pfn) {
> > +        ret = -EAGAIN;
> > +        goto out;
> > +    }
> > 
> > If you require pre-faulting, you don't need to return mmu_pfn and
> > things would be seriously wrong if the two didn't match, wouldn't
> > they?
> 
> Yea, I'm not sure why it would be a normal condition. Maybe Isaku can comment on
> the thinking?

Sean suggested for KVM_TDX_INIT_MEM_REGION to check if the two pfn from TDP MMU
and guest_memfd match.  As pointed out, the two PFNs should match under
appropriate lock (or heavily broken).  Personally I'm fine to remove such check
and to avoid returning pfn.

https://lore.kernel.org/kvm/Ze-TJh0BBOWm9spT@google.com/

  Then KVM provides a dedicated TDX ioctl(), i.e. what is/was KVM_TDX_INIT_MEM_REGION,
  to do PAGE.ADD.  KVM_TDX_INIT_MEM_REGION wouldn't need to map anything, it would
  simply need to verify that the pfn from guest_memfd() is the same as what's in
  the TDP MMU.
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 006f06463a27..625a57515d48 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -275,6 +275,8 @@  extern bool tdp_mmu_enabled;
 #define tdp_mmu_enabled false
 #endif
 
+int kvm_tdp_mmu_get_walk_mirror_pfn(struct kvm_vcpu *vcpu, u64 gpa, kvm_pfn_t *pfn);
+
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 {
 	return !tdp_mmu_enabled || kvm_shadow_root_allocated(kvm);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 0caa1029b6bd..897d8aa1f544 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1946,16 +1946,14 @@  bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
  *
  * Must be called between kvm_tdp_mmu_walk_lockless_{begin,end}.
  */
-int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
-			 int *root_level)
+static int __kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
+				  enum kvm_tdp_mmu_root_types root_type)
 {
-	struct kvm_mmu_page *root = root_to_sp(vcpu->arch.mmu->root.hpa);
+	struct kvm_mmu_page *root = tdp_mmu_get_root(vcpu, root_type);
 	struct tdp_iter iter;
 	gfn_t gfn = addr >> PAGE_SHIFT;
 	int leaf = -1;
 
-	*root_level = vcpu->arch.mmu->root_role.level;
-
 	tdp_mmu_for_each_pte(iter, vcpu->kvm, root, gfn, gfn + 1) {
 		leaf = iter.level;
 		sptes[leaf] = iter.old_spte;
@@ -1964,6 +1962,37 @@  int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 	return leaf;
 }
 
+int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
+			 int *root_level)
+{
+	*root_level = vcpu->arch.mmu->root_role.level;
+
+	return __kvm_tdp_mmu_get_walk(vcpu, addr, sptes, KVM_DIRECT_ROOTS);
+}
+
+int kvm_tdp_mmu_get_walk_mirror_pfn(struct kvm_vcpu *vcpu, u64 gpa,
+				     kvm_pfn_t *pfn)
+{
+	u64 sptes[PT64_ROOT_MAX_LEVEL + 1], spte;
+	int leaf;
+
+	lockdep_assert_held(&vcpu->kvm->mmu_lock);
+
+	rcu_read_lock();
+	leaf = __kvm_tdp_mmu_get_walk(vcpu, gpa, sptes, KVM_MIRROR_ROOTS);
+	rcu_read_unlock();
+	if (leaf < 0)
+		return -ENOENT;
+
+	spte = sptes[leaf];
+	if (!(is_shadow_present_pte(spte) && is_last_spte(spte, leaf)))
+		return -ENOENT;
+
+	*pfn = spte_to_pfn(spte);
+	return leaf;
+}
+EXPORT_SYMBOL_GPL(kvm_tdp_mmu_get_walk_mirror_pfn);
+
 /*
  * Returns the last level spte pointer of the shadow page walk for the given
  * gpa, and sets *spte to the spte value. This spte may be non-preset. If no