diff mbox series

[v3,17/17] KVM: x86/tdp_mmu: Take root types for kvm_tdp_mmu_invalidate_all_roots()

Message ID 20240619223614.290657-18-rick.p.edgecombe@intel.com (mailing list archive)
State New, archived
Headers show
Series TDX MMU prep series part 1 | expand

Commit Message

Rick Edgecombe June 19, 2024, 10:36 p.m. UTC
From: Isaku Yamahata <isaku.yamahata@intel.com>

Rename kvm_tdp_mmu_invalidate_all_roots() to
kvm_tdp_mmu_invalidate_roots(), and make it enum kvm_tdp_mmu_root_types
as an argument.

kvm_tdp_mmu_invalidate_roots() is called with different root types. For
kvm_mmu_zap_all_fast() it only operates on shared roots. But when tearing
down a VM it needs to invalidate all roots. Have the callers only
invalidate the required roots instead of all roots.

Within kvm_tdp_mmu_invalidate_roots(), respect the root type
passed by checking the root type in root iterator.

Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Prep v3:
 - Use root enum instead of process enum (Paolo)
 - Squash with "KVM: x86/tdp_mmu: Invalidate correct roots" (Paolo)
 - Update comment in kvm_mmu_zap_all_fast() (Paolo)
 - Add warning for attempting to invalidate invalid roots (Paolo)

TDX MMU Prep v2:
 - Use process enum instead of root

TDX MMU Prep:
 - New patch
---
 arch/x86/kvm/mmu/mmu.c     |  9 +++++++--
 arch/x86/kvm/mmu/tdp_mmu.c | 15 +++++++++++++--
 arch/x86/kvm/mmu/tdp_mmu.h |  3 ++-
 3 files changed, 22 insertions(+), 5 deletions(-)

Comments

Yan Zhao June 21, 2024, 7:10 a.m. UTC | #1
On Wed, Jun 19, 2024 at 03:36:14PM -0700, Rick Edgecombe wrote:
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 630e6b6d4bf2..a1ab67a4f41f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -37,7 +37,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>  	 * for zapping and thus puts the TDP MMU's reference to each root, i.e.
>  	 * ultimately frees all roots.
>  	 */
> -	kvm_tdp_mmu_invalidate_all_roots(kvm);
> +	kvm_tdp_mmu_invalidate_roots(kvm, KVM_VALID_ROOTS);
all roots (mirror + direct) are invalidated here.

>  	kvm_tdp_mmu_zap_invalidated_roots(kvm);
kvm_tdp_mmu_zap_invalidated_roots() will zap invalidated mirror root with
mmu_lock held for read, which should trigger KVM_BUG_ON() in
__tdp_mmu_set_spte_atomic(), which assumes "atomic zapping don't operate on
mirror roots".

But up to now, the KVM_BUG_ON() is not triggered because
kvm_mmu_notifier_release() is called earlier than kvm_destroy_vm() (as in below
call trace), and kvm_arch_flush_shadow_all() in kvm_mmu_notifier_release() has
zapped all mirror SPTEs before kvm_mmu_uninit_vm() called in kvm_destroy_vm().


kvm_mmu_notifier_release
  kvm_flush_shadow_all
    kvm_arch_flush_shadow_all
      static_call_cond(kvm_x86_flush_shadow_all_private)(kvm);
      kvm_mmu_zap_all  ==>hold mmu_lock for write
        kvm_tdp_mmu_zap_all ==>zap KVM_ALL_ROOTS with mmu_lock held for write

kvm_destroy_vm
  kvm_arch_destroy_vm
    kvm_mmu_uninit_vm
      kvm_mmu_uninit_tdp_mmu
        kvm_tdp_mmu_invalidate_roots ==>invalid all KVM_VALID_ROOTS
        kvm_tdp_mmu_zap_invalidated_roots ==> zap all roots with mmu_lock held for read


A question is that kvm_mmu_notifier_release(), as a callback of primary MMU
notifier, why does it zap mirrored tdp when all other callbacks are with
KVM_FILTER_SHARED?

Could we just zap all KVM_DIRECT_ROOTS (valid | invalid) in
kvm_mmu_notifier_release() and move mirrord tdp related stuffs from 
kvm_arch_flush_shadow_all() to kvm_mmu_uninit_tdp_mmu(), ensuring mmu_lock is
held for write?

>  
>  	WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
Rick Edgecombe June 21, 2024, 7:08 p.m. UTC | #2
On Fri, 2024-06-21 at 15:10 +0800, Yan Zhao wrote:
> On Wed, Jun 19, 2024 at 03:36:14PM -0700, Rick Edgecombe wrote:
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 630e6b6d4bf2..a1ab67a4f41f 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -37,7 +37,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
> >          * for zapping and thus puts the TDP MMU's reference to each root,
> > i.e.
> >          * ultimately frees all roots.
> >          */
> > -       kvm_tdp_mmu_invalidate_all_roots(kvm);
> > +       kvm_tdp_mmu_invalidate_roots(kvm, KVM_VALID_ROOTS);
> all roots (mirror + direct) are invalidated here.

Right.
> 
> >         kvm_tdp_mmu_zap_invalidated_roots(kvm);
> kvm_tdp_mmu_zap_invalidated_roots() will zap invalidated mirror root with
> mmu_lock held for read, which should trigger KVM_BUG_ON() in
> __tdp_mmu_set_spte_atomic(), which assumes "atomic zapping don't operate on
> mirror roots".
> 
> But up to now, the KVM_BUG_ON() is not triggered because
> kvm_mmu_notifier_release() is called earlier than kvm_destroy_vm() (as in
> below
> call trace), and kvm_arch_flush_shadow_all() in kvm_mmu_notifier_release() has
> zapped all mirror SPTEs before kvm_mmu_uninit_vm() called in kvm_destroy_vm().
> 
> 
> kvm_mmu_notifier_release
>   kvm_flush_shadow_all
>     kvm_arch_flush_shadow_all
>       static_call_cond(kvm_x86_flush_shadow_all_private)(kvm);
>       kvm_mmu_zap_all  ==>hold mmu_lock for write
>         kvm_tdp_mmu_zap_all ==>zap KVM_ALL_ROOTS with mmu_lock held for write
> 
> kvm_destroy_vm
>   kvm_arch_destroy_vm
>     kvm_mmu_uninit_vm
>       kvm_mmu_uninit_tdp_mmu
>         kvm_tdp_mmu_invalidate_roots ==>invalid all KVM_VALID_ROOTS
>         kvm_tdp_mmu_zap_invalidated_roots ==> zap all roots with mmu_lock held
> for read
> 
> 
> A question is that kvm_mmu_notifier_release(), as a callback of primary MMU
> notifier, why does it zap mirrored tdp when all other callbacks are with
> KVM_FILTER_SHARED?
> 
> Could we just zap all KVM_DIRECT_ROOTS (valid | invalid) in
> kvm_mmu_notifier_release() and move mirrord tdp related stuffs from 
> kvm_arch_flush_shadow_all() to kvm_mmu_uninit_tdp_mmu(), ensuring mmu_lock is
> held for write?

Sigh, thanks for flagging this. I agree it seems weird to free private memory
from an MMU notifier callback. I also found this old thread where Sean NAKed the
current approach (free hkid in mmu release):
https://lore.kernel.org/kvm/ZN+1QHGa6ltpQxZn@google.com/#t

One challenge is that flush_shadow_all_private() needs to be done before
kvm_destroy_vcpus(), where it gets into tdx_vcpu_free(). So kvm_mmu_uninit_vm()
is too late. Perhaps this is why it was shoved into mmu notifier release (which
happens long before as you noted). Isaku, do you recall any other reasons?

But static_call_cond(kvm_x86_vm_destroy) happens before kvm_destroy_vcpus, so we
could maybe actually just do the tdx_mmu_release_hkid() part there. Then drop
the flush_shadow_all_private x86 op. See the (not thoroughly checked) diff at
the bottom of this mail.

But most of what is being discussed is in future patches where it starts to get
into the TDX module interaction. So I wonder if we should drop this patch 17
from "part 1" and include it with the next series so it can all be considered
together.

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-
ops.h
index 2adf36b74910..3927731aa947 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -23,7 +23,6 @@ KVM_X86_OP(has_emulated_msr)
 KVM_X86_OP(vcpu_after_set_cpuid)
 KVM_X86_OP_OPTIONAL(vm_enable_cap)
 KVM_X86_OP(vm_init)
-KVM_X86_OP_OPTIONAL(flush_shadow_all_private)
 KVM_X86_OP_OPTIONAL(vm_destroy)
 KVM_X86_OP_OPTIONAL(vm_free)
 KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8a72e5873808..8b2b79b39d0f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1647,7 +1647,6 @@ struct kvm_x86_ops {
        unsigned int vm_size;
        int (*vm_enable_cap)(struct kvm *kvm, struct kvm_enable_cap *cap);
        int (*vm_init)(struct kvm *kvm);
-       void (*flush_shadow_all_private)(struct kvm *kvm);
        void (*vm_destroy)(struct kvm *kvm);
        void (*vm_free)(struct kvm *kvm);
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e1299eb03e63..4deeeac14324 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6446,7 +6446,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
         * lead to use-after-free.
         */
        if (tdp_mmu_enabled)
-               kvm_tdp_mmu_zap_invalidated_roots(kvm);
+               kvm_tdp_mmu_zap_invalidated_roots(kvm, true);
 }
 
 static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
@@ -6977,13 +6977,6 @@ static void kvm_mmu_zap_all(struct kvm *kvm)
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
 {
-       /*
-        * kvm_mmu_zap_all() zaps both private and shared page tables.  Before
-        * tearing down private page tables, TDX requires some TD resources to
-        * be destroyed (i.e. keyID must have been reclaimed, etc).  Invoke
-        * kvm_x86_flush_shadow_all_private() for this.
-        */
-       static_call_cond(kvm_x86_flush_shadow_all_private)(kvm);
        kvm_mmu_zap_all(kvm);
 }
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 68dfcdb46ab7..9e8b012aa8cc 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -38,7 +38,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
         * ultimately frees all roots.
         */
        kvm_tdp_mmu_invalidate_roots(kvm, KVM_VALID_ROOTS);
-       kvm_tdp_mmu_zap_invalidated_roots(kvm);
+       kvm_tdp_mmu_zap_invalidated_roots(kvm, false);
 
        WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
        WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
@@ -1057,7 +1057,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
         * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
         */
        lockdep_assert_held_write(&kvm->mmu_lock);
-       for_each_tdp_mmu_root_yield_safe(kvm, root)
+       __for_each_tdp_mmu_root_yield_safe(kvm, root, -1, KVM_DIRECT_ROOTS)
                tdp_mmu_zap_root(kvm, root, false);
 }
 
@@ -1065,11 +1065,14 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
  * Zap all invalidated roots to ensure all SPTEs are dropped before the "fast
  * zap" completes.
  */
-void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
+void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared)
 {
        struct kvm_mmu_page *root;
 
-       read_lock(&kvm->mmu_lock);
+       if (shared)
+               read_lock(&kvm->mmu_lock);
+       else
+               write_lock(&kvm->mmu_lock);
 
        for_each_tdp_mmu_root_yield_safe(kvm, root) {
                if (!root->tdp_mmu_scheduled_root_to_zap)
@@ -1087,7 +1090,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
                 * that may be zapped, as such entries are associated with the
                 * ASID on both VMX and SVM.
                 */
-               tdp_mmu_zap_root(kvm, root, true);
+               tdp_mmu_zap_root(kvm, root, shared);
 
                /*
                 * The referenced needs to be put *after* zapping the root, as
@@ -1097,7 +1100,10 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
                kvm_tdp_mmu_put_root(kvm, root);
        }
 
-       read_unlock(&kvm->mmu_lock);
+       if (shared)
+               read_unlock(&kvm->mmu_lock);
+       else
+               write_unlock(&kvm->mmu_lock);
 }
 
 /*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 56741d31048a..7927fa4a96e0 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -68,7 +68,7 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page
*sp);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
                                  enum kvm_tdp_mmu_root_types root_types);
-void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
+void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared);
 
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
 
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index b6828e35eb17..3f9bfcd3e152 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -98,16 +98,12 @@ static int vt_vm_init(struct kvm *kvm)
        return vmx_vm_init(kvm);
 }
 
-static void vt_flush_shadow_all_private(struct kvm *kvm)
-{
-       if (is_td(kvm))
-               tdx_mmu_release_hkid(kvm);
-}
-
 static void vt_vm_destroy(struct kvm *kvm)
 {
-       if (is_td(kvm))
+       if (is_td(kvm)) {
+               tdx_mmu_release_hkid(kvm);
                return;
+       }
 
        vmx_vm_destroy(kvm);
 }
@@ -980,7 +976,6 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
        .vm_size = sizeof(struct kvm_vmx),
        .vm_enable_cap = vt_vm_enable_cap,
        .vm_init = vt_vm_init,
-       .flush_shadow_all_private = vt_flush_shadow_all_private,
        .vm_destroy = vt_vm_destroy,
        .vm_free = vt_vm_free,
Yan Zhao June 24, 2024, 8:29 a.m. UTC | #3
On Sat, Jun 22, 2024 at 03:08:22AM +0800, Edgecombe, Rick P wrote:
> On Fri, 2024-06-21 at 15:10 +0800, Yan Zhao wrote:
> > On Wed, Jun 19, 2024 at 03:36:14PM -0700, Rick Edgecombe wrote:
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index 630e6b6d4bf2..a1ab67a4f41f 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -37,7 +37,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
> > >          * for zapping and thus puts the TDP MMU's reference to each root,
> > > i.e.
> > >          * ultimately frees all roots.
> > >          */
> > > -       kvm_tdp_mmu_invalidate_all_roots(kvm);
> > > +       kvm_tdp_mmu_invalidate_roots(kvm, KVM_VALID_ROOTS);
> > all roots (mirror + direct) are invalidated here.
> 
> Right.
> > 
> > >         kvm_tdp_mmu_zap_invalidated_roots(kvm);
> > kvm_tdp_mmu_zap_invalidated_roots() will zap invalidated mirror root with
> > mmu_lock held for read, which should trigger KVM_BUG_ON() in
> > __tdp_mmu_set_spte_atomic(), which assumes "atomic zapping don't operate on
> > mirror roots".
> > 
> > But up to now, the KVM_BUG_ON() is not triggered because
> > kvm_mmu_notifier_release() is called earlier than kvm_destroy_vm() (as in
> > below
> > call trace), and kvm_arch_flush_shadow_all() in kvm_mmu_notifier_release() has
> > zapped all mirror SPTEs before kvm_mmu_uninit_vm() called in kvm_destroy_vm().
> > 
> > 
> > kvm_mmu_notifier_release
> >   kvm_flush_shadow_all
> >     kvm_arch_flush_shadow_all
> >       static_call_cond(kvm_x86_flush_shadow_all_private)(kvm);
> >       kvm_mmu_zap_all  ==>hold mmu_lock for write
> >         kvm_tdp_mmu_zap_all ==>zap KVM_ALL_ROOTS with mmu_lock held for write
> > 
> > kvm_destroy_vm
> >   kvm_arch_destroy_vm
> >     kvm_mmu_uninit_vm
> >       kvm_mmu_uninit_tdp_mmu
> >         kvm_tdp_mmu_invalidate_roots ==>invalid all KVM_VALID_ROOTS
> >         kvm_tdp_mmu_zap_invalidated_roots ==> zap all roots with mmu_lock held
> > for read
> > 
> > 
> > A question is that kvm_mmu_notifier_release(), as a callback of primary MMU
> > notifier, why does it zap mirrored tdp when all other callbacks are with
> > KVM_FILTER_SHARED?
> > 
> > Could we just zap all KVM_DIRECT_ROOTS (valid | invalid) in
> > kvm_mmu_notifier_release() and move mirrord tdp related stuffs from 
> > kvm_arch_flush_shadow_all() to kvm_mmu_uninit_tdp_mmu(), ensuring mmu_lock is
> > held for write?
> 
> Sigh, thanks for flagging this. I agree it seems weird to free private memory
> from an MMU notifier callback. I also found this old thread where Sean NAKed the
> current approach (free hkid in mmu release):
> https://lore.kernel.org/kvm/ZN+1QHGa6ltpQxZn@google.com/#t
> 
> One challenge is that flush_shadow_all_private() needs to be done before
> kvm_destroy_vcpus(), where it gets into tdx_vcpu_free(). So kvm_mmu_uninit_vm()
> is too late. Perhaps this is why it was shoved into mmu notifier release (which
> happens long before as you noted). Isaku, do you recall any other reasons?
> 
> But static_call_cond(kvm_x86_vm_destroy) happens before kvm_destroy_vcpus, so we
> could maybe actually just do the tdx_mmu_release_hkid() part there. Then drop
> the flush_shadow_all_private x86 op. See the (not thoroughly checked) diff at
> the bottom of this mail.
It looks good to me.

> 
> But most of what is being discussed is in future patches where it starts to get
> into the TDX module interaction. So I wonder if we should drop this patch 17
> from "part 1" and include it with the next series so it can all be considered
> together.
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-
> ops.h
> index 2adf36b74910..3927731aa947 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -23,7 +23,6 @@ KVM_X86_OP(has_emulated_msr)
>  KVM_X86_OP(vcpu_after_set_cpuid)
>  KVM_X86_OP_OPTIONAL(vm_enable_cap)
>  KVM_X86_OP(vm_init)
> -KVM_X86_OP_OPTIONAL(flush_shadow_all_private)
>  KVM_X86_OP_OPTIONAL(vm_destroy)
>  KVM_X86_OP_OPTIONAL(vm_free)
>  KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 8a72e5873808..8b2b79b39d0f 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1647,7 +1647,6 @@ struct kvm_x86_ops {
>         unsigned int vm_size;
>         int (*vm_enable_cap)(struct kvm *kvm, struct kvm_enable_cap *cap);
>         int (*vm_init)(struct kvm *kvm);
> -       void (*flush_shadow_all_private)(struct kvm *kvm);
>         void (*vm_destroy)(struct kvm *kvm);
>         void (*vm_free)(struct kvm *kvm);
>  
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e1299eb03e63..4deeeac14324 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6446,7 +6446,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>          * lead to use-after-free.
>          */
>         if (tdp_mmu_enabled)
> -               kvm_tdp_mmu_zap_invalidated_roots(kvm);
> +               kvm_tdp_mmu_zap_invalidated_roots(kvm, true);
>  }
>  
>  static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> @@ -6977,13 +6977,6 @@ static void kvm_mmu_zap_all(struct kvm *kvm)
>  
>  void kvm_arch_flush_shadow_all(struct kvm *kvm)
>  {
> -       /*
> -        * kvm_mmu_zap_all() zaps both private and shared page tables.  Before
> -        * tearing down private page tables, TDX requires some TD resources to
> -        * be destroyed (i.e. keyID must have been reclaimed, etc).  Invoke
> -        * kvm_x86_flush_shadow_all_private() for this.
> -        */
> -       static_call_cond(kvm_x86_flush_shadow_all_private)(kvm);
>         kvm_mmu_zap_all(kvm);
>  }
>  
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 68dfcdb46ab7..9e8b012aa8cc 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -38,7 +38,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>          * ultimately frees all roots.
>          */
>         kvm_tdp_mmu_invalidate_roots(kvm, KVM_VALID_ROOTS);
> -       kvm_tdp_mmu_zap_invalidated_roots(kvm);
> +       kvm_tdp_mmu_zap_invalidated_roots(kvm, false);
>  
>         WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
>         WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
> @@ -1057,7 +1057,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
>          * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
>          */
>         lockdep_assert_held_write(&kvm->mmu_lock);
> -       for_each_tdp_mmu_root_yield_safe(kvm, root)
> +       __for_each_tdp_mmu_root_yield_safe(kvm, root, -1, KVM_DIRECT_ROOTS)
nit: update the comment of kvm_tdp_mmu_zap_all() and explain why it's
KVM_DIRECT_ROOTS, not KVM_DIRECT_ROOTS | KVM_INVALID_ROOTS.

>                 tdp_mmu_zap_root(kvm, root, false);
>  }
>  
> @@ -1065,11 +1065,14 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
>   * Zap all invalidated roots to ensure all SPTEs are dropped before the "fast
>   * zap" completes.
>   */
> -void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
> +void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared)
>  {
>         struct kvm_mmu_page *root;
>  
> -       read_lock(&kvm->mmu_lock);
> +       if (shared)
> +               read_lock(&kvm->mmu_lock);
> +       else
> +               write_lock(&kvm->mmu_lock);
>  
>         for_each_tdp_mmu_root_yield_safe(kvm, root) {
>                 if (!root->tdp_mmu_scheduled_root_to_zap)
> @@ -1087,7 +1090,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
>                  * that may be zapped, as such entries are associated with the
>                  * ASID on both VMX and SVM.
>                  */
> -               tdp_mmu_zap_root(kvm, root, true);
> +               tdp_mmu_zap_root(kvm, root, shared);
>  
>                 /*
>                  * The referenced needs to be put *after* zapping the root, as
> @@ -1097,7 +1100,10 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
>                 kvm_tdp_mmu_put_root(kvm, root);
>         }
>  
> -       read_unlock(&kvm->mmu_lock);
> +       if (shared)
> +               read_unlock(&kvm->mmu_lock);
> +       else
> +               write_unlock(&kvm->mmu_lock);
>  }
>  
>  /*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 56741d31048a..7927fa4a96e0 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -68,7 +68,7 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page
> *sp);
>  void kvm_tdp_mmu_zap_all(struct kvm *kvm);
>  void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
>                                   enum kvm_tdp_mmu_root_types root_types);
> -void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
> +void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared);
>  
>  int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>  
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index b6828e35eb17..3f9bfcd3e152 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -98,16 +98,12 @@ static int vt_vm_init(struct kvm *kvm)
>         return vmx_vm_init(kvm);
>  }
>  
> -static void vt_flush_shadow_all_private(struct kvm *kvm)
> -{
> -       if (is_td(kvm))
> -               tdx_mmu_release_hkid(kvm);
> -}
> -
>  static void vt_vm_destroy(struct kvm *kvm)
>  {
> -       if (is_td(kvm))
> +       if (is_td(kvm)) {
> +               tdx_mmu_release_hkid(kvm);
>                 return;
> +       }
>  
>         vmx_vm_destroy(kvm);
>  }
> @@ -980,7 +976,6 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>         .vm_size = sizeof(struct kvm_vmx),
>         .vm_enable_cap = vt_vm_enable_cap,
>         .vm_init = vt_vm_init,
> -       .flush_shadow_all_private = vt_flush_shadow_all_private,
>         .vm_destroy = vt_vm_destroy,
>         .vm_free = vt_vm_free,
>  
>
Rick Edgecombe June 24, 2024, 11:15 p.m. UTC | #4
On Mon, 2024-06-24 at 16:29 +0800, Yan Zhao wrote:
> > @@ -1057,7 +1057,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
> >           * KVM_RUN is unreachable, i.e. no vCPUs will ever service the
> > request.
> >           */
> >          lockdep_assert_held_write(&kvm->mmu_lock);
> > -       for_each_tdp_mmu_root_yield_safe(kvm, root)
> > +       __for_each_tdp_mmu_root_yield_safe(kvm, root, -1, KVM_DIRECT_ROOTS)
> nit: update the comment of kvm_tdp_mmu_zap_all() 

Yea, good idea. It's definitely needs some explanation, considering the function
is called "zap_all". A bit unfortunate actually.

> and explain why it's
> KVM_DIRECT_ROOTS, not KVM_DIRECT_ROOTS | KVM_INVALID_ROOTS.

Explain why not to zap invalid mirrored roots?
Yan Zhao June 25, 2024, 6:14 a.m. UTC | #5
On Tue, Jun 25, 2024 at 07:15:09AM +0800, Edgecombe, Rick P wrote:
> On Mon, 2024-06-24 at 16:29 +0800, Yan Zhao wrote:
> > > @@ -1057,7 +1057,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
> > >           * KVM_RUN is unreachable, i.e. no vCPUs will ever service the
> > > request.
> > >           */
> > >          lockdep_assert_held_write(&kvm->mmu_lock);
> > > -       for_each_tdp_mmu_root_yield_safe(kvm, root)
> > > +       __for_each_tdp_mmu_root_yield_safe(kvm, root, -1, KVM_DIRECT_ROOTS)
> > nit: update the comment of kvm_tdp_mmu_zap_all() 
> 
> Yea, good idea. It's definitely needs some explanation, considering the function
> is called "zap_all". A bit unfortunate actually.
> 
> > and explain why it's
> > KVM_DIRECT_ROOTS, not KVM_DIRECT_ROOTS | KVM_INVALID_ROOTS.
> 
> Explain why not to zap invalid mirrored roots?
No. Explain why not to zap invalid direct roots.
Just passing KVM_DIRECT_ROOTS will zap only valid direct roots, right?
The original kvm_tdp_mmu_zap_all() "Zap all roots, including invalid roots".
It might be better to explain why not to zap invalid direct roots here.
Rick Edgecombe June 25, 2024, 8:56 p.m. UTC | #6
On Tue, 2024-06-25 at 14:14 +0800, Yan Zhao wrote:
> > Explain why not to zap invalid mirrored roots?
> No. Explain why not to zap invalid direct roots.
> Just passing KVM_DIRECT_ROOTS will zap only valid direct roots, right?
> The original kvm_tdp_mmu_zap_all() "Zap all roots, including invalid roots".
> It might be better to explain why not to zap invalid direct roots here.

Hmm, right. We don't actually have enum value to iterate all direct roots (valid
and invalid). We could change tdp_mmu_root_match() to:
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 8320b093fef9..bcd771a8835f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -98,8 +98,8 @@ static bool tdp_mmu_root_match(struct kvm_mmu_page *root,
        if (WARN_ON_ONCE(!(types & KVM_VALID_ROOTS)))
                return false;
 
-       if (root->role.invalid)
-               return types & KVM_INVALID_ROOTS;
+       if (root->role.invalid && !(types & KVM_INVALID_ROOTS))
+               return false;
        if (likely(!is_mirror_sp(root)))
                return types & KVM_DIRECT_ROOTS;
 
Maybe better than a comment...? Need to put the whole thing together and test it
still.
Yan Zhao June 26, 2024, 2:25 a.m. UTC | #7
On Wed, Jun 26, 2024 at 04:56:33AM +0800, Edgecombe, Rick P wrote:
> On Tue, 2024-06-25 at 14:14 +0800, Yan Zhao wrote:
> > > Explain why not to zap invalid mirrored roots?
> > No. Explain why not to zap invalid direct roots.
> > Just passing KVM_DIRECT_ROOTS will zap only valid direct roots, right?
> > The original kvm_tdp_mmu_zap_all() "Zap all roots, including invalid roots".
> > It might be better to explain why not to zap invalid direct roots here.
> 
> Hmm, right. We don't actually have enum value to iterate all direct roots (valid
> and invalid). We could change tdp_mmu_root_match() to:
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 8320b093fef9..bcd771a8835f 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -98,8 +98,8 @@ static bool tdp_mmu_root_match(struct kvm_mmu_page *root,
>         if (WARN_ON_ONCE(!(types & KVM_VALID_ROOTS)))
>                 return false;
>  
> -       if (root->role.invalid)
> -               return types & KVM_INVALID_ROOTS;
> +       if (root->role.invalid && !(types & KVM_INVALID_ROOTS))
> +               return false;
>         if (likely(!is_mirror_sp(root)))
>                 return types & KVM_DIRECT_ROOTS;
>  
> Maybe better than a comment...? Need to put the whole thing together and test it
> still.
Hmm, I think passing KVM_DIRECT_ROOTS only in kvm_tdp_mmu_zap_all() is ok,
because kvm_mmu_uninit_tdp_mmu() will zap all invalidated roots eventually,
though KVM_DIRECT_ROOTS | KVM_INVALID_ROOTS matches more to the function name
zap_all.

I'm not convinced that the change in tdp_mmu_root_match() above is needed.
Once a root is marked invalid, it won't be restored later. Distinguishing
between invalid direct roots and invalid mirror roots will only result in more
unused roots lingering unnecessarily.
Rick Edgecombe July 3, 2024, 8 p.m. UTC | #8
On Wed, 2024-06-26 at 10:25 +0800, Yan Zhao wrote:
> > Maybe better than a comment...? Need to put the whole thing together and
> > test it
> > still.
> Hmm, I think passing KVM_DIRECT_ROOTS only in kvm_tdp_mmu_zap_all() is ok,
> because kvm_mmu_uninit_tdp_mmu() will zap all invalidated roots eventually,
> though KVM_DIRECT_ROOTS | KVM_INVALID_ROOTS matches more to the function name
> zap_all.
> 
> I'm not convinced that the change in tdp_mmu_root_match() above is needed.
> Once a root is marked invalid, it won't be restored later. Distinguishing
> between invalid direct roots and invalid mirror roots will only result in more
> unused roots lingering unnecessarily.

The logic for direct roots is for normal VMs as well. In any case, we should not
mix generic KVM MMU changes in with mirrored memory additions. So let's keep the
existing direct root behavior the same.
Yan Zhao July 5, 2024, 1:16 a.m. UTC | #9
On Thu, Jul 04, 2024 at 04:00:18AM +0800, Edgecombe, Rick P wrote:
> On Wed, 2024-06-26 at 10:25 +0800, Yan Zhao wrote:
> > > Maybe better than a comment...? Need to put the whole thing together and
> > > test it
> > > still.
> > Hmm, I think passing KVM_DIRECT_ROOTS only in kvm_tdp_mmu_zap_all() is ok,
> > because kvm_mmu_uninit_tdp_mmu() will zap all invalidated roots eventually,
> > though KVM_DIRECT_ROOTS | KVM_INVALID_ROOTS matches more to the function name
> > zap_all.
> > 
> > I'm not convinced that the change in tdp_mmu_root_match() above is needed.
> > Once a root is marked invalid, it won't be restored later. Distinguishing
> > between invalid direct roots and invalid mirror roots will only result in more
> > unused roots lingering unnecessarily.
> 
> The logic for direct roots is for normal VMs as well. In any case, we should not
> mix generic KVM MMU changes in with mirrored memory additions. So let's keep the
> existing direct root behavior the same.
To keep the existing direct root behavior the same, I think specifying
KVM_DIRECT_ROOTS | KVM_INVALID_ROOTS in kvm_tdp_mmu_zap_all() is enough.

No need to modify tdp_mmu_root_match() do distinguish between invalid direct
roots and invalid mirror roots. As long as a root is invalid, guest is no longer
affected by it and KVM will not use it any more. The last remaining operation
to the invalid root is only zapping.

Distinguishing between invalid direct roots and invalid mirror roots would
make the code to zap invalid roots unnecessarily complex, e.g.

kvm_tdp_mmu_zap_invalidated_roots() is called both in kvm_mmu_uninit_tdp_mmu()
and kvm_mmu_zap_all_fast().

- When called in the former, both invalid direct and invalid mirror roots are
  required to zap;
- when called in the latter, only invalid direct roots are required to zap.
Rick Edgecombe July 9, 2024, 10:52 p.m. UTC | #10
On Fri, 2024-07-05 at 09:16 +0800, Yan Zhao wrote:
> To keep the existing direct root behavior the same, I think specifying
> KVM_DIRECT_ROOTS | KVM_INVALID_ROOTS in kvm_tdp_mmu_zap_all() is enough.

Right.

> 
> No need to modify tdp_mmu_root_match() do distinguish between invalid direct
> roots and invalid mirror roots. As long as a root is invalid, guest is no
> longer
> affected by it and KVM will not use it any more. The last remaining operation
> to the invalid root is only zapping.
> 
> Distinguishing between invalid direct roots and invalid mirror roots would
> make the code to zap invalid roots unnecessarily complex, e.g.

I'm not sure that it is more complicated. One requires a big comment to explain,
and the other is self explanatory...

> 
> kvm_tdp_mmu_zap_invalidated_roots() is called both in kvm_mmu_uninit_tdp_mmu()
> and kvm_mmu_zap_all_fast().
> 
> - When called in the former, both invalid direct and invalid mirror roots are
>   required to zap;
> - when called in the latter, only invalid direct roots are required to zap.

It might help to put together a full fixup at this point. We have a couple diffs
in this thread, and I'm not 100% which base patch we are talking about at this
point.
Isaku Yamahata July 18, 2024, 3:28 p.m. UTC | #11
On Fri, Jun 21, 2024 at 07:08:22PM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Fri, 2024-06-21 at 15:10 +0800, Yan Zhao wrote:
> > On Wed, Jun 19, 2024 at 03:36:14PM -0700, Rick Edgecombe wrote:
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index 630e6b6d4bf2..a1ab67a4f41f 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -37,7 +37,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
> > >          * for zapping and thus puts the TDP MMU's reference to each root,
> > > i.e.
> > >          * ultimately frees all roots.
> > >          */
> > > -       kvm_tdp_mmu_invalidate_all_roots(kvm);
> > > +       kvm_tdp_mmu_invalidate_roots(kvm, KVM_VALID_ROOTS);
> > all roots (mirror + direct) are invalidated here.
> 
> Right.
> > 
> > >         kvm_tdp_mmu_zap_invalidated_roots(kvm);
> > kvm_tdp_mmu_zap_invalidated_roots() will zap invalidated mirror root with
> > mmu_lock held for read, which should trigger KVM_BUG_ON() in
> > __tdp_mmu_set_spte_atomic(), which assumes "atomic zapping don't operate on
> > mirror roots".
> > 
> > But up to now, the KVM_BUG_ON() is not triggered because
> > kvm_mmu_notifier_release() is called earlier than kvm_destroy_vm() (as in
> > below
> > call trace), and kvm_arch_flush_shadow_all() in kvm_mmu_notifier_release() has
> > zapped all mirror SPTEs before kvm_mmu_uninit_vm() called in kvm_destroy_vm().
> > 
> > 
> > kvm_mmu_notifier_release
> >   kvm_flush_shadow_all
> >     kvm_arch_flush_shadow_all
> >       static_call_cond(kvm_x86_flush_shadow_all_private)(kvm);
> >       kvm_mmu_zap_all  ==>hold mmu_lock for write
> >         kvm_tdp_mmu_zap_all ==>zap KVM_ALL_ROOTS with mmu_lock held for write
> > 
> > kvm_destroy_vm
> >   kvm_arch_destroy_vm
> >     kvm_mmu_uninit_vm
> >       kvm_mmu_uninit_tdp_mmu
> >         kvm_tdp_mmu_invalidate_roots ==>invalid all KVM_VALID_ROOTS
> >         kvm_tdp_mmu_zap_invalidated_roots ==> zap all roots with mmu_lock held
> > for read
> > 
> > 
> > A question is that kvm_mmu_notifier_release(), as a callback of primary MMU
> > notifier, why does it zap mirrored tdp when all other callbacks are with
> > KVM_FILTER_SHARED?
> > 
> > Could we just zap all KVM_DIRECT_ROOTS (valid | invalid) in
> > kvm_mmu_notifier_release() and move mirrord tdp related stuffs from 
> > kvm_arch_flush_shadow_all() to kvm_mmu_uninit_tdp_mmu(), ensuring mmu_lock is
> > held for write?
> 
> Sigh, thanks for flagging this. I agree it seems weird to free private memory
> from an MMU notifier callback. I also found this old thread where Sean NAKed the
> current approach (free hkid in mmu release):
> https://lore.kernel.org/kvm/ZN+1QHGa6ltpQxZn@google.com/#t
> 
> One challenge is that flush_shadow_all_private() needs to be done before
> kvm_destroy_vcpus(), where it gets into tdx_vcpu_free(). So kvm_mmu_uninit_vm()
> is too late. Perhaps this is why it was shoved into mmu notifier release (which
> happens long before as you noted). Isaku, do you recall any other reasons?

Although it's a bit late, it's for record.

It's to optimize the destruction Secure-EPT.  Free HKID early and destruct
Secure-EPT by TDH.PHYMEM.PAGE.RECLAIM().  QEMU doesn't close any KVM file
descriptors on exit. (gmem fd references KVM VM fd. so vm destruction happens
after all gmem fds are closed.  Closing gmem fd causes secure-EPT zapping befure
releasing HKID.)

Because we're ignoring such optimization for now, we can simply defer releasing
HKID following Seans's call.


> But static_call_cond(kvm_x86_vm_destroy) happens before kvm_destroy_vcpus, so we
> could maybe actually just do the tdx_mmu_release_hkid() part there. Then drop
> the flush_shadow_all_private x86 op. See the (not thoroughly checked) diff at
> the bottom of this mail.

Yep, we can release HKID at vm destruction with potential too slow zapping of
Secure-EPT.  The following change basically looks good to me.
(The callback for Secure-EPT can be simplified.)

Thanks,

> But most of what is being discussed is in future patches where it starts to get
> into the TDX module interaction. So I wonder if we should drop this patch 17
> from "part 1" and include it with the next series so it can all be considered
> together.
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-
> ops.h
> index 2adf36b74910..3927731aa947 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -23,7 +23,6 @@ KVM_X86_OP(has_emulated_msr)
>  KVM_X86_OP(vcpu_after_set_cpuid)
>  KVM_X86_OP_OPTIONAL(vm_enable_cap)
>  KVM_X86_OP(vm_init)
> -KVM_X86_OP_OPTIONAL(flush_shadow_all_private)
>  KVM_X86_OP_OPTIONAL(vm_destroy)
>  KVM_X86_OP_OPTIONAL(vm_free)
>  KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 8a72e5873808..8b2b79b39d0f 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1647,7 +1647,6 @@ struct kvm_x86_ops {
>         unsigned int vm_size;
>         int (*vm_enable_cap)(struct kvm *kvm, struct kvm_enable_cap *cap);
>         int (*vm_init)(struct kvm *kvm);
> -       void (*flush_shadow_all_private)(struct kvm *kvm);
>         void (*vm_destroy)(struct kvm *kvm);
>         void (*vm_free)(struct kvm *kvm);
>  
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e1299eb03e63..4deeeac14324 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6446,7 +6446,7 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>          * lead to use-after-free.
>          */
>         if (tdp_mmu_enabled)
> -               kvm_tdp_mmu_zap_invalidated_roots(kvm);
> +               kvm_tdp_mmu_zap_invalidated_roots(kvm, true);
>  }
>  
>  static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
> @@ -6977,13 +6977,6 @@ static void kvm_mmu_zap_all(struct kvm *kvm)
>  
>  void kvm_arch_flush_shadow_all(struct kvm *kvm)
>  {
> -       /*
> -        * kvm_mmu_zap_all() zaps both private and shared page tables.  Before
> -        * tearing down private page tables, TDX requires some TD resources to
> -        * be destroyed (i.e. keyID must have been reclaimed, etc).  Invoke
> -        * kvm_x86_flush_shadow_all_private() for this.
> -        */
> -       static_call_cond(kvm_x86_flush_shadow_all_private)(kvm);
>         kvm_mmu_zap_all(kvm);
>  }
>  
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 68dfcdb46ab7..9e8b012aa8cc 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -38,7 +38,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>          * ultimately frees all roots.
>          */
>         kvm_tdp_mmu_invalidate_roots(kvm, KVM_VALID_ROOTS);
> -       kvm_tdp_mmu_zap_invalidated_roots(kvm);
> +       kvm_tdp_mmu_zap_invalidated_roots(kvm, false);
>  
>         WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
>         WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
> @@ -1057,7 +1057,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
>          * KVM_RUN is unreachable, i.e. no vCPUs will ever service the request.
>          */
>         lockdep_assert_held_write(&kvm->mmu_lock);
> -       for_each_tdp_mmu_root_yield_safe(kvm, root)
> +       __for_each_tdp_mmu_root_yield_safe(kvm, root, -1, KVM_DIRECT_ROOTS)
>                 tdp_mmu_zap_root(kvm, root, false);
>  }
>  
> @@ -1065,11 +1065,14 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
>   * Zap all invalidated roots to ensure all SPTEs are dropped before the "fast
>   * zap" completes.
>   */
> -void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
> +void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared)
>  {
>         struct kvm_mmu_page *root;
>  
> -       read_lock(&kvm->mmu_lock);
> +       if (shared)
> +               read_lock(&kvm->mmu_lock);
> +       else
> +               write_lock(&kvm->mmu_lock);
>  
>         for_each_tdp_mmu_root_yield_safe(kvm, root) {
>                 if (!root->tdp_mmu_scheduled_root_to_zap)
> @@ -1087,7 +1090,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
>                  * that may be zapped, as such entries are associated with the
>                  * ASID on both VMX and SVM.
>                  */
> -               tdp_mmu_zap_root(kvm, root, true);
> +               tdp_mmu_zap_root(kvm, root, shared);
>  
>                 /*
>                  * The referenced needs to be put *after* zapping the root, as
> @@ -1097,7 +1100,10 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
>                 kvm_tdp_mmu_put_root(kvm, root);
>         }
>  
> -       read_unlock(&kvm->mmu_lock);
> +       if (shared)
> +               read_unlock(&kvm->mmu_lock);
> +       else
> +               write_unlock(&kvm->mmu_lock);
>  }
>  
>  /*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 56741d31048a..7927fa4a96e0 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -68,7 +68,7 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page
> *sp);
>  void kvm_tdp_mmu_zap_all(struct kvm *kvm);
>  void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
>                                   enum kvm_tdp_mmu_root_types root_types);
> -void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
> +void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm, bool shared);
>  
>  int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>  
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index b6828e35eb17..3f9bfcd3e152 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -98,16 +98,12 @@ static int vt_vm_init(struct kvm *kvm)
>         return vmx_vm_init(kvm);
>  }
>  
> -static void vt_flush_shadow_all_private(struct kvm *kvm)
> -{
> -       if (is_td(kvm))
> -               tdx_mmu_release_hkid(kvm);
> -}
> -
>  static void vt_vm_destroy(struct kvm *kvm)
>  {
> -       if (is_td(kvm))
> +       if (is_td(kvm)) {
> +               tdx_mmu_release_hkid(kvm);
>                 return;
> +       }
>  
>         vmx_vm_destroy(kvm);
>  }
> @@ -980,7 +976,6 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>         .vm_size = sizeof(struct kvm_vmx),
>         .vm_enable_cap = vt_vm_enable_cap,
>         .vm_init = vt_vm_init,
> -       .flush_shadow_all_private = vt_flush_shadow_all_private,
>         .vm_destroy = vt_vm_destroy,
>         .vm_free = vt_vm_free,
>  
>
Rick Edgecombe July 18, 2024, 3:55 p.m. UTC | #12
On Thu, 2024-07-18 at 08:28 -0700, Isaku Yamahata wrote:
> Although it's a bit late, it's for record.
> 
> It's to optimize the destruction Secure-EPT.  Free HKID early and destruct
> Secure-EPT by TDH.PHYMEM.PAGE.RECLAIM().  QEMU doesn't close any KVM file
> descriptors on exit. (gmem fd references KVM VM fd. so vm destruction happens
> after all gmem fds are closed.  Closing gmem fd causes secure-EPT zapping
> befure
> releasing HKID.)
> 
> Because we're ignoring such optimization for now, we can simply defer
> releasing
> HKID following Seans's call.

Thanks for the background.

> 
> 
> > But static_call_cond(kvm_x86_vm_destroy) happens before kvm_destroy_vcpus,
> > so we
> > could maybe actually just do the tdx_mmu_release_hkid() part there. Then
> > drop
> > the flush_shadow_all_private x86 op. See the (not thoroughly checked) diff
> > at
> > the bottom of this mail.
> 
> Yep, we can release HKID at vm destruction with potential too slow zapping of
> Secure-EPT.  The following change basically looks good to me.
> (The callback for Secure-EPT can be simplified.)
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 287dcc2685e4..c5255aa62146 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6414,8 +6414,13 @@  static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 	 * write and in the same critical section as making the reload request,
 	 * e.g. before kvm_zap_obsolete_pages() could drop mmu_lock and yield.
 	 */
-	if (tdp_mmu_enabled)
-		kvm_tdp_mmu_invalidate_all_roots(kvm);
+	if (tdp_mmu_enabled) {
+		/*
+		 * External page tables don't support fast zapping, therefore
+		 * their mirrors must be invalidated separately by the caller.
+		 */
+		kvm_tdp_mmu_invalidate_roots(kvm, KVM_DIRECT_ROOTS);
+	}
 
 	/*
 	 * Notify all vcpus to reload its shadow page table and flush TLB.
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 630e6b6d4bf2..a1ab67a4f41f 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -37,7 +37,7 @@  void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 	 * for zapping and thus puts the TDP MMU's reference to each root, i.e.
 	 * ultimately frees all roots.
 	 */
-	kvm_tdp_mmu_invalidate_all_roots(kvm);
+	kvm_tdp_mmu_invalidate_roots(kvm, KVM_VALID_ROOTS);
 	kvm_tdp_mmu_zap_invalidated_roots(kvm);
 
 	WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
@@ -1110,10 +1110,18 @@  void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
  * Note, kvm_tdp_mmu_zap_invalidated_roots() is gifted the TDP MMU's reference.
  * See kvm_tdp_mmu_alloc_root().
  */
-void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
+void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
+				  enum kvm_tdp_mmu_root_types root_types)
 {
 	struct kvm_mmu_page *root;
 
+	/*
+	 * Invalidating invalid roots doesn't make sense, prevent developers from
+	 * having to think about it.
+	 */
+	if (WARN_ON_ONCE(root_types & KVM_INVALID_ROOTS))
+		root_types &= ~KVM_INVALID_ROOTS;
+
 	/*
 	 * mmu_lock must be held for write to ensure that a root doesn't become
 	 * invalid while there are active readers (invalidating a root while
@@ -1135,6 +1143,9 @@  void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
 	 * or get/put references to roots.
 	 */
 	list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
+		if (!tdp_mmu_root_match(root, root_types))
+			continue;
+
 		/*
 		 * Note, invalid roots can outlive a memslot update!  Invalid
 		 * roots must be *zapped* before the memslot update completes,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 2903f03a34be..56741d31048a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -66,7 +66,8 @@  static inline struct kvm_mmu_page *tdp_mmu_get_root(struct kvm_vcpu *vcpu,
 bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, gfn_t start, gfn_t end, bool flush);
 bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
-void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
+void kvm_tdp_mmu_invalidate_roots(struct kvm *kvm,
+				  enum kvm_tdp_mmu_root_types root_types);
 void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
 
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);