diff mbox series

[v19,058/130] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page

Message ID 9d86b5a2787d20ffb5a58f86e43601a660521f16.1708933498.git.isaku.yamahata@intel.com (mailing list archive)
State New, archived
Headers show
Series [v19,001/130] x86/virt/tdx: Rename _offset to _member for TD_SYSINFO_MAP() macro | expand

Commit Message

Isaku Yamahata Feb. 26, 2024, 8:26 a.m. UTC
From: Isaku Yamahata <isaku.yamahata@intel.com>

For private GPA, CPU refers a private page table whose contents are
encrypted.  The dedicated APIs to operate on it (e.g. updating/reading its
PTE entry) are used and their cost is expensive.

When KVM resolves KVM page fault, it walks the page tables.  To reuse the
existing KVM MMU code and mitigate the heavy cost to directly walk private
page table, allocate one more page to copy the dummy page table for KVM MMU
code to directly walk.  Resolve KVM page fault with the existing code, and
do additional operations necessary for the private page table.  To
distinguish such cases, the existing KVM page table is called a shared page
table (i.e. not associated with private page table), and the page table
with private page table is called a private page table.  The relationship
is depicted below.

Add a private pointer to struct kvm_mmu_page for private page table and
add helper functions to allocate/initialize/free a private page table
page.

              KVM page fault                     |
                     |                           |
                     V                           |
        -------------+----------                 |
        |                      |                 |
        V                      V                 |
     shared GPA           private GPA            |
        |                      |                 |
        V                      V                 |
    shared PT root      dummy PT root            |    private PT root
        |                      |                 |           |
        V                      V                 |           V
     shared PT            dummy PT ----propagate---->   private PT
        |                      |                 |           |
        |                      \-----------------+------\    |
        |                                        |      |    |
        V                                        |      V    V
  shared guest page                              |    private guest page
                                                 |
                           non-encrypted memory  |    encrypted memory
                                                 |
PT: page table
- Shared PT is visible to KVM and it is used by CPU.
- Private PT is used by CPU but it is invisible to KVM.
- Dummy PT is visible to KVM but not used by CPU.  It is used to
  propagate PT change to the actual private PT which is used by CPU.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
---
v19:
- typo in the comment in kvm_mmu_alloc_private_spt()
- drop CONFIG_KVM_MMU_PRIVATE
---
 arch/x86/include/asm/kvm_host.h |  5 +++
 arch/x86/kvm/mmu/mmu.c          |  7 ++++
 arch/x86/kvm/mmu/mmu_internal.h | 63 ++++++++++++++++++++++++++++++---
 arch/x86/kvm/mmu/tdp_mmu.c      |  1 +
 4 files changed, 72 insertions(+), 4 deletions(-)

Comments

Edgecombe, Rick P March 13, 2024, 8:51 p.m. UTC | #1
On Mon, 2024-02-26 at 00:26 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> For private GPA, CPU refers a private page table whose contents are
> encrypted.  The dedicated APIs to operate on it (e.g.
> updating/reading its
> PTE entry) are used and their cost is expensive.
> 
> When KVM resolves KVM page fault, it walks the page tables.  To reuse
> the
> existing KVM MMU code and mitigate the heavy cost to directly walk
> private
> page table, allocate one more page to copy the dummy page table for
> KVM MMU
> code to directly walk.  Resolve KVM page fault with the existing
> code, and
> do additional operations necessary for the private page table. 

>  To
> distinguish such cases, the existing KVM page table is called a
> shared page
> table (i.e. not associated with private page table), and the page
> table
> with private page table is called a private page table.

This makes it sound like the dummy page table for the private alias is
also called a shared page table, but in the drawing below it looks like
only the shared alias is called "shared PT".

>   The relationship
> is depicted below.
> 
> Add a private pointer to struct kvm_mmu_page for private page table
> and
> add helper functions to allocate/initialize/free a private page table
> page.
> 
>               KVM page fault                     |
>                      |                           |
>                      V                           |
>         -------------+----------                 |
>         |                      |                 |
>         V                      V                 |
>      shared GPA           private GPA            |
>         |                      |                 |
>         V                      V                 |
>     shared PT root      dummy PT root            |    private PT root
>         |                      |                 |           |
>         V                      V                 |           V
>      shared PT            dummy PT ----propagate---->   private PT
>         |                      |                 |           |
>         |                      \-----------------+------\    |
>         |                                        |      |    |
>         V                                        |      V    V
>   shared guest page                              |    private guest
> page
>                                                  |
>                            non-encrypted memory  |    encrypted
> memory
>                                                  |
> PT: page table
> - Shared PT is visible to KVM and it is used by CPU.
> - Private PT is used by CPU but it is invisible to KVM.
> - Dummy PT is visible to KVM but not used by CPU.  It is used to
>   propagate PT change to the actual private PT which is used by CPU.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> ---
> v19:
> - typo in the comment in kvm_mmu_alloc_private_spt()
> - drop CONFIG_KVM_MMU_PRIVATE
> ---
>  arch/x86/include/asm/kvm_host.h |  5 +++
>  arch/x86/kvm/mmu/mmu.c          |  7 ++++
>  arch/x86/kvm/mmu/mmu_internal.h | 63 ++++++++++++++++++++++++++++++-
> --
>  arch/x86/kvm/mmu/tdp_mmu.c      |  1 +
>  4 files changed, 72 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h
> b/arch/x86/include/asm/kvm_host.h
> index dcc6f7c38a83..efd3fda1c177 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -825,6 +825,11 @@ struct kvm_vcpu_arch {
>         struct kvm_mmu_memory_cache mmu_shadow_page_cache;
>         struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
>         struct kvm_mmu_memory_cache mmu_page_header_cache;
> +       /*
> +        * This cache is to allocate private page table. E.g. 
> Secure-EPT used
> +        * by the TDX module.
> +        */
> +       struct kvm_mmu_memory_cache mmu_private_spt_cache;
>  
>         /*
>          * QEMU userspace and the guest each have their own FPU
> state.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index eeebbc67e42b..0d6d4506ec97 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -685,6 +685,12 @@ static int mmu_topup_memory_caches(struct
> kvm_vcpu *vcpu, bool maybe_indirect)
>                                        1 + PT64_ROOT_MAX_LEVEL +
> PTE_PREFETCH_NUM);
>         if (r)
>                 return r;
> +       if (kvm_gfn_shared_mask(vcpu->kvm)) {
> +               r = kvm_mmu_topup_memory_cache(&vcpu-
> >arch.mmu_private_spt_cache,
> +                                              PT64_ROOT_MAX_LEVEL);
> +               if (r)
> +                       return r;
> +       }
>         r = kvm_mmu_topup_memory_cache(&vcpu-
> >arch.mmu_shadow_page_cache,
>                                        PT64_ROOT_MAX_LEVEL);
>         if (r)
> @@ -704,6 +710,7 @@ static void mmu_free_memory_caches(struct
> kvm_vcpu *vcpu)
>         kvm_mmu_free_memory_cache(&vcpu-
> >arch.mmu_pte_list_desc_cache);
>         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
>         kvm_mmu_free_memory_cache(&vcpu-
> >arch.mmu_shadowed_info_cache);
> +       kvm_mmu_free_memory_cache(&vcpu->arch.mmu_private_spt_cache);
>         kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
>  }
>  
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h
> b/arch/x86/kvm/mmu/mmu_internal.h
> index e3f54701f98d..002f3f80bf3b 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -101,7 +101,21 @@ struct kvm_mmu_page {
>                 int root_count;
>                 refcount_t tdp_mmu_root_count;
>         };
> -       unsigned int unsync_children;
> +       union {
> +               struct {
> +                       unsigned int unsync_children;
> +                       /*
> +                        * Number of writes since the last time
> traversal
> +                        * visited this page.
> +                        */
> +                       atomic_t write_flooding_count;
> +               };

I think the point of putting these in a union is that they only apply
to shadow paging and so can't be used with TDX. I think you are putting
more than the sizeof(void *) in there as there are multiple in the same
category. But there seems to be a new one added, *shadowed_translation.
Should it go in there too? Is the union because there wasn't room
before, or just to be tidy?

I think the commit log should have more discussion of this union and
maybe a comment in the struct to explain the purpose of the
organization. Can you explain the reasoning now for the sake of
discussion?

> +               /*
> +                * Associated private shadow page table, e.g. Secure-
> EPT page
> +                * passed to the TDX module.
> +                */
> +               void *private_spt;
> +       };
>         union {
>                 struct kvm_rmap_head parent_ptes; /* rmap pointers to
> parent sptes */
>                 tdp_ptep_t ptep;
> @@ -124,9 +138,6 @@ struct kvm_mmu_page {
>         int clear_spte_count;
>  #endif
>  
> -       /* Number of writes since the last time traversal visited
> this page.  */
> -       atomic_t write_flooding_count;
> -
>  #ifdef CONFIG_X86_64
>         /* Used for freeing the page asynchronously if it is a TDP
> MMU page. */
>         struct rcu_head rcu_head;
> @@ -150,6 +161,50 @@ static inline bool is_private_sp(const struct
> kvm_mmu_page *sp)
>         return kvm_mmu_page_role_is_private(sp->role);
>  }
>  
> +static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp)
> +{
> +       return sp->private_spt;
> +}
> +
> +static inline void kvm_mmu_init_private_spt(struct kvm_mmu_page *sp,
> void *private_spt)
> +{
> +       sp->private_spt = private_spt;
> +}
> +
> +static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu,
> struct kvm_mmu_page *sp)
> +{
> +       bool is_root = vcpu->arch.root_mmu.root_role.level == sp-
> >role.level;
> +
> +       KVM_BUG_ON(!kvm_mmu_page_role_is_private(sp->role), vcpu-
> >kvm);
> +       if (is_root)
> +               /*
> +                * Because TDX module assigns root Secure-EPT page
> and set it to
> +                * Secure-EPTP when TD vcpu is created, secure page
> table for
> +                * root isn't needed.
> +                */
> +               sp->private_spt = NULL;
> +       else {
> +               /*
> +                * Because the TDX module doesn't trust VMM and
> initializes
> +                * the pages itself, KVM doesn't initialize them. 
> Allocate
> +                * pages with garbage and give them to the TDX
> module.
> +                */
> +               sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu-
> >arch.mmu_private_spt_cache);
> +               /*
> +                * Because mmu_private_spt_cache is topped up before
> starting
> +                * kvm page fault resolving, the allocation above
> shouldn't
> +                * fail.
> +                */
> +               WARN_ON_ONCE(!sp->private_spt);

There is already a BUG_ON() for the allocation failure in
kvm_mmu_memory_cache_alloc().

> +       }
> +}
> +
> +static inline void kvm_mmu_free_private_spt(struct kvm_mmu_page *sp)
> +{
> +       if (sp->private_spt)

free_page() can accept NULL, so the above check is unneeded.

> +               free_page((unsigned long)sp->private_spt);
> +}
> +
>  static inline bool kvm_mmu_page_ad_need_write_protect(struct
> kvm_mmu_page *sp)
>  {
>         /*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 87233b3ceaef..d47f0daf1b03 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -53,6 +53,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>  
>  static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
>  {
> +       kvm_mmu_free_private_spt(sp);

This particular memcache zeros the allocations so it is safe to free
this without regard to whether sp->private_spt has been set and that
the allocation caller is not in place yet. It would be nice to add this
detail in the log.

>         free_page((unsigned long)sp->spt);
>         kmem_cache_free(mmu_page_header_cache, sp);
>  }
Isaku Yamahata March 14, 2024, 6:10 p.m. UTC | #2
On Wed, Mar 13, 2024 at 08:51:53PM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Mon, 2024-02-26 at 00:26 -0800, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > For private GPA, CPU refers a private page table whose contents are
> > encrypted.  The dedicated APIs to operate on it (e.g.
> > updating/reading its
> > PTE entry) are used and their cost is expensive.
> > 
> > When KVM resolves KVM page fault, it walks the page tables.  To reuse
> > the
> > existing KVM MMU code and mitigate the heavy cost to directly walk
> > private
> > page table, allocate one more page to copy the dummy page table for
> > KVM MMU
> > code to directly walk.  Resolve KVM page fault with the existing
> > code, and
> > do additional operations necessary for the private page table. 
> 
> >  To
> > distinguish such cases, the existing KVM page table is called a
> > shared page
> > table (i.e. not associated with private page table), and the page
> > table
> > with private page table is called a private page table.
> 
> This makes it sound like the dummy page table for the private alias is
> also called a shared page table, but in the drawing below it looks like
> only the shared alias is called "shared PT".

How about this,
Call the existing KVM page table associated with shared GPA as shared page table.
Call the KVM page table associate with private GPA as private page table.

> >   The relationship
> > is depicted below.
> > 
> > Add a private pointer to struct kvm_mmu_page for private page table
> > and
> > add helper functions to allocate/initialize/free a private page table
> > page.
> > 
> >               KVM page fault                     |
> >                      |                           |
> >                      V                           |
> >         -------------+----------                 |
> >         |                      |                 |
> >         V                      V                 |
> >      shared GPA           private GPA            |
> >         |                      |                 |
> >         V                      V                 |
> >     shared PT root      dummy PT root            |    private PT root
> >         |                      |                 |           |
> >         V                      V                 |           V
> >      shared PT            dummy PT ----propagate---->   private PT
> >         |                      |                 |           |
> >         |                      \-----------------+------\    |
> >         |                                        |      |    |
> >         V                                        |      V    V
> >   shared guest page                              |    private guest
> > page
> >                                                  |
> >                            non-encrypted memory  |    encrypted
> > memory
> >                                                  |
> > PT: page table
> > - Shared PT is visible to KVM and it is used by CPU.
> > - Private PT is used by CPU but it is invisible to KVM.
> > - Dummy PT is visible to KVM but not used by CPU.  It is used to
> >   propagate PT change to the actual private PT which is used by CPU.
> > 
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
> > ---

...snip...

> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h
> > b/arch/x86/kvm/mmu/mmu_internal.h
> > index e3f54701f98d..002f3f80bf3b 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -101,7 +101,21 @@ struct kvm_mmu_page {
> >                 int root_count;
> >                 refcount_t tdp_mmu_root_count;
> >         };
> > -       unsigned int unsync_children;
> > +       union {
> > +               struct {
> > +                       unsigned int unsync_children;
> > +                       /*
> > +                        * Number of writes since the last time
> > traversal
> > +                        * visited this page.
> > +                        */
> > +                       atomic_t write_flooding_count;
> > +               };
> 
> I think the point of putting these in a union is that they only apply
> to shadow paging and so can't be used with TDX. I think you are putting
> more than the sizeof(void *) in there as there are multiple in the same
> category.

I'm not sure if I'm following you.
On x86_64, sizeof(unsigned int) = 4, sizeof(atomic_t) = 4, sizeof(void *) = 8.
I moved write_flooding_count to have 8 bytes.


> But there seems to be a new one added, *shadowed_translation.
> Should it go in there too? Is the union because there wasn't room
> before, or just to be tidy?

Originally TDX MMU support was implemented for legacy tdp mmu.  It used
shadowed_translation.  It was not an option at that time.  Later we switched to
(new) TDP MMU.  Now we have choice to which member to overlay.


> I think the commit log should have more discussion of this union and
> maybe a comment in the struct to explain the purpose of the
> organization. Can you explain the reasoning now for the sake of
> discussion?

Sure.  We'd like to add void * pointer to struct kvm_mmu_page.  Given some
members are used only for legacy KVM MMUs and not used for TDP MMU, we can save
memory overhead with union.  We have options.
- u64 *shadowed_translation
  This was not chosen for the old implementation. Now this is option.
- pack unsync_children and write_flooding_count for 8 bytes
  This patch chosen this for historical reason. Other two option is possible.
- unsync_child_bitmap
  Historically it was unioned with other members. But now it's not.

I don't have strong preference for TDX support as long as we can have void *.
Huang, Kai March 14, 2024, 9:23 p.m. UTC | #3
On 15/03/2024 7:10 am, Isaku Yamahata wrote:
> On Wed, Mar 13, 2024 at 08:51:53PM +0000,
> "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:
> 
>> On Mon, 2024-02-26 at 00:26 -0800, isaku.yamahata@intel.com wrote:
>>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>>
>>> For private GPA, CPU refers a private page table whose contents are
>>> encrypted.  The dedicated APIs to operate on it (e.g.
>>> updating/reading its
>>> PTE entry) are used and their cost is expensive.
>>>
>>> When KVM resolves KVM page fault, it walks the page tables.  To reuse
>>> the
>>> existing KVM MMU code and mitigate the heavy cost to directly walk
>>> private
>>> page table, allocate one more page to copy the dummy page table for
>>> KVM MMU
>>> code to directly walk.  Resolve KVM page fault with the existing
>>> code, and
>>> do additional operations necessary for the private page table.
>>
>>>   To
>>> distinguish such cases, the existing KVM page table is called a
>>> shared page
>>> table (i.e. not associated with private page table), and the page
>>> table
>>> with private page table is called a private page table.
>>
>> This makes it sound like the dummy page table for the private alias is
>> also called a shared page table, but in the drawing below it looks like
>> only the shared alias is called "shared PT".
> 
> How about this,
> Call the existing KVM page table associated with shared GPA as shared page table. > Call the KVM page table associate with private GPA as private page table.
> 

For the second one, are you talking about the *true* secure/private EPT 
page table used by hardware, or the one visible to KVM but not used by 
hardware?

We have 3 page tables as you mentioned:

PT: page table
- Shared PT is visible to KVM and it is used by CPU.
- Private PT is used by CPU but it is invisible to KVM.
- Dummy PT is visible to KVM but not used by CPU.  It is used to
   propagate PT change to the actual private PT which is used by CPU.

If I recall correctly, we used to call the last one "mirrored (private) 
page table".

I lost the tracking when we changed to use "dummy page table", but it 
seems to me "mirrored" is better than "dummy" because the latter means 
it is useless but in fact it is used to propagate changes to the real 
private page table used by hardware.

Btw, one nit, perhaps:

"Shared PT is visible to KVM and it is used by CPU." -> "Shared PT is 
visible to KVM and it is used by CPU for shared mappings".

To make it more clearer it is used for "shared mappings".

But this may be unnecessary to others, so up to you.
Edgecombe, Rick P March 14, 2024, 9:39 p.m. UTC | #4
On Fri, 2024-03-15 at 10:23 +1300, Huang, Kai wrote:
> We have 3 page tables as you mentioned:
> 
> PT: page table
> - Shared PT is visible to KVM and it is used by CPU.
> - Private PT is used by CPU but it is invisible to KVM.
> - Dummy PT is visible to KVM but not used by CPU.  It is used to
>    propagate PT change to the actual private PT which is used by CPU.
> 
> If I recall correctly, we used to call the last one "mirrored
> (private) 
> page table".
> 
> I lost the tracking when we changed to use "dummy page table", but it
> seems to me "mirrored" is better than "dummy" because the latter
> means 
> it is useless but in fact it is used to propagate changes to the real
> private page table used by hardware.

Mirrored makes sense to me. So like:

Private - Table actually mapping private alias, in TDX module
Shared - Shared alias table, visible in KVM
Mirror - Mirroring private, visible in KVM

> 
> Btw, one nit, perhaps:
> 
> "Shared PT is visible to KVM and it is used by CPU." -> "Shared PT is
> visible to KVM and it is used by CPU for shared mappings".
> 
> To make it more clearer it is used for "shared mappings".
> 
> But this may be unnecessary to others, so up to you.

Yep, this seems clearer.
Edgecombe, Rick P March 14, 2024, 9:52 p.m. UTC | #5
On Thu, 2024-03-14 at 11:10 -0700, Isaku Yamahata wrote:
> > I think the point of putting these in a union is that they only
> > apply
> > to shadow paging and so can't be used with TDX. I think you are
> > putting
> > more than the sizeof(void *) in there as there are multiple in the
> > same
> > category.
> 
> I'm not sure if I'm following you.
> On x86_64, sizeof(unsigned int) = 4, sizeof(atomic_t) = 4,
> sizeof(void *) = 8.
> I moved write_flooding_count to have 8 bytes.

Ah, I see. Yes you are write about it summing to 8. Ok, what do you
think about putting a comment that these will always be unused with
TDX?

> 
> 
> > But there seems to be a new one added, *shadowed_translation.
> > Should it go in there too? Is the union because there wasn't room
> > before, or just to be tidy?
> 
> Originally TDX MMU support was implemented for legacy tdp mmu.  It
> used
> shadowed_translation.  It was not an option at that time.  Later we
> switched to
> (new) TDP MMU.  Now we have choice to which member to overlay.
> 
> 
> > I think the commit log should have more discussion of this union
> > and
> > maybe a comment in the struct to explain the purpose of the
> > organization. Can you explain the reasoning now for the sake of
> > discussion?
> 
> Sure.  We'd like to add void * pointer to struct kvm_mmu_page.  Given
> some
> members are used only for legacy KVM MMUs and not used for TDP MMU,
> we can save
> memory overhead with union.  We have options.
> - u64 *shadowed_translation
>   This was not chosen for the old implementation. Now this is option.

This seems a little more straighforward, but I'm on the fence if it's
worth changing.

> - pack unsync_children and write_flooding_count for 8 bytes
>   This patch chosen this for historical reason. Other two option is
> possible.
> - unsync_child_bitmap
>   Historically it was unioned with other members. But now it's not.
> 
> I don't have strong preference for TDX support as long as we can have
> void *.
Isaku Yamahata March 15, 2024, 12:24 a.m. UTC | #6
On Thu, Mar 14, 2024 at 09:52:58PM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Thu, 2024-03-14 at 11:10 -0700, Isaku Yamahata wrote:
> > > I think the point of putting these in a union is that they only
> > > apply
> > > to shadow paging and so can't be used with TDX. I think you are
> > > putting
> > > more than the sizeof(void *) in there as there are multiple in the
> > > same
> > > category.
> > 
> > I'm not sure if I'm following you.
> > On x86_64, sizeof(unsigned int) = 4, sizeof(atomic_t) = 4,
> > sizeof(void *) = 8.
> > I moved write_flooding_count to have 8 bytes.
> 
> Ah, I see. Yes you are write about it summing to 8. Ok, what do you
> think about putting a comment that these will always be unused with
> TDX?

Ok, will add a comment. Also add some to the commit message.
Isaku Yamahata March 15, 2024, 1:09 a.m. UTC | #7
On Thu, Mar 14, 2024 at 09:39:34PM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Fri, 2024-03-15 at 10:23 +1300, Huang, Kai wrote:
> > We have 3 page tables as you mentioned:
> > 
> > PT: page table
> > - Shared PT is visible to KVM and it is used by CPU.
> > - Private PT is used by CPU but it is invisible to KVM.
> > - Dummy PT is visible to KVM but not used by CPU.  It is used to
> >    propagate PT change to the actual private PT which is used by CPU.
> > 
> > If I recall correctly, we used to call the last one "mirrored
> > (private) 
> > page table".
> > 
> > I lost the tracking when we changed to use "dummy page table", but it
> > seems to me "mirrored" is better than "dummy" because the latter
> > means 
> > it is useless but in fact it is used to propagate changes to the real
> > private page table used by hardware.
> 
> Mirrored makes sense to me. So like:
> 
> Private - Table actually mapping private alias, in TDX module
> Shared - Shared alias table, visible in KVM
> Mirror - Mirroring private, visible in KVM
> 
> > 
> > Btw, one nit, perhaps:
> > 
> > "Shared PT is visible to KVM and it is used by CPU." -> "Shared PT is
> > visible to KVM and it is used by CPU for shared mappings".
> > 
> > To make it more clearer it is used for "shared mappings".
> > 
> > But this may be unnecessary to others, so up to you.
> 
> Yep, this seems clearer.

Here is the updated one. Renamed dummy -> mirroed.

When KVM resolves the KVM page fault, it walks the page tables.  To reuse
the existing KVM MMU code and mitigate the heavy cost of directly walking
the private page table, allocate one more page to copy the mirrored page
table for the KVM MMU code to directly walk.  Resolve the KVM page fault
with the existing code, and do additional operations necessary for the
private page table.  To distinguish such cases, the existing KVM page table
is called a shared page table (i.e., not associated with a private page
table), and the page table with a private page table is called a mirrored
page table.  The relationship is depicted below.


              KVM page fault                     |
                     |                           |
                     V                           |
        -------------+----------                 |
        |                      |                 |
        V                      V                 |
     shared GPA           private GPA            |
        |                      |                 |
        V                      V                 |
    shared PT root      mirrored PT root         |    private PT root
        |                      |                 |           |
        V                      V                 |           V
     shared PT           mirrored PT ----propagate---->  private PT
        |                      |                 |           |
        |                      \-----------------+------\    |
        |                                        |      |    |
        V                                        |      V    V
  shared guest page                              |    private guest page
                                                 |
                           non-encrypted memory  |    encrypted memory
                                                 |
PT: Page table
Shared PT: visible to KVM, and the CPU uses it for shared mappings.
Private PT: the CPU uses it, but it is invisible to KVM.  TDX module
            updates this table to map private guest pages.
Mirrored PT: It is visible to KVM, but the CPU doesn't use it.  KVM uses it
             to propagate PT change to the actual private PT.
Binbin Wu March 27, 2024, 1:49 p.m. UTC | #8
On 3/15/2024 9:09 AM, Isaku Yamahata wrote:
> Here is the updated one. Renamed dummy -> mirroed.
>
> When KVM resolves the KVM page fault, it walks the page tables.  To reuse
> the existing KVM MMU code and mitigate the heavy cost of directly walking
> the private page table, allocate one more page to copy the mirrored page

Here "copy" is a bit confusing for me.
The mirrored page table is maintained by KVM, not copied from anywhere.

> table for the KVM MMU code to directly walk.  Resolve the KVM page fault
> with the existing code, and do additional operations necessary for the
> private page table.  To distinguish such cases, the existing KVM page table
> is called a shared page table (i.e., not associated with a private page
> table), and the page table with a private page table is called a mirrored
> page table.  The relationship is depicted below.
>
>
>                KVM page fault                     |
>                       |                           |
>                       V                           |
>          -------------+----------                 |
>          |                      |                 |
>          V                      V                 |
>       shared GPA           private GPA            |
>          |                      |                 |
>          V                      V                 |
>      shared PT root      mirrored PT root         |    private PT root
>          |                      |                 |           |
>          V                      V                 |           V
>       shared PT           mirrored PT ----propagate---->  private PT
>          |                      |                 |           |
>          |                      \-----------------+------\    |
>          |                                        |      |    |
>          V                                        |      V    V
>    shared guest page                              |    private guest page
>                                                   |
>                             non-encrypted memory  |    encrypted memory
>                                                   |
> PT: Page table
> Shared PT: visible to KVM, and the CPU uses it for shared mappings.
> Private PT: the CPU uses it, but it is invisible to KVM.  TDX module
>              updates this table to map private guest pages.
> Mirrored PT: It is visible to KVM, but the CPU doesn't use it.  KVM uses it
>               to propagate PT change to the actual private PT.
>
Isaku Yamahata March 28, 2024, 12:02 a.m. UTC | #9
On Wed, Mar 27, 2024 at 09:49:14PM +0800,
Binbin Wu <binbin.wu@linux.intel.com> wrote:

> 
> 
> On 3/15/2024 9:09 AM, Isaku Yamahata wrote:
> > Here is the updated one. Renamed dummy -> mirroed.
> > 
> > When KVM resolves the KVM page fault, it walks the page tables.  To reuse
> > the existing KVM MMU code and mitigate the heavy cost of directly walking
> > the private page table, allocate one more page to copy the mirrored page
> 
> Here "copy" is a bit confusing for me.
> The mirrored page table is maintained by KVM, not copied from anywhere.

How about, "maintain" or "keep"?

> 
> > table for the KVM MMU code to directly walk.  Resolve the KVM page fault
> > with the existing code, and do additional operations necessary for the
> > private page table.  To distinguish such cases, the existing KVM page table
> > is called a shared page table (i.e., not associated with a private page
> > table), and the page table with a private page table is called a mirrored
> > page table.  The relationship is depicted below.
> > 
> > 
> >                KVM page fault                     |
> >                       |                           |
> >                       V                           |
> >          -------------+----------                 |
> >          |                      |                 |
> >          V                      V                 |
> >       shared GPA           private GPA            |
> >          |                      |                 |
> >          V                      V                 |
> >      shared PT root      mirrored PT root         |    private PT root
> >          |                      |                 |           |
> >          V                      V                 |           V
> >       shared PT           mirrored PT ----propagate---->  private PT
> >          |                      |                 |           |
> >          |                      \-----------------+------\    |
> >          |                                        |      |    |
> >          V                                        |      V    V
> >    shared guest page                              |    private guest page
> >                                                   |
> >                             non-encrypted memory  |    encrypted memory
> >                                                   |
> > PT: Page table
> > Shared PT: visible to KVM, and the CPU uses it for shared mappings.
> > Private PT: the CPU uses it, but it is invisible to KVM.  TDX module
> >              updates this table to map private guest pages.
> > Mirrored PT: It is visible to KVM, but the CPU doesn't use it.  KVM uses it
> >               to propagate PT change to the actual private PT.
> > 
> 
>
Binbin Wu March 28, 2024, 3:17 a.m. UTC | #10
On 3/28/2024 8:02 AM, Isaku Yamahata wrote:
> On Wed, Mar 27, 2024 at 09:49:14PM +0800,
> Binbin Wu <binbin.wu@linux.intel.com> wrote:
>
>>
>> On 3/15/2024 9:09 AM, Isaku Yamahata wrote:
>>> Here is the updated one. Renamed dummy -> mirroed.
>>>
>>> When KVM resolves the KVM page fault, it walks the page tables.  To reuse
>>> the existing KVM MMU code and mitigate the heavy cost of directly walking
>>> the private page table, allocate one more page to copy the mirrored page
>> Here "copy" is a bit confusing for me.
>> The mirrored page table is maintained by KVM, not copied from anywhere.
> How about, "maintain" or "keep"?

Or just use "for"?

i.e, allocate one more page for the mirrored page table ...



>
>>> table for the KVM MMU code to directly walk.  Resolve the KVM page fault
>>> with the existing code, and do additional operations necessary for the
>>> private page table.  To distinguish such cases, the existing KVM page table
>>> is called a shared page table (i.e., not associated with a private page
>>> table), and the page table with a private page table is called a mirrored
>>> page table.  The relationship is depicted below.
>>>
>>>
>>>                 KVM page fault                     |
>>>                        |                           |
>>>                        V                           |
>>>           -------------+----------                 |
>>>           |                      |                 |
>>>           V                      V                 |
>>>        shared GPA           private GPA            |
>>>           |                      |                 |
>>>           V                      V                 |
>>>       shared PT root      mirrored PT root         |    private PT root
>>>           |                      |                 |           |
>>>           V                      V                 |           V
>>>        shared PT           mirrored PT ----propagate---->  private PT
>>>           |                      |                 |           |
>>>           |                      \-----------------+------\    |
>>>           |                                        |      |    |
>>>           V                                        |      V    V
>>>     shared guest page                              |    private guest page
>>>                                                    |
>>>                              non-encrypted memory  |    encrypted memory
>>>                                                    |
>>> PT: Page table
>>> Shared PT: visible to KVM, and the CPU uses it for shared mappings.
>>> Private PT: the CPU uses it, but it is invisible to KVM.  TDX module
>>>               updates this table to map private guest pages.
>>> Mirrored PT: It is visible to KVM, but the CPU doesn't use it.  KVM uses it
>>>                to propagate PT change to the actual private PT.
>>>
>>
Yan Zhao April 22, 2024, 3:34 a.m. UTC | #11
On Mon, Feb 26, 2024 at 12:26:00AM -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> +static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp)
> +{
> +	return sp->private_spt;
> +}
> +
> +static inline void kvm_mmu_init_private_spt(struct kvm_mmu_page *sp, void *private_spt)
> +{
> +	sp->private_spt = private_spt;
> +}
This function is actually not used for initialization.
Instead, it's only called after failure of free_private_spt() in order to
intentionally leak the page to prevent kernel from accessing the encrypted page.

So to avoid confusion, how about renaming it to kvm_mmu_leak_private_spt() and
always resetting the pointer to NULL?

static inline void kvm_mmu_leak_private_spt(struct kvm_mmu_page *sp)
{
	sp->private_spt = NULL;
}

> +
> +static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> +{
> +	bool is_root = vcpu->arch.root_mmu.root_role.level == sp->role.level;
> +
> +	KVM_BUG_ON(!kvm_mmu_page_role_is_private(sp->role), vcpu->kvm);
> +	if (is_root)
> +		/*
> +		 * Because TDX module assigns root Secure-EPT page and set it to
> +		 * Secure-EPTP when TD vcpu is created, secure page table for
> +		 * root isn't needed.
> +		 */
> +		sp->private_spt = NULL;
> +	else {
> +		/*
> +		 * Because the TDX module doesn't trust VMM and initializes
> +		 * the pages itself, KVM doesn't initialize them.  Allocate
> +		 * pages with garbage and give them to the TDX module.
> +		 */
> +		sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_private_spt_cache);
> +		/*
> +		 * Because mmu_private_spt_cache is topped up before starting
> +		 * kvm page fault resolving, the allocation above shouldn't
> +		 * fail.
> +		 */
> +		WARN_ON_ONCE(!sp->private_spt);
> +	}
> +}
Isaku Yamahata April 22, 2024, 5:30 p.m. UTC | #12
On Mon, Apr 22, 2024 at 11:34:18AM +0800,
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Mon, Feb 26, 2024 at 12:26:00AM -0800, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > +static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp)
> > +{
> > +	return sp->private_spt;
> > +}
> > +
> > +static inline void kvm_mmu_init_private_spt(struct kvm_mmu_page *sp, void *private_spt)
> > +{
> > +	sp->private_spt = private_spt;
> > +}
> This function is actually not used for initialization.
> Instead, it's only called after failure of free_private_spt() in order to
> intentionally leak the page to prevent kernel from accessing the encrypted page.
> 
> So to avoid confusion, how about renaming it to kvm_mmu_leak_private_spt() and
> always resetting the pointer to NULL?
> 
> static inline void kvm_mmu_leak_private_spt(struct kvm_mmu_page *sp)
> {
> 	sp->private_spt = NULL;
> }

The older version had a config to disable TDX TDP MMU at a compile time.  Now
we dropped the config so that we don't necessarily need wrapper function with
#ifdef. Now we have only single caller, I'll eliminate this wrapper function
(and related wrapper functions) by open code.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index dcc6f7c38a83..efd3fda1c177 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -825,6 +825,11 @@  struct kvm_vcpu_arch {
 	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
 	struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
+	/*
+	 * This cache is to allocate private page table. E.g.  Secure-EPT used
+	 * by the TDX module.
+	 */
+	struct kvm_mmu_memory_cache mmu_private_spt_cache;
 
 	/*
 	 * QEMU userspace and the guest each have their own FPU state.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index eeebbc67e42b..0d6d4506ec97 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -685,6 +685,12 @@  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 				       1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
 	if (r)
 		return r;
+	if (kvm_gfn_shared_mask(vcpu->kvm)) {
+		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_private_spt_cache,
+					       PT64_ROOT_MAX_LEVEL);
+		if (r)
+			return r;
+	}
 	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
 				       PT64_ROOT_MAX_LEVEL);
 	if (r)
@@ -704,6 +710,7 @@  static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
+	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_private_spt_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index e3f54701f98d..002f3f80bf3b 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -101,7 +101,21 @@  struct kvm_mmu_page {
 		int root_count;
 		refcount_t tdp_mmu_root_count;
 	};
-	unsigned int unsync_children;
+	union {
+		struct {
+			unsigned int unsync_children;
+			/*
+			 * Number of writes since the last time traversal
+			 * visited this page.
+			 */
+			atomic_t write_flooding_count;
+		};
+		/*
+		 * Associated private shadow page table, e.g. Secure-EPT page
+		 * passed to the TDX module.
+		 */
+		void *private_spt;
+	};
 	union {
 		struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
 		tdp_ptep_t ptep;
@@ -124,9 +138,6 @@  struct kvm_mmu_page {
 	int clear_spte_count;
 #endif
 
-	/* Number of writes since the last time traversal visited this page.  */
-	atomic_t write_flooding_count;
-
 #ifdef CONFIG_X86_64
 	/* Used for freeing the page asynchronously if it is a TDP MMU page. */
 	struct rcu_head rcu_head;
@@ -150,6 +161,50 @@  static inline bool is_private_sp(const struct kvm_mmu_page *sp)
 	return kvm_mmu_page_role_is_private(sp->role);
 }
 
+static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp)
+{
+	return sp->private_spt;
+}
+
+static inline void kvm_mmu_init_private_spt(struct kvm_mmu_page *sp, void *private_spt)
+{
+	sp->private_spt = private_spt;
+}
+
+static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+{
+	bool is_root = vcpu->arch.root_mmu.root_role.level == sp->role.level;
+
+	KVM_BUG_ON(!kvm_mmu_page_role_is_private(sp->role), vcpu->kvm);
+	if (is_root)
+		/*
+		 * Because TDX module assigns root Secure-EPT page and set it to
+		 * Secure-EPTP when TD vcpu is created, secure page table for
+		 * root isn't needed.
+		 */
+		sp->private_spt = NULL;
+	else {
+		/*
+		 * Because the TDX module doesn't trust VMM and initializes
+		 * the pages itself, KVM doesn't initialize them.  Allocate
+		 * pages with garbage and give them to the TDX module.
+		 */
+		sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_private_spt_cache);
+		/*
+		 * Because mmu_private_spt_cache is topped up before starting
+		 * kvm page fault resolving, the allocation above shouldn't
+		 * fail.
+		 */
+		WARN_ON_ONCE(!sp->private_spt);
+	}
+}
+
+static inline void kvm_mmu_free_private_spt(struct kvm_mmu_page *sp)
+{
+	if (sp->private_spt)
+		free_page((unsigned long)sp->private_spt);
+}
+
 static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
 {
 	/*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 87233b3ceaef..d47f0daf1b03 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -53,6 +53,7 @@  void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 
 static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
 {
+	kvm_mmu_free_private_spt(sp);
 	free_page((unsigned long)sp->spt);
 	kmem_cache_free(mmu_page_header_cache, sp);
 }