Message ID | 9d86b5a2787d20ffb5a58f86e43601a660521f16.1708933498.git.isaku.yamahata@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v19,001/130] x86/virt/tdx: Rename _offset to _member for TD_SYSINFO_MAP() macro | expand |
On Mon, 2024-02-26 at 00:26 -0800, isaku.yamahata@intel.com wrote: > From: Isaku Yamahata <isaku.yamahata@intel.com> > > For private GPA, CPU refers a private page table whose contents are > encrypted. The dedicated APIs to operate on it (e.g. > updating/reading its > PTE entry) are used and their cost is expensive. > > When KVM resolves KVM page fault, it walks the page tables. To reuse > the > existing KVM MMU code and mitigate the heavy cost to directly walk > private > page table, allocate one more page to copy the dummy page table for > KVM MMU > code to directly walk. Resolve KVM page fault with the existing > code, and > do additional operations necessary for the private page table. > To > distinguish such cases, the existing KVM page table is called a > shared page > table (i.e. not associated with private page table), and the page > table > with private page table is called a private page table. This makes it sound like the dummy page table for the private alias is also called a shared page table, but in the drawing below it looks like only the shared alias is called "shared PT". > The relationship > is depicted below. > > Add a private pointer to struct kvm_mmu_page for private page table > and > add helper functions to allocate/initialize/free a private page table > page. > > KVM page fault | > | | > V | > -------------+---------- | > | | | > V V | > shared GPA private GPA | > | | | > V V | > shared PT root dummy PT root | private PT root > | | | | > V V | V > shared PT dummy PT ----propagate----> private PT > | | | | > | \-----------------+------\ | > | | | | > V | V V > shared guest page | private guest > page > | > non-encrypted memory | encrypted > memory > | > PT: page table > - Shared PT is visible to KVM and it is used by CPU. > - Private PT is used by CPU but it is invisible to KVM. > - Dummy PT is visible to KVM but not used by CPU. It is used to > propagate PT change to the actual private PT which is used by CPU. > > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> > Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> > --- > v19: > - typo in the comment in kvm_mmu_alloc_private_spt() > - drop CONFIG_KVM_MMU_PRIVATE > --- > arch/x86/include/asm/kvm_host.h | 5 +++ > arch/x86/kvm/mmu/mmu.c | 7 ++++ > arch/x86/kvm/mmu/mmu_internal.h | 63 ++++++++++++++++++++++++++++++- > -- > arch/x86/kvm/mmu/tdp_mmu.c | 1 + > 4 files changed, 72 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h > b/arch/x86/include/asm/kvm_host.h > index dcc6f7c38a83..efd3fda1c177 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -825,6 +825,11 @@ struct kvm_vcpu_arch { > struct kvm_mmu_memory_cache mmu_shadow_page_cache; > struct kvm_mmu_memory_cache mmu_shadowed_info_cache; > struct kvm_mmu_memory_cache mmu_page_header_cache; > + /* > + * This cache is to allocate private page table. E.g. > Secure-EPT used > + * by the TDX module. > + */ > + struct kvm_mmu_memory_cache mmu_private_spt_cache; > > /* > * QEMU userspace and the guest each have their own FPU > state. > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index eeebbc67e42b..0d6d4506ec97 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -685,6 +685,12 @@ static int mmu_topup_memory_caches(struct > kvm_vcpu *vcpu, bool maybe_indirect) > 1 + PT64_ROOT_MAX_LEVEL + > PTE_PREFETCH_NUM); > if (r) > return r; > + if (kvm_gfn_shared_mask(vcpu->kvm)) { > + r = kvm_mmu_topup_memory_cache(&vcpu- > >arch.mmu_private_spt_cache, > + PT64_ROOT_MAX_LEVEL); > + if (r) > + return r; > + } > r = kvm_mmu_topup_memory_cache(&vcpu- > >arch.mmu_shadow_page_cache, > PT64_ROOT_MAX_LEVEL); > if (r) > @@ -704,6 +710,7 @@ static void mmu_free_memory_caches(struct > kvm_vcpu *vcpu) > kvm_mmu_free_memory_cache(&vcpu- > >arch.mmu_pte_list_desc_cache); > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache); > kvm_mmu_free_memory_cache(&vcpu- > >arch.mmu_shadowed_info_cache); > + kvm_mmu_free_memory_cache(&vcpu->arch.mmu_private_spt_cache); > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache); > } > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h > b/arch/x86/kvm/mmu/mmu_internal.h > index e3f54701f98d..002f3f80bf3b 100644 > --- a/arch/x86/kvm/mmu/mmu_internal.h > +++ b/arch/x86/kvm/mmu/mmu_internal.h > @@ -101,7 +101,21 @@ struct kvm_mmu_page { > int root_count; > refcount_t tdp_mmu_root_count; > }; > - unsigned int unsync_children; > + union { > + struct { > + unsigned int unsync_children; > + /* > + * Number of writes since the last time > traversal > + * visited this page. > + */ > + atomic_t write_flooding_count; > + }; I think the point of putting these in a union is that they only apply to shadow paging and so can't be used with TDX. I think you are putting more than the sizeof(void *) in there as there are multiple in the same category. But there seems to be a new one added, *shadowed_translation. Should it go in there too? Is the union because there wasn't room before, or just to be tidy? I think the commit log should have more discussion of this union and maybe a comment in the struct to explain the purpose of the organization. Can you explain the reasoning now for the sake of discussion? > + /* > + * Associated private shadow page table, e.g. Secure- > EPT page > + * passed to the TDX module. > + */ > + void *private_spt; > + }; > union { > struct kvm_rmap_head parent_ptes; /* rmap pointers to > parent sptes */ > tdp_ptep_t ptep; > @@ -124,9 +138,6 @@ struct kvm_mmu_page { > int clear_spte_count; > #endif > > - /* Number of writes since the last time traversal visited > this page. */ > - atomic_t write_flooding_count; > - > #ifdef CONFIG_X86_64 > /* Used for freeing the page asynchronously if it is a TDP > MMU page. */ > struct rcu_head rcu_head; > @@ -150,6 +161,50 @@ static inline bool is_private_sp(const struct > kvm_mmu_page *sp) > return kvm_mmu_page_role_is_private(sp->role); > } > > +static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp) > +{ > + return sp->private_spt; > +} > + > +static inline void kvm_mmu_init_private_spt(struct kvm_mmu_page *sp, > void *private_spt) > +{ > + sp->private_spt = private_spt; > +} > + > +static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, > struct kvm_mmu_page *sp) > +{ > + bool is_root = vcpu->arch.root_mmu.root_role.level == sp- > >role.level; > + > + KVM_BUG_ON(!kvm_mmu_page_role_is_private(sp->role), vcpu- > >kvm); > + if (is_root) > + /* > + * Because TDX module assigns root Secure-EPT page > and set it to > + * Secure-EPTP when TD vcpu is created, secure page > table for > + * root isn't needed. > + */ > + sp->private_spt = NULL; > + else { > + /* > + * Because the TDX module doesn't trust VMM and > initializes > + * the pages itself, KVM doesn't initialize them. > Allocate > + * pages with garbage and give them to the TDX > module. > + */ > + sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu- > >arch.mmu_private_spt_cache); > + /* > + * Because mmu_private_spt_cache is topped up before > starting > + * kvm page fault resolving, the allocation above > shouldn't > + * fail. > + */ > + WARN_ON_ONCE(!sp->private_spt); There is already a BUG_ON() for the allocation failure in kvm_mmu_memory_cache_alloc(). > + } > +} > + > +static inline void kvm_mmu_free_private_spt(struct kvm_mmu_page *sp) > +{ > + if (sp->private_spt) free_page() can accept NULL, so the above check is unneeded. > + free_page((unsigned long)sp->private_spt); > +} > + > static inline bool kvm_mmu_page_ad_need_write_protect(struct > kvm_mmu_page *sp) > { > /* > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 87233b3ceaef..d47f0daf1b03 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -53,6 +53,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm) > > static void tdp_mmu_free_sp(struct kvm_mmu_page *sp) > { > + kvm_mmu_free_private_spt(sp); This particular memcache zeros the allocations so it is safe to free this without regard to whether sp->private_spt has been set and that the allocation caller is not in place yet. It would be nice to add this detail in the log. > free_page((unsigned long)sp->spt); > kmem_cache_free(mmu_page_header_cache, sp); > }
On Wed, Mar 13, 2024 at 08:51:53PM +0000, "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote: > On Mon, 2024-02-26 at 00:26 -0800, isaku.yamahata@intel.com wrote: > > From: Isaku Yamahata <isaku.yamahata@intel.com> > > > > For private GPA, CPU refers a private page table whose contents are > > encrypted. The dedicated APIs to operate on it (e.g. > > updating/reading its > > PTE entry) are used and their cost is expensive. > > > > When KVM resolves KVM page fault, it walks the page tables. To reuse > > the > > existing KVM MMU code and mitigate the heavy cost to directly walk > > private > > page table, allocate one more page to copy the dummy page table for > > KVM MMU > > code to directly walk. Resolve KVM page fault with the existing > > code, and > > do additional operations necessary for the private page table. > > > To > > distinguish such cases, the existing KVM page table is called a > > shared page > > table (i.e. not associated with private page table), and the page > > table > > with private page table is called a private page table. > > This makes it sound like the dummy page table for the private alias is > also called a shared page table, but in the drawing below it looks like > only the shared alias is called "shared PT". How about this, Call the existing KVM page table associated with shared GPA as shared page table. Call the KVM page table associate with private GPA as private page table. > > The relationship > > is depicted below. > > > > Add a private pointer to struct kvm_mmu_page for private page table > > and > > add helper functions to allocate/initialize/free a private page table > > page. > > > > KVM page fault | > > | | > > V | > > -------------+---------- | > > | | | > > V V | > > shared GPA private GPA | > > | | | > > V V | > > shared PT root dummy PT root | private PT root > > | | | | > > V V | V > > shared PT dummy PT ----propagate----> private PT > > | | | | > > | \-----------------+------\ | > > | | | | > > V | V V > > shared guest page | private guest > > page > > | > > non-encrypted memory | encrypted > > memory > > | > > PT: page table > > - Shared PT is visible to KVM and it is used by CPU. > > - Private PT is used by CPU but it is invisible to KVM. > > - Dummy PT is visible to KVM but not used by CPU. It is used to > > propagate PT change to the actual private PT which is used by CPU. > > > > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> > > Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> > > --- ...snip... > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h > > b/arch/x86/kvm/mmu/mmu_internal.h > > index e3f54701f98d..002f3f80bf3b 100644 > > --- a/arch/x86/kvm/mmu/mmu_internal.h > > +++ b/arch/x86/kvm/mmu/mmu_internal.h > > @@ -101,7 +101,21 @@ struct kvm_mmu_page { > > int root_count; > > refcount_t tdp_mmu_root_count; > > }; > > - unsigned int unsync_children; > > + union { > > + struct { > > + unsigned int unsync_children; > > + /* > > + * Number of writes since the last time > > traversal > > + * visited this page. > > + */ > > + atomic_t write_flooding_count; > > + }; > > I think the point of putting these in a union is that they only apply > to shadow paging and so can't be used with TDX. I think you are putting > more than the sizeof(void *) in there as there are multiple in the same > category. I'm not sure if I'm following you. On x86_64, sizeof(unsigned int) = 4, sizeof(atomic_t) = 4, sizeof(void *) = 8. I moved write_flooding_count to have 8 bytes. > But there seems to be a new one added, *shadowed_translation. > Should it go in there too? Is the union because there wasn't room > before, or just to be tidy? Originally TDX MMU support was implemented for legacy tdp mmu. It used shadowed_translation. It was not an option at that time. Later we switched to (new) TDP MMU. Now we have choice to which member to overlay. > I think the commit log should have more discussion of this union and > maybe a comment in the struct to explain the purpose of the > organization. Can you explain the reasoning now for the sake of > discussion? Sure. We'd like to add void * pointer to struct kvm_mmu_page. Given some members are used only for legacy KVM MMUs and not used for TDP MMU, we can save memory overhead with union. We have options. - u64 *shadowed_translation This was not chosen for the old implementation. Now this is option. - pack unsync_children and write_flooding_count for 8 bytes This patch chosen this for historical reason. Other two option is possible. - unsync_child_bitmap Historically it was unioned with other members. But now it's not. I don't have strong preference for TDX support as long as we can have void *.
On 15/03/2024 7:10 am, Isaku Yamahata wrote: > On Wed, Mar 13, 2024 at 08:51:53PM +0000, > "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote: > >> On Mon, 2024-02-26 at 00:26 -0800, isaku.yamahata@intel.com wrote: >>> From: Isaku Yamahata <isaku.yamahata@intel.com> >>> >>> For private GPA, CPU refers a private page table whose contents are >>> encrypted. The dedicated APIs to operate on it (e.g. >>> updating/reading its >>> PTE entry) are used and their cost is expensive. >>> >>> When KVM resolves KVM page fault, it walks the page tables. To reuse >>> the >>> existing KVM MMU code and mitigate the heavy cost to directly walk >>> private >>> page table, allocate one more page to copy the dummy page table for >>> KVM MMU >>> code to directly walk. Resolve KVM page fault with the existing >>> code, and >>> do additional operations necessary for the private page table. >> >>> To >>> distinguish such cases, the existing KVM page table is called a >>> shared page >>> table (i.e. not associated with private page table), and the page >>> table >>> with private page table is called a private page table. >> >> This makes it sound like the dummy page table for the private alias is >> also called a shared page table, but in the drawing below it looks like >> only the shared alias is called "shared PT". > > How about this, > Call the existing KVM page table associated with shared GPA as shared page table. > Call the KVM page table associate with private GPA as private page table. > For the second one, are you talking about the *true* secure/private EPT page table used by hardware, or the one visible to KVM but not used by hardware? We have 3 page tables as you mentioned: PT: page table - Shared PT is visible to KVM and it is used by CPU. - Private PT is used by CPU but it is invisible to KVM. - Dummy PT is visible to KVM but not used by CPU. It is used to propagate PT change to the actual private PT which is used by CPU. If I recall correctly, we used to call the last one "mirrored (private) page table". I lost the tracking when we changed to use "dummy page table", but it seems to me "mirrored" is better than "dummy" because the latter means it is useless but in fact it is used to propagate changes to the real private page table used by hardware. Btw, one nit, perhaps: "Shared PT is visible to KVM and it is used by CPU." -> "Shared PT is visible to KVM and it is used by CPU for shared mappings". To make it more clearer it is used for "shared mappings". But this may be unnecessary to others, so up to you.
On Fri, 2024-03-15 at 10:23 +1300, Huang, Kai wrote: > We have 3 page tables as you mentioned: > > PT: page table > - Shared PT is visible to KVM and it is used by CPU. > - Private PT is used by CPU but it is invisible to KVM. > - Dummy PT is visible to KVM but not used by CPU. It is used to > propagate PT change to the actual private PT which is used by CPU. > > If I recall correctly, we used to call the last one "mirrored > (private) > page table". > > I lost the tracking when we changed to use "dummy page table", but it > seems to me "mirrored" is better than "dummy" because the latter > means > it is useless but in fact it is used to propagate changes to the real > private page table used by hardware. Mirrored makes sense to me. So like: Private - Table actually mapping private alias, in TDX module Shared - Shared alias table, visible in KVM Mirror - Mirroring private, visible in KVM > > Btw, one nit, perhaps: > > "Shared PT is visible to KVM and it is used by CPU." -> "Shared PT is > visible to KVM and it is used by CPU for shared mappings". > > To make it more clearer it is used for "shared mappings". > > But this may be unnecessary to others, so up to you. Yep, this seems clearer.
On Thu, 2024-03-14 at 11:10 -0700, Isaku Yamahata wrote: > > I think the point of putting these in a union is that they only > > apply > > to shadow paging and so can't be used with TDX. I think you are > > putting > > more than the sizeof(void *) in there as there are multiple in the > > same > > category. > > I'm not sure if I'm following you. > On x86_64, sizeof(unsigned int) = 4, sizeof(atomic_t) = 4, > sizeof(void *) = 8. > I moved write_flooding_count to have 8 bytes. Ah, I see. Yes you are write about it summing to 8. Ok, what do you think about putting a comment that these will always be unused with TDX? > > > > But there seems to be a new one added, *shadowed_translation. > > Should it go in there too? Is the union because there wasn't room > > before, or just to be tidy? > > Originally TDX MMU support was implemented for legacy tdp mmu. It > used > shadowed_translation. It was not an option at that time. Later we > switched to > (new) TDP MMU. Now we have choice to which member to overlay. > > > > I think the commit log should have more discussion of this union > > and > > maybe a comment in the struct to explain the purpose of the > > organization. Can you explain the reasoning now for the sake of > > discussion? > > Sure. We'd like to add void * pointer to struct kvm_mmu_page. Given > some > members are used only for legacy KVM MMUs and not used for TDP MMU, > we can save > memory overhead with union. We have options. > - u64 *shadowed_translation > This was not chosen for the old implementation. Now this is option. This seems a little more straighforward, but I'm on the fence if it's worth changing. > - pack unsync_children and write_flooding_count for 8 bytes > This patch chosen this for historical reason. Other two option is > possible. > - unsync_child_bitmap > Historically it was unioned with other members. But now it's not. > > I don't have strong preference for TDX support as long as we can have > void *.
On Thu, Mar 14, 2024 at 09:52:58PM +0000, "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote: > On Thu, 2024-03-14 at 11:10 -0700, Isaku Yamahata wrote: > > > I think the point of putting these in a union is that they only > > > apply > > > to shadow paging and so can't be used with TDX. I think you are > > > putting > > > more than the sizeof(void *) in there as there are multiple in the > > > same > > > category. > > > > I'm not sure if I'm following you. > > On x86_64, sizeof(unsigned int) = 4, sizeof(atomic_t) = 4, > > sizeof(void *) = 8. > > I moved write_flooding_count to have 8 bytes. > > Ah, I see. Yes you are write about it summing to 8. Ok, what do you > think about putting a comment that these will always be unused with > TDX? Ok, will add a comment. Also add some to the commit message.
On Thu, Mar 14, 2024 at 09:39:34PM +0000, "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote: > On Fri, 2024-03-15 at 10:23 +1300, Huang, Kai wrote: > > We have 3 page tables as you mentioned: > > > > PT: page table > > - Shared PT is visible to KVM and it is used by CPU. > > - Private PT is used by CPU but it is invisible to KVM. > > - Dummy PT is visible to KVM but not used by CPU. It is used to > > propagate PT change to the actual private PT which is used by CPU. > > > > If I recall correctly, we used to call the last one "mirrored > > (private) > > page table". > > > > I lost the tracking when we changed to use "dummy page table", but it > > seems to me "mirrored" is better than "dummy" because the latter > > means > > it is useless but in fact it is used to propagate changes to the real > > private page table used by hardware. > > Mirrored makes sense to me. So like: > > Private - Table actually mapping private alias, in TDX module > Shared - Shared alias table, visible in KVM > Mirror - Mirroring private, visible in KVM > > > > > Btw, one nit, perhaps: > > > > "Shared PT is visible to KVM and it is used by CPU." -> "Shared PT is > > visible to KVM and it is used by CPU for shared mappings". > > > > To make it more clearer it is used for "shared mappings". > > > > But this may be unnecessary to others, so up to you. > > Yep, this seems clearer. Here is the updated one. Renamed dummy -> mirroed. When KVM resolves the KVM page fault, it walks the page tables. To reuse the existing KVM MMU code and mitigate the heavy cost of directly walking the private page table, allocate one more page to copy the mirrored page table for the KVM MMU code to directly walk. Resolve the KVM page fault with the existing code, and do additional operations necessary for the private page table. To distinguish such cases, the existing KVM page table is called a shared page table (i.e., not associated with a private page table), and the page table with a private page table is called a mirrored page table. The relationship is depicted below. KVM page fault | | | V | -------------+---------- | | | | V V | shared GPA private GPA | | | | V V | shared PT root mirrored PT root | private PT root | | | | V V | V shared PT mirrored PT ----propagate----> private PT | | | | | \-----------------+------\ | | | | | V | V V shared guest page | private guest page | non-encrypted memory | encrypted memory | PT: Page table Shared PT: visible to KVM, and the CPU uses it for shared mappings. Private PT: the CPU uses it, but it is invisible to KVM. TDX module updates this table to map private guest pages. Mirrored PT: It is visible to KVM, but the CPU doesn't use it. KVM uses it to propagate PT change to the actual private PT.
On 3/15/2024 9:09 AM, Isaku Yamahata wrote: > Here is the updated one. Renamed dummy -> mirroed. > > When KVM resolves the KVM page fault, it walks the page tables. To reuse > the existing KVM MMU code and mitigate the heavy cost of directly walking > the private page table, allocate one more page to copy the mirrored page Here "copy" is a bit confusing for me. The mirrored page table is maintained by KVM, not copied from anywhere. > table for the KVM MMU code to directly walk. Resolve the KVM page fault > with the existing code, and do additional operations necessary for the > private page table. To distinguish such cases, the existing KVM page table > is called a shared page table (i.e., not associated with a private page > table), and the page table with a private page table is called a mirrored > page table. The relationship is depicted below. > > > KVM page fault | > | | > V | > -------------+---------- | > | | | > V V | > shared GPA private GPA | > | | | > V V | > shared PT root mirrored PT root | private PT root > | | | | > V V | V > shared PT mirrored PT ----propagate----> private PT > | | | | > | \-----------------+------\ | > | | | | > V | V V > shared guest page | private guest page > | > non-encrypted memory | encrypted memory > | > PT: Page table > Shared PT: visible to KVM, and the CPU uses it for shared mappings. > Private PT: the CPU uses it, but it is invisible to KVM. TDX module > updates this table to map private guest pages. > Mirrored PT: It is visible to KVM, but the CPU doesn't use it. KVM uses it > to propagate PT change to the actual private PT. >
On Wed, Mar 27, 2024 at 09:49:14PM +0800, Binbin Wu <binbin.wu@linux.intel.com> wrote: > > > On 3/15/2024 9:09 AM, Isaku Yamahata wrote: > > Here is the updated one. Renamed dummy -> mirroed. > > > > When KVM resolves the KVM page fault, it walks the page tables. To reuse > > the existing KVM MMU code and mitigate the heavy cost of directly walking > > the private page table, allocate one more page to copy the mirrored page > > Here "copy" is a bit confusing for me. > The mirrored page table is maintained by KVM, not copied from anywhere. How about, "maintain" or "keep"? > > > table for the KVM MMU code to directly walk. Resolve the KVM page fault > > with the existing code, and do additional operations necessary for the > > private page table. To distinguish such cases, the existing KVM page table > > is called a shared page table (i.e., not associated with a private page > > table), and the page table with a private page table is called a mirrored > > page table. The relationship is depicted below. > > > > > > KVM page fault | > > | | > > V | > > -------------+---------- | > > | | | > > V V | > > shared GPA private GPA | > > | | | > > V V | > > shared PT root mirrored PT root | private PT root > > | | | | > > V V | V > > shared PT mirrored PT ----propagate----> private PT > > | | | | > > | \-----------------+------\ | > > | | | | > > V | V V > > shared guest page | private guest page > > | > > non-encrypted memory | encrypted memory > > | > > PT: Page table > > Shared PT: visible to KVM, and the CPU uses it for shared mappings. > > Private PT: the CPU uses it, but it is invisible to KVM. TDX module > > updates this table to map private guest pages. > > Mirrored PT: It is visible to KVM, but the CPU doesn't use it. KVM uses it > > to propagate PT change to the actual private PT. > > > >
On 3/28/2024 8:02 AM, Isaku Yamahata wrote: > On Wed, Mar 27, 2024 at 09:49:14PM +0800, > Binbin Wu <binbin.wu@linux.intel.com> wrote: > >> >> On 3/15/2024 9:09 AM, Isaku Yamahata wrote: >>> Here is the updated one. Renamed dummy -> mirroed. >>> >>> When KVM resolves the KVM page fault, it walks the page tables. To reuse >>> the existing KVM MMU code and mitigate the heavy cost of directly walking >>> the private page table, allocate one more page to copy the mirrored page >> Here "copy" is a bit confusing for me. >> The mirrored page table is maintained by KVM, not copied from anywhere. > How about, "maintain" or "keep"? Or just use "for"? i.e, allocate one more page for the mirrored page table ... > >>> table for the KVM MMU code to directly walk. Resolve the KVM page fault >>> with the existing code, and do additional operations necessary for the >>> private page table. To distinguish such cases, the existing KVM page table >>> is called a shared page table (i.e., not associated with a private page >>> table), and the page table with a private page table is called a mirrored >>> page table. The relationship is depicted below. >>> >>> >>> KVM page fault | >>> | | >>> V | >>> -------------+---------- | >>> | | | >>> V V | >>> shared GPA private GPA | >>> | | | >>> V V | >>> shared PT root mirrored PT root | private PT root >>> | | | | >>> V V | V >>> shared PT mirrored PT ----propagate----> private PT >>> | | | | >>> | \-----------------+------\ | >>> | | | | >>> V | V V >>> shared guest page | private guest page >>> | >>> non-encrypted memory | encrypted memory >>> | >>> PT: Page table >>> Shared PT: visible to KVM, and the CPU uses it for shared mappings. >>> Private PT: the CPU uses it, but it is invisible to KVM. TDX module >>> updates this table to map private guest pages. >>> Mirrored PT: It is visible to KVM, but the CPU doesn't use it. KVM uses it >>> to propagate PT change to the actual private PT. >>> >>
On Mon, Feb 26, 2024 at 12:26:00AM -0800, isaku.yamahata@intel.com wrote: > From: Isaku Yamahata <isaku.yamahata@intel.com> > +static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp) > +{ > + return sp->private_spt; > +} > + > +static inline void kvm_mmu_init_private_spt(struct kvm_mmu_page *sp, void *private_spt) > +{ > + sp->private_spt = private_spt; > +} This function is actually not used for initialization. Instead, it's only called after failure of free_private_spt() in order to intentionally leak the page to prevent kernel from accessing the encrypted page. So to avoid confusion, how about renaming it to kvm_mmu_leak_private_spt() and always resetting the pointer to NULL? static inline void kvm_mmu_leak_private_spt(struct kvm_mmu_page *sp) { sp->private_spt = NULL; } > + > +static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) > +{ > + bool is_root = vcpu->arch.root_mmu.root_role.level == sp->role.level; > + > + KVM_BUG_ON(!kvm_mmu_page_role_is_private(sp->role), vcpu->kvm); > + if (is_root) > + /* > + * Because TDX module assigns root Secure-EPT page and set it to > + * Secure-EPTP when TD vcpu is created, secure page table for > + * root isn't needed. > + */ > + sp->private_spt = NULL; > + else { > + /* > + * Because the TDX module doesn't trust VMM and initializes > + * the pages itself, KVM doesn't initialize them. Allocate > + * pages with garbage and give them to the TDX module. > + */ > + sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_private_spt_cache); > + /* > + * Because mmu_private_spt_cache is topped up before starting > + * kvm page fault resolving, the allocation above shouldn't > + * fail. > + */ > + WARN_ON_ONCE(!sp->private_spt); > + } > +}
On Mon, Apr 22, 2024 at 11:34:18AM +0800, Yan Zhao <yan.y.zhao@intel.com> wrote: > On Mon, Feb 26, 2024 at 12:26:00AM -0800, isaku.yamahata@intel.com wrote: > > From: Isaku Yamahata <isaku.yamahata@intel.com> > > +static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp) > > +{ > > + return sp->private_spt; > > +} > > + > > +static inline void kvm_mmu_init_private_spt(struct kvm_mmu_page *sp, void *private_spt) > > +{ > > + sp->private_spt = private_spt; > > +} > This function is actually not used for initialization. > Instead, it's only called after failure of free_private_spt() in order to > intentionally leak the page to prevent kernel from accessing the encrypted page. > > So to avoid confusion, how about renaming it to kvm_mmu_leak_private_spt() and > always resetting the pointer to NULL? > > static inline void kvm_mmu_leak_private_spt(struct kvm_mmu_page *sp) > { > sp->private_spt = NULL; > } The older version had a config to disable TDX TDP MMU at a compile time. Now we dropped the config so that we don't necessarily need wrapper function with #ifdef. Now we have only single caller, I'll eliminate this wrapper function (and related wrapper functions) by open code.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index dcc6f7c38a83..efd3fda1c177 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -825,6 +825,11 @@ struct kvm_vcpu_arch { struct kvm_mmu_memory_cache mmu_shadow_page_cache; struct kvm_mmu_memory_cache mmu_shadowed_info_cache; struct kvm_mmu_memory_cache mmu_page_header_cache; + /* + * This cache is to allocate private page table. E.g. Secure-EPT used + * by the TDX module. + */ + struct kvm_mmu_memory_cache mmu_private_spt_cache; /* * QEMU userspace and the guest each have their own FPU state. diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index eeebbc67e42b..0d6d4506ec97 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -685,6 +685,12 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect) 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM); if (r) return r; + if (kvm_gfn_shared_mask(vcpu->kvm)) { + r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_private_spt_cache, + PT64_ROOT_MAX_LEVEL); + if (r) + return r; + } r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache, PT64_ROOT_MAX_LEVEL); if (r) @@ -704,6 +710,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu) kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache); kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache); kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache); + kvm_mmu_free_memory_cache(&vcpu->arch.mmu_private_spt_cache); kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache); } diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index e3f54701f98d..002f3f80bf3b 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -101,7 +101,21 @@ struct kvm_mmu_page { int root_count; refcount_t tdp_mmu_root_count; }; - unsigned int unsync_children; + union { + struct { + unsigned int unsync_children; + /* + * Number of writes since the last time traversal + * visited this page. + */ + atomic_t write_flooding_count; + }; + /* + * Associated private shadow page table, e.g. Secure-EPT page + * passed to the TDX module. + */ + void *private_spt; + }; union { struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */ tdp_ptep_t ptep; @@ -124,9 +138,6 @@ struct kvm_mmu_page { int clear_spte_count; #endif - /* Number of writes since the last time traversal visited this page. */ - atomic_t write_flooding_count; - #ifdef CONFIG_X86_64 /* Used for freeing the page asynchronously if it is a TDP MMU page. */ struct rcu_head rcu_head; @@ -150,6 +161,50 @@ static inline bool is_private_sp(const struct kvm_mmu_page *sp) return kvm_mmu_page_role_is_private(sp->role); } +static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp) +{ + return sp->private_spt; +} + +static inline void kvm_mmu_init_private_spt(struct kvm_mmu_page *sp, void *private_spt) +{ + sp->private_spt = private_spt; +} + +static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) +{ + bool is_root = vcpu->arch.root_mmu.root_role.level == sp->role.level; + + KVM_BUG_ON(!kvm_mmu_page_role_is_private(sp->role), vcpu->kvm); + if (is_root) + /* + * Because TDX module assigns root Secure-EPT page and set it to + * Secure-EPTP when TD vcpu is created, secure page table for + * root isn't needed. + */ + sp->private_spt = NULL; + else { + /* + * Because the TDX module doesn't trust VMM and initializes + * the pages itself, KVM doesn't initialize them. Allocate + * pages with garbage and give them to the TDX module. + */ + sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_private_spt_cache); + /* + * Because mmu_private_spt_cache is topped up before starting + * kvm page fault resolving, the allocation above shouldn't + * fail. + */ + WARN_ON_ONCE(!sp->private_spt); + } +} + +static inline void kvm_mmu_free_private_spt(struct kvm_mmu_page *sp) +{ + if (sp->private_spt) + free_page((unsigned long)sp->private_spt); +} + static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp) { /* diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 87233b3ceaef..d47f0daf1b03 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -53,6 +53,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm) static void tdp_mmu_free_sp(struct kvm_mmu_page *sp) { + kvm_mmu_free_private_spt(sp); free_page((unsigned long)sp->spt); kmem_cache_free(mmu_page_header_cache, sp); }