diff mbox series

[19/21] KVM: TDX: Add an ioctl to create initial guest memory

Message ID 20240904030751.117579-20-rick.p.edgecombe@intel.com (mailing list archive)
State New, archived
Headers show
Series TDX MMU Part 2 | expand

Commit Message

Rick Edgecombe Sept. 4, 2024, 3:07 a.m. UTC
From: Isaku Yamahata <isaku.yamahata@intel.com>

Add a new ioctl for the user space VMM to initialize guest memory with the
specified memory contents.

Because TDX protects the guest's memory, the creation of the initial guest
memory requires a dedicated TDX module API, TDH.MEM.PAGE.ADD(), instead of
directly copying the memory contents into the guest's memory in the case of
the default VM type.

Define a new subcommand, KVM_TDX_INIT_MEM_REGION, of vCPU-scoped
KVM_MEMORY_ENCRYPT_OP.  Check if the GFN is already pre-allocated, assign
the guest page in Secure-EPT, copy the initial memory contents into the
guest memory, and encrypt the guest memory.  Optionally, extend the memory
measurement of the TDX guest.

Discussion history:
- Originally, KVM_TDX_INIT_MEM_REGION used the callback of the TDP MMU of
  the KVM page fault handler.  It issues TDX SEAMCALL deep in the call
  stack, and the ioctl passes down the necessary parameters.  [2] rejected
  it.  [3] suggests that the call to the TDX module should be invoked in a
  shallow call stack.

- Instead, introduce guest memory pre-population [1] that doesn't update
  vendor-specific part (Secure-EPT in TDX case) and the vendor-specific
  code (KVM_TDX_INIT_MEM_REGION) updates only vendor-specific parts without
  modifying the KVM TDP MMU suggested at [4]

    Crazy idea.  For TDX S-EPT, what if KVM_MAP_MEMORY does all of the
    SEPT.ADD stuff, which doesn't affect the measurement, and even fills in
    KVM's copy of the leaf EPTE, but tdx_sept_set_private_spte() doesn't do
    anything if the TD isn't finalized?

    Then KVM provides a dedicated TDX ioctl(), i.e. what is/was
    KVM_TDX_INIT_MEM_REGION, to do PAGE.ADD.  KVM_TDX_INIT_MEM_REGION
    wouldn't need to map anything, it would simply need to verify that the
    pfn from guest_memfd() is the same as what's in the TDP MMU.

- Use the common guest_memfd population function, kvm_gmem_populate()
  instead of a custom function.  It should check whether the PFN
  from TDP MMU is the same as the one from guest_memfd. [1]

- Instead of forcing userspace to do two passes, pre-map the guest
  initial memory in tdx_gmem_post_populate. [5]

Link: https://lore.kernel.org/kvm/20240419085927.3648704-1-pbonzini@redhat.com/ [1]
Link: https://lore.kernel.org/kvm/Zbrj5WKVgMsUFDtb@google.com/ [2]
Link: https://lore.kernel.org/kvm/Zh8DHbb8FzoVErgX@google.com/ [3]
Link: https://lore.kernel.org/kvm/Ze-TJh0BBOWm9spT@google.com/ [4]
Link: https://lore.kernel.org/kvm/CABgObfa=a3cKcKJHQRrCs-3Ty8ppSRou=dhi6Q+KdZnom0Zegw@mail.gmail.com/ [5]
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU part 2 v1:
 - Update the code according to latest gmem update.
   https://lore.kernel.org/kvm/CABgObfa=a3cKcKJHQRrCs-3Ty8ppSRou=dhi6Q+KdZnom0Zegw@mail.gmail.com/
 - Fixup a aligment bug reported by Binbin.
 - Rename KVM_MEMORY_MAPPING => KVM_MAP_MEMORY (Sean)
 - Drop issueing TDH.MEM.PAGE.ADD() on KVM_MAP_MEMORY(), defer it to
   KVM_TDX_INIT_MEM_REGION. (Sean)
 - Added nr_premapped to track the number of premapped pages
 - Drop tdx_post_mmu_map_page().
 - Drop kvm_slot_can_be_private() check (Paolo)
 - Use kvm_tdp_mmu_gpa_is_mapped() (Paolo)

v19:
 - Switched to use KVM_MEMORY_MAPPING
 - Dropped measurement extension
 - updated commit message. private_page_add() => set_private_spte()
---
 arch/x86/include/uapi/asm/kvm.h |   9 ++
 arch/x86/kvm/vmx/tdx.c          | 150 ++++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c             |   1 +
 3 files changed, 160 insertions(+)

Comments

Yan Zhao Sept. 4, 2024, 4:53 a.m. UTC | #1
On Tue, Sep 03, 2024 at 08:07:49PM -0700, Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Add a new ioctl for the user space VMM to initialize guest memory with the
> specified memory contents.
> 
> Because TDX protects the guest's memory, the creation of the initial guest
> memory requires a dedicated TDX module API, TDH.MEM.PAGE.ADD(), instead of
> directly copying the memory contents into the guest's memory in the case of
> the default VM type.
> 
> Define a new subcommand, KVM_TDX_INIT_MEM_REGION, of vCPU-scoped
> KVM_MEMORY_ENCRYPT_OP.  Check if the GFN is already pre-allocated, assign
> the guest page in Secure-EPT, copy the initial memory contents into the
> guest memory, and encrypt the guest memory.  Optionally, extend the memory
> measurement of the TDX guest.
> 
> Discussion history:
> - Originally, KVM_TDX_INIT_MEM_REGION used the callback of the TDP MMU of
>   the KVM page fault handler.  It issues TDX SEAMCALL deep in the call
>   stack, and the ioctl passes down the necessary parameters.  [2] rejected
>   it.  [3] suggests that the call to the TDX module should be invoked in a
>   shallow call stack.
> 
> - Instead, introduce guest memory pre-population [1] that doesn't update
>   vendor-specific part (Secure-EPT in TDX case) and the vendor-specific
>   code (KVM_TDX_INIT_MEM_REGION) updates only vendor-specific parts without
>   modifying the KVM TDP MMU suggested at [4]
> 
>     Crazy idea.  For TDX S-EPT, what if KVM_MAP_MEMORY does all of the
>     SEPT.ADD stuff, which doesn't affect the measurement, and even fills in
>     KVM's copy of the leaf EPTE, but tdx_sept_set_private_spte() doesn't do
>     anything if the TD isn't finalized?
> 
>     Then KVM provides a dedicated TDX ioctl(), i.e. what is/was
>     KVM_TDX_INIT_MEM_REGION, to do PAGE.ADD.  KVM_TDX_INIT_MEM_REGION
>     wouldn't need to map anything, it would simply need to verify that the
>     pfn from guest_memfd() is the same as what's in the TDP MMU.
> 
> - Use the common guest_memfd population function, kvm_gmem_populate()
>   instead of a custom function.  It should check whether the PFN
>   from TDP MMU is the same as the one from guest_memfd. [1]
> 
> - Instead of forcing userspace to do two passes, pre-map the guest
>   initial memory in tdx_gmem_post_populate. [5]
> 
> Link: https://lore.kernel.org/kvm/20240419085927.3648704-1-pbonzini@redhat.com/ [1]
> Link: https://lore.kernel.org/kvm/Zbrj5WKVgMsUFDtb@google.com/ [2]
> Link: https://lore.kernel.org/kvm/Zh8DHbb8FzoVErgX@google.com/ [3]
> Link: https://lore.kernel.org/kvm/Ze-TJh0BBOWm9spT@google.com/ [4]
> Link: https://lore.kernel.org/kvm/CABgObfa=a3cKcKJHQRrCs-3Ty8ppSRou=dhi6Q+KdZnom0Zegw@mail.gmail.com/ [5]
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Co-developed-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> TDX MMU part 2 v1:
>  - Update the code according to latest gmem update.
>    https://lore.kernel.org/kvm/CABgObfa=a3cKcKJHQRrCs-3Ty8ppSRou=dhi6Q+KdZnom0Zegw@mail.gmail.com/
>  - Fixup a aligment bug reported by Binbin.
>  - Rename KVM_MEMORY_MAPPING => KVM_MAP_MEMORY (Sean)
>  - Drop issueing TDH.MEM.PAGE.ADD() on KVM_MAP_MEMORY(), defer it to
>    KVM_TDX_INIT_MEM_REGION. (Sean)
>  - Added nr_premapped to track the number of premapped pages
>  - Drop tdx_post_mmu_map_page().
>  - Drop kvm_slot_can_be_private() check (Paolo)
>  - Use kvm_tdp_mmu_gpa_is_mapped() (Paolo)
> 
> v19:
>  - Switched to use KVM_MEMORY_MAPPING
>  - Dropped measurement extension
>  - updated commit message. private_page_add() => set_private_spte()
> ---
>  arch/x86/include/uapi/asm/kvm.h |   9 ++
>  arch/x86/kvm/vmx/tdx.c          | 150 ++++++++++++++++++++++++++++++++
>  virt/kvm/kvm_main.c             |   1 +
>  3 files changed, 160 insertions(+)
> 
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 39636be5c891..789d1d821b4f 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -931,6 +931,7 @@ enum kvm_tdx_cmd_id {
>  	KVM_TDX_CAPABILITIES = 0,
>  	KVM_TDX_INIT_VM,
>  	KVM_TDX_INIT_VCPU,
> +	KVM_TDX_INIT_MEM_REGION,
>  	KVM_TDX_GET_CPUID,
>  
>  	KVM_TDX_CMD_NR_MAX,
> @@ -996,4 +997,12 @@ struct kvm_tdx_init_vm {
>  	struct kvm_cpuid2 cpuid;
>  };
>  
> +#define KVM_TDX_MEASURE_MEMORY_REGION   _BITULL(0)
> +
> +struct kvm_tdx_init_mem_region {
> +	__u64 source_addr;
> +	__u64 gpa;
> +	__u64 nr_pages;
> +};
> +
>  #endif /* _ASM_X86_KVM_H */
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 50ce24905062..796d1a495a66 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -8,6 +8,7 @@
>  #include "tdx_ops.h"
>  #include "vmx.h"
>  #include "mmu/spte.h"
> +#include "common.h"
>  
>  #undef pr_fmt
>  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> @@ -1586,6 +1587,152 @@ static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
>  	return 0;
>  }
>  
> +struct tdx_gmem_post_populate_arg {
> +	struct kvm_vcpu *vcpu;
> +	__u32 flags;
> +};
> +
> +static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> +				  void __user *src, int order, void *_arg)
> +{
> +	u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	struct tdx_gmem_post_populate_arg *arg = _arg;
> +	struct kvm_vcpu *vcpu = arg->vcpu;
> +	gpa_t gpa = gfn_to_gpa(gfn);
> +	u8 level = PG_LEVEL_4K;
> +	struct page *page;
> +	int ret, i;
> +	u64 err, entry, level_state;
> +
> +	/*
> +	 * Get the source page if it has been faulted in. Return failure if the
> +	 * source page has been swapped out or unmapped in primary memory.
> +	 */
> +	ret = get_user_pages_fast((unsigned long)src, 1, 0, &page);
> +	if (ret < 0)
> +		return ret;
> +	if (ret != 1)
> +		return -ENOMEM;
> +
> +	if (!kvm_mem_is_private(kvm, gfn)) {
> +		ret = -EFAULT;
> +		goto out_put_page;
> +	}
> +
> +	ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
> +	if (ret < 0)
> +		goto out_put_page;
> +
> +	read_lock(&kvm->mmu_lock);
Although mirrored root can't be zapped with shared lock currently, is it
better to hold write_lock() here?

It should bring no extra overhead in a normal condition when the
tdx_gmem_post_populate() is called.

> +
> +	if (!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa)) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +
> +	ret = 0;
> +	do {
> +		err = tdh_mem_page_add(kvm_tdx, gpa, pfn_to_hpa(pfn),
> +				       pfn_to_hpa(page_to_pfn(page)),
> +				       &entry, &level_state);
> +	} while (err == TDX_ERROR_SEPT_BUSY);
> +	if (err) {
> +		ret = -EIO;
> +		goto out;
> +	}
> +
> +	WARN_ON_ONCE(!atomic64_read(&kvm_tdx->nr_premapped));
> +	atomic64_dec(&kvm_tdx->nr_premapped);
> +
> +	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
> +		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
> +			err = tdh_mr_extend(kvm_tdx, gpa + i, &entry,
> +					&level_state);
> +			if (err) {
> +				ret = -EIO;
> +				break;
> +			}
> +		}
> +	}
> +
> +out:
> +	read_unlock(&kvm->mmu_lock);
> +out_put_page:
> +	put_page(page);
> +	return ret;
> +}
> +
> +static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	struct kvm_tdx_init_mem_region region;
> +	struct tdx_gmem_post_populate_arg arg;
> +	long gmem_ret;
> +	int ret;
> +
> +	if (!to_tdx(vcpu)->initialized)
> +		return -EINVAL;
> +
> +	/* Once TD is finalized, the initial guest memory is fixed. */
> +	if (is_td_finalized(kvm_tdx))
> +		return -EINVAL;
> +
> +	if (cmd->flags & ~KVM_TDX_MEASURE_MEMORY_REGION)
> +		return -EINVAL;
> +
> +	if (copy_from_user(&region, u64_to_user_ptr(cmd->data), sizeof(region)))
> +		return -EFAULT;
> +
> +	if (!PAGE_ALIGNED(region.source_addr) || !PAGE_ALIGNED(region.gpa) ||
> +	    !region.nr_pages ||
> +	    region.gpa + (region.nr_pages << PAGE_SHIFT) <= region.gpa ||
> +	    !kvm_is_private_gpa(kvm, region.gpa) ||
> +	    !kvm_is_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT) - 1))
> +		return -EINVAL;
> +
> +	mutex_lock(&kvm->slots_lock);
> +
> +	kvm_mmu_reload(vcpu);
> +	ret = 0;
> +	while (region.nr_pages) {
> +		if (signal_pending(current)) {
> +			ret = -EINTR;
> +			break;
> +		}
> +
> +		arg = (struct tdx_gmem_post_populate_arg) {
> +			.vcpu = vcpu,
> +			.flags = cmd->flags,
> +		};
> +		gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa),
> +					     u64_to_user_ptr(region.source_addr),
> +					     1, tdx_gmem_post_populate, &arg);
> +		if (gmem_ret < 0) {
> +			ret = gmem_ret;
> +			break;
> +		}
> +
> +		if (gmem_ret != 1) {
> +			ret = -EIO;
> +			break;
> +		}
> +
> +		region.source_addr += PAGE_SIZE;
> +		region.gpa += PAGE_SIZE;
> +		region.nr_pages--;
> +
> +		cond_resched();
> +	}
> +
> +	mutex_unlock(&kvm->slots_lock);
> +
> +	if (copy_to_user(u64_to_user_ptr(cmd->data), &region, sizeof(region)))
> +		ret = -EFAULT;
> +	return ret;
> +}
> +
>  int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
>  {
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> @@ -1605,6 +1752,9 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
>  	case KVM_TDX_INIT_VCPU:
>  		ret = tdx_vcpu_init(vcpu, &cmd);
>  		break;
> +	case KVM_TDX_INIT_MEM_REGION:
> +		ret = tdx_vcpu_init_mem_region(vcpu, &cmd);
> +		break;
>  	case KVM_TDX_GET_CPUID:
>  		ret = tdx_vcpu_get_cpuid(vcpu, &cmd);
>  		break;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 73fc3334721d..0822db480719 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2639,6 +2639,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn
>  
>  	return NULL;
>  }
> +EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);
>  
>  bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
>  {
> -- 
> 2.34.1
>
Rick Edgecombe Sept. 4, 2024, 1:56 p.m. UTC | #2
On Tue, 2024-09-03 at 20:07 -0700, Rick Edgecombe wrote:
> +static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> +                                 void __user *src, int order, void *_arg)
> +{
> +       u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
> +       struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +       struct tdx_gmem_post_populate_arg *arg = _arg;
> +       struct kvm_vcpu *vcpu = arg->vcpu;
> +       gpa_t gpa = gfn_to_gpa(gfn);
> +       u8 level = PG_LEVEL_4K;
> +       struct page *page;
> +       int ret, i;
> +       u64 err, entry, level_state;
> +
> +       /*
> +        * Get the source page if it has been faulted in. Return failure if
> the
> +        * source page has been swapped out or unmapped in primary memory.
> +        */
> +       ret = get_user_pages_fast((unsigned long)src, 1, 0, &page);
> +       if (ret < 0)
> +               return ret;
> +       if (ret != 1)
> +               return -ENOMEM;
> +
> +       if (!kvm_mem_is_private(kvm, gfn)) {
> +               ret = -EFAULT;
> +               goto out_put_page;
> +       }

Paulo had said he was going to add this check in gmem code. I thought it was not
added but it actually is. So we can drop this check.
Rick Edgecombe Sept. 4, 2024, 2:01 p.m. UTC | #3
On Wed, 2024-09-04 at 12:53 +0800, Yan Zhao wrote:
> > +       if (!kvm_mem_is_private(kvm, gfn)) {
> > +               ret = -EFAULT;
> > +               goto out_put_page;
> > +       }
> > +
> > +       ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
> > +       if (ret < 0)
> > +               goto out_put_page;
> > +
> > +       read_lock(&kvm->mmu_lock);
> Although mirrored root can't be zapped with shared lock currently, is it
> better to hold write_lock() here?
> 
> It should bring no extra overhead in a normal condition when the
> tdx_gmem_post_populate() is called.

I think we should hold the weakest lock we can. Otherwise someday someone could
run into it and think the write_lock() is required. It will add confusion.

What was the benefit of a write lock? Just in case we got it wrong?
Rick Edgecombe Sept. 6, 2024, 4:30 p.m. UTC | #4
On Wed, 2024-09-04 at 07:01 -0700, Rick Edgecombe wrote:
> On Wed, 2024-09-04 at 12:53 +0800, Yan Zhao wrote:
> > > +       if (!kvm_mem_is_private(kvm, gfn)) {
> > > +               ret = -EFAULT;
> > > +               goto out_put_page;
> > > +       }
> > > +
> > > +       ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
> > > +       if (ret < 0)
> > > +               goto out_put_page;
> > > +
> > > +       read_lock(&kvm->mmu_lock);
> > Although mirrored root can't be zapped with shared lock currently, is it
> > better to hold write_lock() here?
> > 
> > It should bring no extra overhead in a normal condition when the
> > tdx_gmem_post_populate() is called.
> 
> I think we should hold the weakest lock we can. Otherwise someday someone
> could
> run into it and think the write_lock() is required. It will add confusion.
> 
> What was the benefit of a write lock? Just in case we got it wrong?

I just tried to draft a comment to make it look less weird, but I think actually
even the mmu_read lock is technically unnecessary because we hold both
filemap_invalidate_lock() and slots_lock. The cases we care about:
 memslot deletion - slots_lock protects
 gmem hole punch - filemap_invalidate_lock() protects
 set attributes - slots_lock protects
 others?

So I guess all the mirror zapping cases that could execute concurrently are
already covered by other locks. If we skipped grabbing the mmu lock completely
it would trigger the assertion in kvm_tdp_mmu_gpa_is_mapped(). Removing the
assert would probably make kvm_tdp_mmu_gpa_is_mapped() a bit dangerous. Hmm. 

Maybe a comment like this:
/*
 * The case to care about here is a PTE getting zapped concurrently and 
 * this function erroneously thinking a page is mapped in the mirror EPT.
 * The private mem zapping paths are already covered by other locks held
 * here, but grab an mmu read_lock to not trigger the assert in
 * kvm_tdp_mmu_gpa_is_mapped().
 */

Yan, do you think it is sufficient?
Yan Zhao Sept. 9, 2024, 1:29 a.m. UTC | #5
On Sat, Sep 07, 2024 at 12:30:00AM +0800, Edgecombe, Rick P wrote:
> On Wed, 2024-09-04 at 07:01 -0700, Rick Edgecombe wrote:
> > On Wed, 2024-09-04 at 12:53 +0800, Yan Zhao wrote:
> > > > +       if (!kvm_mem_is_private(kvm, gfn)) {
> > > > +               ret = -EFAULT;
> > > > +               goto out_put_page;
> > > > +       }
> > > > +
> > > > +       ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
> > > > +       if (ret < 0)
> > > > +               goto out_put_page;
> > > > +
> > > > +       read_lock(&kvm->mmu_lock);
> > > Although mirrored root can't be zapped with shared lock currently, is it
> > > better to hold write_lock() here?
> > > 
> > > It should bring no extra overhead in a normal condition when the
> > > tdx_gmem_post_populate() is called.
> > 
> > I think we should hold the weakest lock we can. Otherwise someday someone
> > could
> > run into it and think the write_lock() is required. It will add confusion.
> > 
> > What was the benefit of a write lock? Just in case we got it wrong?
> 
> I just tried to draft a comment to make it look less weird, but I think actually
> even the mmu_read lock is technically unnecessary because we hold both
> filemap_invalidate_lock() and slots_lock. The cases we care about:
>  memslot deletion - slots_lock protects
>  gmem hole punch - filemap_invalidate_lock() protects
>  set attributes - slots_lock protects
>  others?
> 
> So I guess all the mirror zapping cases that could execute concurrently are
> already covered by other locks. If we skipped grabbing the mmu lock completely
> it would trigger the assertion in kvm_tdp_mmu_gpa_is_mapped(). Removing the
> assert would probably make kvm_tdp_mmu_gpa_is_mapped() a bit dangerous. Hmm. 
> 
> Maybe a comment like this:
> /*
>  * The case to care about here is a PTE getting zapped concurrently and 
>  * this function erroneously thinking a page is mapped in the mirror EPT.
>  * The private mem zapping paths are already covered by other locks held
>  * here, but grab an mmu read_lock to not trigger the assert in
>  * kvm_tdp_mmu_gpa_is_mapped().
>  */
> 
> Yan, do you think it is sufficient?
Yes, with current code base, I think it's sufficient. Thanks!

I asked that question was just to confirm whether we need to guard against the
potential removal of SPTE under a shared lock, given the change is small and
KVM_TDX_INIT_MEM_REGION() is not on performance critical path.
Paolo Bonzini Sept. 10, 2024, 10:13 a.m. UTC | #6
On 9/6/24 18:30, Edgecombe, Rick P wrote:
> /*
>   * The case to care about here is a PTE getting zapped concurrently and
>   * this function erroneously thinking a page is mapped in the mirror EPT.
>   * The private mem zapping paths are already covered by other locks held
>   * here, but grab an mmu read_lock to not trigger the assert in
>   * kvm_tdp_mmu_gpa_is_mapped().
>   */
> 
> Yan, do you think it is sufficient?

If you're actually requiring that the other locks are sufficient, then 
there can be no ENOENT.

Maybe:

	/*
	 * The private mem cannot be zapped after kvm_tdp_map_page()
	 * because all paths are covered by slots_lock and the
	 * filemap invalidate lock.  Check that they are indeed enough.
	 */
	if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
		scoped_guard(read_lock, &kvm->mmu_lock) {
			if (KVM_BUG_ON(kvm,
				!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa)) {
				ret = -EIO;
				goto out;
			}
		}
	}

Paolo
Paolo Bonzini Sept. 10, 2024, 10:16 a.m. UTC | #7
On 9/4/24 05:07, Rick Edgecombe wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Add a new ioctl for the user space VMM to initialize guest memory with the
> specified memory contents.
> 
> Because TDX protects the guest's memory, the creation of the initial guest
> memory requires a dedicated TDX module API, TDH.MEM.PAGE.ADD(), instead of
> directly copying the memory contents into the guest's memory in the case of
> the default VM type.
> 
> Define a new subcommand, KVM_TDX_INIT_MEM_REGION, of vCPU-scoped
> KVM_MEMORY_ENCRYPT_OP.  Check if the GFN is already pre-allocated, assign
> the guest page in Secure-EPT, copy the initial memory contents into the
> guest memory, and encrypt the guest memory.  Optionally, extend the memory
> measurement of the TDX guest.
> 
> Discussion history:

While useful for the reviewers, in the end this is the simplest possible 
userspace API (the one that we started with) and the objections just 
went away because it reuses the infrastructure that was introduced for 
pre-faulting memory.

So I'd replace everything with:

---
The ioctl uses the vCPU file descriptor because of the TDX module's 
requirement that the memory is added to the S-EPT (via TDH.MEM.SEPT.ADD) 
prior to initialization (TDH.MEM.PAGE.ADD).  Accessing the MMU in turn 
requires a vCPU file descriptor, just like for KVM_PRE_FAULT_MEMORY.  In 
fact, the post-populate callback is able to reuse the same logic used by 
KVM_PRE_FAULT_MEMORY, so that userspace can do everything with a single 
ioctl.

Note that this is the only way to invoke TDH.MEM.SEPT.ADD before the TD 
in finalized, as userspace cannot use KVM_PRE_FAULT_MEMORY at that 
point.  This ensures that there cannot be pages in the S-EPT awaiting 
TDH.MEM.PAGE.ADD, which would be treated incorrectly as spurious by 
tdp_mmu_map_handle_target_level() (KVM would see the SPTE as PRESENT, 
but the corresponding S-EPT entry will be !PRESENT).
---

Part of the second paragraph comes from your link [4], 
https://lore.kernel.org/kvm/Ze-TJh0BBOWm9spT@google.com/, but updated 
for recent changes to KVM_PRE_FAULT_MEMORY.

This drops the historical information that is not particularly relevant 
for the future, it updates what's relevant to mention changes done for 
SEV-SNP, and also preserves most of the other information:

* why the vCPU file descriptor

* the desirability of a single ioctl for userspace

* the relationship between KVM_TDX_INIT_MEM_REGION and KVM_PRE_FAULT_MEMORY

Paolo
Rick Edgecombe Sept. 11, 2024, 12:11 a.m. UTC | #8
On Tue, 2024-09-10 at 12:13 +0200, Paolo Bonzini wrote:
> > Yan, do you think it is sufficient?
> 
> If you're actually requiring that the other locks are sufficient, then 
> there can be no ENOENT.
> 
> Maybe:
> 
>         /*
>          * The private mem cannot be zapped after kvm_tdp_map_page()
>          * because all paths are covered by slots_lock and the
>          * filemap invalidate lock.  Check that they are indeed enough.
>          */
>         if (IS_ENABLED(CONFIG_KVM_PROVE_MMU)) {
>                 scoped_guard(read_lock, &kvm->mmu_lock) {
>                         if (KVM_BUG_ON(kvm,
>                                 !kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa)) {
>                                 ret = -EIO;
>                                 goto out;
>                         }
>                 }
>         }

True. We can put it behind CONFIG_KVM_PROVE_MMU.
Rick Edgecombe Sept. 11, 2024, 12:12 a.m. UTC | #9
On Tue, 2024-09-10 at 12:16 +0200, Paolo Bonzini wrote:
> While useful for the reviewers, in the end this is the simplest possible 
> userspace API (the one that we started with) and the objections just 
> went away because it reuses the infrastructure that was introduced for 
> pre-faulting memory.
> 
> So I'd replace everything with:

Sure, thanks.
diff mbox series

Patch

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 39636be5c891..789d1d821b4f 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -931,6 +931,7 @@  enum kvm_tdx_cmd_id {
 	KVM_TDX_CAPABILITIES = 0,
 	KVM_TDX_INIT_VM,
 	KVM_TDX_INIT_VCPU,
+	KVM_TDX_INIT_MEM_REGION,
 	KVM_TDX_GET_CPUID,
 
 	KVM_TDX_CMD_NR_MAX,
@@ -996,4 +997,12 @@  struct kvm_tdx_init_vm {
 	struct kvm_cpuid2 cpuid;
 };
 
+#define KVM_TDX_MEASURE_MEMORY_REGION   _BITULL(0)
+
+struct kvm_tdx_init_mem_region {
+	__u64 source_addr;
+	__u64 gpa;
+	__u64 nr_pages;
+};
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 50ce24905062..796d1a495a66 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -8,6 +8,7 @@ 
 #include "tdx_ops.h"
 #include "vmx.h"
 #include "mmu/spte.h"
+#include "common.h"
 
 #undef pr_fmt
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -1586,6 +1587,152 @@  static int tdx_vcpu_init(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
 	return 0;
 }
 
+struct tdx_gmem_post_populate_arg {
+	struct kvm_vcpu *vcpu;
+	__u32 flags;
+};
+
+static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
+				  void __user *src, int order, void *_arg)
+{
+	u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct tdx_gmem_post_populate_arg *arg = _arg;
+	struct kvm_vcpu *vcpu = arg->vcpu;
+	gpa_t gpa = gfn_to_gpa(gfn);
+	u8 level = PG_LEVEL_4K;
+	struct page *page;
+	int ret, i;
+	u64 err, entry, level_state;
+
+	/*
+	 * Get the source page if it has been faulted in. Return failure if the
+	 * source page has been swapped out or unmapped in primary memory.
+	 */
+	ret = get_user_pages_fast((unsigned long)src, 1, 0, &page);
+	if (ret < 0)
+		return ret;
+	if (ret != 1)
+		return -ENOMEM;
+
+	if (!kvm_mem_is_private(kvm, gfn)) {
+		ret = -EFAULT;
+		goto out_put_page;
+	}
+
+	ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
+	if (ret < 0)
+		goto out_put_page;
+
+	read_lock(&kvm->mmu_lock);
+
+	if (!kvm_tdp_mmu_gpa_is_mapped(vcpu, gpa)) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	ret = 0;
+	do {
+		err = tdh_mem_page_add(kvm_tdx, gpa, pfn_to_hpa(pfn),
+				       pfn_to_hpa(page_to_pfn(page)),
+				       &entry, &level_state);
+	} while (err == TDX_ERROR_SEPT_BUSY);
+	if (err) {
+		ret = -EIO;
+		goto out;
+	}
+
+	WARN_ON_ONCE(!atomic64_read(&kvm_tdx->nr_premapped));
+	atomic64_dec(&kvm_tdx->nr_premapped);
+
+	if (arg->flags & KVM_TDX_MEASURE_MEMORY_REGION) {
+		for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
+			err = tdh_mr_extend(kvm_tdx, gpa + i, &entry,
+					&level_state);
+			if (err) {
+				ret = -EIO;
+				break;
+			}
+		}
+	}
+
+out:
+	read_unlock(&kvm->mmu_lock);
+out_put_page:
+	put_page(page);
+	return ret;
+}
+
+static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu, struct kvm_tdx_cmd *cmd)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct kvm_tdx_init_mem_region region;
+	struct tdx_gmem_post_populate_arg arg;
+	long gmem_ret;
+	int ret;
+
+	if (!to_tdx(vcpu)->initialized)
+		return -EINVAL;
+
+	/* Once TD is finalized, the initial guest memory is fixed. */
+	if (is_td_finalized(kvm_tdx))
+		return -EINVAL;
+
+	if (cmd->flags & ~KVM_TDX_MEASURE_MEMORY_REGION)
+		return -EINVAL;
+
+	if (copy_from_user(&region, u64_to_user_ptr(cmd->data), sizeof(region)))
+		return -EFAULT;
+
+	if (!PAGE_ALIGNED(region.source_addr) || !PAGE_ALIGNED(region.gpa) ||
+	    !region.nr_pages ||
+	    region.gpa + (region.nr_pages << PAGE_SHIFT) <= region.gpa ||
+	    !kvm_is_private_gpa(kvm, region.gpa) ||
+	    !kvm_is_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT) - 1))
+		return -EINVAL;
+
+	mutex_lock(&kvm->slots_lock);
+
+	kvm_mmu_reload(vcpu);
+	ret = 0;
+	while (region.nr_pages) {
+		if (signal_pending(current)) {
+			ret = -EINTR;
+			break;
+		}
+
+		arg = (struct tdx_gmem_post_populate_arg) {
+			.vcpu = vcpu,
+			.flags = cmd->flags,
+		};
+		gmem_ret = kvm_gmem_populate(kvm, gpa_to_gfn(region.gpa),
+					     u64_to_user_ptr(region.source_addr),
+					     1, tdx_gmem_post_populate, &arg);
+		if (gmem_ret < 0) {
+			ret = gmem_ret;
+			break;
+		}
+
+		if (gmem_ret != 1) {
+			ret = -EIO;
+			break;
+		}
+
+		region.source_addr += PAGE_SIZE;
+		region.gpa += PAGE_SIZE;
+		region.nr_pages--;
+
+		cond_resched();
+	}
+
+	mutex_unlock(&kvm->slots_lock);
+
+	if (copy_to_user(u64_to_user_ptr(cmd->data), &region, sizeof(region)))
+		ret = -EFAULT;
+	return ret;
+}
+
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
@@ -1605,6 +1752,9 @@  int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 	case KVM_TDX_INIT_VCPU:
 		ret = tdx_vcpu_init(vcpu, &cmd);
 		break;
+	case KVM_TDX_INIT_MEM_REGION:
+		ret = tdx_vcpu_init_mem_region(vcpu, &cmd);
+		break;
 	case KVM_TDX_GET_CPUID:
 		ret = tdx_vcpu_get_cpuid(vcpu, &cmd);
 		break;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 73fc3334721d..0822db480719 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2639,6 +2639,7 @@  struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn
 
 	return NULL;
 }
+EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);
 
 bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
 {