diff mbox series

[5/7] KVM: x86/mmu: Introduce kvm_tdp_map_page() to populate guest memory

Message ID 20240417153450.3608097-6-pbonzini@redhat.com (mailing list archive)
State New, archived
Headers show
Series KVM: Guest Memory Pre-Population API | expand

Commit Message

Paolo Bonzini April 17, 2024, 3:34 p.m. UTC
From: Isaku Yamahata <isaku.yamahata@intel.com>

Introduce a helper function to call the KVM fault handler.  It allows a new
ioctl to invoke the KVM fault handler to populate without seeing RET_PF_*
enums or other KVM MMU internal definitions because RET_PF_* are internal
to x86 KVM MMU.  The implementation is restricted to two-dimensional paging
for simplicity.  The shadow paging uses GVA for faulting instead of L1 GPA.
It makes the API difficult to use.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-ID: <9b866a0ae7147f96571c439e75429a03dcb659b6.1712785629.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu.h     |  3 +++
 arch/x86/kvm/mmu/mmu.c | 32 ++++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

Comments

Sean Christopherson April 17, 2024, 9:24 p.m. UTC | #1
On Wed, Apr 17, 2024, Paolo Bonzini wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Introduce a helper function to call the KVM fault handler.  It allows a new
> ioctl to invoke the KVM fault handler to populate without seeing RET_PF_*
> enums or other KVM MMU internal definitions because RET_PF_* are internal
> to x86 KVM MMU.  The implementation is restricted to two-dimensional paging
> for simplicity.  The shadow paging uses GVA for faulting instead of L1 GPA.
> It makes the API difficult to use.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Message-ID: <9b866a0ae7147f96571c439e75429a03dcb659b6.1712785629.git.isaku.yamahata@intel.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/mmu.h     |  3 +++
>  arch/x86/kvm/mmu/mmu.c | 32 ++++++++++++++++++++++++++++++++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index e8b620a85627..51ff4f67e115 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -183,6 +183,9 @@ static inline void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
>  	__kvm_mmu_refresh_passthrough_bits(vcpu, mmu);
>  }
>  
> +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> +		     u8 *level);
> +
>  /*
>   * Check if a given access (described through the I/D, W/R and U/S bits of a
>   * page fault error code pfec) causes a permission fault with the given PTE
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 7fbcfc97edcc..fb2149d16f8d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4646,6 +4646,38 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  	return direct_page_fault(vcpu, fault);
>  }
>  
> +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> +		     u8 *level)

If the return is an overloaded "long", then there's no need for @level, i.e. do
the level=>size conversion in this helper.

> +{
> +	int r;
> +
> +	/* Restrict to TDP page fault. */

Do we want to restrict this to the TDP MMU?  Not for any particular reason, mostly
just to keep moving towards officially deprecating/removing TDP support from the
shadow MMU.

> +	if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> +		return -EOPNOTSUPP;
> +
> +	r = __kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
> +	if (r < 0)
> +		return r;
> +
> +	switch (r) {
> +	case RET_PF_RETRY:
> +		return -EAGAIN;
> +
> +	case RET_PF_FIXED:
> +	case RET_PF_SPURIOUS:
> +		return 0;

Going with the "long" idea, this becomes:

		end = (gpa & KVM_HPAGE_MASK(level)) + KVM_HPAGE_SIZE(level);
		return min(size, end - gpa);

though I would vote for a:

		break;

so that the happy path is nicely isolated at the end of the function.

> +
> +	case RET_PF_EMULATE:
> +		return -EINVAL;
> +
> +	case RET_PF_CONTINUE:
> +	case RET_PF_INVALID:
> +	default:
> +		WARN_ON_ONCE(r);
> +		return -EIO;
> +	}
> +}
> +
>  static void nonpaging_init_context(struct kvm_mmu *context)
>  {
>  	context->page_fault = nonpaging_page_fault;
> -- 
> 2.43.0
> 
>
Paolo Bonzini April 17, 2024, 9:31 p.m. UTC | #2
On Wed, Apr 17, 2024 at 11:24 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Apr 17, 2024, Paolo Bonzini wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> >
> > Introduce a helper function to call the KVM fault handler.  It allows a new
> > ioctl to invoke the KVM fault handler to populate without seeing RET_PF_*
> > enums or other KVM MMU internal definitions because RET_PF_* are internal
> > to x86 KVM MMU.  The implementation is restricted to two-dimensional paging
> > for simplicity.  The shadow paging uses GVA for faulting instead of L1 GPA.
> > It makes the API difficult to use.
> >
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > Message-ID: <9b866a0ae7147f96571c439e75429a03dcb659b6.1712785629.git.isaku.yamahata@intel.com>
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > ---
> >  arch/x86/kvm/mmu.h     |  3 +++
> >  arch/x86/kvm/mmu/mmu.c | 32 ++++++++++++++++++++++++++++++++
> >  2 files changed, 35 insertions(+)
> >
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index e8b620a85627..51ff4f67e115 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -183,6 +183,9 @@ static inline void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
> >       __kvm_mmu_refresh_passthrough_bits(vcpu, mmu);
> >  }
> >
> > +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> > +                  u8 *level);
> > +
> >  /*
> >   * Check if a given access (described through the I/D, W/R and U/S bits of a
> >   * page fault error code pfec) causes a permission fault with the given PTE
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 7fbcfc97edcc..fb2149d16f8d 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4646,6 +4646,38 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >       return direct_page_fault(vcpu, fault);
> >  }
> >
> > +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> > +                  u8 *level)
>
> If the return is an overloaded "long", then there's no need for @level, i.e. do
> the level=>size conversion in this helper.
>
> > +{
> > +     int r;
> > +
> > +     /* Restrict to TDP page fault. */
>
> Do we want to restrict this to the TDP MMU?  Not for any particular reason, mostly
> just to keep moving towards officially deprecating/removing TDP support from the
> shadow MMU.

Heh, yet another thing I briefly thought about while reviewing Isaku's
work. In the end I decided that, with the implementation being just a
regular prefault, there's not much to save from keeping the shadow MMU
away from this.

The real ugly part is that if the memslots are zapped the
pre-population effect basically goes away (damn
kvm_arch_flush_shadow_memslot). This is the reason why I initially
thought of KVM_CHECK_EXTENSION for the VM file descriptor, to only
allow this for TDX VMs.

The real solution for this is to not "graduate" this ioctl too soon to
kvm/next. Let's keep it in kvm-coco-queue until TDX is ready and then
make a final decision.

Paolo

> > +     if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
> > +             return -EOPNOTSUPP;
> > +
> > +     r = __kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
> > +     if (r < 0)
> > +             return r;
> > +
> > +     switch (r) {
> > +     case RET_PF_RETRY:
> > +             return -EAGAIN;
> > +
> > +     case RET_PF_FIXED:
> > +     case RET_PF_SPURIOUS:
> > +             return 0;
>
> Going with the "long" idea, this becomes:
>
>                 end = (gpa & KVM_HPAGE_MASK(level)) + KVM_HPAGE_SIZE(level);
>                 return min(size, end - gpa);
>
> though I would vote for a:
>
>                 break;
>
> so that the happy path is nicely isolated at the end of the function.
>
> > +
> > +     case RET_PF_EMULATE:
> > +             return -EINVAL;
> > +
> > +     case RET_PF_CONTINUE:
> > +     case RET_PF_INVALID:
> > +     default:
> > +             WARN_ON_ONCE(r);
> > +             return -EIO;
> > +     }
> > +}
> > +
> >  static void nonpaging_init_context(struct kvm_mmu *context)
> >  {
> >       context->page_fault = nonpaging_page_fault;
> > --
> > 2.43.0
> >
> >
>
Sean Christopherson April 17, 2024, 9:34 p.m. UTC | #3
On Wed, Apr 17, 2024, Sean Christopherson wrote:
> On Wed, Apr 17, 2024, Paolo Bonzini wrote:
> > +	case RET_PF_EMULATE:
> > +		return -EINVAL;

Almost forgot.  EINVAL on emulation is weird.  I don't know that any return code
is going to be "good", but I think just about anything is better than EINVAL,
e.g. arguably this could be -EBUSY since retrying after creating a memslot would
succeed.
Paolo Bonzini April 17, 2024, 9:47 p.m. UTC | #4
On Wed, Apr 17, 2024 at 11:34 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Apr 17, 2024, Sean Christopherson wrote:
> > On Wed, Apr 17, 2024, Paolo Bonzini wrote:
> > > +   case RET_PF_EMULATE:
> > > +           return -EINVAL;
>
> Almost forgot.  EINVAL on emulation is weird.  I don't know that any return code
> is going to be "good", but I think just about anything is better than EINVAL,
> e.g. arguably this could be -EBUSY since retrying after creating a memslot would
> succeed.

Then I guess -ENOENT?

Paolo
Sean Christopherson April 17, 2024, 10:26 p.m. UTC | #5
On Wed, Apr 17, 2024, Paolo Bonzini wrote:
> On Wed, Apr 17, 2024 at 11:24 PM Sean Christopherson <seanjc@google.com> wrote:
> > Do we want to restrict this to the TDP MMU?  Not for any particular reason,
> > mostly just to keep moving towards officially deprecating/removing TDP
> > support from the shadow MMU.
> 
> Heh, yet another thing I briefly thought about while reviewing Isaku's
> work. In the end I decided that, with the implementation being just a
> regular prefault, there's not much to save from keeping the shadow MMU
> away from this.

Yeah.

> The real ugly part is that if the memslots are zapped the
> pre-population effect basically goes away (damn
> kvm_arch_flush_shadow_memslot). 

Ah, the eternal thorn in my side. 

> This is the reason why I initially thought of KVM_CHECK_EXTENSION for the VM
> file descriptor, to only allow this for TDX VMs.

I'm fairly certain memslot deletion is mostly a QEMU specific problem.  Allegedly
(I haven't verified), our userspace+firmware doesn't delete any memslots during
boot.

And it might even be solvable for QEMU, at least for some configurations.  E.g.
during boot, my QEMU+OVMF setup creates and deletes the SMRAM memslot (despite my
KVM build not supporting SMM), and deletes the lower RAM memslot when relocating
BIOS.  The SMRAM is definitely solvable, and the BIOS relocation stuff seems like
it's solvable too.
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index e8b620a85627..51ff4f67e115 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -183,6 +183,9 @@  static inline void kvm_mmu_refresh_passthrough_bits(struct kvm_vcpu *vcpu,
 	__kvm_mmu_refresh_passthrough_bits(vcpu, mmu);
 }
 
+int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
+		     u8 *level);
+
 /*
  * Check if a given access (described through the I/D, W/R and U/S bits of a
  * page fault error code pfec) causes a permission fault with the given PTE
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 7fbcfc97edcc..fb2149d16f8d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4646,6 +4646,38 @@  int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	return direct_page_fault(vcpu, fault);
 }
 
+int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
+		     u8 *level)
+{
+	int r;
+
+	/* Restrict to TDP page fault. */
+	if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
+		return -EOPNOTSUPP;
+
+	r = __kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
+	if (r < 0)
+		return r;
+
+	switch (r) {
+	case RET_PF_RETRY:
+		return -EAGAIN;
+
+	case RET_PF_FIXED:
+	case RET_PF_SPURIOUS:
+		return 0;
+
+	case RET_PF_EMULATE:
+		return -EINVAL;
+
+	case RET_PF_CONTINUE:
+	case RET_PF_INVALID:
+	default:
+		WARN_ON_ONCE(r);
+		return -EIO;
+	}
+}
+
 static void nonpaging_init_context(struct kvm_mmu *context)
 {
 	context->page_fault = nonpaging_page_fault;