diff mbox series

[v2,07/10] KVM: x86: Always populate L1 GPA for KVM_MAP_MEMORY

Message ID 2f1de1b7b6512280fae4ac05e77ced80a585971b.1712785629.git.isaku.yamahata@intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: Guest Memory Pre-Population API | expand

Commit Message

Isaku Yamahata April 10, 2024, 10:07 p.m. UTC
From: Isaku Yamahata <isaku.yamahata@intel.com>

Forcibly switch vCPU mode out from guest mode and SMM mode before calling
KVM page fault handler for KVM_MAP_MEMORY.

KVM_MAP_MEMORY populates guest memory with guest physical address (GPA).
If the vCPU is in guest mode, it populates with L2 GPA.  If vCPU is in SMM
mode, it populates the SMM address pace.  The API would be difficult to use
as such.  Change vCPU MMU mode around populating the guest memory to always
populate with L1 GPA.

There are several options to populate L1 GPA irrelevant to vCPU mode.
- Switch vCPU MMU only: This patch.
  Pros: Concise implementation.
  Cons: Heavily dependent on the KVM MMU implementation.
- Use kvm_x86_nested_ops.get/set_state() to switch to/from guest mode.
  Use __get/set_sregs2() to switch to/from SMM mode.
  Pros: straightforward.
  Cons: This may cause unintended side effects.
- Refactor KVM page fault handler not to pass vCPU. Pass around necessary
  parameters and struct kvm.
  Pros: The end result will have clearly no side effects.
  Cons: This will require big refactoring.
- Return error on guest mode or SMM mode:  Without this patch.
  Pros: No additional patch.
  Cons: Difficult to use.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
v2:
- Newly added.
---
 arch/x86/kvm/x86.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

Comments

Rick Edgecombe April 15, 2024, 7:12 p.m. UTC | #1
I wouldn't call myself much of an expert on nested, but...

On Wed, 2024-04-10 at 15:07 -0700, isaku.yamahata@intel.com wrote:
> There are several options to populate L1 GPA irrelevant to vCPU mode.
> - Switch vCPU MMU only: This patch.
>   Pros: Concise implementation.
>   Cons: Heavily dependent on the KVM MMU implementation.

Is switching just the MMU enough here? Won't the MTRRs and other vcpu bits be
wrong?

> - Use kvm_x86_nested_ops.get/set_state() to switch to/from guest mode.
>   Use __get/set_sregs2() to switch to/from SMM mode.
>   Pros: straightforward.
>   Cons: This may cause unintended side effects.

Cons make sense.

> - Refactor KVM page fault handler not to pass vCPU. Pass around necessary
>   parameters and struct kvm.
>   Pros: The end result will have clearly no side effects.
>   Cons: This will require big refactoring.

But doesn't the fault handler need the vCPU state?

> - Return error on guest mode or SMM mode:  Without this patch.
>   Pros: No additional patch.
>   Cons: Difficult to use.

Hmm... For the non-TDX use cases this is just an optimization, right? For TDX
there shouldn't be an issue. If so, maybe this last one is not so horrible.
Rick Edgecombe April 15, 2024, 7:37 p.m. UTC | #2
On Wed, 2024-04-10 at 15:07 -0700, isaku.yamahata@intel.com wrote:
> @@ -5882,18 +5884,40 @@ int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
>         if (!tdp_enabled)
>                 return -EOPNOTSUPP;
>  
> +       /* Force to use L1 GPA despite of vcpu MMU mode. */
> +       is_smm = !!(vcpu->arch.hflags & HF_SMM_MASK);
> +       if (is_smm ||
> +           vcpu->arch.mmu != &vcpu->arch.root_mmu ||
> +           vcpu->arch.walk_mmu != &vcpu->arch.root_mmu) {
> +               vcpu->arch.hflags &= ~HF_SMM_MASK;
> +               mmu = vcpu->arch.mmu;
> +               walk_mmu = vcpu->arch.walk_mmu;
> +               vcpu->arch.mmu = &vcpu->arch.root_mmu;
> +               vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> +               kvm_mmu_reset_context(vcpu);
> +       }
> +
>         /* reload is optimized for repeated call. */

After the kvm_mmu_reset_context(), what benefit is there to the operation? And
it happening for every call of kvm_arch_vcpu_map_memory()?

>         kvm_mmu_reload(vcpu);
Sean Christopherson April 15, 2024, 9:17 p.m. UTC | #3
On Mon, Apr 15, 2024, Rick P Edgecombe wrote:
> I wouldn't call myself much of an expert on nested, but...
> 
> On Wed, 2024-04-10 at 15:07 -0700, isaku.yamahata@intel.com wrote:
> > There are several options to populate L1 GPA irrelevant to vCPU mode.
> > - Switch vCPU MMU only: This patch.
> >   Pros: Concise implementation.
> >   Cons: Heavily dependent on the KVM MMU implementation.

Con: Makes it impossible to support other MMUs/modes without extending the uAPI.

> Is switching just the MMU enough here? Won't the MTRRs and other vcpu bits be
> wrong?
> 
> > - Use kvm_x86_nested_ops.get/set_state() to switch to/from guest mode.
> >   Use __get/set_sregs2() to switch to/from SMM mode.
> >   Pros: straightforward.
> >   Cons: This may cause unintended side effects.
> 
> Cons make sense.
> 
> > - Refactor KVM page fault handler not to pass vCPU. Pass around necessary
> >   parameters and struct kvm.
> >   Pros: The end result will have clearly no side effects.
> >   Cons: This will require big refactoring.
> 
> But doesn't the fault handler need the vCPU state?

Ignoring guest MTRRs, which will hopefully soon be a non-issue, no.  There are
only six possible roots if TDP is enabled:

      1. 4-level !SMM !guest_mode
      2. 4-level  SMM !guest_mode
      3. 5-level !SMM !guest_mode
      4. 5-level  SMM !guest_mode
      5. 4-level !SMM guest_mode
      6. 5-level !SMM guest_mode

4-level vs. 5-level is a guest MAXPHYADDR thing, and swapping the MMU eliminates
the SMM and guest_mode issues.  If there is per-vCPU state that makes its way into
the TDP page tables, then we have problems, because it means that there is per-vCPU
state in per-VM structures that isn't accounted for.

There are a few edge cases where KVM treads carefully, e.g. if the fault is to
the vCPU's APIC-access page, but KVM manually handles those to avoid consuming
per-vCPU state.

That said, I think this option is effectively 1b, because dropping the SMM vs.
guest_mode state has the same uAPI problems as forcibly swapping the MMU, it's
just a different way of doing so.

The first question to answer is, do we want to return an error or "silently"
install mappings for !SMM, !guest_mode.  And so this option becomes relevant only
_if_ we want to unconditionally install mappings for the 'base" mode.

> > - Return error on guest mode or SMM mode:  Without this patch.
> >   Pros: No additional patch.
> >   Cons: Difficult to use.
> 
> Hmm... For the non-TDX use cases this is just an optimization, right? For TDX
> there shouldn't be an issue. If so, maybe this last one is not so horrible.

And the fact there are so variables to control (MAXPHADDR, SMM, and guest_mode)
basically invalidates the argument that returning an error makes the ioctl() hard
to use.  I can imagine it might be hard to squeeze this ioctl() into QEMU's
existing code, but I don't buy that the ioctl() itself is hard to use.

Literally the only thing userspace needs to do is set CPUID to implicitly select
between 4-level and 5-level paging.  If userspace wants to pre-map memory during
live migration, or when jump-starting the guest with pre-defined state, simply
pre-map memory before stuffing guest state.  In and of itself, that doesn't seem
difficult, e.g. at a quick glance, QEMU could add a hook somewhere in
kvm_vcpu_thread_fn() without too much trouble (though that comes with a huge
disclaimer that I only know enough about how QEMU manages vCPUs to be dangerous).

I would describe the overall cons for this patch versus returning an error
differently.  Switching MMU state puts the complexity in the kernel.  Returning
an error punts any complexity to userspace.  Specifically, anything that KVM can
do regarding vCPU state to get the right MMU, userspace can do too.
 
Add on that silently doing things that effectively ignore guest state usually
ends badly, and I don't see a good argument for this patch (or any variant
thereof).
Rick Edgecombe April 15, 2024, 9:36 p.m. UTC | #4
On Mon, 2024-04-15 at 14:17 -0700, Sean Christopherson wrote:
> > But doesn't the fault handler need the vCPU state?
> 
> Ignoring guest MTRRs, which will hopefully soon be a non-issue, no.  There are
> only six possible roots if TDP is enabled:
> 
>       1. 4-level !SMM !guest_mode
>       2. 4-level  SMM !guest_mode
>       3. 5-level !SMM !guest_mode
>       4. 5-level  SMM !guest_mode
>       5. 4-level !SMM guest_mode
>       6. 5-level !SMM guest_mode
> 
> 4-level vs. 5-level is a guest MAXPHYADDR thing, and swapping the MMU
> eliminates
> the SMM and guest_mode issues.  If there is per-vCPU state that makes its way
> into
> the TDP page tables, then we have problems, because it means that there is
> per-vCPU
> state in per-VM structures that isn't accounted for.
> 
> There are a few edge cases where KVM treads carefully, e.g. if the fault is to
> the vCPU's APIC-access page, but KVM manually handles those to avoid consuming
> per-vCPU state.
> 
> That said, I think this option is effectively 1b, because dropping the SMM vs.
> guest_mode state has the same uAPI problems as forcibly swapping the MMU, it's
> just a different way of doing so.
> 
> The first question to answer is, do we want to return an error or "silently"
> install mappings for !SMM, !guest_mode.  And so this option becomes relevant
> only
> _if_ we want to unconditionally install mappings for the 'base" mode.

Ah, I thought there was some logic around CR0.CD.

> 
> > > - Return error on guest mode or SMM mode:  Without this patch.
> > >   Pros: No additional patch.
> > >   Cons: Difficult to use.
> > 
> > Hmm... For the non-TDX use cases this is just an optimization, right? For
> > TDX
> > there shouldn't be an issue. If so, maybe this last one is not so horrible.
> 
> And the fact there are so variables to control (MAXPHADDR, SMM, and
> guest_mode)
> basically invalidates the argument that returning an error makes the ioctl()
> hard
> to use.  I can imagine it might be hard to squeeze this ioctl() into QEMU's
> existing code, but I don't buy that the ioctl() itself is hard to use.
> 
> Literally the only thing userspace needs to do is set CPUID to implicitly
> select
> between 4-level and 5-level paging.  If userspace wants to pre-map memory
> during
> live migration, or when jump-starting the guest with pre-defined state, simply
> pre-map memory before stuffing guest state.  In and of itself, that doesn't
> seem
> difficult, e.g. at a quick glance, QEMU could add a hook somewhere in
> kvm_vcpu_thread_fn() without too much trouble (though that comes with a huge
> disclaimer that I only know enough about how QEMU manages vCPUs to be
> dangerous).
> 
> I would describe the overall cons for this patch versus returning an error
> differently.  Switching MMU state puts the complexity in the kernel. 
> Returning
> an error punts any complexity to userspace.  Specifically, anything that KVM
> can
> do regarding vCPU state to get the right MMU, userspace can do too.
>  
> Add on that silently doing things that effectively ignore guest state usually
> ends badly, and I don't see a good argument for this patch (or any variant
> thereof).

Great.
Sean Christopherson April 15, 2024, 10:59 p.m. UTC | #5
On Mon, Apr 15, 2024, Rick P Edgecombe wrote:
> On Mon, 2024-04-15 at 14:17 -0700, Sean Christopherson wrote:
> > > But doesn't the fault handler need the vCPU state?
> > 
> > Ignoring guest MTRRs, which will hopefully soon be a non-issue, no.  There are
> > only six possible roots if TDP is enabled:
> > 
> >       1. 4-level !SMM !guest_mode
> >       2. 4-level  SMM !guest_mode
> >       3. 5-level !SMM !guest_mode
> >       4. 5-level  SMM !guest_mode
> >       5. 4-level !SMM guest_mode
> >       6. 5-level !SMM guest_mode
> > 
> > 4-level vs. 5-level is a guest MAXPHYADDR thing, and swapping the MMU
> > eliminates the SMM and guest_mode issues.  If there is per-vCPU state that
> > makes its way into the TDP page tables, then we have problems, because it
> > means that there is per-vCPU state in per-VM structures that isn't
> > accounted for.
> > 
> > There are a few edge cases where KVM treads carefully, e.g. if the fault is
> > to the vCPU's APIC-access page, but KVM manually handles those to avoid
> > consuming per-vCPU state.
> > 
> > That said, I think this option is effectively 1b, because dropping the SMM
> > vs.  guest_mode state has the same uAPI problems as forcibly swapping the
> > MMU, it's just a different way of doing so.
> > 
> > The first question to answer is, do we want to return an error or
> > "silently" install mappings for !SMM, !guest_mode.  And so this option
> > becomes relevant only _if_ we want to unconditionally install mappings for
> > the 'base" mode.
> 
> Ah, I thought there was some logic around CR0.CD.

There is, but it's hopefully going the way of the dodo, along with full MTRR
virtualization:

https://lore.kernel.org/all/20240309010929.1403984-1-seanjc@google.com
Isaku Yamahata April 16, 2024, 1:49 a.m. UTC | #6
On Mon, Apr 15, 2024 at 02:17:02PM -0700,
Sean Christopherson <seanjc@google.com> wrote:

> > > - Return error on guest mode or SMM mode:  Without this patch.
> > >   Pros: No additional patch.
> > >   Cons: Difficult to use.
> > 
> > Hmm... For the non-TDX use cases this is just an optimization, right? For TDX
> > there shouldn't be an issue. If so, maybe this last one is not so horrible.
> 
> And the fact there are so variables to control (MAXPHADDR, SMM, and guest_mode)
> basically invalidates the argument that returning an error makes the ioctl() hard
> to use.  I can imagine it might be hard to squeeze this ioctl() into QEMU's
> existing code, but I don't buy that the ioctl() itself is hard to use.
> 
> Literally the only thing userspace needs to do is set CPUID to implicitly select
> between 4-level and 5-level paging.  If userspace wants to pre-map memory during
> live migration, or when jump-starting the guest with pre-defined state, simply
> pre-map memory before stuffing guest state.  In and of itself, that doesn't seem
> difficult, e.g. at a quick glance, QEMU could add a hook somewhere in
> kvm_vcpu_thread_fn() without too much trouble (though that comes with a huge
> disclaimer that I only know enough about how QEMU manages vCPUs to be dangerous).
> 
> I would describe the overall cons for this patch versus returning an error
> differently.  Switching MMU state puts the complexity in the kernel.  Returning
> an error punts any complexity to userspace.  Specifically, anything that KVM can
> do regarding vCPU state to get the right MMU, userspace can do too.
>  
> Add on that silently doing things that effectively ignore guest state usually
> ends badly, and I don't see a good argument for this patch (or any variant
> thereof).

Ok, here is a experimental patch on top of the 7/10 to return error.  Is this
a direction? or do we want to invoke KVM page fault handler without any check?

I can see the following options.

- Error if vCPU is in SMM mode or guest mode: This patch
  Defer the decision until the use cases come up.  We can utilize
  KVM_CAP_MAP_MEMORY and struct kvm_map_memory.flags for future
  enhancement.
  Pro: Keep room for future enhancement for unclear use cases to defer
       the decision.
  Con: The use space VMM has to check/switch the vCPU mode.

- No check of vCPU mode and go on
  Pro: It works.
  Con: Unclear how the uAPI should be without concrete use cases.

- Always populate with L1 GPA:
  This is a bad idea.

---
 arch/x86/kvm/x86.c | 32 +++++++++-----------------------
 1 file changed, 9 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8ba9c1720ac9..2f3ceda5c225 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5871,10 +5871,8 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
 			     struct kvm_memory_mapping *mapping)
 {
-	struct kvm_mmu *mmu = NULL, *walk_mmu = NULL;
 	u64 end, error_code = 0;
 	u8 level = PG_LEVEL_4K;
-	bool is_smm;
 	int r;
 
 	/*
@@ -5884,25 +5882,21 @@ int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
 	if (!tdp_enabled)
 		return -EOPNOTSUPP;
 
-	/* Force to use L1 GPA despite of vcpu MMU mode. */
-	is_smm = !!(vcpu->arch.hflags & HF_SMM_MASK);
-	if (is_smm ||
-	    vcpu->arch.mmu != &vcpu->arch.root_mmu ||
-	    vcpu->arch.walk_mmu != &vcpu->arch.root_mmu) {
-		vcpu->arch.hflags &= ~HF_SMM_MASK;
-		mmu = vcpu->arch.mmu;
-		walk_mmu = vcpu->arch.walk_mmu;
-		vcpu->arch.mmu = &vcpu->arch.root_mmu;
-		vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
-		kvm_mmu_reset_context(vcpu);
-	}
+	/*
+	 * SMM mode results in populating SMM memory space with memslots id = 1.
+	 * guest mode results in populating with L2 GPA.
+	 * Don't support those cases for now and punt them for the future
+	 * discussion.
+	 */
+	if (is_smm(vcpu) || is_guest_mode(vcpu))
+		return -EOPNOTSUPP;
 
 	/* reload is optimized for repeated call. */
 	kvm_mmu_reload(vcpu);
 
 	r = kvm_tdp_map_page(vcpu, mapping->base_address, error_code, &level);
 	if (r)
-		goto out;
+		return r;
 
 	/* mapping->base_address is not necessarily aligned to level-hugepage. */
 	end = (mapping->base_address & KVM_HPAGE_MASK(level)) +
@@ -5910,14 +5904,6 @@ int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
 	mapping->size -= end - mapping->base_address;
 	mapping->base_address = end;
 
-out:
-	/* Restore MMU state. */
-	if (is_smm || mmu) {
-		vcpu->arch.hflags |= is_smm ? HF_SMM_MASK : 0;
-		vcpu->arch.mmu = mmu;
-		vcpu->arch.walk_mmu = walk_mmu;
-		kvm_mmu_reset_context(vcpu);
-	}
 	return r;
 }
Sean Christopherson April 16, 2024, 2:22 p.m. UTC | #7
On Mon, Apr 15, 2024, Isaku Yamahata wrote:
> On Mon, Apr 15, 2024 at 02:17:02PM -0700,
> Sean Christopherson <seanjc@google.com> wrote:
> 
> > > > - Return error on guest mode or SMM mode:  Without this patch.
> > > >   Pros: No additional patch.
> > > >   Cons: Difficult to use.
> > > 
> > > Hmm... For the non-TDX use cases this is just an optimization, right? For TDX
> > > there shouldn't be an issue. If so, maybe this last one is not so horrible.
> > 
> > And the fact there are so variables to control (MAXPHADDR, SMM, and guest_mode)
> > basically invalidates the argument that returning an error makes the ioctl() hard
> > to use.  I can imagine it might be hard to squeeze this ioctl() into QEMU's
> > existing code, but I don't buy that the ioctl() itself is hard to use.
> > 
> > Literally the only thing userspace needs to do is set CPUID to implicitly select
> > between 4-level and 5-level paging.  If userspace wants to pre-map memory during
> > live migration, or when jump-starting the guest with pre-defined state, simply
> > pre-map memory before stuffing guest state.  In and of itself, that doesn't seem
> > difficult, e.g. at a quick glance, QEMU could add a hook somewhere in
> > kvm_vcpu_thread_fn() without too much trouble (though that comes with a huge
> > disclaimer that I only know enough about how QEMU manages vCPUs to be dangerous).
> > 
> > I would describe the overall cons for this patch versus returning an error
> > differently.  Switching MMU state puts the complexity in the kernel.  Returning
> > an error punts any complexity to userspace.  Specifically, anything that KVM can
> > do regarding vCPU state to get the right MMU, userspace can do too.
> >  
> > Add on that silently doing things that effectively ignore guest state usually
> > ends badly, and I don't see a good argument for this patch (or any variant
> > thereof).
> 
> Ok, here is a experimental patch on top of the 7/10 to return error.  Is this
> a direction? or do we want to invoke KVM page fault handler without any check?
> 
> I can see the following options.
> 
> - Error if vCPU is in SMM mode or guest mode: This patch
>   Defer the decision until the use cases come up.  We can utilize
>   KVM_CAP_MAP_MEMORY and struct kvm_map_memory.flags for future
>   enhancement.
>   Pro: Keep room for future enhancement for unclear use cases to defer
>        the decision.
>   Con: The use space VMM has to check/switch the vCPU mode.
> 
> - No check of vCPU mode and go on
>   Pro: It works.
>   Con: Unclear how the uAPI should be without concrete use cases.
> 
> - Always populate with L1 GPA:
>   This is a bad idea.

Not always.  If L1 is using shadow paging, then L1 and L2 GPAs are in the same
domain from KVM's perspective.

As I said in v1 (I think it was v1), whether or not mapping an L1 GPA is supported
should be a property of the mmu, not an explicit check.  As far as the TDP MMU is
concerend, there's nothing special about SMM nor is there anything special about
guest_mode, except for the fact that they use different roots than "normal" mode.
But that part Just Works.

And if L1 is using TDP, i.e. KVM is shadowing L1's TDP page tables, and L2 is
active, then vcpu->arch.mmu will point at a variation of the shadow MMU, e.g.
the PTTYPE_EPT MMU on Intel CPUs.  The shadow MMU won't support pre-mapping GPAs
because it's non-sensical (legacy shadow paging needs a GVA, nested TDP needs an
L2 GPA), and so the ioctl() fails because mmu->map_gpa or whatever is NULL.

In other words, unless I'm forgetting something, explicit checks for "unsupported"
modes shoud be unnecessary, because
Rick Edgecombe April 16, 2024, 5:11 p.m. UTC | #8
On Wed, 2024-04-10 at 15:07 -0700, isaku.yamahata@intel.com wrote:
> +       /* Force to use L1 GPA despite of vcpu MMU mode. */
> +       is_smm = !!(vcpu->arch.hflags & HF_SMM_MASK);
> +       if (is_smm ||
> +           vcpu->arch.mmu != &vcpu->arch.root_mmu ||
> +           vcpu->arch.walk_mmu != &vcpu->arch.root_mmu) {
> +               vcpu->arch.hflags &= ~HF_SMM_MASK;

0-day informs me that the definition for HF_SMM_MASK depends on CONFIG_KVM_SMM.

> +               mmu = vcpu->arch.mmu;
> +               walk_mmu = vcpu->arch.walk_mmu;
> +               vcpu->arch.mmu = &vcpu->arch.root_mmu;
> +               vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> +               kvm_mmu_reset_context(vcpu);
> +       }
> +
Paolo Bonzini April 16, 2024, 9:41 p.m. UTC | #9
On Mon, Apr 15, 2024 at 11:17 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Apr 15, 2024, Rick P Edgecombe wrote:
> > I wouldn't call myself much of an expert on nested, but...
> >
> > On Wed, 2024-04-10 at 15:07 -0700, isaku.yamahata@intel.com wrote:
> > > There are several options to populate L1 GPA irrelevant to vCPU mode.
> > > - Switch vCPU MMU only: This patch.
> > >   Pros: Concise implementation.
> > >   Cons: Heavily dependent on the KVM MMU implementation.
>
> Con: Makes it impossible to support other MMUs/modes without extending the uAPI.

+1.

> The first question to answer is, do we want to return an error or "silently"
> install mappings for !SMM, !guest_mode.  And so this option becomes relevant only
> _if_ we want to unconditionally install mappings for the 'base" mode.
>
> > > - Return error on guest mode or SMM mode:  Without this patch.
> > >   Pros: No additional patch.
> > >   Cons: Difficult to use.
> >
> > Hmm... For the non-TDX use cases this is just an optimization, right? For TDX
> > there shouldn't be an issue. If so, maybe this last one is not so horrible.

It doesn't even have to be ABI that it gives an error. As you say,
this ioctl can just be advisory only for !confidential machines. Even
if it were implemented, the shadow MMU can drop roots at any moment
and/or kill the mapping via the shrinker.

That said, I can't fully shake the feeling that this ioctl should be
an error for !TDX and that TDX_INIT_MEM_REGION wasn't that bad. The
implementation was ugly but the API was fine. Sorry about this;
patches 3-5 can still be included in kvm-coco-queue sooner rather than
later.

> And the fact there are so variables to control (MAXPHADDR, SMM, and guest_mode)
> basically invalidates the argument that returning an error makes the ioctl() hard
> to use.  I can imagine it might be hard to squeeze this ioctl() into QEMU's
> existing code, but I don't buy that the ioctl() itself is hard to use.

Nah, I don't think so. With TDX it's just MAXPHYADDR; just invoke it
after KVM_SET_CPUID2 or TDX_INIT_VCPU which is very early.

> Literally the only thing userspace needs to do is set CPUID to implicitly select
> between 4-level and 5-level paging.  If userspace wants to pre-map memory during
> live migration, or when jump-starting the guest with pre-defined state, simply
> pre-map memory before stuffing guest state.  In and of itself, that doesn't seem
> difficult, e.g. at a quick glance, QEMU could add a hook somewhere in
> kvm_vcpu_thread_fn() without too much trouble (though that comes with a huge
> disclaimer that I only know enough about how QEMU manages vCPUs to be dangerous).

Hehe :) the machine_done_notifier is probably a better place. /me
checks... yes it's exactly where Xiaoyao did it (tdx_finalize_vm is
the notifier, it calls KVM_TDX_INIT_VCPU, from tdx_post_init_vcpus and
then KVM_MEMORY_MAPPING).

> I would describe the overall cons for this patch versus returning an error
> differently.  Switching MMU state puts the complexity in the kernel.  Returning
> an error punts any complexity to userspace.  Specifically, anything that KVM can
> do regarding vCPU state to get the right MMU, userspace can do too.
>
> Add on that silently doing things that effectively ignore guest state usually
> ends badly, and I don't see a good argument for this patch (or any variant
> thereof).

Agreed.

Paolo
Sean Christopherson April 16, 2024, 11 p.m. UTC | #10
On Tue, Apr 16, 2024, Paolo Bonzini wrote:
> On Mon, Apr 15, 2024 at 11:17 PM Sean Christopherson <seanjc@google.com> wrote:
> > The first question to answer is, do we want to return an error or "silently"
> > install mappings for !SMM, !guest_mode.  And so this option becomes relevant only
> > _if_ we want to unconditionally install mappings for the 'base" mode.
> >
> > > > - Return error on guest mode or SMM mode:  Without this patch.
> > > >   Pros: No additional patch.
> > > >   Cons: Difficult to use.
> > >
> > > Hmm... For the non-TDX use cases this is just an optimization, right? For TDX
> > > there shouldn't be an issue. If so, maybe this last one is not so horrible.
> 
> It doesn't even have to be ABI that it gives an error. As you say,
> this ioctl can just be advisory only for !confidential machines. Even
> if it were implemented, the shadow MMU can drop roots at any moment

Sure, but there's a difference between KVM _potentially_ dropping roots and
guaranteed failure because userspace is trying to do something that's unsupported.
But I think this is a non-issue, because it should really just be as simple as:

	if (!mmu->pre_map_memory)
		return -EOPNOTSUPP;

Hmm, or probably this to avoid adding an MMU hook for a single MMU flavor:

	if (!tdp_mmu_enabled || !mmu->root_role.direct)
		return -EOPNOTSUPP;

> and/or kill the mapping via the shrinker.

Ugh, we really need to kill that code.

> That said, I can't fully shake the feeling that this ioctl should be
> an error for !TDX and that TDX_INIT_MEM_REGION wasn't that bad. The
> implementation was ugly but the API was fine. 

Hmm, but IMO the implementation was ugly in no small part because of the contraints
put on KVM by the API.  Mapping S-EPT *and* doing TDH.MEM.PAGE.ADD in the same
ioctl() forced KVM to operate on vcpu0, and necessitated shoving temporary data
into a per-VM structure in order to get the source contents into TDH.MEM.PAGE.ADD.

We could eliminate the vcpu0 grossness, but it would require a massive refactor,
which is also not a problem per se, but it's obviously not free.  Eliminating
kvm_tdx.source_page is also doable, but it's not clear to me that end result would
be a net positive.

If userspace pre-maps the S-EPT entries ahead of time, then KVM should have a
straight shot to PAGE.ADD, i.e. doesn't need to "pass" the source page via a
scratch field in kvm_tdx, and I think/hope would avoid the need to grab vcpu0
in order to get at an MMU to build the S-EPT.

And stating the obvious, TDX_INIT_MEM_REGION also doesn't allow pre-mapping memory,
which is generally useful, and can be especially beneficial for confidential VMs
(and TDX in particular) due to the added cost of a page fault VM-Exit.

I'm not dead set on this generic ioctl(), but unless it ends up being a train wreck
for userspace, I think it will allow for cleaner and more reusable code in KVM.
Paolo Bonzini April 17, 2024, 10:28 a.m. UTC | #11
On Wed, Apr 17, 2024 at 1:00 AM Sean Christopherson <seanjc@google.com> wrote:
> > > > Hmm... For the non-TDX use cases this is just an optimization, right? For TDX
> > > > there shouldn't be an issue. If so, maybe this last one is not so horrible.
> >
> > It doesn't even have to be ABI that it gives an error. As you say,
> > this ioctl can just be advisory only for !confidential machines. Even
> > if it were implemented, the shadow MMU can drop roots at any moment
>
> Sure, but there's a difference between KVM _potentially_ dropping roots and
> guaranteed failure because userspace is trying to do something that's unsupported.
> But I think this is a non-issue, because it should really just be as simple as:
>
>         if (!mmu->pre_map_memory)
>                 return -EOPNOTSUPP;
>
> Hmm, or probably this to avoid adding an MMU hook for a single MMU flavor:
>
>         if (!tdp_mmu_enabled || !mmu->root_role.direct)
>                 return -EOPNOTSUPP;
>
> > and/or kill the mapping via the shrinker.
>
> Ugh, we really need to kill that code.

Ok, so let's add a KVM_CHECK_EXTENSION so that people can check if
it's supported.

> > That said, I can't fully shake the feeling that this ioctl should be
> > an error for !TDX and that TDX_INIT_MEM_REGION wasn't that bad. The
> > implementation was ugly but the API was fine.
>
> Hmm, but IMO the implementation was ugly in no small part because of the contraints
> put on KVM by the API.  Mapping S-EPT *and* doing TDH.MEM.PAGE.ADD in the same
> ioctl() forced KVM to operate on vcpu0, and necessitated shoving temporary data
> into a per-VM structure in order to get the source contents into TDH.MEM.PAGE.ADD.

That's because it was trying to do two things with a single loop. It's
not needed - and in fact KVM_CAP_MEMORY_MAPPING forces userspace to do
it in two passes.

> And stating the obvious, TDX_INIT_MEM_REGION also doesn't allow pre-mapping memory,
> which is generally useful, and can be especially beneficial for confidential VMs
> (and TDX in particular) due to the added cost of a page fault VM-Exit.
>
> I'm not dead set on this generic ioctl(), but unless it ends up being a train wreck
> for userspace, I think it will allow for cleaner and more reusable code in KVM.

Yes, this ioctl() can stay. Forcing it before adding memory to TDX is
ugly, but it's not a blocker. I'll look at it closely and see how far
it is from being committable to kvm-coco-queue.

Paolo
diff mbox series

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2c765de3531e..8ba9c1720ac9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5871,8 +5871,10 @@  static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
 			     struct kvm_memory_mapping *mapping)
 {
+	struct kvm_mmu *mmu = NULL, *walk_mmu = NULL;
 	u64 end, error_code = 0;
 	u8 level = PG_LEVEL_4K;
+	bool is_smm;
 	int r;
 
 	/*
@@ -5882,18 +5884,40 @@  int kvm_arch_vcpu_map_memory(struct kvm_vcpu *vcpu,
 	if (!tdp_enabled)
 		return -EOPNOTSUPP;
 
+	/* Force to use L1 GPA despite of vcpu MMU mode. */
+	is_smm = !!(vcpu->arch.hflags & HF_SMM_MASK);
+	if (is_smm ||
+	    vcpu->arch.mmu != &vcpu->arch.root_mmu ||
+	    vcpu->arch.walk_mmu != &vcpu->arch.root_mmu) {
+		vcpu->arch.hflags &= ~HF_SMM_MASK;
+		mmu = vcpu->arch.mmu;
+		walk_mmu = vcpu->arch.walk_mmu;
+		vcpu->arch.mmu = &vcpu->arch.root_mmu;
+		vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
+		kvm_mmu_reset_context(vcpu);
+	}
+
 	/* reload is optimized for repeated call. */
 	kvm_mmu_reload(vcpu);
 
 	r = kvm_tdp_map_page(vcpu, mapping->base_address, error_code, &level);
 	if (r)
-		return r;
+		goto out;
 
 	/* mapping->base_address is not necessarily aligned to level-hugepage. */
 	end = (mapping->base_address & KVM_HPAGE_MASK(level)) +
 		KVM_HPAGE_SIZE(level);
 	mapping->size -= end - mapping->base_address;
 	mapping->base_address = end;
+
+out:
+	/* Restore MMU state. */
+	if (is_smm || mmu) {
+		vcpu->arch.hflags |= is_smm ? HF_SMM_MASK : 0;
+		vcpu->arch.mmu = mmu;
+		vcpu->arch.walk_mmu = walk_mmu;
+		kvm_mmu_reset_context(vcpu);
+	}
 	return r;
 }