diff mbox series

KVM: nVMX: Set msr bitmap correctly for MSR_FS_BASE in vmcs02

Message ID 1556762959-31705-1-git-send-email-jintack@cs.columbia.edu (mailing list archive)
State New, archived
Headers show
Series KVM: nVMX: Set msr bitmap correctly for MSR_FS_BASE in vmcs02 | expand

Commit Message

Jintack Lim May 2, 2019, 2:09 a.m. UTC
Even when neither L0 nor L1 configured to trap MSR_FS_BASE writes from
its own VMs, the current KVM L0 always traps MSR_FS_BASE writes from L2.
Let's check if both L0 and L1 disabled trap for MSR_FS_BASE for its VMs
respectively, and let L2 write to MSR_FS_BASE without trap if that's the
case.

Signed-off-by: Jintack Lim <jintack@cs.columbia.edu>
---
 arch/x86/kvm/vmx/nested.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Sean Christopherson May 2, 2019, 3:03 p.m. UTC | #1
+Cc Jim

On Wed, May 01, 2019 at 10:09:19PM -0400, Jintack Lim wrote:
> Even when neither L0 nor L1 configured to trap MSR_FS_BASE writes from
> its own VMs, the current KVM L0 always traps MSR_FS_BASE writes from L2.
> Let's check if both L0 and L1 disabled trap for MSR_FS_BASE for its VMs
> respectively, and let L2 write to MSR_FS_BASE without trap if that's the
> case.
> 
> Signed-off-by: Jintack Lim <jintack@cs.columbia.edu>
> ---
>  arch/x86/kvm/vmx/nested.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 0c601d0..ab85aea 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -537,6 +537,7 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
>  	 */
>  	bool pred_cmd = !msr_write_intercepted_l01(vcpu, MSR_IA32_PRED_CMD);
>  	bool spec_ctrl = !msr_write_intercepted_l01(vcpu, MSR_IA32_SPEC_CTRL);
> +	bool fs_base = !msr_write_intercepted_l01(vcpu, MSR_FS_BASE);

This isn't sufficient as we only fall into this code if L2 is in x2APIC
mode or has accessed the speculation MSRs.  The quick fix is to check if
we want to pass through MSR_FS_BASE, but if we're going to open up the
floodgates then we should pass through as many MSRs as possible, e.g.
GS_BASE, KERNEL_GS_BASE, TSC, SYSENTER_*, etc..., and do so using a
generic mechanism.

That being said, I think there are other reasons why KVM doesn't pass
through MSRs to L2.  Unfortunately, I'm struggling to recall what those
reasons are.

Jim, I'm pretty sure you've looked at this code a lot, do you happen to
know off hand?  Is it purely a performance thing to avoid merging bitmaps
on every nested entry, is there a subtle bug/security hole, or is it
simply that no one has ever gotten around to writing the code?

>  
>  	/* Nothing to do if the MSR bitmap is not in use.  */
>  	if (!cpu_has_vmx_msr_bitmap() ||
> @@ -592,6 +593,12 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
>  		}
>  	}
>  
> +	if (fs_base)
> +		nested_vmx_disable_intercept_for_msr(
> +					msr_bitmap_l1, msr_bitmap_l0,
> +					MSR_FS_BASE,
> +					MSR_TYPE_W);

This should be MSR_TYPE_RW.

> +
>  	if (spec_ctrl)
>  		nested_vmx_disable_intercept_for_msr(
>  					msr_bitmap_l1, msr_bitmap_l0,
> -- 
> 1.9.1
> 
>
Jintack Lim May 2, 2019, 3:39 p.m. UTC | #2
On Thu, May 2, 2019 at 11:03 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> +Cc Jim
>
> On Wed, May 01, 2019 at 10:09:19PM -0400, Jintack Lim wrote:
> > Even when neither L0 nor L1 configured to trap MSR_FS_BASE writes from
> > its own VMs, the current KVM L0 always traps MSR_FS_BASE writes from L2.
> > Let's check if both L0 and L1 disabled trap for MSR_FS_BASE for its VMs
> > respectively, and let L2 write to MSR_FS_BASE without trap if that's the
> > case.
> >
> > Signed-off-by: Jintack Lim <jintack@cs.columbia.edu>
> > ---
> >  arch/x86/kvm/vmx/nested.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index 0c601d0..ab85aea 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -537,6 +537,7 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> >        */
> >       bool pred_cmd = !msr_write_intercepted_l01(vcpu, MSR_IA32_PRED_CMD);
> >       bool spec_ctrl = !msr_write_intercepted_l01(vcpu, MSR_IA32_SPEC_CTRL);
> > +     bool fs_base = !msr_write_intercepted_l01(vcpu, MSR_FS_BASE);
>
> This isn't sufficient as we only fall into this code if L2 is in x2APIC
> mode or has accessed the speculation MSRs.  The quick fix is to check if
> we want to pass through MSR_FS_BASE, but if we're going to open up the
> floodgates then we should pass through as many MSRs as possible, e.g.
> GS_BASE, KERNEL_GS_BASE, TSC, SYSENTER_*, etc..., and do so using a
> generic mechanism.

Thanks Sean for the review.

I agree that the fix should be passing through as many MSRs as possible.

>
> That being said, I think there are other reasons why KVM doesn't pass
> through MSRs to L2.  Unfortunately, I'm struggling to recall what those
> reasons are.

I'm also curious if there are other reasons!

>
> Jim, I'm pretty sure you've looked at this code a lot, do you happen to
> know off hand?  Is it purely a performance thing to avoid merging bitmaps
> on every nested entry, is there a subtle bug/security hole, or is it
> simply that no one has ever gotten around to writing the code?
>
> >
> >       /* Nothing to do if the MSR bitmap is not in use.  */
> >       if (!cpu_has_vmx_msr_bitmap() ||
> > @@ -592,6 +593,12 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> >               }
> >       }
> >
> > +     if (fs_base)
> > +             nested_vmx_disable_intercept_for_msr(
> > +                                     msr_bitmap_l1, msr_bitmap_l0,
> > +                                     MSR_FS_BASE,
> > +                                     MSR_TYPE_W);
>
> This should be MSR_TYPE_RW.
>
> > +
> >       if (spec_ctrl)
> >               nested_vmx_disable_intercept_for_msr(
> >                                       msr_bitmap_l1, msr_bitmap_l0,
> > --
> > 1.9.1
> >
> >
>
Jim Mattson May 2, 2019, 5:59 p.m. UTC | #3
On Thu, May 2, 2019 at 8:03 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:

> That being said, I think there are other reasons why KVM doesn't pass
> through MSRs to L2.  Unfortunately, I'm struggling to recall what those
> reasons are.
>
> Jim, I'm pretty sure you've looked at this code a lot, do you happen to
> know off hand?  Is it purely a performance thing to avoid merging bitmaps
> on every nested entry, is there a subtle bug/security hole, or is it
> simply that no one has ever gotten around to writing the code?

I'm not aware of any subtle bugs or security holes. If L1 changes the
VMCS12 MSR permission bitmaps while L2 is running, behavior is
unlikely to match hardware, but this is clearly in "undefined
behavior" territory anyway. IIRC, the posted interrupt structures are
the only thing hanging off of the VMCS that can legally be modified
while a logical processor with that VMCS active is in VMX non-root
operation.

I agree that FS_BASE, GS_BASE, and KERNEL_GS_BASE, at the very least,
are worthy of special treatment. Fortunately, their permission bits
are all in the same quadword. Some of the others, like the SYSENTER
and SYSCALL MSRs are rarely modified by a typical (non-hypervisor) OS.
For nested performance at levels deeper than L2, they might still
prove interesting.

Basically, I think no one has gotten around to writing the code.
Sean Christopherson May 2, 2019, 9:06 p.m. UTC | #4
On Thu, May 02, 2019 at 10:59:16AM -0700, Jim Mattson wrote:
> On Thu, May 2, 2019 at 8:03 AM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> 
> > That being said, I think there are other reasons why KVM doesn't pass
> > through MSRs to L2.  Unfortunately, I'm struggling to recall what those
> > reasons are.
> >
> > Jim, I'm pretty sure you've looked at this code a lot, do you happen to
> > know off hand?  Is it purely a performance thing to avoid merging bitmaps
> > on every nested entry, is there a subtle bug/security hole, or is it
> > simply that no one has ever gotten around to writing the code?
> 
> I'm not aware of any subtle bugs or security holes. If L1 changes the
> VMCS12 MSR permission bitmaps while L2 is running, behavior is
> unlikely to match hardware, but this is clearly in "undefined
> behavior" territory anyway. IIRC, the posted interrupt structures are
> the only thing hanging off of the VMCS that can legally be modified
> while a logical processor with that VMCS active is in VMX non-root
> operation.

Cool, thanks!

> I agree that FS_BASE, GS_BASE, and KERNEL_GS_BASE, at the very least,
> are worthy of special treatment. Fortunately, their permission bits
> are all in the same quadword. Some of the others, like the SYSENTER
> and SYSCALL MSRs are rarely modified by a typical (non-hypervisor) OS.
> For nested performance at levels deeper than L2, they might still
> prove interesting.

Agreed on the *_BASE MSRs.  Rarely written MSRs should be intercepted
so that the nested VM-Exit path doesn't need to read them from vmcs02
on every exit (WIP).  I'm playing with the nested code right now and
one of the things I'm realizing is that KVM spends an absurd amount of
time copying data to/from VMCSes for fields that are almost never
accessed by L0 or L1.

> Basically, I think no one has gotten around to writing the code.
Jintack Lim May 3, 2019, 12:02 a.m. UTC | #5
On Thu, May 2, 2019 at 5:06 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Thu, May 02, 2019 at 10:59:16AM -0700, Jim Mattson wrote:
> > On Thu, May 2, 2019 at 8:03 AM Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> >
> > > That being said, I think there are other reasons why KVM doesn't pass
> > > through MSRs to L2.  Unfortunately, I'm struggling to recall what those
> > > reasons are.
> > >
> > > Jim, I'm pretty sure you've looked at this code a lot, do you happen to
> > > know off hand?  Is it purely a performance thing to avoid merging bitmaps
> > > on every nested entry, is there a subtle bug/security hole, or is it
> > > simply that no one has ever gotten around to writing the code?
> >
> > I'm not aware of any subtle bugs or security holes. If L1 changes the
> > VMCS12 MSR permission bitmaps while L2 is running, behavior is
> > unlikely to match hardware, but this is clearly in "undefined
> > behavior" territory anyway. IIRC, the posted interrupt structures are
> > the only thing hanging off of the VMCS that can legally be modified
> > while a logical processor with that VMCS active is in VMX non-root
> > operation.
>
> Cool, thanks!
>
> > I agree that FS_BASE, GS_BASE, and KERNEL_GS_BASE, at the very least,
> > are worthy of special treatment. Fortunately, their permission bits
> > are all in the same quadword. Some of the others, like the SYSENTER
> > and SYSCALL MSRs are rarely modified by a typical (non-hypervisor) OS.
> > For nested performance at levels deeper than L2, they might still
> > prove interesting.
>
> Agreed on the *_BASE MSRs.  Rarely written MSRs should be intercepted
> so that the nested VM-Exit path doesn't need to read them from vmcs02
> on every exit (WIP).

Ok, let me handle those three *_BASE MSRs in v2.

> I'm playing with the nested code right now and
> one of the things I'm realizing is that KVM spends an absurd amount of
> time copying data to/from VMCSes for fields that are almost never
> accessed by L0 or L1.
>
> > Basically, I think no one has gotten around to writing the code.
>
Jintack Lim May 3, 2019, 12:21 a.m. UTC | #6
On Thu, May 2, 2019 at 11:03 AM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> +Cc Jim
>
> On Wed, May 01, 2019 at 10:09:19PM -0400, Jintack Lim wrote:
> > Even when neither L0 nor L1 configured to trap MSR_FS_BASE writes from
> > its own VMs, the current KVM L0 always traps MSR_FS_BASE writes from L2.
> > Let's check if both L0 and L1 disabled trap for MSR_FS_BASE for its VMs
> > respectively, and let L2 write to MSR_FS_BASE without trap if that's the
> > case.
> >
> > Signed-off-by: Jintack Lim <jintack@cs.columbia.edu>
> > ---
> >  arch/x86/kvm/vmx/nested.c | 7 +++++++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index 0c601d0..ab85aea 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -537,6 +537,7 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> >        */
> >       bool pred_cmd = !msr_write_intercepted_l01(vcpu, MSR_IA32_PRED_CMD);
> >       bool spec_ctrl = !msr_write_intercepted_l01(vcpu, MSR_IA32_SPEC_CTRL);
> > +     bool fs_base = !msr_write_intercepted_l01(vcpu, MSR_FS_BASE);
>
> This isn't sufficient as we only fall into this code if L2 is in x2APIC
> mode or has accessed the speculation MSRs.  The quick fix is to check if
> we want to pass through MSR_FS_BASE, but if we're going to open up the
> floodgates then we should pass through as many MSRs as possible, e.g.
> GS_BASE, KERNEL_GS_BASE, TSC, SYSENTER_*, etc..., and do so using a
> generic mechanism.
>
> That being said, I think there are other reasons why KVM doesn't pass
> through MSRs to L2.  Unfortunately, I'm struggling to recall what those
> reasons are.
>
> Jim, I'm pretty sure you've looked at this code a lot, do you happen to
> know off hand?  Is it purely a performance thing to avoid merging bitmaps
> on every nested entry, is there a subtle bug/security hole, or is it
> simply that no one has ever gotten around to writing the code?
>
> >
> >       /* Nothing to do if the MSR bitmap is not in use.  */
> >       if (!cpu_has_vmx_msr_bitmap() ||
> > @@ -592,6 +593,12 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> >               }
> >       }
> >
> > +     if (fs_base)
> > +             nested_vmx_disable_intercept_for_msr(
> > +                                     msr_bitmap_l1, msr_bitmap_l0,
> > +                                     MSR_FS_BASE,
> > +                                     MSR_TYPE_W);
>
> This should be MSR_TYPE_RW.

Should I explicitly check if L1 disabled intercept for read operations
along with write operations?

It seems like the current code only checks write operations for
spec_ctrl while it disables intercept for RW. This is not the case for
pred_cmd, though. I might be missing something here. Could you explain
it?

>
> > +
> >       if (spec_ctrl)
> >               nested_vmx_disable_intercept_for_msr(
> >                                       msr_bitmap_l1, msr_bitmap_l0,
> > --
> > 1.9.1
> >
> >
>
diff mbox series

Patch

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 0c601d0..ab85aea 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -537,6 +537,7 @@  static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
 	 */
 	bool pred_cmd = !msr_write_intercepted_l01(vcpu, MSR_IA32_PRED_CMD);
 	bool spec_ctrl = !msr_write_intercepted_l01(vcpu, MSR_IA32_SPEC_CTRL);
+	bool fs_base = !msr_write_intercepted_l01(vcpu, MSR_FS_BASE);
 
 	/* Nothing to do if the MSR bitmap is not in use.  */
 	if (!cpu_has_vmx_msr_bitmap() ||
@@ -592,6 +593,12 @@  static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
 		}
 	}
 
+	if (fs_base)
+		nested_vmx_disable_intercept_for_msr(
+					msr_bitmap_l1, msr_bitmap_l0,
+					MSR_FS_BASE,
+					MSR_TYPE_W);
+
 	if (spec_ctrl)
 		nested_vmx_disable_intercept_for_msr(
 					msr_bitmap_l1, msr_bitmap_l0,