Message ID | 1556762959-31705-1-git-send-email-jintack@cs.columbia.edu (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: nVMX: Set msr bitmap correctly for MSR_FS_BASE in vmcs02 | expand |
+Cc Jim On Wed, May 01, 2019 at 10:09:19PM -0400, Jintack Lim wrote: > Even when neither L0 nor L1 configured to trap MSR_FS_BASE writes from > its own VMs, the current KVM L0 always traps MSR_FS_BASE writes from L2. > Let's check if both L0 and L1 disabled trap for MSR_FS_BASE for its VMs > respectively, and let L2 write to MSR_FS_BASE without trap if that's the > case. > > Signed-off-by: Jintack Lim <jintack@cs.columbia.edu> > --- > arch/x86/kvm/vmx/nested.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c > index 0c601d0..ab85aea 100644 > --- a/arch/x86/kvm/vmx/nested.c > +++ b/arch/x86/kvm/vmx/nested.c > @@ -537,6 +537,7 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu, > */ > bool pred_cmd = !msr_write_intercepted_l01(vcpu, MSR_IA32_PRED_CMD); > bool spec_ctrl = !msr_write_intercepted_l01(vcpu, MSR_IA32_SPEC_CTRL); > + bool fs_base = !msr_write_intercepted_l01(vcpu, MSR_FS_BASE); This isn't sufficient as we only fall into this code if L2 is in x2APIC mode or has accessed the speculation MSRs. The quick fix is to check if we want to pass through MSR_FS_BASE, but if we're going to open up the floodgates then we should pass through as many MSRs as possible, e.g. GS_BASE, KERNEL_GS_BASE, TSC, SYSENTER_*, etc..., and do so using a generic mechanism. That being said, I think there are other reasons why KVM doesn't pass through MSRs to L2. Unfortunately, I'm struggling to recall what those reasons are. Jim, I'm pretty sure you've looked at this code a lot, do you happen to know off hand? Is it purely a performance thing to avoid merging bitmaps on every nested entry, is there a subtle bug/security hole, or is it simply that no one has ever gotten around to writing the code? > > /* Nothing to do if the MSR bitmap is not in use. */ > if (!cpu_has_vmx_msr_bitmap() || > @@ -592,6 +593,12 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu, > } > } > > + if (fs_base) > + nested_vmx_disable_intercept_for_msr( > + msr_bitmap_l1, msr_bitmap_l0, > + MSR_FS_BASE, > + MSR_TYPE_W); This should be MSR_TYPE_RW. > + > if (spec_ctrl) > nested_vmx_disable_intercept_for_msr( > msr_bitmap_l1, msr_bitmap_l0, > -- > 1.9.1 > >
On Thu, May 2, 2019 at 11:03 AM Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > +Cc Jim > > On Wed, May 01, 2019 at 10:09:19PM -0400, Jintack Lim wrote: > > Even when neither L0 nor L1 configured to trap MSR_FS_BASE writes from > > its own VMs, the current KVM L0 always traps MSR_FS_BASE writes from L2. > > Let's check if both L0 and L1 disabled trap for MSR_FS_BASE for its VMs > > respectively, and let L2 write to MSR_FS_BASE without trap if that's the > > case. > > > > Signed-off-by: Jintack Lim <jintack@cs.columbia.edu> > > --- > > arch/x86/kvm/vmx/nested.c | 7 +++++++ > > 1 file changed, 7 insertions(+) > > > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c > > index 0c601d0..ab85aea 100644 > > --- a/arch/x86/kvm/vmx/nested.c > > +++ b/arch/x86/kvm/vmx/nested.c > > @@ -537,6 +537,7 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu, > > */ > > bool pred_cmd = !msr_write_intercepted_l01(vcpu, MSR_IA32_PRED_CMD); > > bool spec_ctrl = !msr_write_intercepted_l01(vcpu, MSR_IA32_SPEC_CTRL); > > + bool fs_base = !msr_write_intercepted_l01(vcpu, MSR_FS_BASE); > > This isn't sufficient as we only fall into this code if L2 is in x2APIC > mode or has accessed the speculation MSRs. The quick fix is to check if > we want to pass through MSR_FS_BASE, but if we're going to open up the > floodgates then we should pass through as many MSRs as possible, e.g. > GS_BASE, KERNEL_GS_BASE, TSC, SYSENTER_*, etc..., and do so using a > generic mechanism. Thanks Sean for the review. I agree that the fix should be passing through as many MSRs as possible. > > That being said, I think there are other reasons why KVM doesn't pass > through MSRs to L2. Unfortunately, I'm struggling to recall what those > reasons are. I'm also curious if there are other reasons! > > Jim, I'm pretty sure you've looked at this code a lot, do you happen to > know off hand? Is it purely a performance thing to avoid merging bitmaps > on every nested entry, is there a subtle bug/security hole, or is it > simply that no one has ever gotten around to writing the code? > > > > > /* Nothing to do if the MSR bitmap is not in use. */ > > if (!cpu_has_vmx_msr_bitmap() || > > @@ -592,6 +593,12 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu, > > } > > } > > > > + if (fs_base) > > + nested_vmx_disable_intercept_for_msr( > > + msr_bitmap_l1, msr_bitmap_l0, > > + MSR_FS_BASE, > > + MSR_TYPE_W); > > This should be MSR_TYPE_RW. > > > + > > if (spec_ctrl) > > nested_vmx_disable_intercept_for_msr( > > msr_bitmap_l1, msr_bitmap_l0, > > -- > > 1.9.1 > > > > >
On Thu, May 2, 2019 at 8:03 AM Sean Christopherson <sean.j.christopherson@intel.com> wrote: > That being said, I think there are other reasons why KVM doesn't pass > through MSRs to L2. Unfortunately, I'm struggling to recall what those > reasons are. > > Jim, I'm pretty sure you've looked at this code a lot, do you happen to > know off hand? Is it purely a performance thing to avoid merging bitmaps > on every nested entry, is there a subtle bug/security hole, or is it > simply that no one has ever gotten around to writing the code? I'm not aware of any subtle bugs or security holes. If L1 changes the VMCS12 MSR permission bitmaps while L2 is running, behavior is unlikely to match hardware, but this is clearly in "undefined behavior" territory anyway. IIRC, the posted interrupt structures are the only thing hanging off of the VMCS that can legally be modified while a logical processor with that VMCS active is in VMX non-root operation. I agree that FS_BASE, GS_BASE, and KERNEL_GS_BASE, at the very least, are worthy of special treatment. Fortunately, their permission bits are all in the same quadword. Some of the others, like the SYSENTER and SYSCALL MSRs are rarely modified by a typical (non-hypervisor) OS. For nested performance at levels deeper than L2, they might still prove interesting. Basically, I think no one has gotten around to writing the code.
On Thu, May 02, 2019 at 10:59:16AM -0700, Jim Mattson wrote: > On Thu, May 2, 2019 at 8:03 AM Sean Christopherson > <sean.j.christopherson@intel.com> wrote: > > > That being said, I think there are other reasons why KVM doesn't pass > > through MSRs to L2. Unfortunately, I'm struggling to recall what those > > reasons are. > > > > Jim, I'm pretty sure you've looked at this code a lot, do you happen to > > know off hand? Is it purely a performance thing to avoid merging bitmaps > > on every nested entry, is there a subtle bug/security hole, or is it > > simply that no one has ever gotten around to writing the code? > > I'm not aware of any subtle bugs or security holes. If L1 changes the > VMCS12 MSR permission bitmaps while L2 is running, behavior is > unlikely to match hardware, but this is clearly in "undefined > behavior" territory anyway. IIRC, the posted interrupt structures are > the only thing hanging off of the VMCS that can legally be modified > while a logical processor with that VMCS active is in VMX non-root > operation. Cool, thanks! > I agree that FS_BASE, GS_BASE, and KERNEL_GS_BASE, at the very least, > are worthy of special treatment. Fortunately, their permission bits > are all in the same quadword. Some of the others, like the SYSENTER > and SYSCALL MSRs are rarely modified by a typical (non-hypervisor) OS. > For nested performance at levels deeper than L2, they might still > prove interesting. Agreed on the *_BASE MSRs. Rarely written MSRs should be intercepted so that the nested VM-Exit path doesn't need to read them from vmcs02 on every exit (WIP). I'm playing with the nested code right now and one of the things I'm realizing is that KVM spends an absurd amount of time copying data to/from VMCSes for fields that are almost never accessed by L0 or L1. > Basically, I think no one has gotten around to writing the code.
On Thu, May 2, 2019 at 5:06 PM Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > On Thu, May 02, 2019 at 10:59:16AM -0700, Jim Mattson wrote: > > On Thu, May 2, 2019 at 8:03 AM Sean Christopherson > > <sean.j.christopherson@intel.com> wrote: > > > > > That being said, I think there are other reasons why KVM doesn't pass > > > through MSRs to L2. Unfortunately, I'm struggling to recall what those > > > reasons are. > > > > > > Jim, I'm pretty sure you've looked at this code a lot, do you happen to > > > know off hand? Is it purely a performance thing to avoid merging bitmaps > > > on every nested entry, is there a subtle bug/security hole, or is it > > > simply that no one has ever gotten around to writing the code? > > > > I'm not aware of any subtle bugs or security holes. If L1 changes the > > VMCS12 MSR permission bitmaps while L2 is running, behavior is > > unlikely to match hardware, but this is clearly in "undefined > > behavior" territory anyway. IIRC, the posted interrupt structures are > > the only thing hanging off of the VMCS that can legally be modified > > while a logical processor with that VMCS active is in VMX non-root > > operation. > > Cool, thanks! > > > I agree that FS_BASE, GS_BASE, and KERNEL_GS_BASE, at the very least, > > are worthy of special treatment. Fortunately, their permission bits > > are all in the same quadword. Some of the others, like the SYSENTER > > and SYSCALL MSRs are rarely modified by a typical (non-hypervisor) OS. > > For nested performance at levels deeper than L2, they might still > > prove interesting. > > Agreed on the *_BASE MSRs. Rarely written MSRs should be intercepted > so that the nested VM-Exit path doesn't need to read them from vmcs02 > on every exit (WIP). Ok, let me handle those three *_BASE MSRs in v2. > I'm playing with the nested code right now and > one of the things I'm realizing is that KVM spends an absurd amount of > time copying data to/from VMCSes for fields that are almost never > accessed by L0 or L1. > > > Basically, I think no one has gotten around to writing the code. >
On Thu, May 2, 2019 at 11:03 AM Sean Christopherson <sean.j.christopherson@intel.com> wrote: > > +Cc Jim > > On Wed, May 01, 2019 at 10:09:19PM -0400, Jintack Lim wrote: > > Even when neither L0 nor L1 configured to trap MSR_FS_BASE writes from > > its own VMs, the current KVM L0 always traps MSR_FS_BASE writes from L2. > > Let's check if both L0 and L1 disabled trap for MSR_FS_BASE for its VMs > > respectively, and let L2 write to MSR_FS_BASE without trap if that's the > > case. > > > > Signed-off-by: Jintack Lim <jintack@cs.columbia.edu> > > --- > > arch/x86/kvm/vmx/nested.c | 7 +++++++ > > 1 file changed, 7 insertions(+) > > > > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c > > index 0c601d0..ab85aea 100644 > > --- a/arch/x86/kvm/vmx/nested.c > > +++ b/arch/x86/kvm/vmx/nested.c > > @@ -537,6 +537,7 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu, > > */ > > bool pred_cmd = !msr_write_intercepted_l01(vcpu, MSR_IA32_PRED_CMD); > > bool spec_ctrl = !msr_write_intercepted_l01(vcpu, MSR_IA32_SPEC_CTRL); > > + bool fs_base = !msr_write_intercepted_l01(vcpu, MSR_FS_BASE); > > This isn't sufficient as we only fall into this code if L2 is in x2APIC > mode or has accessed the speculation MSRs. The quick fix is to check if > we want to pass through MSR_FS_BASE, but if we're going to open up the > floodgates then we should pass through as many MSRs as possible, e.g. > GS_BASE, KERNEL_GS_BASE, TSC, SYSENTER_*, etc..., and do so using a > generic mechanism. > > That being said, I think there are other reasons why KVM doesn't pass > through MSRs to L2. Unfortunately, I'm struggling to recall what those > reasons are. > > Jim, I'm pretty sure you've looked at this code a lot, do you happen to > know off hand? Is it purely a performance thing to avoid merging bitmaps > on every nested entry, is there a subtle bug/security hole, or is it > simply that no one has ever gotten around to writing the code? > > > > > /* Nothing to do if the MSR bitmap is not in use. */ > > if (!cpu_has_vmx_msr_bitmap() || > > @@ -592,6 +593,12 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu, > > } > > } > > > > + if (fs_base) > > + nested_vmx_disable_intercept_for_msr( > > + msr_bitmap_l1, msr_bitmap_l0, > > + MSR_FS_BASE, > > + MSR_TYPE_W); > > This should be MSR_TYPE_RW. Should I explicitly check if L1 disabled intercept for read operations along with write operations? It seems like the current code only checks write operations for spec_ctrl while it disables intercept for RW. This is not the case for pred_cmd, though. I might be missing something here. Could you explain it? > > > + > > if (spec_ctrl) > > nested_vmx_disable_intercept_for_msr( > > msr_bitmap_l1, msr_bitmap_l0, > > -- > > 1.9.1 > > > > >
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 0c601d0..ab85aea 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -537,6 +537,7 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu, */ bool pred_cmd = !msr_write_intercepted_l01(vcpu, MSR_IA32_PRED_CMD); bool spec_ctrl = !msr_write_intercepted_l01(vcpu, MSR_IA32_SPEC_CTRL); + bool fs_base = !msr_write_intercepted_l01(vcpu, MSR_FS_BASE); /* Nothing to do if the MSR bitmap is not in use. */ if (!cpu_has_vmx_msr_bitmap() || @@ -592,6 +593,12 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu, } } + if (fs_base) + nested_vmx_disable_intercept_for_msr( + msr_bitmap_l1, msr_bitmap_l0, + MSR_FS_BASE, + MSR_TYPE_W); + if (spec_ctrl) nested_vmx_disable_intercept_for_msr( msr_bitmap_l1, msr_bitmap_l0,
Even when neither L0 nor L1 configured to trap MSR_FS_BASE writes from its own VMs, the current KVM L0 always traps MSR_FS_BASE writes from L2. Let's check if both L0 and L1 disabled trap for MSR_FS_BASE for its VMs respectively, and let L2 write to MSR_FS_BASE without trap if that's the case. Signed-off-by: Jintack Lim <jintack@cs.columbia.edu> --- arch/x86/kvm/vmx/nested.c | 7 +++++++ 1 file changed, 7 insertions(+)