Message ID | 20220422162103.32736-1-jon@nutanix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] x86/speculation, KVM: only IBPB for switch_mm_always_ibpb on vCPU load | expand |
> On Apr 22, 2022, at 12:21 PM, Jon Kohler <jon@nutanix.com> wrote: > > On vmx_vcpu_load_vmcs and svm_vcpu_load, respect user controlled > configuration for conditional IBPB and only attempt IBPB MSR when > switching between different guest vCPUs IFF switch_mm_always_ibpb, > which fixes a situation where the kernel will issue IBPB > unconditionally even when conditional IBPB is enabled. > > If a user has spectre_v2_user mitigation enabled, in any > configuration, and the underlying processor supports X86_FEATURE_IBPB, > X86_FEATURE_USE_IBPB is set and any calls to > indirect_branch_prediction_barrier() will issue IBPB MSR. > > Depending on the spectre_v2_user configuration, either > switch_mm_always_ibpb key or switch_mm_cond_ibpb key will be set. > > Both switch_mm_always_ibpb and switch_mm_cond_ibpb are handled by > switch_mm() -> cond_mitigation(), which works well in cases where > switching vCPUs (i.e. switching tasks) also switches mm_struct; > however, this misses a paranoid case where user space may be running > multiple guests in a single process (i.e. single mm_struct). This > presents two issues: > > Issue 1: > This paranoid case is already covered by vmx_vcpu_load_vmcs and > svm_vcpu_load; however, this is done by calling > indirect_branch_prediction_barrier() and thus the kernel > unconditionally issues IBPB if X86_FEATURE_USE_IBPB is set. > > Issue 2: > For a conditional configuration, this paranoid case is nonsensical. > If userspace runs multiple VMs in the same process, enables cond_ipbp, > _and_ sets TIF_SPEC_IB, then isn't getting full protection in any case, > e.g. if userspace is handling an exit-to-userspace condition for two > vCPUs from different VMs, then the kernel could switch between those > two vCPUs' tasks without bouncing through KVM and thus without doing > KVM's IBPB. > > Fix both by using intermediary call to x86_virt_guest_switch_ibpb(), > which gates IBPB MSR IFF switch_mm_always_ibpb is true. > > switch_mm_cond_ibpb is intentionally ignored from the KVM code side > as it really is nonsensical given the common case is already well > covered by switch_mm(), so issuing an additional IBPB from KVM is > just pure overhead. > > Note: switch_mm_always_ibpb key is user controlled via spectre_v2_user > and will be true for the following configurations: > spectre_v2_user=on > spectre_v2_user=prctl,ibpb > spectre_v2_user=seccomp,ibpb > > Signed-off-by: Jon Kohler <jon@nutanix.com> > Cc: Sean Christopherson <seanjc@google.com> > Cc: Andrea Arcangeli <aarcange@redhat.com> > Cc: Kees Cook <keescook@chromium.org> > Cc: Josh Poimboeuf <jpoimboe@redhat.com> > Cc: Waiman Long <longman@redhat.com> > --- > v1 -> v2: > - Addressed comments on approach from Sean. > v2 -> v3: > - Updated spec-ctrl.h comments and commit msg to incorporate > additional feedback from Sean. > Gentle ping on this one, thanks, Jon > arch/x86/include/asm/spec-ctrl.h | 14 ++++++++++++++ > arch/x86/kernel/cpu/bugs.c | 6 +++++- > arch/x86/kvm/svm/svm.c | 2 +- > arch/x86/kvm/vmx/vmx.c | 2 +- > 4 files changed, 21 insertions(+), 3 deletions(-) > > diff --git a/arch/x86/include/asm/spec-ctrl.h b/arch/x86/include/asm/spec-ctrl.h > index 5393babc0598..99d3341d2e21 100644 > --- a/arch/x86/include/asm/spec-ctrl.h > +++ b/arch/x86/include/asm/spec-ctrl.h > @@ -85,4 +85,18 @@ static inline void speculative_store_bypass_ht_init(void) { } > extern void speculation_ctrl_update(unsigned long tif); > extern void speculation_ctrl_update_current(void); > > +/* > + * Issue IBPB when switching guest vCPUs IFF switch_mm_always_ibpb. > + * For the more common case of running VMs in their own dedicated process, > + * switching vCPUs that belong to different VMs, i.e. switching tasks, > + * will also switch mm_structs and thus do IPBP via cond_mitigation(); > + * however, in the always_ibpb case, take a paranoid approach and issue > + * IBPB on both switch_mm() and vCPU switch. > + */ > +static inline void x86_virt_guest_switch_ibpb(void) > +{ > + if (static_branch_unlikely(&switch_mm_always_ibpb)) > + indirect_branch_prediction_barrier(); > +} > + > #endif > diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c > index 6296e1ebed1d..6aafb0279cbc 100644 > --- a/arch/x86/kernel/cpu/bugs.c > +++ b/arch/x86/kernel/cpu/bugs.c > @@ -68,8 +68,12 @@ u64 __ro_after_init x86_amd_ls_cfg_ssbd_mask; > DEFINE_STATIC_KEY_FALSE(switch_to_cond_stibp); > /* Control conditional IBPB in switch_mm() */ > DEFINE_STATIC_KEY_FALSE(switch_mm_cond_ibpb); > -/* Control unconditional IBPB in switch_mm() */ > +/* Control unconditional IBPB in both switch_mm() and > + * x86_virt_guest_switch_ibpb(). > + * See notes on x86_virt_guest_switch_ibpb() for KVM use case details. > + */ > DEFINE_STATIC_KEY_FALSE(switch_mm_always_ibpb); > +EXPORT_SYMBOL_GPL(switch_mm_always_ibpb); > > /* Control MDS CPU buffer clear before returning to user space */ > DEFINE_STATIC_KEY_FALSE(mds_user_clear); > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c > index bd4c64b362d2..fc08c94df888 100644 > --- a/arch/x86/kvm/svm/svm.c > +++ b/arch/x86/kvm/svm/svm.c > @@ -1302,7 +1302,7 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu) > > if (sd->current_vmcb != svm->vmcb) { > sd->current_vmcb = svm->vmcb; > - indirect_branch_prediction_barrier(); > + x86_virt_guest_switch_ibpb(); > } > if (kvm_vcpu_apicv_active(vcpu)) > __avic_vcpu_load(vcpu, cpu); > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index 04d170c4b61e..a8eed9b8221b 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -1270,7 +1270,7 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu, > * The L1 VMM can protect itself with retpolines, IBPB or IBRS. > */ > if (!buddy || WARN_ON_ONCE(buddy->vmcs != prev)) > - indirect_branch_prediction_barrier(); > + x86_virt_guest_switch_ibpb(); > } > > if (!already_loaded) { > -- > 2.30.1 (Apple Git-130) >
On Fri, Apr 22, 2022 at 12:21:01PM -0400, Jon Kohler wrote: > Both switch_mm_always_ibpb and switch_mm_cond_ibpb are handled by > switch_mm() -> cond_mitigation(), which works well in cases where > switching vCPUs (i.e. switching tasks) also switches mm_struct; > however, this misses a paranoid case where user space may be running > multiple guests in a single process (i.e. single mm_struct). You lost me here. I admit I'm no virt guy so you'll have to explain in more detail what use case that is that you want to support. What guests share mm_struct? What is the paranoid aspect here? You want to protect a single guest from all the others sharing a mm_struct? > +/* > + * Issue IBPB when switching guest vCPUs IFF switch_mm_always_ibpb. > + * For the more common case of running VMs in their own dedicated process, > + * switching vCPUs that belong to different VMs, i.e. switching tasks, > + * will also switch mm_structs and thus do IPBP via cond_mitigation(); > + * however, in the always_ibpb case, take a paranoid approach and issue > + * IBPB on both switch_mm() and vCPU switch. > + */ > +static inline void x86_virt_guest_switch_ibpb(void) > +{ > + if (static_branch_unlikely(&switch_mm_always_ibpb)) > + indirect_branch_prediction_barrier(); If this switch is going to be conditional, then make it so: static void x86_do_cond_ibpb(void) { if (static_branch_likely(&switch_mm_cond_ibpb)) indirect_branch_prediction_barrier(); } and there's nothing "virt" about it - might as well call the function what it does. And I'd put that function in bugs.c... > diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c > index 6296e1ebed1d..6aafb0279cbc 100644 > --- a/arch/x86/kernel/cpu/bugs.c > +++ b/arch/x86/kernel/cpu/bugs.c > @@ -68,8 +68,12 @@ u64 __ro_after_init x86_amd_ls_cfg_ssbd_mask; > DEFINE_STATIC_KEY_FALSE(switch_to_cond_stibp); > /* Control conditional IBPB in switch_mm() */ > DEFINE_STATIC_KEY_FALSE(switch_mm_cond_ibpb); > -/* Control unconditional IBPB in switch_mm() */ > +/* Control unconditional IBPB in both switch_mm() and > + * x86_virt_guest_switch_ibpb(). > + * See notes on x86_virt_guest_switch_ibpb() for KVM use case details. > + */ > DEFINE_STATIC_KEY_FALSE(switch_mm_always_ibpb); > +EXPORT_SYMBOL_GPL(switch_mm_always_ibpb); ... so that I don't export that key to modules. I'd like to have the big picture clarified first, why we need this, etc. Thx.
> On Apr 29, 2022, at 12:59 PM, Borislav Petkov <bp@alien8.de> wrote: > > On Fri, Apr 22, 2022 at 12:21:01PM -0400, Jon Kohler wrote: >> Both switch_mm_always_ibpb and switch_mm_cond_ibpb are handled by >> switch_mm() -> cond_mitigation(), which works well in cases where >> switching vCPUs (i.e. switching tasks) also switches mm_struct; >> however, this misses a paranoid case where user space may be running >> multiple guests in a single process (i.e. single mm_struct). > > You lost me here. I admit I'm no virt guy so you'll have to explain in > more detail what use case that is that you want to support. > > What guests share mm_struct? Selftests IIUC, but there may be other VMMs that do funny stuff. Said another way, I don’t think we actively restrict user space from doing this as far as I know. > > What is the paranoid aspect here? You want to protect a single guest > from all the others sharing a mm_struct? The paranoid aspect here is KVM is issuing an *additional* IBPB on top of what already happens in switch_mm(). IMHO KVM side IBPB for most use cases isn’t really necessarily but the general concept is that you want to protect vCPU from guest A from guest B, so you issue a prediction barrier on vCPU switch. *however* that protection already happens in switch_mm(), because guest A and B are likely to use different mm_struct, so the only point of having this support in KVM seems to be to “kill it with fire” for paranoid users who might be doing some tomfoolery that would somehow bypass switch_mm() protection (such as somehow sharing a struct). One argument could just say, let’s KISS principle and rip this out of KVM entirely, and users who do this shared mm_struct stuff can just know that their security profile is possibly less than it could be. That would certainly simplify the heck out of all of this, but you could probably set your watch to the next email saying that we broke someones use case and they’d be all grumpy. That’s why we’ve proposed keeping this to the always_ibpb path only, so that if you intentionally configure always_ibpb, you’re going to get barriers in both KVM and in switch_mm. Sean feel free to chime in if I missed the mark with this explanation, you’ve got a way with words for these things :) > >> +/* >> + * Issue IBPB when switching guest vCPUs IFF switch_mm_always_ibpb. >> + * For the more common case of running VMs in their own dedicated process, >> + * switching vCPUs that belong to different VMs, i.e. switching tasks, >> + * will also switch mm_structs and thus do IPBP via cond_mitigation(); >> + * however, in the always_ibpb case, take a paranoid approach and issue >> + * IBPB on both switch_mm() and vCPU switch. >> + */ >> +static inline void x86_virt_guest_switch_ibpb(void) >> +{ >> + if (static_branch_unlikely(&switch_mm_always_ibpb)) >> + indirect_branch_prediction_barrier(); > > If this switch is going to be conditional, then make it so: > > static void x86_do_cond_ibpb(void) > { > if (static_branch_likely(&switch_mm_cond_ibpb)) > indirect_branch_prediction_barrier(); > } In the context of this change, we only want to do this barrier on the always_ibpb path, so we wouldn’t want to do cont_ibpb here, but you bring up a good point, we could change it to x86_do_always_ibpb() and move it to bugs.c, that would simplify things. Thanks. > > and there's nothing "virt" about it - might as well call the function > what it does. And I'd put that function in bugs.c... > >> diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c >> index 6296e1ebed1d..6aafb0279cbc 100644 >> --- a/arch/x86/kernel/cpu/bugs.c >> +++ b/arch/x86/kernel/cpu/bugs.c >> @@ -68,8 +68,12 @@ u64 __ro_after_init x86_amd_ls_cfg_ssbd_mask; >> DEFINE_STATIC_KEY_FALSE(switch_to_cond_stibp); >> /* Control conditional IBPB in switch_mm() */ >> DEFINE_STATIC_KEY_FALSE(switch_mm_cond_ibpb); >> -/* Control unconditional IBPB in switch_mm() */ >> +/* Control unconditional IBPB in both switch_mm() and >> + * x86_virt_guest_switch_ibpb(). >> + * See notes on x86_virt_guest_switch_ibpb() for KVM use case details. >> + */ >> DEFINE_STATIC_KEY_FALSE(switch_mm_always_ibpb); >> +EXPORT_SYMBOL_GPL(switch_mm_always_ibpb); > > ... so that I don't export that key to modules. > > I'd like to have the big picture clarified first, why we need this, etc. No problem, and I appreciate the help and review! Let me know if the above makes sense and I’m happy to issue a v4 patch for this. Cheers, Jon > Thx. > > -- > Regards/Gruss, > Boris. > > https://urldefense.proofpoint.com/v2/url?u=https-3A__people.kernel.org_tglx_notes-2Dabout-2Dnetiquette&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=ZqhgyVXM5xkqhxRyZoFn5ed8_yDhRAKNqjt4Jthv1UJx7NCnlZAkK5_jJs4dN0WA&s=h_0nhBch2znONU8N13GxQyyvSuSSq2Kr7YFpjjR9tko&e=
On Fri, Apr 29, 2022 at 05:31:16PM +0000, Jon Kohler wrote: > Selftests IIUC, but there may be other VMMs that do funny stuff. Said > another way, I don’t think we actively restrict user space from doing > this as far as I know. "selftests", "there may be"?! This doesn't sound like a real-life use case to me and we don't do changes just because. Sorry. > The paranoid aspect here is KVM is issuing an *additional* IBPB on > top of what already happens in switch_mm(). Yeah, I know how that works. > IMHO KVM side IBPB for most use cases isn’t really necessarily but > the general concept is that you want to protect vCPU from guest A > from guest B, so you issue a prediction barrier on vCPU switch. > > *however* that protection already happens in switch_mm(), because > guest A and B are likely to use different mm_struct, so the only point > of having this support in KVM seems to be to “kill it with fire” for > paranoid users who might be doing some tomfoolery that would > somehow bypass switch_mm() protection (such as somehow > sharing a struct). Yeah, no, this all sounds like something highly hypothetical or there's a use case of which you don't want to talk about publicly. Either way, from what I'm reading I'm not in the least convinced that this is needed.
> On Apr 29, 2022, at 3:32 PM, Borislav Petkov <bp@alien8.de> wrote: > > On Fri, Apr 29, 2022 at 05:31:16PM +0000, Jon Kohler wrote: >> Selftests IIUC, but there may be other VMMs that do funny stuff. Said >> another way, I don’t think we actively restrict user space from doing >> this as far as I know. > > "selftests", "there may be"?! > > This doesn't sound like a real-life use case to me and we don't do > changes just because. Sorry. I appreciate your direct feedback, thank you for helping here. Let’s separate the discussion into two parts. 1: Bug Fixing -> an IBPB is being unconditionally issued even when the user selects conditional. That needs to be fixed and this patch fixes that. 2: Design -> do we even want to have this IBPB here in KVM at all? In previous discussions (v1/v2 patch) with Sean, we talked about this not making a whole lot of sense in general; however, we landed on trying to not regress users who might, for whatever reason, care about this IBPB. I’ve shared a bit more detail on our use case below. I’m fine with nuking this IBPB entirely, just want to be mindful of use cases from the rest of the community that we might not normally cross in our day to day. I’m happy to take feedback on this and integrate it into a v4 patch for both of these parts, in terms of both code and correctness in the change log. > >> The paranoid aspect here is KVM is issuing an *additional* IBPB on >> top of what already happens in switch_mm(). > > Yeah, I know how that works. > >> IMHO KVM side IBPB for most use cases isn’t really necessarily but >> the general concept is that you want to protect vCPU from guest A >> from guest B, so you issue a prediction barrier on vCPU switch. >> >> *however* that protection already happens in switch_mm(), because >> guest A and B are likely to use different mm_struct, so the only point >> of having this support in KVM seems to be to “kill it with fire” for >> paranoid users who might be doing some tomfoolery that would >> somehow bypass switch_mm() protection (such as somehow >> sharing a struct). > > Yeah, no, this all sounds like something highly hypothetical or there's > a use case of which you don't want to talk about publicly. We’re an open book here, so I’m happy to share what we’re up to publicly. Our use case is 100% qemu-kvm, which is all separate processes/structs and is covered a-ok by the switch_mm() path. We noticed this bug in one of our scalability tests, which oversubscribes the host with many 2 vCPU VMs and runs a load runner that increases load one machine at a time, so that we can see the trend of response time of an app as host load increases. Given that the host is heavily oversubscribed, there is a lot of vCPU switching, and in particular switching in between vCPUs belonging to different guests, so that we hit this particular barrier in vcpu_load constantly. Taking this fix helped smoothed out that response time curve a bit. Happy to share more specific data if you’d like. > > Either way, from what I'm reading I'm not in the least convinced that > this is needed. > > -- > Regards/Gruss, > Boris. > > https://urldefense.proofpoint.com/v2/url?u=https-3A__people.kernel.org_tglx_notes-2Dabout-2Dnetiquette&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=tctUY3zgYEwUcPZ8E8v-EiXloK4PwYvUVBlR-amoRBEVZym6a2SuqyRYbNGF1_aZ&s=lQjy9h3G6eOqr2qEuAVvtX3LuDxW1kVJHdlkezCy3sU&e=
On Fri, Apr 29, 2022, Borislav Petkov wrote: > On Fri, Apr 29, 2022 at 05:31:16PM +0000, Jon Kohler wrote: > > Selftests IIUC, but there may be other VMMs that do funny stuff. Said > > another way, I don’t think we actively restrict user space from doing > > this as far as I know. > > "selftests", "there may be"?! > > This doesn't sound like a real-life use case to me and we don't do > changes just because. Sorry. > > > The paranoid aspect here is KVM is issuing an *additional* IBPB on > > top of what already happens in switch_mm(). > > Yeah, I know how that works. > > > IMHO KVM side IBPB for most use cases isn’t really necessarily but > > the general concept is that you want to protect vCPU from guest A > > from guest B, so you issue a prediction barrier on vCPU switch. > > > > *however* that protection already happens in switch_mm(), because > > guest A and B are likely to use different mm_struct, so the only point > > of having this support in KVM seems to be to “kill it with fire” for > > paranoid users who might be doing some tomfoolery that would > > somehow bypass switch_mm() protection (such as somehow > > sharing a struct). > > Yeah, no, this all sounds like something highly hypothetical or there's > a use case of which you don't want to talk about publicly. What Jon is trying to do is eliminate IBPB that already exists in KVM. The catch is that, in theory, someone not-Jon could be running multiple VMs in a single address space, e.g. VM-based containers. So if we simply delete the IBPB, then we could theoretically and silently break a user. That's why there's a bunch of hand-waving. > Either way, from what I'm reading I'm not in the least convinced that > this is needed. Can you clarify what "this" is? Does "this" mean "this patch", or does it mean "this IBPB when switching vCPUs"? Because if it means the latter, then I think you're in violent agreement; the IBPB when switching vCPUs is pointless and unnecessary.
On Fri, Apr 29, 2022 at 08:29:30PM +0000, Sean Christopherson wrote: > That's why there's a bunch of hand-waving. Well, I'm still not sure what this patch is trying to fix but both your latest replies do sound clearer... > Can you clarify what "this" is? Does "this" mean "this patch", or does it mean This patch. > "this IBPB when switching vCPUs"? Because if it means the latter, then I think > you're in violent agreement; the IBPB when switching vCPUs is pointless and > unnecessary. Ok, let's concentrate on the bug first - whether a second IBPB - so to speak - is needed. Doing some git archeology points to: 15d45071523d ("KVM/x86: Add IBPB support") which - and I'm surprised - goes to great lengths to explain what those IBPB calls in KVM protect against. From that commit message, for example: " * Mitigate attacks from guest/ring3->host/ring3. These would require a IBPB during context switch in host, or after VMEXIT." so with my very limited virt understanding, when you vmexit, you don't do switch_mm(), right? If so, you need to do a barrier. Regardless of conditional IBPB or not as you want to protect the host from a malicious guest. In general, the whole mitigation strategies are enumerated in Documentation/admin-guide/hw-vuln/spectre.rst There's also a "3. VM mitigation" section. And so on... Bottomline is this: at the time, we went to great lengths to document what the attacks are and how we are protecting against them. So now, if you want to change all that, I'd like to see - the attack scenario is this - we don't think it is relevant because... - therefore, we don't need to protect against it anymore and all that should be properly documented. Otherwise, there's no surviving this mess. Thx!
On Fri, Apr 29, 2022, Borislav Petkov wrote: > On Fri, Apr 29, 2022 at 08:29:30PM +0000, Sean Christopherson wrote: > > That's why there's a bunch of hand-waving. > > Well, I'm still not sure what this patch is trying to fix but both your > latest replies do sound clearer... > > > Can you clarify what "this" is? Does "this" mean "this patch", or does it mean > > This patch. > > > "this IBPB when switching vCPUs"? Because if it means the latter, then I think > > you're in violent agreement; the IBPB when switching vCPUs is pointless and > > unnecessary. > > Ok, let's concentrate on the bug first - whether a second IBPB - so to > speak - is needed. Doing some git archeology points to: > > 15d45071523d ("KVM/x86: Add IBPB support") > > which - and I'm surprised - goes to great lengths to explain what > those IBPB calls in KVM protect against. From that commit message, for > example: > > " * Mitigate attacks from guest/ring3->host/ring3. > These would require a IBPB during context switch in host, or after > VMEXIT." Except that snippet changelog doesn't actually state what KVM does, it states what a hypervsior _could_ do to protect the host from the guest via IBPB. > so with my very limited virt understanding, when you vmexit, you don't > do switch_mm(), right? Correct, but KVM also doesn't do IBPB on VM-Exit (or VM-Entry), nor does KVM do IBPB before exiting to userspace. The IBPB we want to whack is issued only when KVM is switching vCPUs. > If so, you need to do a barrier. Regardless of conditional IBPB or not > as you want to protect the host from a malicious guest. > > In general, the whole mitigation strategies are enumerated in > > Documentation/admin-guide/hw-vuln/spectre.rst > > There's also a "3. VM mitigation" section. > > And so on... > > Bottomline is this: at the time, we went to great lengths to document > what the attacks are and how we are protecting against them. Except that _none_ of that documentation explains why the hell KVM does IBPB when switching betwen vCPUs. The only item is this snippet from the changelog: * Mitigate guests from being attacked by other guests. - This is addressed by issing IBPB when we do a guest switch. And that's the one that I pointed out in v1 as being flawed/wrong, and how Jon ended up with this patch. : But stepping back, why does KVM do its own IBPB in the first place? The goal is : to prevent one vCPU from attacking the next vCPU run on the same pCPU. But unless : userspace is running multiple VMs in the same process/mm_struct, switching vCPUs, : i.e. switching tasks, will also switch mm_structs and thus do IPBP via cond_mitigation. : : If userspace runs multiple VMs in the same process, enables cond_ipbp, _and_ sets : TIF_SPEC_IB, then it's being stupid and isn't getting full protection in any case, : e.g. if userspace is handling an exit-to-userspace condition for two vCPUs from : different VMs, then the kernel could switch between those two vCPUs' tasks without : bouncing through KVM and thus without doing KVM's IBPB. : : I can kinda see doing this for always_ibpb, e.g. if userspace is unaware of spectre : and is naively running multiple VMs in the same process. : : What am I missing?
On Fri, Apr 29, 2022 at 09:59:52PM +0000, Sean Christopherson wrote: > Correct, but KVM also doesn't do IBPB on VM-Exit (or VM-Entry), Why doesn't it do that? Not needed? > nor does KVM do IBPB before exiting to userspace. Same question. > The IBPB we want to whack is issued only when KVM is switching vCPUs. Then please document it properly as I've already requested. > Except that _none_ of that documentation explains why the hell KVM > does IBPB when switching betwen vCPUs. Probably because the folks involved in those patches weren't the hell mainly virt people. Although I see a bunch of virt people on CC on that patch. > : But stepping back, why does KVM do its own IBPB in the first place? The goal is > : to prevent one vCPU from attacking the next vCPU run on the same pCPU. But unless > : userspace is running multiple VMs in the same process/mm_struct, switching vCPUs, > : i.e. switching tasks, will also switch mm_structs and thus do IPBP via cond_mitigation. > : > : If userspace runs multiple VMs in the same process, This keeps popping up. Who does that? Can I get a real-life example to such VM-based containers or what the hell that is, pls? > enables cond_ipbp, _and_ sets > : TIF_SPEC_IB, then it's being stupid and isn't getting full protection in any case, > : e.g. if userspace is handling an exit-to-userspace condition for two vCPUs from > : different VMs, then the kernel could switch between those two vCPUs' tasks without > : bouncing through KVM and thus without doing KVM's IBPB. > : > : I can kinda see doing this for always_ibpb, e.g. if userspace is unaware of spectre > : and is naively running multiple VMs in the same process. So this needs a clearer definition: what protection are we even talking about when the address spaces of processes are shared? My naïve thinking would be: none. They're sharing address space - branch pred. poisoning between the two is the least of their worries. So to cut to the chase: it sounds to me like you don't want to do IBPB at all on vCPU switch. And the process switch case is taken care of by switch_mm().
On Sat, Apr 30, 2022, Borislav Petkov wrote: > On Fri, Apr 29, 2022 at 09:59:52PM +0000, Sean Christopherson wrote: > > Correct, but KVM also doesn't do IBPB on VM-Exit (or VM-Entry), > > Why doesn't it do that? Not needed? The host kernel is protected via RETPOLINE and by flushing the RSB immediately after VM-Exit. > > nor does KVM do IBPB before exiting to userspace. > > Same question. I don't know definitively. My guess is that IBPB is far too costly to do on every exit, and so the onus was put on userspace to recompile with RETPOLINE. What I don't know is why it wasn't implemented as an opt-out feature. > > The IBPB we want to whack is issued only when KVM is switching vCPUs. > > Then please document it properly as I've already requested. I'll write up the bits I have my head wrapped around. > > Except that _none_ of that documentation explains why the hell KVM > > does IBPB when switching betwen vCPUs. > > Probably because the folks involved in those patches weren't the hell > mainly virt people. Although I see a bunch of virt people on CC on that > patch. > > > : But stepping back, why does KVM do its own IBPB in the first place? The goal is > > : to prevent one vCPU from attacking the next vCPU run on the same pCPU. But unless > > : userspace is running multiple VMs in the same process/mm_struct, switching vCPUs, > > : i.e. switching tasks, will also switch mm_structs and thus do IPBP via cond_mitigation. > > : > > : If userspace runs multiple VMs in the same process, > > This keeps popping up. Who does that? Can I get a real-life example to > such VM-based containers or what the hell that is, pls? I don't know of any actual examples. But, it's trivially easy to create multiple VMs in a single process, and so proving the negative that no one runs multiple VMs in a single address space is essentially impossible. The container thing is just one scenario I can think of where userspace might actually benefit from sharing an address space, e.g. it would allow backing the image for large number of VMs with a single set of read-only VMAs. > > enables cond_ipbp, _and_ sets > > : TIF_SPEC_IB, then it's being stupid and isn't getting full protection in any case, > > : e.g. if userspace is handling an exit-to-userspace condition for two vCPUs from > > : different VMs, then the kernel could switch between those two vCPUs' tasks without > > : bouncing through KVM and thus without doing KVM's IBPB. > > : > > : I can kinda see doing this for always_ibpb, e.g. if userspace is unaware of spectre > > : and is naively running multiple VMs in the same process. > > So this needs a clearer definition: what protection are we even talking > about when the address spaces of processes are shared? My naïve > thinking would be: none. They're sharing address space - branch pred. > poisoning between the two is the least of their worries. I truly have no idea, which is part of the reason I brought it up in the first place. I'd have happily just whacked KVM's IBPB entirely, but it seemed prudent to preserve the existing behavior if someone went out of their way to enable switch_mm_always_ibpb. > So to cut to the chase: it sounds to me like you don't want to do IBPB > at all on vCPU switch. Yes, or do it iff switch_mm_always_ibpb is enabled to maintain "compability". > And the process switch case is taken care of by switch_mm(). Yep.
On Fri, Apr 29, 2022 at 11:23:32PM +0000, Sean Christopherson wrote: > The host kernel is protected via RETPOLINE and by flushing the RSB immediately > after VM-Exit. Ah, right. > I don't know definitively. My guess is that IBPB is far too costly to do on every > exit, and so the onus was put on userspace to recompile with RETPOLINE. What I > don't know is why it wasn't implemented as an opt-out feature. Or, we could add the same logic on the exit path as in cond_mitigation() and test for LAST_USER_MM_IBPB when the host has selected switch_mm_cond_ibpb and thus allows for certain guests to be protected... Although, that use case sounds kinda meh: AFAIU, the attack vector here would be, protecting the guest from a malicious kernel. I guess this might matter for encrypted guests tho. > I'll write up the bits I have my head wrapped around. That's nice, thanks! > I don't know of any actual examples. But, it's trivially easy to create multiple > VMs in a single process, and so proving the negative that no one runs multiple VMs > in a single address space is essentially impossible. > > The container thing is just one scenario I can think of where userspace might > actually benefit from sharing an address space, e.g. it would allow backing the > image for large number of VMs with a single set of read-only VMAs. Why I keep harping about this: so let's say we eventually add something and then months, years from now we cannot find out anymore why that thing was added. We will likely remove it or start wasting time figuring out why that thing was added in the first place. This very questioning keeps popping up almost on a daily basis during refactoring so I'd like for us to be better at documenting *why* we're doing a certain solution or function or whatever. And this is doubly important when it comes to the hw mitigations because if you look at the problem space and all the possible ifs and whens and but(t)s, your head will spin in no time. So I'm really sceptical when there's not even a single actual use case to a proposed change. So Jon said something about oversubscription and a lot of vCPU switching. That should be there in the docs as the use case and explaining why dropping IBPB during vCPU switches is redundant. > I truly have no idea, which is part of the reason I brought it up in the first > place. I'd have happily just whacked KVM's IBPB entirely, but it seemed prudent > to preserve the existing behavior if someone went out of their way to enable > switch_mm_always_ibpb. So let me try to understand this use case: you have a guest and a bunch of vCPUs which belong to it. And that guest gets switched between those vCPUs and KVM does IBPB flushes between those vCPUs. So either I'm missing something - which is possible - but if not, that "protection" doesn't make any sense - it is all within the same guest! So that existing behavior was silly to begin with so we might just as well kill it. > Yes, or do it iff switch_mm_always_ibpb is enabled to maintain "compability". Yap, and I'm questioning the even smallest shred of reasoning for having that IBPB flush there *at* *all*. And here's the thing with documenting all that: we will document and say, IBPB between vCPU flushes is non-sensical. Then, when someone comes later and says, "but but, it makes sense because of X" and we hadn't thought about X at the time, we will change it and document it again and this way you'll have everything explicit there, how we arrived at the current situation and be able to always go, "ah, ok, that's why we did it." I hope I'm making some sense...
> On Apr 30, 2022, at 5:50 AM, Borislav Petkov <bp@alien8.de> wrote: > > On Fri, Apr 29, 2022 at 11:23:32PM +0000, Sean Christopherson wrote: >> The host kernel is protected via RETPOLINE and by flushing the RSB immediately >> after VM-Exit. > > Ah, right. > >> I don't know definitively. My guess is that IBPB is far too costly to do on every >> exit, and so the onus was put on userspace to recompile with RETPOLINE. What I >> don't know is why it wasn't implemented as an opt-out feature. > > Or, we could add the same logic on the exit path as in cond_mitigation() > and test for LAST_USER_MM_IBPB when the host has selected > switch_mm_cond_ibpb and thus allows for certain guests to be > protected... This is roughly what the v1 of this patch did. *if* we keep it here, to fix this bug we’d have to bring this logic here for sure. > > Although, that use case sounds kinda meh: AFAIU, the attack vector here > would be, protecting the guest from a malicious kernel. I guess this > might matter for encrypted guests tho. > >> I'll write up the bits I have my head wrapped around. > > That's nice, thanks! Agreed, thx Sean, I really appreciate it. >> I don't know of any actual examples. But, it's trivially easy to create multiple >> VMs in a single process, and so proving the negative that no one runs multiple VMs >> in a single address space is essentially impossible. >> >> The container thing is just one scenario I can think of where userspace might >> actually benefit from sharing an address space, e.g. it would allow backing the >> image for large number of VMs with a single set of read-only VMAs. > > Why I keep harping about this: so let's say we eventually add something > and then months, years from now we cannot find out anymore why that > thing was added. We will likely remove it or start wasting time figuring > out why that thing was added in the first place. > > This very questioning keeps popping up almost on a daily basis during > refactoring so I'd like for us to be better at documenting *why* we're > doing a certain solution or function or whatever. This is 100% a fair ask, I appreciate the diligence, as we’ve all been there on the ‘other side’ of changes to complex areas and spend hours digging on git history, LKML threads, SDM/APM, and other sources trying to derive why the heck something is the way it is. I’m 100% game to make sure this is a good change, so truly thank you for helping hold a high quality bar. > And this is doubly important when it comes to the hw mitigations because > if you look at the problem space and all the possible ifs and whens and > but(t)s, your head will spin in no time. > > So I'm really sceptical when there's not even a single actual use case > to a proposed change. > > So Jon said something about oversubscription and a lot of vCPU > switching. That should be there in the docs as the use case and > explaining why dropping IBPB during vCPU switches is redundant. > >> I truly have no idea, which is part of the reason I brought it up in the first >> place. I'd have happily just whacked KVM's IBPB entirely, but it seemed prudent >> to preserve the existing behavior if someone went out of their way to enable >> switch_mm_always_ibpb. > > So let me try to understand this use case: you have a guest and a bunch > of vCPUs which belong to it. And that guest gets switched between those > vCPUs and KVM does IBPB flushes between those vCPUs. > > So either I'm missing something - which is possible - but if not, that > "protection" doesn't make any sense - it is all within the same guest! > So that existing behavior was silly to begin with so we might just as > well kill it. Close, its not 1 guest with a bunch of vCPU, its a bunch of guests with a small amount of vCPUs, thats the small nuance here, which is one of the reasons why this was hard to see from the beginning. AFAIK, the KVM IBPB is avoided when switching in between vCPUs belonging to the same vmcs/vmcb (i.e. the same guest), e.g. you could have one VM highly oversubscribed to the host and you wouldn’t see either the KVM IBPB or the switch_mm IBPB. All good. Reference vmx_vcpu_load_vmcs() and svm_vcpu_load() and the conditionals prior to the barrier. However, the pain ramps up when you have a bunch of separate guests, especially with a small amount of vCPUs per guest, so the switching is more likely to be in between completely separate guests. Think small servers or virtual desktops. Thats what the scalability test I described in my previous note does, which effectively gets us to the point where each and every load is to a different guest, so we’re firing KVM IBPB each time. But even then, we’re in agreement that its silly because the switch_mm takes care of it. > >> Yes, or do it iff switch_mm_always_ibpb is enabled to maintain "compability". > > Yap, and I'm questioning the even smallest shred of reasoning for having > that IBPB flush there *at* *all*. > > And here's the thing with documenting all that: we will document and > say, IBPB between vCPU flushes is non-sensical. Then, when someone comes > later and says, "but but, it makes sense because of X" and we hadn't > thought about X at the time, we will change it and document it again and > this way you'll have everything explicit there, how we arrived at the > current situation and be able to always go, "ah, ok, that's why we did > it." > > I hope I'm making some sense... Makes sense to me. Let’s wait for Sean’s writeup to clarify and keep drilling down on this. This “but but” was exactly why we wanted to leave at least a “shred” behind :) but agreed, we need to double down on documentation > -- > Regards/Gruss, > Boris. > > https://urldefense.proofpoint.com/v2/url?u=https-3A__people.kernel.org_tglx_notes-2Dabout-2Dnetiquette&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=gz30B4PoWGK0UCpgwq_jMKL5LHzM6w-430LAsSEcVYgm5iUvCuCKcbv8amDHDAUu&s=MOviOxGMpK3YkgYmDtVKwgJQ7RcSFTsEAMWUD8W-SoI&e=
On Sat, Apr 30, 2022 at 02:50:35PM +0000, Jon Kohler wrote: > This is 100% a fair ask, I appreciate the diligence, as we’ve all been there > on the ‘other side’ of changes to complex areas and spend hours digging on > git history, LKML threads, SDM/APM, and other sources trying to derive > why the heck something is the way it is. Yap, that's basically proving my point and why I want stuff to be properly documented so that the question "why was it done this way" can always be answered satisfactorily. > AFAIK, the KVM IBPB is avoided when switching in between vCPUs > belonging to the same vmcs/vmcb (i.e. the same guest), e.g. you could > have one VM highly oversubscribed to the host and you wouldn’t see > either the KVM IBPB or the switch_mm IBPB. All good. > > Reference vmx_vcpu_load_vmcs() and svm_vcpu_load() and the > conditionals prior to the barrier. So this is where something's still missing. > However, the pain ramps up when you have a bunch of separate guests, > especially with a small amount of vCPUs per guest, so the switching is more > likely to be in between completely separate guests. If the guests are completely separate, then it should fall into the switch_mm() case. Unless it has something to do with, as I looked at the SVM side of things, the VMCBs: if (sd->current_vmcb != svm->vmcb) { So it is not only different guests but also within the same guest and when the VMCB of the vCPU is not the current one. But then if VMCB of the vCPU is not the current, per-CPU VMCB, then that CPU ran another guest so in order for that other guest to attack the current guest, then its branch pred should be flushed. But I'm likely missing a virt aspect here so I'd let Sean explain what the rules are... Thx.
> On Apr 30, 2022, at 12:08 PM, Borislav Petkov <bp@alien8.de> wrote: > > On Sat, Apr 30, 2022 at 02:50:35PM +0000, Jon Kohler wrote: >> This is 100% a fair ask, I appreciate the diligence, as we’ve all been there >> on the ‘other side’ of changes to complex areas and spend hours digging on >> git history, LKML threads, SDM/APM, and other sources trying to derive >> why the heck something is the way it is. > > Yap, that's basically proving my point and why I want stuff to be > properly documented so that the question "why was it done this way" can > always be answered satisfactorily. > >> AFAIK, the KVM IBPB is avoided when switching in between vCPUs >> belonging to the same vmcs/vmcb (i.e. the same guest), e.g. you could >> have one VM highly oversubscribed to the host and you wouldn’t see >> either the KVM IBPB or the switch_mm IBPB. All good. >> >> Reference vmx_vcpu_load_vmcs() and svm_vcpu_load() and the >> conditionals prior to the barrier. > > So this is where something's still missing. > >> However, the pain ramps up when you have a bunch of separate guests, >> especially with a small amount of vCPUs per guest, so the switching is more >> likely to be in between completely separate guests. > > If the guests are completely separate, then it should fall into the > switch_mm() case. > > Unless it has something to do with, as I looked at the SVM side of > things, the VMCBs: > > if (sd->current_vmcb != svm->vmcb) { > > So it is not only different guests but also within the same guest and > when the VMCB of the vCPU is not the current one. > > But then if VMCB of the vCPU is not the current, per-CPU VMCB, then that > CPU ran another guest so in order for that other guest to attack the > current guest, then its branch pred should be flushed. > > But I'm likely missing a virt aspect here so I'd let Sean explain what > the rules are... Hey Sean - touching back on this one to see if were able to get some thoughts together on this one? Thanks - Jon > > Thx. > > -- > Regards/Gruss, > Boris. > > https://urldefense.proofpoint.com/v2/url?u=https-3A__people.kernel.org_tglx_notes-2Dabout-2Dnetiquette&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=g8L6xyV5FDaMT1tJZ0GSAFqRAfzycwb-KqVGLeF9tuj2dl8To57JSPptw9_QKhn2&s=DU4mnTrLXbmOItsB0lTJNN4bgP2YC1n1Y2WeysN8PhM&e=
On Sat, Apr 30, 2022, Jon Kohler wrote: > > > On Apr 30, 2022, at 5:50 AM, Borislav Petkov <bp@alien8.de> wrote: > > So let me try to understand this use case: you have a guest and a bunch > > of vCPUs which belong to it. And that guest gets switched between those > > vCPUs and KVM does IBPB flushes between those vCPUs. > > > > So either I'm missing something - which is possible - but if not, that > > "protection" doesn't make any sense - it is all within the same guest! > > So that existing behavior was silly to begin with so we might just as > > well kill it. > > Close, its not 1 guest with a bunch of vCPU, its a bunch of guests with > a small amount of vCPUs, thats the small nuance here, which is one of > the reasons why this was hard to see from the beginning. > > AFAIK, the KVM IBPB is avoided when switching in between vCPUs > belonging to the same vmcs/vmcb (i.e. the same guest), e.g. you could > have one VM highly oversubscribed to the host and you wouldn’t see > either the KVM IBPB or the switch_mm IBPB. All good. No, KVM does not avoid IBPB when switching between vCPUs in a single VM. Every vCPU has a separate VMCS/VMCB, and so the scenario described above where a single VM has a bunch of vCPUs running on a limited set of logical CPUs will emit IBPB on every single switch.
On Sat, Apr 30, 2022, Borislav Petkov wrote: > On Sat, Apr 30, 2022 at 02:50:35PM +0000, Jon Kohler wrote: > > This is 100% a fair ask, I appreciate the diligence, as we’ve all been there > > on the ‘other side’ of changes to complex areas and spend hours digging on > > git history, LKML threads, SDM/APM, and other sources trying to derive > > why the heck something is the way it is. > > Yap, that's basically proving my point and why I want stuff to be > properly documented so that the question "why was it done this way" can > always be answered satisfactorily. > > > AFAIK, the KVM IBPB is avoided when switching in between vCPUs > > belonging to the same vmcs/vmcb (i.e. the same guest), e.g. you could > > have one VM highly oversubscribed to the host and you wouldn’t see > > either the KVM IBPB or the switch_mm IBPB. All good. > > > > Reference vmx_vcpu_load_vmcs() and svm_vcpu_load() and the > > conditionals prior to the barrier. > > So this is where something's still missing. > > > However, the pain ramps up when you have a bunch of separate guests, > > especially with a small amount of vCPUs per guest, so the switching is more > > likely to be in between completely separate guests. > > If the guests are completely separate, then it should fall into the > switch_mm() case. > > Unless it has something to do with, as I looked at the SVM side of > things, the VMCBs: > > if (sd->current_vmcb != svm->vmcb) { > > So it is not only different guests but also within the same guest and > when the VMCB of the vCPU is not the current one. Yep. > But then if VMCB of the vCPU is not the current, per-CPU VMCB, then that > CPU ran another guest so in order for that other guest to attack the > current guest, then its branch pred should be flushed. That CPU ran a different _vCPU_, whether or not it ran a different guest, i.e. a different security domain, is unknown. > But I'm likely missing a virt aspect here so I'd let Sean explain what > the rules are... I don't think you're missing anything. I think the original 15d45071523d ("KVM/x86: Add IBPB support") was simply wrong. As I see it: 1. If the vCPUs belong to the same VM, they are in the same security domain and do not need an IPBP. 2. If the vCPUs belong to different VMs, and each VM is in its own mm_struct, defer to switch_mm_irqs_off() to handle IBPB as an mm switch is guaranteed to occur prior to loading a vCPU belonging to a different VMs. 3. If the vCPUs belong to different VMs, but multiple VMs share an mm_struct, then the security benefits of an IBPB when switching vCPUs are dubious, at best. If we only consider #1 and #2, then KVM doesn't need an IBPB, period. #3 is the only one that's a grey area. I have no objection to omitting IBPB entirely even in that case, because none of us can identify any tangible value in doing so.
> On May 10, 2022, at 10:22 AM, Sean Christopherson <seanjc@google.com> wrote: > > On Sat, Apr 30, 2022, Jon Kohler wrote: >> >>> On Apr 30, 2022, at 5:50 AM, Borislav Petkov <bp@alien8.de> wrote: >>> So let me try to understand this use case: you have a guest and a bunch >>> of vCPUs which belong to it. And that guest gets switched between those >>> vCPUs and KVM does IBPB flushes between those vCPUs. >>> >>> So either I'm missing something - which is possible - but if not, that >>> "protection" doesn't make any sense - it is all within the same guest! >>> So that existing behavior was silly to begin with so we might just as >>> well kill it. >> >> Close, its not 1 guest with a bunch of vCPU, its a bunch of guests with >> a small amount of vCPUs, thats the small nuance here, which is one of >> the reasons why this was hard to see from the beginning. >> >> AFAIK, the KVM IBPB is avoided when switching in between vCPUs >> belonging to the same vmcs/vmcb (i.e. the same guest), e.g. you could >> have one VM highly oversubscribed to the host and you wouldn’t see >> either the KVM IBPB or the switch_mm IBPB. All good. > > No, KVM does not avoid IBPB when switching between vCPUs in a single VM. Every > vCPU has a separate VMCS/VMCB, and so the scenario described above where a single > VM has a bunch of vCPUs running on a limited set of logical CPUs will emit IBPB > on every single switch. Ah! Right, ok thanks for helping me get my wires uncrossed there, I was getting confused from the nested optimization made on 5c911beff KVM: nVMX: Skip IBPB when switching between vmcs01 and vmcs02 So the only time we’d *not* issue IBPB is if the current per-vcpu vmcs/vmcb is still loaded in the non-nested case, or between guests in the nested case. Walking through my thoughts again here with this fresh in my mind: In that example, say guest A has vCPU0 and vCPU1 and has to switch in between the two on the same pCPU, it isn’t doing a switch_mm() because its the same mm_struct; however, I’d wager to say that if you had an attacker on the guest VM, executing an attack on vCPU0 with the intent of attacking vCPU1 (which is up to run next), you have far bigger problems, as that would imply the guest is completely compromised, so why would they even waste time on a complex prediction attack when they have that level of system access in the first place? Going back to the original commit documentation that Boris called out, that specifically says: * Mitigate guests from being attacked by other guests. - This is addressed by issing IBPB when we do a guest switch. Since you need to go through switch_mm() to change mm_struct from Guest A to Guest B, it makes no sense to issue the barrier in KVM, as the kernel is already giving that “for free” (from KVM’s perspective) as the guest-to-guest transition is already covered by cond_mitigation(). That would apply equally for switches within both nested and non-nested cases, because switch_mm needs to be called between guests.
> On May 10, 2022, at 10:44 AM, Sean Christopherson <seanjc@google.com> wrote: > > On Sat, Apr 30, 2022, Borislav Petkov wrote: >> On Sat, Apr 30, 2022 at 02:50:35PM +0000, Jon Kohler wrote: >>> This is 100% a fair ask, I appreciate the diligence, as we’ve all been there >>> on the ‘other side’ of changes to complex areas and spend hours digging on >>> git history, LKML threads, SDM/APM, and other sources trying to derive >>> why the heck something is the way it is. >> >> Yap, that's basically proving my point and why I want stuff to be >> properly documented so that the question "why was it done this way" can >> always be answered satisfactorily. >> >>> AFAIK, the KVM IBPB is avoided when switching in between vCPUs >>> belonging to the same vmcs/vmcb (i.e. the same guest), e.g. you could >>> have one VM highly oversubscribed to the host and you wouldn’t see >>> either the KVM IBPB or the switch_mm IBPB. All good. >>> >>> Reference vmx_vcpu_load_vmcs() and svm_vcpu_load() and the >>> conditionals prior to the barrier. >> >> So this is where something's still missing. >> >>> However, the pain ramps up when you have a bunch of separate guests, >>> especially with a small amount of vCPUs per guest, so the switching is more >>> likely to be in between completely separate guests. >> >> If the guests are completely separate, then it should fall into the >> switch_mm() case. >> >> Unless it has something to do with, as I looked at the SVM side of >> things, the VMCBs: >> >> if (sd->current_vmcb != svm->vmcb) { >> >> So it is not only different guests but also within the same guest and >> when the VMCB of the vCPU is not the current one. > > Yep. > >> But then if VMCB of the vCPU is not the current, per-CPU VMCB, then that >> CPU ran another guest so in order for that other guest to attack the >> current guest, then its branch pred should be flushed. > > That CPU ran a different _vCPU_, whether or not it ran a different guest, i.e. a > different security domain, is unknown. > >> But I'm likely missing a virt aspect here so I'd let Sean explain what >> the rules are... > > I don't think you're missing anything. I think the original 15d45071523d ("KVM/x86: > Add IBPB support") was simply wrong. > > As I see it: > > 1. If the vCPUs belong to the same VM, they are in the same security domain and > do not need an IPBP. > > 2. If the vCPUs belong to different VMs, and each VM is in its own mm_struct, > defer to switch_mm_irqs_off() to handle IBPB as an mm switch is guaranteed to > occur prior to loading a vCPU belonging to a different VMs. > > 3. If the vCPUs belong to different VMs, but multiple VMs share an mm_struct, > then the security benefits of an IBPB when switching vCPUs are dubious, at best. > > If we only consider #1 and #2, then KVM doesn't need an IBPB, period. > > #3 is the only one that's a grey area. I have no objection to omitting IBPB entirely > even in that case, because none of us can identify any tangible value in doing so. Thanks, Sean. Our messages crossed in flight, I sent a reply to your earlier message just a bit ago. This is super helpful to frame this up. What would you think framing the patch like this: x86/speculation, KVM: remove IBPB on vCPU load Remove IBPB that is done on KVM vCPU load, as the guest-to-guest attack surface is already covered by switch_mm_irqs_off() -> cond_mitigation(). The original 15d45071523d ("KVM/x86: Add IBPB support") was simply wrong in its guest-to-guest design intention. There are three scenarios at play here: 1. If the vCPUs belong to the same VM, they are in the same security domain and do not need an IPBP. 2. If the vCPUs belong to different VMs, and each VM is in its own mm_struct, switch_mm_irqs_off() will handle IBPB as an mm switch is guaranteed to occur prior to loading a vCPU belonging to a different VMs. 3. If the vCPUs belong to different VMs, but multiple VMs share an mm_struct, then the security benefits of an IBPB when switching vCPUs are dubious, at best. Issuing IBPB from KVM vCPU load would only cover #3, but there are no real world tangible use cases for such a configuration. If multiple VMs are sharing an mm_structs, prediction attacks are the least of their security worries. Fixes: 15d45071523d ("KVM/x86: Add IBPB support") (Reviewedby/signed of by people here) (Code change simply whacks IBPB in KVM vmx/svm and thats it)
On Tue, May 10, 2022, Jon Kohler wrote: > > > On May 10, 2022, at 10:44 AM, Sean Christopherson <seanjc@google.com> wrote: > > > > On Sat, Apr 30, 2022, Borislav Petkov wrote: > >> But I'm likely missing a virt aspect here so I'd let Sean explain what > >> the rules are... > > > > I don't think you're missing anything. I think the original 15d45071523d ("KVM/x86: > > Add IBPB support") was simply wrong. > > > > As I see it: > > > > 1. If the vCPUs belong to the same VM, they are in the same security domain and > > do not need an IPBP. > > > > 2. If the vCPUs belong to different VMs, and each VM is in its own mm_struct, > > defer to switch_mm_irqs_off() to handle IBPB as an mm switch is guaranteed to > > occur prior to loading a vCPU belonging to a different VMs. > > > > 3. If the vCPUs belong to different VMs, but multiple VMs share an mm_struct, > > then the security benefits of an IBPB when switching vCPUs are dubious, at best. > > > > If we only consider #1 and #2, then KVM doesn't need an IBPB, period. > > > > #3 is the only one that's a grey area. I have no objection to omitting IBPB entirely > > even in that case, because none of us can identify any tangible value in doing so. > > Thanks, Sean. Our messages crossed in flight, I sent a reply to your earlier message > just a bit ago. This is super helpful to frame this up. > > What would you think framing the patch like this: > > x86/speculation, KVM: remove IBPB on vCPU load > > Remove IBPB that is done on KVM vCPU load, as the guest-to-guest > attack surface is already covered by switch_mm_irqs_off() -> > cond_mitigation(). > > The original 15d45071523d ("KVM/x86: Add IBPB support") was simply wrong in > its guest-to-guest design intention. There are three scenarios at play > here: > > 1. If the vCPUs belong to the same VM, they are in the same security > domain and do not need an IPBP. > 2. If the vCPUs belong to different VMs, and each VM is in its own mm_struct, > switch_mm_irqs_off() will handle IBPB as an mm switch is guaranteed to > occur prior to loading a vCPU belonging to a different VMs. > 3. If the vCPUs belong to different VMs, but multiple VMs share an mm_struct, > then the security benefits of an IBPB when switching vCPUs are dubious, > at best. > > Issuing IBPB from KVM vCPU load would only cover #3, but there are no Just to hedge, there are no _known_ use cases. > real world tangible use cases for such a configuration. and I would further qualify this with: but there are no known real world, tangible use cases for running multiple VMs belonging to different security domains in a shared address space. Running multiple VMs in a single address space is plausible and sane, _if_ they are all in the same security domain or security is not a concern. That way the statement isn't invalidated if someone pops up with a use case for running multiple VMs but has no security story. Other than that, LGTM. > If multiple VMs > are sharing an mm_structs, prediction attacks are the least of their > security worries. > > Fixes: 15d45071523d ("KVM/x86: Add IBPB support") > (Reviewedby/signed of by people here) > (Code change simply whacks IBPB in KVM vmx/svm and thats it) > > >
On Tue, May 10, 2022 at 03:50:31PM +0000, Sean Christopherson wrote: > > x86/speculation, KVM: remove IBPB on vCPU load > > > > Remove IBPB that is done on KVM vCPU load, as the guest-to-guest > > attack surface is already covered by switch_mm_irqs_off() -> > > cond_mitigation(). > > > > The original 15d45071523d ("KVM/x86: Add IBPB support") was simply wrong in > > its guest-to-guest design intention. There are three scenarios at play > > here: > > > > 1. If the vCPUs belong to the same VM, they are in the same security > > domain and do not need an IPBP. > > 2. If the vCPUs belong to different VMs, and each VM is in its own mm_struct, > > switch_mm_irqs_off() will handle IBPB as an mm switch is guaranteed to > > occur prior to loading a vCPU belonging to a different VMs. > > 3. If the vCPUs belong to different VMs, but multiple VMs share an mm_struct, > > then the security benefits of an IBPB when switching vCPUs are dubious, > > at best. > > > > Issuing IBPB from KVM vCPU load would only cover #3, but there are no > > Just to hedge, there are no _known_ use cases. > > > real world tangible use cases for such a configuration. > > and I would further qualify this with: > > but there are no known real world, tangible use cases for running multiple > VMs belonging to different security domains in a shared address space. > > Running multiple VMs in a single address space is plausible and sane, _if_ they > are all in the same security domain or security is not a concern. That way the > statement isn't invalidated if someone pops up with a use case for running multiple > VMs but has no security story. > > Other than that, LGTM. > > > If multiple VMs > > are sharing an mm_structs, prediction attacks are the least of their > > security worries. > > > > Fixes: 15d45071523d ("KVM/x86: Add IBPB support") > > (Reviewedby/signed of by people here) > > (Code change simply whacks IBPB in KVM vmx/svm and thats it) I agree with all that I've read so far - the only thing that's missing is: (Documentation in Documentation/admin-guide/hw-vuln/spectre.rst about what the use cases are and what we're protecting against and what we're *not* protecting against because <raisins>). Thx.
> On May 12, 2022, at 9:44 AM, Borislav Petkov <bp@alien8.de> wrote: > > On Tue, May 10, 2022 at 03:50:31PM +0000, Sean Christopherson wrote: >>> x86/speculation, KVM: remove IBPB on vCPU load >>> >>> Remove IBPB that is done on KVM vCPU load, as the guest-to-guest >>> attack surface is already covered by switch_mm_irqs_off() -> >>> cond_mitigation(). >>> >>> The original 15d45071523d ("KVM/x86: Add IBPB support") was simply wrong in >>> its guest-to-guest design intention. There are three scenarios at play >>> here: >>> >>> 1. If the vCPUs belong to the same VM, they are in the same security >>> domain and do not need an IPBP. >>> 2. If the vCPUs belong to different VMs, and each VM is in its own mm_struct, >>> switch_mm_irqs_off() will handle IBPB as an mm switch is guaranteed to >>> occur prior to loading a vCPU belonging to a different VMs. >>> 3. If the vCPUs belong to different VMs, but multiple VMs share an mm_struct, >>> then the security benefits of an IBPB when switching vCPUs are dubious, >>> at best. >>> >>> Issuing IBPB from KVM vCPU load would only cover #3, but there are no >> >> Just to hedge, there are no _known_ use cases. >> >>> real world tangible use cases for such a configuration. >> >> and I would further qualify this with: >> >> but there are no known real world, tangible use cases for running multiple >> VMs belonging to different security domains in a shared address space. >> >> Running multiple VMs in a single address space is plausible and sane, _if_ they >> are all in the same security domain or security is not a concern. That way the >> statement isn't invalidated if someone pops up with a use case for running multiple >> VMs but has no security story. >> >> Other than that, LGTM. >> >>> If multiple VMs >>> are sharing an mm_structs, prediction attacks are the least of their >>> security worries. >>> >>> Fixes: 15d45071523d ("KVM/x86: Add IBPB support") >>> (Reviewedby/signed of by people here) >>> (Code change simply whacks IBPB in KVM vmx/svm and thats it) > > I agree with all that I've read so far - the only thing that's missing is: > > (Documentation in Documentation/admin-guide/hw-vuln/spectre.rst about what the use > cases are and what we're protecting against and what we're *not* protecting > against because <raisins>). > > Thx. Ok Thanks, Boris. I’ll review that doc and make modifications on v4, and make sure that you are cc’d. Thanks again, Jon > > -- > Regards/Gruss, > Boris. > > https://urldefense.proofpoint.com/v2/url?u=https-3A__people.kernel.org_tglx_notes-2Dabout-2Dnetiquette&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=55IDSpFE7N1d0eOYIL-UhgxoFg5JT7HFCEx17rNfo8XDAoJgj4xHjTzvqKec6Zi6&s=4ijrpeiLfGJRiyOpYY0Pn-BxvGEqvO2T7xaNyC0LmMk&e=
diff --git a/arch/x86/include/asm/spec-ctrl.h b/arch/x86/include/asm/spec-ctrl.h index 5393babc0598..99d3341d2e21 100644 --- a/arch/x86/include/asm/spec-ctrl.h +++ b/arch/x86/include/asm/spec-ctrl.h @@ -85,4 +85,18 @@ static inline void speculative_store_bypass_ht_init(void) { } extern void speculation_ctrl_update(unsigned long tif); extern void speculation_ctrl_update_current(void); +/* + * Issue IBPB when switching guest vCPUs IFF switch_mm_always_ibpb. + * For the more common case of running VMs in their own dedicated process, + * switching vCPUs that belong to different VMs, i.e. switching tasks, + * will also switch mm_structs and thus do IPBP via cond_mitigation(); + * however, in the always_ibpb case, take a paranoid approach and issue + * IBPB on both switch_mm() and vCPU switch. + */ +static inline void x86_virt_guest_switch_ibpb(void) +{ + if (static_branch_unlikely(&switch_mm_always_ibpb)) + indirect_branch_prediction_barrier(); +} + #endif diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 6296e1ebed1d..6aafb0279cbc 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -68,8 +68,12 @@ u64 __ro_after_init x86_amd_ls_cfg_ssbd_mask; DEFINE_STATIC_KEY_FALSE(switch_to_cond_stibp); /* Control conditional IBPB in switch_mm() */ DEFINE_STATIC_KEY_FALSE(switch_mm_cond_ibpb); -/* Control unconditional IBPB in switch_mm() */ +/* Control unconditional IBPB in both switch_mm() and + * x86_virt_guest_switch_ibpb(). + * See notes on x86_virt_guest_switch_ibpb() for KVM use case details. + */ DEFINE_STATIC_KEY_FALSE(switch_mm_always_ibpb); +EXPORT_SYMBOL_GPL(switch_mm_always_ibpb); /* Control MDS CPU buffer clear before returning to user space */ DEFINE_STATIC_KEY_FALSE(mds_user_clear); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index bd4c64b362d2..fc08c94df888 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1302,7 +1302,7 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu) if (sd->current_vmcb != svm->vmcb) { sd->current_vmcb = svm->vmcb; - indirect_branch_prediction_barrier(); + x86_virt_guest_switch_ibpb(); } if (kvm_vcpu_apicv_active(vcpu)) __avic_vcpu_load(vcpu, cpu); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 04d170c4b61e..a8eed9b8221b 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -1270,7 +1270,7 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu, * The L1 VMM can protect itself with retpolines, IBPB or IBRS. */ if (!buddy || WARN_ON_ONCE(buddy->vmcs != prev)) - indirect_branch_prediction_barrier(); + x86_virt_guest_switch_ibpb(); } if (!already_loaded) {
On vmx_vcpu_load_vmcs and svm_vcpu_load, respect user controlled configuration for conditional IBPB and only attempt IBPB MSR when switching between different guest vCPUs IFF switch_mm_always_ibpb, which fixes a situation where the kernel will issue IBPB unconditionally even when conditional IBPB is enabled. If a user has spectre_v2_user mitigation enabled, in any configuration, and the underlying processor supports X86_FEATURE_IBPB, X86_FEATURE_USE_IBPB is set and any calls to indirect_branch_prediction_barrier() will issue IBPB MSR. Depending on the spectre_v2_user configuration, either switch_mm_always_ibpb key or switch_mm_cond_ibpb key will be set. Both switch_mm_always_ibpb and switch_mm_cond_ibpb are handled by switch_mm() -> cond_mitigation(), which works well in cases where switching vCPUs (i.e. switching tasks) also switches mm_struct; however, this misses a paranoid case where user space may be running multiple guests in a single process (i.e. single mm_struct). This presents two issues: Issue 1: This paranoid case is already covered by vmx_vcpu_load_vmcs and svm_vcpu_load; however, this is done by calling indirect_branch_prediction_barrier() and thus the kernel unconditionally issues IBPB if X86_FEATURE_USE_IBPB is set. Issue 2: For a conditional configuration, this paranoid case is nonsensical. If userspace runs multiple VMs in the same process, enables cond_ipbp, _and_ sets TIF_SPEC_IB, then isn't getting full protection in any case, e.g. if userspace is handling an exit-to-userspace condition for two vCPUs from different VMs, then the kernel could switch between those two vCPUs' tasks without bouncing through KVM and thus without doing KVM's IBPB. Fix both by using intermediary call to x86_virt_guest_switch_ibpb(), which gates IBPB MSR IFF switch_mm_always_ibpb is true. switch_mm_cond_ibpb is intentionally ignored from the KVM code side as it really is nonsensical given the common case is already well covered by switch_mm(), so issuing an additional IBPB from KVM is just pure overhead. Note: switch_mm_always_ibpb key is user controlled via spectre_v2_user and will be true for the following configurations: spectre_v2_user=on spectre_v2_user=prctl,ibpb spectre_v2_user=seccomp,ibpb Signed-off-by: Jon Kohler <jon@nutanix.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Waiman Long <longman@redhat.com> --- v1 -> v2: - Addressed comments on approach from Sean. v2 -> v3: - Updated spec-ctrl.h comments and commit msg to incorporate additional feedback from Sean. arch/x86/include/asm/spec-ctrl.h | 14 ++++++++++++++ arch/x86/kernel/cpu/bugs.c | 6 +++++- arch/x86/kvm/svm/svm.c | 2 +- arch/x86/kvm/vmx/vmx.c | 2 +- 4 files changed, 21 insertions(+), 3 deletions(-) -- 2.30.1 (Apple Git-130)