Message ID | 20220512184514.15742-1-jon@nutanix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v4] x86/speculation, KVM: remove IBPB on vCPU load | expand |
On Thu, May 12, 2022, Jon Kohler wrote: > Remove IBPB that is done on KVM vCPU load, as the guest-to-guest > attack surface is already covered by switch_mm_irqs_off() -> > cond_mitigation(). > > The original commit 15d45071523d ("KVM/x86: Add IBPB support") was simply > wrong in its guest-to-guest design intention. There are three scenarios > at play here: Jim pointed offline that there's a case we didn't consider. When switching between vCPUs in the same VM, an IBPB may be warranted as the tasks in the VM may be in different security domains. E.g. the guest will not get a notification that vCPU0 is being swapped out for vCPU1 on a single pCPU. So, sadly, after all that, I think the IBPB needs to stay. But the documentation most definitely needs to be updated. A per-VM capability to skip the IBPB may be warranted, e.g. for container-like use cases where a single VM is running a single workload.
On Thu, May 12, 2022, Sean Christopherson wrote: > On Thu, May 12, 2022, Jon Kohler wrote: > > Remove IBPB that is done on KVM vCPU load, as the guest-to-guest > > attack surface is already covered by switch_mm_irqs_off() -> > > cond_mitigation(). > > > > The original commit 15d45071523d ("KVM/x86: Add IBPB support") was simply > > wrong in its guest-to-guest design intention. There are three scenarios > > at play here: > > Jim pointed offline that there's a case we didn't consider. When switching between > vCPUs in the same VM, an IBPB may be warranted as the tasks in the VM may be in > different security domains. E.g. the guest will not get a notification that vCPU0 is > being swapped out for vCPU1 on a single pCPU. > > So, sadly, after all that, I think the IBPB needs to stay. But the documentation > most definitely needs to be updated. > > A per-VM capability to skip the IBPB may be warranted, e.g. for container-like > use cases where a single VM is running a single workload. Ah, actually, the IBPB can be skipped if the vCPUs have different mm_structs, because then the IBPB is fully redundant with respect to any IBPB performed by switch_mm_irqs_off(). Hrm, though it might need a KVM or per-VM knob, e.g. just because the VMM doesn't want IBPB doesn't mean the guest doesn't want IBPB. That would also sidestep the largely theoretical question of whether vCPUs from different VMs but the same address space are in the same security domain. It doesn't matter, because even if they are in the same domain, KVM still needs to do IBPB.
> On May 12, 2022, at 3:35 PM, Sean Christopherson <seanjc@google.com> wrote: > > On Thu, May 12, 2022, Sean Christopherson wrote: >> On Thu, May 12, 2022, Jon Kohler wrote: >>> Remove IBPB that is done on KVM vCPU load, as the guest-to-guest >>> attack surface is already covered by switch_mm_irqs_off() -> >>> cond_mitigation(). >>> >>> The original commit 15d45071523d ("KVM/x86: Add IBPB support") was simply >>> wrong in its guest-to-guest design intention. There are three scenarios >>> at play here: >> >> Jim pointed offline that there's a case we didn't consider. When switching between >> vCPUs in the same VM, an IBPB may be warranted as the tasks in the VM may be in >> different security domains. E.g. the guest will not get a notification that vCPU0 is >> being swapped out for vCPU1 on a single pCPU. >> >> So, sadly, after all that, I think the IBPB needs to stay. But the documentation >> most definitely needs to be updated. >> >> A per-VM capability to skip the IBPB may be warranted, e.g. for container-like >> use cases where a single VM is running a single workload. > > Ah, actually, the IBPB can be skipped if the vCPUs have different mm_structs, > because then the IBPB is fully redundant with respect to any IBPB performed by > switch_mm_irqs_off(). Hrm, though it might need a KVM or per-VM knob, e.g. just > because the VMM doesn't want IBPB doesn't mean the guest doesn't want IBPB. > > That would also sidestep the largely theoretical question of whether vCPUs from > different VMs but the same address space are in the same security domain. It doesn't > matter, because even if they are in the same domain, KVM still needs to do IBPB. So should we go back to the earlier approach where we have it be only IBPB on always_ibpb? Or what? At minimum, we need to fix the unilateral-ness of all of this :) since we’re IBPB’ing even when the user did not explicitly tell us to. That said, since I just re-read the documentation today, it does specifically suggest that if the guest wants to protect *itself* it should turn on IBPB or STIBP (or other mitigations galore), so I think we end up having to think about what our “contract” is with users who host their workloads on KVM - are they expecting us to protect them in any/all cases? Said another way, the internal guest areas of concern aren’t something the kernel would always be able to A) identify far in advance and B) always solve on the users behalf. There is an argument to be made that the guest needs to deal with its own house, yea?
On Thu, May 12, 2022 at 12:51 PM Jon Kohler <jon@nutanix.com> wrote: > > > > > On May 12, 2022, at 3:35 PM, Sean Christopherson <seanjc@google.com> wrote: > > > > On Thu, May 12, 2022, Sean Christopherson wrote: > >> On Thu, May 12, 2022, Jon Kohler wrote: > >>> Remove IBPB that is done on KVM vCPU load, as the guest-to-guest > >>> attack surface is already covered by switch_mm_irqs_off() -> > >>> cond_mitigation(). > >>> > >>> The original commit 15d45071523d ("KVM/x86: Add IBPB support") was simply > >>> wrong in its guest-to-guest design intention. There are three scenarios > >>> at play here: > >> > >> Jim pointed offline that there's a case we didn't consider. When switching between > >> vCPUs in the same VM, an IBPB may be warranted as the tasks in the VM may be in > >> different security domains. E.g. the guest will not get a notification that vCPU0 is > >> being swapped out for vCPU1 on a single pCPU. > >> > >> So, sadly, after all that, I think the IBPB needs to stay. But the documentation > >> most definitely needs to be updated. > >> > >> A per-VM capability to skip the IBPB may be warranted, e.g. for container-like > >> use cases where a single VM is running a single workload. > > > > Ah, actually, the IBPB can be skipped if the vCPUs have different mm_structs, > > because then the IBPB is fully redundant with respect to any IBPB performed by > > switch_mm_irqs_off(). Hrm, though it might need a KVM or per-VM knob, e.g. just > > because the VMM doesn't want IBPB doesn't mean the guest doesn't want IBPB. > > > > That would also sidestep the largely theoretical question of whether vCPUs from > > different VMs but the same address space are in the same security domain. It doesn't > > matter, because even if they are in the same domain, KVM still needs to do IBPB. > > So should we go back to the earlier approach where we have it be only > IBPB on always_ibpb? Or what? > > At minimum, we need to fix the unilateral-ness of all of this :) since we’re > IBPB’ing even when the user did not explicitly tell us to. > > That said, since I just re-read the documentation today, it does specifically > suggest that if the guest wants to protect *itself* it should turn on IBPB or > STIBP (or other mitigations galore), so I think we end up having to think > about what our “contract” is with users who host their workloads on > KVM - are they expecting us to protect them in any/all cases? > > Said another way, the internal guest areas of concern aren’t something > the kernel would always be able to A) identify far in advance and B) > always solve on the users behalf. There is an argument to be made > that the guest needs to deal with its own house, yea? To the extent that the guest has control over its own house, yes. Say the guest obviates the need for internal IBPB by statically partitioning virtual cores into different security domains. If the hypervisor breaks core isolation on the physical platform, it is responsible for providing the necessary mitigations.
On Thu, May 12, 2022, Jon Kohler wrote: > > > > On May 12, 2022, at 3:35 PM, Sean Christopherson <seanjc@google.com> wrote: > > > > On Thu, May 12, 2022, Sean Christopherson wrote: > >> On Thu, May 12, 2022, Jon Kohler wrote: > >>> Remove IBPB that is done on KVM vCPU load, as the guest-to-guest > >>> attack surface is already covered by switch_mm_irqs_off() -> > >>> cond_mitigation(). > >>> > >>> The original commit 15d45071523d ("KVM/x86: Add IBPB support") was simply > >>> wrong in its guest-to-guest design intention. There are three scenarios > >>> at play here: > >> > >> Jim pointed offline that there's a case we didn't consider. When switching between > >> vCPUs in the same VM, an IBPB may be warranted as the tasks in the VM may be in > >> different security domains. E.g. the guest will not get a notification that vCPU0 is > >> being swapped out for vCPU1 on a single pCPU. > >> > >> So, sadly, after all that, I think the IBPB needs to stay. But the documentation > >> most definitely needs to be updated. > >> > >> A per-VM capability to skip the IBPB may be warranted, e.g. for container-like > >> use cases where a single VM is running a single workload. > > > > Ah, actually, the IBPB can be skipped if the vCPUs have different mm_structs, > > because then the IBPB is fully redundant with respect to any IBPB performed by > > switch_mm_irqs_off(). Hrm, though it might need a KVM or per-VM knob, e.g. just > > because the VMM doesn't want IBPB doesn't mean the guest doesn't want IBPB. > > > > That would also sidestep the largely theoretical question of whether vCPUs from > > different VMs but the same address space are in the same security domain. It doesn't > > matter, because even if they are in the same domain, KVM still needs to do IBPB. > > So should we go back to the earlier approach where we have it be only > IBPB on always_ibpb? Or what? > > At minimum, we need to fix the unilateral-ness of all of this :) since we’re > IBPB’ing even when the user did not explicitly tell us to. I think we need separate controls for the guest. E.g. if the userspace VMM is sufficiently hardened then it can run without "do IBPB" flag, but that doesn't mean that the entire guest it's running is sufficiently hardened. > That said, since I just re-read the documentation today, it does specifically > suggest that if the guest wants to protect *itself* it should turn on IBPB or > STIBP (or other mitigations galore), so I think we end up having to think > about what our “contract” is with users who host their workloads on > KVM - are they expecting us to protect them in any/all cases? > > Said another way, the internal guest areas of concern aren’t something > the kernel would always be able to A) identify far in advance and B) > always solve on the users behalf. There is an argument to be made > that the guest needs to deal with its own house, yea? The issue is that the guest won't get a notification if vCPU0 is replaced with vCPU1 on the same physical CPU, thus the guest doesn't get an opportunity to emit IBPB. Since the host doesn't know whether or not the guest wants IBPB, unless the owner of the host is also the owner of the guest workload, the safe approach is to assume the guest is vulnerable.
On Thu, May 12, 2022 at 1:07 PM Sean Christopherson <seanjc@google.com> wrote: > > On Thu, May 12, 2022, Jon Kohler wrote: > > > > > > > On May 12, 2022, at 3:35 PM, Sean Christopherson <seanjc@google.com> wrote: > > > > > > On Thu, May 12, 2022, Sean Christopherson wrote: > > >> On Thu, May 12, 2022, Jon Kohler wrote: > > >>> Remove IBPB that is done on KVM vCPU load, as the guest-to-guest > > >>> attack surface is already covered by switch_mm_irqs_off() -> > > >>> cond_mitigation(). > > >>> > > >>> The original commit 15d45071523d ("KVM/x86: Add IBPB support") was simply > > >>> wrong in its guest-to-guest design intention. There are three scenarios > > >>> at play here: > > >> > > >> Jim pointed offline that there's a case we didn't consider. When switching between > > >> vCPUs in the same VM, an IBPB may be warranted as the tasks in the VM may be in > > >> different security domains. E.g. the guest will not get a notification that vCPU0 is > > >> being swapped out for vCPU1 on a single pCPU. > > >> > > >> So, sadly, after all that, I think the IBPB needs to stay. But the documentation > > >> most definitely needs to be updated. > > >> > > >> A per-VM capability to skip the IBPB may be warranted, e.g. for container-like > > >> use cases where a single VM is running a single workload. > > > > > > Ah, actually, the IBPB can be skipped if the vCPUs have different mm_structs, > > > because then the IBPB is fully redundant with respect to any IBPB performed by > > > switch_mm_irqs_off(). Hrm, though it might need a KVM or per-VM knob, e.g. just > > > because the VMM doesn't want IBPB doesn't mean the guest doesn't want IBPB. > > > > > > That would also sidestep the largely theoretical question of whether vCPUs from > > > different VMs but the same address space are in the same security domain. It doesn't > > > matter, because even if they are in the same domain, KVM still needs to do IBPB. > > > > So should we go back to the earlier approach where we have it be only > > IBPB on always_ibpb? Or what? > > > > At minimum, we need to fix the unilateral-ness of all of this :) since we’re > > IBPB’ing even when the user did not explicitly tell us to. > > I think we need separate controls for the guest. E.g. if the userspace VMM is > sufficiently hardened then it can run without "do IBPB" flag, but that doesn't > mean that the entire guest it's running is sufficiently hardened. > > > That said, since I just re-read the documentation today, it does specifically > > suggest that if the guest wants to protect *itself* it should turn on IBPB or > > STIBP (or other mitigations galore), so I think we end up having to think > > about what our “contract” is with users who host their workloads on > > KVM - are they expecting us to protect them in any/all cases? > > > > Said another way, the internal guest areas of concern aren’t something > > the kernel would always be able to A) identify far in advance and B) > > always solve on the users behalf. There is an argument to be made > > that the guest needs to deal with its own house, yea? > > The issue is that the guest won't get a notification if vCPU0 is replaced with > vCPU1 on the same physical CPU, thus the guest doesn't get an opportunity to emit > IBPB. Since the host doesn't know whether or not the guest wants IBPB, unless the > owner of the host is also the owner of the guest workload, the safe approach is to > assume the guest is vulnerable. Exactly. And if the guest has used taskset as its mitigation strategy, how is the host to know?
> On May 12, 2022, at 4:07 PM, Sean Christopherson <seanjc@google.com> wrote: > > On Thu, May 12, 2022, Jon Kohler wrote: >> >> >>> On May 12, 2022, at 3:35 PM, Sean Christopherson <seanjc@google.com> wrote: >>> >>> On Thu, May 12, 2022, Sean Christopherson wrote: >>>> On Thu, May 12, 2022, Jon Kohler wrote: >>>>> Remove IBPB that is done on KVM vCPU load, as the guest-to-guest >>>>> attack surface is already covered by switch_mm_irqs_off() -> >>>>> cond_mitigation(). >>>>> >>>>> The original commit 15d45071523d ("KVM/x86: Add IBPB support") was simply >>>>> wrong in its guest-to-guest design intention. There are three scenarios >>>>> at play here: >>>> >>>> Jim pointed offline that there's a case we didn't consider. When switching between >>>> vCPUs in the same VM, an IBPB may be warranted as the tasks in the VM may be in >>>> different security domains. E.g. the guest will not get a notification that vCPU0 is >>>> being swapped out for vCPU1 on a single pCPU. >>>> >>>> So, sadly, after all that, I think the IBPB needs to stay. But the documentation >>>> most definitely needs to be updated. >>>> >>>> A per-VM capability to skip the IBPB may be warranted, e.g. for container-like >>>> use cases where a single VM is running a single workload. >>> >>> Ah, actually, the IBPB can be skipped if the vCPUs have different mm_structs, >>> because then the IBPB is fully redundant with respect to any IBPB performed by >>> switch_mm_irqs_off(). Hrm, though it might need a KVM or per-VM knob, e.g. just >>> because the VMM doesn't want IBPB doesn't mean the guest doesn't want IBPB. >>> >>> That would also sidestep the largely theoretical question of whether vCPUs from >>> different VMs but the same address space are in the same security domain. It doesn't >>> matter, because even if they are in the same domain, KVM still needs to do IBPB. >> >> So should we go back to the earlier approach where we have it be only >> IBPB on always_ibpb? Or what? >> >> At minimum, we need to fix the unilateral-ness of all of this :) since we’re >> IBPB’ing even when the user did not explicitly tell us to. > > I think we need separate controls for the guest. E.g. if the userspace VMM is > sufficiently hardened then it can run without "do IBPB" flag, but that doesn't > mean that the entire guest it's running is sufficiently hardened. What if we keyed off MSR bitmap, such that if a guest *ever* issued IBPB, KVM can do IBPB on switch? We already disable interception today, so we have the data, just like we do for SPEC_CTRL. if (prev != vmx->loaded_vmcs->vmcs) { per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs; vmcs_load(vmx->loaded_vmcs->vmcs); /* * No indirect branch prediction barrier needed when switching * the active VMCS within a guest, e.g. on nested VM-Enter. * The L1 VMM can protect itself with retpolines, IBPB or IBRS. * We'll only issue this IBPB if the guest itself has ever issued * an IBPB, which would indicate they care about prediction barriers * on one or more task(s) within the guest. This guards against the * scenario where the guest has separate security domains on separate * vCPUs, and the kernel switches vCPU-x out for vCPU-y on the same * pCPU, before the guest has the chance to issue its own barrier. * In this scenario, the switch_mm() -> cond_mitigation would not * issue its own barrier, because the vCPUs are sharing a mm_struct. */ if ((!buddy || WARN_ON_ONCE(buddy->vmcs != prev)) && !msr_write_intercepted(vmx, MSR_IA32_PRED_CMD)) indirect_branch_prediction_barrier() } If the guest isn’t ever issuing IBPB, they one could say that they do not care about vCPU-to-vCPU attack surface. Thoughts? > >> That said, since I just re-read the documentation today, it does specifically >> suggest that if the guest wants to protect *itself* it should turn on IBPB or >> STIBP (or other mitigations galore), so I think we end up having to think >> about what our “contract” is with users who host their workloads on >> KVM - are they expecting us to protect them in any/all cases? >> >> Said another way, the internal guest areas of concern aren’t something >> the kernel would always be able to A) identify far in advance and B) >> always solve on the users behalf. There is an argument to be made >> that the guest needs to deal with its own house, yea? > > The issue is that the guest won't get a notification if vCPU0 is replaced with > vCPU1 on the same physical CPU, thus the guest doesn't get an opportunity to emit > IBPB. Since the host doesn't know whether or not the guest wants IBPB, unless the > owner of the host is also the owner of the guest workload, the safe approach is to > assume the guest is vulnerable.
> On May 12, 2022, at 4:27 PM, Jim Mattson <jmattson@google.com> wrote: > > On Thu, May 12, 2022 at 1:07 PM Sean Christopherson <seanjc@google.com> wrote: >> >> On Thu, May 12, 2022, Jon Kohler wrote: >>> >>> >>>> On May 12, 2022, at 3:35 PM, Sean Christopherson <seanjc@google.com> wrote: >>>> >>>> On Thu, May 12, 2022, Sean Christopherson wrote: >>>>> On Thu, May 12, 2022, Jon Kohler wrote: >>>>>> Remove IBPB that is done on KVM vCPU load, as the guest-to-guest >>>>>> attack surface is already covered by switch_mm_irqs_off() -> >>>>>> cond_mitigation(). >>>>>> >>>>>> The original commit 15d45071523d ("KVM/x86: Add IBPB support") was simply >>>>>> wrong in its guest-to-guest design intention. There are three scenarios >>>>>> at play here: >>>>> >>>>> Jim pointed offline that there's a case we didn't consider. When switching between >>>>> vCPUs in the same VM, an IBPB may be warranted as the tasks in the VM may be in >>>>> different security domains. E.g. the guest will not get a notification that vCPU0 is >>>>> being swapped out for vCPU1 on a single pCPU. >>>>> >>>>> So, sadly, after all that, I think the IBPB needs to stay. But the documentation >>>>> most definitely needs to be updated. >>>>> >>>>> A per-VM capability to skip the IBPB may be warranted, e.g. for container-like >>>>> use cases where a single VM is running a single workload. >>>> >>>> Ah, actually, the IBPB can be skipped if the vCPUs have different mm_structs, >>>> because then the IBPB is fully redundant with respect to any IBPB performed by >>>> switch_mm_irqs_off(). Hrm, though it might need a KVM or per-VM knob, e.g. just >>>> because the VMM doesn't want IBPB doesn't mean the guest doesn't want IBPB. >>>> >>>> That would also sidestep the largely theoretical question of whether vCPUs from >>>> different VMs but the same address space are in the same security domain. It doesn't >>>> matter, because even if they are in the same domain, KVM still needs to do IBPB. >>> >>> So should we go back to the earlier approach where we have it be only >>> IBPB on always_ibpb? Or what? >>> >>> At minimum, we need to fix the unilateral-ness of all of this :) since we’re >>> IBPB’ing even when the user did not explicitly tell us to. >> >> I think we need separate controls for the guest. E.g. if the userspace VMM is >> sufficiently hardened then it can run without "do IBPB" flag, but that doesn't >> mean that the entire guest it's running is sufficiently hardened. >> >>> That said, since I just re-read the documentation today, it does specifically >>> suggest that if the guest wants to protect *itself* it should turn on IBPB or >>> STIBP (or other mitigations galore), so I think we end up having to think >>> about what our “contract” is with users who host their workloads on >>> KVM - are they expecting us to protect them in any/all cases? >>> >>> Said another way, the internal guest areas of concern aren’t something >>> the kernel would always be able to A) identify far in advance and B) >>> always solve on the users behalf. There is an argument to be made >>> that the guest needs to deal with its own house, yea? >> >> The issue is that the guest won't get a notification if vCPU0 is replaced with >> vCPU1 on the same physical CPU, thus the guest doesn't get an opportunity to emit >> IBPB. Since the host doesn't know whether or not the guest wants IBPB, unless the >> owner of the host is also the owner of the guest workload, the safe approach is to >> assume the guest is vulnerable. > > Exactly. And if the guest has used taskset as its mitigation strategy, > how is the host to know? Yea thats fair enough. I posed a solution on Sean’s response just as this email came in, would love to know your thoughts (keying off MSR bitmap).
On Thu, May 12, 2022 at 1:34 PM Jon Kohler <jon@nutanix.com> wrote: > > > > > On May 12, 2022, at 4:27 PM, Jim Mattson <jmattson@google.com> wrote: > > > > On Thu, May 12, 2022 at 1:07 PM Sean Christopherson <seanjc@google.com> wrote: > >> > >> On Thu, May 12, 2022, Jon Kohler wrote: > >>> > >>> > >>>> On May 12, 2022, at 3:35 PM, Sean Christopherson <seanjc@google.com> wrote: > >>>> > >>>> On Thu, May 12, 2022, Sean Christopherson wrote: > >>>>> On Thu, May 12, 2022, Jon Kohler wrote: > >>>>>> Remove IBPB that is done on KVM vCPU load, as the guest-to-guest > >>>>>> attack surface is already covered by switch_mm_irqs_off() -> > >>>>>> cond_mitigation(). > >>>>>> > >>>>>> The original commit 15d45071523d ("KVM/x86: Add IBPB support") was simply > >>>>>> wrong in its guest-to-guest design intention. There are three scenarios > >>>>>> at play here: > >>>>> > >>>>> Jim pointed offline that there's a case we didn't consider. When switching between > >>>>> vCPUs in the same VM, an IBPB may be warranted as the tasks in the VM may be in > >>>>> different security domains. E.g. the guest will not get a notification that vCPU0 is > >>>>> being swapped out for vCPU1 on a single pCPU. > >>>>> > >>>>> So, sadly, after all that, I think the IBPB needs to stay. But the documentation > >>>>> most definitely needs to be updated. > >>>>> > >>>>> A per-VM capability to skip the IBPB may be warranted, e.g. for container-like > >>>>> use cases where a single VM is running a single workload. > >>>> > >>>> Ah, actually, the IBPB can be skipped if the vCPUs have different mm_structs, > >>>> because then the IBPB is fully redundant with respect to any IBPB performed by > >>>> switch_mm_irqs_off(). Hrm, though it might need a KVM or per-VM knob, e.g. just > >>>> because the VMM doesn't want IBPB doesn't mean the guest doesn't want IBPB. > >>>> > >>>> That would also sidestep the largely theoretical question of whether vCPUs from > >>>> different VMs but the same address space are in the same security domain. It doesn't > >>>> matter, because even if they are in the same domain, KVM still needs to do IBPB. > >>> > >>> So should we go back to the earlier approach where we have it be only > >>> IBPB on always_ibpb? Or what? > >>> > >>> At minimum, we need to fix the unilateral-ness of all of this :) since we’re > >>> IBPB’ing even when the user did not explicitly tell us to. > >> > >> I think we need separate controls for the guest. E.g. if the userspace VMM is > >> sufficiently hardened then it can run without "do IBPB" flag, but that doesn't > >> mean that the entire guest it's running is sufficiently hardened. > >> > >>> That said, since I just re-read the documentation today, it does specifically > >>> suggest that if the guest wants to protect *itself* it should turn on IBPB or > >>> STIBP (or other mitigations galore), so I think we end up having to think > >>> about what our “contract” is with users who host their workloads on > >>> KVM - are they expecting us to protect them in any/all cases? > >>> > >>> Said another way, the internal guest areas of concern aren’t something > >>> the kernel would always be able to A) identify far in advance and B) > >>> always solve on the users behalf. There is an argument to be made > >>> that the guest needs to deal with its own house, yea? > >> > >> The issue is that the guest won't get a notification if vCPU0 is replaced with > >> vCPU1 on the same physical CPU, thus the guest doesn't get an opportunity to emit > >> IBPB. Since the host doesn't know whether or not the guest wants )IBPB, unless the > >> owner of the host is also the owner of the guest workload, the safe approach is to > >> assume the guest is vulnerable. > > > > Exactly. And if the guest has used taskset as its mitigation strategy, > > how is the host to know? > > Yea thats fair enough. I posed a solution on Sean’s response just as this email > came in, would love to know your thoughts (keying off MSR bitmap). > I don't believe this works. The IBPBs in cond_mitigation (static in arch/x86/mm/tlb.c) won't be triggered if the guest has given its sensitive tasks exclusive use of their cores. And, if performance is a concern, that is exactly what someone would do.
> On May 12, 2022, at 7:57 PM, Jim Mattson <jmattson@google.com> wrote: > > On Thu, May 12, 2022 at 1:34 PM Jon Kohler <jon@nutanix.com> wrote: >> >> >> >>> On May 12, 2022, at 4:27 PM, Jim Mattson <jmattson@google.com> wrote: >>> >>> On Thu, May 12, 2022 at 1:07 PM Sean Christopherson <seanjc@google.com> wrote: >>>> >>>> On Thu, May 12, 2022, Jon Kohler wrote: >>>>> >>>>> >>>>>> On May 12, 2022, at 3:35 PM, Sean Christopherson <seanjc@google.com> wrote: >>>>>> >>>>>> On Thu, May 12, 2022, Sean Christopherson wrote: >>>>>>> On Thu, May 12, 2022, Jon Kohler wrote: >>>>>>>> Remove IBPB that is done on KVM vCPU load, as the guest-to-guest >>>>>>>> attack surface is already covered by switch_mm_irqs_off() -> >>>>>>>> cond_mitigation(). >>>>>>>> >>>>>>>> The original commit 15d45071523d ("KVM/x86: Add IBPB support") was simply >>>>>>>> wrong in its guest-to-guest design intention. There are three scenarios >>>>>>>> at play here: >>>>>>> >>>>>>> Jim pointed offline that there's a case we didn't consider. When switching between >>>>>>> vCPUs in the same VM, an IBPB may be warranted as the tasks in the VM may be in >>>>>>> different security domains. E.g. the guest will not get a notification that vCPU0 is >>>>>>> being swapped out for vCPU1 on a single pCPU. >>>>>>> >>>>>>> So, sadly, after all that, I think the IBPB needs to stay. But the documentation >>>>>>> most definitely needs to be updated. >>>>>>> >>>>>>> A per-VM capability to skip the IBPB may be warranted, e.g. for container-like >>>>>>> use cases where a single VM is running a single workload. >>>>>> >>>>>> Ah, actually, the IBPB can be skipped if the vCPUs have different mm_structs, >>>>>> because then the IBPB is fully redundant with respect to any IBPB performed by >>>>>> switch_mm_irqs_off(). Hrm, though it might need a KVM or per-VM knob, e.g. just >>>>>> because the VMM doesn't want IBPB doesn't mean the guest doesn't want IBPB. >>>>>> >>>>>> That would also sidestep the largely theoretical question of whether vCPUs from >>>>>> different VMs but the same address space are in the same security domain. It doesn't >>>>>> matter, because even if they are in the same domain, KVM still needs to do IBPB. >>>>> >>>>> So should we go back to the earlier approach where we have it be only >>>>> IBPB on always_ibpb? Or what? >>>>> >>>>> At minimum, we need to fix the unilateral-ness of all of this :) since we’re >>>>> IBPB’ing even when the user did not explicitly tell us to. >>>> >>>> I think we need separate controls for the guest. E.g. if the userspace VMM is >>>> sufficiently hardened then it can run without "do IBPB" flag, but that doesn't >>>> mean that the entire guest it's running is sufficiently hardened. >>>> >>>>> That said, since I just re-read the documentation today, it does specifically >>>>> suggest that if the guest wants to protect *itself* it should turn on IBPB or >>>>> STIBP (or other mitigations galore), so I think we end up having to think >>>>> about what our “contract” is with users who host their workloads on >>>>> KVM - are they expecting us to protect them in any/all cases? >>>>> >>>>> Said another way, the internal guest areas of concern aren’t something >>>>> the kernel would always be able to A) identify far in advance and B) >>>>> always solve on the users behalf. There is an argument to be made >>>>> that the guest needs to deal with its own house, yea? >>>> >>>> The issue is that the guest won't get a notification if vCPU0 is replaced with >>>> vCPU1 on the same physical CPU, thus the guest doesn't get an opportunity to emit >>>> IBPB. Since the host doesn't know whether or not the guest wants )IBPB, unless the >>>> owner of the host is also the owner of the guest workload, the safe approach is to >>>> assume the guest is vulnerable. >>> >>> Exactly. And if the guest has used taskset as its mitigation strategy, >>> how is the host to know? >> >> Yea thats fair enough. I posed a solution on Sean’s response just as this email >> came in, would love to know your thoughts (keying off MSR bitmap). >> > > I don't believe this works. The IBPBs in cond_mitigation (static in > arch/x86/mm/tlb.c) won't be triggered if the guest has given its > sensitive tasks exclusive use of their cores. And, if performance is a > concern, that is exactly what someone would do. I’m talking about within the guest itself, not the host level cond_mitigation. The purposed idea here would be to look at the MSR bitmap that is populated from the guest writing IBPB to that vCPU at least once in its lifetime, and that a security minded workload would indeed configure IBPB. Even with taskset, one would think that a security minded user would also setup IBPB to protect itself within the guest, which is exactly what the linux admin guide suggests that they do (in spectre.rst). Taking a step back and going back to the ground floor: What would be (or should be) the expectation from the guest in this example for their own security configuration? i.e. they are using taskset to assign security domain 0 to vcpu 0 and security domain 1 to vcpu 1. Would we expect them to always set up cond_ibpb (and prctl/seccomp) to protect against other casual user space threads that just might so happen to get scheduled into vcpu0/1? Or are we expecting them to configure always_ibpb? In such a security minded scenario, what would be the expectation of host configuration? If nothing else, I’d want to make sure we get the docs correct :) You mentioned if someone was concerned about performance, are you saying they also critically care about performance, such that they are willing to *not* use IBPB at all, and instead just use taskset and hope nothing ever gets scheduled on there, and then hope that the hypervisor does the job for them? Would this be the expectation of just KVM? Or all hypervisors on the market? Again not trying to be a hard head, just trying to wrap my own head around all of this. I appreciate the time from both you and Sean! Cheers, Jon
On Thu, May 12, 2022 at 5:50 PM Jon Kohler <jon@nutanix.com> wrote: > You mentioned if someone was concerned about performance, are you > saying they also critically care about performance, such that they are > willing to *not* use IBPB at all, and instead just use taskset and hope > nothing ever gets scheduled on there, and then hope that the hypervisor > does the job for them? I am saying that IBPB is not the only viable mitigation for cross-process indirect branch steering. Proper scheduling can also solve the problem, without the overhead of IBPB. Say that you have two security domains: trusted and untrusted. If you have a two-socket system, and you always run trusted workloads on socket#0 and untrusted workloads on socket#1, IBPB is completely superfluous. However, if the hypervisor chooses to schedule a vCPU thread from virtual socket#0 after a vCPU thread from virtual socket#1 on the same logical processor, then it *must* execute an IBPB between those two vCPU threads. Otherwise, it has introduced a non-architectural vulnerability that the guest can't possibly be aware of. If you can't trust your OS to schedule tasks where you tell it to schedule them, can you really trust it to provide you with any kind of inter-process security? > Would this be the expectation of just KVM? Or all hypervisors on the > market? Any hypervisor that doesn't do this is broken, but that won't keep it off the market. :-)
> On May 12, 2022, at 11:06 PM, Jim Mattson <jmattson@google.com> wrote: > > On Thu, May 12, 2022 at 5:50 PM Jon Kohler <jon@nutanix.com> wrote: > >> You mentioned if someone was concerned about performance, are you >> saying they also critically care about performance, such that they are >> willing to *not* use IBPB at all, and instead just use taskset and hope >> nothing ever gets scheduled on there, and then hope that the hypervisor >> does the job for them? > > I am saying that IBPB is not the only viable mitigation for > cross-process indirect branch steering. Proper scheduling can also > solve the problem, without the overhead of IBPB. Say that you have two > security domains: trusted and untrusted. If you have a two-socket > system, and you always run trusted workloads on socket#0 and untrusted > workloads on socket#1, IBPB is completely superfluous. However, if the > hypervisor chooses to schedule a vCPU thread from virtual socket#0 > after a vCPU thread from virtual socket#1 on the same logical > processor, then it *must* execute an IBPB between those two vCPU > threads. Otherwise, it has introduced a non-architectural > vulnerability that the guest can't possibly be aware of. > > If you can't trust your OS to schedule tasks where you tell it to > schedule them, can you really trust it to provide you with any kind of > inter-process security? Fair enough, so going forward: Should this be mandatory in all cases? How this whole effort came was that a user could configure their KVM host with conditional IBPB, but this particular mitigation is now always on no matter what. In our previous patch review threads, Sean and I mostly settled on making this particular avenue active only when a user configures always_ibpb, such that for cases like the one you describe (and others like it that come up in the future) can be covered easily, but for cond_ibpb, we can document that it doesn’t cover this case. Would that be acceptable here? > >> Would this be the expectation of just KVM? Or all hypervisors on the >> market? > > Any hypervisor that doesn't do this is broken, but that won't keep it > off the market. :-) Very true :)
On Thu, May 12, 2022 at 8:19 PM Jon Kohler <jon@nutanix.com> wrote: > > > > > On May 12, 2022, at 11:06 PM, Jim Mattson <jmattson@google.com> wrote: > > > > On Thu, May 12, 2022 at 5:50 PM Jon Kohler <jon@nutanix.com> wrote: > > > >> You mentioned if someone was concerned about performance, are you > >> saying they also critically care about performance, such that they are > >> willing to *not* use IBPB at all, and instead just use taskset and hope > >> nothing ever gets scheduled on there, and then hope that the hypervisor > >> does the job for them? > > > > I am saying that IBPB is not the only viable mitigation for > > cross-process indirect branch steering. Proper scheduling can also > > solve the problem, without the overhead of IBPB. Say that you have two > > security domains: trusted and untrusted. If you have a two-socket > > system, and you always run trusted workloads on socket#0 and untrusted > > workloads on socket#1, IBPB is completely superfluous. However, if the > > hypervisor chooses to schedule a vCPU thread from virtual socket#0 > > after a vCPU thread from virtual socket#1 on the same logical > > processor, then it *must* execute an IBPB between those two vCPU > > threads. Otherwise, it has introduced a non-architectural > > vulnerability that the guest can't possibly be aware of. > > > > If you can't trust your OS to schedule tasks where you tell it to > > schedule them, can you really trust it to provide you with any kind of > > inter-process security? > > Fair enough, so going forward: > Should this be mandatory in all cases? How this whole effort came > was that a user could configure their KVM host with conditional > IBPB, but this particular mitigation is now always on no matter what. > > In our previous patch review threads, Sean and I mostly settled on making > this particular avenue active only when a user configures always_ibpb, such > that for cases like the one you describe (and others like it that come up in > the future) can be covered easily, but for cond_ibpb, we can document > that it doesn’t cover this case. > > Would that be acceptable here? That would make me unhappy. We use cond_ibpb, and I don't want to switch to always_ibpb, yet I do want this barrier. > > > >> Would this be the expectation of just KVM? Or all hypervisors on the > >> market? > > > > Any hypervisor that doesn't do this is broken, but that won't keep it > > off the market. :-) > > Very true :) >
> On May 12, 2022, at 11:50 PM, Jim Mattson <jmattson@google.com> wrote: > > On Thu, May 12, 2022 at 8:19 PM Jon Kohler <jon@nutanix.com> wrote: >> >> >> >>> On May 12, 2022, at 11:06 PM, Jim Mattson <jmattson@google.com> wrote: >>> >>> On Thu, May 12, 2022 at 5:50 PM Jon Kohler <jon@nutanix.com> wrote: >>> >>>> You mentioned if someone was concerned about performance, are you >>>> saying they also critically care about performance, such that they are >>>> willing to *not* use IBPB at all, and instead just use taskset and hope >>>> nothing ever gets scheduled on there, and then hope that the hypervisor >>>> does the job for them? >>> >>> I am saying that IBPB is not the only viable mitigation for >>> cross-process indirect branch steering. Proper scheduling can also >>> solve the problem, without the overhead of IBPB. Say that you have two >>> security domains: trusted and untrusted. If you have a two-socket >>> system, and you always run trusted workloads on socket#0 and untrusted >>> workloads on socket#1, IBPB is completely superfluous. However, if the >>> hypervisor chooses to schedule a vCPU thread from virtual socket#0 >>> after a vCPU thread from virtual socket#1 on the same logical >>> processor, then it *must* execute an IBPB between those two vCPU >>> threads. Otherwise, it has introduced a non-architectural >>> vulnerability that the guest can't possibly be aware of. >>> >>> If you can't trust your OS to schedule tasks where you tell it to >>> schedule them, can you really trust it to provide you with any kind of >>> inter-process security? >> >> Fair enough, so going forward: >> Should this be mandatory in all cases? How this whole effort came >> was that a user could configure their KVM host with conditional >> IBPB, but this particular mitigation is now always on no matter what. >> >> In our previous patch review threads, Sean and I mostly settled on making >> this particular avenue active only when a user configures always_ibpb, such >> that for cases like the one you describe (and others like it that come up in >> the future) can be covered easily, but for cond_ibpb, we can document >> that it doesn’t cover this case. >> >> Would that be acceptable here? > > That would make me unhappy. We use cond_ibpb, and I don't want to > switch to always_ibpb, yet I do want this barrier. Ok gotcha, which I think is a good point for cloud providers, since the workload(s) are especially opaque. How about this: I could work up a v5 patch here where this was at minimum a system level knob (similar to other mitigation knobs) and documented In more detail. That way folks who might want more control here have the basic ability to do that without recompiling the kernel. Such a “knob” would be on by default, such that there is no functional regression here. Would that be ok with you as a middle ground? Thanks again, Jon > >>> >>>> Would this be the expectation of just KVM? Or all hypervisors on the >>>> market? >>> >>> Any hypervisor that doesn't do this is broken, but that won't keep it >>> off the market. :-) >> >> Very true :) >>
On Fri, May 13, 2022 at 8:21 AM Jon Kohler <jon@nutanix.com> wrote: > > > > > On May 12, 2022, at 11:50 PM, Jim Mattson <jmattson@google.com> wrote: > > > > On Thu, May 12, 2022 at 8:19 PM Jon Kohler <jon@nutanix.com> wrote: > >> > >> > >> > >>> On May 12, 2022, at 11:06 PM, Jim Mattson <jmattson@google.com> wrote: > >>> > >>> On Thu, May 12, 2022 at 5:50 PM Jon Kohler <jon@nutanix.com> wrote: > >>> > >>>> You mentioned if someone was concerned about performance, are you > >>>> saying they also critically care about performance, such that they are > >>>> willing to *not* use IBPB at all, and instead just use taskset and hope > >>>> nothing ever gets scheduled on there, and then hope that the hypervisor > >>>> does the job for them? > >>> > >>> I am saying that IBPB is not the only viable mitigation for > >>> cross-process indirect branch steering. Proper scheduling can also > >>> solve the problem, without the overhead of IBPB. Say that you have two > >>> security domains: trusted and untrusted. If you have a two-socket > >>> system, and you always run trusted workloads on socket#0 and untrusted > >>> workloads on socket#1, IBPB is completely superfluous. However, if the > >>> hypervisor chooses to schedule a vCPU thread from virtual socket#0 > >>> after a vCPU thread from virtual socket#1 on the same logical > >>> processor, then it *must* execute an IBPB between those two vCPU > >>> threads. Otherwise, it has introduced a non-architectural > >>> vulnerability that the guest can't possibly be aware of. > >>> > >>> If you can't trust your OS to schedule tasks where you tell it to > >>> schedule them, can you really trust it to provide you with any kind of > >>> inter-process security? > >> > >> Fair enough, so going forward: > >> Should this be mandatory in all cases? How this whole effort came > >> was that a user could configure their KVM host with conditional > >> IBPB, but this particular mitigation is now always on no matter what. > >> > >> In our previous patch review threads, Sean and I mostly settled on making > >> this particular avenue active only when a user configures always_ibpb, such > >> that for cases like the one you describe (and others like it that come up in > >> the future) can be covered easily, but for cond_ibpb, we can document > >> that it doesn’t cover this case. > >> > >> Would that be acceptable here? > > > > That would make me unhappy. We use cond_ibpb, and I don't want to > > switch to always_ibpb, yet I do want this barrier. > > Ok gotcha, which I think is a good point for cloud providers, since the > workload(s) are especially opaque. > > How about this: I could work up a v5 patch here where this was at minimum > a system level knob (similar to other mitigation knobs) and documented > In more detail. That way folks who might want more control here have the > basic ability to do that without recompiling the kernel. Such a “knob” would > be on by default, such that there is no functional regression here. > > Would that be ok with you as a middle ground? That would be great. Module parameter or sysctl is fine with me. Thanks! > Thanks again, > Jon > > > > >>> > >>>> Would this be the expectation of just KVM? Or all hypervisors on the > >>>> market? > >>> > >>> Any hypervisor that doesn't do this is broken, but that won't keep it > >>> off the market. :-) > >> > >> Very true :) > >> >
diff --git a/Documentation/admin-guide/hw-vuln/spectre.rst b/Documentation/admin-guide/hw-vuln/spectre.rst index 9e9556826450..1705f7a0b15d 100644 --- a/Documentation/admin-guide/hw-vuln/spectre.rst +++ b/Documentation/admin-guide/hw-vuln/spectre.rst @@ -314,7 +314,23 @@ Spectre variant 2 Linux kernel mitigates attacks to other guests running in the same CPU hardware thread by flushing the return stack buffer on VM exit, - and clearing the branch target buffer before switching to a new guest. + and clearing the branch target buffer before switching to a new guest + using IBPB. + + When using IBPB to clear branch target buffer, there are three + scenarios at play for VM switching, as follows: + If switching between vCPUs belonging to the same guest VM on the same CPU + hardware thread, they are considered to be in the same security domain and + do not need an IPBP. + If switching between vCPUs that belong to different VMs, and each VM has + its own memory address space, then IBPB will clear during the context + switch. + If the Virtual Machine Monitor (VMM) configures multiple VMs to share + a single address space, they are all considered to be one single security + domain, such that switching in between vCPUs that belong to different VMs + in this address space will not use IBPB. Sharing an address space between + multiple VMs is uncommon and should be discouraged for security minded + workloads. If SMT is used, Spectre variant 2 attacks from an untrusted guest in the sibling hyperthread can be mitigated by the administrator, @@ -534,7 +550,9 @@ Spectre variant 2 To mitigate guest-to-guest attacks in the same CPU hardware thread, the branch target buffer is sanitized by flushing before switching - to a new guest on a CPU. + to a new guest on a CPU. Note that this will not occur if guests are + configured to use a single shared memory address space, as that is + considered a single security domain. The above mitigations are turned on by default on vulnerable CPUs. @@ -660,6 +678,17 @@ Mitigation selection guide against variant 2 attacks originating from programs running on sibling threads. + Note that in a virtualized environment using KVM, this flush will be + done when switching in between VMs. Each VM is considered its own + security domain and IBPB will not flush when switching in between + vCPUs from a single guest running on the same pCPU. Virtual machine + monitors (VMMs), such as qemu-kvm, traditionally configure each VM to + be a separate process with separate memory address space; however, + it should be explicitly noted that IBPB will not flush between vCPU + changes if a VMM configures multiple VMs in a shared address space. + While such a configuration is plausible, it is not practical from a + security perspective and should be avoided for security minded workloads. + Alternatively, STIBP can be used only when running programs whose indirect branch speculation is explicitly disabled, while IBPB is still used all the time when switching to a new diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 7e45d03cd018..051955145c29 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1302,7 +1302,6 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu) if (sd->current_vmcb != svm->vmcb) { sd->current_vmcb = svm->vmcb; - indirect_branch_prediction_barrier(); } if (kvm_vcpu_apicv_active(vcpu)) __avic_vcpu_load(vcpu, cpu); diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 856c87563883..e2271d935d5c 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -266,7 +266,7 @@ static void vmx_switch_vmcs(struct kvm_vcpu *vcpu, struct loaded_vmcs *vmcs) cpu = get_cpu(); prev = vmx->loaded_vmcs; vmx->loaded_vmcs = vmcs; - vmx_vcpu_load_vmcs(vcpu, cpu, prev); + vmx_vcpu_load_vmcs(vcpu, cpu); vmx_sync_vmcs_host_state(vmx, prev); put_cpu(); @@ -4097,12 +4097,12 @@ static void copy_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu, cpu = get_cpu(); vmx->loaded_vmcs = &vmx->nested.vmcs02; - vmx_vcpu_load_vmcs(vcpu, cpu, &vmx->vmcs01); + vmx_vcpu_load_vmcs(vcpu, cpu); sync_vmcs02_to_vmcs12_rare(vcpu, vmcs12); vmx->loaded_vmcs = &vmx->vmcs01; - vmx_vcpu_load_vmcs(vcpu, cpu, &vmx->nested.vmcs02); + vmx_vcpu_load_vmcs(vcpu, cpu); put_cpu(); } diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 610355b9ccce..13f720686ad1 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -1235,8 +1235,7 @@ static void vmx_write_guest_kernel_gs_base(struct vcpu_vmx *vmx, u64 data) } #endif -void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu, - struct loaded_vmcs *buddy) +void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); bool already_loaded = vmx->loaded_vmcs->cpu == cpu; @@ -1263,14 +1262,6 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu, if (prev != vmx->loaded_vmcs->vmcs) { per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs; vmcs_load(vmx->loaded_vmcs->vmcs); - - /* - * No indirect branch prediction barrier needed when switching - * the active VMCS within a guest, e.g. on nested VM-Enter. - * The L1 VMM can protect itself with retpolines, IBPB or IBRS. - */ - if (!buddy || WARN_ON_ONCE(buddy->vmcs != prev)) - indirect_branch_prediction_barrier(); } if (!already_loaded) { @@ -1308,7 +1299,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); - vmx_vcpu_load_vmcs(vcpu, cpu, NULL); + vmx_vcpu_load_vmcs(vcpu, cpu); vmx_vcpu_pi_load(vcpu, cpu); diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index b98c7e96697a..d5f6d8d0b7ca 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -369,8 +369,7 @@ struct kvm_vmx { }; bool nested_vmx_allowed(struct kvm_vcpu *vcpu); -void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu, - struct loaded_vmcs *buddy); +void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu); int allocate_vpid(void); void free_vpid(int vpid); void vmx_set_constant_host_state(struct vcpu_vmx *vmx);
Remove IBPB that is done on KVM vCPU load, as the guest-to-guest attack surface is already covered by switch_mm_irqs_off() -> cond_mitigation(). The original commit 15d45071523d ("KVM/x86: Add IBPB support") was simply wrong in its guest-to-guest design intention. There are three scenarios at play here: 1. If the vCPUs belong to the same VM, they are in the same security domain and do not need an IPBP. 2. If the vCPUs belong to different VMs, and each VM is in its own mm_struct, switch_mm_irqs_off() will handle IBPB as an mm switch is guaranteed to occur prior to loading a vCPU belonging to a different VMs. 3. If the vCPUs belong to different VMs, but multiple VMs share an mm_struct, then the security benefits of an IBPB when switching vCPUs are dubious, at best. Issuing IBPB from KVM vCPU load would only cover #3, but there are no known real world, tangible use cases for running multiple VMs belonging to different security domains in a shared address space. Running multiple VMs in a single address space is plausible and sane, _if_ they are all in the same security domain or security is not a concern. If multiple VMs are sharing an mm_structs, prediction attacks are the least of their security worries. Update documentation to reflect what we're protecting against and what we're not and also remove "buddy" from vmx_vcpu_load_vmcs() as it is now unused. Fixes: 15d45071523d ("KVM/x86: Add IBPB support") Signed-off-by: Jon Kohler <jon@nutanix.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Waiman Long <longman@redhat.com> --- v1 - https://lore.kernel.org/all/20220411164636.74866-1-jon@nutanix.com/ v1 -> v2: - https://lore.kernel.org/all/20220419020011.65995-1-jon@nutanix.com/ - Addressed comments on approach from Sean. v2 -> v3: - https://lore.kernel.org/all/20220422162103.32736-1-jon@nutanix.com/ - Updated spec-ctrl.h comments and commit msg to incorporate additional feedback from Sean. v3 -> v4: - Addressed comments from Boris and Sean. - Narrowed change to removing KVM's IBPB only + docs update. - Removed "buddy" on vmx_vcpu_load_vmcs() as it is now unused. Documentation/admin-guide/hw-vuln/spectre.rst | 33 +++++++++++++++++-- arch/x86/kvm/svm/svm.c | 1 - arch/x86/kvm/vmx/nested.c | 6 ++-- arch/x86/kvm/vmx/vmx.c | 13 ++------ arch/x86/kvm/vmx/vmx.h | 3 +- 5 files changed, 37 insertions(+), 19 deletions(-) -- 2.30.1 (Apple Git-130)