Message ID | 20240509090146.146153-1-leitao@debian.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: Addressing a possible race in kvm_vcpu_on_spin: | expand |
On Thu, May 09, 2024, Breno Leitao wrote: > There are two workflow paths that access the same address > simultaneously, creating a potential data race in kvm_vcpu_on_spin. This > occurs when one workflow reads kvm->last_boosted_vcpu while another > parallel path writes to it. > > KCSAN produces the following output when enabled. > > BUG: KCSAN: data-race in kvm_vcpu_on_spin [kvm] / kvm_vcpu_on_spin [kvm] > > write to 0xffffc90025a92344 of 4 bytes by task 4340 on cpu 16: > kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4112) kvm > handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel > vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel > vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm > kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm > kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm > __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890) > __x64_sys_ioctl (fs/ioctl.c:890) > x64_sys_call (arch/x86/entry/syscall_64.c:33) > do_syscall_64 (arch/x86/entry/common.c:?) > entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) > > read to 0xffffc90025a92344 of 4 bytes by task 4342 on cpu 4: > kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4069) kvm > handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel > vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel > vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm > kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm > kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm > __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890) > __x64_sys_ioctl (fs/ioctl.c:890) > x64_sys_call (arch/x86/entry/syscall_64.c:33) > do_syscall_64 (arch/x86/entry/common.c:?) > entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) > > value changed: 0x00000012 -> 0x00000000 > > Given that both operations occur simultaneously without any locking > mechanisms in place, let's ensure atomicity to prevent possible data > corruption. We'll achieve this by employing READ_ONCE() for the reading > operation and WRITE_ONCE() for the writing operation. Please state changelogs as a commands, e.g. Use {READ,WRITE}_ONCE() to access kvm->last_boosted_vcpu to ensure ... And I think it's worth calling out that corruption is _extremely_ unlikely to happen in practice. It would require the compiler to generate truly awful code, and it would require a VM with >256 vCPUs. That said, I do think this should be sent to stable kernels, as it's (very, very) theoretically possible to generate an out-of-bounds access, and this seems like a super safe fix. How about this? --- KVM: Fix a data race on last_boosted_vcpu in kvm_vcpu_on_spin() Use {READ,WRITE}_ONCE() to access kvm->last_boosted_vcpu to ensure the loads and stores are atomic. In the extremely unlikely scenario the compiler tears the stores, it's theoretically possible for KVM to attempt to get a vCPU using an out-of-bounds index, e.g. if the write is split into multiple 8-bit stores, and is paired with a 32-bit load on a VM with 257 vCPUs: CPU0 CPU1 last_boosted_vcpu = 0xff; (last_boosted_vcpu = 0x100) last_boosted_vcpu[15:8] = 0x01; i = (last_boosted_vcpu = 0x1ff) last_boosted_vcpu[7:0] = 0x00; vcpu = kvm->vcpu_array[0x1ff]; As detected by KCSAN: BUG: KCSAN: data-race in kvm_vcpu_on_spin [kvm] / kvm_vcpu_on_spin [kvm] write to 0xffffc90025a92344 of 4 bytes by task 4340 on cpu 16: kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4112) kvm handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890) __x64_sys_ioctl (fs/ioctl.c:890) x64_sys_call (arch/x86/entry/syscall_64.c:33) do_syscall_64 (arch/x86/entry/common.c:?) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) read to 0xffffc90025a92344 of 4 bytes by task 4342 on cpu 4: kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4069) kvm handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890) __x64_sys_ioctl (fs/ioctl.c:890) x64_sys_call (arch/x86/entry/syscall_64.c:33) do_syscall_64 (arch/x86/entry/common.c:?) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) value changed: 0x00000012 -> 0x00000000 Fixes: 217ece6129f2 ("KVM: use yield_to instead of sleep in kvm_vcpu_on_spin") Cc: stable@vger.kernel.org --- > Signed-off-by: Breno Leitao <leitao@debian.org> > --- > virt/kvm/kvm_main.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index ff0a20565f90..9768307d5e6c 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -4066,12 +4066,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode) > { > struct kvm *kvm = me->kvm; > struct kvm_vcpu *vcpu; > - int last_boosted_vcpu = me->kvm->last_boosted_vcpu; > + int last_boosted_vcpu; > unsigned long i; > int yielded = 0; > int try = 3; > int pass; > > + last_boosted_vcpu = READ_ONCE(me->kvm->last_boosted_vcpu); Nit, this could opportunistically use "kvm" without the silly me->kvm. > kvm_vcpu_set_in_spin_loop(me, true); > /* > * We boost the priority of a VCPU that is runnable but not > @@ -4109,7 +4110,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode) > > yielded = kvm_vcpu_yield_to(vcpu); > if (yielded > 0) { > - kvm->last_boosted_vcpu = i; > + WRITE_ONCE(kvm->last_boosted_vcpu, i); > break; > } else if (yielded < 0) { > try--; Side topic #1: am I the only one that finds these loops unnecessarily hard to read? Unless I'm misreading the code, it's really just an indirect way of looping over all vCPUs, starting at last_boosted_vcpu+1 and the wrapping. IMO, reworking it to be like this is more straightforward: int nr_vcpus, start, i, idx, yielded; struct kvm *kvm = me->kvm; struct kvm_vcpu *vcpu; int try = 3; nr_vcpus = atomic_read(&kvm->online_vcpus); if (nr_vcpus < 2) return; /* Pairs with the smp_wmb() in kvm_vm_ioctl_create_vcpu(). */ smp_rmb(); kvm_vcpu_set_in_spin_loop(me, true); start = READ_ONCE(kvm->last_boosted_vcpu) + 1; for (i = 0; i < nr_vcpus; i++) { idx = (start + i) % nr_vcpus; if (idx == me->vcpu_idx) continue; vcpu = xa_load(&kvm->vcpu_array, idx); if (!READ_ONCE(vcpu->ready)) continue; if (kvm_vcpu_is_blocking(vcpu) && !vcpu_dy_runnable(vcpu)) continue; /* * Treat the target vCPU as being in-kernel if it has a pending * interrupt, as the vCPU trying to yield may be spinning * waiting on IPI delivery, i.e. the target vCPU is in-kernel * for the purposes of directed yield. */ if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode && !kvm_arch_dy_has_pending_interrupt(vcpu) && !kvm_arch_vcpu_preempted_in_kernel(vcpu)) continue; if (!kvm_vcpu_eligible_for_directed_yield(vcpu)) continue; yielded = kvm_vcpu_yield_to(vcpu); if (yielded > 0) { WRITE_ONCE(kvm->last_boosted_vcpu, i); break; } else if (yielded < 0 && !--try) { break; } } kvm_vcpu_set_in_spin_loop(me, false); /* Ensure vcpu is not eligible during next spinloop */ kvm_vcpu_set_dy_eligible(me, false); Side topic #2, intercepting PAUSE on x86 when there's only a single vCPU in the VM is silly. I don't know if it's worth the complexity, but we could defer enabling PLE exiting until a second vCPU is created, e.g. via a new request. Hmm, but x86 at least already has KVM_X86_DISABLE_EXITS_PAUSE, so this could be more easily handled in userspace, e.g. by disabing PAUSE exiting if userspace knows it's creating a single-vCPU VM.
Hello Sean, On Thu, May 09, 2024 at 09:42:48AM -0700, Sean Christopherson wrote: > On Thu, May 09, 2024, Breno Leitao wrote: > > kvm_vcpu_set_in_spin_loop(me, true); > > /* > > * We boost the priority of a VCPU that is runnable but not > > @@ -4109,7 +4110,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode) > > > > yielded = kvm_vcpu_yield_to(vcpu); > > if (yielded > 0) { > > - kvm->last_boosted_vcpu = i; > > + WRITE_ONCE(kvm->last_boosted_vcpu, i); > > break; > > } else if (yielded < 0) { > > try--; > > Side topic #1: am I the only one that finds these loops unnecessarily hard to > read? No. :-) In fact, when I skimmed over the code, I though that maybe the code was not covering the vCPUs before last_boosted_vcpu in the array. Now that I am looking at it carefully, the code is using `pass` to track if the vCPU passed last_boosted_vcpu in the index. > Unless I'm misreading the code, it's really just an indirect way of looping > over all vCPUs, starting at last_boosted_vcpu+1 and the wrapping. > > IMO, reworking it to be like this is more straightforward: > > int nr_vcpus, start, i, idx, yielded; > struct kvm *kvm = me->kvm; > struct kvm_vcpu *vcpu; > int try = 3; > > nr_vcpus = atomic_read(&kvm->online_vcpus); > if (nr_vcpus < 2) > return; > > /* Pairs with the smp_wmb() in kvm_vm_ioctl_create_vcpu(). */ > smp_rmb(); Why do you need this now? Isn't the RCU read lock in xa_load() enough? > kvm_vcpu_set_in_spin_loop(me, true); > > start = READ_ONCE(kvm->last_boosted_vcpu) + 1; > for (i = 0; i < nr_vcpus; i++) { Why do you need to started at the last boosted vcpu? I.e, why not starting at 0 and skipping me->vcpu_idx and kvm->last_boosted_vcpu? > idx = (start + i) % nr_vcpus; > if (idx == me->vcpu_idx) > continue; > > vcpu = xa_load(&kvm->vcpu_array, idx); > if (!READ_ONCE(vcpu->ready)) > continue; > if (kvm_vcpu_is_blocking(vcpu) && !vcpu_dy_runnable(vcpu)) > continue; > > /* > * Treat the target vCPU as being in-kernel if it has a pending > * interrupt, as the vCPU trying to yield may be spinning > * waiting on IPI delivery, i.e. the target vCPU is in-kernel > * for the purposes of directed yield. > */ > if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode && > !kvm_arch_dy_has_pending_interrupt(vcpu) && > !kvm_arch_vcpu_preempted_in_kernel(vcpu)) > continue; > > if (!kvm_vcpu_eligible_for_directed_yield(vcpu)) > continue; > > yielded = kvm_vcpu_yield_to(vcpu); > if (yielded > 0) { > WRITE_ONCE(kvm->last_boosted_vcpu, i); > break; > } else if (yielded < 0 && !--try) { > break; > } > } > > kvm_vcpu_set_in_spin_loop(me, false); > > /* Ensure vcpu is not eligible during next spinloop */ > kvm_vcpu_set_dy_eligible(me, false); I didn't tested it, but I reviewed it, and it seems sane and way easier to read. I agree this code is easier to read, from someone that has little KVM background.
On Fri, May 10, 2024, Breno Leitao wrote: > > IMO, reworking it to be like this is more straightforward: > > > > int nr_vcpus, start, i, idx, yielded; > > struct kvm *kvm = me->kvm; > > struct kvm_vcpu *vcpu; > > int try = 3; > > > > nr_vcpus = atomic_read(&kvm->online_vcpus); > > if (nr_vcpus < 2) > > return; > > > > /* Pairs with the smp_wmb() in kvm_vm_ioctl_create_vcpu(). */ > > smp_rmb(); > > Why do you need this now? Isn't the RCU read lock in xa_load() enough? No. RCU read lock doesn't suffice, because on kernels without PREEMPT_COUNT rcu_read_lock() may be a literal nop. There may be a _compiler_ barrier, but smp_rmb() requires more than a compiler barrier on many architectures. And just as importantly, KVM shouldn't be relying on the inner details of other code without a hard guarantee of that behavior. E.g. KVM does rely on srcu_read_unlock() to provide a full memory barrier, but KVM does so through an "official" API, smp_mb__after_srcu_read_unlock(). > > kvm_vcpu_set_in_spin_loop(me, true); > > > > start = READ_ONCE(kvm->last_boosted_vcpu) + 1; > > for (i = 0; i < nr_vcpus; i++) { > > Why do you need to started at the last boosted vcpu? I.e, why not > starting at 0 and skipping me->vcpu_idx and kvm->last_boosted_vcpu? To provide round-robin style yielding in order to (hopefully) yield to the vCPU that is holding a spinlock (or some other asset that is causing a vCPU to spin in kernel mode). E.g. if there are 4 vCPUs all running on a single CPU, vCPU3 gets preempted while holding a spinlock, and all vCPUs are contented for said spinlock then starting at vCPU0 every time would result in vCPU1 yielding to vCPU0, and vCPU0 yielding back to vCPU1, indefinitely. Starting at the last boosted vCPU instead results in vCPU0 yielding to vCPU1, vCPU1 yielding to vCPU2, and vCPU2 yielding to vCPU3, thus getting back to the vCPU that holds the spinlock soon-ish. I'm sure more sophisticated/performant approaches are possible, but they would likely be more complex, require persistent state (a.k.a. memory usage), and/or need knowledge of the workload being run.
On Fri, May 10, 2024 at 07:39:14AM -0700, Sean Christopherson wrote: > On Fri, May 10, 2024, Breno Leitao wrote: > > > IMO, reworking it to be like this is more straightforward: > > > > > > int nr_vcpus, start, i, idx, yielded; > > > struct kvm *kvm = me->kvm; > > > struct kvm_vcpu *vcpu; > > > int try = 3; > > > > > > nr_vcpus = atomic_read(&kvm->online_vcpus); > > > if (nr_vcpus < 2) > > > return; > > > > > > /* Pairs with the smp_wmb() in kvm_vm_ioctl_create_vcpu(). */ > > > smp_rmb(); > > > > Why do you need this now? Isn't the RCU read lock in xa_load() enough? > > No. RCU read lock doesn't suffice, because on kernels without PREEMPT_COUNT > rcu_read_lock() may be a literal nop. There may be a _compiler_ barrier, but > smp_rmb() requires more than a compiler barrier on many architectures. Makes sense. In fact, it makes sense to have an explicit barrier in-between the xarray modify operations and reading/storing online_vcpus. > > > kvm_vcpu_set_in_spin_loop(me, true); > > > > > > start = READ_ONCE(kvm->last_boosted_vcpu) + 1; > > > for (i = 0; i < nr_vcpus; i++) { > > > > Why do you need to started at the last boosted vcpu? I.e, why not > > starting at 0 and skipping me->vcpu_idx and kvm->last_boosted_vcpu? > > To provide round-robin style yielding in order to (hopefully) yield to the vCPU > that is holding a spinlock (or some other asset that is causing a vCPU to spin > in kernel mode). > > E.g. if there are 4 vCPUs all running on a single CPU, vCPU3 gets preempted while > holding a spinlock, and all vCPUs are contented for said spinlock then starting > at vCPU0 every time would result in vCPU1 yielding to vCPU0, and vCPU0 yielding > back to vCPU1, indefinitely. Makes sense, this would always privilege vCPU 0 in favor of the last vCPU. 100% clear. Thanks!
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index ff0a20565f90..9768307d5e6c 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -4066,12 +4066,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode) { struct kvm *kvm = me->kvm; struct kvm_vcpu *vcpu; - int last_boosted_vcpu = me->kvm->last_boosted_vcpu; + int last_boosted_vcpu; unsigned long i; int yielded = 0; int try = 3; int pass; + last_boosted_vcpu = READ_ONCE(me->kvm->last_boosted_vcpu); kvm_vcpu_set_in_spin_loop(me, true); /* * We boost the priority of a VCPU that is runnable but not @@ -4109,7 +4110,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode) yielded = kvm_vcpu_yield_to(vcpu); if (yielded > 0) { - kvm->last_boosted_vcpu = i; + WRITE_ONCE(kvm->last_boosted_vcpu, i); break; } else if (yielded < 0) { try--;
There are two workflow paths that access the same address simultaneously, creating a potential data race in kvm_vcpu_on_spin. This occurs when one workflow reads kvm->last_boosted_vcpu while another parallel path writes to it. KCSAN produces the following output when enabled. BUG: KCSAN: data-race in kvm_vcpu_on_spin [kvm] / kvm_vcpu_on_spin [kvm] write to 0xffffc90025a92344 of 4 bytes by task 4340 on cpu 16: kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4112) kvm handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890) __x64_sys_ioctl (fs/ioctl.c:890) x64_sys_call (arch/x86/entry/syscall_64.c:33) do_syscall_64 (arch/x86/entry/common.c:?) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) read to 0xffffc90025a92344 of 4 bytes by task 4342 on cpu 4: kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4069) kvm handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890) __x64_sys_ioctl (fs/ioctl.c:890) x64_sys_call (arch/x86/entry/syscall_64.c:33) do_syscall_64 (arch/x86/entry/common.c:?) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) value changed: 0x00000012 -> 0x00000000 Given that both operations occur simultaneously without any locking mechanisms in place, let's ensure atomicity to prevent possible data corruption. We'll achieve this by employing READ_ONCE() for the reading operation and WRITE_ONCE() for the writing operation. Signed-off-by: Breno Leitao <leitao@debian.org> --- virt/kvm/kvm_main.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)