Message ID | 20211210163625.2886-1-dwmw2@infradead.org (mailing list archive) |
---|---|
Headers | show |
Series | x86/xen: Add in-kernel Xen event channel delivery | expand |
On 12/10/21 17:36, David Woodhouse wrote: > Introduce the basic concept of 2 level event channels for kernel delivery, > which is just a simple matter of a few test_and_set_bit calls on a mapped > shared info page. > > This can be used for routing MSI of passthrough devices to PIRQ event > channels in a Xen guest, and we can build on it for delivering IPIs and > timers directly from the kernel too. Queued patches 1-5, thanks. Paolo > v1: Use kvm_map_gfn() although I didn't quite see how it works. > > v2: Avoid kvm_map_gfn() and implement a safe mapping with invalidation > support for myself. > > v3: Reinvent gfn_to_pfn_cache with sane invalidation semantics, for my > use case as well as nesting. > > v4: Rework dirty handling, as it became apparently that we need an active > vCPU context to mark pages dirty so it can't be done from the MMU > notifier duing the invalidation; it has to happen on unmap. > > v5: Fix sparse warnings reported by kernel test robot <lkp@intel.com>. > > Fix revalidation when memslots change but the resulting HVA stays > the same. We can use the same kernel mapping in that case, if the > HVA → PFN translation was valid before. So that probably means we > shouldn't unmap the "old_hva". Augment the test case to exercise > that one too. > > Include the fix for the dirty ring vs. Xen shinfo oops reported > by butt3rflyh4ck <butterflyhuangxx@gmail.com>. > > v6: Paolo's review feedback, rebase onto kvm/next dropping the patches > which are already merged. > > Again, the *last* patch in the series (this time #6) is for illustration > and is not intended to be merged as-is. > > David Woodhouse (6): > KVM: Warn if mark_page_dirty() is called without an active vCPU > KVM: Reinstate gfn_to_pfn_cache with invalidation support > KVM: x86/xen: Maintain valid mapping of Xen shared_info page > KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery > KVM: x86: Fix wall clock writes in Xen shared_info not to mark page dirty > KVM: x86: First attempt at converting nested virtual APIC page to gpc > > Documentation/virt/kvm/api.rst | 33 ++ > arch/x86/include/asm/kvm_host.h | 4 +- > arch/x86/kvm/Kconfig | 1 + > arch/x86/kvm/irq_comm.c | 12 + > arch/x86/kvm/vmx/nested.c | 50 ++- > arch/x86/kvm/vmx/vmx.c | 12 +- > arch/x86/kvm/vmx/vmx.h | 2 +- > arch/x86/kvm/x86.c | 15 +- > arch/x86/kvm/x86.h | 1 - > arch/x86/kvm/xen.c | 341 +++++++++++++++++++-- > arch/x86/kvm/xen.h | 9 + > include/linux/kvm_dirty_ring.h | 6 - > include/linux/kvm_host.h | 110 +++++++ > include/linux/kvm_types.h | 18 ++ > include/uapi/linux/kvm.h | 11 + > .../testing/selftests/kvm/x86_64/xen_shinfo_test.c | 184 ++++++++++- > virt/kvm/Kconfig | 3 + > virt/kvm/Makefile.kvm | 1 + > virt/kvm/dirty_ring.c | 11 +- > virt/kvm/kvm_main.c | 19 +- > virt/kvm/kvm_mm.h | 44 +++ > virt/kvm/mmu_lock.h | 23 -- > virt/kvm/pfncache.c | 337 ++++++++++++++++++++ > 23 files changed, 1161 insertions(+), 86 deletions(-) > > >
On Sat, 2021-12-11 at 02:09 +0100, Paolo Bonzini wrote: > On 12/10/21 17:36, David Woodhouse wrote: > > Introduce the basic concept of 2 level event channels for kernel delivery, > > which is just a simple matter of a few test_and_set_bit calls on a mapped > > shared info page. > > > > This can be used for routing MSI of passthrough devices to PIRQ event > > channels in a Xen guest, and we can build on it for delivering IPIs and > > timers directly from the kernel too. > > Queued patches 1-5, thanks. > > Paolo Any word on when these will make it out to kvm.git? Did you find something else I need to fix?
On 12/22/21 16:18, David Woodhouse wrote: >>> n a mapped >>> shared info page. >>> >>> This can be used for routing MSI of passthrough devices to PIRQ event >>> channels in a Xen guest, and we can build on it for delivering IPIs and >>> timers directly from the kernel too. >> Queued patches 1-5, thanks. >> >> Paolo > Any word on when these will make it out to kvm.git? Did you find > something else I need to fix? I got a WARN when testing the queue branch, now I'm bisecting. Paolo
On Thu, Dec 23, 2021, Paolo Bonzini wrote: > On 12/22/21 16:18, David Woodhouse wrote: > > > > n a mapped > > > > shared info page. > > > > > > > > This can be used for routing MSI of passthrough devices to PIRQ event > > > > channels in a Xen guest, and we can build on it for delivering IPIs and > > > > timers directly from the kernel too. > > > Queued patches 1-5, thanks. > > > > > > Paolo > > Any word on when these will make it out to kvm.git? Did you find > > something else I need to fix? > > I got a WARN when testing the queue branch, now I'm bisecting. Was it perhaps this WARN? WARNING: CPU: 5 PID: 2180 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:3163 mark_page_dirty_in_slot+0x6b/0x80 [kvm] Modules linked in: kvm_intel kvm irqbypass CPU: 5 PID: 2180 Comm: hyperv_clock Not tainted 5.16.0-rc4+ #81 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:mark_page_dirty_in_slot+0x6b/0x80 [kvm] kvm_write_guest+0x117/0x120 [kvm] kvm_hv_invalidate_tsc_page+0x99/0xf0 [kvm] kvm_arch_vm_ioctl+0x20f/0xb60 [kvm] kvm_vm_ioctl+0x711/0xd50 [kvm] __x64_sys_ioctl+0x7f/0xb0 do_syscall_64+0x38/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae If so, it was clearly introduced by commit 03c0304a86bc ("KVM: Warn if mark_page_dirty() is called without an active vCPU"). But that change is 100% correct as it's trivial to crash KVM without its protection. E.g. running hyper_clock with these mods diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c index 2dd78c1db4b6..ae0d0490580a 100644 --- a/tools/testing/selftests/kvm/lib/kvm_util.c +++ b/tools/testing/selftests/kvm/lib/kvm_util.c @@ -371,6 +371,8 @@ struct kvm_vm *vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t nr_vcpus, vm_create_irqchip(vm); #endif + vm_enable_dirty_ring(vm, 0x10000); + for (i = 0; i < nr_vcpus; ++i) { uint32_t vcpuid = vcpuids ? vcpuids[i] : i; diff --git a/tools/testing/selftests/kvm/x86_64/hyperv_clock.c b/tools/testing/selftests/kvm/x86_64/hyperv_clock.c index e0b2bb1339b1..1032695e3901 100644 --- a/tools/testing/selftests/kvm/x86_64/hyperv_clock.c +++ b/tools/testing/selftests/kvm/x86_64/hyperv_clock.c @@ -212,6 +212,8 @@ int main(void) vm = vm_create_default(VCPU_ID, 0, guest_main); run = vcpu_state(vm, VCPU_ID); + vm_mem_region_set_flags(vm, 0, KVM_MEM_LOG_DIRTY_PAGES); + vcpu_set_hv_cpuid(vm, VCPU_ID); tsc_page_gva = vm_vaddr_alloc_page(vm); Triggers a NULL pointer deref. BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 12959a067 P4D 12959a067 PUD 129594067 PMD 0 Oops: 0000 [#1] SMP CPU: 5 PID: 1784 Comm: hyperv_clock Not tainted 5.16.0-rc4+ #82 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_dirty_ring_get+0xe/0x30 [kvm] mark_page_dirty_in_slot.part.0+0x30/0x60 [kvm] kvm_write_guest+0x117/0x130 [kvm] kvm_hv_invalidate_tsc_page+0x99/0xf0 [kvm] kvm_arch_vm_ioctl+0x20f/0xb60 [kvm] kvm_vm_ioctl+0x711/0xd50 [kvm] __x64_sys_ioctl+0x7f/0xb0 do_syscall_64+0x38/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae Commit e880c6ea55b9 ("KVM: x86: hyper-v: Prevent using not-yet-updated TSC page by secondary CPUs") is squarely to blame as it was added after dirty ring, though in Vitaly's defense, David put it best: "That's a fairly awful bear trap". Assuming there's nothing special about vcpu0's clock, I belive the below fixes all issues. If not, then the correct fix is to have vcpu0 block all other vCPUs until the update completes. --- arch/x86/include/asm/kvm_host.h | 5 +++-- arch/x86/kvm/hyperv.c | 39 ++++++++------------------------- arch/x86/kvm/hyperv.h | 2 +- arch/x86/kvm/x86.c | 7 +++--- 4 files changed, 16 insertions(+), 37 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 5d97f4adc1cb..f85df83e3083 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -971,10 +971,11 @@ enum hv_tsc_page_status { HV_TSC_PAGE_HOST_CHANGED, /* TSC page was properly set up and is currently active */ HV_TSC_PAGE_SET, - /* TSC page is currently being updated and therefore is inactive */ - HV_TSC_PAGE_UPDATING, /* TSC page was set up with an inaccessible GPA */ HV_TSC_PAGE_BROKEN, + + /* TSC page needs to be updated to handle a guest or host change */ + HV_TSC_PAGE_UPDATE_REQUIRED = BIT(31), }; /* Hyper-V emulation context */ diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index 6e38a7d22e97..84b063bda4d8 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -1122,14 +1122,14 @@ void kvm_hv_setup_tsc_page(struct kvm *kvm, BUILD_BUG_ON(sizeof(tsc_seq) != sizeof(hv->tsc_ref.tsc_sequence)); BUILD_BUG_ON(offsetof(struct ms_hyperv_tsc_page, tsc_sequence) != 0); - if (hv->hv_tsc_page_status == HV_TSC_PAGE_BROKEN || - hv->hv_tsc_page_status == HV_TSC_PAGE_UNSET) - return; - mutex_lock(&hv->hv_lock); + if (!(hv->hv_tsc_page & HV_X64_MSR_TSC_REFERENCE_ENABLE)) goto out_unlock; + if (!(hv->hv_tsc_page_status & HV_TSC_PAGE_UPDATE_REQUIRED)) + goto out_unlock; + gfn = hv->hv_tsc_page >> HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT; /* * Because the TSC parameters only vary when there is a @@ -1188,45 +1188,24 @@ void kvm_hv_setup_tsc_page(struct kvm *kvm, mutex_unlock(&hv->hv_lock); } -void kvm_hv_invalidate_tsc_page(struct kvm *kvm) +void kvm_hv_request_tsc_page_update(struct kvm *kvm) { struct kvm_hv *hv = to_kvm_hv(kvm); - u64 gfn; - int idx; + + mutex_lock(&hv->hv_lock); if (hv->hv_tsc_page_status == HV_TSC_PAGE_BROKEN || hv->hv_tsc_page_status == HV_TSC_PAGE_UNSET || tsc_page_update_unsafe(hv)) - return; - - mutex_lock(&hv->hv_lock); - - if (!(hv->hv_tsc_page & HV_X64_MSR_TSC_REFERENCE_ENABLE)) goto out_unlock; - /* Preserve HV_TSC_PAGE_GUEST_CHANGED/HV_TSC_PAGE_HOST_CHANGED states */ - if (hv->hv_tsc_page_status == HV_TSC_PAGE_SET) - hv->hv_tsc_page_status = HV_TSC_PAGE_UPDATING; - - gfn = hv->hv_tsc_page >> HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT; - - hv->tsc_ref.tsc_sequence = 0; - - /* - * Take the srcu lock as memslots will be accessed to check the gfn - * cache generation against the memslots generation. - */ - idx = srcu_read_lock(&kvm->srcu); - if (kvm_write_guest(kvm, gfn_to_gpa(gfn), - &hv->tsc_ref, sizeof(hv->tsc_ref.tsc_sequence))) - hv->hv_tsc_page_status = HV_TSC_PAGE_BROKEN; - srcu_read_unlock(&kvm->srcu, idx); + if (hv->hv_tsc_page & HV_X64_MSR_TSC_REFERENCE_ENABLE) + hv->hv_tsc_page_status |= HV_TSC_PAGE_UPDATE_REQUIRED; out_unlock: mutex_unlock(&hv->hv_lock); } - static bool hv_check_msr_access(struct kvm_vcpu_hv *hv_vcpu, u32 msr) { if (!hv_vcpu->enforce_cpuid) diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h index ed1c4e546d04..3e79b4a9ed4e 100644 --- a/arch/x86/kvm/hyperv.h +++ b/arch/x86/kvm/hyperv.h @@ -133,7 +133,7 @@ void kvm_hv_process_stimers(struct kvm_vcpu *vcpu); void kvm_hv_setup_tsc_page(struct kvm *kvm, struct pvclock_vcpu_time_info *hv_clock); -void kvm_hv_invalidate_tsc_page(struct kvm *kvm); +void kvm_hv_request_tsc_page_update(struct kvm *kvm); void kvm_hv_init_vm(struct kvm *kvm); void kvm_hv_destroy_vm(struct kvm *kvm); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c194a8cbd25f..c1adc9efea28 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2848,7 +2848,7 @@ static void kvm_end_pvclock_update(struct kvm *kvm) static void kvm_update_masterclock(struct kvm *kvm) { - kvm_hv_invalidate_tsc_page(kvm); + kvm_hv_request_tsc_page_update(kvm); kvm_start_pvclock_update(kvm); pvclock_update_vm_gtod_copy(kvm); kvm_end_pvclock_update(kvm); @@ -3060,8 +3060,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v) offsetof(struct compat_vcpu_info, time)); if (vcpu->xen.vcpu_time_info_set) kvm_setup_pvclock_page(v, &vcpu->xen.vcpu_time_info_cache, 0); - if (!v->vcpu_idx) - kvm_hv_setup_tsc_page(v->kvm, &vcpu->hv_clock); + kvm_hv_setup_tsc_page(v->kvm, &vcpu->hv_clock); return 0; } @@ -6039,7 +6038,7 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp) if (data.flags & ~KVM_CLOCK_VALID_FLAGS) return -EINVAL; - kvm_hv_invalidate_tsc_page(kvm); + kvm_hv_request_tsc_page_update(kvm); kvm_start_pvclock_update(kvm); pvclock_update_vm_gtod_copy(kvm); base-commit: 9f73307be079749233b47e623a67201975c575bc --
On 12/23/21 22:24, Sean Christopherson wrote: >>> Any word on when these will make it out to kvm.git? Did you find >>> something else I need to fix? >> I got a WARN when testing the queue branch, now I'm bisecting. > Was it perhaps this WARN? No, it was PEBKAC. I was trying to be clever and only rebuilding the module instead of the whole kernel, but kvm_handle_signal_exit is inlined in kernel/entry/kvm.c and thus relies on the offsets that were changed by removing pre_pcpu and blocked_vcpu_list. So now the hashes in kvm/queue is still not officially stable, but the pull request for the merge window should be exactly that branch + the AMX stuff. The SEV selftests can still make it in 5.17 but probably more towards the end of the merge window. Paolo
On Sat, 25 Dec 2021 at 09:19, Sean Christopherson <seanjc@google.com> wrote: > > On Thu, Dec 23, 2021, Paolo Bonzini wrote: > > On 12/22/21 16:18, David Woodhouse wrote: > > > > > n a mapped > > > > > shared info page. > > > > > > > > > > This can be used for routing MSI of passthrough devices to PIRQ event > > > > > channels in a Xen guest, and we can build on it for delivering IPIs and > > > > > timers directly from the kernel too. > > > > Queued patches 1-5, thanks. > > > > > > > > Paolo > > > Any word on when these will make it out to kvm.git? Did you find > > > something else I need to fix? > > > > I got a WARN when testing the queue branch, now I'm bisecting. > > Was it perhaps this WARN? > > WARNING: CPU: 5 PID: 2180 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:3163 mark_page_dirty_in_slot+0x6b/0x80 [kvm] > Modules linked in: kvm_intel kvm irqbypass > CPU: 5 PID: 2180 Comm: hyperv_clock Not tainted 5.16.0-rc4+ #81 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 > RIP: 0010:mark_page_dirty_in_slot+0x6b/0x80 [kvm] > kvm_write_guest+0x117/0x120 [kvm] > kvm_hv_invalidate_tsc_page+0x99/0xf0 [kvm] > kvm_arch_vm_ioctl+0x20f/0xb60 [kvm] > kvm_vm_ioctl+0x711/0xd50 [kvm] > __x64_sys_ioctl+0x7f/0xb0 > do_syscall_64+0x38/0xc0 > entry_SYSCALL_64_after_hwframe+0x44/0xae > > If so, it was clearly introduced by commit 03c0304a86bc ("KVM: Warn if mark_page_dirty() > is called without an active vCPU"). But that change is 100% correct as it's trivial > to crash KVM without its protection. E.g. running hyper_clock with these mods > > diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c > index 2dd78c1db4b6..ae0d0490580a 100644 > --- a/tools/testing/selftests/kvm/lib/kvm_util.c > +++ b/tools/testing/selftests/kvm/lib/kvm_util.c > @@ -371,6 +371,8 @@ struct kvm_vm *vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t nr_vcpus, > vm_create_irqchip(vm); > #endif > > + vm_enable_dirty_ring(vm, 0x10000); > + > for (i = 0; i < nr_vcpus; ++i) { > uint32_t vcpuid = vcpuids ? vcpuids[i] : i; > > diff --git a/tools/testing/selftests/kvm/x86_64/hyperv_clock.c b/tools/testing/selftests/kvm/x86_64/hyperv_clock.c > index e0b2bb1339b1..1032695e3901 100644 > --- a/tools/testing/selftests/kvm/x86_64/hyperv_clock.c > +++ b/tools/testing/selftests/kvm/x86_64/hyperv_clock.c > @@ -212,6 +212,8 @@ int main(void) > vm = vm_create_default(VCPU_ID, 0, guest_main); > run = vcpu_state(vm, VCPU_ID); > > + vm_mem_region_set_flags(vm, 0, KVM_MEM_LOG_DIRTY_PAGES); > + > vcpu_set_hv_cpuid(vm, VCPU_ID); > > tsc_page_gva = vm_vaddr_alloc_page(vm); > > Triggers a NULL pointer deref. > > BUG: kernel NULL pointer dereference, address: 0000000000000000 > #PF: supervisor read access in kernel mode > #PF: error_code(0x0000) - not-present page > PGD 12959a067 P4D 12959a067 PUD 129594067 PMD 0 > Oops: 0000 [#1] SMP > CPU: 5 PID: 1784 Comm: hyperv_clock Not tainted 5.16.0-rc4+ #82 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 > RIP: 0010:kvm_dirty_ring_get+0xe/0x30 [kvm] > mark_page_dirty_in_slot.part.0+0x30/0x60 [kvm] > kvm_write_guest+0x117/0x130 [kvm] > kvm_hv_invalidate_tsc_page+0x99/0xf0 [kvm] > kvm_arch_vm_ioctl+0x20f/0xb60 [kvm] > kvm_vm_ioctl+0x711/0xd50 [kvm] > __x64_sys_ioctl+0x7f/0xb0 > do_syscall_64+0x38/0xc0 > entry_SYSCALL_64_after_hwframe+0x44/0xae I saw this warning by running vanilla hyperv_clock w/o modifying against kvm/queue. > > Commit e880c6ea55b9 ("KVM: x86: hyper-v: Prevent using not-yet-updated TSC page > by secondary CPUs") is squarely to blame as it was added after dirty ring, though > in Vitaly's defense, David put it best: "That's a fairly awful bear trap". > > Assuming there's nothing special about vcpu0's clock, I belive the below fixes > all issues. If not, then the correct fix is to have vcpu0 block all other vCPUs > until the update completes. Could we invalidate hv->tsc_ref.tsc_sequence directly w/o marking page dirty since after migration it is stale until the first successful kvm_hv_setup_tsc_page() call? Wanpeng
On Thu, 2021-12-23 at 21:24 +0000, Sean Christopherson wrote: > Commit e880c6ea55b9 ("KVM: x86: hyper-v: Prevent using not-yet-updated TSC page > by secondary CPUs") is squarely to blame as it was added after dirty ring, though > in Vitaly's defense, David put it best: "That's a fairly awful bear trap". Even with the WARN to keep us honest, this is still awful. We have kvm_vcpu_write_guest()... but the vcpu we pass it is ignored and only vcpu->kvm is used. But you have to be running on a physical CPU which currently owns *a* vCPU of that KVM, or you might crash. There is also kvm_write_guest() which doesn't take a vCPU as an argument, and looks for all the world like it was designed for you not to need one... but which still needs there to be a vCPU or it might crash. I think I want to kill the latter, make the former use the vCPU it's given, add a spinlock to the dirty ring which will be uncontended anyway in the common case so it shouldn't hurt (?), and then let people use kvm->vcpu[0] when they really need to, with a documented caveat that when there are *no* vCPUs in a KVM, dirty tracking might not work. Which is fine, as migration of a KVM that hasn't been fully set up yet is silly.
+Peter On Wed, Jan 05, 2022, David Woodhouse wrote: > On Thu, 2021-12-23 at 21:24 +0000, Sean Christopherson wrote: > > Commit e880c6ea55b9 ("KVM: x86: hyper-v: Prevent using not-yet-updated TSC page > > by secondary CPUs") is squarely to blame as it was added after dirty ring, though > > in Vitaly's defense, David put it best: "That's a fairly awful bear trap". > > Even with the WARN to keep us honest, this is still awful. > > We have kvm_vcpu_write_guest()... but the vcpu we pass it is ignored > and only vcpu->kvm is used. But you have to be running on a physical > CPU which currently owns *a* vCPU of that KVM, or you might crash. > > There is also kvm_write_guest() which doesn't take a vCPU as an > argument, and looks for all the world like it was designed for you not > to need one... but which still needs there to be a vCPU or it might > crash. > > I think I want to kill the latter, make the former use the vCPU it's > given, add a spinlock to the dirty ring which will be uncontended > anyway in the common case so it shouldn't hurt (?), IIRC, Peter did a fair amount of performance testing and analysis that led to the current behavior. > and then let people use kvm->vcpu[0] when they really need to, with a > documented caveat that when there are *no* vCPUs in a KVM, dirty tracking > might not work. Which is fine, as migration of a KVM that hasn't been fully > set up yet is silly. "when they really need to" can be a slippery slope, using vcpu[0] is also quite gross. Though I 100% agree that always using kvm_get_running_vcpu() is awful.
On Sat, Jan 08, 2022 at 12:26:32AM +0000, Sean Christopherson wrote: > +Peter > > On Wed, Jan 05, 2022, David Woodhouse wrote: > > On Thu, 2021-12-23 at 21:24 +0000, Sean Christopherson wrote: > > > Commit e880c6ea55b9 ("KVM: x86: hyper-v: Prevent using not-yet-updated TSC page > > > by secondary CPUs") is squarely to blame as it was added after dirty ring, though > > > in Vitaly's defense, David put it best: "That's a fairly awful bear trap". > > > > Even with the WARN to keep us honest, this is still awful. > > > > We have kvm_vcpu_write_guest()... but the vcpu we pass it is ignored > > and only vcpu->kvm is used. But you have to be running on a physical > > CPU which currently owns *a* vCPU of that KVM, or you might crash. > > > > There is also kvm_write_guest() which doesn't take a vCPU as an > > argument, and looks for all the world like it was designed for you not > > to need one... but which still needs there to be a vCPU or it might > > crash. Right, I agree too those are slightly confusing. > > > > I think I want to kill the latter, make the former use the vCPU it's > > given, add a spinlock to the dirty ring which will be uncontended > > anyway in the common case so it shouldn't hurt (?), > > IIRC, Peter did a fair amount of performance testing and analysis that led to > the current behavior. Yes that comes out from the discussion when working on dirty ring. I can't remember the details too, but IIRC what we figured is 99% cases of guest dirtying pages are with vcpu context. The only outlier at that time (at least for x86) was kvm-gt who uses direct kvm context but then it turns out that's an illegal use of the kvm API and kvm-gt fixed itself instead. Then we come to the current kvm_get_running_vcpu() approach because all the dirty track contexts are within an existing vcpu context. Dirty ring could be simplified too because we don't need the default vcpu0 trick, nor do we need an extra ring just to record no-vcpu contexts (in very early version of dirty ring we do have N+1 rings, where N is the vcpu number, and the extra is for this). kvm_get_running_vcpu() also has a slight benefit that it guarantees no spinlock needed. > > > and then let people use kvm->vcpu[0] when they really need to, with a > > documented caveat that when there are *no* vCPUs in a KVM, dirty tracking > > might not work. Which is fine, as migration of a KVM that hasn't been fully > > set up yet is silly. > > "when they really need to" can be a slippery slope, using vcpu[0] is also quite > gross. Though I 100% agree that always using kvm_get_running_vcpu() is awful. I agreed none of them looks like an extreme clean approach. So far I'd hope we can fix the problem with what Sean suggested on postponing the page update until when we have the vcpu context, so we keep the assumption still true on "having a vcpu context" at least for x86 and that'll be the simplest, afaiu. Or do we have explicit other requirement that needs to dirty guest pages without vcpu context at all? Thanks,
On Wed, 2022-01-12 at 11:10 +0800, Peter Xu wrote: > On Sat, Jan 08, 2022 at 12:26:32AM +0000, Sean Christopherson wrote: > > +Peter > > > > On Wed, Jan 05, 2022, David Woodhouse wrote: > > > On Thu, 2021-12-23 at 21:24 +0000, Sean Christopherson wrote: > > > > Commit e880c6ea55b9 ("KVM: x86: hyper-v: Prevent using not-yet-updated TSC page > > > > by secondary CPUs") is squarely to blame as it was added after dirty ring, though > > > > in Vitaly's defense, David put it best: "That's a fairly awful bear trap". > > > > > > Even with the WARN to keep us honest, this is still awful. > > > > > > We have kvm_vcpu_write_guest()... but the vcpu we pass it is ignored > > > and only vcpu->kvm is used. But you have to be running on a physical > > > CPU which currently owns *a* vCPU of that KVM, or you might crash. > > > > > > There is also kvm_write_guest() which doesn't take a vCPU as an > > > argument, and looks for all the world like it was designed for you not > > > to need one... but which still needs there to be a vCPU or it might > > > crash. > > Right, I agree too those are slightly confusing. > > > > > > > I think I want to kill the latter, make the former use the vCPU it's > > > given, add a spinlock to the dirty ring which will be uncontended > > > anyway in the common case so it shouldn't hurt (?), > > > > IIRC, Peter did a fair amount of performance testing and analysis that led to > > the current behavior. > > Yes that comes out from the discussion when working on dirty ring. I can't > remember the details too, but IIRC what we figured is 99% cases of guest > dirtying pages are with vcpu context. > > The only outlier at that time (at least for x86) was kvm-gt who uses direct kvm > context but then it turns out that's an illegal use of the kvm API and kvm-gt > fixed itself instead. > > Then we come to the current kvm_get_running_vcpu() approach because all the > dirty track contexts are within an existing vcpu context. Dirty ring could be > simplified too because we don't need the default vcpu0 trick, nor do we need an > extra ring just to record no-vcpu contexts (in very early version of dirty ring > we do have N+1 rings, where N is the vcpu number, and the extra is for this). > > kvm_get_running_vcpu() also has a slight benefit that it guarantees no spinlock > needed. > > > > > > and then let people use kvm->vcpu[0] when they really need to, with a > > > documented caveat that when there are *no* vCPUs in a KVM, dirty tracking > > > might not work. Which is fine, as migration of a KVM that hasn't been fully > > > set up yet is silly. > > > > "when they really need to" can be a slippery slope, using vcpu[0] is also quite > > gross. Though I 100% agree that always using kvm_get_running_vcpu() is awful. > > I agreed none of them looks like an extreme clean approach. > > So far I'd hope we can fix the problem with what Sean suggested on postponing > the page update until when we have the vcpu context, so we keep the assumption > still true on "having a vcpu context" at least for x86 and that'll be the > simplest, afaiu. > > Or do we have explicit other requirement that needs to dirty guest pages > without vcpu context at all? Delivering interrupts may want to do so. That's the one we hit for S390, and I only avoided it for Xen event channel delivery on x86 by declaring that the Xen shared info page is exempt from dirty tracking and should *always* be considered dirty.
On 1/19/22 09:14, David Woodhouse wrote: >> Or do we have explicit other requirement that needs to dirty guest pages >> without vcpu context at all? > > Delivering interrupts may want to do so. That's the one we hit for > S390, and I only avoided it for Xen event channel delivery on x86 by > declaring that the Xen shared info page is exempt from dirty tracking > and should*always* be considered dirty. We also have one that I just found out about in kvm_hv_invalidate_tsc_page, called from KVM_SET_CLOCK. :/ So either we have another special case to document for the dirty ring buffer (and retroactively so, even), or we're in bad need for a solution. Paolo
On Wed, 2022-01-19 at 18:36 +0100, Paolo Bonzini wrote: > On 1/19/22 09:14, David Woodhouse wrote: > > > Or do we have explicit other requirement that needs to dirty guest pages > > > without vcpu context at all? > > > > Delivering interrupts may want to do so. That's the one we hit for > > S390, and I only avoided it for Xen event channel delivery on x86 by > > declaring that the Xen shared info page is exempt from dirty tracking > > and should*always* be considered dirty. > > We also have one that I just found out about in > kvm_hv_invalidate_tsc_page, called from KVM_SET_CLOCK. :/ > > So either we have another special case to document for the dirty ring > buffer (and retroactively so, even), or we're in bad need for a solution. Seems like adding that warning is having precisely the desired effect :)
On Wed, Jan 19, 2022, David Woodhouse wrote: > On Wed, 2022-01-19 at 18:36 +0100, Paolo Bonzini wrote: > > On 1/19/22 09:14, David Woodhouse wrote: > > > > Or do we have explicit other requirement that needs to dirty guest pages > > > > without vcpu context at all? > > > > > > Delivering interrupts may want to do so. That's the one we hit for > > > S390, and I only avoided it for Xen event channel delivery on x86 by > > > declaring that the Xen shared info page is exempt from dirty tracking > > > and should*always* be considered dirty. > > > > We also have one that I just found out about in > > kvm_hv_invalidate_tsc_page, called from KVM_SET_CLOCK. :/ I think we can fix that usage though: https://lore.kernel.org/all/YcTpJ369cRBN4W93@google.com > > So either we have another special case to document for the dirty ring > > buffer (and retroactively so, even), or we're in bad need for a solution. > > Seems like adding that warning is having precisely the desired effect :) The WARN is certainly useful. Part of me actually likes the restriction of needing to have a valid vCPU, at least for x86, as there really aren't many legitimate cases where KVM should be marking memory dirty without a vCPU.
On 1/19/22 18:44, Sean Christopherson wrote: > I think we can fix that usage though: > > https://lore.kernel.org/all/YcTpJ369cRBN4W93@google.com Ok, will review next week. >>> So either we have another special case to document for the dirty ring >>> buffer (and retroactively so, even), or we're in bad need for a solution. >> Seems like adding that warning is having precisely the desired effect:) > > The WARN is certainly useful. Part of me actually likes the restriction of needing > to have a valid vCPU, at least for x86, as there really aren't many legitimate cases > where KVM should be marking memory dirty without a vCPU. Yes, I agree that every case of it has needed, at the very least, further scrutiny. Paolo