[v6,0/6] x86/xen: Add in-kernel Xen event channel delivery

Message ID	20211210163625.2886-1-dwmw2@infradead.org (mailing list archive)
Headers	show Return-Path: <kvm-owner@kernel.org> From: David Woodhouse <dwmw2@infradead.org> To: Paolo Bonzini <pbonzini@redhat.com>, kvm <kvm@vger.kernel.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>, Joao Martins <joao.m.martins@oracle.com>, "jmattson @ google . com" <jmattson@google.com>, "wanpengli @ tencent . com" <wanpengli@tencent.com>, "seanjc @ google . com" <seanjc@google.com>, "vkuznets @ redhat . com" <vkuznets@redhat.com>, "mtosatti @ redhat . com" <mtosatti@redhat.com>, "joro @ 8bytes . org" <joro@8bytes.org>, karahmed@amazon.com, butt3rflyh4ck <butterflyhuangxx@gmail.com> Subject: [PATCH v6 0/6] x86/xen: Add in-kernel Xen event channel delivery Date: Fri, 10 Dec 2021 16:36:19 +0000 Message-Id: <20211210163625.2886-1-dwmw2@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: David Woodhouse <dwmw2@infradead.org> Precedence: bulk
Series	x86/xen: Add in-kernel Xen event channel delivery \| expand [v6,0/6] x86/xen: Add in-kernel Xen event channel delivery [v6,1/6] KVM: Warn if mark_page_dirty() is called without an active vCPU [v6,2/6] KVM: Reinstate gfn_to_pfn_cache with invalidation support [v6,3/6] KVM: x86/xen: Maintain valid mapping of Xen shared_info page [v6,4/6] KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery [v6,5/6] KVM: x86: Fix wall clock writes in Xen shared_info not to mark page dirty [v6,6/6] KVM: x86: First attempt at converting nested virtual APIC page to gpc

David Woodhouse Dec. 10, 2021, 4:36 p.m. UTC

Introduce the basic concept of 2 level event channels for kernel delivery,
which is just a simple matter of a few test_and_set_bit calls on a mapped
shared info page.

This can be used for routing MSI of passthrough devices to PIRQ event
channels in a Xen guest, and we can build on it for delivering IPIs and
timers directly from the kernel too.

v1: Use kvm_map_gfn() although I didn't quite see how it works.

v2: Avoid kvm_map_gfn() and implement a safe mapping with invalidation
    support for myself.

v3: Reinvent gfn_to_pfn_cache with sane invalidation semantics, for my
    use case as well as nesting.

v4: Rework dirty handling, as it became apparently that we need an active
    vCPU context to mark pages dirty so it can't be done from the MMU
    notifier duing the invalidation; it has to happen on unmap.

v5: Fix sparse warnings reported by kernel test robot <lkp@intel.com>.

    Fix revalidation when memslots change but the resulting HVA stays
    the same. We can use the same kernel mapping in that case, if the
    HVA → PFN translation was valid before. So that probably means we
    shouldn't unmap the "old_hva". Augment the test case to exercise
    that one too.

    Include the fix for the dirty ring vs. Xen shinfo oops reported
    by butt3rflyh4ck <butterflyhuangxx@gmail.com>.

v6: Paolo's review feedback, rebase onto kvm/next dropping the patches
    which are already merged.

Again, the *last* patch in the series (this time #6) is for illustration
and is not intended to be merged as-is.

David Woodhouse (6):
      KVM: Warn if mark_page_dirty() is called without an active vCPU
      KVM: Reinstate gfn_to_pfn_cache with invalidation support
      KVM: x86/xen: Maintain valid mapping of Xen shared_info page
      KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery
      KVM: x86: Fix wall clock writes in Xen shared_info not to mark page dirty
      KVM: x86: First attempt at converting nested virtual APIC page to gpc

 Documentation/virt/kvm/api.rst                     |  33 ++
 arch/x86/include/asm/kvm_host.h                    |   4 +-
 arch/x86/kvm/Kconfig                               |   1 +
 arch/x86/kvm/irq_comm.c                            |  12 +
 arch/x86/kvm/vmx/nested.c                          |  50 ++-
 arch/x86/kvm/vmx/vmx.c                             |  12 +-
 arch/x86/kvm/vmx/vmx.h                             |   2 +-
 arch/x86/kvm/x86.c                                 |  15 +-
 arch/x86/kvm/x86.h                                 |   1 -
 arch/x86/kvm/xen.c                                 | 341 +++++++++++++++++++--
 arch/x86/kvm/xen.h                                 |   9 +
 include/linux/kvm_dirty_ring.h                     |   6 -
 include/linux/kvm_host.h                           | 110 +++++++
 include/linux/kvm_types.h                          |  18 ++
 include/uapi/linux/kvm.h                           |  11 +
 .../testing/selftests/kvm/x86_64/xen_shinfo_test.c | 184 ++++++++++-
 virt/kvm/Kconfig                                   |   3 +
 virt/kvm/Makefile.kvm                              |   1 +
 virt/kvm/dirty_ring.c                              |  11 +-
 virt/kvm/kvm_main.c                                |  19 +-
 virt/kvm/kvm_mm.h                                  |  44 +++
 virt/kvm/mmu_lock.h                                |  23 --
 virt/kvm/pfncache.c                                | 337 ++++++++++++++++++++
 23 files changed, 1161 insertions(+), 86 deletions(-)

Paolo Bonzini Dec. 11, 2021, 1:09 a.m. UTC | #1

On 12/10/21 17:36, David Woodhouse wrote:
> Introduce the basic concept of 2 level event channels for kernel delivery,
> which is just a simple matter of a few test_and_set_bit calls on a mapped
> shared info page.
> 
> This can be used for routing MSI of passthrough devices to PIRQ event
> channels in a Xen guest, and we can build on it for delivering IPIs and
> timers directly from the kernel too.

Queued patches 1-5, thanks.

Paolo

> v1: Use kvm_map_gfn() although I didn't quite see how it works.
> 
> v2: Avoid kvm_map_gfn() and implement a safe mapping with invalidation
>      support for myself.
> 
> v3: Reinvent gfn_to_pfn_cache with sane invalidation semantics, for my
>      use case as well as nesting.
> 
> v4: Rework dirty handling, as it became apparently that we need an active
>      vCPU context to mark pages dirty so it can't be done from the MMU
>      notifier duing the invalidation; it has to happen on unmap.
> 
> v5: Fix sparse warnings reported by kernel test robot <lkp@intel.com>.
> 
>      Fix revalidation when memslots change but the resulting HVA stays
>      the same. We can use the same kernel mapping in that case, if the
>      HVA → PFN translation was valid before. So that probably means we
>      shouldn't unmap the "old_hva". Augment the test case to exercise
>      that one too.
> 
>      Include the fix for the dirty ring vs. Xen shinfo oops reported
>      by butt3rflyh4ck <butterflyhuangxx@gmail.com>.
> 
> v6: Paolo's review feedback, rebase onto kvm/next dropping the patches
>      which are already merged.
> 
> Again, the *last* patch in the series (this time #6) is for illustration
> and is not intended to be merged as-is.
> 
> David Woodhouse (6):
>        KVM: Warn if mark_page_dirty() is called without an active vCPU
>        KVM: Reinstate gfn_to_pfn_cache with invalidation support
>        KVM: x86/xen: Maintain valid mapping of Xen shared_info page
>        KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery
>        KVM: x86: Fix wall clock writes in Xen shared_info not to mark page dirty
>        KVM: x86: First attempt at converting nested virtual APIC page to gpc
> 
>   Documentation/virt/kvm/api.rst                     |  33 ++
>   arch/x86/include/asm/kvm_host.h                    |   4 +-
>   arch/x86/kvm/Kconfig                               |   1 +
>   arch/x86/kvm/irq_comm.c                            |  12 +
>   arch/x86/kvm/vmx/nested.c                          |  50 ++-
>   arch/x86/kvm/vmx/vmx.c                             |  12 +-
>   arch/x86/kvm/vmx/vmx.h                             |   2 +-
>   arch/x86/kvm/x86.c                                 |  15 +-
>   arch/x86/kvm/x86.h                                 |   1 -
>   arch/x86/kvm/xen.c                                 | 341 +++++++++++++++++++--
>   arch/x86/kvm/xen.h                                 |   9 +
>   include/linux/kvm_dirty_ring.h                     |   6 -
>   include/linux/kvm_host.h                           | 110 +++++++
>   include/linux/kvm_types.h                          |  18 ++
>   include/uapi/linux/kvm.h                           |  11 +
>   .../testing/selftests/kvm/x86_64/xen_shinfo_test.c | 184 ++++++++++-
>   virt/kvm/Kconfig                                   |   3 +
>   virt/kvm/Makefile.kvm                              |   1 +
>   virt/kvm/dirty_ring.c                              |  11 +-
>   virt/kvm/kvm_main.c                                |  19 +-
>   virt/kvm/kvm_mm.h                                  |  44 +++
>   virt/kvm/mmu_lock.h                                |  23 --
>   virt/kvm/pfncache.c                                | 337 ++++++++++++++++++++
>   23 files changed, 1161 insertions(+), 86 deletions(-)
> 
> 
>

David Woodhouse Dec. 22, 2021, 3:18 p.m. UTC | #2

On Sat, 2021-12-11 at 02:09 +0100, Paolo Bonzini wrote:
> On 12/10/21 17:36, David Woodhouse wrote:
> > Introduce the basic concept of 2 level event channels for kernel delivery,
> > which is just a simple matter of a few test_and_set_bit calls on a mapped
> > shared info page.
> > 
> > This can be used for routing MSI of passthrough devices to PIRQ event
> > channels in a Xen guest, and we can build on it for delivering IPIs and
> > timers directly from the kernel too.
> 
> Queued patches 1-5, thanks.
> 
> Paolo

Any word on when these will make it out to kvm.git? Did you find
something else I need to fix?

Paolo Bonzini Dec. 23, 2021, 12:26 a.m. UTC | #3

On 12/22/21 16:18, David Woodhouse wrote:
>>> n a mapped
>>> shared info page.
>>>
>>> This can be used for routing MSI of passthrough devices to PIRQ event
>>> channels in a Xen guest, and we can build on it for delivering IPIs and
>>> timers directly from the kernel too.
>> Queued patches 1-5, thanks.
>>
>> Paolo
> Any word on when these will make it out to kvm.git? Did you find
> something else I need to fix?

I got a WARN when testing the queue branch, now I'm bisecting.

Paolo

Sean Christopherson Dec. 23, 2021, 9:24 p.m. UTC | #4

On Thu, Dec 23, 2021, Paolo Bonzini wrote:
> On 12/22/21 16:18, David Woodhouse wrote:
> > > > n a mapped
> > > > shared info page.
> > > > 
> > > > This can be used for routing MSI of passthrough devices to PIRQ event
> > > > channels in a Xen guest, and we can build on it for delivering IPIs and
> > > > timers directly from the kernel too.
> > > Queued patches 1-5, thanks.
> > > 
> > > Paolo
> > Any word on when these will make it out to kvm.git? Did you find
> > something else I need to fix?
> 
> I got a WARN when testing the queue branch, now I'm bisecting.

Was it perhaps this WARN?

  WARNING: CPU: 5 PID: 2180 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:3163 mark_page_dirty_in_slot+0x6b/0x80 [kvm]
  Modules linked in: kvm_intel kvm irqbypass
  CPU: 5 PID: 2180 Comm: hyperv_clock Not tainted 5.16.0-rc4+ #81
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:mark_page_dirty_in_slot+0x6b/0x80 [kvm]
   kvm_write_guest+0x117/0x120 [kvm]
   kvm_hv_invalidate_tsc_page+0x99/0xf0 [kvm]
   kvm_arch_vm_ioctl+0x20f/0xb60 [kvm]
   kvm_vm_ioctl+0x711/0xd50 [kvm]
   __x64_sys_ioctl+0x7f/0xb0
   do_syscall_64+0x38/0xc0
   entry_SYSCALL_64_after_hwframe+0x44/0xae

If so, it was clearly introduced by commit 03c0304a86bc ("KVM: Warn if mark_page_dirty()
is called without an active vCPU").  But that change is 100% correct as it's trivial
to crash KVM without its protection.  E.g. running hyper_clock with these mods

diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 2dd78c1db4b6..ae0d0490580a 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -371,6 +371,8 @@ struct kvm_vm *vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t nr_vcpus,
        vm_create_irqchip(vm);
 #endif

+       vm_enable_dirty_ring(vm, 0x10000);
+
        for (i = 0; i < nr_vcpus; ++i) {
                uint32_t vcpuid = vcpuids ? vcpuids[i] : i;

diff --git a/tools/testing/selftests/kvm/x86_64/hyperv_clock.c b/tools/testing/selftests/kvm/x86_64/hyperv_clock.c
index e0b2bb1339b1..1032695e3901 100644
--- a/tools/testing/selftests/kvm/x86_64/hyperv_clock.c
+++ b/tools/testing/selftests/kvm/x86_64/hyperv_clock.c
@@ -212,6 +212,8 @@ int main(void)
        vm = vm_create_default(VCPU_ID, 0, guest_main);
        run = vcpu_state(vm, VCPU_ID);

+       vm_mem_region_set_flags(vm, 0, KVM_MEM_LOG_DIRTY_PAGES);
+
        vcpu_set_hv_cpuid(vm, VCPU_ID);

        tsc_page_gva = vm_vaddr_alloc_page(vm);

Triggers a NULL pointer deref.

  BUG: kernel NULL pointer dereference, address: 0000000000000000
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 12959a067 P4D 12959a067 PUD 129594067 PMD 0
  Oops: 0000 [#1] SMP
  CPU: 5 PID: 1784 Comm: hyperv_clock Not tainted 5.16.0-rc4+ #82
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:kvm_dirty_ring_get+0xe/0x30 [kvm]
   mark_page_dirty_in_slot.part.0+0x30/0x60 [kvm]
   kvm_write_guest+0x117/0x130 [kvm]
   kvm_hv_invalidate_tsc_page+0x99/0xf0 [kvm]
   kvm_arch_vm_ioctl+0x20f/0xb60 [kvm]
   kvm_vm_ioctl+0x711/0xd50 [kvm]
   __x64_sys_ioctl+0x7f/0xb0
   do_syscall_64+0x38/0xc0
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Commit e880c6ea55b9 ("KVM: x86: hyper-v: Prevent using not-yet-updated TSC page
by secondary CPUs") is squarely to blame as it was added after dirty ring, though
in Vitaly's defense, David put it best: "That's a fairly awful bear trap".

Assuming there's nothing special about vcpu0's clock, I belive the below fixes
all issues.  If not, then the correct fix is to have vcpu0 block all other vCPUs
until the update completes.

---
 arch/x86/include/asm/kvm_host.h |  5 +++--
 arch/x86/kvm/hyperv.c           | 39 ++++++++-------------------------
 arch/x86/kvm/hyperv.h           |  2 +-
 arch/x86/kvm/x86.c              |  7 +++---
 4 files changed, 16 insertions(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5d97f4adc1cb..f85df83e3083 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -971,10 +971,11 @@ enum hv_tsc_page_status {
 	HV_TSC_PAGE_HOST_CHANGED,
 	/* TSC page was properly set up and is currently active  */
 	HV_TSC_PAGE_SET,
-	/* TSC page is currently being updated and therefore is inactive */
-	HV_TSC_PAGE_UPDATING,
 	/* TSC page was set up with an inaccessible GPA */
 	HV_TSC_PAGE_BROKEN,
+
+	/* TSC page needs to be updated to handle a guest or host change */
+	HV_TSC_PAGE_UPDATE_REQUIRED = BIT(31),
 };

 /* Hyper-V emulation context */
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 6e38a7d22e97..84b063bda4d8 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -1122,14 +1122,14 @@ void kvm_hv_setup_tsc_page(struct kvm *kvm,
 	BUILD_BUG_ON(sizeof(tsc_seq) != sizeof(hv->tsc_ref.tsc_sequence));
 	BUILD_BUG_ON(offsetof(struct ms_hyperv_tsc_page, tsc_sequence) != 0);

-	if (hv->hv_tsc_page_status == HV_TSC_PAGE_BROKEN ||
-	    hv->hv_tsc_page_status == HV_TSC_PAGE_UNSET)
-		return;
-
 	mutex_lock(&hv->hv_lock);
+
 	if (!(hv->hv_tsc_page & HV_X64_MSR_TSC_REFERENCE_ENABLE))
 		goto out_unlock;

+	if (!(hv->hv_tsc_page_status & HV_TSC_PAGE_UPDATE_REQUIRED))
+		goto out_unlock;
+
 	gfn = hv->hv_tsc_page >> HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT;
 	/*
 	 * Because the TSC parameters only vary when there is a
@@ -1188,45 +1188,24 @@ void kvm_hv_setup_tsc_page(struct kvm *kvm,
 	mutex_unlock(&hv->hv_lock);
 }

-void kvm_hv_invalidate_tsc_page(struct kvm *kvm)
+void kvm_hv_request_tsc_page_update(struct kvm *kvm)
 {
 	struct kvm_hv *hv = to_kvm_hv(kvm);
-	u64 gfn;
-	int idx;
+
+	mutex_lock(&hv->hv_lock);

 	if (hv->hv_tsc_page_status == HV_TSC_PAGE_BROKEN ||
 	    hv->hv_tsc_page_status == HV_TSC_PAGE_UNSET ||
 	    tsc_page_update_unsafe(hv))
-		return;
-
-	mutex_lock(&hv->hv_lock);
-
-	if (!(hv->hv_tsc_page & HV_X64_MSR_TSC_REFERENCE_ENABLE))
 		goto out_unlock;

-	/* Preserve HV_TSC_PAGE_GUEST_CHANGED/HV_TSC_PAGE_HOST_CHANGED states */
-	if (hv->hv_tsc_page_status == HV_TSC_PAGE_SET)
-		hv->hv_tsc_page_status = HV_TSC_PAGE_UPDATING;
-
-	gfn = hv->hv_tsc_page >> HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT;
-
-	hv->tsc_ref.tsc_sequence = 0;
-
-	/*
-	 * Take the srcu lock as memslots will be accessed to check the gfn
-	 * cache generation against the memslots generation.
-	 */
-	idx = srcu_read_lock(&kvm->srcu);
-	if (kvm_write_guest(kvm, gfn_to_gpa(gfn),
-			    &hv->tsc_ref, sizeof(hv->tsc_ref.tsc_sequence)))
-		hv->hv_tsc_page_status = HV_TSC_PAGE_BROKEN;
-	srcu_read_unlock(&kvm->srcu, idx);
+	if (hv->hv_tsc_page & HV_X64_MSR_TSC_REFERENCE_ENABLE)
+		hv->hv_tsc_page_status |= HV_TSC_PAGE_UPDATE_REQUIRED;

 out_unlock:
 	mutex_unlock(&hv->hv_lock);
 }

-
 static bool hv_check_msr_access(struct kvm_vcpu_hv *hv_vcpu, u32 msr)
 {
 	if (!hv_vcpu->enforce_cpuid)
diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h
index ed1c4e546d04..3e79b4a9ed4e 100644
--- a/arch/x86/kvm/hyperv.h
+++ b/arch/x86/kvm/hyperv.h
@@ -133,7 +133,7 @@ void kvm_hv_process_stimers(struct kvm_vcpu *vcpu);

 void kvm_hv_setup_tsc_page(struct kvm *kvm,
 			   struct pvclock_vcpu_time_info *hv_clock);
-void kvm_hv_invalidate_tsc_page(struct kvm *kvm);
+void kvm_hv_request_tsc_page_update(struct kvm *kvm);

 void kvm_hv_init_vm(struct kvm *kvm);
 void kvm_hv_destroy_vm(struct kvm *kvm);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c194a8cbd25f..c1adc9efea28 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2848,7 +2848,7 @@ static void kvm_end_pvclock_update(struct kvm *kvm)

 static void kvm_update_masterclock(struct kvm *kvm)
 {
-	kvm_hv_invalidate_tsc_page(kvm);
+	kvm_hv_request_tsc_page_update(kvm);
 	kvm_start_pvclock_update(kvm);
 	pvclock_update_vm_gtod_copy(kvm);
 	kvm_end_pvclock_update(kvm);
@@ -3060,8 +3060,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 				       offsetof(struct compat_vcpu_info, time));
 	if (vcpu->xen.vcpu_time_info_set)
 		kvm_setup_pvclock_page(v, &vcpu->xen.vcpu_time_info_cache, 0);
-	if (!v->vcpu_idx)
-		kvm_hv_setup_tsc_page(v->kvm, &vcpu->hv_clock);
+	kvm_hv_setup_tsc_page(v->kvm, &vcpu->hv_clock);
 	return 0;
 }

@@ -6039,7 +6038,7 @@ static int kvm_vm_ioctl_set_clock(struct kvm *kvm, void __user *argp)
 	if (data.flags & ~KVM_CLOCK_VALID_FLAGS)
 		return -EINVAL;

-	kvm_hv_invalidate_tsc_page(kvm);
+	kvm_hv_request_tsc_page_update(kvm);
 	kvm_start_pvclock_update(kvm);
 	pvclock_update_vm_gtod_copy(kvm);


base-commit: 9f73307be079749233b47e623a67201975c575bc
--

Paolo Bonzini Dec. 23, 2021, 10:21 p.m. UTC | #5

On 12/23/21 22:24, Sean Christopherson wrote:
>>> Any word on when these will make it out to kvm.git? Did you find
>>> something else I need to fix?
>> I got a WARN when testing the queue branch, now I'm bisecting.
> Was it perhaps this WARN?

No, it was PEBKAC.  I was trying to be clever and only rebuilding the 
module instead of the whole kernel, but kvm_handle_signal_exit is 
inlined in kernel/entry/kvm.c and thus relies on the offsets that were 
changed by removing pre_pcpu and blocked_vcpu_list.

So now the hashes in kvm/queue is still not officially stable, but the 
pull request for the merge window should be exactly that branch + the 
AMX stuff.  The SEV selftests can still make it in 5.17 but probably 
more towards the end of the merge window.

Paolo

Wanpeng Li Jan. 4, 2022, 7:48 a.m. UTC | #6

On Sat, 25 Dec 2021 at 09:19, Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Dec 23, 2021, Paolo Bonzini wrote:
> > On 12/22/21 16:18, David Woodhouse wrote:
> > > > > n a mapped
> > > > > shared info page.
> > > > >
> > > > > This can be used for routing MSI of passthrough devices to PIRQ event
> > > > > channels in a Xen guest, and we can build on it for delivering IPIs and
> > > > > timers directly from the kernel too.
> > > > Queued patches 1-5, thanks.
> > > >
> > > > Paolo
> > > Any word on when these will make it out to kvm.git? Did you find
> > > something else I need to fix?
> >
> > I got a WARN when testing the queue branch, now I'm bisecting.
>
> Was it perhaps this WARN?
>
>   WARNING: CPU: 5 PID: 2180 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:3163 mark_page_dirty_in_slot+0x6b/0x80 [kvm]
>   Modules linked in: kvm_intel kvm irqbypass
>   CPU: 5 PID: 2180 Comm: hyperv_clock Not tainted 5.16.0-rc4+ #81
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   RIP: 0010:mark_page_dirty_in_slot+0x6b/0x80 [kvm]
>    kvm_write_guest+0x117/0x120 [kvm]
>    kvm_hv_invalidate_tsc_page+0x99/0xf0 [kvm]
>    kvm_arch_vm_ioctl+0x20f/0xb60 [kvm]
>    kvm_vm_ioctl+0x711/0xd50 [kvm]
>    __x64_sys_ioctl+0x7f/0xb0
>    do_syscall_64+0x38/0xc0
>    entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> If so, it was clearly introduced by commit 03c0304a86bc ("KVM: Warn if mark_page_dirty()
> is called without an active vCPU").  But that change is 100% correct as it's trivial
> to crash KVM without its protection.  E.g. running hyper_clock with these mods
>
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index 2dd78c1db4b6..ae0d0490580a 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -371,6 +371,8 @@ struct kvm_vm *vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t nr_vcpus,
>         vm_create_irqchip(vm);
>  #endif
>
> +       vm_enable_dirty_ring(vm, 0x10000);
> +
>         for (i = 0; i < nr_vcpus; ++i) {
>                 uint32_t vcpuid = vcpuids ? vcpuids[i] : i;
>
> diff --git a/tools/testing/selftests/kvm/x86_64/hyperv_clock.c b/tools/testing/selftests/kvm/x86_64/hyperv_clock.c
> index e0b2bb1339b1..1032695e3901 100644
> --- a/tools/testing/selftests/kvm/x86_64/hyperv_clock.c
> +++ b/tools/testing/selftests/kvm/x86_64/hyperv_clock.c
> @@ -212,6 +212,8 @@ int main(void)
>         vm = vm_create_default(VCPU_ID, 0, guest_main);
>         run = vcpu_state(vm, VCPU_ID);
>
> +       vm_mem_region_set_flags(vm, 0, KVM_MEM_LOG_DIRTY_PAGES);
> +
>         vcpu_set_hv_cpuid(vm, VCPU_ID);
>
>         tsc_page_gva = vm_vaddr_alloc_page(vm);
>
> Triggers a NULL pointer deref.
>
>   BUG: kernel NULL pointer dereference, address: 0000000000000000
>   #PF: supervisor read access in kernel mode
>   #PF: error_code(0x0000) - not-present page
>   PGD 12959a067 P4D 12959a067 PUD 129594067 PMD 0
>   Oops: 0000 [#1] SMP
>   CPU: 5 PID: 1784 Comm: hyperv_clock Not tainted 5.16.0-rc4+ #82
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   RIP: 0010:kvm_dirty_ring_get+0xe/0x30 [kvm]
>    mark_page_dirty_in_slot.part.0+0x30/0x60 [kvm]
>    kvm_write_guest+0x117/0x130 [kvm]
>    kvm_hv_invalidate_tsc_page+0x99/0xf0 [kvm]
>    kvm_arch_vm_ioctl+0x20f/0xb60 [kvm]
>    kvm_vm_ioctl+0x711/0xd50 [kvm]
>    __x64_sys_ioctl+0x7f/0xb0
>    do_syscall_64+0x38/0xc0
>    entry_SYSCALL_64_after_hwframe+0x44/0xae

I saw this warning by running vanilla hyperv_clock w/o modifying
against kvm/queue.

>
> Commit e880c6ea55b9 ("KVM: x86: hyper-v: Prevent using not-yet-updated TSC page
> by secondary CPUs") is squarely to blame as it was added after dirty ring, though
> in Vitaly's defense, David put it best: "That's a fairly awful bear trap".
>
> Assuming there's nothing special about vcpu0's clock, I belive the below fixes
> all issues.  If not, then the correct fix is to have vcpu0 block all other vCPUs
> until the update completes.

Could we invalidate hv->tsc_ref.tsc_sequence directly w/o marking page
dirty since after migration it is stale until the first successful
kvm_hv_setup_tsc_page() call?

    Wanpeng

David Woodhouse Jan. 5, 2022, 5:58 p.m. UTC | #7

On Thu, 2021-12-23 at 21:24 +0000, Sean Christopherson wrote:
> Commit e880c6ea55b9 ("KVM: x86: hyper-v: Prevent using not-yet-updated TSC page
> by secondary CPUs") is squarely to blame as it was added after dirty ring, though
> in Vitaly's defense, David put it best: "That's a fairly awful bear trap".

Even with the WARN to keep us honest, this is still awful.

We have kvm_vcpu_write_guest()... but the vcpu we pass it is ignored
and only vcpu->kvm is used. But you have to be running on a physical
CPU which currently owns *a* vCPU of that KVM, or you might crash.

There is also kvm_write_guest() which doesn't take a vCPU as an
argument, and looks for all the world like it was designed for you not
to need one... but which still needs there to be a vCPU or it might
crash.

I think I want to kill the latter, make the former use the vCPU it's
given, add a spinlock to the dirty ring which will be uncontended
anyway in the common case so it shouldn't hurt (?), and then let people
use kvm->vcpu[0] when they really need to, with a documented caveat
that when there are *no* vCPUs in a KVM, dirty tracking might not work.
Which is fine, as migration of a KVM that hasn't been fully set up yet
is silly.

Sean Christopherson Jan. 8, 2022, 12:26 a.m. UTC | #8

+Peter

On Wed, Jan 05, 2022, David Woodhouse wrote:
> On Thu, 2021-12-23 at 21:24 +0000, Sean Christopherson wrote:
> > Commit e880c6ea55b9 ("KVM: x86: hyper-v: Prevent using not-yet-updated TSC page
> > by secondary CPUs") is squarely to blame as it was added after dirty ring, though
> > in Vitaly's defense, David put it best: "That's a fairly awful bear trap".
> 
> Even with the WARN to keep us honest, this is still awful.
> 
> We have kvm_vcpu_write_guest()... but the vcpu we pass it is ignored
> and only vcpu->kvm is used. But you have to be running on a physical
> CPU which currently owns *a* vCPU of that KVM, or you might crash.
> 
> There is also kvm_write_guest() which doesn't take a vCPU as an
> argument, and looks for all the world like it was designed for you not
> to need one... but which still needs there to be a vCPU or it might
> crash.
> 
> I think I want to kill the latter, make the former use the vCPU it's
> given, add a spinlock to the dirty ring which will be uncontended
> anyway in the common case so it shouldn't hurt (?),

IIRC, Peter did a fair amount of performance testing and analysis that led to
the current behavior.

> and then let people use kvm->vcpu[0] when they really need to, with a
> documented caveat that when there are *no* vCPUs in a KVM, dirty tracking
> might not work.  Which is fine, as migration of a KVM that hasn't been fully
> set up yet is silly.

"when they really need to" can be a slippery slope, using vcpu[0] is also quite
gross.  Though I 100% agree that always using kvm_get_running_vcpu() is awful.

Peter Xu Jan. 12, 2022, 3:10 a.m. UTC | #9

On Sat, Jan 08, 2022 at 12:26:32AM +0000, Sean Christopherson wrote:
> +Peter
> 
> On Wed, Jan 05, 2022, David Woodhouse wrote:
> > On Thu, 2021-12-23 at 21:24 +0000, Sean Christopherson wrote:
> > > Commit e880c6ea55b9 ("KVM: x86: hyper-v: Prevent using not-yet-updated TSC page
> > > by secondary CPUs") is squarely to blame as it was added after dirty ring, though
> > > in Vitaly's defense, David put it best: "That's a fairly awful bear trap".
> > 
> > Even with the WARN to keep us honest, this is still awful.
> > 
> > We have kvm_vcpu_write_guest()... but the vcpu we pass it is ignored
> > and only vcpu->kvm is used. But you have to be running on a physical
> > CPU which currently owns *a* vCPU of that KVM, or you might crash.
> > 
> > There is also kvm_write_guest() which doesn't take a vCPU as an
> > argument, and looks for all the world like it was designed for you not
> > to need one... but which still needs there to be a vCPU or it might
> > crash.

Right, I agree too those are slightly confusing.

> > 
> > I think I want to kill the latter, make the former use the vCPU it's
> > given, add a spinlock to the dirty ring which will be uncontended
> > anyway in the common case so it shouldn't hurt (?),
> 
> IIRC, Peter did a fair amount of performance testing and analysis that led to
> the current behavior.

Yes that comes out from the discussion when working on dirty ring.  I can't
remember the details too, but IIRC what we figured is 99% cases of guest
dirtying pages are with vcpu context.

The only outlier at that time (at least for x86) was kvm-gt who uses direct kvm
context but then it turns out that's an illegal use of the kvm API and kvm-gt
fixed itself instead.

Then we come to the current kvm_get_running_vcpu() approach because all the
dirty track contexts are within an existing vcpu context.  Dirty ring could be
simplified too because we don't need the default vcpu0 trick, nor do we need an
extra ring just to record no-vcpu contexts (in very early version of dirty ring
we do have N+1 rings, where N is the vcpu number, and the extra is for this).

kvm_get_running_vcpu() also has a slight benefit that it guarantees no spinlock
needed.

> 
> > and then let people use kvm->vcpu[0] when they really need to, with a
> > documented caveat that when there are *no* vCPUs in a KVM, dirty tracking
> > might not work.  Which is fine, as migration of a KVM that hasn't been fully
> > set up yet is silly.
> 
> "when they really need to" can be a slippery slope, using vcpu[0] is also quite
> gross.  Though I 100% agree that always using kvm_get_running_vcpu() is awful.

I agreed none of them looks like an extreme clean approach.

So far I'd hope we can fix the problem with what Sean suggested on postponing
the page update until when we have the vcpu context, so we keep the assumption
still true on "having a vcpu context" at least for x86 and that'll be the
simplest, afaiu.

Or do we have explicit other requirement that needs to dirty guest pages
without vcpu context at all?

Thanks,

David Woodhouse Jan. 19, 2022, 8:14 a.m. UTC | #10

On Wed, 2022-01-12 at 11:10 +0800, Peter Xu wrote:
> On Sat, Jan 08, 2022 at 12:26:32AM +0000, Sean Christopherson wrote:
> > +Peter
> > 
> > On Wed, Jan 05, 2022, David Woodhouse wrote:
> > > On Thu, 2021-12-23 at 21:24 +0000, Sean Christopherson wrote:
> > > > Commit e880c6ea55b9 ("KVM: x86: hyper-v: Prevent using not-yet-updated TSC page
> > > > by secondary CPUs") is squarely to blame as it was added after dirty ring, though
> > > > in Vitaly's defense, David put it best: "That's a fairly awful bear trap".
> > > 
> > > Even with the WARN to keep us honest, this is still awful.
> > > 
> > > We have kvm_vcpu_write_guest()... but the vcpu we pass it is ignored
> > > and only vcpu->kvm is used. But you have to be running on a physical
> > > CPU which currently owns *a* vCPU of that KVM, or you might crash.
> > > 
> > > There is also kvm_write_guest() which doesn't take a vCPU as an
> > > argument, and looks for all the world like it was designed for you not
> > > to need one... but which still needs there to be a vCPU or it might
> > > crash.
> 
> Right, I agree too those are slightly confusing.
> 
> > > 
> > > I think I want to kill the latter, make the former use the vCPU it's
> > > given, add a spinlock to the dirty ring which will be uncontended
> > > anyway in the common case so it shouldn't hurt (?),
> > 
> > IIRC, Peter did a fair amount of performance testing and analysis that led to
> > the current behavior.
> 
> Yes that comes out from the discussion when working on dirty ring.  I can't
> remember the details too, but IIRC what we figured is 99% cases of guest
> dirtying pages are with vcpu context.
> 
> The only outlier at that time (at least for x86) was kvm-gt who uses direct kvm
> context but then it turns out that's an illegal use of the kvm API and kvm-gt
> fixed itself instead.
> 
> Then we come to the current kvm_get_running_vcpu() approach because all the
> dirty track contexts are within an existing vcpu context.  Dirty ring could be
> simplified too because we don't need the default vcpu0 trick, nor do we need an
> extra ring just to record no-vcpu contexts (in very early version of dirty ring
> we do have N+1 rings, where N is the vcpu number, and the extra is for this).
> 
> kvm_get_running_vcpu() also has a slight benefit that it guarantees no spinlock
> needed.
> 
> > 
> > > and then let people use kvm->vcpu[0] when they really need to, with a
> > > documented caveat that when there are *no* vCPUs in a KVM, dirty tracking
> > > might not work.  Which is fine, as migration of a KVM that hasn't been fully
> > > set up yet is silly.
> > 
> > "when they really need to" can be a slippery slope, using vcpu[0] is also quite
> > gross.  Though I 100% agree that always using kvm_get_running_vcpu() is awful.
> 
> I agreed none of them looks like an extreme clean approach.
> 
> So far I'd hope we can fix the problem with what Sean suggested on postponing
> the page update until when we have the vcpu context, so we keep the assumption
> still true on "having a vcpu context" at least for x86 and that'll be the
> simplest, afaiu.
> 
> Or do we have explicit other requirement that needs to dirty guest pages
> without vcpu context at all?

Delivering interrupts may want to do so. That's the one we hit for
S390, and I only avoided it for Xen event channel delivery on x86 by
declaring that the Xen shared info page is exempt from dirty tracking
and should *always* be considered dirty.

Paolo Bonzini Jan. 19, 2022, 5:36 p.m. UTC | #11

On 1/19/22 09:14, David Woodhouse wrote:
>> Or do we have explicit other requirement that needs to dirty guest pages
>> without vcpu context at all?
>
> Delivering interrupts may want to do so. That's the one we hit for
> S390, and I only avoided it for Xen event channel delivery on x86 by
> declaring that the Xen shared info page is exempt from dirty tracking
> and should*always*  be considered dirty.

We also have one that I just found out about in 
kvm_hv_invalidate_tsc_page, called from KVM_SET_CLOCK. :/

So either we have another special case to document for the dirty ring 
buffer (and retroactively so, even), or we're in bad need for a solution.

Paolo

David Woodhouse Jan. 19, 2022, 5:38 p.m. UTC | #12

On Wed, 2022-01-19 at 18:36 +0100, Paolo Bonzini wrote:
> On 1/19/22 09:14, David Woodhouse wrote:
> > > Or do we have explicit other requirement that needs to dirty guest pages
> > > without vcpu context at all?
> > 
> > Delivering interrupts may want to do so. That's the one we hit for
> > S390, and I only avoided it for Xen event channel delivery on x86 by
> > declaring that the Xen shared info page is exempt from dirty tracking
> > and should*always*  be considered dirty.
> 
> We also have one that I just found out about in 
> kvm_hv_invalidate_tsc_page, called from KVM_SET_CLOCK. :/
> 
> So either we have another special case to document for the dirty ring 
> buffer (and retroactively so, even), or we're in bad need for a solution.

Seems like adding that warning is having precisely the desired effect :)

Sean Christopherson Jan. 19, 2022, 5:44 p.m. UTC | #13

On Wed, Jan 19, 2022, David Woodhouse wrote:
> On Wed, 2022-01-19 at 18:36 +0100, Paolo Bonzini wrote:
> > On 1/19/22 09:14, David Woodhouse wrote:
> > > > Or do we have explicit other requirement that needs to dirty guest pages
> > > > without vcpu context at all?
> > > 
> > > Delivering interrupts may want to do so. That's the one we hit for
> > > S390, and I only avoided it for Xen event channel delivery on x86 by
> > > declaring that the Xen shared info page is exempt from dirty tracking
> > > and should*always*  be considered dirty.
> > 
> > We also have one that I just found out about in 
> > kvm_hv_invalidate_tsc_page, called from KVM_SET_CLOCK. :/

I think we can fix that usage though:

https://lore.kernel.org/all/YcTpJ369cRBN4W93@google.com

> > So either we have another special case to document for the dirty ring 
> > buffer (and retroactively so, even), or we're in bad need for a solution.
> 
> Seems like adding that warning is having precisely the desired effect :)

The WARN is certainly useful.  Part of me actually likes the restriction of needing
to have a valid vCPU, at least for x86, as there really aren't many legitimate cases
where KVM should be marking memory dirty without a vCPU.

Paolo Bonzini Jan. 21, 2022, 6:15 p.m. UTC | #14

On 1/19/22 18:44, Sean Christopherson wrote:
> I think we can fix that usage though:
> 
> https://lore.kernel.org/all/YcTpJ369cRBN4W93@google.com

Ok, will review next week.

>>> So either we have another special case to document for the dirty ring
>>> buffer (and retroactively so, even), or we're in bad need for a solution.
>> Seems like adding that warning is having precisely the desired effect:)
>
> The WARN is certainly useful.  Part of me actually likes the restriction of needing
> to have a valid vCPU, at least for x86, as there really aren't many legitimate cases
> where KVM should be marking memory dirty without a vCPU.

Yes, I agree that every case of it has needed, at the very least, 
further scrutiny.

Paolo

[v6,0/6] x86/xen: Add in-kernel Xen event channel delivery

Message

Comments