Message ID | 20231215172830.2540987-9-eperezma@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Map memory at destination .load_setup in vDPA-net migration | expand |
On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez <eperezma@redhat.com> wrote: > > Callers can use this function to setup the incoming migration thread. > > This thread is able to map the guest memory while the migration is > ongoing, without blocking QMP or other important tasks. While this > allows the destination QEMU not to block, it expands the mapping time > during migration instead of making it pre-migration. If it's just QMP, can we simply use bh with a quota here? Btw, have you measured the hotspot that causes such slowness? Is it pinning or vendor specific mapping that slows down the progress? Or if VFIO has a similar issue? > > This thread joins at vdpa backend device start, so it could happen that > the guest memory is so large that we still have guest memory to map > before this time. So we would still hit the QMP stall in this case? > This can be improved in later iterations, when the > destination device can inform QEMU that it is not ready to complete the > migration. > > If the device is not started, the clean of the mapped memory is done at > .load_cleanup. This is far from ideal, as the destination machine has > mapped all the guest ram for nothing, and now it needs to unmap it. > However, we don't have information about the state of the device so its > the best we can do. Once iterative migration is supported, this will be > improved as we know the virtio state of the device. > > If the VM migrates before finishing all the maps, the source will stop > but the destination is still not ready to continue, and it will wait > until all guest RAM is mapped. It is still an improvement over doing > all the map when the migration finish, but next patches use the > switchover_ack method to prevent source to stop until all the memory is > mapped at the destination. > > The memory unmapping if the device is not started is weird > too, as ideally nothing would be mapped. This can be fixed when we > migrate the device state iteratively, and we know for sure if the device > is started or not. At this moment we don't have such information so > there is no better alternative. > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com> > > --- Thanks
On Wed, Dec 20, 2023 at 6:22 AM Jason Wang <jasowang@redhat.com> wrote: > > On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez <eperezma@redhat.com> wrote: > > > > Callers can use this function to setup the incoming migration thread. > > > > This thread is able to map the guest memory while the migration is > > ongoing, without blocking QMP or other important tasks. While this > > allows the destination QEMU not to block, it expands the mapping time > > during migration instead of making it pre-migration. > > If it's just QMP, can we simply use bh with a quota here? > Because QEMU cannot guarantee the quota at write(fd, VHOST_IOTLB_UPDATE, ...). Also, synchronization with vhost_vdpa_dev_start would complicate as it would need to be re-scheduled too. As a half-baked idea, we can split the mapping chunks in manageable sizes, but I don't like that idea a lot. > Btw, have you measured the hotspot that causes such slowness? Is it > pinning or vendor specific mapping that slows down the progress? Or if > VFIO has a similar issue? > Si-Wei did the actual profiling as he is the one with the 128G guests, but most of the time was spent in the memory pinning. Si-Wei, please correct me if I'm wrong. I didn't check VFIO, but I think it just maps at realize phase with vfio_realize -> vfio_attach_device -> vfio_connect_container(). In previous testings, this delayed the VM initialization by a lot, as we're moving that 20s of blocking to every VM start. Investigating a way to do it only in the case of being the destination of a live migration, I think the right place is .load_setup migration handler. But I'm ok to move it for sure. > > > > This thread joins at vdpa backend device start, so it could happen that > > the guest memory is so large that we still have guest memory to map > > before this time. > > So we would still hit the QMP stall in this case? > This paragraph is kind of outdated, sorry. I can only cause this if I don't enable switchover_ack migration capability and if I artificially make memory pinning in the kernel artificially slow. But I didn't check QMP to be honest, so I can try to test it, yes. If QMP is not responsive, that means QMP is not responsive in QEMU master in that period actually. So we're only improving anyway. Thanks! > > This can be improved in later iterations, when the > > destination device can inform QEMU that it is not ready to complete the > > migration. > > > > If the device is not started, the clean of the mapped memory is done at > > .load_cleanup. This is far from ideal, as the destination machine has > > mapped all the guest ram for nothing, and now it needs to unmap it. > > However, we don't have information about the state of the device so its > > the best we can do. Once iterative migration is supported, this will be > > improved as we know the virtio state of the device. > > > > If the VM migrates before finishing all the maps, the source will stop > > but the destination is still not ready to continue, and it will wait > > until all guest RAM is mapped. It is still an improvement over doing > > all the map when the migration finish, but next patches use the > > switchover_ack method to prevent source to stop until all the memory is > > mapped at the destination. > > > > The memory unmapping if the device is not started is weird > > too, as ideally nothing would be mapped. This can be fixed when we > > migrate the device state iteratively, and we know for sure if the device > > is started or not. At this moment we don't have such information so > > there is no better alternative. > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com> > > > > --- > > Thanks >
On Wed, Dec 20, 2023 at 3:07 PM Eugenio Perez Martin <eperezma@redhat.com> wrote: > > On Wed, Dec 20, 2023 at 6:22 AM Jason Wang <jasowang@redhat.com> wrote: > > > > On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez <eperezma@redhat.com> wrote: > > > > > > Callers can use this function to setup the incoming migration thread. > > > > > > This thread is able to map the guest memory while the migration is > > > ongoing, without blocking QMP or other important tasks. While this > > > allows the destination QEMU not to block, it expands the mapping time > > > during migration instead of making it pre-migration. > > > > If it's just QMP, can we simply use bh with a quota here? > > > > Because QEMU cannot guarantee the quota at write(fd, > VHOST_IOTLB_UPDATE, ...). So you mean the delay may be caused by a single syscall? > Also, synchronization with > vhost_vdpa_dev_start would complicate as it would need to be > re-scheduled too. Just a flush of the bh, or not? But another question. How to synchronize with the memory API in this case. Currently the updating (without vIOMMU) is done under the listener callback. Usually after the commit, Qemu may think the memory topology has been updated. If it is done asynchronously, would we have any problem? > > As a half-baked idea, we can split the mapping chunks in manageable > sizes, but I don't like that idea a lot. > > > Btw, have you measured the hotspot that causes such slowness? Is it > > pinning or vendor specific mapping that slows down the progress? Or if > > VFIO has a similar issue? > > > > Si-Wei did the actual profiling as he is the one with the 128G guests, > but most of the time was spent in the memory pinning. Si-Wei, please > correct me if I'm wrong. > > I didn't check VFIO, but I think it just maps at realize phase with > vfio_realize -> vfio_attach_device -> vfio_connect_container(). In > previous testings, this delayed the VM initialization by a lot, as > we're moving that 20s of blocking to every VM start. > > Investigating a way to do it only in the case of being the destination > of a live migration, I think the right place is .load_setup migration > handler. But I'm ok to move it for sure. Adding Peter for more ideas. > > > > > > > This thread joins at vdpa backend device start, so it could happen that > > > the guest memory is so large that we still have guest memory to map > > > before this time. > > > > So we would still hit the QMP stall in this case? > > > > This paragraph is kind of outdated, sorry. I can only cause this if I > don't enable switchover_ack migration capability and if I artificially > make memory pinning in the kernel artificially slow. But I didn't > check QMP to be honest, so I can try to test it, yes. > > If QMP is not responsive, that means QMP is not responsive in QEMU > master in that period actually. So we're only improving anyway. > > Thanks! > Thanks
On Thu, Dec 21, 2023 at 3:17 AM Jason Wang <jasowang@redhat.com> wrote: > > On Wed, Dec 20, 2023 at 3:07 PM Eugenio Perez Martin > <eperezma@redhat.com> wrote: > > > > On Wed, Dec 20, 2023 at 6:22 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez <eperezma@redhat.com> wrote: > > > > > > > > Callers can use this function to setup the incoming migration thread. > > > > > > > > This thread is able to map the guest memory while the migration is > > > > ongoing, without blocking QMP or other important tasks. While this > > > > allows the destination QEMU not to block, it expands the mapping time > > > > during migration instead of making it pre-migration. > > > > > > If it's just QMP, can we simply use bh with a quota here? > > > > > > > Because QEMU cannot guarantee the quota at write(fd, > > VHOST_IOTLB_UPDATE, ...). > > So you mean the delay may be caused by a single syscall? > Mostly yes, the iotlb write() that maps of all the guest memory. > > Also, synchronization with > > vhost_vdpa_dev_start would complicate as it would need to be > > re-scheduled too. > > Just a flush of the bh, or not? > Let me put it differently: to map the guest memory, vhost_vdpa_dma_map is called because the guest starts the device by a PCI write to the device status: #0 vhost_vdpa_dma_map (s=0x5555570e0e60, asid=0, iova=0, size=786432, vaddr=0x7fff40000000, readonly=false) at ../hw/virtio/vhost-vdpa.c:93 #1 0x0000555555979451 in vhost_vdpa_listener_region_add (listener=0x5555570e0e68, section=0x7fffee5bc0d0) at ../hw/virtio/vhost-vdpa.c:415 #2 0x0000555555b3c543 in listener_add_address_space (listener=0x5555570e0e68, as=0x555556db72e0 <address_space_memory>) at ../system/memory.c:3011 #3 0x0000555555b3c996 in memory_listener_register (listener=0x5555570e0e68, as=0x555556db72e0 <address_space_memory>) at ../system/memory.c:3081 #4 0x000055555597be03 in vhost_vdpa_dev_start (dev=0x5555570e1310, started=true) at ../hw/virtio/vhost-vdpa.c:1460 #5 0x00005555559734c2 in vhost_dev_start (hdev=0x5555570e1310, vdev=0x5555584b2c80, vrings=false) at ../hw/virtio/vhost.c:2058 #6 0x0000555555854ec8 in vhost_net_start_one (net=0x5555570e1310, dev=0x5555584b2c80) at ../hw/net/vhost_net.c:274 #7 0x00005555558554ca in vhost_net_start (dev=0x5555584b2c80, ncs=0x5555584c8278, data_queue_pairs=1, cvq=1) at ../hw/net/vhost_net.c:415 #8 0x0000555555ace7a5 in virtio_net_vhost_status (n=0x5555584b2c80, status=15 '\017') at ../hw/net/virtio-net.c:310 #9 0x0000555555acea50 in virtio_net_set_status (vdev=0x5555584b2c80, status=15 '\017') at ../hw/net/virtio-net.c:391 #10 0x0000555555b06fee in virtio_set_status (vdev=0x5555584b2c80, val=15 '\017') at ../hw/virtio/virtio.c:2048 #11 0x000055555595d667 in virtio_pci_common_write (opaque=0x5555584aa8b0, addr=20, val=15, size=1) at ../hw/virtio/virtio-pci.c:1580 #12 0x0000555555b351c1 in memory_region_write_accessor (mr=0x5555584ab3f0, addr=20, value=0x7fffee5bc4c8, size=1, shift=0, mask=255, attrs=...) at ../system/memory.c:497 #13 0x0000555555b354c5 in access_with_adjusted_size (addr=20, value=0x7fffee5bc4c8, size=1, access_size_min=1, access_size_max=4, access_fn=0x555555b350cf <memory_region_write_accessor>, mr=0x5555584ab3f0, attrs=...) at ../system/memory.c:573 #14 0x0000555555b3856f in memory_region_dispatch_write (mr=0x5555584ab3f0, addr=20, data=15, op=MO_8, attrs=...) at ../system/memory.c:1521 #15 0x0000555555b45885 in flatview_write_continue (fv=0x7fffd8122b80, addr=4227858452, attrs=..., ptr=0x7ffff7ff0028, len=1, addr1=20, l=1, mr=0x5555584ab3f0) at ../system/physmem.c:2714 #16 0x0000555555b459e8 in flatview_write (fv=0x7fffd8122b80, addr=4227858452, attrs=..., buf=0x7ffff7ff0028, len=1) at ../system/physmem.c:2756 #17 0x0000555555b45d9a in address_space_write (as=0x555556db72e0 <address_space_memory>, addr=4227858452, attrs=..., buf=0x7ffff7ff0028, len=1) at ../system/physmem.c:2863 #18 0x0000555555b45e07 in address_space_rw (as=0x555556db72e0 <address_space_memory>, addr=4227858452, attrs=..., buf=0x7ffff7ff0028, len=1, is_write=true) at ../system/physmem.c:2873 #19 0x0000555555b5eb30 in kvm_cpu_exec (cpu=0x5555571258f0) at ../accel/kvm/kvm-all.c:2915 #20 0x0000555555b61798 in kvm_vcpu_thread_fn (arg=0x5555571258f0) at ../accel/kvm/kvm-accel-ops.c:51 #21 0x0000555555d384b7 in qemu_thread_start (args=0x55555712c390) at ../util/qemu-thread-posix.c:541 #22 0x00007ffff580814a in start_thread () from /lib64/libpthread.so.0 #23 0x00007ffff54fcf23 in clone () from /lib64/libc.so.6 Can we reschedule that map to a bh without returning the control to the vCPU? > But another question. How to synchronize with the memory API in this > case. Currently the updating (without vIOMMU) is done under the > listener callback. > > Usually after the commit, Qemu may think the memory topology has been > updated. If it is done asynchronously, would we have any problem? > The function vhost_vdpa_process_iotlb_msg in the kernel has its own lock. So two QEMU threads can map memory independently and they get serialized. For the write() caller, it is like the call takes more time, but there are no deadlocks or similar. > > > > As a half-baked idea, we can split the mapping chunks in manageable > > sizes, but I don't like that idea a lot. > > > > > Btw, have you measured the hotspot that causes such slowness? Is it > > > pinning or vendor specific mapping that slows down the progress? Or if > > > VFIO has a similar issue? > > > > > > > Si-Wei did the actual profiling as he is the one with the 128G guests, > > but most of the time was spent in the memory pinning. Si-Wei, please > > correct me if I'm wrong. > > > > I didn't check VFIO, but I think it just maps at realize phase with > > vfio_realize -> vfio_attach_device -> vfio_connect_container(). In > > previous testings, this delayed the VM initialization by a lot, as > > we're moving that 20s of blocking to every VM start. > > > > Investigating a way to do it only in the case of being the destination > > of a live migration, I think the right place is .load_setup migration > > handler. But I'm ok to move it for sure. > > Adding Peter for more ideas. > Appreciated :). Thanks! > > > > > > > > > > This thread joins at vdpa backend device start, so it could happen that > > > > the guest memory is so large that we still have guest memory to map > > > > before this time. > > > > > > So we would still hit the QMP stall in this case? > > > > > > > This paragraph is kind of outdated, sorry. I can only cause this if I > > don't enable switchover_ack migration capability and if I artificially > > make memory pinning in the kernel artificially slow. But I didn't > > check QMP to be honest, so I can try to test it, yes. > > > > If QMP is not responsive, that means QMP is not responsive in QEMU > > master in that period actually. So we're only improving anyway. > > > > Thanks! > > > > Thanks >
Jason, Eugenio, Apologies for a late reply; just back from the long holiday. On Thu, Dec 21, 2023 at 09:20:40AM +0100, Eugenio Perez Martin wrote: > Si-Wei did the actual profiling as he is the one with the 128G guests, > but most of the time was spent in the memory pinning. Si-Wei, please > correct me if I'm wrong. IIUC we're talking about no-vIOMMU use case. The pinning should indeed take a lot of time if it's similar to what VFIO does. > > I didn't check VFIO, but I think it just maps at realize phase with > vfio_realize -> vfio_attach_device -> vfio_connect_container(). In > previous testings, this delayed the VM initialization by a lot, as > we're moving that 20s of blocking to every VM start. > > Investigating a way to do it only in the case of being the destination > of a live migration, I think the right place is .load_setup migration > handler. But I'm ok to move it for sure. If it's destined to map the 128G, it does sound sensible to me to do it when VM starts, rather than anytime afterwards. Could anyone help to explain what's the problem if vDPA maps 128G at VM init just like what VFIO does? Thanks,
On Tue, Jan 2, 2024 at 6:33 AM Peter Xu <peterx@redhat.com> wrote: > > Jason, Eugenio, > > Apologies for a late reply; just back from the long holiday. > > On Thu, Dec 21, 2023 at 09:20:40AM +0100, Eugenio Perez Martin wrote: > > Si-Wei did the actual profiling as he is the one with the 128G guests, > > but most of the time was spent in the memory pinning. Si-Wei, please > > correct me if I'm wrong. > > IIUC we're talking about no-vIOMMU use case. The pinning should indeed > take a lot of time if it's similar to what VFIO does. > > > > > I didn't check VFIO, but I think it just maps at realize phase with > > vfio_realize -> vfio_attach_device -> vfio_connect_container(). In > > previous testings, this delayed the VM initialization by a lot, as > > we're moving that 20s of blocking to every VM start. > > > > Investigating a way to do it only in the case of being the destination > > of a live migration, I think the right place is .load_setup migration > > handler. But I'm ok to move it for sure. > > If it's destined to map the 128G, it does sound sensible to me to do it > when VM starts, rather than anytime afterwards. > Just for completion, it is not 100% sure the driver will start the device. But it is likely for sure. > Could anyone help to explain what's the problem if vDPA maps 128G at VM > init just like what VFIO does? > The main problem was the delay of VM start. In the master branch, the pinning is done when the driver starts the device. While it takes the BQL, the rest of the vCPUs can move work forward while the host is pinning. So the impact of it is not so evident. To move it to initialization time made it very noticeable. To make things worse, QEMU did not respond to QMP commands and similar. That's why it was done only if the VM was the destination of a LM. However, we've added the memory map thread in this version, so this might not be a problem anymore. We could move the spawn of the thread to initialization time. But how to undo this pinning in the case the guest does not start the device? In this series, this is done at the destination with vhost_vdpa_load_cleanup. Or is it ok to just keep the memory mapped as long as QEMU has the vDPA device? Thanks!
On Tue, Jan 02, 2024 at 12:28:48PM +0100, Eugenio Perez Martin wrote: > On Tue, Jan 2, 2024 at 6:33 AM Peter Xu <peterx@redhat.com> wrote: > > > > Jason, Eugenio, > > > > Apologies for a late reply; just back from the long holiday. > > > > On Thu, Dec 21, 2023 at 09:20:40AM +0100, Eugenio Perez Martin wrote: > > > Si-Wei did the actual profiling as he is the one with the 128G guests, > > > but most of the time was spent in the memory pinning. Si-Wei, please > > > correct me if I'm wrong. > > > > IIUC we're talking about no-vIOMMU use case. The pinning should indeed > > take a lot of time if it's similar to what VFIO does. > > > > > > > > I didn't check VFIO, but I think it just maps at realize phase with > > > vfio_realize -> vfio_attach_device -> vfio_connect_container(). In > > > previous testings, this delayed the VM initialization by a lot, as > > > we're moving that 20s of blocking to every VM start. > > > > > > Investigating a way to do it only in the case of being the destination > > > of a live migration, I think the right place is .load_setup migration > > > handler. But I'm ok to move it for sure. > > > > If it's destined to map the 128G, it does sound sensible to me to do it > > when VM starts, rather than anytime afterwards. > > > > Just for completion, it is not 100% sure the driver will start the > device. But it is likely for sure. My understanding is that vDPA is still a quite special device, assuming only targeting advanced users, and should not appear in a default config for anyone. It means the user should hopefully remove the device if the guest is not using it, instead of worrying on a slow boot. > > > Could anyone help to explain what's the problem if vDPA maps 128G at VM > > init just like what VFIO does? > > > > The main problem was the delay of VM start. In the master branch, the > pinning is done when the driver starts the device. While it takes the > BQL, the rest of the vCPUs can move work forward while the host is > pinning. So the impact of it is not so evident. > > To move it to initialization time made it very noticeable. To make > things worse, QEMU did not respond to QMP commands and similar. That's > why it was done only if the VM was the destination of a LM. Is that a major issue for us? IIUC then VFIO shares the same condition. If it's a real problem, do we want to have a solution that works for both (or, is it possible)? > > However, we've added the memory map thread in this version, so this > might not be a problem anymore. We could move the spawn of the thread > to initialization time. > > But how to undo this pinning in the case the guest does not start the > device? In this series, this is done at the destination with > vhost_vdpa_load_cleanup. Or is it ok to just keep the memory mapped as > long as QEMU has the vDPA device? I think even if vDPA decides to use a thread, we should keep the same behavior before/after the migration. Having assymetric behavior over DMA from the assigned HWs might have unpredictable implications. What I worry is we may over-optimize / over-engineer the case where the user will specify the vDPA device but not use it, as I mentioned above. For the long term, maybe there's chance to optimize DMA pinning for both vdpa/vfio use cases, then we can always pin them during VM starts? Assuming that issue only exists for large VMs, while they should normally be good candidates for huge pages already. Then, it means maybe one folio/page can cover a large range (e.g. 1G on x86_64) in one pin, and physical continuity also provides possibility of IOMMU large page mappings. I didn't check at which stage we are for VFIO on this, Alex may know better. I'm copying Alex anyway since the problem seems to be a common one already, so maybe he has some thoughts. Thanks,
On Wed, Jan 3, 2024 at 7:16 AM Peter Xu <peterx@redhat.com> wrote: > > On Tue, Jan 02, 2024 at 12:28:48PM +0100, Eugenio Perez Martin wrote: > > On Tue, Jan 2, 2024 at 6:33 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > Jason, Eugenio, > > > > > > Apologies for a late reply; just back from the long holiday. > > > > > > On Thu, Dec 21, 2023 at 09:20:40AM +0100, Eugenio Perez Martin wrote: > > > > Si-Wei did the actual profiling as he is the one with the 128G guests, > > > > but most of the time was spent in the memory pinning. Si-Wei, please > > > > correct me if I'm wrong. > > > > > > IIUC we're talking about no-vIOMMU use case. The pinning should indeed > > > take a lot of time if it's similar to what VFIO does. > > > > > > > > > > > I didn't check VFIO, but I think it just maps at realize phase with > > > > vfio_realize -> vfio_attach_device -> vfio_connect_container(). In > > > > previous testings, this delayed the VM initialization by a lot, as > > > > we're moving that 20s of blocking to every VM start. > > > > > > > > Investigating a way to do it only in the case of being the destination > > > > of a live migration, I think the right place is .load_setup migration > > > > handler. But I'm ok to move it for sure. > > > > > > If it's destined to map the 128G, it does sound sensible to me to do it > > > when VM starts, rather than anytime afterwards. > > > > > > > Just for completion, it is not 100% sure the driver will start the > > device. But it is likely for sure. > > My understanding is that vDPA is still a quite special device, assuming > only targeting advanced users, and should not appear in a default config > for anyone. It means the user should hopefully remove the device if the > guest is not using it, instead of worrying on a slow boot. > > > > > > Could anyone help to explain what's the problem if vDPA maps 128G at VM > > > init just like what VFIO does? > > > > > > > The main problem was the delay of VM start. In the master branch, the > > pinning is done when the driver starts the device. While it takes the > > BQL, the rest of the vCPUs can move work forward while the host is > > pinning. So the impact of it is not so evident. > > > > To move it to initialization time made it very noticeable. To make > > things worse, QEMU did not respond to QMP commands and similar. That's > > why it was done only if the VM was the destination of a LM. > > Is that a major issue for us? To me it is a regression but I'm ok with it for sure. > IIUC then VFIO shares the same condition. > If it's a real problem, do we want to have a solution that works for both > (or, is it possible)? > I would not consider a regression for VFIO since I think it has behaved that way from the beginning. But yes, I'm all in to find a common solution. > > > > However, we've added the memory map thread in this version, so this > > might not be a problem anymore. We could move the spawn of the thread > > to initialization time. > > > > But how to undo this pinning in the case the guest does not start the > > device? In this series, this is done at the destination with > > vhost_vdpa_load_cleanup. Or is it ok to just keep the memory mapped as > > long as QEMU has the vDPA device? > > I think even if vDPA decides to use a thread, we should keep the same > behavior before/after the migration. Having assymetric behavior over DMA > from the assigned HWs might have unpredictable implications. > > What I worry is we may over-optimize / over-engineer the case where the > user will specify the vDPA device but not use it, as I mentioned above. > I agree with all of the above. If it is ok to keep memory mapped while the guest has not started I think we can move the spawn of the thread, or even just the map write itself, to the vdpa init. > For the long term, maybe there's chance to optimize DMA pinning for both > vdpa/vfio use cases, then we can always pin them during VM starts? Assuming > that issue only exists for large VMs, while they should normally be good > candidates for huge pages already. Then, it means maybe one folio/page can > cover a large range (e.g. 1G on x86_64) in one pin, and physical continuity > also provides possibility of IOMMU large page mappings. I didn't check at > which stage we are for VFIO on this, Alex may know better. Sounds interesting, and I think it should be implemented. Thanks for the pointer! > I'm copying Alex > anyway since the problem seems to be a common one already, so maybe he has > some thoughts. > Appreciated :). Thanks!
On Wed, Jan 03, 2024 at 12:11:19PM +0100, Eugenio Perez Martin wrote: > On Wed, Jan 3, 2024 at 7:16 AM Peter Xu <peterx@redhat.com> wrote: > > > > On Tue, Jan 02, 2024 at 12:28:48PM +0100, Eugenio Perez Martin wrote: > > > On Tue, Jan 2, 2024 at 6:33 AM Peter Xu <peterx@redhat.com> wrote: > > > > > > > > Jason, Eugenio, > > > > > > > > Apologies for a late reply; just back from the long holiday. > > > > > > > > On Thu, Dec 21, 2023 at 09:20:40AM +0100, Eugenio Perez Martin wrote: > > > > > Si-Wei did the actual profiling as he is the one with the 128G guests, > > > > > but most of the time was spent in the memory pinning. Si-Wei, please > > > > > correct me if I'm wrong. > > > > > > > > IIUC we're talking about no-vIOMMU use case. The pinning should indeed > > > > take a lot of time if it's similar to what VFIO does. > > > > > > > > > > > > > > I didn't check VFIO, but I think it just maps at realize phase with > > > > > vfio_realize -> vfio_attach_device -> vfio_connect_container(). In > > > > > previous testings, this delayed the VM initialization by a lot, as > > > > > we're moving that 20s of blocking to every VM start. > > > > > > > > > > Investigating a way to do it only in the case of being the destination > > > > > of a live migration, I think the right place is .load_setup migration > > > > > handler. But I'm ok to move it for sure. > > > > > > > > If it's destined to map the 128G, it does sound sensible to me to do it > > > > when VM starts, rather than anytime afterwards. > > > > > > > > > > Just for completion, it is not 100% sure the driver will start the > > > device. But it is likely for sure. > > > > My understanding is that vDPA is still a quite special device, assuming > > only targeting advanced users, and should not appear in a default config > > for anyone. It means the user should hopefully remove the device if the > > guest is not using it, instead of worrying on a slow boot. > > > > > > > > > Could anyone help to explain what's the problem if vDPA maps 128G at VM > > > > init just like what VFIO does? > > > > > > > > > > The main problem was the delay of VM start. In the master branch, the > > > pinning is done when the driver starts the device. While it takes the > > > BQL, the rest of the vCPUs can move work forward while the host is > > > pinning. So the impact of it is not so evident. > > > > > > To move it to initialization time made it very noticeable. To make > > > things worse, QEMU did not respond to QMP commands and similar. That's > > > why it was done only if the VM was the destination of a LM. > > > > Is that a major issue for us? > > To me it is a regression but I'm ok with it for sure. > > > IIUC then VFIO shares the same condition. > > If it's a real problem, do we want to have a solution that works for both > > (or, is it possible)? > > > > I would not consider a regression for VFIO since I think it has > behaved that way from the beginning. But yes, I'm all in to find a > common solution. > > > > > > > However, we've added the memory map thread in this version, so this > > > might not be a problem anymore. We could move the spawn of the thread > > > to initialization time. > > > > > > But how to undo this pinning in the case the guest does not start the > > > device? In this series, this is done at the destination with > > > vhost_vdpa_load_cleanup. Or is it ok to just keep the memory mapped as > > > long as QEMU has the vDPA device? > > > > I think even if vDPA decides to use a thread, we should keep the same > > behavior before/after the migration. Having assymetric behavior over DMA > > from the assigned HWs might have unpredictable implications. > > > > What I worry is we may over-optimize / over-engineer the case where the > > user will specify the vDPA device but not use it, as I mentioned above. > > > > I agree with all of the above. If it is ok to keep memory mapped while > the guest has not started I think we can move the spawn of the thread, > or even just the map write itself, to the vdpa init. > > > For the long term, maybe there's chance to optimize DMA pinning for both > > vdpa/vfio use cases, then we can always pin them during VM starts? Assuming > > that issue only exists for large VMs, while they should normally be good > > candidates for huge pages already. Then, it means maybe one folio/page can > > cover a large range (e.g. 1G on x86_64) in one pin, and physical continuity > > also provides possibility of IOMMU large page mappings. I didn't check at > > which stage we are for VFIO on this, Alex may know better. > > Sounds interesting, and I think it should be implemented. Thanks for > the pointer! I didn't have an exact pointer previously, but to provide a pointer, I think it can be something like this: physr discussion - Jason Gunthorpe https://www.youtube.com/watch?v=QftOTtks-pI&list=PLbzoR-pLrL6rlmdpJ3-oMgU_zxc1wAhjS&index=36 Since I have zero knowledge on vDPA side, I can only provide the exmaple from VFIO and even if so that may not be fully accurate. Basically afaiu currently vfio is already smart enough to recognize continuous ranges (via vfio_pin_pages_remote()): /* Pin a contiguous chunk of memory */ npage = vfio_pin_pages_remote(dma, vaddr + dma->size, size >> PAGE_SHIFT, &pfn, limit, &batch); But the pin can still be slow, due to e.g. we need a page* for each PAGE_SIZE range. I think something like physr can improve it. That should correspond to the 1st slide of the video of the "performance" entry on pin_user_pages(). Thanks, > > > I'm copying Alex > > anyway since the problem seems to be a common one already, so maybe he has > > some thoughts. > > > > Appreciated :). > > Thanks! >
diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h index 8f54e5edd4..b49286b327 100644 --- a/include/hw/virtio/vhost-vdpa.h +++ b/include/hw/virtio/vhost-vdpa.h @@ -17,6 +17,7 @@ #include "hw/virtio/vhost-iova-tree.h" #include "hw/virtio/vhost-shadow-virtqueue.h" #include "hw/virtio/virtio.h" +#include "qemu/thread.h" #include "standard-headers/linux/vhost_types.h" /* @@ -30,6 +31,12 @@ typedef struct VhostVDPAHostNotifier { void *addr; } VhostVDPAHostNotifier; +typedef struct VhostVDPAMapThread { + QemuThread thread; + GAsyncQueue *queue; + bool map_thread_enabled; +} VhostVDPAMapThread; + /* Info shared by all vhost_vdpa device models */ typedef struct vhost_vdpa_shared { int device_fd; @@ -43,8 +50,27 @@ typedef struct vhost_vdpa_shared { /* Copy of backend features */ uint64_t backend_cap; + /* + * Thread to map memory in QEMU incoming migration. + * + * Incoming migration calls devices ->load_setup in the main thread, but + * map operations can take a long time with BLQ acquired. This forbids + * QEMU to serve other requests like QMP. + * + * To solve it, offload the first listener operations until the first + * listener commit from the main thread. Once these are served, join the + * map thread. + */ + VhostVDPAMapThread *map_thread; + bool iotlb_batch_begin_sent; + /* + * The memory listener has been registered, so DMA maps have been sent to + * the device. + */ + bool listener_registered; + /* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA */ bool shadow_data; } VhostVDPAShared; @@ -73,6 +99,8 @@ int vhost_vdpa_dma_map(VhostVDPAShared *s, uint32_t asid, hwaddr iova, hwaddr size, void *vaddr, bool readonly); int vhost_vdpa_dma_unmap(VhostVDPAShared *s, uint32_t asid, hwaddr iova, hwaddr size); +int vhost_vdpa_load_setup(VhostVDPAShared *s, AddressSpace *dma_as); +int vhost_vdpa_load_cleanup(VhostVDPAShared *s, bool vhost_will_start); typedef struct vdpa_iommu { VhostVDPAShared *dev_shared; diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c index 43f7c382b1..339e11c58a 100644 --- a/hw/virtio/vhost-vdpa.c +++ b/hw/virtio/vhost-vdpa.c @@ -101,6 +101,16 @@ int vhost_vdpa_dma_map(VhostVDPAShared *s, uint32_t asid, hwaddr iova, msg.iotlb.perm = readonly ? VHOST_ACCESS_RO : VHOST_ACCESS_RW; msg.iotlb.type = VHOST_IOTLB_UPDATE; + if (s->map_thread && s->map_thread->map_thread_enabled && + !qemu_thread_is_self(&s->map_thread->thread)) { + struct vhost_msg_v2 *new_msg = g_new(struct vhost_msg_v2, 1); + + *new_msg = msg; + g_async_queue_push(s->map_thread->queue, new_msg); + + return 0; + } + trace_vhost_vdpa_dma_map(s, fd, msg.type, msg.asid, msg.iotlb.iova, msg.iotlb.size, msg.iotlb.uaddr, msg.iotlb.perm, msg.iotlb.type); @@ -131,6 +141,16 @@ int vhost_vdpa_dma_unmap(VhostVDPAShared *s, uint32_t asid, hwaddr iova, msg.iotlb.size = size; msg.iotlb.type = VHOST_IOTLB_INVALIDATE; + if (s->map_thread && s->map_thread->map_thread_enabled && + !qemu_thread_is_self(&s->map_thread->thread)) { + struct vhost_msg_v2 *new_msg = g_new(struct vhost_msg_v2, 1); + + *new_msg = msg; + g_async_queue_push(s->map_thread->queue, new_msg); + + return 0; + } + trace_vhost_vdpa_dma_unmap(s, fd, msg.type, msg.asid, msg.iotlb.iova, msg.iotlb.size, msg.iotlb.type); @@ -156,6 +176,16 @@ static void vhost_vdpa_iotlb_batch_begin_once(VhostVDPAShared *s) return; } + if (s->map_thread && s->map_thread->map_thread_enabled && + !qemu_thread_is_self(&s->map_thread->thread)) { + struct vhost_msg_v2 *new_msg = g_new(struct vhost_msg_v2, 1); + + *new_msg = msg; + g_async_queue_push(s->map_thread->queue, new_msg); + + return; + } + trace_vhost_vdpa_listener_begin_batch(s, fd, msg.type, msg.iotlb.type); if (write(fd, &msg, sizeof(msg)) != sizeof(msg)) { error_report("failed to write, fd=%d, errno=%d (%s)", @@ -180,6 +210,17 @@ static void vhost_vdpa_dma_end_batch(VhostVDPAShared *s) msg.type = VHOST_IOTLB_MSG_V2; msg.iotlb.type = VHOST_IOTLB_BATCH_END; + if (s->map_thread && s->map_thread->map_thread_enabled && + !qemu_thread_is_self(&s->map_thread->thread)) { + struct vhost_msg_v2 *new_msg = g_new(struct vhost_msg_v2, 1); + + *new_msg = msg; + g_async_queue_push(s->map_thread->queue, new_msg); + s->map_thread->map_thread_enabled = false; + + return; + } + trace_vhost_vdpa_listener_commit(s, fd, msg.type, msg.iotlb.type); if (write(fd, &msg, sizeof(msg)) != sizeof(msg)) { error_report("failed to write, fd=%d, errno=%d (%s)", @@ -1288,6 +1329,88 @@ static void vhost_vdpa_suspend(struct vhost_dev *dev) vhost_vdpa_reset_device(dev); } +static void *vhost_vdpa_load_map_worker(void *opaque) +{ + VhostVDPAShared *shared = opaque; + GPtrArray *ret = NULL; + + while (true) { + g_autofree struct vhost_msg_v2 *msg = NULL; + int r = 0; + + msg = g_async_queue_pop(shared->map_thread->queue); + switch (msg->iotlb.type) { + case VHOST_IOTLB_UPDATE: + r = vhost_vdpa_dma_map(shared, msg->asid, msg->iotlb.iova, + msg->iotlb.size, + (void *)(uintptr_t)msg->iotlb.uaddr, + msg->iotlb.perm == VHOST_ACCESS_RO); + break; + case VHOST_IOTLB_INVALIDATE: + r = vhost_vdpa_dma_unmap(shared, msg->asid, msg->iotlb.iova, + msg->iotlb.size); + break; + case VHOST_IOTLB_BATCH_BEGIN: + vhost_vdpa_iotlb_batch_begin_once(shared); + break; + case VHOST_IOTLB_BATCH_END: + vhost_vdpa_dma_end_batch(shared); + goto end; + default: + error_report("Invalid IOTLB msg type %d", msg->iotlb.type); + break; + }; + + if (unlikely(r != 0)) { + /* Add to failed iotlbs so we can remove it from iova_tree */ + if (ret == NULL) { + ret = g_ptr_array_new_full(1, g_free); + } + + g_ptr_array_add(ret, g_steal_pointer(&msg)); + } + } + +end: + return ret; +} + +static void vhost_vdpa_spawn_maps_thread(VhostVDPAShared *shared) +{ + shared->map_thread = g_new0(VhostVDPAMapThread, 1); + shared->map_thread->queue = g_async_queue_new(); + qemu_thread_create(&shared->map_thread->thread, "vdpa map thread", + vhost_vdpa_load_map_worker, shared, + QEMU_THREAD_JOINABLE); + shared->map_thread->map_thread_enabled = true; +} + +static bool vhost_vdpa_join_maps_thread(VhostVDPAShared *shared) +{ + g_autoptr(GPtrArray) failed_iova = NULL; + + failed_iova = qemu_thread_join(&shared->map_thread->thread); + g_async_queue_unref(shared->map_thread->queue); + g_clear_pointer(&shared->map_thread, g_free); + + if (likely(!failed_iova)) { + return true; + } + + /* If it is a failed IOVA, abort starting */ + for (size_t i = 0; failed_iova->len; ++i) { + struct vhost_msg_v2 *msg = g_ptr_array_index(failed_iova, i); + DMAMap mem_region = { + .iova = msg->iotlb.iova, + .size = msg->iotlb.size - 1, /* Inclusive */ + }; + + vhost_iova_tree_remove(shared->iova_tree, mem_region); + } + + return false; +} + static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started) { struct vhost_vdpa *v = dev->opaque; @@ -1315,7 +1438,15 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started) "IOMMU and try again"); return -1; } - memory_listener_register(&v->shared->listener, dev->vdev->dma_as); + if (!v->shared->listener_registered) { + memory_listener_register(&v->shared->listener, dev->vdev->dma_as); + v->shared->listener_registered = true; + } else { + ok = vhost_vdpa_join_maps_thread(v->shared); + if (unlikely(!ok)) { + goto out_stop; + } + } return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK); } @@ -1340,6 +1471,8 @@ static void vhost_vdpa_reset_status(struct vhost_dev *dev) vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE | VIRTIO_CONFIG_S_DRIVER); memory_listener_unregister(&v->shared->listener); + v->shared->listener_registered = false; + } static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base, @@ -1522,3 +1655,34 @@ const VhostOps vdpa_ops = { .vhost_set_config_call = vhost_vdpa_set_config_call, .vhost_reset_status = vhost_vdpa_reset_status, }; + +int vhost_vdpa_load_setup(VhostVDPAShared *shared, AddressSpace *dma_as) +{ + uint8_t s = VIRTIO_CONFIG_S_ACKNOWLEDGE | VIRTIO_CONFIG_S_DRIVER; + int r = ioctl(shared->device_fd, VHOST_VDPA_SET_STATUS, &s); + if (unlikely(r < 0)) { + return r; + } + + vhost_vdpa_spawn_maps_thread(shared); + memory_listener_register(&shared->listener, dma_as); + shared->listener_registered = true; + return 0; +} + +int vhost_vdpa_load_cleanup(VhostVDPAShared *shared, bool vhost_will_start) +{ + if (shared->map_thread && !shared->map_thread->map_thread_enabled) { + return 0; + } + + if (vhost_will_start) { + /* + * Delegate the join of map thread to vhost_vdpa_dev_start, as it runs + * out of main qemu lock. + */ + return 0; + } + + return vhost_vdpa_join_maps_thread(shared) ? 0 : -1; +}