diff mbox series

[for,9.0,08/12] vdpa: add vhost_vdpa_load_setup

Message ID 20231215172830.2540987-9-eperezma@redhat.com (mailing list archive)
State New, archived
Headers show
Series Map memory at destination .load_setup in vDPA-net migration | expand

Commit Message

Eugenio Perez Martin Dec. 15, 2023, 5:28 p.m. UTC
Callers can use this function to setup the incoming migration thread.

This thread is able to map the guest memory while the migration is
ongoing, without blocking QMP or other important tasks. While this
allows the destination QEMU not to block, it expands the mapping time
during migration instead of making it pre-migration.

This thread joins at vdpa backend device start, so it could happen that
the guest memory is so large that we still have guest memory to map
before this time.  This can be improved in later iterations, when the
destination device can inform QEMU that it is not ready to complete the
migration.

If the device is not started, the clean of the mapped memory is done at
.load_cleanup.  This is far from ideal, as the destination machine has
mapped all the guest ram for nothing, and now it needs to unmap it.
However, we don't have information about the state of the device so its
the best we can do.  Once iterative migration is supported, this will be
improved as we know the virtio state of the device.

If the VM migrates before finishing all the maps, the source will stop
but the destination is still not ready to continue, and it will wait
until all guest RAM is mapped.  It is still an improvement over doing
all the map when the migration finish, but next patches use the
switchover_ack method to prevent source to stop until all the memory is
mapped at the destination.

The memory unmapping if the device is not started is weird
too, as ideally nothing would be mapped.  This can be fixed when we
migrate the device state iteratively, and we know for sure if the device
is started or not.  At this moment we don't have such information so
there is no better alternative.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>

---
v1:
* Moved all the map thread members to another struct than
  VhostVDPAShared, so its memory and usage can be free from the moment
  the thread joins.

RFC v2:
* Use a dedicated thread for map instead of doing all in .load_setup,
  blocking QMP etc.

Other options considered:
* ThreadPool: It is hard to keep the order of the requests in this form.
  To send the worker thread to threadpool is also nice as the number of
  qemu threads can be controlled that way, but then vhost_vdpa_dev_map
  and similar cannot know if they are in the worker thread by checking
  with pthread_self. We can split the dma_map and similar in two
  functions, one checking if we can offload to thread map and other
  without it.
  Synchronization with qemu_thread_join is also lost. A semaphore,
  QemuEvent, or plain Mutex + Cond would be needed to synchronize list
  access, complicating the solution. Maps are heavy enough.
  Using coroutines have the same result.
* QIOTask: This is very nice as it does even have the task cleaner to
  run in the main thread, and I have a model to follow in
  dns-resolver.c. However, I still need to check if I'm in the main
  thread or not.

To be able to get rid of qemu_thread_is_self, we could create another
set of .dma_map_sync, .dma_unmap_sync, etc, and call the
(potentially) async versions from the listener or the 100% sure sync
version from the map thread. This makes the code more complex in my
opinion.

Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-vdpa.h |  28 ++++++
 hw/virtio/vhost-vdpa.c         | 166 ++++++++++++++++++++++++++++++++-
 2 files changed, 193 insertions(+), 1 deletion(-)

Comments

Jason Wang Dec. 20, 2023, 5:21 a.m. UTC | #1
On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez <eperezma@redhat.com> wrote:
>
> Callers can use this function to setup the incoming migration thread.
>
> This thread is able to map the guest memory while the migration is
> ongoing, without blocking QMP or other important tasks. While this
> allows the destination QEMU not to block, it expands the mapping time
> during migration instead of making it pre-migration.

If it's just QMP, can we simply use bh with a quota here?

Btw, have you measured the hotspot that causes such slowness? Is it
pinning or vendor specific mapping that slows down the progress? Or if
VFIO has a similar issue?

>
> This thread joins at vdpa backend device start, so it could happen that
> the guest memory is so large that we still have guest memory to map
> before this time.

So we would still hit the QMP stall in this case?

> This can be improved in later iterations, when the
> destination device can inform QEMU that it is not ready to complete the
> migration.
>
> If the device is not started, the clean of the mapped memory is done at
> .load_cleanup.  This is far from ideal, as the destination machine has
> mapped all the guest ram for nothing, and now it needs to unmap it.
> However, we don't have information about the state of the device so its
> the best we can do.  Once iterative migration is supported, this will be
> improved as we know the virtio state of the device.
>
> If the VM migrates before finishing all the maps, the source will stop
> but the destination is still not ready to continue, and it will wait
> until all guest RAM is mapped.  It is still an improvement over doing
> all the map when the migration finish, but next patches use the
> switchover_ack method to prevent source to stop until all the memory is
> mapped at the destination.
>
> The memory unmapping if the device is not started is weird
> too, as ideally nothing would be mapped.  This can be fixed when we
> migrate the device state iteratively, and we know for sure if the device
> is started or not.  At this moment we don't have such information so
> there is no better alternative.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>
> ---

Thanks
Eugenio Perez Martin Dec. 20, 2023, 7:06 a.m. UTC | #2
On Wed, Dec 20, 2023 at 6:22 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> >
> > Callers can use this function to setup the incoming migration thread.
> >
> > This thread is able to map the guest memory while the migration is
> > ongoing, without blocking QMP or other important tasks. While this
> > allows the destination QEMU not to block, it expands the mapping time
> > during migration instead of making it pre-migration.
>
> If it's just QMP, can we simply use bh with a quota here?
>

Because QEMU cannot guarantee the quota at write(fd,
VHOST_IOTLB_UPDATE, ...). Also, synchronization with
vhost_vdpa_dev_start would complicate as it would need to be
re-scheduled too.

As a half-baked idea, we can split the mapping chunks in manageable
sizes, but I don't like that idea a lot.

> Btw, have you measured the hotspot that causes such slowness? Is it
> pinning or vendor specific mapping that slows down the progress? Or if
> VFIO has a similar issue?
>

Si-Wei did the actual profiling as he is the one with the 128G guests,
but most of the time was spent in the memory pinning. Si-Wei, please
correct me if I'm wrong.

I didn't check VFIO, but I think it just maps at realize phase with
vfio_realize -> vfio_attach_device -> vfio_connect_container(). In
previous testings, this delayed the VM initialization by a lot, as
we're moving that 20s of blocking to every VM start.

Investigating a way to do it only in the case of being the destination
of a live migration, I think the right place is .load_setup migration
handler. But I'm ok to move it for sure.

> >
> > This thread joins at vdpa backend device start, so it could happen that
> > the guest memory is so large that we still have guest memory to map
> > before this time.
>
> So we would still hit the QMP stall in this case?
>

This paragraph is kind of outdated, sorry. I can only cause this if I
don't enable switchover_ack migration capability and if I artificially
make memory pinning in the kernel artificially slow. But I didn't
check QMP to be honest, so I can try to test it, yes.

If QMP is not responsive, that means QMP is not responsive in QEMU
master in that period actually. So we're only improving anyway.

Thanks!

> > This can be improved in later iterations, when the
> > destination device can inform QEMU that it is not ready to complete the
> > migration.
> >
> > If the device is not started, the clean of the mapped memory is done at
> > .load_cleanup.  This is far from ideal, as the destination machine has
> > mapped all the guest ram for nothing, and now it needs to unmap it.
> > However, we don't have information about the state of the device so its
> > the best we can do.  Once iterative migration is supported, this will be
> > improved as we know the virtio state of the device.
> >
> > If the VM migrates before finishing all the maps, the source will stop
> > but the destination is still not ready to continue, and it will wait
> > until all guest RAM is mapped.  It is still an improvement over doing
> > all the map when the migration finish, but next patches use the
> > switchover_ack method to prevent source to stop until all the memory is
> > mapped at the destination.
> >
> > The memory unmapping if the device is not started is weird
> > too, as ideally nothing would be mapped.  This can be fixed when we
> > migrate the device state iteratively, and we know for sure if the device
> > is started or not.  At this moment we don't have such information so
> > there is no better alternative.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >
> > ---
>
> Thanks
>
Jason Wang Dec. 21, 2023, 2:17 a.m. UTC | #3
On Wed, Dec 20, 2023 at 3:07 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Wed, Dec 20, 2023 at 6:22 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> > >
> > > Callers can use this function to setup the incoming migration thread.
> > >
> > > This thread is able to map the guest memory while the migration is
> > > ongoing, without blocking QMP or other important tasks. While this
> > > allows the destination QEMU not to block, it expands the mapping time
> > > during migration instead of making it pre-migration.
> >
> > If it's just QMP, can we simply use bh with a quota here?
> >
>
> Because QEMU cannot guarantee the quota at write(fd,
> VHOST_IOTLB_UPDATE, ...).

So you mean the delay may be caused by a single syscall?

> Also, synchronization with
> vhost_vdpa_dev_start would complicate as it would need to be
> re-scheduled too.

Just a flush of the bh, or not?

But another question. How to synchronize with the memory API in this
case. Currently the updating (without vIOMMU) is done under the
listener callback.

Usually after the commit, Qemu may think the memory topology has been
updated. If it is done asynchronously, would we have any problem?

>
> As a half-baked idea, we can split the mapping chunks in manageable
> sizes, but I don't like that idea a lot.
>
> > Btw, have you measured the hotspot that causes such slowness? Is it
> > pinning or vendor specific mapping that slows down the progress? Or if
> > VFIO has a similar issue?
> >
>
> Si-Wei did the actual profiling as he is the one with the 128G guests,
> but most of the time was spent in the memory pinning. Si-Wei, please
> correct me if I'm wrong.
>
> I didn't check VFIO, but I think it just maps at realize phase with
> vfio_realize -> vfio_attach_device -> vfio_connect_container(). In
> previous testings, this delayed the VM initialization by a lot, as
> we're moving that 20s of blocking to every VM start.
>
> Investigating a way to do it only in the case of being the destination
> of a live migration, I think the right place is .load_setup migration
> handler. But I'm ok to move it for sure.

Adding Peter for more ideas.

>
> > >
> > > This thread joins at vdpa backend device start, so it could happen that
> > > the guest memory is so large that we still have guest memory to map
> > > before this time.
> >
> > So we would still hit the QMP stall in this case?
> >
>
> This paragraph is kind of outdated, sorry. I can only cause this if I
> don't enable switchover_ack migration capability and if I artificially
> make memory pinning in the kernel artificially slow. But I didn't
> check QMP to be honest, so I can try to test it, yes.
>
> If QMP is not responsive, that means QMP is not responsive in QEMU
> master in that period actually. So we're only improving anyway.
>
> Thanks!
>

Thanks
Eugenio Perez Martin Dec. 21, 2023, 8:20 a.m. UTC | #4
On Thu, Dec 21, 2023 at 3:17 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Dec 20, 2023 at 3:07 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Wed, Dec 20, 2023 at 6:22 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Sat, Dec 16, 2023 at 1:28 AM Eugenio Pérez <eperezma@redhat.com> wrote:
> > > >
> > > > Callers can use this function to setup the incoming migration thread.
> > > >
> > > > This thread is able to map the guest memory while the migration is
> > > > ongoing, without blocking QMP or other important tasks. While this
> > > > allows the destination QEMU not to block, it expands the mapping time
> > > > during migration instead of making it pre-migration.
> > >
> > > If it's just QMP, can we simply use bh with a quota here?
> > >
> >
> > Because QEMU cannot guarantee the quota at write(fd,
> > VHOST_IOTLB_UPDATE, ...).
>
> So you mean the delay may be caused by a single syscall?
>

Mostly yes, the iotlb write() that maps of all the guest memory.

> > Also, synchronization with
> > vhost_vdpa_dev_start would complicate as it would need to be
> > re-scheduled too.
>
> Just a flush of the bh, or not?
>

Let me put it differently: to map the guest memory, vhost_vdpa_dma_map
is called because the guest starts the device by a PCI write to the
device status:
#0  vhost_vdpa_dma_map (s=0x5555570e0e60, asid=0, iova=0, size=786432,
vaddr=0x7fff40000000, readonly=false)
    at ../hw/virtio/vhost-vdpa.c:93
#1  0x0000555555979451 in vhost_vdpa_listener_region_add
(listener=0x5555570e0e68, section=0x7fffee5bc0d0) at
../hw/virtio/vhost-vdpa.c:415
#2  0x0000555555b3c543 in listener_add_address_space
(listener=0x5555570e0e68, as=0x555556db72e0 <address_space_memory>)
    at ../system/memory.c:3011
#3  0x0000555555b3c996 in memory_listener_register
(listener=0x5555570e0e68, as=0x555556db72e0 <address_space_memory>)
    at ../system/memory.c:3081
#4  0x000055555597be03 in vhost_vdpa_dev_start (dev=0x5555570e1310,
started=true) at ../hw/virtio/vhost-vdpa.c:1460
#5  0x00005555559734c2 in vhost_dev_start (hdev=0x5555570e1310,
vdev=0x5555584b2c80, vrings=false) at ../hw/virtio/vhost.c:2058
#6  0x0000555555854ec8 in vhost_net_start_one (net=0x5555570e1310,
dev=0x5555584b2c80) at ../hw/net/vhost_net.c:274
#7  0x00005555558554ca in vhost_net_start (dev=0x5555584b2c80,
ncs=0x5555584c8278, data_queue_pairs=1, cvq=1) at
../hw/net/vhost_net.c:415
#8  0x0000555555ace7a5 in virtio_net_vhost_status (n=0x5555584b2c80,
status=15 '\017') at ../hw/net/virtio-net.c:310
#9  0x0000555555acea50 in virtio_net_set_status (vdev=0x5555584b2c80,
status=15 '\017') at ../hw/net/virtio-net.c:391
#10 0x0000555555b06fee in virtio_set_status (vdev=0x5555584b2c80,
val=15 '\017') at ../hw/virtio/virtio.c:2048
#11 0x000055555595d667 in virtio_pci_common_write
(opaque=0x5555584aa8b0, addr=20, val=15, size=1) at
../hw/virtio/virtio-pci.c:1580
#12 0x0000555555b351c1 in memory_region_write_accessor
(mr=0x5555584ab3f0, addr=20, value=0x7fffee5bc4c8, size=1, shift=0,
mask=255,
    attrs=...) at ../system/memory.c:497
#13 0x0000555555b354c5 in access_with_adjusted_size (addr=20,
value=0x7fffee5bc4c8, size=1, access_size_min=1, access_size_max=4,
    access_fn=0x555555b350cf <memory_region_write_accessor>,
mr=0x5555584ab3f0, attrs=...) at ../system/memory.c:573
#14 0x0000555555b3856f in memory_region_dispatch_write
(mr=0x5555584ab3f0, addr=20, data=15, op=MO_8, attrs=...) at
../system/memory.c:1521
#15 0x0000555555b45885 in flatview_write_continue (fv=0x7fffd8122b80,
addr=4227858452, attrs=..., ptr=0x7ffff7ff0028, len=1, addr1=20,
    l=1, mr=0x5555584ab3f0) at ../system/physmem.c:2714
#16 0x0000555555b459e8 in flatview_write (fv=0x7fffd8122b80,
addr=4227858452, attrs=..., buf=0x7ffff7ff0028, len=1)
    at ../system/physmem.c:2756
#17 0x0000555555b45d9a in address_space_write (as=0x555556db72e0
<address_space_memory>, addr=4227858452, attrs=...,
buf=0x7ffff7ff0028,
    len=1) at ../system/physmem.c:2863
#18 0x0000555555b45e07 in address_space_rw (as=0x555556db72e0
<address_space_memory>, addr=4227858452, attrs=...,
buf=0x7ffff7ff0028,
    len=1, is_write=true) at ../system/physmem.c:2873
#19 0x0000555555b5eb30 in kvm_cpu_exec (cpu=0x5555571258f0) at
../accel/kvm/kvm-all.c:2915
#20 0x0000555555b61798 in kvm_vcpu_thread_fn (arg=0x5555571258f0) at
../accel/kvm/kvm-accel-ops.c:51
#21 0x0000555555d384b7 in qemu_thread_start (args=0x55555712c390) at
../util/qemu-thread-posix.c:541
#22 0x00007ffff580814a in start_thread () from /lib64/libpthread.so.0
#23 0x00007ffff54fcf23 in clone () from /lib64/libc.so.6

Can we reschedule that map to a bh without returning the control to the vCPU?

> But another question. How to synchronize with the memory API in this
> case. Currently the updating (without vIOMMU) is done under the
> listener callback.
>
> Usually after the commit, Qemu may think the memory topology has been
> updated. If it is done asynchronously, would we have any problem?
>

The function vhost_vdpa_process_iotlb_msg in the kernel has its own
lock. So two QEMU threads can map memory independently and they get
serialized.

For the write() caller, it is like the call takes more time, but there
are no deadlocks or similar.

> >
> > As a half-baked idea, we can split the mapping chunks in manageable
> > sizes, but I don't like that idea a lot.
> >
> > > Btw, have you measured the hotspot that causes such slowness? Is it
> > > pinning or vendor specific mapping that slows down the progress? Or if
> > > VFIO has a similar issue?
> > >
> >
> > Si-Wei did the actual profiling as he is the one with the 128G guests,
> > but most of the time was spent in the memory pinning. Si-Wei, please
> > correct me if I'm wrong.
> >
> > I didn't check VFIO, but I think it just maps at realize phase with
> > vfio_realize -> vfio_attach_device -> vfio_connect_container(). In
> > previous testings, this delayed the VM initialization by a lot, as
> > we're moving that 20s of blocking to every VM start.
> >
> > Investigating a way to do it only in the case of being the destination
> > of a live migration, I think the right place is .load_setup migration
> > handler. But I'm ok to move it for sure.
>
> Adding Peter for more ideas.
>

Appreciated :).

Thanks!

> >
> > > >
> > > > This thread joins at vdpa backend device start, so it could happen that
> > > > the guest memory is so large that we still have guest memory to map
> > > > before this time.
> > >
> > > So we would still hit the QMP stall in this case?
> > >
> >
> > This paragraph is kind of outdated, sorry. I can only cause this if I
> > don't enable switchover_ack migration capability and if I artificially
> > make memory pinning in the kernel artificially slow. But I didn't
> > check QMP to be honest, so I can try to test it, yes.
> >
> > If QMP is not responsive, that means QMP is not responsive in QEMU
> > master in that period actually. So we're only improving anyway.
> >
> > Thanks!
> >
>
> Thanks
>
Peter Xu Jan. 2, 2024, 5:33 a.m. UTC | #5
Jason, Eugenio,

Apologies for a late reply; just back from the long holiday.

On Thu, Dec 21, 2023 at 09:20:40AM +0100, Eugenio Perez Martin wrote:
> Si-Wei did the actual profiling as he is the one with the 128G guests,
> but most of the time was spent in the memory pinning. Si-Wei, please
> correct me if I'm wrong.

IIUC we're talking about no-vIOMMU use case.  The pinning should indeed
take a lot of time if it's similar to what VFIO does.

>
> I didn't check VFIO, but I think it just maps at realize phase with
> vfio_realize -> vfio_attach_device -> vfio_connect_container(). In
> previous testings, this delayed the VM initialization by a lot, as
> we're moving that 20s of blocking to every VM start.
>
> Investigating a way to do it only in the case of being the destination
> of a live migration, I think the right place is .load_setup migration
> handler. But I'm ok to move it for sure.

If it's destined to map the 128G, it does sound sensible to me to do it
when VM starts, rather than anytime afterwards.

Could anyone help to explain what's the problem if vDPA maps 128G at VM
init just like what VFIO does?

Thanks,
Eugenio Perez Martin Jan. 2, 2024, 11:28 a.m. UTC | #6
On Tue, Jan 2, 2024 at 6:33 AM Peter Xu <peterx@redhat.com> wrote:
>
> Jason, Eugenio,
>
> Apologies for a late reply; just back from the long holiday.
>
> On Thu, Dec 21, 2023 at 09:20:40AM +0100, Eugenio Perez Martin wrote:
> > Si-Wei did the actual profiling as he is the one with the 128G guests,
> > but most of the time was spent in the memory pinning. Si-Wei, please
> > correct me if I'm wrong.
>
> IIUC we're talking about no-vIOMMU use case.  The pinning should indeed
> take a lot of time if it's similar to what VFIO does.
>
> >
> > I didn't check VFIO, but I think it just maps at realize phase with
> > vfio_realize -> vfio_attach_device -> vfio_connect_container(). In
> > previous testings, this delayed the VM initialization by a lot, as
> > we're moving that 20s of blocking to every VM start.
> >
> > Investigating a way to do it only in the case of being the destination
> > of a live migration, I think the right place is .load_setup migration
> > handler. But I'm ok to move it for sure.
>
> If it's destined to map the 128G, it does sound sensible to me to do it
> when VM starts, rather than anytime afterwards.
>

Just for completion, it is not 100% sure the driver will start the
device. But it is likely for sure.

> Could anyone help to explain what's the problem if vDPA maps 128G at VM
> init just like what VFIO does?
>

The main problem was the delay of VM start. In the master branch, the
pinning is done when the driver starts the device. While it takes the
BQL, the rest of the vCPUs can move work forward while the host is
pinning. So the impact of it is not so evident.

To move it to initialization time made it very noticeable. To make
things worse, QEMU did not respond to QMP commands and similar. That's
why it was done only if the VM was the destination of a LM.

However, we've added the memory map thread in this version, so this
might not be a problem anymore. We could move the spawn of the thread
to initialization time.

But how to undo this pinning in the case the guest does not start the
device? In this series, this is done at the destination with
vhost_vdpa_load_cleanup. Or is it ok to just keep the memory mapped as
long as QEMU has the vDPA device?

Thanks!
Peter Xu Jan. 3, 2024, 6:16 a.m. UTC | #7
On Tue, Jan 02, 2024 at 12:28:48PM +0100, Eugenio Perez Martin wrote:
> On Tue, Jan 2, 2024 at 6:33 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > Jason, Eugenio,
> >
> > Apologies for a late reply; just back from the long holiday.
> >
> > On Thu, Dec 21, 2023 at 09:20:40AM +0100, Eugenio Perez Martin wrote:
> > > Si-Wei did the actual profiling as he is the one with the 128G guests,
> > > but most of the time was spent in the memory pinning. Si-Wei, please
> > > correct me if I'm wrong.
> >
> > IIUC we're talking about no-vIOMMU use case.  The pinning should indeed
> > take a lot of time if it's similar to what VFIO does.
> >
> > >
> > > I didn't check VFIO, but I think it just maps at realize phase with
> > > vfio_realize -> vfio_attach_device -> vfio_connect_container(). In
> > > previous testings, this delayed the VM initialization by a lot, as
> > > we're moving that 20s of blocking to every VM start.
> > >
> > > Investigating a way to do it only in the case of being the destination
> > > of a live migration, I think the right place is .load_setup migration
> > > handler. But I'm ok to move it for sure.
> >
> > If it's destined to map the 128G, it does sound sensible to me to do it
> > when VM starts, rather than anytime afterwards.
> >
> 
> Just for completion, it is not 100% sure the driver will start the
> device. But it is likely for sure.

My understanding is that vDPA is still a quite special device, assuming
only targeting advanced users, and should not appear in a default config
for anyone.  It means the user should hopefully remove the device if the
guest is not using it, instead of worrying on a slow boot.

> 
> > Could anyone help to explain what's the problem if vDPA maps 128G at VM
> > init just like what VFIO does?
> >
> 
> The main problem was the delay of VM start. In the master branch, the
> pinning is done when the driver starts the device. While it takes the
> BQL, the rest of the vCPUs can move work forward while the host is
> pinning. So the impact of it is not so evident.
> 
> To move it to initialization time made it very noticeable. To make
> things worse, QEMU did not respond to QMP commands and similar. That's
> why it was done only if the VM was the destination of a LM.

Is that a major issue for us?  IIUC then VFIO shares the same condition.
If it's a real problem, do we want to have a solution that works for both
(or, is it possible)?

> 
> However, we've added the memory map thread in this version, so this
> might not be a problem anymore. We could move the spawn of the thread
> to initialization time.
> 
> But how to undo this pinning in the case the guest does not start the
> device? In this series, this is done at the destination with
> vhost_vdpa_load_cleanup. Or is it ok to just keep the memory mapped as
> long as QEMU has the vDPA device?

I think even if vDPA decides to use a thread, we should keep the same
behavior before/after the migration.  Having assymetric behavior over DMA
from the assigned HWs might have unpredictable implications.

What I worry is we may over-optimize / over-engineer the case where the
user will specify the vDPA device but not use it, as I mentioned above.

For the long term, maybe there's chance to optimize DMA pinning for both
vdpa/vfio use cases, then we can always pin them during VM starts? Assuming
that issue only exists for large VMs, while they should normally be good
candidates for huge pages already.  Then, it means maybe one folio/page can
cover a large range (e.g. 1G on x86_64) in one pin, and physical continuity
also provides possibility of IOMMU large page mappings.  I didn't check at
which stage we are for VFIO on this, Alex may know better. I'm copying Alex
anyway since the problem seems to be a common one already, so maybe he has
some thoughts.

Thanks,
Eugenio Perez Martin Jan. 3, 2024, 11:11 a.m. UTC | #8
On Wed, Jan 3, 2024 at 7:16 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Jan 02, 2024 at 12:28:48PM +0100, Eugenio Perez Martin wrote:
> > On Tue, Jan 2, 2024 at 6:33 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > Jason, Eugenio,
> > >
> > > Apologies for a late reply; just back from the long holiday.
> > >
> > > On Thu, Dec 21, 2023 at 09:20:40AM +0100, Eugenio Perez Martin wrote:
> > > > Si-Wei did the actual profiling as he is the one with the 128G guests,
> > > > but most of the time was spent in the memory pinning. Si-Wei, please
> > > > correct me if I'm wrong.
> > >
> > > IIUC we're talking about no-vIOMMU use case.  The pinning should indeed
> > > take a lot of time if it's similar to what VFIO does.
> > >
> > > >
> > > > I didn't check VFIO, but I think it just maps at realize phase with
> > > > vfio_realize -> vfio_attach_device -> vfio_connect_container(). In
> > > > previous testings, this delayed the VM initialization by a lot, as
> > > > we're moving that 20s of blocking to every VM start.
> > > >
> > > > Investigating a way to do it only in the case of being the destination
> > > > of a live migration, I think the right place is .load_setup migration
> > > > handler. But I'm ok to move it for sure.
> > >
> > > If it's destined to map the 128G, it does sound sensible to me to do it
> > > when VM starts, rather than anytime afterwards.
> > >
> >
> > Just for completion, it is not 100% sure the driver will start the
> > device. But it is likely for sure.
>
> My understanding is that vDPA is still a quite special device, assuming
> only targeting advanced users, and should not appear in a default config
> for anyone.  It means the user should hopefully remove the device if the
> guest is not using it, instead of worrying on a slow boot.
>
> >
> > > Could anyone help to explain what's the problem if vDPA maps 128G at VM
> > > init just like what VFIO does?
> > >
> >
> > The main problem was the delay of VM start. In the master branch, the
> > pinning is done when the driver starts the device. While it takes the
> > BQL, the rest of the vCPUs can move work forward while the host is
> > pinning. So the impact of it is not so evident.
> >
> > To move it to initialization time made it very noticeable. To make
> > things worse, QEMU did not respond to QMP commands and similar. That's
> > why it was done only if the VM was the destination of a LM.
>
> Is that a major issue for us?

To me it is a regression but I'm ok with it for sure.

>  IIUC then VFIO shares the same condition.
> If it's a real problem, do we want to have a solution that works for both
> (or, is it possible)?
>

I would not consider a regression for VFIO since I think it has
behaved that way from the beginning. But yes, I'm all in to find a
common solution.

> >
> > However, we've added the memory map thread in this version, so this
> > might not be a problem anymore. We could move the spawn of the thread
> > to initialization time.
> >
> > But how to undo this pinning in the case the guest does not start the
> > device? In this series, this is done at the destination with
> > vhost_vdpa_load_cleanup. Or is it ok to just keep the memory mapped as
> > long as QEMU has the vDPA device?
>
> I think even if vDPA decides to use a thread, we should keep the same
> behavior before/after the migration.  Having assymetric behavior over DMA
> from the assigned HWs might have unpredictable implications.
>
> What I worry is we may over-optimize / over-engineer the case where the
> user will specify the vDPA device but not use it, as I mentioned above.
>

I agree with all of the above. If it is ok to keep memory mapped while
the guest has not started I think we can move the spawn of the thread,
or even just the map write itself, to the vdpa init.

> For the long term, maybe there's chance to optimize DMA pinning for both
> vdpa/vfio use cases, then we can always pin them during VM starts? Assuming
> that issue only exists for large VMs, while they should normally be good
> candidates for huge pages already.  Then, it means maybe one folio/page can
> cover a large range (e.g. 1G on x86_64) in one pin, and physical continuity
> also provides possibility of IOMMU large page mappings.  I didn't check at
> which stage we are for VFIO on this, Alex may know better.

Sounds interesting, and I think it should be implemented. Thanks for
the pointer!

> I'm copying Alex
> anyway since the problem seems to be a common one already, so maybe he has
> some thoughts.
>

Appreciated :).

Thanks!
Peter Xu Jan. 4, 2024, 6:46 a.m. UTC | #9
On Wed, Jan 03, 2024 at 12:11:19PM +0100, Eugenio Perez Martin wrote:
> On Wed, Jan 3, 2024 at 7:16 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Jan 02, 2024 at 12:28:48PM +0100, Eugenio Perez Martin wrote:
> > > On Tue, Jan 2, 2024 at 6:33 AM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > Jason, Eugenio,
> > > >
> > > > Apologies for a late reply; just back from the long holiday.
> > > >
> > > > On Thu, Dec 21, 2023 at 09:20:40AM +0100, Eugenio Perez Martin wrote:
> > > > > Si-Wei did the actual profiling as he is the one with the 128G guests,
> > > > > but most of the time was spent in the memory pinning. Si-Wei, please
> > > > > correct me if I'm wrong.
> > > >
> > > > IIUC we're talking about no-vIOMMU use case.  The pinning should indeed
> > > > take a lot of time if it's similar to what VFIO does.
> > > >
> > > > >
> > > > > I didn't check VFIO, but I think it just maps at realize phase with
> > > > > vfio_realize -> vfio_attach_device -> vfio_connect_container(). In
> > > > > previous testings, this delayed the VM initialization by a lot, as
> > > > > we're moving that 20s of blocking to every VM start.
> > > > >
> > > > > Investigating a way to do it only in the case of being the destination
> > > > > of a live migration, I think the right place is .load_setup migration
> > > > > handler. But I'm ok to move it for sure.
> > > >
> > > > If it's destined to map the 128G, it does sound sensible to me to do it
> > > > when VM starts, rather than anytime afterwards.
> > > >
> > >
> > > Just for completion, it is not 100% sure the driver will start the
> > > device. But it is likely for sure.
> >
> > My understanding is that vDPA is still a quite special device, assuming
> > only targeting advanced users, and should not appear in a default config
> > for anyone.  It means the user should hopefully remove the device if the
> > guest is not using it, instead of worrying on a slow boot.
> >
> > >
> > > > Could anyone help to explain what's the problem if vDPA maps 128G at VM
> > > > init just like what VFIO does?
> > > >
> > >
> > > The main problem was the delay of VM start. In the master branch, the
> > > pinning is done when the driver starts the device. While it takes the
> > > BQL, the rest of the vCPUs can move work forward while the host is
> > > pinning. So the impact of it is not so evident.
> > >
> > > To move it to initialization time made it very noticeable. To make
> > > things worse, QEMU did not respond to QMP commands and similar. That's
> > > why it was done only if the VM was the destination of a LM.
> >
> > Is that a major issue for us?
> 
> To me it is a regression but I'm ok with it for sure.
> 
> >  IIUC then VFIO shares the same condition.
> > If it's a real problem, do we want to have a solution that works for both
> > (or, is it possible)?
> >
> 
> I would not consider a regression for VFIO since I think it has
> behaved that way from the beginning. But yes, I'm all in to find a
> common solution.
> 
> > >
> > > However, we've added the memory map thread in this version, so this
> > > might not be a problem anymore. We could move the spawn of the thread
> > > to initialization time.
> > >
> > > But how to undo this pinning in the case the guest does not start the
> > > device? In this series, this is done at the destination with
> > > vhost_vdpa_load_cleanup. Or is it ok to just keep the memory mapped as
> > > long as QEMU has the vDPA device?
> >
> > I think even if vDPA decides to use a thread, we should keep the same
> > behavior before/after the migration.  Having assymetric behavior over DMA
> > from the assigned HWs might have unpredictable implications.
> >
> > What I worry is we may over-optimize / over-engineer the case where the
> > user will specify the vDPA device but not use it, as I mentioned above.
> >
> 
> I agree with all of the above. If it is ok to keep memory mapped while
> the guest has not started I think we can move the spawn of the thread,
> or even just the map write itself, to the vdpa init.
> 
> > For the long term, maybe there's chance to optimize DMA pinning for both
> > vdpa/vfio use cases, then we can always pin them during VM starts? Assuming
> > that issue only exists for large VMs, while they should normally be good
> > candidates for huge pages already.  Then, it means maybe one folio/page can
> > cover a large range (e.g. 1G on x86_64) in one pin, and physical continuity
> > also provides possibility of IOMMU large page mappings.  I didn't check at
> > which stage we are for VFIO on this, Alex may know better.
> 
> Sounds interesting, and I think it should be implemented. Thanks for
> the pointer!

I didn't have an exact pointer previously, but to provide a pointer, I
think it can be something like this:

  physr discussion - Jason Gunthorpe
  https://www.youtube.com/watch?v=QftOTtks-pI&list=PLbzoR-pLrL6rlmdpJ3-oMgU_zxc1wAhjS&index=36

Since I have zero knowledge on vDPA side, I can only provide the exmaple
from VFIO and even if so that may not be fully accurate.  Basically afaiu
currently vfio is already smart enough to recognize continuous ranges (via
vfio_pin_pages_remote()):

		/* Pin a contiguous chunk of memory */
		npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
					      size >> PAGE_SHIFT, &pfn, limit,
					      &batch);

But the pin can still be slow, due to e.g. we need a page* for each
PAGE_SIZE range.

I think something like physr can improve it.  That should correspond to the
1st slide of the video of the "performance" entry on pin_user_pages().

Thanks,

> 
> > I'm copying Alex
> > anyway since the problem seems to be a common one already, so maybe he has
> > some thoughts.
> >
> 
> Appreciated :).
> 
> Thanks!
>
diff mbox series

Patch

diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 8f54e5edd4..b49286b327 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -17,6 +17,7 @@ 
 #include "hw/virtio/vhost-iova-tree.h"
 #include "hw/virtio/vhost-shadow-virtqueue.h"
 #include "hw/virtio/virtio.h"
+#include "qemu/thread.h"
 #include "standard-headers/linux/vhost_types.h"
 
 /*
@@ -30,6 +31,12 @@  typedef struct VhostVDPAHostNotifier {
     void *addr;
 } VhostVDPAHostNotifier;
 
+typedef struct VhostVDPAMapThread {
+    QemuThread thread;
+    GAsyncQueue *queue;
+    bool map_thread_enabled;
+} VhostVDPAMapThread;
+
 /* Info shared by all vhost_vdpa device models */
 typedef struct vhost_vdpa_shared {
     int device_fd;
@@ -43,8 +50,27 @@  typedef struct vhost_vdpa_shared {
     /* Copy of backend features */
     uint64_t backend_cap;
 
+    /*
+     * Thread to map memory in QEMU incoming migration.
+     *
+     * Incoming migration calls devices ->load_setup in the main thread, but
+     * map operations can take a long time with BLQ acquired.  This forbids
+     * QEMU to serve other requests like QMP.
+     *
+     * To solve it, offload the first listener operations until the first
+     * listener commit from the main thread.  Once these are served, join the
+     * map thread.
+     */
+    VhostVDPAMapThread *map_thread;
+
     bool iotlb_batch_begin_sent;
 
+    /*
+     * The memory listener has been registered, so DMA maps have been sent to
+     * the device.
+     */
+    bool listener_registered;
+
     /* Vdpa must send shadow addresses as IOTLB key for data queues, not GPA */
     bool shadow_data;
 } VhostVDPAShared;
@@ -73,6 +99,8 @@  int vhost_vdpa_dma_map(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
                        hwaddr size, void *vaddr, bool readonly);
 int vhost_vdpa_dma_unmap(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
                          hwaddr size);
+int vhost_vdpa_load_setup(VhostVDPAShared *s, AddressSpace *dma_as);
+int vhost_vdpa_load_cleanup(VhostVDPAShared *s, bool vhost_will_start);
 
 typedef struct vdpa_iommu {
     VhostVDPAShared *dev_shared;
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 43f7c382b1..339e11c58a 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -101,6 +101,16 @@  int vhost_vdpa_dma_map(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
     msg.iotlb.perm = readonly ? VHOST_ACCESS_RO : VHOST_ACCESS_RW;
     msg.iotlb.type = VHOST_IOTLB_UPDATE;
 
+    if (s->map_thread && s->map_thread->map_thread_enabled &&
+        !qemu_thread_is_self(&s->map_thread->thread)) {
+        struct vhost_msg_v2 *new_msg = g_new(struct vhost_msg_v2, 1);
+
+        *new_msg = msg;
+        g_async_queue_push(s->map_thread->queue, new_msg);
+
+        return 0;
+    }
+
     trace_vhost_vdpa_dma_map(s, fd, msg.type, msg.asid, msg.iotlb.iova,
                              msg.iotlb.size, msg.iotlb.uaddr, msg.iotlb.perm,
                              msg.iotlb.type);
@@ -131,6 +141,16 @@  int vhost_vdpa_dma_unmap(VhostVDPAShared *s, uint32_t asid, hwaddr iova,
     msg.iotlb.size = size;
     msg.iotlb.type = VHOST_IOTLB_INVALIDATE;
 
+    if (s->map_thread && s->map_thread->map_thread_enabled &&
+        !qemu_thread_is_self(&s->map_thread->thread)) {
+        struct vhost_msg_v2 *new_msg = g_new(struct vhost_msg_v2, 1);
+
+        *new_msg = msg;
+        g_async_queue_push(s->map_thread->queue, new_msg);
+
+        return 0;
+    }
+
     trace_vhost_vdpa_dma_unmap(s, fd, msg.type, msg.asid, msg.iotlb.iova,
                                msg.iotlb.size, msg.iotlb.type);
 
@@ -156,6 +176,16 @@  static void vhost_vdpa_iotlb_batch_begin_once(VhostVDPAShared *s)
         return;
     }
 
+    if (s->map_thread && s->map_thread->map_thread_enabled &&
+        !qemu_thread_is_self(&s->map_thread->thread)) {
+        struct vhost_msg_v2 *new_msg = g_new(struct vhost_msg_v2, 1);
+
+        *new_msg = msg;
+        g_async_queue_push(s->map_thread->queue, new_msg);
+
+        return;
+    }
+
     trace_vhost_vdpa_listener_begin_batch(s, fd, msg.type, msg.iotlb.type);
     if (write(fd, &msg, sizeof(msg)) != sizeof(msg)) {
         error_report("failed to write, fd=%d, errno=%d (%s)",
@@ -180,6 +210,17 @@  static void vhost_vdpa_dma_end_batch(VhostVDPAShared *s)
     msg.type = VHOST_IOTLB_MSG_V2;
     msg.iotlb.type = VHOST_IOTLB_BATCH_END;
 
+    if (s->map_thread && s->map_thread->map_thread_enabled &&
+        !qemu_thread_is_self(&s->map_thread->thread)) {
+        struct vhost_msg_v2 *new_msg = g_new(struct vhost_msg_v2, 1);
+
+        *new_msg = msg;
+        g_async_queue_push(s->map_thread->queue, new_msg);
+        s->map_thread->map_thread_enabled = false;
+
+        return;
+    }
+
     trace_vhost_vdpa_listener_commit(s, fd, msg.type, msg.iotlb.type);
     if (write(fd, &msg, sizeof(msg)) != sizeof(msg)) {
         error_report("failed to write, fd=%d, errno=%d (%s)",
@@ -1288,6 +1329,88 @@  static void vhost_vdpa_suspend(struct vhost_dev *dev)
     vhost_vdpa_reset_device(dev);
 }
 
+static void *vhost_vdpa_load_map_worker(void *opaque)
+{
+    VhostVDPAShared *shared = opaque;
+    GPtrArray *ret = NULL;
+
+    while (true) {
+        g_autofree struct vhost_msg_v2 *msg = NULL;
+        int r = 0;
+
+        msg = g_async_queue_pop(shared->map_thread->queue);
+        switch (msg->iotlb.type) {
+        case VHOST_IOTLB_UPDATE:
+            r = vhost_vdpa_dma_map(shared, msg->asid, msg->iotlb.iova,
+                                   msg->iotlb.size,
+                                   (void *)(uintptr_t)msg->iotlb.uaddr,
+                                   msg->iotlb.perm == VHOST_ACCESS_RO);
+            break;
+        case VHOST_IOTLB_INVALIDATE:
+            r = vhost_vdpa_dma_unmap(shared, msg->asid, msg->iotlb.iova,
+                                     msg->iotlb.size);
+            break;
+        case VHOST_IOTLB_BATCH_BEGIN:
+            vhost_vdpa_iotlb_batch_begin_once(shared);
+            break;
+        case VHOST_IOTLB_BATCH_END:
+            vhost_vdpa_dma_end_batch(shared);
+            goto end;
+        default:
+            error_report("Invalid IOTLB msg type %d", msg->iotlb.type);
+            break;
+        };
+
+        if (unlikely(r != 0)) {
+            /* Add to failed iotlbs so we can remove it from iova_tree */
+            if (ret == NULL) {
+                ret = g_ptr_array_new_full(1, g_free);
+            }
+
+            g_ptr_array_add(ret, g_steal_pointer(&msg));
+        }
+    }
+
+end:
+    return ret;
+}
+
+static void vhost_vdpa_spawn_maps_thread(VhostVDPAShared *shared)
+{
+    shared->map_thread = g_new0(VhostVDPAMapThread, 1);
+    shared->map_thread->queue = g_async_queue_new();
+    qemu_thread_create(&shared->map_thread->thread, "vdpa map thread",
+                       vhost_vdpa_load_map_worker, shared,
+                       QEMU_THREAD_JOINABLE);
+    shared->map_thread->map_thread_enabled = true;
+}
+
+static bool vhost_vdpa_join_maps_thread(VhostVDPAShared *shared)
+{
+    g_autoptr(GPtrArray) failed_iova = NULL;
+
+    failed_iova = qemu_thread_join(&shared->map_thread->thread);
+    g_async_queue_unref(shared->map_thread->queue);
+    g_clear_pointer(&shared->map_thread, g_free);
+
+    if (likely(!failed_iova)) {
+        return true;
+    }
+
+    /* If it is a failed IOVA, abort starting */
+    for (size_t i = 0; failed_iova->len; ++i) {
+        struct vhost_msg_v2 *msg = g_ptr_array_index(failed_iova, i);
+        DMAMap mem_region = {
+            .iova = msg->iotlb.iova,
+            .size = msg->iotlb.size - 1, /* Inclusive */
+        };
+
+        vhost_iova_tree_remove(shared->iova_tree, mem_region);
+    }
+
+    return false;
+}
+
 static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
 {
     struct vhost_vdpa *v = dev->opaque;
@@ -1315,7 +1438,15 @@  static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
                          "IOMMU and try again");
             return -1;
         }
-        memory_listener_register(&v->shared->listener, dev->vdev->dma_as);
+        if (!v->shared->listener_registered) {
+            memory_listener_register(&v->shared->listener, dev->vdev->dma_as);
+            v->shared->listener_registered = true;
+        } else {
+            ok = vhost_vdpa_join_maps_thread(v->shared);
+            if (unlikely(!ok)) {
+                goto out_stop;
+            }
+        }
 
         return vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
     }
@@ -1340,6 +1471,8 @@  static void vhost_vdpa_reset_status(struct vhost_dev *dev)
     vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
                                VIRTIO_CONFIG_S_DRIVER);
     memory_listener_unregister(&v->shared->listener);
+    v->shared->listener_registered = false;
+
 }
 
 static int vhost_vdpa_set_log_base(struct vhost_dev *dev, uint64_t base,
@@ -1522,3 +1655,34 @@  const VhostOps vdpa_ops = {
         .vhost_set_config_call = vhost_vdpa_set_config_call,
         .vhost_reset_status = vhost_vdpa_reset_status,
 };
+
+int vhost_vdpa_load_setup(VhostVDPAShared *shared, AddressSpace *dma_as)
+{
+    uint8_t s = VIRTIO_CONFIG_S_ACKNOWLEDGE | VIRTIO_CONFIG_S_DRIVER;
+    int r = ioctl(shared->device_fd, VHOST_VDPA_SET_STATUS, &s);
+    if (unlikely(r < 0)) {
+        return r;
+    }
+
+    vhost_vdpa_spawn_maps_thread(shared);
+    memory_listener_register(&shared->listener, dma_as);
+    shared->listener_registered = true;
+    return 0;
+}
+
+int vhost_vdpa_load_cleanup(VhostVDPAShared *shared, bool vhost_will_start)
+{
+    if (shared->map_thread && !shared->map_thread->map_thread_enabled) {
+        return 0;
+    }
+
+    if (vhost_will_start) {
+        /*
+         * Delegate the join of map thread to vhost_vdpa_dev_start, as it runs
+         * out of main qemu lock.
+         */
+        return 0;
+    }
+
+    return vhost_vdpa_join_maps_thread(shared) ? 0 : -1;
+}