mbox series

[RFC,v3,00/29] vDPA software assisted live migration

Message ID 20210519162903.1172366-1-eperezma@redhat.com (mailing list archive)
Headers show
Series vDPA software assisted live migration | expand

Message

Eugenio Perez Martin May 19, 2021, 4:28 p.m. UTC
This series enable shadow virtqueue for vhost-vdpa devices. This is a
new method of vhost devices migration: Instead of relay on vhost
device's dirty logging capability, SW assisted LM intercepts dataplane,
forwarding the descriptors between VM and device. Is intended for vDPA
devices with no dirty memory tracking capabilities.

In this migration mode, qemu offers a new vring to the device to
read and write into, and disable vhost notifiers, processing guest and
vhost notifications in qemu. On used buffer relay, qemu will mark the
dirty memory as with plain virtio-net devices. This way, devices does
not need to have dirty page logging capability.

This series is a POC doing SW LM for vhost-net and vhost-vdpa devices.
The former already have dirty page logging capabilities, but it is both
easier to test and uses different code paths in qemu.

For qemu to use shadow virtqueues the vhost-net devices need to be
instantiated:
* With IOMMU (iommu_platform=on,ats=on)
* Without event_idx (event_idx=off)

And shadow virtqueue needs to be enabled for them with QMP command
like:

{ "execute": "x-vhost-enable-shadow-vq",
      "arguments": { "name": "dev0", "enable": true } }

The series includes some commits to delete in the final version. One
of them is the one that adds vhost_kernel_vring_pause to vhost kernel
devices. This is only intended to work with vhost-net devices, as a way
to test the solution, so don't use any other vhost kernel device in the
same test.

The vhost-vdpa devices should work the same way. However, vp_vdpa is
not working properly with intel iommu unmapping, so this series add two
extra commits to allow testing the solution enable SVQ mode from the
device start and forbidding any other vhost-vdpa memory mapping. The
causes of this are still under debugging.

For testing vhost-vdpa devices vp_vdpa device has been used with nested
virtualization, using a qemu virtio-net device in L0. To be able to
stop and reset status, features in RFC status has been implemented in
commits 5 and 6. After that, virtio-net driver in L0 guest is replaced
by vp_vdpa driver, and a nested qemu instance is launched using it.

This vp_vdpa driver needs to be also modified to support the RFCs,
mainly allowing it to removing the _S_STOPPED status flag and
implementing actual vp_vdpa_set_vq_state and vp_vdpa_get_vq_state
callbacks.

Just the notification forwarding (with no descriptor relay) can be
achieved with patches 7 and 8, and starting SVQ. Previous commits
are cleanup ones and declaration of QMP command.

Commit 17 introduces the buffer forwarding. Previous one are for
preparations again, and laters are for enabling some obvious
optimizations. However, it needs the vdpa device to be able to map
every IOVA space, and some vDPA devices are not able to do so. Checking
of this is added in previous commits.

Later commits allow vhost and shadow virtqueue to track and translate
between qemu virtual addresses and a restricted iommu range. At the
moment is not able to delete old translations, limit maximum range
it can translate, nor vhost add new memory regions from the moment
SVQ is enabled, but is somehow straightforward to add these.

This is a big series, so the idea is to send it in logical chunks once
all comments have been collected. As a first complete usecase, a SVQ
mode with no possibility of going back to regular mode would cover a
first usecase, and this RFC already have all the ingredients but
internal memory tracking.

It is based on the ideas of DPDK SW assisted LM, in the series of
DPDK's https://patchwork.dpdk.org/cover/48370/ . However, these does
not map the shadow vq in guest's VA, but in qemu's.

Comments are welcome!

TODO:
* Event, indirect, packed, and others features of virtio - Waiting for
  confirmation of the big picture.
* vDPA devices: Grow IOVA tree to track new or deleted memory. Cap
  IOVA limit in tree so it cannot grow forever.
* To sepparate buffers forwarding in its own AIO context, so we can
  throw more threads to that task and we don't need to stop the main
  event loop.
* IOMMU optimizations, so bacthing and bigger chunks of IOVA can be
  sent to device.
* Automatic kick-in on live-migration.
* Proper documentation.

Thanks!

Changes from v2 RFC:
  * Adding vhost-vdpa devices support
  * Fixed some memory leaks pointed by different comments

Changes from v1 RFC:
  * Use QMP instead of migration to start SVQ mode.
  * Only accepting IOMMU devices, closer behavior with target devices
    (vDPA)
  * Fix invalid masking/unmasking of vhost call fd.
  * Use of proper methods for synchronization.
  * No need to modify VirtIO device code, all of the changes are
    contained in vhost code.
  * Delete superfluous code.
  * An intermediate RFC was sent with only the notifications forwarding
    changes. It can be seen in
    https://patchew.org/QEMU/20210129205415.876290-1-eperezma@redhat.com/
  * v1 at
    https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg05372.html

Eugenio Pérez (29):
  virtio: Add virtio_queue_is_host_notifier_enabled
  vhost: Save masked_notifier state
  vhost: Add VhostShadowVirtqueue
  vhost: Add x-vhost-enable-shadow-vq qmp
  virtio: Add VIRTIO_F_QUEUE_STATE
  virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
  vhost: Route guest->host notification through shadow virtqueue
  vhost: Route host->guest notification through shadow virtqueue
  vhost: Avoid re-set masked notifier in shadow vq
  virtio: Add vhost_shadow_vq_get_vring_addr
  vhost: Add vhost_vring_pause operation
  vhost: add vhost_kernel_vring_pause
  vhost: Add vhost_get_iova_range operation
  vhost: add vhost_has_limited_iova_range
  vhost: Add enable_custom_iommu to VhostOps
  vhost-vdpa: Add vhost_vdpa_enable_custom_iommu
  vhost: Shadow virtqueue buffers forwarding
  vhost: Use vhost_enable_custom_iommu to unmap everything if available
  vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue
    kick
  vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow
    virtqueue
  vhost: Add VhostIOVATree
  vhost: Add iova_rev_maps_find_iova to IOVAReverseMaps
  vhost: Use a tree to store memory mappings
  vhost: Add iova_rev_maps_alloc
  vhost: Add custom IOTLB translations to SVQ
  vhost: Map in vdpa-dev
  vhost-vdpa: Implement vhost_vdpa_vring_pause operation
  vhost-vdpa: never map with vDPA listener
  vhost: Start vhost-vdpa SVQ directly

 qapi/net.json                                 |  22 +
 hw/virtio/vhost-iova-tree.h                   |  61 ++
 hw/virtio/vhost-shadow-virtqueue.h            |  38 ++
 hw/virtio/virtio-pci.h                        |   1 +
 include/hw/virtio/vhost-backend.h             |  16 +
 include/hw/virtio/vhost-vdpa.h                |   2 +-
 include/hw/virtio/vhost.h                     |  14 +
 include/hw/virtio/virtio.h                    |   5 +-
 .../standard-headers/linux/virtio_config.h    |   5 +
 include/standard-headers/linux/virtio_pci.h   |   2 +
 hw/net/virtio-net.c                           |   4 +-
 hw/virtio/vhost-backend.c                     |  42 ++
 hw/virtio/vhost-iova-tree.c                   | 283 ++++++++
 hw/virtio/vhost-shadow-virtqueue.c            | 643 ++++++++++++++++++
 hw/virtio/vhost-vdpa.c                        |  73 +-
 hw/virtio/vhost.c                             | 459 ++++++++++++-
 hw/virtio/virtio-pci.c                        |   9 +
 hw/virtio/virtio.c                            |   5 +
 hw/virtio/meson.build                         |   2 +-
 hw/virtio/trace-events                        |   1 +
 20 files changed, 1663 insertions(+), 24 deletions(-)
 create mode 100644 hw/virtio/vhost-iova-tree.h
 create mode 100644 hw/virtio/vhost-shadow-virtqueue.h
 create mode 100644 hw/virtio/vhost-iova-tree.c
 create mode 100644 hw/virtio/vhost-shadow-virtqueue.c

Comments

Michael S. Tsirkin May 24, 2021, 9:38 a.m. UTC | #1
On Wed, May 19, 2021 at 06:28:34PM +0200, Eugenio Pérez wrote:
> Commit 17 introduces the buffer forwarding. Previous one are for
> preparations again, and laters are for enabling some obvious
> optimizations. However, it needs the vdpa device to be able to map
> every IOVA space, and some vDPA devices are not able to do so. Checking
> of this is added in previous commits.

That might become a significant limitation. And it worries me that
this is such a big patchset which might yet take a while to get
finalized.

I have an idea: how about as a first step we implement a transparent
switch from vdpa to a software virtio in QEMU or a software vhost in
kernel?

This will give us live migration quickly with performance comparable
to failover but without dependance on guest cooperation.

Next step could be driving vdpa from userspace while still copying
packets to a pre-registered buffer.

Finally your approach will be a performance optimization for devices
that support arbitrary IOVA.

Thoughts?
Eugenio Perez Martin May 24, 2021, 10:37 a.m. UTC | #2
On Mon, May 24, 2021 at 11:38 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Wed, May 19, 2021 at 06:28:34PM +0200, Eugenio Pérez wrote:
> > Commit 17 introduces the buffer forwarding. Previous one are for
> > preparations again, and laters are for enabling some obvious
> > optimizations. However, it needs the vdpa device to be able to map
> > every IOVA space, and some vDPA devices are not able to do so. Checking
> > of this is added in previous commits.
>
> That might become a significant limitation. And it worries me that
> this is such a big patchset which might yet take a while to get
> finalized.
>

Sorry, maybe I've been unclear here: Latter commits in this series
address this limitation. Still not perfect: for example, it does not
support adding or removing guest's memory at the moment, but this
should be easy to implement on top.

The main issue I'm observing is from the kernel if I'm not wrong: If I
unmap every address, I cannot re-map them again. But code in this
patchset is mostly final, except for the comments it may arise in the
mail list of course.

> I have an idea: how about as a first step we implement a transparent
> switch from vdpa to a software virtio in QEMU or a software vhost in
> kernel?
>
> This will give us live migration quickly with performance comparable
> to failover but without dependance on guest cooperation.
>

I think it should be doable. I'm not sure about the effort that needs
to be done in qemu to hide these "hypervisor-failover devices" from
guest's view but it should be comparable to failover, as you say.

Networking should be ok by its nature, although it could require care
on the host hardware setup. But I'm not sure how other types of
vhost/vdpa devices may work that way. How would a disk/scsi device
switch modes? Can the kernel take control of the vdpa device through
vhost, and just start reporting with a dirty bitmap?

Thanks!

> Next step could be driving vdpa from userspace while still copying
> packets to a pre-registered buffer.
>
> Finally your approach will be a performance optimization for devices
> that support arbitrary IOVA.
>
> Thoughts?
>
> --
> MST
>
Michael S. Tsirkin May 24, 2021, 11:29 a.m. UTC | #3
On Mon, May 24, 2021 at 12:37:48PM +0200, Eugenio Perez Martin wrote:
> On Mon, May 24, 2021 at 11:38 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Wed, May 19, 2021 at 06:28:34PM +0200, Eugenio Pérez wrote:
> > > Commit 17 introduces the buffer forwarding. Previous one are for
> > > preparations again, and laters are for enabling some obvious
> > > optimizations. However, it needs the vdpa device to be able to map
> > > every IOVA space, and some vDPA devices are not able to do so. Checking
> > > of this is added in previous commits.
> >
> > That might become a significant limitation. And it worries me that
> > this is such a big patchset which might yet take a while to get
> > finalized.
> >
> 
> Sorry, maybe I've been unclear here: Latter commits in this series
> address this limitation. Still not perfect: for example, it does not
> support adding or removing guest's memory at the moment, but this
> should be easy to implement on top.
> 
> The main issue I'm observing is from the kernel if I'm not wrong: If I
> unmap every address, I cannot re-map them again. But code in this
> patchset is mostly final, except for the comments it may arise in the
> mail list of course.
> 
> > I have an idea: how about as a first step we implement a transparent
> > switch from vdpa to a software virtio in QEMU or a software vhost in
> > kernel?
> >
> > This will give us live migration quickly with performance comparable
> > to failover but without dependance on guest cooperation.
> >
> 
> I think it should be doable. I'm not sure about the effort that needs
> to be done in qemu to hide these "hypervisor-failover devices" from
> guest's view but it should be comparable to failover, as you say.
> 
> Networking should be ok by its nature, although it could require care
> on the host hardware setup. But I'm not sure how other types of
> vhost/vdpa devices may work that way. How would a disk/scsi device
> switch modes? Can the kernel take control of the vdpa device through
> vhost, and just start reporting with a dirty bitmap?
> 
> Thanks!

It depends of course, e.g. blk is mostly reads/writes so
not a lot of state. just don't reorder or drop requests.

> > Next step could be driving vdpa from userspace while still copying
> > packets to a pre-registered buffer.
> >
> > Finally your approach will be a performance optimization for devices
> > that support arbitrary IOVA.
> >
> > Thoughts?
> >
> > --
> > MST
> >
Jason Wang May 25, 2021, 12:09 a.m. UTC | #4
在 2021/5/24 下午6:37, Eugenio Perez Martin 写道:
> On Mon, May 24, 2021 at 11:38 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Wed, May 19, 2021 at 06:28:34PM +0200, Eugenio Pérez wrote:
>>> Commit 17 introduces the buffer forwarding. Previous one are for
>>> preparations again, and laters are for enabling some obvious
>>> optimizations. However, it needs the vdpa device to be able to map
>>> every IOVA space, and some vDPA devices are not able to do so. Checking
>>> of this is added in previous commits.
>> That might become a significant limitation. And it worries me that
>> this is such a big patchset which might yet take a while to get
>> finalized.
>>
> Sorry, maybe I've been unclear here: Latter commits in this series
> address this limitation. Still not perfect: for example, it does not
> support adding or removing guest's memory at the moment, but this
> should be easy to implement on top.
>
> The main issue I'm observing is from the kernel if I'm not wrong: If I
> unmap every address, I cannot re-map them again.


This looks like a bug.

Does this happen only on some specific device (e.g vp_vdpa) or it's a 
general issue of vhost-vdpa?


>   But code in this
> patchset is mostly final, except for the comments it may arise in the
> mail list of course.
>
>> I have an idea: how about as a first step we implement a transparent
>> switch from vdpa to a software virtio in QEMU or a software vhost in
>> kernel?
>>
>> This will give us live migration quickly with performance comparable
>> to failover but without dependance on guest cooperation.
>>
> I think it should be doable. I'm not sure about the effort that needs
> to be done in qemu to hide these "hypervisor-failover devices" from
> guest's view but it should be comparable to failover, as you say.


Yes, if we want to switch, I'd go to a fallback to vhost-vdpa network 
backend instead.

Thanks


>
> Networking should be ok by its nature, although it could require care
> on the host hardware setup. But I'm not sure how other types of
> vhost/vdpa devices may work that way. How would a disk/scsi device
> switch modes? Can the kernel take control of the vdpa device through
> vhost, and just start reporting with a dirty bitmap?
>
> Thanks!
>
>> Next step could be driving vdpa from userspace while still copying
>> packets to a pre-registered buffer.
>>
>> Finally your approach will be a performance optimization for devices
>> that support arbitrary IOVA.
>>
>> Thoughts?
>>
>> --
>> MST
>>
Jason Wang June 2, 2021, 9:59 a.m. UTC | #5
在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> This series enable shadow virtqueue for vhost-vdpa devices. This is a
> new method of vhost devices migration: Instead of relay on vhost
> device's dirty logging capability, SW assisted LM intercepts dataplane,
> forwarding the descriptors between VM and device. Is intended for vDPA
> devices with no dirty memory tracking capabilities.
>
> In this migration mode, qemu offers a new vring to the device to
> read and write into, and disable vhost notifiers, processing guest and
> vhost notifications in qemu. On used buffer relay, qemu will mark the
> dirty memory as with plain virtio-net devices. This way, devices does
> not need to have dirty page logging capability.
>
> This series is a POC doing SW LM for vhost-net and vhost-vdpa devices.
> The former already have dirty page logging capabilities, but it is both
> easier to test and uses different code paths in qemu.
>
> For qemu to use shadow virtqueues the vhost-net devices need to be
> instantiated:
> * With IOMMU (iommu_platform=on,ats=on)
> * Without event_idx (event_idx=off)
>
> And shadow virtqueue needs to be enabled for them with QMP command
> like:
>
> { "execute": "x-vhost-enable-shadow-vq",
>        "arguments": { "name": "dev0", "enable": true } }
>
> The series includes some commits to delete in the final version. One
> of them is the one that adds vhost_kernel_vring_pause to vhost kernel
> devices. This is only intended to work with vhost-net devices, as a way
> to test the solution, so don't use any other vhost kernel device in the
> same test.
>
> The vhost-vdpa devices should work the same way. However, vp_vdpa is
> not working properly with intel iommu unmapping, so this series add two
> extra commits to allow testing the solution enable SVQ mode from the
> device start and forbidding any other vhost-vdpa memory mapping. The
> causes of this are still under debugging.
>
> For testing vhost-vdpa devices vp_vdpa device has been used with nested
> virtualization, using a qemu virtio-net device in L0. To be able to
> stop and reset status, features in RFC status has been implemented in
> commits 5 and 6. After that, virtio-net driver in L0 guest is replaced
> by vp_vdpa driver, and a nested qemu instance is launched using it.
>
> This vp_vdpa driver needs to be also modified to support the RFCs,
> mainly allowing it to removing the _S_STOPPED status flag and
> implementing actual vp_vdpa_set_vq_state and vp_vdpa_get_vq_state
> callbacks.
>
> Just the notification forwarding (with no descriptor relay) can be
> achieved with patches 7 and 8, and starting SVQ. Previous commits
> are cleanup ones and declaration of QMP command.
>
> Commit 17 introduces the buffer forwarding. Previous one are for
> preparations again, and laters are for enabling some obvious
> optimizations. However, it needs the vdpa device to be able to map
> every IOVA space, and some vDPA devices are not able to do so. Checking
> of this is added in previous commits.
>
> Later commits allow vhost and shadow virtqueue to track and translate
> between qemu virtual addresses and a restricted iommu range. At the
> moment is not able to delete old translations, limit maximum range
> it can translate, nor vhost add new memory regions from the moment
> SVQ is enabled, but is somehow straightforward to add these.
>
> This is a big series, so the idea is to send it in logical chunks once
> all comments have been collected. As a first complete usecase, a SVQ
> mode with no possibility of going back to regular mode would cover a
> first usecase, and this RFC already have all the ingredients but
> internal memory tracking.
>
> It is based on the ideas of DPDK SW assisted LM, in the series of
> DPDK's https://patchwork.dpdk.org/cover/48370/ . However, these does
> not map the shadow vq in guest's VA, but in qemu's.
>
> Comments are welcome!


Thanks a lot for working on this.

I feel like we need to start from something simple to have some minimal 
functions to work first.

There are two major issues for the current complexity:

1) the current code tries to work on general virtio level (all kinds of 
vhost backends)
2) two kinds of translations are used (qemu HVA and qemu IOVA)

For 1), I'd suggest to start from vhost-vdpa, and it's even better to 
hide all the svq stuffs from the vhost API first. That is to say, do it 
totally inside vhost-vDPA and leave the rest part of Qemu untouched. 
This makes things easier and after this part is merged. We can start to 
think of how to generalize it other vhost bakcends (or it's still 
questionable that it's worth to do so). I believe most of the codes 
could be re-used.

For 2), I think we'd better always go with qemu IOVA (IOVA allocator). 
This should work for all cases and may simplify the code a lot. In the 
future, if we found qemu HVA is useful, we can implement some dedicated 
alocator for vhost-net to have 1:1 mapping.

Thoughts?

Thanks


>
> TODO:
> * Event, indirect, packed, and others features of virtio - Waiting for
>    confirmation of the big picture.
> * vDPA devices: Grow IOVA tree to track new or deleted memory. Cap
>    IOVA limit in tree so it cannot grow forever.
> * To sepparate buffers forwarding in its own AIO context, so we can
>    throw more threads to that task and we don't need to stop the main
>    event loop.
> * IOMMU optimizations, so bacthing and bigger chunks of IOVA can be
>    sent to device.
> * Automatic kick-in on live-migration.
> * Proper documentation.
>
> Thanks!
>
> Changes from v2 RFC:
>    * Adding vhost-vdpa devices support
>    * Fixed some memory leaks pointed by different comments
>
> Changes from v1 RFC:
>    * Use QMP instead of migration to start SVQ mode.
>    * Only accepting IOMMU devices, closer behavior with target devices
>      (vDPA)
>    * Fix invalid masking/unmasking of vhost call fd.
>    * Use of proper methods for synchronization.
>    * No need to modify VirtIO device code, all of the changes are
>      contained in vhost code.
>    * Delete superfluous code.
>    * An intermediate RFC was sent with only the notifications forwarding
>      changes. It can be seen in
>      https://patchew.org/QEMU/20210129205415.876290-1-eperezma@redhat.com/
>    * v1 at
>      https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg05372.html
>
> Eugenio Pérez (29):
>    virtio: Add virtio_queue_is_host_notifier_enabled
>    vhost: Save masked_notifier state
>    vhost: Add VhostShadowVirtqueue
>    vhost: Add x-vhost-enable-shadow-vq qmp
>    virtio: Add VIRTIO_F_QUEUE_STATE
>    virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
>    vhost: Route guest->host notification through shadow virtqueue
>    vhost: Route host->guest notification through shadow virtqueue
>    vhost: Avoid re-set masked notifier in shadow vq
>    virtio: Add vhost_shadow_vq_get_vring_addr
>    vhost: Add vhost_vring_pause operation
>    vhost: add vhost_kernel_vring_pause
>    vhost: Add vhost_get_iova_range operation
>    vhost: add vhost_has_limited_iova_range
>    vhost: Add enable_custom_iommu to VhostOps
>    vhost-vdpa: Add vhost_vdpa_enable_custom_iommu
>    vhost: Shadow virtqueue buffers forwarding
>    vhost: Use vhost_enable_custom_iommu to unmap everything if available
>    vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue
>      kick
>    vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow
>      virtqueue
>    vhost: Add VhostIOVATree
>    vhost: Add iova_rev_maps_find_iova to IOVAReverseMaps
>    vhost: Use a tree to store memory mappings
>    vhost: Add iova_rev_maps_alloc
>    vhost: Add custom IOTLB translations to SVQ
>    vhost: Map in vdpa-dev
>    vhost-vdpa: Implement vhost_vdpa_vring_pause operation
>    vhost-vdpa: never map with vDPA listener
>    vhost: Start vhost-vdpa SVQ directly
>
>   qapi/net.json                                 |  22 +
>   hw/virtio/vhost-iova-tree.h                   |  61 ++
>   hw/virtio/vhost-shadow-virtqueue.h            |  38 ++
>   hw/virtio/virtio-pci.h                        |   1 +
>   include/hw/virtio/vhost-backend.h             |  16 +
>   include/hw/virtio/vhost-vdpa.h                |   2 +-
>   include/hw/virtio/vhost.h                     |  14 +
>   include/hw/virtio/virtio.h                    |   5 +-
>   .../standard-headers/linux/virtio_config.h    |   5 +
>   include/standard-headers/linux/virtio_pci.h   |   2 +
>   hw/net/virtio-net.c                           |   4 +-
>   hw/virtio/vhost-backend.c                     |  42 ++
>   hw/virtio/vhost-iova-tree.c                   | 283 ++++++++
>   hw/virtio/vhost-shadow-virtqueue.c            | 643 ++++++++++++++++++
>   hw/virtio/vhost-vdpa.c                        |  73 +-
>   hw/virtio/vhost.c                             | 459 ++++++++++++-
>   hw/virtio/virtio-pci.c                        |   9 +
>   hw/virtio/virtio.c                            |   5 +
>   hw/virtio/meson.build                         |   2 +-
>   hw/virtio/trace-events                        |   1 +
>   20 files changed, 1663 insertions(+), 24 deletions(-)
>   create mode 100644 hw/virtio/vhost-iova-tree.h
>   create mode 100644 hw/virtio/vhost-shadow-virtqueue.h
>   create mode 100644 hw/virtio/vhost-iova-tree.c
>   create mode 100644 hw/virtio/vhost-shadow-virtqueue.c
>
Stefan Hajnoczi July 19, 2021, 2:13 p.m. UTC | #6
On Mon, May 24, 2021 at 07:29:06AM -0400, Michael S. Tsirkin wrote:
> On Mon, May 24, 2021 at 12:37:48PM +0200, Eugenio Perez Martin wrote:
> > On Mon, May 24, 2021 at 11:38 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Wed, May 19, 2021 at 06:28:34PM +0200, Eugenio Pérez wrote:
> > > > Commit 17 introduces the buffer forwarding. Previous one are for
> > > > preparations again, and laters are for enabling some obvious
> > > > optimizations. However, it needs the vdpa device to be able to map
> > > > every IOVA space, and some vDPA devices are not able to do so. Checking
> > > > of this is added in previous commits.
> > >
> > > That might become a significant limitation. And it worries me that
> > > this is such a big patchset which might yet take a while to get
> > > finalized.
> > >
> > 
> > Sorry, maybe I've been unclear here: Latter commits in this series
> > address this limitation. Still not perfect: for example, it does not
> > support adding or removing guest's memory at the moment, but this
> > should be easy to implement on top.
> > 
> > The main issue I'm observing is from the kernel if I'm not wrong: If I
> > unmap every address, I cannot re-map them again. But code in this
> > patchset is mostly final, except for the comments it may arise in the
> > mail list of course.
> > 
> > > I have an idea: how about as a first step we implement a transparent
> > > switch from vdpa to a software virtio in QEMU or a software vhost in
> > > kernel?
> > >
> > > This will give us live migration quickly with performance comparable
> > > to failover but without dependance on guest cooperation.
> > >
> > 
> > I think it should be doable. I'm not sure about the effort that needs
> > to be done in qemu to hide these "hypervisor-failover devices" from
> > guest's view but it should be comparable to failover, as you say.
> > 
> > Networking should be ok by its nature, although it could require care
> > on the host hardware setup. But I'm not sure how other types of
> > vhost/vdpa devices may work that way. How would a disk/scsi device
> > switch modes? Can the kernel take control of the vdpa device through
> > vhost, and just start reporting with a dirty bitmap?
> > 
> > Thanks!
> 
> It depends of course, e.g. blk is mostly reads/writes so
> not a lot of state. just don't reorder or drop requests.

QEMU's virtio-blk does not attempt to change states (e.g. quiesce the
device or switch between vhost kernel/QEMU, etc) while there are
in-flight requests. Instead all currently active requests must complete
(in some cases they can be cancelled to stop them early). Note that
failed requests can be kept in a list across the switch and then
resubmitted later.

The underlying storage never has requests in flight while the device is
switched. The reason QEMU does this is because there's no way to hand
over an in-flight preadv(2), Linux AIO, or other host kernel block layer
request to another process.

Stefan