Message ID | 20230112172434.760850-12-eperezma@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Dinamycally switch to vhost shadow virtqueues at vdpa net migration | expand |
On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote: > > This allows net to restart the device backend to configure SVQ on it. > > Ideally, these changes should not be net specific. However, the vdpa net > backend is the one with enough knowledge to configure everything because > of some reasons: > * Queues might need to be shadowed or not depending on its kind (control > vs data). > * Queues need to share the same map translations (iova tree). > > Because of that it is cleaner to restart the whole net backend and > configure again as expected, similar to how vhost-kernel moves between > userspace and passthrough. > > If more kinds of devices need dynamic switching to SVQ we can create a > callback struct like VhostOps and move most of the code there. > VhostOps cannot be reused since all vdpa backend share them, and to > personalize just for networking would be too heavy. > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com> > --- > net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 84 insertions(+) > > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c > index 5d7ad6e4d7..f38532b1df 100644 > --- a/net/vhost-vdpa.c > +++ b/net/vhost-vdpa.c > @@ -26,6 +26,8 @@ > #include <err.h> > #include "standard-headers/linux/virtio_net.h" > #include "monitor/monitor.h" > +#include "migration/migration.h" > +#include "migration/misc.h" > #include "migration/blocker.h" > #include "hw/virtio/vhost.h" > > @@ -33,6 +35,7 @@ > typedef struct VhostVDPAState { > NetClientState nc; > struct vhost_vdpa vhost_vdpa; > + Notifier migration_state; > Error *migration_blocker; > VHostNetState *vhost_net; > > @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) > return DO_UPCAST(VhostVDPAState, nc, nc0); > } > > +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable) > +{ > + struct vhost_vdpa *v = &s->vhost_vdpa; > + VirtIONet *n; > + VirtIODevice *vdev; > + int data_queue_pairs, cvq, r; > + NetClientState *peer; > + > + /* We are only called on the first data vqs and only if x-svq is not set */ > + if (s->vhost_vdpa.shadow_vqs_enabled == enable) { > + return; > + } > + > + vdev = v->dev->vdev; > + n = VIRTIO_NET(vdev); > + if (!n->vhost_started) { > + return; > + } > + > + if (enable) { > + ioctl(v->device_fd, VHOST_VDPA_SUSPEND); Do we need to check if the device is started or not here? > + } I'm not sure I understand the reason for vhost_net_stop() after a VHOST_VDPA_SUSPEND. It looks to me those functions are duplicated. > + data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1; > + cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ? > + n->max_ncs - n->max_queue_pairs : 0; > + vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq); > + > + peer = s->nc.peer; > + for (int i = 0; i < data_queue_pairs + cvq; i++) { > + VhostVDPAState *vdpa_state; > + NetClientState *nc; > + > + if (i < data_queue_pairs) { > + nc = qemu_get_peer(peer, i); > + } else { > + nc = qemu_get_peer(peer, n->max_queue_pairs); > + } > + > + vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc); > + vdpa_state->vhost_vdpa.shadow_data = enable; > + > + if (i < data_queue_pairs) { > + /* Do not override CVQ shadow_vqs_enabled */ > + vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable; > + } > + } > + > + r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq); > + if (unlikely(r < 0)) { > + error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r); > + } > +} > + > +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data) > +{ > + MigrationState *migration = data; > + VhostVDPAState *s = container_of(notifier, VhostVDPAState, > + migration_state); > + > + switch (migration->state) { > + case MIGRATION_STATUS_SETUP: > + vhost_vdpa_net_log_global_enable(s, true); > + return; > + > + case MIGRATION_STATUS_CANCELLING: > + case MIGRATION_STATUS_CANCELLED: > + case MIGRATION_STATUS_FAILED: > + vhost_vdpa_net_log_global_enable(s, false); Do we need to recover here? Thanks > + return; > + }; > +} > + > static void vhost_vdpa_net_data_start_first(VhostVDPAState *s) > { > struct vhost_vdpa *v = &s->vhost_vdpa; > > + if (v->feature_log) { > + add_migration_state_change_notifier(&s->migration_state); > + } > + > if (v->shadow_vqs_enabled) { > v->iova_tree = vhost_iova_tree_new(v->iova_range.first, > v->iova_range.last); > @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc) > > assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); > > + if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) { > + remove_migration_state_change_notifier(&s->migration_state); > + } > + > dev = s->vhost_vdpa.dev; > if (dev->vq_index + dev->nvqs == dev->vq_index_end) { > g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete); > @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer, > s->vhost_vdpa.device_fd = vdpa_device_fd; > s->vhost_vdpa.index = queue_pair_index; > s->always_svq = svq; > + s->migration_state.notify = vdpa_net_migration_state_notifier; > s->vhost_vdpa.shadow_vqs_enabled = svq; > s->vhost_vdpa.iova_range = iova_range; > s->vhost_vdpa.shadow_data = svq; > -- > 2.31.1 >
On Fri, Jan 13, 2023 at 5:55 AM Jason Wang <jasowang@redhat.com> wrote: > > On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote: > > > > This allows net to restart the device backend to configure SVQ on it. > > > > Ideally, these changes should not be net specific. However, the vdpa net > > backend is the one with enough knowledge to configure everything because > > of some reasons: > > * Queues might need to be shadowed or not depending on its kind (control > > vs data). > > * Queues need to share the same map translations (iova tree). > > > > Because of that it is cleaner to restart the whole net backend and > > configure again as expected, similar to how vhost-kernel moves between > > userspace and passthrough. > > > > If more kinds of devices need dynamic switching to SVQ we can create a > > callback struct like VhostOps and move most of the code there. > > VhostOps cannot be reused since all vdpa backend share them, and to > > personalize just for networking would be too heavy. > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com> > > --- > > net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 84 insertions(+) > > > > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c > > index 5d7ad6e4d7..f38532b1df 100644 > > --- a/net/vhost-vdpa.c > > +++ b/net/vhost-vdpa.c > > @@ -26,6 +26,8 @@ > > #include <err.h> > > #include "standard-headers/linux/virtio_net.h" > > #include "monitor/monitor.h" > > +#include "migration/migration.h" > > +#include "migration/misc.h" > > #include "migration/blocker.h" > > #include "hw/virtio/vhost.h" > > > > @@ -33,6 +35,7 @@ > > typedef struct VhostVDPAState { > > NetClientState nc; > > struct vhost_vdpa vhost_vdpa; > > + Notifier migration_state; > > Error *migration_blocker; > > VHostNetState *vhost_net; > > > > @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) > > return DO_UPCAST(VhostVDPAState, nc, nc0); > > } > > > > +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable) > > +{ > > + struct vhost_vdpa *v = &s->vhost_vdpa; > > + VirtIONet *n; > > + VirtIODevice *vdev; > > + int data_queue_pairs, cvq, r; > > + NetClientState *peer; > > + > > + /* We are only called on the first data vqs and only if x-svq is not set */ > > + if (s->vhost_vdpa.shadow_vqs_enabled == enable) { > > + return; > > + } > > + > > + vdev = v->dev->vdev; > > + n = VIRTIO_NET(vdev); > > + if (!n->vhost_started) { > > + return; > > + } > > + > > + if (enable) { > > + ioctl(v->device_fd, VHOST_VDPA_SUSPEND); > > Do we need to check if the device is started or not here? > v->vhost_started is checked right above, right? > > + } > > I'm not sure I understand the reason for vhost_net_stop() after a > VHOST_VDPA_SUSPEND. It looks to me those functions are duplicated. > I think this is really worth exploring, and it would have been clearer if I didn't squash the vhost_reset_status commit by mistake :). Looking at qemu master vhost.c:vhost_dev_stop: if (hdev->vhost_ops->vhost_dev_start) { hdev->vhost_ops->vhost_dev_start(hdev, false); } if (vrings) { vhost_dev_set_vring_enable(hdev, false); } for (i = 0; i < hdev->nvqs; ++i) { vhost_virtqueue_stop(hdev, vdev, hdev->vqs + i, hdev->vq_index + i); } Both vhost-used and vhost-vdpa set_status(0) at ->vhost_dev_start(hdev, false). It cleans virtqueue state in vdpa so they are not recoverable at vhost_virtqueue_stop->get_vring_base, and I think it is too late for vdpa devices to change it. I guess vhost-user devices do not lose the state there, but I did not test. I call VHOST_VDPA_SUSPEND here so vhost_vdpa_dev_start looks more similar to vhost_user_dev_start. We can make vhost_vdpa_dev_start(false) to suspend the device instead. But then we need to reset it after getting the indexes. That's why I added vhost_vdpa_reset_status, but I admit it is neither the cleanest approach nor the best name to it. Adding Maxime, RFC here so we can make -vdpa and -user not to divert too much. > > + data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1; > > + cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ? > > + n->max_ncs - n->max_queue_pairs : 0; > > + vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq); > > + > > + peer = s->nc.peer; > > + for (int i = 0; i < data_queue_pairs + cvq; i++) { > > + VhostVDPAState *vdpa_state; > > + NetClientState *nc; > > + > > + if (i < data_queue_pairs) { > > + nc = qemu_get_peer(peer, i); > > + } else { > > + nc = qemu_get_peer(peer, n->max_queue_pairs); > > + } > > + > > + vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc); > > + vdpa_state->vhost_vdpa.shadow_data = enable; > > + > > + if (i < data_queue_pairs) { > > + /* Do not override CVQ shadow_vqs_enabled */ > > + vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable; > > + } > > + } > > + > > + r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq); > > + if (unlikely(r < 0)) { > > + error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r); > > + } > > +} > > + > > +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data) > > +{ > > + MigrationState *migration = data; > > + VhostVDPAState *s = container_of(notifier, VhostVDPAState, > > + migration_state); > > + > > + switch (migration->state) { > > + case MIGRATION_STATUS_SETUP: > > + vhost_vdpa_net_log_global_enable(s, true); > > + return; > > + > > + case MIGRATION_STATUS_CANCELLING: > > + case MIGRATION_STATUS_CANCELLED: > > + case MIGRATION_STATUS_FAILED: > > + vhost_vdpa_net_log_global_enable(s, false); > > Do we need to recover here? > I may be missing something, but the device is fully reset and restored in these cases. CCing Juan and D. Gilbert, a review would be appreciated to check if this covers all the cases. Thanks! > Thanks > > > + return; > > + }; > > +} > > + > > static void vhost_vdpa_net_data_start_first(VhostVDPAState *s) > > { > > struct vhost_vdpa *v = &s->vhost_vdpa; > > > > + if (v->feature_log) { > > + add_migration_state_change_notifier(&s->migration_state); > > + } > > + > > if (v->shadow_vqs_enabled) { > > v->iova_tree = vhost_iova_tree_new(v->iova_range.first, > > v->iova_range.last); > > @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc) > > > > assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); > > > > + if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) { > > + remove_migration_state_change_notifier(&s->migration_state); > > + } > > + > > dev = s->vhost_vdpa.dev; > > if (dev->vq_index + dev->nvqs == dev->vq_index_end) { > > g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete); > > @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer, > > s->vhost_vdpa.device_fd = vdpa_device_fd; > > s->vhost_vdpa.index = queue_pair_index; > > s->always_svq = svq; > > + s->migration_state.notify = vdpa_net_migration_state_notifier; > > s->vhost_vdpa.shadow_vqs_enabled = svq; > > s->vhost_vdpa.iova_range = iova_range; > > s->vhost_vdpa.shadow_data = svq; > > -- > > 2.31.1 > > > >
在 2023/1/13 17:00, Eugenio Perez Martin 写道: > On Fri, Jan 13, 2023 at 5:55 AM Jason Wang <jasowang@redhat.com> wrote: >> On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote: >>> This allows net to restart the device backend to configure SVQ on it. >>> >>> Ideally, these changes should not be net specific. However, the vdpa net >>> backend is the one with enough knowledge to configure everything because >>> of some reasons: >>> * Queues might need to be shadowed or not depending on its kind (control >>> vs data). >>> * Queues need to share the same map translations (iova tree). >>> >>> Because of that it is cleaner to restart the whole net backend and >>> configure again as expected, similar to how vhost-kernel moves between >>> userspace and passthrough. >>> >>> If more kinds of devices need dynamic switching to SVQ we can create a >>> callback struct like VhostOps and move most of the code there. >>> VhostOps cannot be reused since all vdpa backend share them, and to >>> personalize just for networking would be too heavy. >>> >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com> >>> --- >>> net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++ >>> 1 file changed, 84 insertions(+) >>> >>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c >>> index 5d7ad6e4d7..f38532b1df 100644 >>> --- a/net/vhost-vdpa.c >>> +++ b/net/vhost-vdpa.c >>> @@ -26,6 +26,8 @@ >>> #include <err.h> >>> #include "standard-headers/linux/virtio_net.h" >>> #include "monitor/monitor.h" >>> +#include "migration/migration.h" >>> +#include "migration/misc.h" >>> #include "migration/blocker.h" >>> #include "hw/virtio/vhost.h" >>> >>> @@ -33,6 +35,7 @@ >>> typedef struct VhostVDPAState { >>> NetClientState nc; >>> struct vhost_vdpa vhost_vdpa; >>> + Notifier migration_state; >>> Error *migration_blocker; >>> VHostNetState *vhost_net; >>> >>> @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) >>> return DO_UPCAST(VhostVDPAState, nc, nc0); >>> } >>> >>> +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable) >>> +{ >>> + struct vhost_vdpa *v = &s->vhost_vdpa; >>> + VirtIONet *n; >>> + VirtIODevice *vdev; >>> + int data_queue_pairs, cvq, r; >>> + NetClientState *peer; >>> + >>> + /* We are only called on the first data vqs and only if x-svq is not set */ >>> + if (s->vhost_vdpa.shadow_vqs_enabled == enable) { >>> + return; >>> + } >>> + >>> + vdev = v->dev->vdev; >>> + n = VIRTIO_NET(vdev); >>> + if (!n->vhost_started) { >>> + return; >>> + } >>> + >>> + if (enable) { >>> + ioctl(v->device_fd, VHOST_VDPA_SUSPEND); >> Do we need to check if the device is started or not here? >> > v->vhost_started is checked right above, right? Right, I miss that. > >>> + } >> I'm not sure I understand the reason for vhost_net_stop() after a >> VHOST_VDPA_SUSPEND. It looks to me those functions are duplicated. >> > I think this is really worth exploring, and it would have been clearer > if I didn't squash the vhost_reset_status commit by mistake :). > > Looking at qemu master vhost.c:vhost_dev_stop: > if (hdev->vhost_ops->vhost_dev_start) { > hdev->vhost_ops->vhost_dev_start(hdev, false); > } > if (vrings) { > vhost_dev_set_vring_enable(hdev, false); > } > for (i = 0; i < hdev->nvqs; ++i) { > vhost_virtqueue_stop(hdev, > vdev, > hdev->vqs + i, > hdev->vq_index + i); > } > > Both vhost-used and vhost-vdpa set_status(0) at > ->vhost_dev_start(hdev, false). It cleans virtqueue state in vdpa so > they are not recoverable at vhost_virtqueue_stop->get_vring_base, and > I think it is too late for vdpa devices to change it. I guess > vhost-user devices do not lose the state there, but I did not test. > > I call VHOST_VDPA_SUSPEND here so vhost_vdpa_dev_start looks more > similar to vhost_user_dev_start. We can make > vhost_vdpa_dev_start(false) to suspend the device instead. But then we > need to reset it after getting the indexes. That's why I added > vhost_vdpa_reset_status, but I admit it is neither the cleanest > approach nor the best name to it. I wonder if we can simply suspend in vhost_net_stop() if we know the parent can stop? Thanks > > Adding Maxime, RFC here so we can make -vdpa and -user not to divert too much. > >>> + data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1; >>> + cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ? >>> + n->max_ncs - n->max_queue_pairs : 0; >>> + vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq); >>> + >>> + peer = s->nc.peer; >>> + for (int i = 0; i < data_queue_pairs + cvq; i++) { >>> + VhostVDPAState *vdpa_state; >>> + NetClientState *nc; >>> + >>> + if (i < data_queue_pairs) { >>> + nc = qemu_get_peer(peer, i); >>> + } else { >>> + nc = qemu_get_peer(peer, n->max_queue_pairs); >>> + } >>> + >>> + vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc); >>> + vdpa_state->vhost_vdpa.shadow_data = enable; >>> + >>> + if (i < data_queue_pairs) { >>> + /* Do not override CVQ shadow_vqs_enabled */ >>> + vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable; >>> + } >>> + } >>> + >>> + r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq); >>> + if (unlikely(r < 0)) { >>> + error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r); >>> + } >>> +} >>> + >>> +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data) >>> +{ >>> + MigrationState *migration = data; >>> + VhostVDPAState *s = container_of(notifier, VhostVDPAState, >>> + migration_state); >>> + >>> + switch (migration->state) { >>> + case MIGRATION_STATUS_SETUP: >>> + vhost_vdpa_net_log_global_enable(s, true); >>> + return; >>> + >>> + case MIGRATION_STATUS_CANCELLING: >>> + case MIGRATION_STATUS_CANCELLED: >>> + case MIGRATION_STATUS_FAILED: >>> + vhost_vdpa_net_log_global_enable(s, false); >> Do we need to recover here? >> > I may be missing something, but the device is fully reset and restored > in these cases. > > CCing Juan and D. Gilbert, a review would be appreciated to check if > this covers all the cases. > > Thanks! > > >> Thanks >> >>> + return; >>> + }; >>> +} >>> + >>> static void vhost_vdpa_net_data_start_first(VhostVDPAState *s) >>> { >>> struct vhost_vdpa *v = &s->vhost_vdpa; >>> >>> + if (v->feature_log) { >>> + add_migration_state_change_notifier(&s->migration_state); >>> + } >>> + >>> if (v->shadow_vqs_enabled) { >>> v->iova_tree = vhost_iova_tree_new(v->iova_range.first, >>> v->iova_range.last); >>> @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc) >>> >>> assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); >>> >>> + if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) { >>> + remove_migration_state_change_notifier(&s->migration_state); >>> + } >>> + >>> dev = s->vhost_vdpa.dev; >>> if (dev->vq_index + dev->nvqs == dev->vq_index_end) { >>> g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete); >>> @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer, >>> s->vhost_vdpa.device_fd = vdpa_device_fd; >>> s->vhost_vdpa.index = queue_pair_index; >>> s->always_svq = svq; >>> + s->migration_state.notify = vdpa_net_migration_state_notifier; >>> s->vhost_vdpa.shadow_vqs_enabled = svq; >>> s->vhost_vdpa.iova_range = iova_range; >>> s->vhost_vdpa.shadow_data = svq; >>> -- >>> 2.31.1 >>> >>
On Mon, Jan 16, 2023 at 7:51 AM Jason Wang <jasowang@redhat.com> wrote: > > > 在 2023/1/13 17:00, Eugenio Perez Martin 写道: > > On Fri, Jan 13, 2023 at 5:55 AM Jason Wang <jasowang@redhat.com> wrote: > >> On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote: > >>> This allows net to restart the device backend to configure SVQ on it. > >>> > >>> Ideally, these changes should not be net specific. However, the vdpa net > >>> backend is the one with enough knowledge to configure everything because > >>> of some reasons: > >>> * Queues might need to be shadowed or not depending on its kind (control > >>> vs data). > >>> * Queues need to share the same map translations (iova tree). > >>> > >>> Because of that it is cleaner to restart the whole net backend and > >>> configure again as expected, similar to how vhost-kernel moves between > >>> userspace and passthrough. > >>> > >>> If more kinds of devices need dynamic switching to SVQ we can create a > >>> callback struct like VhostOps and move most of the code there. > >>> VhostOps cannot be reused since all vdpa backend share them, and to > >>> personalize just for networking would be too heavy. > >>> > >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com> > >>> --- > >>> net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++ > >>> 1 file changed, 84 insertions(+) > >>> > >>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c > >>> index 5d7ad6e4d7..f38532b1df 100644 > >>> --- a/net/vhost-vdpa.c > >>> +++ b/net/vhost-vdpa.c > >>> @@ -26,6 +26,8 @@ > >>> #include <err.h> > >>> #include "standard-headers/linux/virtio_net.h" > >>> #include "monitor/monitor.h" > >>> +#include "migration/migration.h" > >>> +#include "migration/misc.h" > >>> #include "migration/blocker.h" > >>> #include "hw/virtio/vhost.h" > >>> > >>> @@ -33,6 +35,7 @@ > >>> typedef struct VhostVDPAState { > >>> NetClientState nc; > >>> struct vhost_vdpa vhost_vdpa; > >>> + Notifier migration_state; > >>> Error *migration_blocker; > >>> VHostNetState *vhost_net; > >>> > >>> @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) > >>> return DO_UPCAST(VhostVDPAState, nc, nc0); > >>> } > >>> > >>> +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable) > >>> +{ > >>> + struct vhost_vdpa *v = &s->vhost_vdpa; > >>> + VirtIONet *n; > >>> + VirtIODevice *vdev; > >>> + int data_queue_pairs, cvq, r; > >>> + NetClientState *peer; > >>> + > >>> + /* We are only called on the first data vqs and only if x-svq is not set */ > >>> + if (s->vhost_vdpa.shadow_vqs_enabled == enable) { > >>> + return; > >>> + } > >>> + > >>> + vdev = v->dev->vdev; > >>> + n = VIRTIO_NET(vdev); > >>> + if (!n->vhost_started) { > >>> + return; > >>> + } > >>> + > >>> + if (enable) { > >>> + ioctl(v->device_fd, VHOST_VDPA_SUSPEND); > >> Do we need to check if the device is started or not here? > >> > > v->vhost_started is checked right above, right? > > > Right, I miss that. > > > > > >>> + } > >> I'm not sure I understand the reason for vhost_net_stop() after a > >> VHOST_VDPA_SUSPEND. It looks to me those functions are duplicated. > >> > > I think this is really worth exploring, and it would have been clearer > > if I didn't squash the vhost_reset_status commit by mistake :). > > > > Looking at qemu master vhost.c:vhost_dev_stop: > > if (hdev->vhost_ops->vhost_dev_start) { > > hdev->vhost_ops->vhost_dev_start(hdev, false); > > } > > if (vrings) { > > vhost_dev_set_vring_enable(hdev, false); > > } > > for (i = 0; i < hdev->nvqs; ++i) { > > vhost_virtqueue_stop(hdev, > > vdev, > > hdev->vqs + i, > > hdev->vq_index + i); > > } > > > > Both vhost-used and vhost-vdpa set_status(0) at > > ->vhost_dev_start(hdev, false). It cleans virtqueue state in vdpa so > > they are not recoverable at vhost_virtqueue_stop->get_vring_base, and > > I think it is too late for vdpa devices to change it. I guess > > vhost-user devices do not lose the state there, but I did not test. > > > > I call VHOST_VDPA_SUSPEND here so vhost_vdpa_dev_start looks more > > similar to vhost_user_dev_start. We can make > > vhost_vdpa_dev_start(false) to suspend the device instead. But then we > > need to reset it after getting the indexes. That's why I added > > vhost_vdpa_reset_status, but I admit it is neither the cleanest > > approach nor the best name to it. > > > I wonder if we can simply suspend in vhost_net_stop() if we know the > parent can stop? > Sure, that's possible, I'll move that code to vhost_net_stop. Thanks! > Thanks > > > > > > Adding Maxime, RFC here so we can make -vdpa and -user not to divert too much. > > > >>> + data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1; > >>> + cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ? > >>> + n->max_ncs - n->max_queue_pairs : 0; > >>> + vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq); > >>> + > >>> + peer = s->nc.peer; > >>> + for (int i = 0; i < data_queue_pairs + cvq; i++) { > >>> + VhostVDPAState *vdpa_state; > >>> + NetClientState *nc; > >>> + > >>> + if (i < data_queue_pairs) { > >>> + nc = qemu_get_peer(peer, i); > >>> + } else { > >>> + nc = qemu_get_peer(peer, n->max_queue_pairs); > >>> + } > >>> + > >>> + vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc); > >>> + vdpa_state->vhost_vdpa.shadow_data = enable; > >>> + > >>> + if (i < data_queue_pairs) { > >>> + /* Do not override CVQ shadow_vqs_enabled */ > >>> + vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable; > >>> + } > >>> + } > >>> + > >>> + r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq); > >>> + if (unlikely(r < 0)) { > >>> + error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r); > >>> + } > >>> +} > >>> + > >>> +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data) > >>> +{ > >>> + MigrationState *migration = data; > >>> + VhostVDPAState *s = container_of(notifier, VhostVDPAState, > >>> + migration_state); > >>> + > >>> + switch (migration->state) { > >>> + case MIGRATION_STATUS_SETUP: > >>> + vhost_vdpa_net_log_global_enable(s, true); > >>> + return; > >>> + > >>> + case MIGRATION_STATUS_CANCELLING: > >>> + case MIGRATION_STATUS_CANCELLED: > >>> + case MIGRATION_STATUS_FAILED: > >>> + vhost_vdpa_net_log_global_enable(s, false); > >> Do we need to recover here? > >> > > I may be missing something, but the device is fully reset and restored > > in these cases. > > > > CCing Juan and D. Gilbert, a review would be appreciated to check if > > this covers all the cases. > > > > Thanks! > > > > > >> Thanks > >> > >>> + return; > >>> + }; > >>> +} > >>> + > >>> static void vhost_vdpa_net_data_start_first(VhostVDPAState *s) > >>> { > >>> struct vhost_vdpa *v = &s->vhost_vdpa; > >>> > >>> + if (v->feature_log) { > >>> + add_migration_state_change_notifier(&s->migration_state); > >>> + } > >>> + > >>> if (v->shadow_vqs_enabled) { > >>> v->iova_tree = vhost_iova_tree_new(v->iova_range.first, > >>> v->iova_range.last); > >>> @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc) > >>> > >>> assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); > >>> > >>> + if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) { > >>> + remove_migration_state_change_notifier(&s->migration_state); > >>> + } > >>> + > >>> dev = s->vhost_vdpa.dev; > >>> if (dev->vq_index + dev->nvqs == dev->vq_index_end) { > >>> g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete); > >>> @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer, > >>> s->vhost_vdpa.device_fd = vdpa_device_fd; > >>> s->vhost_vdpa.index = queue_pair_index; > >>> s->always_svq = svq; > >>> + s->migration_state.notify = vdpa_net_migration_state_notifier; > >>> s->vhost_vdpa.shadow_vqs_enabled = svq; > >>> s->vhost_vdpa.iova_range = iova_range; > >>> s->vhost_vdpa.shadow_data = svq; > >>> -- > >>> 2.31.1 > >>> > >> >
* Eugenio Perez Martin (eperezma@redhat.com) wrote: > On Fri, Jan 13, 2023 at 5:55 AM Jason Wang <jasowang@redhat.com> wrote: > > > > On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote: > > > > > > This allows net to restart the device backend to configure SVQ on it. > > > > > > Ideally, these changes should not be net specific. However, the vdpa net > > > backend is the one with enough knowledge to configure everything because > > > of some reasons: > > > * Queues might need to be shadowed or not depending on its kind (control > > > vs data). > > > * Queues need to share the same map translations (iova tree). > > > > > > Because of that it is cleaner to restart the whole net backend and > > > configure again as expected, similar to how vhost-kernel moves between > > > userspace and passthrough. > > > > > > If more kinds of devices need dynamic switching to SVQ we can create a > > > callback struct like VhostOps and move most of the code there. > > > VhostOps cannot be reused since all vdpa backend share them, and to > > > personalize just for networking would be too heavy. > > > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com> > > > --- > > > net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++ > > > 1 file changed, 84 insertions(+) > > > > > > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c > > > index 5d7ad6e4d7..f38532b1df 100644 > > > --- a/net/vhost-vdpa.c > > > +++ b/net/vhost-vdpa.c > > > @@ -26,6 +26,8 @@ > > > #include <err.h> > > > #include "standard-headers/linux/virtio_net.h" > > > #include "monitor/monitor.h" > > > +#include "migration/migration.h" > > > +#include "migration/misc.h" > > > #include "migration/blocker.h" > > > #include "hw/virtio/vhost.h" > > > > > > @@ -33,6 +35,7 @@ > > > typedef struct VhostVDPAState { > > > NetClientState nc; > > > struct vhost_vdpa vhost_vdpa; > > > + Notifier migration_state; > > > Error *migration_blocker; > > > VHostNetState *vhost_net; > > > > > > @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) > > > return DO_UPCAST(VhostVDPAState, nc, nc0); > > > } > > > > > > +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable) > > > +{ > > > + struct vhost_vdpa *v = &s->vhost_vdpa; > > > + VirtIONet *n; > > > + VirtIODevice *vdev; > > > + int data_queue_pairs, cvq, r; > > > + NetClientState *peer; > > > + > > > + /* We are only called on the first data vqs and only if x-svq is not set */ > > > + if (s->vhost_vdpa.shadow_vqs_enabled == enable) { > > > + return; > > > + } > > > + > > > + vdev = v->dev->vdev; > > > + n = VIRTIO_NET(vdev); > > > + if (!n->vhost_started) { > > > + return; > > > + } > > > + > > > + if (enable) { > > > + ioctl(v->device_fd, VHOST_VDPA_SUSPEND); > > > > Do we need to check if the device is started or not here? > > > > v->vhost_started is checked right above, right? > > > > + } > > > > I'm not sure I understand the reason for vhost_net_stop() after a > > VHOST_VDPA_SUSPEND. It looks to me those functions are duplicated. > > > > I think this is really worth exploring, and it would have been clearer > if I didn't squash the vhost_reset_status commit by mistake :). > > Looking at qemu master vhost.c:vhost_dev_stop: > if (hdev->vhost_ops->vhost_dev_start) { > hdev->vhost_ops->vhost_dev_start(hdev, false); > } > if (vrings) { > vhost_dev_set_vring_enable(hdev, false); > } > for (i = 0; i < hdev->nvqs; ++i) { > vhost_virtqueue_stop(hdev, > vdev, > hdev->vqs + i, > hdev->vq_index + i); > } > > Both vhost-used and vhost-vdpa set_status(0) at > ->vhost_dev_start(hdev, false). It cleans virtqueue state in vdpa so > they are not recoverable at vhost_virtqueue_stop->get_vring_base, and > I think it is too late for vdpa devices to change it. I guess > vhost-user devices do not lose the state there, but I did not test. > > I call VHOST_VDPA_SUSPEND here so vhost_vdpa_dev_start looks more > similar to vhost_user_dev_start. We can make > vhost_vdpa_dev_start(false) to suspend the device instead. But then we > need to reset it after getting the indexes. That's why I added > vhost_vdpa_reset_status, but I admit it is neither the cleanest > approach nor the best name to it. > > Adding Maxime, RFC here so we can make -vdpa and -user not to divert too much. > > > > + data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1; > > > + cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ? > > > + n->max_ncs - n->max_queue_pairs : 0; > > > + vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq); > > > + > > > + peer = s->nc.peer; > > > + for (int i = 0; i < data_queue_pairs + cvq; i++) { > > > + VhostVDPAState *vdpa_state; > > > + NetClientState *nc; > > > + > > > + if (i < data_queue_pairs) { > > > + nc = qemu_get_peer(peer, i); > > > + } else { > > > + nc = qemu_get_peer(peer, n->max_queue_pairs); > > > + } > > > + > > > + vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc); > > > + vdpa_state->vhost_vdpa.shadow_data = enable; > > > + > > > + if (i < data_queue_pairs) { > > > + /* Do not override CVQ shadow_vqs_enabled */ > > > + vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable; > > > + } > > > + } > > > + > > > + r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq); > > > + if (unlikely(r < 0)) { > > > + error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r); > > > + } > > > +} > > > + > > > +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data) > > > +{ > > > + MigrationState *migration = data; > > > + VhostVDPAState *s = container_of(notifier, VhostVDPAState, > > > + migration_state); > > > + > > > + switch (migration->state) { > > > + case MIGRATION_STATUS_SETUP: > > > + vhost_vdpa_net_log_global_enable(s, true); > > > + return; > > > + > > > + case MIGRATION_STATUS_CANCELLING: > > > + case MIGRATION_STATUS_CANCELLED: > > > + case MIGRATION_STATUS_FAILED: > > > + vhost_vdpa_net_log_global_enable(s, false); > > > > Do we need to recover here? > > > > I may be missing something, but the device is fully reset and restored > in these cases. > > CCing Juan and D. Gilbert, a review would be appreciated to check if > this covers all the cases. I'm surprised I'm not seeing an entry for MIGRATION_STATUS_COMPLETED there. You might consider: if (migration_in_setup(s)) { vhost_vdpa_net_log_global_enable(s, true); } else if (migration_has_finished(s) || migration_has_failed(s)) { vhost_vdpa_net_log_global_enable(s, false); } I'm not too sure what will happen in your world with postcopy; it's worth testing, just remember on the source you don't want to be changing guest memory when you're in the postcopy phase. Dave > Thanks! > > > > Thanks > > > > > + return; > > > + }; > > > +} > > > + > > > static void vhost_vdpa_net_data_start_first(VhostVDPAState *s) > > > { > > > struct vhost_vdpa *v = &s->vhost_vdpa; > > > > > > + if (v->feature_log) { > > > + add_migration_state_change_notifier(&s->migration_state); > > > + } > > > + > > > if (v->shadow_vqs_enabled) { > > > v->iova_tree = vhost_iova_tree_new(v->iova_range.first, > > > v->iova_range.last); > > > @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc) > > > > > > assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); > > > > > > + if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) { > > > + remove_migration_state_change_notifier(&s->migration_state); > > > + } > > > + > > > dev = s->vhost_vdpa.dev; > > > if (dev->vq_index + dev->nvqs == dev->vq_index_end) { > > > g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete); > > > @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer, > > > s->vhost_vdpa.device_fd = vdpa_device_fd; > > > s->vhost_vdpa.index = queue_pair_index; > > > s->always_svq = svq; > > > + s->migration_state.notify = vdpa_net_migration_state_notifier; > > > s->vhost_vdpa.shadow_vqs_enabled = svq; > > > s->vhost_vdpa.iova_range = iova_range; > > > s->vhost_vdpa.shadow_data = svq; > > > -- > > > 2.31.1 > > > > > > > >
On Tue, Jan 17, 2023 at 10:58 AM Dr. David Alan Gilbert <dgilbert@redhat.com> wrote: > > * Eugenio Perez Martin (eperezma@redhat.com) wrote: > > On Fri, Jan 13, 2023 at 5:55 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote: > > > > > > > > This allows net to restart the device backend to configure SVQ on it. > > > > > > > > Ideally, these changes should not be net specific. However, the vdpa net > > > > backend is the one with enough knowledge to configure everything because > > > > of some reasons: > > > > * Queues might need to be shadowed or not depending on its kind (control > > > > vs data). > > > > * Queues need to share the same map translations (iova tree). > > > > > > > > Because of that it is cleaner to restart the whole net backend and > > > > configure again as expected, similar to how vhost-kernel moves between > > > > userspace and passthrough. > > > > > > > > If more kinds of devices need dynamic switching to SVQ we can create a > > > > callback struct like VhostOps and move most of the code there. > > > > VhostOps cannot be reused since all vdpa backend share them, and to > > > > personalize just for networking would be too heavy. > > > > > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com> > > > > --- > > > > net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++ > > > > 1 file changed, 84 insertions(+) > > > > > > > > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c > > > > index 5d7ad6e4d7..f38532b1df 100644 > > > > --- a/net/vhost-vdpa.c > > > > +++ b/net/vhost-vdpa.c > > > > @@ -26,6 +26,8 @@ > > > > #include <err.h> > > > > #include "standard-headers/linux/virtio_net.h" > > > > #include "monitor/monitor.h" > > > > +#include "migration/migration.h" > > > > +#include "migration/misc.h" > > > > #include "migration/blocker.h" > > > > #include "hw/virtio/vhost.h" > > > > > > > > @@ -33,6 +35,7 @@ > > > > typedef struct VhostVDPAState { > > > > NetClientState nc; > > > > struct vhost_vdpa vhost_vdpa; > > > > + Notifier migration_state; > > > > Error *migration_blocker; > > > > VHostNetState *vhost_net; > > > > > > > > @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) > > > > return DO_UPCAST(VhostVDPAState, nc, nc0); > > > > } > > > > > > > > +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable) > > > > +{ > > > > + struct vhost_vdpa *v = &s->vhost_vdpa; > > > > + VirtIONet *n; > > > > + VirtIODevice *vdev; > > > > + int data_queue_pairs, cvq, r; > > > > + NetClientState *peer; > > > > + > > > > + /* We are only called on the first data vqs and only if x-svq is not set */ > > > > + if (s->vhost_vdpa.shadow_vqs_enabled == enable) { > > > > + return; > > > > + } > > > > + > > > > + vdev = v->dev->vdev; > > > > + n = VIRTIO_NET(vdev); > > > > + if (!n->vhost_started) { > > > > + return; > > > > + } > > > > + > > > > + if (enable) { > > > > + ioctl(v->device_fd, VHOST_VDPA_SUSPEND); > > > > > > Do we need to check if the device is started or not here? > > > > > > > v->vhost_started is checked right above, right? > > > > > > + } > > > > > > I'm not sure I understand the reason for vhost_net_stop() after a > > > VHOST_VDPA_SUSPEND. It looks to me those functions are duplicated. > > > > > > > I think this is really worth exploring, and it would have been clearer > > if I didn't squash the vhost_reset_status commit by mistake :). > > > > Looking at qemu master vhost.c:vhost_dev_stop: > > if (hdev->vhost_ops->vhost_dev_start) { > > hdev->vhost_ops->vhost_dev_start(hdev, false); > > } > > if (vrings) { > > vhost_dev_set_vring_enable(hdev, false); > > } > > for (i = 0; i < hdev->nvqs; ++i) { > > vhost_virtqueue_stop(hdev, > > vdev, > > hdev->vqs + i, > > hdev->vq_index + i); > > } > > > > Both vhost-used and vhost-vdpa set_status(0) at > > ->vhost_dev_start(hdev, false). It cleans virtqueue state in vdpa so > > they are not recoverable at vhost_virtqueue_stop->get_vring_base, and > > I think it is too late for vdpa devices to change it. I guess > > vhost-user devices do not lose the state there, but I did not test. > > > > I call VHOST_VDPA_SUSPEND here so vhost_vdpa_dev_start looks more > > similar to vhost_user_dev_start. We can make > > vhost_vdpa_dev_start(false) to suspend the device instead. But then we > > need to reset it after getting the indexes. That's why I added > > vhost_vdpa_reset_status, but I admit it is neither the cleanest > > approach nor the best name to it. > > > > Adding Maxime, RFC here so we can make -vdpa and -user not to divert too much. > > > > > > + data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1; > > > > + cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ? > > > > + n->max_ncs - n->max_queue_pairs : 0; > > > > + vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq); > > > > + > > > > + peer = s->nc.peer; > > > > + for (int i = 0; i < data_queue_pairs + cvq; i++) { > > > > + VhostVDPAState *vdpa_state; > > > > + NetClientState *nc; > > > > + > > > > + if (i < data_queue_pairs) { > > > > + nc = qemu_get_peer(peer, i); > > > > + } else { > > > > + nc = qemu_get_peer(peer, n->max_queue_pairs); > > > > + } > > > > + > > > > + vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc); > > > > + vdpa_state->vhost_vdpa.shadow_data = enable; > > > > + > > > > + if (i < data_queue_pairs) { > > > > + /* Do not override CVQ shadow_vqs_enabled */ > > > > + vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable; > > > > + } > > > > + } > > > > + > > > > + r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq); > > > > + if (unlikely(r < 0)) { > > > > + error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r); > > > > + } > > > > +} > > > > + > > > > +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data) > > > > +{ > > > > + MigrationState *migration = data; > > > > + VhostVDPAState *s = container_of(notifier, VhostVDPAState, > > > > + migration_state); > > > > + > > > > + switch (migration->state) { > > > > + case MIGRATION_STATUS_SETUP: > > > > + vhost_vdpa_net_log_global_enable(s, true); > > > > + return; > > > > + > > > > + case MIGRATION_STATUS_CANCELLING: > > > > + case MIGRATION_STATUS_CANCELLED: > > > > + case MIGRATION_STATUS_FAILED: > > > > + vhost_vdpa_net_log_global_enable(s, false); > > > > > > Do we need to recover here? > > > > > > > I may be missing something, but the device is fully reset and restored > > in these cases. > > > > CCing Juan and D. Gilbert, a review would be appreciated to check if > > this covers all the cases. > > I'm surprised I'm not seeing an entry for MIGRATION_STATUS_COMPLETED > there. > > You might consider: > if (migration_in_setup(s)) { > vhost_vdpa_net_log_global_enable(s, true); > } else if (migration_has_finished(s) || migration_has_failed(s)) { > vhost_vdpa_net_log_global_enable(s, false); > } > Thank you very much for the input, I see this is definitely cleaner than my proposal. Just for completion here I need to handle differently has_finished vs has_failed because of recovery. This is easily achievable from your snippet so thank you very much. > I'm not too sure what will happen in your world with postcopy; it's > worth testing, just remember on the source you don't want to be changing > guest memory when you're in the postcopy phase. > If I'm not wrong postcopy is forbidden as long as there exists a vdpa device but I can check it for sure. Thanks! > Dave > > > Thanks! > > > > > > > Thanks > > > > > > > + return; > > > > + }; > > > > +} > > > > + > > > > static void vhost_vdpa_net_data_start_first(VhostVDPAState *s) > > > > { > > > > struct vhost_vdpa *v = &s->vhost_vdpa; > > > > > > > > + if (v->feature_log) { > > > > + add_migration_state_change_notifier(&s->migration_state); > > > > + } > > > > + > > > > if (v->shadow_vqs_enabled) { > > > > v->iova_tree = vhost_iova_tree_new(v->iova_range.first, > > > > v->iova_range.last); > > > > @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc) > > > > > > > > assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); > > > > > > > > + if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) { > > > > + remove_migration_state_change_notifier(&s->migration_state); > > > > + } > > > > + > > > > dev = s->vhost_vdpa.dev; > > > > if (dev->vq_index + dev->nvqs == dev->vq_index_end) { > > > > g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete); > > > > @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer, > > > > s->vhost_vdpa.device_fd = vdpa_device_fd; > > > > s->vhost_vdpa.index = queue_pair_index; > > > > s->always_svq = svq; > > > > + s->migration_state.notify = vdpa_net_migration_state_notifier; > > > > s->vhost_vdpa.shadow_vqs_enabled = svq; > > > > s->vhost_vdpa.iova_range = iova_range; > > > > s->vhost_vdpa.shadow_data = svq; > > > > -- > > > > 2.31.1 > > > > > > > > > > > > > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK >
* Eugenio Perez Martin (eperezma@redhat.com) wrote: > On Tue, Jan 17, 2023 at 10:58 AM Dr. David Alan Gilbert > <dgilbert@redhat.com> wrote: > > > > * Eugenio Perez Martin (eperezma@redhat.com) wrote: > > > On Fri, Jan 13, 2023 at 5:55 AM Jason Wang <jasowang@redhat.com> wrote: > > > > > > > > On Fri, Jan 13, 2023 at 1:25 AM Eugenio Pérez <eperezma@redhat.com> wrote: > > > > > > > > > > This allows net to restart the device backend to configure SVQ on it. > > > > > > > > > > Ideally, these changes should not be net specific. However, the vdpa net > > > > > backend is the one with enough knowledge to configure everything because > > > > > of some reasons: > > > > > * Queues might need to be shadowed or not depending on its kind (control > > > > > vs data). > > > > > * Queues need to share the same map translations (iova tree). > > > > > > > > > > Because of that it is cleaner to restart the whole net backend and > > > > > configure again as expected, similar to how vhost-kernel moves between > > > > > userspace and passthrough. > > > > > > > > > > If more kinds of devices need dynamic switching to SVQ we can create a > > > > > callback struct like VhostOps and move most of the code there. > > > > > VhostOps cannot be reused since all vdpa backend share them, and to > > > > > personalize just for networking would be too heavy. > > > > > > > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com> > > > > > --- > > > > > net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > 1 file changed, 84 insertions(+) > > > > > > > > > > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c > > > > > index 5d7ad6e4d7..f38532b1df 100644 > > > > > --- a/net/vhost-vdpa.c > > > > > +++ b/net/vhost-vdpa.c > > > > > @@ -26,6 +26,8 @@ > > > > > #include <err.h> > > > > > #include "standard-headers/linux/virtio_net.h" > > > > > #include "monitor/monitor.h" > > > > > +#include "migration/migration.h" > > > > > +#include "migration/misc.h" > > > > > #include "migration/blocker.h" > > > > > #include "hw/virtio/vhost.h" > > > > > > > > > > @@ -33,6 +35,7 @@ > > > > > typedef struct VhostVDPAState { > > > > > NetClientState nc; > > > > > struct vhost_vdpa vhost_vdpa; > > > > > + Notifier migration_state; > > > > > Error *migration_blocker; > > > > > VHostNetState *vhost_net; > > > > > > > > > > @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) > > > > > return DO_UPCAST(VhostVDPAState, nc, nc0); > > > > > } > > > > > > > > > > +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable) > > > > > +{ > > > > > + struct vhost_vdpa *v = &s->vhost_vdpa; > > > > > + VirtIONet *n; > > > > > + VirtIODevice *vdev; > > > > > + int data_queue_pairs, cvq, r; > > > > > + NetClientState *peer; > > > > > + > > > > > + /* We are only called on the first data vqs and only if x-svq is not set */ > > > > > + if (s->vhost_vdpa.shadow_vqs_enabled == enable) { > > > > > + return; > > > > > + } > > > > > + > > > > > + vdev = v->dev->vdev; > > > > > + n = VIRTIO_NET(vdev); > > > > > + if (!n->vhost_started) { > > > > > + return; > > > > > + } > > > > > + > > > > > + if (enable) { > > > > > + ioctl(v->device_fd, VHOST_VDPA_SUSPEND); > > > > > > > > Do we need to check if the device is started or not here? > > > > > > > > > > v->vhost_started is checked right above, right? > > > > > > > > + } > > > > > > > > I'm not sure I understand the reason for vhost_net_stop() after a > > > > VHOST_VDPA_SUSPEND. It looks to me those functions are duplicated. > > > > > > > > > > I think this is really worth exploring, and it would have been clearer > > > if I didn't squash the vhost_reset_status commit by mistake :). > > > > > > Looking at qemu master vhost.c:vhost_dev_stop: > > > if (hdev->vhost_ops->vhost_dev_start) { > > > hdev->vhost_ops->vhost_dev_start(hdev, false); > > > } > > > if (vrings) { > > > vhost_dev_set_vring_enable(hdev, false); > > > } > > > for (i = 0; i < hdev->nvqs; ++i) { > > > vhost_virtqueue_stop(hdev, > > > vdev, > > > hdev->vqs + i, > > > hdev->vq_index + i); > > > } > > > > > > Both vhost-used and vhost-vdpa set_status(0) at > > > ->vhost_dev_start(hdev, false). It cleans virtqueue state in vdpa so > > > they are not recoverable at vhost_virtqueue_stop->get_vring_base, and > > > I think it is too late for vdpa devices to change it. I guess > > > vhost-user devices do not lose the state there, but I did not test. > > > > > > I call VHOST_VDPA_SUSPEND here so vhost_vdpa_dev_start looks more > > > similar to vhost_user_dev_start. We can make > > > vhost_vdpa_dev_start(false) to suspend the device instead. But then we > > > need to reset it after getting the indexes. That's why I added > > > vhost_vdpa_reset_status, but I admit it is neither the cleanest > > > approach nor the best name to it. > > > > > > Adding Maxime, RFC here so we can make -vdpa and -user not to divert too much. > > > > > > > > + data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1; > > > > > + cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ? > > > > > + n->max_ncs - n->max_queue_pairs : 0; > > > > > + vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq); > > > > > + > > > > > + peer = s->nc.peer; > > > > > + for (int i = 0; i < data_queue_pairs + cvq; i++) { > > > > > + VhostVDPAState *vdpa_state; > > > > > + NetClientState *nc; > > > > > + > > > > > + if (i < data_queue_pairs) { > > > > > + nc = qemu_get_peer(peer, i); > > > > > + } else { > > > > > + nc = qemu_get_peer(peer, n->max_queue_pairs); > > > > > + } > > > > > + > > > > > + vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc); > > > > > + vdpa_state->vhost_vdpa.shadow_data = enable; > > > > > + > > > > > + if (i < data_queue_pairs) { > > > > > + /* Do not override CVQ shadow_vqs_enabled */ > > > > > + vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable; > > > > > + } > > > > > + } > > > > > + > > > > > + r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq); > > > > > + if (unlikely(r < 0)) { > > > > > + error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r); > > > > > + } > > > > > +} > > > > > + > > > > > +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data) > > > > > +{ > > > > > + MigrationState *migration = data; > > > > > + VhostVDPAState *s = container_of(notifier, VhostVDPAState, > > > > > + migration_state); > > > > > + > > > > > + switch (migration->state) { > > > > > + case MIGRATION_STATUS_SETUP: > > > > > + vhost_vdpa_net_log_global_enable(s, true); > > > > > + return; > > > > > + > > > > > + case MIGRATION_STATUS_CANCELLING: > > > > > + case MIGRATION_STATUS_CANCELLED: > > > > > + case MIGRATION_STATUS_FAILED: > > > > > + vhost_vdpa_net_log_global_enable(s, false); > > > > > > > > Do we need to recover here? > > > > > > > > > > I may be missing something, but the device is fully reset and restored > > > in these cases. > > > > > > CCing Juan and D. Gilbert, a review would be appreciated to check if > > > this covers all the cases. > > > > I'm surprised I'm not seeing an entry for MIGRATION_STATUS_COMPLETED > > there. > > > > You might consider: > > if (migration_in_setup(s)) { > > vhost_vdpa_net_log_global_enable(s, true); > > } else if (migration_has_finished(s) || migration_has_failed(s)) { > > vhost_vdpa_net_log_global_enable(s, false); > > } > > > > Thank you very much for the input, I see this is definitely cleaner > than my proposal. > > Just for completion here I need to handle differently has_finished vs > has_failed because of recovery. This is easily achievable from your > snippet so thank you very much. > > > I'm not too sure what will happen in your world with postcopy; it's > > worth testing, just remember on the source you don't want to be changing > > guest memory when you're in the postcopy phase. > > > > If I'm not wrong postcopy is forbidden as long as there exists a vdpa > device but I can check it for sure. Ah yes, we don't want the vdpa writing into the destination RAM during the postcopy phase; I can imagine with shadow-queues you might be able to come up with a solution to that - but that's a complication for another time. Dave > Thanks! > > > > Dave > > > > > Thanks! > > > > > > > > > > Thanks > > > > > > > > > + return; > > > > > + }; > > > > > +} > > > > > + > > > > > static void vhost_vdpa_net_data_start_first(VhostVDPAState *s) > > > > > { > > > > > struct vhost_vdpa *v = &s->vhost_vdpa; > > > > > > > > > > + if (v->feature_log) { > > > > > + add_migration_state_change_notifier(&s->migration_state); > > > > > + } > > > > > + > > > > > if (v->shadow_vqs_enabled) { > > > > > v->iova_tree = vhost_iova_tree_new(v->iova_range.first, > > > > > v->iova_range.last); > > > > > @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc) > > > > > > > > > > assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); > > > > > > > > > > + if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) { > > > > > + remove_migration_state_change_notifier(&s->migration_state); > > > > > + } > > > > > + > > > > > dev = s->vhost_vdpa.dev; > > > > > if (dev->vq_index + dev->nvqs == dev->vq_index_end) { > > > > > g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete); > > > > > @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer, > > > > > s->vhost_vdpa.device_fd = vdpa_device_fd; > > > > > s->vhost_vdpa.index = queue_pair_index; > > > > > s->always_svq = svq; > > > > > + s->migration_state.notify = vdpa_net_migration_state_notifier; > > > > > s->vhost_vdpa.shadow_vqs_enabled = svq; > > > > > s->vhost_vdpa.iova_range = iova_range; > > > > > s->vhost_vdpa.shadow_data = svq; > > > > > -- > > > > > 2.31.1 > > > > > > > > > > > > > > > > > > -- > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > > >
On 1/12/2023 9:24 AM, Eugenio Pérez wrote: > This allows net to restart the device backend to configure SVQ on it. > > Ideally, these changes should not be net specific. However, the vdpa net > backend is the one with enough knowledge to configure everything because > of some reasons: > * Queues might need to be shadowed or not depending on its kind (control > vs data). > * Queues need to share the same map translations (iova tree). > > Because of that it is cleaner to restart the whole net backend and > configure again as expected, similar to how vhost-kernel moves between > userspace and passthrough. > > If more kinds of devices need dynamic switching to SVQ we can create a > callback struct like VhostOps and move most of the code there. > VhostOps cannot be reused since all vdpa backend share them, and to > personalize just for networking would be too heavy. > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com> > --- > net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 84 insertions(+) > > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c > index 5d7ad6e4d7..f38532b1df 100644 > --- a/net/vhost-vdpa.c > +++ b/net/vhost-vdpa.c > @@ -26,6 +26,8 @@ > #include <err.h> > #include "standard-headers/linux/virtio_net.h" > #include "monitor/monitor.h" > +#include "migration/migration.h" > +#include "migration/misc.h" > #include "migration/blocker.h" > #include "hw/virtio/vhost.h" > > @@ -33,6 +35,7 @@ > typedef struct VhostVDPAState { > NetClientState nc; > struct vhost_vdpa vhost_vdpa; > + Notifier migration_state; > Error *migration_blocker; > VHostNetState *vhost_net; > > @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) > return DO_UPCAST(VhostVDPAState, nc, nc0); > } > > +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable) > +{ > + struct vhost_vdpa *v = &s->vhost_vdpa; > + VirtIONet *n; > + VirtIODevice *vdev; > + int data_queue_pairs, cvq, r; > + NetClientState *peer; > + > + /* We are only called on the first data vqs and only if x-svq is not set */ > + if (s->vhost_vdpa.shadow_vqs_enabled == enable) { > + return; > + } > + > + vdev = v->dev->vdev; > + n = VIRTIO_NET(vdev); > + if (!n->vhost_started) { > + return; > + } > + > + if (enable) { > + ioctl(v->device_fd, VHOST_VDPA_SUSPEND); > + } > + data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1; > + cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ? > + n->max_ncs - n->max_queue_pairs : 0; > + vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq); > + > + peer = s->nc.peer; > + for (int i = 0; i < data_queue_pairs + cvq; i++) { > + VhostVDPAState *vdpa_state; > + NetClientState *nc; > + > + if (i < data_queue_pairs) { > + nc = qemu_get_peer(peer, i); > + } else { > + nc = qemu_get_peer(peer, n->max_queue_pairs); > + } > + > + vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc); > + vdpa_state->vhost_vdpa.shadow_data = enable; > + > + if (i < data_queue_pairs) { > + /* Do not override CVQ shadow_vqs_enabled */ > + vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable; > + } > + } > + > + r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq); As the first revision, this method (vhost_net_stop followed by vhost_net_start) should be fine for software vhost-vdpa backend for e.g. vp_vdpa and vdpa_sim_net. However, I would like to get your attention that this method implies substantial blackout time for mode switching on real hardware - get a full cycle of device reset of getting memory mappings torn down, unpin & repin same set of pages, and set up new mapping would take very significant amount of time, especially for a large VM. Maybe we can do: 1) replace reset with the RESUME feature that was just added to the vhost-vdpa ioctls in kernel 2) add new vdpa ioctls to allow iova range rebound to new virtual address for QEMU's shadow vq or back to device's vq 3) use a light-weighted sequence of suspend+rebind+resume to switch mode on the fly instead of getting through the whole reset+restart cycle I suspect the same idea could even be used to address high live migration downtime seen on hardware vdpa device. What do you think? Thanks, -Siwei > + if (unlikely(r < 0)) { > + error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r); > + } > +} > + > +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data) > +{ > + MigrationState *migration = data; > + VhostVDPAState *s = container_of(notifier, VhostVDPAState, > + migration_state); > + > + switch (migration->state) { > + case MIGRATION_STATUS_SETUP: > + vhost_vdpa_net_log_global_enable(s, true); > + return; > + > + case MIGRATION_STATUS_CANCELLING: > + case MIGRATION_STATUS_CANCELLED: > + case MIGRATION_STATUS_FAILED: > + vhost_vdpa_net_log_global_enable(s, false); > + return; > + }; > +} > + > static void vhost_vdpa_net_data_start_first(VhostVDPAState *s) > { > struct vhost_vdpa *v = &s->vhost_vdpa; > > + if (v->feature_log) { > + add_migration_state_change_notifier(&s->migration_state); > + } > + > if (v->shadow_vqs_enabled) { > v->iova_tree = vhost_iova_tree_new(v->iova_range.first, > v->iova_range.last); > @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc) > > assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); > > + if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) { > + remove_migration_state_change_notifier(&s->migration_state); > + } > + > dev = s->vhost_vdpa.dev; > if (dev->vq_index + dev->nvqs == dev->vq_index_end) { > g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete); > @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer, > s->vhost_vdpa.device_fd = vdpa_device_fd; > s->vhost_vdpa.index = queue_pair_index; > s->always_svq = svq; > + s->migration_state.notify = vdpa_net_migration_state_notifier; > s->vhost_vdpa.shadow_vqs_enabled = svq; > s->vhost_vdpa.iova_range = iova_range; > s->vhost_vdpa.shadow_data = svq;
On Thu, Feb 2, 2023 at 2:53 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote: > > > > On 1/12/2023 9:24 AM, Eugenio Pérez wrote: > > This allows net to restart the device backend to configure SVQ on it. > > > > Ideally, these changes should not be net specific. However, the vdpa net > > backend is the one with enough knowledge to configure everything because > > of some reasons: > > * Queues might need to be shadowed or not depending on its kind (control > > vs data). > > * Queues need to share the same map translations (iova tree). > > > > Because of that it is cleaner to restart the whole net backend and > > configure again as expected, similar to how vhost-kernel moves between > > userspace and passthrough. > > > > If more kinds of devices need dynamic switching to SVQ we can create a > > callback struct like VhostOps and move most of the code there. > > VhostOps cannot be reused since all vdpa backend share them, and to > > personalize just for networking would be too heavy. > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com> > > --- > > net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 84 insertions(+) > > > > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c > > index 5d7ad6e4d7..f38532b1df 100644 > > --- a/net/vhost-vdpa.c > > +++ b/net/vhost-vdpa.c > > @@ -26,6 +26,8 @@ > > #include <err.h> > > #include "standard-headers/linux/virtio_net.h" > > #include "monitor/monitor.h" > > +#include "migration/migration.h" > > +#include "migration/misc.h" > > #include "migration/blocker.h" > > #include "hw/virtio/vhost.h" > > > > @@ -33,6 +35,7 @@ > > typedef struct VhostVDPAState { > > NetClientState nc; > > struct vhost_vdpa vhost_vdpa; > > + Notifier migration_state; > > Error *migration_blocker; > > VHostNetState *vhost_net; > > > > @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) > > return DO_UPCAST(VhostVDPAState, nc, nc0); > > } > > > > +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable) > > +{ > > + struct vhost_vdpa *v = &s->vhost_vdpa; > > + VirtIONet *n; > > + VirtIODevice *vdev; > > + int data_queue_pairs, cvq, r; > > + NetClientState *peer; > > + > > + /* We are only called on the first data vqs and only if x-svq is not set */ > > + if (s->vhost_vdpa.shadow_vqs_enabled == enable) { > > + return; > > + } > > + > > + vdev = v->dev->vdev; > > + n = VIRTIO_NET(vdev); > > + if (!n->vhost_started) { > > + return; > > + } > > + > > + if (enable) { > > + ioctl(v->device_fd, VHOST_VDPA_SUSPEND); > > + } > > + data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1; > > + cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ? > > + n->max_ncs - n->max_queue_pairs : 0; > > + vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq); > > + > > + peer = s->nc.peer; > > + for (int i = 0; i < data_queue_pairs + cvq; i++) { > > + VhostVDPAState *vdpa_state; > > + NetClientState *nc; > > + > > + if (i < data_queue_pairs) { > > + nc = qemu_get_peer(peer, i); > > + } else { > > + nc = qemu_get_peer(peer, n->max_queue_pairs); > > + } > > + > > + vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc); > > + vdpa_state->vhost_vdpa.shadow_data = enable; > > + > > + if (i < data_queue_pairs) { > > + /* Do not override CVQ shadow_vqs_enabled */ > > + vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable; > > + } > > + } > > + > > + r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq); > As the first revision, this method (vhost_net_stop followed by > vhost_net_start) should be fine for software vhost-vdpa backend for e.g. > vp_vdpa and vdpa_sim_net. However, I would like to get your attention > that this method implies substantial blackout time for mode switching on > real hardware - get a full cycle of device reset of getting memory > mappings torn down, unpin & repin same set of pages, and set up new > mapping would take very significant amount of time, especially for a > large VM. Maybe we can do: > Right, I think this is something that deserves optimization in the future. Note that we must replace the mappings anyway, with all passthrough queues stopped. This is because SVQ vrings are not in the guest space. The pin can be skipped though, I think that's a low hand fruit here. If any, we can track guest's IOVA and add SVQ vrings in a hole. If guest's IOVA layout changes, we can translate it then to a new location. That way we only need one map operation in the worst case. I'm omitting the lookup time here, but I still should be worth it. But as you mention I think it is not worth complicating this series and we can think about it on top. We can start building it on top of your suggestions for sure. > 1) replace reset with the RESUME feature that was just added to the > vhost-vdpa ioctls in kernel We cannot change vring addresses just with a SUSPEND / RESUME. We could do it with the VIRTIO_F_RING_RESET feature though. Would it be advantageous to the device? > 2) add new vdpa ioctls to allow iova range rebound to new virtual > address for QEMU's shadow vq or back to device's vq Actually, if the device supports ASID we can allocate ASID 1 to that purpose. At this moment only CVQ vrings and control buffers are there when the device is passthrough. But this doesn't solve the problem if we need to send all SVQ translation to the device on-chip IOMMU, doesn't it? We must clear all of it and send the new one to the device anyway. > 3) use a light-weighted sequence of suspend+rebind+resume to switch mode > on the fly instead of getting through the whole reset+restart cycle > I think this is the same as 1, isn't it? > I suspect the same idea could even be used to address high live > migration downtime seen on hardware vdpa device. What do you think? > I think this is a great start for sure! Some questions: a) Is the time on reprogramming on-chip IOMMU comparable to program regular IOMMU? If it is the case it should be easier to find vdpa devices with support for _F_RESET soon. b) Not to merge on master, but it is possible to add an artificial delay on vdpa_sim that simulates the properties of the delay of IOMMU? In that line, have you observed if it is linear with the size of the memory, with the number of maps, other factors..? Thanks! > Thanks, > -Siwei > > > + if (unlikely(r < 0)) { > > + error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r); > > + } > > +} > > + > > +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data) > > +{ > > + MigrationState *migration = data; > > + VhostVDPAState *s = container_of(notifier, VhostVDPAState, > > + migration_state); > > + > > + switch (migration->state) { > > + case MIGRATION_STATUS_SETUP: > > + vhost_vdpa_net_log_global_enable(s, true); > > + return; > > + > > + case MIGRATION_STATUS_CANCELLING: > > + case MIGRATION_STATUS_CANCELLED: > > + case MIGRATION_STATUS_FAILED: > > + vhost_vdpa_net_log_global_enable(s, false); > > + return; > > + }; > > +} > > + > > static void vhost_vdpa_net_data_start_first(VhostVDPAState *s) > > { > > struct vhost_vdpa *v = &s->vhost_vdpa; > > > > + if (v->feature_log) { > > + add_migration_state_change_notifier(&s->migration_state); > > + } > > + > > if (v->shadow_vqs_enabled) { > > v->iova_tree = vhost_iova_tree_new(v->iova_range.first, > > v->iova_range.last); > > @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc) > > > > assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); > > > > + if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) { > > + remove_migration_state_change_notifier(&s->migration_state); > > + } > > + > > dev = s->vhost_vdpa.dev; > > if (dev->vq_index + dev->nvqs == dev->vq_index_end) { > > g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete); > > @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer, > > s->vhost_vdpa.device_fd = vdpa_device_fd; > > s->vhost_vdpa.index = queue_pair_index; > > s->always_svq = svq; > > + s->migration_state.notify = vdpa_net_migration_state_notifier; > > s->vhost_vdpa.shadow_vqs_enabled = svq; > > s->vhost_vdpa.iova_range = iova_range; > > s->vhost_vdpa.shadow_data = svq; >
On 2/2/2023 7:28 AM, Eugenio Perez Martin wrote: > On Thu, Feb 2, 2023 at 2:53 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote: >> >> >> On 1/12/2023 9:24 AM, Eugenio Pérez wrote: >>> This allows net to restart the device backend to configure SVQ on it. >>> >>> Ideally, these changes should not be net specific. However, the vdpa net >>> backend is the one with enough knowledge to configure everything because >>> of some reasons: >>> * Queues might need to be shadowed or not depending on its kind (control >>> vs data). >>> * Queues need to share the same map translations (iova tree). >>> >>> Because of that it is cleaner to restart the whole net backend and >>> configure again as expected, similar to how vhost-kernel moves between >>> userspace and passthrough. >>> >>> If more kinds of devices need dynamic switching to SVQ we can create a >>> callback struct like VhostOps and move most of the code there. >>> VhostOps cannot be reused since all vdpa backend share them, and to >>> personalize just for networking would be too heavy. >>> >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com> >>> --- >>> net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++ >>> 1 file changed, 84 insertions(+) >>> >>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c >>> index 5d7ad6e4d7..f38532b1df 100644 >>> --- a/net/vhost-vdpa.c >>> +++ b/net/vhost-vdpa.c >>> @@ -26,6 +26,8 @@ >>> #include <err.h> >>> #include "standard-headers/linux/virtio_net.h" >>> #include "monitor/monitor.h" >>> +#include "migration/migration.h" >>> +#include "migration/misc.h" >>> #include "migration/blocker.h" >>> #include "hw/virtio/vhost.h" >>> >>> @@ -33,6 +35,7 @@ >>> typedef struct VhostVDPAState { >>> NetClientState nc; >>> struct vhost_vdpa vhost_vdpa; >>> + Notifier migration_state; >>> Error *migration_blocker; >>> VHostNetState *vhost_net; >>> >>> @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) >>> return DO_UPCAST(VhostVDPAState, nc, nc0); >>> } >>> >>> +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable) >>> +{ >>> + struct vhost_vdpa *v = &s->vhost_vdpa; >>> + VirtIONet *n; >>> + VirtIODevice *vdev; >>> + int data_queue_pairs, cvq, r; >>> + NetClientState *peer; >>> + >>> + /* We are only called on the first data vqs and only if x-svq is not set */ >>> + if (s->vhost_vdpa.shadow_vqs_enabled == enable) { >>> + return; >>> + } >>> + >>> + vdev = v->dev->vdev; >>> + n = VIRTIO_NET(vdev); >>> + if (!n->vhost_started) { >>> + return; >>> + } >>> + >>> + if (enable) { >>> + ioctl(v->device_fd, VHOST_VDPA_SUSPEND); >>> + } >>> + data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1; >>> + cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ? >>> + n->max_ncs - n->max_queue_pairs : 0; >>> + vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq); >>> + >>> + peer = s->nc.peer; >>> + for (int i = 0; i < data_queue_pairs + cvq; i++) { >>> + VhostVDPAState *vdpa_state; >>> + NetClientState *nc; >>> + >>> + if (i < data_queue_pairs) { >>> + nc = qemu_get_peer(peer, i); >>> + } else { >>> + nc = qemu_get_peer(peer, n->max_queue_pairs); >>> + } >>> + >>> + vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc); >>> + vdpa_state->vhost_vdpa.shadow_data = enable; >>> + >>> + if (i < data_queue_pairs) { >>> + /* Do not override CVQ shadow_vqs_enabled */ >>> + vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable; >>> + } >>> + } >>> + >>> + r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq); >> As the first revision, this method (vhost_net_stop followed by >> vhost_net_start) should be fine for software vhost-vdpa backend for e.g. >> vp_vdpa and vdpa_sim_net. However, I would like to get your attention >> that this method implies substantial blackout time for mode switching on >> real hardware - get a full cycle of device reset of getting memory >> mappings torn down, unpin & repin same set of pages, and set up new >> mapping would take very significant amount of time, especially for a >> large VM. Maybe we can do: >> > Right, I think this is something that deserves optimization in the future. > > Note that we must replace the mappings anyway, with all passthrough > queues stopped. Yes, unmap and remap is needed indeed. I haven't checked, does shadow vq keep mapping to the same GPA where passthrough data virtqueues were associated with across switch (so that the mode switch is transparent to the guest)? For platform IOMMU the mapping and remapping cost is inevitable, though I wonder for the on-chip IOMMU case could it take some fast path to just replace IOVA in place without destroying and setting up all mapping entries, if the same GPA is going to be used for the data rings (copy Eli for his input). > This is because SVQ vrings are not in the guest space. > The pin can be skipped though, I think that's a low hand fruit here. Yes, that's right. For a large VM pining overhead usually overweighs the mapping cost. It would be a great amount of time saving if pin can be skipped. > > If any, we can track guest's IOVA and add SVQ vrings in a hole. If > guest's IOVA layout changes, we can translate it then to a new > location. That way we only need one map operation in the worst case. > I'm omitting the lookup time here, but I still should be worth it. > > But as you mention I think it is not worth complicating this series > and we can think about it on top. Yes, agreed. I'll just let you aware of the need of this optimization for real hardware device. > We can start building it on top of > your suggestions for sure. > >> 1) replace reset with the RESUME feature that was just added to the >> vhost-vdpa ioctls in kernel > We cannot change vring addresses just with a SUSPEND / RESUME. I wonder if we can make SUSPEND (via some flag or new backend feature is fine) accept updating internal state like the vring addresses, while defer applying it to the device until RESUME? That way we don't lose a lot of other states that otherwise would need to re-instantiate at large with _F_RING_RESET or device reset. > > We could do it with the VIRTIO_F_RING_RESET feature though. Would it > be advantageous to the device? > >> 2) add new vdpa ioctls to allow iova range rebound to new virtual >> address for QEMU's shadow vq or back to device's vq > Actually, if the device supports ASID we can allocate ASID 1 to that > purpose. At this moment only CVQ vrings and control buffers are there > when the device is passthrough. Yep, we can get SVQ mapping pre-cooked in another ASID before dismantle the mapping for the passthrough VQ. This will help the general IOMMU case. > > But this doesn't solve the problem if we need to send all SVQ > translation to the device on-chip IOMMU, doesn't it? We must clear all > of it and send the new one to the device anyway. > >> 3) use a light-weighted sequence of suspend+rebind+resume to switch mode >> on the fly instead of getting through the whole reset+restart cycle >> > I think this is the same as 1, isn't it? I mean do all three together: 1,2 in kernel and 3 in QEMU. > >> I suspect the same idea could even be used to address high live >> migration downtime seen on hardware vdpa device. What do you think? >> > I think this is a great start for sure! Some questions: > a) Is the time on reprogramming on-chip IOMMU comparable to program > regular IOMMU? I would think this largely depends on the hardware implementation of on-chip IOMMU, the performance characteristics of which is very device specific. Some times driver software implementation and API for on-chip MMU also matters. Which would require vendor specific work to optimize based on the specific use case. > If it is the case it should be easier to find vdpa > devices with support for _F_RESET soon. > b) Not to merge on master, but it is possible to add an artificial > delay on vdpa_sim that simulates the properties of the delay of IOMMU? > In that line, have you observed if it is linear with the size of the > memory, with the number of maps, other factors..? As I said this is very device specific and hard to quantify, but I agree it's a good idea to simulate the delay and measure the effect. For the on-chip MMU device I'm looking, large proportion of the time was spent on software side in allocating a range of memory for hosting mapping entries (don't know how to quantify this part but the allocation time is not a constant nor linear to the size of memory), walking all iotlb entries passed down from vdpa layer and building corresponding memory key objects for a range of pages. For each iotlb entry the time to build memory mapping looks grow linearly with the size of memory. Not sure if there's room to improve, I'll let the owner to clarify. Thanks, -Siwei > > Thanks! > >> Thanks, >> -Siwei >> >>> + if (unlikely(r < 0)) { >>> + error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r); >>> + } >>> +} >>> + >>> +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data) >>> +{ >>> + MigrationState *migration = data; >>> + VhostVDPAState *s = container_of(notifier, VhostVDPAState, >>> + migration_state); >>> + >>> + switch (migration->state) { >>> + case MIGRATION_STATUS_SETUP: >>> + vhost_vdpa_net_log_global_enable(s, true); >>> + return; >>> + >>> + case MIGRATION_STATUS_CANCELLING: >>> + case MIGRATION_STATUS_CANCELLED: >>> + case MIGRATION_STATUS_FAILED: >>> + vhost_vdpa_net_log_global_enable(s, false); >>> + return; >>> + }; >>> +} >>> + >>> static void vhost_vdpa_net_data_start_first(VhostVDPAState *s) >>> { >>> struct vhost_vdpa *v = &s->vhost_vdpa; >>> >>> + if (v->feature_log) { >>> + add_migration_state_change_notifier(&s->migration_state); >>> + } >>> + >>> if (v->shadow_vqs_enabled) { >>> v->iova_tree = vhost_iova_tree_new(v->iova_range.first, >>> v->iova_range.last); >>> @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc) >>> >>> assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); >>> >>> + if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) { >>> + remove_migration_state_change_notifier(&s->migration_state); >>> + } >>> + >>> dev = s->vhost_vdpa.dev; >>> if (dev->vq_index + dev->nvqs == dev->vq_index_end) { >>> g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete); >>> @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer, >>> s->vhost_vdpa.device_fd = vdpa_device_fd; >>> s->vhost_vdpa.index = queue_pair_index; >>> s->always_svq = svq; >>> + s->migration_state.notify = vdpa_net_migration_state_notifier; >>> s->vhost_vdpa.shadow_vqs_enabled = svq; >>> s->vhost_vdpa.iova_range = iova_range; >>> s->vhost_vdpa.shadow_data = svq;
> -----Original Message----- > From: Si-Wei Liu <si-wei.liu@oracle.com> > Sent: Thursday, 2 February 2023 3:53 > To: Eugenio Pérez <eperezma@redhat.com>; qemu-devel@nongnu.org > Cc: Liuxiangdong <liuxiangdong5@huawei.com>; Zhu Lingshan > <lingshan.zhu@intel.com>; Gonglei (Arei) <arei.gonglei@huawei.com>; > alvaro.karsz@solid-run.com; Shannon Nelson <snelson@pensando.io>; > Laurent Vivier <lvivier@redhat.com>; Harpreet Singh Anand > <hanand@xilinx.com>; Gautam Dawar <gdawar@xilinx.com>; Stefano > Garzarella <sgarzare@redhat.com>; Cornelia Huck <cohuck@redhat.com>; > Cindy Lu <lulu@redhat.com>; Eli Cohen <eli@mellanox.com>; Paolo Bonzini > <pbonzini@redhat.com>; Michael S. Tsirkin <mst@redhat.com>; Jason Wang > <jasowang@redhat.com>; Stefan Hajnoczi <stefanha@redhat.com>; Parav > Pandit <parav@mellanox.com> > Subject: Re: [RFC v2 11/13] vdpa: add vdpa net migration state notifier > > > > On 1/12/2023 9:24 AM, Eugenio Pérez wrote: > > This allows net to restart the device backend to configure SVQ on it. > > > > Ideally, these changes should not be net specific. However, the vdpa net > > backend is the one with enough knowledge to configure everything because > > of some reasons: > > * Queues might need to be shadowed or not depending on its kind (control > > vs data). > > * Queues need to share the same map translations (iova tree). > > > > Because of that it is cleaner to restart the whole net backend and > > configure again as expected, similar to how vhost-kernel moves between > > userspace and passthrough. > > > > If more kinds of devices need dynamic switching to SVQ we can create a > > callback struct like VhostOps and move most of the code there. > > VhostOps cannot be reused since all vdpa backend share them, and to > > personalize just for networking would be too heavy. > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com> > > --- > > net/vhost-vdpa.c | 84 > ++++++++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 84 insertions(+) > > > > diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c > > index 5d7ad6e4d7..f38532b1df 100644 > > --- a/net/vhost-vdpa.c > > +++ b/net/vhost-vdpa.c > > @@ -26,6 +26,8 @@ > > #include <err.h> > > #include "standard-headers/linux/virtio_net.h" > > #include "monitor/monitor.h" > > +#include "migration/migration.h" > > +#include "migration/misc.h" > > #include "migration/blocker.h" > > #include "hw/virtio/vhost.h" > > > > @@ -33,6 +35,7 @@ > > typedef struct VhostVDPAState { > > NetClientState nc; > > struct vhost_vdpa vhost_vdpa; > > + Notifier migration_state; > > Error *migration_blocker; > > VHostNetState *vhost_net; > > > > @@ -243,10 +246,86 @@ static VhostVDPAState > *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) > > return DO_UPCAST(VhostVDPAState, nc, nc0); > > } > > > > +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool > enable) > > +{ > > + struct vhost_vdpa *v = &s->vhost_vdpa; > > + VirtIONet *n; > > + VirtIODevice *vdev; > > + int data_queue_pairs, cvq, r; > > + NetClientState *peer; > > + > > + /* We are only called on the first data vqs and only if x-svq is not set */ > > + if (s->vhost_vdpa.shadow_vqs_enabled == enable) { > > + return; > > + } > > + > > + vdev = v->dev->vdev; > > + n = VIRTIO_NET(vdev); > > + if (!n->vhost_started) { > > + return; > > + } > > + > > + if (enable) { > > + ioctl(v->device_fd, VHOST_VDPA_SUSPEND); > > + } > > + data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1; > > + cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ? > > + n->max_ncs - n->max_queue_pairs : 0; > > + vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq); > > + > > + peer = s->nc.peer; > > + for (int i = 0; i < data_queue_pairs + cvq; i++) { > > + VhostVDPAState *vdpa_state; > > + NetClientState *nc; > > + > > + if (i < data_queue_pairs) { > > + nc = qemu_get_peer(peer, i); > > + } else { > > + nc = qemu_get_peer(peer, n->max_queue_pairs); > > + } > > + > > + vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc); > > + vdpa_state->vhost_vdpa.shadow_data = enable; > > + > > + if (i < data_queue_pairs) { > > + /* Do not override CVQ shadow_vqs_enabled */ > > + vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable; > > + } > > + } > > + > > + r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq); > As the first revision, this method (vhost_net_stop followed by > vhost_net_start) should be fine for software vhost-vdpa backend for e.g. > vp_vdpa and vdpa_sim_net. However, I would like to get your attention > that this method implies substantial blackout time for mode switching on > real hardware - get a full cycle of device reset of getting memory > mappings torn down, unpin & repin same set of pages, and set up new > mapping would take very significant amount of time, especially for a > large VM. Maybe we can do: > > 1) replace reset with the RESUME feature that was just added to the > vhost-vdpa ioctls in kernel > 2) add new vdpa ioctls to allow iova range rebound to new virtual > address for QEMU's shadow vq or back to device's vq Every time you change the iova range, mlx5_vdpa needs to destroy the old MR and build a new one, based on the new data provided by the new iova. That implies destroying the VQs and creating them again with reference to the new MR. If the new iova provided is narrower, it will cause the memory registration operation to take less time. In any case, I don't see how adding a new call makes a difference relative to using the current set_map() call. If all we need is extend the current iova range to include the shadow virtqueue, we could introduce a call that extends the current iova. In this case, I could provide a much faster modification on the MR. > 3) use a light-weighted sequence of suspend+rebind+resume to switch mode > on the fly instead of getting through the whole reset+restart cycle > > I suspect the same idea could even be used to address high live > migration downtime seen on hardware vdpa device. What do you think? > > Thanks, > -Siwei > > > + if (unlikely(r < 0)) { > > + error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r); > > + } > > +} > > + > > +static void vdpa_net_migration_state_notifier(Notifier *notifier, void > *data) > > +{ > > + MigrationState *migration = data; > > + VhostVDPAState *s = container_of(notifier, VhostVDPAState, > > + migration_state); > > + > > + switch (migration->state) { > > + case MIGRATION_STATUS_SETUP: > > + vhost_vdpa_net_log_global_enable(s, true); > > + return; > > + > > + case MIGRATION_STATUS_CANCELLING: > > + case MIGRATION_STATUS_CANCELLED: > > + case MIGRATION_STATUS_FAILED: > > + vhost_vdpa_net_log_global_enable(s, false); > > + return; > > + }; > > +} > > + > > static void vhost_vdpa_net_data_start_first(VhostVDPAState *s) > > { > > struct vhost_vdpa *v = &s->vhost_vdpa; > > > > + if (v->feature_log) { > > + add_migration_state_change_notifier(&s->migration_state); > > + } > > + > > if (v->shadow_vqs_enabled) { > > v->iova_tree = vhost_iova_tree_new(v->iova_range.first, > > v->iova_range.last); > > @@ -280,6 +359,10 @@ static void > vhost_vdpa_net_client_stop(NetClientState *nc) > > > > assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); > > > > + if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) { > > + remove_migration_state_change_notifier(&s->migration_state); > > + } > > + > > dev = s->vhost_vdpa.dev; > > if (dev->vq_index + dev->nvqs == dev->vq_index_end) { > > g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete); > > @@ -767,6 +850,7 @@ static NetClientState > *net_vhost_vdpa_init(NetClientState *peer, > > s->vhost_vdpa.device_fd = vdpa_device_fd; > > s->vhost_vdpa.index = queue_pair_index; > > s->always_svq = svq; > > + s->migration_state.notify = vdpa_net_migration_state_notifier; > > s->vhost_vdpa.shadow_vqs_enabled = svq; > > s->vhost_vdpa.iova_range = iova_range; > > s->vhost_vdpa.shadow_data = svq;
On Sat, Feb 4, 2023 at 3:04 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote: > > > > On 2/2/2023 7:28 AM, Eugenio Perez Martin wrote: > > On Thu, Feb 2, 2023 at 2:53 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote: > >> > >> > >> On 1/12/2023 9:24 AM, Eugenio Pérez wrote: > >>> This allows net to restart the device backend to configure SVQ on it. > >>> > >>> Ideally, these changes should not be net specific. However, the vdpa net > >>> backend is the one with enough knowledge to configure everything because > >>> of some reasons: > >>> * Queues might need to be shadowed or not depending on its kind (control > >>> vs data). > >>> * Queues need to share the same map translations (iova tree). > >>> > >>> Because of that it is cleaner to restart the whole net backend and > >>> configure again as expected, similar to how vhost-kernel moves between > >>> userspace and passthrough. > >>> > >>> If more kinds of devices need dynamic switching to SVQ we can create a > >>> callback struct like VhostOps and move most of the code there. > >>> VhostOps cannot be reused since all vdpa backend share them, and to > >>> personalize just for networking would be too heavy. > >>> > >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com> > >>> --- > >>> net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++ > >>> 1 file changed, 84 insertions(+) > >>> > >>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c > >>> index 5d7ad6e4d7..f38532b1df 100644 > >>> --- a/net/vhost-vdpa.c > >>> +++ b/net/vhost-vdpa.c > >>> @@ -26,6 +26,8 @@ > >>> #include <err.h> > >>> #include "standard-headers/linux/virtio_net.h" > >>> #include "monitor/monitor.h" > >>> +#include "migration/migration.h" > >>> +#include "migration/misc.h" > >>> #include "migration/blocker.h" > >>> #include "hw/virtio/vhost.h" > >>> > >>> @@ -33,6 +35,7 @@ > >>> typedef struct VhostVDPAState { > >>> NetClientState nc; > >>> struct vhost_vdpa vhost_vdpa; > >>> + Notifier migration_state; > >>> Error *migration_blocker; > >>> VHostNetState *vhost_net; > >>> > >>> @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) > >>> return DO_UPCAST(VhostVDPAState, nc, nc0); > >>> } > >>> > >>> +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable) > >>> +{ > >>> + struct vhost_vdpa *v = &s->vhost_vdpa; > >>> + VirtIONet *n; > >>> + VirtIODevice *vdev; > >>> + int data_queue_pairs, cvq, r; > >>> + NetClientState *peer; > >>> + > >>> + /* We are only called on the first data vqs and only if x-svq is not set */ > >>> + if (s->vhost_vdpa.shadow_vqs_enabled == enable) { > >>> + return; > >>> + } > >>> + > >>> + vdev = v->dev->vdev; > >>> + n = VIRTIO_NET(vdev); > >>> + if (!n->vhost_started) { > >>> + return; > >>> + } > >>> + > >>> + if (enable) { > >>> + ioctl(v->device_fd, VHOST_VDPA_SUSPEND); > >>> + } > >>> + data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1; > >>> + cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ? > >>> + n->max_ncs - n->max_queue_pairs : 0; > >>> + vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq); > >>> + > >>> + peer = s->nc.peer; > >>> + for (int i = 0; i < data_queue_pairs + cvq; i++) { > >>> + VhostVDPAState *vdpa_state; > >>> + NetClientState *nc; > >>> + > >>> + if (i < data_queue_pairs) { > >>> + nc = qemu_get_peer(peer, i); > >>> + } else { > >>> + nc = qemu_get_peer(peer, n->max_queue_pairs); > >>> + } > >>> + > >>> + vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc); > >>> + vdpa_state->vhost_vdpa.shadow_data = enable; > >>> + > >>> + if (i < data_queue_pairs) { > >>> + /* Do not override CVQ shadow_vqs_enabled */ > >>> + vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable; > >>> + } > >>> + } > >>> + > >>> + r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq); > >> As the first revision, this method (vhost_net_stop followed by > >> vhost_net_start) should be fine for software vhost-vdpa backend for e.g. > >> vp_vdpa and vdpa_sim_net. However, I would like to get your attention > >> that this method implies substantial blackout time for mode switching on > >> real hardware - get a full cycle of device reset of getting memory > >> mappings torn down, unpin & repin same set of pages, and set up new > >> mapping would take very significant amount of time, especially for a > >> large VM. Maybe we can do: > >> > > Right, I think this is something that deserves optimization in the future. > > > > Note that we must replace the mappings anyway, with all passthrough > > queues stopped. > Yes, unmap and remap is needed indeed. I haven't checked, does shadow vq > keep mapping to the same GPA where passthrough data virtqueues were > associated with across switch (so that the mode switch is transparent to > the guest)? I don't get this question, SVQ switching is already transparent to the guest. > For platform IOMMU the mapping and remapping cost is > inevitable, though I wonder for the on-chip IOMMU case could it take > some fast path to just replace IOVA in place without destroying and > setting up all mapping entries, if the same GPA is going to be used for > the data rings (copy Eli for his input). > > > This is because SVQ vrings are not in the guest space. > > The pin can be skipped though, I think that's a low hand fruit here. > Yes, that's right. For a large VM pining overhead usually overweighs the > mapping cost. It would be a great amount of time saving if pin can be > skipped. > That is doable using dma_map/unmap apis instead of set_map (or comparing in set_map) and allocation GPA translations in advance. > > > > If any, we can track guest's IOVA and add SVQ vrings in a hole. If > > guest's IOVA layout changes, we can translate it then to a new > > location. That way we only need one map operation in the worst case. > > I'm omitting the lookup time here, but I still should be worth it. > > > > But as you mention I think it is not worth complicating this series > > and we can think about it on top. > Yes, agreed. I'll just let you aware of the need of this optimization > for real hardware device. > > > We can start building it on top of > > your suggestions for sure. > > > >> 1) replace reset with the RESUME feature that was just added to the > >> vhost-vdpa ioctls in kernel > > We cannot change vring addresses just with a SUSPEND / RESUME. > I wonder if we can make SUSPEND (via some flag or new backend feature is > fine) accept updating internal state like the vring addresses, while > defer applying it to the device until RESUME? That way we don't lose a > lot of other states that otherwise would need to re-instantiate at large > with _F_RING_RESET or device reset. > If that helps, that can be done for sure. As another idea, we could do the reverse and allow _F_RING_RESET to not to forget the parameters unless they're explicitly overriden. I think I prefer your idea in SUSPEND / RESUME cycle, but just wanted to put that possibility on the table if that makes more sense. > > > > We could do it with the VIRTIO_F_RING_RESET feature though. Would it > > be advantageous to the device? > > > >> 2) add new vdpa ioctls to allow iova range rebound to new virtual > >> address for QEMU's shadow vq or back to device's vq > > Actually, if the device supports ASID we can allocate ASID 1 to that > > purpose. At this moment only CVQ vrings and control buffers are there > > when the device is passthrough. > Yep, we can get SVQ mapping pre-cooked in another ASID before dismantle > the mapping for the passthrough VQ. This will help the general IOMMU case. > > > > > But this doesn't solve the problem if we need to send all SVQ > > translation to the device on-chip IOMMU, doesn't it? We must clear all > > of it and send the new one to the device anyway. > > > >> 3) use a light-weighted sequence of suspend+rebind+resume to switch mode > >> on the fly instead of getting through the whole reset+restart cycle > >> > > I think this is the same as 1, isn't it? > I mean do all three together: 1,2 in kernel and 3 in QEMU. > Ok I missed that in my first read, thanks! But I feel 2 should be easier to do in qemu. I don't really know how this helps in the general IOMMU case, I'm assuming the IOMMU does not support PASID or similar tricks. Is that because of the vhost_iotlb population or is there anything else I'm missing? > > > >> I suspect the same idea could even be used to address high live > >> migration downtime seen on hardware vdpa device. What do you think? > >> > > I think this is a great start for sure! Some questions: > > a) Is the time on reprogramming on-chip IOMMU comparable to program > > regular IOMMU? > I would think this largely depends on the hardware implementation of > on-chip IOMMU, the performance characteristics of which is very device > specific. Some times driver software implementation and API for on-chip > MMU also matters. Which would require vendor specific work to optimize > based on the specific use case. > Got it. > > If it is the case it should be easier to find vdpa > > devices with support for _F_RESET soon. > > b) Not to merge on master, but it is possible to add an artificial > > delay on vdpa_sim that simulates the properties of the delay of IOMMU? > > In that line, have you observed if it is linear with the size of the > > memory, with the number of maps, other factors..? > As I said this is very device specific and hard to quantify, but I agree > it's a good idea to simulate the delay and measure the effect. For the > on-chip MMU device I'm looking, large proportion of the time was spent > on software side in allocating a range of memory for hosting mapping > entries (don't know how to quantify this part but the allocation time is > not a constant nor linear to the size of memory), walking all iotlb > entries passed down from vdpa layer and building corresponding memory > key objects for a range of pages. For each iotlb entry the time to build > memory mapping looks grow linearly with the size of memory. Not sure if > there's room to improve, I'll let the owner to clarify. > So I think all of these are great ideas. If we state the pin & unpin huts latency in the switching I think the easiest way to start is: * To start with qemu and send all the map / unmap in a batch * Avoid the pin / unpin in the kernel using a smarter algorithm for that, not unpinning regions that it is going to pin again. What do you think? Thanks! > Thanks, > -Siwei > > > > > > > > > Thanks! > > > >> Thanks, > >> -Siwei > >> > >>> + if (unlikely(r < 0)) { > >>> + error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r); > >>> + } > >>> +} > >>> + > >>> +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data) > >>> +{ > >>> + MigrationState *migration = data; > >>> + VhostVDPAState *s = container_of(notifier, VhostVDPAState, > >>> + migration_state); > >>> + > >>> + switch (migration->state) { > >>> + case MIGRATION_STATUS_SETUP: > >>> + vhost_vdpa_net_log_global_enable(s, true); > >>> + return; > >>> + > >>> + case MIGRATION_STATUS_CANCELLING: > >>> + case MIGRATION_STATUS_CANCELLED: > >>> + case MIGRATION_STATUS_FAILED: > >>> + vhost_vdpa_net_log_global_enable(s, false); > >>> + return; > >>> + }; > >>> +} > >>> + > >>> static void vhost_vdpa_net_data_start_first(VhostVDPAState *s) > >>> { > >>> struct vhost_vdpa *v = &s->vhost_vdpa; > >>> > >>> + if (v->feature_log) { > >>> + add_migration_state_change_notifier(&s->migration_state); > >>> + } > >>> + > >>> if (v->shadow_vqs_enabled) { > >>> v->iova_tree = vhost_iova_tree_new(v->iova_range.first, > >>> v->iova_range.last); > >>> @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc) > >>> > >>> assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); > >>> > >>> + if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) { > >>> + remove_migration_state_change_notifier(&s->migration_state); > >>> + } > >>> + > >>> dev = s->vhost_vdpa.dev; > >>> if (dev->vq_index + dev->nvqs == dev->vq_index_end) { > >>> g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete); > >>> @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer, > >>> s->vhost_vdpa.device_fd = vdpa_device_fd; > >>> s->vhost_vdpa.index = queue_pair_index; > >>> s->always_svq = svq; > >>> + s->migration_state.notify = vdpa_net_migration_state_notifier; > >>> s->vhost_vdpa.shadow_vqs_enabled = svq; > >>> s->vhost_vdpa.iova_range = iova_range; > >>> s->vhost_vdpa.shadow_data = svq; >
On 2/13/2023 1:47 AM, Eugenio Perez Martin wrote: > On Sat, Feb 4, 2023 at 3:04 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote: >> >> >> On 2/2/2023 7:28 AM, Eugenio Perez Martin wrote: >>> On Thu, Feb 2, 2023 at 2:53 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote: >>>> >>>> On 1/12/2023 9:24 AM, Eugenio Pérez wrote: >>>>> This allows net to restart the device backend to configure SVQ on it. >>>>> >>>>> Ideally, these changes should not be net specific. However, the vdpa net >>>>> backend is the one with enough knowledge to configure everything because >>>>> of some reasons: >>>>> * Queues might need to be shadowed or not depending on its kind (control >>>>> vs data). >>>>> * Queues need to share the same map translations (iova tree). >>>>> >>>>> Because of that it is cleaner to restart the whole net backend and >>>>> configure again as expected, similar to how vhost-kernel moves between >>>>> userspace and passthrough. >>>>> >>>>> If more kinds of devices need dynamic switching to SVQ we can create a >>>>> callback struct like VhostOps and move most of the code there. >>>>> VhostOps cannot be reused since all vdpa backend share them, and to >>>>> personalize just for networking would be too heavy. >>>>> >>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com> >>>>> --- >>>>> net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> 1 file changed, 84 insertions(+) >>>>> >>>>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c >>>>> index 5d7ad6e4d7..f38532b1df 100644 >>>>> --- a/net/vhost-vdpa.c >>>>> +++ b/net/vhost-vdpa.c >>>>> @@ -26,6 +26,8 @@ >>>>> #include <err.h> >>>>> #include "standard-headers/linux/virtio_net.h" >>>>> #include "monitor/monitor.h" >>>>> +#include "migration/migration.h" >>>>> +#include "migration/misc.h" >>>>> #include "migration/blocker.h" >>>>> #include "hw/virtio/vhost.h" >>>>> >>>>> @@ -33,6 +35,7 @@ >>>>> typedef struct VhostVDPAState { >>>>> NetClientState nc; >>>>> struct vhost_vdpa vhost_vdpa; >>>>> + Notifier migration_state; >>>>> Error *migration_blocker; >>>>> VHostNetState *vhost_net; >>>>> >>>>> @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) >>>>> return DO_UPCAST(VhostVDPAState, nc, nc0); >>>>> } >>>>> >>>>> +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable) >>>>> +{ >>>>> + struct vhost_vdpa *v = &s->vhost_vdpa; >>>>> + VirtIONet *n; >>>>> + VirtIODevice *vdev; >>>>> + int data_queue_pairs, cvq, r; >>>>> + NetClientState *peer; >>>>> + >>>>> + /* We are only called on the first data vqs and only if x-svq is not set */ >>>>> + if (s->vhost_vdpa.shadow_vqs_enabled == enable) { >>>>> + return; >>>>> + } >>>>> + >>>>> + vdev = v->dev->vdev; >>>>> + n = VIRTIO_NET(vdev); >>>>> + if (!n->vhost_started) { >>>>> + return; >>>>> + } >>>>> + >>>>> + if (enable) { >>>>> + ioctl(v->device_fd, VHOST_VDPA_SUSPEND); >>>>> + } >>>>> + data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1; >>>>> + cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ? >>>>> + n->max_ncs - n->max_queue_pairs : 0; >>>>> + vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq); >>>>> + >>>>> + peer = s->nc.peer; >>>>> + for (int i = 0; i < data_queue_pairs + cvq; i++) { >>>>> + VhostVDPAState *vdpa_state; >>>>> + NetClientState *nc; >>>>> + >>>>> + if (i < data_queue_pairs) { >>>>> + nc = qemu_get_peer(peer, i); >>>>> + } else { >>>>> + nc = qemu_get_peer(peer, n->max_queue_pairs); >>>>> + } >>>>> + >>>>> + vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc); >>>>> + vdpa_state->vhost_vdpa.shadow_data = enable; >>>>> + >>>>> + if (i < data_queue_pairs) { >>>>> + /* Do not override CVQ shadow_vqs_enabled */ >>>>> + vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable; >>>>> + } >>>>> + } >>>>> + >>>>> + r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq); >>>> As the first revision, this method (vhost_net_stop followed by >>>> vhost_net_start) should be fine for software vhost-vdpa backend for e.g. >>>> vp_vdpa and vdpa_sim_net. However, I would like to get your attention >>>> that this method implies substantial blackout time for mode switching on >>>> real hardware - get a full cycle of device reset of getting memory >>>> mappings torn down, unpin & repin same set of pages, and set up new >>>> mapping would take very significant amount of time, especially for a >>>> large VM. Maybe we can do: >>>> >>> Right, I think this is something that deserves optimization in the future. >>> >>> Note that we must replace the mappings anyway, with all passthrough >>> queues stopped. >> Yes, unmap and remap is needed indeed. I haven't checked, does shadow vq >> keep mapping to the same GPA where passthrough data virtqueues were >> associated with across switch (so that the mode switch is transparent to >> the guest)? > I don't get this question, SVQ switching is already transparent to the guest. Never mind, you seem to have answered the question in the reply here and below. I was thinking of possibility to do incremental in-place update for a given IOVA range with one single call (for the on-chip IOMMU case), instead of separate unmap() and map() calls. Things like .set_map_replace(vdpa, asid, iova_start, size, iotlb_new_maps) as I ever mentioned. > >> For platform IOMMU the mapping and remapping cost is >> inevitable, though I wonder for the on-chip IOMMU case could it take >> some fast path to just replace IOVA in place without destroying and >> setting up all mapping entries, if the same GPA is going to be used for >> the data rings (copy Eli for his input). >> >>> This is because SVQ vrings are not in the guest space. >>> The pin can be skipped though, I think that's a low hand fruit here. >> Yes, that's right. For a large VM pining overhead usually overweighs the >> mapping cost. It would be a great amount of time saving if pin can be >> skipped. >> > That is doable using dma_map/unmap apis instead of set_map (or > comparing in set_map) and allocation GPA translations in advance. Is there a way for a driver to use both dma_map()/unmap() and set_map() APIs at the same time? Seems not possible for the moment. And batching is currently unsupported on dma_map()/unmap(). Not sure how mapping could be decoupled from pinning as the current uAPI (VHOST_IOTLB_UPDATE and VHOST_IOTLB_INVALIDATE) have both, i.e. it's not easy to tear them apart. If we agree pinning is not needed, perhaps we could add a new uAPI to just remap the IOVA ranges for data VQ addresses, and get around any code path involving page pinning. Under the hood at the driver API level, in case of general platform IOMMU, iommu_unmap() and iommu_map() can be used; in case of on-chip IOMMU, vdpa kernel would just call the new driver API .set_map_replace() to update the relevant IOVA mappings in place, without having to rebuild the entire iova tree. > >>> If any, we can track guest's IOVA and add SVQ vrings in a hole. If >>> guest's IOVA layout changes, we can translate it then to a new >>> location. That way we only need one map operation in the worst case. >>> I'm omitting the lookup time here, but I still should be worth it. >>> >>> But as you mention I think it is not worth complicating this series >>> and we can think about it on top. >> Yes, agreed. I'll just let you aware of the need of this optimization >> for real hardware device. >> >>> We can start building it on top of >>> your suggestions for sure. >>> >>>> 1) replace reset with the RESUME feature that was just added to the >>>> vhost-vdpa ioctls in kernel >>> We cannot change vring addresses just with a SUSPEND / RESUME. >> I wonder if we can make SUSPEND (via some flag or new backend feature is >> fine) accept updating internal state like the vring addresses, while >> defer applying it to the device until RESUME? That way we don't lose a >> lot of other states that otherwise would need to re-instantiate at large >> with _F_RING_RESET or device reset. >> > If that helps, that can be done for sure. > > As another idea, we could do the reverse and allow _F_RING_RESET to > not to forget the parameters unless they're explicitly overriden. Hmmm, this might need spec extension as that's not the current expectation for _F_RING_RESET so far as I understand. Once ring is reset, all parameters associated with the ring are forgotten. > I think I prefer your idea in SUSPEND / RESUME cycle, but just wanted > to put that possibility on the table if that makes more sense. Yea may be via a new per-vq suspend feature: _F_RING_STOP. > >>> We could do it with the VIRTIO_F_RING_RESET feature though. Would it >>> be advantageous to the device? >>> >>>> 2) add new vdpa ioctls to allow iova range rebound to new virtual >>>> address for QEMU's shadow vq or back to device's vq >>> Actually, if the device supports ASID we can allocate ASID 1 to that >>> purpose. At this moment only CVQ vrings and control buffers are there >>> when the device is passthrough. >> Yep, we can get SVQ mapping pre-cooked in another ASID before dismantle >> the mapping for the passthrough VQ. This will help the general IOMMU case. >> >>> But this doesn't solve the problem if we need to send all SVQ >>> translation to the device on-chip IOMMU, doesn't it? We must clear all >>> of it and send the new one to the device anyway. >>> >>>> 3) use a light-weighted sequence of suspend+rebind+resume to switch mode >>>> on the fly instead of getting through the whole reset+restart cycle >>>> >>> I think this is the same as 1, isn't it? >> I mean do all three together: 1,2 in kernel and 3 in QEMU. >> > Ok I missed that in my first read, thanks! > > But I feel 2 should be easier to do in qemu. > > I don't really know how this helps in the general IOMMU case, I'm > assuming the IOMMU does not support PASID or similar tricks. Is that > because of the vhost_iotlb population or is there anything else I'm > missing? A new uAPI (more precisely, iotlb message) is needed to get around of page pinning at least. Or if not specifically tied to onchip IOMMU, we can make it two separate uAPIs for UNMAP and MAP, respectively. > >>>> I suspect the same idea could even be used to address high live >>>> migration downtime seen on hardware vdpa device. What do you think? >>>> >>> I think this is a great start for sure! Some questions: >>> a) Is the time on reprogramming on-chip IOMMU comparable to program >>> regular IOMMU? >> I would think this largely depends on the hardware implementation of >> on-chip IOMMU, the performance characteristics of which is very device >> specific. Some times driver software implementation and API for on-chip >> MMU also matters. Which would require vendor specific work to optimize >> based on the specific use case. >> > Got it. > >>> If it is the case it should be easier to find vdpa >>> devices with support for _F_RESET soon. >>> b) Not to merge on master, but it is possible to add an artificial >>> delay on vdpa_sim that simulates the properties of the delay of IOMMU? >>> In that line, have you observed if it is linear with the size of the >>> memory, with the number of maps, other factors..? >> As I said this is very device specific and hard to quantify, but I agree >> it's a good idea to simulate the delay and measure the effect. For the >> on-chip MMU device I'm looking, large proportion of the time was spent >> on software side in allocating a range of memory for hosting mapping >> entries (don't know how to quantify this part but the allocation time is >> not a constant nor linear to the size of memory), walking all iotlb >> entries passed down from vdpa layer and building corresponding memory >> key objects for a range of pages. For each iotlb entry the time to build >> memory mapping looks grow linearly with the size of memory. Not sure if >> there's room to improve, I'll let the owner to clarify. >> > So I think all of these are great ideas. > > If we state the pin & unpin huts latency in the switching I think the > easiest way to start is: > * To start with qemu and send all the map / unmap in a batch By map / unmap, you are referring to the uAPIs (VHOST_IOTLB_UPDATE and VHOST_IOTLB_INVALIDATE), not the driver level .dma_map/unmap() kernel APIs, right? yes it's always good to commit all map / unmap transactions at once in a batch. > * Avoid the pin / unpin in the kernel using a smarter algorithm for > that, not unpinning regions that it is going to pin again. This seems to change the uAPI behavior underneath. Maybe cleaner to get it done with new uAPI. Regards, -Siwei > > What do you think? > > Thanks! > >> Thanks, >> -Siwei >> >> >> >> >> >>> Thanks! >>> >>>> Thanks, >>>> -Siwei >>>> >>>>> + if (unlikely(r < 0)) { >>>>> + error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r); >>>>> + } >>>>> +} >>>>> + >>>>> +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data) >>>>> +{ >>>>> + MigrationState *migration = data; >>>>> + VhostVDPAState *s = container_of(notifier, VhostVDPAState, >>>>> + migration_state); >>>>> + >>>>> + switch (migration->state) { >>>>> + case MIGRATION_STATUS_SETUP: >>>>> + vhost_vdpa_net_log_global_enable(s, true); >>>>> + return; >>>>> + >>>>> + case MIGRATION_STATUS_CANCELLING: >>>>> + case MIGRATION_STATUS_CANCELLED: >>>>> + case MIGRATION_STATUS_FAILED: >>>>> + vhost_vdpa_net_log_global_enable(s, false); >>>>> + return; >>>>> + }; >>>>> +} >>>>> + >>>>> static void vhost_vdpa_net_data_start_first(VhostVDPAState *s) >>>>> { >>>>> struct vhost_vdpa *v = &s->vhost_vdpa; >>>>> >>>>> + if (v->feature_log) { >>>>> + add_migration_state_change_notifier(&s->migration_state); >>>>> + } >>>>> + >>>>> if (v->shadow_vqs_enabled) { >>>>> v->iova_tree = vhost_iova_tree_new(v->iova_range.first, >>>>> v->iova_range.last); >>>>> @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc) >>>>> >>>>> assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); >>>>> >>>>> + if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) { >>>>> + remove_migration_state_change_notifier(&s->migration_state); >>>>> + } >>>>> + >>>>> dev = s->vhost_vdpa.dev; >>>>> if (dev->vq_index + dev->nvqs == dev->vq_index_end) { >>>>> g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete); >>>>> @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer, >>>>> s->vhost_vdpa.device_fd = vdpa_device_fd; >>>>> s->vhost_vdpa.index = queue_pair_index; >>>>> s->always_svq = svq; >>>>> + s->migration_state.notify = vdpa_net_migration_state_notifier; >>>>> s->vhost_vdpa.shadow_vqs_enabled = svq; >>>>> s->vhost_vdpa.iova_range = iova_range; >>>>> s->vhost_vdpa.shadow_data = svq;
On Mon, Feb 13, 2023 at 11:37 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote: > > > > On 2/13/2023 1:47 AM, Eugenio Perez Martin wrote: > > On Sat, Feb 4, 2023 at 3:04 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote: > >> > >> > >> On 2/2/2023 7:28 AM, Eugenio Perez Martin wrote: > >>> On Thu, Feb 2, 2023 at 2:53 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote: > >>>> > >>>> On 1/12/2023 9:24 AM, Eugenio Pérez wrote: > >>>>> This allows net to restart the device backend to configure SVQ on it. > >>>>> > >>>>> Ideally, these changes should not be net specific. However, the vdpa net > >>>>> backend is the one with enough knowledge to configure everything because > >>>>> of some reasons: > >>>>> * Queues might need to be shadowed or not depending on its kind (control > >>>>> vs data). > >>>>> * Queues need to share the same map translations (iova tree). > >>>>> > >>>>> Because of that it is cleaner to restart the whole net backend and > >>>>> configure again as expected, similar to how vhost-kernel moves between > >>>>> userspace and passthrough. > >>>>> > >>>>> If more kinds of devices need dynamic switching to SVQ we can create a > >>>>> callback struct like VhostOps and move most of the code there. > >>>>> VhostOps cannot be reused since all vdpa backend share them, and to > >>>>> personalize just for networking would be too heavy. > >>>>> > >>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com> > >>>>> --- > >>>>> net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++ > >>>>> 1 file changed, 84 insertions(+) > >>>>> > >>>>> diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c > >>>>> index 5d7ad6e4d7..f38532b1df 100644 > >>>>> --- a/net/vhost-vdpa.c > >>>>> +++ b/net/vhost-vdpa.c > >>>>> @@ -26,6 +26,8 @@ > >>>>> #include <err.h> > >>>>> #include "standard-headers/linux/virtio_net.h" > >>>>> #include "monitor/monitor.h" > >>>>> +#include "migration/migration.h" > >>>>> +#include "migration/misc.h" > >>>>> #include "migration/blocker.h" > >>>>> #include "hw/virtio/vhost.h" > >>>>> > >>>>> @@ -33,6 +35,7 @@ > >>>>> typedef struct VhostVDPAState { > >>>>> NetClientState nc; > >>>>> struct vhost_vdpa vhost_vdpa; > >>>>> + Notifier migration_state; > >>>>> Error *migration_blocker; > >>>>> VHostNetState *vhost_net; > >>>>> > >>>>> @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) > >>>>> return DO_UPCAST(VhostVDPAState, nc, nc0); > >>>>> } > >>>>> > >>>>> +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable) > >>>>> +{ > >>>>> + struct vhost_vdpa *v = &s->vhost_vdpa; > >>>>> + VirtIONet *n; > >>>>> + VirtIODevice *vdev; > >>>>> + int data_queue_pairs, cvq, r; > >>>>> + NetClientState *peer; > >>>>> + > >>>>> + /* We are only called on the first data vqs and only if x-svq is not set */ > >>>>> + if (s->vhost_vdpa.shadow_vqs_enabled == enable) { > >>>>> + return; > >>>>> + } > >>>>> + > >>>>> + vdev = v->dev->vdev; > >>>>> + n = VIRTIO_NET(vdev); > >>>>> + if (!n->vhost_started) { > >>>>> + return; > >>>>> + } > >>>>> + > >>>>> + if (enable) { > >>>>> + ioctl(v->device_fd, VHOST_VDPA_SUSPEND); > >>>>> + } > >>>>> + data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1; > >>>>> + cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ? > >>>>> + n->max_ncs - n->max_queue_pairs : 0; > >>>>> + vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq); > >>>>> + > >>>>> + peer = s->nc.peer; > >>>>> + for (int i = 0; i < data_queue_pairs + cvq; i++) { > >>>>> + VhostVDPAState *vdpa_state; > >>>>> + NetClientState *nc; > >>>>> + > >>>>> + if (i < data_queue_pairs) { > >>>>> + nc = qemu_get_peer(peer, i); > >>>>> + } else { > >>>>> + nc = qemu_get_peer(peer, n->max_queue_pairs); > >>>>> + } > >>>>> + > >>>>> + vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc); > >>>>> + vdpa_state->vhost_vdpa.shadow_data = enable; > >>>>> + > >>>>> + if (i < data_queue_pairs) { > >>>>> + /* Do not override CVQ shadow_vqs_enabled */ > >>>>> + vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable; > >>>>> + } > >>>>> + } > >>>>> + > >>>>> + r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq); > >>>> As the first revision, this method (vhost_net_stop followed by > >>>> vhost_net_start) should be fine for software vhost-vdpa backend for e.g. > >>>> vp_vdpa and vdpa_sim_net. However, I would like to get your attention > >>>> that this method implies substantial blackout time for mode switching on > >>>> real hardware - get a full cycle of device reset of getting memory > >>>> mappings torn down, unpin & repin same set of pages, and set up new > >>>> mapping would take very significant amount of time, especially for a > >>>> large VM. Maybe we can do: > >>>> > >>> Right, I think this is something that deserves optimization in the future. > >>> > >>> Note that we must replace the mappings anyway, with all passthrough > >>> queues stopped. > >> Yes, unmap and remap is needed indeed. I haven't checked, does shadow vq > >> keep mapping to the same GPA where passthrough data virtqueues were > >> associated with across switch (so that the mode switch is transparent to > >> the guest)? > > I don't get this question, SVQ switching is already transparent to the guest. > Never mind, you seem to have answered the question in the reply here and > below. I was thinking of possibility to do incremental in-place update > for a given IOVA range with one single call (for the on-chip IOMMU > case), instead of separate unmap() and map() calls. Things like > .set_map_replace(vdpa, asid, iova_start, size, iotlb_new_maps) as I ever > mentioned. > > > > >> For platform IOMMU the mapping and remapping cost is > >> inevitable, though I wonder for the on-chip IOMMU case could it take > >> some fast path to just replace IOVA in place without destroying and > >> setting up all mapping entries, if the same GPA is going to be used for > >> the data rings (copy Eli for his input). > >> > >>> This is because SVQ vrings are not in the guest space. > >>> The pin can be skipped though, I think that's a low hand fruit here. > >> Yes, that's right. For a large VM pining overhead usually overweighs the > >> mapping cost. It would be a great amount of time saving if pin can be > >> skipped. > >> > > That is doable using dma_map/unmap apis instead of set_map (or > > comparing in set_map) and allocation GPA translations in advance. > Is there a way for a driver to use both dma_map()/unmap() and set_map() > APIs at the same time? Seems not possible for the moment. And batching > is currently unsupported on dma_map()/unmap(). > I meant not ignoring the batch calls, yes. > Not sure how mapping could be decoupled from pinning as the current uAPI > (VHOST_IOTLB_UPDATE and VHOST_IOTLB_INVALIDATE) have both, i.e. it's not > easy to tear them apart. If we add a reverse tree, I'd say it should be possible to transverse the new and the old IOVA -> iotlb tree and only map / unmap the differences. All the guest memory will stay pinned this way, only SVQ will be pinned and unpinned. I'm not sure if this is cheap or comparable to the pin / unpin operations but maybe we can even build that tree at set_map? Does the pin operation get cheaper when using hugepages and similar? > If we agree pinning is not needed, perhaps we > could add a new uAPI to just remap the IOVA ranges for data VQ > addresses, and get around any code path involving page pinning. Under > the hood at the driver API level, in case of general platform IOMMU, > iommu_unmap() and iommu_map() can be used; in case of on-chip IOMMU, > vdpa kernel would just call the new driver API .set_map_replace() to > update the relevant IOVA mappings in place, without having to rebuild > the entire iova tree. > That's a more efficient way to do it for sure, although it requires additions to uAPI. > > > >>> If any, we can track guest's IOVA and add SVQ vrings in a hole. If > >>> guest's IOVA layout changes, we can translate it then to a new > >>> location. That way we only need one map operation in the worst case. > >>> I'm omitting the lookup time here, but I still should be worth it. > >>> > >>> But as you mention I think it is not worth complicating this series > >>> and we can think about it on top. > >> Yes, agreed. I'll just let you aware of the need of this optimization > >> for real hardware device. > >> > >>> We can start building it on top of > >>> your suggestions for sure. > >>> > >>>> 1) replace reset with the RESUME feature that was just added to the > >>>> vhost-vdpa ioctls in kernel > >>> We cannot change vring addresses just with a SUSPEND / RESUME. > >> I wonder if we can make SUSPEND (via some flag or new backend feature is > >> fine) accept updating internal state like the vring addresses, while > >> defer applying it to the device until RESUME? That way we don't lose a > >> lot of other states that otherwise would need to re-instantiate at large > >> with _F_RING_RESET or device reset. > >> > > If that helps, that can be done for sure. > > > > As another idea, we could do the reverse and allow _F_RING_RESET to > > not to forget the parameters unless they're explicitly overriden. > Hmmm, this might need spec extension as that's not the current > expectation for _F_RING_RESET so far as I understand. Once ring is > reset, all parameters associated with the ring are forgotten. > > > I think I prefer your idea in SUSPEND / RESUME cycle, but just wanted > > to put that possibility on the table if that makes more sense. > Yea may be via a new per-vq suspend feature: _F_RING_STOP. > > > > >>> We could do it with the VIRTIO_F_RING_RESET feature though. Would it > >>> be advantageous to the device? > >>> > >>>> 2) add new vdpa ioctls to allow iova range rebound to new virtual > >>>> address for QEMU's shadow vq or back to device's vq > >>> Actually, if the device supports ASID we can allocate ASID 1 to that > >>> purpose. At this moment only CVQ vrings and control buffers are there > >>> when the device is passthrough. > >> Yep, we can get SVQ mapping pre-cooked in another ASID before dismantle > >> the mapping for the passthrough VQ. This will help the general IOMMU case. > >> > >>> But this doesn't solve the problem if we need to send all SVQ > >>> translation to the device on-chip IOMMU, doesn't it? We must clear all > >>> of it and send the new one to the device anyway. > >>> > >>>> 3) use a light-weighted sequence of suspend+rebind+resume to switch mode > >>>> on the fly instead of getting through the whole reset+restart cycle > >>>> > >>> I think this is the same as 1, isn't it? > >> I mean do all three together: 1,2 in kernel and 3 in QEMU. > >> > > Ok I missed that in my first read, thanks! > > > > But I feel 2 should be easier to do in qemu. > > > > I don't really know how this helps in the general IOMMU case, I'm > > assuming the IOMMU does not support PASID or similar tricks. Is that > > because of the vhost_iotlb population or is there anything else I'm > > missing? > A new uAPI (more precisely, iotlb message) is needed to get around of > page pinning at least. Or if not specifically tied to onchip IOMMU, we > can make it two separate uAPIs for UNMAP and MAP, respectively. > I'd say the right call is just a "replace", or we will just replicate map / unmap, isn't it? > > > >>>> I suspect the same idea could even be used to address high live > >>>> migration downtime seen on hardware vdpa device. What do you think? > >>>> > >>> I think this is a great start for sure! Some questions: > >>> a) Is the time on reprogramming on-chip IOMMU comparable to program > >>> regular IOMMU? > >> I would think this largely depends on the hardware implementation of > >> on-chip IOMMU, the performance characteristics of which is very device > >> specific. Some times driver software implementation and API for on-chip > >> MMU also matters. Which would require vendor specific work to optimize > >> based on the specific use case. > >> > > Got it. > > > >>> If it is the case it should be easier to find vdpa > >>> devices with support for _F_RESET soon. > >>> b) Not to merge on master, but it is possible to add an artificial > >>> delay on vdpa_sim that simulates the properties of the delay of IOMMU? > >>> In that line, have you observed if it is linear with the size of the > >>> memory, with the number of maps, other factors..? > >> As I said this is very device specific and hard to quantify, but I agree > >> it's a good idea to simulate the delay and measure the effect. For the > >> on-chip MMU device I'm looking, large proportion of the time was spent > >> on software side in allocating a range of memory for hosting mapping > >> entries (don't know how to quantify this part but the allocation time is > >> not a constant nor linear to the size of memory), walking all iotlb > >> entries passed down from vdpa layer and building corresponding memory > >> key objects for a range of pages. For each iotlb entry the time to build > >> memory mapping looks grow linearly with the size of memory. Not sure if > >> there's room to improve, I'll let the owner to clarify. > >> > > So I think all of these are great ideas. > > > > If we state the pin & unpin huts latency in the switching I think the > > easiest way to start is: > > * To start with qemu and send all the map / unmap in a batch > By map / unmap, you are referring to the uAPIs (VHOST_IOTLB_UPDATE and > VHOST_IOTLB_INVALIDATE), not the driver level .dma_map/unmap() kernel > APIs, right? yes it's always good to commit all map / unmap transactions > at once in a batch. > Right, sorry for not being specific enough. > > * Avoid the pin / unpin in the kernel using a smarter algorithm for > > that, not unpinning regions that it is going to pin again. > This seems to change the uAPI behavior underneath. Maybe cleaner to get > it done with new uAPI. > I think there is no visible change from userspace, or do we expect an effective unpin + pin for some reason? Thanks! > Regards, > -Siwei > > > > > What do you think? > > > > Thanks! > > > >> Thanks, > >> -Siwei > >> > >> > >> > >> > >> > >>> Thanks! > >>> > >>>> Thanks, > >>>> -Siwei > >>>> > >>>>> + if (unlikely(r < 0)) { > >>>>> + error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r); > >>>>> + } > >>>>> +} > >>>>> + > >>>>> +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data) > >>>>> +{ > >>>>> + MigrationState *migration = data; > >>>>> + VhostVDPAState *s = container_of(notifier, VhostVDPAState, > >>>>> + migration_state); > >>>>> + > >>>>> + switch (migration->state) { > >>>>> + case MIGRATION_STATUS_SETUP: > >>>>> + vhost_vdpa_net_log_global_enable(s, true); > >>>>> + return; > >>>>> + > >>>>> + case MIGRATION_STATUS_CANCELLING: > >>>>> + case MIGRATION_STATUS_CANCELLED: > >>>>> + case MIGRATION_STATUS_FAILED: > >>>>> + vhost_vdpa_net_log_global_enable(s, false); > >>>>> + return; > >>>>> + }; > >>>>> +} > >>>>> + > >>>>> static void vhost_vdpa_net_data_start_first(VhostVDPAState *s) > >>>>> { > >>>>> struct vhost_vdpa *v = &s->vhost_vdpa; > >>>>> > >>>>> + if (v->feature_log) { > >>>>> + add_migration_state_change_notifier(&s->migration_state); > >>>>> + } > >>>>> + > >>>>> if (v->shadow_vqs_enabled) { > >>>>> v->iova_tree = vhost_iova_tree_new(v->iova_range.first, > >>>>> v->iova_range.last); > >>>>> @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc) > >>>>> > >>>>> assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); > >>>>> > >>>>> + if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) { > >>>>> + remove_migration_state_change_notifier(&s->migration_state); > >>>>> + } > >>>>> + > >>>>> dev = s->vhost_vdpa.dev; > >>>>> if (dev->vq_index + dev->nvqs == dev->vq_index_end) { > >>>>> g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete); > >>>>> @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer, > >>>>> s->vhost_vdpa.device_fd = vdpa_device_fd; > >>>>> s->vhost_vdpa.index = queue_pair_index; > >>>>> s->always_svq = svq; > >>>>> + s->migration_state.notify = vdpa_net_migration_state_notifier; > >>>>> s->vhost_vdpa.shadow_vqs_enabled = svq; > >>>>> s->vhost_vdpa.iova_range = iova_range; > >>>>> s->vhost_vdpa.shadow_data = svq; >
diff --git a/net/vhost-vdpa.c b/net/vhost-vdpa.c index 5d7ad6e4d7..f38532b1df 100644 --- a/net/vhost-vdpa.c +++ b/net/vhost-vdpa.c @@ -26,6 +26,8 @@ #include <err.h> #include "standard-headers/linux/virtio_net.h" #include "monitor/monitor.h" +#include "migration/migration.h" +#include "migration/misc.h" #include "migration/blocker.h" #include "hw/virtio/vhost.h" @@ -33,6 +35,7 @@ typedef struct VhostVDPAState { NetClientState nc; struct vhost_vdpa vhost_vdpa; + Notifier migration_state; Error *migration_blocker; VHostNetState *vhost_net; @@ -243,10 +246,86 @@ static VhostVDPAState *vhost_vdpa_net_first_nc_vdpa(VhostVDPAState *s) return DO_UPCAST(VhostVDPAState, nc, nc0); } +static void vhost_vdpa_net_log_global_enable(VhostVDPAState *s, bool enable) +{ + struct vhost_vdpa *v = &s->vhost_vdpa; + VirtIONet *n; + VirtIODevice *vdev; + int data_queue_pairs, cvq, r; + NetClientState *peer; + + /* We are only called on the first data vqs and only if x-svq is not set */ + if (s->vhost_vdpa.shadow_vqs_enabled == enable) { + return; + } + + vdev = v->dev->vdev; + n = VIRTIO_NET(vdev); + if (!n->vhost_started) { + return; + } + + if (enable) { + ioctl(v->device_fd, VHOST_VDPA_SUSPEND); + } + data_queue_pairs = n->multiqueue ? n->max_queue_pairs : 1; + cvq = virtio_vdev_has_feature(vdev, VIRTIO_NET_F_CTRL_VQ) ? + n->max_ncs - n->max_queue_pairs : 0; + vhost_net_stop(vdev, n->nic->ncs, data_queue_pairs, cvq); + + peer = s->nc.peer; + for (int i = 0; i < data_queue_pairs + cvq; i++) { + VhostVDPAState *vdpa_state; + NetClientState *nc; + + if (i < data_queue_pairs) { + nc = qemu_get_peer(peer, i); + } else { + nc = qemu_get_peer(peer, n->max_queue_pairs); + } + + vdpa_state = DO_UPCAST(VhostVDPAState, nc, nc); + vdpa_state->vhost_vdpa.shadow_data = enable; + + if (i < data_queue_pairs) { + /* Do not override CVQ shadow_vqs_enabled */ + vdpa_state->vhost_vdpa.shadow_vqs_enabled = enable; + } + } + + r = vhost_net_start(vdev, n->nic->ncs, data_queue_pairs, cvq); + if (unlikely(r < 0)) { + error_report("unable to start vhost net: %s(%d)", g_strerror(-r), -r); + } +} + +static void vdpa_net_migration_state_notifier(Notifier *notifier, void *data) +{ + MigrationState *migration = data; + VhostVDPAState *s = container_of(notifier, VhostVDPAState, + migration_state); + + switch (migration->state) { + case MIGRATION_STATUS_SETUP: + vhost_vdpa_net_log_global_enable(s, true); + return; + + case MIGRATION_STATUS_CANCELLING: + case MIGRATION_STATUS_CANCELLED: + case MIGRATION_STATUS_FAILED: + vhost_vdpa_net_log_global_enable(s, false); + return; + }; +} + static void vhost_vdpa_net_data_start_first(VhostVDPAState *s) { struct vhost_vdpa *v = &s->vhost_vdpa; + if (v->feature_log) { + add_migration_state_change_notifier(&s->migration_state); + } + if (v->shadow_vqs_enabled) { v->iova_tree = vhost_iova_tree_new(v->iova_range.first, v->iova_range.last); @@ -280,6 +359,10 @@ static void vhost_vdpa_net_client_stop(NetClientState *nc) assert(nc->info->type == NET_CLIENT_DRIVER_VHOST_VDPA); + if (s->vhost_vdpa.index == 0 && s->vhost_vdpa.feature_log) { + remove_migration_state_change_notifier(&s->migration_state); + } + dev = s->vhost_vdpa.dev; if (dev->vq_index + dev->nvqs == dev->vq_index_end) { g_clear_pointer(&s->vhost_vdpa.iova_tree, vhost_iova_tree_delete); @@ -767,6 +850,7 @@ static NetClientState *net_vhost_vdpa_init(NetClientState *peer, s->vhost_vdpa.device_fd = vdpa_device_fd; s->vhost_vdpa.index = queue_pair_index; s->always_svq = svq; + s->migration_state.notify = vdpa_net_migration_state_notifier; s->vhost_vdpa.shadow_vqs_enabled = svq; s->vhost_vdpa.iova_range = iova_range; s->vhost_vdpa.shadow_data = svq;
This allows net to restart the device backend to configure SVQ on it. Ideally, these changes should not be net specific. However, the vdpa net backend is the one with enough knowledge to configure everything because of some reasons: * Queues might need to be shadowed or not depending on its kind (control vs data). * Queues need to share the same map translations (iova tree). Because of that it is cleaner to restart the whole net backend and configure again as expected, similar to how vhost-kernel moves between userspace and passthrough. If more kinds of devices need dynamic switching to SVQ we can create a callback struct like VhostOps and move most of the code there. VhostOps cannot be reused since all vdpa backend share them, and to personalize just for networking would be too heavy. Signed-off-by: Eugenio Pérez <eperezma@redhat.com> --- net/vhost-vdpa.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 84 insertions(+)