Message ID | 20230411150515.14020-3-hreitz@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | vhost-user-fs: Internal migration | expand |
On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > So-called "internal" virtio-fs migration refers to transporting the > back-end's (virtiofsd's) state through qemu's migration stream. To do > this, we need to be able to transfer virtiofsd's internal state to and > from virtiofsd. > > Because virtiofsd's internal state will not be too large, we believe it > is best to transfer it as a single binary blob after the streaming > phase. Because this method should be useful to other vhost-user > implementations, too, it is introduced as a general-purpose addition to > the protocol, not limited to vhost-user-fs. > > These are the additions to the protocol: > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > This feature signals support for transferring state, and is added so > that migration can fail early when the back-end has no support. > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe > over which to transfer the state. The front-end sends an FD to the > back-end into/from which it can write/read its state, and the back-end > can decide to either use it, or reply with a different FD for the > front-end to override the front-end's choice. > The front-end creates a simple pipe to transfer the state, but maybe > the back-end already has an FD into/from which it has to write/read > its state, in which case it will want to override the simple pipe. > Conversely, maybe in the future we find a way to have the front-end > get an immediate FD for the migration stream (in some cases), in which > case we will want to send this to the back-end instead of creating a > pipe. > Hence the negotiation: If one side has a better idea than a plain > pipe, we will want to use that. > > - CHECK_DEVICE_STATE: After the state has been transferred through the > pipe (the end indicated by EOF), the front-end invokes this function > to verify success. There is no in-band way (through the pipe) to > indicate failure, so we need to check explicitly. > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD > (which includes establishing the direction of transfer and migration > phase), the sending side writes its data into the pipe, and the reading > side reads it until it sees an EOF. Then, the front-end will check for > success via CHECK_DEVICE_STATE, which on the destination side includes > checking for integrity (i.e. errors during deserialization). > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > --- > include/hw/virtio/vhost-backend.h | 24 +++++ > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ > hw/virtio/vhost.c | 37 ++++++++ > 4 files changed, 287 insertions(+) > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > index ec3fbae58d..5935b32fe3 100644 > --- a/include/hw/virtio/vhost-backend.h > +++ b/include/hw/virtio/vhost-backend.h > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > } VhostSetConfigType; > > +typedef enum VhostDeviceStateDirection { > + /* Transfer state from back-end (device) to front-end */ > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > + /* Transfer state from front-end to back-end (device) */ > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > +} VhostDeviceStateDirection; > + > +typedef enum VhostDeviceStatePhase { > + /* The device (and all its vrings) is stopped */ > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > +} VhostDeviceStatePhase; vDPA has: /* Suspend a device so it does not process virtqueue requests anymore * * After the return of ioctl the device must preserve all the necessary state * (the virtqueue vring base plus the possible device specific states) that is * required for restoring in the future. The device must not change its * configuration after that point. */ #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) /* Resume a device so it can resume processing virtqueue requests * * After the return of this ioctl the device will have restored all the * necessary states and it is fully operational to continue processing the * virtqueue descriptors. */ #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) I wonder if it makes sense to import these into vhost-user so that the difference between kernel vhost and vhost-user is minimized. It's okay if one of them is ahead of the other, but it would be nice to avoid overlapping/duplicated functionality. (And I hope vDPA will import the device state vhost-user messages introduced in this series.) > + > struct vhost_inflight; > struct vhost_dev; > struct vhost_log; > @@ -133,6 +145,15 @@ typedef int (*vhost_set_config_call_op)(struct vhost_dev *dev, > > typedef void (*vhost_reset_status_op)(struct vhost_dev *dev); > > +typedef bool (*vhost_supports_migratory_state_op)(struct vhost_dev *dev); > +typedef int (*vhost_set_device_state_fd_op)(struct vhost_dev *dev, > + VhostDeviceStateDirection direction, > + VhostDeviceStatePhase phase, > + int fd, > + int *reply_fd, > + Error **errp); > +typedef int (*vhost_check_device_state_op)(struct vhost_dev *dev, Error **errp); > + > typedef struct VhostOps { > VhostBackendType backend_type; > vhost_backend_init vhost_backend_init; > @@ -181,6 +202,9 @@ typedef struct VhostOps { > vhost_force_iommu_op vhost_force_iommu; > vhost_set_config_call_op vhost_set_config_call; > vhost_reset_status_op vhost_reset_status; > + vhost_supports_migratory_state_op vhost_supports_migratory_state; > + vhost_set_device_state_fd_op vhost_set_device_state_fd; > + vhost_check_device_state_op vhost_check_device_state; > } VhostOps; > > int vhost_backend_update_device_iotlb(struct vhost_dev *dev, > diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h > index 2fe02ed5d4..29449e0fe2 100644 > --- a/include/hw/virtio/vhost.h > +++ b/include/hw/virtio/vhost.h > @@ -346,4 +346,83 @@ int vhost_dev_set_inflight(struct vhost_dev *dev, > struct vhost_inflight *inflight); > int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size, > struct vhost_inflight *inflight); > + > +/** > + * vhost_supports_migratory_state(): Checks whether the back-end > + * supports transferring internal state for the purpose of migration. > + * Support for this feature is required for vhost_set_device_state_fd() > + * and vhost_check_device_state(). > + * > + * @dev: The vhost device > + * > + * Returns true if the device supports these commands, and false if it > + * does not. > + */ > +bool vhost_supports_migratory_state(struct vhost_dev *dev); > + > +/** > + * vhost_set_device_state_fd(): Begin transfer of internal state from/to > + * the back-end for the purpose of migration. Data is to be transferred > + * over a pipe according to @direction and @phase. The sending end must > + * only write to the pipe, and the receiving end must only read from it. > + * Once the sending end is done, it closes its FD. The receiving end > + * must take this as the end-of-transfer signal and close its FD, too. > + * > + * @fd is the back-end's end of the pipe: The write FD for SAVE, and the > + * read FD for LOAD. This function transfers ownership of @fd to the > + * back-end, i.e. closes it in the front-end. > + * > + * The back-end may optionally reply with an FD of its own, if this > + * improves efficiency on its end. In this case, the returned FD is > + * stored in *reply_fd. The back-end will discard the FD sent to it, > + * and the front-end must use *reply_fd for transferring state to/from > + * the back-end. > + * > + * @dev: The vhost device > + * @direction: The direction in which the state is to be transferred. > + * For outgoing migrations, this is SAVE, and data is read > + * from the back-end and stored by the front-end in the > + * migration stream. > + * For incoming migrations, this is LOAD, and data is read > + * by the front-end from the migration stream and sent to > + * the back-end to restore the saved state. > + * @phase: Which migration phase we are in. Currently, there is only > + * STOPPED (device and all vrings are stopped), in the future, > + * more phases such as PRE_COPY or POST_COPY may be added. > + * @fd: Back-end's end of the pipe through which to transfer state; note > + * that ownership is transferred to the back-end, so this function > + * closes @fd in the front-end. > + * @reply_fd: If the back-end wishes to use a different pipe for state > + * transfer, this will contain an FD for the front-end to > + * use. Otherwise, -1 is stored here. > + * @errp: Potential error description > + * > + * Returns 0 on success, and -errno on failure. > + */ > +int vhost_set_device_state_fd(struct vhost_dev *dev, > + VhostDeviceStateDirection direction, > + VhostDeviceStatePhase phase, > + int fd, > + int *reply_fd, > + Error **errp); > + > +/** > + * vhost_set_device_state_fd(): After transferring state from/to the > + * back-end via vhost_set_device_state_fd(), i.e. once the sending end > + * has closed the pipe, inquire the back-end to report any potential > + * errors that have occurred on its side. This allows to sense errors > + * like: > + * - During outgoing migration, when the source side had already started > + * to produce its state, something went wrong and it failed to finish > + * - During incoming migration, when the received state is somehow > + * invalid and cannot be processed by the back-end > + * > + * @dev: The vhost device > + * @errp: Potential error description > + * > + * Returns 0 when the back-end reports successful state transfer and > + * processing, and -errno when an error occurred somewhere. > + */ > +int vhost_check_device_state(struct vhost_dev *dev, Error **errp); > + > #endif > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c > index e5285df4ba..93d8f2494a 100644 > --- a/hw/virtio/vhost-user.c > +++ b/hw/virtio/vhost-user.c > @@ -83,6 +83,7 @@ enum VhostUserProtocolFeature { > /* Feature 14 reserved for VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS. */ > VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15, > VHOST_USER_PROTOCOL_F_STATUS = 16, > + VHOST_USER_PROTOCOL_F_MIGRATORY_STATE = 17, > VHOST_USER_PROTOCOL_F_MAX > }; > > @@ -130,6 +131,8 @@ typedef enum VhostUserRequest { > VHOST_USER_REM_MEM_REG = 38, > VHOST_USER_SET_STATUS = 39, > VHOST_USER_GET_STATUS = 40, > + VHOST_USER_SET_DEVICE_STATE_FD = 41, > + VHOST_USER_CHECK_DEVICE_STATE = 42, > VHOST_USER_MAX > } VhostUserRequest; > > @@ -210,6 +213,12 @@ typedef struct { > uint32_t size; /* the following payload size */ > } QEMU_PACKED VhostUserHeader; > > +/* Request payload of VHOST_USER_SET_DEVICE_STATE_FD */ > +typedef struct VhostUserTransferDeviceState { > + uint32_t direction; > + uint32_t phase; > +} VhostUserTransferDeviceState; > + > typedef union { > #define VHOST_USER_VRING_IDX_MASK (0xff) > #define VHOST_USER_VRING_NOFD_MASK (0x1 << 8) > @@ -224,6 +233,7 @@ typedef union { > VhostUserCryptoSession session; > VhostUserVringArea area; > VhostUserInflight inflight; > + VhostUserTransferDeviceState transfer_state; > } VhostUserPayload; > > typedef struct VhostUserMsg { > @@ -2681,6 +2691,140 @@ static int vhost_user_dev_start(struct vhost_dev *dev, bool started) > } > } > > +static bool vhost_user_supports_migratory_state(struct vhost_dev *dev) > +{ > + return virtio_has_feature(dev->protocol_features, > + VHOST_USER_PROTOCOL_F_MIGRATORY_STATE); > +} > + > +static int vhost_user_set_device_state_fd(struct vhost_dev *dev, > + VhostDeviceStateDirection direction, > + VhostDeviceStatePhase phase, > + int fd, > + int *reply_fd, > + Error **errp) > +{ > + int ret; > + struct vhost_user *vu = dev->opaque; > + VhostUserMsg msg = { > + .hdr = { > + .request = VHOST_USER_SET_DEVICE_STATE_FD, > + .flags = VHOST_USER_VERSION, > + .size = sizeof(msg.payload.transfer_state), > + }, > + .payload.transfer_state = { > + .direction = direction, > + .phase = phase, > + }, > + }; > + > + *reply_fd = -1; > + > + if (!vhost_user_supports_migratory_state(dev)) { > + close(fd); > + error_setg(errp, "Back-end does not support migration state transfer"); > + return -ENOTSUP; > + } > + > + ret = vhost_user_write(dev, &msg, &fd, 1); > + close(fd); > + if (ret < 0) { > + error_setg_errno(errp, -ret, > + "Failed to send SET_DEVICE_STATE_FD message"); > + return ret; > + } > + > + ret = vhost_user_read(dev, &msg); > + if (ret < 0) { > + error_setg_errno(errp, -ret, > + "Failed to receive SET_DEVICE_STATE_FD reply"); > + return ret; > + } > + > + if (msg.hdr.request != VHOST_USER_SET_DEVICE_STATE_FD) { > + error_setg(errp, > + "Received unexpected message type, expected %d, received %d", > + VHOST_USER_SET_DEVICE_STATE_FD, msg.hdr.request); > + return -EPROTO; > + } > + > + if (msg.hdr.size != sizeof(msg.payload.u64)) { > + error_setg(errp, > + "Received bad message size, expected %zu, received %" PRIu32, > + sizeof(msg.payload.u64), msg.hdr.size); > + return -EPROTO; > + } > + > + if ((msg.payload.u64 & 0xff) != 0) { > + error_setg(errp, "Back-end did not accept migration state transfer"); > + return -EIO; > + } > + > + if (!(msg.payload.u64 & VHOST_USER_VRING_NOFD_MASK)) { > + *reply_fd = qemu_chr_fe_get_msgfd(vu->user->chr); > + if (*reply_fd < 0) { > + error_setg(errp, > + "Failed to get back-end-provided transfer pipe FD"); > + *reply_fd = -1; > + return -EIO; > + } > + } > + > + return 0; > +} > + > +static int vhost_user_check_device_state(struct vhost_dev *dev, Error **errp) > +{ > + int ret; > + VhostUserMsg msg = { > + .hdr = { > + .request = VHOST_USER_CHECK_DEVICE_STATE, > + .flags = VHOST_USER_VERSION, > + .size = 0, > + }, > + }; > + > + if (!vhost_user_supports_migratory_state(dev)) { > + error_setg(errp, "Back-end does not support migration state transfer"); > + return -ENOTSUP; > + } > + > + ret = vhost_user_write(dev, &msg, NULL, 0); > + if (ret < 0) { > + error_setg_errno(errp, -ret, > + "Failed to send CHECK_DEVICE_STATE message"); > + return ret; > + } > + > + ret = vhost_user_read(dev, &msg); > + if (ret < 0) { > + error_setg_errno(errp, -ret, > + "Failed to receive CHECK_DEVICE_STATE reply"); > + return ret; > + } > + > + if (msg.hdr.request != VHOST_USER_CHECK_DEVICE_STATE) { > + error_setg(errp, > + "Received unexpected message type, expected %d, received %d", > + VHOST_USER_CHECK_DEVICE_STATE, msg.hdr.request); > + return -EPROTO; > + } > + > + if (msg.hdr.size != sizeof(msg.payload.u64)) { > + error_setg(errp, > + "Received bad message size, expected %zu, received %" PRIu32, > + sizeof(msg.payload.u64), msg.hdr.size); > + return -EPROTO; > + } > + > + if (msg.payload.u64 != 0) { > + error_setg(errp, "Back-end failed to process its internal state"); > + return -EIO; > + } > + > + return 0; > +} > + > const VhostOps user_ops = { > .backend_type = VHOST_BACKEND_TYPE_USER, > .vhost_backend_init = vhost_user_backend_init, > @@ -2716,4 +2860,7 @@ const VhostOps user_ops = { > .vhost_get_inflight_fd = vhost_user_get_inflight_fd, > .vhost_set_inflight_fd = vhost_user_set_inflight_fd, > .vhost_dev_start = vhost_user_dev_start, > + .vhost_supports_migratory_state = vhost_user_supports_migratory_state, > + .vhost_set_device_state_fd = vhost_user_set_device_state_fd, > + .vhost_check_device_state = vhost_user_check_device_state, > }; > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c > index cbff589efa..90099d8f6a 100644 > --- a/hw/virtio/vhost.c > +++ b/hw/virtio/vhost.c > @@ -2088,3 +2088,40 @@ int vhost_net_set_backend(struct vhost_dev *hdev, > > return -ENOSYS; > } > + > +bool vhost_supports_migratory_state(struct vhost_dev *dev) > +{ > + if (dev->vhost_ops->vhost_supports_migratory_state) { > + return dev->vhost_ops->vhost_supports_migratory_state(dev); > + } > + > + return false; > +} > + > +int vhost_set_device_state_fd(struct vhost_dev *dev, > + VhostDeviceStateDirection direction, > + VhostDeviceStatePhase phase, > + int fd, > + int *reply_fd, > + Error **errp) > +{ > + if (dev->vhost_ops->vhost_set_device_state_fd) { > + return dev->vhost_ops->vhost_set_device_state_fd(dev, direction, phase, > + fd, reply_fd, errp); > + } > + > + error_setg(errp, > + "vhost transport does not support migration state transfer"); > + return -ENOSYS; > +} > + > +int vhost_check_device_state(struct vhost_dev *dev, Error **errp) > +{ > + if (dev->vhost_ops->vhost_check_device_state) { > + return dev->vhost_ops->vhost_check_device_state(dev, errp); > + } > + > + error_setg(errp, > + "vhost transport does not support migration state transfer"); > + return -ENOSYS; > +} > -- > 2.39.1 >
On Tue, Apr 11, 2023 at 5:33 PM Hanna Czenczek <hreitz@redhat.com> wrote: > > So-called "internal" virtio-fs migration refers to transporting the > back-end's (virtiofsd's) state through qemu's migration stream. To do > this, we need to be able to transfer virtiofsd's internal state to and > from virtiofsd. > > Because virtiofsd's internal state will not be too large, we believe it > is best to transfer it as a single binary blob after the streaming > phase. Because this method should be useful to other vhost-user > implementations, too, it is introduced as a general-purpose addition to > the protocol, not limited to vhost-user-fs. > > These are the additions to the protocol: > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > This feature signals support for transferring state, and is added so > that migration can fail early when the back-end has no support. > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe > over which to transfer the state. The front-end sends an FD to the > back-end into/from which it can write/read its state, and the back-end > can decide to either use it, or reply with a different FD for the > front-end to override the front-end's choice. > The front-end creates a simple pipe to transfer the state, but maybe > the back-end already has an FD into/from which it has to write/read > its state, in which case it will want to override the simple pipe. > Conversely, maybe in the future we find a way to have the front-end > get an immediate FD for the migration stream (in some cases), in which > case we will want to send this to the back-end instead of creating a > pipe. > Hence the negotiation: If one side has a better idea than a plain > pipe, we will want to use that. > > - CHECK_DEVICE_STATE: After the state has been transferred through the > pipe (the end indicated by EOF), the front-end invokes this function > to verify success. There is no in-band way (through the pipe) to > indicate failure, so we need to check explicitly. > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD > (which includes establishing the direction of transfer and migration > phase), the sending side writes its data into the pipe, and the reading > side reads it until it sees an EOF. Then, the front-end will check for > success via CHECK_DEVICE_STATE, which on the destination side includes > checking for integrity (i.e. errors during deserialization). > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > --- > include/hw/virtio/vhost-backend.h | 24 +++++ > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ > hw/virtio/vhost.c | 37 ++++++++ > 4 files changed, 287 insertions(+) > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > index ec3fbae58d..5935b32fe3 100644 > --- a/include/hw/virtio/vhost-backend.h > +++ b/include/hw/virtio/vhost-backend.h > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > } VhostSetConfigType; > > +typedef enum VhostDeviceStateDirection { > + /* Transfer state from back-end (device) to front-end */ > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > + /* Transfer state from front-end to back-end (device) */ > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > +} VhostDeviceStateDirection; > + > +typedef enum VhostDeviceStatePhase { > + /* The device (and all its vrings) is stopped */ > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > +} VhostDeviceStatePhase; > + > struct vhost_inflight; > struct vhost_dev; > struct vhost_log; > @@ -133,6 +145,15 @@ typedef int (*vhost_set_config_call_op)(struct vhost_dev *dev, > > typedef void (*vhost_reset_status_op)(struct vhost_dev *dev); > > +typedef bool (*vhost_supports_migratory_state_op)(struct vhost_dev *dev); > +typedef int (*vhost_set_device_state_fd_op)(struct vhost_dev *dev, > + VhostDeviceStateDirection direction, > + VhostDeviceStatePhase phase, > + int fd, > + int *reply_fd, > + Error **errp); > +typedef int (*vhost_check_device_state_op)(struct vhost_dev *dev, Error **errp); > + > typedef struct VhostOps { > VhostBackendType backend_type; > vhost_backend_init vhost_backend_init; > @@ -181,6 +202,9 @@ typedef struct VhostOps { > vhost_force_iommu_op vhost_force_iommu; > vhost_set_config_call_op vhost_set_config_call; > vhost_reset_status_op vhost_reset_status; > + vhost_supports_migratory_state_op vhost_supports_migratory_state; > + vhost_set_device_state_fd_op vhost_set_device_state_fd; > + vhost_check_device_state_op vhost_check_device_state; > } VhostOps; > > int vhost_backend_update_device_iotlb(struct vhost_dev *dev, > diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h > index 2fe02ed5d4..29449e0fe2 100644 > --- a/include/hw/virtio/vhost.h > +++ b/include/hw/virtio/vhost.h > @@ -346,4 +346,83 @@ int vhost_dev_set_inflight(struct vhost_dev *dev, > struct vhost_inflight *inflight); > int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size, > struct vhost_inflight *inflight); > + > +/** > + * vhost_supports_migratory_state(): Checks whether the back-end > + * supports transferring internal state for the purpose of migration. > + * Support for this feature is required for vhost_set_device_state_fd() > + * and vhost_check_device_state(). > + * > + * @dev: The vhost device > + * > + * Returns true if the device supports these commands, and false if it > + * does not. > + */ > +bool vhost_supports_migratory_state(struct vhost_dev *dev); > + > +/** > + * vhost_set_device_state_fd(): Begin transfer of internal state from/to > + * the back-end for the purpose of migration. Data is to be transferred > + * over a pipe according to @direction and @phase. The sending end must > + * only write to the pipe, and the receiving end must only read from it. > + * Once the sending end is done, it closes its FD. The receiving end > + * must take this as the end-of-transfer signal and close its FD, too. > + * > + * @fd is the back-end's end of the pipe: The write FD for SAVE, and the > + * read FD for LOAD. This function transfers ownership of @fd to the > + * back-end, i.e. closes it in the front-end. > + * > + * The back-end may optionally reply with an FD of its own, if this > + * improves efficiency on its end. In this case, the returned FD is > + * stored in *reply_fd. The back-end will discard the FD sent to it, > + * and the front-end must use *reply_fd for transferring state to/from > + * the back-end. > + * > + * @dev: The vhost device > + * @direction: The direction in which the state is to be transferred. > + * For outgoing migrations, this is SAVE, and data is read > + * from the back-end and stored by the front-end in the > + * migration stream. > + * For incoming migrations, this is LOAD, and data is read > + * by the front-end from the migration stream and sent to > + * the back-end to restore the saved state. > + * @phase: Which migration phase we are in. Currently, there is only > + * STOPPED (device and all vrings are stopped), in the future, > + * more phases such as PRE_COPY or POST_COPY may be added. > + * @fd: Back-end's end of the pipe through which to transfer state; note > + * that ownership is transferred to the back-end, so this function > + * closes @fd in the front-end. > + * @reply_fd: If the back-end wishes to use a different pipe for state > + * transfer, this will contain an FD for the front-end to > + * use. Otherwise, -1 is stored here. > + * @errp: Potential error description > + * > + * Returns 0 on success, and -errno on failure. > + */ > +int vhost_set_device_state_fd(struct vhost_dev *dev, > + VhostDeviceStateDirection direction, > + VhostDeviceStatePhase phase, > + int fd, > + int *reply_fd, > + Error **errp); > + > +/** > + * vhost_set_device_state_fd(): After transferring state from/to the Nitpick: This function doc is for vhost_check_device_state not vhost_set_device_state_fd. Thanks! > + * back-end via vhost_set_device_state_fd(), i.e. once the sending end > + * has closed the pipe, inquire the back-end to report any potential > + * errors that have occurred on its side. This allows to sense errors > + * like: > + * - During outgoing migration, when the source side had already started > + * to produce its state, something went wrong and it failed to finish > + * - During incoming migration, when the received state is somehow > + * invalid and cannot be processed by the back-end > + * > + * @dev: The vhost device > + * @errp: Potential error description > + * > + * Returns 0 when the back-end reports successful state transfer and > + * processing, and -errno when an error occurred somewhere. > + */ > +int vhost_check_device_state(struct vhost_dev *dev, Error **errp); > + > #endif > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c > index e5285df4ba..93d8f2494a 100644 > --- a/hw/virtio/vhost-user.c > +++ b/hw/virtio/vhost-user.c > @@ -83,6 +83,7 @@ enum VhostUserProtocolFeature { > /* Feature 14 reserved for VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS. */ > VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15, > VHOST_USER_PROTOCOL_F_STATUS = 16, > + VHOST_USER_PROTOCOL_F_MIGRATORY_STATE = 17, > VHOST_USER_PROTOCOL_F_MAX > }; > > @@ -130,6 +131,8 @@ typedef enum VhostUserRequest { > VHOST_USER_REM_MEM_REG = 38, > VHOST_USER_SET_STATUS = 39, > VHOST_USER_GET_STATUS = 40, > + VHOST_USER_SET_DEVICE_STATE_FD = 41, > + VHOST_USER_CHECK_DEVICE_STATE = 42, > VHOST_USER_MAX > } VhostUserRequest; > > @@ -210,6 +213,12 @@ typedef struct { > uint32_t size; /* the following payload size */ > } QEMU_PACKED VhostUserHeader; > > +/* Request payload of VHOST_USER_SET_DEVICE_STATE_FD */ > +typedef struct VhostUserTransferDeviceState { > + uint32_t direction; > + uint32_t phase; > +} VhostUserTransferDeviceState; > + > typedef union { > #define VHOST_USER_VRING_IDX_MASK (0xff) > #define VHOST_USER_VRING_NOFD_MASK (0x1 << 8) > @@ -224,6 +233,7 @@ typedef union { > VhostUserCryptoSession session; > VhostUserVringArea area; > VhostUserInflight inflight; > + VhostUserTransferDeviceState transfer_state; > } VhostUserPayload; > > typedef struct VhostUserMsg { > @@ -2681,6 +2691,140 @@ static int vhost_user_dev_start(struct vhost_dev *dev, bool started) > } > } > > +static bool vhost_user_supports_migratory_state(struct vhost_dev *dev) > +{ > + return virtio_has_feature(dev->protocol_features, > + VHOST_USER_PROTOCOL_F_MIGRATORY_STATE); > +} > + > +static int vhost_user_set_device_state_fd(struct vhost_dev *dev, > + VhostDeviceStateDirection direction, > + VhostDeviceStatePhase phase, > + int fd, > + int *reply_fd, > + Error **errp) > +{ > + int ret; > + struct vhost_user *vu = dev->opaque; > + VhostUserMsg msg = { > + .hdr = { > + .request = VHOST_USER_SET_DEVICE_STATE_FD, > + .flags = VHOST_USER_VERSION, > + .size = sizeof(msg.payload.transfer_state), > + }, > + .payload.transfer_state = { > + .direction = direction, > + .phase = phase, > + }, > + }; > + > + *reply_fd = -1; > + > + if (!vhost_user_supports_migratory_state(dev)) { > + close(fd); > + error_setg(errp, "Back-end does not support migration state transfer"); > + return -ENOTSUP; > + } > + > + ret = vhost_user_write(dev, &msg, &fd, 1); > + close(fd); > + if (ret < 0) { > + error_setg_errno(errp, -ret, > + "Failed to send SET_DEVICE_STATE_FD message"); > + return ret; > + } > + > + ret = vhost_user_read(dev, &msg); > + if (ret < 0) { > + error_setg_errno(errp, -ret, > + "Failed to receive SET_DEVICE_STATE_FD reply"); > + return ret; > + } > + > + if (msg.hdr.request != VHOST_USER_SET_DEVICE_STATE_FD) { > + error_setg(errp, > + "Received unexpected message type, expected %d, received %d", > + VHOST_USER_SET_DEVICE_STATE_FD, msg.hdr.request); > + return -EPROTO; > + } > + > + if (msg.hdr.size != sizeof(msg.payload.u64)) { > + error_setg(errp, > + "Received bad message size, expected %zu, received %" PRIu32, > + sizeof(msg.payload.u64), msg.hdr.size); > + return -EPROTO; > + } > + > + if ((msg.payload.u64 & 0xff) != 0) { > + error_setg(errp, "Back-end did not accept migration state transfer"); > + return -EIO; > + } > + > + if (!(msg.payload.u64 & VHOST_USER_VRING_NOFD_MASK)) { > + *reply_fd = qemu_chr_fe_get_msgfd(vu->user->chr); > + if (*reply_fd < 0) { > + error_setg(errp, > + "Failed to get back-end-provided transfer pipe FD"); > + *reply_fd = -1; > + return -EIO; > + } > + } > + > + return 0; > +} > + > +static int vhost_user_check_device_state(struct vhost_dev *dev, Error **errp) > +{ > + int ret; > + VhostUserMsg msg = { > + .hdr = { > + .request = VHOST_USER_CHECK_DEVICE_STATE, > + .flags = VHOST_USER_VERSION, > + .size = 0, > + }, > + }; > + > + if (!vhost_user_supports_migratory_state(dev)) { > + error_setg(errp, "Back-end does not support migration state transfer"); > + return -ENOTSUP; > + } > + > + ret = vhost_user_write(dev, &msg, NULL, 0); > + if (ret < 0) { > + error_setg_errno(errp, -ret, > + "Failed to send CHECK_DEVICE_STATE message"); > + return ret; > + } > + > + ret = vhost_user_read(dev, &msg); > + if (ret < 0) { > + error_setg_errno(errp, -ret, > + "Failed to receive CHECK_DEVICE_STATE reply"); > + return ret; > + } > + > + if (msg.hdr.request != VHOST_USER_CHECK_DEVICE_STATE) { > + error_setg(errp, > + "Received unexpected message type, expected %d, received %d", > + VHOST_USER_CHECK_DEVICE_STATE, msg.hdr.request); > + return -EPROTO; > + } > + > + if (msg.hdr.size != sizeof(msg.payload.u64)) { > + error_setg(errp, > + "Received bad message size, expected %zu, received %" PRIu32, > + sizeof(msg.payload.u64), msg.hdr.size); > + return -EPROTO; > + } > + > + if (msg.payload.u64 != 0) { > + error_setg(errp, "Back-end failed to process its internal state"); > + return -EIO; > + } > + > + return 0; > +} > + > const VhostOps user_ops = { > .backend_type = VHOST_BACKEND_TYPE_USER, > .vhost_backend_init = vhost_user_backend_init, > @@ -2716,4 +2860,7 @@ const VhostOps user_ops = { > .vhost_get_inflight_fd = vhost_user_get_inflight_fd, > .vhost_set_inflight_fd = vhost_user_set_inflight_fd, > .vhost_dev_start = vhost_user_dev_start, > + .vhost_supports_migratory_state = vhost_user_supports_migratory_state, > + .vhost_set_device_state_fd = vhost_user_set_device_state_fd, > + .vhost_check_device_state = vhost_user_check_device_state, > }; > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c > index cbff589efa..90099d8f6a 100644 > --- a/hw/virtio/vhost.c > +++ b/hw/virtio/vhost.c > @@ -2088,3 +2088,40 @@ int vhost_net_set_backend(struct vhost_dev *hdev, > > return -ENOSYS; > } > + > +bool vhost_supports_migratory_state(struct vhost_dev *dev) > +{ > + if (dev->vhost_ops->vhost_supports_migratory_state) { > + return dev->vhost_ops->vhost_supports_migratory_state(dev); > + } > + > + return false; > +} > + > +int vhost_set_device_state_fd(struct vhost_dev *dev, > + VhostDeviceStateDirection direction, > + VhostDeviceStatePhase phase, > + int fd, > + int *reply_fd, > + Error **errp) > +{ > + if (dev->vhost_ops->vhost_set_device_state_fd) { > + return dev->vhost_ops->vhost_set_device_state_fd(dev, direction, phase, > + fd, reply_fd, errp); > + } > + > + error_setg(errp, > + "vhost transport does not support migration state transfer"); > + return -ENOSYS; > +} > + > +int vhost_check_device_state(struct vhost_dev *dev, Error **errp) > +{ > + if (dev->vhost_ops->vhost_check_device_state) { > + return dev->vhost_ops->vhost_check_device_state(dev, errp); > + } > + > + error_setg(errp, > + "vhost transport does not support migration state transfer"); > + return -ENOSYS; > +} > -- > 2.39.1 > >
On 12.04.23 23:06, Stefan Hajnoczi wrote: > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: >> So-called "internal" virtio-fs migration refers to transporting the >> back-end's (virtiofsd's) state through qemu's migration stream. To do >> this, we need to be able to transfer virtiofsd's internal state to and >> from virtiofsd. >> >> Because virtiofsd's internal state will not be too large, we believe it >> is best to transfer it as a single binary blob after the streaming >> phase. Because this method should be useful to other vhost-user >> implementations, too, it is introduced as a general-purpose addition to >> the protocol, not limited to vhost-user-fs. >> >> These are the additions to the protocol: >> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: >> This feature signals support for transferring state, and is added so >> that migration can fail early when the back-end has no support. >> >> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe >> over which to transfer the state. The front-end sends an FD to the >> back-end into/from which it can write/read its state, and the back-end >> can decide to either use it, or reply with a different FD for the >> front-end to override the front-end's choice. >> The front-end creates a simple pipe to transfer the state, but maybe >> the back-end already has an FD into/from which it has to write/read >> its state, in which case it will want to override the simple pipe. >> Conversely, maybe in the future we find a way to have the front-end >> get an immediate FD for the migration stream (in some cases), in which >> case we will want to send this to the back-end instead of creating a >> pipe. >> Hence the negotiation: If one side has a better idea than a plain >> pipe, we will want to use that. >> >> - CHECK_DEVICE_STATE: After the state has been transferred through the >> pipe (the end indicated by EOF), the front-end invokes this function >> to verify success. There is no in-band way (through the pipe) to >> indicate failure, so we need to check explicitly. >> >> Once the transfer pipe has been established via SET_DEVICE_STATE_FD >> (which includes establishing the direction of transfer and migration >> phase), the sending side writes its data into the pipe, and the reading >> side reads it until it sees an EOF. Then, the front-end will check for >> success via CHECK_DEVICE_STATE, which on the destination side includes >> checking for integrity (i.e. errors during deserialization). >> >> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> >> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> >> --- >> include/hw/virtio/vhost-backend.h | 24 +++++ >> include/hw/virtio/vhost.h | 79 ++++++++++++++++ >> hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ >> hw/virtio/vhost.c | 37 ++++++++ >> 4 files changed, 287 insertions(+) >> >> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h >> index ec3fbae58d..5935b32fe3 100644 >> --- a/include/hw/virtio/vhost-backend.h >> +++ b/include/hw/virtio/vhost-backend.h >> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { >> VHOST_SET_CONFIG_TYPE_MIGRATION = 1, >> } VhostSetConfigType; >> >> +typedef enum VhostDeviceStateDirection { >> + /* Transfer state from back-end (device) to front-end */ >> + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, >> + /* Transfer state from front-end to back-end (device) */ >> + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, >> +} VhostDeviceStateDirection; >> + >> +typedef enum VhostDeviceStatePhase { >> + /* The device (and all its vrings) is stopped */ >> + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, >> +} VhostDeviceStatePhase; > vDPA has: > > /* Suspend a device so it does not process virtqueue requests anymore > * > * After the return of ioctl the device must preserve all the necessary state > * (the virtqueue vring base plus the possible device specific states) that is > * required for restoring in the future. The device must not change its > * configuration after that point. > */ > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > /* Resume a device so it can resume processing virtqueue requests > * > * After the return of this ioctl the device will have restored all the > * necessary states and it is fully operational to continue processing the > * virtqueue descriptors. > */ > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > I wonder if it makes sense to import these into vhost-user so that the > difference between kernel vhost and vhost-user is minimized. It's okay > if one of them is ahead of the other, but it would be nice to avoid > overlapping/duplicated functionality. > > (And I hope vDPA will import the device state vhost-user messages > introduced in this series.) I don’t understand your suggestion. (Like, I very simply don’t understand :)) These are vhost messages, right? What purpose do you have in mind for them in vhost-user for internal migration? They’re different from the state transfer messages, because they don’t transfer state to/from the front-end. Also, the state transfer stuff is supposed to be distinct from starting/stopping the device; right now, it just requires the device to be stopped beforehand (or started only afterwards). And in the future, new VhostDeviceStatePhase values may allow the messages to be used on devices that aren’t stopped. So they seem to serve very different purposes. I can imagine using the VDPA_{SUSPEND,RESUME} messages for external migration (what Anton is working on), but they don’t really help with internal migration implemented here. If I were to add them, they’d just be sent in addition to the new messages added in this patch here, i.e. SUSPEND on the source before SET_DEVICE_STATE_FD, and RESUME on the destination after CHECK_DEVICE_STATE (we could use RESUME in place of CHECK_DEVICE_STATE on the destination, but we can’t do that on the source, so we still need CHECK_DEVICE_STATE). Hanna
On 13.04.23 10:50, Eugenio Perez Martin wrote: > On Tue, Apr 11, 2023 at 5:33 PM Hanna Czenczek <hreitz@redhat.com> wrote: >> So-called "internal" virtio-fs migration refers to transporting the >> back-end's (virtiofsd's) state through qemu's migration stream. To do >> this, we need to be able to transfer virtiofsd's internal state to and >> from virtiofsd. >> >> Because virtiofsd's internal state will not be too large, we believe it >> is best to transfer it as a single binary blob after the streaming >> phase. Because this method should be useful to other vhost-user >> implementations, too, it is introduced as a general-purpose addition to >> the protocol, not limited to vhost-user-fs. >> >> These are the additions to the protocol: >> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: >> This feature signals support for transferring state, and is added so >> that migration can fail early when the back-end has no support. >> >> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe >> over which to transfer the state. The front-end sends an FD to the >> back-end into/from which it can write/read its state, and the back-end >> can decide to either use it, or reply with a different FD for the >> front-end to override the front-end's choice. >> The front-end creates a simple pipe to transfer the state, but maybe >> the back-end already has an FD into/from which it has to write/read >> its state, in which case it will want to override the simple pipe. >> Conversely, maybe in the future we find a way to have the front-end >> get an immediate FD for the migration stream (in some cases), in which >> case we will want to send this to the back-end instead of creating a >> pipe. >> Hence the negotiation: If one side has a better idea than a plain >> pipe, we will want to use that. >> >> - CHECK_DEVICE_STATE: After the state has been transferred through the >> pipe (the end indicated by EOF), the front-end invokes this function >> to verify success. There is no in-band way (through the pipe) to >> indicate failure, so we need to check explicitly. >> >> Once the transfer pipe has been established via SET_DEVICE_STATE_FD >> (which includes establishing the direction of transfer and migration >> phase), the sending side writes its data into the pipe, and the reading >> side reads it until it sees an EOF. Then, the front-end will check for >> success via CHECK_DEVICE_STATE, which on the destination side includes >> checking for integrity (i.e. errors during deserialization). >> >> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> >> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> >> --- >> include/hw/virtio/vhost-backend.h | 24 +++++ >> include/hw/virtio/vhost.h | 79 ++++++++++++++++ >> hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ >> hw/virtio/vhost.c | 37 ++++++++ >> 4 files changed, 287 insertions(+) [...] >> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h >> index 2fe02ed5d4..29449e0fe2 100644 >> --- a/include/hw/virtio/vhost.h >> +++ b/include/hw/virtio/vhost.h >> @@ -346,4 +346,83 @@ int vhost_dev_set_inflight(struct vhost_dev *dev, [...] >> +/** >> + * vhost_set_device_state_fd(): After transferring state from/to the > Nitpick: This function doc is for vhost_check_device_state not > vhost_set_device_state_fd. > > Thanks! Oops, right, thanks! Hanna >> + * back-end via vhost_set_device_state_fd(), i.e. once the sending end >> + * has closed the pipe, inquire the back-end to report any potential >> + * errors that have occurred on its side. This allows to sense errors >> + * like: >> + * - During outgoing migration, when the source side had already started >> + * to produce its state, something went wrong and it failed to finish >> + * - During incoming migration, when the received state is somehow >> + * invalid and cannot be processed by the back-end >> + * >> + * @dev: The vhost device >> + * @errp: Potential error description >> + * >> + * Returns 0 when the back-end reports successful state transfer and >> + * processing, and -errno when an error occurred somewhere. >> + */ >> +int vhost_check_device_state(struct vhost_dev *dev, Error **errp); >> +
On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > So-called "internal" virtio-fs migration refers to transporting the > > back-end's (virtiofsd's) state through qemu's migration stream. To do > > this, we need to be able to transfer virtiofsd's internal state to and > > from virtiofsd. > > > > Because virtiofsd's internal state will not be too large, we believe it > > is best to transfer it as a single binary blob after the streaming > > phase. Because this method should be useful to other vhost-user > > implementations, too, it is introduced as a general-purpose addition to > > the protocol, not limited to vhost-user-fs. > > > > These are the additions to the protocol: > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > This feature signals support for transferring state, and is added so > > that migration can fail early when the back-end has no support. > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe > > over which to transfer the state. The front-end sends an FD to the > > back-end into/from which it can write/read its state, and the back-end > > can decide to either use it, or reply with a different FD for the > > front-end to override the front-end's choice. > > The front-end creates a simple pipe to transfer the state, but maybe > > the back-end already has an FD into/from which it has to write/read > > its state, in which case it will want to override the simple pipe. > > Conversely, maybe in the future we find a way to have the front-end > > get an immediate FD for the migration stream (in some cases), in which > > case we will want to send this to the back-end instead of creating a > > pipe. > > Hence the negotiation: If one side has a better idea than a plain > > pipe, we will want to use that. > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the > > pipe (the end indicated by EOF), the front-end invokes this function > > to verify success. There is no in-band way (through the pipe) to > > indicate failure, so we need to check explicitly. > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD > > (which includes establishing the direction of transfer and migration > > phase), the sending side writes its data into the pipe, and the reading > > side reads it until it sees an EOF. Then, the front-end will check for > > success via CHECK_DEVICE_STATE, which on the destination side includes > > checking for integrity (i.e. errors during deserialization). > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > --- > > include/hw/virtio/vhost-backend.h | 24 +++++ > > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ > > hw/virtio/vhost.c | 37 ++++++++ > > 4 files changed, 287 insertions(+) > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > > index ec3fbae58d..5935b32fe3 100644 > > --- a/include/hw/virtio/vhost-backend.h > > +++ b/include/hw/virtio/vhost-backend.h > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > } VhostSetConfigType; > > > > +typedef enum VhostDeviceStateDirection { > > + /* Transfer state from back-end (device) to front-end */ > > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > + /* Transfer state from front-end to back-end (device) */ > > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > +} VhostDeviceStateDirection; > > + > > +typedef enum VhostDeviceStatePhase { > > + /* The device (and all its vrings) is stopped */ > > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > +} VhostDeviceStatePhase; > > vDPA has: > > /* Suspend a device so it does not process virtqueue requests anymore > * > * After the return of ioctl the device must preserve all the necessary state > * (the virtqueue vring base plus the possible device specific states) that is > * required for restoring in the future. The device must not change its > * configuration after that point. > */ > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > /* Resume a device so it can resume processing virtqueue requests > * > * After the return of this ioctl the device will have restored all the > * necessary states and it is fully operational to continue processing the > * virtqueue descriptors. > */ > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > I wonder if it makes sense to import these into vhost-user so that the > difference between kernel vhost and vhost-user is minimized. It's okay > if one of them is ahead of the other, but it would be nice to avoid > overlapping/duplicated functionality. > That's what I had in mind in the first versions. I proposed VHOST_STOP instead of VHOST_VDPA_STOP for this very reason. Later it did change to SUSPEND. Generally it is better if we make the interface less parametrized and we trust in the messages and its semantics in my opinion. In other words, instead of vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command. Another way to apply this is with the "direction" parameter. Maybe it is better to split it into "set_state_fd" and "get_state_fd"? In that case, reusing the ioctls as vhost-user messages would be ok. But that puts this proposal further from the VFIO code, which uses "migration_set_state(state)", and maybe it is better when the number of states is high. BTW, is there any usage for *reply_fd at this moment from the backend? > (And I hope vDPA will import the device state vhost-user messages > introduced in this series.) > I guess they will be needed for vdpa-fs devices? Is there any emulated virtio-fs in qemu? Thanks! > > + > > struct vhost_inflight; > > struct vhost_dev; > > struct vhost_log; > > @@ -133,6 +145,15 @@ typedef int (*vhost_set_config_call_op)(struct vhost_dev *dev, > > > > typedef void (*vhost_reset_status_op)(struct vhost_dev *dev); > > > > +typedef bool (*vhost_supports_migratory_state_op)(struct vhost_dev *dev); > > +typedef int (*vhost_set_device_state_fd_op)(struct vhost_dev *dev, > > + VhostDeviceStateDirection direction, > > + VhostDeviceStatePhase phase, > > + int fd, > > + int *reply_fd, > > + Error **errp); > > +typedef int (*vhost_check_device_state_op)(struct vhost_dev *dev, Error **errp); > > + > > typedef struct VhostOps { > > VhostBackendType backend_type; > > vhost_backend_init vhost_backend_init; > > @@ -181,6 +202,9 @@ typedef struct VhostOps { > > vhost_force_iommu_op vhost_force_iommu; > > vhost_set_config_call_op vhost_set_config_call; > > vhost_reset_status_op vhost_reset_status; > > + vhost_supports_migratory_state_op vhost_supports_migratory_state; > > + vhost_set_device_state_fd_op vhost_set_device_state_fd; > > + vhost_check_device_state_op vhost_check_device_state; > > } VhostOps; > > > > int vhost_backend_update_device_iotlb(struct vhost_dev *dev, > > diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h > > index 2fe02ed5d4..29449e0fe2 100644 > > --- a/include/hw/virtio/vhost.h > > +++ b/include/hw/virtio/vhost.h > > @@ -346,4 +346,83 @@ int vhost_dev_set_inflight(struct vhost_dev *dev, > > struct vhost_inflight *inflight); > > int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size, > > struct vhost_inflight *inflight); > > + > > +/** > > + * vhost_supports_migratory_state(): Checks whether the back-end > > + * supports transferring internal state for the purpose of migration. > > + * Support for this feature is required for vhost_set_device_state_fd() > > + * and vhost_check_device_state(). > > + * > > + * @dev: The vhost device > > + * > > + * Returns true if the device supports these commands, and false if it > > + * does not. > > + */ > > +bool vhost_supports_migratory_state(struct vhost_dev *dev); > > + > > +/** > > + * vhost_set_device_state_fd(): Begin transfer of internal state from/to > > + * the back-end for the purpose of migration. Data is to be transferred > > + * over a pipe according to @direction and @phase. The sending end must > > + * only write to the pipe, and the receiving end must only read from it. > > + * Once the sending end is done, it closes its FD. The receiving end > > + * must take this as the end-of-transfer signal and close its FD, too. > > + * > > + * @fd is the back-end's end of the pipe: The write FD for SAVE, and the > > + * read FD for LOAD. This function transfers ownership of @fd to the > > + * back-end, i.e. closes it in the front-end. > > + * > > + * The back-end may optionally reply with an FD of its own, if this > > + * improves efficiency on its end. In this case, the returned FD is > > + * stored in *reply_fd. The back-end will discard the FD sent to it, > > + * and the front-end must use *reply_fd for transferring state to/from > > + * the back-end. > > + * > > + * @dev: The vhost device > > + * @direction: The direction in which the state is to be transferred. > > + * For outgoing migrations, this is SAVE, and data is read > > + * from the back-end and stored by the front-end in the > > + * migration stream. > > + * For incoming migrations, this is LOAD, and data is read > > + * by the front-end from the migration stream and sent to > > + * the back-end to restore the saved state. > > + * @phase: Which migration phase we are in. Currently, there is only > > + * STOPPED (device and all vrings are stopped), in the future, > > + * more phases such as PRE_COPY or POST_COPY may be added. > > + * @fd: Back-end's end of the pipe through which to transfer state; note > > + * that ownership is transferred to the back-end, so this function > > + * closes @fd in the front-end. > > + * @reply_fd: If the back-end wishes to use a different pipe for state > > + * transfer, this will contain an FD for the front-end to > > + * use. Otherwise, -1 is stored here. > > + * @errp: Potential error description > > + * > > + * Returns 0 on success, and -errno on failure. > > + */ > > +int vhost_set_device_state_fd(struct vhost_dev *dev, > > + VhostDeviceStateDirection direction, > > + VhostDeviceStatePhase phase, > > + int fd, > > + int *reply_fd, > > + Error **errp); > > + > > +/** > > + * vhost_set_device_state_fd(): After transferring state from/to the > > + * back-end via vhost_set_device_state_fd(), i.e. once the sending end > > + * has closed the pipe, inquire the back-end to report any potential > > + * errors that have occurred on its side. This allows to sense errors > > + * like: > > + * - During outgoing migration, when the source side had already started > > + * to produce its state, something went wrong and it failed to finish > > + * - During incoming migration, when the received state is somehow > > + * invalid and cannot be processed by the back-end > > + * > > + * @dev: The vhost device > > + * @errp: Potential error description > > + * > > + * Returns 0 when the back-end reports successful state transfer and > > + * processing, and -errno when an error occurred somewhere. > > + */ > > +int vhost_check_device_state(struct vhost_dev *dev, Error **errp); > > + > > #endif > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c > > index e5285df4ba..93d8f2494a 100644 > > --- a/hw/virtio/vhost-user.c > > +++ b/hw/virtio/vhost-user.c > > @@ -83,6 +83,7 @@ enum VhostUserProtocolFeature { > > /* Feature 14 reserved for VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS. */ > > VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15, > > VHOST_USER_PROTOCOL_F_STATUS = 16, > > + VHOST_USER_PROTOCOL_F_MIGRATORY_STATE = 17, > > VHOST_USER_PROTOCOL_F_MAX > > }; > > > > @@ -130,6 +131,8 @@ typedef enum VhostUserRequest { > > VHOST_USER_REM_MEM_REG = 38, > > VHOST_USER_SET_STATUS = 39, > > VHOST_USER_GET_STATUS = 40, > > + VHOST_USER_SET_DEVICE_STATE_FD = 41, > > + VHOST_USER_CHECK_DEVICE_STATE = 42, > > VHOST_USER_MAX > > } VhostUserRequest; > > > > @@ -210,6 +213,12 @@ typedef struct { > > uint32_t size; /* the following payload size */ > > } QEMU_PACKED VhostUserHeader; > > > > +/* Request payload of VHOST_USER_SET_DEVICE_STATE_FD */ > > +typedef struct VhostUserTransferDeviceState { > > + uint32_t direction; > > + uint32_t phase; > > +} VhostUserTransferDeviceState; > > + > > typedef union { > > #define VHOST_USER_VRING_IDX_MASK (0xff) > > #define VHOST_USER_VRING_NOFD_MASK (0x1 << 8) > > @@ -224,6 +233,7 @@ typedef union { > > VhostUserCryptoSession session; > > VhostUserVringArea area; > > VhostUserInflight inflight; > > + VhostUserTransferDeviceState transfer_state; > > } VhostUserPayload; > > > > typedef struct VhostUserMsg { > > @@ -2681,6 +2691,140 @@ static int vhost_user_dev_start(struct vhost_dev *dev, bool started) > > } > > } > > > > +static bool vhost_user_supports_migratory_state(struct vhost_dev *dev) > > +{ > > + return virtio_has_feature(dev->protocol_features, > > + VHOST_USER_PROTOCOL_F_MIGRATORY_STATE); > > +} > > + > > +static int vhost_user_set_device_state_fd(struct vhost_dev *dev, > > + VhostDeviceStateDirection direction, > > + VhostDeviceStatePhase phase, > > + int fd, > > + int *reply_fd, > > + Error **errp) > > +{ > > + int ret; > > + struct vhost_user *vu = dev->opaque; > > + VhostUserMsg msg = { > > + .hdr = { > > + .request = VHOST_USER_SET_DEVICE_STATE_FD, > > + .flags = VHOST_USER_VERSION, > > + .size = sizeof(msg.payload.transfer_state), > > + }, > > + .payload.transfer_state = { > > + .direction = direction, > > + .phase = phase, > > + }, > > + }; > > + > > + *reply_fd = -1; > > + > > + if (!vhost_user_supports_migratory_state(dev)) { > > + close(fd); > > + error_setg(errp, "Back-end does not support migration state transfer"); > > + return -ENOTSUP; > > + } > > + > > + ret = vhost_user_write(dev, &msg, &fd, 1); > > + close(fd); > > + if (ret < 0) { > > + error_setg_errno(errp, -ret, > > + "Failed to send SET_DEVICE_STATE_FD message"); > > + return ret; > > + } > > + > > + ret = vhost_user_read(dev, &msg); > > + if (ret < 0) { > > + error_setg_errno(errp, -ret, > > + "Failed to receive SET_DEVICE_STATE_FD reply"); > > + return ret; > > + } > > + > > + if (msg.hdr.request != VHOST_USER_SET_DEVICE_STATE_FD) { > > + error_setg(errp, > > + "Received unexpected message type, expected %d, received %d", > > + VHOST_USER_SET_DEVICE_STATE_FD, msg.hdr.request); > > + return -EPROTO; > > + } > > + > > + if (msg.hdr.size != sizeof(msg.payload.u64)) { > > + error_setg(errp, > > + "Received bad message size, expected %zu, received %" PRIu32, > > + sizeof(msg.payload.u64), msg.hdr.size); > > + return -EPROTO; > > + } > > + > > + if ((msg.payload.u64 & 0xff) != 0) { > > + error_setg(errp, "Back-end did not accept migration state transfer"); > > + return -EIO; > > + } > > + > > + if (!(msg.payload.u64 & VHOST_USER_VRING_NOFD_MASK)) { > > + *reply_fd = qemu_chr_fe_get_msgfd(vu->user->chr); > > + if (*reply_fd < 0) { > > + error_setg(errp, > > + "Failed to get back-end-provided transfer pipe FD"); > > + *reply_fd = -1; > > + return -EIO; > > + } > > + } > > + > > + return 0; > > +} > > + > > +static int vhost_user_check_device_state(struct vhost_dev *dev, Error **errp) > > +{ > > + int ret; > > + VhostUserMsg msg = { > > + .hdr = { > > + .request = VHOST_USER_CHECK_DEVICE_STATE, > > + .flags = VHOST_USER_VERSION, > > + .size = 0, > > + }, > > + }; > > + > > + if (!vhost_user_supports_migratory_state(dev)) { > > + error_setg(errp, "Back-end does not support migration state transfer"); > > + return -ENOTSUP; > > + } > > + > > + ret = vhost_user_write(dev, &msg, NULL, 0); > > + if (ret < 0) { > > + error_setg_errno(errp, -ret, > > + "Failed to send CHECK_DEVICE_STATE message"); > > + return ret; > > + } > > + > > + ret = vhost_user_read(dev, &msg); > > + if (ret < 0) { > > + error_setg_errno(errp, -ret, > > + "Failed to receive CHECK_DEVICE_STATE reply"); > > + return ret; > > + } > > + > > + if (msg.hdr.request != VHOST_USER_CHECK_DEVICE_STATE) { > > + error_setg(errp, > > + "Received unexpected message type, expected %d, received %d", > > + VHOST_USER_CHECK_DEVICE_STATE, msg.hdr.request); > > + return -EPROTO; > > + } > > + > > + if (msg.hdr.size != sizeof(msg.payload.u64)) { > > + error_setg(errp, > > + "Received bad message size, expected %zu, received %" PRIu32, > > + sizeof(msg.payload.u64), msg.hdr.size); > > + return -EPROTO; > > + } > > + > > + if (msg.payload.u64 != 0) { > > + error_setg(errp, "Back-end failed to process its internal state"); > > + return -EIO; > > + } > > + > > + return 0; > > +} > > + > > const VhostOps user_ops = { > > .backend_type = VHOST_BACKEND_TYPE_USER, > > .vhost_backend_init = vhost_user_backend_init, > > @@ -2716,4 +2860,7 @@ const VhostOps user_ops = { > > .vhost_get_inflight_fd = vhost_user_get_inflight_fd, > > .vhost_set_inflight_fd = vhost_user_set_inflight_fd, > > .vhost_dev_start = vhost_user_dev_start, > > + .vhost_supports_migratory_state = vhost_user_supports_migratory_state, > > + .vhost_set_device_state_fd = vhost_user_set_device_state_fd, > > + .vhost_check_device_state = vhost_user_check_device_state, > > }; > > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c > > index cbff589efa..90099d8f6a 100644 > > --- a/hw/virtio/vhost.c > > +++ b/hw/virtio/vhost.c > > @@ -2088,3 +2088,40 @@ int vhost_net_set_backend(struct vhost_dev *hdev, > > > > return -ENOSYS; > > } > > + > > +bool vhost_supports_migratory_state(struct vhost_dev *dev) > > +{ > > + if (dev->vhost_ops->vhost_supports_migratory_state) { > > + return dev->vhost_ops->vhost_supports_migratory_state(dev); > > + } > > + > > + return false; > > +} > > + > > +int vhost_set_device_state_fd(struct vhost_dev *dev, > > + VhostDeviceStateDirection direction, > > + VhostDeviceStatePhase phase, > > + int fd, > > + int *reply_fd, > > + Error **errp) > > +{ > > + if (dev->vhost_ops->vhost_set_device_state_fd) { > > + return dev->vhost_ops->vhost_set_device_state_fd(dev, direction, phase, > > + fd, reply_fd, errp); > > + } > > + > > + error_setg(errp, > > + "vhost transport does not support migration state transfer"); > > + return -ENOSYS; > > +} > > + > > +int vhost_check_device_state(struct vhost_dev *dev, Error **errp) > > +{ > > + if (dev->vhost_ops->vhost_check_device_state) { > > + return dev->vhost_ops->vhost_check_device_state(dev, errp); > > + } > > + > > + error_setg(errp, > > + "vhost transport does not support migration state transfer"); > > + return -ENOSYS; > > +} > > -- > > 2.39.1 > >
On Thu, 13 Apr 2023 at 06:15, Eugenio Perez Martin <eperezma@redhat.com> wrote: > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > (And I hope vDPA will import the device state vhost-user messages > > introduced in this series.) > > > > I guess they will be needed for vdpa-fs devices? Is there any emulated > virtio-fs in qemu? Maybe also virtio-gpu or virtio-crypto, if someone decides to create hardware or in-kernel implementations. virtiofs is not built into QEMU, there are only vhost-user implementations. Stefan
On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote: > > On 12.04.23 23:06, Stefan Hajnoczi wrote: > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > >> So-called "internal" virtio-fs migration refers to transporting the > >> back-end's (virtiofsd's) state through qemu's migration stream. To do > >> this, we need to be able to transfer virtiofsd's internal state to and > >> from virtiofsd. > >> > >> Because virtiofsd's internal state will not be too large, we believe it > >> is best to transfer it as a single binary blob after the streaming > >> phase. Because this method should be useful to other vhost-user > >> implementations, too, it is introduced as a general-purpose addition to > >> the protocol, not limited to vhost-user-fs. > >> > >> These are the additions to the protocol: > >> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > >> This feature signals support for transferring state, and is added so > >> that migration can fail early when the back-end has no support. > >> > >> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe > >> over which to transfer the state. The front-end sends an FD to the > >> back-end into/from which it can write/read its state, and the back-end > >> can decide to either use it, or reply with a different FD for the > >> front-end to override the front-end's choice. > >> The front-end creates a simple pipe to transfer the state, but maybe > >> the back-end already has an FD into/from which it has to write/read > >> its state, in which case it will want to override the simple pipe. > >> Conversely, maybe in the future we find a way to have the front-end > >> get an immediate FD for the migration stream (in some cases), in which > >> case we will want to send this to the back-end instead of creating a > >> pipe. > >> Hence the negotiation: If one side has a better idea than a plain > >> pipe, we will want to use that. > >> > >> - CHECK_DEVICE_STATE: After the state has been transferred through the > >> pipe (the end indicated by EOF), the front-end invokes this function > >> to verify success. There is no in-band way (through the pipe) to > >> indicate failure, so we need to check explicitly. > >> > >> Once the transfer pipe has been established via SET_DEVICE_STATE_FD > >> (which includes establishing the direction of transfer and migration > >> phase), the sending side writes its data into the pipe, and the reading > >> side reads it until it sees an EOF. Then, the front-end will check for > >> success via CHECK_DEVICE_STATE, which on the destination side includes > >> checking for integrity (i.e. errors during deserialization). > >> > >> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > >> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > >> --- > >> include/hw/virtio/vhost-backend.h | 24 +++++ > >> include/hw/virtio/vhost.h | 79 ++++++++++++++++ > >> hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ > >> hw/virtio/vhost.c | 37 ++++++++ > >> 4 files changed, 287 insertions(+) > >> > >> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > >> index ec3fbae58d..5935b32fe3 100644 > >> --- a/include/hw/virtio/vhost-backend.h > >> +++ b/include/hw/virtio/vhost-backend.h > >> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > >> VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > >> } VhostSetConfigType; > >> > >> +typedef enum VhostDeviceStateDirection { > >> + /* Transfer state from back-end (device) to front-end */ > >> + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > >> + /* Transfer state from front-end to back-end (device) */ > >> + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > >> +} VhostDeviceStateDirection; > >> + > >> +typedef enum VhostDeviceStatePhase { > >> + /* The device (and all its vrings) is stopped */ > >> + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > >> +} VhostDeviceStatePhase; > > vDPA has: > > > > /* Suspend a device so it does not process virtqueue requests anymore > > * > > * After the return of ioctl the device must preserve all the necessary state > > * (the virtqueue vring base plus the possible device specific states) that is > > * required for restoring in the future. The device must not change its > > * configuration after that point. > > */ > > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > > > /* Resume a device so it can resume processing virtqueue requests > > * > > * After the return of this ioctl the device will have restored all the > > * necessary states and it is fully operational to continue processing the > > * virtqueue descriptors. > > */ > > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > > > I wonder if it makes sense to import these into vhost-user so that the > > difference between kernel vhost and vhost-user is minimized. It's okay > > if one of them is ahead of the other, but it would be nice to avoid > > overlapping/duplicated functionality. > > > > (And I hope vDPA will import the device state vhost-user messages > > introduced in this series.) > > I don’t understand your suggestion. (Like, I very simply don’t > understand :)) > > These are vhost messages, right? What purpose do you have in mind for > them in vhost-user for internal migration? They’re different from the > state transfer messages, because they don’t transfer state to/from the > front-end. Also, the state transfer stuff is supposed to be distinct > from starting/stopping the device; right now, it just requires the > device to be stopped beforehand (or started only afterwards). And in > the future, new VhostDeviceStatePhase values may allow the messages to > be used on devices that aren’t stopped. > > So they seem to serve very different purposes. I can imagine using the > VDPA_{SUSPEND,RESUME} messages for external migration (what Anton is > working on), but they don’t really help with internal migration > implemented here. If I were to add them, they’d just be sent in > addition to the new messages added in this patch here, i.e. SUSPEND on > the source before SET_DEVICE_STATE_FD, and RESUME on the destination > after CHECK_DEVICE_STATE (we could use RESUME in place of > CHECK_DEVICE_STATE on the destination, but we can’t do that on the > source, so we still need CHECK_DEVICE_STATE). Yes, they are complementary to the device state fd message. I want to make sure pre-conditions about the device's state (running vs stopped) already take into account the vDPA SUSPEND/RESUME model. vDPA will need device state save/load in the future. For virtiofs devices, for example. This is why I think we should plan for vDPA and vhost-user to share the same interface. Also, I think the code path you're relying on (vhost_dev_stop()) on doesn't work for backends that implement VHOST_USER_PROTOCOL_F_STATUS because stopping the backend resets the device and throws away its state. SUSPEND/RESUME solve this. This looks like a more general problem since vhost_dev_stop() is called any time the VM is paused. Maybe it needs to use SUSPEND/RESUME whenever possible. Stefan
On 13.04.23 12:14, Eugenio Perez Martin wrote: > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: >> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: >>> So-called "internal" virtio-fs migration refers to transporting the >>> back-end's (virtiofsd's) state through qemu's migration stream. To do >>> this, we need to be able to transfer virtiofsd's internal state to and >>> from virtiofsd. >>> >>> Because virtiofsd's internal state will not be too large, we believe it >>> is best to transfer it as a single binary blob after the streaming >>> phase. Because this method should be useful to other vhost-user >>> implementations, too, it is introduced as a general-purpose addition to >>> the protocol, not limited to vhost-user-fs. >>> >>> These are the additions to the protocol: >>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: >>> This feature signals support for transferring state, and is added so >>> that migration can fail early when the back-end has no support. >>> >>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe >>> over which to transfer the state. The front-end sends an FD to the >>> back-end into/from which it can write/read its state, and the back-end >>> can decide to either use it, or reply with a different FD for the >>> front-end to override the front-end's choice. >>> The front-end creates a simple pipe to transfer the state, but maybe >>> the back-end already has an FD into/from which it has to write/read >>> its state, in which case it will want to override the simple pipe. >>> Conversely, maybe in the future we find a way to have the front-end >>> get an immediate FD for the migration stream (in some cases), in which >>> case we will want to send this to the back-end instead of creating a >>> pipe. >>> Hence the negotiation: If one side has a better idea than a plain >>> pipe, we will want to use that. >>> >>> - CHECK_DEVICE_STATE: After the state has been transferred through the >>> pipe (the end indicated by EOF), the front-end invokes this function >>> to verify success. There is no in-band way (through the pipe) to >>> indicate failure, so we need to check explicitly. >>> >>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD >>> (which includes establishing the direction of transfer and migration >>> phase), the sending side writes its data into the pipe, and the reading >>> side reads it until it sees an EOF. Then, the front-end will check for >>> success via CHECK_DEVICE_STATE, which on the destination side includes >>> checking for integrity (i.e. errors during deserialization). >>> >>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> >>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> >>> --- >>> include/hw/virtio/vhost-backend.h | 24 +++++ >>> include/hw/virtio/vhost.h | 79 ++++++++++++++++ >>> hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ >>> hw/virtio/vhost.c | 37 ++++++++ >>> 4 files changed, 287 insertions(+) >>> >>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h >>> index ec3fbae58d..5935b32fe3 100644 >>> --- a/include/hw/virtio/vhost-backend.h >>> +++ b/include/hw/virtio/vhost-backend.h >>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { >>> VHOST_SET_CONFIG_TYPE_MIGRATION = 1, >>> } VhostSetConfigType; >>> >>> +typedef enum VhostDeviceStateDirection { >>> + /* Transfer state from back-end (device) to front-end */ >>> + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, >>> + /* Transfer state from front-end to back-end (device) */ >>> + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, >>> +} VhostDeviceStateDirection; >>> + >>> +typedef enum VhostDeviceStatePhase { >>> + /* The device (and all its vrings) is stopped */ >>> + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, >>> +} VhostDeviceStatePhase; >> vDPA has: >> >> /* Suspend a device so it does not process virtqueue requests anymore >> * >> * After the return of ioctl the device must preserve all the necessary state >> * (the virtqueue vring base plus the possible device specific states) that is >> * required for restoring in the future. The device must not change its >> * configuration after that point. >> */ >> #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) >> >> /* Resume a device so it can resume processing virtqueue requests >> * >> * After the return of this ioctl the device will have restored all the >> * necessary states and it is fully operational to continue processing the >> * virtqueue descriptors. >> */ >> #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) >> >> I wonder if it makes sense to import these into vhost-user so that the >> difference between kernel vhost and vhost-user is minimized. It's okay >> if one of them is ahead of the other, but it would be nice to avoid >> overlapping/duplicated functionality. >> > That's what I had in mind in the first versions. I proposed VHOST_STOP > instead of VHOST_VDPA_STOP for this very reason. Later it did change > to SUSPEND. > > Generally it is better if we make the interface less parametrized and > we trust in the messages and its semantics in my opinion. In other > words, instead of > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command. I.e. you mean that this should simply be stateful instead of re-affirming the current state with a parameter? The problem I see is that transferring states in different phases of migration will require specialized implementations. So running SET_DEVICE_STATE_FD in a different phase will require support from the back-end. Same in the front-end, the exact protocol and thus implementation will (probably, difficult to say at this point) depend on the migration phase. I would therefore prefer to have an explicit distinction in the command itself that affirms the phase we’re targeting. On the other hand, I don’t see the parameter complicating anything. The front-end must supply it, but it will know the phase anyway, so this is easy. The back-end can just choose to ignore it, if it doesn’t feel the need to verify that the phase is what it thinks it is. > Another way to apply this is with the "direction" parameter. Maybe it > is better to split it into "set_state_fd" and "get_state_fd"? Well, it would rather be `set_state_send_fd` and `set_state_receive_fd`. We always negotiate a pipe between front-end and back-end, the question is just whether the back-end gets the receiving (load) or the sending (save) end. Technically, one can make it fully stateful and say that if the device hasn’t been started already, it’s always a LOAD, and otherwise always a SAVE. But as above, I’d prefer to keep the parameter because the implementations are different, so I’d prefer there to be a re-affirmation that front-end and back-end are in sync about what should be done. Personally, I don’t really see the advantage of having two functions instead of one function with an enum with two values. The thing about SET_DEVICE_STATE_FD is that it itself won’t differ much regardless of whether we’re loading or saving, it just negotiates the pipe – the difference is what happens after the pipe has been negotiated. So if we split the function into two, both implementations will share most of their code anyway, which makes me think it should be a single function. > In that case, reusing the ioctls as vhost-user messages would be ok. > But that puts this proposal further from the VFIO code, which uses > "migration_set_state(state)", and maybe it is better when the number > of states is high. I’m not sure what you mean (because I don’t know the VFIO code, I assume). Are you saying that using a more finely grained migration_set_state() model would conflict with the rather coarse suspend/resume? > BTW, is there any usage for *reply_fd at this moment from the backend? No, virtiofsd doesn’t plan to make use of it. Hanna
On 13.04.23 13:38, Stefan Hajnoczi wrote: > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote: >> On 12.04.23 23:06, Stefan Hajnoczi wrote: >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: >>>> So-called "internal" virtio-fs migration refers to transporting the >>>> back-end's (virtiofsd's) state through qemu's migration stream. To do >>>> this, we need to be able to transfer virtiofsd's internal state to and >>>> from virtiofsd. >>>> >>>> Because virtiofsd's internal state will not be too large, we believe it >>>> is best to transfer it as a single binary blob after the streaming >>>> phase. Because this method should be useful to other vhost-user >>>> implementations, too, it is introduced as a general-purpose addition to >>>> the protocol, not limited to vhost-user-fs. >>>> >>>> These are the additions to the protocol: >>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: >>>> This feature signals support for transferring state, and is added so >>>> that migration can fail early when the back-end has no support. >>>> >>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe >>>> over which to transfer the state. The front-end sends an FD to the >>>> back-end into/from which it can write/read its state, and the back-end >>>> can decide to either use it, or reply with a different FD for the >>>> front-end to override the front-end's choice. >>>> The front-end creates a simple pipe to transfer the state, but maybe >>>> the back-end already has an FD into/from which it has to write/read >>>> its state, in which case it will want to override the simple pipe. >>>> Conversely, maybe in the future we find a way to have the front-end >>>> get an immediate FD for the migration stream (in some cases), in which >>>> case we will want to send this to the back-end instead of creating a >>>> pipe. >>>> Hence the negotiation: If one side has a better idea than a plain >>>> pipe, we will want to use that. >>>> >>>> - CHECK_DEVICE_STATE: After the state has been transferred through the >>>> pipe (the end indicated by EOF), the front-end invokes this function >>>> to verify success. There is no in-band way (through the pipe) to >>>> indicate failure, so we need to check explicitly. >>>> >>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD >>>> (which includes establishing the direction of transfer and migration >>>> phase), the sending side writes its data into the pipe, and the reading >>>> side reads it until it sees an EOF. Then, the front-end will check for >>>> success via CHECK_DEVICE_STATE, which on the destination side includes >>>> checking for integrity (i.e. errors during deserialization). >>>> >>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> >>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> >>>> --- >>>> include/hw/virtio/vhost-backend.h | 24 +++++ >>>> include/hw/virtio/vhost.h | 79 ++++++++++++++++ >>>> hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ >>>> hw/virtio/vhost.c | 37 ++++++++ >>>> 4 files changed, 287 insertions(+) >>>> >>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h >>>> index ec3fbae58d..5935b32fe3 100644 >>>> --- a/include/hw/virtio/vhost-backend.h >>>> +++ b/include/hw/virtio/vhost-backend.h >>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { >>>> VHOST_SET_CONFIG_TYPE_MIGRATION = 1, >>>> } VhostSetConfigType; >>>> >>>> +typedef enum VhostDeviceStateDirection { >>>> + /* Transfer state from back-end (device) to front-end */ >>>> + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, >>>> + /* Transfer state from front-end to back-end (device) */ >>>> + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, >>>> +} VhostDeviceStateDirection; >>>> + >>>> +typedef enum VhostDeviceStatePhase { >>>> + /* The device (and all its vrings) is stopped */ >>>> + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, >>>> +} VhostDeviceStatePhase; >>> vDPA has: >>> >>> /* Suspend a device so it does not process virtqueue requests anymore >>> * >>> * After the return of ioctl the device must preserve all the necessary state >>> * (the virtqueue vring base plus the possible device specific states) that is >>> * required for restoring in the future. The device must not change its >>> * configuration after that point. >>> */ >>> #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) >>> >>> /* Resume a device so it can resume processing virtqueue requests >>> * >>> * After the return of this ioctl the device will have restored all the >>> * necessary states and it is fully operational to continue processing the >>> * virtqueue descriptors. >>> */ >>> #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) >>> >>> I wonder if it makes sense to import these into vhost-user so that the >>> difference between kernel vhost and vhost-user is minimized. It's okay >>> if one of them is ahead of the other, but it would be nice to avoid >>> overlapping/duplicated functionality. >>> >>> (And I hope vDPA will import the device state vhost-user messages >>> introduced in this series.) >> I don’t understand your suggestion. (Like, I very simply don’t >> understand :)) >> >> These are vhost messages, right? What purpose do you have in mind for >> them in vhost-user for internal migration? They’re different from the >> state transfer messages, because they don’t transfer state to/from the >> front-end. Also, the state transfer stuff is supposed to be distinct >> from starting/stopping the device; right now, it just requires the >> device to be stopped beforehand (or started only afterwards). And in >> the future, new VhostDeviceStatePhase values may allow the messages to >> be used on devices that aren’t stopped. >> >> So they seem to serve very different purposes. I can imagine using the >> VDPA_{SUSPEND,RESUME} messages for external migration (what Anton is >> working on), but they don’t really help with internal migration >> implemented here. If I were to add them, they’d just be sent in >> addition to the new messages added in this patch here, i.e. SUSPEND on >> the source before SET_DEVICE_STATE_FD, and RESUME on the destination >> after CHECK_DEVICE_STATE (we could use RESUME in place of >> CHECK_DEVICE_STATE on the destination, but we can’t do that on the >> source, so we still need CHECK_DEVICE_STATE). > Yes, they are complementary to the device state fd message. I want to > make sure pre-conditions about the device's state (running vs stopped) > already take into account the vDPA SUSPEND/RESUME model. > > vDPA will need device state save/load in the future. For virtiofs > devices, for example. This is why I think we should plan for vDPA and > vhost-user to share the same interface. While the paragraph below is more important, I don’t feel like this would be important right now. It’s clear that SUSPEND must come before transferring any state, and that RESUME must come after transferring state. I don’t think we need to clarify this now, it’d be obvious when implementing SUSPEND/RESUME. > Also, I think the code path you're relying on (vhost_dev_stop()) on > doesn't work for backends that implement VHOST_USER_PROTOCOL_F_STATUS > because stopping the backend resets the device and throws away its > state. SUSPEND/RESUME solve this. This looks like a more general > problem since vhost_dev_stop() is called any time the VM is paused. > Maybe it needs to use SUSPEND/RESUME whenever possible. That’s a problem. Quite a problem, to be honest, because this sounds rather complicated with honestly absolutely no practical benefit right now. Would you require SUSPEND/RESUME for state transfer even if the back-end does not implement GET/SET_STATUS? Because then this would also lead to more complexity in virtiofsd. Basically, what I’m hearing is that I need to implement a different feature that has no practical impact right now, and also fix bugs around it along the way... (Not that I have any better suggestion.) Hanna
On Thu, 13 Apr 2023 at 13:55, Hanna Czenczek <hreitz@redhat.com> wrote: > > On 13.04.23 13:38, Stefan Hajnoczi wrote: > > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote: > >> On 12.04.23 23:06, Stefan Hajnoczi wrote: > >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > >>>> So-called "internal" virtio-fs migration refers to transporting the > >>>> back-end's (virtiofsd's) state through qemu's migration stream. To do > >>>> this, we need to be able to transfer virtiofsd's internal state to and > >>>> from virtiofsd. > >>>> > >>>> Because virtiofsd's internal state will not be too large, we believe it > >>>> is best to transfer it as a single binary blob after the streaming > >>>> phase. Because this method should be useful to other vhost-user > >>>> implementations, too, it is introduced as a general-purpose addition to > >>>> the protocol, not limited to vhost-user-fs. > >>>> > >>>> These are the additions to the protocol: > >>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > >>>> This feature signals support for transferring state, and is added so > >>>> that migration can fail early when the back-end has no support. > >>>> > >>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe > >>>> over which to transfer the state. The front-end sends an FD to the > >>>> back-end into/from which it can write/read its state, and the back-end > >>>> can decide to either use it, or reply with a different FD for the > >>>> front-end to override the front-end's choice. > >>>> The front-end creates a simple pipe to transfer the state, but maybe > >>>> the back-end already has an FD into/from which it has to write/read > >>>> its state, in which case it will want to override the simple pipe. > >>>> Conversely, maybe in the future we find a way to have the front-end > >>>> get an immediate FD for the migration stream (in some cases), in which > >>>> case we will want to send this to the back-end instead of creating a > >>>> pipe. > >>>> Hence the negotiation: If one side has a better idea than a plain > >>>> pipe, we will want to use that. > >>>> > >>>> - CHECK_DEVICE_STATE: After the state has been transferred through the > >>>> pipe (the end indicated by EOF), the front-end invokes this function > >>>> to verify success. There is no in-band way (through the pipe) to > >>>> indicate failure, so we need to check explicitly. > >>>> > >>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD > >>>> (which includes establishing the direction of transfer and migration > >>>> phase), the sending side writes its data into the pipe, and the reading > >>>> side reads it until it sees an EOF. Then, the front-end will check for > >>>> success via CHECK_DEVICE_STATE, which on the destination side includes > >>>> checking for integrity (i.e. errors during deserialization). > >>>> > >>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > >>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > >>>> --- > >>>> include/hw/virtio/vhost-backend.h | 24 +++++ > >>>> include/hw/virtio/vhost.h | 79 ++++++++++++++++ > >>>> hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ > >>>> hw/virtio/vhost.c | 37 ++++++++ > >>>> 4 files changed, 287 insertions(+) > >>>> > >>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > >>>> index ec3fbae58d..5935b32fe3 100644 > >>>> --- a/include/hw/virtio/vhost-backend.h > >>>> +++ b/include/hw/virtio/vhost-backend.h > >>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > >>>> VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > >>>> } VhostSetConfigType; > >>>> > >>>> +typedef enum VhostDeviceStateDirection { > >>>> + /* Transfer state from back-end (device) to front-end */ > >>>> + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > >>>> + /* Transfer state from front-end to back-end (device) */ > >>>> + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > >>>> +} VhostDeviceStateDirection; > >>>> + > >>>> +typedef enum VhostDeviceStatePhase { > >>>> + /* The device (and all its vrings) is stopped */ > >>>> + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > >>>> +} VhostDeviceStatePhase; > >>> vDPA has: > >>> > >>> /* Suspend a device so it does not process virtqueue requests anymore > >>> * > >>> * After the return of ioctl the device must preserve all the necessary state > >>> * (the virtqueue vring base plus the possible device specific states) that is > >>> * required for restoring in the future. The device must not change its > >>> * configuration after that point. > >>> */ > >>> #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > >>> > >>> /* Resume a device so it can resume processing virtqueue requests > >>> * > >>> * After the return of this ioctl the device will have restored all the > >>> * necessary states and it is fully operational to continue processing the > >>> * virtqueue descriptors. > >>> */ > >>> #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > >>> > >>> I wonder if it makes sense to import these into vhost-user so that the > >>> difference between kernel vhost and vhost-user is minimized. It's okay > >>> if one of them is ahead of the other, but it would be nice to avoid > >>> overlapping/duplicated functionality. > >>> > >>> (And I hope vDPA will import the device state vhost-user messages > >>> introduced in this series.) > >> I don’t understand your suggestion. (Like, I very simply don’t > >> understand :)) > >> > >> These are vhost messages, right? What purpose do you have in mind for > >> them in vhost-user for internal migration? They’re different from the > >> state transfer messages, because they don’t transfer state to/from the > >> front-end. Also, the state transfer stuff is supposed to be distinct > >> from starting/stopping the device; right now, it just requires the > >> device to be stopped beforehand (or started only afterwards). And in > >> the future, new VhostDeviceStatePhase values may allow the messages to > >> be used on devices that aren’t stopped. > >> > >> So they seem to serve very different purposes. I can imagine using the > >> VDPA_{SUSPEND,RESUME} messages for external migration (what Anton is > >> working on), but they don’t really help with internal migration > >> implemented here. If I were to add them, they’d just be sent in > >> addition to the new messages added in this patch here, i.e. SUSPEND on > >> the source before SET_DEVICE_STATE_FD, and RESUME on the destination > >> after CHECK_DEVICE_STATE (we could use RESUME in place of > >> CHECK_DEVICE_STATE on the destination, but we can’t do that on the > >> source, so we still need CHECK_DEVICE_STATE). > > Yes, they are complementary to the device state fd message. I want to > > make sure pre-conditions about the device's state (running vs stopped) > > already take into account the vDPA SUSPEND/RESUME model. > > > > vDPA will need device state save/load in the future. For virtiofs > > devices, for example. This is why I think we should plan for vDPA and > > vhost-user to share the same interface. > > While the paragraph below is more important, I don’t feel like this > would be important right now. It’s clear that SUSPEND must come before > transferring any state, and that RESUME must come after transferring > state. I don’t think we need to clarify this now, it’d be obvious when > implementing SUSPEND/RESUME. > > > Also, I think the code path you're relying on (vhost_dev_stop()) on > > doesn't work for backends that implement VHOST_USER_PROTOCOL_F_STATUS > > because stopping the backend resets the device and throws away its > > state. SUSPEND/RESUME solve this. This looks like a more general > > problem since vhost_dev_stop() is called any time the VM is paused. > > Maybe it needs to use SUSPEND/RESUME whenever possible. > > That’s a problem. Quite a problem, to be honest, because this sounds > rather complicated with honestly absolutely no practical benefit right > now. > > Would you require SUSPEND/RESUME for state transfer even if the back-end > does not implement GET/SET_STATUS? Because then this would also lead to > more complexity in virtiofsd. > > Basically, what I’m hearing is that I need to implement a different > feature that has no practical impact right now, and also fix bugs around > it along the way... Eugenio's input regarding the design of the vhost-user messages is important. That way we know it can be ported to vDPA later. There is some extra discussion and work here, but only on the design of the interface. You shouldn't need to implement extra unused stuff. Whoever needs it can do that later based on a design that left room to eventually do iterative migration for vhost-user and vDPA (comparable to VFIO's migration interface). Since both vDPA (vhost kernel) and vhost-user are stable APIs, it will be hard to make significant design changes later without breaking all existing implementations. That's why I think we should think ahead. Stefan
On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote: > > On 13.04.23 13:38, Stefan Hajnoczi wrote: > > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote: > >> On 12.04.23 23:06, Stefan Hajnoczi wrote: > >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > >>>> So-called "internal" virtio-fs migration refers to transporting the > >>>> back-end's (virtiofsd's) state through qemu's migration stream. To do > >>>> this, we need to be able to transfer virtiofsd's internal state to and > >>>> from virtiofsd. > >>>> > >>>> Because virtiofsd's internal state will not be too large, we believe it > >>>> is best to transfer it as a single binary blob after the streaming > >>>> phase. Because this method should be useful to other vhost-user > >>>> implementations, too, it is introduced as a general-purpose addition to > >>>> the protocol, not limited to vhost-user-fs. > >>>> > >>>> These are the additions to the protocol: > >>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > >>>> This feature signals support for transferring state, and is added so > >>>> that migration can fail early when the back-end has no support. > >>>> > >>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe > >>>> over which to transfer the state. The front-end sends an FD to the > >>>> back-end into/from which it can write/read its state, and the back-end > >>>> can decide to either use it, or reply with a different FD for the > >>>> front-end to override the front-end's choice. > >>>> The front-end creates a simple pipe to transfer the state, but maybe > >>>> the back-end already has an FD into/from which it has to write/read > >>>> its state, in which case it will want to override the simple pipe. > >>>> Conversely, maybe in the future we find a way to have the front-end > >>>> get an immediate FD for the migration stream (in some cases), in which > >>>> case we will want to send this to the back-end instead of creating a > >>>> pipe. > >>>> Hence the negotiation: If one side has a better idea than a plain > >>>> pipe, we will want to use that. > >>>> > >>>> - CHECK_DEVICE_STATE: After the state has been transferred through the > >>>> pipe (the end indicated by EOF), the front-end invokes this function > >>>> to verify success. There is no in-band way (through the pipe) to > >>>> indicate failure, so we need to check explicitly. > >>>> > >>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD > >>>> (which includes establishing the direction of transfer and migration > >>>> phase), the sending side writes its data into the pipe, and the reading > >>>> side reads it until it sees an EOF. Then, the front-end will check for > >>>> success via CHECK_DEVICE_STATE, which on the destination side includes > >>>> checking for integrity (i.e. errors during deserialization). > >>>> > >>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > >>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > >>>> --- > >>>> include/hw/virtio/vhost-backend.h | 24 +++++ > >>>> include/hw/virtio/vhost.h | 79 ++++++++++++++++ > >>>> hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ > >>>> hw/virtio/vhost.c | 37 ++++++++ > >>>> 4 files changed, 287 insertions(+) > >>>> > >>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > >>>> index ec3fbae58d..5935b32fe3 100644 > >>>> --- a/include/hw/virtio/vhost-backend.h > >>>> +++ b/include/hw/virtio/vhost-backend.h > >>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > >>>> VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > >>>> } VhostSetConfigType; > >>>> > >>>> +typedef enum VhostDeviceStateDirection { > >>>> + /* Transfer state from back-end (device) to front-end */ > >>>> + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > >>>> + /* Transfer state from front-end to back-end (device) */ > >>>> + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > >>>> +} VhostDeviceStateDirection; > >>>> + > >>>> +typedef enum VhostDeviceStatePhase { > >>>> + /* The device (and all its vrings) is stopped */ > >>>> + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > >>>> +} VhostDeviceStatePhase; > >>> vDPA has: > >>> > >>> /* Suspend a device so it does not process virtqueue requests anymore > >>> * > >>> * After the return of ioctl the device must preserve all the necessary state > >>> * (the virtqueue vring base plus the possible device specific states) that is > >>> * required for restoring in the future. The device must not change its > >>> * configuration after that point. > >>> */ > >>> #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > >>> > >>> /* Resume a device so it can resume processing virtqueue requests > >>> * > >>> * After the return of this ioctl the device will have restored all the > >>> * necessary states and it is fully operational to continue processing the > >>> * virtqueue descriptors. > >>> */ > >>> #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > >>> > >>> I wonder if it makes sense to import these into vhost-user so that the > >>> difference between kernel vhost and vhost-user is minimized. It's okay > >>> if one of them is ahead of the other, but it would be nice to avoid > >>> overlapping/duplicated functionality. > >>> > >>> (And I hope vDPA will import the device state vhost-user messages > >>> introduced in this series.) > >> I don’t understand your suggestion. (Like, I very simply don’t > >> understand :)) > >> > >> These are vhost messages, right? What purpose do you have in mind for > >> them in vhost-user for internal migration? They’re different from the > >> state transfer messages, because they don’t transfer state to/from the > >> front-end. Also, the state transfer stuff is supposed to be distinct > >> from starting/stopping the device; right now, it just requires the > >> device to be stopped beforehand (or started only afterwards). And in > >> the future, new VhostDeviceStatePhase values may allow the messages to > >> be used on devices that aren’t stopped. > >> > >> So they seem to serve very different purposes. I can imagine using the > >> VDPA_{SUSPEND,RESUME} messages for external migration (what Anton is > >> working on), but they don’t really help with internal migration > >> implemented here. If I were to add them, they’d just be sent in > >> addition to the new messages added in this patch here, i.e. SUSPEND on > >> the source before SET_DEVICE_STATE_FD, and RESUME on the destination > >> after CHECK_DEVICE_STATE (we could use RESUME in place of > >> CHECK_DEVICE_STATE on the destination, but we can’t do that on the > >> source, so we still need CHECK_DEVICE_STATE). > > Yes, they are complementary to the device state fd message. I want to > > make sure pre-conditions about the device's state (running vs stopped) > > already take into account the vDPA SUSPEND/RESUME model. > > > > vDPA will need device state save/load in the future. For virtiofs > > devices, for example. This is why I think we should plan for vDPA and > > vhost-user to share the same interface. > > While the paragraph below is more important, I don’t feel like this > would be important right now. It’s clear that SUSPEND must come before > transferring any state, and that RESUME must come after transferring > state. I don’t think we need to clarify this now, it’d be obvious when > implementing SUSPEND/RESUME. > > > Also, I think the code path you're relying on (vhost_dev_stop()) on > > doesn't work for backends that implement VHOST_USER_PROTOCOL_F_STATUS > > because stopping the backend resets the device and throws away its > > state. SUSPEND/RESUME solve this. This looks like a more general > > problem since vhost_dev_stop() is called any time the VM is paused. > > Maybe it needs to use SUSPEND/RESUME whenever possible. > > That’s a problem. Quite a problem, to be honest, because this sounds > rather complicated with honestly absolutely no practical benefit right > now. > > Would you require SUSPEND/RESUME for state transfer even if the back-end > does not implement GET/SET_STATUS? Because then this would also lead to > more complexity in virtiofsd. > At this moment the vhost-user net in DPDK suspends at VHOST_GET_VRING_BASE. Not the same case though, as here only the vq indexes / wrap bits are transferred here. Vhost-vdpa implements the suspend call so it does not need to trust VHOST_GET_VRING_BASE to be the last vring call done. Since virtiofsd is using vhost-user maybe it is not needed to implement it actually. > Basically, what I’m hearing is that I need to implement a different > feature that has no practical impact right now, and also fix bugs around > it along the way... > To fix this properly requires iterative device migration in qemu as far as I know, instead of using VMStates [1]. This way the state is requested to virtiofsd before the device reset. What does virtiofsd do when the state is totally sent? Does it keep processing requests and generating new state or is only a one shot that will suspend the daemon? If it is the second I think it still can be done in one shot at the end, always indicating "no more state" at save_live_pending and sending all the state at save_live_complete_precopy. Does that make sense to you? Thanks! [1] https://qemu.readthedocs.io/en/latest/devel/migration.html#iterative-device-migration
On Thu, Apr 13, 2023 at 07:31:57PM +0200, Hanna Czenczek wrote: > On 13.04.23 12:14, Eugenio Perez Martin wrote: > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > > > So-called "internal" virtio-fs migration refers to transporting the > > > > back-end's (virtiofsd's) state through qemu's migration stream. To do > > > > this, we need to be able to transfer virtiofsd's internal state to and > > > > from virtiofsd. > > > > > > > > Because virtiofsd's internal state will not be too large, we believe it > > > > is best to transfer it as a single binary blob after the streaming > > > > phase. Because this method should be useful to other vhost-user > > > > implementations, too, it is introduced as a general-purpose addition to > > > > the protocol, not limited to vhost-user-fs. > > > > > > > > These are the additions to the protocol: > > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > > > This feature signals support for transferring state, and is added so > > > > that migration can fail early when the back-end has no support. > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe > > > > over which to transfer the state. The front-end sends an FD to the > > > > back-end into/from which it can write/read its state, and the back-end > > > > can decide to either use it, or reply with a different FD for the > > > > front-end to override the front-end's choice. > > > > The front-end creates a simple pipe to transfer the state, but maybe > > > > the back-end already has an FD into/from which it has to write/read > > > > its state, in which case it will want to override the simple pipe. > > > > Conversely, maybe in the future we find a way to have the front-end > > > > get an immediate FD for the migration stream (in some cases), in which > > > > case we will want to send this to the back-end instead of creating a > > > > pipe. > > > > Hence the negotiation: If one side has a better idea than a plain > > > > pipe, we will want to use that. > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the > > > > pipe (the end indicated by EOF), the front-end invokes this function > > > > to verify success. There is no in-band way (through the pipe) to > > > > indicate failure, so we need to check explicitly. > > > > > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD > > > > (which includes establishing the direction of transfer and migration > > > > phase), the sending side writes its data into the pipe, and the reading > > > > side reads it until it sees an EOF. Then, the front-end will check for > > > > success via CHECK_DEVICE_STATE, which on the destination side includes > > > > checking for integrity (i.e. errors during deserialization). > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > > --- > > > > include/hw/virtio/vhost-backend.h | 24 +++++ > > > > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > > > hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ > > > > hw/virtio/vhost.c | 37 ++++++++ > > > > 4 files changed, 287 insertions(+) > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > > > > index ec3fbae58d..5935b32fe3 100644 > > > > --- a/include/hw/virtio/vhost-backend.h > > > > +++ b/include/hw/virtio/vhost-backend.h > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > > > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > > > } VhostSetConfigType; > > > > > > > > +typedef enum VhostDeviceStateDirection { > > > > + /* Transfer state from back-end (device) to front-end */ > > > > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > > > + /* Transfer state from front-end to back-end (device) */ > > > > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > > > +} VhostDeviceStateDirection; > > > > + > > > > +typedef enum VhostDeviceStatePhase { > > > > + /* The device (and all its vrings) is stopped */ > > > > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > > > +} VhostDeviceStatePhase; > > > vDPA has: > > > > > > /* Suspend a device so it does not process virtqueue requests anymore > > > * > > > * After the return of ioctl the device must preserve all the necessary state > > > * (the virtqueue vring base plus the possible device specific states) that is > > > * required for restoring in the future. The device must not change its > > > * configuration after that point. > > > */ > > > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > > > > > /* Resume a device so it can resume processing virtqueue requests > > > * > > > * After the return of this ioctl the device will have restored all the > > > * necessary states and it is fully operational to continue processing the > > > * virtqueue descriptors. > > > */ > > > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > > > > > I wonder if it makes sense to import these into vhost-user so that the > > > difference between kernel vhost and vhost-user is minimized. It's okay > > > if one of them is ahead of the other, but it would be nice to avoid > > > overlapping/duplicated functionality. > > > > > That's what I had in mind in the first versions. I proposed VHOST_STOP > > instead of VHOST_VDPA_STOP for this very reason. Later it did change > > to SUSPEND. > > > > Generally it is better if we make the interface less parametrized and > > we trust in the messages and its semantics in my opinion. In other > > words, instead of > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command. > > I.e. you mean that this should simply be stateful instead of > re-affirming the current state with a parameter? > > The problem I see is that transferring states in different phases of > migration will require specialized implementations. So running > SET_DEVICE_STATE_FD in a different phase will require support from the > back-end. Same in the front-end, the exact protocol and thus > implementation will (probably, difficult to say at this point) depend on > the migration phase. I would therefore prefer to have an explicit > distinction in the command itself that affirms the phase we’re > targeting. > > On the other hand, I don’t see the parameter complicating anything. The > front-end must supply it, but it will know the phase anyway, so this is > easy. The back-end can just choose to ignore it, if it doesn’t feel the > need to verify that the phase is what it thinks it is. > > > Another way to apply this is with the "direction" parameter. Maybe it > > is better to split it into "set_state_fd" and "get_state_fd"? > > Well, it would rather be `set_state_send_fd` and `set_state_receive_fd`. > We always negotiate a pipe between front-end and back-end, the question > is just whether the back-end gets the receiving (load) or the sending > (save) end. > > Technically, one can make it fully stateful and say that if the device > hasn’t been started already, it’s always a LOAD, and otherwise always a > SAVE. But as above, I’d prefer to keep the parameter because the > implementations are different, so I’d prefer there to be a > re-affirmation that front-end and back-end are in sync about what should > be done. > > Personally, I don’t really see the advantage of having two functions > instead of one function with an enum with two values. The thing about > SET_DEVICE_STATE_FD is that it itself won’t differ much regardless of > whether we’re loading or saving, it just negotiates the pipe – the > difference is what happens after the pipe has been negotiated. So if we > split the function into two, both implementations will share most of > their code anyway, which makes me think it should be a single function. I also don't really see an advantage to defining separate messages as long as SET_DEVICE_STATE_FD just sets up the pipe. If there are other arguments that differ depending on the state/direction, then it's nicer to have separate messages so that argument type remains simple (not a union). This brings to mind how iterative migration will work. The interface for iterative migration is basically the same as non-iterative migration plus a method to query the number of bytes remaining. When the number of bytes falls below a threshold, the vCPUs are stopped and the remainder of the data is read. Some details from VFIO migration: - The VMM must explicitly change the state when transitioning from iterative and non-iterative migration, but the data transfer fd remains the same. - The state of the device (running, stopped, resuming, etc) doesn't change asynchronously, it's always driven by the VMM. However, setting the state can fail and then the new state may be an error state. Mapping this to SET_DEVICE_STATE_FD: - VhostDeviceStatePhase is extended with VHOST_TRANSFER_STATE_PHASE_RUNNING = 1 for iterative migration. The frontend sends SET_DEVICE_STATE_FD again with VHOST_TRANSFER_STATE_PHASE_STOPPED when entering non-iterative migration and the frontend sends the iterative fd from the previous SET_DEVICE_STATE_FD call to the backend. The backend may reply with another fd, if necessary. If the backend changes the fd, then the contents of the previous fd must be fully read and transferred before the contents of the new fd are migrated. (Maybe this is too complex and we should forbid changing the fd when going from RUNNING -> STOPPED.) - CHECK_DEVICE_STATE can be extended to report the number of bytes remaining. The semantics change so that CHECK_DEVICE_STATE can be called while the VMM is still reading from the fd. It becomes: enum CheckDeviceStateResult { Saving(bytes_remaining : usize), Failed(error_code : u64), } > > In that case, reusing the ioctls as vhost-user messages would be ok. > > But that puts this proposal further from the VFIO code, which uses > > "migration_set_state(state)", and maybe it is better when the number > > of states is high. > > I’m not sure what you mean (because I don’t know the VFIO code, I > assume). Are you saying that using a more finely grained > migration_set_state() model would conflict with the rather coarse > suspend/resume? I think VFIO is already different because vDPA has SUSPEND/RESUME, whereas VFIO controls the state via VFIO_DEVICE_FEATURE VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE (which is similar but not identical to SET_DEVICE_STATE_FD in this patch series). Stefan
On Fri, Apr 14, 2023 at 05:17:02PM +0200, Eugenio Perez Martin wrote: > On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote: > > > > On 13.04.23 13:38, Stefan Hajnoczi wrote: > > > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote: > > >> On 12.04.23 23:06, Stefan Hajnoczi wrote: > > >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > >>>> So-called "internal" virtio-fs migration refers to transporting the > > >>>> back-end's (virtiofsd's) state through qemu's migration stream. To do > > >>>> this, we need to be able to transfer virtiofsd's internal state to and > > >>>> from virtiofsd. > > >>>> > > >>>> Because virtiofsd's internal state will not be too large, we believe it > > >>>> is best to transfer it as a single binary blob after the streaming > > >>>> phase. Because this method should be useful to other vhost-user > > >>>> implementations, too, it is introduced as a general-purpose addition to > > >>>> the protocol, not limited to vhost-user-fs. > > >>>> > > >>>> These are the additions to the protocol: > > >>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > >>>> This feature signals support for transferring state, and is added so > > >>>> that migration can fail early when the back-end has no support. > > >>>> > > >>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe > > >>>> over which to transfer the state. The front-end sends an FD to the > > >>>> back-end into/from which it can write/read its state, and the back-end > > >>>> can decide to either use it, or reply with a different FD for the > > >>>> front-end to override the front-end's choice. > > >>>> The front-end creates a simple pipe to transfer the state, but maybe > > >>>> the back-end already has an FD into/from which it has to write/read > > >>>> its state, in which case it will want to override the simple pipe. > > >>>> Conversely, maybe in the future we find a way to have the front-end > > >>>> get an immediate FD for the migration stream (in some cases), in which > > >>>> case we will want to send this to the back-end instead of creating a > > >>>> pipe. > > >>>> Hence the negotiation: If one side has a better idea than a plain > > >>>> pipe, we will want to use that. > > >>>> > > >>>> - CHECK_DEVICE_STATE: After the state has been transferred through the > > >>>> pipe (the end indicated by EOF), the front-end invokes this function > > >>>> to verify success. There is no in-band way (through the pipe) to > > >>>> indicate failure, so we need to check explicitly. > > >>>> > > >>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD > > >>>> (which includes establishing the direction of transfer and migration > > >>>> phase), the sending side writes its data into the pipe, and the reading > > >>>> side reads it until it sees an EOF. Then, the front-end will check for > > >>>> success via CHECK_DEVICE_STATE, which on the destination side includes > > >>>> checking for integrity (i.e. errors during deserialization). > > >>>> > > >>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > >>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > >>>> --- > > >>>> include/hw/virtio/vhost-backend.h | 24 +++++ > > >>>> include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > >>>> hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ > > >>>> hw/virtio/vhost.c | 37 ++++++++ > > >>>> 4 files changed, 287 insertions(+) > > >>>> > > >>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > > >>>> index ec3fbae58d..5935b32fe3 100644 > > >>>> --- a/include/hw/virtio/vhost-backend.h > > >>>> +++ b/include/hw/virtio/vhost-backend.h > > >>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > >>>> VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > >>>> } VhostSetConfigType; > > >>>> > > >>>> +typedef enum VhostDeviceStateDirection { > > >>>> + /* Transfer state from back-end (device) to front-end */ > > >>>> + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > >>>> + /* Transfer state from front-end to back-end (device) */ > > >>>> + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > >>>> +} VhostDeviceStateDirection; > > >>>> + > > >>>> +typedef enum VhostDeviceStatePhase { > > >>>> + /* The device (and all its vrings) is stopped */ > > >>>> + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > >>>> +} VhostDeviceStatePhase; > > >>> vDPA has: > > >>> > > >>> /* Suspend a device so it does not process virtqueue requests anymore > > >>> * > > >>> * After the return of ioctl the device must preserve all the necessary state > > >>> * (the virtqueue vring base plus the possible device specific states) that is > > >>> * required for restoring in the future. The device must not change its > > >>> * configuration after that point. > > >>> */ > > >>> #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > >>> > > >>> /* Resume a device so it can resume processing virtqueue requests > > >>> * > > >>> * After the return of this ioctl the device will have restored all the > > >>> * necessary states and it is fully operational to continue processing the > > >>> * virtqueue descriptors. > > >>> */ > > >>> #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > >>> > > >>> I wonder if it makes sense to import these into vhost-user so that the > > >>> difference between kernel vhost and vhost-user is minimized. It's okay > > >>> if one of them is ahead of the other, but it would be nice to avoid > > >>> overlapping/duplicated functionality. > > >>> > > >>> (And I hope vDPA will import the device state vhost-user messages > > >>> introduced in this series.) > > >> I don’t understand your suggestion. (Like, I very simply don’t > > >> understand :)) > > >> > > >> These are vhost messages, right? What purpose do you have in mind for > > >> them in vhost-user for internal migration? They’re different from the > > >> state transfer messages, because they don’t transfer state to/from the > > >> front-end. Also, the state transfer stuff is supposed to be distinct > > >> from starting/stopping the device; right now, it just requires the > > >> device to be stopped beforehand (or started only afterwards). And in > > >> the future, new VhostDeviceStatePhase values may allow the messages to > > >> be used on devices that aren’t stopped. > > >> > > >> So they seem to serve very different purposes. I can imagine using the > > >> VDPA_{SUSPEND,RESUME} messages for external migration (what Anton is > > >> working on), but they don’t really help with internal migration > > >> implemented here. If I were to add them, they’d just be sent in > > >> addition to the new messages added in this patch here, i.e. SUSPEND on > > >> the source before SET_DEVICE_STATE_FD, and RESUME on the destination > > >> after CHECK_DEVICE_STATE (we could use RESUME in place of > > >> CHECK_DEVICE_STATE on the destination, but we can’t do that on the > > >> source, so we still need CHECK_DEVICE_STATE). > > > Yes, they are complementary to the device state fd message. I want to > > > make sure pre-conditions about the device's state (running vs stopped) > > > already take into account the vDPA SUSPEND/RESUME model. > > > > > > vDPA will need device state save/load in the future. For virtiofs > > > devices, for example. This is why I think we should plan for vDPA and > > > vhost-user to share the same interface. > > > > While the paragraph below is more important, I don’t feel like this > > would be important right now. It’s clear that SUSPEND must come before > > transferring any state, and that RESUME must come after transferring > > state. I don’t think we need to clarify this now, it’d be obvious when > > implementing SUSPEND/RESUME. > > > > > Also, I think the code path you're relying on (vhost_dev_stop()) on > > > doesn't work for backends that implement VHOST_USER_PROTOCOL_F_STATUS > > > because stopping the backend resets the device and throws away its > > > state. SUSPEND/RESUME solve this. This looks like a more general > > > problem since vhost_dev_stop() is called any time the VM is paused. > > > Maybe it needs to use SUSPEND/RESUME whenever possible. > > > > That’s a problem. Quite a problem, to be honest, because this sounds > > rather complicated with honestly absolutely no practical benefit right > > now. > > > > Would you require SUSPEND/RESUME for state transfer even if the back-end > > does not implement GET/SET_STATUS? Because then this would also lead to > > more complexity in virtiofsd. > > > > At this moment the vhost-user net in DPDK suspends at > VHOST_GET_VRING_BASE. Not the same case though, as here only the vq > indexes / wrap bits are transferred here. > > Vhost-vdpa implements the suspend call so it does not need to trust > VHOST_GET_VRING_BASE to be the last vring call done. Since virtiofsd > is using vhost-user maybe it is not needed to implement it actually. Careful, if we deliberately make vhost-user and vDPA diverge, then it will be hard to share the migration interface. > > Basically, what I’m hearing is that I need to implement a different > > feature that has no practical impact right now, and also fix bugs around > > it along the way... > > > > To fix this properly requires iterative device migration in qemu as > far as I know, instead of using VMStates [1]. This way the state is > requested to virtiofsd before the device reset. I don't follow. Many devices are fine with non-iterative migration. They shouldn't be forced to do iterative migration. > What does virtiofsd do when the state is totally sent? Does it keep > processing requests and generating new state or is only a one shot > that will suspend the daemon? If it is the second I think it still can > be done in one shot at the end, always indicating "no more state" at > save_live_pending and sending all the state at > save_live_complete_precopy. > > Does that make sense to you? > > Thanks! > > [1] https://qemu.readthedocs.io/en/latest/devel/migration.html#iterative-device-migration >
On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote: > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > > So-called "internal" virtio-fs migration refers to transporting the > > > back-end's (virtiofsd's) state through qemu's migration stream. To do > > > this, we need to be able to transfer virtiofsd's internal state to and > > > from virtiofsd. > > > > > > Because virtiofsd's internal state will not be too large, we believe it > > > is best to transfer it as a single binary blob after the streaming > > > phase. Because this method should be useful to other vhost-user > > > implementations, too, it is introduced as a general-purpose addition to > > > the protocol, not limited to vhost-user-fs. > > > > > > These are the additions to the protocol: > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > > This feature signals support for transferring state, and is added so > > > that migration can fail early when the back-end has no support. > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe > > > over which to transfer the state. The front-end sends an FD to the > > > back-end into/from which it can write/read its state, and the back-end > > > can decide to either use it, or reply with a different FD for the > > > front-end to override the front-end's choice. > > > The front-end creates a simple pipe to transfer the state, but maybe > > > the back-end already has an FD into/from which it has to write/read > > > its state, in which case it will want to override the simple pipe. > > > Conversely, maybe in the future we find a way to have the front-end > > > get an immediate FD for the migration stream (in some cases), in which > > > case we will want to send this to the back-end instead of creating a > > > pipe. > > > Hence the negotiation: If one side has a better idea than a plain > > > pipe, we will want to use that. > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the > > > pipe (the end indicated by EOF), the front-end invokes this function > > > to verify success. There is no in-band way (through the pipe) to > > > indicate failure, so we need to check explicitly. > > > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD > > > (which includes establishing the direction of transfer and migration > > > phase), the sending side writes its data into the pipe, and the reading > > > side reads it until it sees an EOF. Then, the front-end will check for > > > success via CHECK_DEVICE_STATE, which on the destination side includes > > > checking for integrity (i.e. errors during deserialization). > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > --- > > > include/hw/virtio/vhost-backend.h | 24 +++++ > > > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > > hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ > > > hw/virtio/vhost.c | 37 ++++++++ > > > 4 files changed, 287 insertions(+) > > > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > > > index ec3fbae58d..5935b32fe3 100644 > > > --- a/include/hw/virtio/vhost-backend.h > > > +++ b/include/hw/virtio/vhost-backend.h > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > > } VhostSetConfigType; > > > > > > +typedef enum VhostDeviceStateDirection { > > > + /* Transfer state from back-end (device) to front-end */ > > > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > > + /* Transfer state from front-end to back-end (device) */ > > > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > > +} VhostDeviceStateDirection; > > > + > > > +typedef enum VhostDeviceStatePhase { > > > + /* The device (and all its vrings) is stopped */ > > > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > > +} VhostDeviceStatePhase; > > > > vDPA has: > > > > /* Suspend a device so it does not process virtqueue requests anymore > > * > > * After the return of ioctl the device must preserve all the necessary state > > * (the virtqueue vring base plus the possible device specific states) that is > > * required for restoring in the future. The device must not change its > > * configuration after that point. > > */ > > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > > > /* Resume a device so it can resume processing virtqueue requests > > * > > * After the return of this ioctl the device will have restored all the > > * necessary states and it is fully operational to continue processing the > > * virtqueue descriptors. > > */ > > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > > > I wonder if it makes sense to import these into vhost-user so that the > > difference between kernel vhost and vhost-user is minimized. It's okay > > if one of them is ahead of the other, but it would be nice to avoid > > overlapping/duplicated functionality. > > > > That's what I had in mind in the first versions. I proposed VHOST_STOP > instead of VHOST_VDPA_STOP for this very reason. Later it did change > to SUSPEND. I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not ioctl(VHOST_VDPA_RESUME). The doc comments in <linux/vdpa.h> don't explain how the device can leave the suspended state. Can you clarify this? Stefan
On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote: > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > > So-called "internal" virtio-fs migration refers to transporting the > > > back-end's (virtiofsd's) state through qemu's migration stream. To do > > > this, we need to be able to transfer virtiofsd's internal state to and > > > from virtiofsd. > > > > > > Because virtiofsd's internal state will not be too large, we believe it > > > is best to transfer it as a single binary blob after the streaming > > > phase. Because this method should be useful to other vhost-user > > > implementations, too, it is introduced as a general-purpose addition to > > > the protocol, not limited to vhost-user-fs. > > > > > > These are the additions to the protocol: > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > > This feature signals support for transferring state, and is added so > > > that migration can fail early when the back-end has no support. > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe > > > over which to transfer the state. The front-end sends an FD to the > > > back-end into/from which it can write/read its state, and the back-end > > > can decide to either use it, or reply with a different FD for the > > > front-end to override the front-end's choice. > > > The front-end creates a simple pipe to transfer the state, but maybe > > > the back-end already has an FD into/from which it has to write/read > > > its state, in which case it will want to override the simple pipe. > > > Conversely, maybe in the future we find a way to have the front-end > > > get an immediate FD for the migration stream (in some cases), in which > > > case we will want to send this to the back-end instead of creating a > > > pipe. > > > Hence the negotiation: If one side has a better idea than a plain > > > pipe, we will want to use that. > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the > > > pipe (the end indicated by EOF), the front-end invokes this function > > > to verify success. There is no in-band way (through the pipe) to > > > indicate failure, so we need to check explicitly. > > > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD > > > (which includes establishing the direction of transfer and migration > > > phase), the sending side writes its data into the pipe, and the reading > > > side reads it until it sees an EOF. Then, the front-end will check for > > > success via CHECK_DEVICE_STATE, which on the destination side includes > > > checking for integrity (i.e. errors during deserialization). > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > --- > > > include/hw/virtio/vhost-backend.h | 24 +++++ > > > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > > hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ > > > hw/virtio/vhost.c | 37 ++++++++ > > > 4 files changed, 287 insertions(+) > > > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > > > index ec3fbae58d..5935b32fe3 100644 > > > --- a/include/hw/virtio/vhost-backend.h > > > +++ b/include/hw/virtio/vhost-backend.h > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > > } VhostSetConfigType; > > > > > > +typedef enum VhostDeviceStateDirection { > > > + /* Transfer state from back-end (device) to front-end */ > > > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > > + /* Transfer state from front-end to back-end (device) */ > > > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > > +} VhostDeviceStateDirection; > > > + > > > +typedef enum VhostDeviceStatePhase { > > > + /* The device (and all its vrings) is stopped */ > > > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > > +} VhostDeviceStatePhase; > > > > vDPA has: > > > > /* Suspend a device so it does not process virtqueue requests anymore > > * > > * After the return of ioctl the device must preserve all the necessary state > > * (the virtqueue vring base plus the possible device specific states) that is > > * required for restoring in the future. The device must not change its > > * configuration after that point. > > */ > > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > > > /* Resume a device so it can resume processing virtqueue requests > > * > > * After the return of this ioctl the device will have restored all the > > * necessary states and it is fully operational to continue processing the > > * virtqueue descriptors. > > */ > > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > > > I wonder if it makes sense to import these into vhost-user so that the > > difference between kernel vhost and vhost-user is minimized. It's okay > > if one of them is ahead of the other, but it would be nice to avoid > > overlapping/duplicated functionality. > > > > That's what I had in mind in the first versions. I proposed VHOST_STOP > instead of VHOST_VDPA_STOP for this very reason. Later it did change > to SUSPEND. > > Generally it is better if we make the interface less parametrized and > we trust in the messages and its semantics in my opinion. In other > words, instead of > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command. > > Another way to apply this is with the "direction" parameter. Maybe it > is better to split it into "set_state_fd" and "get_state_fd"? > > In that case, reusing the ioctls as vhost-user messages would be ok. > But that puts this proposal further from the VFIO code, which uses > "migration_set_state(state)", and maybe it is better when the number > of states is high. Hi Eugenio, Another question about vDPA suspend/resume: /* Host notifiers must be enabled at this point. */ void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings) { int i; /* should only be called after backend is connected */ assert(hdev->vhost_ops); event_notifier_test_and_clear( &hdev->vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier); event_notifier_test_and_clear(&vdev->config_notifier); trace_vhost_dev_stop(hdev, vdev->name, vrings); if (hdev->vhost_ops->vhost_dev_start) { hdev->vhost_ops->vhost_dev_start(hdev, false); ^^^ SUSPEND ^^^ } if (vrings) { vhost_dev_set_vring_enable(hdev, false); } for (i = 0; i < hdev->nvqs; ++i) { vhost_virtqueue_stop(hdev, vdev, hdev->vqs + i, hdev->vq_index + i); ^^^ fetch virtqueue state from kernel ^^^ } if (hdev->vhost_ops->vhost_reset_status) { hdev->vhost_ops->vhost_reset_status(hdev); ^^^ reset device^^^ I noticed the QEMU vDPA code resets the device in vhost_dev_stop() -> vhost_reset_status(). The device's migration code runs after vhost_dev_stop() and the state will have been lost. It looks like vDPA changes are necessary in order to support stateful devices even though QEMU already uses SUSPEND. Is my understanding correct? Stefan
On Thu, Apr 13, 2023 at 7:32 PM Hanna Czenczek <hreitz@redhat.com> wrote: > > On 13.04.23 12:14, Eugenio Perez Martin wrote: > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > >> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > >>> So-called "internal" virtio-fs migration refers to transporting the > >>> back-end's (virtiofsd's) state through qemu's migration stream. To do > >>> this, we need to be able to transfer virtiofsd's internal state to and > >>> from virtiofsd. > >>> > >>> Because virtiofsd's internal state will not be too large, we believe it > >>> is best to transfer it as a single binary blob after the streaming > >>> phase. Because this method should be useful to other vhost-user > >>> implementations, too, it is introduced as a general-purpose addition to > >>> the protocol, not limited to vhost-user-fs. > >>> > >>> These are the additions to the protocol: > >>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > >>> This feature signals support for transferring state, and is added so > >>> that migration can fail early when the back-end has no support. > >>> > >>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe > >>> over which to transfer the state. The front-end sends an FD to the > >>> back-end into/from which it can write/read its state, and the back-end > >>> can decide to either use it, or reply with a different FD for the > >>> front-end to override the front-end's choice. > >>> The front-end creates a simple pipe to transfer the state, but maybe > >>> the back-end already has an FD into/from which it has to write/read > >>> its state, in which case it will want to override the simple pipe. > >>> Conversely, maybe in the future we find a way to have the front-end > >>> get an immediate FD for the migration stream (in some cases), in which > >>> case we will want to send this to the back-end instead of creating a > >>> pipe. > >>> Hence the negotiation: If one side has a better idea than a plain > >>> pipe, we will want to use that. > >>> > >>> - CHECK_DEVICE_STATE: After the state has been transferred through the > >>> pipe (the end indicated by EOF), the front-end invokes this function > >>> to verify success. There is no in-band way (through the pipe) to > >>> indicate failure, so we need to check explicitly. > >>> > >>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD > >>> (which includes establishing the direction of transfer and migration > >>> phase), the sending side writes its data into the pipe, and the reading > >>> side reads it until it sees an EOF. Then, the front-end will check for > >>> success via CHECK_DEVICE_STATE, which on the destination side includes > >>> checking for integrity (i.e. errors during deserialization). > >>> > >>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > >>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > >>> --- > >>> include/hw/virtio/vhost-backend.h | 24 +++++ > >>> include/hw/virtio/vhost.h | 79 ++++++++++++++++ > >>> hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ > >>> hw/virtio/vhost.c | 37 ++++++++ > >>> 4 files changed, 287 insertions(+) > >>> > >>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > >>> index ec3fbae58d..5935b32fe3 100644 > >>> --- a/include/hw/virtio/vhost-backend.h > >>> +++ b/include/hw/virtio/vhost-backend.h > >>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > >>> VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > >>> } VhostSetConfigType; > >>> > >>> +typedef enum VhostDeviceStateDirection { > >>> + /* Transfer state from back-end (device) to front-end */ > >>> + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > >>> + /* Transfer state from front-end to back-end (device) */ > >>> + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > >>> +} VhostDeviceStateDirection; > >>> + > >>> +typedef enum VhostDeviceStatePhase { > >>> + /* The device (and all its vrings) is stopped */ > >>> + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > >>> +} VhostDeviceStatePhase; > >> vDPA has: > >> > >> /* Suspend a device so it does not process virtqueue requests anymore > >> * > >> * After the return of ioctl the device must preserve all the necessary state > >> * (the virtqueue vring base plus the possible device specific states) that is > >> * required for restoring in the future. The device must not change its > >> * configuration after that point. > >> */ > >> #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > >> > >> /* Resume a device so it can resume processing virtqueue requests > >> * > >> * After the return of this ioctl the device will have restored all the > >> * necessary states and it is fully operational to continue processing the > >> * virtqueue descriptors. > >> */ > >> #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > >> > >> I wonder if it makes sense to import these into vhost-user so that the > >> difference between kernel vhost and vhost-user is minimized. It's okay > >> if one of them is ahead of the other, but it would be nice to avoid > >> overlapping/duplicated functionality. > >> > > That's what I had in mind in the first versions. I proposed VHOST_STOP > > instead of VHOST_VDPA_STOP for this very reason. Later it did change > > to SUSPEND. > > > > Generally it is better if we make the interface less parametrized and > > we trust in the messages and its semantics in my opinion. In other > > words, instead of > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command. > > I.e. you mean that this should simply be stateful instead of > re-affirming the current state with a parameter? > > The problem I see is that transferring states in different phases of > migration will require specialized implementations. So running > SET_DEVICE_STATE_FD in a different phase will require support from the > back-end. Same in the front-end, the exact protocol and thus > implementation will (probably, difficult to say at this point) depend on > the migration phase. I would therefore prefer to have an explicit > distinction in the command itself that affirms the phase we’re > targeting. > I think we will have this same problem when more phases are added, as the fd and direction arguments are always passed whatever phase you set. Future phases may not require it, or require different arguments. > On the other hand, I don’t see the parameter complicating anything. The > front-end must supply it, but it will know the phase anyway, so this is > easy. The back-end can just choose to ignore it, if it doesn’t feel the > need to verify that the phase is what it thinks it is. > > > Another way to apply this is with the "direction" parameter. Maybe it > > is better to split it into "set_state_fd" and "get_state_fd"? > > Well, it would rather be `set_state_send_fd` and `set_state_receive_fd`. Right, thanks for the correction. > We always negotiate a pipe between front-end and back-end, the question > is just whether the back-end gets the receiving (load) or the sending > (save) end. > > Technically, one can make it fully stateful and say that if the device > hasn’t been started already, it’s always a LOAD, and otherwise always a > SAVE. But as above, I’d prefer to keep the parameter because the > implementations are different, so I’d prefer there to be a > re-affirmation that front-end and back-end are in sync about what should > be done. > > Personally, I don’t really see the advantage of having two functions > instead of one function with an enum with two values. The thing about > SET_DEVICE_STATE_FD is that it itself won’t differ much regardless of > whether we’re loading or saving, it just negotiates the pipe – the > difference is what happens after the pipe has been negotiated. So if we > split the function into two, both implementations will share most of > their code anyway, which makes me think it should be a single function. > Yes, all of that makes sense. My proposal was in the line of following other commands like VHOST_USER_SET_VRING_BASE / VHOST_USER_GET_VRING_BASE or VHOST_USER_SET_INFLIGHT_FD and VHOST_USER_GET_INFLIGHT_FD. If that has been considered and it is more convenient to use the arguments I'm totally fine. > > In that case, reusing the ioctls as vhost-user messages would be ok. > > But that puts this proposal further from the VFIO code, which uses > > "migration_set_state(state)", and maybe it is better when the number > > of states is high. > > I’m not sure what you mean (because I don’t know the VFIO code, I > assume). Are you saying that using a more finely grained > migration_set_state() model would conflict with the rather coarse > suspend/resume? > I don't think it exactly conflicts, as we should be able to map both to a given set_state. They may overlap if vhost-user decides to use them. Or if vdpa decides to use SET_DEVICE_STATE_FD. This already happens with the different vhost backends, each one has a different way to suspend the device in the case of a migration anyway. Thanks! > > BTW, is there any usage for *reply_fd at this moment from the backend? > > No, virtiofsd doesn’t plan to make use of it. > > Hanna >
On Mon, Apr 17, 2023 at 5:18 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Fri, Apr 14, 2023 at 05:17:02PM +0200, Eugenio Perez Martin wrote: > > On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote: > > > > > > On 13.04.23 13:38, Stefan Hajnoczi wrote: > > > > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote: > > > >> On 12.04.23 23:06, Stefan Hajnoczi wrote: > > > >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > > >>>> So-called "internal" virtio-fs migration refers to transporting the > > > >>>> back-end's (virtiofsd's) state through qemu's migration stream. To do > > > >>>> this, we need to be able to transfer virtiofsd's internal state to and > > > >>>> from virtiofsd. > > > >>>> > > > >>>> Because virtiofsd's internal state will not be too large, we believe it > > > >>>> is best to transfer it as a single binary blob after the streaming > > > >>>> phase. Because this method should be useful to other vhost-user > > > >>>> implementations, too, it is introduced as a general-purpose addition to > > > >>>> the protocol, not limited to vhost-user-fs. > > > >>>> > > > >>>> These are the additions to the protocol: > > > >>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > > >>>> This feature signals support for transferring state, and is added so > > > >>>> that migration can fail early when the back-end has no support. > > > >>>> > > > >>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe > > > >>>> over which to transfer the state. The front-end sends an FD to the > > > >>>> back-end into/from which it can write/read its state, and the back-end > > > >>>> can decide to either use it, or reply with a different FD for the > > > >>>> front-end to override the front-end's choice. > > > >>>> The front-end creates a simple pipe to transfer the state, but maybe > > > >>>> the back-end already has an FD into/from which it has to write/read > > > >>>> its state, in which case it will want to override the simple pipe. > > > >>>> Conversely, maybe in the future we find a way to have the front-end > > > >>>> get an immediate FD for the migration stream (in some cases), in which > > > >>>> case we will want to send this to the back-end instead of creating a > > > >>>> pipe. > > > >>>> Hence the negotiation: If one side has a better idea than a plain > > > >>>> pipe, we will want to use that. > > > >>>> > > > >>>> - CHECK_DEVICE_STATE: After the state has been transferred through the > > > >>>> pipe (the end indicated by EOF), the front-end invokes this function > > > >>>> to verify success. There is no in-band way (through the pipe) to > > > >>>> indicate failure, so we need to check explicitly. > > > >>>> > > > >>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD > > > >>>> (which includes establishing the direction of transfer and migration > > > >>>> phase), the sending side writes its data into the pipe, and the reading > > > >>>> side reads it until it sees an EOF. Then, the front-end will check for > > > >>>> success via CHECK_DEVICE_STATE, which on the destination side includes > > > >>>> checking for integrity (i.e. errors during deserialization). > > > >>>> > > > >>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > >>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > >>>> --- > > > >>>> include/hw/virtio/vhost-backend.h | 24 +++++ > > > >>>> include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > > >>>> hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ > > > >>>> hw/virtio/vhost.c | 37 ++++++++ > > > >>>> 4 files changed, 287 insertions(+) > > > >>>> > > > >>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > > > >>>> index ec3fbae58d..5935b32fe3 100644 > > > >>>> --- a/include/hw/virtio/vhost-backend.h > > > >>>> +++ b/include/hw/virtio/vhost-backend.h > > > >>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > > >>>> VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > > >>>> } VhostSetConfigType; > > > >>>> > > > >>>> +typedef enum VhostDeviceStateDirection { > > > >>>> + /* Transfer state from back-end (device) to front-end */ > > > >>>> + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > > >>>> + /* Transfer state from front-end to back-end (device) */ > > > >>>> + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > > >>>> +} VhostDeviceStateDirection; > > > >>>> + > > > >>>> +typedef enum VhostDeviceStatePhase { > > > >>>> + /* The device (and all its vrings) is stopped */ > > > >>>> + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > > >>>> +} VhostDeviceStatePhase; > > > >>> vDPA has: > > > >>> > > > >>> /* Suspend a device so it does not process virtqueue requests anymore > > > >>> * > > > >>> * After the return of ioctl the device must preserve all the necessary state > > > >>> * (the virtqueue vring base plus the possible device specific states) that is > > > >>> * required for restoring in the future. The device must not change its > > > >>> * configuration after that point. > > > >>> */ > > > >>> #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > > >>> > > > >>> /* Resume a device so it can resume processing virtqueue requests > > > >>> * > > > >>> * After the return of this ioctl the device will have restored all the > > > >>> * necessary states and it is fully operational to continue processing the > > > >>> * virtqueue descriptors. > > > >>> */ > > > >>> #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > > >>> > > > >>> I wonder if it makes sense to import these into vhost-user so that the > > > >>> difference between kernel vhost and vhost-user is minimized. It's okay > > > >>> if one of them is ahead of the other, but it would be nice to avoid > > > >>> overlapping/duplicated functionality. > > > >>> > > > >>> (And I hope vDPA will import the device state vhost-user messages > > > >>> introduced in this series.) > > > >> I don’t understand your suggestion. (Like, I very simply don’t > > > >> understand :)) > > > >> > > > >> These are vhost messages, right? What purpose do you have in mind for > > > >> them in vhost-user for internal migration? They’re different from the > > > >> state transfer messages, because they don’t transfer state to/from the > > > >> front-end. Also, the state transfer stuff is supposed to be distinct > > > >> from starting/stopping the device; right now, it just requires the > > > >> device to be stopped beforehand (or started only afterwards). And in > > > >> the future, new VhostDeviceStatePhase values may allow the messages to > > > >> be used on devices that aren’t stopped. > > > >> > > > >> So they seem to serve very different purposes. I can imagine using the > > > >> VDPA_{SUSPEND,RESUME} messages for external migration (what Anton is > > > >> working on), but they don’t really help with internal migration > > > >> implemented here. If I were to add them, they’d just be sent in > > > >> addition to the new messages added in this patch here, i.e. SUSPEND on > > > >> the source before SET_DEVICE_STATE_FD, and RESUME on the destination > > > >> after CHECK_DEVICE_STATE (we could use RESUME in place of > > > >> CHECK_DEVICE_STATE on the destination, but we can’t do that on the > > > >> source, so we still need CHECK_DEVICE_STATE). > > > > Yes, they are complementary to the device state fd message. I want to > > > > make sure pre-conditions about the device's state (running vs stopped) > > > > already take into account the vDPA SUSPEND/RESUME model. > > > > > > > > vDPA will need device state save/load in the future. For virtiofs > > > > devices, for example. This is why I think we should plan for vDPA and > > > > vhost-user to share the same interface. > > > > > > While the paragraph below is more important, I don’t feel like this > > > would be important right now. It’s clear that SUSPEND must come before > > > transferring any state, and that RESUME must come after transferring > > > state. I don’t think we need to clarify this now, it’d be obvious when > > > implementing SUSPEND/RESUME. > > > > > > > Also, I think the code path you're relying on (vhost_dev_stop()) on > > > > doesn't work for backends that implement VHOST_USER_PROTOCOL_F_STATUS > > > > because stopping the backend resets the device and throws away its > > > > state. SUSPEND/RESUME solve this. This looks like a more general > > > > problem since vhost_dev_stop() is called any time the VM is paused. > > > > Maybe it needs to use SUSPEND/RESUME whenever possible. > > > > > > That’s a problem. Quite a problem, to be honest, because this sounds > > > rather complicated with honestly absolutely no practical benefit right > > > now. > > > > > > Would you require SUSPEND/RESUME for state transfer even if the back-end > > > does not implement GET/SET_STATUS? Because then this would also lead to > > > more complexity in virtiofsd. > > > > > > > At this moment the vhost-user net in DPDK suspends at > > VHOST_GET_VRING_BASE. Not the same case though, as here only the vq > > indexes / wrap bits are transferred here. > > > > Vhost-vdpa implements the suspend call so it does not need to trust > > VHOST_GET_VRING_BASE to be the last vring call done. Since virtiofsd > > is using vhost-user maybe it is not needed to implement it actually. > > Careful, if we deliberately make vhost-user and vDPA diverge, then it > will be hard to share the migration interface. > I don't recall the exact reasons for not following with the VRING_GET_BASE == suspend for vDPA, IIRC was the lack of a proper definition back then. But vhost-kernel and vhost-user already diverged in that regard, for example. vhost-kernel set a tap backend of -1 to suspend the device. > > > Basically, what I’m hearing is that I need to implement a different > > > feature that has no practical impact right now, and also fix bugs around > > > it along the way... > > > > > > > To fix this properly requires iterative device migration in qemu as > > far as I know, instead of using VMStates [1]. This way the state is > > requested to virtiofsd before the device reset. > > I don't follow. Many devices are fine with non-iterative migration. They > shouldn't be forced to do iterative migration. > Sorry I think I didn't express myself well. I didn't mean to force virtiofsd to support the iterative migration, but to use the device iterative migration API in QEMU to send the needed commands before vhost_dev_stop. In that regard, the device or the vhost-user commands would not require changes. I think it is convenient in the long run for virtiofsd, as if the state grows so much that it's not feasible to fetch it in one shot there is no need to make changes in the qemu migration protocol. I think it is not unlikely in virtiofs, but maybe I'm missing something obvious and it's state will never grow. > > What does virtiofsd do when the state is totally sent? Does it keep > > processing requests and generating new state or is only a one shot > > that will suspend the daemon? If it is the second I think it still can > > be done in one shot at the end, always indicating "no more state" at > > save_live_pending and sending all the state at > > save_live_complete_precopy. > > > > Does that make sense to you? > > > > Thanks! > > > > [1] https://qemu.readthedocs.io/en/latest/devel/migration.html#iterative-device-migration > >
On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote: > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > > > So-called "internal" virtio-fs migration refers to transporting the > > > > back-end's (virtiofsd's) state through qemu's migration stream. To do > > > > this, we need to be able to transfer virtiofsd's internal state to and > > > > from virtiofsd. > > > > > > > > Because virtiofsd's internal state will not be too large, we believe it > > > > is best to transfer it as a single binary blob after the streaming > > > > phase. Because this method should be useful to other vhost-user > > > > implementations, too, it is introduced as a general-purpose addition to > > > > the protocol, not limited to vhost-user-fs. > > > > > > > > These are the additions to the protocol: > > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > > > This feature signals support for transferring state, and is added so > > > > that migration can fail early when the back-end has no support. > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe > > > > over which to transfer the state. The front-end sends an FD to the > > > > back-end into/from which it can write/read its state, and the back-end > > > > can decide to either use it, or reply with a different FD for the > > > > front-end to override the front-end's choice. > > > > The front-end creates a simple pipe to transfer the state, but maybe > > > > the back-end already has an FD into/from which it has to write/read > > > > its state, in which case it will want to override the simple pipe. > > > > Conversely, maybe in the future we find a way to have the front-end > > > > get an immediate FD for the migration stream (in some cases), in which > > > > case we will want to send this to the back-end instead of creating a > > > > pipe. > > > > Hence the negotiation: If one side has a better idea than a plain > > > > pipe, we will want to use that. > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the > > > > pipe (the end indicated by EOF), the front-end invokes this function > > > > to verify success. There is no in-band way (through the pipe) to > > > > indicate failure, so we need to check explicitly. > > > > > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD > > > > (which includes establishing the direction of transfer and migration > > > > phase), the sending side writes its data into the pipe, and the reading > > > > side reads it until it sees an EOF. Then, the front-end will check for > > > > success via CHECK_DEVICE_STATE, which on the destination side includes > > > > checking for integrity (i.e. errors during deserialization). > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > > --- > > > > include/hw/virtio/vhost-backend.h | 24 +++++ > > > > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > > > hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ > > > > hw/virtio/vhost.c | 37 ++++++++ > > > > 4 files changed, 287 insertions(+) > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > > > > index ec3fbae58d..5935b32fe3 100644 > > > > --- a/include/hw/virtio/vhost-backend.h > > > > +++ b/include/hw/virtio/vhost-backend.h > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > > > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > > > } VhostSetConfigType; > > > > > > > > +typedef enum VhostDeviceStateDirection { > > > > + /* Transfer state from back-end (device) to front-end */ > > > > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > > > + /* Transfer state from front-end to back-end (device) */ > > > > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > > > +} VhostDeviceStateDirection; > > > > + > > > > +typedef enum VhostDeviceStatePhase { > > > > + /* The device (and all its vrings) is stopped */ > > > > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > > > +} VhostDeviceStatePhase; > > > > > > vDPA has: > > > > > > /* Suspend a device so it does not process virtqueue requests anymore > > > * > > > * After the return of ioctl the device must preserve all the necessary state > > > * (the virtqueue vring base plus the possible device specific states) that is > > > * required for restoring in the future. The device must not change its > > > * configuration after that point. > > > */ > > > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > > > > > /* Resume a device so it can resume processing virtqueue requests > > > * > > > * After the return of this ioctl the device will have restored all the > > > * necessary states and it is fully operational to continue processing the > > > * virtqueue descriptors. > > > */ > > > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > > > > > I wonder if it makes sense to import these into vhost-user so that the > > > difference between kernel vhost and vhost-user is minimized. It's okay > > > if one of them is ahead of the other, but it would be nice to avoid > > > overlapping/duplicated functionality. > > > > > > > That's what I had in mind in the first versions. I proposed VHOST_STOP > > instead of VHOST_VDPA_STOP for this very reason. Later it did change > > to SUSPEND. > > > > Generally it is better if we make the interface less parametrized and > > we trust in the messages and its semantics in my opinion. In other > > words, instead of > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command. > > > > Another way to apply this is with the "direction" parameter. Maybe it > > is better to split it into "set_state_fd" and "get_state_fd"? > > > > In that case, reusing the ioctls as vhost-user messages would be ok. > > But that puts this proposal further from the VFIO code, which uses > > "migration_set_state(state)", and maybe it is better when the number > > of states is high. > > Hi Eugenio, > Another question about vDPA suspend/resume: > > /* Host notifiers must be enabled at this point. */ > void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings) > { > int i; > > /* should only be called after backend is connected */ > assert(hdev->vhost_ops); > event_notifier_test_and_clear( > &hdev->vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier); > event_notifier_test_and_clear(&vdev->config_notifier); > > trace_vhost_dev_stop(hdev, vdev->name, vrings); > > if (hdev->vhost_ops->vhost_dev_start) { > hdev->vhost_ops->vhost_dev_start(hdev, false); > ^^^ SUSPEND ^^^ > } > if (vrings) { > vhost_dev_set_vring_enable(hdev, false); > } > for (i = 0; i < hdev->nvqs; ++i) { > vhost_virtqueue_stop(hdev, > vdev, > hdev->vqs + i, > hdev->vq_index + i); > ^^^ fetch virtqueue state from kernel ^^^ > } > if (hdev->vhost_ops->vhost_reset_status) { > hdev->vhost_ops->vhost_reset_status(hdev); > ^^^ reset device^^^ > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop() -> > vhost_reset_status(). The device's migration code runs after > vhost_dev_stop() and the state will have been lost. > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the qemu VirtIONet device model. This is for all vhost backends. Regarding the state like mac or mq configuration, SVQ runs for all the VM run in the CVQ. So it can track all of that status in the device model too. When a migration effectively occurs, all the frontend state is migrated as a regular emulated device. To route all of the state in a normalized way for qemu is what leaves open the possibility to do cross-backends migrations, etc. Does that answer your question? > It looks like vDPA changes are necessary in order to support stateful > devices even though QEMU already uses SUSPEND. Is my understanding > correct? > Changes are required elsewhere, as the code to restore the state properly in the destination has not been merged. Thanks!
On Mon, 17 Apr 2023 at 14:56, Eugenio Perez Martin <eperezma@redhat.com> wrote: > > On Mon, Apr 17, 2023 at 5:18 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > On Fri, Apr 14, 2023 at 05:17:02PM +0200, Eugenio Perez Martin wrote: > > > On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote: > > > > > > > > On 13.04.23 13:38, Stefan Hajnoczi wrote: > > > > > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote: > > > > >> On 12.04.23 23:06, Stefan Hajnoczi wrote: > > > > >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > > > Basically, what I’m hearing is that I need to implement a different > > > > feature that has no practical impact right now, and also fix bugs around > > > > it along the way... > > > > > > > > > > To fix this properly requires iterative device migration in qemu as > > > far as I know, instead of using VMStates [1]. This way the state is > > > requested to virtiofsd before the device reset. > > > > I don't follow. Many devices are fine with non-iterative migration. They > > shouldn't be forced to do iterative migration. > > > > Sorry I think I didn't express myself well. I didn't mean to force > virtiofsd to support the iterative migration, but to use the device > iterative migration API in QEMU to send the needed commands before > vhost_dev_stop. In that regard, the device or the vhost-user commands > would not require changes. > > I think it is convenient in the long run for virtiofsd, as if the > state grows so much that it's not feasible to fetch it in one shot > there is no need to make changes in the qemu migration protocol. I > think it is not unlikely in virtiofs, but maybe I'm missing something > obvious and it's state will never grow. I don't understand. vCPUs are still running at that point and the device state could change. It's not safe to save the full device state until vCPUs have stopped (after vhost_dev_stop). If you're suggestion somehow doing non-iterative migration but during the iterative phase, then I don't think that's possible? Stefan
On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote: > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > > > So-called "internal" virtio-fs migration refers to transporting the > > > > back-end's (virtiofsd's) state through qemu's migration stream. To do > > > > this, we need to be able to transfer virtiofsd's internal state to and > > > > from virtiofsd. > > > > > > > > Because virtiofsd's internal state will not be too large, we believe it > > > > is best to transfer it as a single binary blob after the streaming > > > > phase. Because this method should be useful to other vhost-user > > > > implementations, too, it is introduced as a general-purpose addition to > > > > the protocol, not limited to vhost-user-fs. > > > > > > > > These are the additions to the protocol: > > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > > > This feature signals support for transferring state, and is added so > > > > that migration can fail early when the back-end has no support. > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe > > > > over which to transfer the state. The front-end sends an FD to the > > > > back-end into/from which it can write/read its state, and the back-end > > > > can decide to either use it, or reply with a different FD for the > > > > front-end to override the front-end's choice. > > > > The front-end creates a simple pipe to transfer the state, but maybe > > > > the back-end already has an FD into/from which it has to write/read > > > > its state, in which case it will want to override the simple pipe. > > > > Conversely, maybe in the future we find a way to have the front-end > > > > get an immediate FD for the migration stream (in some cases), in which > > > > case we will want to send this to the back-end instead of creating a > > > > pipe. > > > > Hence the negotiation: If one side has a better idea than a plain > > > > pipe, we will want to use that. > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the > > > > pipe (the end indicated by EOF), the front-end invokes this function > > > > to verify success. There is no in-band way (through the pipe) to > > > > indicate failure, so we need to check explicitly. > > > > > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD > > > > (which includes establishing the direction of transfer and migration > > > > phase), the sending side writes its data into the pipe, and the reading > > > > side reads it until it sees an EOF. Then, the front-end will check for > > > > success via CHECK_DEVICE_STATE, which on the destination side includes > > > > checking for integrity (i.e. errors during deserialization). > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > > --- > > > > include/hw/virtio/vhost-backend.h | 24 +++++ > > > > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > > > hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ > > > > hw/virtio/vhost.c | 37 ++++++++ > > > > 4 files changed, 287 insertions(+) > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > > > > index ec3fbae58d..5935b32fe3 100644 > > > > --- a/include/hw/virtio/vhost-backend.h > > > > +++ b/include/hw/virtio/vhost-backend.h > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > > > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > > > } VhostSetConfigType; > > > > > > > > +typedef enum VhostDeviceStateDirection { > > > > + /* Transfer state from back-end (device) to front-end */ > > > > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > > > + /* Transfer state from front-end to back-end (device) */ > > > > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > > > +} VhostDeviceStateDirection; > > > > + > > > > +typedef enum VhostDeviceStatePhase { > > > > + /* The device (and all its vrings) is stopped */ > > > > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > > > +} VhostDeviceStatePhase; > > > > > > vDPA has: > > > > > > /* Suspend a device so it does not process virtqueue requests anymore > > > * > > > * After the return of ioctl the device must preserve all the necessary state > > > * (the virtqueue vring base plus the possible device specific states) that is > > > * required for restoring in the future. The device must not change its > > > * configuration after that point. > > > */ > > > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > > > > > /* Resume a device so it can resume processing virtqueue requests > > > * > > > * After the return of this ioctl the device will have restored all the > > > * necessary states and it is fully operational to continue processing the > > > * virtqueue descriptors. > > > */ > > > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > > > > > I wonder if it makes sense to import these into vhost-user so that the > > > difference between kernel vhost and vhost-user is minimized. It's okay > > > if one of them is ahead of the other, but it would be nice to avoid > > > overlapping/duplicated functionality. > > > > > > > That's what I had in mind in the first versions. I proposed VHOST_STOP > > instead of VHOST_VDPA_STOP for this very reason. Later it did change > > to SUSPEND. > > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not > ioctl(VHOST_VDPA_RESUME). > > The doc comments in <linux/vdpa.h> don't explain how the device can > leave the suspended state. Can you clarify this? > Do you mean in what situations or regarding the semantics of _RESUME? To me resume is an operation mainly to resume the device in the event of a VM suspension, not a migration. It can be used as a fallback code in some cases of migration failure though, but it is not currently used in qemu. Thanks!
On Mon, Apr 17, 2023 at 9:08 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > On Mon, 17 Apr 2023 at 14:56, Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > On Mon, Apr 17, 2023 at 5:18 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > On Fri, Apr 14, 2023 at 05:17:02PM +0200, Eugenio Perez Martin wrote: > > > > On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote: > > > > > > > > > > On 13.04.23 13:38, Stefan Hajnoczi wrote: > > > > > > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote: > > > > > >> On 12.04.23 23:06, Stefan Hajnoczi wrote: > > > > > >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > > > > Basically, what I’m hearing is that I need to implement a different > > > > > feature that has no practical impact right now, and also fix bugs around > > > > > it along the way... > > > > > > > > > > > > > To fix this properly requires iterative device migration in qemu as > > > > far as I know, instead of using VMStates [1]. This way the state is > > > > requested to virtiofsd before the device reset. > > > > > > I don't follow. Many devices are fine with non-iterative migration. They > > > shouldn't be forced to do iterative migration. > > > > > > > Sorry I think I didn't express myself well. I didn't mean to force > > virtiofsd to support the iterative migration, but to use the device > > iterative migration API in QEMU to send the needed commands before > > vhost_dev_stop. In that regard, the device or the vhost-user commands > > would not require changes. > > > > I think it is convenient in the long run for virtiofsd, as if the > > state grows so much that it's not feasible to fetch it in one shot > > there is no need to make changes in the qemu migration protocol. I > > think it is not unlikely in virtiofs, but maybe I'm missing something > > obvious and it's state will never grow. > > I don't understand. vCPUs are still running at that point and the > device state could change. It's not safe to save the full device state > until vCPUs have stopped (after vhost_dev_stop). > I think the vCPU is already stopped at save_live_complete_precopy callback. Maybe my understanding is wrong? Thanks! > If you're suggestion somehow doing non-iterative migration but during > the iterative phase, then I don't think that's possible? > > Stefan >
On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com> wrote: > > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote: > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > > > > So-called "internal" virtio-fs migration refers to transporting the > > > > > back-end's (virtiofsd's) state through qemu's migration stream. To do > > > > > this, we need to be able to transfer virtiofsd's internal state to and > > > > > from virtiofsd. > > > > > > > > > > Because virtiofsd's internal state will not be too large, we believe it > > > > > is best to transfer it as a single binary blob after the streaming > > > > > phase. Because this method should be useful to other vhost-user > > > > > implementations, too, it is introduced as a general-purpose addition to > > > > > the protocol, not limited to vhost-user-fs. > > > > > > > > > > These are the additions to the protocol: > > > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > > > > This feature signals support for transferring state, and is added so > > > > > that migration can fail early when the back-end has no support. > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe > > > > > over which to transfer the state. The front-end sends an FD to the > > > > > back-end into/from which it can write/read its state, and the back-end > > > > > can decide to either use it, or reply with a different FD for the > > > > > front-end to override the front-end's choice. > > > > > The front-end creates a simple pipe to transfer the state, but maybe > > > > > the back-end already has an FD into/from which it has to write/read > > > > > its state, in which case it will want to override the simple pipe. > > > > > Conversely, maybe in the future we find a way to have the front-end > > > > > get an immediate FD for the migration stream (in some cases), in which > > > > > case we will want to send this to the back-end instead of creating a > > > > > pipe. > > > > > Hence the negotiation: If one side has a better idea than a plain > > > > > pipe, we will want to use that. > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the > > > > > pipe (the end indicated by EOF), the front-end invokes this function > > > > > to verify success. There is no in-band way (through the pipe) to > > > > > indicate failure, so we need to check explicitly. > > > > > > > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD > > > > > (which includes establishing the direction of transfer and migration > > > > > phase), the sending side writes its data into the pipe, and the reading > > > > > side reads it until it sees an EOF. Then, the front-end will check for > > > > > success via CHECK_DEVICE_STATE, which on the destination side includes > > > > > checking for integrity (i.e. errors during deserialization). > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > > > --- > > > > > include/hw/virtio/vhost-backend.h | 24 +++++ > > > > > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > > > > hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ > > > > > hw/virtio/vhost.c | 37 ++++++++ > > > > > 4 files changed, 287 insertions(+) > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > > > > > index ec3fbae58d..5935b32fe3 100644 > > > > > --- a/include/hw/virtio/vhost-backend.h > > > > > +++ b/include/hw/virtio/vhost-backend.h > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > > > > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > > > > } VhostSetConfigType; > > > > > > > > > > +typedef enum VhostDeviceStateDirection { > > > > > + /* Transfer state from back-end (device) to front-end */ > > > > > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > > > > + /* Transfer state from front-end to back-end (device) */ > > > > > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > > > > +} VhostDeviceStateDirection; > > > > > + > > > > > +typedef enum VhostDeviceStatePhase { > > > > > + /* The device (and all its vrings) is stopped */ > > > > > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > > > > +} VhostDeviceStatePhase; > > > > > > > > vDPA has: > > > > > > > > /* Suspend a device so it does not process virtqueue requests anymore > > > > * > > > > * After the return of ioctl the device must preserve all the necessary state > > > > * (the virtqueue vring base plus the possible device specific states) that is > > > > * required for restoring in the future. The device must not change its > > > > * configuration after that point. > > > > */ > > > > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > > > > > > > /* Resume a device so it can resume processing virtqueue requests > > > > * > > > > * After the return of this ioctl the device will have restored all the > > > > * necessary states and it is fully operational to continue processing the > > > > * virtqueue descriptors. > > > > */ > > > > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > > > > > > > I wonder if it makes sense to import these into vhost-user so that the > > > > difference between kernel vhost and vhost-user is minimized. It's okay > > > > if one of them is ahead of the other, but it would be nice to avoid > > > > overlapping/duplicated functionality. > > > > > > > > > > That's what I had in mind in the first versions. I proposed VHOST_STOP > > > instead of VHOST_VDPA_STOP for this very reason. Later it did change > > > to SUSPEND. > > > > > > Generally it is better if we make the interface less parametrized and > > > we trust in the messages and its semantics in my opinion. In other > > > words, instead of > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command. > > > > > > Another way to apply this is with the "direction" parameter. Maybe it > > > is better to split it into "set_state_fd" and "get_state_fd"? > > > > > > In that case, reusing the ioctls as vhost-user messages would be ok. > > > But that puts this proposal further from the VFIO code, which uses > > > "migration_set_state(state)", and maybe it is better when the number > > > of states is high. > > > > Hi Eugenio, > > Another question about vDPA suspend/resume: > > > > /* Host notifiers must be enabled at this point. */ > > void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings) > > { > > int i; > > > > /* should only be called after backend is connected */ > > assert(hdev->vhost_ops); > > event_notifier_test_and_clear( > > &hdev->vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier); > > event_notifier_test_and_clear(&vdev->config_notifier); > > > > trace_vhost_dev_stop(hdev, vdev->name, vrings); > > > > if (hdev->vhost_ops->vhost_dev_start) { > > hdev->vhost_ops->vhost_dev_start(hdev, false); > > ^^^ SUSPEND ^^^ > > } > > if (vrings) { > > vhost_dev_set_vring_enable(hdev, false); > > } > > for (i = 0; i < hdev->nvqs; ++i) { > > vhost_virtqueue_stop(hdev, > > vdev, > > hdev->vqs + i, > > hdev->vq_index + i); > > ^^^ fetch virtqueue state from kernel ^^^ > > } > > if (hdev->vhost_ops->vhost_reset_status) { > > hdev->vhost_ops->vhost_reset_status(hdev); > > ^^^ reset device^^^ > > > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop() -> > > vhost_reset_status(). The device's migration code runs after > > vhost_dev_stop() and the state will have been lost. > > > > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the > qemu VirtIONet device model. This is for all vhost backends. > > Regarding the state like mac or mq configuration, SVQ runs for all the > VM run in the CVQ. So it can track all of that status in the device > model too. > > When a migration effectively occurs, all the frontend state is > migrated as a regular emulated device. To route all of the state in a > normalized way for qemu is what leaves open the possibility to do > cross-backends migrations, etc. > > Does that answer your question? I think you're confirming that changes would be necessary in order for vDPA to support the save/load operation that Hanna is introducing. > > It looks like vDPA changes are necessary in order to support stateful > > devices even though QEMU already uses SUSPEND. Is my understanding > > correct? > > > > Changes are required elsewhere, as the code to restore the state > properly in the destination has not been merged. I'm not sure what you mean by elsewhere? I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and then VHOST_VDPA_SET_STATUS 0. In order to save device state from the vDPA device in the future, it will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that the device state can be saved before the device is reset. Does that sound right? Stefan
On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote: > > On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote: > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > > > > So-called "internal" virtio-fs migration refers to transporting the > > > > > back-end's (virtiofsd's) state through qemu's migration stream. To do > > > > > this, we need to be able to transfer virtiofsd's internal state to and > > > > > from virtiofsd. > > > > > > > > > > Because virtiofsd's internal state will not be too large, we believe it > > > > > is best to transfer it as a single binary blob after the streaming > > > > > phase. Because this method should be useful to other vhost-user > > > > > implementations, too, it is introduced as a general-purpose addition to > > > > > the protocol, not limited to vhost-user-fs. > > > > > > > > > > These are the additions to the protocol: > > > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > > > > This feature signals support for transferring state, and is added so > > > > > that migration can fail early when the back-end has no support. > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe > > > > > over which to transfer the state. The front-end sends an FD to the > > > > > back-end into/from which it can write/read its state, and the back-end > > > > > can decide to either use it, or reply with a different FD for the > > > > > front-end to override the front-end's choice. > > > > > The front-end creates a simple pipe to transfer the state, but maybe > > > > > the back-end already has an FD into/from which it has to write/read > > > > > its state, in which case it will want to override the simple pipe. > > > > > Conversely, maybe in the future we find a way to have the front-end > > > > > get an immediate FD for the migration stream (in some cases), in which > > > > > case we will want to send this to the back-end instead of creating a > > > > > pipe. > > > > > Hence the negotiation: If one side has a better idea than a plain > > > > > pipe, we will want to use that. > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the > > > > > pipe (the end indicated by EOF), the front-end invokes this function > > > > > to verify success. There is no in-band way (through the pipe) to > > > > > indicate failure, so we need to check explicitly. > > > > > > > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD > > > > > (which includes establishing the direction of transfer and migration > > > > > phase), the sending side writes its data into the pipe, and the reading > > > > > side reads it until it sees an EOF. Then, the front-end will check for > > > > > success via CHECK_DEVICE_STATE, which on the destination side includes > > > > > checking for integrity (i.e. errors during deserialization). > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > > > --- > > > > > include/hw/virtio/vhost-backend.h | 24 +++++ > > > > > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > > > > hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ > > > > > hw/virtio/vhost.c | 37 ++++++++ > > > > > 4 files changed, 287 insertions(+) > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > > > > > index ec3fbae58d..5935b32fe3 100644 > > > > > --- a/include/hw/virtio/vhost-backend.h > > > > > +++ b/include/hw/virtio/vhost-backend.h > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > > > > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > > > > } VhostSetConfigType; > > > > > > > > > > +typedef enum VhostDeviceStateDirection { > > > > > + /* Transfer state from back-end (device) to front-end */ > > > > > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > > > > + /* Transfer state from front-end to back-end (device) */ > > > > > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > > > > +} VhostDeviceStateDirection; > > > > > + > > > > > +typedef enum VhostDeviceStatePhase { > > > > > + /* The device (and all its vrings) is stopped */ > > > > > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > > > > +} VhostDeviceStatePhase; > > > > > > > > vDPA has: > > > > > > > > /* Suspend a device so it does not process virtqueue requests anymore > > > > * > > > > * After the return of ioctl the device must preserve all the necessary state > > > > * (the virtqueue vring base plus the possible device specific states) that is > > > > * required for restoring in the future. The device must not change its > > > > * configuration after that point. > > > > */ > > > > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > > > > > > > /* Resume a device so it can resume processing virtqueue requests > > > > * > > > > * After the return of this ioctl the device will have restored all the > > > > * necessary states and it is fully operational to continue processing the > > > > * virtqueue descriptors. > > > > */ > > > > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > > > > > > > I wonder if it makes sense to import these into vhost-user so that the > > > > difference between kernel vhost and vhost-user is minimized. It's okay > > > > if one of them is ahead of the other, but it would be nice to avoid > > > > overlapping/duplicated functionality. > > > > > > > > > > That's what I had in mind in the first versions. I proposed VHOST_STOP > > > instead of VHOST_VDPA_STOP for this very reason. Later it did change > > > to SUSPEND. > > > > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not > > ioctl(VHOST_VDPA_RESUME). > > > > The doc comments in <linux/vdpa.h> don't explain how the device can > > leave the suspended state. Can you clarify this? > > > > Do you mean in what situations or regarding the semantics of _RESUME? > > To me resume is an operation mainly to resume the device in the event > of a VM suspension, not a migration. It can be used as a fallback code > in some cases of migration failure though, but it is not currently > used in qemu. Is a "VM suspension" the QEMU HMP 'stop' command? I guess the reason why QEMU doesn't call RESUME anywhere is that it resets the device in vhost_dev_stop()? Does it make sense to combine SUSPEND and RESUME with Hanna's SET_DEVICE_STATE_FD? For example, non-iterative migration works like this: - Saving the device's state is done by SUSPEND followed by SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g. savevm command or migration failed), then RESUME is called to continue. - Loading the device's state is done by SUSPEND followed by SET_DEVICE_STATE_FD, followed by RESUME. Stefan
On Mon, 17 Apr 2023 at 15:12, Eugenio Perez Martin <eperezma@redhat.com> wrote: > > On Mon, Apr 17, 2023 at 9:08 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > On Mon, 17 Apr 2023 at 14:56, Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > On Mon, Apr 17, 2023 at 5:18 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > > > On Fri, Apr 14, 2023 at 05:17:02PM +0200, Eugenio Perez Martin wrote: > > > > > On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote: > > > > > > > > > > > > On 13.04.23 13:38, Stefan Hajnoczi wrote: > > > > > > > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote: > > > > > > >> On 12.04.23 23:06, Stefan Hajnoczi wrote: > > > > > > >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > > > > > Basically, what I’m hearing is that I need to implement a different > > > > > > feature that has no practical impact right now, and also fix bugs around > > > > > > it along the way... > > > > > > > > > > > > > > > > To fix this properly requires iterative device migration in qemu as > > > > > far as I know, instead of using VMStates [1]. This way the state is > > > > > requested to virtiofsd before the device reset. > > > > > > > > I don't follow. Many devices are fine with non-iterative migration. They > > > > shouldn't be forced to do iterative migration. > > > > > > > > > > Sorry I think I didn't express myself well. I didn't mean to force > > > virtiofsd to support the iterative migration, but to use the device > > > iterative migration API in QEMU to send the needed commands before > > > vhost_dev_stop. In that regard, the device or the vhost-user commands > > > would not require changes. > > > > > > I think it is convenient in the long run for virtiofsd, as if the > > > state grows so much that it's not feasible to fetch it in one shot > > > there is no need to make changes in the qemu migration protocol. I > > > think it is not unlikely in virtiofs, but maybe I'm missing something > > > obvious and it's state will never grow. > > > > I don't understand. vCPUs are still running at that point and the > > device state could change. It's not safe to save the full device state > > until vCPUs have stopped (after vhost_dev_stop). > > > > I think the vCPU is already stopped at save_live_complete_precopy > callback. Maybe my understanding is wrong? Agreed, vCPUs are stopped in save_live_complete_precopy(). However, you wrote "use the device iterative migration API in QEMU to send the needed commands before vhost_dev_stop". save_live_complete_precopy() runs after vhost_dev_stop() so it doesn't seem to solve the problem. Stefan
On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote: > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > > > > > So-called "internal" virtio-fs migration refers to transporting the > > > > > > back-end's (virtiofsd's) state through qemu's migration stream. To do > > > > > > this, we need to be able to transfer virtiofsd's internal state to and > > > > > > from virtiofsd. > > > > > > > > > > > > Because virtiofsd's internal state will not be too large, we believe it > > > > > > is best to transfer it as a single binary blob after the streaming > > > > > > phase. Because this method should be useful to other vhost-user > > > > > > implementations, too, it is introduced as a general-purpose addition to > > > > > > the protocol, not limited to vhost-user-fs. > > > > > > > > > > > > These are the additions to the protocol: > > > > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > > > > > This feature signals support for transferring state, and is added so > > > > > > that migration can fail early when the back-end has no support. > > > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe > > > > > > over which to transfer the state. The front-end sends an FD to the > > > > > > back-end into/from which it can write/read its state, and the back-end > > > > > > can decide to either use it, or reply with a different FD for the > > > > > > front-end to override the front-end's choice. > > > > > > The front-end creates a simple pipe to transfer the state, but maybe > > > > > > the back-end already has an FD into/from which it has to write/read > > > > > > its state, in which case it will want to override the simple pipe. > > > > > > Conversely, maybe in the future we find a way to have the front-end > > > > > > get an immediate FD for the migration stream (in some cases), in which > > > > > > case we will want to send this to the back-end instead of creating a > > > > > > pipe. > > > > > > Hence the negotiation: If one side has a better idea than a plain > > > > > > pipe, we will want to use that. > > > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the > > > > > > pipe (the end indicated by EOF), the front-end invokes this function > > > > > > to verify success. There is no in-band way (through the pipe) to > > > > > > indicate failure, so we need to check explicitly. > > > > > > > > > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD > > > > > > (which includes establishing the direction of transfer and migration > > > > > > phase), the sending side writes its data into the pipe, and the reading > > > > > > side reads it until it sees an EOF. Then, the front-end will check for > > > > > > success via CHECK_DEVICE_STATE, which on the destination side includes > > > > > > checking for integrity (i.e. errors during deserialization). > > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > > > > --- > > > > > > include/hw/virtio/vhost-backend.h | 24 +++++ > > > > > > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > > > > > hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ > > > > > > hw/virtio/vhost.c | 37 ++++++++ > > > > > > 4 files changed, 287 insertions(+) > > > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > > > > > > index ec3fbae58d..5935b32fe3 100644 > > > > > > --- a/include/hw/virtio/vhost-backend.h > > > > > > +++ b/include/hw/virtio/vhost-backend.h > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > > > > > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > > > > > } VhostSetConfigType; > > > > > > > > > > > > +typedef enum VhostDeviceStateDirection { > > > > > > + /* Transfer state from back-end (device) to front-end */ > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > > > > > + /* Transfer state from front-end to back-end (device) */ > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > > > > > +} VhostDeviceStateDirection; > > > > > > + > > > > > > +typedef enum VhostDeviceStatePhase { > > > > > > + /* The device (and all its vrings) is stopped */ > > > > > > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > > > > > +} VhostDeviceStatePhase; > > > > > > > > > > vDPA has: > > > > > > > > > > /* Suspend a device so it does not process virtqueue requests anymore > > > > > * > > > > > * After the return of ioctl the device must preserve all the necessary state > > > > > * (the virtqueue vring base plus the possible device specific states) that is > > > > > * required for restoring in the future. The device must not change its > > > > > * configuration after that point. > > > > > */ > > > > > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > > > > > > > > > /* Resume a device so it can resume processing virtqueue requests > > > > > * > > > > > * After the return of this ioctl the device will have restored all the > > > > > * necessary states and it is fully operational to continue processing the > > > > > * virtqueue descriptors. > > > > > */ > > > > > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > > > > > > > > > I wonder if it makes sense to import these into vhost-user so that the > > > > > difference between kernel vhost and vhost-user is minimized. It's okay > > > > > if one of them is ahead of the other, but it would be nice to avoid > > > > > overlapping/duplicated functionality. > > > > > > > > > > > > > That's what I had in mind in the first versions. I proposed VHOST_STOP > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did change > > > > to SUSPEND. > > > > > > > > Generally it is better if we make the interface less parametrized and > > > > we trust in the messages and its semantics in my opinion. In other > > > > words, instead of > > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send > > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command. > > > > > > > > Another way to apply this is with the "direction" parameter. Maybe it > > > > is better to split it into "set_state_fd" and "get_state_fd"? > > > > > > > > In that case, reusing the ioctls as vhost-user messages would be ok. > > > > But that puts this proposal further from the VFIO code, which uses > > > > "migration_set_state(state)", and maybe it is better when the number > > > > of states is high. > > > > > > Hi Eugenio, > > > Another question about vDPA suspend/resume: > > > > > > /* Host notifiers must be enabled at this point. */ > > > void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings) > > > { > > > int i; > > > > > > /* should only be called after backend is connected */ > > > assert(hdev->vhost_ops); > > > event_notifier_test_and_clear( > > > &hdev->vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier); > > > event_notifier_test_and_clear(&vdev->config_notifier); > > > > > > trace_vhost_dev_stop(hdev, vdev->name, vrings); > > > > > > if (hdev->vhost_ops->vhost_dev_start) { > > > hdev->vhost_ops->vhost_dev_start(hdev, false); > > > ^^^ SUSPEND ^^^ > > > } > > > if (vrings) { > > > vhost_dev_set_vring_enable(hdev, false); > > > } > > > for (i = 0; i < hdev->nvqs; ++i) { > > > vhost_virtqueue_stop(hdev, > > > vdev, > > > hdev->vqs + i, > > > hdev->vq_index + i); > > > ^^^ fetch virtqueue state from kernel ^^^ > > > } > > > if (hdev->vhost_ops->vhost_reset_status) { > > > hdev->vhost_ops->vhost_reset_status(hdev); > > > ^^^ reset device^^^ > > > > > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop() -> > > > vhost_reset_status(). The device's migration code runs after > > > vhost_dev_stop() and the state will have been lost. > > > > > > > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the > > qemu VirtIONet device model. This is for all vhost backends. > > > > Regarding the state like mac or mq configuration, SVQ runs for all the > > VM run in the CVQ. So it can track all of that status in the device > > model too. > > > > When a migration effectively occurs, all the frontend state is > > migrated as a regular emulated device. To route all of the state in a > > normalized way for qemu is what leaves open the possibility to do > > cross-backends migrations, etc. > > > > Does that answer your question? > > I think you're confirming that changes would be necessary in order for > vDPA to support the save/load operation that Hanna is introducing. > Yes, this first iteration was centered on net, with an eye on block, where state can be routed through classical emulated devices. This is how vhost-kernel and vhost-user do classically. And it allows cross-backend, to not modify qemu migration state, etc. To introduce this opaque state to qemu, that must be fetched after the suspend and not before, requires changes in vhost protocol, as discussed previously. > > > It looks like vDPA changes are necessary in order to support stateful > > > devices even though QEMU already uses SUSPEND. Is my understanding > > > correct? > > > > > > > Changes are required elsewhere, as the code to restore the state > > properly in the destination has not been merged. > > I'm not sure what you mean by elsewhere? > I meant for vdpa *net* devices the changes are not required in vdpa ioctls, but mostly in qemu. If you meant stateful as "it must have a state blob that it must be opaque to qemu", then I think the straightforward action is to fetch state blob about the same time as vq indexes. But yes, changes (at least a new ioctl) is needed for that. > I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and > then VHOST_VDPA_SET_STATUS 0. > > In order to save device state from the vDPA device in the future, it > will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that > the device state can be saved before the device is reset. > > Does that sound right? > The split between suspend and reset was added recently for that very reason. In all the virtio devices, the frontend is initialized before the backend, so I don't think it is a good idea to defer the backend cleanup. Especially if we have already set the state is small enough to not needing iterative migration from virtiofsd point of view. If fetching that state at the same time as vq indexes is not valid, could it follow the same model as the "in-flight descriptors"? vhost-user follows them by using a shared memory region where their state is tracked [1]. This allows qemu to survive vhost-user SW backend crashes, and does not forbid the cross-backends live migration as all the information is there to recover them. For hw devices this is not convenient as it occupies PCI bandwidth. So a possibility is to synchronize this memory region after a synchronization point, being the SUSPEND call or GET_VRING_BASE. HW devices are not going to crash in the software sense, so all use cases remain the same to qemu. And that shared memory information is recoverable after vhost_dev_stop. Does that sound reasonable to virtiofsd? To offer a shared memory region where it dumps the state, maybe only after the set_state(STATE_PHASE_STOPPED)? Thanks! [1] https://qemu.readthedocs.io/en/latest/interop/vhost-user.html#inflight-i-o-tracking
On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote: > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > > > > > So-called "internal" virtio-fs migration refers to transporting the > > > > > > back-end's (virtiofsd's) state through qemu's migration stream. To do > > > > > > this, we need to be able to transfer virtiofsd's internal state to and > > > > > > from virtiofsd. > > > > > > > > > > > > Because virtiofsd's internal state will not be too large, we believe it > > > > > > is best to transfer it as a single binary blob after the streaming > > > > > > phase. Because this method should be useful to other vhost-user > > > > > > implementations, too, it is introduced as a general-purpose addition to > > > > > > the protocol, not limited to vhost-user-fs. > > > > > > > > > > > > These are the additions to the protocol: > > > > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > > > > > This feature signals support for transferring state, and is added so > > > > > > that migration can fail early when the back-end has no support. > > > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe > > > > > > over which to transfer the state. The front-end sends an FD to the > > > > > > back-end into/from which it can write/read its state, and the back-end > > > > > > can decide to either use it, or reply with a different FD for the > > > > > > front-end to override the front-end's choice. > > > > > > The front-end creates a simple pipe to transfer the state, but maybe > > > > > > the back-end already has an FD into/from which it has to write/read > > > > > > its state, in which case it will want to override the simple pipe. > > > > > > Conversely, maybe in the future we find a way to have the front-end > > > > > > get an immediate FD for the migration stream (in some cases), in which > > > > > > case we will want to send this to the back-end instead of creating a > > > > > > pipe. > > > > > > Hence the negotiation: If one side has a better idea than a plain > > > > > > pipe, we will want to use that. > > > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the > > > > > > pipe (the end indicated by EOF), the front-end invokes this function > > > > > > to verify success. There is no in-band way (through the pipe) to > > > > > > indicate failure, so we need to check explicitly. > > > > > > > > > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD > > > > > > (which includes establishing the direction of transfer and migration > > > > > > phase), the sending side writes its data into the pipe, and the reading > > > > > > side reads it until it sees an EOF. Then, the front-end will check for > > > > > > success via CHECK_DEVICE_STATE, which on the destination side includes > > > > > > checking for integrity (i.e. errors during deserialization). > > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > > > > --- > > > > > > include/hw/virtio/vhost-backend.h | 24 +++++ > > > > > > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > > > > > hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ > > > > > > hw/virtio/vhost.c | 37 ++++++++ > > > > > > 4 files changed, 287 insertions(+) > > > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > > > > > > index ec3fbae58d..5935b32fe3 100644 > > > > > > --- a/include/hw/virtio/vhost-backend.h > > > > > > +++ b/include/hw/virtio/vhost-backend.h > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > > > > > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > > > > > } VhostSetConfigType; > > > > > > > > > > > > +typedef enum VhostDeviceStateDirection { > > > > > > + /* Transfer state from back-end (device) to front-end */ > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > > > > > + /* Transfer state from front-end to back-end (device) */ > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > > > > > +} VhostDeviceStateDirection; > > > > > > + > > > > > > +typedef enum VhostDeviceStatePhase { > > > > > > + /* The device (and all its vrings) is stopped */ > > > > > > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > > > > > +} VhostDeviceStatePhase; > > > > > > > > > > vDPA has: > > > > > > > > > > /* Suspend a device so it does not process virtqueue requests anymore > > > > > * > > > > > * After the return of ioctl the device must preserve all the necessary state > > > > > * (the virtqueue vring base plus the possible device specific states) that is > > > > > * required for restoring in the future. The device must not change its > > > > > * configuration after that point. > > > > > */ > > > > > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > > > > > > > > > /* Resume a device so it can resume processing virtqueue requests > > > > > * > > > > > * After the return of this ioctl the device will have restored all the > > > > > * necessary states and it is fully operational to continue processing the > > > > > * virtqueue descriptors. > > > > > */ > > > > > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > > > > > > > > > I wonder if it makes sense to import these into vhost-user so that the > > > > > difference between kernel vhost and vhost-user is minimized. It's okay > > > > > if one of them is ahead of the other, but it would be nice to avoid > > > > > overlapping/duplicated functionality. > > > > > > > > > > > > > That's what I had in mind in the first versions. I proposed VHOST_STOP > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did change > > > > to SUSPEND. > > > > > > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not > > > ioctl(VHOST_VDPA_RESUME). > > > > > > The doc comments in <linux/vdpa.h> don't explain how the device can > > > leave the suspended state. Can you clarify this? > > > > > > > Do you mean in what situations or regarding the semantics of _RESUME? > > > > To me resume is an operation mainly to resume the device in the event > > of a VM suspension, not a migration. It can be used as a fallback code > > in some cases of migration failure though, but it is not currently > > used in qemu. > > Is a "VM suspension" the QEMU HMP 'stop' command? > > I guess the reason why QEMU doesn't call RESUME anywhere is that it > resets the device in vhost_dev_stop()? > The actual reason for not using RESUME is that the ioctl was added after the SUSPEND design in qemu. Same as this proposal, it is was not needed at the time. In the case of vhost-vdpa net, the only usage of suspend is to fetch the vq indexes, and in case of error vhost already fetches them from guest's used ring way before vDPA, so it has little usage. > Does it make sense to combine SUSPEND and RESUME with Hanna's > SET_DEVICE_STATE_FD? For example, non-iterative migration works like > this: > - Saving the device's state is done by SUSPEND followed by > SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g. > savevm command or migration failed), then RESUME is called to > continue. I think the previous steps make sense at vhost_dev_stop, not virtio savevm handlers. To start spreading this logic to more places of qemu can bring confusion. > - Loading the device's state is done by SUSPEND followed by > SET_DEVICE_STATE_FD, followed by RESUME. > I think the restore makes more sense after reset and before driver_ok, suspend does not seem a right call there. SUSPEND implies there may be other operations before, so the device may have processed some requests wrong, as it is not in the right state. Thanks!
On Mon, Apr 17, 2023 at 9:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > On Mon, 17 Apr 2023 at 15:12, Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > On Mon, Apr 17, 2023 at 9:08 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > On Mon, 17 Apr 2023 at 14:56, Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > > > On Mon, Apr 17, 2023 at 5:18 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > > > > > On Fri, Apr 14, 2023 at 05:17:02PM +0200, Eugenio Perez Martin wrote: > > > > > > On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote: > > > > > > > > > > > > > > On 13.04.23 13:38, Stefan Hajnoczi wrote: > > > > > > > > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote: > > > > > > > >> On 12.04.23 23:06, Stefan Hajnoczi wrote: > > > > > > > >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > > > > > > Basically, what I’m hearing is that I need to implement a different > > > > > > > feature that has no practical impact right now, and also fix bugs around > > > > > > > it along the way... > > > > > > > > > > > > > > > > > > > To fix this properly requires iterative device migration in qemu as > > > > > > far as I know, instead of using VMStates [1]. This way the state is > > > > > > requested to virtiofsd before the device reset. > > > > > > > > > > I don't follow. Many devices are fine with non-iterative migration. They > > > > > shouldn't be forced to do iterative migration. > > > > > > > > > > > > > Sorry I think I didn't express myself well. I didn't mean to force > > > > virtiofsd to support the iterative migration, but to use the device > > > > iterative migration API in QEMU to send the needed commands before > > > > vhost_dev_stop. In that regard, the device or the vhost-user commands > > > > would not require changes. > > > > > > > > I think it is convenient in the long run for virtiofsd, as if the > > > > state grows so much that it's not feasible to fetch it in one shot > > > > there is no need to make changes in the qemu migration protocol. I > > > > think it is not unlikely in virtiofs, but maybe I'm missing something > > > > obvious and it's state will never grow. > > > > > > I don't understand. vCPUs are still running at that point and the > > > device state could change. It's not safe to save the full device state > > > until vCPUs have stopped (after vhost_dev_stop). > > > > > > > I think the vCPU is already stopped at save_live_complete_precopy > > callback. Maybe my understanding is wrong? > > Agreed, vCPUs are stopped in save_live_complete_precopy(). However, > you wrote "use the device iterative migration API in QEMU to send the > needed commands before vhost_dev_stop". save_live_complete_precopy() > runs after vhost_dev_stop() so it doesn't seem to solve the problem. > You're right, and it actually makes the most sense. So I guess this converges with the other thread, let's follow the discussion there. Thanks!
On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote: > On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote: > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > > > > > > So-called "internal" virtio-fs migration refers to transporting the > > > > > > > back-end's (virtiofsd's) state through qemu's migration stream. To do > > > > > > > this, we need to be able to transfer virtiofsd's internal state to and > > > > > > > from virtiofsd. > > > > > > > > > > > > > > Because virtiofsd's internal state will not be too large, we believe it > > > > > > > is best to transfer it as a single binary blob after the streaming > > > > > > > phase. Because this method should be useful to other vhost-user > > > > > > > implementations, too, it is introduced as a general-purpose addition to > > > > > > > the protocol, not limited to vhost-user-fs. > > > > > > > > > > > > > > These are the additions to the protocol: > > > > > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > > > > > > This feature signals support for transferring state, and is added so > > > > > > > that migration can fail early when the back-end has no support. > > > > > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe > > > > > > > over which to transfer the state. The front-end sends an FD to the > > > > > > > back-end into/from which it can write/read its state, and the back-end > > > > > > > can decide to either use it, or reply with a different FD for the > > > > > > > front-end to override the front-end's choice. > > > > > > > The front-end creates a simple pipe to transfer the state, but maybe > > > > > > > the back-end already has an FD into/from which it has to write/read > > > > > > > its state, in which case it will want to override the simple pipe. > > > > > > > Conversely, maybe in the future we find a way to have the front-end > > > > > > > get an immediate FD for the migration stream (in some cases), in which > > > > > > > case we will want to send this to the back-end instead of creating a > > > > > > > pipe. > > > > > > > Hence the negotiation: If one side has a better idea than a plain > > > > > > > pipe, we will want to use that. > > > > > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the > > > > > > > pipe (the end indicated by EOF), the front-end invokes this function > > > > > > > to verify success. There is no in-band way (through the pipe) to > > > > > > > indicate failure, so we need to check explicitly. > > > > > > > > > > > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD > > > > > > > (which includes establishing the direction of transfer and migration > > > > > > > phase), the sending side writes its data into the pipe, and the reading > > > > > > > side reads it until it sees an EOF. Then, the front-end will check for > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side includes > > > > > > > checking for integrity (i.e. errors during deserialization). > > > > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > > > > > --- > > > > > > > include/hw/virtio/vhost-backend.h | 24 +++++ > > > > > > > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > > > > > > hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ > > > > > > > hw/virtio/vhost.c | 37 ++++++++ > > > > > > > 4 files changed, 287 insertions(+) > > > > > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > > > > > > > index ec3fbae58d..5935b32fe3 100644 > > > > > > > --- a/include/hw/virtio/vhost-backend.h > > > > > > > +++ b/include/hw/virtio/vhost-backend.h > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > > > > > > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > > > > > > } VhostSetConfigType; > > > > > > > > > > > > > > +typedef enum VhostDeviceStateDirection { > > > > > > > + /* Transfer state from back-end (device) to front-end */ > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > > > > > > + /* Transfer state from front-end to back-end (device) */ > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > > > > > > +} VhostDeviceStateDirection; > > > > > > > + > > > > > > > +typedef enum VhostDeviceStatePhase { > > > > > > > + /* The device (and all its vrings) is stopped */ > > > > > > > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > > > > > > +} VhostDeviceStatePhase; > > > > > > > > > > > > vDPA has: > > > > > > > > > > > > /* Suspend a device so it does not process virtqueue requests anymore > > > > > > * > > > > > > * After the return of ioctl the device must preserve all the necessary state > > > > > > * (the virtqueue vring base plus the possible device specific states) that is > > > > > > * required for restoring in the future. The device must not change its > > > > > > * configuration after that point. > > > > > > */ > > > > > > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > > > > > > > > > > > /* Resume a device so it can resume processing virtqueue requests > > > > > > * > > > > > > * After the return of this ioctl the device will have restored all the > > > > > > * necessary states and it is fully operational to continue processing the > > > > > > * virtqueue descriptors. > > > > > > */ > > > > > > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > > > > > > > > > > > I wonder if it makes sense to import these into vhost-user so that the > > > > > > difference between kernel vhost and vhost-user is minimized. It's okay > > > > > > if one of them is ahead of the other, but it would be nice to avoid > > > > > > overlapping/duplicated functionality. > > > > > > > > > > > > > > > > That's what I had in mind in the first versions. I proposed VHOST_STOP > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did change > > > > > to SUSPEND. > > > > > > > > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not > > > > ioctl(VHOST_VDPA_RESUME). > > > > > > > > The doc comments in <linux/vdpa.h> don't explain how the device can > > > > leave the suspended state. Can you clarify this? > > > > > > > > > > Do you mean in what situations or regarding the semantics of _RESUME? > > > > > > To me resume is an operation mainly to resume the device in the event > > > of a VM suspension, not a migration. It can be used as a fallback code > > > in some cases of migration failure though, but it is not currently > > > used in qemu. > > > > Is a "VM suspension" the QEMU HMP 'stop' command? > > > > I guess the reason why QEMU doesn't call RESUME anywhere is that it > > resets the device in vhost_dev_stop()? > > > > The actual reason for not using RESUME is that the ioctl was added > after the SUSPEND design in qemu. Same as this proposal, it is was not > needed at the time. > > In the case of vhost-vdpa net, the only usage of suspend is to fetch > the vq indexes, and in case of error vhost already fetches them from > guest's used ring way before vDPA, so it has little usage. > > > Does it make sense to combine SUSPEND and RESUME with Hanna's > > SET_DEVICE_STATE_FD? For example, non-iterative migration works like > > this: > > - Saving the device's state is done by SUSPEND followed by > > SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g. > > savevm command or migration failed), then RESUME is called to > > continue. > > I think the previous steps make sense at vhost_dev_stop, not virtio > savevm handlers. To start spreading this logic to more places of qemu > can bring confusion. I don't think there is a way around extending the QEMU vhost's code model. The current model in QEMU's vhost code is that the backend is reset when the VM stops. This model worked fine for stateless devices but it doesn't work for stateful devices. Imagine a vdpa-gpu device: you cannot reset the device in vhost_dev_stop() and expect the GPU to continue working when vhost_dev_start() is called again because all its state has been lost. The guest driver will send requests that references a virtio-gpu resources that no longer exist. One solution is to save the device's state in vhost_dev_stop(). I think this is what you're suggesting. It requires keeping a copy of the state and then loading the state again in vhost_dev_start(). I don't think this approach should be used because it requires all stateful devices to support live migration (otherwise they break across HMP 'stop'/'cont'). Also, the device state for some devices may be large and it would also become more complicated when iterative migration is added. Instead, I think the QEMU vhost code needs to be structured so that struct vhost_dev has a suspended state: ,---------. v | started ------> stopped \ ^ \ | -> suspended The device doesn't lose state when it enters the suspended state. It can be resumed again. This is why I think SUSPEND/RESUME need to be part of the solution. (It's also an argument for not including the phase argument in SET_DEVICE_STATE_FD because the SUSPEND message is sent during vhost_dev_stop() separately from saving the device's state.) > > - Loading the device's state is done by SUSPEND followed by > > SET_DEVICE_STATE_FD, followed by RESUME. > > > > I think the restore makes more sense after reset and before driver_ok, > suspend does not seem a right call there. SUSPEND implies there may be > other operations before, so the device may have processed some > requests wrong, as it is not in the right state. I find it more elegant to allow SUSPEND -> load -> RESUME if the device state is saved using SUSPEND -> save -> RESUME since the operations are symmetrical, but requiring the device to be reset works too. Here is my understanding of your idea in more detail: The VIRTIO Device Status Field value must be ACKNOWLEDGE | DRIVER | FEATURES_OK, any device initialization configuration space writes must be done, and virtqueues must be configured (Step 7 of 3.1.1 Driver Requirements in VIRTIO 1.2). At that point the device is able to parse the device state and set up its internal state. Doing it any earlier (before feature negotiation or virtqueue configuration) places the device in the awkward situation of having to keep the device state in a buffer and defer loading it until later, which is complex. After device state loading is complete, the DRIVER_OK bit is set to resume device operation. Saving device state is only allowed when the DRIVER_OK bit has been set. Does this sound right? Stefan
On Tue, Apr 18, 2023 at 7:59 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote: > > On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > > > On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote: > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > > > > > > > So-called "internal" virtio-fs migration refers to transporting the > > > > > > > > back-end's (virtiofsd's) state through qemu's migration stream. To do > > > > > > > > this, we need to be able to transfer virtiofsd's internal state to and > > > > > > > > from virtiofsd. > > > > > > > > > > > > > > > > Because virtiofsd's internal state will not be too large, we believe it > > > > > > > > is best to transfer it as a single binary blob after the streaming > > > > > > > > phase. Because this method should be useful to other vhost-user > > > > > > > > implementations, too, it is introduced as a general-purpose addition to > > > > > > > > the protocol, not limited to vhost-user-fs. > > > > > > > > > > > > > > > > These are the additions to the protocol: > > > > > > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > > > > > > > This feature signals support for transferring state, and is added so > > > > > > > > that migration can fail early when the back-end has no support. > > > > > > > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe > > > > > > > > over which to transfer the state. The front-end sends an FD to the > > > > > > > > back-end into/from which it can write/read its state, and the back-end > > > > > > > > can decide to either use it, or reply with a different FD for the > > > > > > > > front-end to override the front-end's choice. > > > > > > > > The front-end creates a simple pipe to transfer the state, but maybe > > > > > > > > the back-end already has an FD into/from which it has to write/read > > > > > > > > its state, in which case it will want to override the simple pipe. > > > > > > > > Conversely, maybe in the future we find a way to have the front-end > > > > > > > > get an immediate FD for the migration stream (in some cases), in which > > > > > > > > case we will want to send this to the back-end instead of creating a > > > > > > > > pipe. > > > > > > > > Hence the negotiation: If one side has a better idea than a plain > > > > > > > > pipe, we will want to use that. > > > > > > > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the > > > > > > > > pipe (the end indicated by EOF), the front-end invokes this function > > > > > > > > to verify success. There is no in-band way (through the pipe) to > > > > > > > > indicate failure, so we need to check explicitly. > > > > > > > > > > > > > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD > > > > > > > > (which includes establishing the direction of transfer and migration > > > > > > > > phase), the sending side writes its data into the pipe, and the reading > > > > > > > > side reads it until it sees an EOF. Then, the front-end will check for > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side includes > > > > > > > > checking for integrity (i.e. errors during deserialization). > > > > > > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > > > > > > --- > > > > > > > > include/hw/virtio/vhost-backend.h | 24 +++++ > > > > > > > > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > > > > > > > hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ > > > > > > > > hw/virtio/vhost.c | 37 ++++++++ > > > > > > > > 4 files changed, 287 insertions(+) > > > > > > > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > > > > > > > > index ec3fbae58d..5935b32fe3 100644 > > > > > > > > --- a/include/hw/virtio/vhost-backend.h > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > > > > > > > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > > > > > > > } VhostSetConfigType; > > > > > > > > > > > > > > > > +typedef enum VhostDeviceStateDirection { > > > > > > > > + /* Transfer state from back-end (device) to front-end */ > > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > > > > > > > + /* Transfer state from front-end to back-end (device) */ > > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > > > > > > > +} VhostDeviceStateDirection; > > > > > > > > + > > > > > > > > +typedef enum VhostDeviceStatePhase { > > > > > > > > + /* The device (and all its vrings) is stopped */ > > > > > > > > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > > > > > > > +} VhostDeviceStatePhase; > > > > > > > > > > > > > > vDPA has: > > > > > > > > > > > > > > /* Suspend a device so it does not process virtqueue requests anymore > > > > > > > * > > > > > > > * After the return of ioctl the device must preserve all the necessary state > > > > > > > * (the virtqueue vring base plus the possible device specific states) that is > > > > > > > * required for restoring in the future. The device must not change its > > > > > > > * configuration after that point. > > > > > > > */ > > > > > > > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > > > > > > > > > > > > > /* Resume a device so it can resume processing virtqueue requests > > > > > > > * > > > > > > > * After the return of this ioctl the device will have restored all the > > > > > > > * necessary states and it is fully operational to continue processing the > > > > > > > * virtqueue descriptors. > > > > > > > */ > > > > > > > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > > > > > > > > > > > > > I wonder if it makes sense to import these into vhost-user so that the > > > > > > > difference between kernel vhost and vhost-user is minimized. It's okay > > > > > > > if one of them is ahead of the other, but it would be nice to avoid > > > > > > > overlapping/duplicated functionality. > > > > > > > > > > > > > > > > > > > That's what I had in mind in the first versions. I proposed VHOST_STOP > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did change > > > > > > to SUSPEND. > > > > > > > > > > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not > > > > > ioctl(VHOST_VDPA_RESUME). > > > > > > > > > > The doc comments in <linux/vdpa.h> don't explain how the device can > > > > > leave the suspended state. Can you clarify this? > > > > > > > > > > > > > Do you mean in what situations or regarding the semantics of _RESUME? > > > > > > > > To me resume is an operation mainly to resume the device in the event > > > > of a VM suspension, not a migration. It can be used as a fallback code > > > > in some cases of migration failure though, but it is not currently > > > > used in qemu. > > > > > > Is a "VM suspension" the QEMU HMP 'stop' command? > > > > > > I guess the reason why QEMU doesn't call RESUME anywhere is that it > > > resets the device in vhost_dev_stop()? > > > > > > > The actual reason for not using RESUME is that the ioctl was added > > after the SUSPEND design in qemu. Same as this proposal, it is was not > > needed at the time. > > > > In the case of vhost-vdpa net, the only usage of suspend is to fetch > > the vq indexes, and in case of error vhost already fetches them from > > guest's used ring way before vDPA, so it has little usage. > > > > > Does it make sense to combine SUSPEND and RESUME with Hanna's > > > SET_DEVICE_STATE_FD? For example, non-iterative migration works like > > > this: > > > - Saving the device's state is done by SUSPEND followed by > > > SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g. > > > savevm command or migration failed), then RESUME is called to > > > continue. > > > > I think the previous steps make sense at vhost_dev_stop, not virtio > > savevm handlers. To start spreading this logic to more places of qemu > > can bring confusion. > > I don't think there is a way around extending the QEMU vhost's code > model. The current model in QEMU's vhost code is that the backend is > reset when the VM stops. This model worked fine for stateless devices > but it doesn't work for stateful devices. > > Imagine a vdpa-gpu device: you cannot reset the device in > vhost_dev_stop() and expect the GPU to continue working when > vhost_dev_start() is called again because all its state has been lost. > The guest driver will send requests that references a virtio-gpu > resources that no longer exist. > > One solution is to save the device's state in vhost_dev_stop(). I think > this is what you're suggesting. It requires keeping a copy of the state > and then loading the state again in vhost_dev_start(). I don't think > this approach should be used because it requires all stateful devices to > support live migration (otherwise they break across HMP 'stop'/'cont'). > Also, the device state for some devices may be large and it would also > become more complicated when iterative migration is added. > > Instead, I think the QEMU vhost code needs to be structured so that > struct vhost_dev has a suspended state: > > ,---------. > v | > started ------> stopped > \ ^ > \ | > -> suspended > > The device doesn't lose state when it enters the suspended state. It can > be resumed again. > > This is why I think SUSPEND/RESUME need to be part of the solution. I agree with all of this, especially after realizing vhost_dev_stop is called before the last request of the state in the iterative migration. However I think we can move faster with the virtiofsd migration code, as long as we agree on the vhost-user messages it will receive. This is because we already agree that the state will be sent in one shot and not iteratively, so it will be small. I understand this may change in the future, that's why I proposed to start using iterative right now. However it may make little sense if it is not used in the vhost-user device. I also understand that other devices may have a bigger state so it will be needed for them. > (It's also an argument for not including the phase argument in > SET_DEVICE_STATE_FD because the SUSPEND message is sent during > vhost_dev_stop() separately from saving the device's state.) > > > > - Loading the device's state is done by SUSPEND followed by > > > SET_DEVICE_STATE_FD, followed by RESUME. > > > > > > > I think the restore makes more sense after reset and before driver_ok, > > suspend does not seem a right call there. SUSPEND implies there may be > > other operations before, so the device may have processed some > > requests wrong, as it is not in the right state. > > I find it more elegant to allow SUSPEND -> load -> RESUME if the device > state is saved using SUSPEND -> save -> RESUME since the operations are > symmetrical, but requiring the device to be reset works too. Here is my > understanding of your idea in more detail: > > The VIRTIO Device Status Field value must be ACKNOWLEDGE | DRIVER | > FEATURES_OK, any device initialization configuration space writes must > be done, and virtqueues must be configured (Step 7 of 3.1.1 Driver > Requirements in VIRTIO 1.2). > > At that point the device is able to parse the device state and set up > its internal state. Doing it any earlier (before feature negotiation or > virtqueue configuration) places the device in the awkward situation of > having to keep the device state in a buffer and defer loading it until > later, which is complex. > > After device state loading is complete, the DRIVER_OK bit is set to > resume device operation. > > Saving device state is only allowed when the DRIVER_OK bit has been set. > > Does this sound right? > Yes, I see it is accurate. If you agree that SUSPEND only makes sense after DRIVER_OK, to restore the state while suspended complicates the state machine by a lot. The device spec is simpler with these restrictions in my opinion. Thanks!
On Tue, 18 Apr 2023 at 14:31, Eugenio Perez Martin <eperezma@redhat.com> wrote: > > On Tue, Apr 18, 2023 at 7:59 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote: > > > On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > > > > > > On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote: > > > > > > > > > > On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote: > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > > > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > > > > > > > > So-called "internal" virtio-fs migration refers to transporting the > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration stream. To do > > > > > > > > > this, we need to be able to transfer virtiofsd's internal state to and > > > > > > > > > from virtiofsd. > > > > > > > > > > > > > > > > > > Because virtiofsd's internal state will not be too large, we believe it > > > > > > > > > is best to transfer it as a single binary blob after the streaming > > > > > > > > > phase. Because this method should be useful to other vhost-user > > > > > > > > > implementations, too, it is introduced as a general-purpose addition to > > > > > > > > > the protocol, not limited to vhost-user-fs. > > > > > > > > > > > > > > > > > > These are the additions to the protocol: > > > > > > > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > > > > > > > > This feature signals support for transferring state, and is added so > > > > > > > > > that migration can fail early when the back-end has no support. > > > > > > > > > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe > > > > > > > > > over which to transfer the state. The front-end sends an FD to the > > > > > > > > > back-end into/from which it can write/read its state, and the back-end > > > > > > > > > can decide to either use it, or reply with a different FD for the > > > > > > > > > front-end to override the front-end's choice. > > > > > > > > > The front-end creates a simple pipe to transfer the state, but maybe > > > > > > > > > the back-end already has an FD into/from which it has to write/read > > > > > > > > > its state, in which case it will want to override the simple pipe. > > > > > > > > > Conversely, maybe in the future we find a way to have the front-end > > > > > > > > > get an immediate FD for the migration stream (in some cases), in which > > > > > > > > > case we will want to send this to the back-end instead of creating a > > > > > > > > > pipe. > > > > > > > > > Hence the negotiation: If one side has a better idea than a plain > > > > > > > > > pipe, we will want to use that. > > > > > > > > > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the > > > > > > > > > pipe (the end indicated by EOF), the front-end invokes this function > > > > > > > > > to verify success. There is no in-band way (through the pipe) to > > > > > > > > > indicate failure, so we need to check explicitly. > > > > > > > > > > > > > > > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD > > > > > > > > > (which includes establishing the direction of transfer and migration > > > > > > > > > phase), the sending side writes its data into the pipe, and the reading > > > > > > > > > side reads it until it sees an EOF. Then, the front-end will check for > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side includes > > > > > > > > > checking for integrity (i.e. errors during deserialization). > > > > > > > > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > > > > > > > --- > > > > > > > > > include/hw/virtio/vhost-backend.h | 24 +++++ > > > > > > > > > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > > > > > > > > hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ > > > > > > > > > hw/virtio/vhost.c | 37 ++++++++ > > > > > > > > > 4 files changed, 287 insertions(+) > > > > > > > > > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > > > > > > > > > index ec3fbae58d..5935b32fe3 100644 > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > > > > > > > > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > > > > > > > > } VhostSetConfigType; > > > > > > > > > > > > > > > > > > +typedef enum VhostDeviceStateDirection { > > > > > > > > > + /* Transfer state from back-end (device) to front-end */ > > > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > > > > > > > > + /* Transfer state from front-end to back-end (device) */ > > > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > > > > > > > > +} VhostDeviceStateDirection; > > > > > > > > > + > > > > > > > > > +typedef enum VhostDeviceStatePhase { > > > > > > > > > + /* The device (and all its vrings) is stopped */ > > > > > > > > > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > > > > > > > > +} VhostDeviceStatePhase; > > > > > > > > > > > > > > > > vDPA has: > > > > > > > > > > > > > > > > /* Suspend a device so it does not process virtqueue requests anymore > > > > > > > > * > > > > > > > > * After the return of ioctl the device must preserve all the necessary state > > > > > > > > * (the virtqueue vring base plus the possible device specific states) that is > > > > > > > > * required for restoring in the future. The device must not change its > > > > > > > > * configuration after that point. > > > > > > > > */ > > > > > > > > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > > > > > > > > > > > > > > > /* Resume a device so it can resume processing virtqueue requests > > > > > > > > * > > > > > > > > * After the return of this ioctl the device will have restored all the > > > > > > > > * necessary states and it is fully operational to continue processing the > > > > > > > > * virtqueue descriptors. > > > > > > > > */ > > > > > > > > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > > > > > > > > > > > > > > > I wonder if it makes sense to import these into vhost-user so that the > > > > > > > > difference between kernel vhost and vhost-user is minimized. It's okay > > > > > > > > if one of them is ahead of the other, but it would be nice to avoid > > > > > > > > overlapping/duplicated functionality. > > > > > > > > > > > > > > > > > > > > > > That's what I had in mind in the first versions. I proposed VHOST_STOP > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did change > > > > > > > to SUSPEND. > > > > > > > > > > > > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not > > > > > > ioctl(VHOST_VDPA_RESUME). > > > > > > > > > > > > The doc comments in <linux/vdpa.h> don't explain how the device can > > > > > > leave the suspended state. Can you clarify this? > > > > > > > > > > > > > > > > Do you mean in what situations or regarding the semantics of _RESUME? > > > > > > > > > > To me resume is an operation mainly to resume the device in the event > > > > > of a VM suspension, not a migration. It can be used as a fallback code > > > > > in some cases of migration failure though, but it is not currently > > > > > used in qemu. > > > > > > > > Is a "VM suspension" the QEMU HMP 'stop' command? > > > > > > > > I guess the reason why QEMU doesn't call RESUME anywhere is that it > > > > resets the device in vhost_dev_stop()? > > > > > > > > > > The actual reason for not using RESUME is that the ioctl was added > > > after the SUSPEND design in qemu. Same as this proposal, it is was not > > > needed at the time. > > > > > > In the case of vhost-vdpa net, the only usage of suspend is to fetch > > > the vq indexes, and in case of error vhost already fetches them from > > > guest's used ring way before vDPA, so it has little usage. > > > > > > > Does it make sense to combine SUSPEND and RESUME with Hanna's > > > > SET_DEVICE_STATE_FD? For example, non-iterative migration works like > > > > this: > > > > - Saving the device's state is done by SUSPEND followed by > > > > SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g. > > > > savevm command or migration failed), then RESUME is called to > > > > continue. > > > > > > I think the previous steps make sense at vhost_dev_stop, not virtio > > > savevm handlers. To start spreading this logic to more places of qemu > > > can bring confusion. > > > > I don't think there is a way around extending the QEMU vhost's code > > model. The current model in QEMU's vhost code is that the backend is > > reset when the VM stops. This model worked fine for stateless devices > > but it doesn't work for stateful devices. > > > > Imagine a vdpa-gpu device: you cannot reset the device in > > vhost_dev_stop() and expect the GPU to continue working when > > vhost_dev_start() is called again because all its state has been lost. > > The guest driver will send requests that references a virtio-gpu > > resources that no longer exist. > > > > One solution is to save the device's state in vhost_dev_stop(). I think > > this is what you're suggesting. It requires keeping a copy of the state > > and then loading the state again in vhost_dev_start(). I don't think > > this approach should be used because it requires all stateful devices to > > support live migration (otherwise they break across HMP 'stop'/'cont'). > > Also, the device state for some devices may be large and it would also > > become more complicated when iterative migration is added. > > > > Instead, I think the QEMU vhost code needs to be structured so that > > struct vhost_dev has a suspended state: > > > > ,---------. > > v | > > started ------> stopped > > \ ^ > > \ | > > -> suspended > > > > The device doesn't lose state when it enters the suspended state. It can > > be resumed again. > > > > This is why I think SUSPEND/RESUME need to be part of the solution. > > I agree with all of this, especially after realizing vhost_dev_stop is > called before the last request of the state in the iterative > migration. > > However I think we can move faster with the virtiofsd migration code, > as long as we agree on the vhost-user messages it will receive. This > is because we already agree that the state will be sent in one shot > and not iteratively, so it will be small. > > I understand this may change in the future, that's why I proposed to > start using iterative right now. However it may make little sense if > it is not used in the vhost-user device. I also understand that other > devices may have a bigger state so it will be needed for them. Can you summarize how you'd like save to work today? I'm not sure what you have in mind. Stefan
On 14.04.23 17:17, Eugenio Perez Martin wrote: > On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote: [...] >> Basically, what I’m hearing is that I need to implement a different >> feature that has no practical impact right now, and also fix bugs around >> it along the way... >> > To fix this properly requires iterative device migration in qemu as > far as I know, instead of using VMStates [1]. This way the state is > requested to virtiofsd before the device reset. > > What does virtiofsd do when the state is totally sent? Does it keep > processing requests and generating new state or is only a one shot > that will suspend the daemon? If it is the second I think it still can > be done in one shot at the end, always indicating "no more state" at > save_live_pending and sending all the state at > save_live_complete_precopy. This sounds to me as if we should reset all devices during migration, and I don’t understand that. virtiofsd will not immediately process requests when the state is sent, because the device is still stopped, but when it is re-enabled (e.g. because of a failed migration), it will have retained its state and continue processing requests as if nothing happened. A reset would break this and other stateful back-ends, as I think Stefan has mentioned somewhere else. It seems to me as if there are devices that need a reset, and so need suspend+resume around it, but I also think there are back-ends that don’t, where this would only unnecessarily complicate the back-end implementation. Hanna
On 17.04.23 17:12, Stefan Hajnoczi wrote: [...] > This brings to mind how iterative migration will work. The interface for > iterative migration is basically the same as non-iterative migration > plus a method to query the number of bytes remaining. When the number of > bytes falls below a threshold, the vCPUs are stopped and the remainder > of the data is read. > > Some details from VFIO migration: > - The VMM must explicitly change the state when transitioning from > iterative and non-iterative migration, but the data transfer fd > remains the same. > - The state of the device (running, stopped, resuming, etc) doesn't > change asynchronously, it's always driven by the VMM. However, setting > the state can fail and then the new state may be an error state. > > Mapping this to SET_DEVICE_STATE_FD: > - VhostDeviceStatePhase is extended with > VHOST_TRANSFER_STATE_PHASE_RUNNING = 1 for iterative migration. The > frontend sends SET_DEVICE_STATE_FD again with > VHOST_TRANSFER_STATE_PHASE_STOPPED when entering non-iterative > migration and the frontend sends the iterative fd from the previous > SET_DEVICE_STATE_FD call to the backend. The backend may reply with > another fd, if necessary. If the backend changes the fd, then the > contents of the previous fd must be fully read and transferred before > the contents of the new fd are migrated. (Maybe this is too complex > and we should forbid changing the fd when going from RUNNING -> > STOPPED.) > - CHECK_DEVICE_STATE can be extended to report the number of bytes > remaining. The semantics change so that CHECK_DEVICE_STATE can be > called while the VMM is still reading from the fd. It becomes: > > enum CheckDeviceStateResult { > Saving(bytes_remaining : usize), > Failed(error_code : u64), > } Sounds good. Personally, I’d forbid changing the FD when just changing state, which raises the question of whether there should then be a separate command for just changing the state (like VFIO_DEVICE_FEATURE ..._MIG_DEVICE_STATE?), but that would be a question for then. Changing the CHECK_DEVICE_STATE interface sounds good to me. Hanna
On Wed, 19 Apr 2023 at 06:45, Hanna Czenczek <hreitz@redhat.com> wrote: > > On 14.04.23 17:17, Eugenio Perez Martin wrote: > > On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote: > > [...] > > >> Basically, what I’m hearing is that I need to implement a different > >> feature that has no practical impact right now, and also fix bugs around > >> it along the way... > >> > > To fix this properly requires iterative device migration in qemu as > > far as I know, instead of using VMStates [1]. This way the state is > > requested to virtiofsd before the device reset. > > > > What does virtiofsd do when the state is totally sent? Does it keep > > processing requests and generating new state or is only a one shot > > that will suspend the daemon? If it is the second I think it still can > > be done in one shot at the end, always indicating "no more state" at > > save_live_pending and sending all the state at > > save_live_complete_precopy. > > This sounds to me as if we should reset all devices during migration, > and I don’t understand that. virtiofsd will not immediately process > requests when the state is sent, because the device is still stopped, > but when it is re-enabled (e.g. because of a failed migration), it will > have retained its state and continue processing requests as if nothing > happened. A reset would break this and other stateful back-ends, as I > think Stefan has mentioned somewhere else. > > It seems to me as if there are devices that need a reset, and so need > suspend+resume around it, but I also think there are back-ends that > don’t, where this would only unnecessarily complicate the back-end > implementation. Existing vhost-user backends must continue working, so I think having two code paths is (almost) unavoidable. One approach is to add SUSPEND/RESUME to the vhost-user protocol with a corresponding VHOST_USER_PROTOCOL_F_SUSPEND feature bit. vhost-user frontends can identify backends that support SUSPEND/RESUME instead of device reset. Old vhost-user backends will continue to use device reset. I said avoiding two code paths is almost unavoidable. It may be possible to rely on existing VHOST_USER_GET_VRING_BASE's semantics (it stops a single virtqueue) instead of SUSPEND. RESUME is replaced by VHOST_USER_SET_VRING_* and gets the device going again. However, I'm not 100% sure if this will work (even for all existing devices). It would require carefully studying both the spec and various implementations to see if it's viable. There's a chance of losing the performance optimization that VHOST_USER_SET_STATUS provided to DPDK if the device is not reset. In my opinion SUSPEND/RESUME is the cleanest way to do this. Stefan
On 18.04.23 19:59, Stefan Hajnoczi wrote: > On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote: >> On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: >>> On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote: >>>> On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: >>>>> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote: >>>>>> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: >>>>>>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: >>>>>>>> So-called "internal" virtio-fs migration refers to transporting the >>>>>>>> back-end's (virtiofsd's) state through qemu's migration stream. To do >>>>>>>> this, we need to be able to transfer virtiofsd's internal state to and >>>>>>>> from virtiofsd. >>>>>>>> >>>>>>>> Because virtiofsd's internal state will not be too large, we believe it >>>>>>>> is best to transfer it as a single binary blob after the streaming >>>>>>>> phase. Because this method should be useful to other vhost-user >>>>>>>> implementations, too, it is introduced as a general-purpose addition to >>>>>>>> the protocol, not limited to vhost-user-fs. >>>>>>>> >>>>>>>> These are the additions to the protocol: >>>>>>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: >>>>>>>> This feature signals support for transferring state, and is added so >>>>>>>> that migration can fail early when the back-end has no support. >>>>>>>> >>>>>>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe >>>>>>>> over which to transfer the state. The front-end sends an FD to the >>>>>>>> back-end into/from which it can write/read its state, and the back-end >>>>>>>> can decide to either use it, or reply with a different FD for the >>>>>>>> front-end to override the front-end's choice. >>>>>>>> The front-end creates a simple pipe to transfer the state, but maybe >>>>>>>> the back-end already has an FD into/from which it has to write/read >>>>>>>> its state, in which case it will want to override the simple pipe. >>>>>>>> Conversely, maybe in the future we find a way to have the front-end >>>>>>>> get an immediate FD for the migration stream (in some cases), in which >>>>>>>> case we will want to send this to the back-end instead of creating a >>>>>>>> pipe. >>>>>>>> Hence the negotiation: If one side has a better idea than a plain >>>>>>>> pipe, we will want to use that. >>>>>>>> >>>>>>>> - CHECK_DEVICE_STATE: After the state has been transferred through the >>>>>>>> pipe (the end indicated by EOF), the front-end invokes this function >>>>>>>> to verify success. There is no in-band way (through the pipe) to >>>>>>>> indicate failure, so we need to check explicitly. >>>>>>>> >>>>>>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD >>>>>>>> (which includes establishing the direction of transfer and migration >>>>>>>> phase), the sending side writes its data into the pipe, and the reading >>>>>>>> side reads it until it sees an EOF. Then, the front-end will check for >>>>>>>> success via CHECK_DEVICE_STATE, which on the destination side includes >>>>>>>> checking for integrity (i.e. errors during deserialization). >>>>>>>> >>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> >>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> >>>>>>>> --- >>>>>>>> include/hw/virtio/vhost-backend.h | 24 +++++ >>>>>>>> include/hw/virtio/vhost.h | 79 ++++++++++++++++ >>>>>>>> hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ >>>>>>>> hw/virtio/vhost.c | 37 ++++++++ >>>>>>>> 4 files changed, 287 insertions(+) >>>>>>>> >>>>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h >>>>>>>> index ec3fbae58d..5935b32fe3 100644 >>>>>>>> --- a/include/hw/virtio/vhost-backend.h >>>>>>>> +++ b/include/hw/virtio/vhost-backend.h >>>>>>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { >>>>>>>> VHOST_SET_CONFIG_TYPE_MIGRATION = 1, >>>>>>>> } VhostSetConfigType; >>>>>>>> >>>>>>>> +typedef enum VhostDeviceStateDirection { >>>>>>>> + /* Transfer state from back-end (device) to front-end */ >>>>>>>> + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, >>>>>>>> + /* Transfer state from front-end to back-end (device) */ >>>>>>>> + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, >>>>>>>> +} VhostDeviceStateDirection; >>>>>>>> + >>>>>>>> +typedef enum VhostDeviceStatePhase { >>>>>>>> + /* The device (and all its vrings) is stopped */ >>>>>>>> + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, >>>>>>>> +} VhostDeviceStatePhase; >>>>>>> vDPA has: >>>>>>> >>>>>>> /* Suspend a device so it does not process virtqueue requests anymore >>>>>>> * >>>>>>> * After the return of ioctl the device must preserve all the necessary state >>>>>>> * (the virtqueue vring base plus the possible device specific states) that is >>>>>>> * required for restoring in the future. The device must not change its >>>>>>> * configuration after that point. >>>>>>> */ >>>>>>> #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) >>>>>>> >>>>>>> /* Resume a device so it can resume processing virtqueue requests >>>>>>> * >>>>>>> * After the return of this ioctl the device will have restored all the >>>>>>> * necessary states and it is fully operational to continue processing the >>>>>>> * virtqueue descriptors. >>>>>>> */ >>>>>>> #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) >>>>>>> >>>>>>> I wonder if it makes sense to import these into vhost-user so that the >>>>>>> difference between kernel vhost and vhost-user is minimized. It's okay >>>>>>> if one of them is ahead of the other, but it would be nice to avoid >>>>>>> overlapping/duplicated functionality. >>>>>>> >>>>>> That's what I had in mind in the first versions. I proposed VHOST_STOP >>>>>> instead of VHOST_VDPA_STOP for this very reason. Later it did change >>>>>> to SUSPEND. >>>>> I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not >>>>> ioctl(VHOST_VDPA_RESUME). >>>>> >>>>> The doc comments in <linux/vdpa.h> don't explain how the device can >>>>> leave the suspended state. Can you clarify this? >>>>> >>>> Do you mean in what situations or regarding the semantics of _RESUME? >>>> >>>> To me resume is an operation mainly to resume the device in the event >>>> of a VM suspension, not a migration. It can be used as a fallback code >>>> in some cases of migration failure though, but it is not currently >>>> used in qemu. >>> Is a "VM suspension" the QEMU HMP 'stop' command? >>> >>> I guess the reason why QEMU doesn't call RESUME anywhere is that it >>> resets the device in vhost_dev_stop()? >>> >> The actual reason for not using RESUME is that the ioctl was added >> after the SUSPEND design in qemu. Same as this proposal, it is was not >> needed at the time. >> >> In the case of vhost-vdpa net, the only usage of suspend is to fetch >> the vq indexes, and in case of error vhost already fetches them from >> guest's used ring way before vDPA, so it has little usage. >> >>> Does it make sense to combine SUSPEND and RESUME with Hanna's >>> SET_DEVICE_STATE_FD? For example, non-iterative migration works like >>> this: >>> - Saving the device's state is done by SUSPEND followed by >>> SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g. >>> savevm command or migration failed), then RESUME is called to >>> continue. >> I think the previous steps make sense at vhost_dev_stop, not virtio >> savevm handlers. To start spreading this logic to more places of qemu >> can bring confusion. > I don't think there is a way around extending the QEMU vhost's code > model. The current model in QEMU's vhost code is that the backend is > reset when the VM stops. This model worked fine for stateless devices > but it doesn't work for stateful devices. > > Imagine a vdpa-gpu device: you cannot reset the device in > vhost_dev_stop() and expect the GPU to continue working when > vhost_dev_start() is called again because all its state has been lost. > The guest driver will send requests that references a virtio-gpu > resources that no longer exist. > > One solution is to save the device's state in vhost_dev_stop(). I think > this is what you're suggesting. It requires keeping a copy of the state > and then loading the state again in vhost_dev_start(). I don't think > this approach should be used because it requires all stateful devices to > support live migration (otherwise they break across HMP 'stop'/'cont'). > Also, the device state for some devices may be large and it would also > become more complicated when iterative migration is added. > > Instead, I think the QEMU vhost code needs to be structured so that > struct vhost_dev has a suspended state: > > ,---------. > v | > started ------> stopped > \ ^ > \ | > -> suspended > > The device doesn't lose state when it enters the suspended state. It can > be resumed again. > > This is why I think SUSPEND/RESUME need to be part of the solution. > (It's also an argument for not including the phase argument in > SET_DEVICE_STATE_FD because the SUSPEND message is sent during > vhost_dev_stop() separately from saving the device's state.) So let me ask if I understand this protocol correctly: Basically, SUSPEND would ask the device to fully serialize its internal state, retain it in some buffer, and RESUME would then deserialize the state from the buffer, right? While this state needn’t necessarily be immediately migratable, I suppose (e.g. one could retain file descriptors there, and it doesn’t need to be a serialized byte buffer, but could still be structured), it would basically be a live migration implementation already. As far as I understand, that’s why you suggest not running a SUSPEND+RESUME cycle on anything but live migration, right? I wonder how that model would then work with iterative migration, though. Basically, for non-iterative migration, the back-end would expect SUSPEND first to flush its state out to a buffer, and then the state transfer would just copy from that buffer. For iterative migration, though, there is no SUSPEND first, so the back-end must implicitly begin to serialize its state and send it over. I find that a bit strange. Also, how would this work with currently migratable stateless back-ends? Do they already implement SUSPEND+RESUME as no-ops? If not, I think we should detect stateless back-ends and skip the operations in qemu lest we have to update those back-ends for no real reason. Hanna
On Wed, 19 Apr 2023 at 06:57, Hanna Czenczek <hreitz@redhat.com> wrote: > > On 18.04.23 19:59, Stefan Hajnoczi wrote: > > On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote: > >> On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > >>> On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote: > >>>> On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > >>>>> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote: > >>>>>> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > >>>>>>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > >>>>>>>> So-called "internal" virtio-fs migration refers to transporting the > >>>>>>>> back-end's (virtiofsd's) state through qemu's migration stream. To do > >>>>>>>> this, we need to be able to transfer virtiofsd's internal state to and > >>>>>>>> from virtiofsd. > >>>>>>>> > >>>>>>>> Because virtiofsd's internal state will not be too large, we believe it > >>>>>>>> is best to transfer it as a single binary blob after the streaming > >>>>>>>> phase. Because this method should be useful to other vhost-user > >>>>>>>> implementations, too, it is introduced as a general-purpose addition to > >>>>>>>> the protocol, not limited to vhost-user-fs. > >>>>>>>> > >>>>>>>> These are the additions to the protocol: > >>>>>>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > >>>>>>>> This feature signals support for transferring state, and is added so > >>>>>>>> that migration can fail early when the back-end has no support. > >>>>>>>> > >>>>>>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe > >>>>>>>> over which to transfer the state. The front-end sends an FD to the > >>>>>>>> back-end into/from which it can write/read its state, and the back-end > >>>>>>>> can decide to either use it, or reply with a different FD for the > >>>>>>>> front-end to override the front-end's choice. > >>>>>>>> The front-end creates a simple pipe to transfer the state, but maybe > >>>>>>>> the back-end already has an FD into/from which it has to write/read > >>>>>>>> its state, in which case it will want to override the simple pipe. > >>>>>>>> Conversely, maybe in the future we find a way to have the front-end > >>>>>>>> get an immediate FD for the migration stream (in some cases), in which > >>>>>>>> case we will want to send this to the back-end instead of creating a > >>>>>>>> pipe. > >>>>>>>> Hence the negotiation: If one side has a better idea than a plain > >>>>>>>> pipe, we will want to use that. > >>>>>>>> > >>>>>>>> - CHECK_DEVICE_STATE: After the state has been transferred through the > >>>>>>>> pipe (the end indicated by EOF), the front-end invokes this function > >>>>>>>> to verify success. There is no in-band way (through the pipe) to > >>>>>>>> indicate failure, so we need to check explicitly. > >>>>>>>> > >>>>>>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD > >>>>>>>> (which includes establishing the direction of transfer and migration > >>>>>>>> phase), the sending side writes its data into the pipe, and the reading > >>>>>>>> side reads it until it sees an EOF. Then, the front-end will check for > >>>>>>>> success via CHECK_DEVICE_STATE, which on the destination side includes > >>>>>>>> checking for integrity (i.e. errors during deserialization). > >>>>>>>> > >>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > >>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > >>>>>>>> --- > >>>>>>>> include/hw/virtio/vhost-backend.h | 24 +++++ > >>>>>>>> include/hw/virtio/vhost.h | 79 ++++++++++++++++ > >>>>>>>> hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ > >>>>>>>> hw/virtio/vhost.c | 37 ++++++++ > >>>>>>>> 4 files changed, 287 insertions(+) > >>>>>>>> > >>>>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > >>>>>>>> index ec3fbae58d..5935b32fe3 100644 > >>>>>>>> --- a/include/hw/virtio/vhost-backend.h > >>>>>>>> +++ b/include/hw/virtio/vhost-backend.h > >>>>>>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > >>>>>>>> VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > >>>>>>>> } VhostSetConfigType; > >>>>>>>> > >>>>>>>> +typedef enum VhostDeviceStateDirection { > >>>>>>>> + /* Transfer state from back-end (device) to front-end */ > >>>>>>>> + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > >>>>>>>> + /* Transfer state from front-end to back-end (device) */ > >>>>>>>> + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > >>>>>>>> +} VhostDeviceStateDirection; > >>>>>>>> + > >>>>>>>> +typedef enum VhostDeviceStatePhase { > >>>>>>>> + /* The device (and all its vrings) is stopped */ > >>>>>>>> + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > >>>>>>>> +} VhostDeviceStatePhase; > >>>>>>> vDPA has: > >>>>>>> > >>>>>>> /* Suspend a device so it does not process virtqueue requests anymore > >>>>>>> * > >>>>>>> * After the return of ioctl the device must preserve all the necessary state > >>>>>>> * (the virtqueue vring base plus the possible device specific states) that is > >>>>>>> * required for restoring in the future. The device must not change its > >>>>>>> * configuration after that point. > >>>>>>> */ > >>>>>>> #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > >>>>>>> > >>>>>>> /* Resume a device so it can resume processing virtqueue requests > >>>>>>> * > >>>>>>> * After the return of this ioctl the device will have restored all the > >>>>>>> * necessary states and it is fully operational to continue processing the > >>>>>>> * virtqueue descriptors. > >>>>>>> */ > >>>>>>> #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > >>>>>>> > >>>>>>> I wonder if it makes sense to import these into vhost-user so that the > >>>>>>> difference between kernel vhost and vhost-user is minimized. It's okay > >>>>>>> if one of them is ahead of the other, but it would be nice to avoid > >>>>>>> overlapping/duplicated functionality. > >>>>>>> > >>>>>> That's what I had in mind in the first versions. I proposed VHOST_STOP > >>>>>> instead of VHOST_VDPA_STOP for this very reason. Later it did change > >>>>>> to SUSPEND. > >>>>> I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not > >>>>> ioctl(VHOST_VDPA_RESUME). > >>>>> > >>>>> The doc comments in <linux/vdpa.h> don't explain how the device can > >>>>> leave the suspended state. Can you clarify this? > >>>>> > >>>> Do you mean in what situations or regarding the semantics of _RESUME? > >>>> > >>>> To me resume is an operation mainly to resume the device in the event > >>>> of a VM suspension, not a migration. It can be used as a fallback code > >>>> in some cases of migration failure though, but it is not currently > >>>> used in qemu. > >>> Is a "VM suspension" the QEMU HMP 'stop' command? > >>> > >>> I guess the reason why QEMU doesn't call RESUME anywhere is that it > >>> resets the device in vhost_dev_stop()? > >>> > >> The actual reason for not using RESUME is that the ioctl was added > >> after the SUSPEND design in qemu. Same as this proposal, it is was not > >> needed at the time. > >> > >> In the case of vhost-vdpa net, the only usage of suspend is to fetch > >> the vq indexes, and in case of error vhost already fetches them from > >> guest's used ring way before vDPA, so it has little usage. > >> > >>> Does it make sense to combine SUSPEND and RESUME with Hanna's > >>> SET_DEVICE_STATE_FD? For example, non-iterative migration works like > >>> this: > >>> - Saving the device's state is done by SUSPEND followed by > >>> SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g. > >>> savevm command or migration failed), then RESUME is called to > >>> continue. > >> I think the previous steps make sense at vhost_dev_stop, not virtio > >> savevm handlers. To start spreading this logic to more places of qemu > >> can bring confusion. > > I don't think there is a way around extending the QEMU vhost's code > > model. The current model in QEMU's vhost code is that the backend is > > reset when the VM stops. This model worked fine for stateless devices > > but it doesn't work for stateful devices. > > > > Imagine a vdpa-gpu device: you cannot reset the device in > > vhost_dev_stop() and expect the GPU to continue working when > > vhost_dev_start() is called again because all its state has been lost. > > The guest driver will send requests that references a virtio-gpu > > resources that no longer exist. > > > > One solution is to save the device's state in vhost_dev_stop(). I think > > this is what you're suggesting. It requires keeping a copy of the state > > and then loading the state again in vhost_dev_start(). I don't think > > this approach should be used because it requires all stateful devices to > > support live migration (otherwise they break across HMP 'stop'/'cont'). > > Also, the device state for some devices may be large and it would also > > become more complicated when iterative migration is added. > > > > Instead, I think the QEMU vhost code needs to be structured so that > > struct vhost_dev has a suspended state: > > > > ,---------. > > v | > > started ------> stopped > > \ ^ > > \ | > > -> suspended > > > > The device doesn't lose state when it enters the suspended state. It can > > be resumed again. > > > > This is why I think SUSPEND/RESUME need to be part of the solution. > > (It's also an argument for not including the phase argument in > > SET_DEVICE_STATE_FD because the SUSPEND message is sent during > > vhost_dev_stop() separately from saving the device's state.) > > So let me ask if I understand this protocol correctly: Basically, > SUSPEND would ask the device to fully serialize its internal state, > retain it in some buffer, and RESUME would then deserialize the state > from the buffer, right? That's not how I understand SUSPEND/RESUME. I was thinking that SUSPEND pauses device operation so that virtqueues are no longer processed and no other events occur (e.g. VIRTIO Configuration Change Notifications). RESUME continues device operation. Neither command is directly related to device state serialization but SUSPEND freezes the device state, while RESUME allows the device state to change again. > While this state needn’t necessarily be immediately migratable, I > suppose (e.g. one could retain file descriptors there, and it doesn’t > need to be a serialized byte buffer, but could still be structured), it > would basically be a live migration implementation already. As far as I > understand, that’s why you suggest not running a SUSPEND+RESUME cycle on > anything but live migration, right? No, SUSPEND/RESUME would also be used across vm_stop()/vm_start(). That way stateful devices are no longer reset across HMP 'stop'/'cont' (we're lucky it even works for most existing vhost-user backends today and that's just because they don't yet implement VHOST_USER_SET_STATUS). > I wonder how that model would then work with iterative migration, > though. Basically, for non-iterative migration, the back-end would > expect SUSPEND first to flush its state out to a buffer, and then the > state transfer would just copy from that buffer. For iterative > migration, though, there is no SUSPEND first, so the back-end must > implicitly begin to serialize its state and send it over. I find that a > bit strange. I expected SET_DEVICE_STATE_FD to be sent while the device is still running for iterative migration. Device state chunks are saved while the device is still operating. When the VMM decides to stop the guest, it sends SUSPEND to freeze the device. The remainder of the device state can then be read from the fd in the knowledge that the size is now finite. After migration completes, the device is still suspended on the source. If migration failed, RESUME is sent to continue running on the source. > Also, how would this work with currently migratable stateless > back-ends? Do they already implement SUSPEND+RESUME as no-ops? If not, > I think we should detect stateless back-ends and skip the operations in > qemu lest we have to update those back-ends for no real reason. Yes, I think backwards compatibility is a requirement, too. The vhost-user frontend checks the SUSPEND vhost-user protocol feature bit. If the bit is cleared, then it must assume this device is stateless and use device reset operations. Otherwise it can use SUSPEND/RESUME. Stefan
On 18.04.23 09:54, Eugenio Perez Martin wrote: > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: >> On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com> wrote: >>> On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: >>>> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote: >>>>> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: >>>>>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: >>>>>>> So-called "internal" virtio-fs migration refers to transporting the >>>>>>> back-end's (virtiofsd's) state through qemu's migration stream. To do >>>>>>> this, we need to be able to transfer virtiofsd's internal state to and >>>>>>> from virtiofsd. >>>>>>> >>>>>>> Because virtiofsd's internal state will not be too large, we believe it >>>>>>> is best to transfer it as a single binary blob after the streaming >>>>>>> phase. Because this method should be useful to other vhost-user >>>>>>> implementations, too, it is introduced as a general-purpose addition to >>>>>>> the protocol, not limited to vhost-user-fs. >>>>>>> >>>>>>> These are the additions to the protocol: >>>>>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: >>>>>>> This feature signals support for transferring state, and is added so >>>>>>> that migration can fail early when the back-end has no support. >>>>>>> >>>>>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe >>>>>>> over which to transfer the state. The front-end sends an FD to the >>>>>>> back-end into/from which it can write/read its state, and the back-end >>>>>>> can decide to either use it, or reply with a different FD for the >>>>>>> front-end to override the front-end's choice. >>>>>>> The front-end creates a simple pipe to transfer the state, but maybe >>>>>>> the back-end already has an FD into/from which it has to write/read >>>>>>> its state, in which case it will want to override the simple pipe. >>>>>>> Conversely, maybe in the future we find a way to have the front-end >>>>>>> get an immediate FD for the migration stream (in some cases), in which >>>>>>> case we will want to send this to the back-end instead of creating a >>>>>>> pipe. >>>>>>> Hence the negotiation: If one side has a better idea than a plain >>>>>>> pipe, we will want to use that. >>>>>>> >>>>>>> - CHECK_DEVICE_STATE: After the state has been transferred through the >>>>>>> pipe (the end indicated by EOF), the front-end invokes this function >>>>>>> to verify success. There is no in-band way (through the pipe) to >>>>>>> indicate failure, so we need to check explicitly. >>>>>>> >>>>>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD >>>>>>> (which includes establishing the direction of transfer and migration >>>>>>> phase), the sending side writes its data into the pipe, and the reading >>>>>>> side reads it until it sees an EOF. Then, the front-end will check for >>>>>>> success via CHECK_DEVICE_STATE, which on the destination side includes >>>>>>> checking for integrity (i.e. errors during deserialization). >>>>>>> >>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> >>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> >>>>>>> --- >>>>>>> include/hw/virtio/vhost-backend.h | 24 +++++ >>>>>>> include/hw/virtio/vhost.h | 79 ++++++++++++++++ >>>>>>> hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ >>>>>>> hw/virtio/vhost.c | 37 ++++++++ >>>>>>> 4 files changed, 287 insertions(+) >>>>>>> >>>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h >>>>>>> index ec3fbae58d..5935b32fe3 100644 >>>>>>> --- a/include/hw/virtio/vhost-backend.h >>>>>>> +++ b/include/hw/virtio/vhost-backend.h >>>>>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { >>>>>>> VHOST_SET_CONFIG_TYPE_MIGRATION = 1, >>>>>>> } VhostSetConfigType; >>>>>>> >>>>>>> +typedef enum VhostDeviceStateDirection { >>>>>>> + /* Transfer state from back-end (device) to front-end */ >>>>>>> + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, >>>>>>> + /* Transfer state from front-end to back-end (device) */ >>>>>>> + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, >>>>>>> +} VhostDeviceStateDirection; >>>>>>> + >>>>>>> +typedef enum VhostDeviceStatePhase { >>>>>>> + /* The device (and all its vrings) is stopped */ >>>>>>> + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, >>>>>>> +} VhostDeviceStatePhase; >>>>>> vDPA has: >>>>>> >>>>>> /* Suspend a device so it does not process virtqueue requests anymore >>>>>> * >>>>>> * After the return of ioctl the device must preserve all the necessary state >>>>>> * (the virtqueue vring base plus the possible device specific states) that is >>>>>> * required for restoring in the future. The device must not change its >>>>>> * configuration after that point. >>>>>> */ >>>>>> #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) >>>>>> >>>>>> /* Resume a device so it can resume processing virtqueue requests >>>>>> * >>>>>> * After the return of this ioctl the device will have restored all the >>>>>> * necessary states and it is fully operational to continue processing the >>>>>> * virtqueue descriptors. >>>>>> */ >>>>>> #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) >>>>>> >>>>>> I wonder if it makes sense to import these into vhost-user so that the >>>>>> difference between kernel vhost and vhost-user is minimized. It's okay >>>>>> if one of them is ahead of the other, but it would be nice to avoid >>>>>> overlapping/duplicated functionality. >>>>>> >>>>> That's what I had in mind in the first versions. I proposed VHOST_STOP >>>>> instead of VHOST_VDPA_STOP for this very reason. Later it did change >>>>> to SUSPEND. >>>>> >>>>> Generally it is better if we make the interface less parametrized and >>>>> we trust in the messages and its semantics in my opinion. In other >>>>> words, instead of >>>>> vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send >>>>> individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command. >>>>> >>>>> Another way to apply this is with the "direction" parameter. Maybe it >>>>> is better to split it into "set_state_fd" and "get_state_fd"? >>>>> >>>>> In that case, reusing the ioctls as vhost-user messages would be ok. >>>>> But that puts this proposal further from the VFIO code, which uses >>>>> "migration_set_state(state)", and maybe it is better when the number >>>>> of states is high. >>>> Hi Eugenio, >>>> Another question about vDPA suspend/resume: >>>> >>>> /* Host notifiers must be enabled at this point. */ >>>> void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings) >>>> { >>>> int i; >>>> >>>> /* should only be called after backend is connected */ >>>> assert(hdev->vhost_ops); >>>> event_notifier_test_and_clear( >>>> &hdev->vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier); >>>> event_notifier_test_and_clear(&vdev->config_notifier); >>>> >>>> trace_vhost_dev_stop(hdev, vdev->name, vrings); >>>> >>>> if (hdev->vhost_ops->vhost_dev_start) { >>>> hdev->vhost_ops->vhost_dev_start(hdev, false); >>>> ^^^ SUSPEND ^^^ >>>> } >>>> if (vrings) { >>>> vhost_dev_set_vring_enable(hdev, false); >>>> } >>>> for (i = 0; i < hdev->nvqs; ++i) { >>>> vhost_virtqueue_stop(hdev, >>>> vdev, >>>> hdev->vqs + i, >>>> hdev->vq_index + i); >>>> ^^^ fetch virtqueue state from kernel ^^^ >>>> } >>>> if (hdev->vhost_ops->vhost_reset_status) { >>>> hdev->vhost_ops->vhost_reset_status(hdev); >>>> ^^^ reset device^^^ >>>> >>>> I noticed the QEMU vDPA code resets the device in vhost_dev_stop() -> >>>> vhost_reset_status(). The device's migration code runs after >>>> vhost_dev_stop() and the state will have been lost. >>>> >>> vhost_virtqueue_stop saves the vq state (indexes, vring base) in the >>> qemu VirtIONet device model. This is for all vhost backends. >>> >>> Regarding the state like mac or mq configuration, SVQ runs for all the >>> VM run in the CVQ. So it can track all of that status in the device >>> model too. >>> >>> When a migration effectively occurs, all the frontend state is >>> migrated as a regular emulated device. To route all of the state in a >>> normalized way for qemu is what leaves open the possibility to do >>> cross-backends migrations, etc. >>> >>> Does that answer your question? >> I think you're confirming that changes would be necessary in order for >> vDPA to support the save/load operation that Hanna is introducing. >> > Yes, this first iteration was centered on net, with an eye on block, > where state can be routed through classical emulated devices. This is > how vhost-kernel and vhost-user do classically. And it allows > cross-backend, to not modify qemu migration state, etc. > > To introduce this opaque state to qemu, that must be fetched after the > suspend and not before, requires changes in vhost protocol, as > discussed previously. > >>>> It looks like vDPA changes are necessary in order to support stateful >>>> devices even though QEMU already uses SUSPEND. Is my understanding >>>> correct? >>>> >>> Changes are required elsewhere, as the code to restore the state >>> properly in the destination has not been merged. >> I'm not sure what you mean by elsewhere? >> > I meant for vdpa *net* devices the changes are not required in vdpa > ioctls, but mostly in qemu. > > If you meant stateful as "it must have a state blob that it must be > opaque to qemu", then I think the straightforward action is to fetch > state blob about the same time as vq indexes. But yes, changes (at > least a new ioctl) is needed for that. > >> I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and >> then VHOST_VDPA_SET_STATUS 0. >> >> In order to save device state from the vDPA device in the future, it >> will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that >> the device state can be saved before the device is reset. >> >> Does that sound right? >> > The split between suspend and reset was added recently for that very > reason. In all the virtio devices, the frontend is initialized before > the backend, so I don't think it is a good idea to defer the backend > cleanup. Especially if we have already set the state is small enough > to not needing iterative migration from virtiofsd point of view. > > If fetching that state at the same time as vq indexes is not valid, > could it follow the same model as the "in-flight descriptors"? > vhost-user follows them by using a shared memory region where their > state is tracked [1]. This allows qemu to survive vhost-user SW > backend crashes, and does not forbid the cross-backends live migration > as all the information is there to recover them. > > For hw devices this is not convenient as it occupies PCI bandwidth. So > a possibility is to synchronize this memory region after a > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW > devices are not going to crash in the software sense, so all use cases > remain the same to qemu. And that shared memory information is > recoverable after vhost_dev_stop. > > Does that sound reasonable to virtiofsd? To offer a shared memory > region where it dumps the state, maybe only after the > set_state(STATE_PHASE_STOPPED)? I don’t think we need the set_state() call, necessarily, if SUSPEND is mandatory anyway. As for the shared memory, the RFC before this series used shared memory, so it’s possible, yes. But “shared memory region” can mean a lot of things – it sounds like you’re saying the back-end (virtiofsd) should provide it to the front-end, is that right? That could work like this: On the source side: S1. SUSPEND goes to virtiofsd S2. virtiofsd maybe double-checks that the device is stopped, then serializes its state into a newly allocated shared memory area[1] S3. virtiofsd responds to SUSPEND S4. front-end requests shared memory, virtiofsd responds with a handle, maybe already closes its reference S5. front-end saves state, closes its handle, freeing the SHM [1] Maybe virtiofsd can correctly size the serialized state’s size, then it can immediately allocate this area and serialize directly into it; maybe it can’t, then we’ll need a bounce buffer. Not really a fundamental problem, but there are limitations around what you can do with serde implementations in Rust… On the destination side: D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much; virtiofsd would serialize its empty state into an SHM area, and respond to SUSPEND D2. front-end reads state from migration stream into an SHM it has allocated D3. front-end supplies this SHM to virtiofsd, which discards its previous area, and now uses this one D4. RESUME goes to virtiofsd, which deserializes the state from the SHM Couple of questions: A. Stefan suggested D1, but it does seem wasteful now. But if SUSPEND would imply to deserialize a state, and the state is to be transferred through SHM, this is what would need to be done. So maybe we should skip SUSPEND on the destination? B. You described that the back-end should supply the SHM, which works well on the source. On the destination, only the front-end knows how big the state is, so I’ve decided above that it should allocate the SHM (D2) and provide it to the back-end. Is that feasible or is it important (e.g. for real hardware) that the back-end supplies the SHM? (In which case the front-end would need to tell the back-end how big the state SHM needs to be.) Hanna
On 19.04.23 13:10, Stefan Hajnoczi wrote: > On Wed, 19 Apr 2023 at 06:57, Hanna Czenczek <hreitz@redhat.com> wrote: >> On 18.04.23 19:59, Stefan Hajnoczi wrote: >>> On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote: >>>> On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>>> On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote: >>>>>> On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: >>>>>>> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote: >>>>>>>> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: >>>>>>>>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: >>>>>>>>>> So-called "internal" virtio-fs migration refers to transporting the >>>>>>>>>> back-end's (virtiofsd's) state through qemu's migration stream. To do >>>>>>>>>> this, we need to be able to transfer virtiofsd's internal state to and >>>>>>>>>> from virtiofsd. >>>>>>>>>> >>>>>>>>>> Because virtiofsd's internal state will not be too large, we believe it >>>>>>>>>> is best to transfer it as a single binary blob after the streaming >>>>>>>>>> phase. Because this method should be useful to other vhost-user >>>>>>>>>> implementations, too, it is introduced as a general-purpose addition to >>>>>>>>>> the protocol, not limited to vhost-user-fs. >>>>>>>>>> >>>>>>>>>> These are the additions to the protocol: >>>>>>>>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: >>>>>>>>>> This feature signals support for transferring state, and is added so >>>>>>>>>> that migration can fail early when the back-end has no support. >>>>>>>>>> >>>>>>>>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe >>>>>>>>>> over which to transfer the state. The front-end sends an FD to the >>>>>>>>>> back-end into/from which it can write/read its state, and the back-end >>>>>>>>>> can decide to either use it, or reply with a different FD for the >>>>>>>>>> front-end to override the front-end's choice. >>>>>>>>>> The front-end creates a simple pipe to transfer the state, but maybe >>>>>>>>>> the back-end already has an FD into/from which it has to write/read >>>>>>>>>> its state, in which case it will want to override the simple pipe. >>>>>>>>>> Conversely, maybe in the future we find a way to have the front-end >>>>>>>>>> get an immediate FD for the migration stream (in some cases), in which >>>>>>>>>> case we will want to send this to the back-end instead of creating a >>>>>>>>>> pipe. >>>>>>>>>> Hence the negotiation: If one side has a better idea than a plain >>>>>>>>>> pipe, we will want to use that. >>>>>>>>>> >>>>>>>>>> - CHECK_DEVICE_STATE: After the state has been transferred through the >>>>>>>>>> pipe (the end indicated by EOF), the front-end invokes this function >>>>>>>>>> to verify success. There is no in-band way (through the pipe) to >>>>>>>>>> indicate failure, so we need to check explicitly. >>>>>>>>>> >>>>>>>>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD >>>>>>>>>> (which includes establishing the direction of transfer and migration >>>>>>>>>> phase), the sending side writes its data into the pipe, and the reading >>>>>>>>>> side reads it until it sees an EOF. Then, the front-end will check for >>>>>>>>>> success via CHECK_DEVICE_STATE, which on the destination side includes >>>>>>>>>> checking for integrity (i.e. errors during deserialization). >>>>>>>>>> >>>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> >>>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> >>>>>>>>>> --- >>>>>>>>>> include/hw/virtio/vhost-backend.h | 24 +++++ >>>>>>>>>> include/hw/virtio/vhost.h | 79 ++++++++++++++++ >>>>>>>>>> hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ >>>>>>>>>> hw/virtio/vhost.c | 37 ++++++++ >>>>>>>>>> 4 files changed, 287 insertions(+) >>>>>>>>>> >>>>>>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h >>>>>>>>>> index ec3fbae58d..5935b32fe3 100644 >>>>>>>>>> --- a/include/hw/virtio/vhost-backend.h >>>>>>>>>> +++ b/include/hw/virtio/vhost-backend.h >>>>>>>>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { >>>>>>>>>> VHOST_SET_CONFIG_TYPE_MIGRATION = 1, >>>>>>>>>> } VhostSetConfigType; >>>>>>>>>> >>>>>>>>>> +typedef enum VhostDeviceStateDirection { >>>>>>>>>> + /* Transfer state from back-end (device) to front-end */ >>>>>>>>>> + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, >>>>>>>>>> + /* Transfer state from front-end to back-end (device) */ >>>>>>>>>> + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, >>>>>>>>>> +} VhostDeviceStateDirection; >>>>>>>>>> + >>>>>>>>>> +typedef enum VhostDeviceStatePhase { >>>>>>>>>> + /* The device (and all its vrings) is stopped */ >>>>>>>>>> + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, >>>>>>>>>> +} VhostDeviceStatePhase; >>>>>>>>> vDPA has: >>>>>>>>> >>>>>>>>> /* Suspend a device so it does not process virtqueue requests anymore >>>>>>>>> * >>>>>>>>> * After the return of ioctl the device must preserve all the necessary state >>>>>>>>> * (the virtqueue vring base plus the possible device specific states) that is >>>>>>>>> * required for restoring in the future. The device must not change its >>>>>>>>> * configuration after that point. >>>>>>>>> */ >>>>>>>>> #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) >>>>>>>>> >>>>>>>>> /* Resume a device so it can resume processing virtqueue requests >>>>>>>>> * >>>>>>>>> * After the return of this ioctl the device will have restored all the >>>>>>>>> * necessary states and it is fully operational to continue processing the >>>>>>>>> * virtqueue descriptors. >>>>>>>>> */ >>>>>>>>> #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) >>>>>>>>> >>>>>>>>> I wonder if it makes sense to import these into vhost-user so that the >>>>>>>>> difference between kernel vhost and vhost-user is minimized. It's okay >>>>>>>>> if one of them is ahead of the other, but it would be nice to avoid >>>>>>>>> overlapping/duplicated functionality. >>>>>>>>> >>>>>>>> That's what I had in mind in the first versions. I proposed VHOST_STOP >>>>>>>> instead of VHOST_VDPA_STOP for this very reason. Later it did change >>>>>>>> to SUSPEND. >>>>>>> I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not >>>>>>> ioctl(VHOST_VDPA_RESUME). >>>>>>> >>>>>>> The doc comments in <linux/vdpa.h> don't explain how the device can >>>>>>> leave the suspended state. Can you clarify this? >>>>>>> >>>>>> Do you mean in what situations or regarding the semantics of _RESUME? >>>>>> >>>>>> To me resume is an operation mainly to resume the device in the event >>>>>> of a VM suspension, not a migration. It can be used as a fallback code >>>>>> in some cases of migration failure though, but it is not currently >>>>>> used in qemu. >>>>> Is a "VM suspension" the QEMU HMP 'stop' command? >>>>> >>>>> I guess the reason why QEMU doesn't call RESUME anywhere is that it >>>>> resets the device in vhost_dev_stop()? >>>>> >>>> The actual reason for not using RESUME is that the ioctl was added >>>> after the SUSPEND design in qemu. Same as this proposal, it is was not >>>> needed at the time. >>>> >>>> In the case of vhost-vdpa net, the only usage of suspend is to fetch >>>> the vq indexes, and in case of error vhost already fetches them from >>>> guest's used ring way before vDPA, so it has little usage. >>>> >>>>> Does it make sense to combine SUSPEND and RESUME with Hanna's >>>>> SET_DEVICE_STATE_FD? For example, non-iterative migration works like >>>>> this: >>>>> - Saving the device's state is done by SUSPEND followed by >>>>> SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g. >>>>> savevm command or migration failed), then RESUME is called to >>>>> continue. >>>> I think the previous steps make sense at vhost_dev_stop, not virtio >>>> savevm handlers. To start spreading this logic to more places of qemu >>>> can bring confusion. >>> I don't think there is a way around extending the QEMU vhost's code >>> model. The current model in QEMU's vhost code is that the backend is >>> reset when the VM stops. This model worked fine for stateless devices >>> but it doesn't work for stateful devices. >>> >>> Imagine a vdpa-gpu device: you cannot reset the device in >>> vhost_dev_stop() and expect the GPU to continue working when >>> vhost_dev_start() is called again because all its state has been lost. >>> The guest driver will send requests that references a virtio-gpu >>> resources that no longer exist. >>> >>> One solution is to save the device's state in vhost_dev_stop(). I think >>> this is what you're suggesting. It requires keeping a copy of the state >>> and then loading the state again in vhost_dev_start(). I don't think >>> this approach should be used because it requires all stateful devices to >>> support live migration (otherwise they break across HMP 'stop'/'cont'). >>> Also, the device state for some devices may be large and it would also >>> become more complicated when iterative migration is added. >>> >>> Instead, I think the QEMU vhost code needs to be structured so that >>> struct vhost_dev has a suspended state: >>> >>> ,---------. >>> v | >>> started ------> stopped >>> \ ^ >>> \ | >>> -> suspended >>> >>> The device doesn't lose state when it enters the suspended state. It can >>> be resumed again. >>> >>> This is why I think SUSPEND/RESUME need to be part of the solution. >>> (It's also an argument for not including the phase argument in >>> SET_DEVICE_STATE_FD because the SUSPEND message is sent during >>> vhost_dev_stop() separately from saving the device's state.) >> So let me ask if I understand this protocol correctly: Basically, >> SUSPEND would ask the device to fully serialize its internal state, >> retain it in some buffer, and RESUME would then deserialize the state >> from the buffer, right? > That's not how I understand SUSPEND/RESUME. I was thinking that > SUSPEND pauses device operation so that virtqueues are no longer > processed and no other events occur (e.g. VIRTIO Configuration Change > Notifications). RESUME continues device operation. Neither command is > directly related to device state serialization but SUSPEND freezes the > device state, while RESUME allows the device state to change again. I understood that a reset would basically reset all internal state, which is why SUSPEND+RESUME were required around it, to retain the state. >> While this state needn’t necessarily be immediately migratable, I >> suppose (e.g. one could retain file descriptors there, and it doesn’t >> need to be a serialized byte buffer, but could still be structured), it >> would basically be a live migration implementation already. As far as I >> understand, that’s why you suggest not running a SUSPEND+RESUME cycle on >> anything but live migration, right? > No, SUSPEND/RESUME would also be used across vm_stop()/vm_start(). > That way stateful devices are no longer reset across HMP 'stop'/'cont' > (we're lucky it even works for most existing vhost-user backends today > and that's just because they don't yet implement > VHOST_USER_SET_STATUS). So that’s what I seem to misunderstand: If stateful devices are reset, how does SUSPEND+RESUME prevent that? >> I wonder how that model would then work with iterative migration, >> though. Basically, for non-iterative migration, the back-end would >> expect SUSPEND first to flush its state out to a buffer, and then the >> state transfer would just copy from that buffer. For iterative >> migration, though, there is no SUSPEND first, so the back-end must >> implicitly begin to serialize its state and send it over. I find that a >> bit strange. > I expected SET_DEVICE_STATE_FD to be sent while the device is still > running for iterative migration. Device state chunks are saved while > the device is still operating. > > When the VMM decides to stop the guest, it sends SUSPEND to freeze the > device. The remainder of the device state can then be read from the fd > in the knowledge that the size is now finite. > > After migration completes, the device is still suspended on the > source. If migration failed, RESUME is sent to continue running on the > source. Sure, that makes perfect sense as long as SUSPEND/RESUME are unrelated to device state serialization, >> Also, how would this work with currently migratable stateless >> back-ends? Do they already implement SUSPEND+RESUME as no-ops? If not, >> I think we should detect stateless back-ends and skip the operations in >> qemu lest we have to update those back-ends for no real reason. > Yes, I think backwards compatibility is a requirement, too. The > vhost-user frontend checks the SUSPEND vhost-user protocol feature > bit. If the bit is cleared, then it must assume this device is > stateless and use device reset operations. Otherwise it can use > SUSPEND/RESUME. Yes, all stateful devices should currently block migration, so we could require them to implement SUSPEND/RESUME, and assume that any that don’t are stateless. Hanna
On Wed, 19 Apr 2023 at 07:10, Hanna Czenczek <hreitz@redhat.com> wrote: > > On 18.04.23 09:54, Eugenio Perez Martin wrote: > > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > >> On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com> wrote: > >>> On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > >>>> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote: > >>>>> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > >>>>>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > >>>>>>> So-called "internal" virtio-fs migration refers to transporting the > >>>>>>> back-end's (virtiofsd's) state through qemu's migration stream. To do > >>>>>>> this, we need to be able to transfer virtiofsd's internal state to and > >>>>>>> from virtiofsd. > >>>>>>> > >>>>>>> Because virtiofsd's internal state will not be too large, we believe it > >>>>>>> is best to transfer it as a single binary blob after the streaming > >>>>>>> phase. Because this method should be useful to other vhost-user > >>>>>>> implementations, too, it is introduced as a general-purpose addition to > >>>>>>> the protocol, not limited to vhost-user-fs. > >>>>>>> > >>>>>>> These are the additions to the protocol: > >>>>>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > >>>>>>> This feature signals support for transferring state, and is added so > >>>>>>> that migration can fail early when the back-end has no support. > >>>>>>> > >>>>>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe > >>>>>>> over which to transfer the state. The front-end sends an FD to the > >>>>>>> back-end into/from which it can write/read its state, and the back-end > >>>>>>> can decide to either use it, or reply with a different FD for the > >>>>>>> front-end to override the front-end's choice. > >>>>>>> The front-end creates a simple pipe to transfer the state, but maybe > >>>>>>> the back-end already has an FD into/from which it has to write/read > >>>>>>> its state, in which case it will want to override the simple pipe. > >>>>>>> Conversely, maybe in the future we find a way to have the front-end > >>>>>>> get an immediate FD for the migration stream (in some cases), in which > >>>>>>> case we will want to send this to the back-end instead of creating a > >>>>>>> pipe. > >>>>>>> Hence the negotiation: If one side has a better idea than a plain > >>>>>>> pipe, we will want to use that. > >>>>>>> > >>>>>>> - CHECK_DEVICE_STATE: After the state has been transferred through the > >>>>>>> pipe (the end indicated by EOF), the front-end invokes this function > >>>>>>> to verify success. There is no in-band way (through the pipe) to > >>>>>>> indicate failure, so we need to check explicitly. > >>>>>>> > >>>>>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD > >>>>>>> (which includes establishing the direction of transfer and migration > >>>>>>> phase), the sending side writes its data into the pipe, and the reading > >>>>>>> side reads it until it sees an EOF. Then, the front-end will check for > >>>>>>> success via CHECK_DEVICE_STATE, which on the destination side includes > >>>>>>> checking for integrity (i.e. errors during deserialization). > >>>>>>> > >>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > >>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > >>>>>>> --- > >>>>>>> include/hw/virtio/vhost-backend.h | 24 +++++ > >>>>>>> include/hw/virtio/vhost.h | 79 ++++++++++++++++ > >>>>>>> hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ > >>>>>>> hw/virtio/vhost.c | 37 ++++++++ > >>>>>>> 4 files changed, 287 insertions(+) > >>>>>>> > >>>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > >>>>>>> index ec3fbae58d..5935b32fe3 100644 > >>>>>>> --- a/include/hw/virtio/vhost-backend.h > >>>>>>> +++ b/include/hw/virtio/vhost-backend.h > >>>>>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > >>>>>>> VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > >>>>>>> } VhostSetConfigType; > >>>>>>> > >>>>>>> +typedef enum VhostDeviceStateDirection { > >>>>>>> + /* Transfer state from back-end (device) to front-end */ > >>>>>>> + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > >>>>>>> + /* Transfer state from front-end to back-end (device) */ > >>>>>>> + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > >>>>>>> +} VhostDeviceStateDirection; > >>>>>>> + > >>>>>>> +typedef enum VhostDeviceStatePhase { > >>>>>>> + /* The device (and all its vrings) is stopped */ > >>>>>>> + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > >>>>>>> +} VhostDeviceStatePhase; > >>>>>> vDPA has: > >>>>>> > >>>>>> /* Suspend a device so it does not process virtqueue requests anymore > >>>>>> * > >>>>>> * After the return of ioctl the device must preserve all the necessary state > >>>>>> * (the virtqueue vring base plus the possible device specific states) that is > >>>>>> * required for restoring in the future. The device must not change its > >>>>>> * configuration after that point. > >>>>>> */ > >>>>>> #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > >>>>>> > >>>>>> /* Resume a device so it can resume processing virtqueue requests > >>>>>> * > >>>>>> * After the return of this ioctl the device will have restored all the > >>>>>> * necessary states and it is fully operational to continue processing the > >>>>>> * virtqueue descriptors. > >>>>>> */ > >>>>>> #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > >>>>>> > >>>>>> I wonder if it makes sense to import these into vhost-user so that the > >>>>>> difference between kernel vhost and vhost-user is minimized. It's okay > >>>>>> if one of them is ahead of the other, but it would be nice to avoid > >>>>>> overlapping/duplicated functionality. > >>>>>> > >>>>> That's what I had in mind in the first versions. I proposed VHOST_STOP > >>>>> instead of VHOST_VDPA_STOP for this very reason. Later it did change > >>>>> to SUSPEND. > >>>>> > >>>>> Generally it is better if we make the interface less parametrized and > >>>>> we trust in the messages and its semantics in my opinion. In other > >>>>> words, instead of > >>>>> vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send > >>>>> individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command. > >>>>> > >>>>> Another way to apply this is with the "direction" parameter. Maybe it > >>>>> is better to split it into "set_state_fd" and "get_state_fd"? > >>>>> > >>>>> In that case, reusing the ioctls as vhost-user messages would be ok. > >>>>> But that puts this proposal further from the VFIO code, which uses > >>>>> "migration_set_state(state)", and maybe it is better when the number > >>>>> of states is high. > >>>> Hi Eugenio, > >>>> Another question about vDPA suspend/resume: > >>>> > >>>> /* Host notifiers must be enabled at this point. */ > >>>> void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings) > >>>> { > >>>> int i; > >>>> > >>>> /* should only be called after backend is connected */ > >>>> assert(hdev->vhost_ops); > >>>> event_notifier_test_and_clear( > >>>> &hdev->vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier); > >>>> event_notifier_test_and_clear(&vdev->config_notifier); > >>>> > >>>> trace_vhost_dev_stop(hdev, vdev->name, vrings); > >>>> > >>>> if (hdev->vhost_ops->vhost_dev_start) { > >>>> hdev->vhost_ops->vhost_dev_start(hdev, false); > >>>> ^^^ SUSPEND ^^^ > >>>> } > >>>> if (vrings) { > >>>> vhost_dev_set_vring_enable(hdev, false); > >>>> } > >>>> for (i = 0; i < hdev->nvqs; ++i) { > >>>> vhost_virtqueue_stop(hdev, > >>>> vdev, > >>>> hdev->vqs + i, > >>>> hdev->vq_index + i); > >>>> ^^^ fetch virtqueue state from kernel ^^^ > >>>> } > >>>> if (hdev->vhost_ops->vhost_reset_status) { > >>>> hdev->vhost_ops->vhost_reset_status(hdev); > >>>> ^^^ reset device^^^ > >>>> > >>>> I noticed the QEMU vDPA code resets the device in vhost_dev_stop() -> > >>>> vhost_reset_status(). The device's migration code runs after > >>>> vhost_dev_stop() and the state will have been lost. > >>>> > >>> vhost_virtqueue_stop saves the vq state (indexes, vring base) in the > >>> qemu VirtIONet device model. This is for all vhost backends. > >>> > >>> Regarding the state like mac or mq configuration, SVQ runs for all the > >>> VM run in the CVQ. So it can track all of that status in the device > >>> model too. > >>> > >>> When a migration effectively occurs, all the frontend state is > >>> migrated as a regular emulated device. To route all of the state in a > >>> normalized way for qemu is what leaves open the possibility to do > >>> cross-backends migrations, etc. > >>> > >>> Does that answer your question? > >> I think you're confirming that changes would be necessary in order for > >> vDPA to support the save/load operation that Hanna is introducing. > >> > > Yes, this first iteration was centered on net, with an eye on block, > > where state can be routed through classical emulated devices. This is > > how vhost-kernel and vhost-user do classically. And it allows > > cross-backend, to not modify qemu migration state, etc. > > > > To introduce this opaque state to qemu, that must be fetched after the > > suspend and not before, requires changes in vhost protocol, as > > discussed previously. > > > >>>> It looks like vDPA changes are necessary in order to support stateful > >>>> devices even though QEMU already uses SUSPEND. Is my understanding > >>>> correct? > >>>> > >>> Changes are required elsewhere, as the code to restore the state > >>> properly in the destination has not been merged. > >> I'm not sure what you mean by elsewhere? > >> > > I meant for vdpa *net* devices the changes are not required in vdpa > > ioctls, but mostly in qemu. > > > > If you meant stateful as "it must have a state blob that it must be > > opaque to qemu", then I think the straightforward action is to fetch > > state blob about the same time as vq indexes. But yes, changes (at > > least a new ioctl) is needed for that. > > > >> I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and > >> then VHOST_VDPA_SET_STATUS 0. > >> > >> In order to save device state from the vDPA device in the future, it > >> will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that > >> the device state can be saved before the device is reset. > >> > >> Does that sound right? > >> > > The split between suspend and reset was added recently for that very > > reason. In all the virtio devices, the frontend is initialized before > > the backend, so I don't think it is a good idea to defer the backend > > cleanup. Especially if we have already set the state is small enough > > to not needing iterative migration from virtiofsd point of view. > > > > If fetching that state at the same time as vq indexes is not valid, > > could it follow the same model as the "in-flight descriptors"? > > vhost-user follows them by using a shared memory region where their > > state is tracked [1]. This allows qemu to survive vhost-user SW > > backend crashes, and does not forbid the cross-backends live migration > > as all the information is there to recover them. > > > > For hw devices this is not convenient as it occupies PCI bandwidth. So > > a possibility is to synchronize this memory region after a > > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW > > devices are not going to crash in the software sense, so all use cases > > remain the same to qemu. And that shared memory information is > > recoverable after vhost_dev_stop. > > > > Does that sound reasonable to virtiofsd? To offer a shared memory > > region where it dumps the state, maybe only after the > > set_state(STATE_PHASE_STOPPED)? > > I don’t think we need the set_state() call, necessarily, if SUSPEND is > mandatory anyway. > > As for the shared memory, the RFC before this series used shared memory, > so it’s possible, yes. But “shared memory region” can mean a lot of > things – it sounds like you’re saying the back-end (virtiofsd) should > provide it to the front-end, is that right? That could work like this: > > On the source side: > > S1. SUSPEND goes to virtiofsd > S2. virtiofsd maybe double-checks that the device is stopped, then > serializes its state into a newly allocated shared memory area[1] > S3. virtiofsd responds to SUSPEND > S4. front-end requests shared memory, virtiofsd responds with a handle, > maybe already closes its reference > S5. front-end saves state, closes its handle, freeing the SHM > > [1] Maybe virtiofsd can correctly size the serialized state’s size, then > it can immediately allocate this area and serialize directly into it; > maybe it can’t, then we’ll need a bounce buffer. Not really a > fundamental problem, but there are limitations around what you can do > with serde implementations in Rust… > > On the destination side: > > D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much; > virtiofsd would serialize its empty state into an SHM area, and respond > to SUSPEND > D2. front-end reads state from migration stream into an SHM it has allocated > D3. front-end supplies this SHM to virtiofsd, which discards its > previous area, and now uses this one > D4. RESUME goes to virtiofsd, which deserializes the state from the SHM > > Couple of questions: > > A. Stefan suggested D1, but it does seem wasteful now. But if SUSPEND > would imply to deserialize a state, and the state is to be transferred > through SHM, this is what would need to be done. So maybe we should > skip SUSPEND on the destination? > B. You described that the back-end should supply the SHM, which works > well on the source. On the destination, only the front-end knows how > big the state is, so I’ve decided above that it should allocate the SHM > (D2) and provide it to the back-end. Is that feasible or is it > important (e.g. for real hardware) that the back-end supplies the SHM? > (In which case the front-end would need to tell the back-end how big the > state SHM needs to be.) How does this work for iterative live migration? Stefan
On 19.04.23 13:21, Stefan Hajnoczi wrote: > On Wed, 19 Apr 2023 at 07:10, Hanna Czenczek <hreitz@redhat.com> wrote: >> On 18.04.23 09:54, Eugenio Perez Martin wrote: >>> On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: >>>> On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com> wrote: >>>>> On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: >>>>>> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote: >>>>>>> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: >>>>>>>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: >>>>>>>>> So-called "internal" virtio-fs migration refers to transporting the >>>>>>>>> back-end's (virtiofsd's) state through qemu's migration stream. To do >>>>>>>>> this, we need to be able to transfer virtiofsd's internal state to and >>>>>>>>> from virtiofsd. >>>>>>>>> >>>>>>>>> Because virtiofsd's internal state will not be too large, we believe it >>>>>>>>> is best to transfer it as a single binary blob after the streaming >>>>>>>>> phase. Because this method should be useful to other vhost-user >>>>>>>>> implementations, too, it is introduced as a general-purpose addition to >>>>>>>>> the protocol, not limited to vhost-user-fs. >>>>>>>>> >>>>>>>>> These are the additions to the protocol: >>>>>>>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: >>>>>>>>> This feature signals support for transferring state, and is added so >>>>>>>>> that migration can fail early when the back-end has no support. >>>>>>>>> >>>>>>>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe >>>>>>>>> over which to transfer the state. The front-end sends an FD to the >>>>>>>>> back-end into/from which it can write/read its state, and the back-end >>>>>>>>> can decide to either use it, or reply with a different FD for the >>>>>>>>> front-end to override the front-end's choice. >>>>>>>>> The front-end creates a simple pipe to transfer the state, but maybe >>>>>>>>> the back-end already has an FD into/from which it has to write/read >>>>>>>>> its state, in which case it will want to override the simple pipe. >>>>>>>>> Conversely, maybe in the future we find a way to have the front-end >>>>>>>>> get an immediate FD for the migration stream (in some cases), in which >>>>>>>>> case we will want to send this to the back-end instead of creating a >>>>>>>>> pipe. >>>>>>>>> Hence the negotiation: If one side has a better idea than a plain >>>>>>>>> pipe, we will want to use that. >>>>>>>>> >>>>>>>>> - CHECK_DEVICE_STATE: After the state has been transferred through the >>>>>>>>> pipe (the end indicated by EOF), the front-end invokes this function >>>>>>>>> to verify success. There is no in-band way (through the pipe) to >>>>>>>>> indicate failure, so we need to check explicitly. >>>>>>>>> >>>>>>>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD >>>>>>>>> (which includes establishing the direction of transfer and migration >>>>>>>>> phase), the sending side writes its data into the pipe, and the reading >>>>>>>>> side reads it until it sees an EOF. Then, the front-end will check for >>>>>>>>> success via CHECK_DEVICE_STATE, which on the destination side includes >>>>>>>>> checking for integrity (i.e. errors during deserialization). >>>>>>>>> >>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> >>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> >>>>>>>>> --- >>>>>>>>> include/hw/virtio/vhost-backend.h | 24 +++++ >>>>>>>>> include/hw/virtio/vhost.h | 79 ++++++++++++++++ >>>>>>>>> hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ >>>>>>>>> hw/virtio/vhost.c | 37 ++++++++ >>>>>>>>> 4 files changed, 287 insertions(+) >>>>>>>>> >>>>>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h >>>>>>>>> index ec3fbae58d..5935b32fe3 100644 >>>>>>>>> --- a/include/hw/virtio/vhost-backend.h >>>>>>>>> +++ b/include/hw/virtio/vhost-backend.h >>>>>>>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { >>>>>>>>> VHOST_SET_CONFIG_TYPE_MIGRATION = 1, >>>>>>>>> } VhostSetConfigType; >>>>>>>>> >>>>>>>>> +typedef enum VhostDeviceStateDirection { >>>>>>>>> + /* Transfer state from back-end (device) to front-end */ >>>>>>>>> + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, >>>>>>>>> + /* Transfer state from front-end to back-end (device) */ >>>>>>>>> + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, >>>>>>>>> +} VhostDeviceStateDirection; >>>>>>>>> + >>>>>>>>> +typedef enum VhostDeviceStatePhase { >>>>>>>>> + /* The device (and all its vrings) is stopped */ >>>>>>>>> + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, >>>>>>>>> +} VhostDeviceStatePhase; >>>>>>>> vDPA has: >>>>>>>> >>>>>>>> /* Suspend a device so it does not process virtqueue requests anymore >>>>>>>> * >>>>>>>> * After the return of ioctl the device must preserve all the necessary state >>>>>>>> * (the virtqueue vring base plus the possible device specific states) that is >>>>>>>> * required for restoring in the future. The device must not change its >>>>>>>> * configuration after that point. >>>>>>>> */ >>>>>>>> #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) >>>>>>>> >>>>>>>> /* Resume a device so it can resume processing virtqueue requests >>>>>>>> * >>>>>>>> * After the return of this ioctl the device will have restored all the >>>>>>>> * necessary states and it is fully operational to continue processing the >>>>>>>> * virtqueue descriptors. >>>>>>>> */ >>>>>>>> #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) >>>>>>>> >>>>>>>> I wonder if it makes sense to import these into vhost-user so that the >>>>>>>> difference between kernel vhost and vhost-user is minimized. It's okay >>>>>>>> if one of them is ahead of the other, but it would be nice to avoid >>>>>>>> overlapping/duplicated functionality. >>>>>>>> >>>>>>> That's what I had in mind in the first versions. I proposed VHOST_STOP >>>>>>> instead of VHOST_VDPA_STOP for this very reason. Later it did change >>>>>>> to SUSPEND. >>>>>>> >>>>>>> Generally it is better if we make the interface less parametrized and >>>>>>> we trust in the messages and its semantics in my opinion. In other >>>>>>> words, instead of >>>>>>> vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send >>>>>>> individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command. >>>>>>> >>>>>>> Another way to apply this is with the "direction" parameter. Maybe it >>>>>>> is better to split it into "set_state_fd" and "get_state_fd"? >>>>>>> >>>>>>> In that case, reusing the ioctls as vhost-user messages would be ok. >>>>>>> But that puts this proposal further from the VFIO code, which uses >>>>>>> "migration_set_state(state)", and maybe it is better when the number >>>>>>> of states is high. >>>>>> Hi Eugenio, >>>>>> Another question about vDPA suspend/resume: >>>>>> >>>>>> /* Host notifiers must be enabled at this point. */ >>>>>> void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings) >>>>>> { >>>>>> int i; >>>>>> >>>>>> /* should only be called after backend is connected */ >>>>>> assert(hdev->vhost_ops); >>>>>> event_notifier_test_and_clear( >>>>>> &hdev->vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier); >>>>>> event_notifier_test_and_clear(&vdev->config_notifier); >>>>>> >>>>>> trace_vhost_dev_stop(hdev, vdev->name, vrings); >>>>>> >>>>>> if (hdev->vhost_ops->vhost_dev_start) { >>>>>> hdev->vhost_ops->vhost_dev_start(hdev, false); >>>>>> ^^^ SUSPEND ^^^ >>>>>> } >>>>>> if (vrings) { >>>>>> vhost_dev_set_vring_enable(hdev, false); >>>>>> } >>>>>> for (i = 0; i < hdev->nvqs; ++i) { >>>>>> vhost_virtqueue_stop(hdev, >>>>>> vdev, >>>>>> hdev->vqs + i, >>>>>> hdev->vq_index + i); >>>>>> ^^^ fetch virtqueue state from kernel ^^^ >>>>>> } >>>>>> if (hdev->vhost_ops->vhost_reset_status) { >>>>>> hdev->vhost_ops->vhost_reset_status(hdev); >>>>>> ^^^ reset device^^^ >>>>>> >>>>>> I noticed the QEMU vDPA code resets the device in vhost_dev_stop() -> >>>>>> vhost_reset_status(). The device's migration code runs after >>>>>> vhost_dev_stop() and the state will have been lost. >>>>>> >>>>> vhost_virtqueue_stop saves the vq state (indexes, vring base) in the >>>>> qemu VirtIONet device model. This is for all vhost backends. >>>>> >>>>> Regarding the state like mac or mq configuration, SVQ runs for all the >>>>> VM run in the CVQ. So it can track all of that status in the device >>>>> model too. >>>>> >>>>> When a migration effectively occurs, all the frontend state is >>>>> migrated as a regular emulated device. To route all of the state in a >>>>> normalized way for qemu is what leaves open the possibility to do >>>>> cross-backends migrations, etc. >>>>> >>>>> Does that answer your question? >>>> I think you're confirming that changes would be necessary in order for >>>> vDPA to support the save/load operation that Hanna is introducing. >>>> >>> Yes, this first iteration was centered on net, with an eye on block, >>> where state can be routed through classical emulated devices. This is >>> how vhost-kernel and vhost-user do classically. And it allows >>> cross-backend, to not modify qemu migration state, etc. >>> >>> To introduce this opaque state to qemu, that must be fetched after the >>> suspend and not before, requires changes in vhost protocol, as >>> discussed previously. >>> >>>>>> It looks like vDPA changes are necessary in order to support stateful >>>>>> devices even though QEMU already uses SUSPEND. Is my understanding >>>>>> correct? >>>>>> >>>>> Changes are required elsewhere, as the code to restore the state >>>>> properly in the destination has not been merged. >>>> I'm not sure what you mean by elsewhere? >>>> >>> I meant for vdpa *net* devices the changes are not required in vdpa >>> ioctls, but mostly in qemu. >>> >>> If you meant stateful as "it must have a state blob that it must be >>> opaque to qemu", then I think the straightforward action is to fetch >>> state blob about the same time as vq indexes. But yes, changes (at >>> least a new ioctl) is needed for that. >>> >>>> I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and >>>> then VHOST_VDPA_SET_STATUS 0. >>>> >>>> In order to save device state from the vDPA device in the future, it >>>> will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that >>>> the device state can be saved before the device is reset. >>>> >>>> Does that sound right? >>>> >>> The split between suspend and reset was added recently for that very >>> reason. In all the virtio devices, the frontend is initialized before >>> the backend, so I don't think it is a good idea to defer the backend >>> cleanup. Especially if we have already set the state is small enough >>> to not needing iterative migration from virtiofsd point of view. >>> >>> If fetching that state at the same time as vq indexes is not valid, >>> could it follow the same model as the "in-flight descriptors"? >>> vhost-user follows them by using a shared memory region where their >>> state is tracked [1]. This allows qemu to survive vhost-user SW >>> backend crashes, and does not forbid the cross-backends live migration >>> as all the information is there to recover them. >>> >>> For hw devices this is not convenient as it occupies PCI bandwidth. So >>> a possibility is to synchronize this memory region after a >>> synchronization point, being the SUSPEND call or GET_VRING_BASE. HW >>> devices are not going to crash in the software sense, so all use cases >>> remain the same to qemu. And that shared memory information is >>> recoverable after vhost_dev_stop. >>> >>> Does that sound reasonable to virtiofsd? To offer a shared memory >>> region where it dumps the state, maybe only after the >>> set_state(STATE_PHASE_STOPPED)? >> I don’t think we need the set_state() call, necessarily, if SUSPEND is >> mandatory anyway. >> >> As for the shared memory, the RFC before this series used shared memory, >> so it’s possible, yes. But “shared memory region” can mean a lot of >> things – it sounds like you’re saying the back-end (virtiofsd) should >> provide it to the front-end, is that right? That could work like this: >> >> On the source side: >> >> S1. SUSPEND goes to virtiofsd >> S2. virtiofsd maybe double-checks that the device is stopped, then >> serializes its state into a newly allocated shared memory area[1] >> S3. virtiofsd responds to SUSPEND >> S4. front-end requests shared memory, virtiofsd responds with a handle, >> maybe already closes its reference >> S5. front-end saves state, closes its handle, freeing the SHM >> >> [1] Maybe virtiofsd can correctly size the serialized state’s size, then >> it can immediately allocate this area and serialize directly into it; >> maybe it can’t, then we’ll need a bounce buffer. Not really a >> fundamental problem, but there are limitations around what you can do >> with serde implementations in Rust… >> >> On the destination side: >> >> D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much; >> virtiofsd would serialize its empty state into an SHM area, and respond >> to SUSPEND >> D2. front-end reads state from migration stream into an SHM it has allocated >> D3. front-end supplies this SHM to virtiofsd, which discards its >> previous area, and now uses this one >> D4. RESUME goes to virtiofsd, which deserializes the state from the SHM >> >> Couple of questions: >> >> A. Stefan suggested D1, but it does seem wasteful now. But if SUSPEND >> would imply to deserialize a state, and the state is to be transferred >> through SHM, this is what would need to be done. So maybe we should >> skip SUSPEND on the destination? >> B. You described that the back-end should supply the SHM, which works >> well on the source. On the destination, only the front-end knows how >> big the state is, so I’ve decided above that it should allocate the SHM >> (D2) and provide it to the back-end. Is that feasible or is it >> important (e.g. for real hardware) that the back-end supplies the SHM? >> (In which case the front-end would need to tell the back-end how big the >> state SHM needs to be.) > How does this work for iterative live migration? Right, probably not at all. :) Hanna
On Wed, 19 Apr 2023 at 07:16, Hanna Czenczek <hreitz@redhat.com> wrote: > > On 19.04.23 13:10, Stefan Hajnoczi wrote: > > On Wed, 19 Apr 2023 at 06:57, Hanna Czenczek <hreitz@redhat.com> wrote: > >> On 18.04.23 19:59, Stefan Hajnoczi wrote: > >>> On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote: > >>>> On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > >>>>> On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote: > >>>>>> On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > >>>>>>> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote: > >>>>>>>> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > >>>>>>>>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > >>>>>>>>>> So-called "internal" virtio-fs migration refers to transporting the > >>>>>>>>>> back-end's (virtiofsd's) state through qemu's migration stream. To do > >>>>>>>>>> this, we need to be able to transfer virtiofsd's internal state to and > >>>>>>>>>> from virtiofsd. > >>>>>>>>>> > >>>>>>>>>> Because virtiofsd's internal state will not be too large, we believe it > >>>>>>>>>> is best to transfer it as a single binary blob after the streaming > >>>>>>>>>> phase. Because this method should be useful to other vhost-user > >>>>>>>>>> implementations, too, it is introduced as a general-purpose addition to > >>>>>>>>>> the protocol, not limited to vhost-user-fs. > >>>>>>>>>> > >>>>>>>>>> These are the additions to the protocol: > >>>>>>>>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > >>>>>>>>>> This feature signals support for transferring state, and is added so > >>>>>>>>>> that migration can fail early when the back-end has no support. > >>>>>>>>>> > >>>>>>>>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe > >>>>>>>>>> over which to transfer the state. The front-end sends an FD to the > >>>>>>>>>> back-end into/from which it can write/read its state, and the back-end > >>>>>>>>>> can decide to either use it, or reply with a different FD for the > >>>>>>>>>> front-end to override the front-end's choice. > >>>>>>>>>> The front-end creates a simple pipe to transfer the state, but maybe > >>>>>>>>>> the back-end already has an FD into/from which it has to write/read > >>>>>>>>>> its state, in which case it will want to override the simple pipe. > >>>>>>>>>> Conversely, maybe in the future we find a way to have the front-end > >>>>>>>>>> get an immediate FD for the migration stream (in some cases), in which > >>>>>>>>>> case we will want to send this to the back-end instead of creating a > >>>>>>>>>> pipe. > >>>>>>>>>> Hence the negotiation: If one side has a better idea than a plain > >>>>>>>>>> pipe, we will want to use that. > >>>>>>>>>> > >>>>>>>>>> - CHECK_DEVICE_STATE: After the state has been transferred through the > >>>>>>>>>> pipe (the end indicated by EOF), the front-end invokes this function > >>>>>>>>>> to verify success. There is no in-band way (through the pipe) to > >>>>>>>>>> indicate failure, so we need to check explicitly. > >>>>>>>>>> > >>>>>>>>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD > >>>>>>>>>> (which includes establishing the direction of transfer and migration > >>>>>>>>>> phase), the sending side writes its data into the pipe, and the reading > >>>>>>>>>> side reads it until it sees an EOF. Then, the front-end will check for > >>>>>>>>>> success via CHECK_DEVICE_STATE, which on the destination side includes > >>>>>>>>>> checking for integrity (i.e. errors during deserialization). > >>>>>>>>>> > >>>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > >>>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > >>>>>>>>>> --- > >>>>>>>>>> include/hw/virtio/vhost-backend.h | 24 +++++ > >>>>>>>>>> include/hw/virtio/vhost.h | 79 ++++++++++++++++ > >>>>>>>>>> hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ > >>>>>>>>>> hw/virtio/vhost.c | 37 ++++++++ > >>>>>>>>>> 4 files changed, 287 insertions(+) > >>>>>>>>>> > >>>>>>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h > >>>>>>>>>> index ec3fbae58d..5935b32fe3 100644 > >>>>>>>>>> --- a/include/hw/virtio/vhost-backend.h > >>>>>>>>>> +++ b/include/hw/virtio/vhost-backend.h > >>>>>>>>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > >>>>>>>>>> VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > >>>>>>>>>> } VhostSetConfigType; > >>>>>>>>>> > >>>>>>>>>> +typedef enum VhostDeviceStateDirection { > >>>>>>>>>> + /* Transfer state from back-end (device) to front-end */ > >>>>>>>>>> + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > >>>>>>>>>> + /* Transfer state from front-end to back-end (device) */ > >>>>>>>>>> + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > >>>>>>>>>> +} VhostDeviceStateDirection; > >>>>>>>>>> + > >>>>>>>>>> +typedef enum VhostDeviceStatePhase { > >>>>>>>>>> + /* The device (and all its vrings) is stopped */ > >>>>>>>>>> + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > >>>>>>>>>> +} VhostDeviceStatePhase; > >>>>>>>>> vDPA has: > >>>>>>>>> > >>>>>>>>> /* Suspend a device so it does not process virtqueue requests anymore > >>>>>>>>> * > >>>>>>>>> * After the return of ioctl the device must preserve all the necessary state > >>>>>>>>> * (the virtqueue vring base plus the possible device specific states) that is > >>>>>>>>> * required for restoring in the future. The device must not change its > >>>>>>>>> * configuration after that point. > >>>>>>>>> */ > >>>>>>>>> #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > >>>>>>>>> > >>>>>>>>> /* Resume a device so it can resume processing virtqueue requests > >>>>>>>>> * > >>>>>>>>> * After the return of this ioctl the device will have restored all the > >>>>>>>>> * necessary states and it is fully operational to continue processing the > >>>>>>>>> * virtqueue descriptors. > >>>>>>>>> */ > >>>>>>>>> #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > >>>>>>>>> > >>>>>>>>> I wonder if it makes sense to import these into vhost-user so that the > >>>>>>>>> difference between kernel vhost and vhost-user is minimized. It's okay > >>>>>>>>> if one of them is ahead of the other, but it would be nice to avoid > >>>>>>>>> overlapping/duplicated functionality. > >>>>>>>>> > >>>>>>>> That's what I had in mind in the first versions. I proposed VHOST_STOP > >>>>>>>> instead of VHOST_VDPA_STOP for this very reason. Later it did change > >>>>>>>> to SUSPEND. > >>>>>>> I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not > >>>>>>> ioctl(VHOST_VDPA_RESUME). > >>>>>>> > >>>>>>> The doc comments in <linux/vdpa.h> don't explain how the device can > >>>>>>> leave the suspended state. Can you clarify this? > >>>>>>> > >>>>>> Do you mean in what situations or regarding the semantics of _RESUME? > >>>>>> > >>>>>> To me resume is an operation mainly to resume the device in the event > >>>>>> of a VM suspension, not a migration. It can be used as a fallback code > >>>>>> in some cases of migration failure though, but it is not currently > >>>>>> used in qemu. > >>>>> Is a "VM suspension" the QEMU HMP 'stop' command? > >>>>> > >>>>> I guess the reason why QEMU doesn't call RESUME anywhere is that it > >>>>> resets the device in vhost_dev_stop()? > >>>>> > >>>> The actual reason for not using RESUME is that the ioctl was added > >>>> after the SUSPEND design in qemu. Same as this proposal, it is was not > >>>> needed at the time. > >>>> > >>>> In the case of vhost-vdpa net, the only usage of suspend is to fetch > >>>> the vq indexes, and in case of error vhost already fetches them from > >>>> guest's used ring way before vDPA, so it has little usage. > >>>> > >>>>> Does it make sense to combine SUSPEND and RESUME with Hanna's > >>>>> SET_DEVICE_STATE_FD? For example, non-iterative migration works like > >>>>> this: > >>>>> - Saving the device's state is done by SUSPEND followed by > >>>>> SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g. > >>>>> savevm command or migration failed), then RESUME is called to > >>>>> continue. > >>>> I think the previous steps make sense at vhost_dev_stop, not virtio > >>>> savevm handlers. To start spreading this logic to more places of qemu > >>>> can bring confusion. > >>> I don't think there is a way around extending the QEMU vhost's code > >>> model. The current model in QEMU's vhost code is that the backend is > >>> reset when the VM stops. This model worked fine for stateless devices > >>> but it doesn't work for stateful devices. > >>> > >>> Imagine a vdpa-gpu device: you cannot reset the device in > >>> vhost_dev_stop() and expect the GPU to continue working when > >>> vhost_dev_start() is called again because all its state has been lost. > >>> The guest driver will send requests that references a virtio-gpu > >>> resources that no longer exist. > >>> > >>> One solution is to save the device's state in vhost_dev_stop(). I think > >>> this is what you're suggesting. It requires keeping a copy of the state > >>> and then loading the state again in vhost_dev_start(). I don't think > >>> this approach should be used because it requires all stateful devices to > >>> support live migration (otherwise they break across HMP 'stop'/'cont'). > >>> Also, the device state for some devices may be large and it would also > >>> become more complicated when iterative migration is added. > >>> > >>> Instead, I think the QEMU vhost code needs to be structured so that > >>> struct vhost_dev has a suspended state: > >>> > >>> ,---------. > >>> v | > >>> started ------> stopped > >>> \ ^ > >>> \ | > >>> -> suspended > >>> > >>> The device doesn't lose state when it enters the suspended state. It can > >>> be resumed again. > >>> > >>> This is why I think SUSPEND/RESUME need to be part of the solution. > >>> (It's also an argument for not including the phase argument in > >>> SET_DEVICE_STATE_FD because the SUSPEND message is sent during > >>> vhost_dev_stop() separately from saving the device's state.) > >> So let me ask if I understand this protocol correctly: Basically, > >> SUSPEND would ask the device to fully serialize its internal state, > >> retain it in some buffer, and RESUME would then deserialize the state > >> from the buffer, right? > > That's not how I understand SUSPEND/RESUME. I was thinking that > > SUSPEND pauses device operation so that virtqueues are no longer > > processed and no other events occur (e.g. VIRTIO Configuration Change > > Notifications). RESUME continues device operation. Neither command is > > directly related to device state serialization but SUSPEND freezes the > > device state, while RESUME allows the device state to change again. > > I understood that a reset would basically reset all internal state, > which is why SUSPEND+RESUME were required around it, to retain the state. The SUSPEND/RESUME operations I'm thinking of come directly from <linux/vhost.h>: /* Suspend a device so it does not process virtqueue requests anymore * * After the return of ioctl the device must preserve all the necessary state * (the virtqueue vring base plus the possible device specific states) that is * required for restoring in the future. The device must not change its * configuration after that point. */ #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) /* Resume a device so it can resume processing virtqueue requests * * After the return of this ioctl the device will have restored all the * necessary states and it is fully operational to continue processing the * virtqueue descriptors. */ #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > >> While this state needn’t necessarily be immediately migratable, I > >> suppose (e.g. one could retain file descriptors there, and it doesn’t > >> need to be a serialized byte buffer, but could still be structured), it > >> would basically be a live migration implementation already. As far as I > >> understand, that’s why you suggest not running a SUSPEND+RESUME cycle on > >> anything but live migration, right? > > No, SUSPEND/RESUME would also be used across vm_stop()/vm_start(). > > That way stateful devices are no longer reset across HMP 'stop'/'cont' > > (we're lucky it even works for most existing vhost-user backends today > > and that's just because they don't yet implement > > VHOST_USER_SET_STATUS). > > So that’s what I seem to misunderstand: If stateful devices are reset, > how does SUSPEND+RESUME prevent that? The vhost-user frontend can check the VHOST_USER_PROTOCOL_F_SUSPEND feature bit to determine that the backend supports SUSPEND/RESUME and that mechanism should be used instead of resetting the device. Stefan
On Wed, 2023-04-19 at 13:10 +0200, Hanna Czenczek wrote: > On 18.04.23 09:54, Eugenio Perez Martin wrote: > > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com> wrote: > > > On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com> > > > wrote: > > > > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> > > > > wrote: > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote: > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi < > > > > > > stefanha@redhat.com> wrote: > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > > > > > > > So-called "internal" virtio-fs migration refers to transporting > > > > > > > > the > > > > > > > > back-end's (virtiofsd's) state through qemu's migration > > > > > > > > stream. To do > > > > > > > > this, we need to be able to transfer virtiofsd's internal state > > > > > > > > to and > > > > > > > > from virtiofsd. > > > > > > > > > > > > > > > > Because virtiofsd's internal state will not be too large, we > > > > > > > > believe it > > > > > > > > is best to transfer it as a single binary blob after the > > > > > > > > streaming > > > > > > > > phase. Because this method should be useful to other vhost-user > > > > > > > > implementations, too, it is introduced as a general-purpose > > > > > > > > addition to > > > > > > > > the protocol, not limited to vhost-user-fs. > > > > > > > > > > > > > > > > These are the additions to the protocol: > > > > > > > > - New vhost-user protocol feature > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > > > > > > > This feature signals support for transferring state, and is > > > > > > > > added so > > > > > > > > that migration can fail early when the back-end has no > > > > > > > > support. > > > > > > > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate > > > > > > > > a pipe > > > > > > > > over which to transfer the state. The front-end sends an FD > > > > > > > > to the > > > > > > > > back-end into/from which it can write/read its state, and the > > > > > > > > back-end > > > > > > > > can decide to either use it, or reply with a different FD for > > > > > > > > the > > > > > > > > front-end to override the front-end's choice. > > > > > > > > The front-end creates a simple pipe to transfer the state, > > > > > > > > but maybe > > > > > > > > the back-end already has an FD into/from which it has to > > > > > > > > write/read > > > > > > > > its state, in which case it will want to override the simple > > > > > > > > pipe. > > > > > > > > Conversely, maybe in the future we find a way to have the > > > > > > > > front-end > > > > > > > > get an immediate FD for the migration stream (in some cases), > > > > > > > > in which > > > > > > > > case we will want to send this to the back-end instead of > > > > > > > > creating a > > > > > > > > pipe. > > > > > > > > Hence the negotiation: If one side has a better idea than a > > > > > > > > plain > > > > > > > > pipe, we will want to use that. > > > > > > > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred > > > > > > > > through the > > > > > > > > pipe (the end indicated by EOF), the front-end invokes this > > > > > > > > function > > > > > > > > to verify success. There is no in-band way (through the > > > > > > > > pipe) to > > > > > > > > indicate failure, so we need to check explicitly. > > > > > > > > > > > > > > > > Once the transfer pipe has been established via > > > > > > > > SET_DEVICE_STATE_FD > > > > > > > > (which includes establishing the direction of transfer and > > > > > > > > migration > > > > > > > > phase), the sending side writes its data into the pipe, and the > > > > > > > > reading > > > > > > > > side reads it until it sees an EOF. Then, the front-end will > > > > > > > > check for > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side > > > > > > > > includes > > > > > > > > checking for integrity (i.e. errors during deserialization). > > > > > > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > > > > > > --- > > > > > > > > include/hw/virtio/vhost-backend.h | 24 +++++ > > > > > > > > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > > > > > > > hw/virtio/vhost-user.c | 147 > > > > > > > > ++++++++++++++++++++++++++++++ > > > > > > > > hw/virtio/vhost.c | 37 ++++++++ > > > > > > > > 4 files changed, 287 insertions(+) > > > > > > > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h > > > > > > > > b/include/hw/virtio/vhost-backend.h > > > > > > > > index ec3fbae58d..5935b32fe3 100644 > > > > > > > > --- a/include/hw/virtio/vhost-backend.h > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > > > > > > > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > > > > > > > } VhostSetConfigType; > > > > > > > > > > > > > > > > +typedef enum VhostDeviceStateDirection { > > > > > > > > + /* Transfer state from back-end (device) to front-end */ > > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > > > > > > > + /* Transfer state from front-end to back-end (device) */ > > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > > > > > > > +} VhostDeviceStateDirection; > > > > > > > > + > > > > > > > > +typedef enum VhostDeviceStatePhase { > > > > > > > > + /* The device (and all its vrings) is stopped */ > > > > > > > > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > > > > > > > +} VhostDeviceStatePhase; > > > > > > > vDPA has: > > > > > > > > > > > > > > /* Suspend a device so it does not process virtqueue requests > > > > > > > anymore > > > > > > > * > > > > > > > * After the return of ioctl the device must preserve all the > > > > > > > necessary state > > > > > > > * (the virtqueue vring base plus the possible device specific > > > > > > > states) that is > > > > > > > * required for restoring in the future. The device must not > > > > > > > change its > > > > > > > * configuration after that point. > > > > > > > */ > > > > > > > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > > > > > > > > > > > > > /* Resume a device so it can resume processing virtqueue > > > > > > > requests > > > > > > > * > > > > > > > * After the return of this ioctl the device will have restored > > > > > > > all the > > > > > > > * necessary states and it is fully operational to continue > > > > > > > processing the > > > > > > > * virtqueue descriptors. > > > > > > > */ > > > > > > > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > > > > > > > > > > > > > I wonder if it makes sense to import these into vhost-user so that > > > > > > > the > > > > > > > difference between kernel vhost and vhost-user is minimized. It's > > > > > > > okay > > > > > > > if one of them is ahead of the other, but it would be nice to > > > > > > > avoid > > > > > > > overlapping/duplicated functionality. > > > > > > > > > > > > > That's what I had in mind in the first versions. I proposed > > > > > > VHOST_STOP > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did change > > > > > > to SUSPEND. > > > > > > > > > > > > Generally it is better if we make the interface less parametrized > > > > > > and > > > > > > we trust in the messages and its semantics in my opinion. In other > > > > > > words, instead of > > > > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), > > > > > > send > > > > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user > > > > > > command. > > > > > > > > > > > > Another way to apply this is with the "direction" parameter. Maybe > > > > > > it > > > > > > is better to split it into "set_state_fd" and "get_state_fd"? > > > > > > > > > > > > In that case, reusing the ioctls as vhost-user messages would be ok. > > > > > > But that puts this proposal further from the VFIO code, which uses > > > > > > "migration_set_state(state)", and maybe it is better when the number > > > > > > of states is high. > > > > > Hi Eugenio, > > > > > Another question about vDPA suspend/resume: > > > > > > > > > > /* Host notifiers must be enabled at this point. */ > > > > > void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, > > > > > bool vrings) > > > > > { > > > > > int i; > > > > > > > > > > /* should only be called after backend is connected */ > > > > > assert(hdev->vhost_ops); > > > > > event_notifier_test_and_clear( > > > > > &hdev- > > > > > >vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier); > > > > > event_notifier_test_and_clear(&vdev->config_notifier); > > > > > > > > > > trace_vhost_dev_stop(hdev, vdev->name, vrings); > > > > > > > > > > if (hdev->vhost_ops->vhost_dev_start) { > > > > > hdev->vhost_ops->vhost_dev_start(hdev, false); > > > > > ^^^ SUSPEND ^^^ > > > > > } > > > > > if (vrings) { > > > > > vhost_dev_set_vring_enable(hdev, false); > > > > > } > > > > > for (i = 0; i < hdev->nvqs; ++i) { > > > > > vhost_virtqueue_stop(hdev, > > > > > vdev, > > > > > hdev->vqs + i, > > > > > hdev->vq_index + i); > > > > > ^^^ fetch virtqueue state from kernel ^^^ > > > > > } > > > > > if (hdev->vhost_ops->vhost_reset_status) { > > > > > hdev->vhost_ops->vhost_reset_status(hdev); > > > > > ^^^ reset device^^^ > > > > > > > > > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop() -> > > > > > vhost_reset_status(). The device's migration code runs after > > > > > vhost_dev_stop() and the state will have been lost. > > > > > > > > > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the > > > > qemu VirtIONet device model. This is for all vhost backends. > > > > > > > > Regarding the state like mac or mq configuration, SVQ runs for all the > > > > VM run in the CVQ. So it can track all of that status in the device > > > > model too. > > > > > > > > When a migration effectively occurs, all the frontend state is > > > > migrated as a regular emulated device. To route all of the state in a > > > > normalized way for qemu is what leaves open the possibility to do > > > > cross-backends migrations, etc. > > > > > > > > Does that answer your question? > > > I think you're confirming that changes would be necessary in order for > > > vDPA to support the save/load operation that Hanna is introducing. > > > > > Yes, this first iteration was centered on net, with an eye on block, > > where state can be routed through classical emulated devices. This is > > how vhost-kernel and vhost-user do classically. And it allows > > cross-backend, to not modify qemu migration state, etc. > > > > To introduce this opaque state to qemu, that must be fetched after the > > suspend and not before, requires changes in vhost protocol, as > > discussed previously. > > > > > > > It looks like vDPA changes are necessary in order to support stateful > > > > > devices even though QEMU already uses SUSPEND. Is my understanding > > > > > correct? > > > > > > > > > Changes are required elsewhere, as the code to restore the state > > > > properly in the destination has not been merged. > > > I'm not sure what you mean by elsewhere? > > > > > I meant for vdpa *net* devices the changes are not required in vdpa > > ioctls, but mostly in qemu. > > > > If you meant stateful as "it must have a state blob that it must be > > opaque to qemu", then I think the straightforward action is to fetch > > state blob about the same time as vq indexes. But yes, changes (at > > least a new ioctl) is needed for that. > > > > > I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and > > > then VHOST_VDPA_SET_STATUS 0. > > > > > > In order to save device state from the vDPA device in the future, it > > > will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that > > > the device state can be saved before the device is reset. > > > > > > Does that sound right? > > > > > The split between suspend and reset was added recently for that very > > reason. In all the virtio devices, the frontend is initialized before > > the backend, so I don't think it is a good idea to defer the backend > > cleanup. Especially if we have already set the state is small enough > > to not needing iterative migration from virtiofsd point of view. > > > > If fetching that state at the same time as vq indexes is not valid, > > could it follow the same model as the "in-flight descriptors"? > > vhost-user follows them by using a shared memory region where their > > state is tracked [1]. This allows qemu to survive vhost-user SW > > backend crashes, and does not forbid the cross-backends live migration > > as all the information is there to recover them. > > > > For hw devices this is not convenient as it occupies PCI bandwidth. So > > a possibility is to synchronize this memory region after a > > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW > > devices are not going to crash in the software sense, so all use cases > > remain the same to qemu. And that shared memory information is > > recoverable after vhost_dev_stop. > > > > Does that sound reasonable to virtiofsd? To offer a shared memory > > region where it dumps the state, maybe only after the > > set_state(STATE_PHASE_STOPPED)? > > I don’t think we need the set_state() call, necessarily, if SUSPEND is > mandatory anyway. > Right, I was taking them as interchangeable. Note that I just put this on the table because it solves another use case (transfer stateful devices + virtiofsd crash) with an interface that mimics another one already existing. I don't want to block the pipe proposal at all. > As for the shared memory, the RFC before this series used shared memory, > so it’s possible, yes. But “shared memory region” can mean a lot of > things – it sounds like you’re saying the back-end (virtiofsd) should > provide it to the front-end, is that right? That could work like this: > inflight_fd provides both calls: VHOST_USER_SET_INFLIGHT_FD and VHOST_USER_GET_INFLIGHT_FD. > On the source side: > > S1. SUSPEND goes to virtiofsd > S2. virtiofsd maybe double-checks that the device is stopped, then > serializes its state into a newly allocated shared memory area[1] > S3. virtiofsd responds to SUSPEND > S4. front-end requests shared memory, virtiofsd responds with a handle, > maybe already closes its reference > S5. front-end saves state, closes its handle, freeing the SHM > > [1] Maybe virtiofsd can correctly size the serialized state’s size, then > it can immediately allocate this area and serialize directly into it; > maybe it can’t, then we’ll need a bounce buffer. Not really a > fundamental problem, but there are limitations around what you can do > with serde implementations in Rust… > I think shared memory regions can grow and shrink with ftruncate, but it complicates the solution for sure. I was under the impression it will be a fixed amount of state, probably based on some actual limits in the vfs. Now I see it is not. > On the destination side: > > D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much; > virtiofsd would serialize its empty state into an SHM area, and respond > to SUSPEND > D2. front-end reads state from migration stream into an SHM it has allocated > D3. front-end supplies this SHM to virtiofsd, which discards its > previous area, and now uses this one > D4. RESUME goes to virtiofsd, which deserializes the state from the SHM > > Couple of questions: > > A. Stefan suggested D1, but it does seem wasteful now. But if SUSPEND > would imply to deserialize a state, and the state is to be transferred > through SHM, this is what would need to be done. So maybe we should > skip SUSPEND on the destination? I think to skip suspend is the best call, yes. > B. You described that the back-end should supply the SHM, which works > well on the source. On the destination, only the front-end knows how > big the state is, so I’ve decided above that it should allocate the SHM > (D2) and provide it to the back-end. Is that feasible or is it > important (e.g. for real hardware) that the back-end supplies the SHM? > (In which case the front-end would need to tell the back-end how big the > state SHM needs to be.) > It is feasible for sure. I think that the best scenario is when the data has a fixed size forever, like QueueRegionSplit and QueueRegionPacked. If that is not possible, I think the best is to indicate the length of the data so the device can fetch in strides as large as possible. Maybe other HW guys can answer better here though. Thanks!
On Tue, 2023-04-18 at 16:40 -0400, Stefan Hajnoczi wrote: > On Tue, 18 Apr 2023 at 14:31, Eugenio Perez Martin <eperezma@redhat.com> > wrote: > > On Tue, Apr 18, 2023 at 7:59 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote: > > > > On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> > > > > wrote: > > > > > On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin < > > > > > eperezma@redhat.com> wrote: > > > > > > On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com > > > > > > > wrote: > > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin > > > > > > > wrote: > > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi < > > > > > > > > stefanha@redhat.com> wrote: > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek > > > > > > > > > wrote: > > > > > > > > > > So-called "internal" virtio-fs migration refers to > > > > > > > > > > transporting the > > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration > > > > > > > > > > stream. To do > > > > > > > > > > this, we need to be able to transfer virtiofsd's internal > > > > > > > > > > state to and > > > > > > > > > > from virtiofsd. > > > > > > > > > > > > > > > > > > > > Because virtiofsd's internal state will not be too large, we > > > > > > > > > > believe it > > > > > > > > > > is best to transfer it as a single binary blob after the > > > > > > > > > > streaming > > > > > > > > > > phase. Because this method should be useful to other vhost- > > > > > > > > > > user > > > > > > > > > > implementations, too, it is introduced as a general-purpose > > > > > > > > > > addition to > > > > > > > > > > the protocol, not limited to vhost-user-fs. > > > > > > > > > > > > > > > > > > > > These are the additions to the protocol: > > > > > > > > > > - New vhost-user protocol feature > > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > > > > > > > > > This feature signals support for transferring state, and > > > > > > > > > > is added so > > > > > > > > > > that migration can fail early when the back-end has no > > > > > > > > > > support. > > > > > > > > > > > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end > > > > > > > > > > negotiate a pipe > > > > > > > > > > over which to transfer the state. The front-end sends an > > > > > > > > > > FD to the > > > > > > > > > > back-end into/from which it can write/read its state, and > > > > > > > > > > the back-end > > > > > > > > > > can decide to either use it, or reply with a different FD > > > > > > > > > > for the > > > > > > > > > > front-end to override the front-end's choice. > > > > > > > > > > The front-end creates a simple pipe to transfer the state, > > > > > > > > > > but maybe > > > > > > > > > > the back-end already has an FD into/from which it has to > > > > > > > > > > write/read > > > > > > > > > > its state, in which case it will want to override the > > > > > > > > > > simple pipe. > > > > > > > > > > Conversely, maybe in the future we find a way to have the > > > > > > > > > > front-end > > > > > > > > > > get an immediate FD for the migration stream (in some > > > > > > > > > > cases), in which > > > > > > > > > > case we will want to send this to the back-end instead of > > > > > > > > > > creating a > > > > > > > > > > pipe. > > > > > > > > > > Hence the negotiation: If one side has a better idea than > > > > > > > > > > a plain > > > > > > > > > > pipe, we will want to use that. > > > > > > > > > > > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred > > > > > > > > > > through the > > > > > > > > > > pipe (the end indicated by EOF), the front-end invokes > > > > > > > > > > this function > > > > > > > > > > to verify success. There is no in-band way (through the > > > > > > > > > > pipe) to > > > > > > > > > > indicate failure, so we need to check explicitly. > > > > > > > > > > > > > > > > > > > > Once the transfer pipe has been established via > > > > > > > > > > SET_DEVICE_STATE_FD > > > > > > > > > > (which includes establishing the direction of transfer and > > > > > > > > > > migration > > > > > > > > > > phase), the sending side writes its data into the pipe, and > > > > > > > > > > the reading > > > > > > > > > > side reads it until it sees an EOF. Then, the front-end > > > > > > > > > > will check for > > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination > > > > > > > > > > side includes > > > > > > > > > > checking for integrity (i.e. errors during deserialization). > > > > > > > > > > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > > > > > > > > --- > > > > > > > > > > include/hw/virtio/vhost-backend.h | 24 +++++ > > > > > > > > > > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > > > > > > > > > hw/virtio/vhost-user.c | 147 > > > > > > > > > > ++++++++++++++++++++++++++++++ > > > > > > > > > > hw/virtio/vhost.c | 37 ++++++++ > > > > > > > > > > 4 files changed, 287 insertions(+) > > > > > > > > > > > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h > > > > > > > > > > b/include/hw/virtio/vhost-backend.h > > > > > > > > > > index ec3fbae58d..5935b32fe3 100644 > > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h > > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h > > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > > > > > > > > > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > > > > > > > > > } VhostSetConfigType; > > > > > > > > > > > > > > > > > > > > +typedef enum VhostDeviceStateDirection { > > > > > > > > > > + /* Transfer state from back-end (device) to front-end > > > > > > > > > > */ > > > > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > > > > > > > > > + /* Transfer state from front-end to back-end (device) > > > > > > > > > > */ > > > > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > > > > > > > > > +} VhostDeviceStateDirection; > > > > > > > > > > + > > > > > > > > > > +typedef enum VhostDeviceStatePhase { > > > > > > > > > > + /* The device (and all its vrings) is stopped */ > > > > > > > > > > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > > > > > > > > > +} VhostDeviceStatePhase; > > > > > > > > > > > > > > > > > > vDPA has: > > > > > > > > > > > > > > > > > > /* Suspend a device so it does not process virtqueue > > > > > > > > > requests anymore > > > > > > > > > * > > > > > > > > > * After the return of ioctl the device must preserve all > > > > > > > > > the necessary state > > > > > > > > > * (the virtqueue vring base plus the possible device > > > > > > > > > specific states) that is > > > > > > > > > * required for restoring in the future. The device must not > > > > > > > > > change its > > > > > > > > > * configuration after that point. > > > > > > > > > */ > > > > > > > > > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > > > > > > > > > > > > > > > > > /* Resume a device so it can resume processing virtqueue > > > > > > > > > requests > > > > > > > > > * > > > > > > > > > * After the return of this ioctl the device will have > > > > > > > > > restored all the > > > > > > > > > * necessary states and it is fully operational to continue > > > > > > > > > processing the > > > > > > > > > * virtqueue descriptors. > > > > > > > > > */ > > > > > > > > > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > > > > > > > > > > > > > > > > > I wonder if it makes sense to import these into vhost-user so > > > > > > > > > that the > > > > > > > > > difference between kernel vhost and vhost-user is minimized. > > > > > > > > > It's okay > > > > > > > > > if one of them is ahead of the other, but it would be nice to > > > > > > > > > avoid > > > > > > > > > overlapping/duplicated functionality. > > > > > > > > > > > > > > > > > > > > > > > > > That's what I had in mind in the first versions. I proposed > > > > > > > > VHOST_STOP > > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did > > > > > > > > change > > > > > > > > to SUSPEND. > > > > > > > > > > > > > > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not > > > > > > > ioctl(VHOST_VDPA_RESUME). > > > > > > > > > > > > > > The doc comments in <linux/vdpa.h> don't explain how the device > > > > > > > can > > > > > > > leave the suspended state. Can you clarify this? > > > > > > > > > > > > > > > > > > > Do you mean in what situations or regarding the semantics of > > > > > > _RESUME? > > > > > > > > > > > > To me resume is an operation mainly to resume the device in the > > > > > > event > > > > > > of a VM suspension, not a migration. It can be used as a fallback > > > > > > code > > > > > > in some cases of migration failure though, but it is not currently > > > > > > used in qemu. > > > > > > > > > > Is a "VM suspension" the QEMU HMP 'stop' command? > > > > > > > > > > I guess the reason why QEMU doesn't call RESUME anywhere is that it > > > > > resets the device in vhost_dev_stop()? > > > > > > > > > > > > > The actual reason for not using RESUME is that the ioctl was added > > > > after the SUSPEND design in qemu. Same as this proposal, it is was not > > > > needed at the time. > > > > > > > > In the case of vhost-vdpa net, the only usage of suspend is to fetch > > > > the vq indexes, and in case of error vhost already fetches them from > > > > guest's used ring way before vDPA, so it has little usage. > > > > > > > > > Does it make sense to combine SUSPEND and RESUME with Hanna's > > > > > SET_DEVICE_STATE_FD? For example, non-iterative migration works like > > > > > this: > > > > > - Saving the device's state is done by SUSPEND followed by > > > > > SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g. > > > > > savevm command or migration failed), then RESUME is called to > > > > > continue. > > > > > > > > I think the previous steps make sense at vhost_dev_stop, not virtio > > > > savevm handlers. To start spreading this logic to more places of qemu > > > > can bring confusion. > > > > > > I don't think there is a way around extending the QEMU vhost's code > > > model. The current model in QEMU's vhost code is that the backend is > > > reset when the VM stops. This model worked fine for stateless devices > > > but it doesn't work for stateful devices. > > > > > > Imagine a vdpa-gpu device: you cannot reset the device in > > > vhost_dev_stop() and expect the GPU to continue working when > > > vhost_dev_start() is called again because all its state has been lost. > > > The guest driver will send requests that references a virtio-gpu > > > resources that no longer exist. > > > > > > One solution is to save the device's state in vhost_dev_stop(). I think > > > this is what you're suggesting. It requires keeping a copy of the state > > > and then loading the state again in vhost_dev_start(). I don't think > > > this approach should be used because it requires all stateful devices to > > > support live migration (otherwise they break across HMP 'stop'/'cont'). > > > Also, the device state for some devices may be large and it would also > > > become more complicated when iterative migration is added. > > > > > > Instead, I think the QEMU vhost code needs to be structured so that > > > struct vhost_dev has a suspended state: > > > > > > ,---------. > > > v | > > > started ------> stopped > > > \ ^ > > > \ | > > > -> suspended > > > > > > The device doesn't lose state when it enters the suspended state. It can > > > be resumed again. > > > > > > This is why I think SUSPEND/RESUME need to be part of the solution. I just realize that we can add an arrow from suspended to stopped, isn't it? "Started" before seems to imply the device may process descriptors after suspend. > > > > I agree with all of this, especially after realizing vhost_dev_stop is > > called before the last request of the state in the iterative > > migration. > > > > However I think we can move faster with the virtiofsd migration code, > > as long as we agree on the vhost-user messages it will receive. This > > is because we already agree that the state will be sent in one shot > > and not iteratively, so it will be small. > > > > I understand this may change in the future, that's why I proposed to > > start using iterative right now. However it may make little sense if > > it is not used in the vhost-user device. I also understand that other > > devices may have a bigger state so it will be needed for them. > > Can you summarize how you'd like save to work today? I'm not sure what > you have in mind. > I think we're trying to find a solution that satisfies many things. On one side, we're assuming that the virtiofsd state will be small enough to be assumable it will not require iterative migration in the short term. However, we also want to support iterative migration, for the shake of *other* future vhost devices that may need it. I also think we should prioritize the protocols stability, in the sense of not adding calls that we will not reuse for iterative LM. Being vhost-user protocol more important to maintain than the qemu migration. To implement the changes you mention will be needed in the future. But we have already set that the virtiofsd is small, so we can just fetch them by the same time than we send VHOST_USER_GET_VRING_BASE message and send the status with the proposed non-iterative approach. If we agree on that, now the question is how to fetch them from the device. The answers are a little bit scattered in the mail threads, but I think we agree on: a) We need to signal that the device must stop processing requests. b) We need a way for the device to dump the state. At this moment I think any proposal satisfies a), and pipe satisfies better b). With proper backend feature flags, the device may support to start writing to the pipe before SUSPEND so we can implement iterative migration on top. Does that makes sense? Thanks!
On Wed, 2023-04-19 at 07:21 -0400, Stefan Hajnoczi wrote: > On Wed, 19 Apr 2023 at 07:10, Hanna Czenczek <hreitz@redhat.com> wrote: > > On 18.04.23 09:54, Eugenio Perez Martin wrote: > > > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com> > > > wrote: > > > > On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com> > > > > wrote: > > > > > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> > > > > > wrote: > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin > > > > > > wrote: > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi < > > > > > > > stefanha@redhat.com> wrote: > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > > > > > > > > So-called "internal" virtio-fs migration refers to > > > > > > > > > transporting the > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration > > > > > > > > > stream. To do > > > > > > > > > this, we need to be able to transfer virtiofsd's internal > > > > > > > > > state to and > > > > > > > > > from virtiofsd. > > > > > > > > > > > > > > > > > > Because virtiofsd's internal state will not be too large, we > > > > > > > > > believe it > > > > > > > > > is best to transfer it as a single binary blob after the > > > > > > > > > streaming > > > > > > > > > phase. Because this method should be useful to other vhost- > > > > > > > > > user > > > > > > > > > implementations, too, it is introduced as a general-purpose > > > > > > > > > addition to > > > > > > > > > the protocol, not limited to vhost-user-fs. > > > > > > > > > > > > > > > > > > These are the additions to the protocol: > > > > > > > > > - New vhost-user protocol feature > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > > > > > > > > This feature signals support for transferring state, and is > > > > > > > > > added so > > > > > > > > > that migration can fail early when the back-end has no > > > > > > > > > support. > > > > > > > > > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end > > > > > > > > > negotiate a pipe > > > > > > > > > over which to transfer the state. The front-end sends an > > > > > > > > > FD to the > > > > > > > > > back-end into/from which it can write/read its state, and > > > > > > > > > the back-end > > > > > > > > > can decide to either use it, or reply with a different FD > > > > > > > > > for the > > > > > > > > > front-end to override the front-end's choice. > > > > > > > > > The front-end creates a simple pipe to transfer the state, > > > > > > > > > but maybe > > > > > > > > > the back-end already has an FD into/from which it has to > > > > > > > > > write/read > > > > > > > > > its state, in which case it will want to override the > > > > > > > > > simple pipe. > > > > > > > > > Conversely, maybe in the future we find a way to have the > > > > > > > > > front-end > > > > > > > > > get an immediate FD for the migration stream (in some > > > > > > > > > cases), in which > > > > > > > > > case we will want to send this to the back-end instead of > > > > > > > > > creating a > > > > > > > > > pipe. > > > > > > > > > Hence the negotiation: If one side has a better idea than a > > > > > > > > > plain > > > > > > > > > pipe, we will want to use that. > > > > > > > > > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred > > > > > > > > > through the > > > > > > > > > pipe (the end indicated by EOF), the front-end invokes this > > > > > > > > > function > > > > > > > > > to verify success. There is no in-band way (through the > > > > > > > > > pipe) to > > > > > > > > > indicate failure, so we need to check explicitly. > > > > > > > > > > > > > > > > > > Once the transfer pipe has been established via > > > > > > > > > SET_DEVICE_STATE_FD > > > > > > > > > (which includes establishing the direction of transfer and > > > > > > > > > migration > > > > > > > > > phase), the sending side writes its data into the pipe, and > > > > > > > > > the reading > > > > > > > > > side reads it until it sees an EOF. Then, the front-end will > > > > > > > > > check for > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side > > > > > > > > > includes > > > > > > > > > checking for integrity (i.e. errors during deserialization). > > > > > > > > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > > > > > > > --- > > > > > > > > > include/hw/virtio/vhost-backend.h | 24 +++++ > > > > > > > > > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > > > > > > > > hw/virtio/vhost-user.c | 147 > > > > > > > > > ++++++++++++++++++++++++++++++ > > > > > > > > > hw/virtio/vhost.c | 37 ++++++++ > > > > > > > > > 4 files changed, 287 insertions(+) > > > > > > > > > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h > > > > > > > > > b/include/hw/virtio/vhost-backend.h > > > > > > > > > index ec3fbae58d..5935b32fe3 100644 > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > > > > > > > > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > > > > > > > > } VhostSetConfigType; > > > > > > > > > > > > > > > > > > +typedef enum VhostDeviceStateDirection { > > > > > > > > > + /* Transfer state from back-end (device) to front-end */ > > > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > > > > > > > > + /* Transfer state from front-end to back-end (device) */ > > > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > > > > > > > > +} VhostDeviceStateDirection; > > > > > > > > > + > > > > > > > > > +typedef enum VhostDeviceStatePhase { > > > > > > > > > + /* The device (and all its vrings) is stopped */ > > > > > > > > > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > > > > > > > > +} VhostDeviceStatePhase; > > > > > > > > vDPA has: > > > > > > > > > > > > > > > > /* Suspend a device so it does not process virtqueue requests > > > > > > > > anymore > > > > > > > > * > > > > > > > > * After the return of ioctl the device must preserve all the > > > > > > > > necessary state > > > > > > > > * (the virtqueue vring base plus the possible device > > > > > > > > specific states) that is > > > > > > > > * required for restoring in the future. The device must not > > > > > > > > change its > > > > > > > > * configuration after that point. > > > > > > > > */ > > > > > > > > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > > > > > > > > > > > > > > > /* Resume a device so it can resume processing virtqueue > > > > > > > > requests > > > > > > > > * > > > > > > > > * After the return of this ioctl the device will have > > > > > > > > restored all the > > > > > > > > * necessary states and it is fully operational to continue > > > > > > > > processing the > > > > > > > > * virtqueue descriptors. > > > > > > > > */ > > > > > > > > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > > > > > > > > > > > > > > > I wonder if it makes sense to import these into vhost-user so > > > > > > > > that the > > > > > > > > difference between kernel vhost and vhost-user is minimized. > > > > > > > > It's okay > > > > > > > > if one of them is ahead of the other, but it would be nice to > > > > > > > > avoid > > > > > > > > overlapping/duplicated functionality. > > > > > > > > > > > > > > > That's what I had in mind in the first versions. I proposed > > > > > > > VHOST_STOP > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did > > > > > > > change > > > > > > > to SUSPEND. > > > > > > > > > > > > > > Generally it is better if we make the interface less parametrized > > > > > > > and > > > > > > > we trust in the messages and its semantics in my opinion. In other > > > > > > > words, instead of > > > > > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), > > > > > > > send > > > > > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user > > > > > > > command. > > > > > > > > > > > > > > Another way to apply this is with the "direction" parameter. Maybe > > > > > > > it > > > > > > > is better to split it into "set_state_fd" and "get_state_fd"? > > > > > > > > > > > > > > In that case, reusing the ioctls as vhost-user messages would be > > > > > > > ok. > > > > > > > But that puts this proposal further from the VFIO code, which uses > > > > > > > "migration_set_state(state)", and maybe it is better when the > > > > > > > number > > > > > > > of states is high. > > > > > > Hi Eugenio, > > > > > > Another question about vDPA suspend/resume: > > > > > > > > > > > > /* Host notifiers must be enabled at this point. */ > > > > > > void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, > > > > > > bool vrings) > > > > > > { > > > > > > int i; > > > > > > > > > > > > /* should only be called after backend is connected */ > > > > > > assert(hdev->vhost_ops); > > > > > > event_notifier_test_and_clear( > > > > > > &hdev- > > > > > > >vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier); > > > > > > event_notifier_test_and_clear(&vdev->config_notifier); > > > > > > > > > > > > trace_vhost_dev_stop(hdev, vdev->name, vrings); > > > > > > > > > > > > if (hdev->vhost_ops->vhost_dev_start) { > > > > > > hdev->vhost_ops->vhost_dev_start(hdev, false); > > > > > > ^^^ SUSPEND ^^^ > > > > > > } > > > > > > if (vrings) { > > > > > > vhost_dev_set_vring_enable(hdev, false); > > > > > > } > > > > > > for (i = 0; i < hdev->nvqs; ++i) { > > > > > > vhost_virtqueue_stop(hdev, > > > > > > vdev, > > > > > > hdev->vqs + i, > > > > > > hdev->vq_index + i); > > > > > > ^^^ fetch virtqueue state from kernel ^^^ > > > > > > } > > > > > > if (hdev->vhost_ops->vhost_reset_status) { > > > > > > hdev->vhost_ops->vhost_reset_status(hdev); > > > > > > ^^^ reset device^^^ > > > > > > > > > > > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop() > > > > > > -> > > > > > > vhost_reset_status(). The device's migration code runs after > > > > > > vhost_dev_stop() and the state will have been lost. > > > > > > > > > > > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the > > > > > qemu VirtIONet device model. This is for all vhost backends. > > > > > > > > > > Regarding the state like mac or mq configuration, SVQ runs for all the > > > > > VM run in the CVQ. So it can track all of that status in the device > > > > > model too. > > > > > > > > > > When a migration effectively occurs, all the frontend state is > > > > > migrated as a regular emulated device. To route all of the state in a > > > > > normalized way for qemu is what leaves open the possibility to do > > > > > cross-backends migrations, etc. > > > > > > > > > > Does that answer your question? > > > > I think you're confirming that changes would be necessary in order for > > > > vDPA to support the save/load operation that Hanna is introducing. > > > > > > > Yes, this first iteration was centered on net, with an eye on block, > > > where state can be routed through classical emulated devices. This is > > > how vhost-kernel and vhost-user do classically. And it allows > > > cross-backend, to not modify qemu migration state, etc. > > > > > > To introduce this opaque state to qemu, that must be fetched after the > > > suspend and not before, requires changes in vhost protocol, as > > > discussed previously. > > > > > > > > > It looks like vDPA changes are necessary in order to support > > > > > > stateful > > > > > > devices even though QEMU already uses SUSPEND. Is my understanding > > > > > > correct? > > > > > > > > > > > Changes are required elsewhere, as the code to restore the state > > > > > properly in the destination has not been merged. > > > > I'm not sure what you mean by elsewhere? > > > > > > > I meant for vdpa *net* devices the changes are not required in vdpa > > > ioctls, but mostly in qemu. > > > > > > If you meant stateful as "it must have a state blob that it must be > > > opaque to qemu", then I think the straightforward action is to fetch > > > state blob about the same time as vq indexes. But yes, changes (at > > > least a new ioctl) is needed for that. > > > > > > > I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and > > > > then VHOST_VDPA_SET_STATUS 0. > > > > > > > > In order to save device state from the vDPA device in the future, it > > > > will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that > > > > the device state can be saved before the device is reset. > > > > > > > > Does that sound right? > > > > > > > The split between suspend and reset was added recently for that very > > > reason. In all the virtio devices, the frontend is initialized before > > > the backend, so I don't think it is a good idea to defer the backend > > > cleanup. Especially if we have already set the state is small enough > > > to not needing iterative migration from virtiofsd point of view. > > > > > > If fetching that state at the same time as vq indexes is not valid, > > > could it follow the same model as the "in-flight descriptors"? > > > vhost-user follows them by using a shared memory region where their > > > state is tracked [1]. This allows qemu to survive vhost-user SW > > > backend crashes, and does not forbid the cross-backends live migration > > > as all the information is there to recover them. > > > > > > For hw devices this is not convenient as it occupies PCI bandwidth. So > > > a possibility is to synchronize this memory region after a > > > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW > > > devices are not going to crash in the software sense, so all use cases > > > remain the same to qemu. And that shared memory information is > > > recoverable after vhost_dev_stop. > > > > > > Does that sound reasonable to virtiofsd? To offer a shared memory > > > region where it dumps the state, maybe only after the > > > set_state(STATE_PHASE_STOPPED)? > > > > I don’t think we need the set_state() call, necessarily, if SUSPEND is > > mandatory anyway. > > > > As for the shared memory, the RFC before this series used shared memory, > > so it’s possible, yes. But “shared memory region” can mean a lot of > > things – it sounds like you’re saying the back-end (virtiofsd) should > > provide it to the front-end, is that right? That could work like this: > > > > On the source side: > > > > S1. SUSPEND goes to virtiofsd > > S2. virtiofsd maybe double-checks that the device is stopped, then > > serializes its state into a newly allocated shared memory area[1] > > S3. virtiofsd responds to SUSPEND > > S4. front-end requests shared memory, virtiofsd responds with a handle, > > maybe already closes its reference > > S5. front-end saves state, closes its handle, freeing the SHM > > > > [1] Maybe virtiofsd can correctly size the serialized state’s size, then > > it can immediately allocate this area and serialize directly into it; > > maybe it can’t, then we’ll need a bounce buffer. Not really a > > fundamental problem, but there are limitations around what you can do > > with serde implementations in Rust… > > > > On the destination side: > > > > D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much; > > virtiofsd would serialize its empty state into an SHM area, and respond > > to SUSPEND > > D2. front-end reads state from migration stream into an SHM it has allocated > > D3. front-end supplies this SHM to virtiofsd, which discards its > > previous area, and now uses this one > > D4. RESUME goes to virtiofsd, which deserializes the state from the SHM > > > > Couple of questions: > > > > A. Stefan suggested D1, but it does seem wasteful now. But if SUSPEND > > would imply to deserialize a state, and the state is to be transferred > > through SHM, this is what would need to be done. So maybe we should > > skip SUSPEND on the destination? > > B. You described that the back-end should supply the SHM, which works > > well on the source. On the destination, only the front-end knows how > > big the state is, so I’ve decided above that it should allocate the SHM > > (D2) and provide it to the back-end. Is that feasible or is it > > important (e.g. for real hardware) that the back-end supplies the SHM? > > (In which case the front-end would need to tell the back-end how big the > > state SHM needs to be.) > > How does this work for iterative live migration? > A pipe will always fit better for iterative from qemu POV, that's for sure. Especially if we want to keep that opaqueness. But we will need to communicate with the HW device using shared memory sooner or later for big states. If we don't transform it in qemu, we will need to do it in the kernel. Also, the pipe will not support daemon crashes. Again I'm just putting this on the table, just in case it fits better or it is convenient. I missed the previous patch where SHM was proposed too, so maybe I missed some feedback useful here. I think the pipe is a better solution in the long run because of the iterative part. Thanks!
On Thu, Apr 20, 2023 at 03:27:51PM +0200, Eugenio Pérez wrote: > On Tue, 2023-04-18 at 16:40 -0400, Stefan Hajnoczi wrote: > > On Tue, 18 Apr 2023 at 14:31, Eugenio Perez Martin <eperezma@redhat.com> > > wrote: > > > On Tue, Apr 18, 2023 at 7:59 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote: > > > > > On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> > > > > > wrote: > > > > > > On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin < > > > > > > eperezma@redhat.com> wrote: > > > > > > > On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com > > > > > > > > wrote: > > > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin > > > > > > > > wrote: > > > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi < > > > > > > > > > stefanha@redhat.com> wrote: > > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek > > > > > > > > > > wrote: > > > > > > > > > > > So-called "internal" virtio-fs migration refers to > > > > > > > > > > > transporting the > > > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration > > > > > > > > > > > stream. To do > > > > > > > > > > > this, we need to be able to transfer virtiofsd's internal > > > > > > > > > > > state to and > > > > > > > > > > > from virtiofsd. > > > > > > > > > > > > > > > > > > > > > > Because virtiofsd's internal state will not be too large, we > > > > > > > > > > > believe it > > > > > > > > > > > is best to transfer it as a single binary blob after the > > > > > > > > > > > streaming > > > > > > > > > > > phase. Because this method should be useful to other vhost- > > > > > > > > > > > user > > > > > > > > > > > implementations, too, it is introduced as a general-purpose > > > > > > > > > > > addition to > > > > > > > > > > > the protocol, not limited to vhost-user-fs. > > > > > > > > > > > > > > > > > > > > > > These are the additions to the protocol: > > > > > > > > > > > - New vhost-user protocol feature > > > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > > > > > > > > > > This feature signals support for transferring state, and > > > > > > > > > > > is added so > > > > > > > > > > > that migration can fail early when the back-end has no > > > > > > > > > > > support. > > > > > > > > > > > > > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end > > > > > > > > > > > negotiate a pipe > > > > > > > > > > > over which to transfer the state. The front-end sends an > > > > > > > > > > > FD to the > > > > > > > > > > > back-end into/from which it can write/read its state, and > > > > > > > > > > > the back-end > > > > > > > > > > > can decide to either use it, or reply with a different FD > > > > > > > > > > > for the > > > > > > > > > > > front-end to override the front-end's choice. > > > > > > > > > > > The front-end creates a simple pipe to transfer the state, > > > > > > > > > > > but maybe > > > > > > > > > > > the back-end already has an FD into/from which it has to > > > > > > > > > > > write/read > > > > > > > > > > > its state, in which case it will want to override the > > > > > > > > > > > simple pipe. > > > > > > > > > > > Conversely, maybe in the future we find a way to have the > > > > > > > > > > > front-end > > > > > > > > > > > get an immediate FD for the migration stream (in some > > > > > > > > > > > cases), in which > > > > > > > > > > > case we will want to send this to the back-end instead of > > > > > > > > > > > creating a > > > > > > > > > > > pipe. > > > > > > > > > > > Hence the negotiation: If one side has a better idea than > > > > > > > > > > > a plain > > > > > > > > > > > pipe, we will want to use that. > > > > > > > > > > > > > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred > > > > > > > > > > > through the > > > > > > > > > > > pipe (the end indicated by EOF), the front-end invokes > > > > > > > > > > > this function > > > > > > > > > > > to verify success. There is no in-band way (through the > > > > > > > > > > > pipe) to > > > > > > > > > > > indicate failure, so we need to check explicitly. > > > > > > > > > > > > > > > > > > > > > > Once the transfer pipe has been established via > > > > > > > > > > > SET_DEVICE_STATE_FD > > > > > > > > > > > (which includes establishing the direction of transfer and > > > > > > > > > > > migration > > > > > > > > > > > phase), the sending side writes its data into the pipe, and > > > > > > > > > > > the reading > > > > > > > > > > > side reads it until it sees an EOF. Then, the front-end > > > > > > > > > > > will check for > > > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination > > > > > > > > > > > side includes > > > > > > > > > > > checking for integrity (i.e. errors during deserialization). > > > > > > > > > > > > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > > > > > > > > > --- > > > > > > > > > > > include/hw/virtio/vhost-backend.h | 24 +++++ > > > > > > > > > > > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > > > > > > > > > > hw/virtio/vhost-user.c | 147 > > > > > > > > > > > ++++++++++++++++++++++++++++++ > > > > > > > > > > > hw/virtio/vhost.c | 37 ++++++++ > > > > > > > > > > > 4 files changed, 287 insertions(+) > > > > > > > > > > > > > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h > > > > > > > > > > > b/include/hw/virtio/vhost-backend.h > > > > > > > > > > > index ec3fbae58d..5935b32fe3 100644 > > > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h > > > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h > > > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > > > > > > > > > > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > > > > > > > > > > } VhostSetConfigType; > > > > > > > > > > > > > > > > > > > > > > +typedef enum VhostDeviceStateDirection { > > > > > > > > > > > + /* Transfer state from back-end (device) to front-end > > > > > > > > > > > */ > > > > > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > > > > > > > > > > + /* Transfer state from front-end to back-end (device) > > > > > > > > > > > */ > > > > > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > > > > > > > > > > +} VhostDeviceStateDirection; > > > > > > > > > > > + > > > > > > > > > > > +typedef enum VhostDeviceStatePhase { > > > > > > > > > > > + /* The device (and all its vrings) is stopped */ > > > > > > > > > > > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > > > > > > > > > > +} VhostDeviceStatePhase; > > > > > > > > > > > > > > > > > > > > vDPA has: > > > > > > > > > > > > > > > > > > > > /* Suspend a device so it does not process virtqueue > > > > > > > > > > requests anymore > > > > > > > > > > * > > > > > > > > > > * After the return of ioctl the device must preserve all > > > > > > > > > > the necessary state > > > > > > > > > > * (the virtqueue vring base plus the possible device > > > > > > > > > > specific states) that is > > > > > > > > > > * required for restoring in the future. The device must not > > > > > > > > > > change its > > > > > > > > > > * configuration after that point. > > > > > > > > > > */ > > > > > > > > > > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > > > > > > > > > > > > > > > > > > > /* Resume a device so it can resume processing virtqueue > > > > > > > > > > requests > > > > > > > > > > * > > > > > > > > > > * After the return of this ioctl the device will have > > > > > > > > > > restored all the > > > > > > > > > > * necessary states and it is fully operational to continue > > > > > > > > > > processing the > > > > > > > > > > * virtqueue descriptors. > > > > > > > > > > */ > > > > > > > > > > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > > > > > > > > > > > > > > > > > > > I wonder if it makes sense to import these into vhost-user so > > > > > > > > > > that the > > > > > > > > > > difference between kernel vhost and vhost-user is minimized. > > > > > > > > > > It's okay > > > > > > > > > > if one of them is ahead of the other, but it would be nice to > > > > > > > > > > avoid > > > > > > > > > > overlapping/duplicated functionality. > > > > > > > > > > > > > > > > > > > > > > > > > > > > That's what I had in mind in the first versions. I proposed > > > > > > > > > VHOST_STOP > > > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did > > > > > > > > > change > > > > > > > > > to SUSPEND. > > > > > > > > > > > > > > > > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not > > > > > > > > ioctl(VHOST_VDPA_RESUME). > > > > > > > > > > > > > > > > The doc comments in <linux/vdpa.h> don't explain how the device > > > > > > > > can > > > > > > > > leave the suspended state. Can you clarify this? > > > > > > > > > > > > > > > > > > > > > > Do you mean in what situations or regarding the semantics of > > > > > > > _RESUME? > > > > > > > > > > > > > > To me resume is an operation mainly to resume the device in the > > > > > > > event > > > > > > > of a VM suspension, not a migration. It can be used as a fallback > > > > > > > code > > > > > > > in some cases of migration failure though, but it is not currently > > > > > > > used in qemu. > > > > > > > > > > > > Is a "VM suspension" the QEMU HMP 'stop' command? > > > > > > > > > > > > I guess the reason why QEMU doesn't call RESUME anywhere is that it > > > > > > resets the device in vhost_dev_stop()? > > > > > > > > > > > > > > > > The actual reason for not using RESUME is that the ioctl was added > > > > > after the SUSPEND design in qemu. Same as this proposal, it is was not > > > > > needed at the time. > > > > > > > > > > In the case of vhost-vdpa net, the only usage of suspend is to fetch > > > > > the vq indexes, and in case of error vhost already fetches them from > > > > > guest's used ring way before vDPA, so it has little usage. > > > > > > > > > > > Does it make sense to combine SUSPEND and RESUME with Hanna's > > > > > > SET_DEVICE_STATE_FD? For example, non-iterative migration works like > > > > > > this: > > > > > > - Saving the device's state is done by SUSPEND followed by > > > > > > SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g. > > > > > > savevm command or migration failed), then RESUME is called to > > > > > > continue. > > > > > > > > > > I think the previous steps make sense at vhost_dev_stop, not virtio > > > > > savevm handlers. To start spreading this logic to more places of qemu > > > > > can bring confusion. > > > > > > > > I don't think there is a way around extending the QEMU vhost's code > > > > model. The current model in QEMU's vhost code is that the backend is > > > > reset when the VM stops. This model worked fine for stateless devices > > > > but it doesn't work for stateful devices. > > > > > > > > Imagine a vdpa-gpu device: you cannot reset the device in > > > > vhost_dev_stop() and expect the GPU to continue working when > > > > vhost_dev_start() is called again because all its state has been lost. > > > > The guest driver will send requests that references a virtio-gpu > > > > resources that no longer exist. > > > > > > > > One solution is to save the device's state in vhost_dev_stop(). I think > > > > this is what you're suggesting. It requires keeping a copy of the state > > > > and then loading the state again in vhost_dev_start(). I don't think > > > > this approach should be used because it requires all stateful devices to > > > > support live migration (otherwise they break across HMP 'stop'/'cont'). > > > > Also, the device state for some devices may be large and it would also > > > > become more complicated when iterative migration is added. > > > > > > > > Instead, I think the QEMU vhost code needs to be structured so that > > > > struct vhost_dev has a suspended state: > > > > > > > > ,---------. > > > > v | > > > > started ------> stopped > > > > \ ^ > > > > \ | > > > > -> suspended > > > > > > > > The device doesn't lose state when it enters the suspended state. It can > > > > be resumed again. > > > > > > > > This is why I think SUSPEND/RESUME need to be part of the solution. > > I just realize that we can add an arrow from suspended to stopped, isn't it? Yes, it could be used in the case of a successful live migration: [started] -> vhost_dev_suspend() [suspended] -> vhost_dev_stop() [stopped] > "Started" before seems to imply the device may process descriptors after > suspend. Yes, in the case of a failed live migration: [started] -> vhost_dev_suspend() [suspended] -> vhost_dev_resume() [started] > > > > > > I agree with all of this, especially after realizing vhost_dev_stop is > > > called before the last request of the state in the iterative > > > migration. > > > > > > However I think we can move faster with the virtiofsd migration code, > > > as long as we agree on the vhost-user messages it will receive. This > > > is because we already agree that the state will be sent in one shot > > > and not iteratively, so it will be small. > > > > > > I understand this may change in the future, that's why I proposed to > > > start using iterative right now. However it may make little sense if > > > it is not used in the vhost-user device. I also understand that other > > > devices may have a bigger state so it will be needed for them. > > > > Can you summarize how you'd like save to work today? I'm not sure what > > you have in mind. > > > > I think we're trying to find a solution that satisfies many things. On one > side, we're assuming that the virtiofsd state will be small enough to be > assumable it will not require iterative migration in the short term. However, > we also want to support iterative migration, for the shake of *other* future > vhost devices that may need it. > > I also think we should prioritize the protocols stability, in the sense of not > adding calls that we will not reuse for iterative LM. Being vhost-user protocol > more important to maintain than the qemu migration. > > To implement the changes you mention will be needed in the future. But we have > already set that the virtiofsd is small, so we can just fetch them by the same > time than we send VHOST_USER_GET_VRING_BASE message and send the status with the > proposed non-iterative approach. VHOST_USER_GET_VRING_BASE itself isn't really enough because it stops a specific virtqueue but not the whole device. Unfortunately stopping all virtqueues is not the same as SUSPEND since spontaneous device activity is possible independent of any virtqueue (e.g. virtio-scsi events and maybe virtio-net link status). That's why I think SUSPEND is necessary for a solution that's generic enough to cover all device types. > If we agree on that, now the question is how to fetch them from the device. The > answers are a little bit scattered in the mail threads, but I think we agree on: > a) We need to signal that the device must stop processing requests. > b) We need a way for the device to dump the state. > > At this moment I think any proposal satisfies a), and pipe satisfies better b). > With proper backend feature flags, the device may support to start writing to > the pipe before SUSPEND so we can implement iterative migration on top. > > Does that makes sense? Yes, and that sounds like what Hanna is proposing for b) plus our discussion about SUSPEND/RESUME in order to achieve a). Stefan
On Thu, Apr 20, 2023 at 03:29:44PM +0200, Eugenio Pérez wrote: > On Wed, 2023-04-19 at 07:21 -0400, Stefan Hajnoczi wrote: > > On Wed, 19 Apr 2023 at 07:10, Hanna Czenczek <hreitz@redhat.com> wrote: > > > On 18.04.23 09:54, Eugenio Perez Martin wrote: > > > > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com> > > > > wrote: > > > > > On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com> > > > > > wrote: > > > > > > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> > > > > > > wrote: > > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin > > > > > > > wrote: > > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi < > > > > > > > > stefanha@redhat.com> wrote: > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > > > > > > > > > So-called "internal" virtio-fs migration refers to > > > > > > > > > > transporting the > > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration > > > > > > > > > > stream. To do > > > > > > > > > > this, we need to be able to transfer virtiofsd's internal > > > > > > > > > > state to and > > > > > > > > > > from virtiofsd. > > > > > > > > > > > > > > > > > > > > Because virtiofsd's internal state will not be too large, we > > > > > > > > > > believe it > > > > > > > > > > is best to transfer it as a single binary blob after the > > > > > > > > > > streaming > > > > > > > > > > phase. Because this method should be useful to other vhost- > > > > > > > > > > user > > > > > > > > > > implementations, too, it is introduced as a general-purpose > > > > > > > > > > addition to > > > > > > > > > > the protocol, not limited to vhost-user-fs. > > > > > > > > > > > > > > > > > > > > These are the additions to the protocol: > > > > > > > > > > - New vhost-user protocol feature > > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > > > > > > > > > This feature signals support for transferring state, and is > > > > > > > > > > added so > > > > > > > > > > that migration can fail early when the back-end has no > > > > > > > > > > support. > > > > > > > > > > > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end > > > > > > > > > > negotiate a pipe > > > > > > > > > > over which to transfer the state. The front-end sends an > > > > > > > > > > FD to the > > > > > > > > > > back-end into/from which it can write/read its state, and > > > > > > > > > > the back-end > > > > > > > > > > can decide to either use it, or reply with a different FD > > > > > > > > > > for the > > > > > > > > > > front-end to override the front-end's choice. > > > > > > > > > > The front-end creates a simple pipe to transfer the state, > > > > > > > > > > but maybe > > > > > > > > > > the back-end already has an FD into/from which it has to > > > > > > > > > > write/read > > > > > > > > > > its state, in which case it will want to override the > > > > > > > > > > simple pipe. > > > > > > > > > > Conversely, maybe in the future we find a way to have the > > > > > > > > > > front-end > > > > > > > > > > get an immediate FD for the migration stream (in some > > > > > > > > > > cases), in which > > > > > > > > > > case we will want to send this to the back-end instead of > > > > > > > > > > creating a > > > > > > > > > > pipe. > > > > > > > > > > Hence the negotiation: If one side has a better idea than a > > > > > > > > > > plain > > > > > > > > > > pipe, we will want to use that. > > > > > > > > > > > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred > > > > > > > > > > through the > > > > > > > > > > pipe (the end indicated by EOF), the front-end invokes this > > > > > > > > > > function > > > > > > > > > > to verify success. There is no in-band way (through the > > > > > > > > > > pipe) to > > > > > > > > > > indicate failure, so we need to check explicitly. > > > > > > > > > > > > > > > > > > > > Once the transfer pipe has been established via > > > > > > > > > > SET_DEVICE_STATE_FD > > > > > > > > > > (which includes establishing the direction of transfer and > > > > > > > > > > migration > > > > > > > > > > phase), the sending side writes its data into the pipe, and > > > > > > > > > > the reading > > > > > > > > > > side reads it until it sees an EOF. Then, the front-end will > > > > > > > > > > check for > > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side > > > > > > > > > > includes > > > > > > > > > > checking for integrity (i.e. errors during deserialization). > > > > > > > > > > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > > > > > > > > --- > > > > > > > > > > include/hw/virtio/vhost-backend.h | 24 +++++ > > > > > > > > > > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > > > > > > > > > hw/virtio/vhost-user.c | 147 > > > > > > > > > > ++++++++++++++++++++++++++++++ > > > > > > > > > > hw/virtio/vhost.c | 37 ++++++++ > > > > > > > > > > 4 files changed, 287 insertions(+) > > > > > > > > > > > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h > > > > > > > > > > b/include/hw/virtio/vhost-backend.h > > > > > > > > > > index ec3fbae58d..5935b32fe3 100644 > > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h > > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h > > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > > > > > > > > > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > > > > > > > > > } VhostSetConfigType; > > > > > > > > > > > > > > > > > > > > +typedef enum VhostDeviceStateDirection { > > > > > > > > > > + /* Transfer state from back-end (device) to front-end */ > > > > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > > > > > > > > > + /* Transfer state from front-end to back-end (device) */ > > > > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > > > > > > > > > +} VhostDeviceStateDirection; > > > > > > > > > > + > > > > > > > > > > +typedef enum VhostDeviceStatePhase { > > > > > > > > > > + /* The device (and all its vrings) is stopped */ > > > > > > > > > > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > > > > > > > > > +} VhostDeviceStatePhase; > > > > > > > > > vDPA has: > > > > > > > > > > > > > > > > > > /* Suspend a device so it does not process virtqueue requests > > > > > > > > > anymore > > > > > > > > > * > > > > > > > > > * After the return of ioctl the device must preserve all the > > > > > > > > > necessary state > > > > > > > > > * (the virtqueue vring base plus the possible device > > > > > > > > > specific states) that is > > > > > > > > > * required for restoring in the future. The device must not > > > > > > > > > change its > > > > > > > > > * configuration after that point. > > > > > > > > > */ > > > > > > > > > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > > > > > > > > > > > > > > > > > /* Resume a device so it can resume processing virtqueue > > > > > > > > > requests > > > > > > > > > * > > > > > > > > > * After the return of this ioctl the device will have > > > > > > > > > restored all the > > > > > > > > > * necessary states and it is fully operational to continue > > > > > > > > > processing the > > > > > > > > > * virtqueue descriptors. > > > > > > > > > */ > > > > > > > > > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > > > > > > > > > > > > > > > > > I wonder if it makes sense to import these into vhost-user so > > > > > > > > > that the > > > > > > > > > difference between kernel vhost and vhost-user is minimized. > > > > > > > > > It's okay > > > > > > > > > if one of them is ahead of the other, but it would be nice to > > > > > > > > > avoid > > > > > > > > > overlapping/duplicated functionality. > > > > > > > > > > > > > > > > > That's what I had in mind in the first versions. I proposed > > > > > > > > VHOST_STOP > > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did > > > > > > > > change > > > > > > > > to SUSPEND. > > > > > > > > > > > > > > > > Generally it is better if we make the interface less parametrized > > > > > > > > and > > > > > > > > we trust in the messages and its semantics in my opinion. In other > > > > > > > > words, instead of > > > > > > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), > > > > > > > > send > > > > > > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user > > > > > > > > command. > > > > > > > > > > > > > > > > Another way to apply this is with the "direction" parameter. Maybe > > > > > > > > it > > > > > > > > is better to split it into "set_state_fd" and "get_state_fd"? > > > > > > > > > > > > > > > > In that case, reusing the ioctls as vhost-user messages would be > > > > > > > > ok. > > > > > > > > But that puts this proposal further from the VFIO code, which uses > > > > > > > > "migration_set_state(state)", and maybe it is better when the > > > > > > > > number > > > > > > > > of states is high. > > > > > > > Hi Eugenio, > > > > > > > Another question about vDPA suspend/resume: > > > > > > > > > > > > > > /* Host notifiers must be enabled at this point. */ > > > > > > > void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, > > > > > > > bool vrings) > > > > > > > { > > > > > > > int i; > > > > > > > > > > > > > > /* should only be called after backend is connected */ > > > > > > > assert(hdev->vhost_ops); > > > > > > > event_notifier_test_and_clear( > > > > > > > &hdev- > > > > > > > >vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier); > > > > > > > event_notifier_test_and_clear(&vdev->config_notifier); > > > > > > > > > > > > > > trace_vhost_dev_stop(hdev, vdev->name, vrings); > > > > > > > > > > > > > > if (hdev->vhost_ops->vhost_dev_start) { > > > > > > > hdev->vhost_ops->vhost_dev_start(hdev, false); > > > > > > > ^^^ SUSPEND ^^^ > > > > > > > } > > > > > > > if (vrings) { > > > > > > > vhost_dev_set_vring_enable(hdev, false); > > > > > > > } > > > > > > > for (i = 0; i < hdev->nvqs; ++i) { > > > > > > > vhost_virtqueue_stop(hdev, > > > > > > > vdev, > > > > > > > hdev->vqs + i, > > > > > > > hdev->vq_index + i); > > > > > > > ^^^ fetch virtqueue state from kernel ^^^ > > > > > > > } > > > > > > > if (hdev->vhost_ops->vhost_reset_status) { > > > > > > > hdev->vhost_ops->vhost_reset_status(hdev); > > > > > > > ^^^ reset device^^^ > > > > > > > > > > > > > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop() > > > > > > > -> > > > > > > > vhost_reset_status(). The device's migration code runs after > > > > > > > vhost_dev_stop() and the state will have been lost. > > > > > > > > > > > > > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the > > > > > > qemu VirtIONet device model. This is for all vhost backends. > > > > > > > > > > > > Regarding the state like mac or mq configuration, SVQ runs for all the > > > > > > VM run in the CVQ. So it can track all of that status in the device > > > > > > model too. > > > > > > > > > > > > When a migration effectively occurs, all the frontend state is > > > > > > migrated as a regular emulated device. To route all of the state in a > > > > > > normalized way for qemu is what leaves open the possibility to do > > > > > > cross-backends migrations, etc. > > > > > > > > > > > > Does that answer your question? > > > > > I think you're confirming that changes would be necessary in order for > > > > > vDPA to support the save/load operation that Hanna is introducing. > > > > > > > > > Yes, this first iteration was centered on net, with an eye on block, > > > > where state can be routed through classical emulated devices. This is > > > > how vhost-kernel and vhost-user do classically. And it allows > > > > cross-backend, to not modify qemu migration state, etc. > > > > > > > > To introduce this opaque state to qemu, that must be fetched after the > > > > suspend and not before, requires changes in vhost protocol, as > > > > discussed previously. > > > > > > > > > > > It looks like vDPA changes are necessary in order to support > > > > > > > stateful > > > > > > > devices even though QEMU already uses SUSPEND. Is my understanding > > > > > > > correct? > > > > > > > > > > > > > Changes are required elsewhere, as the code to restore the state > > > > > > properly in the destination has not been merged. > > > > > I'm not sure what you mean by elsewhere? > > > > > > > > > I meant for vdpa *net* devices the changes are not required in vdpa > > > > ioctls, but mostly in qemu. > > > > > > > > If you meant stateful as "it must have a state blob that it must be > > > > opaque to qemu", then I think the straightforward action is to fetch > > > > state blob about the same time as vq indexes. But yes, changes (at > > > > least a new ioctl) is needed for that. > > > > > > > > > I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and > > > > > then VHOST_VDPA_SET_STATUS 0. > > > > > > > > > > In order to save device state from the vDPA device in the future, it > > > > > will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that > > > > > the device state can be saved before the device is reset. > > > > > > > > > > Does that sound right? > > > > > > > > > The split between suspend and reset was added recently for that very > > > > reason. In all the virtio devices, the frontend is initialized before > > > > the backend, so I don't think it is a good idea to defer the backend > > > > cleanup. Especially if we have already set the state is small enough > > > > to not needing iterative migration from virtiofsd point of view. > > > > > > > > If fetching that state at the same time as vq indexes is not valid, > > > > could it follow the same model as the "in-flight descriptors"? > > > > vhost-user follows them by using a shared memory region where their > > > > state is tracked [1]. This allows qemu to survive vhost-user SW > > > > backend crashes, and does not forbid the cross-backends live migration > > > > as all the information is there to recover them. > > > > > > > > For hw devices this is not convenient as it occupies PCI bandwidth. So > > > > a possibility is to synchronize this memory region after a > > > > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW > > > > devices are not going to crash in the software sense, so all use cases > > > > remain the same to qemu. And that shared memory information is > > > > recoverable after vhost_dev_stop. > > > > > > > > Does that sound reasonable to virtiofsd? To offer a shared memory > > > > region where it dumps the state, maybe only after the > > > > set_state(STATE_PHASE_STOPPED)? > > > > > > I don’t think we need the set_state() call, necessarily, if SUSPEND is > > > mandatory anyway. > > > > > > As for the shared memory, the RFC before this series used shared memory, > > > so it’s possible, yes. But “shared memory region” can mean a lot of > > > things – it sounds like you’re saying the back-end (virtiofsd) should > > > provide it to the front-end, is that right? That could work like this: > > > > > > On the source side: > > > > > > S1. SUSPEND goes to virtiofsd > > > S2. virtiofsd maybe double-checks that the device is stopped, then > > > serializes its state into a newly allocated shared memory area[1] > > > S3. virtiofsd responds to SUSPEND > > > S4. front-end requests shared memory, virtiofsd responds with a handle, > > > maybe already closes its reference > > > S5. front-end saves state, closes its handle, freeing the SHM > > > > > > [1] Maybe virtiofsd can correctly size the serialized state’s size, then > > > it can immediately allocate this area and serialize directly into it; > > > maybe it can’t, then we’ll need a bounce buffer. Not really a > > > fundamental problem, but there are limitations around what you can do > > > with serde implementations in Rust… > > > > > > On the destination side: > > > > > > D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much; > > > virtiofsd would serialize its empty state into an SHM area, and respond > > > to SUSPEND > > > D2. front-end reads state from migration stream into an SHM it has allocated > > > D3. front-end supplies this SHM to virtiofsd, which discards its > > > previous area, and now uses this one > > > D4. RESUME goes to virtiofsd, which deserializes the state from the SHM > > > > > > Couple of questions: > > > > > > A. Stefan suggested D1, but it does seem wasteful now. But if SUSPEND > > > would imply to deserialize a state, and the state is to be transferred > > > through SHM, this is what would need to be done. So maybe we should > > > skip SUSPEND on the destination? > > > B. You described that the back-end should supply the SHM, which works > > > well on the source. On the destination, only the front-end knows how > > > big the state is, so I’ve decided above that it should allocate the SHM > > > (D2) and provide it to the back-end. Is that feasible or is it > > > important (e.g. for real hardware) that the back-end supplies the SHM? > > > (In which case the front-end would need to tell the back-end how big the > > > state SHM needs to be.) > > > > How does this work for iterative live migration? > > > > A pipe will always fit better for iterative from qemu POV, that's for sure. > Especially if we want to keep that opaqueness. > > But we will need to communicate with the HW device using shared memory sooner > or later for big states. If we don't transform it in qemu, we will need to do > it in the kernel. Also, the pipe will not support daemon crashes. > > Again I'm just putting this on the table, just in case it fits better or it is > convenient. I missed the previous patch where SHM was proposed too, so maybe I > missed some feedback useful here. I think the pipe is a better solution in the > long run because of the iterative part. Pipes and shared memory are conceptually equivalent for building streaming interfaces. It's just more complex to design a shared memory interface and it reinvents what file descriptors already offer. I have no doubt we could design iterative migration over a shared memory interface if we needed to, but I'm not sure why? When you mention hardware, are you suggesting defining a standard memory/register layout that hardware implements and mapping it to userspace (QEMU)? Is there a big advantage to exposing memory versus a file descriptor? Stefan
On Mon, May 8, 2023 at 9:12 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Thu, Apr 20, 2023 at 03:27:51PM +0200, Eugenio Pérez wrote: > > On Tue, 2023-04-18 at 16:40 -0400, Stefan Hajnoczi wrote: > > > On Tue, 18 Apr 2023 at 14:31, Eugenio Perez Martin <eperezma@redhat.com> > > > wrote: > > > > On Tue, Apr 18, 2023 at 7:59 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote: > > > > > > On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> > > > > > > wrote: > > > > > > > On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin < > > > > > > > eperezma@redhat.com> wrote: > > > > > > > > On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com > > > > > > > > > wrote: > > > > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin > > > > > > > > > wrote: > > > > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi < > > > > > > > > > > stefanha@redhat.com> wrote: > > > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek > > > > > > > > > > > wrote: > > > > > > > > > > > > So-called "internal" virtio-fs migration refers to > > > > > > > > > > > > transporting the > > > > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration > > > > > > > > > > > > stream. To do > > > > > > > > > > > > this, we need to be able to transfer virtiofsd's internal > > > > > > > > > > > > state to and > > > > > > > > > > > > from virtiofsd. > > > > > > > > > > > > > > > > > > > > > > > > Because virtiofsd's internal state will not be too large, we > > > > > > > > > > > > believe it > > > > > > > > > > > > is best to transfer it as a single binary blob after the > > > > > > > > > > > > streaming > > > > > > > > > > > > phase. Because this method should be useful to other vhost- > > > > > > > > > > > > user > > > > > > > > > > > > implementations, too, it is introduced as a general-purpose > > > > > > > > > > > > addition to > > > > > > > > > > > > the protocol, not limited to vhost-user-fs. > > > > > > > > > > > > > > > > > > > > > > > > These are the additions to the protocol: > > > > > > > > > > > > - New vhost-user protocol feature > > > > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > > > > > > > > > > > This feature signals support for transferring state, and > > > > > > > > > > > > is added so > > > > > > > > > > > > that migration can fail early when the back-end has no > > > > > > > > > > > > support. > > > > > > > > > > > > > > > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end > > > > > > > > > > > > negotiate a pipe > > > > > > > > > > > > over which to transfer the state. The front-end sends an > > > > > > > > > > > > FD to the > > > > > > > > > > > > back-end into/from which it can write/read its state, and > > > > > > > > > > > > the back-end > > > > > > > > > > > > can decide to either use it, or reply with a different FD > > > > > > > > > > > > for the > > > > > > > > > > > > front-end to override the front-end's choice. > > > > > > > > > > > > The front-end creates a simple pipe to transfer the state, > > > > > > > > > > > > but maybe > > > > > > > > > > > > the back-end already has an FD into/from which it has to > > > > > > > > > > > > write/read > > > > > > > > > > > > its state, in which case it will want to override the > > > > > > > > > > > > simple pipe. > > > > > > > > > > > > Conversely, maybe in the future we find a way to have the > > > > > > > > > > > > front-end > > > > > > > > > > > > get an immediate FD for the migration stream (in some > > > > > > > > > > > > cases), in which > > > > > > > > > > > > case we will want to send this to the back-end instead of > > > > > > > > > > > > creating a > > > > > > > > > > > > pipe. > > > > > > > > > > > > Hence the negotiation: If one side has a better idea than > > > > > > > > > > > > a plain > > > > > > > > > > > > pipe, we will want to use that. > > > > > > > > > > > > > > > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred > > > > > > > > > > > > through the > > > > > > > > > > > > pipe (the end indicated by EOF), the front-end invokes > > > > > > > > > > > > this function > > > > > > > > > > > > to verify success. There is no in-band way (through the > > > > > > > > > > > > pipe) to > > > > > > > > > > > > indicate failure, so we need to check explicitly. > > > > > > > > > > > > > > > > > > > > > > > > Once the transfer pipe has been established via > > > > > > > > > > > > SET_DEVICE_STATE_FD > > > > > > > > > > > > (which includes establishing the direction of transfer and > > > > > > > > > > > > migration > > > > > > > > > > > > phase), the sending side writes its data into the pipe, and > > > > > > > > > > > > the reading > > > > > > > > > > > > side reads it until it sees an EOF. Then, the front-end > > > > > > > > > > > > will check for > > > > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination > > > > > > > > > > > > side includes > > > > > > > > > > > > checking for integrity (i.e. errors during deserialization). > > > > > > > > > > > > > > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > > > > > > > > > > --- > > > > > > > > > > > > include/hw/virtio/vhost-backend.h | 24 +++++ > > > > > > > > > > > > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > > > > > > > > > > > hw/virtio/vhost-user.c | 147 > > > > > > > > > > > > ++++++++++++++++++++++++++++++ > > > > > > > > > > > > hw/virtio/vhost.c | 37 ++++++++ > > > > > > > > > > > > 4 files changed, 287 insertions(+) > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h > > > > > > > > > > > > b/include/hw/virtio/vhost-backend.h > > > > > > > > > > > > index ec3fbae58d..5935b32fe3 100644 > > > > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h > > > > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h > > > > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > > > > > > > > > > > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > > > > > > > > > > > } VhostSetConfigType; > > > > > > > > > > > > > > > > > > > > > > > > +typedef enum VhostDeviceStateDirection { > > > > > > > > > > > > + /* Transfer state from back-end (device) to front-end > > > > > > > > > > > > */ > > > > > > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > > > > > > > > > > > + /* Transfer state from front-end to back-end (device) > > > > > > > > > > > > */ > > > > > > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > > > > > > > > > > > +} VhostDeviceStateDirection; > > > > > > > > > > > > + > > > > > > > > > > > > +typedef enum VhostDeviceStatePhase { > > > > > > > > > > > > + /* The device (and all its vrings) is stopped */ > > > > > > > > > > > > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > > > > > > > > > > > +} VhostDeviceStatePhase; > > > > > > > > > > > > > > > > > > > > > > vDPA has: > > > > > > > > > > > > > > > > > > > > > > /* Suspend a device so it does not process virtqueue > > > > > > > > > > > requests anymore > > > > > > > > > > > * > > > > > > > > > > > * After the return of ioctl the device must preserve all > > > > > > > > > > > the necessary state > > > > > > > > > > > * (the virtqueue vring base plus the possible device > > > > > > > > > > > specific states) that is > > > > > > > > > > > * required for restoring in the future. The device must not > > > > > > > > > > > change its > > > > > > > > > > > * configuration after that point. > > > > > > > > > > > */ > > > > > > > > > > > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > > > > > > > > > > > > > > > > > > > > > /* Resume a device so it can resume processing virtqueue > > > > > > > > > > > requests > > > > > > > > > > > * > > > > > > > > > > > * After the return of this ioctl the device will have > > > > > > > > > > > restored all the > > > > > > > > > > > * necessary states and it is fully operational to continue > > > > > > > > > > > processing the > > > > > > > > > > > * virtqueue descriptors. > > > > > > > > > > > */ > > > > > > > > > > > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > > > > > > > > > > > > > > > > > > > > > I wonder if it makes sense to import these into vhost-user so > > > > > > > > > > > that the > > > > > > > > > > > difference between kernel vhost and vhost-user is minimized. > > > > > > > > > > > It's okay > > > > > > > > > > > if one of them is ahead of the other, but it would be nice to > > > > > > > > > > > avoid > > > > > > > > > > > overlapping/duplicated functionality. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > That's what I had in mind in the first versions. I proposed > > > > > > > > > > VHOST_STOP > > > > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did > > > > > > > > > > change > > > > > > > > > > to SUSPEND. > > > > > > > > > > > > > > > > > > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not > > > > > > > > > ioctl(VHOST_VDPA_RESUME). > > > > > > > > > > > > > > > > > > The doc comments in <linux/vdpa.h> don't explain how the device > > > > > > > > > can > > > > > > > > > leave the suspended state. Can you clarify this? > > > > > > > > > > > > > > > > > > > > > > > > > Do you mean in what situations or regarding the semantics of > > > > > > > > _RESUME? > > > > > > > > > > > > > > > > To me resume is an operation mainly to resume the device in the > > > > > > > > event > > > > > > > > of a VM suspension, not a migration. It can be used as a fallback > > > > > > > > code > > > > > > > > in some cases of migration failure though, but it is not currently > > > > > > > > used in qemu. > > > > > > > > > > > > > > Is a "VM suspension" the QEMU HMP 'stop' command? > > > > > > > > > > > > > > I guess the reason why QEMU doesn't call RESUME anywhere is that it > > > > > > > resets the device in vhost_dev_stop()? > > > > > > > > > > > > > > > > > > > The actual reason for not using RESUME is that the ioctl was added > > > > > > after the SUSPEND design in qemu. Same as this proposal, it is was not > > > > > > needed at the time. > > > > > > > > > > > > In the case of vhost-vdpa net, the only usage of suspend is to fetch > > > > > > the vq indexes, and in case of error vhost already fetches them from > > > > > > guest's used ring way before vDPA, so it has little usage. > > > > > > > > > > > > > Does it make sense to combine SUSPEND and RESUME with Hanna's > > > > > > > SET_DEVICE_STATE_FD? For example, non-iterative migration works like > > > > > > > this: > > > > > > > - Saving the device's state is done by SUSPEND followed by > > > > > > > SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g. > > > > > > > savevm command or migration failed), then RESUME is called to > > > > > > > continue. > > > > > > > > > > > > I think the previous steps make sense at vhost_dev_stop, not virtio > > > > > > savevm handlers. To start spreading this logic to more places of qemu > > > > > > can bring confusion. > > > > > > > > > > I don't think there is a way around extending the QEMU vhost's code > > > > > model. The current model in QEMU's vhost code is that the backend is > > > > > reset when the VM stops. This model worked fine for stateless devices > > > > > but it doesn't work for stateful devices. > > > > > > > > > > Imagine a vdpa-gpu device: you cannot reset the device in > > > > > vhost_dev_stop() and expect the GPU to continue working when > > > > > vhost_dev_start() is called again because all its state has been lost. > > > > > The guest driver will send requests that references a virtio-gpu > > > > > resources that no longer exist. > > > > > > > > > > One solution is to save the device's state in vhost_dev_stop(). I think > > > > > this is what you're suggesting. It requires keeping a copy of the state > > > > > and then loading the state again in vhost_dev_start(). I don't think > > > > > this approach should be used because it requires all stateful devices to > > > > > support live migration (otherwise they break across HMP 'stop'/'cont'). > > > > > Also, the device state for some devices may be large and it would also > > > > > become more complicated when iterative migration is added. > > > > > > > > > > Instead, I think the QEMU vhost code needs to be structured so that > > > > > struct vhost_dev has a suspended state: > > > > > > > > > > ,---------. > > > > > v | > > > > > started ------> stopped > > > > > \ ^ > > > > > \ | > > > > > -> suspended > > > > > > > > > > The device doesn't lose state when it enters the suspended state. It can > > > > > be resumed again. > > > > > > > > > > This is why I think SUSPEND/RESUME need to be part of the solution. > > > > I just realize that we can add an arrow from suspended to stopped, isn't it? > > Yes, it could be used in the case of a successful live migration: > [started] -> vhost_dev_suspend() [suspended] -> vhost_dev_stop() [stopped] > > > "Started" before seems to imply the device may process descriptors after > > suspend. > > Yes, in the case of a failed live migration: > [started] -> vhost_dev_suspend() [suspended] -> vhost_dev_resume() [started] > I meant "the device may (is allowed) to process descriptors after suspend and before stopped". I think we have the same view here, just trying to specify the semantics here as completely as possible :). > > > > > > > > I agree with all of this, especially after realizing vhost_dev_stop is > > > > called before the last request of the state in the iterative > > > > migration. > > > > > > > > However I think we can move faster with the virtiofsd migration code, > > > > as long as we agree on the vhost-user messages it will receive. This > > > > is because we already agree that the state will be sent in one shot > > > > and not iteratively, so it will be small. > > > > > > > > I understand this may change in the future, that's why I proposed to > > > > start using iterative right now. However it may make little sense if > > > > it is not used in the vhost-user device. I also understand that other > > > > devices may have a bigger state so it will be needed for them. > > > > > > Can you summarize how you'd like save to work today? I'm not sure what > > > you have in mind. > > > > > > > I think we're trying to find a solution that satisfies many things. On one > > side, we're assuming that the virtiofsd state will be small enough to be > > assumable it will not require iterative migration in the short term. However, > > we also want to support iterative migration, for the shake of *other* future > > vhost devices that may need it. > > > > I also think we should prioritize the protocols stability, in the sense of not > > adding calls that we will not reuse for iterative LM. Being vhost-user protocol > > more important to maintain than the qemu migration. > > > > To implement the changes you mention will be needed in the future. But we have > > already set that the virtiofsd is small, so we can just fetch them by the same > > time than we send VHOST_USER_GET_VRING_BASE message and send the status with the > > proposed non-iterative approach. > > VHOST_USER_GET_VRING_BASE itself isn't really enough because it stops a > specific virtqueue but not the whole device. Unfortunately stopping all > virtqueues is not the same as SUSPEND since spontaneous device activity > is possible independent of any virtqueue (e.g. virtio-scsi events and > maybe virtio-net link status). > > That's why I think SUSPEND is necessary for a solution that's generic > enough to cover all device types. > I agree. In particular virtiofsd is already resetting all the device at VHOST_USER_GET_VRING_BASE if I'm not wrong, so that's even more of a reason to implement suspend call. Thanks! > > If we agree on that, now the question is how to fetch them from the device. The > > answers are a little bit scattered in the mail threads, but I think we agree on: > > a) We need to signal that the device must stop processing requests. > > b) We need a way for the device to dump the state. > > > > At this moment I think any proposal satisfies a), and pipe satisfies better b). > > With proper backend feature flags, the device may support to start writing to > > the pipe before SUSPEND so we can implement iterative migration on top. > > > > Does that makes sense? > > Yes, and that sounds like what Hanna is proposing for b) plus our > discussion about SUSPEND/RESUME in order to achieve a). > > Stefan
On Mon, May 8, 2023 at 10:10 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Thu, Apr 20, 2023 at 03:29:44PM +0200, Eugenio Pérez wrote: > > On Wed, 2023-04-19 at 07:21 -0400, Stefan Hajnoczi wrote: > > > On Wed, 19 Apr 2023 at 07:10, Hanna Czenczek <hreitz@redhat.com> wrote: > > > > On 18.04.23 09:54, Eugenio Perez Martin wrote: > > > > > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com> > > > > > wrote: > > > > > > On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com> > > > > > > wrote: > > > > > > > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> > > > > > > > wrote: > > > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin > > > > > > > > wrote: > > > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi < > > > > > > > > > stefanha@redhat.com> wrote: > > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > > > > > > > > > > So-called "internal" virtio-fs migration refers to > > > > > > > > > > > transporting the > > > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration > > > > > > > > > > > stream. To do > > > > > > > > > > > this, we need to be able to transfer virtiofsd's internal > > > > > > > > > > > state to and > > > > > > > > > > > from virtiofsd. > > > > > > > > > > > > > > > > > > > > > > Because virtiofsd's internal state will not be too large, we > > > > > > > > > > > believe it > > > > > > > > > > > is best to transfer it as a single binary blob after the > > > > > > > > > > > streaming > > > > > > > > > > > phase. Because this method should be useful to other vhost- > > > > > > > > > > > user > > > > > > > > > > > implementations, too, it is introduced as a general-purpose > > > > > > > > > > > addition to > > > > > > > > > > > the protocol, not limited to vhost-user-fs. > > > > > > > > > > > > > > > > > > > > > > These are the additions to the protocol: > > > > > > > > > > > - New vhost-user protocol feature > > > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > > > > > > > > > > This feature signals support for transferring state, and is > > > > > > > > > > > added so > > > > > > > > > > > that migration can fail early when the back-end has no > > > > > > > > > > > support. > > > > > > > > > > > > > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end > > > > > > > > > > > negotiate a pipe > > > > > > > > > > > over which to transfer the state. The front-end sends an > > > > > > > > > > > FD to the > > > > > > > > > > > back-end into/from which it can write/read its state, and > > > > > > > > > > > the back-end > > > > > > > > > > > can decide to either use it, or reply with a different FD > > > > > > > > > > > for the > > > > > > > > > > > front-end to override the front-end's choice. > > > > > > > > > > > The front-end creates a simple pipe to transfer the state, > > > > > > > > > > > but maybe > > > > > > > > > > > the back-end already has an FD into/from which it has to > > > > > > > > > > > write/read > > > > > > > > > > > its state, in which case it will want to override the > > > > > > > > > > > simple pipe. > > > > > > > > > > > Conversely, maybe in the future we find a way to have the > > > > > > > > > > > front-end > > > > > > > > > > > get an immediate FD for the migration stream (in some > > > > > > > > > > > cases), in which > > > > > > > > > > > case we will want to send this to the back-end instead of > > > > > > > > > > > creating a > > > > > > > > > > > pipe. > > > > > > > > > > > Hence the negotiation: If one side has a better idea than a > > > > > > > > > > > plain > > > > > > > > > > > pipe, we will want to use that. > > > > > > > > > > > > > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred > > > > > > > > > > > through the > > > > > > > > > > > pipe (the end indicated by EOF), the front-end invokes this > > > > > > > > > > > function > > > > > > > > > > > to verify success. There is no in-band way (through the > > > > > > > > > > > pipe) to > > > > > > > > > > > indicate failure, so we need to check explicitly. > > > > > > > > > > > > > > > > > > > > > > Once the transfer pipe has been established via > > > > > > > > > > > SET_DEVICE_STATE_FD > > > > > > > > > > > (which includes establishing the direction of transfer and > > > > > > > > > > > migration > > > > > > > > > > > phase), the sending side writes its data into the pipe, and > > > > > > > > > > > the reading > > > > > > > > > > > side reads it until it sees an EOF. Then, the front-end will > > > > > > > > > > > check for > > > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side > > > > > > > > > > > includes > > > > > > > > > > > checking for integrity (i.e. errors during deserialization). > > > > > > > > > > > > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > > > > > > > > > --- > > > > > > > > > > > include/hw/virtio/vhost-backend.h | 24 +++++ > > > > > > > > > > > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > > > > > > > > > > hw/virtio/vhost-user.c | 147 > > > > > > > > > > > ++++++++++++++++++++++++++++++ > > > > > > > > > > > hw/virtio/vhost.c | 37 ++++++++ > > > > > > > > > > > 4 files changed, 287 insertions(+) > > > > > > > > > > > > > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h > > > > > > > > > > > b/include/hw/virtio/vhost-backend.h > > > > > > > > > > > index ec3fbae58d..5935b32fe3 100644 > > > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h > > > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h > > > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > > > > > > > > > > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > > > > > > > > > > } VhostSetConfigType; > > > > > > > > > > > > > > > > > > > > > > +typedef enum VhostDeviceStateDirection { > > > > > > > > > > > + /* Transfer state from back-end (device) to front-end */ > > > > > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > > > > > > > > > > + /* Transfer state from front-end to back-end (device) */ > > > > > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > > > > > > > > > > +} VhostDeviceStateDirection; > > > > > > > > > > > + > > > > > > > > > > > +typedef enum VhostDeviceStatePhase { > > > > > > > > > > > + /* The device (and all its vrings) is stopped */ > > > > > > > > > > > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > > > > > > > > > > +} VhostDeviceStatePhase; > > > > > > > > > > vDPA has: > > > > > > > > > > > > > > > > > > > > /* Suspend a device so it does not process virtqueue requests > > > > > > > > > > anymore > > > > > > > > > > * > > > > > > > > > > * After the return of ioctl the device must preserve all the > > > > > > > > > > necessary state > > > > > > > > > > * (the virtqueue vring base plus the possible device > > > > > > > > > > specific states) that is > > > > > > > > > > * required for restoring in the future. The device must not > > > > > > > > > > change its > > > > > > > > > > * configuration after that point. > > > > > > > > > > */ > > > > > > > > > > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > > > > > > > > > > > > > > > > > > > /* Resume a device so it can resume processing virtqueue > > > > > > > > > > requests > > > > > > > > > > * > > > > > > > > > > * After the return of this ioctl the device will have > > > > > > > > > > restored all the > > > > > > > > > > * necessary states and it is fully operational to continue > > > > > > > > > > processing the > > > > > > > > > > * virtqueue descriptors. > > > > > > > > > > */ > > > > > > > > > > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > > > > > > > > > > > > > > > > > > > I wonder if it makes sense to import these into vhost-user so > > > > > > > > > > that the > > > > > > > > > > difference between kernel vhost and vhost-user is minimized. > > > > > > > > > > It's okay > > > > > > > > > > if one of them is ahead of the other, but it would be nice to > > > > > > > > > > avoid > > > > > > > > > > overlapping/duplicated functionality. > > > > > > > > > > > > > > > > > > > That's what I had in mind in the first versions. I proposed > > > > > > > > > VHOST_STOP > > > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did > > > > > > > > > change > > > > > > > > > to SUSPEND. > > > > > > > > > > > > > > > > > > Generally it is better if we make the interface less parametrized > > > > > > > > > and > > > > > > > > > we trust in the messages and its semantics in my opinion. In other > > > > > > > > > words, instead of > > > > > > > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), > > > > > > > > > send > > > > > > > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user > > > > > > > > > command. > > > > > > > > > > > > > > > > > > Another way to apply this is with the "direction" parameter. Maybe > > > > > > > > > it > > > > > > > > > is better to split it into "set_state_fd" and "get_state_fd"? > > > > > > > > > > > > > > > > > > In that case, reusing the ioctls as vhost-user messages would be > > > > > > > > > ok. > > > > > > > > > But that puts this proposal further from the VFIO code, which uses > > > > > > > > > "migration_set_state(state)", and maybe it is better when the > > > > > > > > > number > > > > > > > > > of states is high. > > > > > > > > Hi Eugenio, > > > > > > > > Another question about vDPA suspend/resume: > > > > > > > > > > > > > > > > /* Host notifiers must be enabled at this point. */ > > > > > > > > void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, > > > > > > > > bool vrings) > > > > > > > > { > > > > > > > > int i; > > > > > > > > > > > > > > > > /* should only be called after backend is connected */ > > > > > > > > assert(hdev->vhost_ops); > > > > > > > > event_notifier_test_and_clear( > > > > > > > > &hdev- > > > > > > > > >vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier); > > > > > > > > event_notifier_test_and_clear(&vdev->config_notifier); > > > > > > > > > > > > > > > > trace_vhost_dev_stop(hdev, vdev->name, vrings); > > > > > > > > > > > > > > > > if (hdev->vhost_ops->vhost_dev_start) { > > > > > > > > hdev->vhost_ops->vhost_dev_start(hdev, false); > > > > > > > > ^^^ SUSPEND ^^^ > > > > > > > > } > > > > > > > > if (vrings) { > > > > > > > > vhost_dev_set_vring_enable(hdev, false); > > > > > > > > } > > > > > > > > for (i = 0; i < hdev->nvqs; ++i) { > > > > > > > > vhost_virtqueue_stop(hdev, > > > > > > > > vdev, > > > > > > > > hdev->vqs + i, > > > > > > > > hdev->vq_index + i); > > > > > > > > ^^^ fetch virtqueue state from kernel ^^^ > > > > > > > > } > > > > > > > > if (hdev->vhost_ops->vhost_reset_status) { > > > > > > > > hdev->vhost_ops->vhost_reset_status(hdev); > > > > > > > > ^^^ reset device^^^ > > > > > > > > > > > > > > > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop() > > > > > > > > -> > > > > > > > > vhost_reset_status(). The device's migration code runs after > > > > > > > > vhost_dev_stop() and the state will have been lost. > > > > > > > > > > > > > > > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the > > > > > > > qemu VirtIONet device model. This is for all vhost backends. > > > > > > > > > > > > > > Regarding the state like mac or mq configuration, SVQ runs for all the > > > > > > > VM run in the CVQ. So it can track all of that status in the device > > > > > > > model too. > > > > > > > > > > > > > > When a migration effectively occurs, all the frontend state is > > > > > > > migrated as a regular emulated device. To route all of the state in a > > > > > > > normalized way for qemu is what leaves open the possibility to do > > > > > > > cross-backends migrations, etc. > > > > > > > > > > > > > > Does that answer your question? > > > > > > I think you're confirming that changes would be necessary in order for > > > > > > vDPA to support the save/load operation that Hanna is introducing. > > > > > > > > > > > Yes, this first iteration was centered on net, with an eye on block, > > > > > where state can be routed through classical emulated devices. This is > > > > > how vhost-kernel and vhost-user do classically. And it allows > > > > > cross-backend, to not modify qemu migration state, etc. > > > > > > > > > > To introduce this opaque state to qemu, that must be fetched after the > > > > > suspend and not before, requires changes in vhost protocol, as > > > > > discussed previously. > > > > > > > > > > > > > It looks like vDPA changes are necessary in order to support > > > > > > > > stateful > > > > > > > > devices even though QEMU already uses SUSPEND. Is my understanding > > > > > > > > correct? > > > > > > > > > > > > > > > Changes are required elsewhere, as the code to restore the state > > > > > > > properly in the destination has not been merged. > > > > > > I'm not sure what you mean by elsewhere? > > > > > > > > > > > I meant for vdpa *net* devices the changes are not required in vdpa > > > > > ioctls, but mostly in qemu. > > > > > > > > > > If you meant stateful as "it must have a state blob that it must be > > > > > opaque to qemu", then I think the straightforward action is to fetch > > > > > state blob about the same time as vq indexes. But yes, changes (at > > > > > least a new ioctl) is needed for that. > > > > > > > > > > > I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and > > > > > > then VHOST_VDPA_SET_STATUS 0. > > > > > > > > > > > > In order to save device state from the vDPA device in the future, it > > > > > > will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that > > > > > > the device state can be saved before the device is reset. > > > > > > > > > > > > Does that sound right? > > > > > > > > > > > The split between suspend and reset was added recently for that very > > > > > reason. In all the virtio devices, the frontend is initialized before > > > > > the backend, so I don't think it is a good idea to defer the backend > > > > > cleanup. Especially if we have already set the state is small enough > > > > > to not needing iterative migration from virtiofsd point of view. > > > > > > > > > > If fetching that state at the same time as vq indexes is not valid, > > > > > could it follow the same model as the "in-flight descriptors"? > > > > > vhost-user follows them by using a shared memory region where their > > > > > state is tracked [1]. This allows qemu to survive vhost-user SW > > > > > backend crashes, and does not forbid the cross-backends live migration > > > > > as all the information is there to recover them. > > > > > > > > > > For hw devices this is not convenient as it occupies PCI bandwidth. So > > > > > a possibility is to synchronize this memory region after a > > > > > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW > > > > > devices are not going to crash in the software sense, so all use cases > > > > > remain the same to qemu. And that shared memory information is > > > > > recoverable after vhost_dev_stop. > > > > > > > > > > Does that sound reasonable to virtiofsd? To offer a shared memory > > > > > region where it dumps the state, maybe only after the > > > > > set_state(STATE_PHASE_STOPPED)? > > > > > > > > I don’t think we need the set_state() call, necessarily, if SUSPEND is > > > > mandatory anyway. > > > > > > > > As for the shared memory, the RFC before this series used shared memory, > > > > so it’s possible, yes. But “shared memory region” can mean a lot of > > > > things – it sounds like you’re saying the back-end (virtiofsd) should > > > > provide it to the front-end, is that right? That could work like this: > > > > > > > > On the source side: > > > > > > > > S1. SUSPEND goes to virtiofsd > > > > S2. virtiofsd maybe double-checks that the device is stopped, then > > > > serializes its state into a newly allocated shared memory area[1] > > > > S3. virtiofsd responds to SUSPEND > > > > S4. front-end requests shared memory, virtiofsd responds with a handle, > > > > maybe already closes its reference > > > > S5. front-end saves state, closes its handle, freeing the SHM > > > > > > > > [1] Maybe virtiofsd can correctly size the serialized state’s size, then > > > > it can immediately allocate this area and serialize directly into it; > > > > maybe it can’t, then we’ll need a bounce buffer. Not really a > > > > fundamental problem, but there are limitations around what you can do > > > > with serde implementations in Rust… > > > > > > > > On the destination side: > > > > > > > > D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much; > > > > virtiofsd would serialize its empty state into an SHM area, and respond > > > > to SUSPEND > > > > D2. front-end reads state from migration stream into an SHM it has allocated > > > > D3. front-end supplies this SHM to virtiofsd, which discards its > > > > previous area, and now uses this one > > > > D4. RESUME goes to virtiofsd, which deserializes the state from the SHM > > > > > > > > Couple of questions: > > > > > > > > A. Stefan suggested D1, but it does seem wasteful now. But if SUSPEND > > > > would imply to deserialize a state, and the state is to be transferred > > > > through SHM, this is what would need to be done. So maybe we should > > > > skip SUSPEND on the destination? > > > > B. You described that the back-end should supply the SHM, which works > > > > well on the source. On the destination, only the front-end knows how > > > > big the state is, so I’ve decided above that it should allocate the SHM > > > > (D2) and provide it to the back-end. Is that feasible or is it > > > > important (e.g. for real hardware) that the back-end supplies the SHM? > > > > (In which case the front-end would need to tell the back-end how big the > > > > state SHM needs to be.) > > > > > > How does this work for iterative live migration? > > > > > > > A pipe will always fit better for iterative from qemu POV, that's for sure. > > Especially if we want to keep that opaqueness. > > > > But we will need to communicate with the HW device using shared memory sooner > > or later for big states. If we don't transform it in qemu, we will need to do > > it in the kernel. Also, the pipe will not support daemon crashes. > > > > Again I'm just putting this on the table, just in case it fits better or it is > > convenient. I missed the previous patch where SHM was proposed too, so maybe I > > missed some feedback useful here. I think the pipe is a better solution in the > > long run because of the iterative part. > > Pipes and shared memory are conceptually equivalent for building > streaming interfaces. It's just more complex to design a shared memory > interface and it reinvents what file descriptors already offer. > > I have no doubt we could design iterative migration over a shared memory > interface if we needed to, but I'm not sure why? When you mention > hardware, are you suggesting defining a standard memory/register layout > that hardware implements and mapping it to userspace (QEMU)? Right. > Is there a > big advantage to exposing memory versus a file descriptor? > For hardware it allows to retrieve and set the device state without intervention of the kernel, saving context switches. For virtiofsd this may not make a lot of sense, but I'm thinking on devices with big states (virtio gpu, maybe?). For software it allows the backend to survive a crash, as the old state can be set directly to a fresh backend instance. As I said, I'm not saying we must go with shared memory. We can always add it on top, assuming the cost of maintaining both models. I'm just trying to make sure we evaluate both. Thanks!
On 09.05.23 08:31, Eugenio Perez Martin wrote: > On Mon, May 8, 2023 at 9:12 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: [...] >> VHOST_USER_GET_VRING_BASE itself isn't really enough because it stops a >> specific virtqueue but not the whole device. Unfortunately stopping all >> virtqueues is not the same as SUSPEND since spontaneous device activity >> is possible independent of any virtqueue (e.g. virtio-scsi events and >> maybe virtio-net link status). >> >> That's why I think SUSPEND is necessary for a solution that's generic >> enough to cover all device types. >> > I agree. > > In particular virtiofsd is already resetting all the device at > VHOST_USER_GET_VRING_BASE if I'm not wrong, so that's even more of a > reason to implement suspend call. Oh, no, just the vring in question. Not the whole device. In addition, we still need the GET_VRING_BASE call anyway, because, well, we want to restore the vring on the destination via SET_VRING_BASE. Hanna
On Tue, May 09, 2023 at 08:45:33AM +0200, Eugenio Perez Martin wrote: > On Mon, May 8, 2023 at 10:10 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > On Thu, Apr 20, 2023 at 03:29:44PM +0200, Eugenio Pérez wrote: > > > On Wed, 2023-04-19 at 07:21 -0400, Stefan Hajnoczi wrote: > > > > On Wed, 19 Apr 2023 at 07:10, Hanna Czenczek <hreitz@redhat.com> wrote: > > > > > On 18.04.23 09:54, Eugenio Perez Martin wrote: > > > > > > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com> > > > > > > wrote: > > > > > > > On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com> > > > > > > > wrote: > > > > > > > > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> > > > > > > > > wrote: > > > > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin > > > > > > > > > wrote: > > > > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi < > > > > > > > > > > stefanha@redhat.com> wrote: > > > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > > > > > > > > > > > So-called "internal" virtio-fs migration refers to > > > > > > > > > > > > transporting the > > > > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration > > > > > > > > > > > > stream. To do > > > > > > > > > > > > this, we need to be able to transfer virtiofsd's internal > > > > > > > > > > > > state to and > > > > > > > > > > > > from virtiofsd. > > > > > > > > > > > > > > > > > > > > > > > > Because virtiofsd's internal state will not be too large, we > > > > > > > > > > > > believe it > > > > > > > > > > > > is best to transfer it as a single binary blob after the > > > > > > > > > > > > streaming > > > > > > > > > > > > phase. Because this method should be useful to other vhost- > > > > > > > > > > > > user > > > > > > > > > > > > implementations, too, it is introduced as a general-purpose > > > > > > > > > > > > addition to > > > > > > > > > > > > the protocol, not limited to vhost-user-fs. > > > > > > > > > > > > > > > > > > > > > > > > These are the additions to the protocol: > > > > > > > > > > > > - New vhost-user protocol feature > > > > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > > > > > > > > > > > This feature signals support for transferring state, and is > > > > > > > > > > > > added so > > > > > > > > > > > > that migration can fail early when the back-end has no > > > > > > > > > > > > support. > > > > > > > > > > > > > > > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end > > > > > > > > > > > > negotiate a pipe > > > > > > > > > > > > over which to transfer the state. The front-end sends an > > > > > > > > > > > > FD to the > > > > > > > > > > > > back-end into/from which it can write/read its state, and > > > > > > > > > > > > the back-end > > > > > > > > > > > > can decide to either use it, or reply with a different FD > > > > > > > > > > > > for the > > > > > > > > > > > > front-end to override the front-end's choice. > > > > > > > > > > > > The front-end creates a simple pipe to transfer the state, > > > > > > > > > > > > but maybe > > > > > > > > > > > > the back-end already has an FD into/from which it has to > > > > > > > > > > > > write/read > > > > > > > > > > > > its state, in which case it will want to override the > > > > > > > > > > > > simple pipe. > > > > > > > > > > > > Conversely, maybe in the future we find a way to have the > > > > > > > > > > > > front-end > > > > > > > > > > > > get an immediate FD for the migration stream (in some > > > > > > > > > > > > cases), in which > > > > > > > > > > > > case we will want to send this to the back-end instead of > > > > > > > > > > > > creating a > > > > > > > > > > > > pipe. > > > > > > > > > > > > Hence the negotiation: If one side has a better idea than a > > > > > > > > > > > > plain > > > > > > > > > > > > pipe, we will want to use that. > > > > > > > > > > > > > > > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred > > > > > > > > > > > > through the > > > > > > > > > > > > pipe (the end indicated by EOF), the front-end invokes this > > > > > > > > > > > > function > > > > > > > > > > > > to verify success. There is no in-band way (through the > > > > > > > > > > > > pipe) to > > > > > > > > > > > > indicate failure, so we need to check explicitly. > > > > > > > > > > > > > > > > > > > > > > > > Once the transfer pipe has been established via > > > > > > > > > > > > SET_DEVICE_STATE_FD > > > > > > > > > > > > (which includes establishing the direction of transfer and > > > > > > > > > > > > migration > > > > > > > > > > > > phase), the sending side writes its data into the pipe, and > > > > > > > > > > > > the reading > > > > > > > > > > > > side reads it until it sees an EOF. Then, the front-end will > > > > > > > > > > > > check for > > > > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side > > > > > > > > > > > > includes > > > > > > > > > > > > checking for integrity (i.e. errors during deserialization). > > > > > > > > > > > > > > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > > > > > > > > > > --- > > > > > > > > > > > > include/hw/virtio/vhost-backend.h | 24 +++++ > > > > > > > > > > > > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > > > > > > > > > > > hw/virtio/vhost-user.c | 147 > > > > > > > > > > > > ++++++++++++++++++++++++++++++ > > > > > > > > > > > > hw/virtio/vhost.c | 37 ++++++++ > > > > > > > > > > > > 4 files changed, 287 insertions(+) > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h > > > > > > > > > > > > b/include/hw/virtio/vhost-backend.h > > > > > > > > > > > > index ec3fbae58d..5935b32fe3 100644 > > > > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h > > > > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h > > > > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > > > > > > > > > > > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > > > > > > > > > > > } VhostSetConfigType; > > > > > > > > > > > > > > > > > > > > > > > > +typedef enum VhostDeviceStateDirection { > > > > > > > > > > > > + /* Transfer state from back-end (device) to front-end */ > > > > > > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > > > > > > > > > > > + /* Transfer state from front-end to back-end (device) */ > > > > > > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > > > > > > > > > > > +} VhostDeviceStateDirection; > > > > > > > > > > > > + > > > > > > > > > > > > +typedef enum VhostDeviceStatePhase { > > > > > > > > > > > > + /* The device (and all its vrings) is stopped */ > > > > > > > > > > > > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > > > > > > > > > > > +} VhostDeviceStatePhase; > > > > > > > > > > > vDPA has: > > > > > > > > > > > > > > > > > > > > > > /* Suspend a device so it does not process virtqueue requests > > > > > > > > > > > anymore > > > > > > > > > > > * > > > > > > > > > > > * After the return of ioctl the device must preserve all the > > > > > > > > > > > necessary state > > > > > > > > > > > * (the virtqueue vring base plus the possible device > > > > > > > > > > > specific states) that is > > > > > > > > > > > * required for restoring in the future. The device must not > > > > > > > > > > > change its > > > > > > > > > > > * configuration after that point. > > > > > > > > > > > */ > > > > > > > > > > > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > > > > > > > > > > > > > > > > > > > > > /* Resume a device so it can resume processing virtqueue > > > > > > > > > > > requests > > > > > > > > > > > * > > > > > > > > > > > * After the return of this ioctl the device will have > > > > > > > > > > > restored all the > > > > > > > > > > > * necessary states and it is fully operational to continue > > > > > > > > > > > processing the > > > > > > > > > > > * virtqueue descriptors. > > > > > > > > > > > */ > > > > > > > > > > > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > > > > > > > > > > > > > > > > > > > > > I wonder if it makes sense to import these into vhost-user so > > > > > > > > > > > that the > > > > > > > > > > > difference between kernel vhost and vhost-user is minimized. > > > > > > > > > > > It's okay > > > > > > > > > > > if one of them is ahead of the other, but it would be nice to > > > > > > > > > > > avoid > > > > > > > > > > > overlapping/duplicated functionality. > > > > > > > > > > > > > > > > > > > > > That's what I had in mind in the first versions. I proposed > > > > > > > > > > VHOST_STOP > > > > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did > > > > > > > > > > change > > > > > > > > > > to SUSPEND. > > > > > > > > > > > > > > > > > > > > Generally it is better if we make the interface less parametrized > > > > > > > > > > and > > > > > > > > > > we trust in the messages and its semantics in my opinion. In other > > > > > > > > > > words, instead of > > > > > > > > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), > > > > > > > > > > send > > > > > > > > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user > > > > > > > > > > command. > > > > > > > > > > > > > > > > > > > > Another way to apply this is with the "direction" parameter. Maybe > > > > > > > > > > it > > > > > > > > > > is better to split it into "set_state_fd" and "get_state_fd"? > > > > > > > > > > > > > > > > > > > > In that case, reusing the ioctls as vhost-user messages would be > > > > > > > > > > ok. > > > > > > > > > > But that puts this proposal further from the VFIO code, which uses > > > > > > > > > > "migration_set_state(state)", and maybe it is better when the > > > > > > > > > > number > > > > > > > > > > of states is high. > > > > > > > > > Hi Eugenio, > > > > > > > > > Another question about vDPA suspend/resume: > > > > > > > > > > > > > > > > > > /* Host notifiers must be enabled at this point. */ > > > > > > > > > void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, > > > > > > > > > bool vrings) > > > > > > > > > { > > > > > > > > > int i; > > > > > > > > > > > > > > > > > > /* should only be called after backend is connected */ > > > > > > > > > assert(hdev->vhost_ops); > > > > > > > > > event_notifier_test_and_clear( > > > > > > > > > &hdev- > > > > > > > > > >vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier); > > > > > > > > > event_notifier_test_and_clear(&vdev->config_notifier); > > > > > > > > > > > > > > > > > > trace_vhost_dev_stop(hdev, vdev->name, vrings); > > > > > > > > > > > > > > > > > > if (hdev->vhost_ops->vhost_dev_start) { > > > > > > > > > hdev->vhost_ops->vhost_dev_start(hdev, false); > > > > > > > > > ^^^ SUSPEND ^^^ > > > > > > > > > } > > > > > > > > > if (vrings) { > > > > > > > > > vhost_dev_set_vring_enable(hdev, false); > > > > > > > > > } > > > > > > > > > for (i = 0; i < hdev->nvqs; ++i) { > > > > > > > > > vhost_virtqueue_stop(hdev, > > > > > > > > > vdev, > > > > > > > > > hdev->vqs + i, > > > > > > > > > hdev->vq_index + i); > > > > > > > > > ^^^ fetch virtqueue state from kernel ^^^ > > > > > > > > > } > > > > > > > > > if (hdev->vhost_ops->vhost_reset_status) { > > > > > > > > > hdev->vhost_ops->vhost_reset_status(hdev); > > > > > > > > > ^^^ reset device^^^ > > > > > > > > > > > > > > > > > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop() > > > > > > > > > -> > > > > > > > > > vhost_reset_status(). The device's migration code runs after > > > > > > > > > vhost_dev_stop() and the state will have been lost. > > > > > > > > > > > > > > > > > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the > > > > > > > > qemu VirtIONet device model. This is for all vhost backends. > > > > > > > > > > > > > > > > Regarding the state like mac or mq configuration, SVQ runs for all the > > > > > > > > VM run in the CVQ. So it can track all of that status in the device > > > > > > > > model too. > > > > > > > > > > > > > > > > When a migration effectively occurs, all the frontend state is > > > > > > > > migrated as a regular emulated device. To route all of the state in a > > > > > > > > normalized way for qemu is what leaves open the possibility to do > > > > > > > > cross-backends migrations, etc. > > > > > > > > > > > > > > > > Does that answer your question? > > > > > > > I think you're confirming that changes would be necessary in order for > > > > > > > vDPA to support the save/load operation that Hanna is introducing. > > > > > > > > > > > > > Yes, this first iteration was centered on net, with an eye on block, > > > > > > where state can be routed through classical emulated devices. This is > > > > > > how vhost-kernel and vhost-user do classically. And it allows > > > > > > cross-backend, to not modify qemu migration state, etc. > > > > > > > > > > > > To introduce this opaque state to qemu, that must be fetched after the > > > > > > suspend and not before, requires changes in vhost protocol, as > > > > > > discussed previously. > > > > > > > > > > > > > > > It looks like vDPA changes are necessary in order to support > > > > > > > > > stateful > > > > > > > > > devices even though QEMU already uses SUSPEND. Is my understanding > > > > > > > > > correct? > > > > > > > > > > > > > > > > > Changes are required elsewhere, as the code to restore the state > > > > > > > > properly in the destination has not been merged. > > > > > > > I'm not sure what you mean by elsewhere? > > > > > > > > > > > > > I meant for vdpa *net* devices the changes are not required in vdpa > > > > > > ioctls, but mostly in qemu. > > > > > > > > > > > > If you meant stateful as "it must have a state blob that it must be > > > > > > opaque to qemu", then I think the straightforward action is to fetch > > > > > > state blob about the same time as vq indexes. But yes, changes (at > > > > > > least a new ioctl) is needed for that. > > > > > > > > > > > > > I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and > > > > > > > then VHOST_VDPA_SET_STATUS 0. > > > > > > > > > > > > > > In order to save device state from the vDPA device in the future, it > > > > > > > will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that > > > > > > > the device state can be saved before the device is reset. > > > > > > > > > > > > > > Does that sound right? > > > > > > > > > > > > > The split between suspend and reset was added recently for that very > > > > > > reason. In all the virtio devices, the frontend is initialized before > > > > > > the backend, so I don't think it is a good idea to defer the backend > > > > > > cleanup. Especially if we have already set the state is small enough > > > > > > to not needing iterative migration from virtiofsd point of view. > > > > > > > > > > > > If fetching that state at the same time as vq indexes is not valid, > > > > > > could it follow the same model as the "in-flight descriptors"? > > > > > > vhost-user follows them by using a shared memory region where their > > > > > > state is tracked [1]. This allows qemu to survive vhost-user SW > > > > > > backend crashes, and does not forbid the cross-backends live migration > > > > > > as all the information is there to recover them. > > > > > > > > > > > > For hw devices this is not convenient as it occupies PCI bandwidth. So > > > > > > a possibility is to synchronize this memory region after a > > > > > > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW > > > > > > devices are not going to crash in the software sense, so all use cases > > > > > > remain the same to qemu. And that shared memory information is > > > > > > recoverable after vhost_dev_stop. > > > > > > > > > > > > Does that sound reasonable to virtiofsd? To offer a shared memory > > > > > > region where it dumps the state, maybe only after the > > > > > > set_state(STATE_PHASE_STOPPED)? > > > > > > > > > > I don’t think we need the set_state() call, necessarily, if SUSPEND is > > > > > mandatory anyway. > > > > > > > > > > As for the shared memory, the RFC before this series used shared memory, > > > > > so it’s possible, yes. But “shared memory region” can mean a lot of > > > > > things – it sounds like you’re saying the back-end (virtiofsd) should > > > > > provide it to the front-end, is that right? That could work like this: > > > > > > > > > > On the source side: > > > > > > > > > > S1. SUSPEND goes to virtiofsd > > > > > S2. virtiofsd maybe double-checks that the device is stopped, then > > > > > serializes its state into a newly allocated shared memory area[1] > > > > > S3. virtiofsd responds to SUSPEND > > > > > S4. front-end requests shared memory, virtiofsd responds with a handle, > > > > > maybe already closes its reference > > > > > S5. front-end saves state, closes its handle, freeing the SHM > > > > > > > > > > [1] Maybe virtiofsd can correctly size the serialized state’s size, then > > > > > it can immediately allocate this area and serialize directly into it; > > > > > maybe it can’t, then we’ll need a bounce buffer. Not really a > > > > > fundamental problem, but there are limitations around what you can do > > > > > with serde implementations in Rust… > > > > > > > > > > On the destination side: > > > > > > > > > > D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much; > > > > > virtiofsd would serialize its empty state into an SHM area, and respond > > > > > to SUSPEND > > > > > D2. front-end reads state from migration stream into an SHM it has allocated > > > > > D3. front-end supplies this SHM to virtiofsd, which discards its > > > > > previous area, and now uses this one > > > > > D4. RESUME goes to virtiofsd, which deserializes the state from the SHM > > > > > > > > > > Couple of questions: > > > > > > > > > > A. Stefan suggested D1, but it does seem wasteful now. But if SUSPEND > > > > > would imply to deserialize a state, and the state is to be transferred > > > > > through SHM, this is what would need to be done. So maybe we should > > > > > skip SUSPEND on the destination? > > > > > B. You described that the back-end should supply the SHM, which works > > > > > well on the source. On the destination, only the front-end knows how > > > > > big the state is, so I’ve decided above that it should allocate the SHM > > > > > (D2) and provide it to the back-end. Is that feasible or is it > > > > > important (e.g. for real hardware) that the back-end supplies the SHM? > > > > > (In which case the front-end would need to tell the back-end how big the > > > > > state SHM needs to be.) > > > > > > > > How does this work for iterative live migration? > > > > > > > > > > A pipe will always fit better for iterative from qemu POV, that's for sure. > > > Especially if we want to keep that opaqueness. > > > > > > But we will need to communicate with the HW device using shared memory sooner > > > or later for big states. If we don't transform it in qemu, we will need to do > > > it in the kernel. Also, the pipe will not support daemon crashes. > > > > > > Again I'm just putting this on the table, just in case it fits better or it is > > > convenient. I missed the previous patch where SHM was proposed too, so maybe I > > > missed some feedback useful here. I think the pipe is a better solution in the > > > long run because of the iterative part. > > > > Pipes and shared memory are conceptually equivalent for building > > streaming interfaces. It's just more complex to design a shared memory > > interface and it reinvents what file descriptors already offer. > > > > I have no doubt we could design iterative migration over a shared memory > > interface if we needed to, but I'm not sure why? When you mention > > hardware, are you suggesting defining a standard memory/register layout > > that hardware implements and mapping it to userspace (QEMU)? > > Right. > > > Is there a > > big advantage to exposing memory versus a file descriptor? > > > > For hardware it allows to retrieve and set the device state without > intervention of the kernel, saving context switches. For virtiofsd > this may not make a lot of sense, but I'm thinking on devices with big > states (virtio gpu, maybe?). A streaming interface implemented using shared memory involves consuming chunks of bytes. Each time data has been read, an action must be performed to notify the device and receive a notification when more data becomes available. That notification involves the kernel (e.g. an eventfd that is triggered by a hardware interrupt) and a read(2) syscall to reset the eventfd. Unless userspace disables notifications and polls (busy waits) the hardware registers, there is still going to be kernel involvement and a context switch. For this reason, I think that shared memory vs pipes will not be significantly different. > For software it allows the backend to survive a crash, as the old > state can be set directly to a fresh backend instance. Can you explain by describing the steps involved? Are you sure it can only be done with shared memory and not pipes? Stefan
On Tue, May 9, 2023 at 11:01 AM Hanna Czenczek <hreitz@redhat.com> wrote: > > On 09.05.23 08:31, Eugenio Perez Martin wrote: > > On Mon, May 8, 2023 at 9:12 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > [...] > > >> VHOST_USER_GET_VRING_BASE itself isn't really enough because it stops a > >> specific virtqueue but not the whole device. Unfortunately stopping all > >> virtqueues is not the same as SUSPEND since spontaneous device activity > >> is possible independent of any virtqueue (e.g. virtio-scsi events and > >> maybe virtio-net link status). > >> > >> That's why I think SUSPEND is necessary for a solution that's generic > >> enough to cover all device types. > >> > > I agree. > > > > In particular virtiofsd is already resetting all the device at > > VHOST_USER_GET_VRING_BASE if I'm not wrong, so that's even more of a > > reason to implement suspend call. > > Oh, no, just the vring in question. Not the whole device. > > In addition, we still need the GET_VRING_BASE call anyway, because, > well, we want to restore the vring on the destination via SET_VRING_BASE. > Ok, that makes sense, sorry for the confusion! Thanks!
On Tue, May 9, 2023 at 5:09 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Tue, May 09, 2023 at 08:45:33AM +0200, Eugenio Perez Martin wrote: > > On Mon, May 8, 2023 at 10:10 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > On Thu, Apr 20, 2023 at 03:29:44PM +0200, Eugenio Pérez wrote: > > > > On Wed, 2023-04-19 at 07:21 -0400, Stefan Hajnoczi wrote: > > > > > On Wed, 19 Apr 2023 at 07:10, Hanna Czenczek <hreitz@redhat.com> wrote: > > > > > > On 18.04.23 09:54, Eugenio Perez Martin wrote: > > > > > > > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com> > > > > > > > wrote: > > > > > > > > On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com> > > > > > > > > wrote: > > > > > > > > > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> > > > > > > > > > wrote: > > > > > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin > > > > > > > > > > wrote: > > > > > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi < > > > > > > > > > > > stefanha@redhat.com> wrote: > > > > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > > > > > > > > > > > > So-called "internal" virtio-fs migration refers to > > > > > > > > > > > > > transporting the > > > > > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration > > > > > > > > > > > > > stream. To do > > > > > > > > > > > > > this, we need to be able to transfer virtiofsd's internal > > > > > > > > > > > > > state to and > > > > > > > > > > > > > from virtiofsd. > > > > > > > > > > > > > > > > > > > > > > > > > > Because virtiofsd's internal state will not be too large, we > > > > > > > > > > > > > believe it > > > > > > > > > > > > > is best to transfer it as a single binary blob after the > > > > > > > > > > > > > streaming > > > > > > > > > > > > > phase. Because this method should be useful to other vhost- > > > > > > > > > > > > > user > > > > > > > > > > > > > implementations, too, it is introduced as a general-purpose > > > > > > > > > > > > > addition to > > > > > > > > > > > > > the protocol, not limited to vhost-user-fs. > > > > > > > > > > > > > > > > > > > > > > > > > > These are the additions to the protocol: > > > > > > > > > > > > > - New vhost-user protocol feature > > > > > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > > > > > > > > > > > > This feature signals support for transferring state, and is > > > > > > > > > > > > > added so > > > > > > > > > > > > > that migration can fail early when the back-end has no > > > > > > > > > > > > > support. > > > > > > > > > > > > > > > > > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end > > > > > > > > > > > > > negotiate a pipe > > > > > > > > > > > > > over which to transfer the state. The front-end sends an > > > > > > > > > > > > > FD to the > > > > > > > > > > > > > back-end into/from which it can write/read its state, and > > > > > > > > > > > > > the back-end > > > > > > > > > > > > > can decide to either use it, or reply with a different FD > > > > > > > > > > > > > for the > > > > > > > > > > > > > front-end to override the front-end's choice. > > > > > > > > > > > > > The front-end creates a simple pipe to transfer the state, > > > > > > > > > > > > > but maybe > > > > > > > > > > > > > the back-end already has an FD into/from which it has to > > > > > > > > > > > > > write/read > > > > > > > > > > > > > its state, in which case it will want to override the > > > > > > > > > > > > > simple pipe. > > > > > > > > > > > > > Conversely, maybe in the future we find a way to have the > > > > > > > > > > > > > front-end > > > > > > > > > > > > > get an immediate FD for the migration stream (in some > > > > > > > > > > > > > cases), in which > > > > > > > > > > > > > case we will want to send this to the back-end instead of > > > > > > > > > > > > > creating a > > > > > > > > > > > > > pipe. > > > > > > > > > > > > > Hence the negotiation: If one side has a better idea than a > > > > > > > > > > > > > plain > > > > > > > > > > > > > pipe, we will want to use that. > > > > > > > > > > > > > > > > > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred > > > > > > > > > > > > > through the > > > > > > > > > > > > > pipe (the end indicated by EOF), the front-end invokes this > > > > > > > > > > > > > function > > > > > > > > > > > > > to verify success. There is no in-band way (through the > > > > > > > > > > > > > pipe) to > > > > > > > > > > > > > indicate failure, so we need to check explicitly. > > > > > > > > > > > > > > > > > > > > > > > > > > Once the transfer pipe has been established via > > > > > > > > > > > > > SET_DEVICE_STATE_FD > > > > > > > > > > > > > (which includes establishing the direction of transfer and > > > > > > > > > > > > > migration > > > > > > > > > > > > > phase), the sending side writes its data into the pipe, and > > > > > > > > > > > > > the reading > > > > > > > > > > > > > side reads it until it sees an EOF. Then, the front-end will > > > > > > > > > > > > > check for > > > > > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side > > > > > > > > > > > > > includes > > > > > > > > > > > > > checking for integrity (i.e. errors during deserialization). > > > > > > > > > > > > > > > > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > > > > > > > > > > > --- > > > > > > > > > > > > > include/hw/virtio/vhost-backend.h | 24 +++++ > > > > > > > > > > > > > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > > > > > > > > > > > > hw/virtio/vhost-user.c | 147 > > > > > > > > > > > > > ++++++++++++++++++++++++++++++ > > > > > > > > > > > > > hw/virtio/vhost.c | 37 ++++++++ > > > > > > > > > > > > > 4 files changed, 287 insertions(+) > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h > > > > > > > > > > > > > b/include/hw/virtio/vhost-backend.h > > > > > > > > > > > > > index ec3fbae58d..5935b32fe3 100644 > > > > > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h > > > > > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h > > > > > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > > > > > > > > > > > > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > > > > > > > > > > > > } VhostSetConfigType; > > > > > > > > > > > > > > > > > > > > > > > > > > +typedef enum VhostDeviceStateDirection { > > > > > > > > > > > > > + /* Transfer state from back-end (device) to front-end */ > > > > > > > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > > > > > > > > > > > > + /* Transfer state from front-end to back-end (device) */ > > > > > > > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > > > > > > > > > > > > +} VhostDeviceStateDirection; > > > > > > > > > > > > > + > > > > > > > > > > > > > +typedef enum VhostDeviceStatePhase { > > > > > > > > > > > > > + /* The device (and all its vrings) is stopped */ > > > > > > > > > > > > > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > > > > > > > > > > > > +} VhostDeviceStatePhase; > > > > > > > > > > > > vDPA has: > > > > > > > > > > > > > > > > > > > > > > > > /* Suspend a device so it does not process virtqueue requests > > > > > > > > > > > > anymore > > > > > > > > > > > > * > > > > > > > > > > > > * After the return of ioctl the device must preserve all the > > > > > > > > > > > > necessary state > > > > > > > > > > > > * (the virtqueue vring base plus the possible device > > > > > > > > > > > > specific states) that is > > > > > > > > > > > > * required for restoring in the future. The device must not > > > > > > > > > > > > change its > > > > > > > > > > > > * configuration after that point. > > > > > > > > > > > > */ > > > > > > > > > > > > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > > > > > > > > > > > > > > > > > > > > > > > /* Resume a device so it can resume processing virtqueue > > > > > > > > > > > > requests > > > > > > > > > > > > * > > > > > > > > > > > > * After the return of this ioctl the device will have > > > > > > > > > > > > restored all the > > > > > > > > > > > > * necessary states and it is fully operational to continue > > > > > > > > > > > > processing the > > > > > > > > > > > > * virtqueue descriptors. > > > > > > > > > > > > */ > > > > > > > > > > > > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > > > > > > > > > > > > > > > > > > > > > > > I wonder if it makes sense to import these into vhost-user so > > > > > > > > > > > > that the > > > > > > > > > > > > difference between kernel vhost and vhost-user is minimized. > > > > > > > > > > > > It's okay > > > > > > > > > > > > if one of them is ahead of the other, but it would be nice to > > > > > > > > > > > > avoid > > > > > > > > > > > > overlapping/duplicated functionality. > > > > > > > > > > > > > > > > > > > > > > > That's what I had in mind in the first versions. I proposed > > > > > > > > > > > VHOST_STOP > > > > > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did > > > > > > > > > > > change > > > > > > > > > > > to SUSPEND. > > > > > > > > > > > > > > > > > > > > > > Generally it is better if we make the interface less parametrized > > > > > > > > > > > and > > > > > > > > > > > we trust in the messages and its semantics in my opinion. In other > > > > > > > > > > > words, instead of > > > > > > > > > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), > > > > > > > > > > > send > > > > > > > > > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user > > > > > > > > > > > command. > > > > > > > > > > > > > > > > > > > > > > Another way to apply this is with the "direction" parameter. Maybe > > > > > > > > > > > it > > > > > > > > > > > is better to split it into "set_state_fd" and "get_state_fd"? > > > > > > > > > > > > > > > > > > > > > > In that case, reusing the ioctls as vhost-user messages would be > > > > > > > > > > > ok. > > > > > > > > > > > But that puts this proposal further from the VFIO code, which uses > > > > > > > > > > > "migration_set_state(state)", and maybe it is better when the > > > > > > > > > > > number > > > > > > > > > > > of states is high. > > > > > > > > > > Hi Eugenio, > > > > > > > > > > Another question about vDPA suspend/resume: > > > > > > > > > > > > > > > > > > > > /* Host notifiers must be enabled at this point. */ > > > > > > > > > > void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, > > > > > > > > > > bool vrings) > > > > > > > > > > { > > > > > > > > > > int i; > > > > > > > > > > > > > > > > > > > > /* should only be called after backend is connected */ > > > > > > > > > > assert(hdev->vhost_ops); > > > > > > > > > > event_notifier_test_and_clear( > > > > > > > > > > &hdev- > > > > > > > > > > >vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier); > > > > > > > > > > event_notifier_test_and_clear(&vdev->config_notifier); > > > > > > > > > > > > > > > > > > > > trace_vhost_dev_stop(hdev, vdev->name, vrings); > > > > > > > > > > > > > > > > > > > > if (hdev->vhost_ops->vhost_dev_start) { > > > > > > > > > > hdev->vhost_ops->vhost_dev_start(hdev, false); > > > > > > > > > > ^^^ SUSPEND ^^^ > > > > > > > > > > } > > > > > > > > > > if (vrings) { > > > > > > > > > > vhost_dev_set_vring_enable(hdev, false); > > > > > > > > > > } > > > > > > > > > > for (i = 0; i < hdev->nvqs; ++i) { > > > > > > > > > > vhost_virtqueue_stop(hdev, > > > > > > > > > > vdev, > > > > > > > > > > hdev->vqs + i, > > > > > > > > > > hdev->vq_index + i); > > > > > > > > > > ^^^ fetch virtqueue state from kernel ^^^ > > > > > > > > > > } > > > > > > > > > > if (hdev->vhost_ops->vhost_reset_status) { > > > > > > > > > > hdev->vhost_ops->vhost_reset_status(hdev); > > > > > > > > > > ^^^ reset device^^^ > > > > > > > > > > > > > > > > > > > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop() > > > > > > > > > > -> > > > > > > > > > > vhost_reset_status(). The device's migration code runs after > > > > > > > > > > vhost_dev_stop() and the state will have been lost. > > > > > > > > > > > > > > > > > > > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the > > > > > > > > > qemu VirtIONet device model. This is for all vhost backends. > > > > > > > > > > > > > > > > > > Regarding the state like mac or mq configuration, SVQ runs for all the > > > > > > > > > VM run in the CVQ. So it can track all of that status in the device > > > > > > > > > model too. > > > > > > > > > > > > > > > > > > When a migration effectively occurs, all the frontend state is > > > > > > > > > migrated as a regular emulated device. To route all of the state in a > > > > > > > > > normalized way for qemu is what leaves open the possibility to do > > > > > > > > > cross-backends migrations, etc. > > > > > > > > > > > > > > > > > > Does that answer your question? > > > > > > > > I think you're confirming that changes would be necessary in order for > > > > > > > > vDPA to support the save/load operation that Hanna is introducing. > > > > > > > > > > > > > > > Yes, this first iteration was centered on net, with an eye on block, > > > > > > > where state can be routed through classical emulated devices. This is > > > > > > > how vhost-kernel and vhost-user do classically. And it allows > > > > > > > cross-backend, to not modify qemu migration state, etc. > > > > > > > > > > > > > > To introduce this opaque state to qemu, that must be fetched after the > > > > > > > suspend and not before, requires changes in vhost protocol, as > > > > > > > discussed previously. > > > > > > > > > > > > > > > > > It looks like vDPA changes are necessary in order to support > > > > > > > > > > stateful > > > > > > > > > > devices even though QEMU already uses SUSPEND. Is my understanding > > > > > > > > > > correct? > > > > > > > > > > > > > > > > > > > Changes are required elsewhere, as the code to restore the state > > > > > > > > > properly in the destination has not been merged. > > > > > > > > I'm not sure what you mean by elsewhere? > > > > > > > > > > > > > > > I meant for vdpa *net* devices the changes are not required in vdpa > > > > > > > ioctls, but mostly in qemu. > > > > > > > > > > > > > > If you meant stateful as "it must have a state blob that it must be > > > > > > > opaque to qemu", then I think the straightforward action is to fetch > > > > > > > state blob about the same time as vq indexes. But yes, changes (at > > > > > > > least a new ioctl) is needed for that. > > > > > > > > > > > > > > > I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and > > > > > > > > then VHOST_VDPA_SET_STATUS 0. > > > > > > > > > > > > > > > > In order to save device state from the vDPA device in the future, it > > > > > > > > will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that > > > > > > > > the device state can be saved before the device is reset. > > > > > > > > > > > > > > > > Does that sound right? > > > > > > > > > > > > > > > The split between suspend and reset was added recently for that very > > > > > > > reason. In all the virtio devices, the frontend is initialized before > > > > > > > the backend, so I don't think it is a good idea to defer the backend > > > > > > > cleanup. Especially if we have already set the state is small enough > > > > > > > to not needing iterative migration from virtiofsd point of view. > > > > > > > > > > > > > > If fetching that state at the same time as vq indexes is not valid, > > > > > > > could it follow the same model as the "in-flight descriptors"? > > > > > > > vhost-user follows them by using a shared memory region where their > > > > > > > state is tracked [1]. This allows qemu to survive vhost-user SW > > > > > > > backend crashes, and does not forbid the cross-backends live migration > > > > > > > as all the information is there to recover them. > > > > > > > > > > > > > > For hw devices this is not convenient as it occupies PCI bandwidth. So > > > > > > > a possibility is to synchronize this memory region after a > > > > > > > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW > > > > > > > devices are not going to crash in the software sense, so all use cases > > > > > > > remain the same to qemu. And that shared memory information is > > > > > > > recoverable after vhost_dev_stop. > > > > > > > > > > > > > > Does that sound reasonable to virtiofsd? To offer a shared memory > > > > > > > region where it dumps the state, maybe only after the > > > > > > > set_state(STATE_PHASE_STOPPED)? > > > > > > > > > > > > I don’t think we need the set_state() call, necessarily, if SUSPEND is > > > > > > mandatory anyway. > > > > > > > > > > > > As for the shared memory, the RFC before this series used shared memory, > > > > > > so it’s possible, yes. But “shared memory region” can mean a lot of > > > > > > things – it sounds like you’re saying the back-end (virtiofsd) should > > > > > > provide it to the front-end, is that right? That could work like this: > > > > > > > > > > > > On the source side: > > > > > > > > > > > > S1. SUSPEND goes to virtiofsd > > > > > > S2. virtiofsd maybe double-checks that the device is stopped, then > > > > > > serializes its state into a newly allocated shared memory area[1] > > > > > > S3. virtiofsd responds to SUSPEND > > > > > > S4. front-end requests shared memory, virtiofsd responds with a handle, > > > > > > maybe already closes its reference > > > > > > S5. front-end saves state, closes its handle, freeing the SHM > > > > > > > > > > > > [1] Maybe virtiofsd can correctly size the serialized state’s size, then > > > > > > it can immediately allocate this area and serialize directly into it; > > > > > > maybe it can’t, then we’ll need a bounce buffer. Not really a > > > > > > fundamental problem, but there are limitations around what you can do > > > > > > with serde implementations in Rust… > > > > > > > > > > > > On the destination side: > > > > > > > > > > > > D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much; > > > > > > virtiofsd would serialize its empty state into an SHM area, and respond > > > > > > to SUSPEND > > > > > > D2. front-end reads state from migration stream into an SHM it has allocated > > > > > > D3. front-end supplies this SHM to virtiofsd, which discards its > > > > > > previous area, and now uses this one > > > > > > D4. RESUME goes to virtiofsd, which deserializes the state from the SHM > > > > > > > > > > > > Couple of questions: > > > > > > > > > > > > A. Stefan suggested D1, but it does seem wasteful now. But if SUSPEND > > > > > > would imply to deserialize a state, and the state is to be transferred > > > > > > through SHM, this is what would need to be done. So maybe we should > > > > > > skip SUSPEND on the destination? > > > > > > B. You described that the back-end should supply the SHM, which works > > > > > > well on the source. On the destination, only the front-end knows how > > > > > > big the state is, so I’ve decided above that it should allocate the SHM > > > > > > (D2) and provide it to the back-end. Is that feasible or is it > > > > > > important (e.g. for real hardware) that the back-end supplies the SHM? > > > > > > (In which case the front-end would need to tell the back-end how big the > > > > > > state SHM needs to be.) > > > > > > > > > > How does this work for iterative live migration? > > > > > > > > > > > > > A pipe will always fit better for iterative from qemu POV, that's for sure. > > > > Especially if we want to keep that opaqueness. > > > > > > > > But we will need to communicate with the HW device using shared memory sooner > > > > or later for big states. If we don't transform it in qemu, we will need to do > > > > it in the kernel. Also, the pipe will not support daemon crashes. > > > > > > > > Again I'm just putting this on the table, just in case it fits better or it is > > > > convenient. I missed the previous patch where SHM was proposed too, so maybe I > > > > missed some feedback useful here. I think the pipe is a better solution in the > > > > long run because of the iterative part. > > > > > > Pipes and shared memory are conceptually equivalent for building > > > streaming interfaces. It's just more complex to design a shared memory > > > interface and it reinvents what file descriptors already offer. > > > > > > I have no doubt we could design iterative migration over a shared memory > > > interface if we needed to, but I'm not sure why? When you mention > > > hardware, are you suggesting defining a standard memory/register layout > > > that hardware implements and mapping it to userspace (QEMU)? > > > > Right. > > > > > Is there a > > > big advantage to exposing memory versus a file descriptor? > > > > > > > For hardware it allows to retrieve and set the device state without > > intervention of the kernel, saving context switches. For virtiofsd > > this may not make a lot of sense, but I'm thinking on devices with big > > states (virtio gpu, maybe?). > > A streaming interface implemented using shared memory involves consuming > chunks of bytes. Each time data has been read, an action must be > performed to notify the device and receive a notification when more data > becomes available. > > That notification involves the kernel (e.g. an eventfd that is triggered > by a hardware interrupt) and a read(2) syscall to reset the eventfd. > > Unless userspace disables notifications and polls (busy waits) the > hardware registers, there is still going to be kernel involvement and a > context switch. For this reason, I think that shared memory vs pipes > will not be significantly different. > Yes, for big states that's right. I was thinking of not-so-big states, where all of it can be asked in one shot, but it may be problematic with iterative migration for sure. In that regard pipes are way better. > > For software it allows the backend to survive a crash, as the old > > state can be set directly to a fresh backend instance. > > Can you explain by describing the steps involved? It's how vhost-user inflight I/O tracking works [1]: QEMU and the backend shares a memory region where the backend dump states continuously. In the event of a crash, this state can be dumped directly to a new vhost-user backend. > Are you sure it can only be done with shared memory and not pipes? > Sorry for the confusion, but I never intended to say that :). [1] https://qemu.readthedocs.io/en/latest/interop/vhost-user.html#inflight-i-o-tracking
On Tue, 9 May 2023 at 11:35, Eugenio Perez Martin <eperezma@redhat.com> wrote: > > On Tue, May 9, 2023 at 5:09 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > On Tue, May 09, 2023 at 08:45:33AM +0200, Eugenio Perez Martin wrote: > > > On Mon, May 8, 2023 at 10:10 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > > > On Thu, Apr 20, 2023 at 03:29:44PM +0200, Eugenio Pérez wrote: > > > > > On Wed, 2023-04-19 at 07:21 -0400, Stefan Hajnoczi wrote: > > > > > > On Wed, 19 Apr 2023 at 07:10, Hanna Czenczek <hreitz@redhat.com> wrote: > > > > > > > On 18.04.23 09:54, Eugenio Perez Martin wrote: > > > > > > > > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com> > > > > > > > > wrote: > > > > > > > > > On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com> > > > > > > > > > wrote: > > > > > > > > > > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> > > > > > > > > > > wrote: > > > > > > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin > > > > > > > > > > > wrote: > > > > > > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi < > > > > > > > > > > > > stefanha@redhat.com> wrote: > > > > > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote: > > > > > > > > > > > > > > So-called "internal" virtio-fs migration refers to > > > > > > > > > > > > > > transporting the > > > > > > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration > > > > > > > > > > > > > > stream. To do > > > > > > > > > > > > > > this, we need to be able to transfer virtiofsd's internal > > > > > > > > > > > > > > state to and > > > > > > > > > > > > > > from virtiofsd. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Because virtiofsd's internal state will not be too large, we > > > > > > > > > > > > > > believe it > > > > > > > > > > > > > > is best to transfer it as a single binary blob after the > > > > > > > > > > > > > > streaming > > > > > > > > > > > > > > phase. Because this method should be useful to other vhost- > > > > > > > > > > > > > > user > > > > > > > > > > > > > > implementations, too, it is introduced as a general-purpose > > > > > > > > > > > > > > addition to > > > > > > > > > > > > > > the protocol, not limited to vhost-user-fs. > > > > > > > > > > > > > > > > > > > > > > > > > > > > These are the additions to the protocol: > > > > > > > > > > > > > > - New vhost-user protocol feature > > > > > > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: > > > > > > > > > > > > > > This feature signals support for transferring state, and is > > > > > > > > > > > > > > added so > > > > > > > > > > > > > > that migration can fail early when the back-end has no > > > > > > > > > > > > > > support. > > > > > > > > > > > > > > > > > > > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end > > > > > > > > > > > > > > negotiate a pipe > > > > > > > > > > > > > > over which to transfer the state. The front-end sends an > > > > > > > > > > > > > > FD to the > > > > > > > > > > > > > > back-end into/from which it can write/read its state, and > > > > > > > > > > > > > > the back-end > > > > > > > > > > > > > > can decide to either use it, or reply with a different FD > > > > > > > > > > > > > > for the > > > > > > > > > > > > > > front-end to override the front-end's choice. > > > > > > > > > > > > > > The front-end creates a simple pipe to transfer the state, > > > > > > > > > > > > > > but maybe > > > > > > > > > > > > > > the back-end already has an FD into/from which it has to > > > > > > > > > > > > > > write/read > > > > > > > > > > > > > > its state, in which case it will want to override the > > > > > > > > > > > > > > simple pipe. > > > > > > > > > > > > > > Conversely, maybe in the future we find a way to have the > > > > > > > > > > > > > > front-end > > > > > > > > > > > > > > get an immediate FD for the migration stream (in some > > > > > > > > > > > > > > cases), in which > > > > > > > > > > > > > > case we will want to send this to the back-end instead of > > > > > > > > > > > > > > creating a > > > > > > > > > > > > > > pipe. > > > > > > > > > > > > > > Hence the negotiation: If one side has a better idea than a > > > > > > > > > > > > > > plain > > > > > > > > > > > > > > pipe, we will want to use that. > > > > > > > > > > > > > > > > > > > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred > > > > > > > > > > > > > > through the > > > > > > > > > > > > > > pipe (the end indicated by EOF), the front-end invokes this > > > > > > > > > > > > > > function > > > > > > > > > > > > > > to verify success. There is no in-band way (through the > > > > > > > > > > > > > > pipe) to > > > > > > > > > > > > > > indicate failure, so we need to check explicitly. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Once the transfer pipe has been established via > > > > > > > > > > > > > > SET_DEVICE_STATE_FD > > > > > > > > > > > > > > (which includes establishing the direction of transfer and > > > > > > > > > > > > > > migration > > > > > > > > > > > > > > phase), the sending side writes its data into the pipe, and > > > > > > > > > > > > > > the reading > > > > > > > > > > > > > > side reads it until it sees an EOF. Then, the front-end will > > > > > > > > > > > > > > check for > > > > > > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side > > > > > > > > > > > > > > includes > > > > > > > > > > > > > > checking for integrity (i.e. errors during deserialization). > > > > > > > > > > > > > > > > > > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> > > > > > > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com> > > > > > > > > > > > > > > --- > > > > > > > > > > > > > > include/hw/virtio/vhost-backend.h | 24 +++++ > > > > > > > > > > > > > > include/hw/virtio/vhost.h | 79 ++++++++++++++++ > > > > > > > > > > > > > > hw/virtio/vhost-user.c | 147 > > > > > > > > > > > > > > ++++++++++++++++++++++++++++++ > > > > > > > > > > > > > > hw/virtio/vhost.c | 37 ++++++++ > > > > > > > > > > > > > > 4 files changed, 287 insertions(+) > > > > > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h > > > > > > > > > > > > > > b/include/hw/virtio/vhost-backend.h > > > > > > > > > > > > > > index ec3fbae58d..5935b32fe3 100644 > > > > > > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h > > > > > > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h > > > > > > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { > > > > > > > > > > > > > > VHOST_SET_CONFIG_TYPE_MIGRATION = 1, > > > > > > > > > > > > > > } VhostSetConfigType; > > > > > > > > > > > > > > > > > > > > > > > > > > > > +typedef enum VhostDeviceStateDirection { > > > > > > > > > > > > > > + /* Transfer state from back-end (device) to front-end */ > > > > > > > > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, > > > > > > > > > > > > > > + /* Transfer state from front-end to back-end (device) */ > > > > > > > > > > > > > > + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, > > > > > > > > > > > > > > +} VhostDeviceStateDirection; > > > > > > > > > > > > > > + > > > > > > > > > > > > > > +typedef enum VhostDeviceStatePhase { > > > > > > > > > > > > > > + /* The device (and all its vrings) is stopped */ > > > > > > > > > > > > > > + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, > > > > > > > > > > > > > > +} VhostDeviceStatePhase; > > > > > > > > > > > > > vDPA has: > > > > > > > > > > > > > > > > > > > > > > > > > > /* Suspend a device so it does not process virtqueue requests > > > > > > > > > > > > > anymore > > > > > > > > > > > > > * > > > > > > > > > > > > > * After the return of ioctl the device must preserve all the > > > > > > > > > > > > > necessary state > > > > > > > > > > > > > * (the virtqueue vring base plus the possible device > > > > > > > > > > > > > specific states) that is > > > > > > > > > > > > > * required for restoring in the future. The device must not > > > > > > > > > > > > > change its > > > > > > > > > > > > > * configuration after that point. > > > > > > > > > > > > > */ > > > > > > > > > > > > > #define VHOST_VDPA_SUSPEND _IO(VHOST_VIRTIO, 0x7D) > > > > > > > > > > > > > > > > > > > > > > > > > > /* Resume a device so it can resume processing virtqueue > > > > > > > > > > > > > requests > > > > > > > > > > > > > * > > > > > > > > > > > > > * After the return of this ioctl the device will have > > > > > > > > > > > > > restored all the > > > > > > > > > > > > > * necessary states and it is fully operational to continue > > > > > > > > > > > > > processing the > > > > > > > > > > > > > * virtqueue descriptors. > > > > > > > > > > > > > */ > > > > > > > > > > > > > #define VHOST_VDPA_RESUME _IO(VHOST_VIRTIO, 0x7E) > > > > > > > > > > > > > > > > > > > > > > > > > > I wonder if it makes sense to import these into vhost-user so > > > > > > > > > > > > > that the > > > > > > > > > > > > > difference between kernel vhost and vhost-user is minimized. > > > > > > > > > > > > > It's okay > > > > > > > > > > > > > if one of them is ahead of the other, but it would be nice to > > > > > > > > > > > > > avoid > > > > > > > > > > > > > overlapping/duplicated functionality. > > > > > > > > > > > > > > > > > > > > > > > > > That's what I had in mind in the first versions. I proposed > > > > > > > > > > > > VHOST_STOP > > > > > > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did > > > > > > > > > > > > change > > > > > > > > > > > > to SUSPEND. > > > > > > > > > > > > > > > > > > > > > > > > Generally it is better if we make the interface less parametrized > > > > > > > > > > > > and > > > > > > > > > > > > we trust in the messages and its semantics in my opinion. In other > > > > > > > > > > > > words, instead of > > > > > > > > > > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), > > > > > > > > > > > > send > > > > > > > > > > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user > > > > > > > > > > > > command. > > > > > > > > > > > > > > > > > > > > > > > > Another way to apply this is with the "direction" parameter. Maybe > > > > > > > > > > > > it > > > > > > > > > > > > is better to split it into "set_state_fd" and "get_state_fd"? > > > > > > > > > > > > > > > > > > > > > > > > In that case, reusing the ioctls as vhost-user messages would be > > > > > > > > > > > > ok. > > > > > > > > > > > > But that puts this proposal further from the VFIO code, which uses > > > > > > > > > > > > "migration_set_state(state)", and maybe it is better when the > > > > > > > > > > > > number > > > > > > > > > > > > of states is high. > > > > > > > > > > > Hi Eugenio, > > > > > > > > > > > Another question about vDPA suspend/resume: > > > > > > > > > > > > > > > > > > > > > > /* Host notifiers must be enabled at this point. */ > > > > > > > > > > > void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, > > > > > > > > > > > bool vrings) > > > > > > > > > > > { > > > > > > > > > > > int i; > > > > > > > > > > > > > > > > > > > > > > /* should only be called after backend is connected */ > > > > > > > > > > > assert(hdev->vhost_ops); > > > > > > > > > > > event_notifier_test_and_clear( > > > > > > > > > > > &hdev- > > > > > > > > > > > >vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier); > > > > > > > > > > > event_notifier_test_and_clear(&vdev->config_notifier); > > > > > > > > > > > > > > > > > > > > > > trace_vhost_dev_stop(hdev, vdev->name, vrings); > > > > > > > > > > > > > > > > > > > > > > if (hdev->vhost_ops->vhost_dev_start) { > > > > > > > > > > > hdev->vhost_ops->vhost_dev_start(hdev, false); > > > > > > > > > > > ^^^ SUSPEND ^^^ > > > > > > > > > > > } > > > > > > > > > > > if (vrings) { > > > > > > > > > > > vhost_dev_set_vring_enable(hdev, false); > > > > > > > > > > > } > > > > > > > > > > > for (i = 0; i < hdev->nvqs; ++i) { > > > > > > > > > > > vhost_virtqueue_stop(hdev, > > > > > > > > > > > vdev, > > > > > > > > > > > hdev->vqs + i, > > > > > > > > > > > hdev->vq_index + i); > > > > > > > > > > > ^^^ fetch virtqueue state from kernel ^^^ > > > > > > > > > > > } > > > > > > > > > > > if (hdev->vhost_ops->vhost_reset_status) { > > > > > > > > > > > hdev->vhost_ops->vhost_reset_status(hdev); > > > > > > > > > > > ^^^ reset device^^^ > > > > > > > > > > > > > > > > > > > > > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop() > > > > > > > > > > > -> > > > > > > > > > > > vhost_reset_status(). The device's migration code runs after > > > > > > > > > > > vhost_dev_stop() and the state will have been lost. > > > > > > > > > > > > > > > > > > > > > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the > > > > > > > > > > qemu VirtIONet device model. This is for all vhost backends. > > > > > > > > > > > > > > > > > > > > Regarding the state like mac or mq configuration, SVQ runs for all the > > > > > > > > > > VM run in the CVQ. So it can track all of that status in the device > > > > > > > > > > model too. > > > > > > > > > > > > > > > > > > > > When a migration effectively occurs, all the frontend state is > > > > > > > > > > migrated as a regular emulated device. To route all of the state in a > > > > > > > > > > normalized way for qemu is what leaves open the possibility to do > > > > > > > > > > cross-backends migrations, etc. > > > > > > > > > > > > > > > > > > > > Does that answer your question? > > > > > > > > > I think you're confirming that changes would be necessary in order for > > > > > > > > > vDPA to support the save/load operation that Hanna is introducing. > > > > > > > > > > > > > > > > > Yes, this first iteration was centered on net, with an eye on block, > > > > > > > > where state can be routed through classical emulated devices. This is > > > > > > > > how vhost-kernel and vhost-user do classically. And it allows > > > > > > > > cross-backend, to not modify qemu migration state, etc. > > > > > > > > > > > > > > > > To introduce this opaque state to qemu, that must be fetched after the > > > > > > > > suspend and not before, requires changes in vhost protocol, as > > > > > > > > discussed previously. > > > > > > > > > > > > > > > > > > > It looks like vDPA changes are necessary in order to support > > > > > > > > > > > stateful > > > > > > > > > > > devices even though QEMU already uses SUSPEND. Is my understanding > > > > > > > > > > > correct? > > > > > > > > > > > > > > > > > > > > > Changes are required elsewhere, as the code to restore the state > > > > > > > > > > properly in the destination has not been merged. > > > > > > > > > I'm not sure what you mean by elsewhere? > > > > > > > > > > > > > > > > > I meant for vdpa *net* devices the changes are not required in vdpa > > > > > > > > ioctls, but mostly in qemu. > > > > > > > > > > > > > > > > If you meant stateful as "it must have a state blob that it must be > > > > > > > > opaque to qemu", then I think the straightforward action is to fetch > > > > > > > > state blob about the same time as vq indexes. But yes, changes (at > > > > > > > > least a new ioctl) is needed for that. > > > > > > > > > > > > > > > > > I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and > > > > > > > > > then VHOST_VDPA_SET_STATUS 0. > > > > > > > > > > > > > > > > > > In order to save device state from the vDPA device in the future, it > > > > > > > > > will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that > > > > > > > > > the device state can be saved before the device is reset. > > > > > > > > > > > > > > > > > > Does that sound right? > > > > > > > > > > > > > > > > > The split between suspend and reset was added recently for that very > > > > > > > > reason. In all the virtio devices, the frontend is initialized before > > > > > > > > the backend, so I don't think it is a good idea to defer the backend > > > > > > > > cleanup. Especially if we have already set the state is small enough > > > > > > > > to not needing iterative migration from virtiofsd point of view. > > > > > > > > > > > > > > > > If fetching that state at the same time as vq indexes is not valid, > > > > > > > > could it follow the same model as the "in-flight descriptors"? > > > > > > > > vhost-user follows them by using a shared memory region where their > > > > > > > > state is tracked [1]. This allows qemu to survive vhost-user SW > > > > > > > > backend crashes, and does not forbid the cross-backends live migration > > > > > > > > as all the information is there to recover them. > > > > > > > > > > > > > > > > For hw devices this is not convenient as it occupies PCI bandwidth. So > > > > > > > > a possibility is to synchronize this memory region after a > > > > > > > > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW > > > > > > > > devices are not going to crash in the software sense, so all use cases > > > > > > > > remain the same to qemu. And that shared memory information is > > > > > > > > recoverable after vhost_dev_stop. > > > > > > > > > > > > > > > > Does that sound reasonable to virtiofsd? To offer a shared memory > > > > > > > > region where it dumps the state, maybe only after the > > > > > > > > set_state(STATE_PHASE_STOPPED)? > > > > > > > > > > > > > > I don’t think we need the set_state() call, necessarily, if SUSPEND is > > > > > > > mandatory anyway. > > > > > > > > > > > > > > As for the shared memory, the RFC before this series used shared memory, > > > > > > > so it’s possible, yes. But “shared memory region” can mean a lot of > > > > > > > things – it sounds like you’re saying the back-end (virtiofsd) should > > > > > > > provide it to the front-end, is that right? That could work like this: > > > > > > > > > > > > > > On the source side: > > > > > > > > > > > > > > S1. SUSPEND goes to virtiofsd > > > > > > > S2. virtiofsd maybe double-checks that the device is stopped, then > > > > > > > serializes its state into a newly allocated shared memory area[1] > > > > > > > S3. virtiofsd responds to SUSPEND > > > > > > > S4. front-end requests shared memory, virtiofsd responds with a handle, > > > > > > > maybe already closes its reference > > > > > > > S5. front-end saves state, closes its handle, freeing the SHM > > > > > > > > > > > > > > [1] Maybe virtiofsd can correctly size the serialized state’s size, then > > > > > > > it can immediately allocate this area and serialize directly into it; > > > > > > > maybe it can’t, then we’ll need a bounce buffer. Not really a > > > > > > > fundamental problem, but there are limitations around what you can do > > > > > > > with serde implementations in Rust… > > > > > > > > > > > > > > On the destination side: > > > > > > > > > > > > > > D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much; > > > > > > > virtiofsd would serialize its empty state into an SHM area, and respond > > > > > > > to SUSPEND > > > > > > > D2. front-end reads state from migration stream into an SHM it has allocated > > > > > > > D3. front-end supplies this SHM to virtiofsd, which discards its > > > > > > > previous area, and now uses this one > > > > > > > D4. RESUME goes to virtiofsd, which deserializes the state from the SHM > > > > > > > > > > > > > > Couple of questions: > > > > > > > > > > > > > > A. Stefan suggested D1, but it does seem wasteful now. But if SUSPEND > > > > > > > would imply to deserialize a state, and the state is to be transferred > > > > > > > through SHM, this is what would need to be done. So maybe we should > > > > > > > skip SUSPEND on the destination? > > > > > > > B. You described that the back-end should supply the SHM, which works > > > > > > > well on the source. On the destination, only the front-end knows how > > > > > > > big the state is, so I’ve decided above that it should allocate the SHM > > > > > > > (D2) and provide it to the back-end. Is that feasible or is it > > > > > > > important (e.g. for real hardware) that the back-end supplies the SHM? > > > > > > > (In which case the front-end would need to tell the back-end how big the > > > > > > > state SHM needs to be.) > > > > > > > > > > > > How does this work for iterative live migration? > > > > > > > > > > > > > > > > A pipe will always fit better for iterative from qemu POV, that's for sure. > > > > > Especially if we want to keep that opaqueness. > > > > > > > > > > But we will need to communicate with the HW device using shared memory sooner > > > > > or later for big states. If we don't transform it in qemu, we will need to do > > > > > it in the kernel. Also, the pipe will not support daemon crashes. > > > > > > > > > > Again I'm just putting this on the table, just in case it fits better or it is > > > > > convenient. I missed the previous patch where SHM was proposed too, so maybe I > > > > > missed some feedback useful here. I think the pipe is a better solution in the > > > > > long run because of the iterative part. > > > > > > > > Pipes and shared memory are conceptually equivalent for building > > > > streaming interfaces. It's just more complex to design a shared memory > > > > interface and it reinvents what file descriptors already offer. > > > > > > > > I have no doubt we could design iterative migration over a shared memory > > > > interface if we needed to, but I'm not sure why? When you mention > > > > hardware, are you suggesting defining a standard memory/register layout > > > > that hardware implements and mapping it to userspace (QEMU)? > > > > > > Right. > > > > > > > Is there a > > > > big advantage to exposing memory versus a file descriptor? > > > > > > > > > > For hardware it allows to retrieve and set the device state without > > > intervention of the kernel, saving context switches. For virtiofsd > > > this may not make a lot of sense, but I'm thinking on devices with big > > > states (virtio gpu, maybe?). > > > > A streaming interface implemented using shared memory involves consuming > > chunks of bytes. Each time data has been read, an action must be > > performed to notify the device and receive a notification when more data > > becomes available. > > > > That notification involves the kernel (e.g. an eventfd that is triggered > > by a hardware interrupt) and a read(2) syscall to reset the eventfd. > > > > Unless userspace disables notifications and polls (busy waits) the > > hardware registers, there is still going to be kernel involvement and a > > context switch. For this reason, I think that shared memory vs pipes > > will not be significantly different. > > > > Yes, for big states that's right. I was thinking of not-so-big states, > where all of it can be asked in one shot, but it may be problematic > with iterative migration for sure. In that regard pipes are way > better. > > > > For software it allows the backend to survive a crash, as the old > > > state can be set directly to a fresh backend instance. > > > > Can you explain by describing the steps involved? > > It's how vhost-user inflight I/O tracking works [1]: QEMU and the > backend shares a memory region where the backend dump states > continuously. In the event of a crash, this state can be dumped > directly to a new vhost-user backend. Neither shared memory nor INFLIGHT_FD are required for crash recovery because the backend can stash state elsewhere, like tmpfs or systemd's FDSTORE=1 (https://www.freedesktop.org/software/systemd/man/sd_pid_notify_with_fds.html). INFLIGHT_FD is just a mechanism to stash an fd (only the backend interprets the contents of the fd and the frontend doesn't even know whether the fd is shared memory or another type of file). I think crash recovery is orthogonal to this discussion because we're talking about a streaming interface. A streaming interface breaks when a crash occurs (regardless of whether it's implemented via shared memory or pipes) as it involves two entities coordinating with each other. If an entity goes away then the stream is incomplete and cannot be used for crash recovery. I guess you're thinking of an fd that contains the full state of the device. That fd could be handed to the backend after reconnection for crash recovery, but a streaming interface doesn't support that. I guess your bringing up the idea of having the full device state always up-to-date for crash recovery purposes? I think crash recovery should be optional since it's complex and hard to test while many (most?) backends don't implement it. It is likely that using the crash recovery state for live migration is going to be even trickier because live migration has additional requirements (e.g. compatibility). My feeling is that it's too hard to satisfy both live migration and crash recovery requirements for all vhost-user device types, but if you have concrete ideas then let's discuss them. Stefan
diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h index ec3fbae58d..5935b32fe3 100644 --- a/include/hw/virtio/vhost-backend.h +++ b/include/hw/virtio/vhost-backend.h @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType { VHOST_SET_CONFIG_TYPE_MIGRATION = 1, } VhostSetConfigType; +typedef enum VhostDeviceStateDirection { + /* Transfer state from back-end (device) to front-end */ + VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0, + /* Transfer state from front-end to back-end (device) */ + VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1, +} VhostDeviceStateDirection; + +typedef enum VhostDeviceStatePhase { + /* The device (and all its vrings) is stopped */ + VHOST_TRANSFER_STATE_PHASE_STOPPED = 0, +} VhostDeviceStatePhase; + struct vhost_inflight; struct vhost_dev; struct vhost_log; @@ -133,6 +145,15 @@ typedef int (*vhost_set_config_call_op)(struct vhost_dev *dev, typedef void (*vhost_reset_status_op)(struct vhost_dev *dev); +typedef bool (*vhost_supports_migratory_state_op)(struct vhost_dev *dev); +typedef int (*vhost_set_device_state_fd_op)(struct vhost_dev *dev, + VhostDeviceStateDirection direction, + VhostDeviceStatePhase phase, + int fd, + int *reply_fd, + Error **errp); +typedef int (*vhost_check_device_state_op)(struct vhost_dev *dev, Error **errp); + typedef struct VhostOps { VhostBackendType backend_type; vhost_backend_init vhost_backend_init; @@ -181,6 +202,9 @@ typedef struct VhostOps { vhost_force_iommu_op vhost_force_iommu; vhost_set_config_call_op vhost_set_config_call; vhost_reset_status_op vhost_reset_status; + vhost_supports_migratory_state_op vhost_supports_migratory_state; + vhost_set_device_state_fd_op vhost_set_device_state_fd; + vhost_check_device_state_op vhost_check_device_state; } VhostOps; int vhost_backend_update_device_iotlb(struct vhost_dev *dev, diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h index 2fe02ed5d4..29449e0fe2 100644 --- a/include/hw/virtio/vhost.h +++ b/include/hw/virtio/vhost.h @@ -346,4 +346,83 @@ int vhost_dev_set_inflight(struct vhost_dev *dev, struct vhost_inflight *inflight); int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size, struct vhost_inflight *inflight); + +/** + * vhost_supports_migratory_state(): Checks whether the back-end + * supports transferring internal state for the purpose of migration. + * Support for this feature is required for vhost_set_device_state_fd() + * and vhost_check_device_state(). + * + * @dev: The vhost device + * + * Returns true if the device supports these commands, and false if it + * does not. + */ +bool vhost_supports_migratory_state(struct vhost_dev *dev); + +/** + * vhost_set_device_state_fd(): Begin transfer of internal state from/to + * the back-end for the purpose of migration. Data is to be transferred + * over a pipe according to @direction and @phase. The sending end must + * only write to the pipe, and the receiving end must only read from it. + * Once the sending end is done, it closes its FD. The receiving end + * must take this as the end-of-transfer signal and close its FD, too. + * + * @fd is the back-end's end of the pipe: The write FD for SAVE, and the + * read FD for LOAD. This function transfers ownership of @fd to the + * back-end, i.e. closes it in the front-end. + * + * The back-end may optionally reply with an FD of its own, if this + * improves efficiency on its end. In this case, the returned FD is + * stored in *reply_fd. The back-end will discard the FD sent to it, + * and the front-end must use *reply_fd for transferring state to/from + * the back-end. + * + * @dev: The vhost device + * @direction: The direction in which the state is to be transferred. + * For outgoing migrations, this is SAVE, and data is read + * from the back-end and stored by the front-end in the + * migration stream. + * For incoming migrations, this is LOAD, and data is read + * by the front-end from the migration stream and sent to + * the back-end to restore the saved state. + * @phase: Which migration phase we are in. Currently, there is only + * STOPPED (device and all vrings are stopped), in the future, + * more phases such as PRE_COPY or POST_COPY may be added. + * @fd: Back-end's end of the pipe through which to transfer state; note + * that ownership is transferred to the back-end, so this function + * closes @fd in the front-end. + * @reply_fd: If the back-end wishes to use a different pipe for state + * transfer, this will contain an FD for the front-end to + * use. Otherwise, -1 is stored here. + * @errp: Potential error description + * + * Returns 0 on success, and -errno on failure. + */ +int vhost_set_device_state_fd(struct vhost_dev *dev, + VhostDeviceStateDirection direction, + VhostDeviceStatePhase phase, + int fd, + int *reply_fd, + Error **errp); + +/** + * vhost_set_device_state_fd(): After transferring state from/to the + * back-end via vhost_set_device_state_fd(), i.e. once the sending end + * has closed the pipe, inquire the back-end to report any potential + * errors that have occurred on its side. This allows to sense errors + * like: + * - During outgoing migration, when the source side had already started + * to produce its state, something went wrong and it failed to finish + * - During incoming migration, when the received state is somehow + * invalid and cannot be processed by the back-end + * + * @dev: The vhost device + * @errp: Potential error description + * + * Returns 0 when the back-end reports successful state transfer and + * processing, and -errno when an error occurred somewhere. + */ +int vhost_check_device_state(struct vhost_dev *dev, Error **errp); + #endif diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c index e5285df4ba..93d8f2494a 100644 --- a/hw/virtio/vhost-user.c +++ b/hw/virtio/vhost-user.c @@ -83,6 +83,7 @@ enum VhostUserProtocolFeature { /* Feature 14 reserved for VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS. */ VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15, VHOST_USER_PROTOCOL_F_STATUS = 16, + VHOST_USER_PROTOCOL_F_MIGRATORY_STATE = 17, VHOST_USER_PROTOCOL_F_MAX }; @@ -130,6 +131,8 @@ typedef enum VhostUserRequest { VHOST_USER_REM_MEM_REG = 38, VHOST_USER_SET_STATUS = 39, VHOST_USER_GET_STATUS = 40, + VHOST_USER_SET_DEVICE_STATE_FD = 41, + VHOST_USER_CHECK_DEVICE_STATE = 42, VHOST_USER_MAX } VhostUserRequest; @@ -210,6 +213,12 @@ typedef struct { uint32_t size; /* the following payload size */ } QEMU_PACKED VhostUserHeader; +/* Request payload of VHOST_USER_SET_DEVICE_STATE_FD */ +typedef struct VhostUserTransferDeviceState { + uint32_t direction; + uint32_t phase; +} VhostUserTransferDeviceState; + typedef union { #define VHOST_USER_VRING_IDX_MASK (0xff) #define VHOST_USER_VRING_NOFD_MASK (0x1 << 8) @@ -224,6 +233,7 @@ typedef union { VhostUserCryptoSession session; VhostUserVringArea area; VhostUserInflight inflight; + VhostUserTransferDeviceState transfer_state; } VhostUserPayload; typedef struct VhostUserMsg { @@ -2681,6 +2691,140 @@ static int vhost_user_dev_start(struct vhost_dev *dev, bool started) } } +static bool vhost_user_supports_migratory_state(struct vhost_dev *dev) +{ + return virtio_has_feature(dev->protocol_features, + VHOST_USER_PROTOCOL_F_MIGRATORY_STATE); +} + +static int vhost_user_set_device_state_fd(struct vhost_dev *dev, + VhostDeviceStateDirection direction, + VhostDeviceStatePhase phase, + int fd, + int *reply_fd, + Error **errp) +{ + int ret; + struct vhost_user *vu = dev->opaque; + VhostUserMsg msg = { + .hdr = { + .request = VHOST_USER_SET_DEVICE_STATE_FD, + .flags = VHOST_USER_VERSION, + .size = sizeof(msg.payload.transfer_state), + }, + .payload.transfer_state = { + .direction = direction, + .phase = phase, + }, + }; + + *reply_fd = -1; + + if (!vhost_user_supports_migratory_state(dev)) { + close(fd); + error_setg(errp, "Back-end does not support migration state transfer"); + return -ENOTSUP; + } + + ret = vhost_user_write(dev, &msg, &fd, 1); + close(fd); + if (ret < 0) { + error_setg_errno(errp, -ret, + "Failed to send SET_DEVICE_STATE_FD message"); + return ret; + } + + ret = vhost_user_read(dev, &msg); + if (ret < 0) { + error_setg_errno(errp, -ret, + "Failed to receive SET_DEVICE_STATE_FD reply"); + return ret; + } + + if (msg.hdr.request != VHOST_USER_SET_DEVICE_STATE_FD) { + error_setg(errp, + "Received unexpected message type, expected %d, received %d", + VHOST_USER_SET_DEVICE_STATE_FD, msg.hdr.request); + return -EPROTO; + } + + if (msg.hdr.size != sizeof(msg.payload.u64)) { + error_setg(errp, + "Received bad message size, expected %zu, received %" PRIu32, + sizeof(msg.payload.u64), msg.hdr.size); + return -EPROTO; + } + + if ((msg.payload.u64 & 0xff) != 0) { + error_setg(errp, "Back-end did not accept migration state transfer"); + return -EIO; + } + + if (!(msg.payload.u64 & VHOST_USER_VRING_NOFD_MASK)) { + *reply_fd = qemu_chr_fe_get_msgfd(vu->user->chr); + if (*reply_fd < 0) { + error_setg(errp, + "Failed to get back-end-provided transfer pipe FD"); + *reply_fd = -1; + return -EIO; + } + } + + return 0; +} + +static int vhost_user_check_device_state(struct vhost_dev *dev, Error **errp) +{ + int ret; + VhostUserMsg msg = { + .hdr = { + .request = VHOST_USER_CHECK_DEVICE_STATE, + .flags = VHOST_USER_VERSION, + .size = 0, + }, + }; + + if (!vhost_user_supports_migratory_state(dev)) { + error_setg(errp, "Back-end does not support migration state transfer"); + return -ENOTSUP; + } + + ret = vhost_user_write(dev, &msg, NULL, 0); + if (ret < 0) { + error_setg_errno(errp, -ret, + "Failed to send CHECK_DEVICE_STATE message"); + return ret; + } + + ret = vhost_user_read(dev, &msg); + if (ret < 0) { + error_setg_errno(errp, -ret, + "Failed to receive CHECK_DEVICE_STATE reply"); + return ret; + } + + if (msg.hdr.request != VHOST_USER_CHECK_DEVICE_STATE) { + error_setg(errp, + "Received unexpected message type, expected %d, received %d", + VHOST_USER_CHECK_DEVICE_STATE, msg.hdr.request); + return -EPROTO; + } + + if (msg.hdr.size != sizeof(msg.payload.u64)) { + error_setg(errp, + "Received bad message size, expected %zu, received %" PRIu32, + sizeof(msg.payload.u64), msg.hdr.size); + return -EPROTO; + } + + if (msg.payload.u64 != 0) { + error_setg(errp, "Back-end failed to process its internal state"); + return -EIO; + } + + return 0; +} + const VhostOps user_ops = { .backend_type = VHOST_BACKEND_TYPE_USER, .vhost_backend_init = vhost_user_backend_init, @@ -2716,4 +2860,7 @@ const VhostOps user_ops = { .vhost_get_inflight_fd = vhost_user_get_inflight_fd, .vhost_set_inflight_fd = vhost_user_set_inflight_fd, .vhost_dev_start = vhost_user_dev_start, + .vhost_supports_migratory_state = vhost_user_supports_migratory_state, + .vhost_set_device_state_fd = vhost_user_set_device_state_fd, + .vhost_check_device_state = vhost_user_check_device_state, }; diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c index cbff589efa..90099d8f6a 100644 --- a/hw/virtio/vhost.c +++ b/hw/virtio/vhost.c @@ -2088,3 +2088,40 @@ int vhost_net_set_backend(struct vhost_dev *hdev, return -ENOSYS; } + +bool vhost_supports_migratory_state(struct vhost_dev *dev) +{ + if (dev->vhost_ops->vhost_supports_migratory_state) { + return dev->vhost_ops->vhost_supports_migratory_state(dev); + } + + return false; +} + +int vhost_set_device_state_fd(struct vhost_dev *dev, + VhostDeviceStateDirection direction, + VhostDeviceStatePhase phase, + int fd, + int *reply_fd, + Error **errp) +{ + if (dev->vhost_ops->vhost_set_device_state_fd) { + return dev->vhost_ops->vhost_set_device_state_fd(dev, direction, phase, + fd, reply_fd, errp); + } + + error_setg(errp, + "vhost transport does not support migration state transfer"); + return -ENOSYS; +} + +int vhost_check_device_state(struct vhost_dev *dev, Error **errp) +{ + if (dev->vhost_ops->vhost_check_device_state) { + return dev->vhost_ops->vhost_check_device_state(dev, errp); + } + + error_setg(errp, + "vhost transport does not support migration state transfer"); + return -ENOSYS; +}
So-called "internal" virtio-fs migration refers to transporting the back-end's (virtiofsd's) state through qemu's migration stream. To do this, we need to be able to transfer virtiofsd's internal state to and from virtiofsd. Because virtiofsd's internal state will not be too large, we believe it is best to transfer it as a single binary blob after the streaming phase. Because this method should be useful to other vhost-user implementations, too, it is introduced as a general-purpose addition to the protocol, not limited to vhost-user-fs. These are the additions to the protocol: - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE: This feature signals support for transferring state, and is added so that migration can fail early when the back-end has no support. - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe over which to transfer the state. The front-end sends an FD to the back-end into/from which it can write/read its state, and the back-end can decide to either use it, or reply with a different FD for the front-end to override the front-end's choice. The front-end creates a simple pipe to transfer the state, but maybe the back-end already has an FD into/from which it has to write/read its state, in which case it will want to override the simple pipe. Conversely, maybe in the future we find a way to have the front-end get an immediate FD for the migration stream (in some cases), in which case we will want to send this to the back-end instead of creating a pipe. Hence the negotiation: If one side has a better idea than a plain pipe, we will want to use that. - CHECK_DEVICE_STATE: After the state has been transferred through the pipe (the end indicated by EOF), the front-end invokes this function to verify success. There is no in-band way (through the pipe) to indicate failure, so we need to check explicitly. Once the transfer pipe has been established via SET_DEVICE_STATE_FD (which includes establishing the direction of transfer and migration phase), the sending side writes its data into the pipe, and the reading side reads it until it sees an EOF. Then, the front-end will check for success via CHECK_DEVICE_STATE, which on the destination side includes checking for integrity (i.e. errors during deserialization). Suggested-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> --- include/hw/virtio/vhost-backend.h | 24 +++++ include/hw/virtio/vhost.h | 79 ++++++++++++++++ hw/virtio/vhost-user.c | 147 ++++++++++++++++++++++++++++++ hw/virtio/vhost.c | 37 ++++++++ 4 files changed, 287 insertions(+)