diff mbox series

[2/4] vhost-user: Interface for migration state transfer

Message ID 20230411150515.14020-3-hreitz@redhat.com (mailing list archive)
State New, archived
Headers show
Series vhost-user-fs: Internal migration | expand

Commit Message

Hanna Czenczek April 11, 2023, 3:05 p.m. UTC
So-called "internal" virtio-fs migration refers to transporting the
back-end's (virtiofsd's) state through qemu's migration stream.  To do
this, we need to be able to transfer virtiofsd's internal state to and
from virtiofsd.

Because virtiofsd's internal state will not be too large, we believe it
is best to transfer it as a single binary blob after the streaming
phase.  Because this method should be useful to other vhost-user
implementations, too, it is introduced as a general-purpose addition to
the protocol, not limited to vhost-user-fs.

These are the additions to the protocol:
- New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
  This feature signals support for transferring state, and is added so
  that migration can fail early when the back-end has no support.

- SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
  over which to transfer the state.  The front-end sends an FD to the
  back-end into/from which it can write/read its state, and the back-end
  can decide to either use it, or reply with a different FD for the
  front-end to override the front-end's choice.
  The front-end creates a simple pipe to transfer the state, but maybe
  the back-end already has an FD into/from which it has to write/read
  its state, in which case it will want to override the simple pipe.
  Conversely, maybe in the future we find a way to have the front-end
  get an immediate FD for the migration stream (in some cases), in which
  case we will want to send this to the back-end instead of creating a
  pipe.
  Hence the negotiation: If one side has a better idea than a plain
  pipe, we will want to use that.

- CHECK_DEVICE_STATE: After the state has been transferred through the
  pipe (the end indicated by EOF), the front-end invokes this function
  to verify success.  There is no in-band way (through the pipe) to
  indicate failure, so we need to check explicitly.

Once the transfer pipe has been established via SET_DEVICE_STATE_FD
(which includes establishing the direction of transfer and migration
phase), the sending side writes its data into the pipe, and the reading
side reads it until it sees an EOF.  Then, the front-end will check for
success via CHECK_DEVICE_STATE, which on the destination side includes
checking for integrity (i.e. errors during deserialization).

Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 include/hw/virtio/vhost-backend.h |  24 +++++
 include/hw/virtio/vhost.h         |  79 ++++++++++++++++
 hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
 hw/virtio/vhost.c                 |  37 ++++++++
 4 files changed, 287 insertions(+)

Comments

Stefan Hajnoczi April 12, 2023, 9:06 p.m. UTC | #1
On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> So-called "internal" virtio-fs migration refers to transporting the
> back-end's (virtiofsd's) state through qemu's migration stream.  To do
> this, we need to be able to transfer virtiofsd's internal state to and
> from virtiofsd.
> 
> Because virtiofsd's internal state will not be too large, we believe it
> is best to transfer it as a single binary blob after the streaming
> phase.  Because this method should be useful to other vhost-user
> implementations, too, it is introduced as a general-purpose addition to
> the protocol, not limited to vhost-user-fs.
> 
> These are the additions to the protocol:
> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
>   This feature signals support for transferring state, and is added so
>   that migration can fail early when the back-end has no support.
> 
> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
>   over which to transfer the state.  The front-end sends an FD to the
>   back-end into/from which it can write/read its state, and the back-end
>   can decide to either use it, or reply with a different FD for the
>   front-end to override the front-end's choice.
>   The front-end creates a simple pipe to transfer the state, but maybe
>   the back-end already has an FD into/from which it has to write/read
>   its state, in which case it will want to override the simple pipe.
>   Conversely, maybe in the future we find a way to have the front-end
>   get an immediate FD for the migration stream (in some cases), in which
>   case we will want to send this to the back-end instead of creating a
>   pipe.
>   Hence the negotiation: If one side has a better idea than a plain
>   pipe, we will want to use that.
> 
> - CHECK_DEVICE_STATE: After the state has been transferred through the
>   pipe (the end indicated by EOF), the front-end invokes this function
>   to verify success.  There is no in-band way (through the pipe) to
>   indicate failure, so we need to check explicitly.
> 
> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> (which includes establishing the direction of transfer and migration
> phase), the sending side writes its data into the pipe, and the reading
> side reads it until it sees an EOF.  Then, the front-end will check for
> success via CHECK_DEVICE_STATE, which on the destination side includes
> checking for integrity (i.e. errors during deserialization).
> 
> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  include/hw/virtio/vhost-backend.h |  24 +++++
>  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
>  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
>  hw/virtio/vhost.c                 |  37 ++++++++
>  4 files changed, 287 insertions(+)
> 
> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> index ec3fbae58d..5935b32fe3 100644
> --- a/include/hw/virtio/vhost-backend.h
> +++ b/include/hw/virtio/vhost-backend.h
> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
>      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
>  } VhostSetConfigType;
>  
> +typedef enum VhostDeviceStateDirection {
> +    /* Transfer state from back-end (device) to front-end */
> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> +    /* Transfer state from front-end to back-end (device) */
> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> +} VhostDeviceStateDirection;
> +
> +typedef enum VhostDeviceStatePhase {
> +    /* The device (and all its vrings) is stopped */
> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> +} VhostDeviceStatePhase;

vDPA has:

  /* Suspend a device so it does not process virtqueue requests anymore
   *
   * After the return of ioctl the device must preserve all the necessary state
   * (the virtqueue vring base plus the possible device specific states) that is
   * required for restoring in the future. The device must not change its
   * configuration after that point.
   */
  #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)

  /* Resume a device so it can resume processing virtqueue requests
   *
   * After the return of this ioctl the device will have restored all the
   * necessary states and it is fully operational to continue processing the
   * virtqueue descriptors.
   */
  #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)

I wonder if it makes sense to import these into vhost-user so that the
difference between kernel vhost and vhost-user is minimized. It's okay
if one of them is ahead of the other, but it would be nice to avoid
overlapping/duplicated functionality.

(And I hope vDPA will import the device state vhost-user messages
introduced in this series.)

> +
>  struct vhost_inflight;
>  struct vhost_dev;
>  struct vhost_log;
> @@ -133,6 +145,15 @@ typedef int (*vhost_set_config_call_op)(struct vhost_dev *dev,
>  
>  typedef void (*vhost_reset_status_op)(struct vhost_dev *dev);
>  
> +typedef bool (*vhost_supports_migratory_state_op)(struct vhost_dev *dev);
> +typedef int (*vhost_set_device_state_fd_op)(struct vhost_dev *dev,
> +                                            VhostDeviceStateDirection direction,
> +                                            VhostDeviceStatePhase phase,
> +                                            int fd,
> +                                            int *reply_fd,
> +                                            Error **errp);
> +typedef int (*vhost_check_device_state_op)(struct vhost_dev *dev, Error **errp);
> +
>  typedef struct VhostOps {
>      VhostBackendType backend_type;
>      vhost_backend_init vhost_backend_init;
> @@ -181,6 +202,9 @@ typedef struct VhostOps {
>      vhost_force_iommu_op vhost_force_iommu;
>      vhost_set_config_call_op vhost_set_config_call;
>      vhost_reset_status_op vhost_reset_status;
> +    vhost_supports_migratory_state_op vhost_supports_migratory_state;
> +    vhost_set_device_state_fd_op vhost_set_device_state_fd;
> +    vhost_check_device_state_op vhost_check_device_state;
>  } VhostOps;
>  
>  int vhost_backend_update_device_iotlb(struct vhost_dev *dev,
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index 2fe02ed5d4..29449e0fe2 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -346,4 +346,83 @@ int vhost_dev_set_inflight(struct vhost_dev *dev,
>                             struct vhost_inflight *inflight);
>  int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
>                             struct vhost_inflight *inflight);
> +
> +/**
> + * vhost_supports_migratory_state(): Checks whether the back-end
> + * supports transferring internal state for the purpose of migration.
> + * Support for this feature is required for vhost_set_device_state_fd()
> + * and vhost_check_device_state().
> + *
> + * @dev: The vhost device
> + *
> + * Returns true if the device supports these commands, and false if it
> + * does not.
> + */
> +bool vhost_supports_migratory_state(struct vhost_dev *dev);
> +
> +/**
> + * vhost_set_device_state_fd(): Begin transfer of internal state from/to
> + * the back-end for the purpose of migration.  Data is to be transferred
> + * over a pipe according to @direction and @phase.  The sending end must
> + * only write to the pipe, and the receiving end must only read from it.
> + * Once the sending end is done, it closes its FD.  The receiving end
> + * must take this as the end-of-transfer signal and close its FD, too.
> + *
> + * @fd is the back-end's end of the pipe: The write FD for SAVE, and the
> + * read FD for LOAD.  This function transfers ownership of @fd to the
> + * back-end, i.e. closes it in the front-end.
> + *
> + * The back-end may optionally reply with an FD of its own, if this
> + * improves efficiency on its end.  In this case, the returned FD is
> + * stored in *reply_fd.  The back-end will discard the FD sent to it,
> + * and the front-end must use *reply_fd for transferring state to/from
> + * the back-end.
> + *
> + * @dev: The vhost device
> + * @direction: The direction in which the state is to be transferred.
> + *             For outgoing migrations, this is SAVE, and data is read
> + *             from the back-end and stored by the front-end in the
> + *             migration stream.
> + *             For incoming migrations, this is LOAD, and data is read
> + *             by the front-end from the migration stream and sent to
> + *             the back-end to restore the saved state.
> + * @phase: Which migration phase we are in.  Currently, there is only
> + *         STOPPED (device and all vrings are stopped), in the future,
> + *         more phases such as PRE_COPY or POST_COPY may be added.
> + * @fd: Back-end's end of the pipe through which to transfer state; note
> + *      that ownership is transferred to the back-end, so this function
> + *      closes @fd in the front-end.
> + * @reply_fd: If the back-end wishes to use a different pipe for state
> + *            transfer, this will contain an FD for the front-end to
> + *            use.  Otherwise, -1 is stored here.
> + * @errp: Potential error description
> + *
> + * Returns 0 on success, and -errno on failure.
> + */
> +int vhost_set_device_state_fd(struct vhost_dev *dev,
> +                              VhostDeviceStateDirection direction,
> +                              VhostDeviceStatePhase phase,
> +                              int fd,
> +                              int *reply_fd,
> +                              Error **errp);
> +
> +/**
> + * vhost_set_device_state_fd(): After transferring state from/to the
> + * back-end via vhost_set_device_state_fd(), i.e. once the sending end
> + * has closed the pipe, inquire the back-end to report any potential
> + * errors that have occurred on its side.  This allows to sense errors
> + * like:
> + * - During outgoing migration, when the source side had already started
> + *   to produce its state, something went wrong and it failed to finish
> + * - During incoming migration, when the received state is somehow
> + *   invalid and cannot be processed by the back-end
> + *
> + * @dev: The vhost device
> + * @errp: Potential error description
> + *
> + * Returns 0 when the back-end reports successful state transfer and
> + * processing, and -errno when an error occurred somewhere.
> + */
> +int vhost_check_device_state(struct vhost_dev *dev, Error **errp);
> +
>  #endif
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index e5285df4ba..93d8f2494a 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -83,6 +83,7 @@ enum VhostUserProtocolFeature {
>      /* Feature 14 reserved for VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS. */
>      VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
>      VHOST_USER_PROTOCOL_F_STATUS = 16,
> +    VHOST_USER_PROTOCOL_F_MIGRATORY_STATE = 17,
>      VHOST_USER_PROTOCOL_F_MAX
>  };
>  
> @@ -130,6 +131,8 @@ typedef enum VhostUserRequest {
>      VHOST_USER_REM_MEM_REG = 38,
>      VHOST_USER_SET_STATUS = 39,
>      VHOST_USER_GET_STATUS = 40,
> +    VHOST_USER_SET_DEVICE_STATE_FD = 41,
> +    VHOST_USER_CHECK_DEVICE_STATE = 42,
>      VHOST_USER_MAX
>  } VhostUserRequest;
>  
> @@ -210,6 +213,12 @@ typedef struct {
>      uint32_t size; /* the following payload size */
>  } QEMU_PACKED VhostUserHeader;
>  
> +/* Request payload of VHOST_USER_SET_DEVICE_STATE_FD */
> +typedef struct VhostUserTransferDeviceState {
> +    uint32_t direction;
> +    uint32_t phase;
> +} VhostUserTransferDeviceState;
> +
>  typedef union {
>  #define VHOST_USER_VRING_IDX_MASK   (0xff)
>  #define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
> @@ -224,6 +233,7 @@ typedef union {
>          VhostUserCryptoSession session;
>          VhostUserVringArea area;
>          VhostUserInflight inflight;
> +        VhostUserTransferDeviceState transfer_state;
>  } VhostUserPayload;
>  
>  typedef struct VhostUserMsg {
> @@ -2681,6 +2691,140 @@ static int vhost_user_dev_start(struct vhost_dev *dev, bool started)
>      }
>  }
>  
> +static bool vhost_user_supports_migratory_state(struct vhost_dev *dev)
> +{
> +    return virtio_has_feature(dev->protocol_features,
> +                              VHOST_USER_PROTOCOL_F_MIGRATORY_STATE);
> +}
> +
> +static int vhost_user_set_device_state_fd(struct vhost_dev *dev,
> +                                          VhostDeviceStateDirection direction,
> +                                          VhostDeviceStatePhase phase,
> +                                          int fd,
> +                                          int *reply_fd,
> +                                          Error **errp)
> +{
> +    int ret;
> +    struct vhost_user *vu = dev->opaque;
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_SET_DEVICE_STATE_FD,
> +            .flags = VHOST_USER_VERSION,
> +            .size = sizeof(msg.payload.transfer_state),
> +        },
> +        .payload.transfer_state = {
> +            .direction = direction,
> +            .phase = phase,
> +        },
> +    };
> +
> +    *reply_fd = -1;
> +
> +    if (!vhost_user_supports_migratory_state(dev)) {
> +        close(fd);
> +        error_setg(errp, "Back-end does not support migration state transfer");
> +        return -ENOTSUP;
> +    }
> +
> +    ret = vhost_user_write(dev, &msg, &fd, 1);
> +    close(fd);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret,
> +                         "Failed to send SET_DEVICE_STATE_FD message");
> +        return ret;
> +    }
> +
> +    ret = vhost_user_read(dev, &msg);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret,
> +                         "Failed to receive SET_DEVICE_STATE_FD reply");
> +        return ret;
> +    }
> +
> +    if (msg.hdr.request != VHOST_USER_SET_DEVICE_STATE_FD) {
> +        error_setg(errp,
> +                   "Received unexpected message type, expected %d, received %d",
> +                   VHOST_USER_SET_DEVICE_STATE_FD, msg.hdr.request);
> +        return -EPROTO;
> +    }
> +
> +    if (msg.hdr.size != sizeof(msg.payload.u64)) {
> +        error_setg(errp,
> +                   "Received bad message size, expected %zu, received %" PRIu32,
> +                   sizeof(msg.payload.u64), msg.hdr.size);
> +        return -EPROTO;
> +    }
> +
> +    if ((msg.payload.u64 & 0xff) != 0) {
> +        error_setg(errp, "Back-end did not accept migration state transfer");
> +        return -EIO;
> +    }
> +
> +    if (!(msg.payload.u64 & VHOST_USER_VRING_NOFD_MASK)) {
> +        *reply_fd = qemu_chr_fe_get_msgfd(vu->user->chr);
> +        if (*reply_fd < 0) {
> +            error_setg(errp,
> +                       "Failed to get back-end-provided transfer pipe FD");
> +            *reply_fd = -1;
> +            return -EIO;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static int vhost_user_check_device_state(struct vhost_dev *dev, Error **errp)
> +{
> +    int ret;
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_CHECK_DEVICE_STATE,
> +            .flags = VHOST_USER_VERSION,
> +            .size = 0,
> +        },
> +    };
> +
> +    if (!vhost_user_supports_migratory_state(dev)) {
> +        error_setg(errp, "Back-end does not support migration state transfer");
> +        return -ENOTSUP;
> +    }
> +
> +    ret = vhost_user_write(dev, &msg, NULL, 0);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret,
> +                         "Failed to send CHECK_DEVICE_STATE message");
> +        return ret;
> +    }
> +
> +    ret = vhost_user_read(dev, &msg);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret,
> +                         "Failed to receive CHECK_DEVICE_STATE reply");
> +        return ret;
> +    }
> +
> +    if (msg.hdr.request != VHOST_USER_CHECK_DEVICE_STATE) {
> +        error_setg(errp,
> +                   "Received unexpected message type, expected %d, received %d",
> +                   VHOST_USER_CHECK_DEVICE_STATE, msg.hdr.request);
> +        return -EPROTO;
> +    }
> +
> +    if (msg.hdr.size != sizeof(msg.payload.u64)) {
> +        error_setg(errp,
> +                   "Received bad message size, expected %zu, received %" PRIu32,
> +                   sizeof(msg.payload.u64), msg.hdr.size);
> +        return -EPROTO;
> +    }
> +
> +    if (msg.payload.u64 != 0) {
> +        error_setg(errp, "Back-end failed to process its internal state");
> +        return -EIO;
> +    }
> +
> +    return 0;
> +}
> +
>  const VhostOps user_ops = {
>          .backend_type = VHOST_BACKEND_TYPE_USER,
>          .vhost_backend_init = vhost_user_backend_init,
> @@ -2716,4 +2860,7 @@ const VhostOps user_ops = {
>          .vhost_get_inflight_fd = vhost_user_get_inflight_fd,
>          .vhost_set_inflight_fd = vhost_user_set_inflight_fd,
>          .vhost_dev_start = vhost_user_dev_start,
> +        .vhost_supports_migratory_state = vhost_user_supports_migratory_state,
> +        .vhost_set_device_state_fd = vhost_user_set_device_state_fd,
> +        .vhost_check_device_state = vhost_user_check_device_state,
>  };
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index cbff589efa..90099d8f6a 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -2088,3 +2088,40 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
>  
>      return -ENOSYS;
>  }
> +
> +bool vhost_supports_migratory_state(struct vhost_dev *dev)
> +{
> +    if (dev->vhost_ops->vhost_supports_migratory_state) {
> +        return dev->vhost_ops->vhost_supports_migratory_state(dev);
> +    }
> +
> +    return false;
> +}
> +
> +int vhost_set_device_state_fd(struct vhost_dev *dev,
> +                              VhostDeviceStateDirection direction,
> +                              VhostDeviceStatePhase phase,
> +                              int fd,
> +                              int *reply_fd,
> +                              Error **errp)
> +{
> +    if (dev->vhost_ops->vhost_set_device_state_fd) {
> +        return dev->vhost_ops->vhost_set_device_state_fd(dev, direction, phase,
> +                                                         fd, reply_fd, errp);
> +    }
> +
> +    error_setg(errp,
> +               "vhost transport does not support migration state transfer");
> +    return -ENOSYS;
> +}
> +
> +int vhost_check_device_state(struct vhost_dev *dev, Error **errp)
> +{
> +    if (dev->vhost_ops->vhost_check_device_state) {
> +        return dev->vhost_ops->vhost_check_device_state(dev, errp);
> +    }
> +
> +    error_setg(errp,
> +               "vhost transport does not support migration state transfer");
> +    return -ENOSYS;
> +}
> -- 
> 2.39.1
>
Eugenio Perez Martin April 13, 2023, 8:50 a.m. UTC | #2
On Tue, Apr 11, 2023 at 5:33 PM Hanna Czenczek <hreitz@redhat.com> wrote:
>
> So-called "internal" virtio-fs migration refers to transporting the
> back-end's (virtiofsd's) state through qemu's migration stream.  To do
> this, we need to be able to transfer virtiofsd's internal state to and
> from virtiofsd.
>
> Because virtiofsd's internal state will not be too large, we believe it
> is best to transfer it as a single binary blob after the streaming
> phase.  Because this method should be useful to other vhost-user
> implementations, too, it is introduced as a general-purpose addition to
> the protocol, not limited to vhost-user-fs.
>
> These are the additions to the protocol:
> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
>   This feature signals support for transferring state, and is added so
>   that migration can fail early when the back-end has no support.
>
> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
>   over which to transfer the state.  The front-end sends an FD to the
>   back-end into/from which it can write/read its state, and the back-end
>   can decide to either use it, or reply with a different FD for the
>   front-end to override the front-end's choice.
>   The front-end creates a simple pipe to transfer the state, but maybe
>   the back-end already has an FD into/from which it has to write/read
>   its state, in which case it will want to override the simple pipe.
>   Conversely, maybe in the future we find a way to have the front-end
>   get an immediate FD for the migration stream (in some cases), in which
>   case we will want to send this to the back-end instead of creating a
>   pipe.
>   Hence the negotiation: If one side has a better idea than a plain
>   pipe, we will want to use that.
>
> - CHECK_DEVICE_STATE: After the state has been transferred through the
>   pipe (the end indicated by EOF), the front-end invokes this function
>   to verify success.  There is no in-band way (through the pipe) to
>   indicate failure, so we need to check explicitly.
>
> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> (which includes establishing the direction of transfer and migration
> phase), the sending side writes its data into the pipe, and the reading
> side reads it until it sees an EOF.  Then, the front-end will check for
> success via CHECK_DEVICE_STATE, which on the destination side includes
> checking for integrity (i.e. errors during deserialization).
>
> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  include/hw/virtio/vhost-backend.h |  24 +++++
>  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
>  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
>  hw/virtio/vhost.c                 |  37 ++++++++
>  4 files changed, 287 insertions(+)
>
> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> index ec3fbae58d..5935b32fe3 100644
> --- a/include/hw/virtio/vhost-backend.h
> +++ b/include/hw/virtio/vhost-backend.h
> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
>      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
>  } VhostSetConfigType;
>
> +typedef enum VhostDeviceStateDirection {
> +    /* Transfer state from back-end (device) to front-end */
> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> +    /* Transfer state from front-end to back-end (device) */
> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> +} VhostDeviceStateDirection;
> +
> +typedef enum VhostDeviceStatePhase {
> +    /* The device (and all its vrings) is stopped */
> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> +} VhostDeviceStatePhase;
> +
>  struct vhost_inflight;
>  struct vhost_dev;
>  struct vhost_log;
> @@ -133,6 +145,15 @@ typedef int (*vhost_set_config_call_op)(struct vhost_dev *dev,
>
>  typedef void (*vhost_reset_status_op)(struct vhost_dev *dev);
>
> +typedef bool (*vhost_supports_migratory_state_op)(struct vhost_dev *dev);
> +typedef int (*vhost_set_device_state_fd_op)(struct vhost_dev *dev,
> +                                            VhostDeviceStateDirection direction,
> +                                            VhostDeviceStatePhase phase,
> +                                            int fd,
> +                                            int *reply_fd,
> +                                            Error **errp);
> +typedef int (*vhost_check_device_state_op)(struct vhost_dev *dev, Error **errp);
> +
>  typedef struct VhostOps {
>      VhostBackendType backend_type;
>      vhost_backend_init vhost_backend_init;
> @@ -181,6 +202,9 @@ typedef struct VhostOps {
>      vhost_force_iommu_op vhost_force_iommu;
>      vhost_set_config_call_op vhost_set_config_call;
>      vhost_reset_status_op vhost_reset_status;
> +    vhost_supports_migratory_state_op vhost_supports_migratory_state;
> +    vhost_set_device_state_fd_op vhost_set_device_state_fd;
> +    vhost_check_device_state_op vhost_check_device_state;
>  } VhostOps;
>
>  int vhost_backend_update_device_iotlb(struct vhost_dev *dev,
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index 2fe02ed5d4..29449e0fe2 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -346,4 +346,83 @@ int vhost_dev_set_inflight(struct vhost_dev *dev,
>                             struct vhost_inflight *inflight);
>  int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
>                             struct vhost_inflight *inflight);
> +
> +/**
> + * vhost_supports_migratory_state(): Checks whether the back-end
> + * supports transferring internal state for the purpose of migration.
> + * Support for this feature is required for vhost_set_device_state_fd()
> + * and vhost_check_device_state().
> + *
> + * @dev: The vhost device
> + *
> + * Returns true if the device supports these commands, and false if it
> + * does not.
> + */
> +bool vhost_supports_migratory_state(struct vhost_dev *dev);
> +
> +/**
> + * vhost_set_device_state_fd(): Begin transfer of internal state from/to
> + * the back-end for the purpose of migration.  Data is to be transferred
> + * over a pipe according to @direction and @phase.  The sending end must
> + * only write to the pipe, and the receiving end must only read from it.
> + * Once the sending end is done, it closes its FD.  The receiving end
> + * must take this as the end-of-transfer signal and close its FD, too.
> + *
> + * @fd is the back-end's end of the pipe: The write FD for SAVE, and the
> + * read FD for LOAD.  This function transfers ownership of @fd to the
> + * back-end, i.e. closes it in the front-end.
> + *
> + * The back-end may optionally reply with an FD of its own, if this
> + * improves efficiency on its end.  In this case, the returned FD is
> + * stored in *reply_fd.  The back-end will discard the FD sent to it,
> + * and the front-end must use *reply_fd for transferring state to/from
> + * the back-end.
> + *
> + * @dev: The vhost device
> + * @direction: The direction in which the state is to be transferred.
> + *             For outgoing migrations, this is SAVE, and data is read
> + *             from the back-end and stored by the front-end in the
> + *             migration stream.
> + *             For incoming migrations, this is LOAD, and data is read
> + *             by the front-end from the migration stream and sent to
> + *             the back-end to restore the saved state.
> + * @phase: Which migration phase we are in.  Currently, there is only
> + *         STOPPED (device and all vrings are stopped), in the future,
> + *         more phases such as PRE_COPY or POST_COPY may be added.
> + * @fd: Back-end's end of the pipe through which to transfer state; note
> + *      that ownership is transferred to the back-end, so this function
> + *      closes @fd in the front-end.
> + * @reply_fd: If the back-end wishes to use a different pipe for state
> + *            transfer, this will contain an FD for the front-end to
> + *            use.  Otherwise, -1 is stored here.
> + * @errp: Potential error description
> + *
> + * Returns 0 on success, and -errno on failure.
> + */
> +int vhost_set_device_state_fd(struct vhost_dev *dev,
> +                              VhostDeviceStateDirection direction,
> +                              VhostDeviceStatePhase phase,
> +                              int fd,
> +                              int *reply_fd,
> +                              Error **errp);
> +
> +/**
> + * vhost_set_device_state_fd(): After transferring state from/to the

Nitpick: This function doc is for vhost_check_device_state not
vhost_set_device_state_fd.

Thanks!


> + * back-end via vhost_set_device_state_fd(), i.e. once the sending end
> + * has closed the pipe, inquire the back-end to report any potential
> + * errors that have occurred on its side.  This allows to sense errors
> + * like:
> + * - During outgoing migration, when the source side had already started
> + *   to produce its state, something went wrong and it failed to finish
> + * - During incoming migration, when the received state is somehow
> + *   invalid and cannot be processed by the back-end
> + *
> + * @dev: The vhost device
> + * @errp: Potential error description
> + *
> + * Returns 0 when the back-end reports successful state transfer and
> + * processing, and -errno when an error occurred somewhere.
> + */
> +int vhost_check_device_state(struct vhost_dev *dev, Error **errp);
> +
>  #endif
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index e5285df4ba..93d8f2494a 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -83,6 +83,7 @@ enum VhostUserProtocolFeature {
>      /* Feature 14 reserved for VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS. */
>      VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
>      VHOST_USER_PROTOCOL_F_STATUS = 16,
> +    VHOST_USER_PROTOCOL_F_MIGRATORY_STATE = 17,
>      VHOST_USER_PROTOCOL_F_MAX
>  };
>
> @@ -130,6 +131,8 @@ typedef enum VhostUserRequest {
>      VHOST_USER_REM_MEM_REG = 38,
>      VHOST_USER_SET_STATUS = 39,
>      VHOST_USER_GET_STATUS = 40,
> +    VHOST_USER_SET_DEVICE_STATE_FD = 41,
> +    VHOST_USER_CHECK_DEVICE_STATE = 42,
>      VHOST_USER_MAX
>  } VhostUserRequest;
>
> @@ -210,6 +213,12 @@ typedef struct {
>      uint32_t size; /* the following payload size */
>  } QEMU_PACKED VhostUserHeader;
>
> +/* Request payload of VHOST_USER_SET_DEVICE_STATE_FD */
> +typedef struct VhostUserTransferDeviceState {
> +    uint32_t direction;
> +    uint32_t phase;
> +} VhostUserTransferDeviceState;
> +
>  typedef union {
>  #define VHOST_USER_VRING_IDX_MASK   (0xff)
>  #define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
> @@ -224,6 +233,7 @@ typedef union {
>          VhostUserCryptoSession session;
>          VhostUserVringArea area;
>          VhostUserInflight inflight;
> +        VhostUserTransferDeviceState transfer_state;
>  } VhostUserPayload;
>
>  typedef struct VhostUserMsg {
> @@ -2681,6 +2691,140 @@ static int vhost_user_dev_start(struct vhost_dev *dev, bool started)
>      }
>  }
>
> +static bool vhost_user_supports_migratory_state(struct vhost_dev *dev)
> +{
> +    return virtio_has_feature(dev->protocol_features,
> +                              VHOST_USER_PROTOCOL_F_MIGRATORY_STATE);
> +}
> +
> +static int vhost_user_set_device_state_fd(struct vhost_dev *dev,
> +                                          VhostDeviceStateDirection direction,
> +                                          VhostDeviceStatePhase phase,
> +                                          int fd,
> +                                          int *reply_fd,
> +                                          Error **errp)
> +{
> +    int ret;
> +    struct vhost_user *vu = dev->opaque;
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_SET_DEVICE_STATE_FD,
> +            .flags = VHOST_USER_VERSION,
> +            .size = sizeof(msg.payload.transfer_state),
> +        },
> +        .payload.transfer_state = {
> +            .direction = direction,
> +            .phase = phase,
> +        },
> +    };
> +
> +    *reply_fd = -1;
> +
> +    if (!vhost_user_supports_migratory_state(dev)) {
> +        close(fd);
> +        error_setg(errp, "Back-end does not support migration state transfer");
> +        return -ENOTSUP;
> +    }
> +
> +    ret = vhost_user_write(dev, &msg, &fd, 1);
> +    close(fd);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret,
> +                         "Failed to send SET_DEVICE_STATE_FD message");
> +        return ret;
> +    }
> +
> +    ret = vhost_user_read(dev, &msg);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret,
> +                         "Failed to receive SET_DEVICE_STATE_FD reply");
> +        return ret;
> +    }
> +
> +    if (msg.hdr.request != VHOST_USER_SET_DEVICE_STATE_FD) {
> +        error_setg(errp,
> +                   "Received unexpected message type, expected %d, received %d",
> +                   VHOST_USER_SET_DEVICE_STATE_FD, msg.hdr.request);
> +        return -EPROTO;
> +    }
> +
> +    if (msg.hdr.size != sizeof(msg.payload.u64)) {
> +        error_setg(errp,
> +                   "Received bad message size, expected %zu, received %" PRIu32,
> +                   sizeof(msg.payload.u64), msg.hdr.size);
> +        return -EPROTO;
> +    }
> +
> +    if ((msg.payload.u64 & 0xff) != 0) {
> +        error_setg(errp, "Back-end did not accept migration state transfer");
> +        return -EIO;
> +    }
> +
> +    if (!(msg.payload.u64 & VHOST_USER_VRING_NOFD_MASK)) {
> +        *reply_fd = qemu_chr_fe_get_msgfd(vu->user->chr);
> +        if (*reply_fd < 0) {
> +            error_setg(errp,
> +                       "Failed to get back-end-provided transfer pipe FD");
> +            *reply_fd = -1;
> +            return -EIO;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static int vhost_user_check_device_state(struct vhost_dev *dev, Error **errp)
> +{
> +    int ret;
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_CHECK_DEVICE_STATE,
> +            .flags = VHOST_USER_VERSION,
> +            .size = 0,
> +        },
> +    };
> +
> +    if (!vhost_user_supports_migratory_state(dev)) {
> +        error_setg(errp, "Back-end does not support migration state transfer");
> +        return -ENOTSUP;
> +    }
> +
> +    ret = vhost_user_write(dev, &msg, NULL, 0);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret,
> +                         "Failed to send CHECK_DEVICE_STATE message");
> +        return ret;
> +    }
> +
> +    ret = vhost_user_read(dev, &msg);
> +    if (ret < 0) {
> +        error_setg_errno(errp, -ret,
> +                         "Failed to receive CHECK_DEVICE_STATE reply");
> +        return ret;
> +    }
> +
> +    if (msg.hdr.request != VHOST_USER_CHECK_DEVICE_STATE) {
> +        error_setg(errp,
> +                   "Received unexpected message type, expected %d, received %d",
> +                   VHOST_USER_CHECK_DEVICE_STATE, msg.hdr.request);
> +        return -EPROTO;
> +    }
> +
> +    if (msg.hdr.size != sizeof(msg.payload.u64)) {
> +        error_setg(errp,
> +                   "Received bad message size, expected %zu, received %" PRIu32,
> +                   sizeof(msg.payload.u64), msg.hdr.size);
> +        return -EPROTO;
> +    }
> +
> +    if (msg.payload.u64 != 0) {
> +        error_setg(errp, "Back-end failed to process its internal state");
> +        return -EIO;
> +    }
> +
> +    return 0;
> +}
> +
>  const VhostOps user_ops = {
>          .backend_type = VHOST_BACKEND_TYPE_USER,
>          .vhost_backend_init = vhost_user_backend_init,
> @@ -2716,4 +2860,7 @@ const VhostOps user_ops = {
>          .vhost_get_inflight_fd = vhost_user_get_inflight_fd,
>          .vhost_set_inflight_fd = vhost_user_set_inflight_fd,
>          .vhost_dev_start = vhost_user_dev_start,
> +        .vhost_supports_migratory_state = vhost_user_supports_migratory_state,
> +        .vhost_set_device_state_fd = vhost_user_set_device_state_fd,
> +        .vhost_check_device_state = vhost_user_check_device_state,
>  };
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index cbff589efa..90099d8f6a 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -2088,3 +2088,40 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
>
>      return -ENOSYS;
>  }
> +
> +bool vhost_supports_migratory_state(struct vhost_dev *dev)
> +{
> +    if (dev->vhost_ops->vhost_supports_migratory_state) {
> +        return dev->vhost_ops->vhost_supports_migratory_state(dev);
> +    }
> +
> +    return false;
> +}
> +
> +int vhost_set_device_state_fd(struct vhost_dev *dev,
> +                              VhostDeviceStateDirection direction,
> +                              VhostDeviceStatePhase phase,
> +                              int fd,
> +                              int *reply_fd,
> +                              Error **errp)
> +{
> +    if (dev->vhost_ops->vhost_set_device_state_fd) {
> +        return dev->vhost_ops->vhost_set_device_state_fd(dev, direction, phase,
> +                                                         fd, reply_fd, errp);
> +    }
> +
> +    error_setg(errp,
> +               "vhost transport does not support migration state transfer");
> +    return -ENOSYS;
> +}
> +
> +int vhost_check_device_state(struct vhost_dev *dev, Error **errp)
> +{
> +    if (dev->vhost_ops->vhost_check_device_state) {
> +        return dev->vhost_ops->vhost_check_device_state(dev, errp);
> +    }
> +
> +    error_setg(errp,
> +               "vhost transport does not support migration state transfer");
> +    return -ENOSYS;
> +}
> --
> 2.39.1
>
>
Hanna Czenczek April 13, 2023, 9:24 a.m. UTC | #3
On 12.04.23 23:06, Stefan Hajnoczi wrote:
> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
>> So-called "internal" virtio-fs migration refers to transporting the
>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
>> this, we need to be able to transfer virtiofsd's internal state to and
>> from virtiofsd.
>>
>> Because virtiofsd's internal state will not be too large, we believe it
>> is best to transfer it as a single binary blob after the streaming
>> phase.  Because this method should be useful to other vhost-user
>> implementations, too, it is introduced as a general-purpose addition to
>> the protocol, not limited to vhost-user-fs.
>>
>> These are the additions to the protocol:
>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
>>    This feature signals support for transferring state, and is added so
>>    that migration can fail early when the back-end has no support.
>>
>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
>>    over which to transfer the state.  The front-end sends an FD to the
>>    back-end into/from which it can write/read its state, and the back-end
>>    can decide to either use it, or reply with a different FD for the
>>    front-end to override the front-end's choice.
>>    The front-end creates a simple pipe to transfer the state, but maybe
>>    the back-end already has an FD into/from which it has to write/read
>>    its state, in which case it will want to override the simple pipe.
>>    Conversely, maybe in the future we find a way to have the front-end
>>    get an immediate FD for the migration stream (in some cases), in which
>>    case we will want to send this to the back-end instead of creating a
>>    pipe.
>>    Hence the negotiation: If one side has a better idea than a plain
>>    pipe, we will want to use that.
>>
>> - CHECK_DEVICE_STATE: After the state has been transferred through the
>>    pipe (the end indicated by EOF), the front-end invokes this function
>>    to verify success.  There is no in-band way (through the pipe) to
>>    indicate failure, so we need to check explicitly.
>>
>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
>> (which includes establishing the direction of transfer and migration
>> phase), the sending side writes its data into the pipe, and the reading
>> side reads it until it sees an EOF.  Then, the front-end will check for
>> success via CHECK_DEVICE_STATE, which on the destination side includes
>> checking for integrity (i.e. errors during deserialization).
>>
>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>>   include/hw/virtio/vhost-backend.h |  24 +++++
>>   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
>>   hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
>>   hw/virtio/vhost.c                 |  37 ++++++++
>>   4 files changed, 287 insertions(+)
>>
>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
>> index ec3fbae58d..5935b32fe3 100644
>> --- a/include/hw/virtio/vhost-backend.h
>> +++ b/include/hw/virtio/vhost-backend.h
>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
>>       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
>>   } VhostSetConfigType;
>>   
>> +typedef enum VhostDeviceStateDirection {
>> +    /* Transfer state from back-end (device) to front-end */
>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
>> +    /* Transfer state from front-end to back-end (device) */
>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
>> +} VhostDeviceStateDirection;
>> +
>> +typedef enum VhostDeviceStatePhase {
>> +    /* The device (and all its vrings) is stopped */
>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
>> +} VhostDeviceStatePhase;
> vDPA has:
>
>    /* Suspend a device so it does not process virtqueue requests anymore
>     *
>     * After the return of ioctl the device must preserve all the necessary state
>     * (the virtqueue vring base plus the possible device specific states) that is
>     * required for restoring in the future. The device must not change its
>     * configuration after that point.
>     */
>    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
>
>    /* Resume a device so it can resume processing virtqueue requests
>     *
>     * After the return of this ioctl the device will have restored all the
>     * necessary states and it is fully operational to continue processing the
>     * virtqueue descriptors.
>     */
>    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
>
> I wonder if it makes sense to import these into vhost-user so that the
> difference between kernel vhost and vhost-user is minimized. It's okay
> if one of them is ahead of the other, but it would be nice to avoid
> overlapping/duplicated functionality.
>
> (And I hope vDPA will import the device state vhost-user messages
> introduced in this series.)

I don’t understand your suggestion.  (Like, I very simply don’t 
understand :))

These are vhost messages, right?  What purpose do you have in mind for 
them in vhost-user for internal migration?  They’re different from the 
state transfer messages, because they don’t transfer state to/from the 
front-end.  Also, the state transfer stuff is supposed to be distinct 
from starting/stopping the device; right now, it just requires the 
device to be stopped beforehand (or started only afterwards).  And in 
the future, new VhostDeviceStatePhase values may allow the messages to 
be used on devices that aren’t stopped.

So they seem to serve very different purposes.  I can imagine using the 
VDPA_{SUSPEND,RESUME} messages for external migration (what Anton is 
working on), but they don’t really help with internal migration 
implemented here.  If I were to add them, they’d just be sent in 
addition to the new messages added in this patch here, i.e. SUSPEND on 
the source before SET_DEVICE_STATE_FD, and RESUME on the destination 
after CHECK_DEVICE_STATE (we could use RESUME in place of 
CHECK_DEVICE_STATE on the destination, but we can’t do that on the 
source, so we still need CHECK_DEVICE_STATE).

Hanna
Hanna Czenczek April 13, 2023, 9:25 a.m. UTC | #4
On 13.04.23 10:50, Eugenio Perez Martin wrote:
> On Tue, Apr 11, 2023 at 5:33 PM Hanna Czenczek <hreitz@redhat.com> wrote:
>> So-called "internal" virtio-fs migration refers to transporting the
>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
>> this, we need to be able to transfer virtiofsd's internal state to and
>> from virtiofsd.
>>
>> Because virtiofsd's internal state will not be too large, we believe it
>> is best to transfer it as a single binary blob after the streaming
>> phase.  Because this method should be useful to other vhost-user
>> implementations, too, it is introduced as a general-purpose addition to
>> the protocol, not limited to vhost-user-fs.
>>
>> These are the additions to the protocol:
>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
>>    This feature signals support for transferring state, and is added so
>>    that migration can fail early when the back-end has no support.
>>
>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
>>    over which to transfer the state.  The front-end sends an FD to the
>>    back-end into/from which it can write/read its state, and the back-end
>>    can decide to either use it, or reply with a different FD for the
>>    front-end to override the front-end's choice.
>>    The front-end creates a simple pipe to transfer the state, but maybe
>>    the back-end already has an FD into/from which it has to write/read
>>    its state, in which case it will want to override the simple pipe.
>>    Conversely, maybe in the future we find a way to have the front-end
>>    get an immediate FD for the migration stream (in some cases), in which
>>    case we will want to send this to the back-end instead of creating a
>>    pipe.
>>    Hence the negotiation: If one side has a better idea than a plain
>>    pipe, we will want to use that.
>>
>> - CHECK_DEVICE_STATE: After the state has been transferred through the
>>    pipe (the end indicated by EOF), the front-end invokes this function
>>    to verify success.  There is no in-band way (through the pipe) to
>>    indicate failure, so we need to check explicitly.
>>
>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
>> (which includes establishing the direction of transfer and migration
>> phase), the sending side writes its data into the pipe, and the reading
>> side reads it until it sees an EOF.  Then, the front-end will check for
>> success via CHECK_DEVICE_STATE, which on the destination side includes
>> checking for integrity (i.e. errors during deserialization).
>>
>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>>   include/hw/virtio/vhost-backend.h |  24 +++++
>>   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
>>   hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
>>   hw/virtio/vhost.c                 |  37 ++++++++
>>   4 files changed, 287 insertions(+)

[...]

>> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
>> index 2fe02ed5d4..29449e0fe2 100644
>> --- a/include/hw/virtio/vhost.h
>> +++ b/include/hw/virtio/vhost.h
>> @@ -346,4 +346,83 @@ int vhost_dev_set_inflight(struct vhost_dev *dev,

[...]

>> +/**
>> + * vhost_set_device_state_fd(): After transferring state from/to the
> Nitpick: This function doc is for vhost_check_device_state not
> vhost_set_device_state_fd.
>
> Thanks!

Oops, right, thanks!

Hanna

>> + * back-end via vhost_set_device_state_fd(), i.e. once the sending end
>> + * has closed the pipe, inquire the back-end to report any potential
>> + * errors that have occurred on its side.  This allows to sense errors
>> + * like:
>> + * - During outgoing migration, when the source side had already started
>> + *   to produce its state, something went wrong and it failed to finish
>> + * - During incoming migration, when the received state is somehow
>> + *   invalid and cannot be processed by the back-end
>> + *
>> + * @dev: The vhost device
>> + * @errp: Potential error description
>> + *
>> + * Returns 0 when the back-end reports successful state transfer and
>> + * processing, and -errno when an error occurred somewhere.
>> + */
>> +int vhost_check_device_state(struct vhost_dev *dev, Error **errp);
>> +
Eugenio Perez Martin April 13, 2023, 10:14 a.m. UTC | #5
On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > So-called "internal" virtio-fs migration refers to transporting the
> > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > this, we need to be able to transfer virtiofsd's internal state to and
> > from virtiofsd.
> >
> > Because virtiofsd's internal state will not be too large, we believe it
> > is best to transfer it as a single binary blob after the streaming
> > phase.  Because this method should be useful to other vhost-user
> > implementations, too, it is introduced as a general-purpose addition to
> > the protocol, not limited to vhost-user-fs.
> >
> > These are the additions to the protocol:
> > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> >   This feature signals support for transferring state, and is added so
> >   that migration can fail early when the back-end has no support.
> >
> > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> >   over which to transfer the state.  The front-end sends an FD to the
> >   back-end into/from which it can write/read its state, and the back-end
> >   can decide to either use it, or reply with a different FD for the
> >   front-end to override the front-end's choice.
> >   The front-end creates a simple pipe to transfer the state, but maybe
> >   the back-end already has an FD into/from which it has to write/read
> >   its state, in which case it will want to override the simple pipe.
> >   Conversely, maybe in the future we find a way to have the front-end
> >   get an immediate FD for the migration stream (in some cases), in which
> >   case we will want to send this to the back-end instead of creating a
> >   pipe.
> >   Hence the negotiation: If one side has a better idea than a plain
> >   pipe, we will want to use that.
> >
> > - CHECK_DEVICE_STATE: After the state has been transferred through the
> >   pipe (the end indicated by EOF), the front-end invokes this function
> >   to verify success.  There is no in-band way (through the pipe) to
> >   indicate failure, so we need to check explicitly.
> >
> > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > (which includes establishing the direction of transfer and migration
> > phase), the sending side writes its data into the pipe, and the reading
> > side reads it until it sees an EOF.  Then, the front-end will check for
> > success via CHECK_DEVICE_STATE, which on the destination side includes
> > checking for integrity (i.e. errors during deserialization).
> >
> > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > ---
> >  include/hw/virtio/vhost-backend.h |  24 +++++
> >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> >  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> >  hw/virtio/vhost.c                 |  37 ++++++++
> >  4 files changed, 287 insertions(+)
> >
> > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > index ec3fbae58d..5935b32fe3 100644
> > --- a/include/hw/virtio/vhost-backend.h
> > +++ b/include/hw/virtio/vhost-backend.h
> > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> >  } VhostSetConfigType;
> >
> > +typedef enum VhostDeviceStateDirection {
> > +    /* Transfer state from back-end (device) to front-end */
> > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > +    /* Transfer state from front-end to back-end (device) */
> > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > +} VhostDeviceStateDirection;
> > +
> > +typedef enum VhostDeviceStatePhase {
> > +    /* The device (and all its vrings) is stopped */
> > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > +} VhostDeviceStatePhase;
>
> vDPA has:
>
>   /* Suspend a device so it does not process virtqueue requests anymore
>    *
>    * After the return of ioctl the device must preserve all the necessary state
>    * (the virtqueue vring base plus the possible device specific states) that is
>    * required for restoring in the future. The device must not change its
>    * configuration after that point.
>    */
>   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
>
>   /* Resume a device so it can resume processing virtqueue requests
>    *
>    * After the return of this ioctl the device will have restored all the
>    * necessary states and it is fully operational to continue processing the
>    * virtqueue descriptors.
>    */
>   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
>
> I wonder if it makes sense to import these into vhost-user so that the
> difference between kernel vhost and vhost-user is minimized. It's okay
> if one of them is ahead of the other, but it would be nice to avoid
> overlapping/duplicated functionality.
>

That's what I had in mind in the first versions. I proposed VHOST_STOP
instead of VHOST_VDPA_STOP for this very reason. Later it did change
to SUSPEND.

Generally it is better if we make the interface less parametrized and
we trust in the messages and its semantics in my opinion. In other
words, instead of
vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send
individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command.

Another way to apply this is with the "direction" parameter. Maybe it
is better to split it into "set_state_fd" and "get_state_fd"?

In that case, reusing the ioctls as vhost-user messages would be ok.
But that puts this proposal further from the VFIO code, which uses
"migration_set_state(state)", and maybe it is better when the number
of states is high.

BTW, is there any usage for *reply_fd at this moment from the backend?

> (And I hope vDPA will import the device state vhost-user messages
> introduced in this series.)
>

I guess they will be needed for vdpa-fs devices? Is there any emulated
virtio-fs in qemu?

Thanks!

> > +
> >  struct vhost_inflight;
> >  struct vhost_dev;
> >  struct vhost_log;
> > @@ -133,6 +145,15 @@ typedef int (*vhost_set_config_call_op)(struct vhost_dev *dev,
> >
> >  typedef void (*vhost_reset_status_op)(struct vhost_dev *dev);
> >
> > +typedef bool (*vhost_supports_migratory_state_op)(struct vhost_dev *dev);
> > +typedef int (*vhost_set_device_state_fd_op)(struct vhost_dev *dev,
> > +                                            VhostDeviceStateDirection direction,
> > +                                            VhostDeviceStatePhase phase,
> > +                                            int fd,
> > +                                            int *reply_fd,
> > +                                            Error **errp);
> > +typedef int (*vhost_check_device_state_op)(struct vhost_dev *dev, Error **errp);
> > +
> >  typedef struct VhostOps {
> >      VhostBackendType backend_type;
> >      vhost_backend_init vhost_backend_init;
> > @@ -181,6 +202,9 @@ typedef struct VhostOps {
> >      vhost_force_iommu_op vhost_force_iommu;
> >      vhost_set_config_call_op vhost_set_config_call;
> >      vhost_reset_status_op vhost_reset_status;
> > +    vhost_supports_migratory_state_op vhost_supports_migratory_state;
> > +    vhost_set_device_state_fd_op vhost_set_device_state_fd;
> > +    vhost_check_device_state_op vhost_check_device_state;
> >  } VhostOps;
> >
> >  int vhost_backend_update_device_iotlb(struct vhost_dev *dev,
> > diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> > index 2fe02ed5d4..29449e0fe2 100644
> > --- a/include/hw/virtio/vhost.h
> > +++ b/include/hw/virtio/vhost.h
> > @@ -346,4 +346,83 @@ int vhost_dev_set_inflight(struct vhost_dev *dev,
> >                             struct vhost_inflight *inflight);
> >  int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
> >                             struct vhost_inflight *inflight);
> > +
> > +/**
> > + * vhost_supports_migratory_state(): Checks whether the back-end
> > + * supports transferring internal state for the purpose of migration.
> > + * Support for this feature is required for vhost_set_device_state_fd()
> > + * and vhost_check_device_state().
> > + *
> > + * @dev: The vhost device
> > + *
> > + * Returns true if the device supports these commands, and false if it
> > + * does not.
> > + */
> > +bool vhost_supports_migratory_state(struct vhost_dev *dev);
> > +
> > +/**
> > + * vhost_set_device_state_fd(): Begin transfer of internal state from/to
> > + * the back-end for the purpose of migration.  Data is to be transferred
> > + * over a pipe according to @direction and @phase.  The sending end must
> > + * only write to the pipe, and the receiving end must only read from it.
> > + * Once the sending end is done, it closes its FD.  The receiving end
> > + * must take this as the end-of-transfer signal and close its FD, too.
> > + *
> > + * @fd is the back-end's end of the pipe: The write FD for SAVE, and the
> > + * read FD for LOAD.  This function transfers ownership of @fd to the
> > + * back-end, i.e. closes it in the front-end.
> > + *
> > + * The back-end may optionally reply with an FD of its own, if this
> > + * improves efficiency on its end.  In this case, the returned FD is
> > + * stored in *reply_fd.  The back-end will discard the FD sent to it,
> > + * and the front-end must use *reply_fd for transferring state to/from
> > + * the back-end.
> > + *
> > + * @dev: The vhost device
> > + * @direction: The direction in which the state is to be transferred.
> > + *             For outgoing migrations, this is SAVE, and data is read
> > + *             from the back-end and stored by the front-end in the
> > + *             migration stream.
> > + *             For incoming migrations, this is LOAD, and data is read
> > + *             by the front-end from the migration stream and sent to
> > + *             the back-end to restore the saved state.
> > + * @phase: Which migration phase we are in.  Currently, there is only
> > + *         STOPPED (device and all vrings are stopped), in the future,
> > + *         more phases such as PRE_COPY or POST_COPY may be added.
> > + * @fd: Back-end's end of the pipe through which to transfer state; note
> > + *      that ownership is transferred to the back-end, so this function
> > + *      closes @fd in the front-end.
> > + * @reply_fd: If the back-end wishes to use a different pipe for state
> > + *            transfer, this will contain an FD for the front-end to
> > + *            use.  Otherwise, -1 is stored here.
> > + * @errp: Potential error description
> > + *
> > + * Returns 0 on success, and -errno on failure.
> > + */
> > +int vhost_set_device_state_fd(struct vhost_dev *dev,
> > +                              VhostDeviceStateDirection direction,
> > +                              VhostDeviceStatePhase phase,
> > +                              int fd,
> > +                              int *reply_fd,
> > +                              Error **errp);
> > +
> > +/**
> > + * vhost_set_device_state_fd(): After transferring state from/to the
> > + * back-end via vhost_set_device_state_fd(), i.e. once the sending end
> > + * has closed the pipe, inquire the back-end to report any potential
> > + * errors that have occurred on its side.  This allows to sense errors
> > + * like:
> > + * - During outgoing migration, when the source side had already started
> > + *   to produce its state, something went wrong and it failed to finish
> > + * - During incoming migration, when the received state is somehow
> > + *   invalid and cannot be processed by the back-end
> > + *
> > + * @dev: The vhost device
> > + * @errp: Potential error description
> > + *
> > + * Returns 0 when the back-end reports successful state transfer and
> > + * processing, and -errno when an error occurred somewhere.
> > + */
> > +int vhost_check_device_state(struct vhost_dev *dev, Error **errp);
> > +
> >  #endif
> > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > index e5285df4ba..93d8f2494a 100644
> > --- a/hw/virtio/vhost-user.c
> > +++ b/hw/virtio/vhost-user.c
> > @@ -83,6 +83,7 @@ enum VhostUserProtocolFeature {
> >      /* Feature 14 reserved for VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS. */
> >      VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
> >      VHOST_USER_PROTOCOL_F_STATUS = 16,
> > +    VHOST_USER_PROTOCOL_F_MIGRATORY_STATE = 17,
> >      VHOST_USER_PROTOCOL_F_MAX
> >  };
> >
> > @@ -130,6 +131,8 @@ typedef enum VhostUserRequest {
> >      VHOST_USER_REM_MEM_REG = 38,
> >      VHOST_USER_SET_STATUS = 39,
> >      VHOST_USER_GET_STATUS = 40,
> > +    VHOST_USER_SET_DEVICE_STATE_FD = 41,
> > +    VHOST_USER_CHECK_DEVICE_STATE = 42,
> >      VHOST_USER_MAX
> >  } VhostUserRequest;
> >
> > @@ -210,6 +213,12 @@ typedef struct {
> >      uint32_t size; /* the following payload size */
> >  } QEMU_PACKED VhostUserHeader;
> >
> > +/* Request payload of VHOST_USER_SET_DEVICE_STATE_FD */
> > +typedef struct VhostUserTransferDeviceState {
> > +    uint32_t direction;
> > +    uint32_t phase;
> > +} VhostUserTransferDeviceState;
> > +
> >  typedef union {
> >  #define VHOST_USER_VRING_IDX_MASK   (0xff)
> >  #define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
> > @@ -224,6 +233,7 @@ typedef union {
> >          VhostUserCryptoSession session;
> >          VhostUserVringArea area;
> >          VhostUserInflight inflight;
> > +        VhostUserTransferDeviceState transfer_state;
> >  } VhostUserPayload;
> >
> >  typedef struct VhostUserMsg {
> > @@ -2681,6 +2691,140 @@ static int vhost_user_dev_start(struct vhost_dev *dev, bool started)
> >      }
> >  }
> >
> > +static bool vhost_user_supports_migratory_state(struct vhost_dev *dev)
> > +{
> > +    return virtio_has_feature(dev->protocol_features,
> > +                              VHOST_USER_PROTOCOL_F_MIGRATORY_STATE);
> > +}
> > +
> > +static int vhost_user_set_device_state_fd(struct vhost_dev *dev,
> > +                                          VhostDeviceStateDirection direction,
> > +                                          VhostDeviceStatePhase phase,
> > +                                          int fd,
> > +                                          int *reply_fd,
> > +                                          Error **errp)
> > +{
> > +    int ret;
> > +    struct vhost_user *vu = dev->opaque;
> > +    VhostUserMsg msg = {
> > +        .hdr = {
> > +            .request = VHOST_USER_SET_DEVICE_STATE_FD,
> > +            .flags = VHOST_USER_VERSION,
> > +            .size = sizeof(msg.payload.transfer_state),
> > +        },
> > +        .payload.transfer_state = {
> > +            .direction = direction,
> > +            .phase = phase,
> > +        },
> > +    };
> > +
> > +    *reply_fd = -1;
> > +
> > +    if (!vhost_user_supports_migratory_state(dev)) {
> > +        close(fd);
> > +        error_setg(errp, "Back-end does not support migration state transfer");
> > +        return -ENOTSUP;
> > +    }
> > +
> > +    ret = vhost_user_write(dev, &msg, &fd, 1);
> > +    close(fd);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret,
> > +                         "Failed to send SET_DEVICE_STATE_FD message");
> > +        return ret;
> > +    }
> > +
> > +    ret = vhost_user_read(dev, &msg);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret,
> > +                         "Failed to receive SET_DEVICE_STATE_FD reply");
> > +        return ret;
> > +    }
> > +
> > +    if (msg.hdr.request != VHOST_USER_SET_DEVICE_STATE_FD) {
> > +        error_setg(errp,
> > +                   "Received unexpected message type, expected %d, received %d",
> > +                   VHOST_USER_SET_DEVICE_STATE_FD, msg.hdr.request);
> > +        return -EPROTO;
> > +    }
> > +
> > +    if (msg.hdr.size != sizeof(msg.payload.u64)) {
> > +        error_setg(errp,
> > +                   "Received bad message size, expected %zu, received %" PRIu32,
> > +                   sizeof(msg.payload.u64), msg.hdr.size);
> > +        return -EPROTO;
> > +    }
> > +
> > +    if ((msg.payload.u64 & 0xff) != 0) {
> > +        error_setg(errp, "Back-end did not accept migration state transfer");
> > +        return -EIO;
> > +    }
> > +
> > +    if (!(msg.payload.u64 & VHOST_USER_VRING_NOFD_MASK)) {
> > +        *reply_fd = qemu_chr_fe_get_msgfd(vu->user->chr);
> > +        if (*reply_fd < 0) {
> > +            error_setg(errp,
> > +                       "Failed to get back-end-provided transfer pipe FD");
> > +            *reply_fd = -1;
> > +            return -EIO;
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +static int vhost_user_check_device_state(struct vhost_dev *dev, Error **errp)
> > +{
> > +    int ret;
> > +    VhostUserMsg msg = {
> > +        .hdr = {
> > +            .request = VHOST_USER_CHECK_DEVICE_STATE,
> > +            .flags = VHOST_USER_VERSION,
> > +            .size = 0,
> > +        },
> > +    };
> > +
> > +    if (!vhost_user_supports_migratory_state(dev)) {
> > +        error_setg(errp, "Back-end does not support migration state transfer");
> > +        return -ENOTSUP;
> > +    }
> > +
> > +    ret = vhost_user_write(dev, &msg, NULL, 0);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret,
> > +                         "Failed to send CHECK_DEVICE_STATE message");
> > +        return ret;
> > +    }
> > +
> > +    ret = vhost_user_read(dev, &msg);
> > +    if (ret < 0) {
> > +        error_setg_errno(errp, -ret,
> > +                         "Failed to receive CHECK_DEVICE_STATE reply");
> > +        return ret;
> > +    }
> > +
> > +    if (msg.hdr.request != VHOST_USER_CHECK_DEVICE_STATE) {
> > +        error_setg(errp,
> > +                   "Received unexpected message type, expected %d, received %d",
> > +                   VHOST_USER_CHECK_DEVICE_STATE, msg.hdr.request);
> > +        return -EPROTO;
> > +    }
> > +
> > +    if (msg.hdr.size != sizeof(msg.payload.u64)) {
> > +        error_setg(errp,
> > +                   "Received bad message size, expected %zu, received %" PRIu32,
> > +                   sizeof(msg.payload.u64), msg.hdr.size);
> > +        return -EPROTO;
> > +    }
> > +
> > +    if (msg.payload.u64 != 0) {
> > +        error_setg(errp, "Back-end failed to process its internal state");
> > +        return -EIO;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> >  const VhostOps user_ops = {
> >          .backend_type = VHOST_BACKEND_TYPE_USER,
> >          .vhost_backend_init = vhost_user_backend_init,
> > @@ -2716,4 +2860,7 @@ const VhostOps user_ops = {
> >          .vhost_get_inflight_fd = vhost_user_get_inflight_fd,
> >          .vhost_set_inflight_fd = vhost_user_set_inflight_fd,
> >          .vhost_dev_start = vhost_user_dev_start,
> > +        .vhost_supports_migratory_state = vhost_user_supports_migratory_state,
> > +        .vhost_set_device_state_fd = vhost_user_set_device_state_fd,
> > +        .vhost_check_device_state = vhost_user_check_device_state,
> >  };
> > diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> > index cbff589efa..90099d8f6a 100644
> > --- a/hw/virtio/vhost.c
> > +++ b/hw/virtio/vhost.c
> > @@ -2088,3 +2088,40 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
> >
> >      return -ENOSYS;
> >  }
> > +
> > +bool vhost_supports_migratory_state(struct vhost_dev *dev)
> > +{
> > +    if (dev->vhost_ops->vhost_supports_migratory_state) {
> > +        return dev->vhost_ops->vhost_supports_migratory_state(dev);
> > +    }
> > +
> > +    return false;
> > +}
> > +
> > +int vhost_set_device_state_fd(struct vhost_dev *dev,
> > +                              VhostDeviceStateDirection direction,
> > +                              VhostDeviceStatePhase phase,
> > +                              int fd,
> > +                              int *reply_fd,
> > +                              Error **errp)
> > +{
> > +    if (dev->vhost_ops->vhost_set_device_state_fd) {
> > +        return dev->vhost_ops->vhost_set_device_state_fd(dev, direction, phase,
> > +                                                         fd, reply_fd, errp);
> > +    }
> > +
> > +    error_setg(errp,
> > +               "vhost transport does not support migration state transfer");
> > +    return -ENOSYS;
> > +}
> > +
> > +int vhost_check_device_state(struct vhost_dev *dev, Error **errp)
> > +{
> > +    if (dev->vhost_ops->vhost_check_device_state) {
> > +        return dev->vhost_ops->vhost_check_device_state(dev, errp);
> > +    }
> > +
> > +    error_setg(errp,
> > +               "vhost transport does not support migration state transfer");
> > +    return -ENOSYS;
> > +}
> > --
> > 2.39.1
> >
Stefan Hajnoczi April 13, 2023, 11:07 a.m. UTC | #6
On Thu, 13 Apr 2023 at 06:15, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > (And I hope vDPA will import the device state vhost-user messages
> > introduced in this series.)
> >
>
> I guess they will be needed for vdpa-fs devices? Is there any emulated
> virtio-fs in qemu?

Maybe also virtio-gpu or virtio-crypto, if someone decides to create
hardware or in-kernel implementations.

virtiofs is not built into QEMU, there are only vhost-user implementations.

Stefan
Stefan Hajnoczi April 13, 2023, 11:38 a.m. UTC | #7
On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 12.04.23 23:06, Stefan Hajnoczi wrote:
> > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> >> So-called "internal" virtio-fs migration refers to transporting the
> >> back-end's (virtiofsd's) state through qemu's migration stream.  To do
> >> this, we need to be able to transfer virtiofsd's internal state to and
> >> from virtiofsd.
> >>
> >> Because virtiofsd's internal state will not be too large, we believe it
> >> is best to transfer it as a single binary blob after the streaming
> >> phase.  Because this method should be useful to other vhost-user
> >> implementations, too, it is introduced as a general-purpose addition to
> >> the protocol, not limited to vhost-user-fs.
> >>
> >> These are the additions to the protocol:
> >> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> >>    This feature signals support for transferring state, and is added so
> >>    that migration can fail early when the back-end has no support.
> >>
> >> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> >>    over which to transfer the state.  The front-end sends an FD to the
> >>    back-end into/from which it can write/read its state, and the back-end
> >>    can decide to either use it, or reply with a different FD for the
> >>    front-end to override the front-end's choice.
> >>    The front-end creates a simple pipe to transfer the state, but maybe
> >>    the back-end already has an FD into/from which it has to write/read
> >>    its state, in which case it will want to override the simple pipe.
> >>    Conversely, maybe in the future we find a way to have the front-end
> >>    get an immediate FD for the migration stream (in some cases), in which
> >>    case we will want to send this to the back-end instead of creating a
> >>    pipe.
> >>    Hence the negotiation: If one side has a better idea than a plain
> >>    pipe, we will want to use that.
> >>
> >> - CHECK_DEVICE_STATE: After the state has been transferred through the
> >>    pipe (the end indicated by EOF), the front-end invokes this function
> >>    to verify success.  There is no in-band way (through the pipe) to
> >>    indicate failure, so we need to check explicitly.
> >>
> >> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> >> (which includes establishing the direction of transfer and migration
> >> phase), the sending side writes its data into the pipe, and the reading
> >> side reads it until it sees an EOF.  Then, the front-end will check for
> >> success via CHECK_DEVICE_STATE, which on the destination side includes
> >> checking for integrity (i.e. errors during deserialization).
> >>
> >> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> >> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> >> ---
> >>   include/hw/virtio/vhost-backend.h |  24 +++++
> >>   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> >>   hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> >>   hw/virtio/vhost.c                 |  37 ++++++++
> >>   4 files changed, 287 insertions(+)
> >>
> >> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> >> index ec3fbae58d..5935b32fe3 100644
> >> --- a/include/hw/virtio/vhost-backend.h
> >> +++ b/include/hw/virtio/vhost-backend.h
> >> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> >>       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> >>   } VhostSetConfigType;
> >>
> >> +typedef enum VhostDeviceStateDirection {
> >> +    /* Transfer state from back-end (device) to front-end */
> >> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> >> +    /* Transfer state from front-end to back-end (device) */
> >> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> >> +} VhostDeviceStateDirection;
> >> +
> >> +typedef enum VhostDeviceStatePhase {
> >> +    /* The device (and all its vrings) is stopped */
> >> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> >> +} VhostDeviceStatePhase;
> > vDPA has:
> >
> >    /* Suspend a device so it does not process virtqueue requests anymore
> >     *
> >     * After the return of ioctl the device must preserve all the necessary state
> >     * (the virtqueue vring base plus the possible device specific states) that is
> >     * required for restoring in the future. The device must not change its
> >     * configuration after that point.
> >     */
> >    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> >
> >    /* Resume a device so it can resume processing virtqueue requests
> >     *
> >     * After the return of this ioctl the device will have restored all the
> >     * necessary states and it is fully operational to continue processing the
> >     * virtqueue descriptors.
> >     */
> >    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> >
> > I wonder if it makes sense to import these into vhost-user so that the
> > difference between kernel vhost and vhost-user is minimized. It's okay
> > if one of them is ahead of the other, but it would be nice to avoid
> > overlapping/duplicated functionality.
> >
> > (And I hope vDPA will import the device state vhost-user messages
> > introduced in this series.)
>
> I don’t understand your suggestion.  (Like, I very simply don’t
> understand :))
>
> These are vhost messages, right?  What purpose do you have in mind for
> them in vhost-user for internal migration?  They’re different from the
> state transfer messages, because they don’t transfer state to/from the
> front-end.  Also, the state transfer stuff is supposed to be distinct
> from starting/stopping the device; right now, it just requires the
> device to be stopped beforehand (or started only afterwards).  And in
> the future, new VhostDeviceStatePhase values may allow the messages to
> be used on devices that aren’t stopped.
>
> So they seem to serve very different purposes.  I can imagine using the
> VDPA_{SUSPEND,RESUME} messages for external migration (what Anton is
> working on), but they don’t really help with internal migration
> implemented here.  If I were to add them, they’d just be sent in
> addition to the new messages added in this patch here, i.e. SUSPEND on
> the source before SET_DEVICE_STATE_FD, and RESUME on the destination
> after CHECK_DEVICE_STATE (we could use RESUME in place of
> CHECK_DEVICE_STATE on the destination, but we can’t do that on the
> source, so we still need CHECK_DEVICE_STATE).

Yes, they are complementary to the device state fd message. I want to
make sure pre-conditions about the device's state (running vs stopped)
already take into account the vDPA SUSPEND/RESUME model.

vDPA will need device state save/load in the future. For virtiofs
devices, for example. This is why I think we should plan for vDPA and
vhost-user to share the same interface.

Also, I think the code path you're relying on (vhost_dev_stop()) on
doesn't work for backends that implement VHOST_USER_PROTOCOL_F_STATUS
because stopping the backend resets the device and throws away its
state. SUSPEND/RESUME solve this. This looks like a more general
problem since vhost_dev_stop() is called any time the VM is paused.
Maybe it needs to use SUSPEND/RESUME whenever possible.

Stefan
Hanna Czenczek April 13, 2023, 5:31 p.m. UTC | #8
On 13.04.23 12:14, Eugenio Perez Martin wrote:
> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
>>> So-called "internal" virtio-fs migration refers to transporting the
>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
>>> this, we need to be able to transfer virtiofsd's internal state to and
>>> from virtiofsd.
>>>
>>> Because virtiofsd's internal state will not be too large, we believe it
>>> is best to transfer it as a single binary blob after the streaming
>>> phase.  Because this method should be useful to other vhost-user
>>> implementations, too, it is introduced as a general-purpose addition to
>>> the protocol, not limited to vhost-user-fs.
>>>
>>> These are the additions to the protocol:
>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
>>>    This feature signals support for transferring state, and is added so
>>>    that migration can fail early when the back-end has no support.
>>>
>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
>>>    over which to transfer the state.  The front-end sends an FD to the
>>>    back-end into/from which it can write/read its state, and the back-end
>>>    can decide to either use it, or reply with a different FD for the
>>>    front-end to override the front-end's choice.
>>>    The front-end creates a simple pipe to transfer the state, but maybe
>>>    the back-end already has an FD into/from which it has to write/read
>>>    its state, in which case it will want to override the simple pipe.
>>>    Conversely, maybe in the future we find a way to have the front-end
>>>    get an immediate FD for the migration stream (in some cases), in which
>>>    case we will want to send this to the back-end instead of creating a
>>>    pipe.
>>>    Hence the negotiation: If one side has a better idea than a plain
>>>    pipe, we will want to use that.
>>>
>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
>>>    pipe (the end indicated by EOF), the front-end invokes this function
>>>    to verify success.  There is no in-band way (through the pipe) to
>>>    indicate failure, so we need to check explicitly.
>>>
>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
>>> (which includes establishing the direction of transfer and migration
>>> phase), the sending side writes its data into the pipe, and the reading
>>> side reads it until it sees an EOF.  Then, the front-end will check for
>>> success via CHECK_DEVICE_STATE, which on the destination side includes
>>> checking for integrity (i.e. errors during deserialization).
>>>
>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>> ---
>>>   include/hw/virtio/vhost-backend.h |  24 +++++
>>>   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
>>>   hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
>>>   hw/virtio/vhost.c                 |  37 ++++++++
>>>   4 files changed, 287 insertions(+)
>>>
>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
>>> index ec3fbae58d..5935b32fe3 100644
>>> --- a/include/hw/virtio/vhost-backend.h
>>> +++ b/include/hw/virtio/vhost-backend.h
>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
>>>       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
>>>   } VhostSetConfigType;
>>>
>>> +typedef enum VhostDeviceStateDirection {
>>> +    /* Transfer state from back-end (device) to front-end */
>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
>>> +    /* Transfer state from front-end to back-end (device) */
>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
>>> +} VhostDeviceStateDirection;
>>> +
>>> +typedef enum VhostDeviceStatePhase {
>>> +    /* The device (and all its vrings) is stopped */
>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
>>> +} VhostDeviceStatePhase;
>> vDPA has:
>>
>>    /* Suspend a device so it does not process virtqueue requests anymore
>>     *
>>     * After the return of ioctl the device must preserve all the necessary state
>>     * (the virtqueue vring base plus the possible device specific states) that is
>>     * required for restoring in the future. The device must not change its
>>     * configuration after that point.
>>     */
>>    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
>>
>>    /* Resume a device so it can resume processing virtqueue requests
>>     *
>>     * After the return of this ioctl the device will have restored all the
>>     * necessary states and it is fully operational to continue processing the
>>     * virtqueue descriptors.
>>     */
>>    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
>>
>> I wonder if it makes sense to import these into vhost-user so that the
>> difference between kernel vhost and vhost-user is minimized. It's okay
>> if one of them is ahead of the other, but it would be nice to avoid
>> overlapping/duplicated functionality.
>>
> That's what I had in mind in the first versions. I proposed VHOST_STOP
> instead of VHOST_VDPA_STOP for this very reason. Later it did change
> to SUSPEND.
>
> Generally it is better if we make the interface less parametrized and
> we trust in the messages and its semantics in my opinion. In other
> words, instead of
> vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send
> individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command.

I.e. you mean that this should simply be stateful instead of
re-affirming the current state with a parameter?

The problem I see is that transferring states in different phases of
migration will require specialized implementations.  So running
SET_DEVICE_STATE_FD in a different phase will require support from the
back-end.  Same in the front-end, the exact protocol and thus
implementation will (probably, difficult to say at this point) depend on
the migration phase.  I would therefore prefer to have an explicit
distinction in the command itself that affirms the phase we’re
targeting.

On the other hand, I don’t see the parameter complicating anything. The
front-end must supply it, but it will know the phase anyway, so this is
easy.  The back-end can just choose to ignore it, if it doesn’t feel the
need to verify that the phase is what it thinks it is.

> Another way to apply this is with the "direction" parameter. Maybe it
> is better to split it into "set_state_fd" and "get_state_fd"?

Well, it would rather be `set_state_send_fd` and `set_state_receive_fd`.
We always negotiate a pipe between front-end and back-end, the question
is just whether the back-end gets the receiving (load) or the sending
(save) end.

Technically, one can make it fully stateful and say that if the device
hasn’t been started already, it’s always a LOAD, and otherwise always a
SAVE.  But as above, I’d prefer to keep the parameter because the
implementations are different, so I’d prefer there to be a
re-affirmation that front-end and back-end are in sync about what should
be done.

Personally, I don’t really see the advantage of having two functions
instead of one function with an enum with two values.  The thing about
SET_DEVICE_STATE_FD is that it itself won’t differ much regardless of
whether we’re loading or saving, it just negotiates the pipe – the
difference is what happens after the pipe has been negotiated.  So if we
split the function into two, both implementations will share most of
their code anyway, which makes me think it should be a single function.

> In that case, reusing the ioctls as vhost-user messages would be ok.
> But that puts this proposal further from the VFIO code, which uses
> "migration_set_state(state)", and maybe it is better when the number
> of states is high.

I’m not sure what you mean (because I don’t know the VFIO code, I
assume).  Are you saying that using a more finely grained
migration_set_state() model would conflict with the rather coarse
suspend/resume?

> BTW, is there any usage for *reply_fd at this moment from the backend?

No, virtiofsd doesn’t plan to make use of it.

Hanna
Hanna Czenczek April 13, 2023, 5:55 p.m. UTC | #9
On 13.04.23 13:38, Stefan Hajnoczi wrote:
> On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote:
>> On 12.04.23 23:06, Stefan Hajnoczi wrote:
>>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
>>>> So-called "internal" virtio-fs migration refers to transporting the
>>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
>>>> this, we need to be able to transfer virtiofsd's internal state to and
>>>> from virtiofsd.
>>>>
>>>> Because virtiofsd's internal state will not be too large, we believe it
>>>> is best to transfer it as a single binary blob after the streaming
>>>> phase.  Because this method should be useful to other vhost-user
>>>> implementations, too, it is introduced as a general-purpose addition to
>>>> the protocol, not limited to vhost-user-fs.
>>>>
>>>> These are the additions to the protocol:
>>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
>>>>     This feature signals support for transferring state, and is added so
>>>>     that migration can fail early when the back-end has no support.
>>>>
>>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
>>>>     over which to transfer the state.  The front-end sends an FD to the
>>>>     back-end into/from which it can write/read its state, and the back-end
>>>>     can decide to either use it, or reply with a different FD for the
>>>>     front-end to override the front-end's choice.
>>>>     The front-end creates a simple pipe to transfer the state, but maybe
>>>>     the back-end already has an FD into/from which it has to write/read
>>>>     its state, in which case it will want to override the simple pipe.
>>>>     Conversely, maybe in the future we find a way to have the front-end
>>>>     get an immediate FD for the migration stream (in some cases), in which
>>>>     case we will want to send this to the back-end instead of creating a
>>>>     pipe.
>>>>     Hence the negotiation: If one side has a better idea than a plain
>>>>     pipe, we will want to use that.
>>>>
>>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
>>>>     pipe (the end indicated by EOF), the front-end invokes this function
>>>>     to verify success.  There is no in-band way (through the pipe) to
>>>>     indicate failure, so we need to check explicitly.
>>>>
>>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
>>>> (which includes establishing the direction of transfer and migration
>>>> phase), the sending side writes its data into the pipe, and the reading
>>>> side reads it until it sees an EOF.  Then, the front-end will check for
>>>> success via CHECK_DEVICE_STATE, which on the destination side includes
>>>> checking for integrity (i.e. errors during deserialization).
>>>>
>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>> ---
>>>>    include/hw/virtio/vhost-backend.h |  24 +++++
>>>>    include/hw/virtio/vhost.h         |  79 ++++++++++++++++
>>>>    hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
>>>>    hw/virtio/vhost.c                 |  37 ++++++++
>>>>    4 files changed, 287 insertions(+)
>>>>
>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
>>>> index ec3fbae58d..5935b32fe3 100644
>>>> --- a/include/hw/virtio/vhost-backend.h
>>>> +++ b/include/hw/virtio/vhost-backend.h
>>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
>>>>        VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
>>>>    } VhostSetConfigType;
>>>>
>>>> +typedef enum VhostDeviceStateDirection {
>>>> +    /* Transfer state from back-end (device) to front-end */
>>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
>>>> +    /* Transfer state from front-end to back-end (device) */
>>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
>>>> +} VhostDeviceStateDirection;
>>>> +
>>>> +typedef enum VhostDeviceStatePhase {
>>>> +    /* The device (and all its vrings) is stopped */
>>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
>>>> +} VhostDeviceStatePhase;
>>> vDPA has:
>>>
>>>     /* Suspend a device so it does not process virtqueue requests anymore
>>>      *
>>>      * After the return of ioctl the device must preserve all the necessary state
>>>      * (the virtqueue vring base plus the possible device specific states) that is
>>>      * required for restoring in the future. The device must not change its
>>>      * configuration after that point.
>>>      */
>>>     #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
>>>
>>>     /* Resume a device so it can resume processing virtqueue requests
>>>      *
>>>      * After the return of this ioctl the device will have restored all the
>>>      * necessary states and it is fully operational to continue processing the
>>>      * virtqueue descriptors.
>>>      */
>>>     #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
>>>
>>> I wonder if it makes sense to import these into vhost-user so that the
>>> difference between kernel vhost and vhost-user is minimized. It's okay
>>> if one of them is ahead of the other, but it would be nice to avoid
>>> overlapping/duplicated functionality.
>>>
>>> (And I hope vDPA will import the device state vhost-user messages
>>> introduced in this series.)
>> I don’t understand your suggestion.  (Like, I very simply don’t
>> understand :))
>>
>> These are vhost messages, right?  What purpose do you have in mind for
>> them in vhost-user for internal migration?  They’re different from the
>> state transfer messages, because they don’t transfer state to/from the
>> front-end.  Also, the state transfer stuff is supposed to be distinct
>> from starting/stopping the device; right now, it just requires the
>> device to be stopped beforehand (or started only afterwards).  And in
>> the future, new VhostDeviceStatePhase values may allow the messages to
>> be used on devices that aren’t stopped.
>>
>> So they seem to serve very different purposes.  I can imagine using the
>> VDPA_{SUSPEND,RESUME} messages for external migration (what Anton is
>> working on), but they don’t really help with internal migration
>> implemented here.  If I were to add them, they’d just be sent in
>> addition to the new messages added in this patch here, i.e. SUSPEND on
>> the source before SET_DEVICE_STATE_FD, and RESUME on the destination
>> after CHECK_DEVICE_STATE (we could use RESUME in place of
>> CHECK_DEVICE_STATE on the destination, but we can’t do that on the
>> source, so we still need CHECK_DEVICE_STATE).
> Yes, they are complementary to the device state fd message. I want to
> make sure pre-conditions about the device's state (running vs stopped)
> already take into account the vDPA SUSPEND/RESUME model.
>
> vDPA will need device state save/load in the future. For virtiofs
> devices, for example. This is why I think we should plan for vDPA and
> vhost-user to share the same interface.

While the paragraph below is more important, I don’t feel like this
would be important right now.  It’s clear that SUSPEND must come before
transferring any state, and that RESUME must come after transferring
state.  I don’t think we need to clarify this now, it’d be obvious when
implementing SUSPEND/RESUME.

> Also, I think the code path you're relying on (vhost_dev_stop()) on
> doesn't work for backends that implement VHOST_USER_PROTOCOL_F_STATUS
> because stopping the backend resets the device and throws away its
> state. SUSPEND/RESUME solve this. This looks like a more general
> problem since vhost_dev_stop() is called any time the VM is paused.
> Maybe it needs to use SUSPEND/RESUME whenever possible.

That’s a problem.  Quite a problem, to be honest, because this sounds
rather complicated with honestly absolutely no practical benefit right
now.

Would you require SUSPEND/RESUME for state transfer even if the back-end
does not implement GET/SET_STATUS?  Because then this would also lead to
more complexity in virtiofsd.

Basically, what I’m hearing is that I need to implement a different
feature that has no practical impact right now, and also fix bugs around
it along the way...

(Not that I have any better suggestion.)

Hanna
Stefan Hajnoczi April 13, 2023, 8:42 p.m. UTC | #10
On Thu, 13 Apr 2023 at 13:55, Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 13.04.23 13:38, Stefan Hajnoczi wrote:
> > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote:
> >> On 12.04.23 23:06, Stefan Hajnoczi wrote:
> >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> >>>> So-called "internal" virtio-fs migration refers to transporting the
> >>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
> >>>> this, we need to be able to transfer virtiofsd's internal state to and
> >>>> from virtiofsd.
> >>>>
> >>>> Because virtiofsd's internal state will not be too large, we believe it
> >>>> is best to transfer it as a single binary blob after the streaming
> >>>> phase.  Because this method should be useful to other vhost-user
> >>>> implementations, too, it is introduced as a general-purpose addition to
> >>>> the protocol, not limited to vhost-user-fs.
> >>>>
> >>>> These are the additions to the protocol:
> >>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> >>>>     This feature signals support for transferring state, and is added so
> >>>>     that migration can fail early when the back-end has no support.
> >>>>
> >>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> >>>>     over which to transfer the state.  The front-end sends an FD to the
> >>>>     back-end into/from which it can write/read its state, and the back-end
> >>>>     can decide to either use it, or reply with a different FD for the
> >>>>     front-end to override the front-end's choice.
> >>>>     The front-end creates a simple pipe to transfer the state, but maybe
> >>>>     the back-end already has an FD into/from which it has to write/read
> >>>>     its state, in which case it will want to override the simple pipe.
> >>>>     Conversely, maybe in the future we find a way to have the front-end
> >>>>     get an immediate FD for the migration stream (in some cases), in which
> >>>>     case we will want to send this to the back-end instead of creating a
> >>>>     pipe.
> >>>>     Hence the negotiation: If one side has a better idea than a plain
> >>>>     pipe, we will want to use that.
> >>>>
> >>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
> >>>>     pipe (the end indicated by EOF), the front-end invokes this function
> >>>>     to verify success.  There is no in-band way (through the pipe) to
> >>>>     indicate failure, so we need to check explicitly.
> >>>>
> >>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> >>>> (which includes establishing the direction of transfer and migration
> >>>> phase), the sending side writes its data into the pipe, and the reading
> >>>> side reads it until it sees an EOF.  Then, the front-end will check for
> >>>> success via CHECK_DEVICE_STATE, which on the destination side includes
> >>>> checking for integrity (i.e. errors during deserialization).
> >>>>
> >>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> >>>> ---
> >>>>    include/hw/virtio/vhost-backend.h |  24 +++++
> >>>>    include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> >>>>    hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> >>>>    hw/virtio/vhost.c                 |  37 ++++++++
> >>>>    4 files changed, 287 insertions(+)
> >>>>
> >>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> >>>> index ec3fbae58d..5935b32fe3 100644
> >>>> --- a/include/hw/virtio/vhost-backend.h
> >>>> +++ b/include/hw/virtio/vhost-backend.h
> >>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> >>>>        VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> >>>>    } VhostSetConfigType;
> >>>>
> >>>> +typedef enum VhostDeviceStateDirection {
> >>>> +    /* Transfer state from back-end (device) to front-end */
> >>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> >>>> +    /* Transfer state from front-end to back-end (device) */
> >>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> >>>> +} VhostDeviceStateDirection;
> >>>> +
> >>>> +typedef enum VhostDeviceStatePhase {
> >>>> +    /* The device (and all its vrings) is stopped */
> >>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> >>>> +} VhostDeviceStatePhase;
> >>> vDPA has:
> >>>
> >>>     /* Suspend a device so it does not process virtqueue requests anymore
> >>>      *
> >>>      * After the return of ioctl the device must preserve all the necessary state
> >>>      * (the virtqueue vring base plus the possible device specific states) that is
> >>>      * required for restoring in the future. The device must not change its
> >>>      * configuration after that point.
> >>>      */
> >>>     #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> >>>
> >>>     /* Resume a device so it can resume processing virtqueue requests
> >>>      *
> >>>      * After the return of this ioctl the device will have restored all the
> >>>      * necessary states and it is fully operational to continue processing the
> >>>      * virtqueue descriptors.
> >>>      */
> >>>     #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> >>>
> >>> I wonder if it makes sense to import these into vhost-user so that the
> >>> difference between kernel vhost and vhost-user is minimized. It's okay
> >>> if one of them is ahead of the other, but it would be nice to avoid
> >>> overlapping/duplicated functionality.
> >>>
> >>> (And I hope vDPA will import the device state vhost-user messages
> >>> introduced in this series.)
> >> I don’t understand your suggestion.  (Like, I very simply don’t
> >> understand :))
> >>
> >> These are vhost messages, right?  What purpose do you have in mind for
> >> them in vhost-user for internal migration?  They’re different from the
> >> state transfer messages, because they don’t transfer state to/from the
> >> front-end.  Also, the state transfer stuff is supposed to be distinct
> >> from starting/stopping the device; right now, it just requires the
> >> device to be stopped beforehand (or started only afterwards).  And in
> >> the future, new VhostDeviceStatePhase values may allow the messages to
> >> be used on devices that aren’t stopped.
> >>
> >> So they seem to serve very different purposes.  I can imagine using the
> >> VDPA_{SUSPEND,RESUME} messages for external migration (what Anton is
> >> working on), but they don’t really help with internal migration
> >> implemented here.  If I were to add them, they’d just be sent in
> >> addition to the new messages added in this patch here, i.e. SUSPEND on
> >> the source before SET_DEVICE_STATE_FD, and RESUME on the destination
> >> after CHECK_DEVICE_STATE (we could use RESUME in place of
> >> CHECK_DEVICE_STATE on the destination, but we can’t do that on the
> >> source, so we still need CHECK_DEVICE_STATE).
> > Yes, they are complementary to the device state fd message. I want to
> > make sure pre-conditions about the device's state (running vs stopped)
> > already take into account the vDPA SUSPEND/RESUME model.
> >
> > vDPA will need device state save/load in the future. For virtiofs
> > devices, for example. This is why I think we should plan for vDPA and
> > vhost-user to share the same interface.
>
> While the paragraph below is more important, I don’t feel like this
> would be important right now.  It’s clear that SUSPEND must come before
> transferring any state, and that RESUME must come after transferring
> state.  I don’t think we need to clarify this now, it’d be obvious when
> implementing SUSPEND/RESUME.
>
> > Also, I think the code path you're relying on (vhost_dev_stop()) on
> > doesn't work for backends that implement VHOST_USER_PROTOCOL_F_STATUS
> > because stopping the backend resets the device and throws away its
> > state. SUSPEND/RESUME solve this. This looks like a more general
> > problem since vhost_dev_stop() is called any time the VM is paused.
> > Maybe it needs to use SUSPEND/RESUME whenever possible.
>
> That’s a problem.  Quite a problem, to be honest, because this sounds
> rather complicated with honestly absolutely no practical benefit right
> now.
>
> Would you require SUSPEND/RESUME for state transfer even if the back-end
> does not implement GET/SET_STATUS?  Because then this would also lead to
> more complexity in virtiofsd.
>
> Basically, what I’m hearing is that I need to implement a different
> feature that has no practical impact right now, and also fix bugs around
> it along the way...

Eugenio's input regarding the design of the vhost-user messages is
important. That way we know it can be ported to vDPA later.

There is some extra discussion and work here, but only on the design
of the interface. You shouldn't need to implement extra unused stuff.
Whoever needs it can do that later based on a design that left room to
eventually do iterative migration for vhost-user and vDPA (comparable
to VFIO's migration interface).

Since both vDPA (vhost kernel) and vhost-user are stable APIs, it will
be hard to make significant design changes later without breaking all
existing implementations. That's why I think we should think ahead.

Stefan
Eugenio Perez Martin April 14, 2023, 3:17 p.m. UTC | #11
On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 13.04.23 13:38, Stefan Hajnoczi wrote:
> > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote:
> >> On 12.04.23 23:06, Stefan Hajnoczi wrote:
> >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> >>>> So-called "internal" virtio-fs migration refers to transporting the
> >>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
> >>>> this, we need to be able to transfer virtiofsd's internal state to and
> >>>> from virtiofsd.
> >>>>
> >>>> Because virtiofsd's internal state will not be too large, we believe it
> >>>> is best to transfer it as a single binary blob after the streaming
> >>>> phase.  Because this method should be useful to other vhost-user
> >>>> implementations, too, it is introduced as a general-purpose addition to
> >>>> the protocol, not limited to vhost-user-fs.
> >>>>
> >>>> These are the additions to the protocol:
> >>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> >>>>     This feature signals support for transferring state, and is added so
> >>>>     that migration can fail early when the back-end has no support.
> >>>>
> >>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> >>>>     over which to transfer the state.  The front-end sends an FD to the
> >>>>     back-end into/from which it can write/read its state, and the back-end
> >>>>     can decide to either use it, or reply with a different FD for the
> >>>>     front-end to override the front-end's choice.
> >>>>     The front-end creates a simple pipe to transfer the state, but maybe
> >>>>     the back-end already has an FD into/from which it has to write/read
> >>>>     its state, in which case it will want to override the simple pipe.
> >>>>     Conversely, maybe in the future we find a way to have the front-end
> >>>>     get an immediate FD for the migration stream (in some cases), in which
> >>>>     case we will want to send this to the back-end instead of creating a
> >>>>     pipe.
> >>>>     Hence the negotiation: If one side has a better idea than a plain
> >>>>     pipe, we will want to use that.
> >>>>
> >>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
> >>>>     pipe (the end indicated by EOF), the front-end invokes this function
> >>>>     to verify success.  There is no in-band way (through the pipe) to
> >>>>     indicate failure, so we need to check explicitly.
> >>>>
> >>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> >>>> (which includes establishing the direction of transfer and migration
> >>>> phase), the sending side writes its data into the pipe, and the reading
> >>>> side reads it until it sees an EOF.  Then, the front-end will check for
> >>>> success via CHECK_DEVICE_STATE, which on the destination side includes
> >>>> checking for integrity (i.e. errors during deserialization).
> >>>>
> >>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> >>>> ---
> >>>>    include/hw/virtio/vhost-backend.h |  24 +++++
> >>>>    include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> >>>>    hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> >>>>    hw/virtio/vhost.c                 |  37 ++++++++
> >>>>    4 files changed, 287 insertions(+)
> >>>>
> >>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> >>>> index ec3fbae58d..5935b32fe3 100644
> >>>> --- a/include/hw/virtio/vhost-backend.h
> >>>> +++ b/include/hw/virtio/vhost-backend.h
> >>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> >>>>        VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> >>>>    } VhostSetConfigType;
> >>>>
> >>>> +typedef enum VhostDeviceStateDirection {
> >>>> +    /* Transfer state from back-end (device) to front-end */
> >>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> >>>> +    /* Transfer state from front-end to back-end (device) */
> >>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> >>>> +} VhostDeviceStateDirection;
> >>>> +
> >>>> +typedef enum VhostDeviceStatePhase {
> >>>> +    /* The device (and all its vrings) is stopped */
> >>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> >>>> +} VhostDeviceStatePhase;
> >>> vDPA has:
> >>>
> >>>     /* Suspend a device so it does not process virtqueue requests anymore
> >>>      *
> >>>      * After the return of ioctl the device must preserve all the necessary state
> >>>      * (the virtqueue vring base plus the possible device specific states) that is
> >>>      * required for restoring in the future. The device must not change its
> >>>      * configuration after that point.
> >>>      */
> >>>     #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> >>>
> >>>     /* Resume a device so it can resume processing virtqueue requests
> >>>      *
> >>>      * After the return of this ioctl the device will have restored all the
> >>>      * necessary states and it is fully operational to continue processing the
> >>>      * virtqueue descriptors.
> >>>      */
> >>>     #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> >>>
> >>> I wonder if it makes sense to import these into vhost-user so that the
> >>> difference between kernel vhost and vhost-user is minimized. It's okay
> >>> if one of them is ahead of the other, but it would be nice to avoid
> >>> overlapping/duplicated functionality.
> >>>
> >>> (And I hope vDPA will import the device state vhost-user messages
> >>> introduced in this series.)
> >> I don’t understand your suggestion.  (Like, I very simply don’t
> >> understand :))
> >>
> >> These are vhost messages, right?  What purpose do you have in mind for
> >> them in vhost-user for internal migration?  They’re different from the
> >> state transfer messages, because they don’t transfer state to/from the
> >> front-end.  Also, the state transfer stuff is supposed to be distinct
> >> from starting/stopping the device; right now, it just requires the
> >> device to be stopped beforehand (or started only afterwards).  And in
> >> the future, new VhostDeviceStatePhase values may allow the messages to
> >> be used on devices that aren’t stopped.
> >>
> >> So they seem to serve very different purposes.  I can imagine using the
> >> VDPA_{SUSPEND,RESUME} messages for external migration (what Anton is
> >> working on), but they don’t really help with internal migration
> >> implemented here.  If I were to add them, they’d just be sent in
> >> addition to the new messages added in this patch here, i.e. SUSPEND on
> >> the source before SET_DEVICE_STATE_FD, and RESUME on the destination
> >> after CHECK_DEVICE_STATE (we could use RESUME in place of
> >> CHECK_DEVICE_STATE on the destination, but we can’t do that on the
> >> source, so we still need CHECK_DEVICE_STATE).
> > Yes, they are complementary to the device state fd message. I want to
> > make sure pre-conditions about the device's state (running vs stopped)
> > already take into account the vDPA SUSPEND/RESUME model.
> >
> > vDPA will need device state save/load in the future. For virtiofs
> > devices, for example. This is why I think we should plan for vDPA and
> > vhost-user to share the same interface.
>
> While the paragraph below is more important, I don’t feel like this
> would be important right now.  It’s clear that SUSPEND must come before
> transferring any state, and that RESUME must come after transferring
> state.  I don’t think we need to clarify this now, it’d be obvious when
> implementing SUSPEND/RESUME.
>
> > Also, I think the code path you're relying on (vhost_dev_stop()) on
> > doesn't work for backends that implement VHOST_USER_PROTOCOL_F_STATUS
> > because stopping the backend resets the device and throws away its
> > state. SUSPEND/RESUME solve this. This looks like a more general
> > problem since vhost_dev_stop() is called any time the VM is paused.
> > Maybe it needs to use SUSPEND/RESUME whenever possible.
>
> That’s a problem.  Quite a problem, to be honest, because this sounds
> rather complicated with honestly absolutely no practical benefit right
> now.
>
> Would you require SUSPEND/RESUME for state transfer even if the back-end
> does not implement GET/SET_STATUS?  Because then this would also lead to
> more complexity in virtiofsd.
>

At this moment the vhost-user net in DPDK suspends at
VHOST_GET_VRING_BASE. Not the same case though, as here only the vq
indexes / wrap bits are transferred here.

Vhost-vdpa implements the suspend call so it does not need to trust
VHOST_GET_VRING_BASE to be the last vring call done. Since virtiofsd
is using vhost-user maybe it is not needed to implement it actually.

> Basically, what I’m hearing is that I need to implement a different
> feature that has no practical impact right now, and also fix bugs around
> it along the way...
>

To fix this properly requires iterative device migration in qemu as
far as I know, instead of using VMStates [1]. This way the state is
requested to virtiofsd before the device reset.

What does virtiofsd do when the state is totally sent? Does it keep
processing requests and generating new state or is only a one shot
that will suspend the daemon? If it is the second I think it still can
be done in one shot at the end, always indicating "no more state" at
save_live_pending and sending all the state at
save_live_complete_precopy.

Does that make sense to you?

Thanks!

[1] https://qemu.readthedocs.io/en/latest/devel/migration.html#iterative-device-migration
Stefan Hajnoczi April 17, 2023, 3:12 p.m. UTC | #12
On Thu, Apr 13, 2023 at 07:31:57PM +0200, Hanna Czenczek wrote:
> On 13.04.23 12:14, Eugenio Perez Martin wrote:
> > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > So-called "internal" virtio-fs migration refers to transporting the
> > > > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > > this, we need to be able to transfer virtiofsd's internal state to and
> > > > from virtiofsd.
> > > > 
> > > > Because virtiofsd's internal state will not be too large, we believe it
> > > > is best to transfer it as a single binary blob after the streaming
> > > > phase.  Because this method should be useful to other vhost-user
> > > > implementations, too, it is introduced as a general-purpose addition to
> > > > the protocol, not limited to vhost-user-fs.
> > > > 
> > > > These are the additions to the protocol:
> > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > >    This feature signals support for transferring state, and is added so
> > > >    that migration can fail early when the back-end has no support.
> > > > 
> > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > > >    over which to transfer the state.  The front-end sends an FD to the
> > > >    back-end into/from which it can write/read its state, and the back-end
> > > >    can decide to either use it, or reply with a different FD for the
> > > >    front-end to override the front-end's choice.
> > > >    The front-end creates a simple pipe to transfer the state, but maybe
> > > >    the back-end already has an FD into/from which it has to write/read
> > > >    its state, in which case it will want to override the simple pipe.
> > > >    Conversely, maybe in the future we find a way to have the front-end
> > > >    get an immediate FD for the migration stream (in some cases), in which
> > > >    case we will want to send this to the back-end instead of creating a
> > > >    pipe.
> > > >    Hence the negotiation: If one side has a better idea than a plain
> > > >    pipe, we will want to use that.
> > > > 
> > > > - CHECK_DEVICE_STATE: After the state has been transferred through the
> > > >    pipe (the end indicated by EOF), the front-end invokes this function
> > > >    to verify success.  There is no in-band way (through the pipe) to
> > > >    indicate failure, so we need to check explicitly.
> > > > 
> > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > > (which includes establishing the direction of transfer and migration
> > > > phase), the sending side writes its data into the pipe, and the reading
> > > > side reads it until it sees an EOF.  Then, the front-end will check for
> > > > success via CHECK_DEVICE_STATE, which on the destination side includes
> > > > checking for integrity (i.e. errors during deserialization).
> > > > 
> > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > ---
> > > >   include/hw/virtio/vhost-backend.h |  24 +++++
> > > >   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > >   hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > > >   hw/virtio/vhost.c                 |  37 ++++++++
> > > >   4 files changed, 287 insertions(+)
> > > > 
> > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > > index ec3fbae58d..5935b32fe3 100644
> > > > --- a/include/hw/virtio/vhost-backend.h
> > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > >       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > >   } VhostSetConfigType;
> > > > 
> > > > +typedef enum VhostDeviceStateDirection {
> > > > +    /* Transfer state from back-end (device) to front-end */
> > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > +    /* Transfer state from front-end to back-end (device) */
> > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > +} VhostDeviceStateDirection;
> > > > +
> > > > +typedef enum VhostDeviceStatePhase {
> > > > +    /* The device (and all its vrings) is stopped */
> > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > +} VhostDeviceStatePhase;
> > > vDPA has:
> > > 
> > >    /* Suspend a device so it does not process virtqueue requests anymore
> > >     *
> > >     * After the return of ioctl the device must preserve all the necessary state
> > >     * (the virtqueue vring base plus the possible device specific states) that is
> > >     * required for restoring in the future. The device must not change its
> > >     * configuration after that point.
> > >     */
> > >    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > 
> > >    /* Resume a device so it can resume processing virtqueue requests
> > >     *
> > >     * After the return of this ioctl the device will have restored all the
> > >     * necessary states and it is fully operational to continue processing the
> > >     * virtqueue descriptors.
> > >     */
> > >    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > 
> > > I wonder if it makes sense to import these into vhost-user so that the
> > > difference between kernel vhost and vhost-user is minimized. It's okay
> > > if one of them is ahead of the other, but it would be nice to avoid
> > > overlapping/duplicated functionality.
> > > 
> > That's what I had in mind in the first versions. I proposed VHOST_STOP
> > instead of VHOST_VDPA_STOP for this very reason. Later it did change
> > to SUSPEND.
> > 
> > Generally it is better if we make the interface less parametrized and
> > we trust in the messages and its semantics in my opinion. In other
> > words, instead of
> > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send
> > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command.
> 
> I.e. you mean that this should simply be stateful instead of
> re-affirming the current state with a parameter?
> 
> The problem I see is that transferring states in different phases of
> migration will require specialized implementations.  So running
> SET_DEVICE_STATE_FD in a different phase will require support from the
> back-end.  Same in the front-end, the exact protocol and thus
> implementation will (probably, difficult to say at this point) depend on
> the migration phase.  I would therefore prefer to have an explicit
> distinction in the command itself that affirms the phase we’re
> targeting.
> 
> On the other hand, I don’t see the parameter complicating anything. The
> front-end must supply it, but it will know the phase anyway, so this is
> easy.  The back-end can just choose to ignore it, if it doesn’t feel the
> need to verify that the phase is what it thinks it is.
> 
> > Another way to apply this is with the "direction" parameter. Maybe it
> > is better to split it into "set_state_fd" and "get_state_fd"?
> 
> Well, it would rather be `set_state_send_fd` and `set_state_receive_fd`.
> We always negotiate a pipe between front-end and back-end, the question
> is just whether the back-end gets the receiving (load) or the sending
> (save) end.
> 
> Technically, one can make it fully stateful and say that if the device
> hasn’t been started already, it’s always a LOAD, and otherwise always a
> SAVE.  But as above, I’d prefer to keep the parameter because the
> implementations are different, so I’d prefer there to be a
> re-affirmation that front-end and back-end are in sync about what should
> be done.
> 
> Personally, I don’t really see the advantage of having two functions
> instead of one function with an enum with two values.  The thing about
> SET_DEVICE_STATE_FD is that it itself won’t differ much regardless of
> whether we’re loading or saving, it just negotiates the pipe – the
> difference is what happens after the pipe has been negotiated.  So if we
> split the function into two, both implementations will share most of
> their code anyway, which makes me think it should be a single function.

I also don't really see an advantage to defining separate messages as
long as SET_DEVICE_STATE_FD just sets up the pipe. If there are other
arguments that differ depending on the state/direction, then it's nicer
to have separate messages so that argument type remains simple (not a
union).

This brings to mind how iterative migration will work. The interface for
iterative migration is basically the same as non-iterative migration
plus a method to query the number of bytes remaining. When the number of
bytes falls below a threshold, the vCPUs are stopped and the remainder
of the data is read.

Some details from VFIO migration:
- The VMM must explicitly change the state when transitioning from
  iterative and non-iterative migration, but the data transfer fd
  remains the same.
- The state of the device (running, stopped, resuming, etc) doesn't
  change asynchronously, it's always driven by the VMM. However, setting
  the state can fail and then the new state may be an error state.

Mapping this to SET_DEVICE_STATE_FD:
- VhostDeviceStatePhase is extended with
  VHOST_TRANSFER_STATE_PHASE_RUNNING = 1 for iterative migration. The
  frontend sends SET_DEVICE_STATE_FD again with
  VHOST_TRANSFER_STATE_PHASE_STOPPED when entering non-iterative
  migration and the frontend sends the iterative fd from the previous
  SET_DEVICE_STATE_FD call to the backend. The backend may reply with
  another fd, if necessary. If the backend changes the fd, then the
  contents of the previous fd must be fully read and transferred before
  the contents of the new fd are migrated. (Maybe this is too complex
  and we should forbid changing the fd when going from RUNNING ->
  STOPPED.)
- CHECK_DEVICE_STATE can be extended to report the number of bytes
  remaining. The semantics change so that CHECK_DEVICE_STATE can be
  called while the VMM is still reading from the fd. It becomes:

    enum CheckDeviceStateResult {
        Saving(bytes_remaining : usize),
	Failed(error_code : u64),
    }

> > In that case, reusing the ioctls as vhost-user messages would be ok.
> > But that puts this proposal further from the VFIO code, which uses
> > "migration_set_state(state)", and maybe it is better when the number
> > of states is high.
> 
> I’m not sure what you mean (because I don’t know the VFIO code, I
> assume).  Are you saying that using a more finely grained
> migration_set_state() model would conflict with the rather coarse
> suspend/resume?

I think VFIO is already different because vDPA has SUSPEND/RESUME,
whereas VFIO controls the state via VFIO_DEVICE_FEATURE
VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE (which is similar but not identical
to SET_DEVICE_STATE_FD in this patch series).

Stefan
Stefan Hajnoczi April 17, 2023, 3:18 p.m. UTC | #13
On Fri, Apr 14, 2023 at 05:17:02PM +0200, Eugenio Perez Martin wrote:
> On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote:
> >
> > On 13.04.23 13:38, Stefan Hajnoczi wrote:
> > > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote:
> > >> On 12.04.23 23:06, Stefan Hajnoczi wrote:
> > >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > >>>> So-called "internal" virtio-fs migration refers to transporting the
> > >>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > >>>> this, we need to be able to transfer virtiofsd's internal state to and
> > >>>> from virtiofsd.
> > >>>>
> > >>>> Because virtiofsd's internal state will not be too large, we believe it
> > >>>> is best to transfer it as a single binary blob after the streaming
> > >>>> phase.  Because this method should be useful to other vhost-user
> > >>>> implementations, too, it is introduced as a general-purpose addition to
> > >>>> the protocol, not limited to vhost-user-fs.
> > >>>>
> > >>>> These are the additions to the protocol:
> > >>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > >>>>     This feature signals support for transferring state, and is added so
> > >>>>     that migration can fail early when the back-end has no support.
> > >>>>
> > >>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > >>>>     over which to transfer the state.  The front-end sends an FD to the
> > >>>>     back-end into/from which it can write/read its state, and the back-end
> > >>>>     can decide to either use it, or reply with a different FD for the
> > >>>>     front-end to override the front-end's choice.
> > >>>>     The front-end creates a simple pipe to transfer the state, but maybe
> > >>>>     the back-end already has an FD into/from which it has to write/read
> > >>>>     its state, in which case it will want to override the simple pipe.
> > >>>>     Conversely, maybe in the future we find a way to have the front-end
> > >>>>     get an immediate FD for the migration stream (in some cases), in which
> > >>>>     case we will want to send this to the back-end instead of creating a
> > >>>>     pipe.
> > >>>>     Hence the negotiation: If one side has a better idea than a plain
> > >>>>     pipe, we will want to use that.
> > >>>>
> > >>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
> > >>>>     pipe (the end indicated by EOF), the front-end invokes this function
> > >>>>     to verify success.  There is no in-band way (through the pipe) to
> > >>>>     indicate failure, so we need to check explicitly.
> > >>>>
> > >>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > >>>> (which includes establishing the direction of transfer and migration
> > >>>> phase), the sending side writes its data into the pipe, and the reading
> > >>>> side reads it until it sees an EOF.  Then, the front-end will check for
> > >>>> success via CHECK_DEVICE_STATE, which on the destination side includes
> > >>>> checking for integrity (i.e. errors during deserialization).
> > >>>>
> > >>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > >>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > >>>> ---
> > >>>>    include/hw/virtio/vhost-backend.h |  24 +++++
> > >>>>    include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > >>>>    hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > >>>>    hw/virtio/vhost.c                 |  37 ++++++++
> > >>>>    4 files changed, 287 insertions(+)
> > >>>>
> > >>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > >>>> index ec3fbae58d..5935b32fe3 100644
> > >>>> --- a/include/hw/virtio/vhost-backend.h
> > >>>> +++ b/include/hw/virtio/vhost-backend.h
> > >>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > >>>>        VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > >>>>    } VhostSetConfigType;
> > >>>>
> > >>>> +typedef enum VhostDeviceStateDirection {
> > >>>> +    /* Transfer state from back-end (device) to front-end */
> > >>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > >>>> +    /* Transfer state from front-end to back-end (device) */
> > >>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > >>>> +} VhostDeviceStateDirection;
> > >>>> +
> > >>>> +typedef enum VhostDeviceStatePhase {
> > >>>> +    /* The device (and all its vrings) is stopped */
> > >>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > >>>> +} VhostDeviceStatePhase;
> > >>> vDPA has:
> > >>>
> > >>>     /* Suspend a device so it does not process virtqueue requests anymore
> > >>>      *
> > >>>      * After the return of ioctl the device must preserve all the necessary state
> > >>>      * (the virtqueue vring base plus the possible device specific states) that is
> > >>>      * required for restoring in the future. The device must not change its
> > >>>      * configuration after that point.
> > >>>      */
> > >>>     #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > >>>
> > >>>     /* Resume a device so it can resume processing virtqueue requests
> > >>>      *
> > >>>      * After the return of this ioctl the device will have restored all the
> > >>>      * necessary states and it is fully operational to continue processing the
> > >>>      * virtqueue descriptors.
> > >>>      */
> > >>>     #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > >>>
> > >>> I wonder if it makes sense to import these into vhost-user so that the
> > >>> difference between kernel vhost and vhost-user is minimized. It's okay
> > >>> if one of them is ahead of the other, but it would be nice to avoid
> > >>> overlapping/duplicated functionality.
> > >>>
> > >>> (And I hope vDPA will import the device state vhost-user messages
> > >>> introduced in this series.)
> > >> I don’t understand your suggestion.  (Like, I very simply don’t
> > >> understand :))
> > >>
> > >> These are vhost messages, right?  What purpose do you have in mind for
> > >> them in vhost-user for internal migration?  They’re different from the
> > >> state transfer messages, because they don’t transfer state to/from the
> > >> front-end.  Also, the state transfer stuff is supposed to be distinct
> > >> from starting/stopping the device; right now, it just requires the
> > >> device to be stopped beforehand (or started only afterwards).  And in
> > >> the future, new VhostDeviceStatePhase values may allow the messages to
> > >> be used on devices that aren’t stopped.
> > >>
> > >> So they seem to serve very different purposes.  I can imagine using the
> > >> VDPA_{SUSPEND,RESUME} messages for external migration (what Anton is
> > >> working on), but they don’t really help with internal migration
> > >> implemented here.  If I were to add them, they’d just be sent in
> > >> addition to the new messages added in this patch here, i.e. SUSPEND on
> > >> the source before SET_DEVICE_STATE_FD, and RESUME on the destination
> > >> after CHECK_DEVICE_STATE (we could use RESUME in place of
> > >> CHECK_DEVICE_STATE on the destination, but we can’t do that on the
> > >> source, so we still need CHECK_DEVICE_STATE).
> > > Yes, they are complementary to the device state fd message. I want to
> > > make sure pre-conditions about the device's state (running vs stopped)
> > > already take into account the vDPA SUSPEND/RESUME model.
> > >
> > > vDPA will need device state save/load in the future. For virtiofs
> > > devices, for example. This is why I think we should plan for vDPA and
> > > vhost-user to share the same interface.
> >
> > While the paragraph below is more important, I don’t feel like this
> > would be important right now.  It’s clear that SUSPEND must come before
> > transferring any state, and that RESUME must come after transferring
> > state.  I don’t think we need to clarify this now, it’d be obvious when
> > implementing SUSPEND/RESUME.
> >
> > > Also, I think the code path you're relying on (vhost_dev_stop()) on
> > > doesn't work for backends that implement VHOST_USER_PROTOCOL_F_STATUS
> > > because stopping the backend resets the device and throws away its
> > > state. SUSPEND/RESUME solve this. This looks like a more general
> > > problem since vhost_dev_stop() is called any time the VM is paused.
> > > Maybe it needs to use SUSPEND/RESUME whenever possible.
> >
> > That’s a problem.  Quite a problem, to be honest, because this sounds
> > rather complicated with honestly absolutely no practical benefit right
> > now.
> >
> > Would you require SUSPEND/RESUME for state transfer even if the back-end
> > does not implement GET/SET_STATUS?  Because then this would also lead to
> > more complexity in virtiofsd.
> >
> 
> At this moment the vhost-user net in DPDK suspends at
> VHOST_GET_VRING_BASE. Not the same case though, as here only the vq
> indexes / wrap bits are transferred here.
> 
> Vhost-vdpa implements the suspend call so it does not need to trust
> VHOST_GET_VRING_BASE to be the last vring call done. Since virtiofsd
> is using vhost-user maybe it is not needed to implement it actually.

Careful, if we deliberately make vhost-user and vDPA diverge, then it
will be hard to share the migration interface.

> > Basically, what I’m hearing is that I need to implement a different
> > feature that has no practical impact right now, and also fix bugs around
> > it along the way...
> >
> 
> To fix this properly requires iterative device migration in qemu as
> far as I know, instead of using VMStates [1]. This way the state is
> requested to virtiofsd before the device reset.

I don't follow. Many devices are fine with non-iterative migration. They
shouldn't be forced to do iterative migration.

> What does virtiofsd do when the state is totally sent? Does it keep
> processing requests and generating new state or is only a one shot
> that will suspend the daemon? If it is the second I think it still can
> be done in one shot at the end, always indicating "no more state" at
> save_live_pending and sending all the state at
> save_live_complete_precopy.
> 
> Does that make sense to you?
> 
> Thanks!
> 
> [1] https://qemu.readthedocs.io/en/latest/devel/migration.html#iterative-device-migration
>
Stefan Hajnoczi April 17, 2023, 3:38 p.m. UTC | #14
On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > So-called "internal" virtio-fs migration refers to transporting the
> > > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > this, we need to be able to transfer virtiofsd's internal state to and
> > > from virtiofsd.
> > >
> > > Because virtiofsd's internal state will not be too large, we believe it
> > > is best to transfer it as a single binary blob after the streaming
> > > phase.  Because this method should be useful to other vhost-user
> > > implementations, too, it is introduced as a general-purpose addition to
> > > the protocol, not limited to vhost-user-fs.
> > >
> > > These are the additions to the protocol:
> > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > >   This feature signals support for transferring state, and is added so
> > >   that migration can fail early when the back-end has no support.
> > >
> > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > >   over which to transfer the state.  The front-end sends an FD to the
> > >   back-end into/from which it can write/read its state, and the back-end
> > >   can decide to either use it, or reply with a different FD for the
> > >   front-end to override the front-end's choice.
> > >   The front-end creates a simple pipe to transfer the state, but maybe
> > >   the back-end already has an FD into/from which it has to write/read
> > >   its state, in which case it will want to override the simple pipe.
> > >   Conversely, maybe in the future we find a way to have the front-end
> > >   get an immediate FD for the migration stream (in some cases), in which
> > >   case we will want to send this to the back-end instead of creating a
> > >   pipe.
> > >   Hence the negotiation: If one side has a better idea than a plain
> > >   pipe, we will want to use that.
> > >
> > > - CHECK_DEVICE_STATE: After the state has been transferred through the
> > >   pipe (the end indicated by EOF), the front-end invokes this function
> > >   to verify success.  There is no in-band way (through the pipe) to
> > >   indicate failure, so we need to check explicitly.
> > >
> > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > (which includes establishing the direction of transfer and migration
> > > phase), the sending side writes its data into the pipe, and the reading
> > > side reads it until it sees an EOF.  Then, the front-end will check for
> > > success via CHECK_DEVICE_STATE, which on the destination side includes
> > > checking for integrity (i.e. errors during deserialization).
> > >
> > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > ---
> > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > >  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > >  hw/virtio/vhost.c                 |  37 ++++++++
> > >  4 files changed, 287 insertions(+)
> > >
> > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > index ec3fbae58d..5935b32fe3 100644
> > > --- a/include/hw/virtio/vhost-backend.h
> > > +++ b/include/hw/virtio/vhost-backend.h
> > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > >  } VhostSetConfigType;
> > >
> > > +typedef enum VhostDeviceStateDirection {
> > > +    /* Transfer state from back-end (device) to front-end */
> > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > +    /* Transfer state from front-end to back-end (device) */
> > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > +} VhostDeviceStateDirection;
> > > +
> > > +typedef enum VhostDeviceStatePhase {
> > > +    /* The device (and all its vrings) is stopped */
> > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > +} VhostDeviceStatePhase;
> >
> > vDPA has:
> >
> >   /* Suspend a device so it does not process virtqueue requests anymore
> >    *
> >    * After the return of ioctl the device must preserve all the necessary state
> >    * (the virtqueue vring base plus the possible device specific states) that is
> >    * required for restoring in the future. The device must not change its
> >    * configuration after that point.
> >    */
> >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> >
> >   /* Resume a device so it can resume processing virtqueue requests
> >    *
> >    * After the return of this ioctl the device will have restored all the
> >    * necessary states and it is fully operational to continue processing the
> >    * virtqueue descriptors.
> >    */
> >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> >
> > I wonder if it makes sense to import these into vhost-user so that the
> > difference between kernel vhost and vhost-user is minimized. It's okay
> > if one of them is ahead of the other, but it would be nice to avoid
> > overlapping/duplicated functionality.
> >
> 
> That's what I had in mind in the first versions. I proposed VHOST_STOP
> instead of VHOST_VDPA_STOP for this very reason. Later it did change
> to SUSPEND.

I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
ioctl(VHOST_VDPA_RESUME).

The doc comments in <linux/vdpa.h> don't explain how the device can
leave the suspended state. Can you clarify this?

Stefan
Stefan Hajnoczi April 17, 2023, 5:14 p.m. UTC | #15
On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > So-called "internal" virtio-fs migration refers to transporting the
> > > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > this, we need to be able to transfer virtiofsd's internal state to and
> > > from virtiofsd.
> > >
> > > Because virtiofsd's internal state will not be too large, we believe it
> > > is best to transfer it as a single binary blob after the streaming
> > > phase.  Because this method should be useful to other vhost-user
> > > implementations, too, it is introduced as a general-purpose addition to
> > > the protocol, not limited to vhost-user-fs.
> > >
> > > These are the additions to the protocol:
> > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > >   This feature signals support for transferring state, and is added so
> > >   that migration can fail early when the back-end has no support.
> > >
> > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > >   over which to transfer the state.  The front-end sends an FD to the
> > >   back-end into/from which it can write/read its state, and the back-end
> > >   can decide to either use it, or reply with a different FD for the
> > >   front-end to override the front-end's choice.
> > >   The front-end creates a simple pipe to transfer the state, but maybe
> > >   the back-end already has an FD into/from which it has to write/read
> > >   its state, in which case it will want to override the simple pipe.
> > >   Conversely, maybe in the future we find a way to have the front-end
> > >   get an immediate FD for the migration stream (in some cases), in which
> > >   case we will want to send this to the back-end instead of creating a
> > >   pipe.
> > >   Hence the negotiation: If one side has a better idea than a plain
> > >   pipe, we will want to use that.
> > >
> > > - CHECK_DEVICE_STATE: After the state has been transferred through the
> > >   pipe (the end indicated by EOF), the front-end invokes this function
> > >   to verify success.  There is no in-band way (through the pipe) to
> > >   indicate failure, so we need to check explicitly.
> > >
> > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > (which includes establishing the direction of transfer and migration
> > > phase), the sending side writes its data into the pipe, and the reading
> > > side reads it until it sees an EOF.  Then, the front-end will check for
> > > success via CHECK_DEVICE_STATE, which on the destination side includes
> > > checking for integrity (i.e. errors during deserialization).
> > >
> > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > ---
> > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > >  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > >  hw/virtio/vhost.c                 |  37 ++++++++
> > >  4 files changed, 287 insertions(+)
> > >
> > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > index ec3fbae58d..5935b32fe3 100644
> > > --- a/include/hw/virtio/vhost-backend.h
> > > +++ b/include/hw/virtio/vhost-backend.h
> > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > >  } VhostSetConfigType;
> > >
> > > +typedef enum VhostDeviceStateDirection {
> > > +    /* Transfer state from back-end (device) to front-end */
> > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > +    /* Transfer state from front-end to back-end (device) */
> > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > +} VhostDeviceStateDirection;
> > > +
> > > +typedef enum VhostDeviceStatePhase {
> > > +    /* The device (and all its vrings) is stopped */
> > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > +} VhostDeviceStatePhase;
> >
> > vDPA has:
> >
> >   /* Suspend a device so it does not process virtqueue requests anymore
> >    *
> >    * After the return of ioctl the device must preserve all the necessary state
> >    * (the virtqueue vring base plus the possible device specific states) that is
> >    * required for restoring in the future. The device must not change its
> >    * configuration after that point.
> >    */
> >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> >
> >   /* Resume a device so it can resume processing virtqueue requests
> >    *
> >    * After the return of this ioctl the device will have restored all the
> >    * necessary states and it is fully operational to continue processing the
> >    * virtqueue descriptors.
> >    */
> >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> >
> > I wonder if it makes sense to import these into vhost-user so that the
> > difference between kernel vhost and vhost-user is minimized. It's okay
> > if one of them is ahead of the other, but it would be nice to avoid
> > overlapping/duplicated functionality.
> >
> 
> That's what I had in mind in the first versions. I proposed VHOST_STOP
> instead of VHOST_VDPA_STOP for this very reason. Later it did change
> to SUSPEND.
> 
> Generally it is better if we make the interface less parametrized and
> we trust in the messages and its semantics in my opinion. In other
> words, instead of
> vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send
> individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command.
> 
> Another way to apply this is with the "direction" parameter. Maybe it
> is better to split it into "set_state_fd" and "get_state_fd"?
> 
> In that case, reusing the ioctls as vhost-user messages would be ok.
> But that puts this proposal further from the VFIO code, which uses
> "migration_set_state(state)", and maybe it is better when the number
> of states is high.

Hi Eugenio,
Another question about vDPA suspend/resume:

  /* Host notifiers must be enabled at this point. */
  void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings)
  {
      int i;
  
      /* should only be called after backend is connected */
      assert(hdev->vhost_ops);
      event_notifier_test_and_clear(
          &hdev->vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
      event_notifier_test_and_clear(&vdev->config_notifier);
  
      trace_vhost_dev_stop(hdev, vdev->name, vrings);
  
      if (hdev->vhost_ops->vhost_dev_start) {
          hdev->vhost_ops->vhost_dev_start(hdev, false);
          ^^^ SUSPEND ^^^
      }
      if (vrings) {
          vhost_dev_set_vring_enable(hdev, false);
      }
      for (i = 0; i < hdev->nvqs; ++i) {
          vhost_virtqueue_stop(hdev,
                               vdev,
                               hdev->vqs + i,
                               hdev->vq_index + i);
	^^^ fetch virtqueue state from kernel ^^^
      }
      if (hdev->vhost_ops->vhost_reset_status) {
          hdev->vhost_ops->vhost_reset_status(hdev);
	  ^^^ reset device^^^

I noticed the QEMU vDPA code resets the device in vhost_dev_stop() ->
vhost_reset_status(). The device's migration code runs after
vhost_dev_stop() and the state will have been lost.

It looks like vDPA changes are necessary in order to support stateful
devices even though QEMU already uses SUSPEND. Is my understanding
correct?

Stefan
Eugenio Perez Martin April 17, 2023, 6:37 p.m. UTC | #16
On Thu, Apr 13, 2023 at 7:32 PM Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 13.04.23 12:14, Eugenio Perez Martin wrote:
> > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> >>> So-called "internal" virtio-fs migration refers to transporting the
> >>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
> >>> this, we need to be able to transfer virtiofsd's internal state to and
> >>> from virtiofsd.
> >>>
> >>> Because virtiofsd's internal state will not be too large, we believe it
> >>> is best to transfer it as a single binary blob after the streaming
> >>> phase.  Because this method should be useful to other vhost-user
> >>> implementations, too, it is introduced as a general-purpose addition to
> >>> the protocol, not limited to vhost-user-fs.
> >>>
> >>> These are the additions to the protocol:
> >>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> >>>    This feature signals support for transferring state, and is added so
> >>>    that migration can fail early when the back-end has no support.
> >>>
> >>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> >>>    over which to transfer the state.  The front-end sends an FD to the
> >>>    back-end into/from which it can write/read its state, and the back-end
> >>>    can decide to either use it, or reply with a different FD for the
> >>>    front-end to override the front-end's choice.
> >>>    The front-end creates a simple pipe to transfer the state, but maybe
> >>>    the back-end already has an FD into/from which it has to write/read
> >>>    its state, in which case it will want to override the simple pipe.
> >>>    Conversely, maybe in the future we find a way to have the front-end
> >>>    get an immediate FD for the migration stream (in some cases), in which
> >>>    case we will want to send this to the back-end instead of creating a
> >>>    pipe.
> >>>    Hence the negotiation: If one side has a better idea than a plain
> >>>    pipe, we will want to use that.
> >>>
> >>> - CHECK_DEVICE_STATE: After the state has been transferred through the
> >>>    pipe (the end indicated by EOF), the front-end invokes this function
> >>>    to verify success.  There is no in-band way (through the pipe) to
> >>>    indicate failure, so we need to check explicitly.
> >>>
> >>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> >>> (which includes establishing the direction of transfer and migration
> >>> phase), the sending side writes its data into the pipe, and the reading
> >>> side reads it until it sees an EOF.  Then, the front-end will check for
> >>> success via CHECK_DEVICE_STATE, which on the destination side includes
> >>> checking for integrity (i.e. errors during deserialization).
> >>>
> >>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> >>> ---
> >>>   include/hw/virtio/vhost-backend.h |  24 +++++
> >>>   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> >>>   hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> >>>   hw/virtio/vhost.c                 |  37 ++++++++
> >>>   4 files changed, 287 insertions(+)
> >>>
> >>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> >>> index ec3fbae58d..5935b32fe3 100644
> >>> --- a/include/hw/virtio/vhost-backend.h
> >>> +++ b/include/hw/virtio/vhost-backend.h
> >>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> >>>       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> >>>   } VhostSetConfigType;
> >>>
> >>> +typedef enum VhostDeviceStateDirection {
> >>> +    /* Transfer state from back-end (device) to front-end */
> >>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> >>> +    /* Transfer state from front-end to back-end (device) */
> >>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> >>> +} VhostDeviceStateDirection;
> >>> +
> >>> +typedef enum VhostDeviceStatePhase {
> >>> +    /* The device (and all its vrings) is stopped */
> >>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> >>> +} VhostDeviceStatePhase;
> >> vDPA has:
> >>
> >>    /* Suspend a device so it does not process virtqueue requests anymore
> >>     *
> >>     * After the return of ioctl the device must preserve all the necessary state
> >>     * (the virtqueue vring base plus the possible device specific states) that is
> >>     * required for restoring in the future. The device must not change its
> >>     * configuration after that point.
> >>     */
> >>    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> >>
> >>    /* Resume a device so it can resume processing virtqueue requests
> >>     *
> >>     * After the return of this ioctl the device will have restored all the
> >>     * necessary states and it is fully operational to continue processing the
> >>     * virtqueue descriptors.
> >>     */
> >>    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> >>
> >> I wonder if it makes sense to import these into vhost-user so that the
> >> difference between kernel vhost and vhost-user is minimized. It's okay
> >> if one of them is ahead of the other, but it would be nice to avoid
> >> overlapping/duplicated functionality.
> >>
> > That's what I had in mind in the first versions. I proposed VHOST_STOP
> > instead of VHOST_VDPA_STOP for this very reason. Later it did change
> > to SUSPEND.
> >
> > Generally it is better if we make the interface less parametrized and
> > we trust in the messages and its semantics in my opinion. In other
> > words, instead of
> > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send
> > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command.
>
> I.e. you mean that this should simply be stateful instead of
> re-affirming the current state with a parameter?
>
> The problem I see is that transferring states in different phases of
> migration will require specialized implementations.  So running
> SET_DEVICE_STATE_FD in a different phase will require support from the
> back-end.  Same in the front-end, the exact protocol and thus
> implementation will (probably, difficult to say at this point) depend on
> the migration phase.  I would therefore prefer to have an explicit
> distinction in the command itself that affirms the phase we’re
> targeting.
>

I think we will have this same problem when more phases are added, as
the fd and direction arguments are always  passed whatever phase you
set. Future phases may not require it, or require different arguments.

> On the other hand, I don’t see the parameter complicating anything. The
> front-end must supply it, but it will know the phase anyway, so this is
> easy.  The back-end can just choose to ignore it, if it doesn’t feel the
> need to verify that the phase is what it thinks it is.
>
> > Another way to apply this is with the "direction" parameter. Maybe it
> > is better to split it into "set_state_fd" and "get_state_fd"?
>
> Well, it would rather be `set_state_send_fd` and `set_state_receive_fd`.

Right, thanks for the correction.

> We always negotiate a pipe between front-end and back-end, the question
> is just whether the back-end gets the receiving (load) or the sending
> (save) end.
>
> Technically, one can make it fully stateful and say that if the device
> hasn’t been started already, it’s always a LOAD, and otherwise always a
> SAVE.  But as above, I’d prefer to keep the parameter because the
> implementations are different, so I’d prefer there to be a
> re-affirmation that front-end and back-end are in sync about what should
> be done.
>
> Personally, I don’t really see the advantage of having two functions
> instead of one function with an enum with two values.  The thing about
> SET_DEVICE_STATE_FD is that it itself won’t differ much regardless of
> whether we’re loading or saving, it just negotiates the pipe – the
> difference is what happens after the pipe has been negotiated.  So if we
> split the function into two, both implementations will share most of
> their code anyway, which makes me think it should be a single function.
>

Yes, all of that makes sense.

My proposal was in the line of following other commands like
VHOST_USER_SET_VRING_BASE / VHOST_USER_GET_VRING_BASE or
VHOST_USER_SET_INFLIGHT_FD and VHOST_USER_GET_INFLIGHT_FD. If that has
been considered and it is more convenient to use the arguments I'm
totally fine.

> > In that case, reusing the ioctls as vhost-user messages would be ok.
> > But that puts this proposal further from the VFIO code, which uses
> > "migration_set_state(state)", and maybe it is better when the number
> > of states is high.
>
> I’m not sure what you mean (because I don’t know the VFIO code, I
> assume).  Are you saying that using a more finely grained
> migration_set_state() model would conflict with the rather coarse
> suspend/resume?
>

I don't think it exactly conflicts, as we should be able to map both
to a given set_state. They may overlap if vhost-user decides to use
them. Or if vdpa decides to use SET_DEVICE_STATE_FD.

This already happens with the different vhost backends, each one has a
different way to suspend the device in the case of a migration anyway.

Thanks!

> > BTW, is there any usage for *reply_fd at this moment from the backend?
>
> No, virtiofsd doesn’t plan to make use of it.
>
> Hanna
>
Eugenio Perez Martin April 17, 2023, 6:55 p.m. UTC | #17
On Mon, Apr 17, 2023 at 5:18 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Fri, Apr 14, 2023 at 05:17:02PM +0200, Eugenio Perez Martin wrote:
> > On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote:
> > >
> > > On 13.04.23 13:38, Stefan Hajnoczi wrote:
> > > > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote:
> > > >> On 12.04.23 23:06, Stefan Hajnoczi wrote:
> > > >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > >>>> So-called "internal" virtio-fs migration refers to transporting the
> > > >>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > >>>> this, we need to be able to transfer virtiofsd's internal state to and
> > > >>>> from virtiofsd.
> > > >>>>
> > > >>>> Because virtiofsd's internal state will not be too large, we believe it
> > > >>>> is best to transfer it as a single binary blob after the streaming
> > > >>>> phase.  Because this method should be useful to other vhost-user
> > > >>>> implementations, too, it is introduced as a general-purpose addition to
> > > >>>> the protocol, not limited to vhost-user-fs.
> > > >>>>
> > > >>>> These are the additions to the protocol:
> > > >>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > >>>>     This feature signals support for transferring state, and is added so
> > > >>>>     that migration can fail early when the back-end has no support.
> > > >>>>
> > > >>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > > >>>>     over which to transfer the state.  The front-end sends an FD to the
> > > >>>>     back-end into/from which it can write/read its state, and the back-end
> > > >>>>     can decide to either use it, or reply with a different FD for the
> > > >>>>     front-end to override the front-end's choice.
> > > >>>>     The front-end creates a simple pipe to transfer the state, but maybe
> > > >>>>     the back-end already has an FD into/from which it has to write/read
> > > >>>>     its state, in which case it will want to override the simple pipe.
> > > >>>>     Conversely, maybe in the future we find a way to have the front-end
> > > >>>>     get an immediate FD for the migration stream (in some cases), in which
> > > >>>>     case we will want to send this to the back-end instead of creating a
> > > >>>>     pipe.
> > > >>>>     Hence the negotiation: If one side has a better idea than a plain
> > > >>>>     pipe, we will want to use that.
> > > >>>>
> > > >>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
> > > >>>>     pipe (the end indicated by EOF), the front-end invokes this function
> > > >>>>     to verify success.  There is no in-band way (through the pipe) to
> > > >>>>     indicate failure, so we need to check explicitly.
> > > >>>>
> > > >>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > >>>> (which includes establishing the direction of transfer and migration
> > > >>>> phase), the sending side writes its data into the pipe, and the reading
> > > >>>> side reads it until it sees an EOF.  Then, the front-end will check for
> > > >>>> success via CHECK_DEVICE_STATE, which on the destination side includes
> > > >>>> checking for integrity (i.e. errors during deserialization).
> > > >>>>
> > > >>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > >>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > >>>> ---
> > > >>>>    include/hw/virtio/vhost-backend.h |  24 +++++
> > > >>>>    include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > >>>>    hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > > >>>>    hw/virtio/vhost.c                 |  37 ++++++++
> > > >>>>    4 files changed, 287 insertions(+)
> > > >>>>
> > > >>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > >>>> index ec3fbae58d..5935b32fe3 100644
> > > >>>> --- a/include/hw/virtio/vhost-backend.h
> > > >>>> +++ b/include/hw/virtio/vhost-backend.h
> > > >>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > >>>>        VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > >>>>    } VhostSetConfigType;
> > > >>>>
> > > >>>> +typedef enum VhostDeviceStateDirection {
> > > >>>> +    /* Transfer state from back-end (device) to front-end */
> > > >>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > >>>> +    /* Transfer state from front-end to back-end (device) */
> > > >>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > >>>> +} VhostDeviceStateDirection;
> > > >>>> +
> > > >>>> +typedef enum VhostDeviceStatePhase {
> > > >>>> +    /* The device (and all its vrings) is stopped */
> > > >>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > >>>> +} VhostDeviceStatePhase;
> > > >>> vDPA has:
> > > >>>
> > > >>>     /* Suspend a device so it does not process virtqueue requests anymore
> > > >>>      *
> > > >>>      * After the return of ioctl the device must preserve all the necessary state
> > > >>>      * (the virtqueue vring base plus the possible device specific states) that is
> > > >>>      * required for restoring in the future. The device must not change its
> > > >>>      * configuration after that point.
> > > >>>      */
> > > >>>     #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > >>>
> > > >>>     /* Resume a device so it can resume processing virtqueue requests
> > > >>>      *
> > > >>>      * After the return of this ioctl the device will have restored all the
> > > >>>      * necessary states and it is fully operational to continue processing the
> > > >>>      * virtqueue descriptors.
> > > >>>      */
> > > >>>     #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > >>>
> > > >>> I wonder if it makes sense to import these into vhost-user so that the
> > > >>> difference between kernel vhost and vhost-user is minimized. It's okay
> > > >>> if one of them is ahead of the other, but it would be nice to avoid
> > > >>> overlapping/duplicated functionality.
> > > >>>
> > > >>> (And I hope vDPA will import the device state vhost-user messages
> > > >>> introduced in this series.)
> > > >> I don’t understand your suggestion.  (Like, I very simply don’t
> > > >> understand :))
> > > >>
> > > >> These are vhost messages, right?  What purpose do you have in mind for
> > > >> them in vhost-user for internal migration?  They’re different from the
> > > >> state transfer messages, because they don’t transfer state to/from the
> > > >> front-end.  Also, the state transfer stuff is supposed to be distinct
> > > >> from starting/stopping the device; right now, it just requires the
> > > >> device to be stopped beforehand (or started only afterwards).  And in
> > > >> the future, new VhostDeviceStatePhase values may allow the messages to
> > > >> be used on devices that aren’t stopped.
> > > >>
> > > >> So they seem to serve very different purposes.  I can imagine using the
> > > >> VDPA_{SUSPEND,RESUME} messages for external migration (what Anton is
> > > >> working on), but they don’t really help with internal migration
> > > >> implemented here.  If I were to add them, they’d just be sent in
> > > >> addition to the new messages added in this patch here, i.e. SUSPEND on
> > > >> the source before SET_DEVICE_STATE_FD, and RESUME on the destination
> > > >> after CHECK_DEVICE_STATE (we could use RESUME in place of
> > > >> CHECK_DEVICE_STATE on the destination, but we can’t do that on the
> > > >> source, so we still need CHECK_DEVICE_STATE).
> > > > Yes, they are complementary to the device state fd message. I want to
> > > > make sure pre-conditions about the device's state (running vs stopped)
> > > > already take into account the vDPA SUSPEND/RESUME model.
> > > >
> > > > vDPA will need device state save/load in the future. For virtiofs
> > > > devices, for example. This is why I think we should plan for vDPA and
> > > > vhost-user to share the same interface.
> > >
> > > While the paragraph below is more important, I don’t feel like this
> > > would be important right now.  It’s clear that SUSPEND must come before
> > > transferring any state, and that RESUME must come after transferring
> > > state.  I don’t think we need to clarify this now, it’d be obvious when
> > > implementing SUSPEND/RESUME.
> > >
> > > > Also, I think the code path you're relying on (vhost_dev_stop()) on
> > > > doesn't work for backends that implement VHOST_USER_PROTOCOL_F_STATUS
> > > > because stopping the backend resets the device and throws away its
> > > > state. SUSPEND/RESUME solve this. This looks like a more general
> > > > problem since vhost_dev_stop() is called any time the VM is paused.
> > > > Maybe it needs to use SUSPEND/RESUME whenever possible.
> > >
> > > That’s a problem.  Quite a problem, to be honest, because this sounds
> > > rather complicated with honestly absolutely no practical benefit right
> > > now.
> > >
> > > Would you require SUSPEND/RESUME for state transfer even if the back-end
> > > does not implement GET/SET_STATUS?  Because then this would also lead to
> > > more complexity in virtiofsd.
> > >
> >
> > At this moment the vhost-user net in DPDK suspends at
> > VHOST_GET_VRING_BASE. Not the same case though, as here only the vq
> > indexes / wrap bits are transferred here.
> >
> > Vhost-vdpa implements the suspend call so it does not need to trust
> > VHOST_GET_VRING_BASE to be the last vring call done. Since virtiofsd
> > is using vhost-user maybe it is not needed to implement it actually.
>
> Careful, if we deliberately make vhost-user and vDPA diverge, then it
> will be hard to share the migration interface.
>

I don't recall the exact reasons for not following with the
VRING_GET_BASE == suspend for vDPA, IIRC was the lack of a proper
definition back then. But vhost-kernel and vhost-user already diverged
in that regard, for example. vhost-kernel set a tap backend of -1 to
suspend the device.

> > > Basically, what I’m hearing is that I need to implement a different
> > > feature that has no practical impact right now, and also fix bugs around
> > > it along the way...
> > >
> >
> > To fix this properly requires iterative device migration in qemu as
> > far as I know, instead of using VMStates [1]. This way the state is
> > requested to virtiofsd before the device reset.
>
> I don't follow. Many devices are fine with non-iterative migration. They
> shouldn't be forced to do iterative migration.
>

Sorry I think I didn't express myself well. I didn't mean to force
virtiofsd to support the iterative migration, but to use the device
iterative migration API in QEMU to send the needed commands before
vhost_dev_stop. In that regard, the device or the vhost-user commands
would not require changes.

I think it is convenient in the long run for virtiofsd, as if the
state grows so much that it's not feasible to fetch it in one shot
there is no need to make changes in the qemu migration protocol. I
think it is not unlikely in virtiofs, but maybe I'm missing something
obvious and it's state will never grow.

> > What does virtiofsd do when the state is totally sent? Does it keep
> > processing requests and generating new state or is only a one shot
> > that will suspend the daemon? If it is the second I think it still can
> > be done in one shot at the end, always indicating "no more state" at
> > save_live_pending and sending all the state at
> > save_live_complete_precopy.
> >
> > Does that make sense to you?
> >
> > Thanks!
> >
> > [1] https://qemu.readthedocs.io/en/latest/devel/migration.html#iterative-device-migration
> >
Eugenio Perez Martin April 17, 2023, 7:06 p.m. UTC | #18
On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > So-called "internal" virtio-fs migration refers to transporting the
> > > > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > > this, we need to be able to transfer virtiofsd's internal state to and
> > > > from virtiofsd.
> > > >
> > > > Because virtiofsd's internal state will not be too large, we believe it
> > > > is best to transfer it as a single binary blob after the streaming
> > > > phase.  Because this method should be useful to other vhost-user
> > > > implementations, too, it is introduced as a general-purpose addition to
> > > > the protocol, not limited to vhost-user-fs.
> > > >
> > > > These are the additions to the protocol:
> > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > >   This feature signals support for transferring state, and is added so
> > > >   that migration can fail early when the back-end has no support.
> > > >
> > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > > >   over which to transfer the state.  The front-end sends an FD to the
> > > >   back-end into/from which it can write/read its state, and the back-end
> > > >   can decide to either use it, or reply with a different FD for the
> > > >   front-end to override the front-end's choice.
> > > >   The front-end creates a simple pipe to transfer the state, but maybe
> > > >   the back-end already has an FD into/from which it has to write/read
> > > >   its state, in which case it will want to override the simple pipe.
> > > >   Conversely, maybe in the future we find a way to have the front-end
> > > >   get an immediate FD for the migration stream (in some cases), in which
> > > >   case we will want to send this to the back-end instead of creating a
> > > >   pipe.
> > > >   Hence the negotiation: If one side has a better idea than a plain
> > > >   pipe, we will want to use that.
> > > >
> > > > - CHECK_DEVICE_STATE: After the state has been transferred through the
> > > >   pipe (the end indicated by EOF), the front-end invokes this function
> > > >   to verify success.  There is no in-band way (through the pipe) to
> > > >   indicate failure, so we need to check explicitly.
> > > >
> > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > > (which includes establishing the direction of transfer and migration
> > > > phase), the sending side writes its data into the pipe, and the reading
> > > > side reads it until it sees an EOF.  Then, the front-end will check for
> > > > success via CHECK_DEVICE_STATE, which on the destination side includes
> > > > checking for integrity (i.e. errors during deserialization).
> > > >
> > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > ---
> > > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > >  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > > >  hw/virtio/vhost.c                 |  37 ++++++++
> > > >  4 files changed, 287 insertions(+)
> > > >
> > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > > index ec3fbae58d..5935b32fe3 100644
> > > > --- a/include/hw/virtio/vhost-backend.h
> > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > >  } VhostSetConfigType;
> > > >
> > > > +typedef enum VhostDeviceStateDirection {
> > > > +    /* Transfer state from back-end (device) to front-end */
> > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > +    /* Transfer state from front-end to back-end (device) */
> > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > +} VhostDeviceStateDirection;
> > > > +
> > > > +typedef enum VhostDeviceStatePhase {
> > > > +    /* The device (and all its vrings) is stopped */
> > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > +} VhostDeviceStatePhase;
> > >
> > > vDPA has:
> > >
> > >   /* Suspend a device so it does not process virtqueue requests anymore
> > >    *
> > >    * After the return of ioctl the device must preserve all the necessary state
> > >    * (the virtqueue vring base plus the possible device specific states) that is
> > >    * required for restoring in the future. The device must not change its
> > >    * configuration after that point.
> > >    */
> > >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > >
> > >   /* Resume a device so it can resume processing virtqueue requests
> > >    *
> > >    * After the return of this ioctl the device will have restored all the
> > >    * necessary states and it is fully operational to continue processing the
> > >    * virtqueue descriptors.
> > >    */
> > >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > >
> > > I wonder if it makes sense to import these into vhost-user so that the
> > > difference between kernel vhost and vhost-user is minimized. It's okay
> > > if one of them is ahead of the other, but it would be nice to avoid
> > > overlapping/duplicated functionality.
> > >
> >
> > That's what I had in mind in the first versions. I proposed VHOST_STOP
> > instead of VHOST_VDPA_STOP for this very reason. Later it did change
> > to SUSPEND.
> >
> > Generally it is better if we make the interface less parametrized and
> > we trust in the messages and its semantics in my opinion. In other
> > words, instead of
> > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send
> > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command.
> >
> > Another way to apply this is with the "direction" parameter. Maybe it
> > is better to split it into "set_state_fd" and "get_state_fd"?
> >
> > In that case, reusing the ioctls as vhost-user messages would be ok.
> > But that puts this proposal further from the VFIO code, which uses
> > "migration_set_state(state)", and maybe it is better when the number
> > of states is high.
>
> Hi Eugenio,
> Another question about vDPA suspend/resume:
>
>   /* Host notifiers must be enabled at this point. */
>   void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings)
>   {
>       int i;
>
>       /* should only be called after backend is connected */
>       assert(hdev->vhost_ops);
>       event_notifier_test_and_clear(
>           &hdev->vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
>       event_notifier_test_and_clear(&vdev->config_notifier);
>
>       trace_vhost_dev_stop(hdev, vdev->name, vrings);
>
>       if (hdev->vhost_ops->vhost_dev_start) {
>           hdev->vhost_ops->vhost_dev_start(hdev, false);
>           ^^^ SUSPEND ^^^
>       }
>       if (vrings) {
>           vhost_dev_set_vring_enable(hdev, false);
>       }
>       for (i = 0; i < hdev->nvqs; ++i) {
>           vhost_virtqueue_stop(hdev,
>                                vdev,
>                                hdev->vqs + i,
>                                hdev->vq_index + i);
>         ^^^ fetch virtqueue state from kernel ^^^
>       }
>       if (hdev->vhost_ops->vhost_reset_status) {
>           hdev->vhost_ops->vhost_reset_status(hdev);
>           ^^^ reset device^^^
>
> I noticed the QEMU vDPA code resets the device in vhost_dev_stop() ->
> vhost_reset_status(). The device's migration code runs after
> vhost_dev_stop() and the state will have been lost.
>

vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
qemu VirtIONet device model. This is for all vhost backends.

Regarding the state like mac or mq configuration, SVQ runs for all the
VM run in the CVQ. So it can track all of that status in the device
model too.

When a migration effectively occurs, all the frontend state is
migrated as a regular emulated device. To route all of the state in a
normalized way for qemu is what leaves open the possibility to do
cross-backends migrations, etc.

Does that answer your question?

> It looks like vDPA changes are necessary in order to support stateful
> devices even though QEMU already uses SUSPEND. Is my understanding
> correct?
>

Changes are required elsewhere, as the code to restore the state
properly in the destination has not been merged.

Thanks!
Stefan Hajnoczi April 17, 2023, 7:08 p.m. UTC | #19
On Mon, 17 Apr 2023 at 14:56, Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Mon, Apr 17, 2023 at 5:18 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Fri, Apr 14, 2023 at 05:17:02PM +0200, Eugenio Perez Martin wrote:
> > > On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote:
> > > >
> > > > On 13.04.23 13:38, Stefan Hajnoczi wrote:
> > > > > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote:
> > > > >> On 12.04.23 23:06, Stefan Hajnoczi wrote:
> > > > >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > Basically, what I’m hearing is that I need to implement a different
> > > > feature that has no practical impact right now, and also fix bugs around
> > > > it along the way...
> > > >
> > >
> > > To fix this properly requires iterative device migration in qemu as
> > > far as I know, instead of using VMStates [1]. This way the state is
> > > requested to virtiofsd before the device reset.
> >
> > I don't follow. Many devices are fine with non-iterative migration. They
> > shouldn't be forced to do iterative migration.
> >
>
> Sorry I think I didn't express myself well. I didn't mean to force
> virtiofsd to support the iterative migration, but to use the device
> iterative migration API in QEMU to send the needed commands before
> vhost_dev_stop. In that regard, the device or the vhost-user commands
> would not require changes.
>
> I think it is convenient in the long run for virtiofsd, as if the
> state grows so much that it's not feasible to fetch it in one shot
> there is no need to make changes in the qemu migration protocol. I
> think it is not unlikely in virtiofs, but maybe I'm missing something
> obvious and it's state will never grow.

I don't understand. vCPUs are still running at that point and the
device state could change. It's not safe to save the full device state
until vCPUs have stopped (after vhost_dev_stop).

If you're suggestion somehow doing non-iterative migration but during
the iterative phase, then I don't think that's possible?

Stefan
Eugenio Perez Martin April 17, 2023, 7:09 p.m. UTC | #20
On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > So-called "internal" virtio-fs migration refers to transporting the
> > > > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > > this, we need to be able to transfer virtiofsd's internal state to and
> > > > from virtiofsd.
> > > >
> > > > Because virtiofsd's internal state will not be too large, we believe it
> > > > is best to transfer it as a single binary blob after the streaming
> > > > phase.  Because this method should be useful to other vhost-user
> > > > implementations, too, it is introduced as a general-purpose addition to
> > > > the protocol, not limited to vhost-user-fs.
> > > >
> > > > These are the additions to the protocol:
> > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > >   This feature signals support for transferring state, and is added so
> > > >   that migration can fail early when the back-end has no support.
> > > >
> > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > > >   over which to transfer the state.  The front-end sends an FD to the
> > > >   back-end into/from which it can write/read its state, and the back-end
> > > >   can decide to either use it, or reply with a different FD for the
> > > >   front-end to override the front-end's choice.
> > > >   The front-end creates a simple pipe to transfer the state, but maybe
> > > >   the back-end already has an FD into/from which it has to write/read
> > > >   its state, in which case it will want to override the simple pipe.
> > > >   Conversely, maybe in the future we find a way to have the front-end
> > > >   get an immediate FD for the migration stream (in some cases), in which
> > > >   case we will want to send this to the back-end instead of creating a
> > > >   pipe.
> > > >   Hence the negotiation: If one side has a better idea than a plain
> > > >   pipe, we will want to use that.
> > > >
> > > > - CHECK_DEVICE_STATE: After the state has been transferred through the
> > > >   pipe (the end indicated by EOF), the front-end invokes this function
> > > >   to verify success.  There is no in-band way (through the pipe) to
> > > >   indicate failure, so we need to check explicitly.
> > > >
> > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > > (which includes establishing the direction of transfer and migration
> > > > phase), the sending side writes its data into the pipe, and the reading
> > > > side reads it until it sees an EOF.  Then, the front-end will check for
> > > > success via CHECK_DEVICE_STATE, which on the destination side includes
> > > > checking for integrity (i.e. errors during deserialization).
> > > >
> > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > ---
> > > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > >  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > > >  hw/virtio/vhost.c                 |  37 ++++++++
> > > >  4 files changed, 287 insertions(+)
> > > >
> > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > > index ec3fbae58d..5935b32fe3 100644
> > > > --- a/include/hw/virtio/vhost-backend.h
> > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > >  } VhostSetConfigType;
> > > >
> > > > +typedef enum VhostDeviceStateDirection {
> > > > +    /* Transfer state from back-end (device) to front-end */
> > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > +    /* Transfer state from front-end to back-end (device) */
> > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > +} VhostDeviceStateDirection;
> > > > +
> > > > +typedef enum VhostDeviceStatePhase {
> > > > +    /* The device (and all its vrings) is stopped */
> > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > +} VhostDeviceStatePhase;
> > >
> > > vDPA has:
> > >
> > >   /* Suspend a device so it does not process virtqueue requests anymore
> > >    *
> > >    * After the return of ioctl the device must preserve all the necessary state
> > >    * (the virtqueue vring base plus the possible device specific states) that is
> > >    * required for restoring in the future. The device must not change its
> > >    * configuration after that point.
> > >    */
> > >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > >
> > >   /* Resume a device so it can resume processing virtqueue requests
> > >    *
> > >    * After the return of this ioctl the device will have restored all the
> > >    * necessary states and it is fully operational to continue processing the
> > >    * virtqueue descriptors.
> > >    */
> > >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > >
> > > I wonder if it makes sense to import these into vhost-user so that the
> > > difference between kernel vhost and vhost-user is minimized. It's okay
> > > if one of them is ahead of the other, but it would be nice to avoid
> > > overlapping/duplicated functionality.
> > >
> >
> > That's what I had in mind in the first versions. I proposed VHOST_STOP
> > instead of VHOST_VDPA_STOP for this very reason. Later it did change
> > to SUSPEND.
>
> I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
> ioctl(VHOST_VDPA_RESUME).
>
> The doc comments in <linux/vdpa.h> don't explain how the device can
> leave the suspended state. Can you clarify this?
>

Do you mean in what situations or regarding the semantics of _RESUME?

To me resume is an operation mainly to resume the device in the event
of a VM suspension, not a migration. It can be used as a fallback code
in some cases of migration failure though, but it is not currently
used in qemu.

Thanks!
Eugenio Perez Martin April 17, 2023, 7:11 p.m. UTC | #21
On Mon, Apr 17, 2023 at 9:08 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Mon, 17 Apr 2023 at 14:56, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >
> > On Mon, Apr 17, 2023 at 5:18 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Fri, Apr 14, 2023 at 05:17:02PM +0200, Eugenio Perez Martin wrote:
> > > > On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote:
> > > > >
> > > > > On 13.04.23 13:38, Stefan Hajnoczi wrote:
> > > > > > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote:
> > > > > >> On 12.04.23 23:06, Stefan Hajnoczi wrote:
> > > > > >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > Basically, what I’m hearing is that I need to implement a different
> > > > > feature that has no practical impact right now, and also fix bugs around
> > > > > it along the way...
> > > > >
> > > >
> > > > To fix this properly requires iterative device migration in qemu as
> > > > far as I know, instead of using VMStates [1]. This way the state is
> > > > requested to virtiofsd before the device reset.
> > >
> > > I don't follow. Many devices are fine with non-iterative migration. They
> > > shouldn't be forced to do iterative migration.
> > >
> >
> > Sorry I think I didn't express myself well. I didn't mean to force
> > virtiofsd to support the iterative migration, but to use the device
> > iterative migration API in QEMU to send the needed commands before
> > vhost_dev_stop. In that regard, the device or the vhost-user commands
> > would not require changes.
> >
> > I think it is convenient in the long run for virtiofsd, as if the
> > state grows so much that it's not feasible to fetch it in one shot
> > there is no need to make changes in the qemu migration protocol. I
> > think it is not unlikely in virtiofs, but maybe I'm missing something
> > obvious and it's state will never grow.
>
> I don't understand. vCPUs are still running at that point and the
> device state could change. It's not safe to save the full device state
> until vCPUs have stopped (after vhost_dev_stop).
>

I think the vCPU is already stopped at save_live_complete_precopy
callback. Maybe my understanding is wrong?

Thanks!

> If you're suggestion somehow doing non-iterative migration but during
> the iterative phase, then I don't think that's possible?
>
> Stefan
>
Stefan Hajnoczi April 17, 2023, 7:20 p.m. UTC | #22
On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > So-called "internal" virtio-fs migration refers to transporting the
> > > > > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > > > this, we need to be able to transfer virtiofsd's internal state to and
> > > > > from virtiofsd.
> > > > >
> > > > > Because virtiofsd's internal state will not be too large, we believe it
> > > > > is best to transfer it as a single binary blob after the streaming
> > > > > phase.  Because this method should be useful to other vhost-user
> > > > > implementations, too, it is introduced as a general-purpose addition to
> > > > > the protocol, not limited to vhost-user-fs.
> > > > >
> > > > > These are the additions to the protocol:
> > > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > >   This feature signals support for transferring state, and is added so
> > > > >   that migration can fail early when the back-end has no support.
> > > > >
> > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > > > >   over which to transfer the state.  The front-end sends an FD to the
> > > > >   back-end into/from which it can write/read its state, and the back-end
> > > > >   can decide to either use it, or reply with a different FD for the
> > > > >   front-end to override the front-end's choice.
> > > > >   The front-end creates a simple pipe to transfer the state, but maybe
> > > > >   the back-end already has an FD into/from which it has to write/read
> > > > >   its state, in which case it will want to override the simple pipe.
> > > > >   Conversely, maybe in the future we find a way to have the front-end
> > > > >   get an immediate FD for the migration stream (in some cases), in which
> > > > >   case we will want to send this to the back-end instead of creating a
> > > > >   pipe.
> > > > >   Hence the negotiation: If one side has a better idea than a plain
> > > > >   pipe, we will want to use that.
> > > > >
> > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the
> > > > >   pipe (the end indicated by EOF), the front-end invokes this function
> > > > >   to verify success.  There is no in-band way (through the pipe) to
> > > > >   indicate failure, so we need to check explicitly.
> > > > >
> > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > > > (which includes establishing the direction of transfer and migration
> > > > > phase), the sending side writes its data into the pipe, and the reading
> > > > > side reads it until it sees an EOF.  Then, the front-end will check for
> > > > > success via CHECK_DEVICE_STATE, which on the destination side includes
> > > > > checking for integrity (i.e. errors during deserialization).
> > > > >
> > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > ---
> > > > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > > > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > >  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > > > >  hw/virtio/vhost.c                 |  37 ++++++++
> > > > >  4 files changed, 287 insertions(+)
> > > > >
> > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > >  } VhostSetConfigType;
> > > > >
> > > > > +typedef enum VhostDeviceStateDirection {
> > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > +} VhostDeviceStateDirection;
> > > > > +
> > > > > +typedef enum VhostDeviceStatePhase {
> > > > > +    /* The device (and all its vrings) is stopped */
> > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > +} VhostDeviceStatePhase;
> > > >
> > > > vDPA has:
> > > >
> > > >   /* Suspend a device so it does not process virtqueue requests anymore
> > > >    *
> > > >    * After the return of ioctl the device must preserve all the necessary state
> > > >    * (the virtqueue vring base plus the possible device specific states) that is
> > > >    * required for restoring in the future. The device must not change its
> > > >    * configuration after that point.
> > > >    */
> > > >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > >
> > > >   /* Resume a device so it can resume processing virtqueue requests
> > > >    *
> > > >    * After the return of this ioctl the device will have restored all the
> > > >    * necessary states and it is fully operational to continue processing the
> > > >    * virtqueue descriptors.
> > > >    */
> > > >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > >
> > > > I wonder if it makes sense to import these into vhost-user so that the
> > > > difference between kernel vhost and vhost-user is minimized. It's okay
> > > > if one of them is ahead of the other, but it would be nice to avoid
> > > > overlapping/duplicated functionality.
> > > >
> > >
> > > That's what I had in mind in the first versions. I proposed VHOST_STOP
> > > instead of VHOST_VDPA_STOP for this very reason. Later it did change
> > > to SUSPEND.
> > >
> > > Generally it is better if we make the interface less parametrized and
> > > we trust in the messages and its semantics in my opinion. In other
> > > words, instead of
> > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send
> > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command.
> > >
> > > Another way to apply this is with the "direction" parameter. Maybe it
> > > is better to split it into "set_state_fd" and "get_state_fd"?
> > >
> > > In that case, reusing the ioctls as vhost-user messages would be ok.
> > > But that puts this proposal further from the VFIO code, which uses
> > > "migration_set_state(state)", and maybe it is better when the number
> > > of states is high.
> >
> > Hi Eugenio,
> > Another question about vDPA suspend/resume:
> >
> >   /* Host notifiers must be enabled at this point. */
> >   void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings)
> >   {
> >       int i;
> >
> >       /* should only be called after backend is connected */
> >       assert(hdev->vhost_ops);
> >       event_notifier_test_and_clear(
> >           &hdev->vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
> >       event_notifier_test_and_clear(&vdev->config_notifier);
> >
> >       trace_vhost_dev_stop(hdev, vdev->name, vrings);
> >
> >       if (hdev->vhost_ops->vhost_dev_start) {
> >           hdev->vhost_ops->vhost_dev_start(hdev, false);
> >           ^^^ SUSPEND ^^^
> >       }
> >       if (vrings) {
> >           vhost_dev_set_vring_enable(hdev, false);
> >       }
> >       for (i = 0; i < hdev->nvqs; ++i) {
> >           vhost_virtqueue_stop(hdev,
> >                                vdev,
> >                                hdev->vqs + i,
> >                                hdev->vq_index + i);
> >         ^^^ fetch virtqueue state from kernel ^^^
> >       }
> >       if (hdev->vhost_ops->vhost_reset_status) {
> >           hdev->vhost_ops->vhost_reset_status(hdev);
> >           ^^^ reset device^^^
> >
> > I noticed the QEMU vDPA code resets the device in vhost_dev_stop() ->
> > vhost_reset_status(). The device's migration code runs after
> > vhost_dev_stop() and the state will have been lost.
> >
>
> vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
> qemu VirtIONet device model. This is for all vhost backends.
>
> Regarding the state like mac or mq configuration, SVQ runs for all the
> VM run in the CVQ. So it can track all of that status in the device
> model too.
>
> When a migration effectively occurs, all the frontend state is
> migrated as a regular emulated device. To route all of the state in a
> normalized way for qemu is what leaves open the possibility to do
> cross-backends migrations, etc.
>
> Does that answer your question?

I think you're confirming that changes would be necessary in order for
vDPA to support the save/load operation that Hanna is introducing.

> > It looks like vDPA changes are necessary in order to support stateful
> > devices even though QEMU already uses SUSPEND. Is my understanding
> > correct?
> >
>
> Changes are required elsewhere, as the code to restore the state
> properly in the destination has not been merged.

I'm not sure what you mean by elsewhere?

I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and
then VHOST_VDPA_SET_STATUS 0.

In order to save device state from the vDPA device in the future, it
will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that
the device state can be saved before the device is reset.

Does that sound right?

Stefan
Stefan Hajnoczi April 17, 2023, 7:33 p.m. UTC | #23
On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > So-called "internal" virtio-fs migration refers to transporting the
> > > > > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > > > this, we need to be able to transfer virtiofsd's internal state to and
> > > > > from virtiofsd.
> > > > >
> > > > > Because virtiofsd's internal state will not be too large, we believe it
> > > > > is best to transfer it as a single binary blob after the streaming
> > > > > phase.  Because this method should be useful to other vhost-user
> > > > > implementations, too, it is introduced as a general-purpose addition to
> > > > > the protocol, not limited to vhost-user-fs.
> > > > >
> > > > > These are the additions to the protocol:
> > > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > >   This feature signals support for transferring state, and is added so
> > > > >   that migration can fail early when the back-end has no support.
> > > > >
> > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > > > >   over which to transfer the state.  The front-end sends an FD to the
> > > > >   back-end into/from which it can write/read its state, and the back-end
> > > > >   can decide to either use it, or reply with a different FD for the
> > > > >   front-end to override the front-end's choice.
> > > > >   The front-end creates a simple pipe to transfer the state, but maybe
> > > > >   the back-end already has an FD into/from which it has to write/read
> > > > >   its state, in which case it will want to override the simple pipe.
> > > > >   Conversely, maybe in the future we find a way to have the front-end
> > > > >   get an immediate FD for the migration stream (in some cases), in which
> > > > >   case we will want to send this to the back-end instead of creating a
> > > > >   pipe.
> > > > >   Hence the negotiation: If one side has a better idea than a plain
> > > > >   pipe, we will want to use that.
> > > > >
> > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the
> > > > >   pipe (the end indicated by EOF), the front-end invokes this function
> > > > >   to verify success.  There is no in-band way (through the pipe) to
> > > > >   indicate failure, so we need to check explicitly.
> > > > >
> > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > > > (which includes establishing the direction of transfer and migration
> > > > > phase), the sending side writes its data into the pipe, and the reading
> > > > > side reads it until it sees an EOF.  Then, the front-end will check for
> > > > > success via CHECK_DEVICE_STATE, which on the destination side includes
> > > > > checking for integrity (i.e. errors during deserialization).
> > > > >
> > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > ---
> > > > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > > > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > >  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > > > >  hw/virtio/vhost.c                 |  37 ++++++++
> > > > >  4 files changed, 287 insertions(+)
> > > > >
> > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > >  } VhostSetConfigType;
> > > > >
> > > > > +typedef enum VhostDeviceStateDirection {
> > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > +} VhostDeviceStateDirection;
> > > > > +
> > > > > +typedef enum VhostDeviceStatePhase {
> > > > > +    /* The device (and all its vrings) is stopped */
> > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > +} VhostDeviceStatePhase;
> > > >
> > > > vDPA has:
> > > >
> > > >   /* Suspend a device so it does not process virtqueue requests anymore
> > > >    *
> > > >    * After the return of ioctl the device must preserve all the necessary state
> > > >    * (the virtqueue vring base plus the possible device specific states) that is
> > > >    * required for restoring in the future. The device must not change its
> > > >    * configuration after that point.
> > > >    */
> > > >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > >
> > > >   /* Resume a device so it can resume processing virtqueue requests
> > > >    *
> > > >    * After the return of this ioctl the device will have restored all the
> > > >    * necessary states and it is fully operational to continue processing the
> > > >    * virtqueue descriptors.
> > > >    */
> > > >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > >
> > > > I wonder if it makes sense to import these into vhost-user so that the
> > > > difference between kernel vhost and vhost-user is minimized. It's okay
> > > > if one of them is ahead of the other, but it would be nice to avoid
> > > > overlapping/duplicated functionality.
> > > >
> > >
> > > That's what I had in mind in the first versions. I proposed VHOST_STOP
> > > instead of VHOST_VDPA_STOP for this very reason. Later it did change
> > > to SUSPEND.
> >
> > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
> > ioctl(VHOST_VDPA_RESUME).
> >
> > The doc comments in <linux/vdpa.h> don't explain how the device can
> > leave the suspended state. Can you clarify this?
> >
>
> Do you mean in what situations or regarding the semantics of _RESUME?
>
> To me resume is an operation mainly to resume the device in the event
> of a VM suspension, not a migration. It can be used as a fallback code
> in some cases of migration failure though, but it is not currently
> used in qemu.

Is a "VM suspension" the QEMU HMP 'stop' command?

I guess the reason why QEMU doesn't call RESUME anywhere is that it
resets the device in vhost_dev_stop()?

Does it make sense to combine SUSPEND and RESUME with Hanna's
SET_DEVICE_STATE_FD? For example, non-iterative migration works like
this:
- Saving the device's state is done by SUSPEND followed by
SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g.
savevm command or migration failed), then RESUME is called to
continue.
- Loading the device's state is done by SUSPEND followed by
SET_DEVICE_STATE_FD, followed by RESUME.

Stefan
Stefan Hajnoczi April 17, 2023, 7:46 p.m. UTC | #24
On Mon, 17 Apr 2023 at 15:12, Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Mon, Apr 17, 2023 at 9:08 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Mon, 17 Apr 2023 at 14:56, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > >
> > > On Mon, Apr 17, 2023 at 5:18 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > On Fri, Apr 14, 2023 at 05:17:02PM +0200, Eugenio Perez Martin wrote:
> > > > > On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote:
> > > > > >
> > > > > > On 13.04.23 13:38, Stefan Hajnoczi wrote:
> > > > > > > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote:
> > > > > > >> On 12.04.23 23:06, Stefan Hajnoczi wrote:
> > > > > > >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > Basically, what I’m hearing is that I need to implement a different
> > > > > > feature that has no practical impact right now, and also fix bugs around
> > > > > > it along the way...
> > > > > >
> > > > >
> > > > > To fix this properly requires iterative device migration in qemu as
> > > > > far as I know, instead of using VMStates [1]. This way the state is
> > > > > requested to virtiofsd before the device reset.
> > > >
> > > > I don't follow. Many devices are fine with non-iterative migration. They
> > > > shouldn't be forced to do iterative migration.
> > > >
> > >
> > > Sorry I think I didn't express myself well. I didn't mean to force
> > > virtiofsd to support the iterative migration, but to use the device
> > > iterative migration API in QEMU to send the needed commands before
> > > vhost_dev_stop. In that regard, the device or the vhost-user commands
> > > would not require changes.
> > >
> > > I think it is convenient in the long run for virtiofsd, as if the
> > > state grows so much that it's not feasible to fetch it in one shot
> > > there is no need to make changes in the qemu migration protocol. I
> > > think it is not unlikely in virtiofs, but maybe I'm missing something
> > > obvious and it's state will never grow.
> >
> > I don't understand. vCPUs are still running at that point and the
> > device state could change. It's not safe to save the full device state
> > until vCPUs have stopped (after vhost_dev_stop).
> >
>
> I think the vCPU is already stopped at save_live_complete_precopy
> callback. Maybe my understanding is wrong?

Agreed, vCPUs are stopped in save_live_complete_precopy(). However,
you wrote "use the device iterative migration API in QEMU to send the
needed commands before vhost_dev_stop". save_live_complete_precopy()
runs after vhost_dev_stop() so it doesn't seem to solve the problem.

Stefan
Eugenio Perez Martin April 18, 2023, 7:54 a.m. UTC | #25
On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >
> > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > >
> > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > So-called "internal" virtio-fs migration refers to transporting the
> > > > > > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > > > > this, we need to be able to transfer virtiofsd's internal state to and
> > > > > > from virtiofsd.
> > > > > >
> > > > > > Because virtiofsd's internal state will not be too large, we believe it
> > > > > > is best to transfer it as a single binary blob after the streaming
> > > > > > phase.  Because this method should be useful to other vhost-user
> > > > > > implementations, too, it is introduced as a general-purpose addition to
> > > > > > the protocol, not limited to vhost-user-fs.
> > > > > >
> > > > > > These are the additions to the protocol:
> > > > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > >   This feature signals support for transferring state, and is added so
> > > > > >   that migration can fail early when the back-end has no support.
> > > > > >
> > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > > > > >   over which to transfer the state.  The front-end sends an FD to the
> > > > > >   back-end into/from which it can write/read its state, and the back-end
> > > > > >   can decide to either use it, or reply with a different FD for the
> > > > > >   front-end to override the front-end's choice.
> > > > > >   The front-end creates a simple pipe to transfer the state, but maybe
> > > > > >   the back-end already has an FD into/from which it has to write/read
> > > > > >   its state, in which case it will want to override the simple pipe.
> > > > > >   Conversely, maybe in the future we find a way to have the front-end
> > > > > >   get an immediate FD for the migration stream (in some cases), in which
> > > > > >   case we will want to send this to the back-end instead of creating a
> > > > > >   pipe.
> > > > > >   Hence the negotiation: If one side has a better idea than a plain
> > > > > >   pipe, we will want to use that.
> > > > > >
> > > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the
> > > > > >   pipe (the end indicated by EOF), the front-end invokes this function
> > > > > >   to verify success.  There is no in-band way (through the pipe) to
> > > > > >   indicate failure, so we need to check explicitly.
> > > > > >
> > > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > > > > (which includes establishing the direction of transfer and migration
> > > > > > phase), the sending side writes its data into the pipe, and the reading
> > > > > > side reads it until it sees an EOF.  Then, the front-end will check for
> > > > > > success via CHECK_DEVICE_STATE, which on the destination side includes
> > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > >
> > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > ---
> > > > > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > >  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > > > > >  hw/virtio/vhost.c                 |  37 ++++++++
> > > > > >  4 files changed, 287 insertions(+)
> > > > > >
> > > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > >  } VhostSetConfigType;
> > > > > >
> > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > +} VhostDeviceStateDirection;
> > > > > > +
> > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > +} VhostDeviceStatePhase;
> > > > >
> > > > > vDPA has:
> > > > >
> > > > >   /* Suspend a device so it does not process virtqueue requests anymore
> > > > >    *
> > > > >    * After the return of ioctl the device must preserve all the necessary state
> > > > >    * (the virtqueue vring base plus the possible device specific states) that is
> > > > >    * required for restoring in the future. The device must not change its
> > > > >    * configuration after that point.
> > > > >    */
> > > > >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > >
> > > > >   /* Resume a device so it can resume processing virtqueue requests
> > > > >    *
> > > > >    * After the return of this ioctl the device will have restored all the
> > > > >    * necessary states and it is fully operational to continue processing the
> > > > >    * virtqueue descriptors.
> > > > >    */
> > > > >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > >
> > > > > I wonder if it makes sense to import these into vhost-user so that the
> > > > > difference between kernel vhost and vhost-user is minimized. It's okay
> > > > > if one of them is ahead of the other, but it would be nice to avoid
> > > > > overlapping/duplicated functionality.
> > > > >
> > > >
> > > > That's what I had in mind in the first versions. I proposed VHOST_STOP
> > > > instead of VHOST_VDPA_STOP for this very reason. Later it did change
> > > > to SUSPEND.
> > > >
> > > > Generally it is better if we make the interface less parametrized and
> > > > we trust in the messages and its semantics in my opinion. In other
> > > > words, instead of
> > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send
> > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command.
> > > >
> > > > Another way to apply this is with the "direction" parameter. Maybe it
> > > > is better to split it into "set_state_fd" and "get_state_fd"?
> > > >
> > > > In that case, reusing the ioctls as vhost-user messages would be ok.
> > > > But that puts this proposal further from the VFIO code, which uses
> > > > "migration_set_state(state)", and maybe it is better when the number
> > > > of states is high.
> > >
> > > Hi Eugenio,
> > > Another question about vDPA suspend/resume:
> > >
> > >   /* Host notifiers must be enabled at this point. */
> > >   void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings)
> > >   {
> > >       int i;
> > >
> > >       /* should only be called after backend is connected */
> > >       assert(hdev->vhost_ops);
> > >       event_notifier_test_and_clear(
> > >           &hdev->vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
> > >       event_notifier_test_and_clear(&vdev->config_notifier);
> > >
> > >       trace_vhost_dev_stop(hdev, vdev->name, vrings);
> > >
> > >       if (hdev->vhost_ops->vhost_dev_start) {
> > >           hdev->vhost_ops->vhost_dev_start(hdev, false);
> > >           ^^^ SUSPEND ^^^
> > >       }
> > >       if (vrings) {
> > >           vhost_dev_set_vring_enable(hdev, false);
> > >       }
> > >       for (i = 0; i < hdev->nvqs; ++i) {
> > >           vhost_virtqueue_stop(hdev,
> > >                                vdev,
> > >                                hdev->vqs + i,
> > >                                hdev->vq_index + i);
> > >         ^^^ fetch virtqueue state from kernel ^^^
> > >       }
> > >       if (hdev->vhost_ops->vhost_reset_status) {
> > >           hdev->vhost_ops->vhost_reset_status(hdev);
> > >           ^^^ reset device^^^
> > >
> > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop() ->
> > > vhost_reset_status(). The device's migration code runs after
> > > vhost_dev_stop() and the state will have been lost.
> > >
> >
> > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
> > qemu VirtIONet device model. This is for all vhost backends.
> >
> > Regarding the state like mac or mq configuration, SVQ runs for all the
> > VM run in the CVQ. So it can track all of that status in the device
> > model too.
> >
> > When a migration effectively occurs, all the frontend state is
> > migrated as a regular emulated device. To route all of the state in a
> > normalized way for qemu is what leaves open the possibility to do
> > cross-backends migrations, etc.
> >
> > Does that answer your question?
>
> I think you're confirming that changes would be necessary in order for
> vDPA to support the save/load operation that Hanna is introducing.
>

Yes, this first iteration was centered on net, with an eye on block,
where state can be routed through classical emulated devices. This is
how vhost-kernel and vhost-user do classically. And it allows
cross-backend, to not modify qemu migration state, etc.

To introduce this opaque state to qemu, that must be fetched after the
suspend and not before, requires changes in vhost protocol, as
discussed previously.

> > > It looks like vDPA changes are necessary in order to support stateful
> > > devices even though QEMU already uses SUSPEND. Is my understanding
> > > correct?
> > >
> >
> > Changes are required elsewhere, as the code to restore the state
> > properly in the destination has not been merged.
>
> I'm not sure what you mean by elsewhere?
>

I meant for vdpa *net* devices the changes are not required in vdpa
ioctls, but mostly in qemu.

If you meant stateful as "it must have a state blob that it must be
opaque to qemu", then I think the straightforward action is to fetch
state blob about the same time as vq indexes. But yes, changes (at
least a new ioctl) is needed for that.

> I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and
> then VHOST_VDPA_SET_STATUS 0.
>
> In order to save device state from the vDPA device in the future, it
> will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that
> the device state can be saved before the device is reset.
>
> Does that sound right?
>

The split between suspend and reset was added recently for that very
reason. In all the virtio devices, the frontend is initialized before
the backend, so I don't think it is a good idea to defer the backend
cleanup. Especially if we have already set the state is small enough
to not needing iterative migration from virtiofsd point of view.

If fetching that state at the same time as vq indexes is not valid,
could it follow the same model as the "in-flight descriptors"?
vhost-user follows them by using a shared memory region where their
state is tracked [1]. This allows qemu to survive vhost-user SW
backend crashes, and does not forbid the cross-backends live migration
as all the information is there to recover them.

For hw devices this is not convenient as it occupies PCI bandwidth. So
a possibility is to synchronize this memory region after a
synchronization point, being the SUSPEND call or GET_VRING_BASE. HW
devices are not going to crash in the software sense, so all use cases
remain the same to qemu. And that shared memory information is
recoverable after vhost_dev_stop.

Does that sound reasonable to virtiofsd? To offer a shared memory
region where it dumps the state, maybe only after the
set_state(STATE_PHASE_STOPPED)?

Thanks!

[1] https://qemu.readthedocs.io/en/latest/interop/vhost-user.html#inflight-i-o-tracking
Eugenio Perez Martin April 18, 2023, 8:09 a.m. UTC | #26
On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >
> > On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > >
> > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > So-called "internal" virtio-fs migration refers to transporting the
> > > > > > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > > > > this, we need to be able to transfer virtiofsd's internal state to and
> > > > > > from virtiofsd.
> > > > > >
> > > > > > Because virtiofsd's internal state will not be too large, we believe it
> > > > > > is best to transfer it as a single binary blob after the streaming
> > > > > > phase.  Because this method should be useful to other vhost-user
> > > > > > implementations, too, it is introduced as a general-purpose addition to
> > > > > > the protocol, not limited to vhost-user-fs.
> > > > > >
> > > > > > These are the additions to the protocol:
> > > > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > >   This feature signals support for transferring state, and is added so
> > > > > >   that migration can fail early when the back-end has no support.
> > > > > >
> > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > > > > >   over which to transfer the state.  The front-end sends an FD to the
> > > > > >   back-end into/from which it can write/read its state, and the back-end
> > > > > >   can decide to either use it, or reply with a different FD for the
> > > > > >   front-end to override the front-end's choice.
> > > > > >   The front-end creates a simple pipe to transfer the state, but maybe
> > > > > >   the back-end already has an FD into/from which it has to write/read
> > > > > >   its state, in which case it will want to override the simple pipe.
> > > > > >   Conversely, maybe in the future we find a way to have the front-end
> > > > > >   get an immediate FD for the migration stream (in some cases), in which
> > > > > >   case we will want to send this to the back-end instead of creating a
> > > > > >   pipe.
> > > > > >   Hence the negotiation: If one side has a better idea than a plain
> > > > > >   pipe, we will want to use that.
> > > > > >
> > > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the
> > > > > >   pipe (the end indicated by EOF), the front-end invokes this function
> > > > > >   to verify success.  There is no in-band way (through the pipe) to
> > > > > >   indicate failure, so we need to check explicitly.
> > > > > >
> > > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > > > > (which includes establishing the direction of transfer and migration
> > > > > > phase), the sending side writes its data into the pipe, and the reading
> > > > > > side reads it until it sees an EOF.  Then, the front-end will check for
> > > > > > success via CHECK_DEVICE_STATE, which on the destination side includes
> > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > >
> > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > ---
> > > > > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > >  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > > > > >  hw/virtio/vhost.c                 |  37 ++++++++
> > > > > >  4 files changed, 287 insertions(+)
> > > > > >
> > > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > >  } VhostSetConfigType;
> > > > > >
> > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > +} VhostDeviceStateDirection;
> > > > > > +
> > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > +} VhostDeviceStatePhase;
> > > > >
> > > > > vDPA has:
> > > > >
> > > > >   /* Suspend a device so it does not process virtqueue requests anymore
> > > > >    *
> > > > >    * After the return of ioctl the device must preserve all the necessary state
> > > > >    * (the virtqueue vring base plus the possible device specific states) that is
> > > > >    * required for restoring in the future. The device must not change its
> > > > >    * configuration after that point.
> > > > >    */
> > > > >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > >
> > > > >   /* Resume a device so it can resume processing virtqueue requests
> > > > >    *
> > > > >    * After the return of this ioctl the device will have restored all the
> > > > >    * necessary states and it is fully operational to continue processing the
> > > > >    * virtqueue descriptors.
> > > > >    */
> > > > >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > >
> > > > > I wonder if it makes sense to import these into vhost-user so that the
> > > > > difference between kernel vhost and vhost-user is minimized. It's okay
> > > > > if one of them is ahead of the other, but it would be nice to avoid
> > > > > overlapping/duplicated functionality.
> > > > >
> > > >
> > > > That's what I had in mind in the first versions. I proposed VHOST_STOP
> > > > instead of VHOST_VDPA_STOP for this very reason. Later it did change
> > > > to SUSPEND.
> > >
> > > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
> > > ioctl(VHOST_VDPA_RESUME).
> > >
> > > The doc comments in <linux/vdpa.h> don't explain how the device can
> > > leave the suspended state. Can you clarify this?
> > >
> >
> > Do you mean in what situations or regarding the semantics of _RESUME?
> >
> > To me resume is an operation mainly to resume the device in the event
> > of a VM suspension, not a migration. It can be used as a fallback code
> > in some cases of migration failure though, but it is not currently
> > used in qemu.
>
> Is a "VM suspension" the QEMU HMP 'stop' command?
>
> I guess the reason why QEMU doesn't call RESUME anywhere is that it
> resets the device in vhost_dev_stop()?
>

The actual reason for not using RESUME is that the ioctl was added
after the SUSPEND design in qemu. Same as this proposal, it is was not
needed at the time.

In the case of vhost-vdpa net, the only usage of suspend is to fetch
the vq indexes, and in case of error vhost already fetches them from
guest's used ring way before vDPA, so it has little usage.

> Does it make sense to combine SUSPEND and RESUME with Hanna's
> SET_DEVICE_STATE_FD? For example, non-iterative migration works like
> this:
> - Saving the device's state is done by SUSPEND followed by
> SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g.
> savevm command or migration failed), then RESUME is called to
> continue.

I think the previous steps make sense at vhost_dev_stop, not virtio
savevm handlers. To start spreading this logic to more places of qemu
can bring confusion.

> - Loading the device's state is done by SUSPEND followed by
> SET_DEVICE_STATE_FD, followed by RESUME.
>

I think the restore makes more sense after reset and before driver_ok,
suspend does not seem a right call there. SUSPEND implies there may be
other operations before, so the device may have processed some
requests wrong, as it is not in the right state.

Thanks!
Eugenio Perez Martin April 18, 2023, 10:09 a.m. UTC | #27
On Mon, Apr 17, 2023 at 9:46 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Mon, 17 Apr 2023 at 15:12, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >
> > On Mon, Apr 17, 2023 at 9:08 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >
> > > On Mon, 17 Apr 2023 at 14:56, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > >
> > > > On Mon, Apr 17, 2023 at 5:18 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > >
> > > > > On Fri, Apr 14, 2023 at 05:17:02PM +0200, Eugenio Perez Martin wrote:
> > > > > > On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote:
> > > > > > >
> > > > > > > On 13.04.23 13:38, Stefan Hajnoczi wrote:
> > > > > > > > On Thu, 13 Apr 2023 at 05:24, Hanna Czenczek <hreitz@redhat.com> wrote:
> > > > > > > >> On 12.04.23 23:06, Stefan Hajnoczi wrote:
> > > > > > > >>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > > Basically, what I’m hearing is that I need to implement a different
> > > > > > > feature that has no practical impact right now, and also fix bugs around
> > > > > > > it along the way...
> > > > > > >
> > > > > >
> > > > > > To fix this properly requires iterative device migration in qemu as
> > > > > > far as I know, instead of using VMStates [1]. This way the state is
> > > > > > requested to virtiofsd before the device reset.
> > > > >
> > > > > I don't follow. Many devices are fine with non-iterative migration. They
> > > > > shouldn't be forced to do iterative migration.
> > > > >
> > > >
> > > > Sorry I think I didn't express myself well. I didn't mean to force
> > > > virtiofsd to support the iterative migration, but to use the device
> > > > iterative migration API in QEMU to send the needed commands before
> > > > vhost_dev_stop. In that regard, the device or the vhost-user commands
> > > > would not require changes.
> > > >
> > > > I think it is convenient in the long run for virtiofsd, as if the
> > > > state grows so much that it's not feasible to fetch it in one shot
> > > > there is no need to make changes in the qemu migration protocol. I
> > > > think it is not unlikely in virtiofs, but maybe I'm missing something
> > > > obvious and it's state will never grow.
> > >
> > > I don't understand. vCPUs are still running at that point and the
> > > device state could change. It's not safe to save the full device state
> > > until vCPUs have stopped (after vhost_dev_stop).
> > >
> >
> > I think the vCPU is already stopped at save_live_complete_precopy
> > callback. Maybe my understanding is wrong?
>
> Agreed, vCPUs are stopped in save_live_complete_precopy(). However,
> you wrote "use the device iterative migration API in QEMU to send the
> needed commands before vhost_dev_stop". save_live_complete_precopy()
> runs after vhost_dev_stop() so it doesn't seem to solve the problem.
>

You're right, and it actually makes the most sense.

So I guess this converges with the other thread, let's follow the
discussion there.

Thanks!
Stefan Hajnoczi April 18, 2023, 5:59 p.m. UTC | #28
On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote:
> On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >
> > On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > >
> > > On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > > So-called "internal" virtio-fs migration refers to transporting the
> > > > > > > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > > > > > this, we need to be able to transfer virtiofsd's internal state to and
> > > > > > > from virtiofsd.
> > > > > > >
> > > > > > > Because virtiofsd's internal state will not be too large, we believe it
> > > > > > > is best to transfer it as a single binary blob after the streaming
> > > > > > > phase.  Because this method should be useful to other vhost-user
> > > > > > > implementations, too, it is introduced as a general-purpose addition to
> > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > >
> > > > > > > These are the additions to the protocol:
> > > > > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > >   This feature signals support for transferring state, and is added so
> > > > > > >   that migration can fail early when the back-end has no support.
> > > > > > >
> > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > > > > > >   over which to transfer the state.  The front-end sends an FD to the
> > > > > > >   back-end into/from which it can write/read its state, and the back-end
> > > > > > >   can decide to either use it, or reply with a different FD for the
> > > > > > >   front-end to override the front-end's choice.
> > > > > > >   The front-end creates a simple pipe to transfer the state, but maybe
> > > > > > >   the back-end already has an FD into/from which it has to write/read
> > > > > > >   its state, in which case it will want to override the simple pipe.
> > > > > > >   Conversely, maybe in the future we find a way to have the front-end
> > > > > > >   get an immediate FD for the migration stream (in some cases), in which
> > > > > > >   case we will want to send this to the back-end instead of creating a
> > > > > > >   pipe.
> > > > > > >   Hence the negotiation: If one side has a better idea than a plain
> > > > > > >   pipe, we will want to use that.
> > > > > > >
> > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the
> > > > > > >   pipe (the end indicated by EOF), the front-end invokes this function
> > > > > > >   to verify success.  There is no in-band way (through the pipe) to
> > > > > > >   indicate failure, so we need to check explicitly.
> > > > > > >
> > > > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > > > > > (which includes establishing the direction of transfer and migration
> > > > > > > phase), the sending side writes its data into the pipe, and the reading
> > > > > > > side reads it until it sees an EOF.  Then, the front-end will check for
> > > > > > > success via CHECK_DEVICE_STATE, which on the destination side includes
> > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > >
> > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > ---
> > > > > > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > >  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > > > > > >  hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > >  4 files changed, 287 insertions(+)
> > > > > > >
> > > > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > >  } VhostSetConfigType;
> > > > > > >
> > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > +} VhostDeviceStateDirection;
> > > > > > > +
> > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > +} VhostDeviceStatePhase;
> > > > > >
> > > > > > vDPA has:
> > > > > >
> > > > > >   /* Suspend a device so it does not process virtqueue requests anymore
> > > > > >    *
> > > > > >    * After the return of ioctl the device must preserve all the necessary state
> > > > > >    * (the virtqueue vring base plus the possible device specific states) that is
> > > > > >    * required for restoring in the future. The device must not change its
> > > > > >    * configuration after that point.
> > > > > >    */
> > > > > >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > >
> > > > > >   /* Resume a device so it can resume processing virtqueue requests
> > > > > >    *
> > > > > >    * After the return of this ioctl the device will have restored all the
> > > > > >    * necessary states and it is fully operational to continue processing the
> > > > > >    * virtqueue descriptors.
> > > > > >    */
> > > > > >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > >
> > > > > > I wonder if it makes sense to import these into vhost-user so that the
> > > > > > difference between kernel vhost and vhost-user is minimized. It's okay
> > > > > > if one of them is ahead of the other, but it would be nice to avoid
> > > > > > overlapping/duplicated functionality.
> > > > > >
> > > > >
> > > > > That's what I had in mind in the first versions. I proposed VHOST_STOP
> > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did change
> > > > > to SUSPEND.
> > > >
> > > > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
> > > > ioctl(VHOST_VDPA_RESUME).
> > > >
> > > > The doc comments in <linux/vdpa.h> don't explain how the device can
> > > > leave the suspended state. Can you clarify this?
> > > >
> > >
> > > Do you mean in what situations or regarding the semantics of _RESUME?
> > >
> > > To me resume is an operation mainly to resume the device in the event
> > > of a VM suspension, not a migration. It can be used as a fallback code
> > > in some cases of migration failure though, but it is not currently
> > > used in qemu.
> >
> > Is a "VM suspension" the QEMU HMP 'stop' command?
> >
> > I guess the reason why QEMU doesn't call RESUME anywhere is that it
> > resets the device in vhost_dev_stop()?
> >
> 
> The actual reason for not using RESUME is that the ioctl was added
> after the SUSPEND design in qemu. Same as this proposal, it is was not
> needed at the time.
> 
> In the case of vhost-vdpa net, the only usage of suspend is to fetch
> the vq indexes, and in case of error vhost already fetches them from
> guest's used ring way before vDPA, so it has little usage.
> 
> > Does it make sense to combine SUSPEND and RESUME with Hanna's
> > SET_DEVICE_STATE_FD? For example, non-iterative migration works like
> > this:
> > - Saving the device's state is done by SUSPEND followed by
> > SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g.
> > savevm command or migration failed), then RESUME is called to
> > continue.
> 
> I think the previous steps make sense at vhost_dev_stop, not virtio
> savevm handlers. To start spreading this logic to more places of qemu
> can bring confusion.

I don't think there is a way around extending the QEMU vhost's code
model. The current model in QEMU's vhost code is that the backend is
reset when the VM stops. This model worked fine for stateless devices
but it doesn't work for stateful devices.

Imagine a vdpa-gpu device: you cannot reset the device in
vhost_dev_stop() and expect the GPU to continue working when
vhost_dev_start() is called again because all its state has been lost.
The guest driver will send requests that references a virtio-gpu
resources that no longer exist.

One solution is to save the device's state in vhost_dev_stop(). I think
this is what you're suggesting. It requires keeping a copy of the state
and then loading the state again in vhost_dev_start(). I don't think
this approach should be used because it requires all stateful devices to
support live migration (otherwise they break across HMP 'stop'/'cont').
Also, the device state for some devices may be large and it would also
become more complicated when iterative migration is added.

Instead, I think the QEMU vhost code needs to be structured so that
struct vhost_dev has a suspended state:

        ,---------.
	v         |
  started ------> stopped
    \   ^
     \  |
      -> suspended

The device doesn't lose state when it enters the suspended state. It can
be resumed again.

This is why I think SUSPEND/RESUME need to be part of the solution.
(It's also an argument for not including the phase argument in
SET_DEVICE_STATE_FD because the SUSPEND message is sent during
vhost_dev_stop() separately from saving the device's state.)

> > - Loading the device's state is done by SUSPEND followed by
> > SET_DEVICE_STATE_FD, followed by RESUME.
> >
> 
> I think the restore makes more sense after reset and before driver_ok,
> suspend does not seem a right call there. SUSPEND implies there may be
> other operations before, so the device may have processed some
> requests wrong, as it is not in the right state.

I find it more elegant to allow SUSPEND -> load -> RESUME if the device
state is saved using SUSPEND -> save -> RESUME since the operations are
symmetrical, but requiring the device to be reset works too. Here is my
understanding of your idea in more detail:

The VIRTIO Device Status Field value must be ACKNOWLEDGE | DRIVER |
FEATURES_OK, any device initialization configuration space writes must
be done, and virtqueues must be configured (Step 7 of 3.1.1 Driver
Requirements in VIRTIO 1.2).

At that point the device is able to parse the device state and set up
its internal state. Doing it any earlier (before feature negotiation or
virtqueue configuration) places the device in the awkward situation of
having to keep the device state in a buffer and defer loading it until
later, which is complex.

After device state loading is complete, the DRIVER_OK bit is set to
resume device operation.

Saving device state is only allowed when the DRIVER_OK bit has been set.

Does this sound right?

Stefan
Eugenio Perez Martin April 18, 2023, 6:31 p.m. UTC | #29
On Tue, Apr 18, 2023 at 7:59 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote:
> > On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > >
> > > On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > >
> > > > On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > >
> > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > >
> > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > > > So-called "internal" virtio-fs migration refers to transporting the
> > > > > > > > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > > > > > > this, we need to be able to transfer virtiofsd's internal state to and
> > > > > > > > from virtiofsd.
> > > > > > > >
> > > > > > > > Because virtiofsd's internal state will not be too large, we believe it
> > > > > > > > is best to transfer it as a single binary blob after the streaming
> > > > > > > > phase.  Because this method should be useful to other vhost-user
> > > > > > > > implementations, too, it is introduced as a general-purpose addition to
> > > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > > >
> > > > > > > > These are the additions to the protocol:
> > > > > > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > > >   This feature signals support for transferring state, and is added so
> > > > > > > >   that migration can fail early when the back-end has no support.
> > > > > > > >
> > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > > > > > > >   over which to transfer the state.  The front-end sends an FD to the
> > > > > > > >   back-end into/from which it can write/read its state, and the back-end
> > > > > > > >   can decide to either use it, or reply with a different FD for the
> > > > > > > >   front-end to override the front-end's choice.
> > > > > > > >   The front-end creates a simple pipe to transfer the state, but maybe
> > > > > > > >   the back-end already has an FD into/from which it has to write/read
> > > > > > > >   its state, in which case it will want to override the simple pipe.
> > > > > > > >   Conversely, maybe in the future we find a way to have the front-end
> > > > > > > >   get an immediate FD for the migration stream (in some cases), in which
> > > > > > > >   case we will want to send this to the back-end instead of creating a
> > > > > > > >   pipe.
> > > > > > > >   Hence the negotiation: If one side has a better idea than a plain
> > > > > > > >   pipe, we will want to use that.
> > > > > > > >
> > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the
> > > > > > > >   pipe (the end indicated by EOF), the front-end invokes this function
> > > > > > > >   to verify success.  There is no in-band way (through the pipe) to
> > > > > > > >   indicate failure, so we need to check explicitly.
> > > > > > > >
> > > > > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > > > > > > (which includes establishing the direction of transfer and migration
> > > > > > > > phase), the sending side writes its data into the pipe, and the reading
> > > > > > > > side reads it until it sees an EOF.  Then, the front-end will check for
> > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side includes
> > > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > > >
> > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > ---
> > > > > > > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > > >  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > > > > > > >  hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > > >  4 files changed, 287 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > > >  } VhostSetConfigType;
> > > > > > > >
> > > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > > +} VhostDeviceStateDirection;
> > > > > > > > +
> > > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > > +} VhostDeviceStatePhase;
> > > > > > >
> > > > > > > vDPA has:
> > > > > > >
> > > > > > >   /* Suspend a device so it does not process virtqueue requests anymore
> > > > > > >    *
> > > > > > >    * After the return of ioctl the device must preserve all the necessary state
> > > > > > >    * (the virtqueue vring base plus the possible device specific states) that is
> > > > > > >    * required for restoring in the future. The device must not change its
> > > > > > >    * configuration after that point.
> > > > > > >    */
> > > > > > >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > > >
> > > > > > >   /* Resume a device so it can resume processing virtqueue requests
> > > > > > >    *
> > > > > > >    * After the return of this ioctl the device will have restored all the
> > > > > > >    * necessary states and it is fully operational to continue processing the
> > > > > > >    * virtqueue descriptors.
> > > > > > >    */
> > > > > > >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > > >
> > > > > > > I wonder if it makes sense to import these into vhost-user so that the
> > > > > > > difference between kernel vhost and vhost-user is minimized. It's okay
> > > > > > > if one of them is ahead of the other, but it would be nice to avoid
> > > > > > > overlapping/duplicated functionality.
> > > > > > >
> > > > > >
> > > > > > That's what I had in mind in the first versions. I proposed VHOST_STOP
> > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did change
> > > > > > to SUSPEND.
> > > > >
> > > > > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
> > > > > ioctl(VHOST_VDPA_RESUME).
> > > > >
> > > > > The doc comments in <linux/vdpa.h> don't explain how the device can
> > > > > leave the suspended state. Can you clarify this?
> > > > >
> > > >
> > > > Do you mean in what situations or regarding the semantics of _RESUME?
> > > >
> > > > To me resume is an operation mainly to resume the device in the event
> > > > of a VM suspension, not a migration. It can be used as a fallback code
> > > > in some cases of migration failure though, but it is not currently
> > > > used in qemu.
> > >
> > > Is a "VM suspension" the QEMU HMP 'stop' command?
> > >
> > > I guess the reason why QEMU doesn't call RESUME anywhere is that it
> > > resets the device in vhost_dev_stop()?
> > >
> >
> > The actual reason for not using RESUME is that the ioctl was added
> > after the SUSPEND design in qemu. Same as this proposal, it is was not
> > needed at the time.
> >
> > In the case of vhost-vdpa net, the only usage of suspend is to fetch
> > the vq indexes, and in case of error vhost already fetches them from
> > guest's used ring way before vDPA, so it has little usage.
> >
> > > Does it make sense to combine SUSPEND and RESUME with Hanna's
> > > SET_DEVICE_STATE_FD? For example, non-iterative migration works like
> > > this:
> > > - Saving the device's state is done by SUSPEND followed by
> > > SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g.
> > > savevm command or migration failed), then RESUME is called to
> > > continue.
> >
> > I think the previous steps make sense at vhost_dev_stop, not virtio
> > savevm handlers. To start spreading this logic to more places of qemu
> > can bring confusion.
>
> I don't think there is a way around extending the QEMU vhost's code
> model. The current model in QEMU's vhost code is that the backend is
> reset when the VM stops. This model worked fine for stateless devices
> but it doesn't work for stateful devices.
>
> Imagine a vdpa-gpu device: you cannot reset the device in
> vhost_dev_stop() and expect the GPU to continue working when
> vhost_dev_start() is called again because all its state has been lost.
> The guest driver will send requests that references a virtio-gpu
> resources that no longer exist.
>
> One solution is to save the device's state in vhost_dev_stop(). I think
> this is what you're suggesting. It requires keeping a copy of the state
> and then loading the state again in vhost_dev_start(). I don't think
> this approach should be used because it requires all stateful devices to
> support live migration (otherwise they break across HMP 'stop'/'cont').
> Also, the device state for some devices may be large and it would also
> become more complicated when iterative migration is added.
>
> Instead, I think the QEMU vhost code needs to be structured so that
> struct vhost_dev has a suspended state:
>
>         ,---------.
>         v         |
>   started ------> stopped
>     \   ^
>      \  |
>       -> suspended
>
> The device doesn't lose state when it enters the suspended state. It can
> be resumed again.
>
> This is why I think SUSPEND/RESUME need to be part of the solution.

I agree with all of this, especially after realizing vhost_dev_stop is
called before the last request of the state in the iterative
migration.

However I think we can move faster with the virtiofsd migration code,
as long as we agree on the vhost-user messages it will receive. This
is because we already agree that the state will be sent in one shot
and not iteratively, so it will be small.

I understand this may change in the future, that's why I proposed to
start using iterative right now. However it may make little sense if
it is not used in the vhost-user device. I also understand that other
devices may have a bigger state so it will be needed for them.

> (It's also an argument for not including the phase argument in
> SET_DEVICE_STATE_FD because the SUSPEND message is sent during
> vhost_dev_stop() separately from saving the device's state.)
>
> > > - Loading the device's state is done by SUSPEND followed by
> > > SET_DEVICE_STATE_FD, followed by RESUME.
> > >
> >
> > I think the restore makes more sense after reset and before driver_ok,
> > suspend does not seem a right call there. SUSPEND implies there may be
> > other operations before, so the device may have processed some
> > requests wrong, as it is not in the right state.
>
> I find it more elegant to allow SUSPEND -> load -> RESUME if the device
> state is saved using SUSPEND -> save -> RESUME since the operations are
> symmetrical, but requiring the device to be reset works too. Here is my
> understanding of your idea in more detail:
>
> The VIRTIO Device Status Field value must be ACKNOWLEDGE | DRIVER |
> FEATURES_OK, any device initialization configuration space writes must
> be done, and virtqueues must be configured (Step 7 of 3.1.1 Driver
> Requirements in VIRTIO 1.2).
>
> At that point the device is able to parse the device state and set up
> its internal state. Doing it any earlier (before feature negotiation or
> virtqueue configuration) places the device in the awkward situation of
> having to keep the device state in a buffer and defer loading it until
> later, which is complex.
>
> After device state loading is complete, the DRIVER_OK bit is set to
> resume device operation.
>
> Saving device state is only allowed when the DRIVER_OK bit has been set.
>
> Does this sound right?
>

Yes, I see it is accurate. If you agree that SUSPEND only makes sense
after DRIVER_OK, to restore the state while suspended complicates the
state machine by a lot. The device spec is simpler with these
restrictions in my opinion.

Thanks!
Stefan Hajnoczi April 18, 2023, 8:40 p.m. UTC | #30
On Tue, 18 Apr 2023 at 14:31, Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Tue, Apr 18, 2023 at 7:59 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote:
> > > On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > >
> > > > On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> > > > >
> > > > > On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > >
> > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > > > > So-called "internal" virtio-fs migration refers to transporting the
> > > > > > > > > back-end's (virtiofsd's) state through qemu's migration stream.  To do
> > > > > > > > > this, we need to be able to transfer virtiofsd's internal state to and
> > > > > > > > > from virtiofsd.
> > > > > > > > >
> > > > > > > > > Because virtiofsd's internal state will not be too large, we believe it
> > > > > > > > > is best to transfer it as a single binary blob after the streaming
> > > > > > > > > phase.  Because this method should be useful to other vhost-user
> > > > > > > > > implementations, too, it is introduced as a general-purpose addition to
> > > > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > > > >
> > > > > > > > > These are the additions to the protocol:
> > > > > > > > > - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > > > >   This feature signals support for transferring state, and is added so
> > > > > > > > >   that migration can fail early when the back-end has no support.
> > > > > > > > >
> > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> > > > > > > > >   over which to transfer the state.  The front-end sends an FD to the
> > > > > > > > >   back-end into/from which it can write/read its state, and the back-end
> > > > > > > > >   can decide to either use it, or reply with a different FD for the
> > > > > > > > >   front-end to override the front-end's choice.
> > > > > > > > >   The front-end creates a simple pipe to transfer the state, but maybe
> > > > > > > > >   the back-end already has an FD into/from which it has to write/read
> > > > > > > > >   its state, in which case it will want to override the simple pipe.
> > > > > > > > >   Conversely, maybe in the future we find a way to have the front-end
> > > > > > > > >   get an immediate FD for the migration stream (in some cases), in which
> > > > > > > > >   case we will want to send this to the back-end instead of creating a
> > > > > > > > >   pipe.
> > > > > > > > >   Hence the negotiation: If one side has a better idea than a plain
> > > > > > > > >   pipe, we will want to use that.
> > > > > > > > >
> > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred through the
> > > > > > > > >   pipe (the end indicated by EOF), the front-end invokes this function
> > > > > > > > >   to verify success.  There is no in-band way (through the pipe) to
> > > > > > > > >   indicate failure, so we need to check explicitly.
> > > > > > > > >
> > > > > > > > > Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> > > > > > > > > (which includes establishing the direction of transfer and migration
> > > > > > > > > phase), the sending side writes its data into the pipe, and the reading
> > > > > > > > > side reads it until it sees an EOF.  Then, the front-end will check for
> > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side includes
> > > > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > > > >
> > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > > ---
> > > > > > > > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > > > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > > > >  hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> > > > > > > > >  hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > > > >  4 files changed, 287 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> > > > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > > > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > > > >  } VhostSetConfigType;
> > > > > > > > >
> > > > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > > > +} VhostDeviceStateDirection;
> > > > > > > > > +
> > > > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > > > +} VhostDeviceStatePhase;
> > > > > > > >
> > > > > > > > vDPA has:
> > > > > > > >
> > > > > > > >   /* Suspend a device so it does not process virtqueue requests anymore
> > > > > > > >    *
> > > > > > > >    * After the return of ioctl the device must preserve all the necessary state
> > > > > > > >    * (the virtqueue vring base plus the possible device specific states) that is
> > > > > > > >    * required for restoring in the future. The device must not change its
> > > > > > > >    * configuration after that point.
> > > > > > > >    */
> > > > > > > >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > > > >
> > > > > > > >   /* Resume a device so it can resume processing virtqueue requests
> > > > > > > >    *
> > > > > > > >    * After the return of this ioctl the device will have restored all the
> > > > > > > >    * necessary states and it is fully operational to continue processing the
> > > > > > > >    * virtqueue descriptors.
> > > > > > > >    */
> > > > > > > >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > > > >
> > > > > > > > I wonder if it makes sense to import these into vhost-user so that the
> > > > > > > > difference between kernel vhost and vhost-user is minimized. It's okay
> > > > > > > > if one of them is ahead of the other, but it would be nice to avoid
> > > > > > > > overlapping/duplicated functionality.
> > > > > > > >
> > > > > > >
> > > > > > > That's what I had in mind in the first versions. I proposed VHOST_STOP
> > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did change
> > > > > > > to SUSPEND.
> > > > > >
> > > > > > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
> > > > > > ioctl(VHOST_VDPA_RESUME).
> > > > > >
> > > > > > The doc comments in <linux/vdpa.h> don't explain how the device can
> > > > > > leave the suspended state. Can you clarify this?
> > > > > >
> > > > >
> > > > > Do you mean in what situations or regarding the semantics of _RESUME?
> > > > >
> > > > > To me resume is an operation mainly to resume the device in the event
> > > > > of a VM suspension, not a migration. It can be used as a fallback code
> > > > > in some cases of migration failure though, but it is not currently
> > > > > used in qemu.
> > > >
> > > > Is a "VM suspension" the QEMU HMP 'stop' command?
> > > >
> > > > I guess the reason why QEMU doesn't call RESUME anywhere is that it
> > > > resets the device in vhost_dev_stop()?
> > > >
> > >
> > > The actual reason for not using RESUME is that the ioctl was added
> > > after the SUSPEND design in qemu. Same as this proposal, it is was not
> > > needed at the time.
> > >
> > > In the case of vhost-vdpa net, the only usage of suspend is to fetch
> > > the vq indexes, and in case of error vhost already fetches them from
> > > guest's used ring way before vDPA, so it has little usage.
> > >
> > > > Does it make sense to combine SUSPEND and RESUME with Hanna's
> > > > SET_DEVICE_STATE_FD? For example, non-iterative migration works like
> > > > this:
> > > > - Saving the device's state is done by SUSPEND followed by
> > > > SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g.
> > > > savevm command or migration failed), then RESUME is called to
> > > > continue.
> > >
> > > I think the previous steps make sense at vhost_dev_stop, not virtio
> > > savevm handlers. To start spreading this logic to more places of qemu
> > > can bring confusion.
> >
> > I don't think there is a way around extending the QEMU vhost's code
> > model. The current model in QEMU's vhost code is that the backend is
> > reset when the VM stops. This model worked fine for stateless devices
> > but it doesn't work for stateful devices.
> >
> > Imagine a vdpa-gpu device: you cannot reset the device in
> > vhost_dev_stop() and expect the GPU to continue working when
> > vhost_dev_start() is called again because all its state has been lost.
> > The guest driver will send requests that references a virtio-gpu
> > resources that no longer exist.
> >
> > One solution is to save the device's state in vhost_dev_stop(). I think
> > this is what you're suggesting. It requires keeping a copy of the state
> > and then loading the state again in vhost_dev_start(). I don't think
> > this approach should be used because it requires all stateful devices to
> > support live migration (otherwise they break across HMP 'stop'/'cont').
> > Also, the device state for some devices may be large and it would also
> > become more complicated when iterative migration is added.
> >
> > Instead, I think the QEMU vhost code needs to be structured so that
> > struct vhost_dev has a suspended state:
> >
> >         ,---------.
> >         v         |
> >   started ------> stopped
> >     \   ^
> >      \  |
> >       -> suspended
> >
> > The device doesn't lose state when it enters the suspended state. It can
> > be resumed again.
> >
> > This is why I think SUSPEND/RESUME need to be part of the solution.
>
> I agree with all of this, especially after realizing vhost_dev_stop is
> called before the last request of the state in the iterative
> migration.
>
> However I think we can move faster with the virtiofsd migration code,
> as long as we agree on the vhost-user messages it will receive. This
> is because we already agree that the state will be sent in one shot
> and not iteratively, so it will be small.
>
> I understand this may change in the future, that's why I proposed to
> start using iterative right now. However it may make little sense if
> it is not used in the vhost-user device. I also understand that other
> devices may have a bigger state so it will be needed for them.

Can you summarize how you'd like save to work today? I'm not sure what
you have in mind.

Stefan
Hanna Czenczek April 19, 2023, 10:45 a.m. UTC | #31
On 14.04.23 17:17, Eugenio Perez Martin wrote:
> On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote:

[...]

>> Basically, what I’m hearing is that I need to implement a different
>> feature that has no practical impact right now, and also fix bugs around
>> it along the way...
>>
> To fix this properly requires iterative device migration in qemu as
> far as I know, instead of using VMStates [1]. This way the state is
> requested to virtiofsd before the device reset.
>
> What does virtiofsd do when the state is totally sent? Does it keep
> processing requests and generating new state or is only a one shot
> that will suspend the daemon? If it is the second I think it still can
> be done in one shot at the end, always indicating "no more state" at
> save_live_pending and sending all the state at
> save_live_complete_precopy.

This sounds to me as if we should reset all devices during migration, 
and I don’t understand that.  virtiofsd will not immediately process 
requests when the state is sent, because the device is still stopped, 
but when it is re-enabled (e.g. because of a failed migration), it will 
have retained its state and continue processing requests as if nothing 
happened.  A reset would break this and other stateful back-ends, as I 
think Stefan has mentioned somewhere else.

It seems to me as if there are devices that need a reset, and so need 
suspend+resume around it, but I also think there are back-ends that 
don’t, where this would only unnecessarily complicate the back-end 
implementation.

Hanna
Hanna Czenczek April 19, 2023, 10:47 a.m. UTC | #32
On 17.04.23 17:12, Stefan Hajnoczi wrote:

[...]

> This brings to mind how iterative migration will work. The interface for
> iterative migration is basically the same as non-iterative migration
> plus a method to query the number of bytes remaining. When the number of
> bytes falls below a threshold, the vCPUs are stopped and the remainder
> of the data is read.
>
> Some details from VFIO migration:
> - The VMM must explicitly change the state when transitioning from
>    iterative and non-iterative migration, but the data transfer fd
>    remains the same.
> - The state of the device (running, stopped, resuming, etc) doesn't
>    change asynchronously, it's always driven by the VMM. However, setting
>    the state can fail and then the new state may be an error state.
>
> Mapping this to SET_DEVICE_STATE_FD:
> - VhostDeviceStatePhase is extended with
>    VHOST_TRANSFER_STATE_PHASE_RUNNING = 1 for iterative migration. The
>    frontend sends SET_DEVICE_STATE_FD again with
>    VHOST_TRANSFER_STATE_PHASE_STOPPED when entering non-iterative
>    migration and the frontend sends the iterative fd from the previous
>    SET_DEVICE_STATE_FD call to the backend. The backend may reply with
>    another fd, if necessary. If the backend changes the fd, then the
>    contents of the previous fd must be fully read and transferred before
>    the contents of the new fd are migrated. (Maybe this is too complex
>    and we should forbid changing the fd when going from RUNNING ->
>    STOPPED.)
> - CHECK_DEVICE_STATE can be extended to report the number of bytes
>    remaining. The semantics change so that CHECK_DEVICE_STATE can be
>    called while the VMM is still reading from the fd. It becomes:
>
>      enum CheckDeviceStateResult {
>          Saving(bytes_remaining : usize),
> 	Failed(error_code : u64),
>      }

Sounds good.  Personally, I’d forbid changing the FD when just changing 
state, which raises the question of whether there should then be a 
separate command for just changing the state (like VFIO_DEVICE_FEATURE 
..._MIG_DEVICE_STATE?), but that would be a question for then.

Changing the CHECK_DEVICE_STATE interface sounds good to me.

Hanna
Stefan Hajnoczi April 19, 2023, 10:57 a.m. UTC | #33
On Wed, 19 Apr 2023 at 06:45, Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 14.04.23 17:17, Eugenio Perez Martin wrote:
> > On Thu, Apr 13, 2023 at 7:55 PM Hanna Czenczek <hreitz@redhat.com> wrote:
>
> [...]
>
> >> Basically, what I’m hearing is that I need to implement a different
> >> feature that has no practical impact right now, and also fix bugs around
> >> it along the way...
> >>
> > To fix this properly requires iterative device migration in qemu as
> > far as I know, instead of using VMStates [1]. This way the state is
> > requested to virtiofsd before the device reset.
> >
> > What does virtiofsd do when the state is totally sent? Does it keep
> > processing requests and generating new state or is only a one shot
> > that will suspend the daemon? If it is the second I think it still can
> > be done in one shot at the end, always indicating "no more state" at
> > save_live_pending and sending all the state at
> > save_live_complete_precopy.
>
> This sounds to me as if we should reset all devices during migration,
> and I don’t understand that.  virtiofsd will not immediately process
> requests when the state is sent, because the device is still stopped,
> but when it is re-enabled (e.g. because of a failed migration), it will
> have retained its state and continue processing requests as if nothing
> happened.  A reset would break this and other stateful back-ends, as I
> think Stefan has mentioned somewhere else.
>
> It seems to me as if there are devices that need a reset, and so need
> suspend+resume around it, but I also think there are back-ends that
> don’t, where this would only unnecessarily complicate the back-end
> implementation.

Existing vhost-user backends must continue working, so I think having
two code paths is (almost) unavoidable.

One approach is to add SUSPEND/RESUME to the vhost-user protocol with
a corresponding VHOST_USER_PROTOCOL_F_SUSPEND feature bit. vhost-user
frontends can identify backends that support SUSPEND/RESUME instead of
device reset. Old vhost-user backends will continue to use device
reset.

I said avoiding two code paths is almost unavoidable. It may be
possible to rely on existing VHOST_USER_GET_VRING_BASE's semantics (it
stops a single virtqueue) instead of SUSPEND. RESUME is replaced by
VHOST_USER_SET_VRING_* and gets the device going again. However, I'm
not 100% sure if this will work (even for all existing devices). It
would require carefully studying both the spec and various
implementations to see if it's viable. There's a chance of losing the
performance optimization that VHOST_USER_SET_STATUS provided to DPDK
if the device is not reset.

In my opinion SUSPEND/RESUME is the cleanest way to do this.

Stefan
Hanna Czenczek April 19, 2023, 10:57 a.m. UTC | #34
On 18.04.23 19:59, Stefan Hajnoczi wrote:
> On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote:
>> On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>> On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote:
>>>> On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
>>>>>> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
>>>>>>>> So-called "internal" virtio-fs migration refers to transporting the
>>>>>>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
>>>>>>>> this, we need to be able to transfer virtiofsd's internal state to and
>>>>>>>> from virtiofsd.
>>>>>>>>
>>>>>>>> Because virtiofsd's internal state will not be too large, we believe it
>>>>>>>> is best to transfer it as a single binary blob after the streaming
>>>>>>>> phase.  Because this method should be useful to other vhost-user
>>>>>>>> implementations, too, it is introduced as a general-purpose addition to
>>>>>>>> the protocol, not limited to vhost-user-fs.
>>>>>>>>
>>>>>>>> These are the additions to the protocol:
>>>>>>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
>>>>>>>>    This feature signals support for transferring state, and is added so
>>>>>>>>    that migration can fail early when the back-end has no support.
>>>>>>>>
>>>>>>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
>>>>>>>>    over which to transfer the state.  The front-end sends an FD to the
>>>>>>>>    back-end into/from which it can write/read its state, and the back-end
>>>>>>>>    can decide to either use it, or reply with a different FD for the
>>>>>>>>    front-end to override the front-end's choice.
>>>>>>>>    The front-end creates a simple pipe to transfer the state, but maybe
>>>>>>>>    the back-end already has an FD into/from which it has to write/read
>>>>>>>>    its state, in which case it will want to override the simple pipe.
>>>>>>>>    Conversely, maybe in the future we find a way to have the front-end
>>>>>>>>    get an immediate FD for the migration stream (in some cases), in which
>>>>>>>>    case we will want to send this to the back-end instead of creating a
>>>>>>>>    pipe.
>>>>>>>>    Hence the negotiation: If one side has a better idea than a plain
>>>>>>>>    pipe, we will want to use that.
>>>>>>>>
>>>>>>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
>>>>>>>>    pipe (the end indicated by EOF), the front-end invokes this function
>>>>>>>>    to verify success.  There is no in-band way (through the pipe) to
>>>>>>>>    indicate failure, so we need to check explicitly.
>>>>>>>>
>>>>>>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
>>>>>>>> (which includes establishing the direction of transfer and migration
>>>>>>>> phase), the sending side writes its data into the pipe, and the reading
>>>>>>>> side reads it until it sees an EOF.  Then, the front-end will check for
>>>>>>>> success via CHECK_DEVICE_STATE, which on the destination side includes
>>>>>>>> checking for integrity (i.e. errors during deserialization).
>>>>>>>>
>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>>>>> ---
>>>>>>>>   include/hw/virtio/vhost-backend.h |  24 +++++
>>>>>>>>   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
>>>>>>>>   hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
>>>>>>>>   hw/virtio/vhost.c                 |  37 ++++++++
>>>>>>>>   4 files changed, 287 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
>>>>>>>> index ec3fbae58d..5935b32fe3 100644
>>>>>>>> --- a/include/hw/virtio/vhost-backend.h
>>>>>>>> +++ b/include/hw/virtio/vhost-backend.h
>>>>>>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
>>>>>>>>       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
>>>>>>>>   } VhostSetConfigType;
>>>>>>>>
>>>>>>>> +typedef enum VhostDeviceStateDirection {
>>>>>>>> +    /* Transfer state from back-end (device) to front-end */
>>>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
>>>>>>>> +    /* Transfer state from front-end to back-end (device) */
>>>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
>>>>>>>> +} VhostDeviceStateDirection;
>>>>>>>> +
>>>>>>>> +typedef enum VhostDeviceStatePhase {
>>>>>>>> +    /* The device (and all its vrings) is stopped */
>>>>>>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
>>>>>>>> +} VhostDeviceStatePhase;
>>>>>>> vDPA has:
>>>>>>>
>>>>>>>    /* Suspend a device so it does not process virtqueue requests anymore
>>>>>>>     *
>>>>>>>     * After the return of ioctl the device must preserve all the necessary state
>>>>>>>     * (the virtqueue vring base plus the possible device specific states) that is
>>>>>>>     * required for restoring in the future. The device must not change its
>>>>>>>     * configuration after that point.
>>>>>>>     */
>>>>>>>    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
>>>>>>>
>>>>>>>    /* Resume a device so it can resume processing virtqueue requests
>>>>>>>     *
>>>>>>>     * After the return of this ioctl the device will have restored all the
>>>>>>>     * necessary states and it is fully operational to continue processing the
>>>>>>>     * virtqueue descriptors.
>>>>>>>     */
>>>>>>>    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
>>>>>>>
>>>>>>> I wonder if it makes sense to import these into vhost-user so that the
>>>>>>> difference between kernel vhost and vhost-user is minimized. It's okay
>>>>>>> if one of them is ahead of the other, but it would be nice to avoid
>>>>>>> overlapping/duplicated functionality.
>>>>>>>
>>>>>> That's what I had in mind in the first versions. I proposed VHOST_STOP
>>>>>> instead of VHOST_VDPA_STOP for this very reason. Later it did change
>>>>>> to SUSPEND.
>>>>> I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
>>>>> ioctl(VHOST_VDPA_RESUME).
>>>>>
>>>>> The doc comments in <linux/vdpa.h> don't explain how the device can
>>>>> leave the suspended state. Can you clarify this?
>>>>>
>>>> Do you mean in what situations or regarding the semantics of _RESUME?
>>>>
>>>> To me resume is an operation mainly to resume the device in the event
>>>> of a VM suspension, not a migration. It can be used as a fallback code
>>>> in some cases of migration failure though, but it is not currently
>>>> used in qemu.
>>> Is a "VM suspension" the QEMU HMP 'stop' command?
>>>
>>> I guess the reason why QEMU doesn't call RESUME anywhere is that it
>>> resets the device in vhost_dev_stop()?
>>>
>> The actual reason for not using RESUME is that the ioctl was added
>> after the SUSPEND design in qemu. Same as this proposal, it is was not
>> needed at the time.
>>
>> In the case of vhost-vdpa net, the only usage of suspend is to fetch
>> the vq indexes, and in case of error vhost already fetches them from
>> guest's used ring way before vDPA, so it has little usage.
>>
>>> Does it make sense to combine SUSPEND and RESUME with Hanna's
>>> SET_DEVICE_STATE_FD? For example, non-iterative migration works like
>>> this:
>>> - Saving the device's state is done by SUSPEND followed by
>>> SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g.
>>> savevm command or migration failed), then RESUME is called to
>>> continue.
>> I think the previous steps make sense at vhost_dev_stop, not virtio
>> savevm handlers. To start spreading this logic to more places of qemu
>> can bring confusion.
> I don't think there is a way around extending the QEMU vhost's code
> model. The current model in QEMU's vhost code is that the backend is
> reset when the VM stops. This model worked fine for stateless devices
> but it doesn't work for stateful devices.
>
> Imagine a vdpa-gpu device: you cannot reset the device in
> vhost_dev_stop() and expect the GPU to continue working when
> vhost_dev_start() is called again because all its state has been lost.
> The guest driver will send requests that references a virtio-gpu
> resources that no longer exist.
>
> One solution is to save the device's state in vhost_dev_stop(). I think
> this is what you're suggesting. It requires keeping a copy of the state
> and then loading the state again in vhost_dev_start(). I don't think
> this approach should be used because it requires all stateful devices to
> support live migration (otherwise they break across HMP 'stop'/'cont').
> Also, the device state for some devices may be large and it would also
> become more complicated when iterative migration is added.
>
> Instead, I think the QEMU vhost code needs to be structured so that
> struct vhost_dev has a suspended state:
>
>          ,---------.
> 	v         |
>    started ------> stopped
>      \   ^
>       \  |
>        -> suspended
>
> The device doesn't lose state when it enters the suspended state. It can
> be resumed again.
>
> This is why I think SUSPEND/RESUME need to be part of the solution.
> (It's also an argument for not including the phase argument in
> SET_DEVICE_STATE_FD because the SUSPEND message is sent during
> vhost_dev_stop() separately from saving the device's state.)

So let me ask if I understand this protocol correctly: Basically, 
SUSPEND would ask the device to fully serialize its internal state, 
retain it in some buffer, and RESUME would then deserialize the state 
from the buffer, right?

While this state needn’t necessarily be immediately migratable, I 
suppose (e.g. one could retain file descriptors there, and it doesn’t 
need to be a serialized byte buffer, but could still be structured), it 
would basically be a live migration implementation already.  As far as I 
understand, that’s why you suggest not running a SUSPEND+RESUME cycle on 
anything but live migration, right?

I wonder how that model would then work with iterative migration, 
though.  Basically, for non-iterative migration, the back-end would 
expect SUSPEND first to flush its state out to a buffer, and then the 
state transfer would just copy from that buffer.  For iterative 
migration, though, there is no SUSPEND first, so the back-end must 
implicitly begin to serialize its state and send it over.  I find that a 
bit strange.

Also, how would this work with currently migratable stateless 
back-ends?  Do they already implement SUSPEND+RESUME as no-ops?  If not, 
I think we should detect stateless back-ends and skip the operations in 
qemu lest we have to update those back-ends for no real reason.

Hanna
Stefan Hajnoczi April 19, 2023, 11:10 a.m. UTC | #35
On Wed, 19 Apr 2023 at 06:57, Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 18.04.23 19:59, Stefan Hajnoczi wrote:
> > On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote:
> >> On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>> On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >>>> On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>>> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> >>>>>> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>>>>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> >>>>>>>> So-called "internal" virtio-fs migration refers to transporting the
> >>>>>>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
> >>>>>>>> this, we need to be able to transfer virtiofsd's internal state to and
> >>>>>>>> from virtiofsd.
> >>>>>>>>
> >>>>>>>> Because virtiofsd's internal state will not be too large, we believe it
> >>>>>>>> is best to transfer it as a single binary blob after the streaming
> >>>>>>>> phase.  Because this method should be useful to other vhost-user
> >>>>>>>> implementations, too, it is introduced as a general-purpose addition to
> >>>>>>>> the protocol, not limited to vhost-user-fs.
> >>>>>>>>
> >>>>>>>> These are the additions to the protocol:
> >>>>>>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> >>>>>>>>    This feature signals support for transferring state, and is added so
> >>>>>>>>    that migration can fail early when the back-end has no support.
> >>>>>>>>
> >>>>>>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> >>>>>>>>    over which to transfer the state.  The front-end sends an FD to the
> >>>>>>>>    back-end into/from which it can write/read its state, and the back-end
> >>>>>>>>    can decide to either use it, or reply with a different FD for the
> >>>>>>>>    front-end to override the front-end's choice.
> >>>>>>>>    The front-end creates a simple pipe to transfer the state, but maybe
> >>>>>>>>    the back-end already has an FD into/from which it has to write/read
> >>>>>>>>    its state, in which case it will want to override the simple pipe.
> >>>>>>>>    Conversely, maybe in the future we find a way to have the front-end
> >>>>>>>>    get an immediate FD for the migration stream (in some cases), in which
> >>>>>>>>    case we will want to send this to the back-end instead of creating a
> >>>>>>>>    pipe.
> >>>>>>>>    Hence the negotiation: If one side has a better idea than a plain
> >>>>>>>>    pipe, we will want to use that.
> >>>>>>>>
> >>>>>>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
> >>>>>>>>    pipe (the end indicated by EOF), the front-end invokes this function
> >>>>>>>>    to verify success.  There is no in-band way (through the pipe) to
> >>>>>>>>    indicate failure, so we need to check explicitly.
> >>>>>>>>
> >>>>>>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> >>>>>>>> (which includes establishing the direction of transfer and migration
> >>>>>>>> phase), the sending side writes its data into the pipe, and the reading
> >>>>>>>> side reads it until it sees an EOF.  Then, the front-end will check for
> >>>>>>>> success via CHECK_DEVICE_STATE, which on the destination side includes
> >>>>>>>> checking for integrity (i.e. errors during deserialization).
> >>>>>>>>
> >>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> >>>>>>>> ---
> >>>>>>>>   include/hw/virtio/vhost-backend.h |  24 +++++
> >>>>>>>>   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> >>>>>>>>   hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> >>>>>>>>   hw/virtio/vhost.c                 |  37 ++++++++
> >>>>>>>>   4 files changed, 287 insertions(+)
> >>>>>>>>
> >>>>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> >>>>>>>> index ec3fbae58d..5935b32fe3 100644
> >>>>>>>> --- a/include/hw/virtio/vhost-backend.h
> >>>>>>>> +++ b/include/hw/virtio/vhost-backend.h
> >>>>>>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> >>>>>>>>       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> >>>>>>>>   } VhostSetConfigType;
> >>>>>>>>
> >>>>>>>> +typedef enum VhostDeviceStateDirection {
> >>>>>>>> +    /* Transfer state from back-end (device) to front-end */
> >>>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> >>>>>>>> +    /* Transfer state from front-end to back-end (device) */
> >>>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> >>>>>>>> +} VhostDeviceStateDirection;
> >>>>>>>> +
> >>>>>>>> +typedef enum VhostDeviceStatePhase {
> >>>>>>>> +    /* The device (and all its vrings) is stopped */
> >>>>>>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> >>>>>>>> +} VhostDeviceStatePhase;
> >>>>>>> vDPA has:
> >>>>>>>
> >>>>>>>    /* Suspend a device so it does not process virtqueue requests anymore
> >>>>>>>     *
> >>>>>>>     * After the return of ioctl the device must preserve all the necessary state
> >>>>>>>     * (the virtqueue vring base plus the possible device specific states) that is
> >>>>>>>     * required for restoring in the future. The device must not change its
> >>>>>>>     * configuration after that point.
> >>>>>>>     */
> >>>>>>>    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> >>>>>>>
> >>>>>>>    /* Resume a device so it can resume processing virtqueue requests
> >>>>>>>     *
> >>>>>>>     * After the return of this ioctl the device will have restored all the
> >>>>>>>     * necessary states and it is fully operational to continue processing the
> >>>>>>>     * virtqueue descriptors.
> >>>>>>>     */
> >>>>>>>    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> >>>>>>>
> >>>>>>> I wonder if it makes sense to import these into vhost-user so that the
> >>>>>>> difference between kernel vhost and vhost-user is minimized. It's okay
> >>>>>>> if one of them is ahead of the other, but it would be nice to avoid
> >>>>>>> overlapping/duplicated functionality.
> >>>>>>>
> >>>>>> That's what I had in mind in the first versions. I proposed VHOST_STOP
> >>>>>> instead of VHOST_VDPA_STOP for this very reason. Later it did change
> >>>>>> to SUSPEND.
> >>>>> I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
> >>>>> ioctl(VHOST_VDPA_RESUME).
> >>>>>
> >>>>> The doc comments in <linux/vdpa.h> don't explain how the device can
> >>>>> leave the suspended state. Can you clarify this?
> >>>>>
> >>>> Do you mean in what situations or regarding the semantics of _RESUME?
> >>>>
> >>>> To me resume is an operation mainly to resume the device in the event
> >>>> of a VM suspension, not a migration. It can be used as a fallback code
> >>>> in some cases of migration failure though, but it is not currently
> >>>> used in qemu.
> >>> Is a "VM suspension" the QEMU HMP 'stop' command?
> >>>
> >>> I guess the reason why QEMU doesn't call RESUME anywhere is that it
> >>> resets the device in vhost_dev_stop()?
> >>>
> >> The actual reason for not using RESUME is that the ioctl was added
> >> after the SUSPEND design in qemu. Same as this proposal, it is was not
> >> needed at the time.
> >>
> >> In the case of vhost-vdpa net, the only usage of suspend is to fetch
> >> the vq indexes, and in case of error vhost already fetches them from
> >> guest's used ring way before vDPA, so it has little usage.
> >>
> >>> Does it make sense to combine SUSPEND and RESUME with Hanna's
> >>> SET_DEVICE_STATE_FD? For example, non-iterative migration works like
> >>> this:
> >>> - Saving the device's state is done by SUSPEND followed by
> >>> SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g.
> >>> savevm command or migration failed), then RESUME is called to
> >>> continue.
> >> I think the previous steps make sense at vhost_dev_stop, not virtio
> >> savevm handlers. To start spreading this logic to more places of qemu
> >> can bring confusion.
> > I don't think there is a way around extending the QEMU vhost's code
> > model. The current model in QEMU's vhost code is that the backend is
> > reset when the VM stops. This model worked fine for stateless devices
> > but it doesn't work for stateful devices.
> >
> > Imagine a vdpa-gpu device: you cannot reset the device in
> > vhost_dev_stop() and expect the GPU to continue working when
> > vhost_dev_start() is called again because all its state has been lost.
> > The guest driver will send requests that references a virtio-gpu
> > resources that no longer exist.
> >
> > One solution is to save the device's state in vhost_dev_stop(). I think
> > this is what you're suggesting. It requires keeping a copy of the state
> > and then loading the state again in vhost_dev_start(). I don't think
> > this approach should be used because it requires all stateful devices to
> > support live migration (otherwise they break across HMP 'stop'/'cont').
> > Also, the device state for some devices may be large and it would also
> > become more complicated when iterative migration is added.
> >
> > Instead, I think the QEMU vhost code needs to be structured so that
> > struct vhost_dev has a suspended state:
> >
> >          ,---------.
> >       v         |
> >    started ------> stopped
> >      \   ^
> >       \  |
> >        -> suspended
> >
> > The device doesn't lose state when it enters the suspended state. It can
> > be resumed again.
> >
> > This is why I think SUSPEND/RESUME need to be part of the solution.
> > (It's also an argument for not including the phase argument in
> > SET_DEVICE_STATE_FD because the SUSPEND message is sent during
> > vhost_dev_stop() separately from saving the device's state.)
>
> So let me ask if I understand this protocol correctly: Basically,
> SUSPEND would ask the device to fully serialize its internal state,
> retain it in some buffer, and RESUME would then deserialize the state
> from the buffer, right?

That's not how I understand SUSPEND/RESUME. I was thinking that
SUSPEND pauses device operation so that virtqueues are no longer
processed and no other events occur (e.g. VIRTIO Configuration Change
Notifications). RESUME continues device operation. Neither command is
directly related to device state serialization but SUSPEND freezes the
device state, while RESUME allows the device state to change again.

> While this state needn’t necessarily be immediately migratable, I
> suppose (e.g. one could retain file descriptors there, and it doesn’t
> need to be a serialized byte buffer, but could still be structured), it
> would basically be a live migration implementation already.  As far as I
> understand, that’s why you suggest not running a SUSPEND+RESUME cycle on
> anything but live migration, right?

No, SUSPEND/RESUME would also be used across vm_stop()/vm_start().
That way stateful devices are no longer reset across HMP 'stop'/'cont'
(we're lucky it even works for most existing vhost-user backends today
and that's just because they don't yet implement
VHOST_USER_SET_STATUS).

> I wonder how that model would then work with iterative migration,
> though.  Basically, for non-iterative migration, the back-end would
> expect SUSPEND first to flush its state out to a buffer, and then the
> state transfer would just copy from that buffer.  For iterative
> migration, though, there is no SUSPEND first, so the back-end must
> implicitly begin to serialize its state and send it over.  I find that a
> bit strange.

I expected SET_DEVICE_STATE_FD to be sent while the device is still
running for iterative migration. Device state chunks are saved while
the device is still operating.

When the VMM decides to stop the guest, it sends SUSPEND to freeze the
device. The remainder of the device state can then be read from the fd
in the knowledge that the size is now finite.

After migration completes, the device is still suspended on the
source. If migration failed, RESUME is sent to continue running on the
source.

> Also, how would this work with currently migratable stateless
> back-ends?  Do they already implement SUSPEND+RESUME as no-ops?  If not,
> I think we should detect stateless back-ends and skip the operations in
> qemu lest we have to update those back-ends for no real reason.

Yes, I think backwards compatibility is a requirement, too. The
vhost-user frontend checks the SUSPEND vhost-user protocol feature
bit. If the bit is cleared, then it must assume this device is
stateless and use device reset operations. Otherwise it can use
SUSPEND/RESUME.

Stefan
Hanna Czenczek April 19, 2023, 11:10 a.m. UTC | #36
On 18.04.23 09:54, Eugenio Perez Martin wrote:
> On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com> wrote:
>>> On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
>>>>> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
>>>>>>> So-called "internal" virtio-fs migration refers to transporting the
>>>>>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
>>>>>>> this, we need to be able to transfer virtiofsd's internal state to and
>>>>>>> from virtiofsd.
>>>>>>>
>>>>>>> Because virtiofsd's internal state will not be too large, we believe it
>>>>>>> is best to transfer it as a single binary blob after the streaming
>>>>>>> phase.  Because this method should be useful to other vhost-user
>>>>>>> implementations, too, it is introduced as a general-purpose addition to
>>>>>>> the protocol, not limited to vhost-user-fs.
>>>>>>>
>>>>>>> These are the additions to the protocol:
>>>>>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
>>>>>>>    This feature signals support for transferring state, and is added so
>>>>>>>    that migration can fail early when the back-end has no support.
>>>>>>>
>>>>>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
>>>>>>>    over which to transfer the state.  The front-end sends an FD to the
>>>>>>>    back-end into/from which it can write/read its state, and the back-end
>>>>>>>    can decide to either use it, or reply with a different FD for the
>>>>>>>    front-end to override the front-end's choice.
>>>>>>>    The front-end creates a simple pipe to transfer the state, but maybe
>>>>>>>    the back-end already has an FD into/from which it has to write/read
>>>>>>>    its state, in which case it will want to override the simple pipe.
>>>>>>>    Conversely, maybe in the future we find a way to have the front-end
>>>>>>>    get an immediate FD for the migration stream (in some cases), in which
>>>>>>>    case we will want to send this to the back-end instead of creating a
>>>>>>>    pipe.
>>>>>>>    Hence the negotiation: If one side has a better idea than a plain
>>>>>>>    pipe, we will want to use that.
>>>>>>>
>>>>>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
>>>>>>>    pipe (the end indicated by EOF), the front-end invokes this function
>>>>>>>    to verify success.  There is no in-band way (through the pipe) to
>>>>>>>    indicate failure, so we need to check explicitly.
>>>>>>>
>>>>>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
>>>>>>> (which includes establishing the direction of transfer and migration
>>>>>>> phase), the sending side writes its data into the pipe, and the reading
>>>>>>> side reads it until it sees an EOF.  Then, the front-end will check for
>>>>>>> success via CHECK_DEVICE_STATE, which on the destination side includes
>>>>>>> checking for integrity (i.e. errors during deserialization).
>>>>>>>
>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>>>> ---
>>>>>>>   include/hw/virtio/vhost-backend.h |  24 +++++
>>>>>>>   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
>>>>>>>   hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
>>>>>>>   hw/virtio/vhost.c                 |  37 ++++++++
>>>>>>>   4 files changed, 287 insertions(+)
>>>>>>>
>>>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
>>>>>>> index ec3fbae58d..5935b32fe3 100644
>>>>>>> --- a/include/hw/virtio/vhost-backend.h
>>>>>>> +++ b/include/hw/virtio/vhost-backend.h
>>>>>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
>>>>>>>       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
>>>>>>>   } VhostSetConfigType;
>>>>>>>
>>>>>>> +typedef enum VhostDeviceStateDirection {
>>>>>>> +    /* Transfer state from back-end (device) to front-end */
>>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
>>>>>>> +    /* Transfer state from front-end to back-end (device) */
>>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
>>>>>>> +} VhostDeviceStateDirection;
>>>>>>> +
>>>>>>> +typedef enum VhostDeviceStatePhase {
>>>>>>> +    /* The device (and all its vrings) is stopped */
>>>>>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
>>>>>>> +} VhostDeviceStatePhase;
>>>>>> vDPA has:
>>>>>>
>>>>>>    /* Suspend a device so it does not process virtqueue requests anymore
>>>>>>     *
>>>>>>     * After the return of ioctl the device must preserve all the necessary state
>>>>>>     * (the virtqueue vring base plus the possible device specific states) that is
>>>>>>     * required for restoring in the future. The device must not change its
>>>>>>     * configuration after that point.
>>>>>>     */
>>>>>>    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
>>>>>>
>>>>>>    /* Resume a device so it can resume processing virtqueue requests
>>>>>>     *
>>>>>>     * After the return of this ioctl the device will have restored all the
>>>>>>     * necessary states and it is fully operational to continue processing the
>>>>>>     * virtqueue descriptors.
>>>>>>     */
>>>>>>    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
>>>>>>
>>>>>> I wonder if it makes sense to import these into vhost-user so that the
>>>>>> difference between kernel vhost and vhost-user is minimized. It's okay
>>>>>> if one of them is ahead of the other, but it would be nice to avoid
>>>>>> overlapping/duplicated functionality.
>>>>>>
>>>>> That's what I had in mind in the first versions. I proposed VHOST_STOP
>>>>> instead of VHOST_VDPA_STOP for this very reason. Later it did change
>>>>> to SUSPEND.
>>>>>
>>>>> Generally it is better if we make the interface less parametrized and
>>>>> we trust in the messages and its semantics in my opinion. In other
>>>>> words, instead of
>>>>> vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send
>>>>> individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command.
>>>>>
>>>>> Another way to apply this is with the "direction" parameter. Maybe it
>>>>> is better to split it into "set_state_fd" and "get_state_fd"?
>>>>>
>>>>> In that case, reusing the ioctls as vhost-user messages would be ok.
>>>>> But that puts this proposal further from the VFIO code, which uses
>>>>> "migration_set_state(state)", and maybe it is better when the number
>>>>> of states is high.
>>>> Hi Eugenio,
>>>> Another question about vDPA suspend/resume:
>>>>
>>>>    /* Host notifiers must be enabled at this point. */
>>>>    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings)
>>>>    {
>>>>        int i;
>>>>
>>>>        /* should only be called after backend is connected */
>>>>        assert(hdev->vhost_ops);
>>>>        event_notifier_test_and_clear(
>>>>            &hdev->vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
>>>>        event_notifier_test_and_clear(&vdev->config_notifier);
>>>>
>>>>        trace_vhost_dev_stop(hdev, vdev->name, vrings);
>>>>
>>>>        if (hdev->vhost_ops->vhost_dev_start) {
>>>>            hdev->vhost_ops->vhost_dev_start(hdev, false);
>>>>            ^^^ SUSPEND ^^^
>>>>        }
>>>>        if (vrings) {
>>>>            vhost_dev_set_vring_enable(hdev, false);
>>>>        }
>>>>        for (i = 0; i < hdev->nvqs; ++i) {
>>>>            vhost_virtqueue_stop(hdev,
>>>>                                 vdev,
>>>>                                 hdev->vqs + i,
>>>>                                 hdev->vq_index + i);
>>>>          ^^^ fetch virtqueue state from kernel ^^^
>>>>        }
>>>>        if (hdev->vhost_ops->vhost_reset_status) {
>>>>            hdev->vhost_ops->vhost_reset_status(hdev);
>>>>            ^^^ reset device^^^
>>>>
>>>> I noticed the QEMU vDPA code resets the device in vhost_dev_stop() ->
>>>> vhost_reset_status(). The device's migration code runs after
>>>> vhost_dev_stop() and the state will have been lost.
>>>>
>>> vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
>>> qemu VirtIONet device model. This is for all vhost backends.
>>>
>>> Regarding the state like mac or mq configuration, SVQ runs for all the
>>> VM run in the CVQ. So it can track all of that status in the device
>>> model too.
>>>
>>> When a migration effectively occurs, all the frontend state is
>>> migrated as a regular emulated device. To route all of the state in a
>>> normalized way for qemu is what leaves open the possibility to do
>>> cross-backends migrations, etc.
>>>
>>> Does that answer your question?
>> I think you're confirming that changes would be necessary in order for
>> vDPA to support the save/load operation that Hanna is introducing.
>>
> Yes, this first iteration was centered on net, with an eye on block,
> where state can be routed through classical emulated devices. This is
> how vhost-kernel and vhost-user do classically. And it allows
> cross-backend, to not modify qemu migration state, etc.
>
> To introduce this opaque state to qemu, that must be fetched after the
> suspend and not before, requires changes in vhost protocol, as
> discussed previously.
>
>>>> It looks like vDPA changes are necessary in order to support stateful
>>>> devices even though QEMU already uses SUSPEND. Is my understanding
>>>> correct?
>>>>
>>> Changes are required elsewhere, as the code to restore the state
>>> properly in the destination has not been merged.
>> I'm not sure what you mean by elsewhere?
>>
> I meant for vdpa *net* devices the changes are not required in vdpa
> ioctls, but mostly in qemu.
>
> If you meant stateful as "it must have a state blob that it must be
> opaque to qemu", then I think the straightforward action is to fetch
> state blob about the same time as vq indexes. But yes, changes (at
> least a new ioctl) is needed for that.
>
>> I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and
>> then VHOST_VDPA_SET_STATUS 0.
>>
>> In order to save device state from the vDPA device in the future, it
>> will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that
>> the device state can be saved before the device is reset.
>>
>> Does that sound right?
>>
> The split between suspend and reset was added recently for that very
> reason. In all the virtio devices, the frontend is initialized before
> the backend, so I don't think it is a good idea to defer the backend
> cleanup. Especially if we have already set the state is small enough
> to not needing iterative migration from virtiofsd point of view.
>
> If fetching that state at the same time as vq indexes is not valid,
> could it follow the same model as the "in-flight descriptors"?
> vhost-user follows them by using a shared memory region where their
> state is tracked [1]. This allows qemu to survive vhost-user SW
> backend crashes, and does not forbid the cross-backends live migration
> as all the information is there to recover them.
>
> For hw devices this is not convenient as it occupies PCI bandwidth. So
> a possibility is to synchronize this memory region after a
> synchronization point, being the SUSPEND call or GET_VRING_BASE. HW
> devices are not going to crash in the software sense, so all use cases
> remain the same to qemu. And that shared memory information is
> recoverable after vhost_dev_stop.
>
> Does that sound reasonable to virtiofsd? To offer a shared memory
> region where it dumps the state, maybe only after the
> set_state(STATE_PHASE_STOPPED)?

I don’t think we need the set_state() call, necessarily, if SUSPEND is 
mandatory anyway.

As for the shared memory, the RFC before this series used shared memory, 
so it’s possible, yes.  But “shared memory region” can mean a lot of 
things – it sounds like you’re saying the back-end (virtiofsd) should 
provide it to the front-end, is that right?  That could work like this:

On the source side:

S1. SUSPEND goes to virtiofsd
S2. virtiofsd maybe double-checks that the device is stopped, then 
serializes its state into a newly allocated shared memory area[1]
S3. virtiofsd responds to SUSPEND
S4. front-end requests shared memory, virtiofsd responds with a handle, 
maybe already closes its reference
S5. front-end saves state, closes its handle, freeing the SHM

[1] Maybe virtiofsd can correctly size the serialized state’s size, then 
it can immediately allocate this area and serialize directly into it; 
maybe it can’t, then we’ll need a bounce buffer.  Not really a 
fundamental problem, but there are limitations around what you can do 
with serde implementations in Rust…

On the destination side:

D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much; 
virtiofsd would serialize its empty state into an SHM area, and respond 
to SUSPEND
D2. front-end reads state from migration stream into an SHM it has allocated
D3. front-end supplies this SHM to virtiofsd, which discards its 
previous area, and now uses this one
D4. RESUME goes to virtiofsd, which deserializes the state from the SHM

Couple of questions:

A. Stefan suggested D1, but it does seem wasteful now.  But if SUSPEND 
would imply to deserialize a state, and the state is to be transferred 
through SHM, this is what would need to be done.  So maybe we should 
skip SUSPEND on the destination?
B. You described that the back-end should supply the SHM, which works 
well on the source.  On the destination, only the front-end knows how 
big the state is, so I’ve decided above that it should allocate the SHM 
(D2) and provide it to the back-end.  Is that feasible or is it 
important (e.g. for real hardware) that the back-end supplies the SHM?  
(In which case the front-end would need to tell the back-end how big the 
state SHM needs to be.)

Hanna
Hanna Czenczek April 19, 2023, 11:15 a.m. UTC | #37
On 19.04.23 13:10, Stefan Hajnoczi wrote:
> On Wed, 19 Apr 2023 at 06:57, Hanna Czenczek <hreitz@redhat.com> wrote:
>> On 18.04.23 19:59, Stefan Hajnoczi wrote:
>>> On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote:
>>>> On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>> On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote:
>>>>>> On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>>> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
>>>>>>>> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>>>>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
>>>>>>>>>> So-called "internal" virtio-fs migration refers to transporting the
>>>>>>>>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
>>>>>>>>>> this, we need to be able to transfer virtiofsd's internal state to and
>>>>>>>>>> from virtiofsd.
>>>>>>>>>>
>>>>>>>>>> Because virtiofsd's internal state will not be too large, we believe it
>>>>>>>>>> is best to transfer it as a single binary blob after the streaming
>>>>>>>>>> phase.  Because this method should be useful to other vhost-user
>>>>>>>>>> implementations, too, it is introduced as a general-purpose addition to
>>>>>>>>>> the protocol, not limited to vhost-user-fs.
>>>>>>>>>>
>>>>>>>>>> These are the additions to the protocol:
>>>>>>>>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
>>>>>>>>>>     This feature signals support for transferring state, and is added so
>>>>>>>>>>     that migration can fail early when the back-end has no support.
>>>>>>>>>>
>>>>>>>>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
>>>>>>>>>>     over which to transfer the state.  The front-end sends an FD to the
>>>>>>>>>>     back-end into/from which it can write/read its state, and the back-end
>>>>>>>>>>     can decide to either use it, or reply with a different FD for the
>>>>>>>>>>     front-end to override the front-end's choice.
>>>>>>>>>>     The front-end creates a simple pipe to transfer the state, but maybe
>>>>>>>>>>     the back-end already has an FD into/from which it has to write/read
>>>>>>>>>>     its state, in which case it will want to override the simple pipe.
>>>>>>>>>>     Conversely, maybe in the future we find a way to have the front-end
>>>>>>>>>>     get an immediate FD for the migration stream (in some cases), in which
>>>>>>>>>>     case we will want to send this to the back-end instead of creating a
>>>>>>>>>>     pipe.
>>>>>>>>>>     Hence the negotiation: If one side has a better idea than a plain
>>>>>>>>>>     pipe, we will want to use that.
>>>>>>>>>>
>>>>>>>>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
>>>>>>>>>>     pipe (the end indicated by EOF), the front-end invokes this function
>>>>>>>>>>     to verify success.  There is no in-band way (through the pipe) to
>>>>>>>>>>     indicate failure, so we need to check explicitly.
>>>>>>>>>>
>>>>>>>>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
>>>>>>>>>> (which includes establishing the direction of transfer and migration
>>>>>>>>>> phase), the sending side writes its data into the pipe, and the reading
>>>>>>>>>> side reads it until it sees an EOF.  Then, the front-end will check for
>>>>>>>>>> success via CHECK_DEVICE_STATE, which on the destination side includes
>>>>>>>>>> checking for integrity (i.e. errors during deserialization).
>>>>>>>>>>
>>>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>>>>>>> ---
>>>>>>>>>>    include/hw/virtio/vhost-backend.h |  24 +++++
>>>>>>>>>>    include/hw/virtio/vhost.h         |  79 ++++++++++++++++
>>>>>>>>>>    hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
>>>>>>>>>>    hw/virtio/vhost.c                 |  37 ++++++++
>>>>>>>>>>    4 files changed, 287 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
>>>>>>>>>> index ec3fbae58d..5935b32fe3 100644
>>>>>>>>>> --- a/include/hw/virtio/vhost-backend.h
>>>>>>>>>> +++ b/include/hw/virtio/vhost-backend.h
>>>>>>>>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
>>>>>>>>>>        VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
>>>>>>>>>>    } VhostSetConfigType;
>>>>>>>>>>
>>>>>>>>>> +typedef enum VhostDeviceStateDirection {
>>>>>>>>>> +    /* Transfer state from back-end (device) to front-end */
>>>>>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
>>>>>>>>>> +    /* Transfer state from front-end to back-end (device) */
>>>>>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
>>>>>>>>>> +} VhostDeviceStateDirection;
>>>>>>>>>> +
>>>>>>>>>> +typedef enum VhostDeviceStatePhase {
>>>>>>>>>> +    /* The device (and all its vrings) is stopped */
>>>>>>>>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
>>>>>>>>>> +} VhostDeviceStatePhase;
>>>>>>>>> vDPA has:
>>>>>>>>>
>>>>>>>>>     /* Suspend a device so it does not process virtqueue requests anymore
>>>>>>>>>      *
>>>>>>>>>      * After the return of ioctl the device must preserve all the necessary state
>>>>>>>>>      * (the virtqueue vring base plus the possible device specific states) that is
>>>>>>>>>      * required for restoring in the future. The device must not change its
>>>>>>>>>      * configuration after that point.
>>>>>>>>>      */
>>>>>>>>>     #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
>>>>>>>>>
>>>>>>>>>     /* Resume a device so it can resume processing virtqueue requests
>>>>>>>>>      *
>>>>>>>>>      * After the return of this ioctl the device will have restored all the
>>>>>>>>>      * necessary states and it is fully operational to continue processing the
>>>>>>>>>      * virtqueue descriptors.
>>>>>>>>>      */
>>>>>>>>>     #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
>>>>>>>>>
>>>>>>>>> I wonder if it makes sense to import these into vhost-user so that the
>>>>>>>>> difference between kernel vhost and vhost-user is minimized. It's okay
>>>>>>>>> if one of them is ahead of the other, but it would be nice to avoid
>>>>>>>>> overlapping/duplicated functionality.
>>>>>>>>>
>>>>>>>> That's what I had in mind in the first versions. I proposed VHOST_STOP
>>>>>>>> instead of VHOST_VDPA_STOP for this very reason. Later it did change
>>>>>>>> to SUSPEND.
>>>>>>> I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
>>>>>>> ioctl(VHOST_VDPA_RESUME).
>>>>>>>
>>>>>>> The doc comments in <linux/vdpa.h> don't explain how the device can
>>>>>>> leave the suspended state. Can you clarify this?
>>>>>>>
>>>>>> Do you mean in what situations or regarding the semantics of _RESUME?
>>>>>>
>>>>>> To me resume is an operation mainly to resume the device in the event
>>>>>> of a VM suspension, not a migration. It can be used as a fallback code
>>>>>> in some cases of migration failure though, but it is not currently
>>>>>> used in qemu.
>>>>> Is a "VM suspension" the QEMU HMP 'stop' command?
>>>>>
>>>>> I guess the reason why QEMU doesn't call RESUME anywhere is that it
>>>>> resets the device in vhost_dev_stop()?
>>>>>
>>>> The actual reason for not using RESUME is that the ioctl was added
>>>> after the SUSPEND design in qemu. Same as this proposal, it is was not
>>>> needed at the time.
>>>>
>>>> In the case of vhost-vdpa net, the only usage of suspend is to fetch
>>>> the vq indexes, and in case of error vhost already fetches them from
>>>> guest's used ring way before vDPA, so it has little usage.
>>>>
>>>>> Does it make sense to combine SUSPEND and RESUME with Hanna's
>>>>> SET_DEVICE_STATE_FD? For example, non-iterative migration works like
>>>>> this:
>>>>> - Saving the device's state is done by SUSPEND followed by
>>>>> SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g.
>>>>> savevm command or migration failed), then RESUME is called to
>>>>> continue.
>>>> I think the previous steps make sense at vhost_dev_stop, not virtio
>>>> savevm handlers. To start spreading this logic to more places of qemu
>>>> can bring confusion.
>>> I don't think there is a way around extending the QEMU vhost's code
>>> model. The current model in QEMU's vhost code is that the backend is
>>> reset when the VM stops. This model worked fine for stateless devices
>>> but it doesn't work for stateful devices.
>>>
>>> Imagine a vdpa-gpu device: you cannot reset the device in
>>> vhost_dev_stop() and expect the GPU to continue working when
>>> vhost_dev_start() is called again because all its state has been lost.
>>> The guest driver will send requests that references a virtio-gpu
>>> resources that no longer exist.
>>>
>>> One solution is to save the device's state in vhost_dev_stop(). I think
>>> this is what you're suggesting. It requires keeping a copy of the state
>>> and then loading the state again in vhost_dev_start(). I don't think
>>> this approach should be used because it requires all stateful devices to
>>> support live migration (otherwise they break across HMP 'stop'/'cont').
>>> Also, the device state for some devices may be large and it would also
>>> become more complicated when iterative migration is added.
>>>
>>> Instead, I think the QEMU vhost code needs to be structured so that
>>> struct vhost_dev has a suspended state:
>>>
>>>           ,---------.
>>>        v         |
>>>     started ------> stopped
>>>       \   ^
>>>        \  |
>>>         -> suspended
>>>
>>> The device doesn't lose state when it enters the suspended state. It can
>>> be resumed again.
>>>
>>> This is why I think SUSPEND/RESUME need to be part of the solution.
>>> (It's also an argument for not including the phase argument in
>>> SET_DEVICE_STATE_FD because the SUSPEND message is sent during
>>> vhost_dev_stop() separately from saving the device's state.)
>> So let me ask if I understand this protocol correctly: Basically,
>> SUSPEND would ask the device to fully serialize its internal state,
>> retain it in some buffer, and RESUME would then deserialize the state
>> from the buffer, right?
> That's not how I understand SUSPEND/RESUME. I was thinking that
> SUSPEND pauses device operation so that virtqueues are no longer
> processed and no other events occur (e.g. VIRTIO Configuration Change
> Notifications). RESUME continues device operation. Neither command is
> directly related to device state serialization but SUSPEND freezes the
> device state, while RESUME allows the device state to change again.

I understood that a reset would basically reset all internal state, 
which is why SUSPEND+RESUME were required around it, to retain the state.

>> While this state needn’t necessarily be immediately migratable, I
>> suppose (e.g. one could retain file descriptors there, and it doesn’t
>> need to be a serialized byte buffer, but could still be structured), it
>> would basically be a live migration implementation already.  As far as I
>> understand, that’s why you suggest not running a SUSPEND+RESUME cycle on
>> anything but live migration, right?
> No, SUSPEND/RESUME would also be used across vm_stop()/vm_start().
> That way stateful devices are no longer reset across HMP 'stop'/'cont'
> (we're lucky it even works for most existing vhost-user backends today
> and that's just because they don't yet implement
> VHOST_USER_SET_STATUS).

So that’s what I seem to misunderstand: If stateful devices are reset, 
how does SUSPEND+RESUME prevent that?

>> I wonder how that model would then work with iterative migration,
>> though.  Basically, for non-iterative migration, the back-end would
>> expect SUSPEND first to flush its state out to a buffer, and then the
>> state transfer would just copy from that buffer.  For iterative
>> migration, though, there is no SUSPEND first, so the back-end must
>> implicitly begin to serialize its state and send it over.  I find that a
>> bit strange.
> I expected SET_DEVICE_STATE_FD to be sent while the device is still
> running for iterative migration. Device state chunks are saved while
> the device is still operating.
>
> When the VMM decides to stop the guest, it sends SUSPEND to freeze the
> device. The remainder of the device state can then be read from the fd
> in the knowledge that the size is now finite.
>
> After migration completes, the device is still suspended on the
> source. If migration failed, RESUME is sent to continue running on the
> source.

Sure, that makes perfect sense as long as SUSPEND/RESUME are unrelated 
to device state serialization,

>> Also, how would this work with currently migratable stateless
>> back-ends?  Do they already implement SUSPEND+RESUME as no-ops?  If not,
>> I think we should detect stateless back-ends and skip the operations in
>> qemu lest we have to update those back-ends for no real reason.
> Yes, I think backwards compatibility is a requirement, too. The
> vhost-user frontend checks the SUSPEND vhost-user protocol feature
> bit. If the bit is cleared, then it must assume this device is
> stateless and use device reset operations. Otherwise it can use
> SUSPEND/RESUME.

Yes, all stateful devices should currently block migration, so we could 
require them to implement SUSPEND/RESUME, and assume that any that don’t 
are stateless.

Hanna
Stefan Hajnoczi April 19, 2023, 11:21 a.m. UTC | #38
On Wed, 19 Apr 2023 at 07:10, Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 18.04.23 09:54, Eugenio Perez Martin wrote:
> > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >> On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >>> On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> >>>>> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>>>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> >>>>>>> So-called "internal" virtio-fs migration refers to transporting the
> >>>>>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
> >>>>>>> this, we need to be able to transfer virtiofsd's internal state to and
> >>>>>>> from virtiofsd.
> >>>>>>>
> >>>>>>> Because virtiofsd's internal state will not be too large, we believe it
> >>>>>>> is best to transfer it as a single binary blob after the streaming
> >>>>>>> phase.  Because this method should be useful to other vhost-user
> >>>>>>> implementations, too, it is introduced as a general-purpose addition to
> >>>>>>> the protocol, not limited to vhost-user-fs.
> >>>>>>>
> >>>>>>> These are the additions to the protocol:
> >>>>>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> >>>>>>>    This feature signals support for transferring state, and is added so
> >>>>>>>    that migration can fail early when the back-end has no support.
> >>>>>>>
> >>>>>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> >>>>>>>    over which to transfer the state.  The front-end sends an FD to the
> >>>>>>>    back-end into/from which it can write/read its state, and the back-end
> >>>>>>>    can decide to either use it, or reply with a different FD for the
> >>>>>>>    front-end to override the front-end's choice.
> >>>>>>>    The front-end creates a simple pipe to transfer the state, but maybe
> >>>>>>>    the back-end already has an FD into/from which it has to write/read
> >>>>>>>    its state, in which case it will want to override the simple pipe.
> >>>>>>>    Conversely, maybe in the future we find a way to have the front-end
> >>>>>>>    get an immediate FD for the migration stream (in some cases), in which
> >>>>>>>    case we will want to send this to the back-end instead of creating a
> >>>>>>>    pipe.
> >>>>>>>    Hence the negotiation: If one side has a better idea than a plain
> >>>>>>>    pipe, we will want to use that.
> >>>>>>>
> >>>>>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
> >>>>>>>    pipe (the end indicated by EOF), the front-end invokes this function
> >>>>>>>    to verify success.  There is no in-band way (through the pipe) to
> >>>>>>>    indicate failure, so we need to check explicitly.
> >>>>>>>
> >>>>>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> >>>>>>> (which includes establishing the direction of transfer and migration
> >>>>>>> phase), the sending side writes its data into the pipe, and the reading
> >>>>>>> side reads it until it sees an EOF.  Then, the front-end will check for
> >>>>>>> success via CHECK_DEVICE_STATE, which on the destination side includes
> >>>>>>> checking for integrity (i.e. errors during deserialization).
> >>>>>>>
> >>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> >>>>>>> ---
> >>>>>>>   include/hw/virtio/vhost-backend.h |  24 +++++
> >>>>>>>   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> >>>>>>>   hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> >>>>>>>   hw/virtio/vhost.c                 |  37 ++++++++
> >>>>>>>   4 files changed, 287 insertions(+)
> >>>>>>>
> >>>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> >>>>>>> index ec3fbae58d..5935b32fe3 100644
> >>>>>>> --- a/include/hw/virtio/vhost-backend.h
> >>>>>>> +++ b/include/hw/virtio/vhost-backend.h
> >>>>>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> >>>>>>>       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> >>>>>>>   } VhostSetConfigType;
> >>>>>>>
> >>>>>>> +typedef enum VhostDeviceStateDirection {
> >>>>>>> +    /* Transfer state from back-end (device) to front-end */
> >>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> >>>>>>> +    /* Transfer state from front-end to back-end (device) */
> >>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> >>>>>>> +} VhostDeviceStateDirection;
> >>>>>>> +
> >>>>>>> +typedef enum VhostDeviceStatePhase {
> >>>>>>> +    /* The device (and all its vrings) is stopped */
> >>>>>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> >>>>>>> +} VhostDeviceStatePhase;
> >>>>>> vDPA has:
> >>>>>>
> >>>>>>    /* Suspend a device so it does not process virtqueue requests anymore
> >>>>>>     *
> >>>>>>     * After the return of ioctl the device must preserve all the necessary state
> >>>>>>     * (the virtqueue vring base plus the possible device specific states) that is
> >>>>>>     * required for restoring in the future. The device must not change its
> >>>>>>     * configuration after that point.
> >>>>>>     */
> >>>>>>    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> >>>>>>
> >>>>>>    /* Resume a device so it can resume processing virtqueue requests
> >>>>>>     *
> >>>>>>     * After the return of this ioctl the device will have restored all the
> >>>>>>     * necessary states and it is fully operational to continue processing the
> >>>>>>     * virtqueue descriptors.
> >>>>>>     */
> >>>>>>    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> >>>>>>
> >>>>>> I wonder if it makes sense to import these into vhost-user so that the
> >>>>>> difference between kernel vhost and vhost-user is minimized. It's okay
> >>>>>> if one of them is ahead of the other, but it would be nice to avoid
> >>>>>> overlapping/duplicated functionality.
> >>>>>>
> >>>>> That's what I had in mind in the first versions. I proposed VHOST_STOP
> >>>>> instead of VHOST_VDPA_STOP for this very reason. Later it did change
> >>>>> to SUSPEND.
> >>>>>
> >>>>> Generally it is better if we make the interface less parametrized and
> >>>>> we trust in the messages and its semantics in my opinion. In other
> >>>>> words, instead of
> >>>>> vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send
> >>>>> individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command.
> >>>>>
> >>>>> Another way to apply this is with the "direction" parameter. Maybe it
> >>>>> is better to split it into "set_state_fd" and "get_state_fd"?
> >>>>>
> >>>>> In that case, reusing the ioctls as vhost-user messages would be ok.
> >>>>> But that puts this proposal further from the VFIO code, which uses
> >>>>> "migration_set_state(state)", and maybe it is better when the number
> >>>>> of states is high.
> >>>> Hi Eugenio,
> >>>> Another question about vDPA suspend/resume:
> >>>>
> >>>>    /* Host notifiers must be enabled at this point. */
> >>>>    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings)
> >>>>    {
> >>>>        int i;
> >>>>
> >>>>        /* should only be called after backend is connected */
> >>>>        assert(hdev->vhost_ops);
> >>>>        event_notifier_test_and_clear(
> >>>>            &hdev->vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
> >>>>        event_notifier_test_and_clear(&vdev->config_notifier);
> >>>>
> >>>>        trace_vhost_dev_stop(hdev, vdev->name, vrings);
> >>>>
> >>>>        if (hdev->vhost_ops->vhost_dev_start) {
> >>>>            hdev->vhost_ops->vhost_dev_start(hdev, false);
> >>>>            ^^^ SUSPEND ^^^
> >>>>        }
> >>>>        if (vrings) {
> >>>>            vhost_dev_set_vring_enable(hdev, false);
> >>>>        }
> >>>>        for (i = 0; i < hdev->nvqs; ++i) {
> >>>>            vhost_virtqueue_stop(hdev,
> >>>>                                 vdev,
> >>>>                                 hdev->vqs + i,
> >>>>                                 hdev->vq_index + i);
> >>>>          ^^^ fetch virtqueue state from kernel ^^^
> >>>>        }
> >>>>        if (hdev->vhost_ops->vhost_reset_status) {
> >>>>            hdev->vhost_ops->vhost_reset_status(hdev);
> >>>>            ^^^ reset device^^^
> >>>>
> >>>> I noticed the QEMU vDPA code resets the device in vhost_dev_stop() ->
> >>>> vhost_reset_status(). The device's migration code runs after
> >>>> vhost_dev_stop() and the state will have been lost.
> >>>>
> >>> vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
> >>> qemu VirtIONet device model. This is for all vhost backends.
> >>>
> >>> Regarding the state like mac or mq configuration, SVQ runs for all the
> >>> VM run in the CVQ. So it can track all of that status in the device
> >>> model too.
> >>>
> >>> When a migration effectively occurs, all the frontend state is
> >>> migrated as a regular emulated device. To route all of the state in a
> >>> normalized way for qemu is what leaves open the possibility to do
> >>> cross-backends migrations, etc.
> >>>
> >>> Does that answer your question?
> >> I think you're confirming that changes would be necessary in order for
> >> vDPA to support the save/load operation that Hanna is introducing.
> >>
> > Yes, this first iteration was centered on net, with an eye on block,
> > where state can be routed through classical emulated devices. This is
> > how vhost-kernel and vhost-user do classically. And it allows
> > cross-backend, to not modify qemu migration state, etc.
> >
> > To introduce this opaque state to qemu, that must be fetched after the
> > suspend and not before, requires changes in vhost protocol, as
> > discussed previously.
> >
> >>>> It looks like vDPA changes are necessary in order to support stateful
> >>>> devices even though QEMU already uses SUSPEND. Is my understanding
> >>>> correct?
> >>>>
> >>> Changes are required elsewhere, as the code to restore the state
> >>> properly in the destination has not been merged.
> >> I'm not sure what you mean by elsewhere?
> >>
> > I meant for vdpa *net* devices the changes are not required in vdpa
> > ioctls, but mostly in qemu.
> >
> > If you meant stateful as "it must have a state blob that it must be
> > opaque to qemu", then I think the straightforward action is to fetch
> > state blob about the same time as vq indexes. But yes, changes (at
> > least a new ioctl) is needed for that.
> >
> >> I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and
> >> then VHOST_VDPA_SET_STATUS 0.
> >>
> >> In order to save device state from the vDPA device in the future, it
> >> will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that
> >> the device state can be saved before the device is reset.
> >>
> >> Does that sound right?
> >>
> > The split between suspend and reset was added recently for that very
> > reason. In all the virtio devices, the frontend is initialized before
> > the backend, so I don't think it is a good idea to defer the backend
> > cleanup. Especially if we have already set the state is small enough
> > to not needing iterative migration from virtiofsd point of view.
> >
> > If fetching that state at the same time as vq indexes is not valid,
> > could it follow the same model as the "in-flight descriptors"?
> > vhost-user follows them by using a shared memory region where their
> > state is tracked [1]. This allows qemu to survive vhost-user SW
> > backend crashes, and does not forbid the cross-backends live migration
> > as all the information is there to recover them.
> >
> > For hw devices this is not convenient as it occupies PCI bandwidth. So
> > a possibility is to synchronize this memory region after a
> > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW
> > devices are not going to crash in the software sense, so all use cases
> > remain the same to qemu. And that shared memory information is
> > recoverable after vhost_dev_stop.
> >
> > Does that sound reasonable to virtiofsd? To offer a shared memory
> > region where it dumps the state, maybe only after the
> > set_state(STATE_PHASE_STOPPED)?
>
> I don’t think we need the set_state() call, necessarily, if SUSPEND is
> mandatory anyway.
>
> As for the shared memory, the RFC before this series used shared memory,
> so it’s possible, yes.  But “shared memory region” can mean a lot of
> things – it sounds like you’re saying the back-end (virtiofsd) should
> provide it to the front-end, is that right?  That could work like this:
>
> On the source side:
>
> S1. SUSPEND goes to virtiofsd
> S2. virtiofsd maybe double-checks that the device is stopped, then
> serializes its state into a newly allocated shared memory area[1]
> S3. virtiofsd responds to SUSPEND
> S4. front-end requests shared memory, virtiofsd responds with a handle,
> maybe already closes its reference
> S5. front-end saves state, closes its handle, freeing the SHM
>
> [1] Maybe virtiofsd can correctly size the serialized state’s size, then
> it can immediately allocate this area and serialize directly into it;
> maybe it can’t, then we’ll need a bounce buffer.  Not really a
> fundamental problem, but there are limitations around what you can do
> with serde implementations in Rust…
>
> On the destination side:
>
> D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much;
> virtiofsd would serialize its empty state into an SHM area, and respond
> to SUSPEND
> D2. front-end reads state from migration stream into an SHM it has allocated
> D3. front-end supplies this SHM to virtiofsd, which discards its
> previous area, and now uses this one
> D4. RESUME goes to virtiofsd, which deserializes the state from the SHM
>
> Couple of questions:
>
> A. Stefan suggested D1, but it does seem wasteful now.  But if SUSPEND
> would imply to deserialize a state, and the state is to be transferred
> through SHM, this is what would need to be done.  So maybe we should
> skip SUSPEND on the destination?
> B. You described that the back-end should supply the SHM, which works
> well on the source.  On the destination, only the front-end knows how
> big the state is, so I’ve decided above that it should allocate the SHM
> (D2) and provide it to the back-end.  Is that feasible or is it
> important (e.g. for real hardware) that the back-end supplies the SHM?
> (In which case the front-end would need to tell the back-end how big the
> state SHM needs to be.)

How does this work for iterative live migration?

Stefan
Hanna Czenczek April 19, 2023, 11:24 a.m. UTC | #39
On 19.04.23 13:21, Stefan Hajnoczi wrote:
> On Wed, 19 Apr 2023 at 07:10, Hanna Czenczek <hreitz@redhat.com> wrote:
>> On 18.04.23 09:54, Eugenio Perez Martin wrote:
>>> On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>> On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com> wrote:
>>>>> On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
>>>>>>> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>>>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
>>>>>>>>> So-called "internal" virtio-fs migration refers to transporting the
>>>>>>>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
>>>>>>>>> this, we need to be able to transfer virtiofsd's internal state to and
>>>>>>>>> from virtiofsd.
>>>>>>>>>
>>>>>>>>> Because virtiofsd's internal state will not be too large, we believe it
>>>>>>>>> is best to transfer it as a single binary blob after the streaming
>>>>>>>>> phase.  Because this method should be useful to other vhost-user
>>>>>>>>> implementations, too, it is introduced as a general-purpose addition to
>>>>>>>>> the protocol, not limited to vhost-user-fs.
>>>>>>>>>
>>>>>>>>> These are the additions to the protocol:
>>>>>>>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
>>>>>>>>>     This feature signals support for transferring state, and is added so
>>>>>>>>>     that migration can fail early when the back-end has no support.
>>>>>>>>>
>>>>>>>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
>>>>>>>>>     over which to transfer the state.  The front-end sends an FD to the
>>>>>>>>>     back-end into/from which it can write/read its state, and the back-end
>>>>>>>>>     can decide to either use it, or reply with a different FD for the
>>>>>>>>>     front-end to override the front-end's choice.
>>>>>>>>>     The front-end creates a simple pipe to transfer the state, but maybe
>>>>>>>>>     the back-end already has an FD into/from which it has to write/read
>>>>>>>>>     its state, in which case it will want to override the simple pipe.
>>>>>>>>>     Conversely, maybe in the future we find a way to have the front-end
>>>>>>>>>     get an immediate FD for the migration stream (in some cases), in which
>>>>>>>>>     case we will want to send this to the back-end instead of creating a
>>>>>>>>>     pipe.
>>>>>>>>>     Hence the negotiation: If one side has a better idea than a plain
>>>>>>>>>     pipe, we will want to use that.
>>>>>>>>>
>>>>>>>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
>>>>>>>>>     pipe (the end indicated by EOF), the front-end invokes this function
>>>>>>>>>     to verify success.  There is no in-band way (through the pipe) to
>>>>>>>>>     indicate failure, so we need to check explicitly.
>>>>>>>>>
>>>>>>>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
>>>>>>>>> (which includes establishing the direction of transfer and migration
>>>>>>>>> phase), the sending side writes its data into the pipe, and the reading
>>>>>>>>> side reads it until it sees an EOF.  Then, the front-end will check for
>>>>>>>>> success via CHECK_DEVICE_STATE, which on the destination side includes
>>>>>>>>> checking for integrity (i.e. errors during deserialization).
>>>>>>>>>
>>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
>>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>>>>>>>>> ---
>>>>>>>>>    include/hw/virtio/vhost-backend.h |  24 +++++
>>>>>>>>>    include/hw/virtio/vhost.h         |  79 ++++++++++++++++
>>>>>>>>>    hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
>>>>>>>>>    hw/virtio/vhost.c                 |  37 ++++++++
>>>>>>>>>    4 files changed, 287 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
>>>>>>>>> index ec3fbae58d..5935b32fe3 100644
>>>>>>>>> --- a/include/hw/virtio/vhost-backend.h
>>>>>>>>> +++ b/include/hw/virtio/vhost-backend.h
>>>>>>>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
>>>>>>>>>        VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
>>>>>>>>>    } VhostSetConfigType;
>>>>>>>>>
>>>>>>>>> +typedef enum VhostDeviceStateDirection {
>>>>>>>>> +    /* Transfer state from back-end (device) to front-end */
>>>>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
>>>>>>>>> +    /* Transfer state from front-end to back-end (device) */
>>>>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
>>>>>>>>> +} VhostDeviceStateDirection;
>>>>>>>>> +
>>>>>>>>> +typedef enum VhostDeviceStatePhase {
>>>>>>>>> +    /* The device (and all its vrings) is stopped */
>>>>>>>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
>>>>>>>>> +} VhostDeviceStatePhase;
>>>>>>>> vDPA has:
>>>>>>>>
>>>>>>>>     /* Suspend a device so it does not process virtqueue requests anymore
>>>>>>>>      *
>>>>>>>>      * After the return of ioctl the device must preserve all the necessary state
>>>>>>>>      * (the virtqueue vring base plus the possible device specific states) that is
>>>>>>>>      * required for restoring in the future. The device must not change its
>>>>>>>>      * configuration after that point.
>>>>>>>>      */
>>>>>>>>     #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
>>>>>>>>
>>>>>>>>     /* Resume a device so it can resume processing virtqueue requests
>>>>>>>>      *
>>>>>>>>      * After the return of this ioctl the device will have restored all the
>>>>>>>>      * necessary states and it is fully operational to continue processing the
>>>>>>>>      * virtqueue descriptors.
>>>>>>>>      */
>>>>>>>>     #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
>>>>>>>>
>>>>>>>> I wonder if it makes sense to import these into vhost-user so that the
>>>>>>>> difference between kernel vhost and vhost-user is minimized. It's okay
>>>>>>>> if one of them is ahead of the other, but it would be nice to avoid
>>>>>>>> overlapping/duplicated functionality.
>>>>>>>>
>>>>>>> That's what I had in mind in the first versions. I proposed VHOST_STOP
>>>>>>> instead of VHOST_VDPA_STOP for this very reason. Later it did change
>>>>>>> to SUSPEND.
>>>>>>>
>>>>>>> Generally it is better if we make the interface less parametrized and
>>>>>>> we trust in the messages and its semantics in my opinion. In other
>>>>>>> words, instead of
>>>>>>> vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED), send
>>>>>>> individually the equivalent of VHOST_VDPA_SUSPEND vhost-user command.
>>>>>>>
>>>>>>> Another way to apply this is with the "direction" parameter. Maybe it
>>>>>>> is better to split it into "set_state_fd" and "get_state_fd"?
>>>>>>>
>>>>>>> In that case, reusing the ioctls as vhost-user messages would be ok.
>>>>>>> But that puts this proposal further from the VFIO code, which uses
>>>>>>> "migration_set_state(state)", and maybe it is better when the number
>>>>>>> of states is high.
>>>>>> Hi Eugenio,
>>>>>> Another question about vDPA suspend/resume:
>>>>>>
>>>>>>     /* Host notifiers must be enabled at this point. */
>>>>>>     void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev, bool vrings)
>>>>>>     {
>>>>>>         int i;
>>>>>>
>>>>>>         /* should only be called after backend is connected */
>>>>>>         assert(hdev->vhost_ops);
>>>>>>         event_notifier_test_and_clear(
>>>>>>             &hdev->vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
>>>>>>         event_notifier_test_and_clear(&vdev->config_notifier);
>>>>>>
>>>>>>         trace_vhost_dev_stop(hdev, vdev->name, vrings);
>>>>>>
>>>>>>         if (hdev->vhost_ops->vhost_dev_start) {
>>>>>>             hdev->vhost_ops->vhost_dev_start(hdev, false);
>>>>>>             ^^^ SUSPEND ^^^
>>>>>>         }
>>>>>>         if (vrings) {
>>>>>>             vhost_dev_set_vring_enable(hdev, false);
>>>>>>         }
>>>>>>         for (i = 0; i < hdev->nvqs; ++i) {
>>>>>>             vhost_virtqueue_stop(hdev,
>>>>>>                                  vdev,
>>>>>>                                  hdev->vqs + i,
>>>>>>                                  hdev->vq_index + i);
>>>>>>           ^^^ fetch virtqueue state from kernel ^^^
>>>>>>         }
>>>>>>         if (hdev->vhost_ops->vhost_reset_status) {
>>>>>>             hdev->vhost_ops->vhost_reset_status(hdev);
>>>>>>             ^^^ reset device^^^
>>>>>>
>>>>>> I noticed the QEMU vDPA code resets the device in vhost_dev_stop() ->
>>>>>> vhost_reset_status(). The device's migration code runs after
>>>>>> vhost_dev_stop() and the state will have been lost.
>>>>>>
>>>>> vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
>>>>> qemu VirtIONet device model. This is for all vhost backends.
>>>>>
>>>>> Regarding the state like mac or mq configuration, SVQ runs for all the
>>>>> VM run in the CVQ. So it can track all of that status in the device
>>>>> model too.
>>>>>
>>>>> When a migration effectively occurs, all the frontend state is
>>>>> migrated as a regular emulated device. To route all of the state in a
>>>>> normalized way for qemu is what leaves open the possibility to do
>>>>> cross-backends migrations, etc.
>>>>>
>>>>> Does that answer your question?
>>>> I think you're confirming that changes would be necessary in order for
>>>> vDPA to support the save/load operation that Hanna is introducing.
>>>>
>>> Yes, this first iteration was centered on net, with an eye on block,
>>> where state can be routed through classical emulated devices. This is
>>> how vhost-kernel and vhost-user do classically. And it allows
>>> cross-backend, to not modify qemu migration state, etc.
>>>
>>> To introduce this opaque state to qemu, that must be fetched after the
>>> suspend and not before, requires changes in vhost protocol, as
>>> discussed previously.
>>>
>>>>>> It looks like vDPA changes are necessary in order to support stateful
>>>>>> devices even though QEMU already uses SUSPEND. Is my understanding
>>>>>> correct?
>>>>>>
>>>>> Changes are required elsewhere, as the code to restore the state
>>>>> properly in the destination has not been merged.
>>>> I'm not sure what you mean by elsewhere?
>>>>
>>> I meant for vdpa *net* devices the changes are not required in vdpa
>>> ioctls, but mostly in qemu.
>>>
>>> If you meant stateful as "it must have a state blob that it must be
>>> opaque to qemu", then I think the straightforward action is to fetch
>>> state blob about the same time as vq indexes. But yes, changes (at
>>> least a new ioctl) is needed for that.
>>>
>>>> I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and
>>>> then VHOST_VDPA_SET_STATUS 0.
>>>>
>>>> In order to save device state from the vDPA device in the future, it
>>>> will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that
>>>> the device state can be saved before the device is reset.
>>>>
>>>> Does that sound right?
>>>>
>>> The split between suspend and reset was added recently for that very
>>> reason. In all the virtio devices, the frontend is initialized before
>>> the backend, so I don't think it is a good idea to defer the backend
>>> cleanup. Especially if we have already set the state is small enough
>>> to not needing iterative migration from virtiofsd point of view.
>>>
>>> If fetching that state at the same time as vq indexes is not valid,
>>> could it follow the same model as the "in-flight descriptors"?
>>> vhost-user follows them by using a shared memory region where their
>>> state is tracked [1]. This allows qemu to survive vhost-user SW
>>> backend crashes, and does not forbid the cross-backends live migration
>>> as all the information is there to recover them.
>>>
>>> For hw devices this is not convenient as it occupies PCI bandwidth. So
>>> a possibility is to synchronize this memory region after a
>>> synchronization point, being the SUSPEND call or GET_VRING_BASE. HW
>>> devices are not going to crash in the software sense, so all use cases
>>> remain the same to qemu. And that shared memory information is
>>> recoverable after vhost_dev_stop.
>>>
>>> Does that sound reasonable to virtiofsd? To offer a shared memory
>>> region where it dumps the state, maybe only after the
>>> set_state(STATE_PHASE_STOPPED)?
>> I don’t think we need the set_state() call, necessarily, if SUSPEND is
>> mandatory anyway.
>>
>> As for the shared memory, the RFC before this series used shared memory,
>> so it’s possible, yes.  But “shared memory region” can mean a lot of
>> things – it sounds like you’re saying the back-end (virtiofsd) should
>> provide it to the front-end, is that right?  That could work like this:
>>
>> On the source side:
>>
>> S1. SUSPEND goes to virtiofsd
>> S2. virtiofsd maybe double-checks that the device is stopped, then
>> serializes its state into a newly allocated shared memory area[1]
>> S3. virtiofsd responds to SUSPEND
>> S4. front-end requests shared memory, virtiofsd responds with a handle,
>> maybe already closes its reference
>> S5. front-end saves state, closes its handle, freeing the SHM
>>
>> [1] Maybe virtiofsd can correctly size the serialized state’s size, then
>> it can immediately allocate this area and serialize directly into it;
>> maybe it can’t, then we’ll need a bounce buffer.  Not really a
>> fundamental problem, but there are limitations around what you can do
>> with serde implementations in Rust…
>>
>> On the destination side:
>>
>> D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much;
>> virtiofsd would serialize its empty state into an SHM area, and respond
>> to SUSPEND
>> D2. front-end reads state from migration stream into an SHM it has allocated
>> D3. front-end supplies this SHM to virtiofsd, which discards its
>> previous area, and now uses this one
>> D4. RESUME goes to virtiofsd, which deserializes the state from the SHM
>>
>> Couple of questions:
>>
>> A. Stefan suggested D1, but it does seem wasteful now.  But if SUSPEND
>> would imply to deserialize a state, and the state is to be transferred
>> through SHM, this is what would need to be done.  So maybe we should
>> skip SUSPEND on the destination?
>> B. You described that the back-end should supply the SHM, which works
>> well on the source.  On the destination, only the front-end knows how
>> big the state is, so I’ve decided above that it should allocate the SHM
>> (D2) and provide it to the back-end.  Is that feasible or is it
>> important (e.g. for real hardware) that the back-end supplies the SHM?
>> (In which case the front-end would need to tell the back-end how big the
>> state SHM needs to be.)
> How does this work for iterative live migration?

Right, probably not at all. :)

Hanna
Stefan Hajnoczi April 19, 2023, 11:24 a.m. UTC | #40
On Wed, 19 Apr 2023 at 07:16, Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 19.04.23 13:10, Stefan Hajnoczi wrote:
> > On Wed, 19 Apr 2023 at 06:57, Hanna Czenczek <hreitz@redhat.com> wrote:
> >> On 18.04.23 19:59, Stefan Hajnoczi wrote:
> >>> On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote:
> >>>> On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >>>>> On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <eperezma@redhat.com> wrote:
> >>>>>> On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>>>>> On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> >>>>>>>> On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>>>>>>>> On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> >>>>>>>>>> So-called "internal" virtio-fs migration refers to transporting the
> >>>>>>>>>> back-end's (virtiofsd's) state through qemu's migration stream.  To do
> >>>>>>>>>> this, we need to be able to transfer virtiofsd's internal state to and
> >>>>>>>>>> from virtiofsd.
> >>>>>>>>>>
> >>>>>>>>>> Because virtiofsd's internal state will not be too large, we believe it
> >>>>>>>>>> is best to transfer it as a single binary blob after the streaming
> >>>>>>>>>> phase.  Because this method should be useful to other vhost-user
> >>>>>>>>>> implementations, too, it is introduced as a general-purpose addition to
> >>>>>>>>>> the protocol, not limited to vhost-user-fs.
> >>>>>>>>>>
> >>>>>>>>>> These are the additions to the protocol:
> >>>>>>>>>> - New vhost-user protocol feature VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> >>>>>>>>>>     This feature signals support for transferring state, and is added so
> >>>>>>>>>>     that migration can fail early when the back-end has no support.
> >>>>>>>>>>
> >>>>>>>>>> - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate a pipe
> >>>>>>>>>>     over which to transfer the state.  The front-end sends an FD to the
> >>>>>>>>>>     back-end into/from which it can write/read its state, and the back-end
> >>>>>>>>>>     can decide to either use it, or reply with a different FD for the
> >>>>>>>>>>     front-end to override the front-end's choice.
> >>>>>>>>>>     The front-end creates a simple pipe to transfer the state, but maybe
> >>>>>>>>>>     the back-end already has an FD into/from which it has to write/read
> >>>>>>>>>>     its state, in which case it will want to override the simple pipe.
> >>>>>>>>>>     Conversely, maybe in the future we find a way to have the front-end
> >>>>>>>>>>     get an immediate FD for the migration stream (in some cases), in which
> >>>>>>>>>>     case we will want to send this to the back-end instead of creating a
> >>>>>>>>>>     pipe.
> >>>>>>>>>>     Hence the negotiation: If one side has a better idea than a plain
> >>>>>>>>>>     pipe, we will want to use that.
> >>>>>>>>>>
> >>>>>>>>>> - CHECK_DEVICE_STATE: After the state has been transferred through the
> >>>>>>>>>>     pipe (the end indicated by EOF), the front-end invokes this function
> >>>>>>>>>>     to verify success.  There is no in-band way (through the pipe) to
> >>>>>>>>>>     indicate failure, so we need to check explicitly.
> >>>>>>>>>>
> >>>>>>>>>> Once the transfer pipe has been established via SET_DEVICE_STATE_FD
> >>>>>>>>>> (which includes establishing the direction of transfer and migration
> >>>>>>>>>> phase), the sending side writes its data into the pipe, and the reading
> >>>>>>>>>> side reads it until it sees an EOF.  Then, the front-end will check for
> >>>>>>>>>> success via CHECK_DEVICE_STATE, which on the destination side includes
> >>>>>>>>>> checking for integrity (i.e. errors during deserialization).
> >>>>>>>>>>
> >>>>>>>>>> Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>>>>>>>>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> >>>>>>>>>> ---
> >>>>>>>>>>    include/hw/virtio/vhost-backend.h |  24 +++++
> >>>>>>>>>>    include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> >>>>>>>>>>    hw/virtio/vhost-user.c            | 147 ++++++++++++++++++++++++++++++
> >>>>>>>>>>    hw/virtio/vhost.c                 |  37 ++++++++
> >>>>>>>>>>    4 files changed, 287 insertions(+)
> >>>>>>>>>>
> >>>>>>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> >>>>>>>>>> index ec3fbae58d..5935b32fe3 100644
> >>>>>>>>>> --- a/include/hw/virtio/vhost-backend.h
> >>>>>>>>>> +++ b/include/hw/virtio/vhost-backend.h
> >>>>>>>>>> @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> >>>>>>>>>>        VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> >>>>>>>>>>    } VhostSetConfigType;
> >>>>>>>>>>
> >>>>>>>>>> +typedef enum VhostDeviceStateDirection {
> >>>>>>>>>> +    /* Transfer state from back-end (device) to front-end */
> >>>>>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> >>>>>>>>>> +    /* Transfer state from front-end to back-end (device) */
> >>>>>>>>>> +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> >>>>>>>>>> +} VhostDeviceStateDirection;
> >>>>>>>>>> +
> >>>>>>>>>> +typedef enum VhostDeviceStatePhase {
> >>>>>>>>>> +    /* The device (and all its vrings) is stopped */
> >>>>>>>>>> +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> >>>>>>>>>> +} VhostDeviceStatePhase;
> >>>>>>>>> vDPA has:
> >>>>>>>>>
> >>>>>>>>>     /* Suspend a device so it does not process virtqueue requests anymore
> >>>>>>>>>      *
> >>>>>>>>>      * After the return of ioctl the device must preserve all the necessary state
> >>>>>>>>>      * (the virtqueue vring base plus the possible device specific states) that is
> >>>>>>>>>      * required for restoring in the future. The device must not change its
> >>>>>>>>>      * configuration after that point.
> >>>>>>>>>      */
> >>>>>>>>>     #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> >>>>>>>>>
> >>>>>>>>>     /* Resume a device so it can resume processing virtqueue requests
> >>>>>>>>>      *
> >>>>>>>>>      * After the return of this ioctl the device will have restored all the
> >>>>>>>>>      * necessary states and it is fully operational to continue processing the
> >>>>>>>>>      * virtqueue descriptors.
> >>>>>>>>>      */
> >>>>>>>>>     #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> >>>>>>>>>
> >>>>>>>>> I wonder if it makes sense to import these into vhost-user so that the
> >>>>>>>>> difference between kernel vhost and vhost-user is minimized. It's okay
> >>>>>>>>> if one of them is ahead of the other, but it would be nice to avoid
> >>>>>>>>> overlapping/duplicated functionality.
> >>>>>>>>>
> >>>>>>>> That's what I had in mind in the first versions. I proposed VHOST_STOP
> >>>>>>>> instead of VHOST_VDPA_STOP for this very reason. Later it did change
> >>>>>>>> to SUSPEND.
> >>>>>>> I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
> >>>>>>> ioctl(VHOST_VDPA_RESUME).
> >>>>>>>
> >>>>>>> The doc comments in <linux/vdpa.h> don't explain how the device can
> >>>>>>> leave the suspended state. Can you clarify this?
> >>>>>>>
> >>>>>> Do you mean in what situations or regarding the semantics of _RESUME?
> >>>>>>
> >>>>>> To me resume is an operation mainly to resume the device in the event
> >>>>>> of a VM suspension, not a migration. It can be used as a fallback code
> >>>>>> in some cases of migration failure though, but it is not currently
> >>>>>> used in qemu.
> >>>>> Is a "VM suspension" the QEMU HMP 'stop' command?
> >>>>>
> >>>>> I guess the reason why QEMU doesn't call RESUME anywhere is that it
> >>>>> resets the device in vhost_dev_stop()?
> >>>>>
> >>>> The actual reason for not using RESUME is that the ioctl was added
> >>>> after the SUSPEND design in qemu. Same as this proposal, it is was not
> >>>> needed at the time.
> >>>>
> >>>> In the case of vhost-vdpa net, the only usage of suspend is to fetch
> >>>> the vq indexes, and in case of error vhost already fetches them from
> >>>> guest's used ring way before vDPA, so it has little usage.
> >>>>
> >>>>> Does it make sense to combine SUSPEND and RESUME with Hanna's
> >>>>> SET_DEVICE_STATE_FD? For example, non-iterative migration works like
> >>>>> this:
> >>>>> - Saving the device's state is done by SUSPEND followed by
> >>>>> SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g.
> >>>>> savevm command or migration failed), then RESUME is called to
> >>>>> continue.
> >>>> I think the previous steps make sense at vhost_dev_stop, not virtio
> >>>> savevm handlers. To start spreading this logic to more places of qemu
> >>>> can bring confusion.
> >>> I don't think there is a way around extending the QEMU vhost's code
> >>> model. The current model in QEMU's vhost code is that the backend is
> >>> reset when the VM stops. This model worked fine for stateless devices
> >>> but it doesn't work for stateful devices.
> >>>
> >>> Imagine a vdpa-gpu device: you cannot reset the device in
> >>> vhost_dev_stop() and expect the GPU to continue working when
> >>> vhost_dev_start() is called again because all its state has been lost.
> >>> The guest driver will send requests that references a virtio-gpu
> >>> resources that no longer exist.
> >>>
> >>> One solution is to save the device's state in vhost_dev_stop(). I think
> >>> this is what you're suggesting. It requires keeping a copy of the state
> >>> and then loading the state again in vhost_dev_start(). I don't think
> >>> this approach should be used because it requires all stateful devices to
> >>> support live migration (otherwise they break across HMP 'stop'/'cont').
> >>> Also, the device state for some devices may be large and it would also
> >>> become more complicated when iterative migration is added.
> >>>
> >>> Instead, I think the QEMU vhost code needs to be structured so that
> >>> struct vhost_dev has a suspended state:
> >>>
> >>>           ,---------.
> >>>        v         |
> >>>     started ------> stopped
> >>>       \   ^
> >>>        \  |
> >>>         -> suspended
> >>>
> >>> The device doesn't lose state when it enters the suspended state. It can
> >>> be resumed again.
> >>>
> >>> This is why I think SUSPEND/RESUME need to be part of the solution.
> >>> (It's also an argument for not including the phase argument in
> >>> SET_DEVICE_STATE_FD because the SUSPEND message is sent during
> >>> vhost_dev_stop() separately from saving the device's state.)
> >> So let me ask if I understand this protocol correctly: Basically,
> >> SUSPEND would ask the device to fully serialize its internal state,
> >> retain it in some buffer, and RESUME would then deserialize the state
> >> from the buffer, right?
> > That's not how I understand SUSPEND/RESUME. I was thinking that
> > SUSPEND pauses device operation so that virtqueues are no longer
> > processed and no other events occur (e.g. VIRTIO Configuration Change
> > Notifications). RESUME continues device operation. Neither command is
> > directly related to device state serialization but SUSPEND freezes the
> > device state, while RESUME allows the device state to change again.
>
> I understood that a reset would basically reset all internal state,
> which is why SUSPEND+RESUME were required around it, to retain the state.

The SUSPEND/RESUME operations I'm thinking of come directly from
<linux/vhost.h>:

/* Suspend a device so it does not process virtqueue requests anymore
 *
 * After the return of ioctl the device must preserve all the necessary state
 * (the virtqueue vring base plus the possible device specific states) that is
 * required for restoring in the future. The device must not change its
 * configuration after that point.
 */
#define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)

/* Resume a device so it can resume processing virtqueue requests
 *
 * After the return of this ioctl the device will have restored all the
 * necessary states and it is fully operational to continue processing the
 * virtqueue descriptors.
 */
#define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)

> >> While this state needn’t necessarily be immediately migratable, I
> >> suppose (e.g. one could retain file descriptors there, and it doesn’t
> >> need to be a serialized byte buffer, but could still be structured), it
> >> would basically be a live migration implementation already.  As far as I
> >> understand, that’s why you suggest not running a SUSPEND+RESUME cycle on
> >> anything but live migration, right?
> > No, SUSPEND/RESUME would also be used across vm_stop()/vm_start().
> > That way stateful devices are no longer reset across HMP 'stop'/'cont'
> > (we're lucky it even works for most existing vhost-user backends today
> > and that's just because they don't yet implement
> > VHOST_USER_SET_STATUS).
>
> So that’s what I seem to misunderstand: If stateful devices are reset,
> how does SUSPEND+RESUME prevent that?

The vhost-user frontend can check the VHOST_USER_PROTOCOL_F_SUSPEND
feature bit to determine that the backend supports SUSPEND/RESUME and
that mechanism should be used instead of resetting the device.

Stefan
Eugenio Perez Martin April 20, 2023, 10:44 a.m. UTC | #41
On Wed, 2023-04-19 at 13:10 +0200, Hanna Czenczek wrote:
> On 18.04.23 09:54, Eugenio Perez Martin wrote:
> > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > > On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com>
> > > wrote:
> > > > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com>
> > > > wrote:
> > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin wrote:
> > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <
> > > > > > stefanha@redhat.com> wrote:
> > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > > > So-called "internal" virtio-fs migration refers to transporting
> > > > > > > > the
> > > > > > > > back-end's (virtiofsd's) state through qemu's migration
> > > > > > > > stream.  To do
> > > > > > > > this, we need to be able to transfer virtiofsd's internal state
> > > > > > > > to and
> > > > > > > > from virtiofsd.
> > > > > > > > 
> > > > > > > > Because virtiofsd's internal state will not be too large, we
> > > > > > > > believe it
> > > > > > > > is best to transfer it as a single binary blob after the
> > > > > > > > streaming
> > > > > > > > phase.  Because this method should be useful to other vhost-user
> > > > > > > > implementations, too, it is introduced as a general-purpose
> > > > > > > > addition to
> > > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > > > 
> > > > > > > > These are the additions to the protocol:
> > > > > > > > - New vhost-user protocol feature
> > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > > >    This feature signals support for transferring state, and is
> > > > > > > > added so
> > > > > > > >    that migration can fail early when the back-end has no
> > > > > > > > support.
> > > > > > > > 
> > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end negotiate
> > > > > > > > a pipe
> > > > > > > >    over which to transfer the state.  The front-end sends an FD
> > > > > > > > to the
> > > > > > > >    back-end into/from which it can write/read its state, and the
> > > > > > > > back-end
> > > > > > > >    can decide to either use it, or reply with a different FD for
> > > > > > > > the
> > > > > > > >    front-end to override the front-end's choice.
> > > > > > > >    The front-end creates a simple pipe to transfer the state,
> > > > > > > > but maybe
> > > > > > > >    the back-end already has an FD into/from which it has to
> > > > > > > > write/read
> > > > > > > >    its state, in which case it will want to override the simple
> > > > > > > > pipe.
> > > > > > > >    Conversely, maybe in the future we find a way to have the
> > > > > > > > front-end
> > > > > > > >    get an immediate FD for the migration stream (in some cases),
> > > > > > > > in which
> > > > > > > >    case we will want to send this to the back-end instead of
> > > > > > > > creating a
> > > > > > > >    pipe.
> > > > > > > >    Hence the negotiation: If one side has a better idea than a
> > > > > > > > plain
> > > > > > > >    pipe, we will want to use that.
> > > > > > > > 
> > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred
> > > > > > > > through the
> > > > > > > >    pipe (the end indicated by EOF), the front-end invokes this
> > > > > > > > function
> > > > > > > >    to verify success.  There is no in-band way (through the
> > > > > > > > pipe) to
> > > > > > > >    indicate failure, so we need to check explicitly.
> > > > > > > > 
> > > > > > > > Once the transfer pipe has been established via
> > > > > > > > SET_DEVICE_STATE_FD
> > > > > > > > (which includes establishing the direction of transfer and
> > > > > > > > migration
> > > > > > > > phase), the sending side writes its data into the pipe, and the
> > > > > > > > reading
> > > > > > > > side reads it until it sees an EOF.  Then, the front-end will
> > > > > > > > check for
> > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side
> > > > > > > > includes
> > > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > > > 
> > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > ---
> > > > > > > >   include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > > >   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > > >   hw/virtio/vhost-user.c            | 147
> > > > > > > > ++++++++++++++++++++++++++++++
> > > > > > > >   hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > > >   4 files changed, 287 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h
> > > > > > > > b/include/hw/virtio/vhost-backend.h
> > > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > > >       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > > >   } VhostSetConfigType;
> > > > > > > > 
> > > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > > +} VhostDeviceStateDirection;
> > > > > > > > +
> > > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > > +} VhostDeviceStatePhase;
> > > > > > > vDPA has:
> > > > > > > 
> > > > > > >    /* Suspend a device so it does not process virtqueue requests
> > > > > > > anymore
> > > > > > >     *
> > > > > > >     * After the return of ioctl the device must preserve all the
> > > > > > > necessary state
> > > > > > >     * (the virtqueue vring base plus the possible device specific
> > > > > > > states) that is
> > > > > > >     * required for restoring in the future. The device must not
> > > > > > > change its
> > > > > > >     * configuration after that point.
> > > > > > >     */
> > > > > > >    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > > > 
> > > > > > >    /* Resume a device so it can resume processing virtqueue
> > > > > > > requests
> > > > > > >     *
> > > > > > >     * After the return of this ioctl the device will have restored
> > > > > > > all the
> > > > > > >     * necessary states and it is fully operational to continue
> > > > > > > processing the
> > > > > > >     * virtqueue descriptors.
> > > > > > >     */
> > > > > > >    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > > > 
> > > > > > > I wonder if it makes sense to import these into vhost-user so that
> > > > > > > the
> > > > > > > difference between kernel vhost and vhost-user is minimized. It's
> > > > > > > okay
> > > > > > > if one of them is ahead of the other, but it would be nice to
> > > > > > > avoid
> > > > > > > overlapping/duplicated functionality.
> > > > > > > 
> > > > > > That's what I had in mind in the first versions. I proposed
> > > > > > VHOST_STOP
> > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did change
> > > > > > to SUSPEND.
> > > > > > 
> > > > > > Generally it is better if we make the interface less parametrized
> > > > > > and
> > > > > > we trust in the messages and its semantics in my opinion. In other
> > > > > > words, instead of
> > > > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED),
> > > > > > send
> > > > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user
> > > > > > command.
> > > > > > 
> > > > > > Another way to apply this is with the "direction" parameter. Maybe
> > > > > > it
> > > > > > is better to split it into "set_state_fd" and "get_state_fd"?
> > > > > > 
> > > > > > In that case, reusing the ioctls as vhost-user messages would be ok.
> > > > > > But that puts this proposal further from the VFIO code, which uses
> > > > > > "migration_set_state(state)", and maybe it is better when the number
> > > > > > of states is high.
> > > > > Hi Eugenio,
> > > > > Another question about vDPA suspend/resume:
> > > > > 
> > > > >    /* Host notifiers must be enabled at this point. */
> > > > >    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev,
> > > > > bool vrings)
> > > > >    {
> > > > >        int i;
> > > > > 
> > > > >        /* should only be called after backend is connected */
> > > > >        assert(hdev->vhost_ops);
> > > > >        event_notifier_test_and_clear(
> > > > >            &hdev-
> > > > > >vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
> > > > >        event_notifier_test_and_clear(&vdev->config_notifier);
> > > > > 
> > > > >        trace_vhost_dev_stop(hdev, vdev->name, vrings);
> > > > > 
> > > > >        if (hdev->vhost_ops->vhost_dev_start) {
> > > > >            hdev->vhost_ops->vhost_dev_start(hdev, false);
> > > > >            ^^^ SUSPEND ^^^
> > > > >        }
> > > > >        if (vrings) {
> > > > >            vhost_dev_set_vring_enable(hdev, false);
> > > > >        }
> > > > >        for (i = 0; i < hdev->nvqs; ++i) {
> > > > >            vhost_virtqueue_stop(hdev,
> > > > >                                 vdev,
> > > > >                                 hdev->vqs + i,
> > > > >                                 hdev->vq_index + i);
> > > > >          ^^^ fetch virtqueue state from kernel ^^^
> > > > >        }
> > > > >        if (hdev->vhost_ops->vhost_reset_status) {
> > > > >            hdev->vhost_ops->vhost_reset_status(hdev);
> > > > >            ^^^ reset device^^^
> > > > > 
> > > > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop() ->
> > > > > vhost_reset_status(). The device's migration code runs after
> > > > > vhost_dev_stop() and the state will have been lost.
> > > > > 
> > > > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
> > > > qemu VirtIONet device model. This is for all vhost backends.
> > > > 
> > > > Regarding the state like mac or mq configuration, SVQ runs for all the
> > > > VM run in the CVQ. So it can track all of that status in the device
> > > > model too.
> > > > 
> > > > When a migration effectively occurs, all the frontend state is
> > > > migrated as a regular emulated device. To route all of the state in a
> > > > normalized way for qemu is what leaves open the possibility to do
> > > > cross-backends migrations, etc.
> > > > 
> > > > Does that answer your question?
> > > I think you're confirming that changes would be necessary in order for
> > > vDPA to support the save/load operation that Hanna is introducing.
> > > 
> > Yes, this first iteration was centered on net, with an eye on block,
> > where state can be routed through classical emulated devices. This is
> > how vhost-kernel and vhost-user do classically. And it allows
> > cross-backend, to not modify qemu migration state, etc.
> > 
> > To introduce this opaque state to qemu, that must be fetched after the
> > suspend and not before, requires changes in vhost protocol, as
> > discussed previously.
> > 
> > > > > It looks like vDPA changes are necessary in order to support stateful
> > > > > devices even though QEMU already uses SUSPEND. Is my understanding
> > > > > correct?
> > > > > 
> > > > Changes are required elsewhere, as the code to restore the state
> > > > properly in the destination has not been merged.
> > > I'm not sure what you mean by elsewhere?
> > > 
> > I meant for vdpa *net* devices the changes are not required in vdpa
> > ioctls, but mostly in qemu.
> > 
> > If you meant stateful as "it must have a state blob that it must be
> > opaque to qemu", then I think the straightforward action is to fetch
> > state blob about the same time as vq indexes. But yes, changes (at
> > least a new ioctl) is needed for that.
> > 
> > > I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and
> > > then VHOST_VDPA_SET_STATUS 0.
> > > 
> > > In order to save device state from the vDPA device in the future, it
> > > will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that
> > > the device state can be saved before the device is reset.
> > > 
> > > Does that sound right?
> > > 
> > The split between suspend and reset was added recently for that very
> > reason. In all the virtio devices, the frontend is initialized before
> > the backend, so I don't think it is a good idea to defer the backend
> > cleanup. Especially if we have already set the state is small enough
> > to not needing iterative migration from virtiofsd point of view.
> > 
> > If fetching that state at the same time as vq indexes is not valid,
> > could it follow the same model as the "in-flight descriptors"?
> > vhost-user follows them by using a shared memory region where their
> > state is tracked [1]. This allows qemu to survive vhost-user SW
> > backend crashes, and does not forbid the cross-backends live migration
> > as all the information is there to recover them.
> > 
> > For hw devices this is not convenient as it occupies PCI bandwidth. So
> > a possibility is to synchronize this memory region after a
> > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW
> > devices are not going to crash in the software sense, so all use cases
> > remain the same to qemu. And that shared memory information is
> > recoverable after vhost_dev_stop.
> > 
> > Does that sound reasonable to virtiofsd? To offer a shared memory
> > region where it dumps the state, maybe only after the
> > set_state(STATE_PHASE_STOPPED)?
> 
> I don’t think we need the set_state() call, necessarily, if SUSPEND is 
> mandatory anyway.
> 

Right, I was taking them as interchangeable.

Note that I just put this on the table because it solves another use case
(transfer stateful devices + virtiofsd crash) with an interface that mimics
another one already existing.  I don't want to block the pipe proposal at all.

> As for the shared memory, the RFC before this series used shared memory, 
> so it’s possible, yes.  But “shared memory region” can mean a lot of 
> things – it sounds like you’re saying the back-end (virtiofsd) should 
> provide it to the front-end, is that right?  That could work like this:
> 

inflight_fd provides both calls: VHOST_USER_SET_INFLIGHT_FD and
VHOST_USER_GET_INFLIGHT_FD.

> On the source side:
> 
> S1. SUSPEND goes to virtiofsd
> S2. virtiofsd maybe double-checks that the device is stopped, then 
> serializes its state into a newly allocated shared memory area[1]
> S3. virtiofsd responds to SUSPEND
> S4. front-end requests shared memory, virtiofsd responds with a handle, 
> maybe already closes its reference
> S5. front-end saves state, closes its handle, freeing the SHM
> 
> [1] Maybe virtiofsd can correctly size the serialized state’s size, then 
> it can immediately allocate this area and serialize directly into it; 
> maybe it can’t, then we’ll need a bounce buffer.  Not really a 
> fundamental problem, but there are limitations around what you can do 
> with serde implementations in Rust…
> 

I think shared memory regions can grow and shrink with ftruncate, but it
complicates the solution for sure.  I was under the impression it will be a
fixed amount of state, probably based on some actual limits in the vfs.  Now I
see it is not.

> On the destination side:
> 
> D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much; 
> virtiofsd would serialize its empty state into an SHM area, and respond 
> to SUSPEND
> D2. front-end reads state from migration stream into an SHM it has allocated
> D3. front-end supplies this SHM to virtiofsd, which discards its 
> previous area, and now uses this one
> D4. RESUME goes to virtiofsd, which deserializes the state from the SHM
> 
> Couple of questions:
> 
> A. Stefan suggested D1, but it does seem wasteful now.  But if SUSPEND 
> would imply to deserialize a state, and the state is to be transferred 
> through SHM, this is what would need to be done.  So maybe we should 
> skip SUSPEND on the destination?

I think to skip suspend is the best call, yes.

> B. You described that the back-end should supply the SHM, which works 
> well on the source.  On the destination, only the front-end knows how 
> big the state is, so I’ve decided above that it should allocate the SHM 
> (D2) and provide it to the back-end.  Is that feasible or is it 
> important (e.g. for real hardware) that the back-end supplies the SHM?  
> (In which case the front-end would need to tell the back-end how big the 
> state SHM needs to be.)
> 

It is feasible for sure.  I think that the best scenario is when the data has a
fixed size forever, like QueueRegionSplit and QueueRegionPacked.  If that is not
possible, I think the best is to indicate the length of the data so the device
can fetch in strides as large as possible.  Maybe other HW guys can answer
better here though.

Thanks!
Eugenio Perez Martin April 20, 2023, 1:27 p.m. UTC | #42
On Tue, 2023-04-18 at 16:40 -0400, Stefan Hajnoczi wrote:
> On Tue, 18 Apr 2023 at 14:31, Eugenio Perez Martin <eperezma@redhat.com>
> wrote:
> > On Tue, Apr 18, 2023 at 7:59 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote:
> > > > On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com>
> > > > wrote:
> > > > > On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <
> > > > > eperezma@redhat.com> wrote:
> > > > > > On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com
> > > > > > > wrote:
> > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin
> > > > > > > wrote:
> > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <
> > > > > > > > stefanha@redhat.com> wrote:
> > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek
> > > > > > > > > wrote:
> > > > > > > > > > So-called "internal" virtio-fs migration refers to
> > > > > > > > > > transporting the
> > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration
> > > > > > > > > > stream.  To do
> > > > > > > > > > this, we need to be able to transfer virtiofsd's internal
> > > > > > > > > > state to and
> > > > > > > > > > from virtiofsd.
> > > > > > > > > > 
> > > > > > > > > > Because virtiofsd's internal state will not be too large, we
> > > > > > > > > > believe it
> > > > > > > > > > is best to transfer it as a single binary blob after the
> > > > > > > > > > streaming
> > > > > > > > > > phase.  Because this method should be useful to other vhost-
> > > > > > > > > > user
> > > > > > > > > > implementations, too, it is introduced as a general-purpose
> > > > > > > > > > addition to
> > > > > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > > > > > 
> > > > > > > > > > These are the additions to the protocol:
> > > > > > > > > > - New vhost-user protocol feature
> > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > > > > >   This feature signals support for transferring state, and
> > > > > > > > > > is added so
> > > > > > > > > >   that migration can fail early when the back-end has no
> > > > > > > > > > support.
> > > > > > > > > > 
> > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end
> > > > > > > > > > negotiate a pipe
> > > > > > > > > >   over which to transfer the state.  The front-end sends an
> > > > > > > > > > FD to the
> > > > > > > > > >   back-end into/from which it can write/read its state, and
> > > > > > > > > > the back-end
> > > > > > > > > >   can decide to either use it, or reply with a different FD
> > > > > > > > > > for the
> > > > > > > > > >   front-end to override the front-end's choice.
> > > > > > > > > >   The front-end creates a simple pipe to transfer the state,
> > > > > > > > > > but maybe
> > > > > > > > > >   the back-end already has an FD into/from which it has to
> > > > > > > > > > write/read
> > > > > > > > > >   its state, in which case it will want to override the
> > > > > > > > > > simple pipe.
> > > > > > > > > >   Conversely, maybe in the future we find a way to have the
> > > > > > > > > > front-end
> > > > > > > > > >   get an immediate FD for the migration stream (in some
> > > > > > > > > > cases), in which
> > > > > > > > > >   case we will want to send this to the back-end instead of
> > > > > > > > > > creating a
> > > > > > > > > >   pipe.
> > > > > > > > > >   Hence the negotiation: If one side has a better idea than
> > > > > > > > > > a plain
> > > > > > > > > >   pipe, we will want to use that.
> > > > > > > > > > 
> > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred
> > > > > > > > > > through the
> > > > > > > > > >   pipe (the end indicated by EOF), the front-end invokes
> > > > > > > > > > this function
> > > > > > > > > >   to verify success.  There is no in-band way (through the
> > > > > > > > > > pipe) to
> > > > > > > > > >   indicate failure, so we need to check explicitly.
> > > > > > > > > > 
> > > > > > > > > > Once the transfer pipe has been established via
> > > > > > > > > > SET_DEVICE_STATE_FD
> > > > > > > > > > (which includes establishing the direction of transfer and
> > > > > > > > > > migration
> > > > > > > > > > phase), the sending side writes its data into the pipe, and
> > > > > > > > > > the reading
> > > > > > > > > > side reads it until it sees an EOF.  Then, the front-end
> > > > > > > > > > will check for
> > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination
> > > > > > > > > > side includes
> > > > > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > > > > > 
> > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > > > ---
> > > > > > > > > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > > > > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > > > > >  hw/virtio/vhost-user.c            | 147
> > > > > > > > > > ++++++++++++++++++++++++++++++
> > > > > > > > > >  hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > > > > >  4 files changed, 287 insertions(+)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > > > > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > > > > >  } VhostSetConfigType;
> > > > > > > > > > 
> > > > > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > > > > +    /* Transfer state from back-end (device) to front-end
> > > > > > > > > > */
> > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > > > > +    /* Transfer state from front-end to back-end (device)
> > > > > > > > > > */
> > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > > > > +} VhostDeviceStateDirection;
> > > > > > > > > > +
> > > > > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > > > > +} VhostDeviceStatePhase;
> > > > > > > > > 
> > > > > > > > > vDPA has:
> > > > > > > > > 
> > > > > > > > >   /* Suspend a device so it does not process virtqueue
> > > > > > > > > requests anymore
> > > > > > > > >    *
> > > > > > > > >    * After the return of ioctl the device must preserve all
> > > > > > > > > the necessary state
> > > > > > > > >    * (the virtqueue vring base plus the possible device
> > > > > > > > > specific states) that is
> > > > > > > > >    * required for restoring in the future. The device must not
> > > > > > > > > change its
> > > > > > > > >    * configuration after that point.
> > > > > > > > >    */
> > > > > > > > >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > > > > > 
> > > > > > > > >   /* Resume a device so it can resume processing virtqueue
> > > > > > > > > requests
> > > > > > > > >    *
> > > > > > > > >    * After the return of this ioctl the device will have
> > > > > > > > > restored all the
> > > > > > > > >    * necessary states and it is fully operational to continue
> > > > > > > > > processing the
> > > > > > > > >    * virtqueue descriptors.
> > > > > > > > >    */
> > > > > > > > >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > > > > > 
> > > > > > > > > I wonder if it makes sense to import these into vhost-user so
> > > > > > > > > that the
> > > > > > > > > difference between kernel vhost and vhost-user is minimized.
> > > > > > > > > It's okay
> > > > > > > > > if one of them is ahead of the other, but it would be nice to
> > > > > > > > > avoid
> > > > > > > > > overlapping/duplicated functionality.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > That's what I had in mind in the first versions. I proposed
> > > > > > > > VHOST_STOP
> > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did
> > > > > > > > change
> > > > > > > > to SUSPEND.
> > > > > > > 
> > > > > > > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
> > > > > > > ioctl(VHOST_VDPA_RESUME).
> > > > > > > 
> > > > > > > The doc comments in <linux/vdpa.h> don't explain how the device
> > > > > > > can
> > > > > > > leave the suspended state. Can you clarify this?
> > > > > > > 
> > > > > > 
> > > > > > Do you mean in what situations or regarding the semantics of
> > > > > > _RESUME?
> > > > > > 
> > > > > > To me resume is an operation mainly to resume the device in the
> > > > > > event
> > > > > > of a VM suspension, not a migration. It can be used as a fallback
> > > > > > code
> > > > > > in some cases of migration failure though, but it is not currently
> > > > > > used in qemu.
> > > > > 
> > > > > Is a "VM suspension" the QEMU HMP 'stop' command?
> > > > > 
> > > > > I guess the reason why QEMU doesn't call RESUME anywhere is that it
> > > > > resets the device in vhost_dev_stop()?
> > > > > 
> > > > 
> > > > The actual reason for not using RESUME is that the ioctl was added
> > > > after the SUSPEND design in qemu. Same as this proposal, it is was not
> > > > needed at the time.
> > > > 
> > > > In the case of vhost-vdpa net, the only usage of suspend is to fetch
> > > > the vq indexes, and in case of error vhost already fetches them from
> > > > guest's used ring way before vDPA, so it has little usage.
> > > > 
> > > > > Does it make sense to combine SUSPEND and RESUME with Hanna's
> > > > > SET_DEVICE_STATE_FD? For example, non-iterative migration works like
> > > > > this:
> > > > > - Saving the device's state is done by SUSPEND followed by
> > > > > SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g.
> > > > > savevm command or migration failed), then RESUME is called to
> > > > > continue.
> > > > 
> > > > I think the previous steps make sense at vhost_dev_stop, not virtio
> > > > savevm handlers. To start spreading this logic to more places of qemu
> > > > can bring confusion.
> > > 
> > > I don't think there is a way around extending the QEMU vhost's code
> > > model. The current model in QEMU's vhost code is that the backend is
> > > reset when the VM stops. This model worked fine for stateless devices
> > > but it doesn't work for stateful devices.
> > > 
> > > Imagine a vdpa-gpu device: you cannot reset the device in
> > > vhost_dev_stop() and expect the GPU to continue working when
> > > vhost_dev_start() is called again because all its state has been lost.
> > > The guest driver will send requests that references a virtio-gpu
> > > resources that no longer exist.
> > > 
> > > One solution is to save the device's state in vhost_dev_stop(). I think
> > > this is what you're suggesting. It requires keeping a copy of the state
> > > and then loading the state again in vhost_dev_start(). I don't think
> > > this approach should be used because it requires all stateful devices to
> > > support live migration (otherwise they break across HMP 'stop'/'cont').
> > > Also, the device state for some devices may be large and it would also
> > > become more complicated when iterative migration is added.
> > > 
> > > Instead, I think the QEMU vhost code needs to be structured so that
> > > struct vhost_dev has a suspended state:
> > > 
> > >         ,---------.
> > >         v         |
> > >   started ------> stopped
> > >     \   ^
> > >      \  |
> > >       -> suspended
> > > 
> > > The device doesn't lose state when it enters the suspended state. It can
> > > be resumed again.
> > > 
> > > This is why I think SUSPEND/RESUME need to be part of the solution.

I just realize that we can add an arrow from suspended to stopped, isn't it?
"Started" before seems to imply the device may process descriptors after
suspend.

> > 
> > I agree with all of this, especially after realizing vhost_dev_stop is
> > called before the last request of the state in the iterative
> > migration.
> > 
> > However I think we can move faster with the virtiofsd migration code,
> > as long as we agree on the vhost-user messages it will receive. This
> > is because we already agree that the state will be sent in one shot
> > and not iteratively, so it will be small.
> > 
> > I understand this may change in the future, that's why I proposed to
> > start using iterative right now. However it may make little sense if
> > it is not used in the vhost-user device. I also understand that other
> > devices may have a bigger state so it will be needed for them.
> 
> Can you summarize how you'd like save to work today? I'm not sure what
> you have in mind.
> 

I think we're trying to find a solution that satisfies many things.  On one
side, we're assuming that the virtiofsd state will be small enough to be
assumable it will not require iterative migration in the short term.  However,
we also want to support iterative migration, for the shake of *other* future
vhost devices that may need it.

I also think we should prioritize the protocols stability, in the sense of not
adding calls that we will not reuse for iterative LM.  Being vhost-user protocol
more important to maintain than the qemu migration.

To implement the changes you mention will be needed in the future.  But we have
already set that the virtiofsd is small, so we can just fetch them by the same
time than we send VHOST_USER_GET_VRING_BASE message and send the status with the
proposed non-iterative approach.

If we agree on that, now the question is how to fetch them from the device.  The
answers are a little bit scattered in the mail threads, but I think we agree on:
a) We need to signal that the device must stop processing requests.
b) We need a way for the device to dump the state.

At this moment I think any proposal satisfies a), and pipe satisfies better b). 
With proper backend feature flags, the device may support to start writing to
the pipe before SUSPEND so we can implement iterative migration on top.

Does that makes sense?

Thanks!
Eugenio Perez Martin April 20, 2023, 1:29 p.m. UTC | #43
On Wed, 2023-04-19 at 07:21 -0400, Stefan Hajnoczi wrote:
> On Wed, 19 Apr 2023 at 07:10, Hanna Czenczek <hreitz@redhat.com> wrote:
> > On 18.04.23 09:54, Eugenio Perez Martin wrote:
> > > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com>
> > > wrote:
> > > > On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com>
> > > > wrote:
> > > > > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com>
> > > > > wrote:
> > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin
> > > > > > wrote:
> > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <
> > > > > > > stefanha@redhat.com> wrote:
> > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > > > > So-called "internal" virtio-fs migration refers to
> > > > > > > > > transporting the
> > > > > > > > > back-end's (virtiofsd's) state through qemu's migration
> > > > > > > > > stream.  To do
> > > > > > > > > this, we need to be able to transfer virtiofsd's internal
> > > > > > > > > state to and
> > > > > > > > > from virtiofsd.
> > > > > > > > > 
> > > > > > > > > Because virtiofsd's internal state will not be too large, we
> > > > > > > > > believe it
> > > > > > > > > is best to transfer it as a single binary blob after the
> > > > > > > > > streaming
> > > > > > > > > phase.  Because this method should be useful to other vhost-
> > > > > > > > > user
> > > > > > > > > implementations, too, it is introduced as a general-purpose
> > > > > > > > > addition to
> > > > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > > > > 
> > > > > > > > > These are the additions to the protocol:
> > > > > > > > > - New vhost-user protocol feature
> > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > > > >    This feature signals support for transferring state, and is
> > > > > > > > > added so
> > > > > > > > >    that migration can fail early when the back-end has no
> > > > > > > > > support.
> > > > > > > > > 
> > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end
> > > > > > > > > negotiate a pipe
> > > > > > > > >    over which to transfer the state.  The front-end sends an
> > > > > > > > > FD to the
> > > > > > > > >    back-end into/from which it can write/read its state, and
> > > > > > > > > the back-end
> > > > > > > > >    can decide to either use it, or reply with a different FD
> > > > > > > > > for the
> > > > > > > > >    front-end to override the front-end's choice.
> > > > > > > > >    The front-end creates a simple pipe to transfer the state,
> > > > > > > > > but maybe
> > > > > > > > >    the back-end already has an FD into/from which it has to
> > > > > > > > > write/read
> > > > > > > > >    its state, in which case it will want to override the
> > > > > > > > > simple pipe.
> > > > > > > > >    Conversely, maybe in the future we find a way to have the
> > > > > > > > > front-end
> > > > > > > > >    get an immediate FD for the migration stream (in some
> > > > > > > > > cases), in which
> > > > > > > > >    case we will want to send this to the back-end instead of
> > > > > > > > > creating a
> > > > > > > > >    pipe.
> > > > > > > > >    Hence the negotiation: If one side has a better idea than a
> > > > > > > > > plain
> > > > > > > > >    pipe, we will want to use that.
> > > > > > > > > 
> > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred
> > > > > > > > > through the
> > > > > > > > >    pipe (the end indicated by EOF), the front-end invokes this
> > > > > > > > > function
> > > > > > > > >    to verify success.  There is no in-band way (through the
> > > > > > > > > pipe) to
> > > > > > > > >    indicate failure, so we need to check explicitly.
> > > > > > > > > 
> > > > > > > > > Once the transfer pipe has been established via
> > > > > > > > > SET_DEVICE_STATE_FD
> > > > > > > > > (which includes establishing the direction of transfer and
> > > > > > > > > migration
> > > > > > > > > phase), the sending side writes its data into the pipe, and
> > > > > > > > > the reading
> > > > > > > > > side reads it until it sees an EOF.  Then, the front-end will
> > > > > > > > > check for
> > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side
> > > > > > > > > includes
> > > > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > > > > 
> > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > > ---
> > > > > > > > >   include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > > > >   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > > > >   hw/virtio/vhost-user.c            | 147
> > > > > > > > > ++++++++++++++++++++++++++++++
> > > > > > > > >   hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > > > >   4 files changed, 287 insertions(+)
> > > > > > > > > 
> > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h
> > > > > > > > > b/include/hw/virtio/vhost-backend.h
> > > > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > > > >       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > > > >   } VhostSetConfigType;
> > > > > > > > > 
> > > > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > > > +} VhostDeviceStateDirection;
> > > > > > > > > +
> > > > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > > > +} VhostDeviceStatePhase;
> > > > > > > > vDPA has:
> > > > > > > > 
> > > > > > > >    /* Suspend a device so it does not process virtqueue requests
> > > > > > > > anymore
> > > > > > > >     *
> > > > > > > >     * After the return of ioctl the device must preserve all the
> > > > > > > > necessary state
> > > > > > > >     * (the virtqueue vring base plus the possible device
> > > > > > > > specific states) that is
> > > > > > > >     * required for restoring in the future. The device must not
> > > > > > > > change its
> > > > > > > >     * configuration after that point.
> > > > > > > >     */
> > > > > > > >    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > > > > 
> > > > > > > >    /* Resume a device so it can resume processing virtqueue
> > > > > > > > requests
> > > > > > > >     *
> > > > > > > >     * After the return of this ioctl the device will have
> > > > > > > > restored all the
> > > > > > > >     * necessary states and it is fully operational to continue
> > > > > > > > processing the
> > > > > > > >     * virtqueue descriptors.
> > > > > > > >     */
> > > > > > > >    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > > > > 
> > > > > > > > I wonder if it makes sense to import these into vhost-user so
> > > > > > > > that the
> > > > > > > > difference between kernel vhost and vhost-user is minimized.
> > > > > > > > It's okay
> > > > > > > > if one of them is ahead of the other, but it would be nice to
> > > > > > > > avoid
> > > > > > > > overlapping/duplicated functionality.
> > > > > > > > 
> > > > > > > That's what I had in mind in the first versions. I proposed
> > > > > > > VHOST_STOP
> > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did
> > > > > > > change
> > > > > > > to SUSPEND.
> > > > > > > 
> > > > > > > Generally it is better if we make the interface less parametrized
> > > > > > > and
> > > > > > > we trust in the messages and its semantics in my opinion. In other
> > > > > > > words, instead of
> > > > > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED),
> > > > > > > send
> > > > > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user
> > > > > > > command.
> > > > > > > 
> > > > > > > Another way to apply this is with the "direction" parameter. Maybe
> > > > > > > it
> > > > > > > is better to split it into "set_state_fd" and "get_state_fd"?
> > > > > > > 
> > > > > > > In that case, reusing the ioctls as vhost-user messages would be
> > > > > > > ok.
> > > > > > > But that puts this proposal further from the VFIO code, which uses
> > > > > > > "migration_set_state(state)", and maybe it is better when the
> > > > > > > number
> > > > > > > of states is high.
> > > > > > Hi Eugenio,
> > > > > > Another question about vDPA suspend/resume:
> > > > > > 
> > > > > >    /* Host notifiers must be enabled at this point. */
> > > > > >    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev,
> > > > > > bool vrings)
> > > > > >    {
> > > > > >        int i;
> > > > > > 
> > > > > >        /* should only be called after backend is connected */
> > > > > >        assert(hdev->vhost_ops);
> > > > > >        event_notifier_test_and_clear(
> > > > > >            &hdev-
> > > > > > >vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
> > > > > >        event_notifier_test_and_clear(&vdev->config_notifier);
> > > > > > 
> > > > > >        trace_vhost_dev_stop(hdev, vdev->name, vrings);
> > > > > > 
> > > > > >        if (hdev->vhost_ops->vhost_dev_start) {
> > > > > >            hdev->vhost_ops->vhost_dev_start(hdev, false);
> > > > > >            ^^^ SUSPEND ^^^
> > > > > >        }
> > > > > >        if (vrings) {
> > > > > >            vhost_dev_set_vring_enable(hdev, false);
> > > > > >        }
> > > > > >        for (i = 0; i < hdev->nvqs; ++i) {
> > > > > >            vhost_virtqueue_stop(hdev,
> > > > > >                                 vdev,
> > > > > >                                 hdev->vqs + i,
> > > > > >                                 hdev->vq_index + i);
> > > > > >          ^^^ fetch virtqueue state from kernel ^^^
> > > > > >        }
> > > > > >        if (hdev->vhost_ops->vhost_reset_status) {
> > > > > >            hdev->vhost_ops->vhost_reset_status(hdev);
> > > > > >            ^^^ reset device^^^
> > > > > > 
> > > > > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop()
> > > > > > ->
> > > > > > vhost_reset_status(). The device's migration code runs after
> > > > > > vhost_dev_stop() and the state will have been lost.
> > > > > > 
> > > > > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
> > > > > qemu VirtIONet device model. This is for all vhost backends.
> > > > > 
> > > > > Regarding the state like mac or mq configuration, SVQ runs for all the
> > > > > VM run in the CVQ. So it can track all of that status in the device
> > > > > model too.
> > > > > 
> > > > > When a migration effectively occurs, all the frontend state is
> > > > > migrated as a regular emulated device. To route all of the state in a
> > > > > normalized way for qemu is what leaves open the possibility to do
> > > > > cross-backends migrations, etc.
> > > > > 
> > > > > Does that answer your question?
> > > > I think you're confirming that changes would be necessary in order for
> > > > vDPA to support the save/load operation that Hanna is introducing.
> > > > 
> > > Yes, this first iteration was centered on net, with an eye on block,
> > > where state can be routed through classical emulated devices. This is
> > > how vhost-kernel and vhost-user do classically. And it allows
> > > cross-backend, to not modify qemu migration state, etc.
> > > 
> > > To introduce this opaque state to qemu, that must be fetched after the
> > > suspend and not before, requires changes in vhost protocol, as
> > > discussed previously.
> > > 
> > > > > > It looks like vDPA changes are necessary in order to support
> > > > > > stateful
> > > > > > devices even though QEMU already uses SUSPEND. Is my understanding
> > > > > > correct?
> > > > > > 
> > > > > Changes are required elsewhere, as the code to restore the state
> > > > > properly in the destination has not been merged.
> > > > I'm not sure what you mean by elsewhere?
> > > > 
> > > I meant for vdpa *net* devices the changes are not required in vdpa
> > > ioctls, but mostly in qemu.
> > > 
> > > If you meant stateful as "it must have a state blob that it must be
> > > opaque to qemu", then I think the straightforward action is to fetch
> > > state blob about the same time as vq indexes. But yes, changes (at
> > > least a new ioctl) is needed for that.
> > > 
> > > > I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and
> > > > then VHOST_VDPA_SET_STATUS 0.
> > > > 
> > > > In order to save device state from the vDPA device in the future, it
> > > > will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that
> > > > the device state can be saved before the device is reset.
> > > > 
> > > > Does that sound right?
> > > > 
> > > The split between suspend and reset was added recently for that very
> > > reason. In all the virtio devices, the frontend is initialized before
> > > the backend, so I don't think it is a good idea to defer the backend
> > > cleanup. Especially if we have already set the state is small enough
> > > to not needing iterative migration from virtiofsd point of view.
> > > 
> > > If fetching that state at the same time as vq indexes is not valid,
> > > could it follow the same model as the "in-flight descriptors"?
> > > vhost-user follows them by using a shared memory region where their
> > > state is tracked [1]. This allows qemu to survive vhost-user SW
> > > backend crashes, and does not forbid the cross-backends live migration
> > > as all the information is there to recover them.
> > > 
> > > For hw devices this is not convenient as it occupies PCI bandwidth. So
> > > a possibility is to synchronize this memory region after a
> > > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW
> > > devices are not going to crash in the software sense, so all use cases
> > > remain the same to qemu. And that shared memory information is
> > > recoverable after vhost_dev_stop.
> > > 
> > > Does that sound reasonable to virtiofsd? To offer a shared memory
> > > region where it dumps the state, maybe only after the
> > > set_state(STATE_PHASE_STOPPED)?
> > 
> > I don’t think we need the set_state() call, necessarily, if SUSPEND is
> > mandatory anyway.
> > 
> > As for the shared memory, the RFC before this series used shared memory,
> > so it’s possible, yes.  But “shared memory region” can mean a lot of
> > things – it sounds like you’re saying the back-end (virtiofsd) should
> > provide it to the front-end, is that right?  That could work like this:
> > 
> > On the source side:
> > 
> > S1. SUSPEND goes to virtiofsd
> > S2. virtiofsd maybe double-checks that the device is stopped, then
> > serializes its state into a newly allocated shared memory area[1]
> > S3. virtiofsd responds to SUSPEND
> > S4. front-end requests shared memory, virtiofsd responds with a handle,
> > maybe already closes its reference
> > S5. front-end saves state, closes its handle, freeing the SHM
> > 
> > [1] Maybe virtiofsd can correctly size the serialized state’s size, then
> > it can immediately allocate this area and serialize directly into it;
> > maybe it can’t, then we’ll need a bounce buffer.  Not really a
> > fundamental problem, but there are limitations around what you can do
> > with serde implementations in Rust…
> > 
> > On the destination side:
> > 
> > D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much;
> > virtiofsd would serialize its empty state into an SHM area, and respond
> > to SUSPEND
> > D2. front-end reads state from migration stream into an SHM it has allocated
> > D3. front-end supplies this SHM to virtiofsd, which discards its
> > previous area, and now uses this one
> > D4. RESUME goes to virtiofsd, which deserializes the state from the SHM
> > 
> > Couple of questions:
> > 
> > A. Stefan suggested D1, but it does seem wasteful now.  But if SUSPEND
> > would imply to deserialize a state, and the state is to be transferred
> > through SHM, this is what would need to be done.  So maybe we should
> > skip SUSPEND on the destination?
> > B. You described that the back-end should supply the SHM, which works
> > well on the source.  On the destination, only the front-end knows how
> > big the state is, so I’ve decided above that it should allocate the SHM
> > (D2) and provide it to the back-end.  Is that feasible or is it
> > important (e.g. for real hardware) that the back-end supplies the SHM?
> > (In which case the front-end would need to tell the back-end how big the
> > state SHM needs to be.)
> 
> How does this work for iterative live migration?
> 

A pipe will always fit better for iterative from qemu POV, that's for sure. 
Especially if we want to keep that opaqueness.

But  we will need to communicate with the HW device using shared memory sooner
or later for big states.  If we don't transform it in qemu, we will need to do
it in the kernel.  Also, the pipe will not support daemon crashes.

Again I'm just putting this on the table, just in case it fits better or it is
convenient.  I missed the previous patch where SHM was proposed too, so maybe I
missed some feedback useful here.  I think the pipe is a better solution in the
long run because of the iterative part.

Thanks!
Stefan Hajnoczi May 8, 2023, 7:12 p.m. UTC | #44
On Thu, Apr 20, 2023 at 03:27:51PM +0200, Eugenio Pérez wrote:
> On Tue, 2023-04-18 at 16:40 -0400, Stefan Hajnoczi wrote:
> > On Tue, 18 Apr 2023 at 14:31, Eugenio Perez Martin <eperezma@redhat.com>
> > wrote:
> > > On Tue, Apr 18, 2023 at 7:59 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote:
> > > > > On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com>
> > > > > wrote:
> > > > > > On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <
> > > > > > eperezma@redhat.com> wrote:
> > > > > > > On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com
> > > > > > > > wrote:
> > > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin
> > > > > > > > wrote:
> > > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <
> > > > > > > > > stefanha@redhat.com> wrote:
> > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek
> > > > > > > > > > wrote:
> > > > > > > > > > > So-called "internal" virtio-fs migration refers to
> > > > > > > > > > > transporting the
> > > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration
> > > > > > > > > > > stream.  To do
> > > > > > > > > > > this, we need to be able to transfer virtiofsd's internal
> > > > > > > > > > > state to and
> > > > > > > > > > > from virtiofsd.
> > > > > > > > > > > 
> > > > > > > > > > > Because virtiofsd's internal state will not be too large, we
> > > > > > > > > > > believe it
> > > > > > > > > > > is best to transfer it as a single binary blob after the
> > > > > > > > > > > streaming
> > > > > > > > > > > phase.  Because this method should be useful to other vhost-
> > > > > > > > > > > user
> > > > > > > > > > > implementations, too, it is introduced as a general-purpose
> > > > > > > > > > > addition to
> > > > > > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > > > > > > 
> > > > > > > > > > > These are the additions to the protocol:
> > > > > > > > > > > - New vhost-user protocol feature
> > > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > > > > > >   This feature signals support for transferring state, and
> > > > > > > > > > > is added so
> > > > > > > > > > >   that migration can fail early when the back-end has no
> > > > > > > > > > > support.
> > > > > > > > > > > 
> > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end
> > > > > > > > > > > negotiate a pipe
> > > > > > > > > > >   over which to transfer the state.  The front-end sends an
> > > > > > > > > > > FD to the
> > > > > > > > > > >   back-end into/from which it can write/read its state, and
> > > > > > > > > > > the back-end
> > > > > > > > > > >   can decide to either use it, or reply with a different FD
> > > > > > > > > > > for the
> > > > > > > > > > >   front-end to override the front-end's choice.
> > > > > > > > > > >   The front-end creates a simple pipe to transfer the state,
> > > > > > > > > > > but maybe
> > > > > > > > > > >   the back-end already has an FD into/from which it has to
> > > > > > > > > > > write/read
> > > > > > > > > > >   its state, in which case it will want to override the
> > > > > > > > > > > simple pipe.
> > > > > > > > > > >   Conversely, maybe in the future we find a way to have the
> > > > > > > > > > > front-end
> > > > > > > > > > >   get an immediate FD for the migration stream (in some
> > > > > > > > > > > cases), in which
> > > > > > > > > > >   case we will want to send this to the back-end instead of
> > > > > > > > > > > creating a
> > > > > > > > > > >   pipe.
> > > > > > > > > > >   Hence the negotiation: If one side has a better idea than
> > > > > > > > > > > a plain
> > > > > > > > > > >   pipe, we will want to use that.
> > > > > > > > > > > 
> > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred
> > > > > > > > > > > through the
> > > > > > > > > > >   pipe (the end indicated by EOF), the front-end invokes
> > > > > > > > > > > this function
> > > > > > > > > > >   to verify success.  There is no in-band way (through the
> > > > > > > > > > > pipe) to
> > > > > > > > > > >   indicate failure, so we need to check explicitly.
> > > > > > > > > > > 
> > > > > > > > > > > Once the transfer pipe has been established via
> > > > > > > > > > > SET_DEVICE_STATE_FD
> > > > > > > > > > > (which includes establishing the direction of transfer and
> > > > > > > > > > > migration
> > > > > > > > > > > phase), the sending side writes its data into the pipe, and
> > > > > > > > > > > the reading
> > > > > > > > > > > side reads it until it sees an EOF.  Then, the front-end
> > > > > > > > > > > will check for
> > > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination
> > > > > > > > > > > side includes
> > > > > > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > > > > > > 
> > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > > > > ---
> > > > > > > > > > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > > > > > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > > > > > >  hw/virtio/vhost-user.c            | 147
> > > > > > > > > > > ++++++++++++++++++++++++++++++
> > > > > > > > > > >  hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > > > > > >  4 files changed, 287 insertions(+)
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > > > > > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > > > > > >  } VhostSetConfigType;
> > > > > > > > > > > 
> > > > > > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > > > > > +    /* Transfer state from back-end (device) to front-end
> > > > > > > > > > > */
> > > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > > > > > +    /* Transfer state from front-end to back-end (device)
> > > > > > > > > > > */
> > > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > > > > > +} VhostDeviceStateDirection;
> > > > > > > > > > > +
> > > > > > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > > > > > +} VhostDeviceStatePhase;
> > > > > > > > > > 
> > > > > > > > > > vDPA has:
> > > > > > > > > > 
> > > > > > > > > >   /* Suspend a device so it does not process virtqueue
> > > > > > > > > > requests anymore
> > > > > > > > > >    *
> > > > > > > > > >    * After the return of ioctl the device must preserve all
> > > > > > > > > > the necessary state
> > > > > > > > > >    * (the virtqueue vring base plus the possible device
> > > > > > > > > > specific states) that is
> > > > > > > > > >    * required for restoring in the future. The device must not
> > > > > > > > > > change its
> > > > > > > > > >    * configuration after that point.
> > > > > > > > > >    */
> > > > > > > > > >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > > > > > > 
> > > > > > > > > >   /* Resume a device so it can resume processing virtqueue
> > > > > > > > > > requests
> > > > > > > > > >    *
> > > > > > > > > >    * After the return of this ioctl the device will have
> > > > > > > > > > restored all the
> > > > > > > > > >    * necessary states and it is fully operational to continue
> > > > > > > > > > processing the
> > > > > > > > > >    * virtqueue descriptors.
> > > > > > > > > >    */
> > > > > > > > > >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > > > > > > 
> > > > > > > > > > I wonder if it makes sense to import these into vhost-user so
> > > > > > > > > > that the
> > > > > > > > > > difference between kernel vhost and vhost-user is minimized.
> > > > > > > > > > It's okay
> > > > > > > > > > if one of them is ahead of the other, but it would be nice to
> > > > > > > > > > avoid
> > > > > > > > > > overlapping/duplicated functionality.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > That's what I had in mind in the first versions. I proposed
> > > > > > > > > VHOST_STOP
> > > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did
> > > > > > > > > change
> > > > > > > > > to SUSPEND.
> > > > > > > > 
> > > > > > > > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
> > > > > > > > ioctl(VHOST_VDPA_RESUME).
> > > > > > > > 
> > > > > > > > The doc comments in <linux/vdpa.h> don't explain how the device
> > > > > > > > can
> > > > > > > > leave the suspended state. Can you clarify this?
> > > > > > > > 
> > > > > > > 
> > > > > > > Do you mean in what situations or regarding the semantics of
> > > > > > > _RESUME?
> > > > > > > 
> > > > > > > To me resume is an operation mainly to resume the device in the
> > > > > > > event
> > > > > > > of a VM suspension, not a migration. It can be used as a fallback
> > > > > > > code
> > > > > > > in some cases of migration failure though, but it is not currently
> > > > > > > used in qemu.
> > > > > > 
> > > > > > Is a "VM suspension" the QEMU HMP 'stop' command?
> > > > > > 
> > > > > > I guess the reason why QEMU doesn't call RESUME anywhere is that it
> > > > > > resets the device in vhost_dev_stop()?
> > > > > > 
> > > > > 
> > > > > The actual reason for not using RESUME is that the ioctl was added
> > > > > after the SUSPEND design in qemu. Same as this proposal, it is was not
> > > > > needed at the time.
> > > > > 
> > > > > In the case of vhost-vdpa net, the only usage of suspend is to fetch
> > > > > the vq indexes, and in case of error vhost already fetches them from
> > > > > guest's used ring way before vDPA, so it has little usage.
> > > > > 
> > > > > > Does it make sense to combine SUSPEND and RESUME with Hanna's
> > > > > > SET_DEVICE_STATE_FD? For example, non-iterative migration works like
> > > > > > this:
> > > > > > - Saving the device's state is done by SUSPEND followed by
> > > > > > SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g.
> > > > > > savevm command or migration failed), then RESUME is called to
> > > > > > continue.
> > > > > 
> > > > > I think the previous steps make sense at vhost_dev_stop, not virtio
> > > > > savevm handlers. To start spreading this logic to more places of qemu
> > > > > can bring confusion.
> > > > 
> > > > I don't think there is a way around extending the QEMU vhost's code
> > > > model. The current model in QEMU's vhost code is that the backend is
> > > > reset when the VM stops. This model worked fine for stateless devices
> > > > but it doesn't work for stateful devices.
> > > > 
> > > > Imagine a vdpa-gpu device: you cannot reset the device in
> > > > vhost_dev_stop() and expect the GPU to continue working when
> > > > vhost_dev_start() is called again because all its state has been lost.
> > > > The guest driver will send requests that references a virtio-gpu
> > > > resources that no longer exist.
> > > > 
> > > > One solution is to save the device's state in vhost_dev_stop(). I think
> > > > this is what you're suggesting. It requires keeping a copy of the state
> > > > and then loading the state again in vhost_dev_start(). I don't think
> > > > this approach should be used because it requires all stateful devices to
> > > > support live migration (otherwise they break across HMP 'stop'/'cont').
> > > > Also, the device state for some devices may be large and it would also
> > > > become more complicated when iterative migration is added.
> > > > 
> > > > Instead, I think the QEMU vhost code needs to be structured so that
> > > > struct vhost_dev has a suspended state:
> > > > 
> > > >         ,---------.
> > > >         v         |
> > > >   started ------> stopped
> > > >     \   ^
> > > >      \  |
> > > >       -> suspended
> > > > 
> > > > The device doesn't lose state when it enters the suspended state. It can
> > > > be resumed again.
> > > > 
> > > > This is why I think SUSPEND/RESUME need to be part of the solution.
> 
> I just realize that we can add an arrow from suspended to stopped, isn't it?

Yes, it could be used in the case of a successful live migration:
[started] -> vhost_dev_suspend() [suspended] -> vhost_dev_stop() [stopped]

> "Started" before seems to imply the device may process descriptors after
> suspend.

Yes, in the case of a failed live migration:
[started] -> vhost_dev_suspend() [suspended] -> vhost_dev_resume() [started]

> > > 
> > > I agree with all of this, especially after realizing vhost_dev_stop is
> > > called before the last request of the state in the iterative
> > > migration.
> > > 
> > > However I think we can move faster with the virtiofsd migration code,
> > > as long as we agree on the vhost-user messages it will receive. This
> > > is because we already agree that the state will be sent in one shot
> > > and not iteratively, so it will be small.
> > > 
> > > I understand this may change in the future, that's why I proposed to
> > > start using iterative right now. However it may make little sense if
> > > it is not used in the vhost-user device. I also understand that other
> > > devices may have a bigger state so it will be needed for them.
> > 
> > Can you summarize how you'd like save to work today? I'm not sure what
> > you have in mind.
> > 
> 
> I think we're trying to find a solution that satisfies many things.  On one
> side, we're assuming that the virtiofsd state will be small enough to be
> assumable it will not require iterative migration in the short term.  However,
> we also want to support iterative migration, for the shake of *other* future
> vhost devices that may need it.
> 
> I also think we should prioritize the protocols stability, in the sense of not
> adding calls that we will not reuse for iterative LM.  Being vhost-user protocol
> more important to maintain than the qemu migration.
> 
> To implement the changes you mention will be needed in the future.  But we have
> already set that the virtiofsd is small, so we can just fetch them by the same
> time than we send VHOST_USER_GET_VRING_BASE message and send the status with the
> proposed non-iterative approach.

VHOST_USER_GET_VRING_BASE itself isn't really enough because it stops a
specific virtqueue but not the whole device. Unfortunately stopping all
virtqueues is not the same as SUSPEND since spontaneous device activity
is possible independent of any virtqueue (e.g. virtio-scsi events and
maybe virtio-net link status).

That's why I think SUSPEND is necessary for a solution that's generic
enough to cover all device types.

> If we agree on that, now the question is how to fetch them from the device.  The
> answers are a little bit scattered in the mail threads, but I think we agree on:
> a) We need to signal that the device must stop processing requests.
> b) We need a way for the device to dump the state.
> 
> At this moment I think any proposal satisfies a), and pipe satisfies better b). 
> With proper backend feature flags, the device may support to start writing to
> the pipe before SUSPEND so we can implement iterative migration on top.
> 
> Does that makes sense?

Yes, and that sounds like what Hanna is proposing for b) plus our
discussion about SUSPEND/RESUME in order to achieve a).

Stefan
Stefan Hajnoczi May 8, 2023, 8:10 p.m. UTC | #45
On Thu, Apr 20, 2023 at 03:29:44PM +0200, Eugenio Pérez wrote:
> On Wed, 2023-04-19 at 07:21 -0400, Stefan Hajnoczi wrote:
> > On Wed, 19 Apr 2023 at 07:10, Hanna Czenczek <hreitz@redhat.com> wrote:
> > > On 18.04.23 09:54, Eugenio Perez Martin wrote:
> > > > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com>
> > > > wrote:
> > > > > On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com>
> > > > > wrote:
> > > > > > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > wrote:
> > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin
> > > > > > > wrote:
> > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <
> > > > > > > > stefanha@redhat.com> wrote:
> > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > > > > > So-called "internal" virtio-fs migration refers to
> > > > > > > > > > transporting the
> > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration
> > > > > > > > > > stream.  To do
> > > > > > > > > > this, we need to be able to transfer virtiofsd's internal
> > > > > > > > > > state to and
> > > > > > > > > > from virtiofsd.
> > > > > > > > > > 
> > > > > > > > > > Because virtiofsd's internal state will not be too large, we
> > > > > > > > > > believe it
> > > > > > > > > > is best to transfer it as a single binary blob after the
> > > > > > > > > > streaming
> > > > > > > > > > phase.  Because this method should be useful to other vhost-
> > > > > > > > > > user
> > > > > > > > > > implementations, too, it is introduced as a general-purpose
> > > > > > > > > > addition to
> > > > > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > > > > > 
> > > > > > > > > > These are the additions to the protocol:
> > > > > > > > > > - New vhost-user protocol feature
> > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > > > > >    This feature signals support for transferring state, and is
> > > > > > > > > > added so
> > > > > > > > > >    that migration can fail early when the back-end has no
> > > > > > > > > > support.
> > > > > > > > > > 
> > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end
> > > > > > > > > > negotiate a pipe
> > > > > > > > > >    over which to transfer the state.  The front-end sends an
> > > > > > > > > > FD to the
> > > > > > > > > >    back-end into/from which it can write/read its state, and
> > > > > > > > > > the back-end
> > > > > > > > > >    can decide to either use it, or reply with a different FD
> > > > > > > > > > for the
> > > > > > > > > >    front-end to override the front-end's choice.
> > > > > > > > > >    The front-end creates a simple pipe to transfer the state,
> > > > > > > > > > but maybe
> > > > > > > > > >    the back-end already has an FD into/from which it has to
> > > > > > > > > > write/read
> > > > > > > > > >    its state, in which case it will want to override the
> > > > > > > > > > simple pipe.
> > > > > > > > > >    Conversely, maybe in the future we find a way to have the
> > > > > > > > > > front-end
> > > > > > > > > >    get an immediate FD for the migration stream (in some
> > > > > > > > > > cases), in which
> > > > > > > > > >    case we will want to send this to the back-end instead of
> > > > > > > > > > creating a
> > > > > > > > > >    pipe.
> > > > > > > > > >    Hence the negotiation: If one side has a better idea than a
> > > > > > > > > > plain
> > > > > > > > > >    pipe, we will want to use that.
> > > > > > > > > > 
> > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred
> > > > > > > > > > through the
> > > > > > > > > >    pipe (the end indicated by EOF), the front-end invokes this
> > > > > > > > > > function
> > > > > > > > > >    to verify success.  There is no in-band way (through the
> > > > > > > > > > pipe) to
> > > > > > > > > >    indicate failure, so we need to check explicitly.
> > > > > > > > > > 
> > > > > > > > > > Once the transfer pipe has been established via
> > > > > > > > > > SET_DEVICE_STATE_FD
> > > > > > > > > > (which includes establishing the direction of transfer and
> > > > > > > > > > migration
> > > > > > > > > > phase), the sending side writes its data into the pipe, and
> > > > > > > > > > the reading
> > > > > > > > > > side reads it until it sees an EOF.  Then, the front-end will
> > > > > > > > > > check for
> > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side
> > > > > > > > > > includes
> > > > > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > > > > > 
> > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > > > ---
> > > > > > > > > >   include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > > > > >   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > > > > >   hw/virtio/vhost-user.c            | 147
> > > > > > > > > > ++++++++++++++++++++++++++++++
> > > > > > > > > >   hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > > > > >   4 files changed, 287 insertions(+)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > > > > >       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > > > > >   } VhostSetConfigType;
> > > > > > > > > > 
> > > > > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > > > > +} VhostDeviceStateDirection;
> > > > > > > > > > +
> > > > > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > > > > +} VhostDeviceStatePhase;
> > > > > > > > > vDPA has:
> > > > > > > > > 
> > > > > > > > >    /* Suspend a device so it does not process virtqueue requests
> > > > > > > > > anymore
> > > > > > > > >     *
> > > > > > > > >     * After the return of ioctl the device must preserve all the
> > > > > > > > > necessary state
> > > > > > > > >     * (the virtqueue vring base plus the possible device
> > > > > > > > > specific states) that is
> > > > > > > > >     * required for restoring in the future. The device must not
> > > > > > > > > change its
> > > > > > > > >     * configuration after that point.
> > > > > > > > >     */
> > > > > > > > >    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > > > > > 
> > > > > > > > >    /* Resume a device so it can resume processing virtqueue
> > > > > > > > > requests
> > > > > > > > >     *
> > > > > > > > >     * After the return of this ioctl the device will have
> > > > > > > > > restored all the
> > > > > > > > >     * necessary states and it is fully operational to continue
> > > > > > > > > processing the
> > > > > > > > >     * virtqueue descriptors.
> > > > > > > > >     */
> > > > > > > > >    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > > > > > 
> > > > > > > > > I wonder if it makes sense to import these into vhost-user so
> > > > > > > > > that the
> > > > > > > > > difference between kernel vhost and vhost-user is minimized.
> > > > > > > > > It's okay
> > > > > > > > > if one of them is ahead of the other, but it would be nice to
> > > > > > > > > avoid
> > > > > > > > > overlapping/duplicated functionality.
> > > > > > > > > 
> > > > > > > > That's what I had in mind in the first versions. I proposed
> > > > > > > > VHOST_STOP
> > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did
> > > > > > > > change
> > > > > > > > to SUSPEND.
> > > > > > > > 
> > > > > > > > Generally it is better if we make the interface less parametrized
> > > > > > > > and
> > > > > > > > we trust in the messages and its semantics in my opinion. In other
> > > > > > > > words, instead of
> > > > > > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED),
> > > > > > > > send
> > > > > > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user
> > > > > > > > command.
> > > > > > > > 
> > > > > > > > Another way to apply this is with the "direction" parameter. Maybe
> > > > > > > > it
> > > > > > > > is better to split it into "set_state_fd" and "get_state_fd"?
> > > > > > > > 
> > > > > > > > In that case, reusing the ioctls as vhost-user messages would be
> > > > > > > > ok.
> > > > > > > > But that puts this proposal further from the VFIO code, which uses
> > > > > > > > "migration_set_state(state)", and maybe it is better when the
> > > > > > > > number
> > > > > > > > of states is high.
> > > > > > > Hi Eugenio,
> > > > > > > Another question about vDPA suspend/resume:
> > > > > > > 
> > > > > > >    /* Host notifiers must be enabled at this point. */
> > > > > > >    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev,
> > > > > > > bool vrings)
> > > > > > >    {
> > > > > > >        int i;
> > > > > > > 
> > > > > > >        /* should only be called after backend is connected */
> > > > > > >        assert(hdev->vhost_ops);
> > > > > > >        event_notifier_test_and_clear(
> > > > > > >            &hdev-
> > > > > > > >vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
> > > > > > >        event_notifier_test_and_clear(&vdev->config_notifier);
> > > > > > > 
> > > > > > >        trace_vhost_dev_stop(hdev, vdev->name, vrings);
> > > > > > > 
> > > > > > >        if (hdev->vhost_ops->vhost_dev_start) {
> > > > > > >            hdev->vhost_ops->vhost_dev_start(hdev, false);
> > > > > > >            ^^^ SUSPEND ^^^
> > > > > > >        }
> > > > > > >        if (vrings) {
> > > > > > >            vhost_dev_set_vring_enable(hdev, false);
> > > > > > >        }
> > > > > > >        for (i = 0; i < hdev->nvqs; ++i) {
> > > > > > >            vhost_virtqueue_stop(hdev,
> > > > > > >                                 vdev,
> > > > > > >                                 hdev->vqs + i,
> > > > > > >                                 hdev->vq_index + i);
> > > > > > >          ^^^ fetch virtqueue state from kernel ^^^
> > > > > > >        }
> > > > > > >        if (hdev->vhost_ops->vhost_reset_status) {
> > > > > > >            hdev->vhost_ops->vhost_reset_status(hdev);
> > > > > > >            ^^^ reset device^^^
> > > > > > > 
> > > > > > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop()
> > > > > > > ->
> > > > > > > vhost_reset_status(). The device's migration code runs after
> > > > > > > vhost_dev_stop() and the state will have been lost.
> > > > > > > 
> > > > > > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
> > > > > > qemu VirtIONet device model. This is for all vhost backends.
> > > > > > 
> > > > > > Regarding the state like mac or mq configuration, SVQ runs for all the
> > > > > > VM run in the CVQ. So it can track all of that status in the device
> > > > > > model too.
> > > > > > 
> > > > > > When a migration effectively occurs, all the frontend state is
> > > > > > migrated as a regular emulated device. To route all of the state in a
> > > > > > normalized way for qemu is what leaves open the possibility to do
> > > > > > cross-backends migrations, etc.
> > > > > > 
> > > > > > Does that answer your question?
> > > > > I think you're confirming that changes would be necessary in order for
> > > > > vDPA to support the save/load operation that Hanna is introducing.
> > > > > 
> > > > Yes, this first iteration was centered on net, with an eye on block,
> > > > where state can be routed through classical emulated devices. This is
> > > > how vhost-kernel and vhost-user do classically. And it allows
> > > > cross-backend, to not modify qemu migration state, etc.
> > > > 
> > > > To introduce this opaque state to qemu, that must be fetched after the
> > > > suspend and not before, requires changes in vhost protocol, as
> > > > discussed previously.
> > > > 
> > > > > > > It looks like vDPA changes are necessary in order to support
> > > > > > > stateful
> > > > > > > devices even though QEMU already uses SUSPEND. Is my understanding
> > > > > > > correct?
> > > > > > > 
> > > > > > Changes are required elsewhere, as the code to restore the state
> > > > > > properly in the destination has not been merged.
> > > > > I'm not sure what you mean by elsewhere?
> > > > > 
> > > > I meant for vdpa *net* devices the changes are not required in vdpa
> > > > ioctls, but mostly in qemu.
> > > > 
> > > > If you meant stateful as "it must have a state blob that it must be
> > > > opaque to qemu", then I think the straightforward action is to fetch
> > > > state blob about the same time as vq indexes. But yes, changes (at
> > > > least a new ioctl) is needed for that.
> > > > 
> > > > > I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and
> > > > > then VHOST_VDPA_SET_STATUS 0.
> > > > > 
> > > > > In order to save device state from the vDPA device in the future, it
> > > > > will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that
> > > > > the device state can be saved before the device is reset.
> > > > > 
> > > > > Does that sound right?
> > > > > 
> > > > The split between suspend and reset was added recently for that very
> > > > reason. In all the virtio devices, the frontend is initialized before
> > > > the backend, so I don't think it is a good idea to defer the backend
> > > > cleanup. Especially if we have already set the state is small enough
> > > > to not needing iterative migration from virtiofsd point of view.
> > > > 
> > > > If fetching that state at the same time as vq indexes is not valid,
> > > > could it follow the same model as the "in-flight descriptors"?
> > > > vhost-user follows them by using a shared memory region where their
> > > > state is tracked [1]. This allows qemu to survive vhost-user SW
> > > > backend crashes, and does not forbid the cross-backends live migration
> > > > as all the information is there to recover them.
> > > > 
> > > > For hw devices this is not convenient as it occupies PCI bandwidth. So
> > > > a possibility is to synchronize this memory region after a
> > > > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW
> > > > devices are not going to crash in the software sense, so all use cases
> > > > remain the same to qemu. And that shared memory information is
> > > > recoverable after vhost_dev_stop.
> > > > 
> > > > Does that sound reasonable to virtiofsd? To offer a shared memory
> > > > region where it dumps the state, maybe only after the
> > > > set_state(STATE_PHASE_STOPPED)?
> > > 
> > > I don’t think we need the set_state() call, necessarily, if SUSPEND is
> > > mandatory anyway.
> > > 
> > > As for the shared memory, the RFC before this series used shared memory,
> > > so it’s possible, yes.  But “shared memory region” can mean a lot of
> > > things – it sounds like you’re saying the back-end (virtiofsd) should
> > > provide it to the front-end, is that right?  That could work like this:
> > > 
> > > On the source side:
> > > 
> > > S1. SUSPEND goes to virtiofsd
> > > S2. virtiofsd maybe double-checks that the device is stopped, then
> > > serializes its state into a newly allocated shared memory area[1]
> > > S3. virtiofsd responds to SUSPEND
> > > S4. front-end requests shared memory, virtiofsd responds with a handle,
> > > maybe already closes its reference
> > > S5. front-end saves state, closes its handle, freeing the SHM
> > > 
> > > [1] Maybe virtiofsd can correctly size the serialized state’s size, then
> > > it can immediately allocate this area and serialize directly into it;
> > > maybe it can’t, then we’ll need a bounce buffer.  Not really a
> > > fundamental problem, but there are limitations around what you can do
> > > with serde implementations in Rust…
> > > 
> > > On the destination side:
> > > 
> > > D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much;
> > > virtiofsd would serialize its empty state into an SHM area, and respond
> > > to SUSPEND
> > > D2. front-end reads state from migration stream into an SHM it has allocated
> > > D3. front-end supplies this SHM to virtiofsd, which discards its
> > > previous area, and now uses this one
> > > D4. RESUME goes to virtiofsd, which deserializes the state from the SHM
> > > 
> > > Couple of questions:
> > > 
> > > A. Stefan suggested D1, but it does seem wasteful now.  But if SUSPEND
> > > would imply to deserialize a state, and the state is to be transferred
> > > through SHM, this is what would need to be done.  So maybe we should
> > > skip SUSPEND on the destination?
> > > B. You described that the back-end should supply the SHM, which works
> > > well on the source.  On the destination, only the front-end knows how
> > > big the state is, so I’ve decided above that it should allocate the SHM
> > > (D2) and provide it to the back-end.  Is that feasible or is it
> > > important (e.g. for real hardware) that the back-end supplies the SHM?
> > > (In which case the front-end would need to tell the back-end how big the
> > > state SHM needs to be.)
> > 
> > How does this work for iterative live migration?
> > 
> 
> A pipe will always fit better for iterative from qemu POV, that's for sure. 
> Especially if we want to keep that opaqueness.
> 
> But  we will need to communicate with the HW device using shared memory sooner
> or later for big states.  If we don't transform it in qemu, we will need to do
> it in the kernel.  Also, the pipe will not support daemon crashes.
>
> Again I'm just putting this on the table, just in case it fits better or it is
> convenient.  I missed the previous patch where SHM was proposed too, so maybe I
> missed some feedback useful here.  I think the pipe is a better solution in the
> long run because of the iterative part.

Pipes and shared memory are conceptually equivalent for building
streaming interfaces. It's just more complex to design a shared memory
interface and it reinvents what file descriptors already offer.

I have no doubt we could design iterative migration over a shared memory
interface if we needed to, but I'm not sure why? When you mention
hardware, are you suggesting defining a standard memory/register layout
that hardware implements and mapping it to userspace (QEMU)? Is there a
big advantage to exposing memory versus a file descriptor?

Stefan
Eugenio Perez Martin May 9, 2023, 6:31 a.m. UTC | #46
On Mon, May 8, 2023 at 9:12 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Apr 20, 2023 at 03:27:51PM +0200, Eugenio Pérez wrote:
> > On Tue, 2023-04-18 at 16:40 -0400, Stefan Hajnoczi wrote:
> > > On Tue, 18 Apr 2023 at 14:31, Eugenio Perez Martin <eperezma@redhat.com>
> > > wrote:
> > > > On Tue, Apr 18, 2023 at 7:59 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > On Tue, Apr 18, 2023 at 10:09:30AM +0200, Eugenio Perez Martin wrote:
> > > > > > On Mon, Apr 17, 2023 at 9:33 PM Stefan Hajnoczi <stefanha@gmail.com>
> > > > > > wrote:
> > > > > > > On Mon, 17 Apr 2023 at 15:10, Eugenio Perez Martin <
> > > > > > > eperezma@redhat.com> wrote:
> > > > > > > > On Mon, Apr 17, 2023 at 5:38 PM Stefan Hajnoczi <stefanha@redhat.com
> > > > > > > > > wrote:
> > > > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin
> > > > > > > > > wrote:
> > > > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <
> > > > > > > > > > stefanha@redhat.com> wrote:
> > > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek
> > > > > > > > > > > wrote:
> > > > > > > > > > > > So-called "internal" virtio-fs migration refers to
> > > > > > > > > > > > transporting the
> > > > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration
> > > > > > > > > > > > stream.  To do
> > > > > > > > > > > > this, we need to be able to transfer virtiofsd's internal
> > > > > > > > > > > > state to and
> > > > > > > > > > > > from virtiofsd.
> > > > > > > > > > > >
> > > > > > > > > > > > Because virtiofsd's internal state will not be too large, we
> > > > > > > > > > > > believe it
> > > > > > > > > > > > is best to transfer it as a single binary blob after the
> > > > > > > > > > > > streaming
> > > > > > > > > > > > phase.  Because this method should be useful to other vhost-
> > > > > > > > > > > > user
> > > > > > > > > > > > implementations, too, it is introduced as a general-purpose
> > > > > > > > > > > > addition to
> > > > > > > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > > > > > > >
> > > > > > > > > > > > These are the additions to the protocol:
> > > > > > > > > > > > - New vhost-user protocol feature
> > > > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > > > > > > >   This feature signals support for transferring state, and
> > > > > > > > > > > > is added so
> > > > > > > > > > > >   that migration can fail early when the back-end has no
> > > > > > > > > > > > support.
> > > > > > > > > > > >
> > > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end
> > > > > > > > > > > > negotiate a pipe
> > > > > > > > > > > >   over which to transfer the state.  The front-end sends an
> > > > > > > > > > > > FD to the
> > > > > > > > > > > >   back-end into/from which it can write/read its state, and
> > > > > > > > > > > > the back-end
> > > > > > > > > > > >   can decide to either use it, or reply with a different FD
> > > > > > > > > > > > for the
> > > > > > > > > > > >   front-end to override the front-end's choice.
> > > > > > > > > > > >   The front-end creates a simple pipe to transfer the state,
> > > > > > > > > > > > but maybe
> > > > > > > > > > > >   the back-end already has an FD into/from which it has to
> > > > > > > > > > > > write/read
> > > > > > > > > > > >   its state, in which case it will want to override the
> > > > > > > > > > > > simple pipe.
> > > > > > > > > > > >   Conversely, maybe in the future we find a way to have the
> > > > > > > > > > > > front-end
> > > > > > > > > > > >   get an immediate FD for the migration stream (in some
> > > > > > > > > > > > cases), in which
> > > > > > > > > > > >   case we will want to send this to the back-end instead of
> > > > > > > > > > > > creating a
> > > > > > > > > > > >   pipe.
> > > > > > > > > > > >   Hence the negotiation: If one side has a better idea than
> > > > > > > > > > > > a plain
> > > > > > > > > > > >   pipe, we will want to use that.
> > > > > > > > > > > >
> > > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred
> > > > > > > > > > > > through the
> > > > > > > > > > > >   pipe (the end indicated by EOF), the front-end invokes
> > > > > > > > > > > > this function
> > > > > > > > > > > >   to verify success.  There is no in-band way (through the
> > > > > > > > > > > > pipe) to
> > > > > > > > > > > >   indicate failure, so we need to check explicitly.
> > > > > > > > > > > >
> > > > > > > > > > > > Once the transfer pipe has been established via
> > > > > > > > > > > > SET_DEVICE_STATE_FD
> > > > > > > > > > > > (which includes establishing the direction of transfer and
> > > > > > > > > > > > migration
> > > > > > > > > > > > phase), the sending side writes its data into the pipe, and
> > > > > > > > > > > > the reading
> > > > > > > > > > > > side reads it until it sees an EOF.  Then, the front-end
> > > > > > > > > > > > will check for
> > > > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination
> > > > > > > > > > > > side includes
> > > > > > > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > > > > > > >
> > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > >  include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > > > > > > >  include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > > > > > > >  hw/virtio/vhost-user.c            | 147
> > > > > > > > > > > > ++++++++++++++++++++++++++++++
> > > > > > > > > > > >  hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > > > > > > >  4 files changed, 287 insertions(+)
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > > > > > > >      VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > > > > > > >  } VhostSetConfigType;
> > > > > > > > > > > >
> > > > > > > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > > > > > > +    /* Transfer state from back-end (device) to front-end
> > > > > > > > > > > > */
> > > > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > > > > > > +    /* Transfer state from front-end to back-end (device)
> > > > > > > > > > > > */
> > > > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > > > > > > +} VhostDeviceStateDirection;
> > > > > > > > > > > > +
> > > > > > > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > > > > > > +} VhostDeviceStatePhase;
> > > > > > > > > > >
> > > > > > > > > > > vDPA has:
> > > > > > > > > > >
> > > > > > > > > > >   /* Suspend a device so it does not process virtqueue
> > > > > > > > > > > requests anymore
> > > > > > > > > > >    *
> > > > > > > > > > >    * After the return of ioctl the device must preserve all
> > > > > > > > > > > the necessary state
> > > > > > > > > > >    * (the virtqueue vring base plus the possible device
> > > > > > > > > > > specific states) that is
> > > > > > > > > > >    * required for restoring in the future. The device must not
> > > > > > > > > > > change its
> > > > > > > > > > >    * configuration after that point.
> > > > > > > > > > >    */
> > > > > > > > > > >   #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > > > > > > >
> > > > > > > > > > >   /* Resume a device so it can resume processing virtqueue
> > > > > > > > > > > requests
> > > > > > > > > > >    *
> > > > > > > > > > >    * After the return of this ioctl the device will have
> > > > > > > > > > > restored all the
> > > > > > > > > > >    * necessary states and it is fully operational to continue
> > > > > > > > > > > processing the
> > > > > > > > > > >    * virtqueue descriptors.
> > > > > > > > > > >    */
> > > > > > > > > > >   #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > > > > > > >
> > > > > > > > > > > I wonder if it makes sense to import these into vhost-user so
> > > > > > > > > > > that the
> > > > > > > > > > > difference between kernel vhost and vhost-user is minimized.
> > > > > > > > > > > It's okay
> > > > > > > > > > > if one of them is ahead of the other, but it would be nice to
> > > > > > > > > > > avoid
> > > > > > > > > > > overlapping/duplicated functionality.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > That's what I had in mind in the first versions. I proposed
> > > > > > > > > > VHOST_STOP
> > > > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did
> > > > > > > > > > change
> > > > > > > > > > to SUSPEND.
> > > > > > > > >
> > > > > > > > > I noticed QEMU only calls ioctl(VHOST_VDPA_SUSPEND) and not
> > > > > > > > > ioctl(VHOST_VDPA_RESUME).
> > > > > > > > >
> > > > > > > > > The doc comments in <linux/vdpa.h> don't explain how the device
> > > > > > > > > can
> > > > > > > > > leave the suspended state. Can you clarify this?
> > > > > > > > >
> > > > > > > >
> > > > > > > > Do you mean in what situations or regarding the semantics of
> > > > > > > > _RESUME?
> > > > > > > >
> > > > > > > > To me resume is an operation mainly to resume the device in the
> > > > > > > > event
> > > > > > > > of a VM suspension, not a migration. It can be used as a fallback
> > > > > > > > code
> > > > > > > > in some cases of migration failure though, but it is not currently
> > > > > > > > used in qemu.
> > > > > > >
> > > > > > > Is a "VM suspension" the QEMU HMP 'stop' command?
> > > > > > >
> > > > > > > I guess the reason why QEMU doesn't call RESUME anywhere is that it
> > > > > > > resets the device in vhost_dev_stop()?
> > > > > > >
> > > > > >
> > > > > > The actual reason for not using RESUME is that the ioctl was added
> > > > > > after the SUSPEND design in qemu. Same as this proposal, it is was not
> > > > > > needed at the time.
> > > > > >
> > > > > > In the case of vhost-vdpa net, the only usage of suspend is to fetch
> > > > > > the vq indexes, and in case of error vhost already fetches them from
> > > > > > guest's used ring way before vDPA, so it has little usage.
> > > > > >
> > > > > > > Does it make sense to combine SUSPEND and RESUME with Hanna's
> > > > > > > SET_DEVICE_STATE_FD? For example, non-iterative migration works like
> > > > > > > this:
> > > > > > > - Saving the device's state is done by SUSPEND followed by
> > > > > > > SET_DEVICE_STATE_FD. If the guest needs to continue executing (e.g.
> > > > > > > savevm command or migration failed), then RESUME is called to
> > > > > > > continue.
> > > > > >
> > > > > > I think the previous steps make sense at vhost_dev_stop, not virtio
> > > > > > savevm handlers. To start spreading this logic to more places of qemu
> > > > > > can bring confusion.
> > > > >
> > > > > I don't think there is a way around extending the QEMU vhost's code
> > > > > model. The current model in QEMU's vhost code is that the backend is
> > > > > reset when the VM stops. This model worked fine for stateless devices
> > > > > but it doesn't work for stateful devices.
> > > > >
> > > > > Imagine a vdpa-gpu device: you cannot reset the device in
> > > > > vhost_dev_stop() and expect the GPU to continue working when
> > > > > vhost_dev_start() is called again because all its state has been lost.
> > > > > The guest driver will send requests that references a virtio-gpu
> > > > > resources that no longer exist.
> > > > >
> > > > > One solution is to save the device's state in vhost_dev_stop(). I think
> > > > > this is what you're suggesting. It requires keeping a copy of the state
> > > > > and then loading the state again in vhost_dev_start(). I don't think
> > > > > this approach should be used because it requires all stateful devices to
> > > > > support live migration (otherwise they break across HMP 'stop'/'cont').
> > > > > Also, the device state for some devices may be large and it would also
> > > > > become more complicated when iterative migration is added.
> > > > >
> > > > > Instead, I think the QEMU vhost code needs to be structured so that
> > > > > struct vhost_dev has a suspended state:
> > > > >
> > > > >         ,---------.
> > > > >         v         |
> > > > >   started ------> stopped
> > > > >     \   ^
> > > > >      \  |
> > > > >       -> suspended
> > > > >
> > > > > The device doesn't lose state when it enters the suspended state. It can
> > > > > be resumed again.
> > > > >
> > > > > This is why I think SUSPEND/RESUME need to be part of the solution.
> >
> > I just realize that we can add an arrow from suspended to stopped, isn't it?
>
> Yes, it could be used in the case of a successful live migration:
> [started] -> vhost_dev_suspend() [suspended] -> vhost_dev_stop() [stopped]
>
> > "Started" before seems to imply the device may process descriptors after
> > suspend.
>
> Yes, in the case of a failed live migration:
> [started] -> vhost_dev_suspend() [suspended] -> vhost_dev_resume() [started]
>

I meant "the device may (is allowed) to process descriptors after
suspend and before stopped". I think we have the same view here, just
trying to specify the semantics here as completely as possible :).

> > > >
> > > > I agree with all of this, especially after realizing vhost_dev_stop is
> > > > called before the last request of the state in the iterative
> > > > migration.
> > > >
> > > > However I think we can move faster with the virtiofsd migration code,
> > > > as long as we agree on the vhost-user messages it will receive. This
> > > > is because we already agree that the state will be sent in one shot
> > > > and not iteratively, so it will be small.
> > > >
> > > > I understand this may change in the future, that's why I proposed to
> > > > start using iterative right now. However it may make little sense if
> > > > it is not used in the vhost-user device. I also understand that other
> > > > devices may have a bigger state so it will be needed for them.
> > >
> > > Can you summarize how you'd like save to work today? I'm not sure what
> > > you have in mind.
> > >
> >
> > I think we're trying to find a solution that satisfies many things.  On one
> > side, we're assuming that the virtiofsd state will be small enough to be
> > assumable it will not require iterative migration in the short term.  However,
> > we also want to support iterative migration, for the shake of *other* future
> > vhost devices that may need it.
> >
> > I also think we should prioritize the protocols stability, in the sense of not
> > adding calls that we will not reuse for iterative LM.  Being vhost-user protocol
> > more important to maintain than the qemu migration.
> >
> > To implement the changes you mention will be needed in the future.  But we have
> > already set that the virtiofsd is small, so we can just fetch them by the same
> > time than we send VHOST_USER_GET_VRING_BASE message and send the status with the
> > proposed non-iterative approach.
>
> VHOST_USER_GET_VRING_BASE itself isn't really enough because it stops a
> specific virtqueue but not the whole device. Unfortunately stopping all
> virtqueues is not the same as SUSPEND since spontaneous device activity
> is possible independent of any virtqueue (e.g. virtio-scsi events and
> maybe virtio-net link status).
>
> That's why I think SUSPEND is necessary for a solution that's generic
> enough to cover all device types.
>

I agree.

In particular virtiofsd is already resetting all the device at
VHOST_USER_GET_VRING_BASE if I'm not wrong, so that's even more of a
reason to implement suspend call.

Thanks!

> > If we agree on that, now the question is how to fetch them from the device.  The
> > answers are a little bit scattered in the mail threads, but I think we agree on:
> > a) We need to signal that the device must stop processing requests.
> > b) We need a way for the device to dump the state.
> >
> > At this moment I think any proposal satisfies a), and pipe satisfies better b).
> > With proper backend feature flags, the device may support to start writing to
> > the pipe before SUSPEND so we can implement iterative migration on top.
> >
> > Does that makes sense?
>
> Yes, and that sounds like what Hanna is proposing for b) plus our
> discussion about SUSPEND/RESUME in order to achieve a).
>
> Stefan
Eugenio Perez Martin May 9, 2023, 6:45 a.m. UTC | #47
On Mon, May 8, 2023 at 10:10 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Apr 20, 2023 at 03:29:44PM +0200, Eugenio Pérez wrote:
> > On Wed, 2023-04-19 at 07:21 -0400, Stefan Hajnoczi wrote:
> > > On Wed, 19 Apr 2023 at 07:10, Hanna Czenczek <hreitz@redhat.com> wrote:
> > > > On 18.04.23 09:54, Eugenio Perez Martin wrote:
> > > > > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com>
> > > > > wrote:
> > > > > > On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com>
> > > > > > wrote:
> > > > > > > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > wrote:
> > > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin
> > > > > > > > wrote:
> > > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <
> > > > > > > > > stefanha@redhat.com> wrote:
> > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > > > > > > So-called "internal" virtio-fs migration refers to
> > > > > > > > > > > transporting the
> > > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration
> > > > > > > > > > > stream.  To do
> > > > > > > > > > > this, we need to be able to transfer virtiofsd's internal
> > > > > > > > > > > state to and
> > > > > > > > > > > from virtiofsd.
> > > > > > > > > > >
> > > > > > > > > > > Because virtiofsd's internal state will not be too large, we
> > > > > > > > > > > believe it
> > > > > > > > > > > is best to transfer it as a single binary blob after the
> > > > > > > > > > > streaming
> > > > > > > > > > > phase.  Because this method should be useful to other vhost-
> > > > > > > > > > > user
> > > > > > > > > > > implementations, too, it is introduced as a general-purpose
> > > > > > > > > > > addition to
> > > > > > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > > > > > >
> > > > > > > > > > > These are the additions to the protocol:
> > > > > > > > > > > - New vhost-user protocol feature
> > > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > > > > > >    This feature signals support for transferring state, and is
> > > > > > > > > > > added so
> > > > > > > > > > >    that migration can fail early when the back-end has no
> > > > > > > > > > > support.
> > > > > > > > > > >
> > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end
> > > > > > > > > > > negotiate a pipe
> > > > > > > > > > >    over which to transfer the state.  The front-end sends an
> > > > > > > > > > > FD to the
> > > > > > > > > > >    back-end into/from which it can write/read its state, and
> > > > > > > > > > > the back-end
> > > > > > > > > > >    can decide to either use it, or reply with a different FD
> > > > > > > > > > > for the
> > > > > > > > > > >    front-end to override the front-end's choice.
> > > > > > > > > > >    The front-end creates a simple pipe to transfer the state,
> > > > > > > > > > > but maybe
> > > > > > > > > > >    the back-end already has an FD into/from which it has to
> > > > > > > > > > > write/read
> > > > > > > > > > >    its state, in which case it will want to override the
> > > > > > > > > > > simple pipe.
> > > > > > > > > > >    Conversely, maybe in the future we find a way to have the
> > > > > > > > > > > front-end
> > > > > > > > > > >    get an immediate FD for the migration stream (in some
> > > > > > > > > > > cases), in which
> > > > > > > > > > >    case we will want to send this to the back-end instead of
> > > > > > > > > > > creating a
> > > > > > > > > > >    pipe.
> > > > > > > > > > >    Hence the negotiation: If one side has a better idea than a
> > > > > > > > > > > plain
> > > > > > > > > > >    pipe, we will want to use that.
> > > > > > > > > > >
> > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred
> > > > > > > > > > > through the
> > > > > > > > > > >    pipe (the end indicated by EOF), the front-end invokes this
> > > > > > > > > > > function
> > > > > > > > > > >    to verify success.  There is no in-band way (through the
> > > > > > > > > > > pipe) to
> > > > > > > > > > >    indicate failure, so we need to check explicitly.
> > > > > > > > > > >
> > > > > > > > > > > Once the transfer pipe has been established via
> > > > > > > > > > > SET_DEVICE_STATE_FD
> > > > > > > > > > > (which includes establishing the direction of transfer and
> > > > > > > > > > > migration
> > > > > > > > > > > phase), the sending side writes its data into the pipe, and
> > > > > > > > > > > the reading
> > > > > > > > > > > side reads it until it sees an EOF.  Then, the front-end will
> > > > > > > > > > > check for
> > > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side
> > > > > > > > > > > includes
> > > > > > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > > > > > >
> > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > > > > ---
> > > > > > > > > > >   include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > > > > > >   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > > > > > >   hw/virtio/vhost-user.c            | 147
> > > > > > > > > > > ++++++++++++++++++++++++++++++
> > > > > > > > > > >   hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > > > > > >   4 files changed, 287 insertions(+)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > > > > > >       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > > > > > >   } VhostSetConfigType;
> > > > > > > > > > >
> > > > > > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > > > > > +} VhostDeviceStateDirection;
> > > > > > > > > > > +
> > > > > > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > > > > > +} VhostDeviceStatePhase;
> > > > > > > > > > vDPA has:
> > > > > > > > > >
> > > > > > > > > >    /* Suspend a device so it does not process virtqueue requests
> > > > > > > > > > anymore
> > > > > > > > > >     *
> > > > > > > > > >     * After the return of ioctl the device must preserve all the
> > > > > > > > > > necessary state
> > > > > > > > > >     * (the virtqueue vring base plus the possible device
> > > > > > > > > > specific states) that is
> > > > > > > > > >     * required for restoring in the future. The device must not
> > > > > > > > > > change its
> > > > > > > > > >     * configuration after that point.
> > > > > > > > > >     */
> > > > > > > > > >    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > > > > > >
> > > > > > > > > >    /* Resume a device so it can resume processing virtqueue
> > > > > > > > > > requests
> > > > > > > > > >     *
> > > > > > > > > >     * After the return of this ioctl the device will have
> > > > > > > > > > restored all the
> > > > > > > > > >     * necessary states and it is fully operational to continue
> > > > > > > > > > processing the
> > > > > > > > > >     * virtqueue descriptors.
> > > > > > > > > >     */
> > > > > > > > > >    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > > > > > >
> > > > > > > > > > I wonder if it makes sense to import these into vhost-user so
> > > > > > > > > > that the
> > > > > > > > > > difference between kernel vhost and vhost-user is minimized.
> > > > > > > > > > It's okay
> > > > > > > > > > if one of them is ahead of the other, but it would be nice to
> > > > > > > > > > avoid
> > > > > > > > > > overlapping/duplicated functionality.
> > > > > > > > > >
> > > > > > > > > That's what I had in mind in the first versions. I proposed
> > > > > > > > > VHOST_STOP
> > > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did
> > > > > > > > > change
> > > > > > > > > to SUSPEND.
> > > > > > > > >
> > > > > > > > > Generally it is better if we make the interface less parametrized
> > > > > > > > > and
> > > > > > > > > we trust in the messages and its semantics in my opinion. In other
> > > > > > > > > words, instead of
> > > > > > > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED),
> > > > > > > > > send
> > > > > > > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user
> > > > > > > > > command.
> > > > > > > > >
> > > > > > > > > Another way to apply this is with the "direction" parameter. Maybe
> > > > > > > > > it
> > > > > > > > > is better to split it into "set_state_fd" and "get_state_fd"?
> > > > > > > > >
> > > > > > > > > In that case, reusing the ioctls as vhost-user messages would be
> > > > > > > > > ok.
> > > > > > > > > But that puts this proposal further from the VFIO code, which uses
> > > > > > > > > "migration_set_state(state)", and maybe it is better when the
> > > > > > > > > number
> > > > > > > > > of states is high.
> > > > > > > > Hi Eugenio,
> > > > > > > > Another question about vDPA suspend/resume:
> > > > > > > >
> > > > > > > >    /* Host notifiers must be enabled at this point. */
> > > > > > > >    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev,
> > > > > > > > bool vrings)
> > > > > > > >    {
> > > > > > > >        int i;
> > > > > > > >
> > > > > > > >        /* should only be called after backend is connected */
> > > > > > > >        assert(hdev->vhost_ops);
> > > > > > > >        event_notifier_test_and_clear(
> > > > > > > >            &hdev-
> > > > > > > > >vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
> > > > > > > >        event_notifier_test_and_clear(&vdev->config_notifier);
> > > > > > > >
> > > > > > > >        trace_vhost_dev_stop(hdev, vdev->name, vrings);
> > > > > > > >
> > > > > > > >        if (hdev->vhost_ops->vhost_dev_start) {
> > > > > > > >            hdev->vhost_ops->vhost_dev_start(hdev, false);
> > > > > > > >            ^^^ SUSPEND ^^^
> > > > > > > >        }
> > > > > > > >        if (vrings) {
> > > > > > > >            vhost_dev_set_vring_enable(hdev, false);
> > > > > > > >        }
> > > > > > > >        for (i = 0; i < hdev->nvqs; ++i) {
> > > > > > > >            vhost_virtqueue_stop(hdev,
> > > > > > > >                                 vdev,
> > > > > > > >                                 hdev->vqs + i,
> > > > > > > >                                 hdev->vq_index + i);
> > > > > > > >          ^^^ fetch virtqueue state from kernel ^^^
> > > > > > > >        }
> > > > > > > >        if (hdev->vhost_ops->vhost_reset_status) {
> > > > > > > >            hdev->vhost_ops->vhost_reset_status(hdev);
> > > > > > > >            ^^^ reset device^^^
> > > > > > > >
> > > > > > > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop()
> > > > > > > > ->
> > > > > > > > vhost_reset_status(). The device's migration code runs after
> > > > > > > > vhost_dev_stop() and the state will have been lost.
> > > > > > > >
> > > > > > > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
> > > > > > > qemu VirtIONet device model. This is for all vhost backends.
> > > > > > >
> > > > > > > Regarding the state like mac or mq configuration, SVQ runs for all the
> > > > > > > VM run in the CVQ. So it can track all of that status in the device
> > > > > > > model too.
> > > > > > >
> > > > > > > When a migration effectively occurs, all the frontend state is
> > > > > > > migrated as a regular emulated device. To route all of the state in a
> > > > > > > normalized way for qemu is what leaves open the possibility to do
> > > > > > > cross-backends migrations, etc.
> > > > > > >
> > > > > > > Does that answer your question?
> > > > > > I think you're confirming that changes would be necessary in order for
> > > > > > vDPA to support the save/load operation that Hanna is introducing.
> > > > > >
> > > > > Yes, this first iteration was centered on net, with an eye on block,
> > > > > where state can be routed through classical emulated devices. This is
> > > > > how vhost-kernel and vhost-user do classically. And it allows
> > > > > cross-backend, to not modify qemu migration state, etc.
> > > > >
> > > > > To introduce this opaque state to qemu, that must be fetched after the
> > > > > suspend and not before, requires changes in vhost protocol, as
> > > > > discussed previously.
> > > > >
> > > > > > > > It looks like vDPA changes are necessary in order to support
> > > > > > > > stateful
> > > > > > > > devices even though QEMU already uses SUSPEND. Is my understanding
> > > > > > > > correct?
> > > > > > > >
> > > > > > > Changes are required elsewhere, as the code to restore the state
> > > > > > > properly in the destination has not been merged.
> > > > > > I'm not sure what you mean by elsewhere?
> > > > > >
> > > > > I meant for vdpa *net* devices the changes are not required in vdpa
> > > > > ioctls, but mostly in qemu.
> > > > >
> > > > > If you meant stateful as "it must have a state blob that it must be
> > > > > opaque to qemu", then I think the straightforward action is to fetch
> > > > > state blob about the same time as vq indexes. But yes, changes (at
> > > > > least a new ioctl) is needed for that.
> > > > >
> > > > > > I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and
> > > > > > then VHOST_VDPA_SET_STATUS 0.
> > > > > >
> > > > > > In order to save device state from the vDPA device in the future, it
> > > > > > will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that
> > > > > > the device state can be saved before the device is reset.
> > > > > >
> > > > > > Does that sound right?
> > > > > >
> > > > > The split between suspend and reset was added recently for that very
> > > > > reason. In all the virtio devices, the frontend is initialized before
> > > > > the backend, so I don't think it is a good idea to defer the backend
> > > > > cleanup. Especially if we have already set the state is small enough
> > > > > to not needing iterative migration from virtiofsd point of view.
> > > > >
> > > > > If fetching that state at the same time as vq indexes is not valid,
> > > > > could it follow the same model as the "in-flight descriptors"?
> > > > > vhost-user follows them by using a shared memory region where their
> > > > > state is tracked [1]. This allows qemu to survive vhost-user SW
> > > > > backend crashes, and does not forbid the cross-backends live migration
> > > > > as all the information is there to recover them.
> > > > >
> > > > > For hw devices this is not convenient as it occupies PCI bandwidth. So
> > > > > a possibility is to synchronize this memory region after a
> > > > > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW
> > > > > devices are not going to crash in the software sense, so all use cases
> > > > > remain the same to qemu. And that shared memory information is
> > > > > recoverable after vhost_dev_stop.
> > > > >
> > > > > Does that sound reasonable to virtiofsd? To offer a shared memory
> > > > > region where it dumps the state, maybe only after the
> > > > > set_state(STATE_PHASE_STOPPED)?
> > > >
> > > > I don’t think we need the set_state() call, necessarily, if SUSPEND is
> > > > mandatory anyway.
> > > >
> > > > As for the shared memory, the RFC before this series used shared memory,
> > > > so it’s possible, yes.  But “shared memory region” can mean a lot of
> > > > things – it sounds like you’re saying the back-end (virtiofsd) should
> > > > provide it to the front-end, is that right?  That could work like this:
> > > >
> > > > On the source side:
> > > >
> > > > S1. SUSPEND goes to virtiofsd
> > > > S2. virtiofsd maybe double-checks that the device is stopped, then
> > > > serializes its state into a newly allocated shared memory area[1]
> > > > S3. virtiofsd responds to SUSPEND
> > > > S4. front-end requests shared memory, virtiofsd responds with a handle,
> > > > maybe already closes its reference
> > > > S5. front-end saves state, closes its handle, freeing the SHM
> > > >
> > > > [1] Maybe virtiofsd can correctly size the serialized state’s size, then
> > > > it can immediately allocate this area and serialize directly into it;
> > > > maybe it can’t, then we’ll need a bounce buffer.  Not really a
> > > > fundamental problem, but there are limitations around what you can do
> > > > with serde implementations in Rust…
> > > >
> > > > On the destination side:
> > > >
> > > > D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much;
> > > > virtiofsd would serialize its empty state into an SHM area, and respond
> > > > to SUSPEND
> > > > D2. front-end reads state from migration stream into an SHM it has allocated
> > > > D3. front-end supplies this SHM to virtiofsd, which discards its
> > > > previous area, and now uses this one
> > > > D4. RESUME goes to virtiofsd, which deserializes the state from the SHM
> > > >
> > > > Couple of questions:
> > > >
> > > > A. Stefan suggested D1, but it does seem wasteful now.  But if SUSPEND
> > > > would imply to deserialize a state, and the state is to be transferred
> > > > through SHM, this is what would need to be done.  So maybe we should
> > > > skip SUSPEND on the destination?
> > > > B. You described that the back-end should supply the SHM, which works
> > > > well on the source.  On the destination, only the front-end knows how
> > > > big the state is, so I’ve decided above that it should allocate the SHM
> > > > (D2) and provide it to the back-end.  Is that feasible or is it
> > > > important (e.g. for real hardware) that the back-end supplies the SHM?
> > > > (In which case the front-end would need to tell the back-end how big the
> > > > state SHM needs to be.)
> > >
> > > How does this work for iterative live migration?
> > >
> >
> > A pipe will always fit better for iterative from qemu POV, that's for sure.
> > Especially if we want to keep that opaqueness.
> >
> > But  we will need to communicate with the HW device using shared memory sooner
> > or later for big states.  If we don't transform it in qemu, we will need to do
> > it in the kernel.  Also, the pipe will not support daemon crashes.
> >
> > Again I'm just putting this on the table, just in case it fits better or it is
> > convenient.  I missed the previous patch where SHM was proposed too, so maybe I
> > missed some feedback useful here.  I think the pipe is a better solution in the
> > long run because of the iterative part.
>
> Pipes and shared memory are conceptually equivalent for building
> streaming interfaces. It's just more complex to design a shared memory
> interface and it reinvents what file descriptors already offer.
>
> I have no doubt we could design iterative migration over a shared memory
> interface if we needed to, but I'm not sure why? When you mention
> hardware, are you suggesting defining a standard memory/register layout
> that hardware implements and mapping it to userspace (QEMU)?

Right.

> Is there a
> big advantage to exposing memory versus a file descriptor?
>

For hardware it allows to retrieve and set the device state without
intervention of the kernel, saving context switches. For virtiofsd
this may not make a lot of sense, but I'm thinking on devices with big
states (virtio gpu, maybe?).

For software it allows the backend to survive a crash, as the old
state can be set directly to a fresh backend instance.

As I said, I'm not saying we must go with shared memory. We can always
add it on top, assuming the cost of maintaining both models. I'm just
trying to make sure we evaluate both.

Thanks!
Hanna Czenczek May 9, 2023, 9:01 a.m. UTC | #48
On 09.05.23 08:31, Eugenio Perez Martin wrote:
> On Mon, May 8, 2023 at 9:12 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:

[...]

>> VHOST_USER_GET_VRING_BASE itself isn't really enough because it stops a
>> specific virtqueue but not the whole device. Unfortunately stopping all
>> virtqueues is not the same as SUSPEND since spontaneous device activity
>> is possible independent of any virtqueue (e.g. virtio-scsi events and
>> maybe virtio-net link status).
>>
>> That's why I think SUSPEND is necessary for a solution that's generic
>> enough to cover all device types.
>>
> I agree.
>
> In particular virtiofsd is already resetting all the device at
> VHOST_USER_GET_VRING_BASE if I'm not wrong, so that's even more of a
> reason to implement suspend call.

Oh, no, just the vring in question.  Not the whole device.

In addition, we still need the GET_VRING_BASE call anyway, because, 
well, we want to restore the vring on the destination via SET_VRING_BASE.

Hanna
Stefan Hajnoczi May 9, 2023, 3:09 p.m. UTC | #49
On Tue, May 09, 2023 at 08:45:33AM +0200, Eugenio Perez Martin wrote:
> On Mon, May 8, 2023 at 10:10 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Thu, Apr 20, 2023 at 03:29:44PM +0200, Eugenio Pérez wrote:
> > > On Wed, 2023-04-19 at 07:21 -0400, Stefan Hajnoczi wrote:
> > > > On Wed, 19 Apr 2023 at 07:10, Hanna Czenczek <hreitz@redhat.com> wrote:
> > > > > On 18.04.23 09:54, Eugenio Perez Martin wrote:
> > > > > > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com>
> > > > > > wrote:
> > > > > > > On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com>
> > > > > > > wrote:
> > > > > > > > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > wrote:
> > > > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin
> > > > > > > > > wrote:
> > > > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <
> > > > > > > > > > stefanha@redhat.com> wrote:
> > > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > > > > > > > So-called "internal" virtio-fs migration refers to
> > > > > > > > > > > > transporting the
> > > > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration
> > > > > > > > > > > > stream.  To do
> > > > > > > > > > > > this, we need to be able to transfer virtiofsd's internal
> > > > > > > > > > > > state to and
> > > > > > > > > > > > from virtiofsd.
> > > > > > > > > > > >
> > > > > > > > > > > > Because virtiofsd's internal state will not be too large, we
> > > > > > > > > > > > believe it
> > > > > > > > > > > > is best to transfer it as a single binary blob after the
> > > > > > > > > > > > streaming
> > > > > > > > > > > > phase.  Because this method should be useful to other vhost-
> > > > > > > > > > > > user
> > > > > > > > > > > > implementations, too, it is introduced as a general-purpose
> > > > > > > > > > > > addition to
> > > > > > > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > > > > > > >
> > > > > > > > > > > > These are the additions to the protocol:
> > > > > > > > > > > > - New vhost-user protocol feature
> > > > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > > > > > > >    This feature signals support for transferring state, and is
> > > > > > > > > > > > added so
> > > > > > > > > > > >    that migration can fail early when the back-end has no
> > > > > > > > > > > > support.
> > > > > > > > > > > >
> > > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end
> > > > > > > > > > > > negotiate a pipe
> > > > > > > > > > > >    over which to transfer the state.  The front-end sends an
> > > > > > > > > > > > FD to the
> > > > > > > > > > > >    back-end into/from which it can write/read its state, and
> > > > > > > > > > > > the back-end
> > > > > > > > > > > >    can decide to either use it, or reply with a different FD
> > > > > > > > > > > > for the
> > > > > > > > > > > >    front-end to override the front-end's choice.
> > > > > > > > > > > >    The front-end creates a simple pipe to transfer the state,
> > > > > > > > > > > > but maybe
> > > > > > > > > > > >    the back-end already has an FD into/from which it has to
> > > > > > > > > > > > write/read
> > > > > > > > > > > >    its state, in which case it will want to override the
> > > > > > > > > > > > simple pipe.
> > > > > > > > > > > >    Conversely, maybe in the future we find a way to have the
> > > > > > > > > > > > front-end
> > > > > > > > > > > >    get an immediate FD for the migration stream (in some
> > > > > > > > > > > > cases), in which
> > > > > > > > > > > >    case we will want to send this to the back-end instead of
> > > > > > > > > > > > creating a
> > > > > > > > > > > >    pipe.
> > > > > > > > > > > >    Hence the negotiation: If one side has a better idea than a
> > > > > > > > > > > > plain
> > > > > > > > > > > >    pipe, we will want to use that.
> > > > > > > > > > > >
> > > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred
> > > > > > > > > > > > through the
> > > > > > > > > > > >    pipe (the end indicated by EOF), the front-end invokes this
> > > > > > > > > > > > function
> > > > > > > > > > > >    to verify success.  There is no in-band way (through the
> > > > > > > > > > > > pipe) to
> > > > > > > > > > > >    indicate failure, so we need to check explicitly.
> > > > > > > > > > > >
> > > > > > > > > > > > Once the transfer pipe has been established via
> > > > > > > > > > > > SET_DEVICE_STATE_FD
> > > > > > > > > > > > (which includes establishing the direction of transfer and
> > > > > > > > > > > > migration
> > > > > > > > > > > > phase), the sending side writes its data into the pipe, and
> > > > > > > > > > > > the reading
> > > > > > > > > > > > side reads it until it sees an EOF.  Then, the front-end will
> > > > > > > > > > > > check for
> > > > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side
> > > > > > > > > > > > includes
> > > > > > > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > > > > > > >
> > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > >   include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > > > > > > >   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > > > > > > >   hw/virtio/vhost-user.c            | 147
> > > > > > > > > > > > ++++++++++++++++++++++++++++++
> > > > > > > > > > > >   hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > > > > > > >   4 files changed, 287 insertions(+)
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > > > > > > >       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > > > > > > >   } VhostSetConfigType;
> > > > > > > > > > > >
> > > > > > > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > > > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > > > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > > > > > > +} VhostDeviceStateDirection;
> > > > > > > > > > > > +
> > > > > > > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > > > > > > +} VhostDeviceStatePhase;
> > > > > > > > > > > vDPA has:
> > > > > > > > > > >
> > > > > > > > > > >    /* Suspend a device so it does not process virtqueue requests
> > > > > > > > > > > anymore
> > > > > > > > > > >     *
> > > > > > > > > > >     * After the return of ioctl the device must preserve all the
> > > > > > > > > > > necessary state
> > > > > > > > > > >     * (the virtqueue vring base plus the possible device
> > > > > > > > > > > specific states) that is
> > > > > > > > > > >     * required for restoring in the future. The device must not
> > > > > > > > > > > change its
> > > > > > > > > > >     * configuration after that point.
> > > > > > > > > > >     */
> > > > > > > > > > >    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > > > > > > >
> > > > > > > > > > >    /* Resume a device so it can resume processing virtqueue
> > > > > > > > > > > requests
> > > > > > > > > > >     *
> > > > > > > > > > >     * After the return of this ioctl the device will have
> > > > > > > > > > > restored all the
> > > > > > > > > > >     * necessary states and it is fully operational to continue
> > > > > > > > > > > processing the
> > > > > > > > > > >     * virtqueue descriptors.
> > > > > > > > > > >     */
> > > > > > > > > > >    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > > > > > > >
> > > > > > > > > > > I wonder if it makes sense to import these into vhost-user so
> > > > > > > > > > > that the
> > > > > > > > > > > difference between kernel vhost and vhost-user is minimized.
> > > > > > > > > > > It's okay
> > > > > > > > > > > if one of them is ahead of the other, but it would be nice to
> > > > > > > > > > > avoid
> > > > > > > > > > > overlapping/duplicated functionality.
> > > > > > > > > > >
> > > > > > > > > > That's what I had in mind in the first versions. I proposed
> > > > > > > > > > VHOST_STOP
> > > > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did
> > > > > > > > > > change
> > > > > > > > > > to SUSPEND.
> > > > > > > > > >
> > > > > > > > > > Generally it is better if we make the interface less parametrized
> > > > > > > > > > and
> > > > > > > > > > we trust in the messages and its semantics in my opinion. In other
> > > > > > > > > > words, instead of
> > > > > > > > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED),
> > > > > > > > > > send
> > > > > > > > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user
> > > > > > > > > > command.
> > > > > > > > > >
> > > > > > > > > > Another way to apply this is with the "direction" parameter. Maybe
> > > > > > > > > > it
> > > > > > > > > > is better to split it into "set_state_fd" and "get_state_fd"?
> > > > > > > > > >
> > > > > > > > > > In that case, reusing the ioctls as vhost-user messages would be
> > > > > > > > > > ok.
> > > > > > > > > > But that puts this proposal further from the VFIO code, which uses
> > > > > > > > > > "migration_set_state(state)", and maybe it is better when the
> > > > > > > > > > number
> > > > > > > > > > of states is high.
> > > > > > > > > Hi Eugenio,
> > > > > > > > > Another question about vDPA suspend/resume:
> > > > > > > > >
> > > > > > > > >    /* Host notifiers must be enabled at this point. */
> > > > > > > > >    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev,
> > > > > > > > > bool vrings)
> > > > > > > > >    {
> > > > > > > > >        int i;
> > > > > > > > >
> > > > > > > > >        /* should only be called after backend is connected */
> > > > > > > > >        assert(hdev->vhost_ops);
> > > > > > > > >        event_notifier_test_and_clear(
> > > > > > > > >            &hdev-
> > > > > > > > > >vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
> > > > > > > > >        event_notifier_test_and_clear(&vdev->config_notifier);
> > > > > > > > >
> > > > > > > > >        trace_vhost_dev_stop(hdev, vdev->name, vrings);
> > > > > > > > >
> > > > > > > > >        if (hdev->vhost_ops->vhost_dev_start) {
> > > > > > > > >            hdev->vhost_ops->vhost_dev_start(hdev, false);
> > > > > > > > >            ^^^ SUSPEND ^^^
> > > > > > > > >        }
> > > > > > > > >        if (vrings) {
> > > > > > > > >            vhost_dev_set_vring_enable(hdev, false);
> > > > > > > > >        }
> > > > > > > > >        for (i = 0; i < hdev->nvqs; ++i) {
> > > > > > > > >            vhost_virtqueue_stop(hdev,
> > > > > > > > >                                 vdev,
> > > > > > > > >                                 hdev->vqs + i,
> > > > > > > > >                                 hdev->vq_index + i);
> > > > > > > > >          ^^^ fetch virtqueue state from kernel ^^^
> > > > > > > > >        }
> > > > > > > > >        if (hdev->vhost_ops->vhost_reset_status) {
> > > > > > > > >            hdev->vhost_ops->vhost_reset_status(hdev);
> > > > > > > > >            ^^^ reset device^^^
> > > > > > > > >
> > > > > > > > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop()
> > > > > > > > > ->
> > > > > > > > > vhost_reset_status(). The device's migration code runs after
> > > > > > > > > vhost_dev_stop() and the state will have been lost.
> > > > > > > > >
> > > > > > > > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
> > > > > > > > qemu VirtIONet device model. This is for all vhost backends.
> > > > > > > >
> > > > > > > > Regarding the state like mac or mq configuration, SVQ runs for all the
> > > > > > > > VM run in the CVQ. So it can track all of that status in the device
> > > > > > > > model too.
> > > > > > > >
> > > > > > > > When a migration effectively occurs, all the frontend state is
> > > > > > > > migrated as a regular emulated device. To route all of the state in a
> > > > > > > > normalized way for qemu is what leaves open the possibility to do
> > > > > > > > cross-backends migrations, etc.
> > > > > > > >
> > > > > > > > Does that answer your question?
> > > > > > > I think you're confirming that changes would be necessary in order for
> > > > > > > vDPA to support the save/load operation that Hanna is introducing.
> > > > > > >
> > > > > > Yes, this first iteration was centered on net, with an eye on block,
> > > > > > where state can be routed through classical emulated devices. This is
> > > > > > how vhost-kernel and vhost-user do classically. And it allows
> > > > > > cross-backend, to not modify qemu migration state, etc.
> > > > > >
> > > > > > To introduce this opaque state to qemu, that must be fetched after the
> > > > > > suspend and not before, requires changes in vhost protocol, as
> > > > > > discussed previously.
> > > > > >
> > > > > > > > > It looks like vDPA changes are necessary in order to support
> > > > > > > > > stateful
> > > > > > > > > devices even though QEMU already uses SUSPEND. Is my understanding
> > > > > > > > > correct?
> > > > > > > > >
> > > > > > > > Changes are required elsewhere, as the code to restore the state
> > > > > > > > properly in the destination has not been merged.
> > > > > > > I'm not sure what you mean by elsewhere?
> > > > > > >
> > > > > > I meant for vdpa *net* devices the changes are not required in vdpa
> > > > > > ioctls, but mostly in qemu.
> > > > > >
> > > > > > If you meant stateful as "it must have a state blob that it must be
> > > > > > opaque to qemu", then I think the straightforward action is to fetch
> > > > > > state blob about the same time as vq indexes. But yes, changes (at
> > > > > > least a new ioctl) is needed for that.
> > > > > >
> > > > > > > I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and
> > > > > > > then VHOST_VDPA_SET_STATUS 0.
> > > > > > >
> > > > > > > In order to save device state from the vDPA device in the future, it
> > > > > > > will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that
> > > > > > > the device state can be saved before the device is reset.
> > > > > > >
> > > > > > > Does that sound right?
> > > > > > >
> > > > > > The split between suspend and reset was added recently for that very
> > > > > > reason. In all the virtio devices, the frontend is initialized before
> > > > > > the backend, so I don't think it is a good idea to defer the backend
> > > > > > cleanup. Especially if we have already set the state is small enough
> > > > > > to not needing iterative migration from virtiofsd point of view.
> > > > > >
> > > > > > If fetching that state at the same time as vq indexes is not valid,
> > > > > > could it follow the same model as the "in-flight descriptors"?
> > > > > > vhost-user follows them by using a shared memory region where their
> > > > > > state is tracked [1]. This allows qemu to survive vhost-user SW
> > > > > > backend crashes, and does not forbid the cross-backends live migration
> > > > > > as all the information is there to recover them.
> > > > > >
> > > > > > For hw devices this is not convenient as it occupies PCI bandwidth. So
> > > > > > a possibility is to synchronize this memory region after a
> > > > > > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW
> > > > > > devices are not going to crash in the software sense, so all use cases
> > > > > > remain the same to qemu. And that shared memory information is
> > > > > > recoverable after vhost_dev_stop.
> > > > > >
> > > > > > Does that sound reasonable to virtiofsd? To offer a shared memory
> > > > > > region where it dumps the state, maybe only after the
> > > > > > set_state(STATE_PHASE_STOPPED)?
> > > > >
> > > > > I don’t think we need the set_state() call, necessarily, if SUSPEND is
> > > > > mandatory anyway.
> > > > >
> > > > > As for the shared memory, the RFC before this series used shared memory,
> > > > > so it’s possible, yes.  But “shared memory region” can mean a lot of
> > > > > things – it sounds like you’re saying the back-end (virtiofsd) should
> > > > > provide it to the front-end, is that right?  That could work like this:
> > > > >
> > > > > On the source side:
> > > > >
> > > > > S1. SUSPEND goes to virtiofsd
> > > > > S2. virtiofsd maybe double-checks that the device is stopped, then
> > > > > serializes its state into a newly allocated shared memory area[1]
> > > > > S3. virtiofsd responds to SUSPEND
> > > > > S4. front-end requests shared memory, virtiofsd responds with a handle,
> > > > > maybe already closes its reference
> > > > > S5. front-end saves state, closes its handle, freeing the SHM
> > > > >
> > > > > [1] Maybe virtiofsd can correctly size the serialized state’s size, then
> > > > > it can immediately allocate this area and serialize directly into it;
> > > > > maybe it can’t, then we’ll need a bounce buffer.  Not really a
> > > > > fundamental problem, but there are limitations around what you can do
> > > > > with serde implementations in Rust…
> > > > >
> > > > > On the destination side:
> > > > >
> > > > > D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much;
> > > > > virtiofsd would serialize its empty state into an SHM area, and respond
> > > > > to SUSPEND
> > > > > D2. front-end reads state from migration stream into an SHM it has allocated
> > > > > D3. front-end supplies this SHM to virtiofsd, which discards its
> > > > > previous area, and now uses this one
> > > > > D4. RESUME goes to virtiofsd, which deserializes the state from the SHM
> > > > >
> > > > > Couple of questions:
> > > > >
> > > > > A. Stefan suggested D1, but it does seem wasteful now.  But if SUSPEND
> > > > > would imply to deserialize a state, and the state is to be transferred
> > > > > through SHM, this is what would need to be done.  So maybe we should
> > > > > skip SUSPEND on the destination?
> > > > > B. You described that the back-end should supply the SHM, which works
> > > > > well on the source.  On the destination, only the front-end knows how
> > > > > big the state is, so I’ve decided above that it should allocate the SHM
> > > > > (D2) and provide it to the back-end.  Is that feasible or is it
> > > > > important (e.g. for real hardware) that the back-end supplies the SHM?
> > > > > (In which case the front-end would need to tell the back-end how big the
> > > > > state SHM needs to be.)
> > > >
> > > > How does this work for iterative live migration?
> > > >
> > >
> > > A pipe will always fit better for iterative from qemu POV, that's for sure.
> > > Especially if we want to keep that opaqueness.
> > >
> > > But  we will need to communicate with the HW device using shared memory sooner
> > > or later for big states.  If we don't transform it in qemu, we will need to do
> > > it in the kernel.  Also, the pipe will not support daemon crashes.
> > >
> > > Again I'm just putting this on the table, just in case it fits better or it is
> > > convenient.  I missed the previous patch where SHM was proposed too, so maybe I
> > > missed some feedback useful here.  I think the pipe is a better solution in the
> > > long run because of the iterative part.
> >
> > Pipes and shared memory are conceptually equivalent for building
> > streaming interfaces. It's just more complex to design a shared memory
> > interface and it reinvents what file descriptors already offer.
> >
> > I have no doubt we could design iterative migration over a shared memory
> > interface if we needed to, but I'm not sure why? When you mention
> > hardware, are you suggesting defining a standard memory/register layout
> > that hardware implements and mapping it to userspace (QEMU)?
> 
> Right.
> 
> > Is there a
> > big advantage to exposing memory versus a file descriptor?
> >
> 
> For hardware it allows to retrieve and set the device state without
> intervention of the kernel, saving context switches. For virtiofsd
> this may not make a lot of sense, but I'm thinking on devices with big
> states (virtio gpu, maybe?).

A streaming interface implemented using shared memory involves consuming
chunks of bytes. Each time data has been read, an action must be
performed to notify the device and receive a notification when more data
becomes available.

That notification involves the kernel (e.g. an eventfd that is triggered
by a hardware interrupt) and a read(2) syscall to reset the eventfd.

Unless userspace disables notifications and polls (busy waits) the
hardware registers, there is still going to be kernel involvement and a
context switch. For this reason, I think that shared memory vs pipes
will not be significantly different.

> For software it allows the backend to survive a crash, as the old
> state can be set directly to a fresh backend instance.

Can you explain by describing the steps involved? Are you sure it can
only be done with shared memory and not pipes?

Stefan
Eugenio Perez Martin May 9, 2023, 3:26 p.m. UTC | #50
On Tue, May 9, 2023 at 11:01 AM Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 09.05.23 08:31, Eugenio Perez Martin wrote:
> > On Mon, May 8, 2023 at 9:12 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> [...]
>
> >> VHOST_USER_GET_VRING_BASE itself isn't really enough because it stops a
> >> specific virtqueue but not the whole device. Unfortunately stopping all
> >> virtqueues is not the same as SUSPEND since spontaneous device activity
> >> is possible independent of any virtqueue (e.g. virtio-scsi events and
> >> maybe virtio-net link status).
> >>
> >> That's why I think SUSPEND is necessary for a solution that's generic
> >> enough to cover all device types.
> >>
> > I agree.
> >
> > In particular virtiofsd is already resetting all the device at
> > VHOST_USER_GET_VRING_BASE if I'm not wrong, so that's even more of a
> > reason to implement suspend call.
>
> Oh, no, just the vring in question.  Not the whole device.
>
> In addition, we still need the GET_VRING_BASE call anyway, because,
> well, we want to restore the vring on the destination via SET_VRING_BASE.
>

Ok, that makes sense, sorry for the confusion!

Thanks!
Eugenio Perez Martin May 9, 2023, 3:35 p.m. UTC | #51
On Tue, May 9, 2023 at 5:09 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Tue, May 09, 2023 at 08:45:33AM +0200, Eugenio Perez Martin wrote:
> > On Mon, May 8, 2023 at 10:10 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Thu, Apr 20, 2023 at 03:29:44PM +0200, Eugenio Pérez wrote:
> > > > On Wed, 2023-04-19 at 07:21 -0400, Stefan Hajnoczi wrote:
> > > > > On Wed, 19 Apr 2023 at 07:10, Hanna Czenczek <hreitz@redhat.com> wrote:
> > > > > > On 18.04.23 09:54, Eugenio Perez Martin wrote:
> > > > > > > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com>
> > > > > > > wrote:
> > > > > > > > On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com>
> > > > > > > > wrote:
> > > > > > > > > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > > wrote:
> > > > > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin
> > > > > > > > > > wrote:
> > > > > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <
> > > > > > > > > > > stefanha@redhat.com> wrote:
> > > > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > > > > > > > > So-called "internal" virtio-fs migration refers to
> > > > > > > > > > > > > transporting the
> > > > > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration
> > > > > > > > > > > > > stream.  To do
> > > > > > > > > > > > > this, we need to be able to transfer virtiofsd's internal
> > > > > > > > > > > > > state to and
> > > > > > > > > > > > > from virtiofsd.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Because virtiofsd's internal state will not be too large, we
> > > > > > > > > > > > > believe it
> > > > > > > > > > > > > is best to transfer it as a single binary blob after the
> > > > > > > > > > > > > streaming
> > > > > > > > > > > > > phase.  Because this method should be useful to other vhost-
> > > > > > > > > > > > > user
> > > > > > > > > > > > > implementations, too, it is introduced as a general-purpose
> > > > > > > > > > > > > addition to
> > > > > > > > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > > > > > > > >
> > > > > > > > > > > > > These are the additions to the protocol:
> > > > > > > > > > > > > - New vhost-user protocol feature
> > > > > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > > > > > > > >    This feature signals support for transferring state, and is
> > > > > > > > > > > > > added so
> > > > > > > > > > > > >    that migration can fail early when the back-end has no
> > > > > > > > > > > > > support.
> > > > > > > > > > > > >
> > > > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end
> > > > > > > > > > > > > negotiate a pipe
> > > > > > > > > > > > >    over which to transfer the state.  The front-end sends an
> > > > > > > > > > > > > FD to the
> > > > > > > > > > > > >    back-end into/from which it can write/read its state, and
> > > > > > > > > > > > > the back-end
> > > > > > > > > > > > >    can decide to either use it, or reply with a different FD
> > > > > > > > > > > > > for the
> > > > > > > > > > > > >    front-end to override the front-end's choice.
> > > > > > > > > > > > >    The front-end creates a simple pipe to transfer the state,
> > > > > > > > > > > > > but maybe
> > > > > > > > > > > > >    the back-end already has an FD into/from which it has to
> > > > > > > > > > > > > write/read
> > > > > > > > > > > > >    its state, in which case it will want to override the
> > > > > > > > > > > > > simple pipe.
> > > > > > > > > > > > >    Conversely, maybe in the future we find a way to have the
> > > > > > > > > > > > > front-end
> > > > > > > > > > > > >    get an immediate FD for the migration stream (in some
> > > > > > > > > > > > > cases), in which
> > > > > > > > > > > > >    case we will want to send this to the back-end instead of
> > > > > > > > > > > > > creating a
> > > > > > > > > > > > >    pipe.
> > > > > > > > > > > > >    Hence the negotiation: If one side has a better idea than a
> > > > > > > > > > > > > plain
> > > > > > > > > > > > >    pipe, we will want to use that.
> > > > > > > > > > > > >
> > > > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred
> > > > > > > > > > > > > through the
> > > > > > > > > > > > >    pipe (the end indicated by EOF), the front-end invokes this
> > > > > > > > > > > > > function
> > > > > > > > > > > > >    to verify success.  There is no in-band way (through the
> > > > > > > > > > > > > pipe) to
> > > > > > > > > > > > >    indicate failure, so we need to check explicitly.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Once the transfer pipe has been established via
> > > > > > > > > > > > > SET_DEVICE_STATE_FD
> > > > > > > > > > > > > (which includes establishing the direction of transfer and
> > > > > > > > > > > > > migration
> > > > > > > > > > > > > phase), the sending side writes its data into the pipe, and
> > > > > > > > > > > > > the reading
> > > > > > > > > > > > > side reads it until it sees an EOF.  Then, the front-end will
> > > > > > > > > > > > > check for
> > > > > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side
> > > > > > > > > > > > > includes
> > > > > > > > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > > > > > > > >
> > > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > > > > > > ---
> > > > > > > > > > > > >   include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > > > > > > > >   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > > > > > > > >   hw/virtio/vhost-user.c            | 147
> > > > > > > > > > > > > ++++++++++++++++++++++++++++++
> > > > > > > > > > > > >   hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > > > > > > > >   4 files changed, 287 insertions(+)
> > > > > > > > > > > > >
> > > > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > > b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > > > > > > > >       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > > > > > > > >   } VhostSetConfigType;
> > > > > > > > > > > > >
> > > > > > > > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > > > > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > > > > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > > > > > > > +} VhostDeviceStateDirection;
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > > > > > > > +} VhostDeviceStatePhase;
> > > > > > > > > > > > vDPA has:
> > > > > > > > > > > >
> > > > > > > > > > > >    /* Suspend a device so it does not process virtqueue requests
> > > > > > > > > > > > anymore
> > > > > > > > > > > >     *
> > > > > > > > > > > >     * After the return of ioctl the device must preserve all the
> > > > > > > > > > > > necessary state
> > > > > > > > > > > >     * (the virtqueue vring base plus the possible device
> > > > > > > > > > > > specific states) that is
> > > > > > > > > > > >     * required for restoring in the future. The device must not
> > > > > > > > > > > > change its
> > > > > > > > > > > >     * configuration after that point.
> > > > > > > > > > > >     */
> > > > > > > > > > > >    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > > > > > > > >
> > > > > > > > > > > >    /* Resume a device so it can resume processing virtqueue
> > > > > > > > > > > > requests
> > > > > > > > > > > >     *
> > > > > > > > > > > >     * After the return of this ioctl the device will have
> > > > > > > > > > > > restored all the
> > > > > > > > > > > >     * necessary states and it is fully operational to continue
> > > > > > > > > > > > processing the
> > > > > > > > > > > >     * virtqueue descriptors.
> > > > > > > > > > > >     */
> > > > > > > > > > > >    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > > > > > > > >
> > > > > > > > > > > > I wonder if it makes sense to import these into vhost-user so
> > > > > > > > > > > > that the
> > > > > > > > > > > > difference between kernel vhost and vhost-user is minimized.
> > > > > > > > > > > > It's okay
> > > > > > > > > > > > if one of them is ahead of the other, but it would be nice to
> > > > > > > > > > > > avoid
> > > > > > > > > > > > overlapping/duplicated functionality.
> > > > > > > > > > > >
> > > > > > > > > > > That's what I had in mind in the first versions. I proposed
> > > > > > > > > > > VHOST_STOP
> > > > > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did
> > > > > > > > > > > change
> > > > > > > > > > > to SUSPEND.
> > > > > > > > > > >
> > > > > > > > > > > Generally it is better if we make the interface less parametrized
> > > > > > > > > > > and
> > > > > > > > > > > we trust in the messages and its semantics in my opinion. In other
> > > > > > > > > > > words, instead of
> > > > > > > > > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED),
> > > > > > > > > > > send
> > > > > > > > > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user
> > > > > > > > > > > command.
> > > > > > > > > > >
> > > > > > > > > > > Another way to apply this is with the "direction" parameter. Maybe
> > > > > > > > > > > it
> > > > > > > > > > > is better to split it into "set_state_fd" and "get_state_fd"?
> > > > > > > > > > >
> > > > > > > > > > > In that case, reusing the ioctls as vhost-user messages would be
> > > > > > > > > > > ok.
> > > > > > > > > > > But that puts this proposal further from the VFIO code, which uses
> > > > > > > > > > > "migration_set_state(state)", and maybe it is better when the
> > > > > > > > > > > number
> > > > > > > > > > > of states is high.
> > > > > > > > > > Hi Eugenio,
> > > > > > > > > > Another question about vDPA suspend/resume:
> > > > > > > > > >
> > > > > > > > > >    /* Host notifiers must be enabled at this point. */
> > > > > > > > > >    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev,
> > > > > > > > > > bool vrings)
> > > > > > > > > >    {
> > > > > > > > > >        int i;
> > > > > > > > > >
> > > > > > > > > >        /* should only be called after backend is connected */
> > > > > > > > > >        assert(hdev->vhost_ops);
> > > > > > > > > >        event_notifier_test_and_clear(
> > > > > > > > > >            &hdev-
> > > > > > > > > > >vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
> > > > > > > > > >        event_notifier_test_and_clear(&vdev->config_notifier);
> > > > > > > > > >
> > > > > > > > > >        trace_vhost_dev_stop(hdev, vdev->name, vrings);
> > > > > > > > > >
> > > > > > > > > >        if (hdev->vhost_ops->vhost_dev_start) {
> > > > > > > > > >            hdev->vhost_ops->vhost_dev_start(hdev, false);
> > > > > > > > > >            ^^^ SUSPEND ^^^
> > > > > > > > > >        }
> > > > > > > > > >        if (vrings) {
> > > > > > > > > >            vhost_dev_set_vring_enable(hdev, false);
> > > > > > > > > >        }
> > > > > > > > > >        for (i = 0; i < hdev->nvqs; ++i) {
> > > > > > > > > >            vhost_virtqueue_stop(hdev,
> > > > > > > > > >                                 vdev,
> > > > > > > > > >                                 hdev->vqs + i,
> > > > > > > > > >                                 hdev->vq_index + i);
> > > > > > > > > >          ^^^ fetch virtqueue state from kernel ^^^
> > > > > > > > > >        }
> > > > > > > > > >        if (hdev->vhost_ops->vhost_reset_status) {
> > > > > > > > > >            hdev->vhost_ops->vhost_reset_status(hdev);
> > > > > > > > > >            ^^^ reset device^^^
> > > > > > > > > >
> > > > > > > > > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop()
> > > > > > > > > > ->
> > > > > > > > > > vhost_reset_status(). The device's migration code runs after
> > > > > > > > > > vhost_dev_stop() and the state will have been lost.
> > > > > > > > > >
> > > > > > > > > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
> > > > > > > > > qemu VirtIONet device model. This is for all vhost backends.
> > > > > > > > >
> > > > > > > > > Regarding the state like mac or mq configuration, SVQ runs for all the
> > > > > > > > > VM run in the CVQ. So it can track all of that status in the device
> > > > > > > > > model too.
> > > > > > > > >
> > > > > > > > > When a migration effectively occurs, all the frontend state is
> > > > > > > > > migrated as a regular emulated device. To route all of the state in a
> > > > > > > > > normalized way for qemu is what leaves open the possibility to do
> > > > > > > > > cross-backends migrations, etc.
> > > > > > > > >
> > > > > > > > > Does that answer your question?
> > > > > > > > I think you're confirming that changes would be necessary in order for
> > > > > > > > vDPA to support the save/load operation that Hanna is introducing.
> > > > > > > >
> > > > > > > Yes, this first iteration was centered on net, with an eye on block,
> > > > > > > where state can be routed through classical emulated devices. This is
> > > > > > > how vhost-kernel and vhost-user do classically. And it allows
> > > > > > > cross-backend, to not modify qemu migration state, etc.
> > > > > > >
> > > > > > > To introduce this opaque state to qemu, that must be fetched after the
> > > > > > > suspend and not before, requires changes in vhost protocol, as
> > > > > > > discussed previously.
> > > > > > >
> > > > > > > > > > It looks like vDPA changes are necessary in order to support
> > > > > > > > > > stateful
> > > > > > > > > > devices even though QEMU already uses SUSPEND. Is my understanding
> > > > > > > > > > correct?
> > > > > > > > > >
> > > > > > > > > Changes are required elsewhere, as the code to restore the state
> > > > > > > > > properly in the destination has not been merged.
> > > > > > > > I'm not sure what you mean by elsewhere?
> > > > > > > >
> > > > > > > I meant for vdpa *net* devices the changes are not required in vdpa
> > > > > > > ioctls, but mostly in qemu.
> > > > > > >
> > > > > > > If you meant stateful as "it must have a state blob that it must be
> > > > > > > opaque to qemu", then I think the straightforward action is to fetch
> > > > > > > state blob about the same time as vq indexes. But yes, changes (at
> > > > > > > least a new ioctl) is needed for that.
> > > > > > >
> > > > > > > > I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and
> > > > > > > > then VHOST_VDPA_SET_STATUS 0.
> > > > > > > >
> > > > > > > > In order to save device state from the vDPA device in the future, it
> > > > > > > > will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that
> > > > > > > > the device state can be saved before the device is reset.
> > > > > > > >
> > > > > > > > Does that sound right?
> > > > > > > >
> > > > > > > The split between suspend and reset was added recently for that very
> > > > > > > reason. In all the virtio devices, the frontend is initialized before
> > > > > > > the backend, so I don't think it is a good idea to defer the backend
> > > > > > > cleanup. Especially if we have already set the state is small enough
> > > > > > > to not needing iterative migration from virtiofsd point of view.
> > > > > > >
> > > > > > > If fetching that state at the same time as vq indexes is not valid,
> > > > > > > could it follow the same model as the "in-flight descriptors"?
> > > > > > > vhost-user follows them by using a shared memory region where their
> > > > > > > state is tracked [1]. This allows qemu to survive vhost-user SW
> > > > > > > backend crashes, and does not forbid the cross-backends live migration
> > > > > > > as all the information is there to recover them.
> > > > > > >
> > > > > > > For hw devices this is not convenient as it occupies PCI bandwidth. So
> > > > > > > a possibility is to synchronize this memory region after a
> > > > > > > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW
> > > > > > > devices are not going to crash in the software sense, so all use cases
> > > > > > > remain the same to qemu. And that shared memory information is
> > > > > > > recoverable after vhost_dev_stop.
> > > > > > >
> > > > > > > Does that sound reasonable to virtiofsd? To offer a shared memory
> > > > > > > region where it dumps the state, maybe only after the
> > > > > > > set_state(STATE_PHASE_STOPPED)?
> > > > > >
> > > > > > I don’t think we need the set_state() call, necessarily, if SUSPEND is
> > > > > > mandatory anyway.
> > > > > >
> > > > > > As for the shared memory, the RFC before this series used shared memory,
> > > > > > so it’s possible, yes.  But “shared memory region” can mean a lot of
> > > > > > things – it sounds like you’re saying the back-end (virtiofsd) should
> > > > > > provide it to the front-end, is that right?  That could work like this:
> > > > > >
> > > > > > On the source side:
> > > > > >
> > > > > > S1. SUSPEND goes to virtiofsd
> > > > > > S2. virtiofsd maybe double-checks that the device is stopped, then
> > > > > > serializes its state into a newly allocated shared memory area[1]
> > > > > > S3. virtiofsd responds to SUSPEND
> > > > > > S4. front-end requests shared memory, virtiofsd responds with a handle,
> > > > > > maybe already closes its reference
> > > > > > S5. front-end saves state, closes its handle, freeing the SHM
> > > > > >
> > > > > > [1] Maybe virtiofsd can correctly size the serialized state’s size, then
> > > > > > it can immediately allocate this area and serialize directly into it;
> > > > > > maybe it can’t, then we’ll need a bounce buffer.  Not really a
> > > > > > fundamental problem, but there are limitations around what you can do
> > > > > > with serde implementations in Rust…
> > > > > >
> > > > > > On the destination side:
> > > > > >
> > > > > > D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much;
> > > > > > virtiofsd would serialize its empty state into an SHM area, and respond
> > > > > > to SUSPEND
> > > > > > D2. front-end reads state from migration stream into an SHM it has allocated
> > > > > > D3. front-end supplies this SHM to virtiofsd, which discards its
> > > > > > previous area, and now uses this one
> > > > > > D4. RESUME goes to virtiofsd, which deserializes the state from the SHM
> > > > > >
> > > > > > Couple of questions:
> > > > > >
> > > > > > A. Stefan suggested D1, but it does seem wasteful now.  But if SUSPEND
> > > > > > would imply to deserialize a state, and the state is to be transferred
> > > > > > through SHM, this is what would need to be done.  So maybe we should
> > > > > > skip SUSPEND on the destination?
> > > > > > B. You described that the back-end should supply the SHM, which works
> > > > > > well on the source.  On the destination, only the front-end knows how
> > > > > > big the state is, so I’ve decided above that it should allocate the SHM
> > > > > > (D2) and provide it to the back-end.  Is that feasible or is it
> > > > > > important (e.g. for real hardware) that the back-end supplies the SHM?
> > > > > > (In which case the front-end would need to tell the back-end how big the
> > > > > > state SHM needs to be.)
> > > > >
> > > > > How does this work for iterative live migration?
> > > > >
> > > >
> > > > A pipe will always fit better for iterative from qemu POV, that's for sure.
> > > > Especially if we want to keep that opaqueness.
> > > >
> > > > But  we will need to communicate with the HW device using shared memory sooner
> > > > or later for big states.  If we don't transform it in qemu, we will need to do
> > > > it in the kernel.  Also, the pipe will not support daemon crashes.
> > > >
> > > > Again I'm just putting this on the table, just in case it fits better or it is
> > > > convenient.  I missed the previous patch where SHM was proposed too, so maybe I
> > > > missed some feedback useful here.  I think the pipe is a better solution in the
> > > > long run because of the iterative part.
> > >
> > > Pipes and shared memory are conceptually equivalent for building
> > > streaming interfaces. It's just more complex to design a shared memory
> > > interface and it reinvents what file descriptors already offer.
> > >
> > > I have no doubt we could design iterative migration over a shared memory
> > > interface if we needed to, but I'm not sure why? When you mention
> > > hardware, are you suggesting defining a standard memory/register layout
> > > that hardware implements and mapping it to userspace (QEMU)?
> >
> > Right.
> >
> > > Is there a
> > > big advantage to exposing memory versus a file descriptor?
> > >
> >
> > For hardware it allows to retrieve and set the device state without
> > intervention of the kernel, saving context switches. For virtiofsd
> > this may not make a lot of sense, but I'm thinking on devices with big
> > states (virtio gpu, maybe?).
>
> A streaming interface implemented using shared memory involves consuming
> chunks of bytes. Each time data has been read, an action must be
> performed to notify the device and receive a notification when more data
> becomes available.
>
> That notification involves the kernel (e.g. an eventfd that is triggered
> by a hardware interrupt) and a read(2) syscall to reset the eventfd.
>
> Unless userspace disables notifications and polls (busy waits) the
> hardware registers, there is still going to be kernel involvement and a
> context switch. For this reason, I think that shared memory vs pipes
> will not be significantly different.
>

Yes, for big states that's right. I was thinking of not-so-big states,
where all of it can be asked in one shot, but it may be problematic
with iterative migration for sure. In that regard pipes are way
better.

> > For software it allows the backend to survive a crash, as the old
> > state can be set directly to a fresh backend instance.
>
> Can you explain by describing the steps involved?

It's how vhost-user inflight I/O tracking works [1]: QEMU and the
backend shares a memory region where the backend dump states
continuously. In the event of a crash, this state can be dumped
directly to a new vhost-user backend.

> Are you sure it can only be done with shared memory and not pipes?
>

Sorry for the confusion, but I never intended to say that :).

[1] https://qemu.readthedocs.io/en/latest/interop/vhost-user.html#inflight-i-o-tracking
Stefan Hajnoczi May 9, 2023, 5:33 p.m. UTC | #52
On Tue, 9 May 2023 at 11:35, Eugenio Perez Martin <eperezma@redhat.com> wrote:
>
> On Tue, May 9, 2023 at 5:09 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Tue, May 09, 2023 at 08:45:33AM +0200, Eugenio Perez Martin wrote:
> > > On Mon, May 8, 2023 at 10:10 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > On Thu, Apr 20, 2023 at 03:29:44PM +0200, Eugenio Pérez wrote:
> > > > > On Wed, 2023-04-19 at 07:21 -0400, Stefan Hajnoczi wrote:
> > > > > > On Wed, 19 Apr 2023 at 07:10, Hanna Czenczek <hreitz@redhat.com> wrote:
> > > > > > > On 18.04.23 09:54, Eugenio Perez Martin wrote:
> > > > > > > > On Mon, Apr 17, 2023 at 9:21 PM Stefan Hajnoczi <stefanha@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > On Mon, 17 Apr 2023 at 15:08, Eugenio Perez Martin <eperezma@redhat.com>
> > > > > > > > > wrote:
> > > > > > > > > > On Mon, Apr 17, 2023 at 7:14 PM Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > On Thu, Apr 13, 2023 at 12:14:24PM +0200, Eugenio Perez Martin
> > > > > > > > > > > wrote:
> > > > > > > > > > > > On Wed, Apr 12, 2023 at 11:06 PM Stefan Hajnoczi <
> > > > > > > > > > > > stefanha@redhat.com> wrote:
> > > > > > > > > > > > > On Tue, Apr 11, 2023 at 05:05:13PM +0200, Hanna Czenczek wrote:
> > > > > > > > > > > > > > So-called "internal" virtio-fs migration refers to
> > > > > > > > > > > > > > transporting the
> > > > > > > > > > > > > > back-end's (virtiofsd's) state through qemu's migration
> > > > > > > > > > > > > > stream.  To do
> > > > > > > > > > > > > > this, we need to be able to transfer virtiofsd's internal
> > > > > > > > > > > > > > state to and
> > > > > > > > > > > > > > from virtiofsd.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Because virtiofsd's internal state will not be too large, we
> > > > > > > > > > > > > > believe it
> > > > > > > > > > > > > > is best to transfer it as a single binary blob after the
> > > > > > > > > > > > > > streaming
> > > > > > > > > > > > > > phase.  Because this method should be useful to other vhost-
> > > > > > > > > > > > > > user
> > > > > > > > > > > > > > implementations, too, it is introduced as a general-purpose
> > > > > > > > > > > > > > addition to
> > > > > > > > > > > > > > the protocol, not limited to vhost-user-fs.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > These are the additions to the protocol:
> > > > > > > > > > > > > > - New vhost-user protocol feature
> > > > > > > > > > > > > > VHOST_USER_PROTOCOL_F_MIGRATORY_STATE:
> > > > > > > > > > > > > >    This feature signals support for transferring state, and is
> > > > > > > > > > > > > > added so
> > > > > > > > > > > > > >    that migration can fail early when the back-end has no
> > > > > > > > > > > > > > support.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > - SET_DEVICE_STATE_FD function: Front-end and back-end
> > > > > > > > > > > > > > negotiate a pipe
> > > > > > > > > > > > > >    over which to transfer the state.  The front-end sends an
> > > > > > > > > > > > > > FD to the
> > > > > > > > > > > > > >    back-end into/from which it can write/read its state, and
> > > > > > > > > > > > > > the back-end
> > > > > > > > > > > > > >    can decide to either use it, or reply with a different FD
> > > > > > > > > > > > > > for the
> > > > > > > > > > > > > >    front-end to override the front-end's choice.
> > > > > > > > > > > > > >    The front-end creates a simple pipe to transfer the state,
> > > > > > > > > > > > > > but maybe
> > > > > > > > > > > > > >    the back-end already has an FD into/from which it has to
> > > > > > > > > > > > > > write/read
> > > > > > > > > > > > > >    its state, in which case it will want to override the
> > > > > > > > > > > > > > simple pipe.
> > > > > > > > > > > > > >    Conversely, maybe in the future we find a way to have the
> > > > > > > > > > > > > > front-end
> > > > > > > > > > > > > >    get an immediate FD for the migration stream (in some
> > > > > > > > > > > > > > cases), in which
> > > > > > > > > > > > > >    case we will want to send this to the back-end instead of
> > > > > > > > > > > > > > creating a
> > > > > > > > > > > > > >    pipe.
> > > > > > > > > > > > > >    Hence the negotiation: If one side has a better idea than a
> > > > > > > > > > > > > > plain
> > > > > > > > > > > > > >    pipe, we will want to use that.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > - CHECK_DEVICE_STATE: After the state has been transferred
> > > > > > > > > > > > > > through the
> > > > > > > > > > > > > >    pipe (the end indicated by EOF), the front-end invokes this
> > > > > > > > > > > > > > function
> > > > > > > > > > > > > >    to verify success.  There is no in-band way (through the
> > > > > > > > > > > > > > pipe) to
> > > > > > > > > > > > > >    indicate failure, so we need to check explicitly.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Once the transfer pipe has been established via
> > > > > > > > > > > > > > SET_DEVICE_STATE_FD
> > > > > > > > > > > > > > (which includes establishing the direction of transfer and
> > > > > > > > > > > > > > migration
> > > > > > > > > > > > > > phase), the sending side writes its data into the pipe, and
> > > > > > > > > > > > > > the reading
> > > > > > > > > > > > > > side reads it until it sees an EOF.  Then, the front-end will
> > > > > > > > > > > > > > check for
> > > > > > > > > > > > > > success via CHECK_DEVICE_STATE, which on the destination side
> > > > > > > > > > > > > > includes
> > > > > > > > > > > > > > checking for integrity (i.e. errors during deserialization).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > > > > > > > > > > > > Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> > > > > > > > > > > > > > ---
> > > > > > > > > > > > > >   include/hw/virtio/vhost-backend.h |  24 +++++
> > > > > > > > > > > > > >   include/hw/virtio/vhost.h         |  79 ++++++++++++++++
> > > > > > > > > > > > > >   hw/virtio/vhost-user.c            | 147
> > > > > > > > > > > > > > ++++++++++++++++++++++++++++++
> > > > > > > > > > > > > >   hw/virtio/vhost.c                 |  37 ++++++++
> > > > > > > > > > > > > >   4 files changed, 287 insertions(+)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > diff --git a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > > > b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > > > index ec3fbae58d..5935b32fe3 100644
> > > > > > > > > > > > > > --- a/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > > > +++ b/include/hw/virtio/vhost-backend.h
> > > > > > > > > > > > > > @@ -26,6 +26,18 @@ typedef enum VhostSetConfigType {
> > > > > > > > > > > > > >       VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
> > > > > > > > > > > > > >   } VhostSetConfigType;
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > +typedef enum VhostDeviceStateDirection {
> > > > > > > > > > > > > > +    /* Transfer state from back-end (device) to front-end */
> > > > > > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
> > > > > > > > > > > > > > +    /* Transfer state from front-end to back-end (device) */
> > > > > > > > > > > > > > +    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
> > > > > > > > > > > > > > +} VhostDeviceStateDirection;
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +typedef enum VhostDeviceStatePhase {
> > > > > > > > > > > > > > +    /* The device (and all its vrings) is stopped */
> > > > > > > > > > > > > > +    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
> > > > > > > > > > > > > > +} VhostDeviceStatePhase;
> > > > > > > > > > > > > vDPA has:
> > > > > > > > > > > > >
> > > > > > > > > > > > >    /* Suspend a device so it does not process virtqueue requests
> > > > > > > > > > > > > anymore
> > > > > > > > > > > > >     *
> > > > > > > > > > > > >     * After the return of ioctl the device must preserve all the
> > > > > > > > > > > > > necessary state
> > > > > > > > > > > > >     * (the virtqueue vring base plus the possible device
> > > > > > > > > > > > > specific states) that is
> > > > > > > > > > > > >     * required for restoring in the future. The device must not
> > > > > > > > > > > > > change its
> > > > > > > > > > > > >     * configuration after that point.
> > > > > > > > > > > > >     */
> > > > > > > > > > > > >    #define VHOST_VDPA_SUSPEND      _IO(VHOST_VIRTIO, 0x7D)
> > > > > > > > > > > > >
> > > > > > > > > > > > >    /* Resume a device so it can resume processing virtqueue
> > > > > > > > > > > > > requests
> > > > > > > > > > > > >     *
> > > > > > > > > > > > >     * After the return of this ioctl the device will have
> > > > > > > > > > > > > restored all the
> > > > > > > > > > > > >     * necessary states and it is fully operational to continue
> > > > > > > > > > > > > processing the
> > > > > > > > > > > > >     * virtqueue descriptors.
> > > > > > > > > > > > >     */
> > > > > > > > > > > > >    #define VHOST_VDPA_RESUME       _IO(VHOST_VIRTIO, 0x7E)
> > > > > > > > > > > > >
> > > > > > > > > > > > > I wonder if it makes sense to import these into vhost-user so
> > > > > > > > > > > > > that the
> > > > > > > > > > > > > difference between kernel vhost and vhost-user is minimized.
> > > > > > > > > > > > > It's okay
> > > > > > > > > > > > > if one of them is ahead of the other, but it would be nice to
> > > > > > > > > > > > > avoid
> > > > > > > > > > > > > overlapping/duplicated functionality.
> > > > > > > > > > > > >
> > > > > > > > > > > > That's what I had in mind in the first versions. I proposed
> > > > > > > > > > > > VHOST_STOP
> > > > > > > > > > > > instead of VHOST_VDPA_STOP for this very reason. Later it did
> > > > > > > > > > > > change
> > > > > > > > > > > > to SUSPEND.
> > > > > > > > > > > >
> > > > > > > > > > > > Generally it is better if we make the interface less parametrized
> > > > > > > > > > > > and
> > > > > > > > > > > > we trust in the messages and its semantics in my opinion. In other
> > > > > > > > > > > > words, instead of
> > > > > > > > > > > > vhost_set_device_state_fd_op(VHOST_TRANSFER_STATE_PHASE_STOPPED),
> > > > > > > > > > > > send
> > > > > > > > > > > > individually the equivalent of VHOST_VDPA_SUSPEND vhost-user
> > > > > > > > > > > > command.
> > > > > > > > > > > >
> > > > > > > > > > > > Another way to apply this is with the "direction" parameter. Maybe
> > > > > > > > > > > > it
> > > > > > > > > > > > is better to split it into "set_state_fd" and "get_state_fd"?
> > > > > > > > > > > >
> > > > > > > > > > > > In that case, reusing the ioctls as vhost-user messages would be
> > > > > > > > > > > > ok.
> > > > > > > > > > > > But that puts this proposal further from the VFIO code, which uses
> > > > > > > > > > > > "migration_set_state(state)", and maybe it is better when the
> > > > > > > > > > > > number
> > > > > > > > > > > > of states is high.
> > > > > > > > > > > Hi Eugenio,
> > > > > > > > > > > Another question about vDPA suspend/resume:
> > > > > > > > > > >
> > > > > > > > > > >    /* Host notifiers must be enabled at this point. */
> > > > > > > > > > >    void vhost_dev_stop(struct vhost_dev *hdev, VirtIODevice *vdev,
> > > > > > > > > > > bool vrings)
> > > > > > > > > > >    {
> > > > > > > > > > >        int i;
> > > > > > > > > > >
> > > > > > > > > > >        /* should only be called after backend is connected */
> > > > > > > > > > >        assert(hdev->vhost_ops);
> > > > > > > > > > >        event_notifier_test_and_clear(
> > > > > > > > > > >            &hdev-
> > > > > > > > > > > >vqs[VHOST_QUEUE_NUM_CONFIG_INR].masked_config_notifier);
> > > > > > > > > > >        event_notifier_test_and_clear(&vdev->config_notifier);
> > > > > > > > > > >
> > > > > > > > > > >        trace_vhost_dev_stop(hdev, vdev->name, vrings);
> > > > > > > > > > >
> > > > > > > > > > >        if (hdev->vhost_ops->vhost_dev_start) {
> > > > > > > > > > >            hdev->vhost_ops->vhost_dev_start(hdev, false);
> > > > > > > > > > >            ^^^ SUSPEND ^^^
> > > > > > > > > > >        }
> > > > > > > > > > >        if (vrings) {
> > > > > > > > > > >            vhost_dev_set_vring_enable(hdev, false);
> > > > > > > > > > >        }
> > > > > > > > > > >        for (i = 0; i < hdev->nvqs; ++i) {
> > > > > > > > > > >            vhost_virtqueue_stop(hdev,
> > > > > > > > > > >                                 vdev,
> > > > > > > > > > >                                 hdev->vqs + i,
> > > > > > > > > > >                                 hdev->vq_index + i);
> > > > > > > > > > >          ^^^ fetch virtqueue state from kernel ^^^
> > > > > > > > > > >        }
> > > > > > > > > > >        if (hdev->vhost_ops->vhost_reset_status) {
> > > > > > > > > > >            hdev->vhost_ops->vhost_reset_status(hdev);
> > > > > > > > > > >            ^^^ reset device^^^
> > > > > > > > > > >
> > > > > > > > > > > I noticed the QEMU vDPA code resets the device in vhost_dev_stop()
> > > > > > > > > > > ->
> > > > > > > > > > > vhost_reset_status(). The device's migration code runs after
> > > > > > > > > > > vhost_dev_stop() and the state will have been lost.
> > > > > > > > > > >
> > > > > > > > > > vhost_virtqueue_stop saves the vq state (indexes, vring base) in the
> > > > > > > > > > qemu VirtIONet device model. This is for all vhost backends.
> > > > > > > > > >
> > > > > > > > > > Regarding the state like mac or mq configuration, SVQ runs for all the
> > > > > > > > > > VM run in the CVQ. So it can track all of that status in the device
> > > > > > > > > > model too.
> > > > > > > > > >
> > > > > > > > > > When a migration effectively occurs, all the frontend state is
> > > > > > > > > > migrated as a regular emulated device. To route all of the state in a
> > > > > > > > > > normalized way for qemu is what leaves open the possibility to do
> > > > > > > > > > cross-backends migrations, etc.
> > > > > > > > > >
> > > > > > > > > > Does that answer your question?
> > > > > > > > > I think you're confirming that changes would be necessary in order for
> > > > > > > > > vDPA to support the save/load operation that Hanna is introducing.
> > > > > > > > >
> > > > > > > > Yes, this first iteration was centered on net, with an eye on block,
> > > > > > > > where state can be routed through classical emulated devices. This is
> > > > > > > > how vhost-kernel and vhost-user do classically. And it allows
> > > > > > > > cross-backend, to not modify qemu migration state, etc.
> > > > > > > >
> > > > > > > > To introduce this opaque state to qemu, that must be fetched after the
> > > > > > > > suspend and not before, requires changes in vhost protocol, as
> > > > > > > > discussed previously.
> > > > > > > >
> > > > > > > > > > > It looks like vDPA changes are necessary in order to support
> > > > > > > > > > > stateful
> > > > > > > > > > > devices even though QEMU already uses SUSPEND. Is my understanding
> > > > > > > > > > > correct?
> > > > > > > > > > >
> > > > > > > > > > Changes are required elsewhere, as the code to restore the state
> > > > > > > > > > properly in the destination has not been merged.
> > > > > > > > > I'm not sure what you mean by elsewhere?
> > > > > > > > >
> > > > > > > > I meant for vdpa *net* devices the changes are not required in vdpa
> > > > > > > > ioctls, but mostly in qemu.
> > > > > > > >
> > > > > > > > If you meant stateful as "it must have a state blob that it must be
> > > > > > > > opaque to qemu", then I think the straightforward action is to fetch
> > > > > > > > state blob about the same time as vq indexes. But yes, changes (at
> > > > > > > > least a new ioctl) is needed for that.
> > > > > > > >
> > > > > > > > > I'm asking about vDPA ioctls. Right now the sequence is SUSPEND and
> > > > > > > > > then VHOST_VDPA_SET_STATUS 0.
> > > > > > > > >
> > > > > > > > > In order to save device state from the vDPA device in the future, it
> > > > > > > > > will be necessary to defer the VHOST_VDPA_SET_STATUS 0 call so that
> > > > > > > > > the device state can be saved before the device is reset.
> > > > > > > > >
> > > > > > > > > Does that sound right?
> > > > > > > > >
> > > > > > > > The split between suspend and reset was added recently for that very
> > > > > > > > reason. In all the virtio devices, the frontend is initialized before
> > > > > > > > the backend, so I don't think it is a good idea to defer the backend
> > > > > > > > cleanup. Especially if we have already set the state is small enough
> > > > > > > > to not needing iterative migration from virtiofsd point of view.
> > > > > > > >
> > > > > > > > If fetching that state at the same time as vq indexes is not valid,
> > > > > > > > could it follow the same model as the "in-flight descriptors"?
> > > > > > > > vhost-user follows them by using a shared memory region where their
> > > > > > > > state is tracked [1]. This allows qemu to survive vhost-user SW
> > > > > > > > backend crashes, and does not forbid the cross-backends live migration
> > > > > > > > as all the information is there to recover them.
> > > > > > > >
> > > > > > > > For hw devices this is not convenient as it occupies PCI bandwidth. So
> > > > > > > > a possibility is to synchronize this memory region after a
> > > > > > > > synchronization point, being the SUSPEND call or GET_VRING_BASE. HW
> > > > > > > > devices are not going to crash in the software sense, so all use cases
> > > > > > > > remain the same to qemu. And that shared memory information is
> > > > > > > > recoverable after vhost_dev_stop.
> > > > > > > >
> > > > > > > > Does that sound reasonable to virtiofsd? To offer a shared memory
> > > > > > > > region where it dumps the state, maybe only after the
> > > > > > > > set_state(STATE_PHASE_STOPPED)?
> > > > > > >
> > > > > > > I don’t think we need the set_state() call, necessarily, if SUSPEND is
> > > > > > > mandatory anyway.
> > > > > > >
> > > > > > > As for the shared memory, the RFC before this series used shared memory,
> > > > > > > so it’s possible, yes.  But “shared memory region” can mean a lot of
> > > > > > > things – it sounds like you’re saying the back-end (virtiofsd) should
> > > > > > > provide it to the front-end, is that right?  That could work like this:
> > > > > > >
> > > > > > > On the source side:
> > > > > > >
> > > > > > > S1. SUSPEND goes to virtiofsd
> > > > > > > S2. virtiofsd maybe double-checks that the device is stopped, then
> > > > > > > serializes its state into a newly allocated shared memory area[1]
> > > > > > > S3. virtiofsd responds to SUSPEND
> > > > > > > S4. front-end requests shared memory, virtiofsd responds with a handle,
> > > > > > > maybe already closes its reference
> > > > > > > S5. front-end saves state, closes its handle, freeing the SHM
> > > > > > >
> > > > > > > [1] Maybe virtiofsd can correctly size the serialized state’s size, then
> > > > > > > it can immediately allocate this area and serialize directly into it;
> > > > > > > maybe it can’t, then we’ll need a bounce buffer.  Not really a
> > > > > > > fundamental problem, but there are limitations around what you can do
> > > > > > > with serde implementations in Rust…
> > > > > > >
> > > > > > > On the destination side:
> > > > > > >
> > > > > > > D1. Optional SUSPEND goes to virtiofsd that hasn’t yet done much;
> > > > > > > virtiofsd would serialize its empty state into an SHM area, and respond
> > > > > > > to SUSPEND
> > > > > > > D2. front-end reads state from migration stream into an SHM it has allocated
> > > > > > > D3. front-end supplies this SHM to virtiofsd, which discards its
> > > > > > > previous area, and now uses this one
> > > > > > > D4. RESUME goes to virtiofsd, which deserializes the state from the SHM
> > > > > > >
> > > > > > > Couple of questions:
> > > > > > >
> > > > > > > A. Stefan suggested D1, but it does seem wasteful now.  But if SUSPEND
> > > > > > > would imply to deserialize a state, and the state is to be transferred
> > > > > > > through SHM, this is what would need to be done.  So maybe we should
> > > > > > > skip SUSPEND on the destination?
> > > > > > > B. You described that the back-end should supply the SHM, which works
> > > > > > > well on the source.  On the destination, only the front-end knows how
> > > > > > > big the state is, so I’ve decided above that it should allocate the SHM
> > > > > > > (D2) and provide it to the back-end.  Is that feasible or is it
> > > > > > > important (e.g. for real hardware) that the back-end supplies the SHM?
> > > > > > > (In which case the front-end would need to tell the back-end how big the
> > > > > > > state SHM needs to be.)
> > > > > >
> > > > > > How does this work for iterative live migration?
> > > > > >
> > > > >
> > > > > A pipe will always fit better for iterative from qemu POV, that's for sure.
> > > > > Especially if we want to keep that opaqueness.
> > > > >
> > > > > But  we will need to communicate with the HW device using shared memory sooner
> > > > > or later for big states.  If we don't transform it in qemu, we will need to do
> > > > > it in the kernel.  Also, the pipe will not support daemon crashes.
> > > > >
> > > > > Again I'm just putting this on the table, just in case it fits better or it is
> > > > > convenient.  I missed the previous patch where SHM was proposed too, so maybe I
> > > > > missed some feedback useful here.  I think the pipe is a better solution in the
> > > > > long run because of the iterative part.
> > > >
> > > > Pipes and shared memory are conceptually equivalent for building
> > > > streaming interfaces. It's just more complex to design a shared memory
> > > > interface and it reinvents what file descriptors already offer.
> > > >
> > > > I have no doubt we could design iterative migration over a shared memory
> > > > interface if we needed to, but I'm not sure why? When you mention
> > > > hardware, are you suggesting defining a standard memory/register layout
> > > > that hardware implements and mapping it to userspace (QEMU)?
> > >
> > > Right.
> > >
> > > > Is there a
> > > > big advantage to exposing memory versus a file descriptor?
> > > >
> > >
> > > For hardware it allows to retrieve and set the device state without
> > > intervention of the kernel, saving context switches. For virtiofsd
> > > this may not make a lot of sense, but I'm thinking on devices with big
> > > states (virtio gpu, maybe?).
> >
> > A streaming interface implemented using shared memory involves consuming
> > chunks of bytes. Each time data has been read, an action must be
> > performed to notify the device and receive a notification when more data
> > becomes available.
> >
> > That notification involves the kernel (e.g. an eventfd that is triggered
> > by a hardware interrupt) and a read(2) syscall to reset the eventfd.
> >
> > Unless userspace disables notifications and polls (busy waits) the
> > hardware registers, there is still going to be kernel involvement and a
> > context switch. For this reason, I think that shared memory vs pipes
> > will not be significantly different.
> >
>
> Yes, for big states that's right. I was thinking of not-so-big states,
> where all of it can be asked in one shot, but it may be problematic
> with iterative migration for sure. In that regard pipes are way
> better.
>
> > > For software it allows the backend to survive a crash, as the old
> > > state can be set directly to a fresh backend instance.
> >
> > Can you explain by describing the steps involved?
>
> It's how vhost-user inflight I/O tracking works [1]: QEMU and the
> backend shares a memory region where the backend dump states
> continuously. In the event of a crash, this state can be dumped
> directly to a new vhost-user backend.

Neither shared memory nor INFLIGHT_FD are required for crash recovery
because the backend can stash state elsewhere, like tmpfs or systemd's
FDSTORE=1 (https://www.freedesktop.org/software/systemd/man/sd_pid_notify_with_fds.html).
INFLIGHT_FD is just a mechanism to stash an fd (only the backend
interprets the contents of the fd and the frontend doesn't even know
whether the fd is shared memory or another type of file).

I think crash recovery is orthogonal to this discussion because we're
talking about a streaming interface. A streaming interface breaks when
a crash occurs (regardless of whether it's implemented via shared
memory or pipes) as it involves two entities coordinating with each
other. If an entity goes away then the stream is incomplete and cannot
be used for crash recovery. I guess you're thinking of an fd that
contains the full state of the device. That fd could be handed to the
backend after reconnection for crash recovery, but a streaming
interface doesn't support that.

I guess your bringing up the idea of having the full device state
always up-to-date for crash recovery purposes? I think crash recovery
should be optional since it's complex and hard to test while many
(most?) backends don't implement it. It is likely that using the crash
recovery state for live migration is going to be even trickier because
live migration has additional requirements (e.g. compatibility). My
feeling is that it's too hard to satisfy both live migration and crash
recovery requirements for all vhost-user device types, but if you have
concrete ideas then let's discuss them.

Stefan
diff mbox series

Patch

diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
index ec3fbae58d..5935b32fe3 100644
--- a/include/hw/virtio/vhost-backend.h
+++ b/include/hw/virtio/vhost-backend.h
@@ -26,6 +26,18 @@  typedef enum VhostSetConfigType {
     VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
 } VhostSetConfigType;
 
+typedef enum VhostDeviceStateDirection {
+    /* Transfer state from back-end (device) to front-end */
+    VHOST_TRANSFER_STATE_DIRECTION_SAVE = 0,
+    /* Transfer state from front-end to back-end (device) */
+    VHOST_TRANSFER_STATE_DIRECTION_LOAD = 1,
+} VhostDeviceStateDirection;
+
+typedef enum VhostDeviceStatePhase {
+    /* The device (and all its vrings) is stopped */
+    VHOST_TRANSFER_STATE_PHASE_STOPPED = 0,
+} VhostDeviceStatePhase;
+
 struct vhost_inflight;
 struct vhost_dev;
 struct vhost_log;
@@ -133,6 +145,15 @@  typedef int (*vhost_set_config_call_op)(struct vhost_dev *dev,
 
 typedef void (*vhost_reset_status_op)(struct vhost_dev *dev);
 
+typedef bool (*vhost_supports_migratory_state_op)(struct vhost_dev *dev);
+typedef int (*vhost_set_device_state_fd_op)(struct vhost_dev *dev,
+                                            VhostDeviceStateDirection direction,
+                                            VhostDeviceStatePhase phase,
+                                            int fd,
+                                            int *reply_fd,
+                                            Error **errp);
+typedef int (*vhost_check_device_state_op)(struct vhost_dev *dev, Error **errp);
+
 typedef struct VhostOps {
     VhostBackendType backend_type;
     vhost_backend_init vhost_backend_init;
@@ -181,6 +202,9 @@  typedef struct VhostOps {
     vhost_force_iommu_op vhost_force_iommu;
     vhost_set_config_call_op vhost_set_config_call;
     vhost_reset_status_op vhost_reset_status;
+    vhost_supports_migratory_state_op vhost_supports_migratory_state;
+    vhost_set_device_state_fd_op vhost_set_device_state_fd;
+    vhost_check_device_state_op vhost_check_device_state;
 } VhostOps;
 
 int vhost_backend_update_device_iotlb(struct vhost_dev *dev,
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 2fe02ed5d4..29449e0fe2 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -346,4 +346,83 @@  int vhost_dev_set_inflight(struct vhost_dev *dev,
                            struct vhost_inflight *inflight);
 int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
                            struct vhost_inflight *inflight);
+
+/**
+ * vhost_supports_migratory_state(): Checks whether the back-end
+ * supports transferring internal state for the purpose of migration.
+ * Support for this feature is required for vhost_set_device_state_fd()
+ * and vhost_check_device_state().
+ *
+ * @dev: The vhost device
+ *
+ * Returns true if the device supports these commands, and false if it
+ * does not.
+ */
+bool vhost_supports_migratory_state(struct vhost_dev *dev);
+
+/**
+ * vhost_set_device_state_fd(): Begin transfer of internal state from/to
+ * the back-end for the purpose of migration.  Data is to be transferred
+ * over a pipe according to @direction and @phase.  The sending end must
+ * only write to the pipe, and the receiving end must only read from it.
+ * Once the sending end is done, it closes its FD.  The receiving end
+ * must take this as the end-of-transfer signal and close its FD, too.
+ *
+ * @fd is the back-end's end of the pipe: The write FD for SAVE, and the
+ * read FD for LOAD.  This function transfers ownership of @fd to the
+ * back-end, i.e. closes it in the front-end.
+ *
+ * The back-end may optionally reply with an FD of its own, if this
+ * improves efficiency on its end.  In this case, the returned FD is
+ * stored in *reply_fd.  The back-end will discard the FD sent to it,
+ * and the front-end must use *reply_fd for transferring state to/from
+ * the back-end.
+ *
+ * @dev: The vhost device
+ * @direction: The direction in which the state is to be transferred.
+ *             For outgoing migrations, this is SAVE, and data is read
+ *             from the back-end and stored by the front-end in the
+ *             migration stream.
+ *             For incoming migrations, this is LOAD, and data is read
+ *             by the front-end from the migration stream and sent to
+ *             the back-end to restore the saved state.
+ * @phase: Which migration phase we are in.  Currently, there is only
+ *         STOPPED (device and all vrings are stopped), in the future,
+ *         more phases such as PRE_COPY or POST_COPY may be added.
+ * @fd: Back-end's end of the pipe through which to transfer state; note
+ *      that ownership is transferred to the back-end, so this function
+ *      closes @fd in the front-end.
+ * @reply_fd: If the back-end wishes to use a different pipe for state
+ *            transfer, this will contain an FD for the front-end to
+ *            use.  Otherwise, -1 is stored here.
+ * @errp: Potential error description
+ *
+ * Returns 0 on success, and -errno on failure.
+ */
+int vhost_set_device_state_fd(struct vhost_dev *dev,
+                              VhostDeviceStateDirection direction,
+                              VhostDeviceStatePhase phase,
+                              int fd,
+                              int *reply_fd,
+                              Error **errp);
+
+/**
+ * vhost_set_device_state_fd(): After transferring state from/to the
+ * back-end via vhost_set_device_state_fd(), i.e. once the sending end
+ * has closed the pipe, inquire the back-end to report any potential
+ * errors that have occurred on its side.  This allows to sense errors
+ * like:
+ * - During outgoing migration, when the source side had already started
+ *   to produce its state, something went wrong and it failed to finish
+ * - During incoming migration, when the received state is somehow
+ *   invalid and cannot be processed by the back-end
+ *
+ * @dev: The vhost device
+ * @errp: Potential error description
+ *
+ * Returns 0 when the back-end reports successful state transfer and
+ * processing, and -errno when an error occurred somewhere.
+ */
+int vhost_check_device_state(struct vhost_dev *dev, Error **errp);
+
 #endif
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index e5285df4ba..93d8f2494a 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -83,6 +83,7 @@  enum VhostUserProtocolFeature {
     /* Feature 14 reserved for VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS. */
     VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS = 15,
     VHOST_USER_PROTOCOL_F_STATUS = 16,
+    VHOST_USER_PROTOCOL_F_MIGRATORY_STATE = 17,
     VHOST_USER_PROTOCOL_F_MAX
 };
 
@@ -130,6 +131,8 @@  typedef enum VhostUserRequest {
     VHOST_USER_REM_MEM_REG = 38,
     VHOST_USER_SET_STATUS = 39,
     VHOST_USER_GET_STATUS = 40,
+    VHOST_USER_SET_DEVICE_STATE_FD = 41,
+    VHOST_USER_CHECK_DEVICE_STATE = 42,
     VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -210,6 +213,12 @@  typedef struct {
     uint32_t size; /* the following payload size */
 } QEMU_PACKED VhostUserHeader;
 
+/* Request payload of VHOST_USER_SET_DEVICE_STATE_FD */
+typedef struct VhostUserTransferDeviceState {
+    uint32_t direction;
+    uint32_t phase;
+} VhostUserTransferDeviceState;
+
 typedef union {
 #define VHOST_USER_VRING_IDX_MASK   (0xff)
 #define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
@@ -224,6 +233,7 @@  typedef union {
         VhostUserCryptoSession session;
         VhostUserVringArea area;
         VhostUserInflight inflight;
+        VhostUserTransferDeviceState transfer_state;
 } VhostUserPayload;
 
 typedef struct VhostUserMsg {
@@ -2681,6 +2691,140 @@  static int vhost_user_dev_start(struct vhost_dev *dev, bool started)
     }
 }
 
+static bool vhost_user_supports_migratory_state(struct vhost_dev *dev)
+{
+    return virtio_has_feature(dev->protocol_features,
+                              VHOST_USER_PROTOCOL_F_MIGRATORY_STATE);
+}
+
+static int vhost_user_set_device_state_fd(struct vhost_dev *dev,
+                                          VhostDeviceStateDirection direction,
+                                          VhostDeviceStatePhase phase,
+                                          int fd,
+                                          int *reply_fd,
+                                          Error **errp)
+{
+    int ret;
+    struct vhost_user *vu = dev->opaque;
+    VhostUserMsg msg = {
+        .hdr = {
+            .request = VHOST_USER_SET_DEVICE_STATE_FD,
+            .flags = VHOST_USER_VERSION,
+            .size = sizeof(msg.payload.transfer_state),
+        },
+        .payload.transfer_state = {
+            .direction = direction,
+            .phase = phase,
+        },
+    };
+
+    *reply_fd = -1;
+
+    if (!vhost_user_supports_migratory_state(dev)) {
+        close(fd);
+        error_setg(errp, "Back-end does not support migration state transfer");
+        return -ENOTSUP;
+    }
+
+    ret = vhost_user_write(dev, &msg, &fd, 1);
+    close(fd);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret,
+                         "Failed to send SET_DEVICE_STATE_FD message");
+        return ret;
+    }
+
+    ret = vhost_user_read(dev, &msg);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret,
+                         "Failed to receive SET_DEVICE_STATE_FD reply");
+        return ret;
+    }
+
+    if (msg.hdr.request != VHOST_USER_SET_DEVICE_STATE_FD) {
+        error_setg(errp,
+                   "Received unexpected message type, expected %d, received %d",
+                   VHOST_USER_SET_DEVICE_STATE_FD, msg.hdr.request);
+        return -EPROTO;
+    }
+
+    if (msg.hdr.size != sizeof(msg.payload.u64)) {
+        error_setg(errp,
+                   "Received bad message size, expected %zu, received %" PRIu32,
+                   sizeof(msg.payload.u64), msg.hdr.size);
+        return -EPROTO;
+    }
+
+    if ((msg.payload.u64 & 0xff) != 0) {
+        error_setg(errp, "Back-end did not accept migration state transfer");
+        return -EIO;
+    }
+
+    if (!(msg.payload.u64 & VHOST_USER_VRING_NOFD_MASK)) {
+        *reply_fd = qemu_chr_fe_get_msgfd(vu->user->chr);
+        if (*reply_fd < 0) {
+            error_setg(errp,
+                       "Failed to get back-end-provided transfer pipe FD");
+            *reply_fd = -1;
+            return -EIO;
+        }
+    }
+
+    return 0;
+}
+
+static int vhost_user_check_device_state(struct vhost_dev *dev, Error **errp)
+{
+    int ret;
+    VhostUserMsg msg = {
+        .hdr = {
+            .request = VHOST_USER_CHECK_DEVICE_STATE,
+            .flags = VHOST_USER_VERSION,
+            .size = 0,
+        },
+    };
+
+    if (!vhost_user_supports_migratory_state(dev)) {
+        error_setg(errp, "Back-end does not support migration state transfer");
+        return -ENOTSUP;
+    }
+
+    ret = vhost_user_write(dev, &msg, NULL, 0);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret,
+                         "Failed to send CHECK_DEVICE_STATE message");
+        return ret;
+    }
+
+    ret = vhost_user_read(dev, &msg);
+    if (ret < 0) {
+        error_setg_errno(errp, -ret,
+                         "Failed to receive CHECK_DEVICE_STATE reply");
+        return ret;
+    }
+
+    if (msg.hdr.request != VHOST_USER_CHECK_DEVICE_STATE) {
+        error_setg(errp,
+                   "Received unexpected message type, expected %d, received %d",
+                   VHOST_USER_CHECK_DEVICE_STATE, msg.hdr.request);
+        return -EPROTO;
+    }
+
+    if (msg.hdr.size != sizeof(msg.payload.u64)) {
+        error_setg(errp,
+                   "Received bad message size, expected %zu, received %" PRIu32,
+                   sizeof(msg.payload.u64), msg.hdr.size);
+        return -EPROTO;
+    }
+
+    if (msg.payload.u64 != 0) {
+        error_setg(errp, "Back-end failed to process its internal state");
+        return -EIO;
+    }
+
+    return 0;
+}
+
 const VhostOps user_ops = {
         .backend_type = VHOST_BACKEND_TYPE_USER,
         .vhost_backend_init = vhost_user_backend_init,
@@ -2716,4 +2860,7 @@  const VhostOps user_ops = {
         .vhost_get_inflight_fd = vhost_user_get_inflight_fd,
         .vhost_set_inflight_fd = vhost_user_set_inflight_fd,
         .vhost_dev_start = vhost_user_dev_start,
+        .vhost_supports_migratory_state = vhost_user_supports_migratory_state,
+        .vhost_set_device_state_fd = vhost_user_set_device_state_fd,
+        .vhost_check_device_state = vhost_user_check_device_state,
 };
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index cbff589efa..90099d8f6a 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -2088,3 +2088,40 @@  int vhost_net_set_backend(struct vhost_dev *hdev,
 
     return -ENOSYS;
 }
+
+bool vhost_supports_migratory_state(struct vhost_dev *dev)
+{
+    if (dev->vhost_ops->vhost_supports_migratory_state) {
+        return dev->vhost_ops->vhost_supports_migratory_state(dev);
+    }
+
+    return false;
+}
+
+int vhost_set_device_state_fd(struct vhost_dev *dev,
+                              VhostDeviceStateDirection direction,
+                              VhostDeviceStatePhase phase,
+                              int fd,
+                              int *reply_fd,
+                              Error **errp)
+{
+    if (dev->vhost_ops->vhost_set_device_state_fd) {
+        return dev->vhost_ops->vhost_set_device_state_fd(dev, direction, phase,
+                                                         fd, reply_fd, errp);
+    }
+
+    error_setg(errp,
+               "vhost transport does not support migration state transfer");
+    return -ENOSYS;
+}
+
+int vhost_check_device_state(struct vhost_dev *dev, Error **errp)
+{
+    if (dev->vhost_ops->vhost_check_device_state) {
+        return dev->vhost_ops->vhost_check_device_state(dev, errp);
+    }
+
+    error_setg(errp,
+               "vhost transport does not support migration state transfer");
+    return -ENOSYS;
+}