[RFC,1/2] vhost-user: Add interface for virtio-fs migration

Message ID	20230313174833.28790-2-hreitz@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> From: Hanna Czenczek <hreitz@redhat.com> To: qemu-devel@nongnu.org, virtio-fs@redhat.com Cc: Hanna Czenczek <hreitz@redhat.com>, "Michael S . Tsirkin" <mst@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com>, "Dr . David Alan Gilbert" <dgilbert@redhat.com>, Juan Quintela <quintela@redhat.com> Subject: [RFC 1/2] vhost-user: Add interface for virtio-fs migration Date: Mon, 13 Mar 2023 18:48:32 +0100 Message-Id: <20230313174833.28790-2-hreitz@redhat.com> In-Reply-To: <20230313174833.28790-1-hreitz@redhat.com> References: <20230313174833.28790-1-hreitz@redhat.com> MIME-Version: 1.0 Content-type: text/plain Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=170.10.129.124; envelope-from=hreitz@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Series	vhost-user-fs: Stateful migration \| expand [RFC,0/2] vhost-user-fs: Stateful migration [RFC,1/2] vhost-user: Add interface for virtio-fs migration [RFC,2/2] vhost-user-fs: Implement stateful migration

Hanna Czenczek March 13, 2023, 5:48 p.m. UTC

Add a virtio-fs-specific vhost-user interface to facilitate migrating
back-end-internal state.  We plan to migrate the internal state simply
as a binary blob after the streaming phase, so all we need is a way to
transfer such a blob from and to the back-end.  We do so by using a
dedicated area of shared memory through which the blob is transferred in
chunks.

This patch adds the following vhost operations (and implements them for
vhost-user):

- FS_SET_STATE_FD: The front-end passes a dedicated shared memory area
  to the back-end.  This area will be used to transfer state via the
  other two operations.
  (After the transfer FS_SET_STATE_FD detaches the shared memory area
  again.)

- FS_GET_STATE: The front-end asks the back-end to place a chunk of
  internal state into the shared memory area.

- FS_SET_STATE: The front-end puts a chunk of internal state into the
  shared memory area, and asks the back-end to fetch it.

On the source side, the back-end is expected to serialize its internal
state either when FS_SET_STATE_FD is invoked, or when FS_GET_STATE is
invoked the first time.  On subsequent FS_GET_STATE calls, it memcpy()s
parts of that serialized state into the shared memory area.

On the destination side, the back-end is expected to collect the state
blob over all FS_SET_STATE calls, and then deserialize and apply it once
FS_SET_STATE_FD detaches the shared memory area.

Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
---
 include/hw/virtio/vhost-backend.h |   9 ++
 include/hw/virtio/vhost.h         |  68 +++++++++++++++
 hw/virtio/vhost-user.c            | 138 ++++++++++++++++++++++++++++++
 hw/virtio/vhost.c                 |  29 +++++++
 4 files changed, 244 insertions(+)

Stefan Hajnoczi March 15, 2023, 1:58 p.m. UTC | #1

On Mon, Mar 13, 2023 at 06:48:32PM +0100, Hanna Czenczek wrote:
> Add a virtio-fs-specific vhost-user interface to facilitate migrating
> back-end-internal state.  We plan to migrate the internal state simply

Luckily the interface does not need to be virtiofs-specific since it
only transfers opaque data. Any stateful device can use this for
migration. Please make it generic both at the vhost-user protocol
message level and at the QEMU vhost API level.

> as a binary blob after the streaming phase, so all we need is a way to
> transfer such a blob from and to the back-end.  We do so by using a
> dedicated area of shared memory through which the blob is transferred in
> chunks.

Keeping the migration data transfer separate from the vhost-user UNIX
domain socket is a good idea since the amount of data could be large and
may congest the UNIX domain socket. The shared memory interface solves
this.

Where I get lost is why it needs to be shared memory instead of simply
an fd? On the source, the front-end could read the fd until EOF and
transfer the opaque data. On the destination, the front-end could write
to the fd and then close it. I think that would be simpler than the
shared memory interface and could potentially support zero-copy via
splice(2) (QEMU doesn't need to look at the data being transferred!).

Here is an outline of an fd-based interface:

- SET_DEVICE_STATE_FD: The front-end passes a file descriptor for
  transferring device state.

  The @direction argument:
  - SAVE: the back-end transfers an outgoing device state over the fd.
  - LOAD: the back-end transfers an incoming device state over the fd.

  The @phase argument:
  - STOPPED: the device is stopped.
  - PRE_COPY: reserved for future use.
  - POST_COPY: reserved for future use.

  The back-end transfers data over the fd according to @direction and
  @phase upon receiving the SET_DEVICE_STATE_FD message.

There are loose ends like how the message interacts with the virtqueue
enabled state, what happens if multiple SET_DEVICE_STATE_FD messages are
sent, etc. I have ignored them for now.

What I wanted to mention about the fd-based interface is:

- It's just one message. The I/O activity happens via the fd and does
  not involve GET_STATE/SET_STATE messages over the vhost-user domain
  socket.

- Buffer management is up to the front-end and back-end implementations
  and a bit simpler than the shared memory interface.

Did you choose the shared memory approach because it has certain
advantages?

> 
> This patch adds the following vhost operations (and implements them for
> vhost-user):
> 
> - FS_SET_STATE_FD: The front-end passes a dedicated shared memory area
>   to the back-end.  This area will be used to transfer state via the
>   other two operations.
>   (After the transfer FS_SET_STATE_FD detaches the shared memory area
>   again.)
>
> - FS_GET_STATE: The front-end asks the back-end to place a chunk of
>   internal state into the shared memory area.
> 
> - FS_SET_STATE: The front-end puts a chunk of internal state into the
>   shared memory area, and asks the back-end to fetch it.
>
> On the source side, the back-end is expected to serialize its internal
> state either when FS_SET_STATE_FD is invoked, or when FS_GET_STATE is
> invoked the first time.  On subsequent FS_GET_STATE calls, it memcpy()s
> parts of that serialized state into the shared memory area.
> 
> On the destination side, the back-end is expected to collect the state
> blob over all FS_SET_STATE calls, and then deserialize and apply it once
> FS_SET_STATE_FD detaches the shared memory area.

What is the rationale for waiting to receive the entire incoming state
before parsing it rather than parsing it in a streaming fashion? Can
this be left as an implementation detail of the vhost-user back-end so
that there's freedom in choosing either approach?

> 
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>  include/hw/virtio/vhost-backend.h |   9 ++
>  include/hw/virtio/vhost.h         |  68 +++++++++++++++
>  hw/virtio/vhost-user.c            | 138 ++++++++++++++++++++++++++++++
>  hw/virtio/vhost.c                 |  29 +++++++
>  4 files changed, 244 insertions(+)
> 
> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> index ec3fbae58d..fa3bd19386 100644
> --- a/include/hw/virtio/vhost-backend.h
> +++ b/include/hw/virtio/vhost-backend.h
> @@ -42,6 +42,12 @@ typedef int (*vhost_backend_init)(struct vhost_dev *dev, void *opaque,
>  typedef int (*vhost_backend_cleanup)(struct vhost_dev *dev);
>  typedef int (*vhost_backend_memslots_limit)(struct vhost_dev *dev);
>  
> +typedef ssize_t (*vhost_fs_get_state_op)(struct vhost_dev *dev,
> +                                         uint64_t state_offset, size_t size);
> +typedef int (*vhost_fs_set_state_op)(struct vhost_dev *dev,
> +                                     uint64_t state_offset, size_t size);
> +typedef int (*vhost_fs_set_state_fd_op)(struct vhost_dev *dev, int memfd,
> +                                        size_t size);
>  typedef int (*vhost_net_set_backend_op)(struct vhost_dev *dev,
>                                  struct vhost_vring_file *file);
>  typedef int (*vhost_net_set_mtu_op)(struct vhost_dev *dev, uint16_t mtu);
> @@ -138,6 +144,9 @@ typedef struct VhostOps {
>      vhost_backend_init vhost_backend_init;
>      vhost_backend_cleanup vhost_backend_cleanup;
>      vhost_backend_memslots_limit vhost_backend_memslots_limit;
> +    vhost_fs_get_state_op vhost_fs_get_state;
> +    vhost_fs_set_state_op vhost_fs_set_state;
> +    vhost_fs_set_state_fd_op vhost_fs_set_state_fd;
>      vhost_net_set_backend_op vhost_net_set_backend;
>      vhost_net_set_mtu_op vhost_net_set_mtu;
>      vhost_scsi_set_endpoint_op vhost_scsi_set_endpoint;
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index a52f273347..b1ad9785dd 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -336,4 +336,72 @@ int vhost_dev_set_inflight(struct vhost_dev *dev,
>                             struct vhost_inflight *inflight);
>  int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
>                             struct vhost_inflight *inflight);
> +
> +/**
> + * vhost_fs_set_state_fd(): Share memory with a virtio-fs vhost
> + * back-end for transferring internal state for the purpose of
> + * migration.  Calling this function again will have the back-end
> + * unregister (free) the previously shared memory area.
> + *
> + * @dev: The vhost device
> + * @memfd: File descriptor associated with the shared memory to share.
> + *         If negative, no memory area is shared, only releasing the
> + *         previously shared area, and announcing the end of transfer
> + *         (which, on the destination side, should lead to the
> + *         back-end deserializing and applying the received state).
> + * @size: Size of the shared memory area
> + *
> + * Returns 0 on success, and -errno on failure.
> + */
> +int vhost_fs_set_state_fd(struct vhost_dev *dev, int memfd, size_t size);
> +
> +/**
> + * vhost_fs_get_state(): Request the virtio-fs vhost back-end to place
> + * a chunk of migration state into the shared memory area negotiated
> + * through vhost_fs_set_state_fd().  May only be used for migration,
> + * and only by the source side.
> + *
> + * The back-end-internal migration state is treated as a binary blob,
> + * which is transferred in chunks to fit into the shared memory area.
> + *
> + * @dev: The vhost device
> + * @state_offset: Offset into the state blob of the first byte to be
> + *                transferred
> + * @size: Number of bytes to transfer at most; must not exceed the
> + *        size of the shared memory area
> + *
> + * On success, returns the number of bytes that remain in the full
> + * state blob from the beginning of this chunk (i.e. the full size of
> + * the blob, minus @state_offset).  When transferring the final chunk,
> + * this may be less than @size.  The shared memory will contain the
> + * requested data, starting at offset 0 into the SHM, and counting
> + * `MIN(@size, returned value)` bytes.
> + *
> + * On failure, returns -errno.
> + */
> +ssize_t vhost_fs_get_state(struct vhost_dev *dev, uint64_t state_offset,
> +                           uint64_t size);
> +
> +/**
> + * vhost_fs_set_state(): Request the virtio-fs vhost back-end to fetch
> + * a chunk of migration state from the shared memory area negotiated
> + * through vhost_fs_set_state_fd().  May only be used for migration,
> + * and only by the destination side.
> + *
> + * The back-end-internal migration state is treated as a binary blob,
> + * which is transferred in chunks to fit into the shared memory area.
> + *
> + * The front-end (i.e. the caller) must transfer the whole state to
> + * the back-end, without holes.
> + *
> + * @vdev: the VirtIODevice structure
> + * @state_offset: Offset into the state blob of the first byte to be
> + *                transferred
> + * @size: Length of the chunk to transfer; must not exceed the size of
> + *        the shared memory area
> + *
> + * Returns 0 on success, and -errno on failure.
> + */
> +int vhost_fs_set_state(struct vhost_dev *dev, uint64_t state_offset,
> +                       uint64_t size);
>  #endif
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index e5285df4ba..7fd1fb1ed3 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -130,6 +130,9 @@ typedef enum VhostUserRequest {
>      VHOST_USER_REM_MEM_REG = 38,
>      VHOST_USER_SET_STATUS = 39,
>      VHOST_USER_GET_STATUS = 40,
> +    VHOST_USER_FS_SET_STATE_FD = 41,
> +    VHOST_USER_FS_GET_STATE = 42,
> +    VHOST_USER_FS_SET_STATE = 43,
>      VHOST_USER_MAX
>  } VhostUserRequest;
>  
> @@ -210,6 +213,15 @@ typedef struct {
>      uint32_t size; /* the following payload size */
>  } QEMU_PACKED VhostUserHeader;
>  
> +/*
> + * Request and reply payloads of VHOST_USER_FS_GET_STATE, and request
> + * payload of VHOST_USER_FS_SET_STATE.
> + */
> +typedef struct VhostUserFsState {
> +    uint64_t state_offset;
> +    uint64_t size;
> +} VhostUserFsState;
> +
>  typedef union {
>  #define VHOST_USER_VRING_IDX_MASK   (0xff)
>  #define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
> @@ -224,6 +236,7 @@ typedef union {
>          VhostUserCryptoSession session;
>          VhostUserVringArea area;
>          VhostUserInflight inflight;
> +        VhostUserFsState fs_state;
>  } VhostUserPayload;
>  
>  typedef struct VhostUserMsg {
> @@ -2240,6 +2253,128 @@ static int vhost_user_net_set_mtu(struct vhost_dev *dev, uint16_t mtu)
>      return 0;
>  }
>  
> +static int vhost_user_fs_set_state_fd(struct vhost_dev *dev, int memfd,
> +                                      size_t size)
> +{
> +    int ret;
> +    bool reply_supported = virtio_has_feature(dev->protocol_features,
> +                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_FS_SET_STATE_FD,
> +            .flags = VHOST_USER_VERSION,
> +            .size = sizeof(msg.payload.u64),
> +        },
> +        .payload.u64 = size,
> +    };
> +
> +    if (reply_supported) {
> +        msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
> +    }
> +
> +    if (memfd < 0) {
> +        assert(size == 0);
> +        ret = vhost_user_write(dev, &msg, NULL, 0);
> +    } else {
> +        ret = vhost_user_write(dev, &msg, &memfd, 1);
> +    }
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (reply_supported) {
> +        return process_message_reply(dev, &msg);
> +    }
> +
> +    return 0;
> +}
> +
> +static ssize_t vhost_user_fs_get_state(struct vhost_dev *dev,
> +                                       uint64_t state_offset,
> +                                       size_t size)
> +{
> +    int ret;
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_FS_GET_STATE,
> +            .flags = VHOST_USER_VERSION,
> +            .size = sizeof(msg.payload.fs_state),
> +        },
> +        .payload.fs_state = {
> +            .state_offset = state_offset,
> +            .size = size,
> +        },
> +    };
> +
> +    ret = vhost_user_write(dev, &msg, NULL, 0);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    ret = vhost_user_read(dev, &msg);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (msg.hdr.request != VHOST_USER_FS_GET_STATE) {
> +        error_report("Received unexpected message type: "
> +                     "Expected %d, received %d",
> +                     VHOST_USER_FS_GET_STATE, msg.hdr.request);
> +        return -EPROTO;
> +    }
> +
> +    if (msg.hdr.size != sizeof(VhostUserFsState)) {
> +        error_report("Received unexpected message length: "
> +                     "Expected %" PRIu32 ", received %zu",
> +                     msg.hdr.size, sizeof(VhostUserFsState));
> +        return -EPROTO;
> +    }
> +
> +    if (msg.payload.fs_state.size > SSIZE_MAX) {
> +        error_report("Remaining state size returned by back end is too high: "
> +                     "Expected up to %zd, reported %" PRIu64,
> +                     SSIZE_MAX, msg.payload.fs_state.size);
> +        return -EPROTO;
> +    }
> +
> +    return msg.payload.fs_state.size;
> +}
> +
> +static int vhost_user_fs_set_state(struct vhost_dev *dev,
> +                                   uint64_t state_offset,
> +                                   size_t size)
> +{
> +    int ret;
> +    bool reply_supported = virtio_has_feature(dev->protocol_features,
> +                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_FS_SET_STATE,
> +            .flags = VHOST_USER_VERSION,
> +            .size = sizeof(msg.payload.fs_state),
> +        },
> +        .payload.fs_state = {
> +            .state_offset = state_offset,
> +            .size = size,
> +        },
> +    };
> +
> +    if (reply_supported) {
> +        msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
> +    }
> +
> +    ret = vhost_user_write(dev, &msg, NULL, 0);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (reply_supported) {
> +        return process_message_reply(dev, &msg);
> +    }
> +
> +    return 0;
> +}
> +
>  static int vhost_user_send_device_iotlb_msg(struct vhost_dev *dev,
>                                              struct vhost_iotlb_msg *imsg)
>  {
> @@ -2716,4 +2851,7 @@ const VhostOps user_ops = {
>          .vhost_get_inflight_fd = vhost_user_get_inflight_fd,
>          .vhost_set_inflight_fd = vhost_user_set_inflight_fd,
>          .vhost_dev_start = vhost_user_dev_start,
> +        .vhost_fs_set_state_fd = vhost_user_fs_set_state_fd,
> +        .vhost_fs_get_state = vhost_user_fs_get_state,
> +        .vhost_fs_set_state = vhost_user_fs_set_state,
>  };
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index a266396576..ef8252c90e 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -2075,3 +2075,32 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
>  
>      return -ENOSYS;
>  }
> +
> +int vhost_fs_set_state_fd(struct vhost_dev *dev, int memfd, size_t size)
> +{
> +    if (dev->vhost_ops->vhost_fs_set_state_fd) {
> +        return dev->vhost_ops->vhost_fs_set_state_fd(dev, memfd, size);
> +    }
> +
> +    return -ENOSYS;
> +}
> +
> +ssize_t vhost_fs_get_state(struct vhost_dev *dev, uint64_t state_offset,
> +                           uint64_t size)
> +{
> +    if (dev->vhost_ops->vhost_fs_get_state) {
> +        return dev->vhost_ops->vhost_fs_get_state(dev, state_offset, size);
> +    }
> +
> +    return -ENOSYS;
> +}
> +
> +int vhost_fs_set_state(struct vhost_dev *dev, uint64_t state_offset,
> +                       uint64_t size)
> +{
> +    if (dev->vhost_ops->vhost_fs_set_state) {
> +        return dev->vhost_ops->vhost_fs_set_state(dev, state_offset, size);
> +    }
> +
> +    return -ENOSYS;
> +}
> -- 
> 2.39.1
>

Hanna Czenczek March 15, 2023, 3:55 p.m. UTC | #2

On 15.03.23 14:58, Stefan Hajnoczi wrote:
> On Mon, Mar 13, 2023 at 06:48:32PM +0100, Hanna Czenczek wrote:
>> Add a virtio-fs-specific vhost-user interface to facilitate migrating
>> back-end-internal state.  We plan to migrate the internal state simply
> Luckily the interface does not need to be virtiofs-specific since it
> only transfers opaque data. Any stateful device can use this for
> migration. Please make it generic both at the vhost-user protocol
> message level and at the QEMU vhost API level.

OK, sure.

>> as a binary blob after the streaming phase, so all we need is a way to
>> transfer such a blob from and to the back-end.  We do so by using a
>> dedicated area of shared memory through which the blob is transferred in
>> chunks.
> Keeping the migration data transfer separate from the vhost-user UNIX
> domain socket is a good idea since the amount of data could be large and
> may congest the UNIX domain socket. The shared memory interface solves
> this.
>
> Where I get lost is why it needs to be shared memory instead of simply
> an fd? On the source, the front-end could read the fd until EOF and
> transfer the opaque data. On the destination, the front-end could write
> to the fd and then close it. I think that would be simpler than the
> shared memory interface and could potentially support zero-copy via
> splice(2) (QEMU doesn't need to look at the data being transferred!).
>
> Here is an outline of an fd-based interface:
>
> - SET_DEVICE_STATE_FD: The front-end passes a file descriptor for
>    transferring device state.
>
>    The @direction argument:
>    - SAVE: the back-end transfers an outgoing device state over the fd.
>    - LOAD: the back-end transfers an incoming device state over the fd.
>
>    The @phase argument:
>    - STOPPED: the device is stopped.
>    - PRE_COPY: reserved for future use.
>    - POST_COPY: reserved for future use.
>
>    The back-end transfers data over the fd according to @direction and
>    @phase upon receiving the SET_DEVICE_STATE_FD message.
>
> There are loose ends like how the message interacts with the virtqueue
> enabled state, what happens if multiple SET_DEVICE_STATE_FD messages are
> sent, etc. I have ignored them for now.
>
> What I wanted to mention about the fd-based interface is:
>
> - It's just one message. The I/O activity happens via the fd and does
>    not involve GET_STATE/SET_STATE messages over the vhost-user domain
>    socket.
>
> - Buffer management is up to the front-end and back-end implementations
>    and a bit simpler than the shared memory interface.
>
> Did you choose the shared memory approach because it has certain
> advantages?

I simply chose it because I didn’t think of anything else. :)

Using just an FD for a pipe-like interface sounds perfect to me.  I 
expect that to make the code simpler and, as you point out, it’s just 
better in general.  Thanks!

>> This patch adds the following vhost operations (and implements them for
>> vhost-user):
>>
>> - FS_SET_STATE_FD: The front-end passes a dedicated shared memory area
>>    to the back-end.  This area will be used to transfer state via the
>>    other two operations.
>>    (After the transfer FS_SET_STATE_FD detaches the shared memory area
>>    again.)
>>
>> - FS_GET_STATE: The front-end asks the back-end to place a chunk of
>>    internal state into the shared memory area.
>>
>> - FS_SET_STATE: The front-end puts a chunk of internal state into the
>>    shared memory area, and asks the back-end to fetch it.
>>
>> On the source side, the back-end is expected to serialize its internal
>> state either when FS_SET_STATE_FD is invoked, or when FS_GET_STATE is
>> invoked the first time.  On subsequent FS_GET_STATE calls, it memcpy()s
>> parts of that serialized state into the shared memory area.
>>
>> On the destination side, the back-end is expected to collect the state
>> blob over all FS_SET_STATE calls, and then deserialize and apply it once
>> FS_SET_STATE_FD detaches the shared memory area.
> What is the rationale for waiting to receive the entire incoming state
> before parsing it rather than parsing it in a streaming fashion? Can
> this be left as an implementation detail of the vhost-user back-end so
> that there's freedom in choosing either approach?

The rationale was that when using the shared memory approach, you need 
to specify the offset into the state of the chunk that you’re currently 
transferring.  So to allow streaming, you’d need to make the front-end 
transfer the chunks in a streaming fashion, so that these offsets are 
continuously increasing.  Definitely possible, and reasonable, I just 
thought it’d be easier not to define it at this point and just state 
that decoding at the end is always safe.

When using a pipe/splicing, however, that won’t be a concern anymore, so 
yes, then we can definitely allow the back-end to decode its state while 
it’s still being received.

Hanna

Stefan Hajnoczi March 15, 2023, 4:33 p.m. UTC | #3

On Wed, 15 Mar 2023 at 11:56, Hanna Czenczek <hreitz@redhat.com> wrote:
>
> On 15.03.23 14:58, Stefan Hajnoczi wrote:
> > On Mon, Mar 13, 2023 at 06:48:32PM +0100, Hanna Czenczek wrote:
> >> Add a virtio-fs-specific vhost-user interface to facilitate migrating
> >> back-end-internal state.  We plan to migrate the internal state simply
> > Luckily the interface does not need to be virtiofs-specific since it
> > only transfers opaque data. Any stateful device can use this for
> > migration. Please make it generic both at the vhost-user protocol
> > message level and at the QEMU vhost API level.
>
> OK, sure.
>
> >> as a binary blob after the streaming phase, so all we need is a way to
> >> transfer such a blob from and to the back-end.  We do so by using a
> >> dedicated area of shared memory through which the blob is transferred in
> >> chunks.
> > Keeping the migration data transfer separate from the vhost-user UNIX
> > domain socket is a good idea since the amount of data could be large and
> > may congest the UNIX domain socket. The shared memory interface solves
> > this.
> >
> > Where I get lost is why it needs to be shared memory instead of simply
> > an fd? On the source, the front-end could read the fd until EOF and
> > transfer the opaque data. On the destination, the front-end could write
> > to the fd and then close it. I think that would be simpler than the
> > shared memory interface and could potentially support zero-copy via
> > splice(2) (QEMU doesn't need to look at the data being transferred!).
> >
> > Here is an outline of an fd-based interface:
> >
> > - SET_DEVICE_STATE_FD: The front-end passes a file descriptor for
> >    transferring device state.
> >
> >    The @direction argument:
> >    - SAVE: the back-end transfers an outgoing device state over the fd.
> >    - LOAD: the back-end transfers an incoming device state over the fd.
> >
> >    The @phase argument:
> >    - STOPPED: the device is stopped.
> >    - PRE_COPY: reserved for future use.
> >    - POST_COPY: reserved for future use.
> >
> >    The back-end transfers data over the fd according to @direction and
> >    @phase upon receiving the SET_DEVICE_STATE_FD message.
> >
> > There are loose ends like how the message interacts with the virtqueue
> > enabled state, what happens if multiple SET_DEVICE_STATE_FD messages are
> > sent, etc. I have ignored them for now.
> >
> > What I wanted to mention about the fd-based interface is:
> >
> > - It's just one message. The I/O activity happens via the fd and does
> >    not involve GET_STATE/SET_STATE messages over the vhost-user domain
> >    socket.
> >
> > - Buffer management is up to the front-end and back-end implementations
> >    and a bit simpler than the shared memory interface.
> >
> > Did you choose the shared memory approach because it has certain
> > advantages?
>
> I simply chose it because I didn’t think of anything else. :)
>
> Using just an FD for a pipe-like interface sounds perfect to me.  I
> expect that to make the code simpler and, as you point out, it’s just
> better in general.  Thanks!

The Linux VFIO Migration v2 API could be interesting to look at too:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/vfio.h#n814

It has a state machine that puts the device into
pre-copy/saving/loading/etc states.

> > What is the rationale for waiting to receive the entire incoming state
> > before parsing it rather than parsing it in a streaming fashion? Can
> > this be left as an implementation detail of the vhost-user back-end so
> > that there's freedom in choosing either approach?
>
> The rationale was that when using the shared memory approach, you need
> to specify the offset into the state of the chunk that you’re currently
> transferring.  So to allow streaming, you’d need to make the front-end
> transfer the chunks in a streaming fashion, so that these offsets are
> continuously increasing.  Definitely possible, and reasonable, I just
> thought it’d be easier not to define it at this point and just state
> that decoding at the end is always safe.
>
> When using a pipe/splicing, however, that won’t be a concern anymore, so
> yes, then we can definitely allow the back-end to decode its state while
> it’s still being received.

I see.

Stefan

Anton Kuchin March 17, 2023, 5:39 p.m. UTC | #4

On 13/03/2023 19:48, Hanna Czenczek wrote:
> Add a virtio-fs-specific vhost-user interface to facilitate migrating
> back-end-internal state.  We plan to migrate the internal state simply
> as a binary blob after the streaming phase, so all we need is a way to
> transfer such a blob from and to the back-end.  We do so by using a
> dedicated area of shared memory through which the blob is transferred in
> chunks.
>
> This patch adds the following vhost operations (and implements them for
> vhost-user):
>
> - FS_SET_STATE_FD: The front-end passes a dedicated shared memory area
>    to the back-end.  This area will be used to transfer state via the
>    other two operations.
>    (After the transfer FS_SET_STATE_FD detaches the shared memory area
>    again.)
>
> - FS_GET_STATE: The front-end asks the back-end to place a chunk of
>    internal state into the shared memory area.
>
> - FS_SET_STATE: The front-end puts a chunk of internal state into the
>    shared memory area, and asks the back-end to fetch it.
>
> On the source side, the back-end is expected to serialize its internal
> state either when FS_SET_STATE_FD is invoked, or when FS_GET_STATE is
> invoked the first time.  On subsequent FS_GET_STATE calls, it memcpy()s
> parts of that serialized state into the shared memory area.
>
> On the destination side, the back-end is expected to collect the state
> blob over all FS_SET_STATE calls, and then deserialize and apply it once
> FS_SET_STATE_FD detaches the shared memory area.
>
> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
> ---
>   include/hw/virtio/vhost-backend.h |   9 ++
>   include/hw/virtio/vhost.h         |  68 +++++++++++++++
>   hw/virtio/vhost-user.c            | 138 ++++++++++++++++++++++++++++++
>   hw/virtio/vhost.c                 |  29 +++++++
>   4 files changed, 244 insertions(+)
>
> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> index ec3fbae58d..fa3bd19386 100644
> --- a/include/hw/virtio/vhost-backend.h
> +++ b/include/hw/virtio/vhost-backend.h
> @@ -42,6 +42,12 @@ typedef int (*vhost_backend_init)(struct vhost_dev *dev, void *opaque,
>   typedef int (*vhost_backend_cleanup)(struct vhost_dev *dev);
>   typedef int (*vhost_backend_memslots_limit)(struct vhost_dev *dev);
>   
> +typedef ssize_t (*vhost_fs_get_state_op)(struct vhost_dev *dev,
> +                                         uint64_t state_offset, size_t size);
> +typedef int (*vhost_fs_set_state_op)(struct vhost_dev *dev,
> +                                     uint64_t state_offset, size_t size);
> +typedef int (*vhost_fs_set_state_fd_op)(struct vhost_dev *dev, int memfd,
> +                                        size_t size);
>   typedef int (*vhost_net_set_backend_op)(struct vhost_dev *dev,
>                                   struct vhost_vring_file *file);
>   typedef int (*vhost_net_set_mtu_op)(struct vhost_dev *dev, uint16_t mtu);
> @@ -138,6 +144,9 @@ typedef struct VhostOps {
>       vhost_backend_init vhost_backend_init;
>       vhost_backend_cleanup vhost_backend_cleanup;
>       vhost_backend_memslots_limit vhost_backend_memslots_limit;
> +    vhost_fs_get_state_op vhost_fs_get_state;
> +    vhost_fs_set_state_op vhost_fs_set_state;
> +    vhost_fs_set_state_fd_op vhost_fs_set_state_fd;
>       vhost_net_set_backend_op vhost_net_set_backend;
>       vhost_net_set_mtu_op vhost_net_set_mtu;
>       vhost_scsi_set_endpoint_op vhost_scsi_set_endpoint;
> diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
> index a52f273347..b1ad9785dd 100644
> --- a/include/hw/virtio/vhost.h
> +++ b/include/hw/virtio/vhost.h
> @@ -336,4 +336,72 @@ int vhost_dev_set_inflight(struct vhost_dev *dev,
>                              struct vhost_inflight *inflight);
>   int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
>                              struct vhost_inflight *inflight);
> +
> +/**
> + * vhost_fs_set_state_fd(): Share memory with a virtio-fs vhost
> + * back-end for transferring internal state for the purpose of
> + * migration.  Calling this function again will have the back-end
> + * unregister (free) the previously shared memory area.
> + *
> + * @dev: The vhost device
> + * @memfd: File descriptor associated with the shared memory to share.
> + *         If negative, no memory area is shared, only releasing the
> + *         previously shared area, and announcing the end of transfer
> + *         (which, on the destination side, should lead to the
> + *         back-end deserializing and applying the received state).
> + * @size: Size of the shared memory area
> + *
> + * Returns 0 on success, and -errno on failure.
> + */
> +int vhost_fs_set_state_fd(struct vhost_dev *dev, int memfd, size_t size);
> +
> +/**
> + * vhost_fs_get_state(): Request the virtio-fs vhost back-end to place
> + * a chunk of migration state into the shared memory area negotiated
> + * through vhost_fs_set_state_fd().  May only be used for migration,
> + * and only by the source side.
> + *
> + * The back-end-internal migration state is treated as a binary blob,
> + * which is transferred in chunks to fit into the shared memory area.
> + *
> + * @dev: The vhost device
> + * @state_offset: Offset into the state blob of the first byte to be
> + *                transferred
> + * @size: Number of bytes to transfer at most; must not exceed the
> + *        size of the shared memory area
> + *
> + * On success, returns the number of bytes that remain in the full
> + * state blob from the beginning of this chunk (i.e. the full size of
> + * the blob, minus @state_offset).  When transferring the final chunk,
> + * this may be less than @size.  The shared memory will contain the
> + * requested data, starting at offset 0 into the SHM, and counting
> + * `MIN(@size, returned value)` bytes.
> + *
> + * On failure, returns -errno.
> + */
> +ssize_t vhost_fs_get_state(struct vhost_dev *dev, uint64_t state_offset,
> +                           uint64_t size);
> +
> +/**
> + * vhost_fs_set_state(): Request the virtio-fs vhost back-end to fetch
> + * a chunk of migration state from the shared memory area negotiated
> + * through vhost_fs_set_state_fd().  May only be used for migration,
> + * and only by the destination side.
> + *
> + * The back-end-internal migration state is treated as a binary blob,
> + * which is transferred in chunks to fit into the shared memory area.
> + *
> + * The front-end (i.e. the caller) must transfer the whole state to
> + * the back-end, without holes.
> + *
> + * @vdev: the VirtIODevice structure
> + * @state_offset: Offset into the state blob of the first byte to be
> + *                transferred
> + * @size: Length of the chunk to transfer; must not exceed the size of
> + *        the shared memory area
> + *
> + * Returns 0 on success, and -errno on failure.
> + */
> +int vhost_fs_set_state(struct vhost_dev *dev, uint64_t state_offset,
> +                       uint64_t size);
>   #endif
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index e5285df4ba..7fd1fb1ed3 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -130,6 +130,9 @@ typedef enum VhostUserRequest {
>       VHOST_USER_REM_MEM_REG = 38,
>       VHOST_USER_SET_STATUS = 39,
>       VHOST_USER_GET_STATUS = 40,
> +    VHOST_USER_FS_SET_STATE_FD = 41,
> +    VHOST_USER_FS_GET_STATE = 42,
> +    VHOST_USER_FS_SET_STATE = 43,
>       VHOST_USER_MAX
>   } VhostUserRequest;
>   
> @@ -210,6 +213,15 @@ typedef struct {
>       uint32_t size; /* the following payload size */
>   } QEMU_PACKED VhostUserHeader;
>   
> +/*
> + * Request and reply payloads of VHOST_USER_FS_GET_STATE, and request
> + * payload of VHOST_USER_FS_SET_STATE.
> + */
> +typedef struct VhostUserFsState {
> +    uint64_t state_offset;
> +    uint64_t size;
> +} VhostUserFsState;
> +
>   typedef union {
>   #define VHOST_USER_VRING_IDX_MASK   (0xff)
>   #define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
> @@ -224,6 +236,7 @@ typedef union {
>           VhostUserCryptoSession session;
>           VhostUserVringArea area;
>           VhostUserInflight inflight;
> +        VhostUserFsState fs_state;
>   } VhostUserPayload;
>   
>   typedef struct VhostUserMsg {
> @@ -2240,6 +2253,128 @@ static int vhost_user_net_set_mtu(struct vhost_dev *dev, uint16_t mtu)
>       return 0;
>   }
>   
> +static int vhost_user_fs_set_state_fd(struct vhost_dev *dev, int memfd,
> +                                      size_t size)
> +{
> +    int ret;
> +    bool reply_supported = virtio_has_feature(dev->protocol_features,
> +                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_FS_SET_STATE_FD,
> +            .flags = VHOST_USER_VERSION,
> +            .size = sizeof(msg.payload.u64),
> +        },
> +        .payload.u64 = size,
> +    };
> +
> +    if (reply_supported) {
> +        msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
> +    }
> +
> +    if (memfd < 0) {
> +        assert(size == 0);
> +        ret = vhost_user_write(dev, &msg, NULL, 0);
> +    } else {
> +        ret = vhost_user_write(dev, &msg, &memfd, 1);
> +    }
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (reply_supported) {
> +        return process_message_reply(dev, &msg);
> +    }
> +
> +    return 0;
> +}
> +
> +static ssize_t vhost_user_fs_get_state(struct vhost_dev *dev,
> +                                       uint64_t state_offset,
> +                                       size_t size)
> +{

Am I right to assume that vrings are supposed to be stopped to get/set 
state?
If I am shouldn't we check it before sending message or do you believe it is
better to return error from backend in these cases?

> +    int ret;
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_FS_GET_STATE,
> +            .flags = VHOST_USER_VERSION,
> +            .size = sizeof(msg.payload.fs_state),
> +        },
> +        .payload.fs_state = {
> +            .state_offset = state_offset,
> +            .size = size,
> +        },
> +    };
> +
> +    ret = vhost_user_write(dev, &msg, NULL, 0);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    ret = vhost_user_read(dev, &msg);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (msg.hdr.request != VHOST_USER_FS_GET_STATE) {
> +        error_report("Received unexpected message type: "
> +                     "Expected %d, received %d",
> +                     VHOST_USER_FS_GET_STATE, msg.hdr.request);
> +        return -EPROTO;
> +    }
> +
> +    if (msg.hdr.size != sizeof(VhostUserFsState)) {
> +        error_report("Received unexpected message length: "
> +                     "Expected %" PRIu32 ", received %zu",
> +                     msg.hdr.size, sizeof(VhostUserFsState));
> +        return -EPROTO;
> +    }
> +
> +    if (msg.payload.fs_state.size > SSIZE_MAX) {
> +        error_report("Remaining state size returned by back end is too high: "
> +                     "Expected up to %zd, reported %" PRIu64,
> +                     SSIZE_MAX, msg.payload.fs_state.size);
> +        return -EPROTO;
> +    }
> +
> +    return msg.payload.fs_state.size;
> +}
> +
> +static int vhost_user_fs_set_state(struct vhost_dev *dev,
> +                                   uint64_t state_offset,
> +                                   size_t size)
> +{
> +    int ret;
> +    bool reply_supported = virtio_has_feature(dev->protocol_features,
> +                                              VHOST_USER_PROTOCOL_F_REPLY_ACK);
> +    VhostUserMsg msg = {
> +        .hdr = {
> +            .request = VHOST_USER_FS_SET_STATE,
> +            .flags = VHOST_USER_VERSION,
> +            .size = sizeof(msg.payload.fs_state),
> +        },
> +        .payload.fs_state = {
> +            .state_offset = state_offset,
> +            .size = size,
> +        },
> +    };
> +
> +    if (reply_supported) {
> +        msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK;
> +    }
> +
> +    ret = vhost_user_write(dev, &msg, NULL, 0);
> +    if (ret < 0) {
> +        return ret;
> +    }
> +
> +    if (reply_supported) {
> +        return process_message_reply(dev, &msg);
> +    }
> +
> +    return 0;
> +}
> +
>   static int vhost_user_send_device_iotlb_msg(struct vhost_dev *dev,
>                                               struct vhost_iotlb_msg *imsg)
>   {
> @@ -2716,4 +2851,7 @@ const VhostOps user_ops = {
>           .vhost_get_inflight_fd = vhost_user_get_inflight_fd,
>           .vhost_set_inflight_fd = vhost_user_set_inflight_fd,
>           .vhost_dev_start = vhost_user_dev_start,
> +        .vhost_fs_set_state_fd = vhost_user_fs_set_state_fd,
> +        .vhost_fs_get_state = vhost_user_fs_get_state,
> +        .vhost_fs_set_state = vhost_user_fs_set_state,
>   };
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index a266396576..ef8252c90e 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -2075,3 +2075,32 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
>   
>       return -ENOSYS;
>   }
> +
> +int vhost_fs_set_state_fd(struct vhost_dev *dev, int memfd, size_t size)
> +{
> +    if (dev->vhost_ops->vhost_fs_set_state_fd) {
> +        return dev->vhost_ops->vhost_fs_set_state_fd(dev, memfd, size);
> +    }
> +
> +    return -ENOSYS;
> +}
> +
> +ssize_t vhost_fs_get_state(struct vhost_dev *dev, uint64_t state_offset,
> +                           uint64_t size)
> +{
> +    if (dev->vhost_ops->vhost_fs_get_state) {
> +        return dev->vhost_ops->vhost_fs_get_state(dev, state_offset, size);
> +    }
> +
> +    return -ENOSYS;
> +}
> +
> +int vhost_fs_set_state(struct vhost_dev *dev, uint64_t state_offset,
> +                       uint64_t size)
> +{
> +    if (dev->vhost_ops->vhost_fs_set_state) {
> +        return dev->vhost_ops->vhost_fs_set_state(dev, state_offset, size);
> +    }
> +
> +    return -ENOSYS;
> +}

Hanna Czenczek March 17, 2023, 5:53 p.m. UTC | #5

On 17.03.23 18:39, Anton Kuchin wrote:
> On 13/03/2023 19:48, Hanna Czenczek wrote:
>> Add a virtio-fs-specific vhost-user interface to facilitate migrating
>> back-end-internal state.  We plan to migrate the internal state simply
>> as a binary blob after the streaming phase, so all we need is a way to
>> transfer such a blob from and to the back-end.  We do so by using a
>> dedicated area of shared memory through which the blob is transferred in
>> chunks.
>>
>> This patch adds the following vhost operations (and implements them for
>> vhost-user):
>>
>> - FS_SET_STATE_FD: The front-end passes a dedicated shared memory area
>>    to the back-end.  This area will be used to transfer state via the
>>    other two operations.
>>    (After the transfer FS_SET_STATE_FD detaches the shared memory area
>>    again.)
>>
>> - FS_GET_STATE: The front-end asks the back-end to place a chunk of
>>    internal state into the shared memory area.
>>
>> - FS_SET_STATE: The front-end puts a chunk of internal state into the
>>    shared memory area, and asks the back-end to fetch it.
>>
>> On the source side, the back-end is expected to serialize its internal
>> state either when FS_SET_STATE_FD is invoked, or when FS_GET_STATE is
>> invoked the first time.  On subsequent FS_GET_STATE calls, it memcpy()s
>> parts of that serialized state into the shared memory area.
>>
>> On the destination side, the back-end is expected to collect the state
>> blob over all FS_SET_STATE calls, and then deserialize and apply it once
>> FS_SET_STATE_FD detaches the shared memory area.
>>
>> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
>> ---
>>   include/hw/virtio/vhost-backend.h |   9 ++
>>   include/hw/virtio/vhost.h         |  68 +++++++++++++++
>>   hw/virtio/vhost-user.c            | 138 ++++++++++++++++++++++++++++++
>>   hw/virtio/vhost.c                 |  29 +++++++
>>   4 files changed, 244 insertions(+)
>>

[...]

>> +static ssize_t vhost_user_fs_get_state(struct vhost_dev *dev,
>> +                                       uint64_t state_offset,
>> +                                       size_t size)
>> +{
>
> Am I right to assume that vrings are supposed to be stopped to get/set 
> state?
> If I am shouldn't we check it before sending message or do you believe 
> it is
> better to return error from backend in these cases?

Good point!  Yes, the vrings should be stopped.  I don’t verify this 
currently, but it makes sense to do so both on the qemu and the 
virtiofsd side.

Hanna

[RFC,1/2] vhost-user: Add interface for virtio-fs migration

Commit Message

Comments

Patch