diff mbox series

[v5,1/6] vhost-user: Support transferring inflight buffer between qemu and backend

Message ID 20190122083152.10705-2-xieyongji@baidu.com (mailing list archive)
State New, archived
Headers show
Series vhost-user-blk: Add support for backend reconnecting | expand

Commit Message

Yongji Xie Jan. 22, 2019, 8:31 a.m. UTC
From: Xie Yongji <xieyongji@baidu.com>

This patch introduces two new messages VHOST_USER_GET_INFLIGHT_FD
and VHOST_USER_SET_INFLIGHT_FD to support transferring a shared
buffer between qemu and backend.

Firstly, qemu uses VHOST_USER_GET_INFLIGHT_FD to get the
shared buffer from backend. Then qemu should send it back
through VHOST_USER_SET_INFLIGHT_FD each time we start vhost-user.

This shared buffer is used to track inflight I/O by backend.
Qemu should clear it when vm reset.

Signed-off-by: Xie Yongji <xieyongji@baidu.com>
Signed-off-by: Chai Wen <chaiwen@baidu.com>
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
---
 docs/interop/vhost-user.txt       | 101 +++++++++++++++++++++++++++
 hw/virtio/vhost-user.c            | 110 ++++++++++++++++++++++++++++++
 hw/virtio/vhost.c                 | 105 ++++++++++++++++++++++++++++
 include/hw/virtio/vhost-backend.h |  10 +++
 include/hw/virtio/vhost.h         |  19 ++++++
 5 files changed, 345 insertions(+)

Comments

Stefan Hajnoczi Jan. 29, 2019, 4:11 a.m. UTC | #1
On Tue, Jan 22, 2019 at 04:31:47PM +0800, elohimes@gmail.com wrote:
> +typedef struct DescState {
> +    uint8_t inuse;
> +    uint8_t version;
> +    uint16_t used_idx;
> +    uint16_t avail_idx;
> +    uint16_t reserved;
> +} DescState;
> +
> +typedef struct QueueRegion {
> +    uint8_t valid;
> +    uint16_t desc_num;
> +    DescState desc[0];
> +} QueueRegion;
> +
> +The struct DescState is used to describe one head-descriptor's state. The
> +fields have following meanings:
> +
> +    inuse: Indicate whether the descriptor is inuse or not.
> +
> +    version: Indicate whether we have an atomic update to used ring and
> +    inflight buffer when slave crash at that point. This field should be
> +    increased by one before and after this two updates. An odd version
> +    indicates an in-progress update.
> +
> +    used_idx: Store old index of used ring before we update used ring and
> +    inflight buffer so that slave can know whether an odd version inflight
> +    head-descriptor in inflight buffer is processed or not.
> +
> +    avail_idx: Used to preserve the descriptor's order in avail ring so that
> +    slave can resubmit descriptors in order.

Will a completely new "packed vring" inflight shm layout be necessary to
support the packed vring layout in VIRTIO 1.1?

https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html#x1-610007
Michael S. Tsirkin Jan. 29, 2019, 4:26 a.m. UTC | #2
On Tue, Jan 29, 2019 at 12:11:55PM +0800, Stefan Hajnoczi wrote:
> On Tue, Jan 22, 2019 at 04:31:47PM +0800, elohimes@gmail.com wrote:
> > +typedef struct DescState {
> > +    uint8_t inuse;
> > +    uint8_t version;
> > +    uint16_t used_idx;
> > +    uint16_t avail_idx;
> > +    uint16_t reserved;
> > +} DescState;
> > +
> > +typedef struct QueueRegion {
> > +    uint8_t valid;

what's this?

> > +    uint16_t desc_num;

there's padding before this field. Pls make it explicit.

> > +    DescState desc[0];
> > +} QueueRegion;
> > +
> > +The struct DescState is used to describe one head-descriptor's state. The
> > +fields have following meanings:
> > +
> > +    inuse: Indicate whether the descriptor is inuse or not.

inuse by what?

> > +
> > +    version: Indicate whether we have an atomic update to used ring and
> > +    inflight buffer when slave crash at that point. This field should be
> > +    increased by one before and after this two updates. An odd version
> > +    indicates an in-progress update.

I'm not sure I understand what does the above say. Also does this
require two atomics? Seems pretty expensive. And why is it called
version?

> > +
> > +    used_idx: Store old index of used ring before we update used ring and
> > +    inflight buffer so that slave can know whether an odd version inflight
> > +    head-descriptor in inflight buffer is processed or not.

Here too.

> > +
> > +    avail_idx: Used to preserve the descriptor's order in avail ring so that
> > +    slave can resubmit descriptors in order.

Why would that be necessary?

> 
> Will a completely new "packed vring" inflight shm layout be necessary to
> support the packed vring layout in VIRTIO 1.1?
> 
> https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html#x1-610007

Probably.
Yongji Xie Jan. 29, 2019, 6:15 a.m. UTC | #3
On Tue, 29 Jan 2019 at 12:26, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Tue, Jan 29, 2019 at 12:11:55PM +0800, Stefan Hajnoczi wrote:
> > On Tue, Jan 22, 2019 at 04:31:47PM +0800, elohimes@gmail.com wrote:
> > > +typedef struct DescState {
> > > +    uint8_t inuse;
> > > +    uint8_t version;
> > > +    uint16_t used_idx;
> > > +    uint16_t avail_idx;
> > > +    uint16_t reserved;
> > > +} DescState;
> > > +
> > > +typedef struct QueueRegion {
> > > +    uint8_t valid;
>
> what's this?
>

We can use this to check whether this buffer is reset by qemu.

> > > +    uint16_t desc_num;
>
> there's padding before this field. Pls make it explicit.
>

Will do it.

> > > +    DescState desc[0];
> > > +} QueueRegion;
> > > +
> > > +The struct DescState is used to describe one head-descriptor's state. The
> > > +fields have following meanings:
> > > +
> > > +    inuse: Indicate whether the descriptor is inuse or not.
>
> inuse by what?
>

Maybe inflight is better?

> > > +
> > > +    version: Indicate whether we have an atomic update to used ring and
> > > +    inflight buffer when slave crash at that point. This field should be
> > > +    increased by one before and after this two updates. An odd version
> > > +    indicates an in-progress update.
>
> I'm not sure I understand what does the above say. Also does this
> require two atomics? Seems pretty expensive. And why is it called
> version?
>
> > > +
> > > +    used_idx: Store old index of used ring before we update used ring and
> > > +    inflight buffer so that slave can know whether an odd version inflight
> > > +    head-descriptor in inflight buffer is processed or not.
>
> Here too.
>

Sorry, the above description may be not clear. This two fields are
used to indicate whether we have an in-progress update to used ring
and inflight buffer. If slave crash before the update to used_ring and
after the update to inflight buffer, the version should be odd and
used_idx should be equal to used_ring.idx. Then we need to roll back
the update to inflight buffer. As for the name of the version filed,
actually I didn't find a good one, so I just copy it from struct
kvm_steal_time...

> > > +
> > > +    avail_idx: Used to preserve the descriptor's order in avail ring so that
> > > +    slave can resubmit descriptors in order.
>
> Why would that be necessary?
>

Maybe some devices will be able to use it to preserve order after
reconnecting in future?

> >
> > Will a completely new "packed vring" inflight shm layout be necessary to
> > support the packed vring layout in VIRTIO 1.1?
> >
> > https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html#x1-610007
>
> Probably.
>

How about supporting packed virtqueue in guest driver?

Thanks,
Yongji
Michael S. Tsirkin Jan. 29, 2019, 2:15 p.m. UTC | #4
On Tue, Jan 29, 2019 at 02:15:35PM +0800, Yongji Xie wrote:
> On Tue, 29 Jan 2019 at 12:26, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Tue, Jan 29, 2019 at 12:11:55PM +0800, Stefan Hajnoczi wrote:
> > > On Tue, Jan 22, 2019 at 04:31:47PM +0800, elohimes@gmail.com wrote:
> > > > +typedef struct DescState {
> > > > +    uint8_t inuse;
> > > > +    uint8_t version;
> > > > +    uint16_t used_idx;
> > > > +    uint16_t avail_idx;
> > > > +    uint16_t reserved;
> > > > +} DescState;
> > > > +
> > > > +typedef struct QueueRegion {
> > > > +    uint8_t valid;
> >
> > what's this?
> >
> 
> We can use this to check whether this buffer is reset by qemu.


I'd use version here. Document that it's 1 currently.
Or do we want feature flags so we can support downgrades too?



> > > > +    uint16_t desc_num;
> >
> > there's padding before this field. Pls make it explicit.
> >
> 
> Will do it.
> 
> > > > +    DescState desc[0];
> > > > +} QueueRegion;
> > > > +
> > > > +The struct DescState is used to describe one head-descriptor's state. The
> > > > +fields have following meanings:
> > > > +
> > > > +    inuse: Indicate whether the descriptor is inuse or not.
> >
> > inuse by what?
> >
> 
> Maybe inflight is better?
> 
> > > > +
> > > > +    version: Indicate whether we have an atomic update to used ring and
> > > > +    inflight buffer when slave crash at that point. This field should be
> > > > +    increased by one before and after this two updates. An odd version
> > > > +    indicates an in-progress update.
> >
> > I'm not sure I understand what does the above say. Also does this
> > require two atomics? Seems pretty expensive. And why is it called
> > version?
> >
> > > > +
> > > > +    used_idx: Store old index of used ring before we update used ring and
> > > > +    inflight buffer so that slave can know whether an odd version inflight
> > > > +    head-descriptor in inflight buffer is processed or not.
> >
> > Here too.
> >
> 
> Sorry, the above description may be not clear. This two fields are
> used to indicate whether we have an in-progress update to used ring
> and inflight buffer. If slave crash before the update to used_ring and
> after the update to inflight buffer, the version should be odd and
> used_idx should be equal to used_ring.idx. Then we need to roll back
> the update to inflight buffer. As for the name of the version filed,
> actually I didn't find a good one, so I just copy it from struct
> kvm_steal_time...
> 
> > > > +
> > > > +    avail_idx: Used to preserve the descriptor's order in avail ring so that
> > > > +    slave can resubmit descriptors in order.
> >
> > Why would that be necessary?
> >
> 
> Maybe some devices will be able to use it to preserve order after
> reconnecting in future?

If buffers are used in order then old entries in the ring
are not overwritten so inflight tracking is not
necessary. This is exactly the case with vhost user net today.

> > >
> > > Will a completely new "packed vring" inflight shm layout be necessary to
> > > support the packed vring layout in VIRTIO 1.1?
> > >
> > > https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html#x1-610007
> >
> > Probably.
> >
> 
> How about supporting packed virtqueue in guest driver?
> 
> Thanks,
> Yongji

Depends on the guest right? Linux has it:


commit f959a128fe83090981add69aadc87a4e496e9369
Author: Tiwei Bie <tiwei.bie@intel.com>
Date:   Wed Nov 21 18:03:30 2018 +0800

    virtio_ring: advertize packed ring layout
Yongji Xie Jan. 30, 2019, 2:07 a.m. UTC | #5
On Tue, 29 Jan 2019 at 22:15, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Tue, Jan 29, 2019 at 02:15:35PM +0800, Yongji Xie wrote:
> > On Tue, 29 Jan 2019 at 12:26, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Tue, Jan 29, 2019 at 12:11:55PM +0800, Stefan Hajnoczi wrote:
> > > > On Tue, Jan 22, 2019 at 04:31:47PM +0800, elohimes@gmail.com wrote:
> > > > > +typedef struct DescState {
> > > > > +    uint8_t inuse;
> > > > > +    uint8_t version;
> > > > > +    uint16_t used_idx;
> > > > > +    uint16_t avail_idx;
> > > > > +    uint16_t reserved;
> > > > > +} DescState;
> > > > > +
> > > > > +typedef struct QueueRegion {
> > > > > +    uint8_t valid;
> > >
> > > what's this?
> > >
> >
> > We can use this to check whether this buffer is reset by qemu.
>
>
> I'd use version here. Document that it's 1 currently.

If we put version into the shared buffer, QEMU will reset it when vm
reset. Then if backend restart at the same time, the version of this
buffer will be lost. So I still let QEMU maintain it and send it back
through message's payload.

> Or do we want feature flags so we can support downgrades too?
>

We faced the same problem that the feature flags will be lost if QEMU
do not maintain it. So maybe we should put this into message's
payload?

>
> > > > > +    uint16_t desc_num;
> > >
> > > there's padding before this field. Pls make it explicit.
> > >
> >
> > Will do it.
> >
> > > > > +    DescState desc[0];
> > > > > +} QueueRegion;
> > > > > +
> > > > > +The struct DescState is used to describe one head-descriptor's state. The
> > > > > +fields have following meanings:
> > > > > +
> > > > > +    inuse: Indicate whether the descriptor is inuse or not.
> > >
> > > inuse by what?
> > >
> >
> > Maybe inflight is better?
> >
> > > > > +
> > > > > +    version: Indicate whether we have an atomic update to used ring and
> > > > > +    inflight buffer when slave crash at that point. This field should be
> > > > > +    increased by one before and after this two updates. An odd version
> > > > > +    indicates an in-progress update.
> > >
> > > I'm not sure I understand what does the above say. Also does this
> > > require two atomics? Seems pretty expensive. And why is it called
> > > version?
> > >
> > > > > +
> > > > > +    used_idx: Store old index of used ring before we update used ring and
> > > > > +    inflight buffer so that slave can know whether an odd version inflight
> > > > > +    head-descriptor in inflight buffer is processed or not.
> > >
> > > Here too.
> > >
> >
> > Sorry, the above description may be not clear. This two fields are
> > used to indicate whether we have an in-progress update to used ring
> > and inflight buffer. If slave crash before the update to used_ring and
> > after the update to inflight buffer, the version should be odd and
> > used_idx should be equal to used_ring.idx. Then we need to roll back
> > the update to inflight buffer. As for the name of the version filed,
> > actually I didn't find a good one, so I just copy it from struct
> > kvm_steal_time...
> >
> > > > > +
> > > > > +    avail_idx: Used to preserve the descriptor's order in avail ring so that
> > > > > +    slave can resubmit descriptors in order.
> > >
> > > Why would that be necessary?
> > >
> >
> > Maybe some devices will be able to use it to preserve order after
> > reconnecting in future?
>
> If buffers are used in order then old entries in the ring
> are not overwritten so inflight tracking is not
> necessary. This is exactly the case with vhost user net today.
>

OK, looks reasonable to me.

> > > >
> > > > Will a completely new "packed vring" inflight shm layout be necessary to
> > > > support the packed vring layout in VIRTIO 1.1?
> > > >
> > > > https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html#x1-610007
> > >
> > > Probably.
> > >
> >
> > How about supporting packed virtqueue in guest driver?
> >
> > Thanks,
> > Yongji
>
> Depends on the guest right? Linux has it:
>

Sorry, actually I mean I prefer to support inflight tracking for
packed virtqueue in guest driver. This feature is only used by legacy
virtio 1.0 or virtio 0.9 device. What do you think about it?

Thanks,
Yongji
Michael S. Tsirkin Jan. 30, 2019, 2:30 a.m. UTC | #6
On Wed, Jan 30, 2019 at 10:07:28AM +0800, Yongji Xie wrote:
> On Tue, 29 Jan 2019 at 22:15, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Tue, Jan 29, 2019 at 02:15:35PM +0800, Yongji Xie wrote:
> > > On Tue, 29 Jan 2019 at 12:26, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Tue, Jan 29, 2019 at 12:11:55PM +0800, Stefan Hajnoczi wrote:
> > > > > On Tue, Jan 22, 2019 at 04:31:47PM +0800, elohimes@gmail.com wrote:
> > > > > > +typedef struct DescState {
> > > > > > +    uint8_t inuse;
> > > > > > +    uint8_t version;
> > > > > > +    uint16_t used_idx;
> > > > > > +    uint16_t avail_idx;
> > > > > > +    uint16_t reserved;
> > > > > > +} DescState;
> > > > > > +
> > > > > > +typedef struct QueueRegion {
> > > > > > +    uint8_t valid;
> > > >
> > > > what's this?
> > > >
> > >
> > > We can use this to check whether this buffer is reset by qemu.
> >
> >
> > I'd use version here. Document that it's 1 currently.
> 
> If we put version into the shared buffer, QEMU will reset it when vm
> reset. Then if backend restart at the same time, the version of this
> buffer will be lost. So I still let QEMU maintain it and send it back
> through message's payload.

I don't get it. If it's reset there's no contents.
If there's no contents then who cares what the version is?


> > Or do we want feature flags so we can support downgrades too?
> >
> 
> We faced the same problem that the feature flags will be lost if QEMU
> do not maintain it. So maybe we should put this into message's
> payload?

I don't understand why we care. We only maintain inflight so we
don't need to reset the device. if device is reset we don't
need to maintain them at all.

> >
> > > > > > +    uint16_t desc_num;
> > > >
> > > > there's padding before this field. Pls make it explicit.
> > > >
> > >
> > > Will do it.
> > >
> > > > > > +    DescState desc[0];
> > > > > > +} QueueRegion;
> > > > > > +
> > > > > > +The struct DescState is used to describe one head-descriptor's state. The
> > > > > > +fields have following meanings:
> > > > > > +
> > > > > > +    inuse: Indicate whether the descriptor is inuse or not.
> > > >
> > > > inuse by what?
> > > >
> > >
> > > Maybe inflight is better?
> > >
> > > > > > +
> > > > > > +    version: Indicate whether we have an atomic update to used ring and
> > > > > > +    inflight buffer when slave crash at that point. This field should be
> > > > > > +    increased by one before and after this two updates. An odd version
> > > > > > +    indicates an in-progress update.
> > > >
> > > > I'm not sure I understand what does the above say. Also does this
> > > > require two atomics? Seems pretty expensive. And why is it called
> > > > version?
> > > >
> > > > > > +
> > > > > > +    used_idx: Store old index of used ring before we update used ring and
> > > > > > +    inflight buffer so that slave can know whether an odd version inflight
> > > > > > +    head-descriptor in inflight buffer is processed or not.
> > > >
> > > > Here too.
> > > >
> > >
> > > Sorry, the above description may be not clear. This two fields are
> > > used to indicate whether we have an in-progress update to used ring
> > > and inflight buffer. If slave crash before the update to used_ring and
> > > after the update to inflight buffer, the version should be odd and
> > > used_idx should be equal to used_ring.idx. Then we need to roll back
> > > the update to inflight buffer. As for the name of the version filed,
> > > actually I didn't find a good one, so I just copy it from struct
> > > kvm_steal_time...
> > >
> > > > > > +
> > > > > > +    avail_idx: Used to preserve the descriptor's order in avail ring so that
> > > > > > +    slave can resubmit descriptors in order.
> > > >
> > > > Why would that be necessary?
> > > >
> > >
> > > Maybe some devices will be able to use it to preserve order after
> > > reconnecting in future?
> >
> > If buffers are used in order then old entries in the ring
> > are not overwritten so inflight tracking is not
> > necessary. This is exactly the case with vhost user net today.
> >
> 
> OK, looks reasonable to me.
> 
> > > > >
> > > > > Will a completely new "packed vring" inflight shm layout be necessary to
> > > > > support the packed vring layout in VIRTIO 1.1?
> > > > >
> > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html#x1-610007
> > > >
> > > > Probably.
> > > >
> > >
> > > How about supporting packed virtqueue in guest driver?
> > >
> > > Thanks,
> > > Yongji
> >
> > Depends on the guest right? Linux has it:
> >
> 
> Sorry, actually I mean I prefer to support inflight tracking for
> packed virtqueue in guest driver. This feature is only used by legacy
> virtio 1.0 or virtio 0.9 device. What do you think about it?
> 
> Thanks,
> Yongji

I don't see what does one have to do with the other.  Either we do or we
don't want to do it without downtime and retries. If we don't mind
resets then let's do it by making guest changes.
Yongji Xie Jan. 30, 2019, 3:49 a.m. UTC | #7
On Wed, 30 Jan 2019 at 10:30, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Wed, Jan 30, 2019 at 10:07:28AM +0800, Yongji Xie wrote:
> > On Tue, 29 Jan 2019 at 22:15, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Tue, Jan 29, 2019 at 02:15:35PM +0800, Yongji Xie wrote:
> > > > On Tue, 29 Jan 2019 at 12:26, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > >
> > > > > On Tue, Jan 29, 2019 at 12:11:55PM +0800, Stefan Hajnoczi wrote:
> > > > > > On Tue, Jan 22, 2019 at 04:31:47PM +0800, elohimes@gmail.com wrote:
> > > > > > > +typedef struct DescState {
> > > > > > > +    uint8_t inuse;
> > > > > > > +    uint8_t version;
> > > > > > > +    uint16_t used_idx;
> > > > > > > +    uint16_t avail_idx;
> > > > > > > +    uint16_t reserved;
> > > > > > > +} DescState;
> > > > > > > +
> > > > > > > +typedef struct QueueRegion {
> > > > > > > +    uint8_t valid;
> > > > >
> > > > > what's this?
> > > > >
> > > >
> > > > We can use this to check whether this buffer is reset by qemu.
> > >
> > >
> > > I'd use version here. Document that it's 1 currently.
> >
> > If we put version into the shared buffer, QEMU will reset it when vm
> > reset. Then if backend restart at the same time, the version of this
> > buffer will be lost. So I still let QEMU maintain it and send it back
> > through message's payload.
>
> I don't get it. If it's reset there's no contents.
> If there's no contents then who cares what the version is?
>

What I thought before is that we should not update the buffer when vm
reset. But now seems like it's unnecessary. I will update this patch
as you said. Thank you!

> > > Or do we want feature flags so we can support downgrades too?
> > >
> >
> > We faced the same problem that the feature flags will be lost if QEMU
> > do not maintain it. So maybe we should put this into message's
> > payload?
>
> I don't understand why we care. We only maintain inflight so we
> don't need to reset the device. if device is reset we don't
> need to maintain them at all.
>
> > >
> > > > > > > +    uint16_t desc_num;
> > > > >
> > > > > there's padding before this field. Pls make it explicit.
> > > > >
> > > >
> > > > Will do it.
> > > >
> > > > > > > +    DescState desc[0];
> > > > > > > +} QueueRegion;
> > > > > > > +
> > > > > > > +The struct DescState is used to describe one head-descriptor's state. The
> > > > > > > +fields have following meanings:
> > > > > > > +
> > > > > > > +    inuse: Indicate whether the descriptor is inuse or not.
> > > > >
> > > > > inuse by what?
> > > > >
> > > >
> > > > Maybe inflight is better?
> > > >
> > > > > > > +
> > > > > > > +    version: Indicate whether we have an atomic update to used ring and
> > > > > > > +    inflight buffer when slave crash at that point. This field should be
> > > > > > > +    increased by one before and after this two updates. An odd version
> > > > > > > +    indicates an in-progress update.
> > > > >
> > > > > I'm not sure I understand what does the above say. Also does this
> > > > > require two atomics? Seems pretty expensive. And why is it called
> > > > > version?
> > > > >
> > > > > > > +
> > > > > > > +    used_idx: Store old index of used ring before we update used ring and
> > > > > > > +    inflight buffer so that slave can know whether an odd version inflight
> > > > > > > +    head-descriptor in inflight buffer is processed or not.
> > > > >
> > > > > Here too.
> > > > >
> > > >
> > > > Sorry, the above description may be not clear. This two fields are
> > > > used to indicate whether we have an in-progress update to used ring
> > > > and inflight buffer. If slave crash before the update to used_ring and
> > > > after the update to inflight buffer, the version should be odd and
> > > > used_idx should be equal to used_ring.idx. Then we need to roll back
> > > > the update to inflight buffer. As for the name of the version filed,
> > > > actually I didn't find a good one, so I just copy it from struct
> > > > kvm_steal_time...
> > > >
> > > > > > > +
> > > > > > > +    avail_idx: Used to preserve the descriptor's order in avail ring so that
> > > > > > > +    slave can resubmit descriptors in order.
> > > > >
> > > > > Why would that be necessary?
> > > > >
> > > >
> > > > Maybe some devices will be able to use it to preserve order after
> > > > reconnecting in future?
> > >
> > > If buffers are used in order then old entries in the ring
> > > are not overwritten so inflight tracking is not
> > > necessary. This is exactly the case with vhost user net today.
> > >
> >
> > OK, looks reasonable to me.
> >
> > > > > >
> > > > > > Will a completely new "packed vring" inflight shm layout be necessary to
> > > > > > support the packed vring layout in VIRTIO 1.1?
> > > > > >
> > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html#x1-610007
> > > > >
> > > > > Probably.
> > > > >
> > > >
> > > > How about supporting packed virtqueue in guest driver?
> > > >
> > > > Thanks,
> > > > Yongji
> > >
> > > Depends on the guest right? Linux has it:
> > >
> >
> > Sorry, actually I mean I prefer to support inflight tracking for
> > packed virtqueue in guest driver. This feature is only used by legacy
> > virtio 1.0 or virtio 0.9 device. What do you think about it?
> >
> > Thanks,
> > Yongji
>
> I don't see what does one have to do with the other.  Either we do or we
> don't want to do it without downtime and retries. If we don't mind
> resets then let's do it by making guest changes.
>

OK, I see. But seems like now we don't support packed virtqueue in
qemu. Is it better to do that in a splited patchset? Add packed
virtqueue implenment in contrib/libvhost-user.c, then document the
inflight I/O tracking behavior.

Thanks,
Yongji
Michael S. Tsirkin Jan. 30, 2019, 4:07 a.m. UTC | #8
On Wed, Jan 30, 2019 at 11:49:56AM +0800, Yongji Xie wrote:
> On Wed, 30 Jan 2019 at 10:30, Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Wed, Jan 30, 2019 at 10:07:28AM +0800, Yongji Xie wrote:
> > > On Tue, 29 Jan 2019 at 22:15, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Tue, Jan 29, 2019 at 02:15:35PM +0800, Yongji Xie wrote:
> > > > > On Tue, 29 Jan 2019 at 12:26, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, Jan 29, 2019 at 12:11:55PM +0800, Stefan Hajnoczi wrote:
> > > > > > > On Tue, Jan 22, 2019 at 04:31:47PM +0800, elohimes@gmail.com wrote:
> > > > > > > > +typedef struct DescState {
> > > > > > > > +    uint8_t inuse;
> > > > > > > > +    uint8_t version;
> > > > > > > > +    uint16_t used_idx;
> > > > > > > > +    uint16_t avail_idx;
> > > > > > > > +    uint16_t reserved;
> > > > > > > > +} DescState;
> > > > > > > > +
> > > > > > > > +typedef struct QueueRegion {
> > > > > > > > +    uint8_t valid;
> > > > > >
> > > > > > what's this?
> > > > > >
> > > > >
> > > > > We can use this to check whether this buffer is reset by qemu.
> > > >
> > > >
> > > > I'd use version here. Document that it's 1 currently.
> > >
> > > If we put version into the shared buffer, QEMU will reset it when vm
> > > reset. Then if backend restart at the same time, the version of this
> > > buffer will be lost. So I still let QEMU maintain it and send it back
> > > through message's payload.
> >
> > I don't get it. If it's reset there's no contents.
> > If there's no contents then who cares what the version is?
> >
> 
> What I thought before is that we should not update the buffer when vm
> reset. But now seems like it's unnecessary. I will update this patch
> as you said. Thank you!
> 
> > > > Or do we want feature flags so we can support downgrades too?
> > > >
> > >
> > > We faced the same problem that the feature flags will be lost if QEMU
> > > do not maintain it. So maybe we should put this into message's
> > > payload?
> >
> > I don't understand why we care. We only maintain inflight so we
> > don't need to reset the device. if device is reset we don't
> > need to maintain them at all.
> >
> > > >
> > > > > > > > +    uint16_t desc_num;
> > > > > >
> > > > > > there's padding before this field. Pls make it explicit.
> > > > > >
> > > > >
> > > > > Will do it.
> > > > >
> > > > > > > > +    DescState desc[0];
> > > > > > > > +} QueueRegion;
> > > > > > > > +
> > > > > > > > +The struct DescState is used to describe one head-descriptor's state. The
> > > > > > > > +fields have following meanings:
> > > > > > > > +
> > > > > > > > +    inuse: Indicate whether the descriptor is inuse or not.
> > > > > >
> > > > > > inuse by what?
> > > > > >
> > > > >
> > > > > Maybe inflight is better?
> > > > >
> > > > > > > > +
> > > > > > > > +    version: Indicate whether we have an atomic update to used ring and
> > > > > > > > +    inflight buffer when slave crash at that point. This field should be
> > > > > > > > +    increased by one before and after this two updates. An odd version
> > > > > > > > +    indicates an in-progress update.
> > > > > >
> > > > > > I'm not sure I understand what does the above say. Also does this
> > > > > > require two atomics? Seems pretty expensive. And why is it called
> > > > > > version?
> > > > > >
> > > > > > > > +
> > > > > > > > +    used_idx: Store old index of used ring before we update used ring and
> > > > > > > > +    inflight buffer so that slave can know whether an odd version inflight
> > > > > > > > +    head-descriptor in inflight buffer is processed or not.
> > > > > >
> > > > > > Here too.
> > > > > >
> > > > >
> > > > > Sorry, the above description may be not clear. This two fields are
> > > > > used to indicate whether we have an in-progress update to used ring
> > > > > and inflight buffer. If slave crash before the update to used_ring and
> > > > > after the update to inflight buffer, the version should be odd and
> > > > > used_idx should be equal to used_ring.idx. Then we need to roll back
> > > > > the update to inflight buffer. As for the name of the version filed,
> > > > > actually I didn't find a good one, so I just copy it from struct
> > > > > kvm_steal_time...
> > > > >
> > > > > > > > +
> > > > > > > > +    avail_idx: Used to preserve the descriptor's order in avail ring so that
> > > > > > > > +    slave can resubmit descriptors in order.
> > > > > >
> > > > > > Why would that be necessary?
> > > > > >
> > > > >
> > > > > Maybe some devices will be able to use it to preserve order after
> > > > > reconnecting in future?
> > > >
> > > > If buffers are used in order then old entries in the ring
> > > > are not overwritten so inflight tracking is not
> > > > necessary. This is exactly the case with vhost user net today.
> > > >
> > >
> > > OK, looks reasonable to me.
> > >
> > > > > > >
> > > > > > > Will a completely new "packed vring" inflight shm layout be necessary to
> > > > > > > support the packed vring layout in VIRTIO 1.1?
> > > > > > >
> > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html#x1-610007
> > > > > >
> > > > > > Probably.
> > > > > >
> > > > >
> > > > > How about supporting packed virtqueue in guest driver?
> > > > >
> > > > > Thanks,
> > > > > Yongji
> > > >
> > > > Depends on the guest right? Linux has it:
> > > >
> > >
> > > Sorry, actually I mean I prefer to support inflight tracking for
> > > packed virtqueue in guest driver. This feature is only used by legacy
> > > virtio 1.0 or virtio 0.9 device. What do you think about it?
> > >
> > > Thanks,
> > > Yongji
> >
> > I don't see what does one have to do with the other.  Either we do or we
> > don't want to do it without downtime and retries. If we don't mind
> > resets then let's do it by making guest changes.
> >
> 
> OK, I see. But seems like now we don't support packed virtqueue in
> qemu. Is it better to do that in a splited patchset? Add packed
> virtqueue implenment in contrib/libvhost-user.c, then document the
> inflight I/O tracking behavior.
> 
> Thanks,
> Yongji

it's just documentation so it does not have to depend
on actual code.
Yongji Xie Jan. 30, 2019, 4:11 a.m. UTC | #9
On Wed, 30 Jan 2019 at 12:07, Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Wed, Jan 30, 2019 at 11:49:56AM +0800, Yongji Xie wrote:
> > On Wed, 30 Jan 2019 at 10:30, Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Wed, Jan 30, 2019 at 10:07:28AM +0800, Yongji Xie wrote:
> > > > On Tue, 29 Jan 2019 at 22:15, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > >
> > > > > On Tue, Jan 29, 2019 at 02:15:35PM +0800, Yongji Xie wrote:
> > > > > > On Tue, 29 Jan 2019 at 12:26, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > > > >
> > > > > > > On Tue, Jan 29, 2019 at 12:11:55PM +0800, Stefan Hajnoczi wrote:
> > > > > > > > On Tue, Jan 22, 2019 at 04:31:47PM +0800, elohimes@gmail.com wrote:
> > > > > > > > > +typedef struct DescState {
> > > > > > > > > +    uint8_t inuse;
> > > > > > > > > +    uint8_t version;
> > > > > > > > > +    uint16_t used_idx;
> > > > > > > > > +    uint16_t avail_idx;
> > > > > > > > > +    uint16_t reserved;
> > > > > > > > > +} DescState;
> > > > > > > > > +
> > > > > > > > > +typedef struct QueueRegion {
> > > > > > > > > +    uint8_t valid;
> > > > > > >
> > > > > > > what's this?
> > > > > > >
> > > > > >
> > > > > > We can use this to check whether this buffer is reset by qemu.
> > > > >
> > > > >
> > > > > I'd use version here. Document that it's 1 currently.
> > > >
> > > > If we put version into the shared buffer, QEMU will reset it when vm
> > > > reset. Then if backend restart at the same time, the version of this
> > > > buffer will be lost. So I still let QEMU maintain it and send it back
> > > > through message's payload.
> > >
> > > I don't get it. If it's reset there's no contents.
> > > If there's no contents then who cares what the version is?
> > >
> >
> > What I thought before is that we should not update the buffer when vm
> > reset. But now seems like it's unnecessary. I will update this patch
> > as you said. Thank you!
> >
> > > > > Or do we want feature flags so we can support downgrades too?
> > > > >
> > > >
> > > > We faced the same problem that the feature flags will be lost if QEMU
> > > > do not maintain it. So maybe we should put this into message's
> > > > payload?
> > >
> > > I don't understand why we care. We only maintain inflight so we
> > > don't need to reset the device. if device is reset we don't
> > > need to maintain them at all.
> > >
> > > > >
> > > > > > > > > +    uint16_t desc_num;
> > > > > > >
> > > > > > > there's padding before this field. Pls make it explicit.
> > > > > > >
> > > > > >
> > > > > > Will do it.
> > > > > >
> > > > > > > > > +    DescState desc[0];
> > > > > > > > > +} QueueRegion;
> > > > > > > > > +
> > > > > > > > > +The struct DescState is used to describe one head-descriptor's state. The
> > > > > > > > > +fields have following meanings:
> > > > > > > > > +
> > > > > > > > > +    inuse: Indicate whether the descriptor is inuse or not.
> > > > > > >
> > > > > > > inuse by what?
> > > > > > >
> > > > > >
> > > > > > Maybe inflight is better?
> > > > > >
> > > > > > > > > +
> > > > > > > > > +    version: Indicate whether we have an atomic update to used ring and
> > > > > > > > > +    inflight buffer when slave crash at that point. This field should be
> > > > > > > > > +    increased by one before and after this two updates. An odd version
> > > > > > > > > +    indicates an in-progress update.
> > > > > > >
> > > > > > > I'm not sure I understand what does the above say. Also does this
> > > > > > > require two atomics? Seems pretty expensive. And why is it called
> > > > > > > version?
> > > > > > >
> > > > > > > > > +
> > > > > > > > > +    used_idx: Store old index of used ring before we update used ring and
> > > > > > > > > +    inflight buffer so that slave can know whether an odd version inflight
> > > > > > > > > +    head-descriptor in inflight buffer is processed or not.
> > > > > > >
> > > > > > > Here too.
> > > > > > >
> > > > > >
> > > > > > Sorry, the above description may be not clear. This two fields are
> > > > > > used to indicate whether we have an in-progress update to used ring
> > > > > > and inflight buffer. If slave crash before the update to used_ring and
> > > > > > after the update to inflight buffer, the version should be odd and
> > > > > > used_idx should be equal to used_ring.idx. Then we need to roll back
> > > > > > the update to inflight buffer. As for the name of the version filed,
> > > > > > actually I didn't find a good one, so I just copy it from struct
> > > > > > kvm_steal_time...
> > > > > >
> > > > > > > > > +
> > > > > > > > > +    avail_idx: Used to preserve the descriptor's order in avail ring so that
> > > > > > > > > +    slave can resubmit descriptors in order.
> > > > > > >
> > > > > > > Why would that be necessary?
> > > > > > >
> > > > > >
> > > > > > Maybe some devices will be able to use it to preserve order after
> > > > > > reconnecting in future?
> > > > >
> > > > > If buffers are used in order then old entries in the ring
> > > > > are not overwritten so inflight tracking is not
> > > > > necessary. This is exactly the case with vhost user net today.
> > > > >
> > > >
> > > > OK, looks reasonable to me.
> > > >
> > > > > > > >
> > > > > > > > Will a completely new "packed vring" inflight shm layout be necessary to
> > > > > > > > support the packed vring layout in VIRTIO 1.1?
> > > > > > > >
> > > > > > > > https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html#x1-610007
> > > > > > >
> > > > > > > Probably.
> > > > > > >
> > > > > >
> > > > > > How about supporting packed virtqueue in guest driver?
> > > > > >
> > > > > > Thanks,
> > > > > > Yongji
> > > > >
> > > > > Depends on the guest right? Linux has it:
> > > > >
> > > >
> > > > Sorry, actually I mean I prefer to support inflight tracking for
> > > > packed virtqueue in guest driver. This feature is only used by legacy
> > > > virtio 1.0 or virtio 0.9 device. What do you think about it?
> > > >
> > > > Thanks,
> > > > Yongji
> > >
> > > I don't see what does one have to do with the other.  Either we do or we
> > > don't want to do it without downtime and retries. If we don't mind
> > > resets then let's do it by making guest changes.
> > >
> >
> > OK, I see. But seems like now we don't support packed virtqueue in
> > qemu. Is it better to do that in a splited patchset? Add packed
> > virtqueue implenment in contrib/libvhost-user.c, then document the
> > inflight I/O tracking behavior.
> >
> > Thanks,
> > Yongji
>
> it's just documentation so it does not have to depend
> on actual code.

I get it.

Thanks,
Yongji
diff mbox series

Patch

diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user.txt
index c2194711d9..fa470a8fe2 100644
--- a/docs/interop/vhost-user.txt
+++ b/docs/interop/vhost-user.txt
@@ -142,6 +142,18 @@  Depending on the request type, payload can be:
    Offset: a 64-bit offset of this area from the start of the
        supplied file descriptor
 
+ * Inflight description
+   ---------------------------------------------------------------
+   | mmap size | mmap offset | version | num queues | queue size |
+   ---------------------------------------------------------------
+
+   mmap size: a 64-bit size of area to track inflight I/O
+   mmap offset: a 64-bit offset of this area from the start
+                of the supplied file descriptor
+   version: a 16-bit version of this area
+   num queues: a 16-bit number of virtqueues
+   queue size: a 16-bit size of virtqueues
+
 In QEMU the vhost-user message is implemented with the following struct:
 
 typedef struct VhostUserMsg {
@@ -157,6 +169,7 @@  typedef struct VhostUserMsg {
         struct vhost_iotlb_msg iotlb;
         VhostUserConfig config;
         VhostUserVringArea area;
+        VhostUserInflight inflight;
     };
 } QEMU_PACKED VhostUserMsg;
 
@@ -175,6 +188,7 @@  the ones that do:
  * VHOST_USER_GET_PROTOCOL_FEATURES
  * VHOST_USER_GET_VRING_BASE
  * VHOST_USER_SET_LOG_BASE (if VHOST_USER_PROTOCOL_F_LOG_SHMFD)
+ * VHOST_USER_GET_INFLIGHT_FD (if VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)
 
 [ Also see the section on REPLY_ACK protocol extension. ]
 
@@ -188,6 +202,7 @@  in the ancillary data:
  * VHOST_USER_SET_VRING_CALL
  * VHOST_USER_SET_VRING_ERR
  * VHOST_USER_SET_SLAVE_REQ_FD
+ * VHOST_USER_SET_INFLIGHT_FD (if VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)
 
 If Master is unable to send the full message or receives a wrong reply it will
 close the connection. An optional reconnection mechanism can be implemented.
@@ -382,6 +397,71 @@  If VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD protocol feature is negotiated,
 slave can send file descriptors (at most 8 descriptors in each message)
 to master via ancillary data using this fd communication channel.
 
+Inflight I/O tracking
+---------------------
+
+To support reconnecting after restart or crash, slave may need to get
+and resubmit inflight I/O. If virtqueue is processed in order, we can
+easily achieve that by getting the inflight head-descriptor from avail ring.
+However, it can't work when we process descriptors out-of-order because
+some entries which store inflight head-descriptor in avail ring might be
+overrided by new entries. To solve this problem, slave need to allocate a
+buffer to track inflight head-descriptor and share it with master for
+persistent. VHOST_USER_GET_INFLIGHT_FD and VHOST_USER_SET_INFLIGHT_FD are
+used to transfer this buffer between master and slave. And the format of
+this buffer is described below:
+
+-------------------------------------------------------
+| queue0 region | queue1 region | ... | queueN region |
+-------------------------------------------------------
+
+N is the number of available virtqueues. Slave could get it from num queues
+field of VhostUserInflight.
+
+Queue region can be implemented as:
+
+typedef struct DescState {
+    uint8_t inuse;
+    uint8_t version;
+    uint16_t used_idx;
+    uint16_t avail_idx;
+    uint16_t reserved;
+} DescState;
+
+typedef struct QueueRegion {
+    uint8_t valid;
+    uint16_t desc_num;
+    DescState desc[0];
+} QueueRegion;
+
+The struct DescState is used to describe one head-descriptor's state. The
+fields have following meanings:
+
+    inuse: Indicate whether the descriptor is inuse or not.
+
+    version: Indicate whether we have an atomic update to used ring and
+    inflight buffer when slave crash at that point. This field should be
+    increased by one before and after this two updates. An odd version
+    indicates an in-progress update.
+
+    used_idx: Store old index of used ring before we update used ring and
+    inflight buffer so that slave can know whether an odd version inflight
+    head-descriptor in inflight buffer is processed or not.
+
+    avail_idx: Used to preserve the descriptor's order in avail ring so that
+    slave can resubmit descriptors in order.
+
+    reserved: Reserved for future use.
+
+An array of struct DescState is maintained to track all head-descriptors' state
+in queue region. The desc_num is the number of those head-descriptors which is
+also equal to the queue size of the virtqueue. Slave could get it from queue size
+field of VhostUserInflight. The valid field is used to specify whether the region
+is valid or not. Slave can use it to detect whether vm is reset because qemu will
+clear the buffer when the reset happens.
+
+Note that slave should align each region to cache line size.
+
 Protocol features
 -----------------
 
@@ -397,6 +477,7 @@  Protocol features
 #define VHOST_USER_PROTOCOL_F_CONFIG         9
 #define VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD  10
 #define VHOST_USER_PROTOCOL_F_HOST_NOTIFIER  11
+#define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD 12
 
 Master message types
 --------------------
@@ -761,6 +842,26 @@  Master message types
       was previously sent.
       The value returned is an error indication; 0 is success.
 
+ * VHOST_USER_GET_INFLIGHT_FD
+      Id: 31
+      Equivalent ioctl: N/A
+      Master payload: inflight description
+
+      When VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD protocol feature has been
+      successfully negotiated, this message is submitted by master to get
+      a shared buffer from slave. The shared buffer will be used to track
+      inflight I/O by slave. QEMU should clear it when vm reset.
+
+ * VHOST_USER_SET_INFLIGHT_FD
+      Id: 32
+      Equivalent ioctl: N/A
+      Master payload: inflight description
+
+      When VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD protocol feature has been
+      successfully negotiated, this message is submitted by master to send
+      the shared inflight buffer back to slave so that slave could get
+      inflight I/O after a crash or restart.
+
 Slave message types
 -------------------
 
diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index 564a31d12c..9aa6204c91 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -52,6 +52,7 @@  enum VhostUserProtocolFeature {
     VHOST_USER_PROTOCOL_F_CONFIG = 9,
     VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD = 10,
     VHOST_USER_PROTOCOL_F_HOST_NOTIFIER = 11,
+    VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD = 12,
     VHOST_USER_PROTOCOL_F_MAX
 };
 
@@ -89,6 +90,8 @@  typedef enum VhostUserRequest {
     VHOST_USER_POSTCOPY_ADVISE  = 28,
     VHOST_USER_POSTCOPY_LISTEN  = 29,
     VHOST_USER_POSTCOPY_END     = 30,
+    VHOST_USER_GET_INFLIGHT_FD = 31,
+    VHOST_USER_SET_INFLIGHT_FD = 32,
     VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -147,6 +150,14 @@  typedef struct VhostUserVringArea {
     uint64_t offset;
 } VhostUserVringArea;
 
+typedef struct VhostUserInflight {
+    uint64_t mmap_size;
+    uint64_t mmap_offset;
+    uint16_t version;
+    uint16_t num_queues;
+    uint16_t queue_size;
+} VhostUserInflight;
+
 typedef struct {
     VhostUserRequest request;
 
@@ -169,6 +180,7 @@  typedef union {
         VhostUserConfig config;
         VhostUserCryptoSession session;
         VhostUserVringArea area;
+        VhostUserInflight inflight;
 } VhostUserPayload;
 
 typedef struct VhostUserMsg {
@@ -1739,6 +1751,102 @@  static bool vhost_user_mem_section_filter(struct vhost_dev *dev,
     return result;
 }
 
+static int vhost_user_get_inflight_fd(struct vhost_dev *dev,
+                                      uint16_t queue_size,
+                                      struct vhost_inflight *inflight)
+{
+    void *addr;
+    int fd;
+    struct vhost_user *u = dev->opaque;
+    CharBackend *chr = u->user->chr;
+    VhostUserMsg msg = {
+        .hdr.request = VHOST_USER_GET_INFLIGHT_FD,
+        .hdr.flags = VHOST_USER_VERSION,
+        .payload.inflight.num_queues = dev->nvqs,
+        .payload.inflight.queue_size = queue_size,
+        .hdr.size = sizeof(msg.payload.inflight),
+    };
+
+    if (!virtio_has_feature(dev->protocol_features,
+                            VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
+        return 0;
+    }
+
+    if (vhost_user_write(dev, &msg, NULL, 0) < 0) {
+        return -1;
+    }
+
+    if (vhost_user_read(dev, &msg) < 0) {
+        return -1;
+    }
+
+    if (msg.hdr.request != VHOST_USER_GET_INFLIGHT_FD) {
+        error_report("Received unexpected msg type. "
+                     "Expected %d received %d",
+                     VHOST_USER_GET_INFLIGHT_FD, msg.hdr.request);
+        return -1;
+    }
+
+    if (msg.hdr.size != sizeof(msg.payload.inflight)) {
+        error_report("Received bad msg size.");
+        return -1;
+    }
+
+    if (!msg.payload.inflight.mmap_size) {
+        return 0;
+    }
+
+    fd = qemu_chr_fe_get_msgfd(chr);
+    if (fd < 0) {
+        error_report("Failed to get mem fd");
+        return -1;
+    }
+
+    addr = mmap(0, msg.payload.inflight.mmap_size, PROT_READ | PROT_WRITE,
+                MAP_SHARED, fd, msg.payload.inflight.mmap_offset);
+
+    if (addr == MAP_FAILED) {
+        error_report("Failed to mmap mem fd");
+        close(fd);
+        return -1;
+    }
+
+    inflight->addr = addr;
+    inflight->fd = fd;
+    inflight->size = msg.payload.inflight.mmap_size;
+    inflight->offset = msg.payload.inflight.mmap_offset;
+    inflight->version = msg.payload.inflight.version;
+    inflight->queue_size = queue_size;
+
+    return 0;
+}
+
+static int vhost_user_set_inflight_fd(struct vhost_dev *dev,
+                                      struct vhost_inflight *inflight)
+{
+    VhostUserMsg msg = {
+        .hdr.request = VHOST_USER_SET_INFLIGHT_FD,
+        .hdr.flags = VHOST_USER_VERSION,
+        .payload.inflight.mmap_size = inflight->size,
+        .payload.inflight.mmap_offset = inflight->offset,
+        .payload.inflight.version = inflight->version,
+        .payload.inflight.num_queues = dev->nvqs,
+        .payload.inflight.queue_size = inflight->queue_size,
+        .hdr.size = sizeof(msg.payload.inflight),
+    };
+
+    if (!virtio_has_feature(dev->protocol_features,
+                            VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
+        return 0;
+    }
+
+    if (vhost_user_write(dev, &msg, &inflight->fd, 1) < 0) {
+        return -1;
+    }
+
+    return 0;
+}
+
 VhostUserState *vhost_user_init(void)
 {
     VhostUserState *user = g_new0(struct VhostUserState, 1);
@@ -1790,4 +1898,6 @@  const VhostOps user_ops = {
         .vhost_crypto_create_session = vhost_user_crypto_create_session,
         .vhost_crypto_close_session = vhost_user_crypto_close_session,
         .vhost_backend_mem_section_filter = vhost_user_mem_section_filter,
+        .vhost_get_inflight_fd = vhost_user_get_inflight_fd,
+        .vhost_set_inflight_fd = vhost_user_set_inflight_fd,
 };
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 569c4053ea..0a4de6b91b 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1481,6 +1481,111 @@  void vhost_dev_set_config_notifier(struct vhost_dev *hdev,
     hdev->config_ops = ops;
 }
 
+void vhost_dev_reset_inflight(struct vhost_inflight *inflight)
+{
+    if (inflight->addr) {
+        memset(inflight->addr, 0, inflight->size);
+    }
+}
+
+void vhost_dev_free_inflight(struct vhost_inflight *inflight)
+{
+    if (inflight->addr) {
+        qemu_memfd_free(inflight->addr, inflight->size, inflight->fd);
+        inflight->addr = NULL;
+        inflight->fd = -1;
+    }
+}
+
+static int vhost_dev_resize_inflight(struct vhost_inflight *inflight,
+                                     uint64_t new_size)
+{
+    Error *err = NULL;
+    int fd = -1;
+    void *addr = qemu_memfd_alloc("vhost-inflight", new_size,
+                                  F_SEAL_GROW | F_SEAL_SHRINK | F_SEAL_SEAL,
+                                  &fd, &err);
+
+    if (err) {
+        error_report_err(err);
+        return -1;
+    }
+
+    vhost_dev_free_inflight(inflight);
+    inflight->offset = 0;
+    inflight->addr = addr;
+    inflight->fd = fd;
+    inflight->size = new_size;
+
+    return 0;
+}
+
+void vhost_dev_save_inflight(struct vhost_inflight *inflight, QEMUFile *f)
+{
+    if (inflight->addr) {
+        qemu_put_be64(f, inflight->size);
+        qemu_put_be16(f, inflight->version);
+        qemu_put_be16(f, inflight->queue_size);
+        qemu_put_buffer(f, inflight->addr, inflight->size);
+    } else {
+        qemu_put_be64(f, 0);
+    }
+}
+
+int vhost_dev_load_inflight(struct vhost_inflight *inflight, QEMUFile *f)
+{
+    uint64_t size;
+
+    size = qemu_get_be64(f);
+    if (!size) {
+        return 0;
+    }
+
+    if (inflight->size != size) {
+        if (vhost_dev_resize_inflight(inflight, size)) {
+            return -1;
+        }
+    }
+    inflight->version = qemu_get_be16(f);
+    inflight->queue_size = qemu_get_be16(f);
+
+    qemu_get_buffer(f, inflight->addr, size);
+
+    return 0;
+}
+
+int vhost_dev_set_inflight(struct vhost_dev *dev,
+                           struct vhost_inflight *inflight)
+{
+    int r;
+
+    if (dev->vhost_ops->vhost_set_inflight_fd && inflight->addr) {
+        r = dev->vhost_ops->vhost_set_inflight_fd(dev, inflight);
+        if (r) {
+            VHOST_OPS_DEBUG("vhost_set_inflight_fd failed");
+            return -errno;
+        }
+    }
+
+    return 0;
+}
+
+int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
+                           struct vhost_inflight *inflight)
+{
+    int r;
+
+    if (dev->vhost_ops->vhost_get_inflight_fd) {
+        r = dev->vhost_ops->vhost_get_inflight_fd(dev, queue_size, inflight);
+        if (r) {
+            VHOST_OPS_DEBUG("vhost_get_inflight_fd failed");
+            return -errno;
+        }
+    }
+
+    return 0;
+}
+
 /* Host notifiers must be enabled at this point. */
 int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice *vdev)
 {
diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
index 81283ec50f..d6632a18e6 100644
--- a/include/hw/virtio/vhost-backend.h
+++ b/include/hw/virtio/vhost-backend.h
@@ -25,6 +25,7 @@  typedef enum VhostSetConfigType {
     VHOST_SET_CONFIG_TYPE_MIGRATION = 1,
 } VhostSetConfigType;
 
+struct vhost_inflight;
 struct vhost_dev;
 struct vhost_log;
 struct vhost_memory;
@@ -104,6 +105,13 @@  typedef int (*vhost_crypto_close_session_op)(struct vhost_dev *dev,
 typedef bool (*vhost_backend_mem_section_filter_op)(struct vhost_dev *dev,
                                                 MemoryRegionSection *section);
 
+typedef int (*vhost_get_inflight_fd_op)(struct vhost_dev *dev,
+                                        uint16_t queue_size,
+                                        struct vhost_inflight *inflight);
+
+typedef int (*vhost_set_inflight_fd_op)(struct vhost_dev *dev,
+                                        struct vhost_inflight *inflight);
+
 typedef struct VhostOps {
     VhostBackendType backend_type;
     vhost_backend_init vhost_backend_init;
@@ -142,6 +150,8 @@  typedef struct VhostOps {
     vhost_crypto_create_session_op vhost_crypto_create_session;
     vhost_crypto_close_session_op vhost_crypto_close_session;
     vhost_backend_mem_section_filter_op vhost_backend_mem_section_filter;
+    vhost_get_inflight_fd_op vhost_get_inflight_fd;
+    vhost_set_inflight_fd_op vhost_set_inflight_fd;
 } VhostOps;
 
 extern const VhostOps user_ops;
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index a7f449fa87..f44f93f54a 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -7,6 +7,16 @@ 
 #include "exec/memory.h"
 
 /* Generic structures common for any vhost based device. */
+
+struct vhost_inflight {
+    int fd;
+    void *addr;
+    uint64_t size;
+    uint64_t offset;
+    uint16_t version;
+    uint16_t queue_size;
+};
+
 struct vhost_virtqueue {
     int kick;
     int call;
@@ -120,4 +130,13 @@  int vhost_dev_set_config(struct vhost_dev *dev, const uint8_t *data,
  */
 void vhost_dev_set_config_notifier(struct vhost_dev *dev,
                                    const VhostDevConfigOps *ops);
+
+void vhost_dev_reset_inflight(struct vhost_inflight *inflight);
+void vhost_dev_free_inflight(struct vhost_inflight *inflight);
+void vhost_dev_save_inflight(struct vhost_inflight *inflight, QEMUFile *f);
+int vhost_dev_load_inflight(struct vhost_inflight *inflight, QEMUFile *f);
+int vhost_dev_set_inflight(struct vhost_dev *dev,
+                           struct vhost_inflight *inflight);
+int vhost_dev_get_inflight(struct vhost_dev *dev, uint16_t queue_size,
+                           struct vhost_inflight *inflight);
 #endif