[0/3] Introduce vhost-vdpa block device

Message ID	20210408101252.552-1-xieyongji@bytedance.com (mailing list archive)
Headers	show Return-Path: <SRS0=9Ez2=JF=nongnu.org=qemu-devel-bounces+qemu-devel=archiver.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E33E461166 From: Xie Yongji <xieyongji@bytedance.com> To: mst@redhat.com, stefanha@redhat.com, sgarzare@redhat.com, raphael.norwitz@nutanix.com, lingshan.zhu@intel.com, changpeng.liu@intel.com, jasowang@redhat.com, kwolf@redhat.com, mreitz@redhat.com Subject: [PATCH 0/3] Introduce vhost-vdpa block device Date: Thu, 8 Apr 2021 18:12:49 +0800 Message-Id: <20210408101252.552-1-xieyongji@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=2607:f8b0:4864:20::1034; envelope-from=xieyongji@bytedance.com; helo=mail-pj1-x1034.google.com X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Cc: qemu-devel@nongnu.org Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>
Series	Introduce vhost-vdpa block device \| expand [0/3] Introduce vhost-vdpa block device [1/3] vhost-vdpa: Remove redundant declaration of address_space_memory [2/3] vhost-blk: Add vhost-blk-common abstraction [3/3] vhost-vdpa-blk: Introduce vhost-vdpa-blk host device

Yongji Xie April 8, 2021, 10:12 a.m. UTC

Since we already have some ways to emulate vDPA block device
in kernel[1] or userspace[2]. This series tries to introduce a
new vhost-vdpa block device for that. To use it, we can add
something like:

qemu-system-x86_64 \
    -device vhost-vdpa-blk-pci,vdpa-dev=/dev/vhost-vdpa-0

You can also get the code and kernel changes from below repositories:

https://github.com/bytedance/qemu/tree/vhost-vdpa-blk
https://github.com/bytedance/linux/tree/vhost-vdpa-blk

Thank you!

[1] https://lore.kernel.org/kvm/20210315163450.254396-1-sgarzare@redhat.com/
[2] https://lore.kernel.org/kvm/20210331080519.172-1-xieyongji@bytedance.com/

Xie Yongji (3):
  Remove redundant declaration of address_space_memory
  vhost-blk: add vhost-blk-common abstraction
  vhost-vdpa-blk: Introduce vhost-vdpa-blk host device

 hw/block/Kconfig                     |   5 +
 hw/block/meson.build                 |   3 +-
 hw/block/vhost-blk-common.c          | 291 +++++++++++++++++++++++++
 hw/block/vhost-user-blk.c            | 306 +++++----------------------
 hw/block/vhost-vdpa-blk.c            | 227 ++++++++++++++++++++
 hw/virtio/meson.build                |   1 +
 hw/virtio/vhost-user-blk-pci.c       |   7 +-
 hw/virtio/vhost-vdpa-blk-pci.c       | 101 +++++++++
 hw/virtio/vhost-vdpa.c               |   1 +
 include/hw/virtio/vhost-blk-common.h |  50 +++++
 include/hw/virtio/vhost-user-blk.h   |  20 +-
 include/hw/virtio/vhost-vdpa-blk.h   |  30 +++
 include/hw/virtio/vhost-vdpa.h       |   1 -
 13 files changed, 762 insertions(+), 281 deletions(-)
 create mode 100644 hw/block/vhost-blk-common.c
 create mode 100644 hw/block/vhost-vdpa-blk.c
 create mode 100644 hw/virtio/vhost-vdpa-blk-pci.c
 create mode 100644 include/hw/virtio/vhost-blk-common.h
 create mode 100644 include/hw/virtio/vhost-vdpa-blk.h

Stefan Hajnoczi April 26, 2021, 3:33 p.m. UTC | #1

On Thu, Apr 08, 2021 at 06:12:49PM +0800, Xie Yongji wrote:
> Since we already have some ways to emulate vDPA block device
> in kernel[1] or userspace[2]. This series tries to introduce a
> new vhost-vdpa block device for that. To use it, we can add
> something like:
> 
> qemu-system-x86_64 \
>     -device vhost-vdpa-blk-pci,vdpa-dev=/dev/vhost-vdpa-0

This device is similar to vhost-user-blk. QEMU does not see it as a
block device so storage migration, I/O throttling, image formats, etc
are not supported. Stefano Garzarella and I discussed how vdpa-blk
devices could integrate more deeply with QEMU's block layer. The QEMU
block layer could be enabled only when necessary and otherwise bypassed
for maximum performance.

This alternative approach is similar to how vhost-net is implemented in
QEMU. A BlockDriver would handle the vdpa-blk device and the regular
virtio-blk-pci device would still be present. The virtqueues could be
delegated to the vdpa-blk device in order to bypass the QEMU block
layer.

I wanted to mention this since it's likely that this kind of vdpa-blk
device implementation will be posted in the future and you might be
interested. It makes live migration with non-shared storage possible,
for example.

An issue with vhost-user-blk is that the ownership of qdev properties
and VIRTIO Configuration Space fields was not clearly defined. Some
things are handled by QEMU's vhost-user-blk code, some things are
handled by the vhost-user device backend, and some things are negotiated
between both entities. This patch series follows the existing
vhost-user-blk approach, which I think will show serious issues as the
device is more widely used and whenever virtio-blk or the implementation
is extended with new features. It is very hard to provide backwards
compatibility with the current approach where the ownership of qdev
properties and VIRTIO Configuration Space fields is ad-hoc and largely
undefined.

Since vDPA has VIRTIO Configuration Space APIs, I suggest that the
vhost-vDPA device controls the entire configuration space. QEMU should
simply forward accesses between the guest and vhost-vdpa.

Regarding qdev properties, it's a little trickier because QEMU needs to
do the emulated VIRTIO device setup (allocating virtqueues, setting
their sizes, etc). Can QEMU query this information from the vDPA device?
If not, which qdev properties are read-only and must match the
configuration of the vDPA device and which are read-write and can
control the vDPA device?

Stefan

Yongji Xie April 27, 2021, 10:24 a.m. UTC | #2

On Mon, Apr 26, 2021 at 11:34 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Apr 08, 2021 at 06:12:49PM +0800, Xie Yongji wrote:
> > Since we already have some ways to emulate vDPA block device
> > in kernel[1] or userspace[2]. This series tries to introduce a
> > new vhost-vdpa block device for that. To use it, we can add
> > something like:
> >
> > qemu-system-x86_64 \
> >     -device vhost-vdpa-blk-pci,vdpa-dev=/dev/vhost-vdpa-0
>
> This device is similar to vhost-user-blk. QEMU does not see it as a
> block device so storage migration, I/O throttling, image formats, etc
> are not supported. Stefano Garzarella and I discussed how vdpa-blk
> devices could integrate more deeply with QEMU's block layer. The QEMU
> block layer could be enabled only when necessary and otherwise bypassed
> for maximum performance.
>

Do you mean we can make use of the shadow virtqueue for the slow path
(I/O will go through the block layer) and add a fast path (like what
we do now) to bypass the block layer?

> This alternative approach is similar to how vhost-net is implemented in
> QEMU. A BlockDriver would handle the vdpa-blk device and the regular
> virtio-blk-pci device would still be present. The virtqueues could be
> delegated to the vdpa-blk device in order to bypass the QEMU block
> layer.
>
> I wanted to mention this since it's likely that this kind of vdpa-blk
> device implementation will be posted in the future and you might be
> interested. It makes live migration with non-shared storage possible,
> for example.
>

That would be nice, I'm looking forward to it!

So do you think whether it's necessary to continue this approach?
Looks like we don't need a vhost-vdpa-blk device any more in the new
approach.

> An issue with vhost-user-blk is that the ownership of qdev properties
> and VIRTIO Configuration Space fields was not clearly defined. Some
> things are handled by QEMU's vhost-user-blk code, some things are
> handled by the vhost-user device backend, and some things are negotiated
> between both entities. This patch series follows the existing
> vhost-user-blk approach, which I think will show serious issues as the
> device is more widely used and whenever virtio-blk or the implementation
> is extended with new features. It is very hard to provide backwards
> compatibility with the current approach where the ownership of qdev
> properties and VIRTIO Configuration Space fields is ad-hoc and largely
> undefined.
>
> Since vDPA has VIRTIO Configuration Space APIs, I suggest that the
> vhost-vDPA device controls the entire configuration space. QEMU should
> simply forward accesses between the guest and vhost-vdpa.
>

Does this already achieved by vhost_ops->vhost_get_config？ And I want
to know how to handle the endianness issue if qemu just simply does
forwarding and doesn't care about the content of config space.

> Regarding qdev properties, it's a little trickier because QEMU needs to
> do the emulated VIRTIO device setup (allocating virtqueues, setting
> their sizes, etc). Can QEMU query this information from the vDPA device?
> If not, which qdev properties are read-only and must match the
> configuration of the vDPA device and which are read-write and can
> control the vDPA device?
>

Yes, that's an issue. We have to make sure the number of virtqueues
and their size set by qemu is not greater than hardware limitation.
Now I think we can query the max queue size, but looks like we don't
have an interface to query the max number of virtqueues.

Thanks,
Yongji

Stefan Hajnoczi April 27, 2021, 2:28 p.m. UTC | #3

On Tue, Apr 27, 2021 at 06:24:55PM +0800, Yongji Xie wrote:
> On Mon, Apr 26, 2021 at 11:34 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Thu, Apr 08, 2021 at 06:12:49PM +0800, Xie Yongji wrote:
> > > Since we already have some ways to emulate vDPA block device
> > > in kernel[1] or userspace[2]. This series tries to introduce a
> > > new vhost-vdpa block device for that. To use it, we can add
> > > something like:
> > >
> > > qemu-system-x86_64 \
> > >     -device vhost-vdpa-blk-pci,vdpa-dev=/dev/vhost-vdpa-0
> >
> > This device is similar to vhost-user-blk. QEMU does not see it as a
> > block device so storage migration, I/O throttling, image formats, etc
> > are not supported. Stefano Garzarella and I discussed how vdpa-blk
> > devices could integrate more deeply with QEMU's block layer. The QEMU
> > block layer could be enabled only when necessary and otherwise bypassed
> > for maximum performance.
> >
> 
> Do you mean we can make use of the shadow virtqueue for the slow path
> (I/O will go through the block layer) and add a fast path (like what
> we do now) to bypass the block layer?

Yes.

> > This alternative approach is similar to how vhost-net is implemented in
> > QEMU. A BlockDriver would handle the vdpa-blk device and the regular
> > virtio-blk-pci device would still be present. The virtqueues could be
> > delegated to the vdpa-blk device in order to bypass the QEMU block
> > layer.
> >
> > I wanted to mention this since it's likely that this kind of vdpa-blk
> > device implementation will be posted in the future and you might be
> > interested. It makes live migration with non-shared storage possible,
> > for example.
> >
> 
> That would be nice, I'm looking forward to it!
> 
> So do you think whether it's necessary to continue this approach?
> Looks like we don't need a vhost-vdpa-blk device any more in the new
> approach.

There is no code for the vdpa-blk BlockDriver yet and the implementation
will be more complex than this patch series (more risk the feature could
be delayed).

If you need vdpa-blk as soon as possible then this patch series may be a
pragmatic solution. As long as it doesn't limit the vdpa-blk interface
in a way that prevents the BlockDriver implementation then I think QEMU
could support both.

In the long term the vdpa-blk BlockDriver could replace -device
vdpa-blk-pci since the performance should be identical and it supports
more use cases.

It's up to you. You're welcome to wait for the BlockDriver, work
together with Stefano on the BlockDriver, or continue with your patch
series.

> > An issue with vhost-user-blk is that the ownership of qdev properties
> > and VIRTIO Configuration Space fields was not clearly defined. Some
> > things are handled by QEMU's vhost-user-blk code, some things are
> > handled by the vhost-user device backend, and some things are negotiated
> > between both entities. This patch series follows the existing
> > vhost-user-blk approach, which I think will show serious issues as the
> > device is more widely used and whenever virtio-blk or the implementation
> > is extended with new features. It is very hard to provide backwards
> > compatibility with the current approach where the ownership of qdev
> > properties and VIRTIO Configuration Space fields is ad-hoc and largely
> > undefined.
> >
> > Since vDPA has VIRTIO Configuration Space APIs, I suggest that the
> > vhost-vDPA device controls the entire configuration space. QEMU should
> > simply forward accesses between the guest and vhost-vdpa.
> >
> 
> Does this already achieved by vhost_ops->vhost_get_config？ And I want
> to know how to handle the endianness issue if qemu just simply does
> forwarding and doesn't care about the content of config space.

QEMU just copies bytes betwen the driver and the device via
vdpa_config_ops->get/set_config(). I don't think it needs to worry about
endianness in VIRTIO Configuration Space access code or did I miss
something?

My understanding is that vDPA currently supports same-endian Legacy and
little-endian Modern devices. Cross-endian Legacy devices are currently
not supported because there is no API to set endianness.

If such an API is added in the future, then QEMU can use it to tell the
vDPA device about guest endianness. QEMU still won't need to modify
Configuration Space data itself.

> > Regarding qdev properties, it's a little trickier because QEMU needs to
> > do the emulated VIRTIO device setup (allocating virtqueues, setting
> > their sizes, etc). Can QEMU query this information from the vDPA device?
> > If not, which qdev properties are read-only and must match the
> > configuration of the vDPA device and which are read-write and can
> > control the vDPA device?
> >
> 
> Yes, that's an issue. We have to make sure the number of virtqueues
> and their size set by qemu is not greater than hardware limitation.
> Now I think we can query the max queue size, but looks like we don't
> have an interface to query the max number of virtqueues.

Okay, this is something that Jason and Stefano can discuss more. Ideally
the QEMU command-line does not need to duplicate information about the
vDPA device. The vdpa management tool
(https://github.com/shemminger/iproute2/tree/main/vdpa) and VDUSE
virtio-blk device will probably be where the main device configuration
happens.

As a starting point, can you describe how your VDUSE virtio-blk device
is configured? Does it have a command-line/config file/RPC API to set
the number of virtqueues, block size, etc?

Stefan

Yongji Xie April 28, 2021, 11:27 a.m. UTC | #4

On Tue, Apr 27, 2021 at 10:28 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Tue, Apr 27, 2021 at 06:24:55PM +0800, Yongji Xie wrote:
> > On Mon, Apr 26, 2021 at 11:34 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > >
> > > On Thu, Apr 08, 2021 at 06:12:49PM +0800, Xie Yongji wrote:
> > > > Since we already have some ways to emulate vDPA block device
> > > > in kernel[1] or userspace[2]. This series tries to introduce a
> > > > new vhost-vdpa block device for that. To use it, we can add
> > > > something like:
> > > >
> > > > qemu-system-x86_64 \
> > > >     -device vhost-vdpa-blk-pci,vdpa-dev=/dev/vhost-vdpa-0
> > >
> > > This device is similar to vhost-user-blk. QEMU does not see it as a
> > > block device so storage migration, I/O throttling, image formats, etc
> > > are not supported. Stefano Garzarella and I discussed how vdpa-blk
> > > devices could integrate more deeply with QEMU's block layer. The QEMU
> > > block layer could be enabled only when necessary and otherwise bypassed
> > > for maximum performance.
> > >
> >
> > Do you mean we can make use of the shadow virtqueue for the slow path
> > (I/O will go through the block layer) and add a fast path (like what
> > we do now) to bypass the block layer?
>
> Yes.
>
> > > This alternative approach is similar to how vhost-net is implemented in
> > > QEMU. A BlockDriver would handle the vdpa-blk device and the regular
> > > virtio-blk-pci device would still be present. The virtqueues could be
> > > delegated to the vdpa-blk device in order to bypass the QEMU block
> > > layer.
> > >
> > > I wanted to mention this since it's likely that this kind of vdpa-blk
> > > device implementation will be posted in the future and you might be
> > > interested. It makes live migration with non-shared storage possible,
> > > for example.
> > >
> >
> > That would be nice, I'm looking forward to it!
> >
> > So do you think whether it's necessary to continue this approach?
> > Looks like we don't need a vhost-vdpa-blk device any more in the new
> > approach.
>
> There is no code for the vdpa-blk BlockDriver yet and the implementation
> will be more complex than this patch series (more risk the feature could
> be delayed).
>
> If you need vdpa-blk as soon as possible then this patch series may be a
> pragmatic solution. As long as it doesn't limit the vdpa-blk interface
> in a way that prevents the BlockDriver implementation then I think QEMU
> could support both.
>
> In the long term the vdpa-blk BlockDriver could replace -device
> vdpa-blk-pci since the performance should be identical and it supports
> more use cases.
>
> It's up to you. You're welcome to wait for the BlockDriver, work
> together with Stefano on the BlockDriver, or continue with your patch
> series.
>

I like the new idea! Let me wait for it.

> > > An issue with vhost-user-blk is that the ownership of qdev properties
> > > and VIRTIO Configuration Space fields was not clearly defined. Some
> > > things are handled by QEMU's vhost-user-blk code, some things are
> > > handled by the vhost-user device backend, and some things are negotiated
> > > between both entities. This patch series follows the existing
> > > vhost-user-blk approach, which I think will show serious issues as the
> > > device is more widely used and whenever virtio-blk or the implementation
> > > is extended with new features. It is very hard to provide backwards
> > > compatibility with the current approach where the ownership of qdev
> > > properties and VIRTIO Configuration Space fields is ad-hoc and largely
> > > undefined.
> > >
> > > Since vDPA has VIRTIO Configuration Space APIs, I suggest that the
> > > vhost-vDPA device controls the entire configuration space. QEMU should
> > > simply forward accesses between the guest and vhost-vdpa.
> > >
> >
> > Does this already achieved by vhost_ops->vhost_get_config？ And I want
> > to know how to handle the endianness issue if qemu just simply does
> > forwarding and doesn't care about the content of config space.
>
> QEMU just copies bytes betwen the driver and the device via
> vdpa_config_ops->get/set_config(). I don't think it needs to worry about
> endianness in VIRTIO Configuration Space access code or did I miss
> something?
>
> My understanding is that vDPA currently supports same-endian Legacy and
> little-endian Modern devices. Cross-endian Legacy devices are currently
> not supported because there is no API to set endianness.
>

OK, I get you.

> If such an API is added in the future, then QEMU can use it to tell the
> vDPA device about guest endianness. QEMU still won't need to modify
> Configuration Space data itself.
>
> > > Regarding qdev properties, it's a little trickier because QEMU needs to
> > > do the emulated VIRTIO device setup (allocating virtqueues, setting
> > > their sizes, etc). Can QEMU query this information from the vDPA device?
> > > If not, which qdev properties are read-only and must match the
> > > configuration of the vDPA device and which are read-write and can
> > > control the vDPA device?
> > >
> >
> > Yes, that's an issue. We have to make sure the number of virtqueues
> > and their size set by qemu is not greater than hardware limitation.
> > Now I think we can query the max queue size, but looks like we don't
> > have an interface to query the max number of virtqueues.
>
> Okay, this is something that Jason and Stefano can discuss more. Ideally
> the QEMU command-line does not need to duplicate information about the
> vDPA device. The vdpa management tool
> (https://github.com/shemminger/iproute2/tree/main/vdpa) and VDUSE
> virtio-blk device will probably be where the main device configuration
> happens.
>
> As a starting point, can you describe how your VDUSE virtio-blk device
> is configured? Does it have a command-line/config file/RPC API to set
> the number of virtqueues, block size, etc?
>

Yes, we have a command-line to set those properties, something like:

qemu-storage-daemon \
      --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
      --monitor chardev=charmonitor \
      --blockdev
driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0
\
      --export type=vduse-blk,id=test,node-name=disk0,writable=on,name=vduse-null,num-queues=16,queue-size=128

Thanks,
Yongji

Stefano Garzarella April 28, 2021, 12:21 p.m. UTC | #5

On Wed, Apr 28, 2021 at 07:27:03PM +0800, Yongji Xie wrote:
>On Tue, Apr 27, 2021 at 10:28 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>
>> On Tue, Apr 27, 2021 at 06:24:55PM +0800, Yongji Xie wrote:
>> > On Mon, Apr 26, 2021 at 11:34 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>> > >
>> > > On Thu, Apr 08, 2021 at 06:12:49PM +0800, Xie Yongji wrote:
>> > > > Since we already have some ways to emulate vDPA block device
>> > > > in kernel[1] or userspace[2]. This series tries to introduce a
>> > > > new vhost-vdpa block device for that. To use it, we can add
>> > > > something like:
>> > > >
>> > > > qemu-system-x86_64 \
>> > > >     -device vhost-vdpa-blk-pci,vdpa-dev=/dev/vhost-vdpa-0
>> > >
>> > > This device is similar to vhost-user-blk. QEMU does not see it as a
>> > > block device so storage migration, I/O throttling, image formats, etc
>> > > are not supported. Stefano Garzarella and I discussed how vdpa-blk
>> > > devices could integrate more deeply with QEMU's block layer. The QEMU
>> > > block layer could be enabled only when necessary and otherwise bypassed
>> > > for maximum performance.
>> > >
>> >
>> > Do you mean we can make use of the shadow virtqueue for the slow path
>> > (I/O will go through the block layer) and add a fast path (like what
>> > we do now) to bypass the block layer?
>>
>> Yes.
>>
>> > > This alternative approach is similar to how vhost-net is implemented in
>> > > QEMU. A BlockDriver would handle the vdpa-blk device and the regular
>> > > virtio-blk-pci device would still be present. The virtqueues could be
>> > > delegated to the vdpa-blk device in order to bypass the QEMU block
>> > > layer.
>> > >
>> > > I wanted to mention this since it's likely that this kind of vdpa-blk
>> > > device implementation will be posted in the future and you might be
>> > > interested. It makes live migration with non-shared storage possible,
>> > > for example.
>> > >
>> >
>> > That would be nice, I'm looking forward to it!
>> >
>> > So do you think whether it's necessary to continue this approach?
>> > Looks like we don't need a vhost-vdpa-blk device any more in the new
>> > approach.
>>
>> There is no code for the vdpa-blk BlockDriver yet and the implementation
>> will be more complex than this patch series (more risk the feature could
>> be delayed).
>>
>> If you need vdpa-blk as soon as possible then this patch series may be a
>> pragmatic solution. As long as it doesn't limit the vdpa-blk interface
>> in a way that prevents the BlockDriver implementation then I think QEMU
>> could support both.
>>
>> In the long term the vdpa-blk BlockDriver could replace -device
>> vdpa-blk-pci since the performance should be identical and it supports
>> more use cases.
>>
>> It's up to you. You're welcome to wait for the BlockDriver, work
>> together with Stefano on the BlockDriver, or continue with your patch
>> series.
>>
>
>I like the new idea! Let me wait for it.
>

Thanks for this great discussion!

I'll keep you updated and I'll CC you on the RFC when I have something 
ready.

Cheers,
Stefano

[0/3] Introduce vhost-vdpa block device

Message

Comments