Message ID | 20190502114801.23116-1-mlevitsk@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | RFC: NVME MDEV | expand |
I simply don't get the point of this series. MDEV is an interface for exposing parts of a device to a userspace program / VM. But that this series appears to do is to expose a purely software defined nvme controller to userspace. Which in principle is a good idea, but we have a much better framework for that, which is called vhost.
On Fri, 2019-05-03 at 14:18 +0200, Christoph Hellwig wrote: > I simply don't get the point of this series. > > MDEV is an interface for exposing parts of a device to a userspace > program / VM. But that this series appears to do is to expose a > purely software defined nvme controller to userspace. Which in > principle is a good idea, but we have a much better framework for that, > which is called vhost. Let me explain the reasons for choosing the IO interfaces as I did: 1. Frontend interface (the interface that faces the guest/userspace/etc): VFIO/mdev is just way to expose a (partially) software defined PCIe device to a guest. Vhost on the other hand is an interface that is hardcoded and optimized for virtio. It can be extended to be pci generic, but why to do so if we already have VFIO. So the biggest advantage of using VFIO _currently_ is that I don't add any new API/ABI to the kernel, and neither the userspace (qemu) needs to learn to use a new API. It also worth noting that VFIO supports nesting out of box, so I don't need to worry about it (vhost has to deal with that on the protocol level using its IOTLB facility). On top of that, it is expected that newer hardware will support the PASID based device subdivision, which will allow us to _directly_ pass through the submission queues of the device and _force_ us to use the NVME protocol for the frontend. 2. Backend interface (the connection to the real nvme device): Currently the backend interface _doesn't have_ to allocate a dedicated queue and bypass the block layer. It can use the block submit_bio/blk_poll as I demonstrate in the last patch in the series. Its 2x slower though. However, similar to the (1), when the driver will support the devices with hardware based passthrough, it will have to dedicate a bunch of queues to the guest, configure them with the appropriate PASID, and then let the guest use these queues directly. Best regards, Maxim Levitsky
On Mon, May 06, 2019 at 12:04:06PM +0300, Maxim Levitsky wrote: > 1. Frontend interface (the interface that faces the guest/userspace/etc): > > VFIO/mdev is just way to expose a (partially) software defined PCIe device to a > guest. > > Vhost on the other hand is an interface that is hardcoded and optimized for > virtio. It can be extended to be pci generic, but why to do so if we already > have VFIO. I wouldn't say vhost is virtio specific. At least Hanne's vhost-nvme doesn't get impacted by that a whole lot. > 2. Backend interface (the connection to the real nvme device): > > Currently the backend interface _doesn't have_ to allocate a dedicated queue and > bypass the block layer. It can use the block submit_bio/blk_poll as I > demonstrate in the last patch in the series. Its 2x slower though. > > However, similar to the (1), when the driver will support the devices with > hardware based passthrough, it will have to dedicate a bunch of queues to the > guest, configure them with the appropriate PASID, and then let the guest useA > these queues directly. We will not let you abuse the nvme queues for anything else. We had that discussion with the mellanox offload and it not only unsafe but also adds way to much crap to the core nvme code for corner cases. Or to put it into another way: unless your paravirt interface requires zero specific changes to the core nvme code it is not acceptable at all.
On Mon, May 06, 2019 at 05:57:52AM -0700, Christoph Hellwig wrote: > > However, similar to the (1), when the driver will support the devices with > > hardware based passthrough, it will have to dedicate a bunch of queues to the > > guest, configure them with the appropriate PASID, and then let the guest useA > > these queues directly. > > We will not let you abuse the nvme queues for anything else. We had > that discussion with the mellanox offload and it not only unsafe but > also adds way to much crap to the core nvme code for corner cases. > > Or to put it into another way: unless your paravirt interface requires > zero specific changes to the core nvme code it is not acceptable at all. I agree we shouldn't specialize generic queues for this, but I think it is worth revisiting driver support for assignable hardware resources iff the specification defines it. Until then, you can always steer processes to different queues by assigning them to different CPUs.
On 06/05/19 07:57, Christoph Hellwig wrote: > > Or to put it into another way: unless your paravirt interface requires > zero specific changes to the core nvme code it is not acceptable at all. I'm not sure it's possible to attain that goal, however I agree that putting the control plane in the kernel is probably not a good idea, so the vhost model is better than mdev for this usecase. In addition, unless it is possible for the driver to pass the queue directly to the guests, there probably isn't much advantage in putting the driver in the kernel at all. Maxim, do you have numbers for 1) QEMU with aio 2) QEMU with VFIO-based userspace nvme driver 3) nvme-mdev? Paolo
On Mon, May 06, 2019 at 12:04:06PM +0300, Maxim Levitsky wrote: > On top of that, it is expected that newer hardware will support the PASID based > device subdivision, which will allow us to _directly_ pass through the > submission queues of the device and _force_ us to use the NVME protocol for the > frontend. I don't understand the PASID argument. The data path will be 100% passthrough and this driver won't be necessary. In the meantime there is already SPDK for users who want polling. This driver's main feature is that the host can still access the device at the same time as VMs, but I'm not sure that's useful in performance-critical use cases and for non-performance use cases this driver isn't necessary. Stefan
On Thu, May 09, 2019 at 02:12:55AM -0700, Stefan Hajnoczi wrote: > On Mon, May 06, 2019 at 12:04:06PM +0300, Maxim Levitsky wrote: > > On top of that, it is expected that newer hardware will support the PASID based > > device subdivision, which will allow us to _directly_ pass through the > > submission queues of the device and _force_ us to use the NVME protocol for the > > frontend. > > I don't understand the PASID argument. The data path will be 100% > passthrough and this driver won't be necessary. We still need a non-passthrough component to handle slow path, non-doorbell controller registers and admin queue. That doesn't necessarily need to be a kernel driver, though.