Message ID | 0-v1-9e6e1739ed95+5fa-vfio_dma_buf_jgg@nvidia.com (mailing list archive) |
---|---|
Headers | show |
Series | Allow MMIO regions to be exported through dma-buf | expand |
Am 17.08.22 um 18:11 schrieb Jason Gunthorpe: > dma-buf has become a way to safely acquire a handle to non-struct page > memory that can still have lifetime controlled by the exporter. Notably > RDMA can now import dma-buf FDs and build them into MRs which allows for > PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory > from PCI device BARs. > > This series supports a use case for SPDK where a NVMe device will be owned > by SPDK through VFIO but interacting with a RDMA device. The RDMA device > may directly access the NVMe CMB or directly manipulate the NVMe device's > doorbell using PCI P2P. > > However, as a general mechanism, it can support many other scenarios with > VFIO. I imagine this dmabuf approach to be usable by iommufd as well for > generic and safe P2P mappings. In general looks good to me, but we really need to get away from using sg_tables for this here. The only thing I'm not 100% convinced of is dma_buf_try_get(), I've seen this incorrectly used so many times that I can't count them any more. Would that be somehow avoidable? Or could you at least explain the use case a bit better. Thanks, Christian. > > This series goes after the "Break up ioctl dispatch functions to one > function per ioctl" series. > > This is on github: https://github.com/jgunthorpe/linux/commits/vfio_dma_buf > > Jason Gunthorpe (4): > dma-buf: Add dma_buf_try_get() > vfio: Add vfio_device_get() > vfio_pci: Do not open code pci_try_reset_function() > vfio/pci: Allow MMIO regions to be exported through dma-buf > > drivers/vfio/pci/Makefile | 1 + > drivers/vfio/pci/vfio_pci_config.c | 22 ++- > drivers/vfio/pci/vfio_pci_core.c | 33 +++- > drivers/vfio/pci/vfio_pci_dma_buf.c | 265 ++++++++++++++++++++++++++++ > drivers/vfio/pci/vfio_pci_priv.h | 24 +++ > drivers/vfio/vfio_main.c | 3 +- > include/linux/dma-buf.h | 13 ++ > include/linux/vfio.h | 6 + > include/linux/vfio_pci_core.h | 1 + > include/uapi/linux/vfio.h | 18 ++ > 10 files changed, 364 insertions(+), 22 deletions(-) > create mode 100644 drivers/vfio/pci/vfio_pci_dma_buf.c > > > base-commit: 385f0411fcd2780b5273992832cdc8edcd5b8ea9
On Thu, Aug 18, 2022 at 01:07:16PM +0200, Christian König wrote: > Am 17.08.22 um 18:11 schrieb Jason Gunthorpe: > > dma-buf has become a way to safely acquire a handle to non-struct page > > memory that can still have lifetime controlled by the exporter. Notably > > RDMA can now import dma-buf FDs and build them into MRs which allows for > > PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory > > from PCI device BARs. > > > > This series supports a use case for SPDK where a NVMe device will be owned > > by SPDK through VFIO but interacting with a RDMA device. The RDMA device > > may directly access the NVMe CMB or directly manipulate the NVMe device's > > doorbell using PCI P2P. > > > > However, as a general mechanism, it can support many other scenarios with > > VFIO. I imagine this dmabuf approach to be usable by iommufd as well for > > generic and safe P2P mappings. > > In general looks good to me, but we really need to get away from using > sg_tables for this here. > > The only thing I'm not 100% convinced of is dma_buf_try_get(), I've seen > this incorrectly used so many times that I can't count them any more. > > Would that be somehow avoidable? Or could you at least explain the use case > a bit better. I didn't see a way, maybe you know of one VFIO needs to maintain a list of dmabuf FDs that have been created by the user attached to each vfio_device: int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, struct vfio_device_feature_dma_buf __user *arg, size_t argsz) { down_write(&vdev->memory_lock); list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs); up_write(&vdev->memory_lock); And dmabuf FD's are removed from the list when the user closes the FD: static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) { down_write(&priv->vdev->memory_lock); list_del_init(&priv->dmabufs_elm); up_write(&priv->vdev->memory_lock); Which then poses the problem: How do you iterate over only dma_buf's that are still alive to execute move? This seems necessary as parts of the dma_buf have already been destroyed by the time the user's release function is called. Which I solved like this: down_write(&vdev->memory_lock); list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { if (!dma_buf_try_get(priv->dmabuf)) continue; So the scenarios resolve as: - Concurrent release is not in progress: dma_buf_try_get() succeeds and prevents concurrent release from starting - Release has started but not reached its memory_lock: dma_buf_try_get() fails - Release has started but passed its memory_lock: dmabuf is not on the list so dma_buf_try_get() is not called. Jason
On Wed, Aug 17, 2022 at 01:11:38PM -0300, Jason Gunthorpe wrote: > dma-buf has become a way to safely acquire a handle to non-struct page > memory that can still have lifetime controlled by the exporter. Notably > RDMA can now import dma-buf FDs and build them into MRs which allows for > PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory > from PCI device BARs. > > This series supports a use case for SPDK where a NVMe device will be owned > by SPDK through VFIO but interacting with a RDMA device. The RDMA device > may directly access the NVMe CMB or directly manipulate the NVMe device's > doorbell using PCI P2P. > > However, as a general mechanism, it can support many other scenarios with > VFIO. I imagine this dmabuf approach to be usable by iommufd as well for > generic and safe P2P mappings. > > This series goes after the "Break up ioctl dispatch functions to one > function per ioctl" series. > > This is on github: https://github.com/jgunthorpe/linux/commits/vfio_dma_buf > > Jason Gunthorpe (4): > dma-buf: Add dma_buf_try_get() > vfio: Add vfio_device_get() > vfio_pci: Do not open code pci_try_reset_function() > vfio/pci: Allow MMIO regions to be exported through dma-buf > > drivers/vfio/pci/Makefile | 1 + > drivers/vfio/pci/vfio_pci_config.c | 22 ++- > drivers/vfio/pci/vfio_pci_core.c | 33 +++- > drivers/vfio/pci/vfio_pci_dma_buf.c | 265 ++++++++++++++++++++++++++++ I forget about this.. Alex, do you want to start doing as Linus discused and I will rename this new file to "dma_buf.c" ? Or keep this directory as having the vfio_pci_* prefix for consistency? Jason
Am 18.08.22 um 14:03 schrieb Jason Gunthorpe: > On Thu, Aug 18, 2022 at 01:07:16PM +0200, Christian König wrote: >> Am 17.08.22 um 18:11 schrieb Jason Gunthorpe: >>> dma-buf has become a way to safely acquire a handle to non-struct page >>> memory that can still have lifetime controlled by the exporter. Notably >>> RDMA can now import dma-buf FDs and build them into MRs which allows for >>> PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory >>> from PCI device BARs. >>> >>> This series supports a use case for SPDK where a NVMe device will be owned >>> by SPDK through VFIO but interacting with a RDMA device. The RDMA device >>> may directly access the NVMe CMB or directly manipulate the NVMe device's >>> doorbell using PCI P2P. >>> >>> However, as a general mechanism, it can support many other scenarios with >>> VFIO. I imagine this dmabuf approach to be usable by iommufd as well for >>> generic and safe P2P mappings. >> In general looks good to me, but we really need to get away from using >> sg_tables for this here. >> >> The only thing I'm not 100% convinced of is dma_buf_try_get(), I've seen >> this incorrectly used so many times that I can't count them any more. >> >> Would that be somehow avoidable? Or could you at least explain the use case >> a bit better. > I didn't see a way, maybe you know of one For GEM objects we usually don't use the reference count of the DMA-buf, but rather that of the GEM object for this. But that's not an ideal solution either. > > VFIO needs to maintain a list of dmabuf FDs that have been created by > the user attached to each vfio_device: > > int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags, > struct vfio_device_feature_dma_buf __user *arg, > size_t argsz) > { > down_write(&vdev->memory_lock); > list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs); > up_write(&vdev->memory_lock); > > And dmabuf FD's are removed from the list when the user closes the FD: > > static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf) > { > down_write(&priv->vdev->memory_lock); > list_del_init(&priv->dmabufs_elm); > up_write(&priv->vdev->memory_lock); > > Which then poses the problem: How do you iterate over only dma_buf's > that are still alive to execute move? > > This seems necessary as parts of the dma_buf have already been > destroyed by the time the user's release function is called. > > Which I solved like this: > > down_write(&vdev->memory_lock); > list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { > if (!dma_buf_try_get(priv->dmabuf)) > continue; What would happen if you don't skip destroyed dma-bufs here? In other words why do you maintain that list in the first place? Regards, Christian. > > So the scenarios resolve as: > - Concurrent release is not in progress: dma_buf_try_get() succeeds > and prevents concurrent release from starting > - Release has started but not reached its memory_lock: > dma_buf_try_get() fails > - Release has started but passed its memory_lock: dmabuf is not on > the list so dma_buf_try_get() is not called. > > Jason
On Thu, Aug 18, 2022 at 02:58:10PM +0200, Christian König wrote: > > > The only thing I'm not 100% convinced of is dma_buf_try_get(), I've seen > > > this incorrectly used so many times that I can't count them any more. > > > > > > Would that be somehow avoidable? Or could you at least explain the use case > > > a bit better. > > I didn't see a way, maybe you know of one > > For GEM objects we usually don't use the reference count of the DMA-buf, but > rather that of the GEM object for this. But that's not an ideal solution > either. You can't really ignore the dmabuf refcount. At some point you have to deal with the dmabuf being asynchronously released by userspace. > > down_write(&vdev->memory_lock); > > list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { > > if (!dma_buf_try_get(priv->dmabuf)) > > continue; > > What would happen if you don't skip destroyed dma-bufs here? In other words > why do you maintain that list in the first place? The list is to keep track of the dmabufs that were created, it is not optional. The only question is what do you do about invoking dma_buf_move_notify() on a dmabuf that is already undergoing destruction. For instance undergoing destruction means the dmabuf core has already done this: mutex_lock(&db_list.lock); list_del(&dmabuf->list_node); mutex_unlock(&db_list.lock); dma_buf_stats_teardown(dmabuf); So it seems non-ideal to continue to use it. However, dma_buf_move_notify() specifically has no issue with that level of destruction since it only does list_for_each_entry(attach, &dmabuf->attachments, node) And attachments must be empty if the file refcount is zero. So we could delete the try_buf and just rely on move being safe on partially destroyed dma_buf's as part of the API design. Jason
Am 18.08.22 um 15:16 schrieb Jason Gunthorpe: > On Thu, Aug 18, 2022 at 02:58:10PM +0200, Christian König wrote: > >>>> The only thing I'm not 100% convinced of is dma_buf_try_get(), I've seen >>>> this incorrectly used so many times that I can't count them any more. >>>> >>>> Would that be somehow avoidable? Or could you at least explain the use case >>>> a bit better. >>> I didn't see a way, maybe you know of one >> For GEM objects we usually don't use the reference count of the DMA-buf, but >> rather that of the GEM object for this. But that's not an ideal solution >> either. > You can't really ignore the dmabuf refcount. At some point you have to > deal with the dmabuf being asynchronously released by userspace. Yeah, but in this case the dma-buf is just a reference to the real/private object which holds the backing store. When the dma-buf is released you drop the real object reference and from your driver internals you only try_get only the real object. The advantage is that only your driver can use the try_get function and not some importing driver which doesn't know about the internals of the exporter. We just had to many cases where developers weren't sure if a pointer is still valid and by using try_get it just "magically" got working (well I have to admit it made the crashing less likely....). >>> down_write(&vdev->memory_lock); >>> list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) { >>> if (!dma_buf_try_get(priv->dmabuf)) >>> continue; >> What would happen if you don't skip destroyed dma-bufs here? In other words >> why do you maintain that list in the first place? > The list is to keep track of the dmabufs that were created, it is not > optional. > > The only question is what do you do about invoking > dma_buf_move_notify() on a dmabuf that is already undergoing > destruction. Ah, yes. Really good point. > > For instance undergoing destruction means the dmabuf core has already > done this: > > mutex_lock(&db_list.lock); > list_del(&dmabuf->list_node); > mutex_unlock(&db_list.lock); > dma_buf_stats_teardown(dmabuf); > > So it seems non-ideal to continue to use it. > > However, dma_buf_move_notify() specifically has no issue with that level of > destruction since it only does > > list_for_each_entry(attach, &dmabuf->attachments, node) > > And attachments must be empty if the file refcount is zero. > > So we could delete the try_buf and just rely on move being safe on > partially destroyed dma_buf's as part of the API design. I think that might be the more defensive approach. A comment on the dma_buf_move_notify() function should probably be a good idea. Thanks, Christian. > > Jason
On Thu, Aug 18, 2022 at 03:37:01PM +0200, Christian König wrote: > Am 18.08.22 um 15:16 schrieb Jason Gunthorpe: > > On Thu, Aug 18, 2022 at 02:58:10PM +0200, Christian König wrote: > > > > > > > The only thing I'm not 100% convinced of is dma_buf_try_get(), I've seen > > > > > this incorrectly used so many times that I can't count them any more. > > > > > > > > > > Would that be somehow avoidable? Or could you at least explain the use case > > > > > a bit better. > > > > I didn't see a way, maybe you know of one > > > For GEM objects we usually don't use the reference count of the DMA-buf, but > > > rather that of the GEM object for this. But that's not an ideal solution > > > either. > > You can't really ignore the dmabuf refcount. At some point you have to > > deal with the dmabuf being asynchronously released by userspace. > > Yeah, but in this case the dma-buf is just a reference to the real/private > object which holds the backing store. The gem approach is backwards to what I did here. GEM holds a singleton pointer to the dmabuf and holds a reference on it as long as it has the pointer. This means the dmabuf can not be freed until the GEM object is freed. For this I held a "weak reference" on the dmabuf in a list, and we convert the weak reference to a strong reference in the usual way using a try_get. The reason it is different is because the VFIO interface allows creating a DMABUF with unique parameters on every user request. Eg the user can select a BAR index and a slice of the MMIO space unique to each each request and this results in a unique DMABUF. Due to this we have to store a list of DMABUFs and we need the DMABUF's to clean up their memory when the user closes the file. > > So we could delete the try_buf and just rely on move being safe on > > partially destroyed dma_buf's as part of the API design. > > I think that might be the more defensive approach. A comment on the > dma_buf_move_notify() function should probably be a good idea. IMHO, it is an anti-pattern. The caller should hold a strong reference on an object before invoking any API surface. Upgrading a weak reference to a strong reference requires the standard "try get" API. But if you feel strongly I don't mind dropping the try_get around move. Jason
Am 19.08.22 um 15:11 schrieb Jason Gunthorpe: > On Thu, Aug 18, 2022 at 03:37:01PM +0200, Christian König wrote: >> Am 18.08.22 um 15:16 schrieb Jason Gunthorpe: >>> On Thu, Aug 18, 2022 at 02:58:10PM +0200, Christian König wrote: >>> >>>>>> The only thing I'm not 100% convinced of is dma_buf_try_get(), I've seen >>>>>> this incorrectly used so many times that I can't count them any more. >>>>>> >>>>>> Would that be somehow avoidable? Or could you at least explain the use case >>>>>> a bit better. >>>>> I didn't see a way, maybe you know of one >>>> For GEM objects we usually don't use the reference count of the DMA-buf, but >>>> rather that of the GEM object for this. But that's not an ideal solution >>>> either. >>> You can't really ignore the dmabuf refcount. At some point you have to >>> deal with the dmabuf being asynchronously released by userspace. >> Yeah, but in this case the dma-buf is just a reference to the real/private >> object which holds the backing store. > The gem approach is backwards to what I did here. As I said, what GEM does is not necessary the best approach either. > GEM holds a singleton pointer to the dmabuf and holds a reference on > it as long as it has the pointer. This means the dmabuf can not be > freed until the GEM object is freed. > > For this I held a "weak reference" on the dmabuf in a list, and we > convert the weak reference to a strong reference in the usual way > using a try_get. > > The reason it is different is because the VFIO interface allows > creating a DMABUF with unique parameters on every user request. Eg the > user can select a BAR index and a slice of the MMIO space unique to > each each request and this results in a unique DMABUF. > > Due to this we have to store a list of DMABUFs and we need the > DMABUF's to clean up their memory when the user closes the file. Yeah, that makes sense. >>> So we could delete the try_buf and just rely on move being safe on >>> partially destroyed dma_buf's as part of the API design. >> I think that might be the more defensive approach. A comment on the >> dma_buf_move_notify() function should probably be a good idea. > IMHO, it is an anti-pattern. The caller should hold a strong reference > on an object before invoking any API surface. Upgrading a weak > reference to a strong reference requires the standard "try get" API. > > But if you feel strongly I don't mind dropping the try_get around move. Well I see it as well that both approaches are not ideal, but my gut feeling tells me that just documenting that dma_buf_move_notify() can still be called as long as the release callback wasn't called yet is probably the better approach. On the other hand this is really just a gut feeling without strong arguments backing it. So if somebody has an argument which makes try_get necessary I'm happy to hear it. Regards, Christian. > > Jason
On Fri, Aug 19, 2022 at 03:33:04PM +0200, Christian König wrote: > > > > So we could delete the try_buf and just rely on move being safe on > > > > partially destroyed dma_buf's as part of the API design. > > > I think that might be the more defensive approach. A comment on the > > > dma_buf_move_notify() function should probably be a good idea. > > IMHO, it is an anti-pattern. The caller should hold a strong reference > > on an object before invoking any API surface. Upgrading a weak > > reference to a strong reference requires the standard "try get" API. > > > > But if you feel strongly I don't mind dropping the try_get around move. > > Well I see it as well that both approaches are not ideal, but my gut feeling > tells me that just documenting that dma_buf_move_notify() can still be > called as long as the release callback wasn't called yet is probably the > better approach. The comment would say something like: "dma_resv_lock(), dma_buf_move_notify(), dma_resv_unlock() may be called with a 0 refcount so long as ops->release() hasn't returned" Which is a really abnormal API design, IMHO. Jason
Am 19.08.22 um 15:39 schrieb Jason Gunthorpe: > On Fri, Aug 19, 2022 at 03:33:04PM +0200, Christian König wrote: > >>>>> So we could delete the try_buf and just rely on move being safe on >>>>> partially destroyed dma_buf's as part of the API design. >>>> I think that might be the more defensive approach. A comment on the >>>> dma_buf_move_notify() function should probably be a good idea. >>> IMHO, it is an anti-pattern. The caller should hold a strong reference >>> on an object before invoking any API surface. Upgrading a weak >>> reference to a strong reference requires the standard "try get" API. >>> >>> But if you feel strongly I don't mind dropping the try_get around move. >> Well I see it as well that both approaches are not ideal, but my gut feeling >> tells me that just documenting that dma_buf_move_notify() can still be >> called as long as the release callback wasn't called yet is probably the >> better approach. > The comment would say something like: > > "dma_resv_lock(), dma_buf_move_notify(), dma_resv_unlock() may be > called with a 0 refcount so long as ops->release() hasn't returned" > > Which is a really abnormal API design, IMHO. Mhm, Daniel or other do you have any opinion on that as well? Thanks, Christian. > > Jason
On Thu, 18 Aug 2022 09:05:24 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote: > On Wed, Aug 17, 2022 at 01:11:38PM -0300, Jason Gunthorpe wrote: > > dma-buf has become a way to safely acquire a handle to non-struct page > > memory that can still have lifetime controlled by the exporter. Notably > > RDMA can now import dma-buf FDs and build them into MRs which allows for > > PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory > > from PCI device BARs. > > > > This series supports a use case for SPDK where a NVMe device will be owned > > by SPDK through VFIO but interacting with a RDMA device. The RDMA device > > may directly access the NVMe CMB or directly manipulate the NVMe device's > > doorbell using PCI P2P. > > > > However, as a general mechanism, it can support many other scenarios with > > VFIO. I imagine this dmabuf approach to be usable by iommufd as well for > > generic and safe P2P mappings. > > > > This series goes after the "Break up ioctl dispatch functions to one > > function per ioctl" series. > > > > This is on github: https://github.com/jgunthorpe/linux/commits/vfio_dma_buf > > > > Jason Gunthorpe (4): > > dma-buf: Add dma_buf_try_get() > > vfio: Add vfio_device_get() > > vfio_pci: Do not open code pci_try_reset_function() > > vfio/pci: Allow MMIO regions to be exported through dma-buf > > > > drivers/vfio/pci/Makefile | 1 + > > drivers/vfio/pci/vfio_pci_config.c | 22 ++- > > drivers/vfio/pci/vfio_pci_core.c | 33 +++- > > drivers/vfio/pci/vfio_pci_dma_buf.c | 265 ++++++++++++++++++++++++++++ > > I forget about this.. > > Alex, do you want to start doing as Linus discused and I will rename > this new file to "dma_buf.c" ? > > Or keep this directory as having the vfio_pci_* prefix for > consistency? I have a hard time generating a strong opinion over file name redundancy relative to directory structure. By my count, over 17% of files in drivers/ have some file name redundancy to their parent directory structure (up to two levels). I see we already have two $FOO_dma_buf.c files in the tree, virtio and amdgpu among these. In the virtio case this is somewhat justified, to me at least, as the virtio_dma_buf.h file exists in a shared include namespace. However, this justification only accounts for about 1% of such files, for many others the redundancy exists in the include path as well. I guess if we don't have a reason other than naming consistency and accept an end goal to incrementally remove file name vs directory structure redundancy where it makes sense, sure, name it dma_buf.c. Ugh. Thanks, Alex