mbox series

[0/4] cover-letter: Allow MMIO regions to be exported through dmabuf

Message ID 20241216095429.210792-1-wguay@fb.com (mailing list archive)
Headers show
Series cover-letter: Allow MMIO regions to be exported through dmabuf | expand

Message

Wei Lin Guay Dec. 16, 2024, 9:54 a.m. UTC
From: Wei Lin Guay <wguay@meta.com>

This is another attempt to revive the patches posted by Jason
Gunthorpe and Vivek Kasireddy, at
https://patchwork.kernel.org/project/linux-media/cover/0-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/
https://lwn.net/Articles/970751/

In addition to the initial proposal by Jason, another promising
application is exposing memory from an AI accelerator (bound to VFIO)
to an RDMA device. This would allow the RDMA device to directly access
the accelerator's memory, thereby facilitating direct data
transactions between the RDMA device and the accelerator.

Below is from the text/motivation from the orginal cover letter.

dma-buf has become a way to safely acquire a handle to non-struct page
memory that can still have lifetime controlled by the exporter. Notably
RDMA can now import dma-buf FDs and build them into MRs which allows for
PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory
from PCI device BARs.

This series supports a use case for SPDK where a NVMe device will be owned
by SPDK through VFIO but interacting with a RDMA device. The RDMA device
may directly access the NVMe CMB or directly manipulate the NVMe device's
doorbell using PCI P2P.

However, as a general mechanism, it can support many other scenarios with
VFIO. I imagine this dmabuf approach to be usable by iommufd as well for
generic and safe P2P mappings.

This series goes after the "Break up ioctl dispatch functions to one
function per ioctl" series.

v2:
 - Name the new file dma_buf.c
 - Restore orig_nents before freeing
 - Fix reversed logic around priv->revoked
 - Set priv->index
 - Rebased on v2 "Break up ioctl dispatch functions"
v1: https://lore.kernel.org/r/0-v1-9e6e1739ed95+5fa-vfio_dma_buf_jgg@nvidia.com
Cc: linux-rdma@vger.kernel.org
Cc: Oded Gabbay <ogabbay@kernel.org>
Cc: Christian König <christian.koenig@amd.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Maor Gottlieb <maorg@nvidia.com>
Cc: dri-devel@lists.freedesktop.org
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Jason Gunthorpe (3):
  vfio: Add vfio_device_get()
  dma-buf: Add dma_buf_try_get()
  vfio/pci: Allow MMIO regions to be exported through dma-buf

Wei Lin Guay (1):
  vfio/pci: Allow export dmabuf without move_notify from importer

 drivers/vfio/pci/Makefile          |   1 +
 drivers/vfio/pci/dma_buf.c         | 291 +++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_config.c |   8 +-
 drivers/vfio/pci/vfio_pci_core.c   |  44 ++++-
 drivers/vfio/pci/vfio_pci_priv.h   |  30 +++
 drivers/vfio/vfio_main.c           |   1 +
 include/linux/dma-buf.h            |  13 ++
 include/linux/vfio.h               |   6 +
 include/linux/vfio_pci_core.h      |   1 +
 include/uapi/linux/vfio.h          |  18 ++
 10 files changed, 405 insertions(+), 8 deletions(-)
 create mode 100644 drivers/vfio/pci/dma_buf.c

--
2.43.5

Comments

Wei Lin Guay Dec. 17, 2024, 11:06 a.m. UTC | #1
Hi Christian, 

Thanks again for your prompt response/review.

> On 17 Dec 2024, at 10:53, Christian König <christian.koenig@amd.com> wrote:
> 
> > 
> Am 16.12.24 um 17:54 schrieb Keith Busch:
>> On Mon, Dec 16, 2024 at 11:21:39AM +0100, Christian König wrote:
>>> Am 16.12.24 um 10:54 schrieb Wei Lin Guay:
>>>> From: Wei Lin Guay <wguay@meta.com>
>>>> However, as a general mechanism, it can support many other scenarios with
>>>> VFIO. I imagine this dmabuf approach to be usable by iommufd as well for
>>>> generic and safe P2P mappings.
>>>> 
>>>> This series goes after the "Break up ioctl dispatch functions to one
>>>> function per ioctl" series.
>>> Yeah that sounds like it should work.
>>> 
>>> But where is the rest of the series, I only see the cover letter?
>> Should be here:
>> 
>>   https://lore.kernel.org/linux-rdma/20241216095429.210792-2-wguay@fb.com/T/#u
> 
> Please send that out once more with me on explicit CC.

I will resend the patch series. I was experiencing issues with my email client, which inadvertently split the series into two separate emails.

> 
> Apart from that I have to reject the adding of dma_buf_try_get(), that is clearly not something we should do.


Understood. It appears that Vivek has confirmed that his v2 has resolved the issue. I will follow up with him to determine if he plans to resume his patch, and if so, I will apply my last patch on top of his updated patch series 

Thanks,
Wei Lin
> 
> Thanks,
> Christian.
Christian König Dec. 17, 2024, 12:47 p.m. UTC | #2
Am 17.12.24 um 12:06 schrieb Wei Lin Guay:
> [SNIP]
>> Please send that out once more with me on explicit CC.
> I will resend the patch series. I was experiencing issues with my email client, which inadvertently split the series into two separate emails.

Alternatively I can also copy the code from the list archive and explain 
why that doesn't work:

+void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
+{
+    struct vfio_pci_dma_buf *priv;
+    struct vfio_pci_dma_buf *tmp;
+
+    lockdep_assert_held_write(&vdev->memory_lock);

This makes sure that the caller is holding vdev->memory_lock.

+
+    list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
+        if (!dma_buf_try_get(priv->dmabuf))

This here only works because vfio_pci_dma_buf_release() also grabs 
vdev->memory_lock and so we are protected against concurrent releases.

+            continue;
+        if (priv->revoked != revoked) {
+            dma_resv_lock(priv->dmabuf->resv, NULL);
+            priv->revoked = revoked;
+            dma_buf_move_notify(priv->dmabuf);
+            dma_resv_unlock(priv->dmabuf->resv);
+        }
+        dma_buf_put(priv->dmabuf);

The problem is now that this here might drop the last reference which in 
turn calls vfio_pci_dma_buf_release() which also tries to grab 
vdev->memory_lock and so results in a deadlock.

+    }
+}

This pattern was suggested multiple times and I had to rejected it every 
time because the whole idea is just fundamentally broken.

It's really astonishing how people always come up with the same broken 
pattern.

Regards,
Christian.

>
>> Apart from that I have to reject the adding of dma_buf_try_get(), that is clearly not something we should do.
>
> Understood. It appears that Vivek has confirmed that his v2 has resolved the issue. I will follow up with him to determine if he plans to resume his patch, and if so, I will apply my last patch on top of his updated patch series
>
> Thanks,
> Wei Lin
>> Thanks,
>> Christian.
>
Kasireddy, Vivek Dec. 18, 2024, 6:16 a.m. UTC | #3
Hi Christian,

> Subject: Re: [PATCH 0/4] cover-letter: Allow MMIO regions to be exported
> through dmabuf
> 
> 
>> 	I will resend the patch series. I was experiencing issues with my email
>> client, which inadvertently split the series into two separate emails.
> 
> 
> Alternatively I can also copy the code from the list archive and explain why
> that doesn't work:
> 
> +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool
> +revoked) {
> +    struct vfio_pci_dma_buf *priv;
> +    struct vfio_pci_dma_buf *tmp;
> +
> +    lockdep_assert_held_write(&vdev->memory_lock);
> 
> This makes sure that the caller is holding vdev->memory_lock.
> 
> +
> +    list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
> +        if (!dma_buf_try_get(priv->dmabuf))
> 
> This here only works because vfio_pci_dma_buf_release() also grabs vdev-
> >memory_lock and so we are protected against concurrent releases.
> 
> +            continue;
> +        if (priv->revoked != revoked) {
> +            dma_resv_lock(priv->dmabuf->resv, NULL);
> +            priv->revoked = revoked;
> +            dma_buf_move_notify(priv->dmabuf);
> +            dma_resv_unlock(priv->dmabuf->resv);
> +        }
> +        dma_buf_put(priv->dmabuf);
> 
> The problem is now that this here might drop the last reference which in turn
> calls vfio_pci_dma_buf_release() which also tries to grab vdev-
> >memory_lock and so results in a deadlock.
AFAICS, vfio_pci_dma_buf_release() would not be called synchronously after the
last reference is dropped by dma_buf_put(). This is because fput(), which is called
by dma_buf_put() triggers f_op->release() asynchronously; therefore, a deadlock
is unlikely to occur in this scenario, unless I am overlooking something.

Thanks,
Vivek

> 
> +    }
> +}
> 
> This pattern was suggested multiple times and I had to rejected it every time
> because the whole idea is just fundamentally broken.
> 
> It's really astonishing how people always come up with the same broken
> pattern.
> 
> Regards,
> Christian.
> 
> 
> 
> 
> 
> 
> 
> 		Apart from that I have to reject the adding of
> dma_buf_try_get(), that is clearly not something we should do.
> 
> 
> 
> 	Understood. It appears that Vivek has confirmed that his v2 has
> resolved the issue. I will follow up with him to determine if he plans to
> resume his patch, and if so, I will apply my last patch on top of his updated
> patch series
> 
> 	Thanks,
> 	Wei Lin
> 
> 
> 		Thanks,
> 		Christian.
> 
> 
> 
>
Christian König Dec. 18, 2024, 10:01 a.m. UTC | #4
Am 18.12.24 um 07:16 schrieb Kasireddy, Vivek:
> Hi Christian,
>
>> Subject: Re: [PATCH 0/4] cover-letter: Allow MMIO regions to be exported
>> through dmabuf
>>
>>
>>> 	I will resend the patch series. I was experiencing issues with my email
>>> client, which inadvertently split the series into two separate emails.
>>
>> Alternatively I can also copy the code from the list archive and explain why
>> that doesn't work:
>>
>> +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool
>> +revoked) {
>> +    struct vfio_pci_dma_buf *priv;
>> +    struct vfio_pci_dma_buf *tmp;
>> +
>> +    lockdep_assert_held_write(&vdev->memory_lock);
>>
>> This makes sure that the caller is holding vdev->memory_lock.
>>
>> +
>> +    list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
>> +        if (!dma_buf_try_get(priv->dmabuf))
>>
>> This here only works because vfio_pci_dma_buf_release() also grabs vdev-
>>> memory_lock and so we are protected against concurrent releases.
>> +            continue;
>> +        if (priv->revoked != revoked) {
>> +            dma_resv_lock(priv->dmabuf->resv, NULL);
>> +            priv->revoked = revoked;
>> +            dma_buf_move_notify(priv->dmabuf);
>> +            dma_resv_unlock(priv->dmabuf->resv);
>> +        }
>> +        dma_buf_put(priv->dmabuf);
>>
>> The problem is now that this here might drop the last reference which in turn
>> calls vfio_pci_dma_buf_release() which also tries to grab vdev-
>>> memory_lock and so results in a deadlock.
> AFAICS, vfio_pci_dma_buf_release() would not be called synchronously after the
> last reference is dropped by dma_buf_put(). This is because fput(), which is called
> by dma_buf_put() triggers f_op->release() asynchronously; therefore, a deadlock
> is unlikely to occur in this scenario, unless I am overlooking something.

My recollection is that the f_op->release handler is only called 
asynchronously if fput() was issued in interrupt context.

But could be that this information is outdated.

Regards,
Christian.

>
> Thanks,
> Vivek
>
>> +    }
>> +}
>>
>> This pattern was suggested multiple times and I had to rejected it every time
>> because the whole idea is just fundamentally broken.
>>
>> It's really astonishing how people always come up with the same broken
>> pattern.
>>
>> Regards,
>> Christian.
>>
>>
>>
>>
>>
>>
>>
>> 		Apart from that I have to reject the adding of
>> dma_buf_try_get(), that is clearly not something we should do.
>>
>>
>>
>> 	Understood. It appears that Vivek has confirmed that his v2 has
>> resolved the issue. I will follow up with him to determine if he plans to
>> resume his patch, and if so, I will apply my last patch on top of his updated
>> patch series
>>
>> 	Thanks,
>> 	Wei Lin
>>
>>
>> 		Thanks,
>> 		Christian.
>>
>>
>>
>>