[0/4] cover-letter: Allow MMIO regions to be exported through dmabuf

Message ID	20241216095429.210792-1-wguay@fb.com (mailing list archive)
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Wei Lin Guay <wguay@fb.com> To: <alex.williamson@redhat.com>, <dri-devel@lists.freedesktop.org>, <kvm@vger.kernel.org>, <linux-rdma@vger.kernel.org> CC: <jgg@nvidia.com>, <vivek.kasireddy@intel.com>, <dagmoxnes@meta.com>, <kbusch@kernel.org>, <nviljoen@meta.com>, Wei Lin Guay <wguay@meta.com>, Oded Gabbay <ogabbay@kernel.org>, =?utf-8?q?Christian_K=C3=B6nig?= <christian.koenig@amd.com>, Daniel Vetter <daniel.vetter@ffwll.ch>, Leon Romanovsky <leon@kernel.org>, Maor Gottlieb <maorg@nvidia.com> Subject: [PATCH 0/4] cover-letter: Allow MMIO regions to be exported through dmabuf Date: Mon, 16 Dec 2024 01:54:14 -0800 Message-ID: <20241216095429.210792-1-wguay@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	cover-letter: Allow MMIO regions to be exported through dmabuf \| expand [0/4] cover-letter: Allow MMIO regions to be exported through dmabuf [1/4] vfio: Add vfio_device_get() [2/4] dma-buf: Add dma_buf_try_get() [3/4] vfio/pci: Allow MMIO regions to be exported through dma-buf

Message ID

20241216095429.210792-1-wguay@fb.com (mailing list archive)

Headers

From: Wei Lin Guay <wguay@fb.com>
To: <alex.williamson@redhat.com>, <dri-devel@lists.freedesktop.org>,
 <kvm@vger.kernel.org>, <linux-rdma@vger.kernel.org>
CC: <jgg@nvidia.com>, <vivek.kasireddy@intel.com>, <dagmoxnes@meta.com>,
 <kbusch@kernel.org>, <nviljoen@meta.com>, Wei Lin Guay <wguay@meta.com>,
 Oded Gabbay <ogabbay@kernel.org>,
 =?utf-8?q?Christian_K=C3=B6nig?= <christian.koenig@amd.com>,
 Daniel Vetter <daniel.vetter@ffwll.ch>, Leon Romanovsky <leon@kernel.org>,
 Maor Gottlieb <maorg@nvidia.com>
Subject: [PATCH 0/4] cover-letter: Allow MMIO regions to be exported through
 dmabuf
Date: Mon, 16 Dec 2024 01:54:14 -0800
Message-ID: <20241216095429.210792-1-wguay@fb.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Precedence: list
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

Series

cover-letter: Allow MMIO regions to be exported through dmabuf | expand

Message

Wei Lin Guay Dec. 16, 2024, 9:54 a.m. UTC

From: Wei Lin Guay <wguay@meta.com>

This is another attempt to revive the patches posted by Jason
Gunthorpe and Vivek Kasireddy, at
https://patchwork.kernel.org/project/linux-media/cover/0-v2-472615b3877e+28f7-vfio_dma_buf_jgg@nvidia.com/
https://lwn.net/Articles/970751/

In addition to the initial proposal by Jason, another promising
application is exposing memory from an AI accelerator (bound to VFIO)
to an RDMA device. This would allow the RDMA device to directly access
the accelerator's memory, thereby facilitating direct data
transactions between the RDMA device and the accelerator.

Below is from the text/motivation from the orginal cover letter.

dma-buf has become a way to safely acquire a handle to non-struct page
memory that can still have lifetime controlled by the exporter. Notably
RDMA can now import dma-buf FDs and build them into MRs which allows for
PCI P2P operations. Extend this to allow vfio-pci to export MMIO memory
from PCI device BARs.

This series supports a use case for SPDK where a NVMe device will be owned
by SPDK through VFIO but interacting with a RDMA device. The RDMA device
may directly access the NVMe CMB or directly manipulate the NVMe device's
doorbell using PCI P2P.

However, as a general mechanism, it can support many other scenarios with
VFIO. I imagine this dmabuf approach to be usable by iommufd as well for
generic and safe P2P mappings.

This series goes after the "Break up ioctl dispatch functions to one
function per ioctl" series.

v2:
 - Name the new file dma_buf.c
 - Restore orig_nents before freeing
 - Fix reversed logic around priv->revoked
 - Set priv->index
 - Rebased on v2 "Break up ioctl dispatch functions"
v1: https://lore.kernel.org/r/0-v1-9e6e1739ed95+5fa-vfio_dma_buf_jgg@nvidia.com
Cc: linux-rdma@vger.kernel.org
Cc: Oded Gabbay <ogabbay@kernel.org>
Cc: Christian König <christian.koenig@amd.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Maor Gottlieb <maorg@nvidia.com>
Cc: dri-devel@lists.freedesktop.org
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

Jason Gunthorpe (3):
  vfio: Add vfio_device_get()
  dma-buf: Add dma_buf_try_get()
  vfio/pci: Allow MMIO regions to be exported through dma-buf

Wei Lin Guay (1):
  vfio/pci: Allow export dmabuf without move_notify from importer

 drivers/vfio/pci/Makefile          |   1 +
 drivers/vfio/pci/dma_buf.c         | 291 +++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_config.c |   8 +-
 drivers/vfio/pci/vfio_pci_core.c   |  44 ++++-
 drivers/vfio/pci/vfio_pci_priv.h   |  30 +++
 drivers/vfio/vfio_main.c           |   1 +
 include/linux/dma-buf.h            |  13 ++
 include/linux/vfio.h               |   6 +
 include/linux/vfio_pci_core.h      |   1 +
 include/uapi/linux/vfio.h          |  18 ++
 10 files changed, 405 insertions(+), 8 deletions(-)
 create mode 100644 drivers/vfio/pci/dma_buf.c

--
2.43.5

Comments

Christian König Dec. 17, 2024, 12:47 p.m. UTC | #1

Am 17.12.24 um 12:06 schrieb Wei Lin Guay:
> [SNIP]
>> Please send that out once more with me on explicit CC.
> I will resend the patch series. I was experiencing issues with my email client, which inadvertently split the series into two separate emails.

Alternatively I can also copy the code from the list archive and explain 
why that doesn't work:

+void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
+{
+    struct vfio_pci_dma_buf *priv;
+    struct vfio_pci_dma_buf *tmp;
+
+    lockdep_assert_held_write(&vdev->memory_lock);

This makes sure that the caller is holding vdev->memory_lock.

+
+    list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
+        if (!dma_buf_try_get(priv->dmabuf))

This here only works because vfio_pci_dma_buf_release() also grabs 
vdev->memory_lock and so we are protected against concurrent releases.

+            continue;
+        if (priv->revoked != revoked) {
+            dma_resv_lock(priv->dmabuf->resv, NULL);
+            priv->revoked = revoked;
+            dma_buf_move_notify(priv->dmabuf);
+            dma_resv_unlock(priv->dmabuf->resv);
+        }
+        dma_buf_put(priv->dmabuf);

The problem is now that this here might drop the last reference which in 
turn calls vfio_pci_dma_buf_release() which also tries to grab 
vdev->memory_lock and so results in a deadlock.

+    }
+}

This pattern was suggested multiple times and I had to rejected it every 
time because the whole idea is just fundamentally broken.

It's really astonishing how people always come up with the same broken 
pattern.

Regards,
Christian.

>
>> Apart from that I have to reject the adding of dma_buf_try_get(), that is clearly not something we should do.
>
> Understood. It appears that Vivek has confirmed that his v2 has resolved the issue. I will follow up with him to determine if he plans to resume his patch, and if so, I will apply my last patch on top of his updated patch series
>
> Thanks,
> Wei Lin
>> Thanks,
>> Christian.
>

Kasireddy, Vivek Dec. 18, 2024, 6:16 a.m. UTC | #2

Hi Christian,

> Subject: Re: [PATCH 0/4] cover-letter: Allow MMIO regions to be exported
> through dmabuf
> 
> 
>> 	I will resend the patch series. I was experiencing issues with my email
>> client, which inadvertently split the series into two separate emails.
> 
> 
> Alternatively I can also copy the code from the list archive and explain why
> that doesn't work:
> 
> +void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool
> +revoked) {
> +    struct vfio_pci_dma_buf *priv;
> +    struct vfio_pci_dma_buf *tmp;
> +
> +    lockdep_assert_held_write(&vdev->memory_lock);
> 
> This makes sure that the caller is holding vdev->memory_lock.
> 
> +
> +    list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
> +        if (!dma_buf_try_get(priv->dmabuf))
> 
> This here only works because vfio_pci_dma_buf_release() also grabs vdev-
> >memory_lock and so we are protected against concurrent releases.
> 
> +            continue;
> +        if (priv->revoked != revoked) {
> +            dma_resv_lock(priv->dmabuf->resv, NULL);
> +            priv->revoked = revoked;
> +            dma_buf_move_notify(priv->dmabuf);
> +            dma_resv_unlock(priv->dmabuf->resv);
> +        }
> +        dma_buf_put(priv->dmabuf);
> 
> The problem is now that this here might drop the last reference which in turn
> calls vfio_pci_dma_buf_release() which also tries to grab vdev-
> >memory_lock and so results in a deadlock.
AFAICS, vfio_pci_dma_buf_release() would not be called synchronously after the
last reference is dropped by dma_buf_put(). This is because fput(), which is called
by dma_buf_put() triggers f_op->release() asynchronously; therefore, a deadlock
is unlikely to occur in this scenario, unless I am overlooking something.

Thanks,
Vivek

> 
> +    }
> +}
> 
> This pattern was suggested multiple times and I had to rejected it every time
> because the whole idea is just fundamentally broken.
> 
> It's really astonishing how people always come up with the same broken
> pattern.
> 
> Regards,
> Christian.
> 
> 
> 
> 
> 
> 
> 
> 		Apart from that I have to reject the adding of
> dma_buf_try_get(), that is clearly not something we should do.
> 
> 
> 
> 	Understood. It appears that Vivek has confirmed that his v2 has
> resolved the issue. I will follow up with him to determine if he plans to
> resume his patch, and if so, I will apply my last patch on top of his updated
> patch series
> 
> 	Thanks,
> 	Wei Lin
> 
> 
> 		Thanks,
> 		Christian.
> 
> 
> 
>