[v5,0/5] drm/virtio: Import scanout buffers from other devices

Message ID	20241126031643.3490496-1-vivek.kasireddy@intel.com (mailing list archive)
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Vivek Kasireddy <vivek.kasireddy@intel.com> To: dri-devel@lists.freedesktop.org Cc: Vivek Kasireddy <vivek.kasireddy@intel.com>, Gerd Hoffmann <kraxel@redhat.com>, Dongwon Kim <dongwon.kim@intel.com>, Simona Vetter <simona@ffwll.ch>, Christian Koenig <christian.koenig@amd.com>, Dmitry Osipenko <dmitry.osipenko@collabora.com>, Rob Clark <robdclark@gmail.com>, Gurchetan Singh <gurchetansingh@chromium.org>, Chia-I Wu <olvaffe@gmail.com> Subject: [PATCH v5 0/5] drm/virtio: Import scanout buffers from other devices Date: Mon, 25 Nov 2024 19:13:41 -0800 Message-ID: <20241126031643.3490496-1-vivek.kasireddy@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	drm/virtio: Import scanout buffers from other devices \| expand [v5,0/5] drm/virtio: Import scanout buffers from other devices [v5,1/5] drm/virtio: Implement VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd [v5,2/5] drm/virtio: Add a helper to map and note the dma addrs and lengths [v5,3/5] drm/virtio: Add helpers to initialize and free the imported object [v5,4/5] drm/virtio: Import prime buffers from other devices as guest blobs [v5,5/5] drm/virtio: Add prepare and cleanup routines for imported dmabuf obj

Message ID

20241126031643.3490496-1-vivek.kasireddy@intel.com (mailing list archive)

Headers

From: Vivek Kasireddy <vivek.kasireddy@intel.com>
To: dri-devel@lists.freedesktop.org
Cc: Vivek Kasireddy <vivek.kasireddy@intel.com>,
 Gerd Hoffmann <kraxel@redhat.com>, Dongwon Kim <dongwon.kim@intel.com>,
 Simona Vetter <simona@ffwll.ch>,
 Christian Koenig <christian.koenig@amd.com>,
 Dmitry Osipenko <dmitry.osipenko@collabora.com>,
 Rob Clark <robdclark@gmail.com>,
 Gurchetan Singh <gurchetansingh@chromium.org>,
 Chia-I Wu <olvaffe@gmail.com>
Subject: [PATCH v5 0/5] drm/virtio: Import scanout buffers from other devices
Date: Mon, 25 Nov 2024 19:13:41 -0800
Message-ID: <20241126031643.3490496-1-vivek.kasireddy@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: list
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

Series

drm/virtio: Import scanout buffers from other devices | expand

Message

Kasireddy, Vivek Nov. 26, 2024, 3:13 a.m. UTC

Having virtio-gpu import scanout buffers (via prime) from other
devices means that we'd be adding a head to headless GPUs assigned
to a Guest VM or additional heads to regular GPU devices that are
passthrough'd to the Guest. In these cases, the Guest compositor
can render into the scanout buffer using a primary GPU and has the
secondary GPU (virtio-gpu) import it for display purposes.

The main advantage with this is that the imported scanout buffer can
either be displayed locally on the Host (e.g, using Qemu + GTK UI)
or encoded and streamed to a remote client (e.g, Qemu + Spice UI).
Note that since Qemu uses udmabuf driver, there would be no copies
made of the scanout buffer as it is displayed. This should be
possible even when it might reside in device memory such has VRAM.

The specific use-case that can be supported with this series is when
running Weston or other guest compositors with "additional-devices"
feature (./weston --drm-device=card1 --additional-devices=card0).
More info about this feature can be found at:
https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/736

In the above scenario, card1 could be a dGPU or an iGPU and card0
would be virtio-gpu in KMS only mode. However, the case where this
patch series could be particularly useful is when card1 is a GPU VF
that needs to share its scanout buffer (in a zero-copy way) with the
GPU PF on the Host. Or, it can also be useful when the scanout buffer
needs to be shared between any two GPU devices (assuming one of them
is assigned to a Guest VM) as long as they are P2P DMA compatible.

As part of the import, the virtio-gpu driver shares the dma
addresses and lengths with Qemu which then determines whether the
memory region they belong to is owned by a VFIO device or whether it
is part of the Guest's system ram. If it is the former, it can use
the VFIO_DEVICE_FEATURE_DMA_BUF feature flag while invoking the ioctl
against the VFIO device fd and get a dmabuf fd in return. In the
latter case, Qemu obtains the dmabuf fd using the udmabuf driver.

Note that the virtio-gpu driver registers a move_notify() callback
to track location changes associated with the scanout buffer and
sends attach/detach backing cmds to Qemu when appropriate. And,
synchronization (that is, ensuring that Guest and Host are not
using the scanout buffer at the same time) is ensured by pinning/
unpinning the dmabuf as part of prepare/cleanup fb and using a
fence in resource_flush cmd.

Changelog:

v4 -> v5 (changes suggested by Dmitry):
- Replace the variable detached with attached and use it in
  virtio_gpu_object_attach/detach to track a BO's backing
- Use the unlocked version of dma_buf_unmap_attachment() to avoid
  having to hold dma resv lock while freeing the object

v3 -> v4 (changes suggested by Dmitry):
- Change the return type of virtgpu_dma_buf_import_sgt() from long
  to int
- Add missing virtio_gpu_detach_object_fenced() while trying to free
  the obj in virtgpu_dma_buf_free_obj()
- Remove the extra newline added at the end of the file in patch 4

v2 -> v3:
- Rebase on 6.12

v1 -> v2:
- Use a fenced version of VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd
  (Dmitry)

RFC -> v1:
- Use virtio_gpu_cleanup_object() to cleanup the imported obj
- Do pin/unpin as part of prepare and cleanup fb for the imported
  dmabuf obj instead doing it as part of plane update
- Tested with gnome-shell/mutter (wayland backend)

Patchset overview:

Patch 1:   Implement VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd
Patch 2-3: Helpers to initalize, import, free imported object
Patch 4-5: Import and use buffers from other devices for scanout

This series is tested using the following method:
- Run Qemu with the following relevant options:
  qemu-system-x86_64 -m 4096m ....
  -device vfio-pci,host=0000:03:00.0
  -device virtio-vga,max_outputs=1,blob=true,xres=1920,yres=1080
  -display gtk,gl=on
  -object memory-backend-memfd,id=mem1,size=4096M
  -machine memory-backend=mem1 ...
- Run upstream Weston with the following options in the Guest VM:
  ./weston --drm-device=card1 --additional-devices=card0
- Or run Gnome-shell/Mutter (wayland backend) with this additional patch:
  https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/3745
  XDG_SESSION_TYPE=wayland dbus-run-session -- /usr/bin/gnome-shell --wayland --no-x11

where card1 is a DG2 dGPU (passthrough'd and using i915 driver in Guest VM),
card0 is virtio-gpu and the Host is using a RPL iGPU.

Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Gurchetan Singh <gurchetansingh@chromium.org>
Cc: Chia-I Wu <olvaffe@gmail.com>

Vivek Kasireddy (5):
  drm/virtio: Implement VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd
  drm/virtio: Add a helper to map and note the dma addrs and lengths
  drm/virtio: Add helpers to initialize and free the imported object
  drm/virtio: Import prime buffers from other devices as guest blobs
  drm/virtio: Add prepare and cleanup routines for imported dmabuf obj

 drivers/gpu/drm/virtio/virtgpu_drv.h    |  10 ++
 drivers/gpu/drm/virtio/virtgpu_object.c |  24 ++++
 drivers/gpu/drm/virtio/virtgpu_plane.c  |  65 ++++++++-
 drivers/gpu/drm/virtio/virtgpu_prime.c  | 173 +++++++++++++++++++++++-
 drivers/gpu/drm/virtio/virtgpu_vq.c     |  35 +++++
 5 files changed, 305 insertions(+), 2 deletions(-)

Comments

Dmitry Osipenko Nov. 26, 2024, 11:02 a.m. UTC | #1

Hello, Vivek

All patches applied to misc-next with a small modification, thanks!

Note: While verifying move_notify(), I noticed that AMD/TTM driver moves
same shared FB GEMs after each framebuffer update when it renders into
FB, despite of the 32GB BAR. This should be rather inefficient. I'd
expect dma-buf staying static if there is no need to evict it. Something
to check how it works with DG2.

Fix: I made this change to the "Import prime buffers" patch after
spotting possibility of having race condition between move_notify() and
freeing GEM:

diff --git a/drivers/gpu/drm/virtio/virtgpu_prime.c
b/drivers/gpu/drm/virtio/virtgpu_prime.c
index 8644b87d473d..688810d1b611 100644
--- a/drivers/gpu/drm/virtio/virtgpu_prime.c
+++ b/drivers/gpu/drm/virtio/virtgpu_prime.c
@@ -189,13 +189,18 @@ static void virtgpu_dma_buf_free_obj(struct
drm_gem_object *obj)
 	struct virtio_gpu_object *bo = gem_to_virtio_gpu_obj(obj);
 	struct virtio_gpu_device *vgdev = obj->dev->dev_private;
 	struct dma_buf_attachment *attach = obj->import_attach;
+	struct dma_resv *resv = attach->dmabuf->resv;

 	if (attach) {
+		dma_resv_lock(resv, NULL);
+
 		virtio_gpu_detach_object_fenced(bo);

 		if (bo->sgt)
-			dma_buf_unmap_attachment_unlocked(attach, bo->sgt,
-							  DMA_BIDIRECTIONAL);
+			dma_buf_unmap_attachment(attach, bo->sgt,
+						 DMA_BIDIRECTIONAL);
+
+		dma_resv_unlock(resv);

 		dma_buf_detach(attach->dmabuf, attach);
 		dma_buf_put(attach->dmabuf);
@@ -268,7 +273,7 @@ static void virtgpu_dma_buf_move_notify(struct
dma_buf_attachment *attach)
 	struct drm_gem_object *obj = attach->importer_priv;
 	struct virtio_gpu_object *bo = gem_to_virtio_gpu_obj(obj);

-	if (bo->created) {
+	if (bo->created && kref_read(&obj->refcount)) {
 		virtio_gpu_detach_object_fenced(bo);

 		if (bo->sgt)

Kasireddy, Vivek Nov. 27, 2024, 6:30 a.m. UTC | #2

Hi Dmitry,

> Subject: Re: [PATCH v5 0/5] drm/virtio: Import scanout buffers from other
> devices
> 
> Hello, Vivek
> 
> All patches applied to misc-next with a small modification, thanks!
Thank you so much for taking the time to test, review and merge this series!!

> 
> Note: While verifying move_notify(), I noticed that AMD/TTM driver moves
> same shared FB GEMs after each framebuffer update when it renders into
> FB, despite of the 32GB BAR. This should be rather inefficient. I'd
> expect dma-buf staying static if there is no need to evict it. Something
> to check how it works with DG2.
IIUC, I think the exporting driver (AMD GPU driver) in the Guest VM migrates the
FB to System RAM because during export it determines that virtio-gpu is not P2P
compatible. This behavior is expected and seen with other dGPU drivers (i915/Xe)
as well. However, I am trying to fix this in Xe driver for specific use-cases (SRIOV)
where we know for sure that the importer on the Host is another dGPU:
https://patchwork.kernel.org/project/dri-devel/cover/20241012024524.1377836-1-vivek.kasireddy@intel.com/
https://patchwork.kernel.org/project/dri-devel/cover/20240624065552.1572580-1-vivek.kasireddy@intel.com/

> 
> Fix: I made this change to the "Import prime buffers" patch after
> spotting possibility of having race condition between move_notify() and
> freeing GEM:
This fix LGTM.

Thanks,
Vivek

> 
> diff --git a/drivers/gpu/drm/virtio/virtgpu_prime.c
> b/drivers/gpu/drm/virtio/virtgpu_prime.c
> index 8644b87d473d..688810d1b611 100644
> --- a/drivers/gpu/drm/virtio/virtgpu_prime.c
> +++ b/drivers/gpu/drm/virtio/virtgpu_prime.c
> @@ -189,13 +189,18 @@ static void virtgpu_dma_buf_free_obj(struct
> drm_gem_object *obj)
>  	struct virtio_gpu_object *bo = gem_to_virtio_gpu_obj(obj);
>  	struct virtio_gpu_device *vgdev = obj->dev->dev_private;
>  	struct dma_buf_attachment *attach = obj->import_attach;
> +	struct dma_resv *resv = attach->dmabuf->resv;
> 
>  	if (attach) {
> +		dma_resv_lock(resv, NULL);
> +
>  		virtio_gpu_detach_object_fenced(bo);
> 
>  		if (bo->sgt)
> -			dma_buf_unmap_attachment_unlocked(attach, bo-
> >sgt,
> -
> DMA_BIDIRECTIONAL);
> +			dma_buf_unmap_attachment(attach, bo->sgt,
> +						 DMA_BIDIRECTIONAL);
> +
> +		dma_resv_unlock(resv);
> 
>  		dma_buf_detach(attach->dmabuf, attach);
>  		dma_buf_put(attach->dmabuf);
> @@ -268,7 +273,7 @@ static void virtgpu_dma_buf_move_notify(struct
> dma_buf_attachment *attach)
>  	struct drm_gem_object *obj = attach->importer_priv;
>  	struct virtio_gpu_object *bo = gem_to_virtio_gpu_obj(obj);
> 
> -	if (bo->created) {
> +	if (bo->created && kref_read(&obj->refcount)) {
>  		virtio_gpu_detach_object_fenced(bo);
> 
>  		if (bo->sgt)
> 
> 
> --
> Best regards,
> Dmitry

Dmitry Osipenko Nov. 27, 2024, 11:56 a.m. UTC | #3

On 11/27/24 09:30, Kasireddy, Vivek wrote:
>> Note: While verifying move_notify(), I noticed that AMD/TTM driver moves
>> same shared FB GEMs after each framebuffer update when it renders into
>> FB, despite of the 32GB BAR. This should be rather inefficient. I'd
>> expect dma-buf staying static if there is no need to evict it. Something
>> to check how it works with DG2.
> IIUC, I think the exporting driver (AMD GPU driver) in the Guest VM migrates the
> FB to System RAM because during export it determines that virtio-gpu is not P2P
> compatible. This behavior is expected and seen with other dGPU drivers (i915/Xe)
> as well. However, I am trying to fix this in Xe driver for specific use-cases (SRIOV)
> where we know for sure that the importer on the Host is another dGPU:
> https://patchwork.kernel.org/project/dri-devel/
> cover/20241012024524.1377836-1-vivek.kasireddy@intel.com/
> https://patchwork.kernel.org/project/dri-devel/
> cover/20240624065552.1572580-1-vivek.kasireddy@intel.com/

Are the dmabufs static with a disabled MOVE_NOTIFY or TTM uses bounce
buffer in this case?