[RFC,0/7] drm/virtio: Import scanout buffers from other devices

Message ID	20240328083615.2662516-1-vivek.kasireddy@intel.com (mailing list archive)
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Vivek Kasireddy <vivek.kasireddy@intel.com> To: dri-devel@lists.freedesktop.org Cc: Vivek Kasireddy <vivek.kasireddy@intel.com>, Gerd Hoffmann <kraxel@redhat.com>, Dongwon Kim <dongwon.kim@intel.com>, Daniel Vetter <daniel.vetter@ffwll.ch>, Christian Koenig <christian.koenig@amd.com>, Dmitry Osipenko <dmitry.osipenko@collabora.com>, Rob Clark <robdclark@chromium.org>, =?utf-8?q?Thomas_Hellstr=C3=B6m?= <thomas.hellstrom@linux.intel.com>, Oded Gabbay <ogabbay@kernel.org>, Michal Wajdeczko <michal.wajdeczko@intel.com>, Michael Tretter <m.tretter@pengutronix.de> Subject: [RFC 0/7] drm/virtio: Import scanout buffers from other devices Date: Thu, 28 Mar 2024 01:32:53 -0700 Message-ID: <20240328083615.2662516-1-vivek.kasireddy@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	drm/virtio: Import scanout buffers from other devices \| expand [RFC,0/7] drm/virtio: Import scanout buffers from other devices [RFC,1/7] drm/virtio: Implement VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd [RFC,2/7] drm/virtio: Add a helper to map and note the dma addrs and lengths [RFC,3/7] drm/virtio: Add helpers to initialize and free the imported object [RFC,4/7] drm/virtio: Import prime buffers from other devices as guest blobs [RFC,5/7] drm/virtio: Ensure that bo's backing store is valid while updating plane [RFC,6/7] udmabuf/uapi: Add new ioctl to create a dmabuf from PCI bar regions [RFC,7/7] udmabuf: Implement UDMABUF_CREATE_LIST_FOR_PCIDEV ioctl

Vivek Kasireddy March 28, 2024, 8:32 a.m. UTC

Having virtio-gpu import scanout buffers (via prime) from other
devices means that we'd be adding a head to headless GPUs assigned
to a Guest VM or additional heads to regular GPU devices that are
passthrough'd to the Guest. In these cases, the Guest compositor
can render into the scanout buffer using a primary GPU and has the
secondary GPU (virtio-gpu) import it for display purposes.

The main advantage with this is that the imported scanout buffer can
either be displayed locally on the Host (e.g, using Qemu + GTK UI)
or encoded and streamed to a remote client (e.g, Qemu + Spice UI).
Note that since Qemu uses udmabuf driver, there would be no copies
made of the scanout buffer as it is displayed. This should be
possible even when it might reside in device memory such has VRAM.

The specific use-case that can be supported with this series is when
running Weston or other guest compositors with "additional-devices"
feature (./weston --drm-device=card1 --additional-devices=card0).
More info about this feature can be found at:
https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/736

In the above scenario, card1 could be a dGPU or an iGPU and card0
would be virtio-gpu in KMS only mode. However, the case where this
patch series could be particularly useful is when card1 is a GPU VF
that needs to share its scanout buffer (in a zero-copy way) with the
GPU PF on the Host. Or, it can also be useful when the scanout buffer
needs to be shared between any two GPU devices (assuming one of them
is assigned to a Guest VM) as long as they are P2P DMA compatible.

As part of the import, the virtio-gpu driver shares the dma
addresses and lengths with Qemu which then determines whether the
memory region they belong to is owned by a PCI device or whether it
is part of the Guest's system ram. If it is the former, it identifies
the devid (or bdf) and bar and provides this info (along with offsets
and sizes) to the udmabuf driver. In the latter case, instead of the
the devid and bar it provides the memfd. The udmabuf driver then
creates a dmabuf using this info that Qemu shares with Spice for
encode via Gstreamer.

Note that the virtio-gpu driver registers a move_notify() callback
to track location changes associated with the scanout buffer and
sends attach/detach backing cmds to Qemu when appropriate. And,
synchronization (that is, ensuring that Guest and Host are not
using the scanout buffer at the same time) is ensured by pinning/
unpinning the dmabuf as part of plane update and using a fence
in resource_flush cmd. 

This series is available at:
https://gitlab.freedesktop.org/Vivek/drm-tip/-/commits/virtgpu_import_rfc

along with additional patches for Qemu and Spice here:
https://gitlab.freedesktop.org/Vivek/qemu/-/commits/virtgpu_dmabuf_pcidev
https://gitlab.freedesktop.org/Vivek/spice/-/commits/encode_dmabuf_v4 

Patchset overview:

Patch 1:   Implement VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd
Patch 2-3: Helpers to initalize, import, free imported object
Patch 4-5: Import and use buffers from other devices for scanout
Patch 6-7: Have udmabuf driver create dmabuf from PCI bars for P2P DMA

This series is tested using the following method:
- Run Qemu with the following relevant options:
  qemu-system-x86_64 -m 4096m ....
  -device vfio-pci,host=0000:03:00.0
  -device virtio-vga,max_outputs=1,blob=true,xres=1920,yres=1080
  -spice port=3001,gl=on,disable-ticketing=on,preferred-codec=gstreamer:h264
  -object memory-backend-memfd,id=mem1,size=4096M
  -machine memory-backend=mem1 ...
- Run upstream Weston with the following options in the Guest VM:
  ./weston --drm-device=card1 --additional-devices=card0

where card1 is a DG2 dGPU (passthrough'd and using xe driver in Guest VM),
card0 is virtio-gpu and the Host is using a RPL iGPU.

Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Cc: Rob Clark <robdclark@chromium.org>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Oded Gabbay <ogabbay@kernel.org>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Michael Tretter <m.tretter@pengutronix.de>

Vivek Kasireddy (7):
  drm/virtio: Implement VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd
  drm/virtio: Add a helper to map and note the dma addrs and lengths
  drm/virtio: Add helpers to initialize and free the imported object
  drm/virtio: Import prime buffers from other devices as guest blobs
  drm/virtio: Ensure that bo's backing store is valid while updating
    plane
  udmabuf/uapi: Add new ioctl to create a dmabuf from PCI bar regions
  udmabuf: Implement UDMABUF_CREATE_LIST_FOR_PCIDEV ioctl

 drivers/dma-buf/udmabuf.c              | 122 ++++++++++++++++--
 drivers/gpu/drm/virtio/virtgpu_drv.h   |   8 ++
 drivers/gpu/drm/virtio/virtgpu_plane.c |  56 ++++++++-
 drivers/gpu/drm/virtio/virtgpu_prime.c | 167 ++++++++++++++++++++++++-
 drivers/gpu/drm/virtio/virtgpu_vq.c    |  15 +++
 include/uapi/linux/udmabuf.h           |  11 +-
 6 files changed, 368 insertions(+), 11 deletions(-)

Gurchetan Singh May 23, 2024, 9:33 p.m. UTC | #1

On Thu, Mar 28, 2024 at 2:01 AM Vivek Kasireddy <vivek.kasireddy@intel.com>
wrote:

> Having virtio-gpu import scanout buffers (via prime) from other
> devices means that we'd be adding a head to headless GPUs assigned
> to a Guest VM or additional heads to regular GPU devices that are
> passthrough'd to the Guest. In these cases, the Guest compositor
> can render into the scanout buffer using a primary GPU and has the
> secondary GPU (virtio-gpu) import it for display purposes.
>
> The main advantage with this is that the imported scanout buffer can
> either be displayed locally on the Host (e.g, using Qemu + GTK UI)
> or encoded and streamed to a remote client (e.g, Qemu + Spice UI).
> Note that since Qemu uses udmabuf driver, there would be no copies
> made of the scanout buffer as it is displayed. This should be
> possible even when it might reside in device memory such has VRAM.
>
> The specific use-case that can be supported with this series is when
> running Weston or other guest compositors with "additional-devices"
> feature (./weston --drm-device=card1 --additional-devices=card0).
> More info about this feature can be found at:
> https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/736
>
> In the above scenario, card1 could be a dGPU or an iGPU and card0
> would be virtio-gpu in KMS only mode. However, the case where this
> patch series could be particularly useful is when card1 is a GPU VF
> that needs to share its scanout buffer (in a zero-copy way) with the
> GPU PF on the Host. Or, it can also be useful when the scanout buffer
> needs to be shared between any two GPU devices (assuming one of them
> is assigned to a Guest VM) as long as they are P2P DMA compatible.
>

Is passthrough iGPU-only or passthrough dGPU-only something you intend to
use?

If it's a dGPU + iGPU setup, then the way other people seem to do it is a
"virtualized" iGPU (via virgl/gfxstream/take your pick) and pass-through
the dGPU.

For example, AMD seems to use virgl to allocate and import into the dGPU.

https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23896
https://lore.kernel.org/all/20231221100016.4022353-1-julia.zhang@amd.com/

ChromeOS also uses that method (see crrev.com/c/3764931) [cc: dGPU
architect +Dominik Behr <dbehr@google.com>]

So if iGPU + dGPU is the primary use case, you should be able to use these
methods as well.  The model would "virtualized iGPU" + passthrough dGPU,
not split SoCs.


> As part of the import, the virtio-gpu driver shares the dma
> addresses and lengths with Qemu which then determines whether the
> memory region they belong to is owned by a PCI device or whether it
> is part of the Guest's system ram. If it is the former, it identifies
> the devid (or bdf) and bar and provides this info (along with offsets
> and sizes) to the udmabuf driver. In the latter case, instead of the
> the devid and bar it provides the memfd. The udmabuf driver then
> creates a dmabuf using this info that Qemu shares with Spice for
> encode via Gstreamer.
>
> Note that the virtio-gpu driver registers a move_notify() callback
> to track location changes associated with the scanout buffer and
> sends attach/detach backing cmds to Qemu when appropriate. And,
> synchronization (that is, ensuring that Guest and Host are not
> using the scanout buffer at the same time) is ensured by pinning/
> unpinning the dmabuf as part of plane update and using a fence
> in resource_flush cmd.


I'm not sure how QEMU's display paths work, but with crosvm if you share
the guest-created dmabuf with the display, and the guest moves the backing
pages, the only recourse is the destroy the surface and show a black screen
to the user: not the best thing experience wise.

Only amdgpu calls dma_buf_move_notfiy(..), and you're probably testing on
Intel only, so you may not be hitting that code path anyways.  I forgot the
exact reason, but apparently udmabuf may not work with amdgpu displays and
it seems the virtualized iGPU + dGPU is the way to go for amdgpu anyways.
So I recommend just pinning the buffer for the lifetime of the import for
simplicity and correctness.


> This series is available at:
> https://gitlab.freedesktop.org/Vivek/drm-tip/-/commits/virtgpu_import_rfc
>
> along with additional patches for Qemu and Spice here:
> https://gitlab.freedesktop.org/Vivek/qemu/-/commits/virtgpu_dmabuf_pcidev
> https://gitlab.freedesktop.org/Vivek/spice/-/commits/encode_dmabuf_v4
>
> Patchset overview:
>
> Patch 1:   Implement VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd
> Patch 2-3: Helpers to initalize, import, free imported object
> Patch 4-5: Import and use buffers from other devices for scanout
> Patch 6-7: Have udmabuf driver create dmabuf from PCI bars for P2P DMA
>
> This series is tested using the following method:
> - Run Qemu with the following relevant options:
>   qemu-system-x86_64 -m 4096m ....
>   -device vfio-pci,host=0000:03:00.0
>   -device virtio-vga,max_outputs=1,blob=true,xres=1920,yres=1080
>   -spice
> port=3001,gl=on,disable-ticketing=on,preferred-codec=gstreamer:h264
>   -object memory-backend-memfd,id=mem1,size=4096M
>   -machine memory-backend=mem1 ...
> - Run upstream Weston with the following options in the Guest VM:
>   ./weston --drm-device=card1 --additional-devices=card0
>
> where card1 is a DG2 dGPU (passthrough'd and using xe driver in Guest VM),
> card0 is virtio-gpu and the Host is using a RPL iGPU.
>
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> Cc: Dongwon Kim <dongwon.kim@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Christian Koenig <christian.koenig@amd.com>
> Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com>
> Cc: Rob Clark <robdclark@chromium.org>
> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> Cc: Oded Gabbay <ogabbay@kernel.org>
> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> Cc: Michael Tretter <m.tretter@pengutronix.de>
>
> Vivek Kasireddy (7):
>   drm/virtio: Implement VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd
>   drm/virtio: Add a helper to map and note the dma addrs and lengths
>   drm/virtio: Add helpers to initialize and free the imported object
>   drm/virtio: Import prime buffers from other devices as guest blobs
>   drm/virtio: Ensure that bo's backing store is valid while updating
>     plane
>   udmabuf/uapi: Add new ioctl to create a dmabuf from PCI bar regions
>   udmabuf: Implement UDMABUF_CREATE_LIST_FOR_PCIDEV ioctl
>
>  drivers/dma-buf/udmabuf.c              | 122 ++++++++++++++++--
>  drivers/gpu/drm/virtio/virtgpu_drv.h   |   8 ++
>  drivers/gpu/drm/virtio/virtgpu_plane.c |  56 ++++++++-
>  drivers/gpu/drm/virtio/virtgpu_prime.c | 167 ++++++++++++++++++++++++-
>  drivers/gpu/drm/virtio/virtgpu_vq.c    |  15 +++
>  include/uapi/linux/udmabuf.h           |  11 +-
>  6 files changed, 368 insertions(+), 11 deletions(-)
>
> --
> 2.43.0
>
>

Vivek Kasireddy May 24, 2024, 6:56 a.m. UTC | #2

Hi Gurchetan,

Thank you for taking a look at this patch series!



On Thu, Mar 28, 2024 at 2:01 AM Vivek Kasireddy <vivek.kasireddy@intel.com<mailto:vivek.kasireddy@intel.com>> wrote:
Having virtio-gpu import scanout buffers (via prime) from other
devices means that we'd be adding a head to headless GPUs assigned
to a Guest VM or additional heads to regular GPU devices that are
passthrough'd to the Guest. In these cases, the Guest compositor
can render into the scanout buffer using a primary GPU and has the
secondary GPU (virtio-gpu) import it for display purposes.

The main advantage with this is that the imported scanout buffer can
either be displayed locally on the Host (e.g, using Qemu + GTK UI)
or encoded and streamed to a remote client (e.g, Qemu + Spice UI).
Note that since Qemu uses udmabuf driver, there would be no copies
made of the scanout buffer as it is displayed. This should be
possible even when it might reside in device memory such has VRAM.

The specific use-case that can be supported with this series is when
running Weston or other guest compositors with "additional-devices"
feature (./weston --drm-device=card1 --additional-devices=card0).
More info about this feature can be found at:
https://gitlab.freedesktop.org/wayland/weston/-/merge_requests/736

In the above scenario, card1 could be a dGPU or an iGPU and card0
would be virtio-gpu in KMS only mode. However, the case where this
patch series could be particularly useful is when card1 is a GPU VF
that needs to share its scanout buffer (in a zero-copy way) with the
GPU PF on the Host. Or, it can also be useful when the scanout buffer
needs to be shared between any two GPU devices (assuming one of them
is assigned to a Guest VM) as long as they are P2P DMA compatible.

Is passthrough iGPU-only or passthrough dGPU-only something you intend to use?
Our main use-case involves passthrough’g a headless dGPU VF device and sharing
the Guest compositor’s scanout buffer with dGPU PF device on the Host. Same goal for
headless iGPU VF to iGPU PF device as well.

However, using a combination of iGPU and dGPU where either of them can be passthrough’d
to the Guest is something I think can be supported with this patch series as well.

If it's a dGPU + iGPU setup, then the way other people seem to do it is a "virtualized" iGPU (via virgl/gfxstream/take your pick) and pass-through the dGPU.

For example, AMD seems to use virgl to allocate and import into the dGPU.

https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23896
https://lore.kernel.org/all/20231221100016.4022353-1-julia.zhang@amd.com/

ChromeOS also uses that method (see crrev.com/c/3764931<http://crrev.com/c/3764931>) [cc: dGPU architect +Dominik Behr<mailto:dbehr@google.com>]

So if iGPU + dGPU is the primary use case, you should be able to use these methods as well.  The model would "virtualized iGPU" + passthrough dGPU, not split SoCs.
In our use-case, the goal is to have only one primary GPU (passthrough’d iGPU/dGPU)
do all the rendering (using native DRI drivers) for clients/compositor and all the outputs
and share the scanout buffers with the secondary GPU (virtio-gpu). Since this is mostly
how Mutter (and also Weston) work in a multi-GPU setup, I am not sure if virgl is needed.

As part of the import, the virtio-gpu driver shares the dma
addresses and lengths with Qemu which then determines whether the
memory region they belong to is owned by a PCI device or whether it
is part of the Guest's system ram. If it is the former, it identifies
the devid (or bdf) and bar and provides this info (along with offsets
and sizes) to the udmabuf driver. In the latter case, instead of the
the devid and bar it provides the memfd. The udmabuf driver then
creates a dmabuf using this info that Qemu shares with Spice for
encode via Gstreamer.

Note that the virtio-gpu driver registers a move_notify() callback
to track location changes associated with the scanout buffer and
sends attach/detach backing cmds to Qemu when appropriate. And,
synchronization (that is, ensuring that Guest and Host are not
using the scanout buffer at the same time) is ensured by pinning/
unpinning the dmabuf as part of plane update and using a fence
in resource_flush cmd.

I'm not sure how QEMU's display paths work, but with crosvm if you share the guest-created dmabuf with the display, and the guest moves the backing pages, the only recourse is the destroy the surface and show a black screen to the user: not the best thing experience wise.
Since Qemu GTK UI uses EGL, there is a blit done from the guest’s scanout buffer onto an EGL
backed buffer on the Host. So, this problem would not happen as of now.

Only amdgpu calls dma_buf_move_notfiy(..), and you're probably testing on Intel only, so you may not be hitting that code path anyways.
I have tested with the Xe driver in the Guest which also calls dma_buf_move_notfiy(). However,
note that for dGPUs, both Xe and amdgpu migrate the scanout buffer from vram to system
memory as part of export, because virtio-gpu is not P2P compatible.

I forgot the exact reason, but apparently udmabuf may not work with amdgpu displays and it seems the virtualized iGPU + dGPU is the way to go for amdgpu anyways.
I am curious why udmabuf would not work with amdgpu?

So I recommend just pinning the buffer for the lifetime of the import for simplicity and correctness.
Yeah, in this patch series, the dmabuf is indeed pinned but only for a short duration in the Guest –
just until the Host is done using it (blit or encode).

Thanks,
Vivek

This series is available at:
https://gitlab.freedesktop.org/Vivek/drm-tip/-/commits/virtgpu_import_rfc

along with additional patches for Qemu and Spice here:
https://gitlab.freedesktop.org/Vivek/qemu/-/commits/virtgpu_dmabuf_pcidev
https://gitlab.freedesktop.org/Vivek/spice/-/commits/encode_dmabuf_v4

Patchset overview:

Patch 1:   Implement VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd
Patch 2-3: Helpers to initalize, import, free imported object
Patch 4-5: Import and use buffers from other devices for scanout
Patch 6-7: Have udmabuf driver create dmabuf from PCI bars for P2P DMA

This series is tested using the following method:
- Run Qemu with the following relevant options:
  qemu-system-x86_64 -m 4096m ....
  -device vfio-pci,host=0000:03:00.0
  -device virtio-vga,max_outputs=1,blob=true,xres=1920,yres=1080
  -spice port=3001,gl=on,disable-ticketing=on,preferred-codec=gstreamer:h264
  -object memory-backend-memfd,id=mem1,size=4096M
  -machine memory-backend=mem1 ...
- Run upstream Weston with the following options in the Guest VM:
  ./weston --drm-device=card1 --additional-devices=card0

where card1 is a DG2 dGPU (passthrough'd and using xe driver in Guest VM),
card0 is virtio-gpu and the Host is using a RPL iGPU.

Cc: Gerd Hoffmann <kraxel@redhat.com<mailto:kraxel@redhat.com>>
Cc: Dongwon Kim <dongwon.kim@intel.com<mailto:dongwon.kim@intel.com>>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch<mailto:daniel.vetter@ffwll.ch>>
Cc: Christian Koenig <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com<mailto:dmitry.osipenko@collabora.com>>
Cc: Rob Clark <robdclark@chromium.org<mailto:robdclark@chromium.org>>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com<mailto:thomas.hellstrom@linux.intel.com>>
Cc: Oded Gabbay <ogabbay@kernel.org<mailto:ogabbay@kernel.org>>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com<mailto:michal.wajdeczko@intel.com>>
Cc: Michael Tretter <m.tretter@pengutronix.de<mailto:m.tretter@pengutronix.de>>

Vivek Kasireddy (7):
  drm/virtio: Implement VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd
  drm/virtio: Add a helper to map and note the dma addrs and lengths
  drm/virtio: Add helpers to initialize and free the imported object
  drm/virtio: Import prime buffers from other devices as guest blobs
  drm/virtio: Ensure that bo's backing store is valid while updating
    plane
  udmabuf/uapi: Add new ioctl to create a dmabuf from PCI bar regions
  udmabuf: Implement UDMABUF_CREATE_LIST_FOR_PCIDEV ioctl

 drivers/dma-buf/udmabuf.c              | 122 ++++++++++++++++--
 drivers/gpu/drm/virtio/virtgpu_drv.h   |   8 ++
 drivers/gpu/drm/virtio/virtgpu_plane.c |  56 ++++++++-
 drivers/gpu/drm/virtio/virtgpu_prime.c | 167 ++++++++++++++++++++++++-
 drivers/gpu/drm/virtio/virtgpu_vq.c    |  15 +++
 include/uapi/linux/udmabuf.h           |  11 +-
 6 files changed, 368 insertions(+), 11 deletions(-)

--
2.43.0

Vivek Kasireddy May 24, 2024, 6:33 p.m. UTC | #3

Hi,

Sorry, my previous reply got messed up as a result of HTML formatting. This is
a plain text version of the same reply.

> 
> 
> 	Having virtio-gpu import scanout buffers (via prime) from other
> 	devices means that we'd be adding a head to headless GPUs assigned
> 	to a Guest VM or additional heads to regular GPU devices that are
> 	passthrough'd to the Guest. In these cases, the Guest compositor
> 	can render into the scanout buffer using a primary GPU and has the
> 	secondary GPU (virtio-gpu) import it for display purposes.
> 
> 	The main advantage with this is that the imported scanout buffer can
> 	either be displayed locally on the Host (e.g, using Qemu + GTK UI)
> 	or encoded and streamed to a remote client (e.g, Qemu + Spice UI).
> 	Note that since Qemu uses udmabuf driver, there would be no
> copies
> 	made of the scanout buffer as it is displayed. This should be
> 	possible even when it might reside in device memory such has
> VRAM.
> 
> 	The specific use-case that can be supported with this series is when
> 	running Weston or other guest compositors with "additional-devices"
> 	feature (./weston --drm-device=card1 --additional-devices=card0).
> 	More info about this feature can be found at:
> 	https://gitlab.freedesktop.org/wayland/weston/-
> /merge_requests/736
> 
> 	In the above scenario, card1 could be a dGPU or an iGPU and card0
> 	would be virtio-gpu in KMS only mode. However, the case where this
> 	patch series could be particularly useful is when card1 is a GPU VF
> 	that needs to share its scanout buffer (in a zero-copy way) with the
> 	GPU PF on the Host. Or, it can also be useful when the scanout buffer
> 	needs to be shared between any two GPU devices (assuming one of
> them
> 	is assigned to a Guest VM) as long as they are P2P DMA compatible.
> 
> 
> 
> Is passthrough iGPU-only or passthrough dGPU-only something you intend to
> use?
Our main use-case involves passthrough’g a headless dGPU VF device and sharing
the Guest compositor’s scanout buffer with dGPU PF device on the Host. Same goal for
headless iGPU VF to iGPU PF device as well.

However, using a combination of iGPU and dGPU where either of them can be passthrough’d
to the Guest is something I think can be supported with this patch series as well.

> 
> If it's a dGPU + iGPU setup, then the way other people seem to do it is a
> "virtualized" iGPU (via virgl/gfxstream/take your pick) and pass-through the
> dGPU.
> 
> For example, AMD seems to use virgl to allocate and import into the dGPU.
> 
> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23896
> 
> https://lore.kernel.org/all/20231221100016.4022353-1-
> julia.zhang@amd.com/
> 
> 
> ChromeOS also uses that method (see crrev.com/c/3764931
> <http://crrev.com/c/3764931> ) [cc: dGPU architect +Dominik Behr
> <mailto:dbehr@google.com> ]
> 
> So if iGPU + dGPU is the primary use case, you should be able to use these
> methods as well.  The model would "virtualized iGPU" + passthrough dGPU,
> not split SoCs.
In our use-case, the goal is to have only one primary GPU (passthrough’d iGPU/dGPU)
do all the rendering (using native DRI drivers) for clients/compositor and all the outputs
and share the scanout buffers with the secondary GPU (virtio-gpu). Since this is mostly
how Mutter (and also Weston) work in a multi-GPU setup, I am not sure if virgl is needed.

And, doing it this way means that no other userspace components need to be modified
on both the Guest and the Host.

> 
> 
> 
> 	As part of the import, the virtio-gpu driver shares the dma
> 	addresses and lengths with Qemu which then determines whether
> the
> 	memory region they belong to is owned by a PCI device or whether it
> 	is part of the Guest's system ram. If it is the former, it identifies
> 	the devid (or bdf) and bar and provides this info (along with offsets
> 	and sizes) to the udmabuf driver. In the latter case, instead of the
> 	the devid and bar it provides the memfd. The udmabuf driver then
> 	creates a dmabuf using this info that Qemu shares with Spice for
> 	encode via Gstreamer.
> 
> 	Note that the virtio-gpu driver registers a move_notify() callback
> 	to track location changes associated with the scanout buffer and
> 	sends attach/detach backing cmds to Qemu when appropriate. And,
> 	synchronization (that is, ensuring that Guest and Host are not
> 	using the scanout buffer at the same time) is ensured by pinning/
> 	unpinning the dmabuf as part of plane update and using a fence
> 	in resource_flush cmd.
> 
> 
> I'm not sure how QEMU's display paths work, but with crosvm if you share
> the guest-created dmabuf with the display, and the guest moves the backing
> pages, the only recourse is the destroy the surface and show a black screen
> to the user: not the best thing experience wise.
Since Qemu GTK UI uses EGL, there is a blit done from the guest’s scanout buffer onto an EGL
backed buffer on the Host. So, this problem would not happen as of now.

> 
> Only amdgpu calls dma_buf_move_notfiy(..), and you're probably testing on
> Intel only, so you may not be hitting that code path anyways.
I have tested with the Xe driver in the Guest which also calls dma_buf_move_notfiy(). But
note that for dGPUs, both Xe and amdgpu migrate the scanout buffer from vram to system
memory as part of export, because virtio-gpu is not P2P compatible. However, I am hoping
to relax this (p2p check against virtio-gpu) in Xe driver if it detects that it is running in
VF mode once the following patch series is merged:
https://lore.kernel.org/dri-devel/20240422063602.3690124-1-vivek.kasireddy@intel.com/

> I forgot the
> exact reason, but apparently udmabuf may not work with amdgpu displays
> and it seems the virtualized iGPU + dGPU is the way to go for amdgpu
> anyways. 
I would really like to know why udmabuf would not work with amdgpu?

> So I recommend just pinning the buffer for the lifetime of the
> import for simplicity and correctness.
Yeah, in this patch series, the dmabuf is indeed pinned, but only for a short duration in the Guest –
just until the Host is done using it (blit or encode).

Thanks,
Vivek

> 
> 
> 	This series is available at:
> 	https://gitlab.freedesktop.org/Vivek/drm-tip/-
> /commits/virtgpu_import_rfc
> 
> 	along with additional patches for Qemu and Spice here:
> 	https://gitlab.freedesktop.org/Vivek/qemu/-
> /commits/virtgpu_dmabuf_pcidev
> 	https://gitlab.freedesktop.org/Vivek/spice/-
> /commits/encode_dmabuf_v4
> 
> 	Patchset overview:
> 
> 	Patch 1:   Implement
> VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd
> 	Patch 2-3: Helpers to initalize, import, free imported object
> 	Patch 4-5: Import and use buffers from other devices for scanout
> 	Patch 6-7: Have udmabuf driver create dmabuf from PCI bars for P2P
> DMA
> 
> 	This series is tested using the following method:
> 	- Run Qemu with the following relevant options:
> 	  qemu-system-x86_64 -m 4096m ....
> 	  -device vfio-pci,host=0000:03:00.0
> 	  -device virtio-vga,max_outputs=1,blob=true,xres=1920,yres=1080
> 	  -spice port=3001,gl=on,disable-ticketing=on,preferred-
> codec=gstreamer:h264
> 	  -object memory-backend-memfd,id=mem1,size=4096M
> 	  -machine memory-backend=mem1 ...
> 	- Run upstream Weston with the following options in the Guest VM:
> 	  ./weston --drm-device=card1 --additional-devices=card0
> 
> 	where card1 is a DG2 dGPU (passthrough'd and using xe driver in
> Guest VM),
> 	card0 is virtio-gpu and the Host is using a RPL iGPU.
> 
> 	Cc: Gerd Hoffmann <kraxel@redhat.com
> <mailto:kraxel@redhat.com> >
> 	Cc: Dongwon Kim <dongwon.kim@intel.com
> <mailto:dongwon.kim@intel.com> >
> 	Cc: Daniel Vetter <daniel.vetter@ffwll.ch
> <mailto:daniel.vetter@ffwll.ch> >
> 	Cc: Christian Koenig <christian.koenig@amd.com
> <mailto:christian.koenig@amd.com> >
> 	Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com
> <mailto:dmitry.osipenko@collabora.com> >
> 	Cc: Rob Clark <robdclark@chromium.org
> <mailto:robdclark@chromium.org> >
> 	Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com
> <mailto:thomas.hellstrom@linux.intel.com> >
> 	Cc: Oded Gabbay <ogabbay@kernel.org
> <mailto:ogabbay@kernel.org> >
> 	Cc: Michal Wajdeczko <michal.wajdeczko@intel.com
> <mailto:michal.wajdeczko@intel.com> >
> 	Cc: Michael Tretter <m.tretter@pengutronix.de
> <mailto:m.tretter@pengutronix.de> >
> 
> 	Vivek Kasireddy (7):
> 	  drm/virtio: Implement
> VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd
> 	  drm/virtio: Add a helper to map and note the dma addrs and
> lengths
> 	  drm/virtio: Add helpers to initialize and free the imported object
> 	  drm/virtio: Import prime buffers from other devices as guest blobs
> 	  drm/virtio: Ensure that bo's backing store is valid while updating
> 	    plane
> 	  udmabuf/uapi: Add new ioctl to create a dmabuf from PCI bar
> regions
> 	  udmabuf: Implement UDMABUF_CREATE_LIST_FOR_PCIDEV ioctl
> 
> 	 drivers/dma-buf/udmabuf.c              | 122 ++++++++++++++++--
> 	 drivers/gpu/drm/virtio/virtgpu_drv.h   |   8 ++
> 	 drivers/gpu/drm/virtio/virtgpu_plane.c |  56 ++++++++-
> 	 drivers/gpu/drm/virtio/virtgpu_prime.c | 167
> ++++++++++++++++++++++++-
> 	 drivers/gpu/drm/virtio/virtgpu_vq.c    |  15 +++
> 	 include/uapi/linux/udmabuf.h           |  11 +-
> 	 6 files changed, 368 insertions(+), 11 deletions(-)
> 
> 	--
> 	2.43.0
> 
>

Gurchetan Singh May 30, 2024, 1:49 a.m. UTC | #4

On Fri, May 24, 2024 at 11:33 AM Kasireddy, Vivek <vivek.kasireddy@intel.com>
wrote:

> Hi,
>
> Sorry, my previous reply got messed up as a result of HTML formatting.
> This is
> a plain text version of the same reply.
>
> >
> >
> >       Having virtio-gpu import scanout buffers (via prime) from other
> >       devices means that we'd be adding a head to headless GPUs assigned
> >       to a Guest VM or additional heads to regular GPU devices that are
> >       passthrough'd to the Guest. In these cases, the Guest compositor
> >       can render into the scanout buffer using a primary GPU and has the
> >       secondary GPU (virtio-gpu) import it for display purposes.
> >
> >       The main advantage with this is that the imported scanout buffer
> can
> >       either be displayed locally on the Host (e.g, using Qemu + GTK UI)
> >       or encoded and streamed to a remote client (e.g, Qemu + Spice UI).
> >       Note that since Qemu uses udmabuf driver, there would be no
> > copies
> >       made of the scanout buffer as it is displayed. This should be
> >       possible even when it might reside in device memory such has
> > VRAM.
> >
> >       The specific use-case that can be supported with this series is
> when
> >       running Weston or other guest compositors with "additional-devices"
> >       feature (./weston --drm-device=card1 --additional-devices=card0).
> >       More info about this feature can be found at:
> >       https://gitlab.freedesktop.org/wayland/weston/-
> > /merge_requests/736
> >
> >       In the above scenario, card1 could be a dGPU or an iGPU and card0
> >       would be virtio-gpu in KMS only mode. However, the case where this
> >       patch series could be particularly useful is when card1 is a GPU VF
> >       that needs to share its scanout buffer (in a zero-copy way) with
> the
> >       GPU PF on the Host. Or, it can also be useful when the scanout
> buffer
> >       needs to be shared between any two GPU devices (assuming one of
> > them
> >       is assigned to a Guest VM) as long as they are P2P DMA compatible.
> >
> >
> >
> > Is passthrough iGPU-only or passthrough dGPU-only something you intend to
> > use?
> Our main use-case involves passthrough’g a headless dGPU VF device and
> sharing
> the Guest compositor’s scanout buffer with dGPU PF device on the Host.
> Same goal for
> headless iGPU VF to iGPU PF device as well.
>

Just to check my understanding: the same physical {i, d}GPU is partitioned
into the VF and PF, but the PF handles host-side display integration and
rendering?


> However, using a combination of iGPU and dGPU where either of them can be
> passthrough’d
> to the Guest is something I think can be supported with this patch series
> as well.
>
> >
> > If it's a dGPU + iGPU setup, then the way other people seem to do it is a
> > "virtualized" iGPU (via virgl/gfxstream/take your pick) and pass-through
> the
> > dGPU.
> >
> > For example, AMD seems to use virgl to allocate and import into the dGPU.
> >
> > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/23896
> >
> > https://lore.kernel.org/all/20231221100016.4022353-1-
> > julia.zhang@amd.com/
> >
> >
> > ChromeOS also uses that method (see crrev.com/c/3764931
> > <http://crrev.com/c/3764931> ) [cc: dGPU architect +Dominik Behr
> > <mailto:dbehr@google.com> ]
> >
> > So if iGPU + dGPU is the primary use case, you should be able to use
> these
> > methods as well.  The model would "virtualized iGPU" + passthrough dGPU,
> > not split SoCs.
> In our use-case, the goal is to have only one primary GPU (passthrough’d
> iGPU/dGPU)
> do all the rendering (using native DRI drivers) for clients/compositor and
> all the outputs
> and share the scanout buffers with the secondary GPU (virtio-gpu). Since
> this is mostly
> how Mutter (and also Weston) work in a multi-GPU setup, I am not sure if
> virgl is needed.
>

I think you can probably use virgl with the PF and others probably will,
but supporting multiple methods in Linux is not unheard of.

Does your patchset need the Mesa kmsro patchset to function correctly?

https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9592

If so, I would try to get that reviewed first to meet DRM requirements (
https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-userspace-requirements).
You might explicitly call out the design decision you're making: ("We can
probably use virgl as the virtualized iGPU via PF, but that adds
unnecessary complexity b/c ______").

And, doing it this way means that no other userspace components need to be
> modified
> on both the Guest and the Host.
>
> >
> >
> >
> >       As part of the import, the virtio-gpu driver shares the dma
> >       addresses and lengths with Qemu which then determines whether
> > the
> >       memory region they belong to is owned by a PCI device or whether it
> >       is part of the Guest's system ram. If it is the former, it
> identifies
> >       the devid (or bdf) and bar and provides this info (along with
> offsets
> >       and sizes) to the udmabuf driver. In the latter case, instead of
> the
> >       the devid and bar it provides the memfd. The udmabuf driver then
> >       creates a dmabuf using this info that Qemu shares with Spice for
> >       encode via Gstreamer.
> >
> >       Note that the virtio-gpu driver registers a move_notify() callback
> >       to track location changes associated with the scanout buffer and
> >       sends attach/detach backing cmds to Qemu when appropriate. And,
> >       synchronization (that is, ensuring that Guest and Host are not
> >       using the scanout buffer at the same time) is ensured by pinning/
> >       unpinning the dmabuf as part of plane update and using a fence
> >       in resource_flush cmd.
> >
> >
> > I'm not sure how QEMU's display paths work, but with crosvm if you share
> > the guest-created dmabuf with the display, and the guest moves the
> backing
> > pages, the only recourse is the destroy the surface and show a black
> screen
> > to the user: not the best thing experience wise.
> Since Qemu GTK UI uses EGL, there is a blit done from the guest’s scanout
> buffer onto an EGL
> backed buffer on the Host. So, this problem would not happen as of now.
>

The guest kernel doesn't know you're using the QEMU GTK UI + EGL
host-side.

If somebody wants to use the virtio-gpu import mechanism with lower-level
Wayland-based display integration, then the problem would occur.

Perhaps, do that just to be safe unless you have performance concerns.

>
> > Only amdgpu calls dma_buf_move_notfiy(..), and you're probably testing on
> > Intel only, so you may not be hitting that code path anyways.
> I have tested with the Xe driver in the Guest which also calls
> dma_buf_move_notfiy(). But
> note that for dGPUs, both Xe and amdgpu migrate the scanout buffer from
> vram to system
> memory as part of export, because virtio-gpu is not P2P compatible.
> However, I am hoping
> to relax this (p2p check against virtio-gpu) in Xe driver if it detects
> that it is running in
> VF mode once the following patch series is merged:
>
> https://lore.kernel.org/dri-devel/20240422063602.3690124-1-vivek.kasireddy@intel.com/
>
> > I forgot the
> > exact reason, but apparently udmabuf may not work with amdgpu displays
> > and it seems the virtualized iGPU + dGPU is the way to go for amdgpu
> > anyways.
> I would really like to know why udmabuf would not work with amdgpu?
>

It's just a rumor I heard, but the idea is udmabuf would be imported into
AMDGPU_GEM_DOMAIN_CPU only.

https://cgit.freedesktop.org/drm/drm-misc/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c#n333

"AMDGPU_GEM_DOMAIN_CPU: System memory that is not GPU accessible. Memory in
this pool could be swapped out to disk if there is pressure."

https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html

Perhaps that limitation is artificial and unnecessary, and it may indeed
work.  I don't think anybody has tried...


>
> > So I recommend just pinning the buffer for the lifetime of the
> > import for simplicity and correctness.
> Yeah, in this patch series, the dmabuf is indeed pinned, but only for a
> short duration in the Guest –
> just until the Host is done using it (blit or encode).
>
> Thanks,
> Vivek
>
> >
> >
> >       This series is available at:
> >       https://gitlab.freedesktop.org/Vivek/drm-tip/-
> > /commits/virtgpu_import_rfc
> >
> >       along with additional patches for Qemu and Spice here:
> >       https://gitlab.freedesktop.org/Vivek/qemu/-
> > /commits/virtgpu_dmabuf_pcidev
> >       https://gitlab.freedesktop.org/Vivek/spice/-
> > /commits/encode_dmabuf_v4
> >
> >       Patchset overview:
> >
> >       Patch 1:   Implement
> > VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd
> >       Patch 2-3: Helpers to initalize, import, free imported object
> >       Patch 4-5: Import and use buffers from other devices for scanout
> >       Patch 6-7: Have udmabuf driver create dmabuf from PCI bars for P2P
> > DMA
> >
> >       This series is tested using the following method:
> >       - Run Qemu with the following relevant options:
> >         qemu-system-x86_64 -m 4096m ....
> >         -device vfio-pci,host=0000:03:00.0
> >         -device virtio-vga,max_outputs=1,blob=true,xres=1920,yres=1080
> >         -spice port=3001,gl=on,disable-ticketing=on,preferred-
> > codec=gstreamer:h264
> >         -object memory-backend-memfd,id=mem1,size=4096M
> >         -machine memory-backend=mem1 ...
> >       - Run upstream Weston with the following options in the Guest VM:
> >         ./weston --drm-device=card1 --additional-devices=card0
> >
> >       where card1 is a DG2 dGPU (passthrough'd and using xe driver in
> > Guest VM),
> >       card0 is virtio-gpu and the Host is using a RPL iGPU.
> >
> >       Cc: Gerd Hoffmann <kraxel@redhat.com
> > <mailto:kraxel@redhat.com> >
> >       Cc: Dongwon Kim <dongwon.kim@intel.com
> > <mailto:dongwon.kim@intel.com> >
> >       Cc: Daniel Vetter <daniel.vetter@ffwll.ch
> > <mailto:daniel.vetter@ffwll.ch> >
> >       Cc: Christian Koenig <christian.koenig@amd.com
> > <mailto:christian.koenig@amd.com> >
> >       Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com
> > <mailto:dmitry.osipenko@collabora.com> >
> >       Cc: Rob Clark <robdclark@chromium.org
> > <mailto:robdclark@chromium.org> >
> >       Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com
> > <mailto:thomas.hellstrom@linux.intel.com> >
> >       Cc: Oded Gabbay <ogabbay@kernel.org
> > <mailto:ogabbay@kernel.org> >
> >       Cc: Michal Wajdeczko <michal.wajdeczko@intel.com
> > <mailto:michal.wajdeczko@intel.com> >
> >       Cc: Michael Tretter <m.tretter@pengutronix.de
> > <mailto:m.tretter@pengutronix.de> >
> >
> >       Vivek Kasireddy (7):
> >         drm/virtio: Implement
> > VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd
> >         drm/virtio: Add a helper to map and note the dma addrs and
> > lengths
> >         drm/virtio: Add helpers to initialize and free the imported
> object
> >         drm/virtio: Import prime buffers from other devices as guest
> blobs
> >         drm/virtio: Ensure that bo's backing store is valid while
> updating
> >           plane
> >         udmabuf/uapi: Add new ioctl to create a dmabuf from PCI bar
> > regions
> >         udmabuf: Implement UDMABUF_CREATE_LIST_FOR_PCIDEV ioctl
> >
> >        drivers/dma-buf/udmabuf.c              | 122 ++++++++++++++++--
> >        drivers/gpu/drm/virtio/virtgpu_drv.h   |   8 ++
> >        drivers/gpu/drm/virtio/virtgpu_plane.c |  56 ++++++++-
> >        drivers/gpu/drm/virtio/virtgpu_prime.c | 167
> > ++++++++++++++++++++++++-
> >        drivers/gpu/drm/virtio/virtgpu_vq.c    |  15 +++
> >        include/uapi/linux/udmabuf.h           |  11 +-
> >        6 files changed, 368 insertions(+), 11 deletions(-)
> >
> >       --
> >       2.43.0
> >
> >
>
>

Vivek Kasireddy May 30, 2024, 7:21 a.m. UTC | #5

Hi Gurchetan,

> 
> On Fri, May 24, 2024 at 11:33 AM Kasireddy, Vivek
> <vivek.kasireddy@intel.com <mailto:vivek.kasireddy@intel.com> > wrote:
> 
> 
> 	Hi,
> 
> 	Sorry, my previous reply got messed up as a result of HTML
> formatting. This is
> 	a plain text version of the same reply.
> 
> 	>
> 	>
> 	>       Having virtio-gpu import scanout buffers (via prime) from other
> 	>       devices means that we'd be adding a head to headless GPUs
> assigned
> 	>       to a Guest VM or additional heads to regular GPU devices that
> are
> 	>       passthrough'd to the Guest. In these cases, the Guest
> compositor
> 	>       can render into the scanout buffer using a primary GPU and has
> the
> 	>       secondary GPU (virtio-gpu) import it for display purposes.
> 	>
> 	>       The main advantage with this is that the imported scanout
> buffer can
> 	>       either be displayed locally on the Host (e.g, using Qemu + GTK
> UI)
> 	>       or encoded and streamed to a remote client (e.g, Qemu + Spice
> UI).
> 	>       Note that since Qemu uses udmabuf driver, there would be no
> 	> copies
> 	>       made of the scanout buffer as it is displayed. This should be
> 	>       possible even when it might reside in device memory such has
> 	> VRAM.
> 	>
> 	>       The specific use-case that can be supported with this series is
> when
> 	>       running Weston or other guest compositors with "additional-
> devices"
> 	>       feature (./weston --drm-device=card1 --additional-
> devices=card0).
> 	>       More info about this feature can be found at:
> 	>       https://gitlab.freedesktop.org/wayland/weston/-
> 	> /merge_requests/736
> 	>
> 	>       In the above scenario, card1 could be a dGPU or an iGPU and
> card0
> 	>       would be virtio-gpu in KMS only mode. However, the case
> where this
> 	>       patch series could be particularly useful is when card1 is a GPU
> VF
> 	>       that needs to share its scanout buffer (in a zero-copy way) with
> the
> 	>       GPU PF on the Host. Or, it can also be useful when the scanout
> buffer
> 	>       needs to be shared between any two GPU devices (assuming
> one of
> 	> them
> 	>       is assigned to a Guest VM) as long as they are P2P DMA
> compatible.
> 	>
> 	>
> 	>
> 	> Is passthrough iGPU-only or passthrough dGPU-only something you
> intend to
> 	> use?
> 	Our main use-case involves passthrough’g a headless dGPU VF device
> and sharing
> 	the Guest compositor’s scanout buffer with dGPU PF device on the
> Host. Same goal for
> 	headless iGPU VF to iGPU PF device as well.
> 
> 
> 
> Just to check my understanding: the same physical {i, d}GPU is partitioned
> into the VF and PF, but the PF handles host-side display integration and
> rendering?
Yes, that is mostly right. In a nutshell, the same physical GPU is partitioned
into one PF device and multiple VF devices. Only the PF device has access to
the display hardware and can do KMS (on the Host). The VF devices are
headless with no access to display hardware (cannot do KMS but can do render/
encode/decode) and are generally assigned (or passthrough'd) to the Guest VMs.
Some more details about this model can be found here:
https://lore.kernel.org/dri-devel/20231110182231.1730-1-michal.wajdeczko@intel.com/

> 
> 
> 	However, using a combination of iGPU and dGPU where either of
> them can be passthrough’d
> 	to the Guest is something I think can be supported with this patch
> series as well.
> 
> 	>
> 	> If it's a dGPU + iGPU setup, then the way other people seem to do it
> is a
> 	> "virtualized" iGPU (via virgl/gfxstream/take your pick) and pass-
> through the
> 	> dGPU.
> 	>
> 	> For example, AMD seems to use virgl to allocate and import into
> the dGPU.
> 	>
> 	> https://gitlab.freedesktop.org/mesa/mesa/-
> /merge_requests/23896
> 	>
> 	> https://lore.kernel.org/all/20231221100016.4022353-1-
> 	> julia.zhang@amd.com/ <http://julia.zhang@amd.com/>
> 	>
> 	>
> 	> ChromeOS also uses that method (see crrev.com/c/3764931
> <http://crrev.com/c/3764931>
> 	> <http://crrev.com/c/3764931> ) [cc: dGPU architect +Dominik Behr
> 	> <mailto:dbehr@google.com <mailto:dbehr@google.com> > ]
> 	>
> 	> So if iGPU + dGPU is the primary use case, you should be able to
> use these
> 	> methods as well.  The model would "virtualized iGPU" +
> passthrough dGPU,
> 	> not split SoCs.
> 	In our use-case, the goal is to have only one primary GPU
> (passthrough’d iGPU/dGPU)
> 	do all the rendering (using native DRI drivers) for clients/compositor
> and all the outputs
> 	and share the scanout buffers with the secondary GPU (virtio-gpu).
> Since this is mostly
> 	how Mutter (and also Weston) work in a multi-GPU setup, I am not
> sure if virgl is needed.
> 
> 
> 
> I think you can probably use virgl with the PF and others probably will, but
> supporting multiple methods in Linux is not unheard of.
In our case, we have an alternative SR-IOV based GPU virtualization/partitioning
model (as described above) where a Guest VM will have access to a hardware-accelerated
GPU VF device for its rendering/encode/decode needs. So, in this situation, using
virgl will become redundant and unnecessary.

And, in this model, we intend to use virtio-gpu for KMS in the Guest VM (since the
GPU VF device cannot do KMS) with the addition of this patchset. However, note that,
since not all GPU SKUs/versions have the SRIOV capability, we plan on using virgl in
those cases where it becomes necessary.

> 
> Does your patchset need the Mesa kmsro patchset to function correctly?
> 
> https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9592
This patchset is an alternative proposal. So, KMSRO would not be needed.
AFAICS, the above MR is mainly stalled because KMSRO uses dumb buffers
which are not suitable for hardware-based rendering in all cases. And, KMSRO
is not really helpful performance-wise with dGPUs, as it forces most buffers to
be allocated from system memory.

> 
> 
> If so, I would try to get that reviewed first to meet DRM requirements
> (https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-
> userspace-requirements).  You might explicitly call out the design decision
> you're making: ("We can probably use virgl as the virtualized iGPU via PF, but
> that adds unnecessary complexity b/c ______").
As I described above, what we have is an alternative GPU virtualization scheme
where virgl is not necessary if SRIOV capability is available. And, as mentioned
earlier, I have tested this series with Mutter/Gnome-shell (upstream master)
(plus one small patch: https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/3745)
and no other changes to any other userspace components on Host and Guest.

> 
> 
> 	And, doing it this way means that no other userspace components
> need to be modified
> 	on both the Guest and the Host.
> 
> 	>
> 	>
> 	>
> 	>       As part of the import, the virtio-gpu driver shares the dma
> 	>       addresses and lengths with Qemu which then determines
> whether
> 	> the
> 	>       memory region they belong to is owned by a PCI device or
> whether it
> 	>       is part of the Guest's system ram. If it is the former, it identifies
> 	>       the devid (or bdf) and bar and provides this info (along with
> offsets
> 	>       and sizes) to the udmabuf driver. In the latter case, instead of
> the
> 	>       the devid and bar it provides the memfd. The udmabuf driver
> then
> 	>       creates a dmabuf using this info that Qemu shares with Spice
> for
> 	>       encode via Gstreamer.
> 	>
> 	>       Note that the virtio-gpu driver registers a move_notify() callback
> 	>       to track location changes associated with the scanout buffer and
> 	>       sends attach/detach backing cmds to Qemu when appropriate.
> And,
> 	>       synchronization (that is, ensuring that Guest and Host are not
> 	>       using the scanout buffer at the same time) is ensured by
> pinning/
> 	>       unpinning the dmabuf as part of plane update and using a fence
> 	>       in resource_flush cmd.
> 	>
> 	>
> 	> I'm not sure how QEMU's display paths work, but with crosvm if
> you share
> 	> the guest-created dmabuf with the display, and the guest moves
> the backing
> 	> pages, the only recourse is the destroy the surface and show a
> black screen
> 	> to the user: not the best thing experience wise.
> 	Since Qemu GTK UI uses EGL, there is a blit done from the guest’s
> scanout buffer onto an EGL
> 	backed buffer on the Host. So, this problem would not happen as of
> now.
> 
> 
> 
> The guest kernel doesn't know you're using the QEMU GTK UI + EGL host-
> side.
So, with blob=true, there is a dma fence in resource_flush() that gets associated
with the Blit/Encode on the Host. This guest dma fence should eventually be signalled
only when the Host is done using guest's scanout buffer.

> 
> If somebody wants to use the virtio-gpu import mechanism with lower-level
> Wayland-based display integration, then the problem would occur.
Right, one way to address this issue is to prevent the Guest compositor from
reusing the scanout buffer (until the Host is done) and forcing it to pick a new
buffer (since Mesa GBM allows 4 backbuffers). 
I have tried this experiment with KMSRO and Wayland-based Qemu UI previously
on iGPUs (and Weston) and noticed that the Guest FPS was getting halved:
https://lore.kernel.org/qemu-devel/20210913222036.3193732-1-vivek.kasireddy@intel.com/

and also discussed and proposed a solution which did not go anywhere:
https://lore.kernel.org/dri-devel/20210913233529.3194401-1-vivek.kasireddy@intel.com/

> 
> Perhaps, do that just to be safe unless you have performance concerns.
If you meant pinning the imported scanout buffer in the Guest, then yes,
that is something I am already doing in this patchset.

> 
> 
> 	>
> 	> Only amdgpu calls dma_buf_move_notfiy(..), and you're probably
> testing on
> 	> Intel only, so you may not be hitting that code path anyways.
> 	I have tested with the Xe driver in the Guest which also calls
> dma_buf_move_notfiy(). But
> 	note that for dGPUs, both Xe and amdgpu migrate the scanout buffer
> from vram to system
> 	memory as part of export, because virtio-gpu is not P2P compatible.
> However, I am hoping
> 	to relax this (p2p check against virtio-gpu) in Xe driver if it detects
> that it is running in
> 	VF mode once the following patch series is merged:
> 	https://lore.kernel.org/dri-devel/20240422063602.3690124-1-
> vivek.kasireddy@intel.com/
> 
> 	> I forgot the
> 	> exact reason, but apparently udmabuf may not work with amdgpu
> displays
> 	> and it seems the virtualized iGPU + dGPU is the way to go for
> amdgpu
> 	> anyways.
> 	I would really like to know why udmabuf would not work with
> amdgpu?
> 
> 
> 
> It's just a rumor I heard, but the idea is udmabuf would be imported into
> AMDGPU_GEM_DOMAIN_CPU only.
> 
> https://cgit.freedesktop.org/drm/drm-
> misc/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c#n333
> 
> "AMDGPU_GEM_DOMAIN_CPU: System memory that is not GPU accessible.
> Memory in this pool could be swapped out to disk if there is pressure."
> 
> https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html
> 
> 
> Perhaps that limitation is artificial and unnecessary, and it may indeed work.
> I don't think anybody has tried...
Since udmabuf driver properly pins the backing pages (from memfd) for DMA,
I don't see any reason why amdgpu would not be able to import.

Thanks,
Vivek

> 
> 
> 
> 
> 	> So I recommend just pinning the buffer for the lifetime of the
> 	> import for simplicity and correctness.
> 	Yeah, in this patch series, the dmabuf is indeed pinned, but only for a
> short duration in the Guest –
> 	just until the Host is done using it (blit or encode).
> 
> 	Thanks,
> 	Vivek
> 
> 	>
> 	>
> 	>       This series is available at:
> 	>       https://gitlab.freedesktop.org/Vivek/drm-tip/-
> 	> /commits/virtgpu_import_rfc
> 	>
> 	>       along with additional patches for Qemu and Spice here:
> 	>       https://gitlab.freedesktop.org/Vivek/qemu/-
> 	> /commits/virtgpu_dmabuf_pcidev
> 	>       https://gitlab.freedesktop.org/Vivek/spice/-
> 	> /commits/encode_dmabuf_v4
> 	>
> 	>       Patchset overview:
> 	>
> 	>       Patch 1:   Implement
> 	> VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd
> 	>       Patch 2-3: Helpers to initalize, import, free imported object
> 	>       Patch 4-5: Import and use buffers from other devices for
> scanout
> 	>       Patch 6-7: Have udmabuf driver create dmabuf from PCI bars
> for P2P
> 	> DMA
> 	>
> 	>       This series is tested using the following method:
> 	>       - Run Qemu with the following relevant options:
> 	>         qemu-system-x86_64 -m 4096m ....
> 	>         -device vfio-pci,host=0000:03:00.0
> 	>         -device virtio-
> vga,max_outputs=1,blob=true,xres=1920,yres=1080
> 	>         -spice port=3001,gl=on,disable-ticketing=on,preferred-
> 	> codec=gstreamer:h264
> 	>         -object memory-backend-memfd,id=mem1,size=4096M
> 	>         -machine memory-backend=mem1 ...
> 	>       - Run upstream Weston with the following options in the Guest
> VM:
> 	>         ./weston --drm-device=card1 --additional-devices=card0
> 	>
> 	>       where card1 is a DG2 dGPU (passthrough'd and using xe driver
> in
> 	> Guest VM),
> 	>       card0 is virtio-gpu and the Host is using a RPL iGPU.
> 	>
> 	>       Cc: Gerd Hoffmann <kraxel@redhat.com
> <mailto:kraxel@redhat.com>
> 	> <mailto:kraxel@redhat.com <mailto:kraxel@redhat.com> > >
> 	>       Cc: Dongwon Kim <dongwon.kim@intel.com
> <mailto:dongwon.kim@intel.com>
> 	> <mailto:dongwon.kim@intel.com
> <mailto:dongwon.kim@intel.com> > >
> 	>       Cc: Daniel Vetter <daniel.vetter@ffwll.ch
> <mailto:daniel.vetter@ffwll.ch>
> 	> <mailto:daniel.vetter@ffwll.ch <mailto:daniel.vetter@ffwll.ch> > >
> 	>       Cc: Christian Koenig <christian.koenig@amd.com
> <mailto:christian.koenig@amd.com>
> 	> <mailto:christian.koenig@amd.com
> <mailto:christian.koenig@amd.com> > >
> 	>       Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com
> <mailto:dmitry.osipenko@collabora.com>
> 	> <mailto:dmitry.osipenko@collabora.com
> <mailto:dmitry.osipenko@collabora.com> > >
> 	>       Cc: Rob Clark <robdclark@chromium.org
> <mailto:robdclark@chromium.org>
> 	> <mailto:robdclark@chromium.org
> <mailto:robdclark@chromium.org> > >
> 	>       Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com
> <mailto:thomas.hellstrom@linux.intel.com>
> 	> <mailto:thomas.hellstrom@linux.intel.com
> <mailto:thomas.hellstrom@linux.intel.com> > >
> 	>       Cc: Oded Gabbay <ogabbay@kernel.org
> <mailto:ogabbay@kernel.org>
> 	> <mailto:ogabbay@kernel.org <mailto:ogabbay@kernel.org> > >
> 	>       Cc: Michal Wajdeczko <michal.wajdeczko@intel.com
> <mailto:michal.wajdeczko@intel.com>
> 	> <mailto:michal.wajdeczko@intel.com
> <mailto:michal.wajdeczko@intel.com> > >
> 	>       Cc: Michael Tretter <m.tretter@pengutronix.de
> <mailto:m.tretter@pengutronix.de>
> 	> <mailto:m.tretter@pengutronix.de
> <mailto:m.tretter@pengutronix.de> > >
> 	>
> 	>       Vivek Kasireddy (7):
> 	>         drm/virtio: Implement
> 	> VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd
> 	>         drm/virtio: Add a helper to map and note the dma addrs and
> 	> lengths
> 	>         drm/virtio: Add helpers to initialize and free the imported
> object
> 	>         drm/virtio: Import prime buffers from other devices as guest
> blobs
> 	>         drm/virtio: Ensure that bo's backing store is valid while
> updating
> 	>           plane
> 	>         udmabuf/uapi: Add new ioctl to create a dmabuf from PCI bar
> 	> regions
> 	>         udmabuf: Implement UDMABUF_CREATE_LIST_FOR_PCIDEV
> ioctl
> 	>
> 	>        drivers/dma-buf/udmabuf.c              | 122 ++++++++++++++++--
> 	>        drivers/gpu/drm/virtio/virtgpu_drv.h   |   8 ++
> 	>        drivers/gpu/drm/virtio/virtgpu_plane.c |  56 ++++++++-
> 	>        drivers/gpu/drm/virtio/virtgpu_prime.c | 167
> 	> ++++++++++++++++++++++++-
> 	>        drivers/gpu/drm/virtio/virtgpu_vq.c    |  15 +++
> 	>        include/uapi/linux/udmabuf.h           |  11 +-
> 	>        6 files changed, 368 insertions(+), 11 deletions(-)
> 	>
> 	>       --
> 	>       2.43.0
> 	>
> 	>
> 
>

Gurchetan Singh June 15, 2024, 12:19 a.m. UTC | #6

On Thu, May 30, 2024 at 12:21 AM Kasireddy, Vivek <vivek.kasireddy@intel.com>
wrote:

> Hi Gurchetan,
>
> >
> > On Fri, May 24, 2024 at 11:33 AM Kasireddy, Vivek
> > <vivek.kasireddy@intel.com <mailto:vivek.kasireddy@intel.com> > wrote:
> >
> >
> >       Hi,
> >
> >       Sorry, my previous reply got messed up as a result of HTML
> > formatting. This is
> >       a plain text version of the same reply.
> >
> >       >
> >       >
> >       >       Having virtio-gpu import scanout buffers (via prime) from
> other
> >       >       devices means that we'd be adding a head to headless GPUs
> > assigned
> >       >       to a Guest VM or additional heads to regular GPU devices
> that
> > are
> >       >       passthrough'd to the Guest. In these cases, the Guest
> > compositor
> >       >       can render into the scanout buffer using a primary GPU and
> has
> > the
> >       >       secondary GPU (virtio-gpu) import it for display purposes.
> >       >
> >       >       The main advantage with this is that the imported scanout
> > buffer can
> >       >       either be displayed locally on the Host (e.g, using Qemu +
> GTK
> > UI)
> >       >       or encoded and streamed to a remote client (e.g, Qemu +
> Spice
> > UI).
> >       >       Note that since Qemu uses udmabuf driver, there would be no
> >       > copies
> >       >       made of the scanout buffer as it is displayed. This should
> be
> >       >       possible even when it might reside in device memory such
> has
> >       > VRAM.
> >       >
> >       >       The specific use-case that can be supported with this
> series is
> > when
> >       >       running Weston or other guest compositors with "additional-
> > devices"
> >       >       feature (./weston --drm-device=card1 --additional-
> > devices=card0).
> >       >       More info about this feature can be found at:
> >       >       https://gitlab.freedesktop.org/wayland/weston/-
> >       > /merge_requests/736
> >       >
> >       >       In the above scenario, card1 could be a dGPU or an iGPU and
> > card0
> >       >       would be virtio-gpu in KMS only mode. However, the case
> > where this
> >       >       patch series could be particularly useful is when card1 is
> a GPU
> > VF
> >       >       that needs to share its scanout buffer (in a zero-copy
> way) with
> > the
> >       >       GPU PF on the Host. Or, it can also be useful when the
> scanout
> > buffer
> >       >       needs to be shared between any two GPU devices (assuming
> > one of
> >       > them
> >       >       is assigned to a Guest VM) as long as they are P2P DMA
> > compatible.
> >       >
> >       >
> >       >
> >       > Is passthrough iGPU-only or passthrough dGPU-only something you
> > intend to
> >       > use?
> >       Our main use-case involves passthrough’g a headless dGPU VF device
> > and sharing
> >       the Guest compositor’s scanout buffer with dGPU PF device on the
> > Host. Same goal for
> >       headless iGPU VF to iGPU PF device as well.
> >
> >
> >
> > Just to check my understanding: the same physical {i, d}GPU is
> partitioned
> > into the VF and PF, but the PF handles host-side display integration and
> > rendering?
> Yes, that is mostly right. In a nutshell, the same physical GPU is
> partitioned
> into one PF device and multiple VF devices. Only the PF device has access
> to
> the display hardware and can do KMS (on the Host). The VF devices are
> headless with no access to display hardware (cannot do KMS but can do
> render/
> encode/decode) and are generally assigned (or passthrough'd) to the Guest
> VMs.
> Some more details about this model can be found here:
>
> https://lore.kernel.org/dri-devel/20231110182231.1730-1-michal.wajdeczko@intel.com/
>
> >
> >
> >       However, using a combination of iGPU and dGPU where either of
> > them can be passthrough’d
> >       to the Guest is something I think can be supported with this patch
> > series as well.
> >
> >       >
> >       > If it's a dGPU + iGPU setup, then the way other people seem to
> do it
> > is a
> >       > "virtualized" iGPU (via virgl/gfxstream/take your pick) and pass-
> > through the
> >       > dGPU.
> >       >
> >       > For example, AMD seems to use virgl to allocate and import into
> > the dGPU.
> >       >
> >       > https://gitlab.freedesktop.org/mesa/mesa/-
> > /merge_requests/23896
> >       >
> >       > https://lore.kernel.org/all/20231221100016.4022353-1-
> >       > julia.zhang@amd.com/ <http://julia.zhang@amd.com/>
> >       >
> >       >
> >       > ChromeOS also uses that method (see crrev.com/c/3764931
> > <http://crrev.com/c/3764931>
> >       > <http://crrev.com/c/3764931> ) [cc: dGPU architect +Dominik Behr
> >       > <mailto:dbehr@google.com <mailto:dbehr@google.com> > ]
> >       >
> >       > So if iGPU + dGPU is the primary use case, you should be able to
> > use these
> >       > methods as well.  The model would "virtualized iGPU" +
> > passthrough dGPU,
> >       > not split SoCs.
> >       In our use-case, the goal is to have only one primary GPU
> > (passthrough’d iGPU/dGPU)
> >       do all the rendering (using native DRI drivers) for
> clients/compositor
> > and all the outputs
> >       and share the scanout buffers with the secondary GPU (virtio-gpu).
> > Since this is mostly
> >       how Mutter (and also Weston) work in a multi-GPU setup, I am not
> > sure if virgl is needed.
> >
> >
> >
> > I think you can probably use virgl with the PF and others probably will,
> but
> > supporting multiple methods in Linux is not unheard of.
> In our case, we have an alternative SR-IOV based GPU
> virtualization/partitioning
> model (as described above) where a Guest VM will have access to a
> hardware-accelerated
> GPU VF device for its rendering/encode/decode needs. So, in this
> situation, using
> virgl will become redundant and unnecessary.
>
> And, in this model, we intend to use virtio-gpu for KMS in the Guest VM
> (since the
> GPU VF device cannot do KMS) with the addition of this patchset. However,
> note that,
> since not all GPU SKUs/versions have the SRIOV capability, we plan on
> using virgl in
> those cases where it becomes necessary.
>
> >
> > Does your patchset need the Mesa kmsro patchset to function correctly?
> >
> > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9592
> This patchset is an alternative proposal. So, KMSRO would not be needed.
> AFAICS, the above MR is mainly stalled because KMSRO uses dumb buffers
> which are not suitable for hardware-based rendering in all cases. And,
> KMSRO
> is not really helpful performance-wise with dGPUs, as it forces most
> buffers to
> be allocated from system memory.
>

Previously, it was recommended when exploring VDMABUF: "the /important/
thing is that the driver which exports the dma-buf (and thus handles the
mappings) must be aware of the virtualization so it can properly coordinate
things with the host side."

https://patchwork.kernel.org/project/linux-media/patch/20210203073517.1908882-3-vivek.kasireddy@intel.com/#23975915


So that's why the KMSRO approach was tried  (virtio-gpu dumb allocations,
not i915).  But as you point out, nobody uses dumb buffers for
hardware-based rendering.

So, if you are going with i915 allocates + virtio-gpu imports,  it should
be fine if you fixed all the issues with i915 allocates + VDMABUF imports.
It seems your fixes add complexity in VFIO and other places, but having
virtio-gpu 3d + virgl allocate adds complexity to Mesa-based allocation
paths (i915, amdgpu would all have to open virtio-gpu render node, and pick
a context type etc.).

I would just do virtio-gpu allocates, since it only requires user-space
patches and no extra ioctls, but that reflects my preferences.  If the
mm/VFIO/QEMU people are fine with your approach, I see nothing wrong with
merging it.

The one caveat is if someone uses a non-GTK/EGL host path, we'll have to
pin memory for the lifetime of the import, since knowing RESOURCE_FLUSH is
done is not sufficient.  But if you're only using it, it shouldn't be an
issue right now.


>
> >
> >
> > If so, I would try to get that reviewed first to meet DRM requirements
> > (https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-source-
> > userspace-requirements).  You might explicitly call out the design
> decision
> > you're making: ("We can probably use virgl as the virtualized iGPU via
> PF, but
> > that adds unnecessary complexity b/c ______").
> As I described above, what we have is an alternative GPU virtualization
> scheme
> where virgl is not necessary if SRIOV capability is available. And, as
> mentioned
> earlier, I have tested this series with Mutter/Gnome-shell (upstream
> master)
> (plus one small patch:
> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/3745)
> and no other changes to any other userspace components on Host and Guest.
>
> >
> >
> >       And, doing it this way means that no other userspace components
> > need to be modified
> >       on both the Guest and the Host.
> >
> >       >
> >       >
> >       >
> >       >       As part of the import, the virtio-gpu driver shares the dma
> >       >       addresses and lengths with Qemu which then determines
> > whether
> >       > the
> >       >       memory region they belong to is owned by a PCI device or
> > whether it
> >       >       is part of the Guest's system ram. If it is the former, it
> identifies
> >       >       the devid (or bdf) and bar and provides this info (along
> with
> > offsets
> >       >       and sizes) to the udmabuf driver. In the latter case,
> instead of
> > the
> >       >       the devid and bar it provides the memfd. The udmabuf driver
> > then
> >       >       creates a dmabuf using this info that Qemu shares with
> Spice
> > for
> >       >       encode via Gstreamer.
> >       >
> >       >       Note that the virtio-gpu driver registers a move_notify()
> callback
> >       >       to track location changes associated with the scanout
> buffer and
> >       >       sends attach/detach backing cmds to Qemu when appropriate.
> > And,
> >       >       synchronization (that is, ensuring that Guest and Host are
> not
> >       >       using the scanout buffer at the same time) is ensured by
> > pinning/
> >       >       unpinning the dmabuf as part of plane update and using a
> fence
> >       >       in resource_flush cmd.
> >       >
> >       >
> >       > I'm not sure how QEMU's display paths work, but with crosvm if
> > you share
> >       > the guest-created dmabuf with the display, and the guest moves
> > the backing
> >       > pages, the only recourse is the destroy the surface and show a
> > black screen
> >       > to the user: not the best thing experience wise.
> >       Since Qemu GTK UI uses EGL, there is a blit done from the guest’s
> > scanout buffer onto an EGL
> >       backed buffer on the Host. So, this problem would not happen as of
> > now.
> >
> >
> >
> > The guest kernel doesn't know you're using the QEMU GTK UI + EGL host-
> > side.
> So, with blob=true, there is a dma fence in resource_flush() that gets
> associated
> with the Blit/Encode on the Host. This guest dma fence should eventually
> be signalled
> only when the Host is done using guest's scanout buffer.
>
> >
> > If somebody wants to use the virtio-gpu import mechanism with lower-level
> > Wayland-based display integration, then the problem would occur.
> Right, one way to address this issue is to prevent the Guest compositor
> from
> reusing the scanout buffer (until the Host is done) and forcing it to pick
> a new
> buffer (since Mesa GBM allows 4 backbuffers).
> I have tried this experiment with KMSRO and Wayland-based Qemu UI
> previously
> on iGPUs (and Weston) and noticed that the Guest FPS was getting halved:
>
> https://lore.kernel.org/qemu-devel/20210913222036.3193732-1-vivek.kasireddy@intel.com/
>
> and also discussed and proposed a solution which did not go anywhere:
>
> https://lore.kernel.org/dri-devel/20210913233529.3194401-1-vivek.kasireddy@intel.com/
>
> >
> > Perhaps, do that just to be safe unless you have performance concerns.
> If you meant pinning the imported scanout buffer in the Guest, then yes,
> that is something I am already doing in this patchset.
>
> >
> >
> >       >
> >       > Only amdgpu calls dma_buf_move_notfiy(..), and you're probably
> > testing on
> >       > Intel only, so you may not be hitting that code path anyways.
> >       I have tested with the Xe driver in the Guest which also calls
> > dma_buf_move_notfiy(). But
> >       note that for dGPUs, both Xe and amdgpu migrate the scanout buffer
> > from vram to system
> >       memory as part of export, because virtio-gpu is not P2P compatible.
> > However, I am hoping
> >       to relax this (p2p check against virtio-gpu) in Xe driver if it
> detects
> > that it is running in
> >       VF mode once the following patch series is merged:
> >       https://lore.kernel.org/dri-devel/20240422063602.3690124-1-
> > vivek.kasireddy@intel.com/
> >
> >       > I forgot the
> >       > exact reason, but apparently udmabuf may not work with amdgpu
> > displays
> >       > and it seems the virtualized iGPU + dGPU is the way to go for
> > amdgpu
> >       > anyways.
> >       I would really like to know why udmabuf would not work with
> > amdgpu?
> >
> >
> >
> > It's just a rumor I heard, but the idea is udmabuf would be imported into
> > AMDGPU_GEM_DOMAIN_CPU only.
> >
> > https://cgit.freedesktop.org/drm/drm-
> > misc/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c#n333
> >
> > "AMDGPU_GEM_DOMAIN_CPU: System memory that is not GPU accessible.
> > Memory in this pool could be swapped out to disk if there is pressure."
> >
> > https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html
> >
> >
> > Perhaps that limitation is artificial and unnecessary, and it may indeed
> work.
> > I don't think anybody has tried...
> Since udmabuf driver properly pins the backing pages (from memfd) for DMA,
> I don't see any reason why amdgpu would not be able to import.
>
> Thanks,
> Vivek
>
> >
> >
> >
> >
> >       > So I recommend just pinning the buffer for the lifetime of the
> >       > import for simplicity and correctness.
> >       Yeah, in this patch series, the dmabuf is indeed pinned, but only
> for a
> > short duration in the Guest –
> >       just until the Host is done using it (blit or encode).
> >
> >       Thanks,
> >       Vivek
> >
> >       >
> >       >
> >       >       This series is available at:
> >       >       https://gitlab.freedesktop.org/Vivek/drm-tip/-
> >       > /commits/virtgpu_import_rfc
> >       >
> >       >       along with additional patches for Qemu and Spice here:
> >       >       https://gitlab.freedesktop.org/Vivek/qemu/-
> >       > /commits/virtgpu_dmabuf_pcidev
> >       >       https://gitlab.freedesktop.org/Vivek/spice/-
> >       > /commits/encode_dmabuf_v4
> >       >
> >       >       Patchset overview:
> >       >
> >       >       Patch 1:   Implement
> >       > VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd
> >       >       Patch 2-3: Helpers to initalize, import, free imported
> object
> >       >       Patch 4-5: Import and use buffers from other devices for
> > scanout
> >       >       Patch 6-7: Have udmabuf driver create dmabuf from PCI bars
> > for P2P
> >       > DMA
> >       >
> >       >       This series is tested using the following method:
> >       >       - Run Qemu with the following relevant options:
> >       >         qemu-system-x86_64 -m 4096m ....
> >       >         -device vfio-pci,host=0000:03:00.0
> >       >         -device virtio-
> > vga,max_outputs=1,blob=true,xres=1920,yres=1080
> >       >         -spice port=3001,gl=on,disable-ticketing=on,preferred-
> >       > codec=gstreamer:h264
> >       >         -object memory-backend-memfd,id=mem1,size=4096M
> >       >         -machine memory-backend=mem1 ...
> >       >       - Run upstream Weston with the following options in the
> Guest
> > VM:
> >       >         ./weston --drm-device=card1 --additional-devices=card0
> >       >
> >       >       where card1 is a DG2 dGPU (passthrough'd and using xe
> driver
> > in
> >       > Guest VM),
> >       >       card0 is virtio-gpu and the Host is using a RPL iGPU.
> >       >
> >       >       Cc: Gerd Hoffmann <kraxel@redhat.com
> > <mailto:kraxel@redhat.com>
> >       > <mailto:kraxel@redhat.com <mailto:kraxel@redhat.com> > >
> >       >       Cc: Dongwon Kim <dongwon.kim@intel.com
> > <mailto:dongwon.kim@intel.com>
> >       > <mailto:dongwon.kim@intel.com
> > <mailto:dongwon.kim@intel.com> > >
> >       >       Cc: Daniel Vetter <daniel.vetter@ffwll.ch
> > <mailto:daniel.vetter@ffwll.ch>
> >       > <mailto:daniel.vetter@ffwll.ch <mailto:daniel.vetter@ffwll.ch>
> > >
> >       >       Cc: Christian Koenig <christian.koenig@amd.com
> > <mailto:christian.koenig@amd.com>
> >       > <mailto:christian.koenig@amd.com
> > <mailto:christian.koenig@amd.com> > >
> >       >       Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com
> > <mailto:dmitry.osipenko@collabora.com>
> >       > <mailto:dmitry.osipenko@collabora.com
> > <mailto:dmitry.osipenko@collabora.com> > >
> >       >       Cc: Rob Clark <robdclark@chromium.org
> > <mailto:robdclark@chromium.org>
> >       > <mailto:robdclark@chromium.org
> > <mailto:robdclark@chromium.org> > >
> >       >       Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com
> > <mailto:thomas.hellstrom@linux.intel.com>
> >       > <mailto:thomas.hellstrom@linux.intel.com
> > <mailto:thomas.hellstrom@linux.intel.com> > >
> >       >       Cc: Oded Gabbay <ogabbay@kernel.org
> > <mailto:ogabbay@kernel.org>
> >       > <mailto:ogabbay@kernel.org <mailto:ogabbay@kernel.org> > >
> >       >       Cc: Michal Wajdeczko <michal.wajdeczko@intel.com
> > <mailto:michal.wajdeczko@intel.com>
> >       > <mailto:michal.wajdeczko@intel.com
> > <mailto:michal.wajdeczko@intel.com> > >
> >       >       Cc: Michael Tretter <m.tretter@pengutronix.de
> > <mailto:m.tretter@pengutronix.de>
> >       > <mailto:m.tretter@pengutronix.de
> > <mailto:m.tretter@pengutronix.de> > >
> >       >
> >       >       Vivek Kasireddy (7):
> >       >         drm/virtio: Implement
> >       > VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd
> >       >         drm/virtio: Add a helper to map and note the dma addrs
> and
> >       > lengths
> >       >         drm/virtio: Add helpers to initialize and free the
> imported
> > object
> >       >         drm/virtio: Import prime buffers from other devices as
> guest
> > blobs
> >       >         drm/virtio: Ensure that bo's backing store is valid while
> > updating
> >       >           plane
> >       >         udmabuf/uapi: Add new ioctl to create a dmabuf from PCI
> bar
> >       > regions
> >       >         udmabuf: Implement UDMABUF_CREATE_LIST_FOR_PCIDEV
> > ioctl
> >       >
> >       >        drivers/dma-buf/udmabuf.c              | 122
> ++++++++++++++++--
> >       >        drivers/gpu/drm/virtio/virtgpu_drv.h   |   8 ++
> >       >        drivers/gpu/drm/virtio/virtgpu_plane.c |  56 ++++++++-
> >       >        drivers/gpu/drm/virtio/virtgpu_prime.c | 167
> >       > ++++++++++++++++++++++++-
> >       >        drivers/gpu/drm/virtio/virtgpu_vq.c    |  15 +++
> >       >        include/uapi/linux/udmabuf.h           |  11 +-
> >       >        6 files changed, 368 insertions(+), 11 deletions(-)
> >       >
> >       >       --
> >       >       2.43.0
> >       >
> >       >
> >
> >
>
>

Vivek Kasireddy June 18, 2024, 7:49 a.m. UTC | #7

Hi Gurchetan,

> 
> On Thu, May 30, 2024 at 12:21 AM Kasireddy, Vivek
> <vivek.kasireddy@intel.com <mailto:vivek.kasireddy@intel.com> > wrote:
> 
> 
> 	Hi Gurchetan,
> 
> 	>
> 	> On Fri, May 24, 2024 at 11:33 AM Kasireddy, Vivek
> 	> <vivek.kasireddy@intel.com <mailto:vivek.kasireddy@intel.com>
> <mailto:vivek.kasireddy@intel.com <mailto:vivek.kasireddy@intel.com> > >
> wrote:
> 	>
> 	>
> 	>       Hi,
> 	>
> 	>       Sorry, my previous reply got messed up as a result of HTML
> 	> formatting. This is
> 	>       a plain text version of the same reply.
> 	>
> 	>       >
> 	>       >
> 	>       >       Having virtio-gpu import scanout buffers (via prime) from
> other
> 	>       >       devices means that we'd be adding a head to headless
> GPUs
> 	> assigned
> 	>       >       to a Guest VM or additional heads to regular GPU devices
> that
> 	> are
> 	>       >       passthrough'd to the Guest. In these cases, the Guest
> 	> compositor
> 	>       >       can render into the scanout buffer using a primary GPU
> and has
> 	> the
> 	>       >       secondary GPU (virtio-gpu) import it for display purposes.
> 	>       >
> 	>       >       The main advantage with this is that the imported scanout
> 	> buffer can
> 	>       >       either be displayed locally on the Host (e.g, using Qemu +
> GTK
> 	> UI)
> 	>       >       or encoded and streamed to a remote client (e.g, Qemu +
> Spice
> 	> UI).
> 	>       >       Note that since Qemu uses udmabuf driver, there would
> be no
> 	>       > copies
> 	>       >       made of the scanout buffer as it is displayed. This should
> be
> 	>       >       possible even when it might reside in device memory such
> has
> 	>       > VRAM.
> 	>       >
> 	>       >       The specific use-case that can be supported with this series
> is
> 	> when
> 	>       >       running Weston or other guest compositors with
> "additional-
> 	> devices"
> 	>       >       feature (./weston --drm-device=card1 --additional-
> 	> devices=card0).
> 	>       >       More info about this feature can be found at:
> 	>       >       https://gitlab.freedesktop.org/wayland/weston/-
> 	>       > /merge_requests/736
> 	>       >
> 	>       >       In the above scenario, card1 could be a dGPU or an iGPU
> and
> 	> card0
> 	>       >       would be virtio-gpu in KMS only mode. However, the case
> 	> where this
> 	>       >       patch series could be particularly useful is when card1 is a
> GPU
> 	> VF
> 	>       >       that needs to share its scanout buffer (in a zero-copy way)
> with
> 	> the
> 	>       >       GPU PF on the Host. Or, it can also be useful when the
> scanout
> 	> buffer
> 	>       >       needs to be shared between any two GPU devices
> (assuming
> 	> one of
> 	>       > them
> 	>       >       is assigned to a Guest VM) as long as they are P2P DMA
> 	> compatible.
> 	>       >
> 	>       >
> 	>       >
> 	>       > Is passthrough iGPU-only or passthrough dGPU-only
> something you
> > intend to
> >       > use?
> >       Our main use-case involves passthrough’g a headless dGPU VF
> device
> > and sharing
> >       the Guest compositor’s scanout buffer with dGPU PF device on
> the
> > Host. Same goal for
> >       headless iGPU VF to iGPU PF device as well.
> >
> >
> >
> > Just to check my understanding: the same physical {i, d}GPU is
> partitioned
> > into the VF and PF, but the PF handles host-side display integration
> and
> > rendering?
> Yes, that is mostly right. In a nutshell, the same physical GPU is
> partitioned
> into one PF device and multiple VF devices. Only the PF device has
> access to
> the display hardware and can do KMS (on the Host). The VF devices
> are
> headless with no access to display hardware (cannot do KMS but can
> do render/
> encode/decode) and are generally assigned (or passthrough'd) to the
> Guest VMs.
> Some more details about this model can be found here:
> https://lore.kernel.org/dri-devel/20231110182231.1730-1-
> michal.wajdeczko@intel.com/
> 
> >
> >
> >       However, using a combination of iGPU and dGPU where either
> of
> > them can be passthrough’d
> >       to the Guest is something I think can be supported with this
> patch
> > series as well.
> >
> > >
> > > If it's a dGPU + iGPU setup, then the way other people seem to
> do it
> > is a
> > > "virtualized" iGPU (via virgl/gfxstream/take your pick) and
> pass-
> > through the
> > > dGPU.
> > >
> > > For example, AMD seems to use virgl to allocate and import
> into
> > the dGPU.
> > >
> > > https://gitlab.freedesktop.org/mesa/mesa/-
> > /merge_requests/23896
> > >
> > > https://lore.kernel.org/all/20231221100016.4022353-1-
> > > julia.zhang@amd.com/ <http://julia.zhang@amd.com/>
> <http://julia.zhang@amd.com/>
> > >
> > >
> > > ChromeOS also uses that method (see crrev.com/c/3764931
> <http://crrev.com/c/3764931>
> > <http://crrev.com/c/3764931>
> > > <http://crrev.com/c/3764931> ) [cc: dGPU architect +Dominik
> Behr
> >> <mailto:dbehr@google.com <mailto:dbehr@google.com>
> <mailto:dbehr@google.com <mailto:dbehr@google.com> > > ]
> > >
> > > So if iGPU + dGPU is the primary use case, you should be able
> to
> > use these
> > > methods as well.  The model would "virtualized iGPU" +
> > passthrough dGPU,
> > > not split SoCs.
> >       In our use-case, the goal is to have only one primary GPU
> > (passthrough’d iGPU/dGPU)
> >       do all the rendering (using native DRI drivers) for
> clients/compositor
> > and all the outputs
> > and share the scanout buffers with the secondary GPU (virtio-
> gpu).
> > Since this is mostly
> >       how Mutter (and also Weston) work in a multi-GPU setup, I am
> not
> > sure if virgl is needed.
> >
> >
> >
> > I think you can probably use virgl with the PF and others probably
> will, but
> > supporting multiple methods in Linux is not unheard of.
> In our case, we have an alternative SR-IOV based GPU
> virtualization/partitioning
> model (as described above) where a Guest VM will have access to a
> hardware-accelerated
> GPU VF device for its rendering/encode/decode needs. So, in this
> situation, using
> virgl will become redundant and unnecessary.
> 
> And, in this model, we intend to use virtio-gpu for KMS in the Guest
> VM (since the
> GPU VF device cannot do KMS) with the addition of this patchset.
> However, note that,
> since not all GPU SKUs/versions have the SRIOV capability, we plan
> on using virgl in
> those cases where it becomes necessary.
> 
> >
> > Does your patchset need the Mesa kmsro patchset to function
> correctly?
> >
> > https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9592
> >This patchset is an alternative proposal. So, KMSRO would not be
> >needed.
> >AFAICS, the above MR is mainly stalled because KMSRO uses dumb
> >buffers
> >which are not suitable for hardware-based rendering in all cases.
> >And, KMSRO
> >is not really helpful performance-wise with dGPUs, as it forces most
> >buffers to
> >be allocated from system memory.
> 
> Previously, it was recommended when exploring VDMABUF: "the
> /important/ thing is that the driver which exports the dma-buf (and thus
> handles the mappings) must be aware of the virtualization so it can properly
> coordinate things with the host side."
> 
> https://patchwork.kernel.org/project/linux-media/patch/20210203073517.1908882-3-
> vivek.kasireddy@intel.com/#23975915
Yeah, I remember that. It would be nice to have Gerd weigh in again because
this design is slightly different as it uses DMA addresses (and not pages) to make
it possible to share buffers allocated by Guest in both system memory and VRAM.
And, as mentioned earlier, with Qemu GTK UI (since it uses EGL), we Blit the
dmabuf to a Host allocated buffer managed by EGL.

But I do recognize that if we were to share the dmabuf directly with the Host 
compositor (like you do with CrosVM + Wayland UI), then yes, the Guest
compositor (or probably the exporter) needs to be aware of this virtualization.

> 
> So that's why the KMSRO approach was tried  (virtio-gpu dumb allocations,
> not i915).  But as you point out, nobody uses dumb buffers for hardware-
> based rendering.
Right, we have been using KMSRO + dumb buffers all this while but as we start
working more with dGPUs we realized that they are not great performance-wise.
So, one of my goals with this series (and also including the VFIO patch series) is to
ensure that (with one dGPU VF assigned to the Guest) the Guest allocated framebuffer
stays in VRAM (and does not migrated to Guest's system memory) as it is shared with
the dGPU PF on the Host.

> 
> So, if you are going with i915 allocates + virtio-gpu imports,  it should be fine
> if you fixed all the issues with i915 allocates + VDMABUF imports.  It seems
> your fixes add complexity in VFIO and other places,
For iGPUs, only this series is needed; and, the VFIO patch series is needed if
we are doing dGPU to dGPU buffer sharing (P2P DMA). And, it appears, the
VFIO patches are anyway needed for RDMA/SPDK use-cases as well.

> but having virtio-gpu 3d +
> virgl allocate adds complexity to Mesa-based allocation paths (i915, amdgpu
> would all have to open virtio-gpu render node, and pick a context type etc.).
> 
> I would just do virtio-gpu allocates, since it only requires user-space patches
> and no extra ioctls, but that reflects my preferences. 
This patch series is also not adding any extra ioctls; and no uapi changes are needed.
The only big change that needs to happen is on the Guest compositor side but both
Mutter and Weston already have the concept of primary and secondary GPUs when
dealing with multi-GPU scenarios, so they are mostly ok. 

> If the mm/VFIO/QEMU
> people are fine with your approach, I see nothing wrong with merging it.
Sima has said they are OK with the design, but they specifically asked people working
on virtio-gpu to weigh in. 

> 
> The one caveat is if someone uses a non-GTK/EGL host path, we'll have to pin
> memory for the lifetime of the import, since knowing RESOURCE_FLUSH is
Yeah, as discussed earlier, non-EGL host paths would probably need additional
signaling mechanism to achieve full frame-rate, when sharing the dmabuf directly
with the Host compositor.

> done is not sufficient.  But if you're only using it, it shouldn't be an issue right
> now.
To ensure that I am not regressing virgl or other use-cases, I am currently limiting
this feature to only cases where blob=true and virgl=false for now. I'll send a v2 soon
with some fixes related to cleanup of imported BOs.

Thanks,
Vivek

> 
> 
> 
> 	>
> 	>
> 	> If so, I would try to get that reviewed first to meet DRM
> requirements
> 	> (https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#open-
> source-
> 	> userspace-requirements).  You might explicitly call out the design
> decision
> 	> you're making: ("We can probably use virgl as the virtualized iGPU
> via PF, but
> 	> that adds unnecessary complexity b/c ______").
> 	As I described above, what we have is an alternative GPU
> virtualization scheme
> 	where virgl is not necessary if SRIOV capability is available. And, as
> mentioned
> 	earlier, I have tested this series with Mutter/Gnome-shell (upstream
> master)
> 	(plus one small patch: https://gitlab.gnome.org/GNOME/mutter/-
> /merge_requests/3745)
> 	and no other changes to any other userspace components on Host
> and Guest.
> 
> 	>
> 	>
> 	>       And, doing it this way means that no other userspace
> components
> 	> need to be modified
> 	>       on both the Guest and the Host.
> 	>
> 	>       >
> 	>       >
> 	>       >
> 	>       >       As part of the import, the virtio-gpu driver shares the dma
> 	>       >       addresses and lengths with Qemu which then determines
> 	> whether
> 	>       > the
> 	>       >       memory region they belong to is owned by a PCI device or
> 	> whether it
> 	>       >       is part of the Guest's system ram. If it is the former, it
> identifies
> 	>       >       the devid (or bdf) and bar and provides this info (along
> with
> 	> offsets
> 	>       >       and sizes) to the udmabuf driver. In the latter case, instead
> of
> 	> the
> 	>       >       the devid and bar it provides the memfd. The udmabuf
> driver
> 	> then
> 	>       >       creates a dmabuf using this info that Qemu shares with
> Spice
> 	> for
> 	>       >       encode via Gstreamer.
> 	>       >
> 	>       >       Note that the virtio-gpu driver registers a move_notify()
> callback
> 	>       >       to track location changes associated with the scanout
> buffer and
> 	>       >       sends attach/detach backing cmds to Qemu when
> appropriate.
> 	> And,
> 	>       >       synchronization (that is, ensuring that Guest and Host are
> not
> 	>       >       using the scanout buffer at the same time) is ensured by
> 	> pinning/
> 	>       >       unpinning the dmabuf as part of plane update and using a
> fence
> 	>       >       in resource_flush cmd.
> 	>       >
> 	>       >
> 	>       > I'm not sure how QEMU's display paths work, but with crosvm
> if
> 	> you share
> 	>       > the guest-created dmabuf with the display, and the guest
> moves
> 	> the backing
> 	>       > pages, the only recourse is the destroy the surface and show a
> 	> black screen
> 	>       > to the user: not the best thing experience wise.
> 	>       Since Qemu GTK UI uses EGL, there is a blit done from the
> guest’s
> 	> scanout buffer onto an EGL
> 	>       backed buffer on the Host. So, this problem would not happen
> as of
> 	> now.
> 	>
> 	>
> 	>
> 	> The guest kernel doesn't know you're using the QEMU GTK UI + EGL
> host-
> 	> side.
> 	So, with blob=true, there is a dma fence in resource_flush() that gets
> associated
> 	with the Blit/Encode on the Host. This guest dma fence should
> eventually be signalled
> 	only when the Host is done using guest's scanout buffer.
> 
> 	>
> 	> If somebody wants to use the virtio-gpu import mechanism with
> lower-level
> 	> Wayland-based display integration, then the problem would occur.
> 	Right, one way to address this issue is to prevent the Guest
> compositor from
> 	reusing the scanout buffer (until the Host is done) and forcing it to
> pick a new
> 	buffer (since Mesa GBM allows 4 backbuffers).
> 	I have tried this experiment with KMSRO and Wayland-based Qemu
> UI previously
> 	on iGPUs (and Weston) and noticed that the Guest FPS was getting
> halved:
> 	https://lore.kernel.org/qemu-devel/20210913222036.3193732-1-
> vivek.kasireddy@intel.com/
> 
> 	and also discussed and proposed a solution which did not go
> anywhere:
> 	https://lore.kernel.org/dri-devel/20210913233529.3194401-1-
> vivek.kasireddy@intel.com/
> 
> 	>
> 	> Perhaps, do that just to be safe unless you have performance
> concerns.
> 	If you meant pinning the imported scanout buffer in the Guest, then
> yes,
> 	that is something I am already doing in this patchset.
> 
> 	>
> 	>
> 	>       >
> 	>       > Only amdgpu calls dma_buf_move_notfiy(..), and you're
> probably
> 	> testing on
> 	>       > Intel only, so you may not be hitting that code path anyways.
> 	>       I have tested with the Xe driver in the Guest which also calls
> 	> dma_buf_move_notfiy(). But
> 	>       note that for dGPUs, both Xe and amdgpu migrate the scanout
> buffer
> 	> from vram to system
> 	>       memory as part of export, because virtio-gpu is not P2P
> compatible.
> 	> However, I am hoping
> 	>       to relax this (p2p check against virtio-gpu) in Xe driver if it
> detects
> 	> that it is running in
> 	>       VF mode once the following patch series is merged:
> 	>       https://lore.kernel.org/dri-devel/20240422063602.3690124-1-
> 	> vivek.kasireddy@intel.com/ <http://vivek.kasireddy@intel.com/>
> 	>
> 	>       > I forgot the
> 	>       > exact reason, but apparently udmabuf may not work with
> amdgpu
> 	> displays
> 	>       > and it seems the virtualized iGPU + dGPU is the way to go for
> 	> amdgpu
> 	>       > anyways.
> 	>       I would really like to know why udmabuf would not work with
> 	> amdgpu?
> 	>
> 	>
> 	>
> 	> It's just a rumor I heard, but the idea is udmabuf would be
> imported into
> 	> AMDGPU_GEM_DOMAIN_CPU only.
> 	>
> 	> https://cgit.freedesktop.org/drm/drm-
> 	>
> misc/tree/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c#n333
> 	>
> 	> "AMDGPU_GEM_DOMAIN_CPU: System memory that is not GPU
> accessible.
> 	> Memory in this pool could be swapped out to disk if there is
> pressure."
> 	>
> 	> https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html
> 	>
> 	>
> 	> Perhaps that limitation is artificial and unnecessary, and it may
> indeed work.
> 	> I don't think anybody has tried...
> 	Since udmabuf driver properly pins the backing pages (from memfd)
> for DMA,
> 	I don't see any reason why amdgpu would not be able to import.
> 
> 	Thanks,
> 	Vivek
> 
> 	>
> 	>
> 	>
> 	>
> 	>       > So I recommend just pinning the buffer for the lifetime of the
> 	>       > import for simplicity and correctness.
> 	>       Yeah, in this patch series, the dmabuf is indeed pinned, but only
> for a
> 	> short duration in the Guest –
> 	>       just until the Host is done using it (blit or encode).
> 	>
> 	>       Thanks,
> 	>       Vivek
> 	>
> 	>       >
> 	>       >
> 	>       >       This series is available at:
> 	>       >       https://gitlab.freedesktop.org/Vivek/drm-tip/-
> 	>       > /commits/virtgpu_import_rfc
> 	>       >
> 	>       >       along with additional patches for Qemu and Spice here:
> 	>       >       https://gitlab.freedesktop.org/Vivek/qemu/-
> 	>       > /commits/virtgpu_dmabuf_pcidev
> 	>       >       https://gitlab.freedesktop.org/Vivek/spice/-
> 	>       > /commits/encode_dmabuf_v4
> 	>       >
> 	>       >       Patchset overview:
> 	>       >
> 	>       >       Patch 1:   Implement
> 	>       > VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd
> 	>       >       Patch 2-3: Helpers to initalize, import, free imported object
> 	>       >       Patch 4-5: Import and use buffers from other devices for
> 	> scanout
> 	>       >       Patch 6-7: Have udmabuf driver create dmabuf from PCI
> bars
> 	> for P2P
> 	>       > DMA
> 	>       >
> 	>       >       This series is tested using the following method:
> 	>       >       - Run Qemu with the following relevant options:
> 	>       >         qemu-system-x86_64 -m 4096m ....
> 	>       >         -device vfio-pci,host=0000:03:00.0
> 	>       >         -device virtio-
> 	> vga,max_outputs=1,blob=true,xres=1920,yres=1080
> 	>       >         -spice port=3001,gl=on,disable-ticketing=on,preferred-
> 	>       > codec=gstreamer:h264
> 	>       >         -object memory-backend-memfd,id=mem1,size=4096M
> 	>       >         -machine memory-backend=mem1 ...
> 	>       >       - Run upstream Weston with the following options in the
> Guest
> 	> VM:
> 	>       >         ./weston --drm-device=card1 --additional-devices=card0
> 	>       >
> 	>       >       where card1 is a DG2 dGPU (passthrough'd and using xe
> driver
> 	> in
> 	>       > Guest VM),
> 	>       >       card0 is virtio-gpu and the Host is using a RPL iGPU.
> 	>       >
> 	>       >       Cc: Gerd Hoffmann <kraxel@redhat.com
> <mailto:kraxel@redhat.com>
> 	> <mailto:kraxel@redhat.com <mailto:kraxel@redhat.com> >
> 	>       > <mailto:kraxel@redhat.com <mailto:kraxel@redhat.com>
> <mailto:kraxel@redhat.com <mailto:kraxel@redhat.com> > > >
> 	>       >       Cc: Dongwon Kim <dongwon.kim@intel.com
> <mailto:dongwon.kim@intel.com>
> 	> <mailto:dongwon.kim@intel.com
> <mailto:dongwon.kim@intel.com> >
> 	>       > <mailto:dongwon.kim@intel.com
> <mailto:dongwon.kim@intel.com>
> 	> <mailto:dongwon.kim@intel.com
> <mailto:dongwon.kim@intel.com> > > >
> 	>       >       Cc: Daniel Vetter <daniel.vetter@ffwll.ch
> <mailto:daniel.vetter@ffwll.ch>
> 	> <mailto:daniel.vetter@ffwll.ch <mailto:daniel.vetter@ffwll.ch> >
> 	>       > <mailto:daniel.vetter@ffwll.ch
> <mailto:daniel.vetter@ffwll.ch>  <mailto:daniel.vetter@ffwll.ch
> <mailto:daniel.vetter@ffwll.ch> > > >
> 	>       >       Cc: Christian Koenig <christian.koenig@amd.com
> <mailto:christian.koenig@amd.com>
> 	> <mailto:christian.koenig@amd.com
> <mailto:christian.koenig@amd.com> >
> 	>       > <mailto:christian.koenig@amd.com
> <mailto:christian.koenig@amd.com>
> 	> <mailto:christian.koenig@amd.com
> <mailto:christian.koenig@amd.com> > > >
> 	>       >       Cc: Dmitry Osipenko <dmitry.osipenko@collabora.com
> <mailto:dmitry.osipenko@collabora.com>
> 	> <mailto:dmitry.osipenko@collabora.com
> <mailto:dmitry.osipenko@collabora.com> >
> 	>       > <mailto:dmitry.osipenko@collabora.com
> <mailto:dmitry.osipenko@collabora.com>
> 	> <mailto:dmitry.osipenko@collabora.com
> <mailto:dmitry.osipenko@collabora.com> > > >
> 	>       >       Cc: Rob Clark <robdclark@chromium.org
> <mailto:robdclark@chromium.org>
> 	> <mailto:robdclark@chromium.org
> <mailto:robdclark@chromium.org> >
> 	>       > <mailto:robdclark@chromium.org
> <mailto:robdclark@chromium.org>
> 	> <mailto:robdclark@chromium.org
> <mailto:robdclark@chromium.org> > > >
> 	>       >       Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com
> <mailto:thomas.hellstrom@linux.intel.com>
> 	> <mailto:thomas.hellstrom@linux.intel.com
> <mailto:thomas.hellstrom@linux.intel.com> >
> 	>       > <mailto:thomas.hellstrom@linux.intel.com
> <mailto:thomas.hellstrom@linux.intel.com>
> 	> <mailto:thomas.hellstrom@linux.intel.com
> <mailto:thomas.hellstrom@linux.intel.com> > > >
> 	>       >       Cc: Oded Gabbay <ogabbay@kernel.org
> <mailto:ogabbay@kernel.org>
> 	> <mailto:ogabbay@kernel.org <mailto:ogabbay@kernel.org> >
> 	>       > <mailto:ogabbay@kernel.org <mailto:ogabbay@kernel.org>
> <mailto:ogabbay@kernel.org <mailto:ogabbay@kernel.org> > > >
> 	>       >       Cc: Michal Wajdeczko <michal.wajdeczko@intel.com
> <mailto:michal.wajdeczko@intel.com>
> 	> <mailto:michal.wajdeczko@intel.com
> <mailto:michal.wajdeczko@intel.com> >
> 	>       > <mailto:michal.wajdeczko@intel.com
> <mailto:michal.wajdeczko@intel.com>
> 	> <mailto:michal.wajdeczko@intel.com
> <mailto:michal.wajdeczko@intel.com> > > >
> 	>       >       Cc: Michael Tretter <m.tretter@pengutronix.de
> <mailto:m.tretter@pengutronix.de>
> 	> <mailto:m.tretter@pengutronix.de
> <mailto:m.tretter@pengutronix.de> >
> 	>       > <mailto:m.tretter@pengutronix.de
> <mailto:m.tretter@pengutronix.de>
> 	> <mailto:m.tretter@pengutronix.de
> <mailto:m.tretter@pengutronix.de> > > >
> 	>       >
> 	>       >       Vivek Kasireddy (7):
> 	>       >         drm/virtio: Implement
> 	>       > VIRTIO_GPU_CMD_RESOURCE_DETACH_BACKING cmd
> 	>       >         drm/virtio: Add a helper to map and note the dma addrs
> and
> 	>       > lengths
> 	>       >         drm/virtio: Add helpers to initialize and free the imported
> 	> object
> 	>       >         drm/virtio: Import prime buffers from other devices as
> guest
> 	> blobs
> 	>       >         drm/virtio: Ensure that bo's backing store is valid while
> 	> updating
> 	>       >           plane
> 	>       >         udmabuf/uapi: Add new ioctl to create a dmabuf from PCI
> bar
> 	>       > regions
> 	>       >         udmabuf: Implement
> UDMABUF_CREATE_LIST_FOR_PCIDEV
> 	> ioctl
> 	>       >
> 	>       >        drivers/dma-buf/udmabuf.c              | 122
> ++++++++++++++++--
> 	>       >        drivers/gpu/drm/virtio/virtgpu_drv.h   |   8 ++
> 	>       >        drivers/gpu/drm/virtio/virtgpu_plane.c |  56 ++++++++-
> 	>       >        drivers/gpu/drm/virtio/virtgpu_prime.c | 167
> 	>       > ++++++++++++++++++++++++-
> 	>       >        drivers/gpu/drm/virtio/virtgpu_vq.c    |  15 +++
> 	>       >        include/uapi/linux/udmabuf.h           |  11 +-
> 	>       >        6 files changed, 368 insertions(+), 11 deletions(-)
> 	>       >
> 	>       >       --
> 	>       >       2.43.0
> 	>       >
> 	>       >
> 	>
> 	>
> 
>

[RFC,0/7] drm/virtio: Import scanout buffers from other devices

Message

Comments