diff mbox series

[RFC,2/2] vc4: introduce DMA-BUF heap

Message ID 20231109074545.148149-2-contact@emersion.fr (mailing list archive)
State New, archived
Headers show
Series [RFC,1/2] dma-buf/dma-heap: export dma_heap_add and dma_heap_get_drvdata | expand

Commit Message

Simon Ser Nov. 9, 2023, 7:45 a.m. UTC
User-space sometimes needs to allocate scanout-capable memory for
GPU rendering purposes. On a vc4/v3d split render/display SoC, this
is achieved via DRM dumb buffers: the v3d user-space driver opens
the primary vc4 node, allocates a DRM dumb buffer there, exports it
as a DMA-BUF, imports it into the v3d render node, and renders to it.

However, DRM dumb buffers are only meant for CPU rendering, they are
not intended to be used for GPU rendering. Primary nodes should only
be used for mode-setting purposes, other programs should not attempt
to open it. Moreover, opening the primary node is already broken on
some setups: systemd grants permission to open primary nodes to
physically logged in users, but this breaks when the user is not
physically logged in (e.g. headless setup) and when the distribution
is using a different init (e.g. Alpine Linux uses openrc).

We need an alternate way for v3d to allocate scanout-capable memory.
Leverage DMA heaps for this purpose: expose a CMA heap to user-space.
Preliminary testing has been done with wlroots [1].

This is still an RFC. Open questions:

- Does this approach make sense to y'all in general?
- What would be a good name for the heap? "vc4" is maybe a bit naive and
  not precise enough. Something with "cma"? Do we need to plan a naming
  scheme to accomodate for multiple vc4 devices?
- Right now root privileges are necessary to open the heap. Should we
  allow everybody to open the heap by default (after all, user-space can
  already allocate arbitrary amounts of GPU memory)? Should we leave it
  up to user-space to solve this issue (via logind/seatd or a Wayland
  protocol or something else)?

TODO:

- Need to add !vc5 support.
- Need to the extend DMA heaps API to allow vc4 to unregister the heap
  on unload.

[1]: https://gitlab.freedesktop.org/wlroots/wlroots/-/merge_requests/4414

Signed-off-by: Simon Ser <contact@emersion.fr>
Cc: Maxime Ripard <mripard@kernel.org>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Daniel Stone <daniel@fooishbar.org>
Cc: Erico Nunes <nunes.erico@gmail.com>
Cc: Iago Toral Quiroga <itoral@igalia.com>
Cc: Maíra Canal <mcanal@igalia.com>
Cc: Thomas Zimmermann <tzimmermann@suse.de>
---
 drivers/gpu/drm/vc4/Makefile       |  2 ++
 drivers/gpu/drm/vc4/vc4_dma_heap.c | 51 ++++++++++++++++++++++++++++++
 drivers/gpu/drm/vc4/vc4_drv.c      |  6 ++++
 drivers/gpu/drm/vc4/vc4_drv.h      |  5 +++
 4 files changed, 64 insertions(+)
 create mode 100644 drivers/gpu/drm/vc4/vc4_dma_heap.c

Comments

Maxime Ripard Nov. 9, 2023, 9:11 a.m. UTC | #1
Hi Simon,

On Thu, Nov 09, 2023 at 07:45:58AM +0000, Simon Ser wrote:
> User-space sometimes needs to allocate scanout-capable memory for
> GPU rendering purposes. On a vc4/v3d split render/display SoC, this
> is achieved via DRM dumb buffers: the v3d user-space driver opens
> the primary vc4 node, allocates a DRM dumb buffer there, exports it
> as a DMA-BUF, imports it into the v3d render node, and renders to it.
> 
> However, DRM dumb buffers are only meant for CPU rendering, they are
> not intended to be used for GPU rendering. Primary nodes should only
> be used for mode-setting purposes, other programs should not attempt
> to open it. Moreover, opening the primary node is already broken on
> some setups: systemd grants permission to open primary nodes to
> physically logged in users, but this breaks when the user is not
> physically logged in (e.g. headless setup) and when the distribution
> is using a different init (e.g. Alpine Linux uses openrc).
> 
> We need an alternate way for v3d to allocate scanout-capable memory.
> Leverage DMA heaps for this purpose: expose a CMA heap to user-space.
> Preliminary testing has been done with wlroots [1].
> 
> This is still an RFC. Open questions:
> 
> - Does this approach make sense to y'all in general?

Makes sense to me :)

> - What would be a good name for the heap? "vc4" is maybe a bit naive and
>   not precise enough. Something with "cma"? Do we need to plan a naming
>   scheme to accomodate for multiple vc4 devices?

That's a general issue though that happens with pretty much all devices
with a separate node for modesetting and rendering, so I don't think
addressing it only for vc4 make sense, we should make it generic.

So maybe something like "scanout"?

One thing we need to consider too is that the Pi5 will have multiple
display nodes (4(?) iirc) with different capabilities, vc4 being only
one of them. This will impact that solution too.

> - Right now root privileges are necessary to open the heap. Should we
>   allow everybody to open the heap by default (after all, user-space can
>   already allocate arbitrary amounts of GPU memory)? Should we leave it
>   up to user-space to solve this issue (via logind/seatd or a Wayland
>   protocol or something else)?

I would have expected a udev rule to handle that?

> TODO:
> 
> - Need to add !vc5 support.

If by !vc5 you mean RPi0-3, then it's probably not needed there at all
since it has a single node for both modesetting and rendering?

Maxime
Iago Toral Nov. 9, 2023, 9:25 a.m. UTC | #2
Hi Simon,

thanks for looking into this! I few comments below inline...

El jue, 09-11-2023 a las 07:45 +0000, Simon Ser escribió:
> User-space sometimes needs to allocate scanout-capable memory for
> GPU rendering purposes. On a vc4/v3d split render/display SoC, this
> is achieved via DRM dumb buffers: the v3d user-space driver opens
> the primary vc4 node, allocates a DRM dumb buffer there, exports it
> as a DMA-BUF, imports it into the v3d render node, and renders to it.
> 
> However, DRM dumb buffers are only meant for CPU rendering, they are
> not intended to be used for GPU rendering. Primary nodes should only
> be used for mode-setting purposes, other programs should not attempt
> to open it. Moreover, opening the primary node is already broken on
> some setups: systemd grants permission to open primary nodes to
> physically logged in users, but this breaks when the user is not
> physically logged in (e.g. headless setup) and when the distribution
> is using a different init (e.g. Alpine Linux uses openrc).
> 
> We need an alternate way for v3d to allocate scanout-capable memory.
> Leverage DMA heaps for this purpose: expose a CMA heap to user-space.
> Preliminary testing has been done with wlroots [1].
> 
> This is still an RFC. Open questions:

I think Maxime is the right person to answer about vc4 but I'll answer
a few of these questions from my perspective:

> 
> - Does this approach make sense to y'all in general?

Yeah, this sounds sensible to me.

> - What would be a good name for the heap? "vc4" is maybe a bit naive
> and
>   not precise enough. Something with "cma"? Do we need to plan a
> naming
>   scheme to accomodate for multiple vc4 devices?
> - Right now root privileges are necessary to open the heap. Should we
>   allow everybody to open the heap by default (after all, user-space
> can
>   already allocate arbitrary amounts of GPU memory)? Should we leave
> it
>   up to user-space to solve this issue (via logind/seatd or a Wayland
>   protocol or something else)?

I think so, yes, after all, the main user of this will be the Mesa
driver, so root only would not be great I think.

> 
> TODO:
> 
> - Need to add !vc5 support.

I believe that would be models before Raspberry Pi 4, which don't have
v3d, so maybe we don't need this there?

Iago

> - Need to the extend DMA heaps API to allow vc4 to unregister the
> heap
>   on unload.
> 
> [1]:
> https://gitlab.freedesktop.org/wlroots/wlroots/-/merge_requests/4414
> 
> Signed-off-by: Simon Ser <contact@emersion.fr>
> Cc: Maxime Ripard <mripard@kernel.org>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Cc: Daniel Stone <daniel@fooishbar.org>
> Cc: Erico Nunes <nunes.erico@gmail.com>
> Cc: Iago Toral Quiroga <itoral@igalia.com>
> Cc: Maíra Canal <mcanal@igalia.com>
> Cc: Thomas Zimmermann <tzimmermann@suse.de>
> ---
>  drivers/gpu/drm/vc4/Makefile       |  2 ++
>  drivers/gpu/drm/vc4/vc4_dma_heap.c | 51
> ++++++++++++++++++++++++++++++
>  drivers/gpu/drm/vc4/vc4_drv.c      |  6 ++++
>  drivers/gpu/drm/vc4/vc4_drv.h      |  5 +++
>  4 files changed, 64 insertions(+)
>  create mode 100644 drivers/gpu/drm/vc4/vc4_dma_heap.c
> 
> diff --git a/drivers/gpu/drm/vc4/Makefile
> b/drivers/gpu/drm/vc4/Makefile
> index c41f89a15a55..e4048870cec7 100644
> --- a/drivers/gpu/drm/vc4/Makefile
> +++ b/drivers/gpu/drm/vc4/Makefile
> @@ -34,4 +34,6 @@ vc4-$(CONFIG_DRM_VC4_KUNIT_TEST) += \
>  
>  vc4-$(CONFIG_DEBUG_FS) += vc4_debugfs.o
>  
> +vc4-$(CONFIG_DMABUF_HEAPS) += vc4_dma_heap.o
> +
>  obj-$(CONFIG_DRM_VC4)  += vc4.o
> diff --git a/drivers/gpu/drm/vc4/vc4_dma_heap.c
> b/drivers/gpu/drm/vc4/vc4_dma_heap.c
> new file mode 100644
> index 000000000000..010d0a88f3fa
> --- /dev/null
> +++ b/drivers/gpu/drm/vc4/vc4_dma_heap.c
> @@ -0,0 +1,51 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + *  Copyright © 2023 Simon Ser
> + */
> +
> +#include <linux/dma-buf.h>
> +#include <linux/dma-heap.h>
> +
> +#include "vc4_drv.h"
> +
> +static struct dma_buf *vc4_dma_heap_allocate(struct dma_heap *heap,
> +                                            unsigned long size,
> +                                            unsigned long fd_flags,
> +                                            unsigned long
> heap_flags)
> +{
> +       struct vc4_dev *vc4 = dma_heap_get_drvdata(heap);
> +       //DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
> +       struct drm_gem_dma_object *dma_obj;
> +       struct dma_buf *dmabuf;
> +
> +       if (WARN_ON_ONCE(!vc4->is_vc5))
> +               return ERR_PTR(-ENODEV);
> +
> +       dma_obj = drm_gem_dma_create(&vc4->base, size);
> +       if (IS_ERR(dma_obj))
> +               return ERR_CAST(dma_obj);
> +
> +       dmabuf = drm_gem_prime_export(&dma_obj->base, fd_flags);
> +       drm_gem_object_put(&dma_obj->base);
> +       return dmabuf;
> +}
> +
> +static const struct dma_heap_ops vc4_dma_heap_ops = {
> +       .allocate = vc4_dma_heap_allocate,
> +};
> +
> +int vc4_dma_heap_create(struct vc4_dev *vc4)
> +{
> +       struct dma_heap_export_info exp_info;
> +       struct dma_heap *heap;
> +
> +       exp_info.name = "vc4"; /* TODO: allow multiple? */
> +       exp_info.ops = &vc4_dma_heap_ops;
> +       exp_info.priv = vc4; /* TODO: unregister when unloading */
> +
> +       heap = dma_heap_add(&exp_info);
> +       if (IS_ERR(heap))
> +               return PTR_ERR(heap);
> +
> +       return 0;
> +}
> diff --git a/drivers/gpu/drm/vc4/vc4_drv.c
> b/drivers/gpu/drm/vc4/vc4_drv.c
> index c133e96b8aca..c7297dd7d9d5 100644
> --- a/drivers/gpu/drm/vc4/vc4_drv.c
> +++ b/drivers/gpu/drm/vc4/vc4_drv.c
> @@ -391,6 +391,12 @@ static int vc4_drm_bind(struct device *dev)
>  
>         drm_fbdev_dma_setup(drm, 16);
>  
> +#if IS_ENABLED(CONFIG_DMABUF_HEAPS)
> +       ret = vc4_dma_heap_create(vc4);
> +       if (ret)
> +               goto err;
> +#endif
> +
>         return 0;
>  
>  err:
> diff --git a/drivers/gpu/drm/vc4/vc4_drv.h
> b/drivers/gpu/drm/vc4/vc4_drv.h
> index ab61e96e7e14..d5c5dd18815c 100644
> --- a/drivers/gpu/drm/vc4/vc4_drv.h
> +++ b/drivers/gpu/drm/vc4/vc4_drv.h
> @@ -1078,4 +1078,9 @@ int vc4_perfmon_destroy_ioctl(struct drm_device
> *dev, void *data,
>  int vc4_perfmon_get_values_ioctl(struct drm_device *dev, void *data,
>                                  struct drm_file *file_priv);
>  
> +/* vc4_dma_heap.c */
> +#if IS_ENABLED(CONFIG_DMABUF_HEAPS)
> +int vc4_dma_heap_create(struct vc4_dev *vc4);
> +#endif
> +
>  #endif /* _VC4_DRV_H_ */
Maíra Canal Nov. 9, 2023, 12:14 p.m. UTC | #3
Hi Simon,

Thanks for working on this feature!

On 11/9/23 04:45, Simon Ser wrote:
> User-space sometimes needs to allocate scanout-capable memory for
> GPU rendering purposes. On a vc4/v3d split render/display SoC, this
> is achieved via DRM dumb buffers: the v3d user-space driver opens
> the primary vc4 node, allocates a DRM dumb buffer there, exports it
> as a DMA-BUF, imports it into the v3d render node, and renders to it.
> 
> However, DRM dumb buffers are only meant for CPU rendering, they are
> not intended to be used for GPU rendering. Primary nodes should only
> be used for mode-setting purposes, other programs should not attempt
> to open it. Moreover, opening the primary node is already broken on
> some setups: systemd grants permission to open primary nodes to
> physically logged in users, but this breaks when the user is not
> physically logged in (e.g. headless setup) and when the distribution
> is using a different init (e.g. Alpine Linux uses openrc).
> 
> We need an alternate way for v3d to allocate scanout-capable memory.

For me, it is a bit unclear how we import the vc4 DMA-BUF heap into v3d.
Should we create an IOCTL for it on v3d?

Also, if you need some help testing with the RPi 5, you can ping on IRC
and I can try to help by testing.

Best Regards,
- Maíra

> Leverage DMA heaps for this purpose: expose a CMA heap to user-space.
> Preliminary testing has been done with wlroots [1].
> 
> This is still an RFC. Open questions:
> 
> - Does this approach make sense to y'all in general?
> - What would be a good name for the heap? "vc4" is maybe a bit naive and
>    not precise enough. Something with "cma"? Do we need to plan a naming
>    scheme to accomodate for multiple vc4 devices?
> - Right now root privileges are necessary to open the heap. Should we
>    allow everybody to open the heap by default (after all, user-space can
>    already allocate arbitrary amounts of GPU memory)? Should we leave it
>    up to user-space to solve this issue (via logind/seatd or a Wayland
>    protocol or something else)?
> 
> TODO:
> 
> - Need to add !vc5 support.
> - Need to the extend DMA heaps API to allow vc4 to unregister the heap
>    on unload.
> 
> [1]: https://gitlab.freedesktop.org/wlroots/wlroots/-/merge_requests/4414
> 
> Signed-off-by: Simon Ser <contact@emersion.fr>
> Cc: Maxime Ripard <mripard@kernel.org>
> Cc: Daniel Vetter <daniel@ffwll.ch>
> Cc: Daniel Stone <daniel@fooishbar.org>
> Cc: Erico Nunes <nunes.erico@gmail.com>
> Cc: Iago Toral Quiroga <itoral@igalia.com>
> Cc: Maíra Canal <mcanal@igalia.com>
> Cc: Thomas Zimmermann <tzimmermann@suse.de>
> ---
>   drivers/gpu/drm/vc4/Makefile       |  2 ++
>   drivers/gpu/drm/vc4/vc4_dma_heap.c | 51 ++++++++++++++++++++++++++++++
>   drivers/gpu/drm/vc4/vc4_drv.c      |  6 ++++
>   drivers/gpu/drm/vc4/vc4_drv.h      |  5 +++
>   4 files changed, 64 insertions(+)
>   create mode 100644 drivers/gpu/drm/vc4/vc4_dma_heap.c
> 
> diff --git a/drivers/gpu/drm/vc4/Makefile b/drivers/gpu/drm/vc4/Makefile
> index c41f89a15a55..e4048870cec7 100644
> --- a/drivers/gpu/drm/vc4/Makefile
> +++ b/drivers/gpu/drm/vc4/Makefile
> @@ -34,4 +34,6 @@ vc4-$(CONFIG_DRM_VC4_KUNIT_TEST) += \
>   
>   vc4-$(CONFIG_DEBUG_FS) += vc4_debugfs.o
>   
> +vc4-$(CONFIG_DMABUF_HEAPS) += vc4_dma_heap.o
> +
>   obj-$(CONFIG_DRM_VC4)  += vc4.o
> diff --git a/drivers/gpu/drm/vc4/vc4_dma_heap.c b/drivers/gpu/drm/vc4/vc4_dma_heap.c
> new file mode 100644
> index 000000000000..010d0a88f3fa
> --- /dev/null
> +++ b/drivers/gpu/drm/vc4/vc4_dma_heap.c
> @@ -0,0 +1,51 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + *  Copyright © 2023 Simon Ser
> + */
> +
> +#include <linux/dma-buf.h>
> +#include <linux/dma-heap.h>
> +
> +#include "vc4_drv.h"
> +
> +static struct dma_buf *vc4_dma_heap_allocate(struct dma_heap *heap,
> +					     unsigned long size,
> +					     unsigned long fd_flags,
> +					     unsigned long heap_flags)
> +{
> +	struct vc4_dev *vc4 = dma_heap_get_drvdata(heap);
> +	//DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
> +	struct drm_gem_dma_object *dma_obj;
> +	struct dma_buf *dmabuf;
> +
> +	if (WARN_ON_ONCE(!vc4->is_vc5))
> +		return ERR_PTR(-ENODEV);
> +
> +	dma_obj = drm_gem_dma_create(&vc4->base, size);
> +	if (IS_ERR(dma_obj))
> +		return ERR_CAST(dma_obj);
> +
> +	dmabuf = drm_gem_prime_export(&dma_obj->base, fd_flags);
> +	drm_gem_object_put(&dma_obj->base);
> +	return dmabuf;
> +}
> +
> +static const struct dma_heap_ops vc4_dma_heap_ops = {
> +	.allocate = vc4_dma_heap_allocate,
> +};
> +
> +int vc4_dma_heap_create(struct vc4_dev *vc4)
> +{
> +	struct dma_heap_export_info exp_info;
> +	struct dma_heap *heap;
> +
> +	exp_info.name = "vc4"; /* TODO: allow multiple? */
> +	exp_info.ops = &vc4_dma_heap_ops;
> +	exp_info.priv = vc4; /* TODO: unregister when unloading */
> +
> +	heap = dma_heap_add(&exp_info);
> +	if (IS_ERR(heap))
> +		return PTR_ERR(heap);
> +
> +	return 0;
> +}
> diff --git a/drivers/gpu/drm/vc4/vc4_drv.c b/drivers/gpu/drm/vc4/vc4_drv.c
> index c133e96b8aca..c7297dd7d9d5 100644
> --- a/drivers/gpu/drm/vc4/vc4_drv.c
> +++ b/drivers/gpu/drm/vc4/vc4_drv.c
> @@ -391,6 +391,12 @@ static int vc4_drm_bind(struct device *dev)
>   
>   	drm_fbdev_dma_setup(drm, 16);
>   
> +#if IS_ENABLED(CONFIG_DMABUF_HEAPS)
> +	ret = vc4_dma_heap_create(vc4);
> +	if (ret)
> +		goto err;
> +#endif
> +
>   	return 0;
>   
>   err:
> diff --git a/drivers/gpu/drm/vc4/vc4_drv.h b/drivers/gpu/drm/vc4/vc4_drv.h
> index ab61e96e7e14..d5c5dd18815c 100644
> --- a/drivers/gpu/drm/vc4/vc4_drv.h
> +++ b/drivers/gpu/drm/vc4/vc4_drv.h
> @@ -1078,4 +1078,9 @@ int vc4_perfmon_destroy_ioctl(struct drm_device *dev, void *data,
>   int vc4_perfmon_get_values_ioctl(struct drm_device *dev, void *data,
>   				 struct drm_file *file_priv);
>   
> +/* vc4_dma_heap.c */
> +#if IS_ENABLED(CONFIG_DMABUF_HEAPS)
> +int vc4_dma_heap_create(struct vc4_dev *vc4);
> +#endif
> +
>   #endif /* _VC4_DRV_H_ */
Maxime Ripard Nov. 9, 2023, 12:35 p.m. UTC | #4
Hi Maira,

On Thu, Nov 09, 2023 at 09:14:29AM -0300, Maira Canal wrote:
> On 11/9/23 04:45, Simon Ser wrote:
> > User-space sometimes needs to allocate scanout-capable memory for
> > GPU rendering purposes. On a vc4/v3d split render/display SoC, this
> > is achieved via DRM dumb buffers: the v3d user-space driver opens
> > the primary vc4 node, allocates a DRM dumb buffer there, exports it
> > as a DMA-BUF, imports it into the v3d render node, and renders to it.
> > 
> > However, DRM dumb buffers are only meant for CPU rendering, they are
> > not intended to be used for GPU rendering. Primary nodes should only
> > be used for mode-setting purposes, other programs should not attempt
> > to open it. Moreover, opening the primary node is already broken on
> > some setups: systemd grants permission to open primary nodes to
> > physically logged in users, but this breaks when the user is not
> > physically logged in (e.g. headless setup) and when the distribution
> > is using a different init (e.g. Alpine Linux uses openrc).
> > 
> > We need an alternate way for v3d to allocate scanout-capable memory.
> 
> For me, it is a bit unclear how we import the vc4 DMA-BUF heap into v3d.
> Should we create an IOCTL for it on v3d?

dma-heaps are separate device files (under /dev/dma_heap) that have
their specific uapi to allocate buffers:
https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-heap.c#L126

You then get a dma-buf handle you can import into whatever device you
want.

Maxime
T.J. Mercier Nov. 9, 2023, 3:14 p.m. UTC | #5
On Wed, Nov 8, 2023 at 11:46 PM Simon Ser <contact@emersion.fr> wrote:
>
> +int vc4_dma_heap_create(struct vc4_dev *vc4)
> +{
> +       struct dma_heap_export_info exp_info;
> +       struct dma_heap *heap;
> +
> +       exp_info.name = "vc4"; /* TODO: allow multiple? */
> +       exp_info.ops = &vc4_dma_heap_ops;
> +       exp_info.priv = vc4; /* TODO: unregister when unloading */
> +

So unregistering a heap isn't currently possible, but we're trying to
enable that here:
https://lore.kernel.org/all/20231106120423.23364-7-yunfei.dong@mediatek.com/
Simon Ser Nov. 9, 2023, 3:31 p.m. UTC | #6
On Thursday, November 9th, 2023 at 10:11, Maxime Ripard <mripard@kernel.org> wrote:

> > - Does this approach make sense to y'all in general?
> 
> Makes sense to me :)

Great to hear!

> > - What would be a good name for the heap? "vc4" is maybe a bit naive and
> > not precise enough. Something with "cma"? Do we need to plan a naming
> > scheme to accomodate for multiple vc4 devices?
> 
> That's a general issue though that happens with pretty much all devices
> with a separate node for modesetting and rendering, so I don't think
> addressing it only for vc4 make sense, we should make it generic.
> 
> So maybe something like "scanout"?
> 
> One thing we need to consider too is that the Pi5 will have multiple
> display nodes (4(?) iirc) with different capabilities, vc4 being only
> one of them. This will impact that solution too.

I'm not sure trying to find a unique generic name for all split render/display
SoC is a good idea:

- For the purposes of replacing DRM dumb buffers usage from v3d, we don't
  actually need a generic name, it's perfectly fine to hardcode a name here
  since vc4 is pretty much hardcoded there already.
- As you said, "scanout" may be ill-defined depending on the system. What if
  a PCIe or USB device is plugged in? What if vkms is loaded? What if there are
  multiple platform display devices? What does "scanout" mean then?
- A generic solution to advertise what DMA heaps a DRM display device is
  compatible with is probably better addressed by populating sysfs with new
  files. We've talked with Sima at XDC about adding a symlink pointing to the
  DMA heap and extra metadata files describing priorities and such. However we
  don't actually need that part for the purposes of v3d -- I think I'd rather
  defer that until more DMA heaps are plumbed across the DRM drivers.

So I'd be in favor of using something descriptive, platform-specific for the
heap name, something that user-space can't generically try to interpret or
parse.

> > - Right now root privileges are necessary to open the heap. Should we
> > allow everybody to open the heap by default (after all, user-space can
> > already allocate arbitrary amounts of GPU memory)? Should we leave it
> > up to user-space to solve this issue (via logind/seatd or a Wayland
> > protocol or something else)?
> 
> I would have expected a udev rule to handle that?

Right. That sounds fine to me. (It's not the end of the world if v3d can't
allocate scanout-capable memory if the heap is not accessible -- just means we
will go through an extra copy in the compositor. And we can always keep the DRM
legacy primary node stuff as a fallback if there really are concerns.)

> > TODO:
> > 
> > - Need to add !vc5 support.
> 
> If by !vc5 you mean RPi0-3, then it's probably not needed there at all
> since it has a single node for both modesetting and rendering?

Ah that's a good point. So we can just not expose any heap for !vc5.
Simon Ser Nov. 9, 2023, 3:33 p.m. UTC | #7
Thanks for the feedback Iago! I've replied to Maxime and I believe that
covers your questions as well.
Simon Ser Nov. 9, 2023, 3:38 p.m. UTC | #8
On Thursday, November 9th, 2023 at 13:14, Maira Canal <mcanal@igalia.com> wrote:

> On 11/9/23 04:45, Simon Ser wrote:
> > User-space sometimes needs to allocate scanout-capable memory for
> > GPU rendering purposes. On a vc4/v3d split render/display SoC, this
> > is achieved via DRM dumb buffers: the v3d user-space driver opens
> > the primary vc4 node, allocates a DRM dumb buffer there, exports it
> > as a DMA-BUF, imports it into the v3d render node, and renders to it.
> >
> > However, DRM dumb buffers are only meant for CPU rendering, they are
> > not intended to be used for GPU rendering. Primary nodes should only
> > be used for mode-setting purposes, other programs should not attempt
> > to open it. Moreover, opening the primary node is already broken on
> > some setups: systemd grants permission to open primary nodes to
> > physically logged in users, but this breaks when the user is not
> > physically logged in (e.g. headless setup) and when the distribution
> > is using a different init (e.g. Alpine Linux uses openrc).
> >
> > We need an alternate way for v3d to allocate scanout-capable memory.
> 
> For me, it is a bit unclear how we import the vc4 DMA-BUF heap into v3d.
> Should we create an IOCTL for it on v3d?

Hm, but v3d is the render driver, so it shouldn't have any knowledge of what
vc4 -- the display driver -- actually needs to be able to scanout a piece of
memory. It's true that other x86 drivers like amdgpu and i915 just have their
own IOCTL to allocate scanout-capable memory, however the split render/display
SoC situation makes the situation a bit more hairy for vc4/v3d.

You can think of the vc4 DMA-BUF heaps as a vc4 alloc IOCTL, except it's using
a more standard framework instead of a custom IOCTL.

Does this make sense?

> Also, if you need some help testing with the RPi 5, you can ping on IRC
> and I can try to help by testing.

Thank you, appreciated!
Simon Ser Nov. 9, 2023, 3:41 p.m. UTC | #9
On Thursday, November 9th, 2023 at 16:14, T.J. Mercier <tjmercier@google.com> wrote:

> > + exp_info.priv = vc4; / TODO: unregister when unloading */
> > +
> 
> So unregistering a heap isn't currently possible, but we're trying to
> enable that here:
> https://lore.kernel.org/all/20231106120423.23364-7-yunfei.dong@mediatek.com/

Great!

Note, I wouldn't actually need any ref'counting, a simpler function to
always unregister would be enough for my purposes I think. But if you
have an actual use-case for the ref'counting, that's completely fine for
me as well.
Dave Stevenson Nov. 9, 2023, 3:42 p.m. UTC | #10
Hi Simon and Maxime

On Thu, 9 Nov 2023 at 09:12, Maxime Ripard <mripard@kernel.org> wrote:
>
> Hi Simon,
>
> On Thu, Nov 09, 2023 at 07:45:58AM +0000, Simon Ser wrote:
> > User-space sometimes needs to allocate scanout-capable memory for
> > GPU rendering purposes. On a vc4/v3d split render/display SoC, this
> > is achieved via DRM dumb buffers: the v3d user-space driver opens
> > the primary vc4 node, allocates a DRM dumb buffer there, exports it
> > as a DMA-BUF, imports it into the v3d render node, and renders to it.
> >
> > However, DRM dumb buffers are only meant for CPU rendering, they are
> > not intended to be used for GPU rendering. Primary nodes should only
> > be used for mode-setting purposes, other programs should not attempt
> > to open it. Moreover, opening the primary node is already broken on
> > some setups: systemd grants permission to open primary nodes to
> > physically logged in users, but this breaks when the user is not
> > physically logged in (e.g. headless setup) and when the distribution
> > is using a different init (e.g. Alpine Linux uses openrc).
> >
> > We need an alternate way for v3d to allocate scanout-capable memory.
> > Leverage DMA heaps for this purpose: expose a CMA heap to user-space.
> > Preliminary testing has been done with wlroots [1].
> >
> > This is still an RFC. Open questions:
> >
> > - Does this approach make sense to y'all in general?
>
> Makes sense to me :)
>
> > - What would be a good name for the heap? "vc4" is maybe a bit naive and
> >   not precise enough. Something with "cma"? Do we need to plan a naming
> >   scheme to accomodate for multiple vc4 devices?
>
> That's a general issue though that happens with pretty much all devices
> with a separate node for modesetting and rendering, so I don't think
> addressing it only for vc4 make sense, we should make it generic.
>
> So maybe something like "scanout"?
>
> One thing we need to consider too is that the Pi5 will have multiple
> display nodes (4(?) iirc) with different capabilities, vc4 being only
> one of them. This will impact that solution too.

It does need to scale.

Pi5 adds 4 additional DRM devices (2xDSI, 1xDPI, and 1xComposite
Video), and just this last week I've been running Wayfire with TinyDRM
drivers for SPI displays and UDL (DisplayLink) outputs as well.
Presumably all of those drivers need to have appropriate hooks added
so they each expose a dma-heap to enable scanout buffers to be
allocated.

Can we add another function pointer to the struct drm_driver (or
similar) to do the allocation, and move the actual dmabuf handling
code into the core?

> > - Right now root privileges are necessary to open the heap. Should we
> >   allow everybody to open the heap by default (after all, user-space can
> >   already allocate arbitrary amounts of GPU memory)? Should we leave it
> >   up to user-space to solve this issue (via logind/seatd or a Wayland
> >   protocol or something else)?
>
> I would have expected a udev rule to handle that?
>
> > TODO:
> >
> > - Need to add !vc5 support.
>
> If by !vc5 you mean RPi0-3, then it's probably not needed there at all
> since it has a single node for both modesetting and rendering?

That is true, but potentially vc4 could be rendering for scanout via UDL or SPI.
Is it easier to always have the dma-heap allocator for every DRM card
rather than making userspace mix and match depending on whether it is
all in one vs split?

  Dave
Simon Ser Nov. 9, 2023, 5:45 p.m. UTC | #11
On Thursday, November 9th, 2023 at 16:42, Dave Stevenson <dave.stevenson@raspberrypi.com> wrote:

> > > - What would be a good name for the heap? "vc4" is maybe a bit naive and
> > >   not precise enough. Something with "cma"? Do we need to plan a naming
> > >   scheme to accomodate for multiple vc4 devices?
> >
> > That's a general issue though that happens with pretty much all devices
> > with a separate node for modesetting and rendering, so I don't think
> > addressing it only for vc4 make sense, we should make it generic.
> >
> > So maybe something like "scanout"?
> >
> > One thing we need to consider too is that the Pi5 will have multiple
> > display nodes (4(?) iirc) with different capabilities, vc4 being only
> > one of them. This will impact that solution too.
> 
> It does need to scale.
> 
> Pi5 adds 4 additional DRM devices (2xDSI, 1xDPI, and 1xComposite
> Video), and just this last week I've been running Wayfire with TinyDRM
> drivers for SPI displays and UDL (DisplayLink) outputs as well.
> Presumably all of those drivers need to have appropriate hooks added
> so they each expose a dma-heap to enable scanout buffers to be
> allocated.

I'm not sure this makes sense necessarily for all of these devices. For vc4 and
the 4 additional RPi 5 DRM devices, probably. For the rest, e.g. UDL, I'm not
sure it makes sense to expose scanout memory allocation via DMA heaps: AFAIK
UDL needs CPU access to the buffers, it can't "scanout", and thus directly
rendering via v3d to a "scanout-capable" buffer doesn't make sense. There will
be a copy (CPU download) either way, and allocating via v3d wouldn't make a
difference.

> Can we add another function pointer to the struct drm_driver (or
> similar) to do the allocation, and move the actual dmabuf handling
> code into the core?

Do you mean that the new logic introduced in this patch should be in DRM core
to allow other drivers to more easily re-use it? Or do you mean something else?

Indeed, there is nothing vc4-specific in this patch, the only requirement is
that the driver uses drm_gem_dma_helper. So this code could be moved into (or
alongside) that helper in DRM core. However, maybe it would be best to wait to
have a second user for this infrastructure before we move into core.

> > > - Need to add !vc5 support.
> >
> > If by !vc5 you mean RPi0-3, then it's probably not needed there at all
> > since it has a single node for both modesetting and rendering?
> 
> That is true, but potentially vc4 could be rendering for scanout via UDL or SPI.
> Is it easier to always have the dma-heap allocator for every DRM card
> rather than making userspace mix and match depending on whether it is
> all in one vs split?

I don't think it's realistic to try to always make DMA heaps available for each
and every driver which might need it from day one. It's too big of a task. And
it's an even bigger task to try to design a fully generic heap compatibility
uAPI from day one. I'd much rather add the heaps one by one, if and when we
figure that it makes sense, and incrementally work our way through.
Dave Stevenson Nov. 9, 2023, 6:38 p.m. UTC | #12
Hi Simon

On Thu, 9 Nov 2023 at 17:46, Simon Ser <contact@emersion.fr> wrote:
>
> On Thursday, November 9th, 2023 at 16:42, Dave Stevenson <dave.stevenson@raspberrypi.com> wrote:
>
> > > > - What would be a good name for the heap? "vc4" is maybe a bit naive and
> > > >   not precise enough. Something with "cma"? Do we need to plan a naming
> > > >   scheme to accomodate for multiple vc4 devices?
> > >
> > > That's a general issue though that happens with pretty much all devices
> > > with a separate node for modesetting and rendering, so I don't think
> > > addressing it only for vc4 make sense, we should make it generic.
> > >
> > > So maybe something like "scanout"?
> > >
> > > One thing we need to consider too is that the Pi5 will have multiple
> > > display nodes (4(?) iirc) with different capabilities, vc4 being only
> > > one of them. This will impact that solution too.
> >
> > It does need to scale.
> >
> > Pi5 adds 4 additional DRM devices (2xDSI, 1xDPI, and 1xComposite
> > Video), and just this last week I've been running Wayfire with TinyDRM
> > drivers for SPI displays and UDL (DisplayLink) outputs as well.
> > Presumably all of those drivers need to have appropriate hooks added
> > so they each expose a dma-heap to enable scanout buffers to be
> > allocated.
>
> I'm not sure this makes sense necessarily for all of these devices. For vc4 and
> the 4 additional RPi 5 DRM devices, probably. For the rest, e.g. UDL, I'm not
> sure it makes sense to expose scanout memory allocation via DMA heaps: AFAIK
> UDL needs CPU access to the buffers, it can't "scanout", and thus directly
> rendering via v3d to a "scanout-capable" buffer doesn't make sense. There will
> be a copy (CPU download) either way, and allocating via v3d wouldn't make a
> difference.

You as a developer may know that UDL is going to need CPU access, but
how does a generic userspace app know? Is it a case of falling back to
allocating via the renderer if there is no suitably named heap?

> > Can we add another function pointer to the struct drm_driver (or
> > similar) to do the allocation, and move the actual dmabuf handling
> > code into the core?
>
> Do you mean that the new logic introduced in this patch should be in DRM core
> to allow other drivers to more easily re-use it? Or do you mean something else?

Yes, make it easy to reuse between drivers.

> Indeed, there is nothing vc4-specific in this patch, the only requirement is
> that the driver uses drm_gem_dma_helper. So this code could be moved into (or
> alongside) that helper in DRM core. However, maybe it would be best to wait to
> have a second user for this infrastructure before we move into core.

Upstreaming of the DSI / DPI / composite drivers for Pi5 should only
be a few months away, and they can all directly scanout.

I expect the Rockchip platforms to also fall into the same category as
the Pi, with Mali as the 3D IP, and the VOP block as the scanout
engine. Have I missed some detail that means that they aren't a second
user for this?

> > > > - Need to add !vc5 support.
> > >
> > > If by !vc5 you mean RPi0-3, then it's probably not needed there at all
> > > since it has a single node for both modesetting and rendering?
> >
> > That is true, but potentially vc4 could be rendering for scanout via UDL or SPI.
> > Is it easier to always have the dma-heap allocator for every DRM card
> > rather than making userspace mix and match depending on whether it is
> > all in one vs split?
>
> I don't think it's realistic to try to always make DMA heaps available for each
> and every driver which might need it from day one. It's too big of a task. And
> it's an even bigger task to try to design a fully generic heap compatibility
> uAPI from day one. I'd much rather add the heaps one by one, if and when we
> figure that it makes sense, and incrementally work our way through.

Is it really that massive a task? We have the dma heap UAPI for
handling the allocations, so what new UAPI is required?

If it's a new function pointer in struct drm_driver, then the heap is
only created by the core if that function is defined using the driver
name. The function returns a struct dma_buf *.
Any driver using drm_gem_dma_helper can use the new helper function
that is basically your vc4_dma_heap_allocate. The "if
(WARN_ON_ONCE(!vc4->is_vc5))" could be handled by not setting the
function pointer on those platforms.

Sorry, I feel I must be missing some critical piece of information here.

  Dave
Maxime Ripard Nov. 9, 2023, 7:09 p.m. UTC | #13
On Thu, Nov 09, 2023 at 03:31:44PM +0000, Simon Ser wrote:
> > > - What would be a good name for the heap? "vc4" is maybe a bit naive and
> > > not precise enough. Something with "cma"? Do we need to plan a naming
> > > scheme to accomodate for multiple vc4 devices?
> > 
> > That's a general issue though that happens with pretty much all devices
> > with a separate node for modesetting and rendering, so I don't think
> > addressing it only for vc4 make sense, we should make it generic.
> > 
> > So maybe something like "scanout"?
> > 
> > One thing we need to consider too is that the Pi5 will have multiple
> > display nodes (4(?) iirc) with different capabilities, vc4 being only
> > one of them. This will impact that solution too.
> 
> I'm not sure trying to find a unique generic name for all split render/display
> SoC is a good idea:
> 
> - For the purposes of replacing DRM dumb buffers usage from v3d, we don't
>   actually need a generic name, it's perfectly fine to hardcode a name here
>   since vc4 is pretty much hardcoded there already.

Right. I'm wondering how that will work with drivers like panfrost or
powervr that will need to interact with different display drivers. We
will have the same issue for those, but we won't be able to hardcode the
driver name.

Similarly, if you have multiple display drivers, what "scanout-capable"
will mean might differ from one device to the other. Some have
constraints on the memory they can access for example, so you cannot
just assume that a scanout-capable buffer allocated by vc4 might not be
scanout-capable for one of the RP1 DSI device.

> - As you said, "scanout" may be ill-defined depending on the system. What if
>   a PCIe or USB device is plugged in? What if vkms is loaded? What if there are
>   multiple platform display devices? What does "scanout" mean then?
> - A generic solution to advertise what DMA heaps a DRM display device is
>   compatible with is probably better addressed by populating sysfs with new
>   files.

That would be a good idea indeed

>   We've talked with Sima at XDC about adding a symlink pointing to the
>   DMA heap and extra metadata files describing priorities and such.
>   However we don't actually need that part for the purposes of v3d --
>   I think I'd rather defer that until more DMA heaps are plumbed
>   across the DRM drivers.

Honestly, I don't think we can afford to only consider vc4/v3d here. The
issue you described seem to affect any SoC with a split scanout/GPU,
which is pretty much all of them? And if they are all affected, we
should design something that fixes it once and for all.

Maxime
Maxime Ripard Nov. 9, 2023, 7:17 p.m. UTC | #14
On Thu, Nov 09, 2023 at 03:42:38PM +0000, Dave Stevenson wrote:
> Hi Simon and Maxime
> 
> On Thu, 9 Nov 2023 at 09:12, Maxime Ripard <mripard@kernel.org> wrote:
> >
> > Hi Simon,
> >
> > On Thu, Nov 09, 2023 at 07:45:58AM +0000, Simon Ser wrote:
> > > User-space sometimes needs to allocate scanout-capable memory for
> > > GPU rendering purposes. On a vc4/v3d split render/display SoC, this
> > > is achieved via DRM dumb buffers: the v3d user-space driver opens
> > > the primary vc4 node, allocates a DRM dumb buffer there, exports it
> > > as a DMA-BUF, imports it into the v3d render node, and renders to it.
> > >
> > > However, DRM dumb buffers are only meant for CPU rendering, they are
> > > not intended to be used for GPU rendering. Primary nodes should only
> > > be used for mode-setting purposes, other programs should not attempt
> > > to open it. Moreover, opening the primary node is already broken on
> > > some setups: systemd grants permission to open primary nodes to
> > > physically logged in users, but this breaks when the user is not
> > > physically logged in (e.g. headless setup) and when the distribution
> > > is using a different init (e.g. Alpine Linux uses openrc).
> > >
> > > We need an alternate way for v3d to allocate scanout-capable memory.
> > > Leverage DMA heaps for this purpose: expose a CMA heap to user-space.
> > > Preliminary testing has been done with wlroots [1].
> > >
> > > This is still an RFC. Open questions:
> > >
> > > - Does this approach make sense to y'all in general?
> >
> > Makes sense to me :)
> >
> > > - What would be a good name for the heap? "vc4" is maybe a bit naive and
> > >   not precise enough. Something with "cma"? Do we need to plan a naming
> > >   scheme to accomodate for multiple vc4 devices?
> >
> > That's a general issue though that happens with pretty much all devices
> > with a separate node for modesetting and rendering, so I don't think
> > addressing it only for vc4 make sense, we should make it generic.
> >
> > So maybe something like "scanout"?
> >
> > One thing we need to consider too is that the Pi5 will have multiple
> > display nodes (4(?) iirc) with different capabilities, vc4 being only
> > one of them. This will impact that solution too.
> 
> It does need to scale.
> 
> Pi5 adds 4 additional DRM devices (2xDSI, 1xDPI, and 1xComposite
> Video), and just this last week I've been running Wayfire with TinyDRM
> drivers for SPI displays and UDL (DisplayLink) outputs as well.
> Presumably all of those drivers need to have appropriate hooks added
> so they each expose a dma-heap to enable scanout buffers to be
> allocated.
> 
> Can we add another function pointer to the struct drm_driver (or
> similar) to do the allocation, and move the actual dmabuf handling
> code into the core?

Yeah, I agree here, it just seems easier to provide a global hook and a
somewhat sane default to cover all drivers in one go (potentially with
additional restrictions, like only for modeset-only drivers or
something).

Maxime
Sebastian Wick Nov. 10, 2023, 9:31 a.m. UTC | #15
On Thu, Nov 09, 2023 at 06:38:20PM +0000, Dave Stevenson wrote:
> Hi Simon
> 
> On Thu, 9 Nov 2023 at 17:46, Simon Ser <contact@emersion.fr> wrote:
> >
> > On Thursday, November 9th, 2023 at 16:42, Dave Stevenson <dave.stevenson@raspberrypi.com> wrote:
> >
> > > > > - What would be a good name for the heap? "vc4" is maybe a bit naive and
> > > > >   not precise enough. Something with "cma"? Do we need to plan a naming
> > > > >   scheme to accomodate for multiple vc4 devices?
> > > >
> > > > That's a general issue though that happens with pretty much all devices
> > > > with a separate node for modesetting and rendering, so I don't think
> > > > addressing it only for vc4 make sense, we should make it generic.
> > > >
> > > > So maybe something like "scanout"?
> > > >
> > > > One thing we need to consider too is that the Pi5 will have multiple
> > > > display nodes (4(?) iirc) with different capabilities, vc4 being only
> > > > one of them. This will impact that solution too.
> > >
> > > It does need to scale.
> > >
> > > Pi5 adds 4 additional DRM devices (2xDSI, 1xDPI, and 1xComposite
> > > Video), and just this last week I've been running Wayfire with TinyDRM
> > > drivers for SPI displays and UDL (DisplayLink) outputs as well.
> > > Presumably all of those drivers need to have appropriate hooks added
> > > so they each expose a dma-heap to enable scanout buffers to be
> > > allocated.
> >
> > I'm not sure this makes sense necessarily for all of these devices. For vc4 and
> > the 4 additional RPi 5 DRM devices, probably. For the rest, e.g. UDL, I'm not
> > sure it makes sense to expose scanout memory allocation via DMA heaps: AFAIK
> > UDL needs CPU access to the buffers, it can't "scanout", and thus directly
> > rendering via v3d to a "scanout-capable" buffer doesn't make sense. There will
> > be a copy (CPU download) either way, and allocating via v3d wouldn't make a
> > difference.
> 
> You as a developer may know that UDL is going to need CPU access, but
> how does a generic userspace app know? Is it a case of falling back to
> allocating via the renderer if there is no suitably named heap?

If we're going with the idea of using sysfs to link a device to a heap
(which I think we should), then those devices all should be linked to
the system memory heap, no?

> > > Can we add another function pointer to the struct drm_driver (or
> > > similar) to do the allocation, and move the actual dmabuf handling
> > > code into the core?
> >
> > Do you mean that the new logic introduced in this patch should be in DRM core
> > to allow other drivers to more easily re-use it? Or do you mean something else?
> 
> Yes, make it easy to reuse between drivers.
> 
> > Indeed, there is nothing vc4-specific in this patch, the only requirement is
> > that the driver uses drm_gem_dma_helper. So this code could be moved into (or
> > alongside) that helper in DRM core. However, maybe it would be best to wait to
> > have a second user for this infrastructure before we move into core.
> 
> Upstreaming of the DSI / DPI / composite drivers for Pi5 should only
> be a few months away, and they can all directly scanout.
> 
> I expect the Rockchip platforms to also fall into the same category as
> the Pi, with Mali as the 3D IP, and the VOP block as the scanout
> engine. Have I missed some detail that means that they aren't a second
> user for this?
> 
> > > > > - Need to add !vc5 support.
> > > >
> > > > If by !vc5 you mean RPi0-3, then it's probably not needed there at all
> > > > since it has a single node for both modesetting and rendering?
> > >
> > > That is true, but potentially vc4 could be rendering for scanout via UDL or SPI.
> > > Is it easier to always have the dma-heap allocator for every DRM card
> > > rather than making userspace mix and match depending on whether it is
> > > all in one vs split?
> >
> > I don't think it's realistic to try to always make DMA heaps available for each
> > and every driver which might need it from day one. It's too big of a task. And
> > it's an even bigger task to try to design a fully generic heap compatibility
> > uAPI from day one. I'd much rather add the heaps one by one, if and when we
> > figure that it makes sense, and incrementally work our way through.
> 
> Is it really that massive a task? We have the dma heap UAPI for
> handling the allocations, so what new UAPI is required?
> 
> If it's a new function pointer in struct drm_driver, then the heap is
> only created by the core if that function is defined using the driver
> name. The function returns a struct dma_buf *.
> Any driver using drm_gem_dma_helper can use the new helper function
> that is basically your vc4_dma_heap_allocate. The "if
> (WARN_ON_ONCE(!vc4->is_vc5))" could be handled by not setting the
> function pointer on those platforms.
> 
> Sorry, I feel I must be missing some critical piece of information here.
> 
>   Dave
>
Simon Ser Nov. 10, 2023, 11:01 a.m. UTC | #16
On Thursday, November 9th, 2023 at 19:38, Dave Stevenson <dave.stevenson@raspberrypi.com> wrote:

> Hi Simon
> 
> On Thu, 9 Nov 2023 at 17:46, Simon Ser <contact@emersion.fr> wrote:
> >
> > On Thursday, November 9th, 2023 at 16:42, Dave Stevenson <dave.stevenson@raspberrypi.com> wrote:
> >
> > > > > - What would be a good name for the heap? "vc4" is maybe a bit naive and
> > > > >   not precise enough. Something with "cma"? Do we need to plan a naming
> > > > >   scheme to accomodate for multiple vc4 devices?
> > > >
> > > > That's a general issue though that happens with pretty much all devices
> > > > with a separate node for modesetting and rendering, so I don't think
> > > > addressing it only for vc4 make sense, we should make it generic.
> > > >
> > > > So maybe something like "scanout"?
> > > >
> > > > One thing we need to consider too is that the Pi5 will have multiple
> > > > display nodes (4(?) iirc) with different capabilities, vc4 being only
> > > > one of them. This will impact that solution too.
> > >
> > > It does need to scale.
> > >
> > > Pi5 adds 4 additional DRM devices (2xDSI, 1xDPI, and 1xComposite
> > > Video), and just this last week I've been running Wayfire with TinyDRM
> > > drivers for SPI displays and UDL (DisplayLink) outputs as well.
> > > Presumably all of those drivers need to have appropriate hooks added
> > > so they each expose a dma-heap to enable scanout buffers to be
> > > allocated.
> >
> > I'm not sure this makes sense necessarily for all of these devices. For vc4 and
> > the 4 additional RPi 5 DRM devices, probably. For the rest, e.g. UDL, I'm not
> > sure it makes sense to expose scanout memory allocation via DMA heaps: AFAIK
> > UDL needs CPU access to the buffers, it can't "scanout", and thus directly
> > rendering via v3d to a "scanout-capable" buffer doesn't make sense. There will
> > be a copy (CPU download) either way, and allocating via v3d wouldn't make a
> > difference.
> 
> You as a developer may know that UDL is going to need CPU access, but
> how does a generic userspace app know? Is it a case of falling back to
> allocating via the renderer if there is no suitably named heap?

Yeah, just like how things are working today. Then try to do direct scanout
with the buffer allocated on the render node, and if it doesn't work blit into
a DRM dumb buffer.

> > Indeed, there is nothing vc4-specific in this patch, the only requirement is
> > that the driver uses drm_gem_dma_helper. So this code could be moved into (or
> > alongside) that helper in DRM core. However, maybe it would be best to wait to
> > have a second user for this infrastructure before we move into core.
> 
> Upstreaming of the DSI / DPI / composite drivers for Pi5 should only
> be a few months away, and they can all directly scanout.
> 
> I expect the Rockchip platforms to also fall into the same category as
> the Pi, with Mali as the 3D IP, and the VOP block as the scanout
> engine. Have I missed some detail that means that they aren't a second
> user for this?
> 
> > > > > - Need to add !vc5 support.
> > > >
> > > > If by !vc5 you mean RPi0-3, then it's probably not needed there at all
> > > > since it has a single node for both modesetting and rendering?
> > >
> > > That is true, but potentially vc4 could be rendering for scanout via UDL or SPI.
> > > Is it easier to always have the dma-heap allocator for every DRM card
> > > rather than making userspace mix and match depending on whether it is
> > > all in one vs split?
> >
> > I don't think it's realistic to try to always make DMA heaps available for each
> > and every driver which might need it from day one. It's too big of a task. And
> > it's an even bigger task to try to design a fully generic heap compatibility
> > uAPI from day one. I'd much rather add the heaps one by one, if and when we
> > figure that it makes sense, and incrementally work our way through.
> 
> Is it really that massive a task? We have the dma heap UAPI for
> handling the allocations, so what new UAPI is required?

I'm only focused on fixing v3d for now. Some split render/display drivers do
similar things, some drivers not. I don't have hardware to test, I'm not
willing to commit a lot of time writing a patchset I'm not even sure will make
it. I don't really understand the hurry of doing all of the things at once here.
If you do happen to have the time, feel free to write the patches or even take
over this series.

I'm worried about blindly adding heaps all over the place, because as said
above, I don't believe it makes sense for all drivers. I think it's probably a
safe bet to add heaps in cases where we use kmsro with dumb buffers on the Mesa
side. But anything else and it really needs to be discussed on a case-by-case
basis.

> If it's a new function pointer in struct drm_driver, then the heap is
> only created by the core if that function is defined using the driver
> name. The function returns a struct dma_buf *.
> Any driver using drm_gem_dma_helper can use the new helper function
> that is basically your vc4_dma_heap_allocate. The "if
> (WARN_ON_ONCE(!vc4->is_vc5))" could be handled by not setting the
> function pointer on those platforms.

I would advise against a function pointer, because that kind of API design
makes DRM core more of a midlayer than a helper library, and this is something
we prefer to avoid [1]. I would instead suggest introducing a DRM core function
which creates the heap and can be called from the driver. 1-line vs. 3-line
in driver code doesn't really make such a big difference.

[1]: https://blog.ffwll.ch/2016/12/midlayers-once-more-with-feeling.html
Simon Ser Nov. 10, 2023, 11:08 a.m. UTC | #17
On Thursday, November 9th, 2023 at 20:09, Maxime Ripard <mripard@kernel.org> wrote:

> On Thu, Nov 09, 2023 at 03:31:44PM +0000, Simon Ser wrote:
> > > > - What would be a good name for the heap? "vc4" is maybe a bit naive and
> > > > not precise enough. Something with "cma"? Do we need to plan a naming
> > > > scheme to accomodate for multiple vc4 devices?
> > > 
> > > That's a general issue though that happens with pretty much all devices
> > > with a separate node for modesetting and rendering, so I don't think
> > > addressing it only for vc4 make sense, we should make it generic.
> > > 
> > > So maybe something like "scanout"?
> > > 
> > > One thing we need to consider too is that the Pi5 will have multiple
> > > display nodes (4(?) iirc) with different capabilities, vc4 being only
> > > one of them. This will impact that solution too.
> > 
> > I'm not sure trying to find a unique generic name for all split render/display
> > SoC is a good idea:
> > 
> > - For the purposes of replacing DRM dumb buffers usage from v3d, we don't
> >   actually need a generic name, it's perfectly fine to hardcode a name here
> >   since vc4 is pretty much hardcoded there already.
> 
> Right. I'm wondering how that will work with drivers like panfrost or
> powervr that will need to interact with different display drivers. We
> will have the same issue for those, but we won't be able to hardcode the
> driver name.

We will be able to hardcode the driver name. In fact, the driver name is
already hardcoded in kmsro today (Mesa creates one kmsro .so per supported
display driver in /usr/lib/dri).

It's just a matter of passing through the actual display device to panfrost in
Mesa and adding a very simple mapping of driver name -> heap name.

> Similarly, if you have multiple display drivers, what "scanout-capable"
> will mean might differ from one device to the other. Some have
> constraints on the memory they can access for example, so you cannot
> just assume that a scanout-capable buffer allocated by vc4 might not be
> scanout-capable for one of the RP1 DSI device.
> 
> > - As you said, "scanout" may be ill-defined depending on the system. What if
> >   a PCIe or USB device is plugged in? What if vkms is loaded? What if there are
> >   multiple platform display devices? What does "scanout" mean then?
> > - A generic solution to advertise what DMA heaps a DRM display device is
> >   compatible with is probably better addressed by populating sysfs with new
> >   files.
> 
> That would be a good idea indeed
> 
> >   We've talked with Sima at XDC about adding a symlink pointing to the
> >   DMA heap and extra metadata files describing priorities and such.
> >   However we don't actually need that part for the purposes of v3d --
> >   I think I'd rather defer that until more DMA heaps are plumbed
> >   across the DRM drivers.
> 
> Honestly, I don't think we can afford to only consider vc4/v3d here. The
> issue you described seem to affect any SoC with a split scanout/GPU,
> which is pretty much all of them? And if they are all affected, we
> should design something that fixes it once and for all.

We don't need any sysfs stuff to fix the primary node and DRM dumb buffer
issues in Mesa's kmsro/renderonly. The sysfs stuff is only required for a fully
generic buffer placement constraint/compatibility uAPI. Which would be super
useful in compositors, but let's do one step at a time.
Simon Ser Nov. 10, 2023, 11:21 a.m. UTC | #18
On Thursday, November 9th, 2023 at 20:17, Maxime Ripard <mripard@kernel.org> wrote:

> > Can we add another function pointer to the struct drm_driver (or
> > similar) to do the allocation, and move the actual dmabuf handling
> > code into the core?
> 
> Yeah, I agree here, it just seems easier to provide a global hook and a
> somewhat sane default to cover all drivers in one go (potentially with
> additional restrictions, like only for modeset-only drivers or
> something).

First off not all drivers are using the GEM DMA helpers (e.g. vc4 with
!vc5 does not). The heap code in this patch only works with drivers
leveraging GEM DMA helpers.

Then maybe it's somewhat simple to cover typical display devices found
on split render/display SoCs, however for the rest it's going to be much
more complicated. For x86 typically there are multiple buffer placements
supported by the GPU and we need to have one heap per possible placement.
And then figuring out all of the rules, priority and compatibility stuff
is a whole other task in and of itself.

In short: I think it's unreasonable to expose a heap without good
knowledge of the specific hardware. In other words, unreasonable to
expose a heap for all drivers at once.
Maxime Ripard Nov. 10, 2023, 2:01 p.m. UTC | #19
On Fri, Nov 10, 2023 at 11:21:15AM +0000, Simon Ser wrote:
> On Thursday, November 9th, 2023 at 20:17, Maxime Ripard <mripard@kernel.org> wrote:
> 
> > > Can we add another function pointer to the struct drm_driver (or
> > > similar) to do the allocation, and move the actual dmabuf handling
> > > code into the core?
> > 
> > Yeah, I agree here, it just seems easier to provide a global hook and a
> > somewhat sane default to cover all drivers in one go (potentially with
> > additional restrictions, like only for modeset-only drivers or
> > something).
> 
> First off not all drivers are using the GEM DMA helpers (e.g. vc4 with
> !vc5 does not).

Right. vc4 pre-RPi4 is the exception though, so it's kind of what I
meant by providing sane defaults: the vast majority of drivers are using
GEM DMA helpers, and we should totally let drivers override that if
relevant.

Basically, we already have 2.5 citizen classes, I'd really like to avoid
having 3 officially, even more so if it's not super difficult to do.

> The heap code in this patch only works with drivers leveraging GEM DMA
> helpers.

We could add a new hook to drm_driver to allocate heaps, link to a
default implementation in DRM_GEM_DMA_DRIVER_OPS_WITH_DUMB_CREATE(), and
then use that new hook in your new heap. It would already cover 40
drivers at the moment, instead of just one, with all of them (but
atmel-hlcdc maybe?) being in the same situation than RPi4-vc4 is.

> Then maybe it's somewhat simple to cover typical display devices found
> on split render/display SoCs, however for the rest it's going to be much
> more complicated. For x86 typically there are multiple buffer placements
> supported by the GPU and we need to have one heap per possible placement.
> And then figuring out all of the rules, priority and compatibility stuff
> is a whole other task in and of itself.

But x86 typically doesn't have a split render/display setup, right?

Maxime
Maxime Ripard Nov. 10, 2023, 2:13 p.m. UTC | #20
On Fri, Nov 10, 2023 at 11:08:06AM +0000, Simon Ser wrote:
> > On Thu, Nov 09, 2023 at 03:31:44PM +0000, Simon Ser wrote:
> > > > > - What would be a good name for the heap? "vc4" is maybe a bit naive and
> > > > > not precise enough. Something with "cma"? Do we need to plan a naming
> > > > > scheme to accomodate for multiple vc4 devices?
> > > > 
> > > > That's a general issue though that happens with pretty much all devices
> > > > with a separate node for modesetting and rendering, so I don't think
> > > > addressing it only for vc4 make sense, we should make it generic.
> > > > 
> > > > So maybe something like "scanout"?
> > > > 
> > > > One thing we need to consider too is that the Pi5 will have multiple
> > > > display nodes (4(?) iirc) with different capabilities, vc4 being only
> > > > one of them. This will impact that solution too.
> > > 
> > > I'm not sure trying to find a unique generic name for all split render/display
> > > SoC is a good idea:
> > > 
> > > - For the purposes of replacing DRM dumb buffers usage from v3d, we don't
> > >   actually need a generic name, it's perfectly fine to hardcode a name here
> > >   since vc4 is pretty much hardcoded there already.
> > 
> > Right. I'm wondering how that will work with drivers like panfrost or
> > powervr that will need to interact with different display drivers. We
> > will have the same issue for those, but we won't be able to hardcode the
> > driver name.
> 
> We will be able to hardcode the driver name. In fact, the driver name is
> already hardcoded in kmsro today (Mesa creates one kmsro .so per supported
> display driver in /usr/lib/dri).
> 
> It's just a matter of passing through the actual display device to panfrost in
> Mesa and adding a very simple mapping of driver name -> heap name.
> 
> > Similarly, if you have multiple display drivers, what "scanout-capable"
> > will mean might differ from one device to the other. Some have
> > constraints on the memory they can access for example, so you cannot
> > just assume that a scanout-capable buffer allocated by vc4 might not be
> > scanout-capable for one of the RP1 DSI device.
> > 
> > > - As you said, "scanout" may be ill-defined depending on the system. What if
> > >   a PCIe or USB device is plugged in? What if vkms is loaded? What if there are
> > >   multiple platform display devices? What does "scanout" mean then?
> > > - A generic solution to advertise what DMA heaps a DRM display device is
> > >   compatible with is probably better addressed by populating sysfs with new
> > >   files.
> > 
> > That would be a good idea indeed
> > 
> > >   We've talked with Sima at XDC about adding a symlink pointing to the
> > >   DMA heap and extra metadata files describing priorities and such.
> > >   However we don't actually need that part for the purposes of v3d --
> > >   I think I'd rather defer that until more DMA heaps are plumbed
> > >   across the DRM drivers.
> > 
> > Honestly, I don't think we can afford to only consider vc4/v3d here. The
> > issue you described seem to affect any SoC with a split scanout/GPU,
> > which is pretty much all of them? And if they are all affected, we
> > should design something that fixes it once and for all.
> 
> We don't need any sysfs stuff to fix the primary node and DRM dumb buffer
> issues in Mesa's kmsro/renderonly. The sysfs stuff is only required for a fully
> generic buffer placement constraint/compatibility uAPI. Which would be super
> useful in compositors, but let's do one step at a time.

I don't think a solution that further fragments the ecosystem is worth
taking, sorry. What you're doing is valuable, we should totally fix the
issue you have, but not at the expense of making vc4 special on one of
the platforms it supports.

Maxime
Simon Ser Nov. 10, 2023, 2:17 p.m. UTC | #21
On Friday, November 10th, 2023 at 15:13, Maxime Ripard <mripard@kernel.org> wrote:

> > > > We've talked with Sima at XDC about adding a symlink pointing to the
> > > > DMA heap and extra metadata files describing priorities and such.
> > > > However we don't actually need that part for the purposes of v3d --
> > > > I think I'd rather defer that until more DMA heaps are plumbed
> > > > across the DRM drivers.
> > > 
> > > Honestly, I don't think we can afford to only consider vc4/v3d here. The
> > > issue you described seem to affect any SoC with a split scanout/GPU,
> > > which is pretty much all of them? And if they are all affected, we
> > > should design something that fixes it once and for all.
> > 
> > We don't need any sysfs stuff to fix the primary node and DRM dumb buffer
> > issues in Mesa's kmsro/renderonly. The sysfs stuff is only required for a fully
> > generic buffer placement constraint/compatibility uAPI. Which would be super
> > useful in compositors, but let's do one step at a time.
> 
> I don't think a solution that further fragments the ecosystem is worth
> taking, sorry. What you're doing is valuable, we should totally fix the
> issue you have, but not at the expense of making vc4 special on one of
> the platforms it supports.

This does not fragment the ecosystem. It moves the ecosystem bit by bit
towards the final solution.
Simon Ser Nov. 10, 2023, 2:23 p.m. UTC | #22
On Friday, November 10th, 2023 at 15:01, Maxime Ripard <mripard@kernel.org> wrote:

> On Fri, Nov 10, 2023 at 11:21:15AM +0000, Simon Ser wrote:
> 
> > On Thursday, November 9th, 2023 at 20:17, Maxime Ripard mripard@kernel.org wrote:
> > 
> > > > Can we add another function pointer to the struct drm_driver (or
> > > > similar) to do the allocation, and move the actual dmabuf handling
> > > > code into the core?
> > > 
> > > Yeah, I agree here, it just seems easier to provide a global hook and a
> > > somewhat sane default to cover all drivers in one go (potentially with
> > > additional restrictions, like only for modeset-only drivers or
> > > something).
> > 
> > First off not all drivers are using the GEM DMA helpers (e.g. vc4 with
> > !vc5 does not).
> 
> Right. vc4 pre-RPi4 is the exception though, so it's kind of what I
> meant by providing sane defaults: the vast majority of drivers are using
> GEM DMA helpers, and we should totally let drivers override that if
> relevant.
> 
> Basically, we already have 2.5 citizen classes, I'd really like to avoid
> having 3 officially, even more so if it's not super difficult to do.
> 
> > The heap code in this patch only works with drivers leveraging GEM DMA
> > helpers.
> 
> We could add a new hook to drm_driver to allocate heaps, link to a
> default implementation in DRM_GEM_DMA_DRIVER_OPS_WITH_DUMB_CREATE(), and
> then use that new hook in your new heap. It would already cover 40
> drivers at the moment, instead of just one, with all of them (but
> atmel-hlcdc maybe?) being in the same situation than RPi4-vc4 is.

As said in another e-mail, I really think the consequences of DMA heaps
need to be thought out on a per-driver basis. Moreover this approach
makes core DRM go in the wrong direction, deeper in midlayer territory.

> > Then maybe it's somewhat simple to cover typical display devices found
> > on split render/display SoCs, however for the rest it's going to be much
> > more complicated. For x86 typically there are multiple buffer placements
> > supported by the GPU and we need to have one heap per possible placement.
> > And then figuring out all of the rules, priority and compatibility stuff
> > is a whole other task in and of itself.
> 
> But x86 typically doesn't have a split render/display setup, right?

So you're saying we should fix everything at once, but why is x86 not
part of "everything" then? x86 also has issues regarding buffer
placement. You're saying you don't want to fragment the ecosystem, but
it seems like there would still be "fragmentation" between split
render/display SoCs and the rest?

I'm having a hard time understanding your goals here.
Maxime Ripard Nov. 13, 2023, 10:52 a.m. UTC | #23
On Fri, Nov 10, 2023 at 02:23:16PM +0000, Simon Ser wrote:
> On Friday, November 10th, 2023 at 15:01, Maxime Ripard <mripard@kernel.org> wrote:
> 
> > On Fri, Nov 10, 2023 at 11:21:15AM +0000, Simon Ser wrote:
> > 
> > > On Thursday, November 9th, 2023 at 20:17, Maxime Ripard mripard@kernel.org wrote:
> > > 
> > > > > Can we add another function pointer to the struct drm_driver (or
> > > > > similar) to do the allocation, and move the actual dmabuf handling
> > > > > code into the core?
> > > > 
> > > > Yeah, I agree here, it just seems easier to provide a global hook and a
> > > > somewhat sane default to cover all drivers in one go (potentially with
> > > > additional restrictions, like only for modeset-only drivers or
> > > > something).
> > > 
> > > First off not all drivers are using the GEM DMA helpers (e.g. vc4 with
> > > !vc5 does not).
> > 
> > Right. vc4 pre-RPi4 is the exception though, so it's kind of what I
> > meant by providing sane defaults: the vast majority of drivers are using
> > GEM DMA helpers, and we should totally let drivers override that if
> > relevant.
> > 
> > Basically, we already have 2.5 citizen classes, I'd really like to avoid
> > having 3 officially, even more so if it's not super difficult to do.
> > 
> > > The heap code in this patch only works with drivers leveraging GEM DMA
> > > helpers.
> > 
> > We could add a new hook to drm_driver to allocate heaps, link to a
> > default implementation in DRM_GEM_DMA_DRIVER_OPS_WITH_DUMB_CREATE(), and
> > then use that new hook in your new heap. It would already cover 40
> > drivers at the moment, instead of just one, with all of them (but
> > atmel-hlcdc maybe?) being in the same situation than RPi4-vc4 is.
> 
> As said in another e-mail, I really think the consequences of DMA heaps
> need to be thought out on a per-driver basis.

The issue you pointed out doesn't show up on a per-driver basis though.

> Moreover this approach makes core DRM go in the wrong direction,
> deeper in midlayer territory.

I have no idea why that makes it more of a midlayer here, but even if it
does, just because it does doesn't mean it's bad or something to avoid.
We've been striving for more than a decade now to make drivers easy to
write and easy to extend / deviate from the norm.

AFAIK, what I suggested provides both. Yours create unnecessary
boilerplate and will leave a lot of drivers out of a needed solution.

> > > Then maybe it's somewhat simple to cover typical display devices found
> > > on split render/display SoCs, however for the rest it's going to be much
> > > more complicated. For x86 typically there are multiple buffer placements
> > > supported by the GPU and we need to have one heap per possible placement.
> > > And then figuring out all of the rules, priority and compatibility stuff
> > > is a whole other task in and of itself.
> > 
> > But x86 typically doesn't have a split render/display setup, right?
> 
> So you're saying we should fix everything at once, but why is x86 not
> part of "everything" then?

"everything" is your word, not mine. I'm saying that the issue you've
mentioned for this series applies to a lot of other drivers, and we
should fix it for those too.

x86 doesn't suffer from the issue you've mentioned, just like the Pi0-3,
and thus we don't have to come up with a solution for them.

> x86 also has issues regarding buffer placement. You're saying you
> don't want to fragment the ecosystem, but it seems like there would
> still be "fragmentation" between split render/display SoCs and the
> rest?

If you want to come up with a generic solution to buffer placement, then
you need to consider both sides. A buffer that can be addressed by the
scanout engine might not be addressable by the codec that will fill that
buffer for example.

The problem is far broader than what you described here, and the
solution far more involved too.

I don't see why you're bringing that up here, I don't think we need a
solution for that at this point, and yet I think your current patch is a
step in the right direction if we enroll every relevant driver.

Maxime

> I'm having a hard time understanding your goals here.
Maxime Ripard Nov. 13, 2023, 10:53 a.m. UTC | #24
On Fri, Nov 10, 2023 at 02:17:45PM +0000, Simon Ser wrote:
> On Friday, November 10th, 2023 at 15:13, Maxime Ripard <mripard@kernel.org> wrote:
> 
> > > > > We've talked with Sima at XDC about adding a symlink pointing to the
> > > > > DMA heap and extra metadata files describing priorities and such.
> > > > > However we don't actually need that part for the purposes of v3d --
> > > > > I think I'd rather defer that until more DMA heaps are plumbed
> > > > > across the DRM drivers.
> > > > 
> > > > Honestly, I don't think we can afford to only consider vc4/v3d here. The
> > > > issue you described seem to affect any SoC with a split scanout/GPU,
> > > > which is pretty much all of them? And if they are all affected, we
> > > > should design something that fixes it once and for all.
> > > 
> > > We don't need any sysfs stuff to fix the primary node and DRM dumb buffer
> > > issues in Mesa's kmsro/renderonly. The sysfs stuff is only required for a fully
> > > generic buffer placement constraint/compatibility uAPI. Which would be super
> > > useful in compositors, but let's do one step at a time.
> > 
> > I don't think a solution that further fragments the ecosystem is worth
> > taking, sorry. What you're doing is valuable, we should totally fix the
> > issue you have, but not at the expense of making vc4 special on one of
> > the platforms it supports.
> 
> This does not fragment the ecosystem. It moves the ecosystem bit by bit
> towards the final solution.

You can rephrase that any way you want, it moves one driver towards the
final solution, thus making it deviate from the norm and leaving the
rest behind.

Maxime
Simon Ser Nov. 16, 2023, 3:53 p.m. UTC | #25
On Thursday, November 9th, 2023 at 08:45, Simon Ser <contact@emersion.fr> wrote:

> User-space sometimes needs to allocate scanout-capable memory for
> GPU rendering purposes. On a vc4/v3d split render/display SoC, this
> is achieved via DRM dumb buffers: the v3d user-space driver opens
> the primary vc4 node, allocates a DRM dumb buffer there, exports it
> as a DMA-BUF, imports it into the v3d render node, and renders to it.
> 
> However, DRM dumb buffers are only meant for CPU rendering, they are
> not intended to be used for GPU rendering. Primary nodes should only
> be used for mode-setting purposes, other programs should not attempt
> to open it. Moreover, opening the primary node is already broken on
> some setups: systemd grants permission to open primary nodes to
> physically logged in users, but this breaks when the user is not
> physically logged in (e.g. headless setup) and when the distribution
> is using a different init (e.g. Alpine Linux uses openrc).
> 
> We need an alternate way for v3d to allocate scanout-capable memory.
> Leverage DMA heaps for this purpose: expose a CMA heap to user-space.

So we've discussed about this patch on IRC [1] [2]. Some random notes:

- We shouldn't create per-DRM-device heaps in general. Instead, we should try
  using centralized heaps like the existing system and cma ones. That way other
  drivers (video, render, etc) can also link to these heaps without depending
  on the display driver.
- We can't generically link to heaps in core DRM, however we probably provide
  a default for shmem and cma helpers.
- We're missing a bunch of heaps, e.g. sometimes there are multiple cma areas
  but only a single cma heap is created right now.
- Some hw needs the memory to be in a specific region for scanout (e.g. lower
  256MB of RAM for Allwinner). We could create one heap per such region (but is
  it fine to have overlapping heaps?).

Also I tried using the default CMA heap on a Pi 4 for scanout and it works fine.
Not super sure it's strictly equivalent to allocations done via dumb buffers
(e.g. WC etc).

[1]: https://oftc.irclog.whitequark.org/dri-devel/2023-11-13#1699899003-1699919633;
[2]: https://oftc.irclog.whitequark.org/dri-devel/2023-11-14
Maxime Ripard Nov. 29, 2023, 12:45 p.m. UTC | #26
Hi,

Thanks for writing this down

On Thu, Nov 16, 2023 at 03:53:20PM +0000, Simon Ser wrote:
> On Thursday, November 9th, 2023 at 08:45, Simon Ser <contact@emersion.fr> wrote:
> 
> > User-space sometimes needs to allocate scanout-capable memory for
> > GPU rendering purposes. On a vc4/v3d split render/display SoC, this
> > is achieved via DRM dumb buffers: the v3d user-space driver opens
> > the primary vc4 node, allocates a DRM dumb buffer there, exports it
> > as a DMA-BUF, imports it into the v3d render node, and renders to it.
> > 
> > However, DRM dumb buffers are only meant for CPU rendering, they are
> > not intended to be used for GPU rendering. Primary nodes should only
> > be used for mode-setting purposes, other programs should not attempt
> > to open it. Moreover, opening the primary node is already broken on
> > some setups: systemd grants permission to open primary nodes to
> > physically logged in users, but this breaks when the user is not
> > physically logged in (e.g. headless setup) and when the distribution
> > is using a different init (e.g. Alpine Linux uses openrc).
> > 
> > We need an alternate way for v3d to allocate scanout-capable memory.
> > Leverage DMA heaps for this purpose: expose a CMA heap to user-space.
> 
> So we've discussed about this patch on IRC [1] [2]. Some random notes:
> 
> - We shouldn't create per-DRM-device heaps in general. Instead, we should try
>   using centralized heaps like the existing system and cma ones. That way other
>   drivers (video, render, etc) can also link to these heaps without depending
>   on the display driver.
> - We can't generically link to heaps in core DRM, however we probably provide
>   a default for shmem and cma helpers.
> - We're missing a bunch of heaps, e.g. sometimes there are multiple cma areas
>   but only a single cma heap is created right now.
> - Some hw needs the memory to be in a specific region for scanout (e.g. lower
>   256MB of RAM for Allwinner). We could create one heap per such region (but is
>   it fine to have overlapping heaps?).

Just for reference, it's not the scanout itself that has that
requirement on Allwinner SoCs, it's the HW codec. But if you want to
display the decoded frame directly using dma-buf, you'll still need to
either allocate a scanout buffer and hope it'll be in the lower 256MB,
or allocate a buffer from the codec in the lower 256MB and then hope
it's scanout-capable (which it is, so that's we do, but there's no
guarantee about it).

I think the logicvc is a much better example for this, since it requires
framebuffers to be in a specific area, with each plane having a
dedicated area.

AFAIK that's the most extreme example we have upstream.

Maxime
Simon Ser Dec. 5, 2023, 4:52 p.m. UTC | #27
On Wednesday, November 29th, 2023 at 13:45, Maxime Ripard <mripard@kernel.org> wrote:


> 
> 
> Hi,
> 
> Thanks for writing this down
> 
> On Thu, Nov 16, 2023 at 03:53:20PM +0000, Simon Ser wrote:
> 
> > On Thursday, November 9th, 2023 at 08:45, Simon Ser contact@emersion.fr wrote:
> > 
> > > User-space sometimes needs to allocate scanout-capable memory for
> > > GPU rendering purposes. On a vc4/v3d split render/display SoC, this
> > > is achieved via DRM dumb buffers: the v3d user-space driver opens
> > > the primary vc4 node, allocates a DRM dumb buffer there, exports it
> > > as a DMA-BUF, imports it into the v3d render node, and renders to it.
> > > 
> > > However, DRM dumb buffers are only meant for CPU rendering, they are
> > > not intended to be used for GPU rendering. Primary nodes should only
> > > be used for mode-setting purposes, other programs should not attempt
> > > to open it. Moreover, opening the primary node is already broken on
> > > some setups: systemd grants permission to open primary nodes to
> > > physically logged in users, but this breaks when the user is not
> > > physically logged in (e.g. headless setup) and when the distribution
> > > is using a different init (e.g. Alpine Linux uses openrc).
> > > 
> > > We need an alternate way for v3d to allocate scanout-capable memory.
> > > Leverage DMA heaps for this purpose: expose a CMA heap to user-space.
> > 
> > So we've discussed about this patch on IRC [1] [2]. Some random notes:
> > 
> > - We shouldn't create per-DRM-device heaps in general. Instead, we should try
> > using centralized heaps like the existing system and cma ones. That way other
> > drivers (video, render, etc) can also link to these heaps without depending
> > on the display driver.
> > - We can't generically link to heaps in core DRM, however we probably provide
> > a default for shmem and cma helpers.
> > - We're missing a bunch of heaps, e.g. sometimes there are multiple cma areas
> > but only a single cma heap is created right now.
> > - Some hw needs the memory to be in a specific region for scanout (e.g. lower
> > 256MB of RAM for Allwinner). We could create one heap per such region (but is
> > it fine to have overlapping heaps?).
> 
> Just for reference, it's not the scanout itself that has that
> requirement on Allwinner SoCs, it's the HW codec. But if you want to
> display the decoded frame directly using dma-buf, you'll still need to
> either allocate a scanout buffer and hope it'll be in the lower 256MB,
> or allocate a buffer from the codec in the lower 256MB and then hope
> it's scanout-capable (which it is, so that's we do, but there's no
> guarantee about it).

OK. Yeah, the problem remains.

> I think the logicvc is a much better example for this, since it requires
> framebuffers to be in a specific area, with each plane having a
> dedicated area.
> 
> AFAIK that's the most extreme example we have upstream.

That kind of restriction is not supported by generic user-space. As far
as user-space is concerned, scanout-capable buffers aren't tied to any
plane in particular. Generic user-space allocates via GBM or dumb
buffers, and at allocation time there is no hint about the plane the
buffer will be attached to.

I'm personally not super excited/interested about supporting this kind
of weird setup which doesn't match the KMS uAPI.
Maxime Ripard Dec. 7, 2023, 1:35 p.m. UTC | #28
On Tue, Dec 05, 2023 at 04:52:06PM +0000, Simon Ser wrote:
> > I think the logicvc is a much better example for this, since it requires
> > framebuffers to be in a specific area, with each plane having a
> > dedicated area.
> > 
> > AFAIK that's the most extreme example we have upstream.
> 
> That kind of restriction is not supported by generic user-space. As far
> as user-space is concerned, scanout-capable buffers aren't tied to any
> plane in particular. Generic user-space allocates via GBM or dumb
> buffers, and at allocation time there is no hint about the plane the
> buffer will be attached to.
> 
> I'm personally not super excited/interested about supporting this kind
> of weird setup which doesn't match the KMS uAPI.

Yeah... I think if we ever come to this, we can support it in two ways:

  - Either we handle it like simpledrm and the like by allocating a
    buffer and then memcpy'ing it to the right place;

  - Or if it's not fast enough, we create a single heap for the entire
    device, and then create some dynamic association between the
    framebuffer and the (hardware) plane depending on where the plane is
    in that area. That would be quite difficult to get right but not
    impossible imo.

Maxime
diff mbox series

Patch

diff --git a/drivers/gpu/drm/vc4/Makefile b/drivers/gpu/drm/vc4/Makefile
index c41f89a15a55..e4048870cec7 100644
--- a/drivers/gpu/drm/vc4/Makefile
+++ b/drivers/gpu/drm/vc4/Makefile
@@ -34,4 +34,6 @@  vc4-$(CONFIG_DRM_VC4_KUNIT_TEST) += \
 
 vc4-$(CONFIG_DEBUG_FS) += vc4_debugfs.o
 
+vc4-$(CONFIG_DMABUF_HEAPS) += vc4_dma_heap.o
+
 obj-$(CONFIG_DRM_VC4)  += vc4.o
diff --git a/drivers/gpu/drm/vc4/vc4_dma_heap.c b/drivers/gpu/drm/vc4/vc4_dma_heap.c
new file mode 100644
index 000000000000..010d0a88f3fa
--- /dev/null
+++ b/drivers/gpu/drm/vc4/vc4_dma_heap.c
@@ -0,0 +1,51 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  Copyright © 2023 Simon Ser
+ */
+
+#include <linux/dma-buf.h>
+#include <linux/dma-heap.h>
+
+#include "vc4_drv.h"
+
+static struct dma_buf *vc4_dma_heap_allocate(struct dma_heap *heap,
+					     unsigned long size,
+					     unsigned long fd_flags,
+					     unsigned long heap_flags)
+{
+	struct vc4_dev *vc4 = dma_heap_get_drvdata(heap);
+	//DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
+	struct drm_gem_dma_object *dma_obj;
+	struct dma_buf *dmabuf;
+
+	if (WARN_ON_ONCE(!vc4->is_vc5))
+		return ERR_PTR(-ENODEV);
+
+	dma_obj = drm_gem_dma_create(&vc4->base, size);
+	if (IS_ERR(dma_obj))
+		return ERR_CAST(dma_obj);
+
+	dmabuf = drm_gem_prime_export(&dma_obj->base, fd_flags);
+	drm_gem_object_put(&dma_obj->base);
+	return dmabuf;
+}
+
+static const struct dma_heap_ops vc4_dma_heap_ops = {
+	.allocate = vc4_dma_heap_allocate,
+};
+
+int vc4_dma_heap_create(struct vc4_dev *vc4)
+{
+	struct dma_heap_export_info exp_info;
+	struct dma_heap *heap;
+
+	exp_info.name = "vc4"; /* TODO: allow multiple? */
+	exp_info.ops = &vc4_dma_heap_ops;
+	exp_info.priv = vc4; /* TODO: unregister when unloading */
+
+	heap = dma_heap_add(&exp_info);
+	if (IS_ERR(heap))
+		return PTR_ERR(heap);
+
+	return 0;
+}
diff --git a/drivers/gpu/drm/vc4/vc4_drv.c b/drivers/gpu/drm/vc4/vc4_drv.c
index c133e96b8aca..c7297dd7d9d5 100644
--- a/drivers/gpu/drm/vc4/vc4_drv.c
+++ b/drivers/gpu/drm/vc4/vc4_drv.c
@@ -391,6 +391,12 @@  static int vc4_drm_bind(struct device *dev)
 
 	drm_fbdev_dma_setup(drm, 16);
 
+#if IS_ENABLED(CONFIG_DMABUF_HEAPS)
+	ret = vc4_dma_heap_create(vc4);
+	if (ret)
+		goto err;
+#endif
+
 	return 0;
 
 err:
diff --git a/drivers/gpu/drm/vc4/vc4_drv.h b/drivers/gpu/drm/vc4/vc4_drv.h
index ab61e96e7e14..d5c5dd18815c 100644
--- a/drivers/gpu/drm/vc4/vc4_drv.h
+++ b/drivers/gpu/drm/vc4/vc4_drv.h
@@ -1078,4 +1078,9 @@  int vc4_perfmon_destroy_ioctl(struct drm_device *dev, void *data,
 int vc4_perfmon_get_values_ioctl(struct drm_device *dev, void *data,
 				 struct drm_file *file_priv);
 
+/* vc4_dma_heap.c */
+#if IS_ENABLED(CONFIG_DMABUF_HEAPS)
+int vc4_dma_heap_create(struct vc4_dev *vc4);
+#endif
+
 #endif /* _VC4_DRV_H_ */