drm/etnaviv: Create an accel device node if compute-only

Message ID	20240424063753.3740664-1-tomeu@tomeuvizoso.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Tomeu Vizoso <tomeu@tomeuvizoso.net> To: linux-kernel@vger.kernel.org Cc: Tomeu Vizoso <tomeu@tomeuvizoso.net>, Oded Gabbay <ogabbay@kernel.org>, Lucas Stach <l.stach@pengutronix.de>, Russell King <linux+etnaviv@armlinux.org.uk>, Christian Gmeiner <christian.gmeiner@gmail.com>, David Airlie <airlied@gmail.com>, Daniel Vetter <daniel@ffwll.ch>, etnaviv@lists.freedesktop.org, dri-devel@lists.freedesktop.org Subject: [PATCH] drm/etnaviv: Create an accel device node if compute-only Date: Wed, 24 Apr 2024 08:37:51 +0200 Message-ID: <20240424063753.3740664-1-tomeu@tomeuvizoso.net> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	drm/etnaviv: Create an accel device node if compute-only \| expand drm/etnaviv: Create an accel device node if compute-only

Tomeu Vizoso April 24, 2024, 6:37 a.m. UTC

If we expose a render node for NPUs without rendering capabilities, the
userspace stack will offer it to compositors and applications for
rendering, which of course won't work.

Userspace is probably right in not questioning whether a render node
might not be capable of supporting rendering, so change it in the kernel
instead by exposing a /dev/accel node.

Before we bring the device up we don't know whether it is capable of
rendering or not (depends on the features of its blocks), so first try
to probe a rendering node, and if we find out that there is no rendering
hardware, abort and retry with an accel node.

Signed-off-by: Tomeu Vizoso <tomeu@tomeuvizoso.net>
Cc: Oded Gabbay <ogabbay@kernel.org>
---
 drivers/gpu/drm/etnaviv/etnaviv_drv.c | 46 ++++++++++++++++++++++-----
 1 file changed, 38 insertions(+), 8 deletions(-)

Christian Gmeiner April 25, 2024, 11:32 a.m. UTC | #1

Hi Tomeu,

>
> If we expose a render node for NPUs without rendering capabilities, the
> userspace stack will offer it to compositors and applications for
> rendering, which of course won't work.
>
> Userspace is probably right in not questioning whether a render node
> might not be capable of supporting rendering, so change it in the kernel
> instead by exposing a /dev/accel node.
>
> Before we bring the device up we don't know whether it is capable of
> rendering or not (depends on the features of its blocks), so first try
> to probe a rendering node, and if we find out that there is no rendering
> hardware, abort and retry with an accel node.
>

I really love this idea of moving away from a render node. What needs to be done
on the userspace side?

> Signed-off-by: Tomeu Vizoso <tomeu@tomeuvizoso.net>
> Cc: Oded Gabbay <ogabbay@kernel.org>

Reviewed-by: Christian Gmeiner <cgmeiner@igalia.com>

> ---
>  drivers/gpu/drm/etnaviv/etnaviv_drv.c | 46 ++++++++++++++++++++++-----
>  1 file changed, 38 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_drv.c b/drivers/gpu/drm/etnaviv/etnaviv_drv.c
> index 6500f3999c5f..8e7dd23115f4 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_drv.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_drv.c
> @@ -11,6 +11,7 @@
>  #include <linux/platform_device.h>
>  #include <linux/uaccess.h>
>
> +#include <drm/drm_accel.h>
>  #include <drm/drm_debugfs.h>
>  #include <drm/drm_drv.h>
>  #include <drm/drm_file.h>
> @@ -488,10 +489,10 @@ static const struct drm_ioctl_desc etnaviv_ioctls[] = {
>         ETNA_IOCTL(PM_QUERY_SIG, pm_query_sig, DRM_RENDER_ALLOW),
>  };
>
> -DEFINE_DRM_GEM_FOPS(fops);
> +DEFINE_DRM_GEM_FOPS(render_fops);
> +DEFINE_DRM_ACCEL_FOPS(accel_fops);
>
> -static const struct drm_driver etnaviv_drm_driver = {
> -       .driver_features    = DRIVER_GEM | DRIVER_RENDER,
> +static struct drm_driver etnaviv_drm_driver = {
>         .open               = etnaviv_open,
>         .postclose           = etnaviv_postclose,
>         .gem_prime_import_sg_table = etnaviv_gem_prime_import_sg_table,
> @@ -500,7 +501,6 @@ static const struct drm_driver etnaviv_drm_driver = {
>  #endif
>         .ioctls             = etnaviv_ioctls,
>         .num_ioctls         = DRM_ETNAVIV_NUM_IOCTLS,
> -       .fops               = &fops,
>         .name               = "etnaviv",
>         .desc               = "etnaviv DRM",
>         .date               = "20151214",
> @@ -508,15 +508,20 @@ static const struct drm_driver etnaviv_drm_driver = {
>         .minor              = 4,
>  };
>
> -/*
> - * Platform driver:
> - */
> -static int etnaviv_bind(struct device *dev)
> +static int etnaviv_bind_with_type(struct device *dev, u32 type)
>  {
>         struct etnaviv_drm_private *priv;
>         struct drm_device *drm;
> +       bool is_compute_only = true;
>         int ret;
>
> +       etnaviv_drm_driver.driver_features = DRIVER_GEM | type;
> +
> +       if (type == DRIVER_RENDER)
> +               etnaviv_drm_driver.fops = &render_fops;
> +       else
> +               etnaviv_drm_driver.fops = &accel_fops;
> +
>         drm = drm_dev_alloc(&etnaviv_drm_driver, dev);
>         if (IS_ERR(drm))
>                 return PTR_ERR(drm);
> @@ -553,6 +558,18 @@ static int etnaviv_bind(struct device *dev)
>
>         load_gpu(drm);
>
> +       for (unsigned int i = 0; i < ETNA_MAX_PIPES; i++) {
> +               struct etnaviv_gpu *g = priv->gpu[i];
> +
> +               if (g && (g->identity.minor_features8 & chipMinorFeatures8_COMPUTE_ONLY) == 0)
> +                       is_compute_only = false;
> +       }
> +
> +       if (type == DRIVER_RENDER && is_compute_only) {
> +               ret = -EINVAL;
> +               goto out_unbind;
> +       }
> +
>         ret = drm_dev_register(drm, 0);
>         if (ret)
>                 goto out_unbind;
> @@ -571,6 +588,19 @@ static int etnaviv_bind(struct device *dev)
>         return ret;
>  }
>
> +/*
> + * Platform driver:
> + */
> +static int etnaviv_bind(struct device *dev)
> +{
> +       int ret = etnaviv_bind_with_type(dev, DRIVER_RENDER);
> +
> +       if (ret == -EINVAL)
> +               return etnaviv_bind_with_type(dev, DRIVER_COMPUTE_ACCEL);
> +
> +       return ret;
> +}
> +
>  static void etnaviv_unbind(struct device *dev)
>  {
>         struct drm_device *drm = dev_get_drvdata(dev);
> --
> 2.44.0
>

Jeffrey Hugo April 25, 2024, 6:59 p.m. UTC | #2

On 4/24/2024 12:37 AM, Tomeu Vizoso wrote:
> If we expose a render node for NPUs without rendering capabilities, the
> userspace stack will offer it to compositors and applications for
> rendering, which of course won't work.
> 
> Userspace is probably right in not questioning whether a render node
> might not be capable of supporting rendering, so change it in the kernel
> instead by exposing a /dev/accel node.
> 
> Before we bring the device up we don't know whether it is capable of
> rendering or not (depends on the features of its blocks), so first try
> to probe a rendering node, and if we find out that there is no rendering
> hardware, abort and retry with an accel node.
> 
> Signed-off-by: Tomeu Vizoso <tomeu@tomeuvizoso.net>
> Cc: Oded Gabbay <ogabbay@kernel.org>

I hope Oded chimes in as Accel maintainer.  I think Airlie/Vetter had 
also previously mentioned they'd have opinions on what is Accel vs DRM.

This gets a nack from me in its current state.  This is not a strong 
nack, and I don't want to discourage you.  I think there is a path forward.

The Accel subsystem documentation says that accel drivers will reside in 
drivers/accel/ but this does not.

Also, the commit text for "accel: add dedicated minor for accelerator 
devices" mentions -

"for drivers that
declare they handle compute accelerator, using a new driver feature
flag called DRIVER_COMPUTE_ACCEL. It is important to note that this
driver feature is mutually exclusive with DRIVER_RENDER. Devices that
want to expose both graphics and compute device char files should be
handled by two drivers that are connected using the auxiliary bus
framework."

I don't see any of that happening here (two drivers connected by aux 
bus, one in drivers/accel).

I think this is the first case we've had of a combo DRM/Accel usecase, 
and so there isn't an existing example to refer you to on how to 
structure things.  I think you are going to be the first example where 
we figure all of this out.

On a more implementation note, ioctls for Accel devices should not be 
marked DRM_RENDER_ALLOW.  Seems like your attempt to reuse as much of 
the code as possible trips over this.

-Jeff

Tomeu Vizoso April 26, 2024, 5:36 a.m. UTC | #3

On Thu, Apr 25, 2024 at 1:32 PM Christian Gmeiner
<christian.gmeiner@gmail.com> wrote:
>
> Hi Tomeu,
>
> >
> > If we expose a render node for NPUs without rendering capabilities, the
> > userspace stack will offer it to compositors and applications for
> > rendering, which of course won't work.
> >
> > Userspace is probably right in not questioning whether a render node
> > might not be capable of supporting rendering, so change it in the kernel
> > instead by exposing a /dev/accel node.
> >
> > Before we bring the device up we don't know whether it is capable of
> > rendering or not (depends on the features of its blocks), so first try
> > to probe a rendering node, and if we find out that there is no rendering
> > hardware, abort and retry with an accel node.
> >
>
> I really love this idea of moving away from a render node. What needs to be done
> on the userspace side?

Doesn't seem that bad, here is a proof of concept:

https://gitlab.freedesktop.org/tomeu/mesa/-/tree/teflon-accel

Thanks for taking a look.

Tomeu

> > Signed-off-by: Tomeu Vizoso <tomeu@tomeuvizoso.net>
> > Cc: Oded Gabbay <ogabbay@kernel.org>
>
> Reviewed-by: Christian Gmeiner <cgmeiner@igalia.com>
>
> > ---
> >  drivers/gpu/drm/etnaviv/etnaviv_drv.c | 46 ++++++++++++++++++++++-----
> >  1 file changed, 38 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/etnaviv/etnaviv_drv.c b/drivers/gpu/drm/etnaviv/etnaviv_drv.c
> > index 6500f3999c5f..8e7dd23115f4 100644
> > --- a/drivers/gpu/drm/etnaviv/etnaviv_drv.c
> > +++ b/drivers/gpu/drm/etnaviv/etnaviv_drv.c
> > @@ -11,6 +11,7 @@
> >  #include <linux/platform_device.h>
> >  #include <linux/uaccess.h>
> >
> > +#include <drm/drm_accel.h>
> >  #include <drm/drm_debugfs.h>
> >  #include <drm/drm_drv.h>
> >  #include <drm/drm_file.h>
> > @@ -488,10 +489,10 @@ static const struct drm_ioctl_desc etnaviv_ioctls[] = {
> >         ETNA_IOCTL(PM_QUERY_SIG, pm_query_sig, DRM_RENDER_ALLOW),
> >  };
> >
> > -DEFINE_DRM_GEM_FOPS(fops);
> > +DEFINE_DRM_GEM_FOPS(render_fops);
> > +DEFINE_DRM_ACCEL_FOPS(accel_fops);
> >
> > -static const struct drm_driver etnaviv_drm_driver = {
> > -       .driver_features    = DRIVER_GEM | DRIVER_RENDER,
> > +static struct drm_driver etnaviv_drm_driver = {
> >         .open               = etnaviv_open,
> >         .postclose           = etnaviv_postclose,
> >         .gem_prime_import_sg_table = etnaviv_gem_prime_import_sg_table,
> > @@ -500,7 +501,6 @@ static const struct drm_driver etnaviv_drm_driver = {
> >  #endif
> >         .ioctls             = etnaviv_ioctls,
> >         .num_ioctls         = DRM_ETNAVIV_NUM_IOCTLS,
> > -       .fops               = &fops,
> >         .name               = "etnaviv",
> >         .desc               = "etnaviv DRM",
> >         .date               = "20151214",
> > @@ -508,15 +508,20 @@ static const struct drm_driver etnaviv_drm_driver = {
> >         .minor              = 4,
> >  };
> >
> > -/*
> > - * Platform driver:
> > - */
> > -static int etnaviv_bind(struct device *dev)
> > +static int etnaviv_bind_with_type(struct device *dev, u32 type)
> >  {
> >         struct etnaviv_drm_private *priv;
> >         struct drm_device *drm;
> > +       bool is_compute_only = true;
> >         int ret;
> >
> > +       etnaviv_drm_driver.driver_features = DRIVER_GEM | type;
> > +
> > +       if (type == DRIVER_RENDER)
> > +               etnaviv_drm_driver.fops = &render_fops;
> > +       else
> > +               etnaviv_drm_driver.fops = &accel_fops;
> > +
> >         drm = drm_dev_alloc(&etnaviv_drm_driver, dev);
> >         if (IS_ERR(drm))
> >                 return PTR_ERR(drm);
> > @@ -553,6 +558,18 @@ static int etnaviv_bind(struct device *dev)
> >
> >         load_gpu(drm);
> >
> > +       for (unsigned int i = 0; i < ETNA_MAX_PIPES; i++) {
> > +               struct etnaviv_gpu *g = priv->gpu[i];
> > +
> > +               if (g && (g->identity.minor_features8 & chipMinorFeatures8_COMPUTE_ONLY) == 0)
> > +                       is_compute_only = false;
> > +       }
> > +
> > +       if (type == DRIVER_RENDER && is_compute_only) {
> > +               ret = -EINVAL;
> > +               goto out_unbind;
> > +       }
> > +
> >         ret = drm_dev_register(drm, 0);
> >         if (ret)
> >                 goto out_unbind;
> > @@ -571,6 +588,19 @@ static int etnaviv_bind(struct device *dev)
> >         return ret;
> >  }
> >
> > +/*
> > + * Platform driver:
> > + */
> > +static int etnaviv_bind(struct device *dev)
> > +{
> > +       int ret = etnaviv_bind_with_type(dev, DRIVER_RENDER);
> > +
> > +       if (ret == -EINVAL)
> > +               return etnaviv_bind_with_type(dev, DRIVER_COMPUTE_ACCEL);
> > +
> > +       return ret;
> > +}
> > +
> >  static void etnaviv_unbind(struct device *dev)
> >  {
> >         struct drm_device *drm = dev_get_drvdata(dev);
> > --
> > 2.44.0
> >
>
>
> --
> greets
> --
> Christian Gmeiner, MSc
>
> https://christian-gmeiner.info/privacypolicy

Tomeu Vizoso April 26, 2024, 6:10 a.m. UTC | #4

On Thu, Apr 25, 2024 at 8:59 PM Jeffrey Hugo <quic_jhugo@quicinc.com> wrote:
>
> On 4/24/2024 12:37 AM, Tomeu Vizoso wrote:
> > If we expose a render node for NPUs without rendering capabilities, the
> > userspace stack will offer it to compositors and applications for
> > rendering, which of course won't work.
> >
> > Userspace is probably right in not questioning whether a render node
> > might not be capable of supporting rendering, so change it in the kernel
> > instead by exposing a /dev/accel node.
> >
> > Before we bring the device up we don't know whether it is capable of
> > rendering or not (depends on the features of its blocks), so first try
> > to probe a rendering node, and if we find out that there is no rendering
> > hardware, abort and retry with an accel node.
> >
> > Signed-off-by: Tomeu Vizoso <tomeu@tomeuvizoso.net>
> > Cc: Oded Gabbay <ogabbay@kernel.org>
>
> I hope Oded chimes in as Accel maintainer.  I think Airlie/Vetter had
> also previously mentioned they'd have opinions on what is Accel vs DRM.
>
> This gets a nack from me in its current state.  This is not a strong
> nack, and I don't want to discourage you.  I think there is a path forward.
>
> The Accel subsystem documentation says that accel drivers will reside in
> drivers/accel/ but this does not.

Indeed, there is that code organization aspect.

> Also, the commit text for "accel: add dedicated minor for accelerator
> devices" mentions -
>
> "for drivers that
> declare they handle compute accelerator, using a new driver feature
> flag called DRIVER_COMPUTE_ACCEL. It is important to note that this
> driver feature is mutually exclusive with DRIVER_RENDER. Devices that
> want to expose both graphics and compute device char files should be
> handled by two drivers that are connected using the auxiliary bus
> framework."
>
> I don't see any of that happening here (two drivers connected by aux
> bus, one in drivers/accel).

Well, the text refers to devices, not drivers. The case we are talking
about is a driver that wants to sometimes expose an accel node, and
sometimes a render node, depending on the hardware it is dealing with.
So there would either be a device exposing a single render node, or a
device exposing a single accel node.

Though by using the auxiliary bus we could in theory solve the code
organization problem mentioned above, I'm not quite seeing how to do
this in a clean way. The driver in /drivers/gpu/drm would have to be a
DRM driver that doesn't register a DRM device, but registers a device
in the auxiliary bus for the driver in /drivers/accel to bind to? Or
are you seeing some possibility that would fit better in the current
DRM framework?

> I think this is the first case we've had of a combo DRM/Accel usecase,
> and so there isn't an existing example to refer you to on how to
> structure things.  I think you are going to be the first example where
> we figure all of this out.

Yep, I will be grateful for any ideas on how to structure this.

> On a more implementation note, ioctls for Accel devices should not be
> marked DRM_RENDER_ALLOW.  Seems like your attempt to reuse as much of
> the code as possible trips over this.

Indeed, thanks.

Cheers,

Tomeu

> -Jeff

Tomeu Vizoso May 9, 2024, 1:53 p.m. UTC | #5

Oded, Dave,

Do you have an opinion on this?

Thanks,

Tomeu

On Fri, Apr 26, 2024 at 8:10 AM Tomeu Vizoso <tomeu@tomeuvizoso.net> wrote:
>
> On Thu, Apr 25, 2024 at 8:59 PM Jeffrey Hugo <quic_jhugo@quicinc.com> wrote:
> >
> > On 4/24/2024 12:37 AM, Tomeu Vizoso wrote:
> > > If we expose a render node for NPUs without rendering capabilities, the
> > > userspace stack will offer it to compositors and applications for
> > > rendering, which of course won't work.
> > >
> > > Userspace is probably right in not questioning whether a render node
> > > might not be capable of supporting rendering, so change it in the kernel
> > > instead by exposing a /dev/accel node.
> > >
> > > Before we bring the device up we don't know whether it is capable of
> > > rendering or not (depends on the features of its blocks), so first try
> > > to probe a rendering node, and if we find out that there is no rendering
> > > hardware, abort and retry with an accel node.
> > >
> > > Signed-off-by: Tomeu Vizoso <tomeu@tomeuvizoso.net>
> > > Cc: Oded Gabbay <ogabbay@kernel.org>
> >
> > I hope Oded chimes in as Accel maintainer.  I think Airlie/Vetter had
> > also previously mentioned they'd have opinions on what is Accel vs DRM.
> >
> > This gets a nack from me in its current state.  This is not a strong
> > nack, and I don't want to discourage you.  I think there is a path forward.
> >
> > The Accel subsystem documentation says that accel drivers will reside in
> > drivers/accel/ but this does not.
>
> Indeed, there is that code organization aspect.
>
> > Also, the commit text for "accel: add dedicated minor for accelerator
> > devices" mentions -
> >
> > "for drivers that
> > declare they handle compute accelerator, using a new driver feature
> > flag called DRIVER_COMPUTE_ACCEL. It is important to note that this
> > driver feature is mutually exclusive with DRIVER_RENDER. Devices that
> > want to expose both graphics and compute device char files should be
> > handled by two drivers that are connected using the auxiliary bus
> > framework."
> >
> > I don't see any of that happening here (two drivers connected by aux
> > bus, one in drivers/accel).
>
> Well, the text refers to devices, not drivers. The case we are talking
> about is a driver that wants to sometimes expose an accel node, and
> sometimes a render node, depending on the hardware it is dealing with.
> So there would either be a device exposing a single render node, or a
> device exposing a single accel node.
>
> Though by using the auxiliary bus we could in theory solve the code
> organization problem mentioned above, I'm not quite seeing how to do
> this in a clean way. The driver in /drivers/gpu/drm would have to be a
> DRM driver that doesn't register a DRM device, but registers a device
> in the auxiliary bus for the driver in /drivers/accel to bind to? Or
> are you seeing some possibility that would fit better in the current
> DRM framework?
>
> > I think this is the first case we've had of a combo DRM/Accel usecase,
> > and so there isn't an existing example to refer you to on how to
> > structure things.  I think you are going to be the first example where
> > we figure all of this out.
>
> Yep, I will be grateful for any ideas on how to structure this.
>
> > On a more implementation note, ioctls for Accel devices should not be
> > marked DRM_RENDER_ALLOW.  Seems like your attempt to reuse as much of
> > the code as possible trips over this.
>
> Indeed, thanks.
>
> Cheers,
>
> Tomeu
>
> > -Jeff

Oded Gabbay May 9, 2024, 2:41 p.m. UTC | #6

On Thu, May 09, 2024 at 03:53:01PM +0200, Tomeu Vizoso wrote:
> Oded, Dave,
> 
> Do you have an opinion on this?
> 
> Thanks,
> 
> Tomeu
Hi Tomeu,

Sorry for not replying earlier, I was down with Covid (again...).

To your question, I don't have an objection to what you are
suggesting. My personal view of accel is that it is an integral part of 
DRM and therefore, if there is an *existing* drm driver that wants to 
create an accel node, I'm not against it. 

There is the question of why you want to expose an accel node, and
here I would like to hear Dave's and Sima's opinion on your suggested
solution as it may affect the direction of other drm drivers.

Thanks,
Oded.

p.s.
Please only use bottom-posting when replying, thanks :)

> 
> On Fri, Apr 26, 2024 at 8:10 AM Tomeu Vizoso <tomeu@tomeuvizoso.net> wrote:
> >
> > On Thu, Apr 25, 2024 at 8:59 PM Jeffrey Hugo <quic_jhugo@quicinc.com> wrote:
> > >
> > > On 4/24/2024 12:37 AM, Tomeu Vizoso wrote:
> > > > If we expose a render node for NPUs without rendering capabilities, the
> > > > userspace stack will offer it to compositors and applications for
> > > > rendering, which of course won't work.
> > > >
> > > > Userspace is probably right in not questioning whether a render node
> > > > might not be capable of supporting rendering, so change it in the kernel
> > > > instead by exposing a /dev/accel node.
> > > >
> > > > Before we bring the device up we don't know whether it is capable of
> > > > rendering or not (depends on the features of its blocks), so first try
> > > > to probe a rendering node, and if we find out that there is no rendering
> > > > hardware, abort and retry with an accel node.
> > > >
> > > > Signed-off-by: Tomeu Vizoso <tomeu@tomeuvizoso.net>
> > > > Cc: Oded Gabbay <ogabbay@kernel.org>
> > >
> > > I hope Oded chimes in as Accel maintainer.  I think Airlie/Vetter had
> > > also previously mentioned they'd have opinions on what is Accel vs DRM.
> > >
> > > This gets a nack from me in its current state.  This is not a strong
> > > nack, and I don't want to discourage you.  I think there is a path forward.
> > >
> > > The Accel subsystem documentation says that accel drivers will reside in
> > > drivers/accel/ but this does not.
> >
> > Indeed, there is that code organization aspect.
> >
> > > Also, the commit text for "accel: add dedicated minor for accelerator
> > > devices" mentions -
> > >
> > > "for drivers that
> > > declare they handle compute accelerator, using a new driver feature
> > > flag called DRIVER_COMPUTE_ACCEL. It is important to note that this
> > > driver feature is mutually exclusive with DRIVER_RENDER. Devices that
> > > want to expose both graphics and compute device char files should be
> > > handled by two drivers that are connected using the auxiliary bus
> > > framework."
> > >
> > > I don't see any of that happening here (two drivers connected by aux
> > > bus, one in drivers/accel).
> >
> > Well, the text refers to devices, not drivers. The case we are talking
> > about is a driver that wants to sometimes expose an accel node, and
> > sometimes a render node, depending on the hardware it is dealing with.
> > So there would either be a device exposing a single render node, or a
> > device exposing a single accel node.
> >
> > Though by using the auxiliary bus we could in theory solve the code
> > organization problem mentioned above, I'm not quite seeing how to do
> > this in a clean way. The driver in /drivers/gpu/drm would have to be a
> > DRM driver that doesn't register a DRM device, but registers a device
> > in the auxiliary bus for the driver in /drivers/accel to bind to? Or
> > are you seeing some possibility that would fit better in the current
> > DRM framework?
> >
> > > I think this is the first case we've had of a combo DRM/Accel usecase,
> > > and so there isn't an existing example to refer you to on how to
> > > structure things.  I think you are going to be the first example where
> > > we figure all of this out.
> >
> > Yep, I will be grateful for any ideas on how to structure this.
> >
> > > On a more implementation note, ioctls for Accel devices should not be
> > > marked DRM_RENDER_ALLOW.  Seems like your attempt to reuse as much of
> > > the code as possible trips over this.
> >
> > Indeed, thanks.
> >
> > Cheers,
> >
> > Tomeu
> >
> > > -Jeff

Lucas Stach May 10, 2024, 8:34 a.m. UTC | #7

Hi Tomeu,

Am Mittwoch, dem 24.04.2024 um 08:37 +0200 schrieb Tomeu Vizoso:
> If we expose a render node for NPUs without rendering capabilities, the
> userspace stack will offer it to compositors and applications for
> rendering, which of course won't work.
> 
> Userspace is probably right in not questioning whether a render node
> might not be capable of supporting rendering, so change it in the kernel
> instead by exposing a /dev/accel node.
> 
> Before we bring the device up we don't know whether it is capable of
> rendering or not (depends on the features of its blocks), so first try
> to probe a rendering node, and if we find out that there is no rendering
> hardware, abort and retry with an accel node.
> 
I thought about this for a while. My opinion is that this is the wrong
approach. We are adding another path to the kernel driver, potentially
complicating the userspace side, as now the NPU backend needs to look
for both render and accel nodes. While currently accel and drm are
pretty closely related and we can share most of the driver, it might
still be a maintenance hassle in the long run.

On the other hand we already have precedence of compute only DRM
devices exposing a render node: there are AMD GPUs that don't expose a
graphics queue and are thus not able to actually render graphics. Mesa
already handles this in part via the PIPE_CAP_GRAPHICS and I think we
should simply extend this to not offer a EGL display on screens without
that capability.

Regards,
Lucas

> Signed-off-by: Tomeu Vizoso <tomeu@tomeuvizoso.net>
> Cc: Oded Gabbay <ogabbay@kernel.org>
> ---
>  drivers/gpu/drm/etnaviv/etnaviv_drv.c | 46 ++++++++++++++++++++++-----
>  1 file changed, 38 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_drv.c b/drivers/gpu/drm/etnaviv/etnaviv_drv.c
> index 6500f3999c5f..8e7dd23115f4 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_drv.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_drv.c
> @@ -11,6 +11,7 @@
>  #include <linux/platform_device.h>
>  #include <linux/uaccess.h>
>  
> +#include <drm/drm_accel.h>
>  #include <drm/drm_debugfs.h>
>  #include <drm/drm_drv.h>
>  #include <drm/drm_file.h>
> @@ -488,10 +489,10 @@ static const struct drm_ioctl_desc etnaviv_ioctls[] = {
>  	ETNA_IOCTL(PM_QUERY_SIG, pm_query_sig, DRM_RENDER_ALLOW),
>  };
>  
> -DEFINE_DRM_GEM_FOPS(fops);
> +DEFINE_DRM_GEM_FOPS(render_fops);
> +DEFINE_DRM_ACCEL_FOPS(accel_fops);
>  
> -static const struct drm_driver etnaviv_drm_driver = {
> -	.driver_features    = DRIVER_GEM | DRIVER_RENDER,
> +static struct drm_driver etnaviv_drm_driver = {
>  	.open               = etnaviv_open,
>  	.postclose           = etnaviv_postclose,
>  	.gem_prime_import_sg_table = etnaviv_gem_prime_import_sg_table,
> @@ -500,7 +501,6 @@ static const struct drm_driver etnaviv_drm_driver = {
>  #endif
>  	.ioctls             = etnaviv_ioctls,
>  	.num_ioctls         = DRM_ETNAVIV_NUM_IOCTLS,
> -	.fops               = &fops,
>  	.name               = "etnaviv",
>  	.desc               = "etnaviv DRM",
>  	.date               = "20151214",
> @@ -508,15 +508,20 @@ static const struct drm_driver etnaviv_drm_driver = {
>  	.minor              = 4,
>  };
>  
> -/*
> - * Platform driver:
> - */
> -static int etnaviv_bind(struct device *dev)
> +static int etnaviv_bind_with_type(struct device *dev, u32 type)
>  {
>  	struct etnaviv_drm_private *priv;
>  	struct drm_device *drm;
> +	bool is_compute_only = true;
>  	int ret;
>  
> +	etnaviv_drm_driver.driver_features = DRIVER_GEM | type;
> +
> +	if (type == DRIVER_RENDER)
> +		etnaviv_drm_driver.fops = &render_fops;
> +	else
> +		etnaviv_drm_driver.fops = &accel_fops;
> +
>  	drm = drm_dev_alloc(&etnaviv_drm_driver, dev);
>  	if (IS_ERR(drm))
>  		return PTR_ERR(drm);
> @@ -553,6 +558,18 @@ static int etnaviv_bind(struct device *dev)
>  
>  	load_gpu(drm);
>  
> +	for (unsigned int i = 0; i < ETNA_MAX_PIPES; i++) {
> +		struct etnaviv_gpu *g = priv->gpu[i];
> +
> +		if (g && (g->identity.minor_features8 & chipMinorFeatures8_COMPUTE_ONLY) == 0)
> +			is_compute_only = false;
> +	}
> +
> +	if (type == DRIVER_RENDER && is_compute_only) {
> +		ret = -EINVAL;
> +		goto out_unbind;
> +	}
> +
>  	ret = drm_dev_register(drm, 0);
>  	if (ret)
>  		goto out_unbind;
> @@ -571,6 +588,19 @@ static int etnaviv_bind(struct device *dev)
>  	return ret;
>  }
>  
> +/*
> + * Platform driver:
> + */
> +static int etnaviv_bind(struct device *dev)
> +{
> +	int ret = etnaviv_bind_with_type(dev, DRIVER_RENDER);
> +
> +	if (ret == -EINVAL)
> +		return etnaviv_bind_with_type(dev, DRIVER_COMPUTE_ACCEL);
> +
> +	return ret;
> +}
> +
>  static void etnaviv_unbind(struct device *dev)
>  {
>  	struct drm_device *drm = dev_get_drvdata(dev);

Tomeu Vizoso May 20, 2024, 7:39 a.m. UTC | #8

Hi Lucas,

On Fri, May 10, 2024 at 10:34 AM Lucas Stach <l.stach@pengutronix.de> wrote:
>
> Hi Tomeu,
>
> Am Mittwoch, dem 24.04.2024 um 08:37 +0200 schrieb Tomeu Vizoso:
> > If we expose a render node for NPUs without rendering capabilities, the
> > userspace stack will offer it to compositors and applications for
> > rendering, which of course won't work.
> >
> > Userspace is probably right in not questioning whether a render node
> > might not be capable of supporting rendering, so change it in the kernel
> > instead by exposing a /dev/accel node.
> >
> > Before we bring the device up we don't know whether it is capable of
> > rendering or not (depends on the features of its blocks), so first try
> > to probe a rendering node, and if we find out that there is no rendering
> > hardware, abort and retry with an accel node.
> >
> I thought about this for a while. My opinion is that this is the wrong
> approach. We are adding another path to the kernel driver, potentially
> complicating the userspace side, as now the NPU backend needs to look
> for both render and accel nodes. While currently accel and drm are
> pretty closely related and we can share most of the driver, it might
> still be a maintenance hassle in the long run.
>
> On the other hand we already have precedence of compute only DRM
> devices exposing a render node: there are AMD GPUs that don't expose a
> graphics queue and are thus not able to actually render graphics. Mesa
> already handles this in part via the PIPE_CAP_GRAPHICS and I think we
> should simply extend this to not offer a EGL display on screens without
> that capability.

The problem with this is that the compositors I know don't loop over
/dev/dri files, trying to create EGL screens and moving to the next
one until they find one that works.

They take the first render node (unless a specific one has been
configured), and assumes it will be able to render with it.

To me it seems as if userspace expects that /dev/dri/renderD* devices
can be used for rendering and by breaking this assumption we would be
breaking existing software.

Which is what I understood to be the whole point behind the decision
to create a new device file hierarchy for accelerators. Or am I
missing something?

Adding Daniel Stone to CC in case he wants to give his opinion from
the compositor point of view.

Cheers,

Tomeu

> Regards,
> Lucas
>
> > Signed-off-by: Tomeu Vizoso <tomeu@tomeuvizoso.net>
> > Cc: Oded Gabbay <ogabbay@kernel.org>
> > ---
> >  drivers/gpu/drm/etnaviv/etnaviv_drv.c | 46 ++++++++++++++++++++++-----
> >  1 file changed, 38 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/etnaviv/etnaviv_drv.c b/drivers/gpu/drm/etnaviv/etnaviv_drv.c
> > index 6500f3999c5f..8e7dd23115f4 100644
> > --- a/drivers/gpu/drm/etnaviv/etnaviv_drv.c
> > +++ b/drivers/gpu/drm/etnaviv/etnaviv_drv.c
> > @@ -11,6 +11,7 @@
> >  #include <linux/platform_device.h>
> >  #include <linux/uaccess.h>
> >
> > +#include <drm/drm_accel.h>
> >  #include <drm/drm_debugfs.h>
> >  #include <drm/drm_drv.h>
> >  #include <drm/drm_file.h>
> > @@ -488,10 +489,10 @@ static const struct drm_ioctl_desc etnaviv_ioctls[] = {
> >       ETNA_IOCTL(PM_QUERY_SIG, pm_query_sig, DRM_RENDER_ALLOW),
> >  };
> >
> > -DEFINE_DRM_GEM_FOPS(fops);
> > +DEFINE_DRM_GEM_FOPS(render_fops);
> > +DEFINE_DRM_ACCEL_FOPS(accel_fops);
> >
> > -static const struct drm_driver etnaviv_drm_driver = {
> > -     .driver_features    = DRIVER_GEM | DRIVER_RENDER,
> > +static struct drm_driver etnaviv_drm_driver = {
> >       .open               = etnaviv_open,
> >       .postclose           = etnaviv_postclose,
> >       .gem_prime_import_sg_table = etnaviv_gem_prime_import_sg_table,
> > @@ -500,7 +501,6 @@ static const struct drm_driver etnaviv_drm_driver = {
> >  #endif
> >       .ioctls             = etnaviv_ioctls,
> >       .num_ioctls         = DRM_ETNAVIV_NUM_IOCTLS,
> > -     .fops               = &fops,
> >       .name               = "etnaviv",
> >       .desc               = "etnaviv DRM",
> >       .date               = "20151214",
> > @@ -508,15 +508,20 @@ static const struct drm_driver etnaviv_drm_driver = {
> >       .minor              = 4,
> >  };
> >
> > -/*
> > - * Platform driver:
> > - */
> > -static int etnaviv_bind(struct device *dev)
> > +static int etnaviv_bind_with_type(struct device *dev, u32 type)
> >  {
> >       struct etnaviv_drm_private *priv;
> >       struct drm_device *drm;
> > +     bool is_compute_only = true;
> >       int ret;
> >
> > +     etnaviv_drm_driver.driver_features = DRIVER_GEM | type;
> > +
> > +     if (type == DRIVER_RENDER)
> > +             etnaviv_drm_driver.fops = &render_fops;
> > +     else
> > +             etnaviv_drm_driver.fops = &accel_fops;
> > +
> >       drm = drm_dev_alloc(&etnaviv_drm_driver, dev);
> >       if (IS_ERR(drm))
> >               return PTR_ERR(drm);
> > @@ -553,6 +558,18 @@ static int etnaviv_bind(struct device *dev)
> >
> >       load_gpu(drm);
> >
> > +     for (unsigned int i = 0; i < ETNA_MAX_PIPES; i++) {
> > +             struct etnaviv_gpu *g = priv->gpu[i];
> > +
> > +             if (g && (g->identity.minor_features8 & chipMinorFeatures8_COMPUTE_ONLY) == 0)
> > +                     is_compute_only = false;
> > +     }
> > +
> > +     if (type == DRIVER_RENDER && is_compute_only) {
> > +             ret = -EINVAL;
> > +             goto out_unbind;
> > +     }
> > +
> >       ret = drm_dev_register(drm, 0);
> >       if (ret)
> >               goto out_unbind;
> > @@ -571,6 +588,19 @@ static int etnaviv_bind(struct device *dev)
> >       return ret;
> >  }
> >
> > +/*
> > + * Platform driver:
> > + */
> > +static int etnaviv_bind(struct device *dev)
> > +{
> > +     int ret = etnaviv_bind_with_type(dev, DRIVER_RENDER);
> > +
> > +     if (ret == -EINVAL)
> > +             return etnaviv_bind_with_type(dev, DRIVER_COMPUTE_ACCEL);
> > +
> > +     return ret;
> > +}
> > +
> >  static void etnaviv_unbind(struct device *dev)
> >  {
> >       struct drm_device *drm = dev_get_drvdata(dev);
>

Tomeu Vizoso May 20, 2024, 9:24 a.m. UTC | #9

On Fri, May 10, 2024 at 10:34 AM Lucas Stach <l.stach@pengutronix.de> wrote:
>
> Hi Tomeu,
>
> Am Mittwoch, dem 24.04.2024 um 08:37 +0200 schrieb Tomeu Vizoso:
> > If we expose a render node for NPUs without rendering capabilities, the
> > userspace stack will offer it to compositors and applications for
> > rendering, which of course won't work.
> >
> > Userspace is probably right in not questioning whether a render node
> > might not be capable of supporting rendering, so change it in the kernel
> > instead by exposing a /dev/accel node.
> >
> > Before we bring the device up we don't know whether it is capable of
> > rendering or not (depends on the features of its blocks), so first try
> > to probe a rendering node, and if we find out that there is no rendering
> > hardware, abort and retry with an accel node.
> >
> I thought about this for a while. My opinion is that this is the wrong
> approach. We are adding another path to the kernel driver, potentially
> complicating the userspace side, as now the NPU backend needs to look
> for both render and accel nodes.

Forgot to mention in my earlier reply today that with the proposed
solution no changes are needed in the Gallium drivers, only in the
pipeloader component in Mesa, and in the Gallium frontends.

But those changes are needed anyway to support the upcoming
compute-only NPUs, such as Rockchip's.

These are the changes I needed to make to the userspace to go with
this kernel patch:

https://gitlab.freedesktop.org/tomeu/mesa/-/commit/6b0db4cce406c574d2b7710208df9c8bd1ab6345

Cheers,

Tomeu

> While currently accel and drm are
> pretty closely related and we can share most of the driver, it might
> still be a maintenance hassle in the long run.
>
> On the other hand we already have precedence of compute only DRM
> devices exposing a render node: there are AMD GPUs that don't expose a
> graphics queue and are thus not able to actually render graphics. Mesa
> already handles this in part via the PIPE_CAP_GRAPHICS and I think we
> should simply extend this to not offer a EGL display on screens without
> that capability.
>
> Regards,
> Lucas
>
> > Signed-off-by: Tomeu Vizoso <tomeu@tomeuvizoso.net>
> > Cc: Oded Gabbay <ogabbay@kernel.org>
> > ---
> >  drivers/gpu/drm/etnaviv/etnaviv_drv.c | 46 ++++++++++++++++++++++-----
> >  1 file changed, 38 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/etnaviv/etnaviv_drv.c b/drivers/gpu/drm/etnaviv/etnaviv_drv.c
> > index 6500f3999c5f..8e7dd23115f4 100644
> > --- a/drivers/gpu/drm/etnaviv/etnaviv_drv.c
> > +++ b/drivers/gpu/drm/etnaviv/etnaviv_drv.c
> > @@ -11,6 +11,7 @@
> >  #include <linux/platform_device.h>
> >  #include <linux/uaccess.h>
> >
> > +#include <drm/drm_accel.h>
> >  #include <drm/drm_debugfs.h>
> >  #include <drm/drm_drv.h>
> >  #include <drm/drm_file.h>
> > @@ -488,10 +489,10 @@ static const struct drm_ioctl_desc etnaviv_ioctls[] = {
> >       ETNA_IOCTL(PM_QUERY_SIG, pm_query_sig, DRM_RENDER_ALLOW),
> >  };
> >
> > -DEFINE_DRM_GEM_FOPS(fops);
> > +DEFINE_DRM_GEM_FOPS(render_fops);
> > +DEFINE_DRM_ACCEL_FOPS(accel_fops);
> >
> > -static const struct drm_driver etnaviv_drm_driver = {
> > -     .driver_features    = DRIVER_GEM | DRIVER_RENDER,
> > +static struct drm_driver etnaviv_drm_driver = {
> >       .open               = etnaviv_open,
> >       .postclose           = etnaviv_postclose,
> >       .gem_prime_import_sg_table = etnaviv_gem_prime_import_sg_table,
> > @@ -500,7 +501,6 @@ static const struct drm_driver etnaviv_drm_driver = {
> >  #endif
> >       .ioctls             = etnaviv_ioctls,
> >       .num_ioctls         = DRM_ETNAVIV_NUM_IOCTLS,
> > -     .fops               = &fops,
> >       .name               = "etnaviv",
> >       .desc               = "etnaviv DRM",
> >       .date               = "20151214",
> > @@ -508,15 +508,20 @@ static const struct drm_driver etnaviv_drm_driver = {
> >       .minor              = 4,
> >  };
> >
> > -/*
> > - * Platform driver:
> > - */
> > -static int etnaviv_bind(struct device *dev)
> > +static int etnaviv_bind_with_type(struct device *dev, u32 type)
> >  {
> >       struct etnaviv_drm_private *priv;
> >       struct drm_device *drm;
> > +     bool is_compute_only = true;
> >       int ret;
> >
> > +     etnaviv_drm_driver.driver_features = DRIVER_GEM | type;
> > +
> > +     if (type == DRIVER_RENDER)
> > +             etnaviv_drm_driver.fops = &render_fops;
> > +     else
> > +             etnaviv_drm_driver.fops = &accel_fops;
> > +
> >       drm = drm_dev_alloc(&etnaviv_drm_driver, dev);
> >       if (IS_ERR(drm))
> >               return PTR_ERR(drm);
> > @@ -553,6 +558,18 @@ static int etnaviv_bind(struct device *dev)
> >
> >       load_gpu(drm);
> >
> > +     for (unsigned int i = 0; i < ETNA_MAX_PIPES; i++) {
> > +             struct etnaviv_gpu *g = priv->gpu[i];
> > +
> > +             if (g && (g->identity.minor_features8 & chipMinorFeatures8_COMPUTE_ONLY) == 0)
> > +                     is_compute_only = false;
> > +     }
> > +
> > +     if (type == DRIVER_RENDER && is_compute_only) {
> > +             ret = -EINVAL;
> > +             goto out_unbind;
> > +     }
> > +
> >       ret = drm_dev_register(drm, 0);
> >       if (ret)
> >               goto out_unbind;
> > @@ -571,6 +588,19 @@ static int etnaviv_bind(struct device *dev)
> >       return ret;
> >  }
> >
> > +/*
> > + * Platform driver:
> > + */
> > +static int etnaviv_bind(struct device *dev)
> > +{
> > +     int ret = etnaviv_bind_with_type(dev, DRIVER_RENDER);
> > +
> > +     if (ret == -EINVAL)
> > +             return etnaviv_bind_with_type(dev, DRIVER_COMPUTE_ACCEL);
> > +
> > +     return ret;
> > +}
> > +
> >  static void etnaviv_unbind(struct device *dev)
> >  {
> >       struct drm_device *drm = dev_get_drvdata(dev);
>

Daniel Stone May 20, 2024, 11:19 a.m. UTC | #10

Hi,

On Mon, 20 May 2024 at 08:39, Tomeu Vizoso <tomeu@tomeuvizoso.net> wrote:
> On Fri, May 10, 2024 at 10:34 AM Lucas Stach <l.stach@pengutronix.de> wrote:
> > Am Mittwoch, dem 24.04.2024 um 08:37 +0200 schrieb Tomeu Vizoso:
> > > If we expose a render node for NPUs without rendering capabilities, the
> > > userspace stack will offer it to compositors and applications for
> > > rendering, which of course won't work.
> > >
> > > Userspace is probably right in not questioning whether a render node
> > > might not be capable of supporting rendering, so change it in the kernel
> > > instead by exposing a /dev/accel node.
> > >
> > > Before we bring the device up we don't know whether it is capable of
> > > rendering or not (depends on the features of its blocks), so first try
> > > to probe a rendering node, and if we find out that there is no rendering
> > > hardware, abort and retry with an accel node.
> >
> > On the other hand we already have precedence of compute only DRM
> > devices exposing a render node: there are AMD GPUs that don't expose a
> > graphics queue and are thus not able to actually render graphics. Mesa
> > already handles this in part via the PIPE_CAP_GRAPHICS and I think we
> > should simply extend this to not offer a EGL display on screens without
> > that capability.
>
> The problem with this is that the compositors I know don't loop over
> /dev/dri files, trying to create EGL screens and moving to the next
> one until they find one that works.
>
> They take the first render node (unless a specific one has been
> configured), and assumes it will be able to render with it.
>
> To me it seems as if userspace expects that /dev/dri/renderD* devices
> can be used for rendering and by breaking this assumption we would be
> breaking existing software.

Mm, it's sort of backwards from that. Compositors just take a
non-render DRM node for KMS, then ask GBM+EGL to instantiate a GPU
which can work with that. When run in headless mode, we don't take
render nodes directly, but instead just create an EGLDisplay or
VkPhysicalDevice and work backwards to a render node, rather than
selecting a render node and going from there.

So from that PoV I don't think it's really that harmful. The only
complication is in Mesa, where it would see an etnaviv/amdgpu/...
render node and potentially try to use it as a device. As long as Mesa
can correctly skip, there should be no userspace API implications.

That being said, I'm not entirely sure what the _benefit_ would be of
exposing a render node for a device which can't be used by any
'traditional' DRM consumers, i.e. GL/Vulkan/winsys.

Cheers,
Daniel

Tomeu Vizoso June 12, 2024, 2:26 p.m. UTC | #11

On Mon, May 20, 2024 at 1:19 PM Daniel Stone <daniel@fooishbar.org> wrote:
>
> Hi,
>
> On Mon, 20 May 2024 at 08:39, Tomeu Vizoso <tomeu@tomeuvizoso.net> wrote:
> > On Fri, May 10, 2024 at 10:34 AM Lucas Stach <l.stach@pengutronix.de> wrote:
> > > Am Mittwoch, dem 24.04.2024 um 08:37 +0200 schrieb Tomeu Vizoso:
> > > > If we expose a render node for NPUs without rendering capabilities, the
> > > > userspace stack will offer it to compositors and applications for
> > > > rendering, which of course won't work.
> > > >
> > > > Userspace is probably right in not questioning whether a render node
> > > > might not be capable of supporting rendering, so change it in the kernel
> > > > instead by exposing a /dev/accel node.
> > > >
> > > > Before we bring the device up we don't know whether it is capable of
> > > > rendering or not (depends on the features of its blocks), so first try
> > > > to probe a rendering node, and if we find out that there is no rendering
> > > > hardware, abort and retry with an accel node.
> > >
> > > On the other hand we already have precedence of compute only DRM
> > > devices exposing a render node: there are AMD GPUs that don't expose a
> > > graphics queue and are thus not able to actually render graphics. Mesa
> > > already handles this in part via the PIPE_CAP_GRAPHICS and I think we
> > > should simply extend this to not offer a EGL display on screens without
> > > that capability.
> >
> > The problem with this is that the compositors I know don't loop over
> > /dev/dri files, trying to create EGL screens and moving to the next
> > one until they find one that works.
> >
> > They take the first render node (unless a specific one has been
> > configured), and assumes it will be able to render with it.
> >
> > To me it seems as if userspace expects that /dev/dri/renderD* devices
> > can be used for rendering and by breaking this assumption we would be
> > breaking existing software.
>
> Mm, it's sort of backwards from that. Compositors just take a
> non-render DRM node for KMS, then ask GBM+EGL to instantiate a GPU
> which can work with that. When run in headless mode, we don't take
> render nodes directly, but instead just create an EGLDisplay or
> VkPhysicalDevice and work backwards to a render node, rather than
> selecting a render node and going from there.
>
> So from that PoV I don't think it's really that harmful. The only
> complication is in Mesa, where it would see an etnaviv/amdgpu/...
> render node and potentially try to use it as a device. As long as Mesa
> can correctly skip, there should be no userspace API implications.
>
> That being said, I'm not entirely sure what the _benefit_ would be of
> exposing a render node for a device which can't be used by any
> 'traditional' DRM consumers, i.e. GL/Vulkan/winsys.

What I don't understand yet from Lucas proposal is how this isn't
going to break existing userspace.

I mean, even if we find a good way of having userspace skip
non-rendering render nodes, what about existing userspace that isn't
able to do that? Any updates to newer kernels are going to break them.

Regards,

Tomeu

Tomeu Vizoso June 17, 2024, 5:01 p.m. UTC | #12

Hi Lucas,

Do you have any idea on how not to break userspace if we expose a render node?

Cheers,

Tomeu

On Wed, Jun 12, 2024 at 4:26 PM Tomeu Vizoso <tomeu@tomeuvizoso.net> wrote:
>
> On Mon, May 20, 2024 at 1:19 PM Daniel Stone <daniel@fooishbar.org> wrote:
> >
> > Hi,
> >
> > On Mon, 20 May 2024 at 08:39, Tomeu Vizoso <tomeu@tomeuvizoso.net> wrote:
> > > On Fri, May 10, 2024 at 10:34 AM Lucas Stach <l.stach@pengutronix.de> wrote:
> > > > Am Mittwoch, dem 24.04.2024 um 08:37 +0200 schrieb Tomeu Vizoso:
> > > > > If we expose a render node for NPUs without rendering capabilities, the
> > > > > userspace stack will offer it to compositors and applications for
> > > > > rendering, which of course won't work.
> > > > >
> > > > > Userspace is probably right in not questioning whether a render node
> > > > > might not be capable of supporting rendering, so change it in the kernel
> > > > > instead by exposing a /dev/accel node.
> > > > >
> > > > > Before we bring the device up we don't know whether it is capable of
> > > > > rendering or not (depends on the features of its blocks), so first try
> > > > > to probe a rendering node, and if we find out that there is no rendering
> > > > > hardware, abort and retry with an accel node.
> > > >
> > > > On the other hand we already have precedence of compute only DRM
> > > > devices exposing a render node: there are AMD GPUs that don't expose a
> > > > graphics queue and are thus not able to actually render graphics. Mesa
> > > > already handles this in part via the PIPE_CAP_GRAPHICS and I think we
> > > > should simply extend this to not offer a EGL display on screens without
> > > > that capability.
> > >
> > > The problem with this is that the compositors I know don't loop over
> > > /dev/dri files, trying to create EGL screens and moving to the next
> > > one until they find one that works.
> > >
> > > They take the first render node (unless a specific one has been
> > > configured), and assumes it will be able to render with it.
> > >
> > > To me it seems as if userspace expects that /dev/dri/renderD* devices
> > > can be used for rendering and by breaking this assumption we would be
> > > breaking existing software.
> >
> > Mm, it's sort of backwards from that. Compositors just take a
> > non-render DRM node for KMS, then ask GBM+EGL to instantiate a GPU
> > which can work with that. When run in headless mode, we don't take
> > render nodes directly, but instead just create an EGLDisplay or
> > VkPhysicalDevice and work backwards to a render node, rather than
> > selecting a render node and going from there.
> >
> > So from that PoV I don't think it's really that harmful. The only
> > complication is in Mesa, where it would see an etnaviv/amdgpu/...
> > render node and potentially try to use it as a device. As long as Mesa
> > can correctly skip, there should be no userspace API implications.
> >
> > That being said, I'm not entirely sure what the _benefit_ would be of
> > exposing a render node for a device which can't be used by any
> > 'traditional' DRM consumers, i.e. GL/Vulkan/winsys.
>
> What I don't understand yet from Lucas proposal is how this isn't
> going to break existing userspace.
>
> I mean, even if we find a good way of having userspace skip
> non-rendering render nodes, what about existing userspace that isn't
> able to do that? Any updates to newer kernels are going to break them.
>
> Regards,
>
> Tomeu

Daniel Vetter June 26, 2024, 7:26 a.m. UTC | #13

On Thu, May 09, 2024 at 05:41:18PM +0300, Oded Gabbay wrote:
> On Thu, May 09, 2024 at 03:53:01PM +0200, Tomeu Vizoso wrote:
> > Oded, Dave,
> > 
> > Do you have an opinion on this?
> > 
> > Thanks,
> > 
> > Tomeu
> Hi Tomeu,
> 
> Sorry for not replying earlier, I was down with Covid (again...).
> 
> To your question, I don't have an objection to what you are
> suggesting. My personal view of accel is that it is an integral part of 
> DRM and therefore, if there is an *existing* drm driver that wants to 
> create an accel node, I'm not against it. 

Yeah, there's a continum from "clearly 3d gpu" to "compute AI
accelerator", with everything possible in-between shipping somewhere.
Collaboration is the important part, hair-splitting on where exactly the
driver should be is kinda secondary. I mean beyond "don't put a pure 3d
driver into accel or vice versa" of course :-)

> There is the question of why you want to expose an accel node, and
> here I would like to hear Dave's and Sima's opinion on your suggested
> solution as it may affect the direction of other drm drivers.

So existing userspace that blindly assumes that any render node will give
it useful 3d acceleration, then that's broken already.

- kernel with new driver support but old mesa without that driver already
  gives you that, even for a pure 3d chip.

- intel (and I think also amd) have pure compute chips without 3d, so this
  issue already exists

Same for the other directions, 3d gpus have variable amounts of compute
chips nowadays.

That leaves imo just the pragmatic choice, and if we need to complicate
the init flow of the kernel driver just for a different charnode major,
then I don't really see the point.

And if we do see the point in this, I think the right approach would be if
we split the init flow further into allocating the drm_device, and then in
a 2nd step either allocate the accel or render uapi stuff as needed. The
DRIVER_FOO flags just aren't super flexible for this kinda of stuff and
have a bit a midlayer taste to them.

Cheers, Sima

> 
> Thanks,
> Oded.
> 
> p.s.
> Please only use bottom-posting when replying, thanks :)
> 
> > 
> > On Fri, Apr 26, 2024 at 8:10 AM Tomeu Vizoso <tomeu@tomeuvizoso.net> wrote:
> > >
> > > On Thu, Apr 25, 2024 at 8:59 PM Jeffrey Hugo <quic_jhugo@quicinc.com> wrote:
> > > >
> > > > On 4/24/2024 12:37 AM, Tomeu Vizoso wrote:
> > > > > If we expose a render node for NPUs without rendering capabilities, the
> > > > > userspace stack will offer it to compositors and applications for
> > > > > rendering, which of course won't work.
> > > > >
> > > > > Userspace is probably right in not questioning whether a render node
> > > > > might not be capable of supporting rendering, so change it in the kernel
> > > > > instead by exposing a /dev/accel node.
> > > > >
> > > > > Before we bring the device up we don't know whether it is capable of
> > > > > rendering or not (depends on the features of its blocks), so first try
> > > > > to probe a rendering node, and if we find out that there is no rendering
> > > > > hardware, abort and retry with an accel node.
> > > > >
> > > > > Signed-off-by: Tomeu Vizoso <tomeu@tomeuvizoso.net>
> > > > > Cc: Oded Gabbay <ogabbay@kernel.org>
> > > >
> > > > I hope Oded chimes in as Accel maintainer.  I think Airlie/Vetter had
> > > > also previously mentioned they'd have opinions on what is Accel vs DRM.
> > > >
> > > > This gets a nack from me in its current state.  This is not a strong
> > > > nack, and I don't want to discourage you.  I think there is a path forward.
> > > >
> > > > The Accel subsystem documentation says that accel drivers will reside in
> > > > drivers/accel/ but this does not.
> > >
> > > Indeed, there is that code organization aspect.
> > >
> > > > Also, the commit text for "accel: add dedicated minor for accelerator
> > > > devices" mentions -
> > > >
> > > > "for drivers that
> > > > declare they handle compute accelerator, using a new driver feature
> > > > flag called DRIVER_COMPUTE_ACCEL. It is important to note that this
> > > > driver feature is mutually exclusive with DRIVER_RENDER. Devices that
> > > > want to expose both graphics and compute device char files should be
> > > > handled by two drivers that are connected using the auxiliary bus
> > > > framework."
> > > >
> > > > I don't see any of that happening here (two drivers connected by aux
> > > > bus, one in drivers/accel).
> > >
> > > Well, the text refers to devices, not drivers. The case we are talking
> > > about is a driver that wants to sometimes expose an accel node, and
> > > sometimes a render node, depending on the hardware it is dealing with.
> > > So there would either be a device exposing a single render node, or a
> > > device exposing a single accel node.
> > >
> > > Though by using the auxiliary bus we could in theory solve the code
> > > organization problem mentioned above, I'm not quite seeing how to do
> > > this in a clean way. The driver in /drivers/gpu/drm would have to be a
> > > DRM driver that doesn't register a DRM device, but registers a device
> > > in the auxiliary bus for the driver in /drivers/accel to bind to? Or
> > > are you seeing some possibility that would fit better in the current
> > > DRM framework?
> > >
> > > > I think this is the first case we've had of a combo DRM/Accel usecase,
> > > > and so there isn't an existing example to refer you to on how to
> > > > structure things.  I think you are going to be the first example where
> > > > we figure all of this out.
> > >
> > > Yep, I will be grateful for any ideas on how to structure this.
> > >
> > > > On a more implementation note, ioctls for Accel devices should not be
> > > > marked DRM_RENDER_ALLOW.  Seems like your attempt to reuse as much of
> > > > the code as possible trips over this.
> > >
> > > Indeed, thanks.
> > >
> > > Cheers,
> > >
> > > Tomeu
> > >
> > > > -Jeff

Daniel Vetter June 26, 2024, 7:28 a.m. UTC | #14

On Mon, Jun 17, 2024 at 07:01:05PM +0200, Tomeu Vizoso wrote:
> Hi Lucas,
> 
> Do you have any idea on how not to break userspace if we expose a render node?

So if you get a new chip with an incompatible 3d block, you already have
that issue. And I hope etnaviv userspace can cope.

Worst case you need to publish a fake extremely_fancy_3d_block to make
sure old mesa never binds against an NPU-only instance.

Or mesa just doesn't cope, in which case we need a etnaviv-v2-we_are_sorry
drm driver name, or something like that.
-Sima

> 
> Cheers,
> 
> Tomeu
> 
> On Wed, Jun 12, 2024 at 4:26 PM Tomeu Vizoso <tomeu@tomeuvizoso.net> wrote:
> >
> > On Mon, May 20, 2024 at 1:19 PM Daniel Stone <daniel@fooishbar.org> wrote:
> > >
> > > Hi,
> > >
> > > On Mon, 20 May 2024 at 08:39, Tomeu Vizoso <tomeu@tomeuvizoso.net> wrote:
> > > > On Fri, May 10, 2024 at 10:34 AM Lucas Stach <l.stach@pengutronix.de> wrote:
> > > > > Am Mittwoch, dem 24.04.2024 um 08:37 +0200 schrieb Tomeu Vizoso:
> > > > > > If we expose a render node for NPUs without rendering capabilities, the
> > > > > > userspace stack will offer it to compositors and applications for
> > > > > > rendering, which of course won't work.
> > > > > >
> > > > > > Userspace is probably right in not questioning whether a render node
> > > > > > might not be capable of supporting rendering, so change it in the kernel
> > > > > > instead by exposing a /dev/accel node.
> > > > > >
> > > > > > Before we bring the device up we don't know whether it is capable of
> > > > > > rendering or not (depends on the features of its blocks), so first try
> > > > > > to probe a rendering node, and if we find out that there is no rendering
> > > > > > hardware, abort and retry with an accel node.
> > > > >
> > > > > On the other hand we already have precedence of compute only DRM
> > > > > devices exposing a render node: there are AMD GPUs that don't expose a
> > > > > graphics queue and are thus not able to actually render graphics. Mesa
> > > > > already handles this in part via the PIPE_CAP_GRAPHICS and I think we
> > > > > should simply extend this to not offer a EGL display on screens without
> > > > > that capability.
> > > >
> > > > The problem with this is that the compositors I know don't loop over
> > > > /dev/dri files, trying to create EGL screens and moving to the next
> > > > one until they find one that works.
> > > >
> > > > They take the first render node (unless a specific one has been
> > > > configured), and assumes it will be able to render with it.
> > > >
> > > > To me it seems as if userspace expects that /dev/dri/renderD* devices
> > > > can be used for rendering and by breaking this assumption we would be
> > > > breaking existing software.
> > >
> > > Mm, it's sort of backwards from that. Compositors just take a
> > > non-render DRM node for KMS, then ask GBM+EGL to instantiate a GPU
> > > which can work with that. When run in headless mode, we don't take
> > > render nodes directly, but instead just create an EGLDisplay or
> > > VkPhysicalDevice and work backwards to a render node, rather than
> > > selecting a render node and going from there.
> > >
> > > So from that PoV I don't think it's really that harmful. The only
> > > complication is in Mesa, where it would see an etnaviv/amdgpu/...
> > > render node and potentially try to use it as a device. As long as Mesa
> > > can correctly skip, there should be no userspace API implications.
> > >
> > > That being said, I'm not entirely sure what the _benefit_ would be of
> > > exposing a render node for a device which can't be used by any
> > > 'traditional' DRM consumers, i.e. GL/Vulkan/winsys.
> >
> > What I don't understand yet from Lucas proposal is how this isn't
> > going to break existing userspace.
> >
> > I mean, even if we find a good way of having userspace skip
> > non-rendering render nodes, what about existing userspace that isn't
> > able to do that? Any updates to newer kernels are going to break them.
> >
> > Regards,
> >
> > Tomeu

Lucas Stach June 26, 2024, 8:28 a.m. UTC | #15

Am Mittwoch, dem 26.06.2024 um 09:28 +0200 schrieb Daniel Vetter:
> On Mon, Jun 17, 2024 at 07:01:05PM +0200, Tomeu Vizoso wrote:
> > Hi Lucas,
> > 
> > Do you have any idea on how not to break userspace if we expose a render node?
> 
> So if you get a new chip with an incompatible 3d block, you already have
> that issue. And I hope etnaviv userspace can cope.
> 
> Worst case you need to publish a fake extremely_fancy_3d_block to make
> sure old mesa never binds against an NPU-only instance.
> 
> Or mesa just doesn't cope, in which case we need a etnaviv-v2-we_are_sorry
> drm driver name, or something like that.

Mesa doesn't cope right now. Mostly because of the renderonly thing
where we magically need to match render devices to otherwise render
incapable KMS devices. The way this matching works is that the
renderonly code tries to open a screen on a rendernode and if that
succeeds we treat it as the matching render device.

The core of the issue is that we have no way of specifying which kind
of screen we need at that point, i.e. if the screen should have 3D
render capabilities or if compute-only or even NN-accel-only would be
okay. So we can't fail screen creation if there is no 3D engine, as
this would break the teflon case, which needs a screen for the NN
accel, but once we successfully create a screen reanderonly might treat
the thing as a rendering device.
So we are kind of stuck here between breaking one or the other use-
case. I'm leaning heavily into the direction of just fixing Mesa, so we
can specify the type of screen we need at creation time to avoid the
renderonly issue, porting this change as far back as reasonably
possible and file old userspace into shit-happens.

Regards,
Lucas

> 
> > 
> > Cheers,
> > 
> > Tomeu
> > 
> > On Wed, Jun 12, 2024 at 4:26 PM Tomeu Vizoso <tomeu@tomeuvizoso.net> wrote:
> > > 
> > > On Mon, May 20, 2024 at 1:19 PM Daniel Stone <daniel@fooishbar.org> wrote:
> > > > 
> > > > Hi,
> > > > 
> > > > On Mon, 20 May 2024 at 08:39, Tomeu Vizoso <tomeu@tomeuvizoso.net> wrote:
> > > > > On Fri, May 10, 2024 at 10:34 AM Lucas Stach <l.stach@pengutronix.de> wrote:
> > > > > > Am Mittwoch, dem 24.04.2024 um 08:37 +0200 schrieb Tomeu Vizoso:
> > > > > > > If we expose a render node for NPUs without rendering capabilities, the
> > > > > > > userspace stack will offer it to compositors and applications for
> > > > > > > rendering, which of course won't work.
> > > > > > > 
> > > > > > > Userspace is probably right in not questioning whether a render node
> > > > > > > might not be capable of supporting rendering, so change it in the kernel
> > > > > > > instead by exposing a /dev/accel node.
> > > > > > > 
> > > > > > > Before we bring the device up we don't know whether it is capable of
> > > > > > > rendering or not (depends on the features of its blocks), so first try
> > > > > > > to probe a rendering node, and if we find out that there is no rendering
> > > > > > > hardware, abort and retry with an accel node.
> > > > > > 
> > > > > > On the other hand we already have precedence of compute only DRM
> > > > > > devices exposing a render node: there are AMD GPUs that don't expose a
> > > > > > graphics queue and are thus not able to actually render graphics. Mesa
> > > > > > already handles this in part via the PIPE_CAP_GRAPHICS and I think we
> > > > > > should simply extend this to not offer a EGL display on screens without
> > > > > > that capability.
> > > > > 
> > > > > The problem with this is that the compositors I know don't loop over
> > > > > /dev/dri files, trying to create EGL screens and moving to the next
> > > > > one until they find one that works.
> > > > > 
> > > > > They take the first render node (unless a specific one has been
> > > > > configured), and assumes it will be able to render with it.
> > > > > 
> > > > > To me it seems as if userspace expects that /dev/dri/renderD* devices
> > > > > can be used for rendering and by breaking this assumption we would be
> > > > > breaking existing software.
> > > > 
> > > > Mm, it's sort of backwards from that. Compositors just take a
> > > > non-render DRM node for KMS, then ask GBM+EGL to instantiate a GPU
> > > > which can work with that. When run in headless mode, we don't take
> > > > render nodes directly, but instead just create an EGLDisplay or
> > > > VkPhysicalDevice and work backwards to a render node, rather than
> > > > selecting a render node and going from there.
> > > > 
> > > > So from that PoV I don't think it's really that harmful. The only
> > > > complication is in Mesa, where it would see an etnaviv/amdgpu/...
> > > > render node and potentially try to use it as a device. As long as Mesa
> > > > can correctly skip, there should be no userspace API implications.
> > > > 
> > > > That being said, I'm not entirely sure what the _benefit_ would be of
> > > > exposing a render node for a device which can't be used by any
> > > > 'traditional' DRM consumers, i.e. GL/Vulkan/winsys.
> > > 
> > > What I don't understand yet from Lucas proposal is how this isn't
> > > going to break existing userspace.
> > > 
> > > I mean, even if we find a good way of having userspace skip
> > > non-rendering render nodes, what about existing userspace that isn't
> > > able to do that? Any updates to newer kernels are going to break them.
> > > 
> > > Regards,
> > > 
> > > Tomeu
>

Dmitry Baryshkov June 26, 2024, 8:42 a.m. UTC | #16

On Wed, Jun 26, 2024 at 09:26:40AM GMT, Daniel Vetter wrote:
> On Thu, May 09, 2024 at 05:41:18PM +0300, Oded Gabbay wrote:
> > On Thu, May 09, 2024 at 03:53:01PM +0200, Tomeu Vizoso wrote:
> > > Oded, Dave,
> > > 
> > > Do you have an opinion on this?
> > > 
> > > Thanks,
> > > 
> > > Tomeu
> > Hi Tomeu,
> > 
> > Sorry for not replying earlier, I was down with Covid (again...).
> > 
> > To your question, I don't have an objection to what you are
> > suggesting. My personal view of accel is that it is an integral part of 
> > DRM and therefore, if there is an *existing* drm driver that wants to 
> > create an accel node, I'm not against it. 
> 
> Yeah, there's a continum from "clearly 3d gpu" to "compute AI
> accelerator", with everything possible in-between shipping somewhere.
> Collaboration is the important part, hair-splitting on where exactly the
> driver should be is kinda secondary. I mean beyond "don't put a pure 3d
> driver into accel or vice versa" of course :-)
> 
> > There is the question of why you want to expose an accel node, and
> > here I would like to hear Dave's and Sima's opinion on your suggested
> > solution as it may affect the direction of other drm drivers.
> 
> So existing userspace that blindly assumes that any render node will give
> it useful 3d acceleration, then that's broken already.
> 
> - kernel with new driver support but old mesa without that driver already
>   gives you that, even for a pure 3d chip.
> 
> - intel (and I think also amd) have pure compute chips without 3d, so this
>   issue already exists
> 
> Same for the other directions, 3d gpus have variable amounts of compute
> chips nowadays.
> 
> That leaves imo just the pragmatic choice, and if we need to complicate
> the init flow of the kernel driver just for a different charnode major,
> then I don't really see the point.
> 
> And if we do see the point in this, I think the right approach would be if
> we split the init flow further into allocating the drm_device, and then in
> a 2nd step either allocate the accel or render uapi stuff as needed. The
> DRIVER_FOO flags just aren't super flexible for this kinda of stuff and
> have a bit a midlayer taste to them.

Being able to defer render allocation would be extremely useful for MSM
too as it's not currently possible to mask the driver_features during
drm_dev_init()

Daniel Stone June 26, 2024, 10:39 a.m. UTC | #17

Hi,

On Wed, 26 Jun 2024 at 09:28, Lucas Stach <l.stach@pengutronix.de> wrote:
> Mesa doesn't cope right now. Mostly because of the renderonly thing
> where we magically need to match render devices to otherwise render
> incapable KMS devices. The way this matching works is that the
> renderonly code tries to open a screen on a rendernode and if that
> succeeds we treat it as the matching render device.
>
> The core of the issue is that we have no way of specifying which kind
> of screen we need at that point, i.e. if the screen should have 3D
> render capabilities or if compute-only or even NN-accel-only would be
> okay. So we can't fail screen creation if there is no 3D engine, as
> this would break the teflon case, which needs a screen for the NN
> accel, but once we successfully create a screen reanderonly might treat
> the thing as a rendering device.
> So we are kind of stuck here between breaking one or the other use-
> case. I'm leaning heavily into the direction of just fixing Mesa, so we
> can specify the type of screen we need at creation time to avoid the
> renderonly issue, porting this change as far back as reasonably
> possible and file old userspace into shit-happens.

Yeah, honestly this sounds like the best solution to me too.

Cheers,
Daniel

Daniel Vetter June 26, 2024, 5:49 p.m. UTC | #18

On Wed, Jun 26, 2024 at 11:42:24AM +0300, Dmitry Baryshkov wrote:
> On Wed, Jun 26, 2024 at 09:26:40AM GMT, Daniel Vetter wrote:
> > On Thu, May 09, 2024 at 05:41:18PM +0300, Oded Gabbay wrote:
> > > On Thu, May 09, 2024 at 03:53:01PM +0200, Tomeu Vizoso wrote:
> > > > Oded, Dave,
> > > > 
> > > > Do you have an opinion on this?
> > > > 
> > > > Thanks,
> > > > 
> > > > Tomeu
> > > Hi Tomeu,
> > > 
> > > Sorry for not replying earlier, I was down with Covid (again...).
> > > 
> > > To your question, I don't have an objection to what you are
> > > suggesting. My personal view of accel is that it is an integral part of 
> > > DRM and therefore, if there is an *existing* drm driver that wants to 
> > > create an accel node, I'm not against it. 
> > 
> > Yeah, there's a continum from "clearly 3d gpu" to "compute AI
> > accelerator", with everything possible in-between shipping somewhere.
> > Collaboration is the important part, hair-splitting on where exactly the
> > driver should be is kinda secondary. I mean beyond "don't put a pure 3d
> > driver into accel or vice versa" of course :-)
> > 
> > > There is the question of why you want to expose an accel node, and
> > > here I would like to hear Dave's and Sima's opinion on your suggested
> > > solution as it may affect the direction of other drm drivers.
> > 
> > So existing userspace that blindly assumes that any render node will give
> > it useful 3d acceleration, then that's broken already.
> > 
> > - kernel with new driver support but old mesa without that driver already
> >   gives you that, even for a pure 3d chip.
> > 
> > - intel (and I think also amd) have pure compute chips without 3d, so this
> >   issue already exists
> > 
> > Same for the other directions, 3d gpus have variable amounts of compute
> > chips nowadays.
> > 
> > That leaves imo just the pragmatic choice, and if we need to complicate
> > the init flow of the kernel driver just for a different charnode major,
> > then I don't really see the point.
> > 
> > And if we do see the point in this, I think the right approach would be if
> > we split the init flow further into allocating the drm_device, and then in
> > a 2nd step either allocate the accel or render uapi stuff as needed. The
> > DRIVER_FOO flags just aren't super flexible for this kinda of stuff and
> > have a bit a midlayer taste to them.
> 
> Being able to defer render allocation would be extremely useful for MSM
> too as it's not currently possible to mask the driver_features during
> drm_dev_init()

Eh I think less driver_features and more explicit (like
drm_mode_config_init() instead of also having to set DRIVER_MODESET) stuff
would be better in general. But they keep popping up, because it's an easy
hack to get things going. Over the years I've managed to remove a lot of
them tough.
-Sima

Daniel Vetter June 26, 2024, 5:52 p.m. UTC | #19

On Wed, Jun 26, 2024 at 11:39:01AM +0100, Daniel Stone wrote:
> Hi,
> 
> On Wed, 26 Jun 2024 at 09:28, Lucas Stach <l.stach@pengutronix.de> wrote:
> > Mesa doesn't cope right now. Mostly because of the renderonly thing
> > where we magically need to match render devices to otherwise render
> > incapable KMS devices. The way this matching works is that the
> > renderonly code tries to open a screen on a rendernode and if that
> > succeeds we treat it as the matching render device.
> >
> > The core of the issue is that we have no way of specifying which kind
> > of screen we need at that point, i.e. if the screen should have 3D
> > render capabilities or if compute-only or even NN-accel-only would be
> > okay. So we can't fail screen creation if there is no 3D engine, as
> > this would break the teflon case, which needs a screen for the NN
> > accel, but once we successfully create a screen reanderonly might treat
> > the thing as a rendering device.
> > So we are kind of stuck here between breaking one or the other use-
> > case. I'm leaning heavily into the direction of just fixing Mesa, so we
> > can specify the type of screen we need at creation time to avoid the
> > renderonly issue, porting this change as far back as reasonably
> > possible and file old userspace into shit-happens.
> 
> Yeah, honestly this sounds like the best solution to me too.

Yeah mesa sounds kinda broken here ...

What might work in the kernel is if you publish a fake 3d engine that's
too new for broken mesa, if that's enough to make it fail to bind? And if
mesa still happily binds against that, then yeah it's probably too broken
and we need etnaviv-v2 (as a drm driver uapi name, I think that's what
mesa filters?) for anything new (including the NN-only ones).

I would still try to avoid that, but just in case someone screams about
regressions.
-Sima

Daniel Stone June 26, 2024, 7:26 p.m. UTC | #20

On Wed, 26 Jun 2024 at 18:52, Daniel Vetter <daniel@ffwll.ch> wrote:
> On Wed, Jun 26, 2024 at 11:39:01AM +0100, Daniel Stone wrote:
> > On Wed, 26 Jun 2024 at 09:28, Lucas Stach <l.stach@pengutronix.de> wrote:
> > > So we are kind of stuck here between breaking one or the other use-
> > > case. I'm leaning heavily into the direction of just fixing Mesa, so we
> > > can specify the type of screen we need at creation time to avoid the
> > > renderonly issue, porting this change as far back as reasonably
> > > possible and file old userspace into shit-happens.
> >
> > Yeah, honestly this sounds like the best solution to me too.
>
> Yeah mesa sounds kinda broken here ...
>
> What might work in the kernel is if you publish a fake 3d engine that's
> too new for broken mesa, if that's enough to make it fail to bind? And if
> mesa still happily binds against that, then yeah it's probably too broken
> and we need etnaviv-v2 (as a drm driver uapi name, I think that's what
> mesa filters?) for anything new (including the NN-only ones).
>
> I would still try to avoid that, but just in case someone screams about
> regressions.

It's not just etnaviv, it's literally every Mesa driver which works
with decoupled render/display. So that would be etnaviv-v2,
panfrost-v2, panthor-v2, v3d-v2, powervr-v2, ... albeit those don't
tend to have multiple instances.

Anyway, I'm still leaning towards the answer being: this is not an
etnaviv regression caused by NPU, it's a longstanding generic Mesa
issue for which the answer is to fix the known fragility.

Cheers,
Daniel

Tomeu Vizoso June 28, 2024, 9:43 a.m. UTC | #21

On Wed, Jun 26, 2024 at 9:26 PM Daniel Stone <daniel@fooishbar.org> wrote:
>
> On Wed, 26 Jun 2024 at 18:52, Daniel Vetter <daniel@ffwll.ch> wrote:
> > On Wed, Jun 26, 2024 at 11:39:01AM +0100, Daniel Stone wrote:
> > > On Wed, 26 Jun 2024 at 09:28, Lucas Stach <l.stach@pengutronix.de> wrote:
> > > > So we are kind of stuck here between breaking one or the other use-
> > > > case. I'm leaning heavily into the direction of just fixing Mesa, so we
> > > > can specify the type of screen we need at creation time to avoid the
> > > > renderonly issue, porting this change as far back as reasonably
> > > > possible and file old userspace into shit-happens.
> > >
> > > Yeah, honestly this sounds like the best solution to me too.
> >
> > Yeah mesa sounds kinda broken here ...
> >
> > What might work in the kernel is if you publish a fake 3d engine that's
> > too new for broken mesa, if that's enough to make it fail to bind? And if
> > mesa still happily binds against that, then yeah it's probably too broken
> > and we need etnaviv-v2 (as a drm driver uapi name, I think that's what
> > mesa filters?) for anything new (including the NN-only ones).
> >
> > I would still try to avoid that, but just in case someone screams about
> > regressions.

Thanks everybody for chiming in.

> It's not just etnaviv, it's literally every Mesa driver which works
> with decoupled render/display. So that would be etnaviv-v2,
> panfrost-v2, panthor-v2, v3d-v2, powervr-v2, ... albeit those don't
> tend to have multiple instances.

TBH, I think VeriSilicon is the only IP vendor that has recycled a
render-only IP into a compute-only IP.

That is why I liked the approach of conditionally creating an accel
node, as it neatly reflects that reality.

> Anyway, I'm still leaning towards the answer being: this is not an
> etnaviv regression caused by NPU, it's a longstanding generic Mesa
> issue for which the answer is to fix the known fragility.

My understanding of the consensus so far is that Mesa should be fixed
so that Gallium drivers can fail at screen init if the device doesn't
support some new usage flags that we would be adding.

If for some reason that doesn't work, we would be looking at having
etnaviv use a different kind of driver name, such as etnaviv-npu or
etnaviv-compute.

Did I get it right?

Thanks,

Tomeu

Daniel Stone June 28, 2024, 9:54 a.m. UTC | #22

Hi,

On Fri, 28 Jun 2024 at 10:43, Tomeu Vizoso <tomeu@tomeuvizoso.net> wrote:
> On Wed, Jun 26, 2024 at 9:26 PM Daniel Stone <daniel@fooishbar.org> wrote:
> > It's not just etnaviv, it's literally every Mesa driver which works
> > with decoupled render/display. So that would be etnaviv-v2,
> > panfrost-v2, panthor-v2, v3d-v2, powervr-v2, ... albeit those don't
> > tend to have multiple instances.
>
> TBH, I think VeriSilicon is the only IP vendor that has recycled a
> render-only IP into a compute-only IP.
>
> That is why I liked the approach of conditionally creating an accel
> node, as it neatly reflects that reality.
>
> > Anyway, I'm still leaning towards the answer being: this is not an
> > etnaviv regression caused by NPU, it's a longstanding generic Mesa
> > issue for which the answer is to fix the known fragility.
>
> My understanding of the consensus so far is that Mesa should be fixed
> so that Gallium drivers can fail at screen init if the device doesn't
> support some new usage flags that we would be adding.
>
> If for some reason that doesn't work, we would be looking at having
> etnaviv use a different kind of driver name, such as etnaviv-npu or
> etnaviv-compute.
>
> Did I get it right?

Yep, wfm. :)

Cheers,
Daniel

Daniel Vetter June 28, 2024, 5:47 p.m. UTC | #23

On Wed, Jun 26, 2024 at 08:26:04PM +0100, Daniel Stone wrote:
> On Wed, 26 Jun 2024 at 18:52, Daniel Vetter <daniel@ffwll.ch> wrote:
> > On Wed, Jun 26, 2024 at 11:39:01AM +0100, Daniel Stone wrote:
> > > On Wed, 26 Jun 2024 at 09:28, Lucas Stach <l.stach@pengutronix.de> wrote:
> > > > So we are kind of stuck here between breaking one or the other use-
> > > > case. I'm leaning heavily into the direction of just fixing Mesa, so we
> > > > can specify the type of screen we need at creation time to avoid the
> > > > renderonly issue, porting this change as far back as reasonably
> > > > possible and file old userspace into shit-happens.
> > >
> > > Yeah, honestly this sounds like the best solution to me too.
> >
> > Yeah mesa sounds kinda broken here ...
> >
> > What might work in the kernel is if you publish a fake 3d engine that's
> > too new for broken mesa, if that's enough to make it fail to bind? And if
> > mesa still happily binds against that, then yeah it's probably too broken
> > and we need etnaviv-v2 (as a drm driver uapi name, I think that's what
> > mesa filters?) for anything new (including the NN-only ones).
> >
> > I would still try to avoid that, but just in case someone screams about
> > regressions.
> 
> It's not just etnaviv, it's literally every Mesa driver which works
> with decoupled render/display. So that would be etnaviv-v2,
> panfrost-v2, panthor-v2, v3d-v2, powervr-v2, ... albeit those don't
> tend to have multiple instances.

So essentially mesa just burns&crashes when old mesa runs on a newer
kernel with support for a chip that mesa doesn't know about?

> Anyway, I'm still leaning towards the answer being: this is not an
> etnaviv regression caused by NPU, it's a longstanding generic Mesa
> issue for which the answer is to fix the known fragility.

If the above is correct, then yes I think we should just fix mesa. Feels
like the breakage is too obviously there, and that's all we'll do unless
the screaming gets too loud.
-Sima

drm/etnaviv: Create an accel device node if compute-only

Commit Message

Comments

Patch