diff mbox series

drm/tegra: Remove of_dma_configure() from host1x_device_add()

Message ID 0-v1-c76c50cd425e+15298-host1x_no_iommu_conf_jgg@nvidia.com (mailing list archive)
State New, archived
Headers show
Series drm/tegra: Remove of_dma_configure() from host1x_device_add() | expand

Commit Message

Jason Gunthorpe Jan. 30, 2024, 4:15 p.m. UTC
This was added in commit c95469aa5a18 ("gpu: host1x: Set DMA ops on device
creation") with the note:

    Currently host1x-instanciated devices have their dma_ops left to NULL,
    which makes any DMA operation (like buffer import) on ARM64 fallback
    to the dummy_dma_ops and fail with an error.

Since commit 14891af3799e ("iommu: Move the iommu driver sysfs setup into
iommu_init/deinit_device()") this call now fails because the struct device
is not fully configured enough to setup the sysfs and we now catch that
error.

This failure means the DMA ops are no longer set during this failing call.

It seems this is no longer a problem because
commit 07397df29e57 ("dma-mapping: move dma configuration to bus
infrastructure") added another call to of_dma_configure() inside the
bus_type->dma_configure() callback.

So long as a driver is probed the to the device it will have DMA properly
setup in the ordinary way.

Remove the unnecessary call which also removes the new long print:

[    1.200004] host1x drm: iommu configuration for device failed with -ENOENT

Reported-by: Diogo Ivo <diogo.ivo@tecnico.ulisboa.pt>
Closes: https://lore.kernel.org/all/bbmhcoghrprmbdibnjum6lefix2eoquxrde7wyqeulm4xabmlm@b6jy32saugqh/
Reported-by: Jon Hunter <jonathanh@nvidia.com>
Closes: https://lore.kernel.org/all/b0334c5e-3a6c-4b58-b525-e72bed8899b3@nvidia.com/
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/gpu/host1x/bus.c | 2 --
 1 file changed, 2 deletions(-)


base-commit: 3049f92c481204f142226d3672711660025fbbb5

Comments

Jon Hunter Jan. 30, 2024, 9:55 p.m. UTC | #1
On 30/01/2024 16:15, Jason Gunthorpe wrote:
> This was added in commit c95469aa5a18 ("gpu: host1x: Set DMA ops on device
> creation") with the note:
> 
>      Currently host1x-instanciated devices have their dma_ops left to NULL,
>      which makes any DMA operation (like buffer import) on ARM64 fallback
>      to the dummy_dma_ops and fail with an error.
> 
> Since commit 14891af3799e ("iommu: Move the iommu driver sysfs setup into
> iommu_init/deinit_device()") this call now fails because the struct device
> is not fully configured enough to setup the sysfs and we now catch that
> error.
> 
> This failure means the DMA ops are no longer set during this failing call.
> 
> It seems this is no longer a problem because
> commit 07397df29e57 ("dma-mapping: move dma configuration to bus
> infrastructure") added another call to of_dma_configure() inside the
> bus_type->dma_configure() callback.
> 
> So long as a driver is probed the to the device it will have DMA properly
> setup in the ordinary way.
> 
> Remove the unnecessary call which also removes the new long print:
> 
> [    1.200004] host1x drm: iommu configuration for device failed with -ENOENT
> 
> Reported-by: Diogo Ivo <diogo.ivo@tecnico.ulisboa.pt>
> Closes: https://lore.kernel.org/all/bbmhcoghrprmbdibnjum6lefix2eoquxrde7wyqeulm4xabmlm@b6jy32saugqh/
> Reported-by: Jon Hunter <jonathanh@nvidia.com>
> Closes: https://lore.kernel.org/all/b0334c5e-3a6c-4b58-b525-e72bed8899b3@nvidia.com/
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/gpu/host1x/bus.c | 2 --
>   1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/gpu/host1x/bus.c b/drivers/gpu/host1x/bus.c
> index 84d042796d2e66..61214d35cadc34 100644
> --- a/drivers/gpu/host1x/bus.c
> +++ b/drivers/gpu/host1x/bus.c
> @@ -458,8 +458,6 @@ static int host1x_device_add(struct host1x *host1x,
>   	device->dev.bus = &host1x_bus_type;
>   	device->dev.parent = host1x->dev;
>   
> -	of_dma_configure(&device->dev, host1x->dev->of_node, true);
> -
>   	device->dev.dma_parms = &device->dma_parms;
>   	dma_set_max_seg_size(&device->dev, UINT_MAX);


In my case the warning is coming from the of_dma_configure_id() in 
drivers/gpu/host1x/context.c. So with the above change I am still seeing 
the warning.

BTW, the subject prefix should be 'gpu: host1x:' for this change.

Jon
Jason Gunthorpe Jan. 31, 2024, 3:33 p.m. UTC | #2
On Tue, Jan 30, 2024 at 09:55:18PM +0000, Jon Hunter wrote:
> 
> On 30/01/2024 16:15, Jason Gunthorpe wrote:
> > This was added in commit c95469aa5a18 ("gpu: host1x: Set DMA ops on device
> > creation") with the note:
> > 
> >      Currently host1x-instanciated devices have their dma_ops left to NULL,
> >      which makes any DMA operation (like buffer import) on ARM64 fallback
> >      to the dummy_dma_ops and fail with an error.
> > 
> > Since commit 14891af3799e ("iommu: Move the iommu driver sysfs setup into
> > iommu_init/deinit_device()") this call now fails because the struct device
> > is not fully configured enough to setup the sysfs and we now catch that
> > error.
> > 
> > This failure means the DMA ops are no longer set during this failing call.

Looking at it more it seems the arch dma ops are setup still, we
ignore the failure on multiple levels :(

> > It seems this is no longer a problem because
> > commit 07397df29e57 ("dma-mapping: move dma configuration to bus
> > infrastructure") added another call to of_dma_configure() inside the
> > bus_type->dma_configure() callback.
> > 
> > So long as a driver is probed the to the device it will have DMA properly
> > setup in the ordinary way.
> > 
> > Remove the unnecessary call which also removes the new long print:
> > 
> > [    1.200004] host1x drm: iommu configuration for device failed with -ENOENT
> > 
> > Reported-by: Diogo Ivo <diogo.ivo@tecnico.ulisboa.pt>
> > Closes: https://lore.kernel.org/all/bbmhcoghrprmbdibnjum6lefix2eoquxrde7wyqeulm4xabmlm@b6jy32saugqh/
> > Reported-by: Jon Hunter <jonathanh@nvidia.com>
> > Closes: https://lore.kernel.org/all/b0334c5e-3a6c-4b58-b525-e72bed8899b3@nvidia.com/
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > ---
> >   drivers/gpu/host1x/bus.c | 2 --
> >   1 file changed, 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/host1x/bus.c b/drivers/gpu/host1x/bus.c
> > index 84d042796d2e66..61214d35cadc34 100644
> > --- a/drivers/gpu/host1x/bus.c
> > +++ b/drivers/gpu/host1x/bus.c
> > @@ -458,8 +458,6 @@ static int host1x_device_add(struct host1x *host1x,
> >   	device->dev.bus = &host1x_bus_type;
> >   	device->dev.parent = host1x->dev;
> > -	of_dma_configure(&device->dev, host1x->dev->of_node, true);
> > -
> >   	device->dev.dma_parms = &device->dma_parms;
> >   	dma_set_max_seg_size(&device->dev, UINT_MAX);
> 
> 
> In my case the warning is coming from the of_dma_configure_id() in
> drivers/gpu/host1x/context.c. So with the above change I am still seeing the
> warning.

You mean this sequence?

		err = device_add(&ctx->dev);
		if (err) {
			dev_err(host1x->dev, "could not add context device %d: %d\n", i, err);
			put_device(&ctx->dev);
			goto unreg_devices;
		}

		err = of_dma_configure_id(&ctx->dev, node, true, &i);
		if (err) {
			dev_err(host1x->dev, "IOMMU configuration failed for context device %d: %d\n",
				i, err);
			device_unregister(&ctx->dev);
			goto unreg_devices;
		}

I didn't seem an obvious place that this would get fixed up later?

device_add() was done before so the iommu_device_link() shouldn't be
failing? Are you hitting a duplicate link (ie remove the nowarn from
iommu_device_link())

Jason
Jon Hunter Feb. 1, 2024, 7:35 p.m. UTC | #3
On 31/01/2024 15:33, Jason Gunthorpe wrote:
> On Tue, Jan 30, 2024 at 09:55:18PM +0000, Jon Hunter wrote:
>>
>> On 30/01/2024 16:15, Jason Gunthorpe wrote:
>>> This was added in commit c95469aa5a18 ("gpu: host1x: Set DMA ops on device
>>> creation") with the note:
>>>
>>>       Currently host1x-instanciated devices have their dma_ops left to NULL,
>>>       which makes any DMA operation (like buffer import) on ARM64 fallback
>>>       to the dummy_dma_ops and fail with an error.
>>>
>>> Since commit 14891af3799e ("iommu: Move the iommu driver sysfs setup into
>>> iommu_init/deinit_device()") this call now fails because the struct device
>>> is not fully configured enough to setup the sysfs and we now catch that
>>> error.
>>>
>>> This failure means the DMA ops are no longer set during this failing call.
> 
> Looking at it more it seems the arch dma ops are setup still, we
> ignore the failure on multiple levels :(
> 
>>> It seems this is no longer a problem because
>>> commit 07397df29e57 ("dma-mapping: move dma configuration to bus
>>> infrastructure") added another call to of_dma_configure() inside the
>>> bus_type->dma_configure() callback.
>>>
>>> So long as a driver is probed the to the device it will have DMA properly
>>> setup in the ordinary way.
>>>
>>> Remove the unnecessary call which also removes the new long print:
>>>
>>> [    1.200004] host1x drm: iommu configuration for device failed with -ENOENT
>>>
>>> Reported-by: Diogo Ivo <diogo.ivo@tecnico.ulisboa.pt>
>>> Closes: https://lore.kernel.org/all/bbmhcoghrprmbdibnjum6lefix2eoquxrde7wyqeulm4xabmlm@b6jy32saugqh/
>>> Reported-by: Jon Hunter <jonathanh@nvidia.com>
>>> Closes: https://lore.kernel.org/all/b0334c5e-3a6c-4b58-b525-e72bed8899b3@nvidia.com/
>>> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>>> ---
>>>    drivers/gpu/host1x/bus.c | 2 --
>>>    1 file changed, 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/host1x/bus.c b/drivers/gpu/host1x/bus.c
>>> index 84d042796d2e66..61214d35cadc34 100644
>>> --- a/drivers/gpu/host1x/bus.c
>>> +++ b/drivers/gpu/host1x/bus.c
>>> @@ -458,8 +458,6 @@ static int host1x_device_add(struct host1x *host1x,
>>>    	device->dev.bus = &host1x_bus_type;
>>>    	device->dev.parent = host1x->dev;
>>> -	of_dma_configure(&device->dev, host1x->dev->of_node, true);
>>> -
>>>    	device->dev.dma_parms = &device->dma_parms;
>>>    	dma_set_max_seg_size(&device->dev, UINT_MAX);
>>
>>
>> In my case the warning is coming from the of_dma_configure_id() in
>> drivers/gpu/host1x/context.c. So with the above change I am still seeing the
>> warning.
> 
> You mean this sequence?
> 
> 		err = device_add(&ctx->dev);
> 		if (err) {
> 			dev_err(host1x->dev, "could not add context device %d: %d\n", i, err);
> 			put_device(&ctx->dev);
> 			goto unreg_devices;
> 		}
> 
> 		err = of_dma_configure_id(&ctx->dev, node, true, &i);
> 		if (err) {
> 			dev_err(host1x->dev, "IOMMU configuration failed for context device %d: %d\n",
> 				i, err);
> 			device_unregister(&ctx->dev);
> 			goto unreg_devices;
> 		}

Yes this sequence.

> I didn't seem an obvious place that this would get fixed up later?
> 
> device_add() was done before so the iommu_device_link() shouldn't be
> failing? Are you hitting a duplicate link (ie remove the nowarn from
> iommu_device_link())


Removing the '_nowarn' does appear to fix it, although it is not clear 
to me why?

Thanks
Jon
Jason Gunthorpe Feb. 1, 2024, 8:02 p.m. UTC | #4
On Thu, Feb 01, 2024 at 07:35:24PM +0000, Jon Hunter wrote:
> > You mean this sequence?
> > 
> > 		err = device_add(&ctx->dev);
> > 		if (err) {
> > 			dev_err(host1x->dev, "could not add context device %d: %d\n", i, err);
> > 			put_device(&ctx->dev);
> > 			goto unreg_devices;
> > 		}
> > 
> > 		err = of_dma_configure_id(&ctx->dev, node, true, &i);
> > 		if (err) {
> > 			dev_err(host1x->dev, "IOMMU configuration failed for context device %d: %d\n",
> > 				i, err);
> > 			device_unregister(&ctx->dev);
> > 			goto unreg_devices;
> > 		}
> 
> Yes this sequence.
> 
> > I didn't seem an obvious place that this would get fixed up later?
> > 
> > device_add() was done before so the iommu_device_link() shouldn't be
> > failing? Are you hitting a duplicate link (ie remove the nowarn from
> > iommu_device_link())
> 
> 
> Removing the '_nowarn' does appear to fix it, although it is not clear to me
> why?

What did you do to remove? Just the letters or the whole line?

I was thinking the letters because it triggers a large debug message.

But, what is the actual log output you see, is it -EEXIST?

If it is coming and going is it a race of some kind?

Jason
Jon Hunter Feb. 2, 2024, 10:40 a.m. UTC | #5
On 01/02/2024 20:02, Jason Gunthorpe wrote:
> On Thu, Feb 01, 2024 at 07:35:24PM +0000, Jon Hunter wrote:
>>> You mean this sequence?
>>>
>>> 		err = device_add(&ctx->dev);
>>> 		if (err) {
>>> 			dev_err(host1x->dev, "could not add context device %d: %d\n", i, err);
>>> 			put_device(&ctx->dev);
>>> 			goto unreg_devices;
>>> 		}
>>>
>>> 		err = of_dma_configure_id(&ctx->dev, node, true, &i);
>>> 		if (err) {
>>> 			dev_err(host1x->dev, "IOMMU configuration failed for context device %d: %d\n",
>>> 				i, err);
>>> 			device_unregister(&ctx->dev);
>>> 			goto unreg_devices;
>>> 		}
>>
>> Yes this sequence.
>>
>>> I didn't seem an obvious place that this would get fixed up later?
>>>
>>> device_add() was done before so the iommu_device_link() shouldn't be
>>> failing? Are you hitting a duplicate link (ie remove the nowarn from
>>> iommu_device_link())
>>
>>
>> Removing the '_nowarn' does appear to fix it, although it is not clear to me
>> why?
> 
> What did you do to remove? Just the letters or the whole line?


Yes sorry, to be clear I made the following change ...

diff --git a/drivers/iommu/iommu-sysfs.c b/drivers/iommu/iommu-sysfs.c
index cbe378c34ba3..7bc9c6d2baab 100644
--- a/drivers/iommu/iommu-sysfs.c
+++ b/drivers/iommu/iommu-sysfs.c
@@ -112,7 +112,7 @@ int iommu_device_link(struct iommu_device *iommu, struct device *link)
         if (ret)
                 return ret;
  
-       ret = sysfs_create_link_nowarn(&link->kobj, &iommu->dev->kobj, "iommu");
+       ret = sysfs_create_link(&link->kobj, &iommu->dev->kobj, "iommu");
         if (ret)
  
> I was thinking the letters because it triggers a large debug message.
> 
> But, what is the actual log output you see, is it -EEXIST?

I see ...

  ERR KERN host1x drm: iommu configuration for device failed with -ENOENT
  
> If it is coming and going is it a race of some kind?

It is consistent without the above. However, I did not think that the
above change would change the returning on -ENOENT? I will add more
debug.

Jon
Jason Gunthorpe Feb. 2, 2024, 2:35 p.m. UTC | #6
On Fri, Feb 02, 2024 at 10:40:36AM +0000, Jon Hunter wrote:

> > But, what is the actual log output you see, is it -EEXIST?
> 
> I see ...
> 
>  ERR KERN host1x drm: iommu configuration for device failed with -ENOENT

So that shouldn't happen in you case as far as I can tell, the device
is properly added, link->kobj should be fine and ENOENT shouldn't
happen.

> > If it is coming and going is it a race of some kind?
> 
> It is consistent without the above. However, I did not think that the
> above change would change the returning on -ENOENT? I will add more
> debug.

I do not think it can either

Still wonder if there is some odd race..

Let me know when you figure out what is happening - I think there is
some bug here it is not just a harmless warning.

Jason
Jon Hunter Feb. 2, 2024, 3:56 p.m. UTC | #7
On 02/02/2024 14:35, Jason Gunthorpe wrote:
> On Fri, Feb 02, 2024 at 10:40:36AM +0000, Jon Hunter wrote:
> 
>>> But, what is the actual log output you see, is it -EEXIST?
>>
>> I see ...
>>
>>   ERR KERN host1x drm: iommu configuration for device failed with -ENOENT
> 
> So that shouldn't happen in you case as far as I can tell, the device
> is properly added, link->kobj should be fine and ENOENT shouldn't
> happen.
> 
>>> If it is coming and going is it a race of some kind?
>>
>> It is consistent without the above. However, I did not think that the
>> above change would change the returning on -ENOENT? I will add more
>> debug.
> 
> I do not think it can either
> 
> Still wonder if there is some odd race..
> 
> Let me know when you figure out what is happening - I think there is
> some bug here it is not just a harmless warning.


Yes looks like a race of some sort. Adding a bit of debug also makes the 
issue go away so difficult to see what is happening.

Jon
Jason Gunthorpe Feb. 2, 2024, 4:15 p.m. UTC | #8
On Fri, Feb 02, 2024 at 03:56:52PM +0000, Jon Hunter wrote:
> 
> On 02/02/2024 14:35, Jason Gunthorpe wrote:
> > On Fri, Feb 02, 2024 at 10:40:36AM +0000, Jon Hunter wrote:
> > 
> > > > But, what is the actual log output you see, is it -EEXIST?
> > > 
> > > I see ...
> > > 
> > >   ERR KERN host1x drm: iommu configuration for device failed with -ENOENT
> > 
> > So that shouldn't happen in you case as far as I can tell, the device
> > is properly added, link->kobj should be fine and ENOENT shouldn't
> > happen.
> > 
> > > > If it is coming and going is it a race of some kind?
> > > 
> > > It is consistent without the above. However, I did not think that the
> > > above change would change the returning on -ENOENT? I will add more
> > > debug.
> > 
> > I do not think it can either
> > 
> > Still wonder if there is some odd race..
> > 
> > Let me know when you figure out what is happening - I think there is
> > some bug here it is not just a harmless warning.
> 
> 
> Yes looks like a race of some sort. Adding a bit of debug also makes the
> issue go away so difficult to see what is happening.

I'm wondering if it is racing with iommu driver probing? I looked but
didn't notice anything obviously wrong there that would cause this
though.

Though, it shouldn't be racing with self-removal of the device it just
added, that would be crazy???

Jason
Jason Gunthorpe Feb. 7, 2024, 1:05 p.m. UTC | #9
On Fri, Feb 02, 2024 at 12:15:40PM -0400, Jason Gunthorpe wrote:

> > Yes looks like a race of some sort. Adding a bit of debug also makes the
> > issue go away so difficult to see what is happening.
> 
> I'm wondering if it is racing with iommu driver probing? I looked but
> didn't notice anything obviously wrong there that would cause this
> though.

Oh there is at least one racy thing here..

The of_xlate hackjob is out of order and is racy if you have multiple instances:

struct tegra_smmu *tegra_smmu_probe(struct device *dev,
				    const struct tegra_smmu_soc *soc,
				    struct tegra_mc *mc)
{
	/*
	 * This is a bit of a hack. Ideally we'd want to simply return this
	 * value. However iommu_device_register() will attempt to add
	 * all devices to the IOMMU before we get that far. In order
	 * not to rely on global variables to track the IOMMU instance, we
	 * set it here so that it can be looked up from the .probe_device()
	 * callback via the IOMMU device's .drvdata field.
	 */
	mc->smmu = smmu;
         ^^^^^^^^^^^^^
           After this of_xlate and probe_device will succeed

        [...]
       	err = iommu_device_sysfs_add(&smmu->iommu, dev, NULL, dev_name(dev));               
	err = iommu_device_register(&smmu->iommu, &tegra_smmu_ops, dev);
        ^^^^^^^^^
        But the iommu instance is not fully initialized yet.

So:

static int tegra_smmu_of_xlate(struct device *dev,
			       struct of_phandle_args *args)
{
	struct platform_device *iommu_pdev = of_find_device_by_node(args->np);
	struct tegra_mc *mc = platform_get_drvdata(iommu_pdev);

	dev_iommu_priv_set(dev, mc->smmu);
                            ^^^^^^^^^^^^
                             Gets the partially initialized iommu
			     instance


static struct iommu_device *tegra_smmu_probe_device(struct device *dev)
{
	smmu = dev_iommu_priv_get(dev);
	if (!smmu)
		return ERR_PTR(-ENODEV);
               ^^^^^^^^^^^^^
                  Allows the driver to bind to a partially setup instance

ie if you have multiple instances of tegra-smmu and you manage to do
concurrent probe where the iommu instance is probing concurrently with
the failing device_add flow then you can a situation like you have
described where the sysfs is not fully setup.

Is this making sense to you? Add a sleep after the mc->smmu store and
confirm with printing?

I think this is all an insane design. I fixed this race and removed
all this hackery in my fwspec removal series. There the iommu instance
only ever comes out of the locked list that iommu_device_register()
populates and drivers have a safe and simple flow.

Maybe just moving the store to mc->smmu later would improve it? I
didn't look closely..

Jason
diff mbox series

Patch

diff --git a/drivers/gpu/host1x/bus.c b/drivers/gpu/host1x/bus.c
index 84d042796d2e66..61214d35cadc34 100644
--- a/drivers/gpu/host1x/bus.c
+++ b/drivers/gpu/host1x/bus.c
@@ -458,8 +458,6 @@  static int host1x_device_add(struct host1x *host1x,
 	device->dev.bus = &host1x_bus_type;
 	device->dev.parent = host1x->dev;
 
-	of_dma_configure(&device->dev, host1x->dev->of_node, true);
-
 	device->dev.dma_parms = &device->dma_parms;
 	dma_set_max_seg_size(&device->dev, UINT_MAX);