diff mbox

[1/2] drm/nouveau/bar/gf100: fix hang when calling ->fini() before ->init()

Message ID be610a7596ad8fee7da092161888c532c2eb2908.1512411775.git.guillaume.tucker@collabora.com (mailing list archive)
State New, archived
Headers show

Commit Message

Guillaume Tucker Dec. 4, 2017, 6:37 p.m. UTC
If the firmware fails to load then ->fini() will be called before the
device has been initialised, causing the kernel to hang while trying
to write to a register.  Add a test in ->fini() to avoid this issue.

This fixes a kernel hang on tegra124.

Fixes: b17de35a2ebbe ("drm/nouveau/bar: implement bar1 teardown")
Signed-off-by: Guillaume Tucker <guillaume.tucker@collabora.com>
CC: Ben Skeggs <bskeggs@redhat.com>
---
 drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

Comments

Jon Hunter Dec. 5, 2017, 2:30 p.m. UTC | #1
On 04/12/17 18:37, Guillaume Tucker wrote:
> If the firmware fails to load then ->fini() will be called before the
> device has been initialised, causing the kernel to hang while trying
> to write to a register.  Add a test in ->fini() to avoid this issue.
> 
> This fixes a kernel hang on tegra124.
> 
> Fixes: b17de35a2ebbe ("drm/nouveau/bar: implement bar1 teardown")
> Signed-off-by: Guillaume Tucker <guillaume.tucker@collabora.com>
> CC: Ben Skeggs <bskeggs@redhat.com>
> ---
>  drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c b/drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c
> index a3ba7f50198b..95e2aba64aad 100644
> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c
> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c
> @@ -43,9 +43,12 @@ gf100_bar_bar1_wait(struct nvkm_bar *base)
>  }
>  
>  void
> -gf100_bar_bar1_fini(struct nvkm_bar *bar)
> +gf100_bar_bar1_fini(struct nvkm_bar *base)
>  {
> -	nvkm_mask(bar->subdev.device, 0x001704, 0x80000000, 0x00000000);
> +	struct nvkm_device *device = base->subdev.device;
> +
> +	if (base->subdev.oneinit)
> +		nvkm_mask(device, 0x001704, 0x80000000, 0x00000000);
>  }
>  
>  void

I have tested this and it works for me. Thanks for fixing this! Would be
good to get Ben's ACK, but you can have my ...

Tested-by: Jon Hunter <jonathanh@nvidia.com>

Cheers
Jon
Ben Skeggs Dec. 5, 2017, 6:32 p.m. UTC | #2
On Wed, Dec 6, 2017 at 12:30 AM, Jon Hunter <jonathanh@nvidia.com> wrote:

>
> On 04/12/17 18:37, Guillaume Tucker wrote:
> > If the firmware fails to load then ->fini() will be called before the
> > device has been initialised, causing the kernel to hang while trying
> > to write to a register.  Add a test in ->fini() to avoid this issue.
> >
> > This fixes a kernel hang on tegra124.
> >
> > Fixes: b17de35a2ebbe ("drm/nouveau/bar: implement bar1 teardown")
> > Signed-off-by: Guillaume Tucker <guillaume.tucker@collabora.com>
> > CC: Ben Skeggs <bskeggs@redhat.com>
> > ---
> >  drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c | 7 +++++--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c
> b/drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c
> > index a3ba7f50198b..95e2aba64aad 100644
> > --- a/drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c
> > +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c
> > @@ -43,9 +43,12 @@ gf100_bar_bar1_wait(struct nvkm_bar *base)
> >  }
> >
> >  void
> > -gf100_bar_bar1_fini(struct nvkm_bar *bar)
> > +gf100_bar_bar1_fini(struct nvkm_bar *base)
> >  {
> > -     nvkm_mask(bar->subdev.device, 0x001704, 0x80000000, 0x00000000);
> > +     struct nvkm_device *device = base->subdev.device;
> > +
> > +     if (base->subdev.oneinit)
> > +             nvkm_mask(device, 0x001704, 0x80000000, 0x00000000);
> >  }
> >
> >  void
>
> I have tested this and it works for me. Thanks for fixing this! Would be
> good to get Ben's ACK, but you can have my ...
>
I'd love to get a good explanation as to why it hangs without this change,
as, on the surface, it's not immediately obvious as to why it's hanging.

Thanks,
Ben.


>
> Tested-by: Jon Hunter <jonathanh@nvidia.com>
>
> Cheers
> Jon
>
> --
> nvpublic
>
Guillaume Tucker Dec. 6, 2017, 9:22 a.m. UTC | #3
On 05/12/17 18:32, Ben Skeggs wrote:
> On Wed, Dec 6, 2017 at 12:30 AM, Jon Hunter <jonathanh@nvidia.com> wrote:
>
>>
>> On 04/12/17 18:37, Guillaume Tucker wrote:
>>> If the firmware fails to load then ->fini() will be called before the
>>> device has been initialised, causing the kernel to hang while trying
>>> to write to a register.  Add a test in ->fini() to avoid this issue.
>>>
>>> This fixes a kernel hang on tegra124.
>>>
>>> Fixes: b17de35a2ebbe ("drm/nouveau/bar: implement bar1 teardown")
>>> Signed-off-by: Guillaume Tucker <guillaume.tucker@collabora.com>
>>> CC: Ben Skeggs <bskeggs@redhat.com>
>>> ---
>>>  drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c | 7 +++++--
>>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c
>> b/drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c
>>> index a3ba7f50198b..95e2aba64aad 100644
>>> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c
>>> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c
>>> @@ -43,9 +43,12 @@ gf100_bar_bar1_wait(struct nvkm_bar *base)
>>>  }
>>>
>>>  void
>>> -gf100_bar_bar1_fini(struct nvkm_bar *bar)
>>> +gf100_bar_bar1_fini(struct nvkm_bar *base)
>>>  {
>>> -     nvkm_mask(bar->subdev.device, 0x001704, 0x80000000, 0x00000000);
>>> +     struct nvkm_device *device = base->subdev.device;
>>> +
>>> +     if (base->subdev.oneinit)
>>> +             nvkm_mask(device, 0x001704, 0x80000000, 0x00000000);
>>>  }
>>>
>>>  void
>>
>> I have tested this and it works for me. Thanks for fixing this! Would be
>> good to get Ben's ACK, but you can have my ...
>>
> I'd love to get a good explanation as to why it hangs without this change,
> as, on the surface, it's not immediately obvious as to why it's hanging.

To be fair I'm not entirely sure either why this causes a hang, I
haven't read the TRM...  The iomem has been mapped at this point,
so accessing the register should work.  One clue is when you look
at _bar1_init(), the 0x1704 register is initialised with
some (device instance?) memory address.  So it's possible that
the hardware does something special when you set this to 0 as in
_bar1_fini(), which may fail in particular if it was previously
not initialised with a valid address.

This is merely guesswork, would be interested to find out the
real explanation though.

>> Tested-by: Jon Hunter <jonathanh@nvidia.com>

Thanks!

Guillaume
Jon Hunter Dec. 6, 2017, 5:18 p.m. UTC | #4
On 06/12/17 09:22, Guillaume Tucker wrote:
> On 05/12/17 18:32, Ben Skeggs wrote:
>> On Wed, Dec 6, 2017 at 12:30 AM, Jon Hunter <jonathanh@nvidia.com> wrote:
>>
>>>
>>> On 04/12/17 18:37, Guillaume Tucker wrote:
>>>> If the firmware fails to load then ->fini() will be called before the
>>>> device has been initialised, causing the kernel to hang while trying
>>>> to write to a register.  Add a test in ->fini() to avoid this issue.
>>>>
>>>> This fixes a kernel hang on tegra124.
>>>>
>>>> Fixes: b17de35a2ebbe ("drm/nouveau/bar: implement bar1 teardown")
>>>> Signed-off-by: Guillaume Tucker <guillaume.tucker@collabora.com>
>>>> CC: Ben Skeggs <bskeggs@redhat.com>
>>>> ---
>>>>  drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c | 7 +++++--
>>>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c
>>> b/drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c
>>>> index a3ba7f50198b..95e2aba64aad 100644
>>>> --- a/drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c
>>>> +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c
>>>> @@ -43,9 +43,12 @@ gf100_bar_bar1_wait(struct nvkm_bar *base)
>>>>  }
>>>>
>>>>  void
>>>> -gf100_bar_bar1_fini(struct nvkm_bar *bar)
>>>> +gf100_bar_bar1_fini(struct nvkm_bar *base)
>>>>  {
>>>> -     nvkm_mask(bar->subdev.device, 0x001704, 0x80000000, 0x00000000);
>>>> +     struct nvkm_device *device = base->subdev.device;
>>>> +
>>>> +     if (base->subdev.oneinit)
>>>> +             nvkm_mask(device, 0x001704, 0x80000000, 0x00000000);
>>>>  }
>>>>
>>>>  void
>>>
>>> I have tested this and it works for me. Thanks for fixing this! Would be
>>> good to get Ben's ACK, but you can have my ...
>>>
>> I'd love to get a good explanation as to why it hangs without this
>> change,
>> as, on the surface, it's not immediately obvious as to why it's hanging.
> 
> To be fair I'm not entirely sure either why this causes a hang, I
> haven't read the TRM...  The iomem has been mapped at this point,
> so accessing the register should work.  One clue is when you look
> at _bar1_init(), the 0x1704 register is initialised with
> some (device instance?) memory address.  So it's possible that
> the hardware does something special when you set this to 0 as in
> _bar1_fini(), which may fail in particular if it was previously
> not initialised with a valid address.
> 
> This is merely guesswork, would be interested to find out the
> real explanation though.

OK, well that's no good. It's a good pointer, but we need to make sure
we understand the root of this hang. I will see if I have sometime to
dig into this further, maybe next week.

Cheers
Jon
diff mbox

Patch

diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c b/drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c
index a3ba7f50198b..95e2aba64aad 100644
--- a/drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c
+++ b/drivers/gpu/drm/nouveau/nvkm/subdev/bar/gf100.c
@@ -43,9 +43,12 @@  gf100_bar_bar1_wait(struct nvkm_bar *base)
 }
 
 void
-gf100_bar_bar1_fini(struct nvkm_bar *bar)
+gf100_bar_bar1_fini(struct nvkm_bar *base)
 {
-	nvkm_mask(bar->subdev.device, 0x001704, 0x80000000, 0x00000000);
+	struct nvkm_device *device = base->subdev.device;
+
+	if (base->subdev.oneinit)
+		nvkm_mask(device, 0x001704, 0x80000000, 0x00000000);
 }
 
 void