mbox series

[0/3] Use implicit kref infra

Message ID 20200902010645.26943-1-luben.tuikov@amd.com (mailing list archive)
Headers show
Series Use implicit kref infra | expand

Message

Luben Tuikov Sept. 2, 2020, 1:06 a.m. UTC
Use the implicit kref infrastructure to free the container
struct amdgpu_device, container of struct drm_device.

First, in drm_dev_register(), do not indiscriminately warn
when a DRM driver hasn't opted for managed.final_kfree,
but instead check if the driver has provided its own
"release" function callback in the DRM driver structure.
If that is the case, no warning.

Remove drmm_add_final_kfree(). We take care of that, in the
kref "release" callback when all refs are down to 0, via
drm_dev_put(), i.e. the free is implicit.

Remove superfluous NULL check, since the DRM device to be
suspended always exists, so long as the underlying PCI and
DRM devices exist.

Luben Tuikov (3):
  drm: No warn for drivers who provide release
  drm/amdgpu: Remove drmm final free
  drm/amdgpu: Remove superfluous NULL check

 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 2 --
 drivers/gpu/drm/drm_drv.c                  | 3 ++-
 3 files changed, 2 insertions(+), 6 deletions(-)

Comments

Pan, Xinhui Sept. 2, 2020, 1:42 a.m. UTC | #1
If you take a look at the below function, you should not use driver's release to free adev. As dev is embedded in adev.

 809 static void drm_dev_release(struct kref *ref)
 810 {
 811         struct drm_device *dev = container_of(ref, struct drm_device, ref);
 812        
 813         if (dev->driver->release)
 814                 dev->driver->release(dev);
 815 
 816         drm_managed_release(dev);
 817 
 818         kfree(dev->managed.final_kfree);
 819 }

You have to make another change something like
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index 13068fdf4331..2aabd2b4c63b 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -815,7 +815,8 @@ static void drm_dev_release(struct kref *ref)
 
        drm_managed_release(dev);
 
-       kfree(dev->managed.final_kfree);
+       if (dev->driver->final_release)
+               dev->driver->final_release(dev);
 }

And in the final_release callback we free the dev. But that is a little complex now. so I prefer still using final_kfree.
Of course we can do some cleanup work in the driver's release callback. BUT no kfree.

-----原始邮件-----
发件人: "Tuikov, Luben" <Luben.Tuikov@amd.com>
日期: 2020年9月2日 星期三 09:07
收件人: "amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>, "dri-devel@lists.freedesktop.org" <dri-devel@lists.freedesktop.org>
抄送: "Deucher, Alexander" <Alexander.Deucher@amd.com>, Daniel Vetter <daniel@ffwll.ch>, "Pan, Xinhui" <Xinhui.Pan@amd.com>, "Tuikov, Luben" <Luben.Tuikov@amd.com>
主题: [PATCH 0/3] Use implicit kref infra

    Use the implicit kref infrastructure to free the container
    struct amdgpu_device, container of struct drm_device.
    
    First, in drm_dev_register(), do not indiscriminately warn
    when a DRM driver hasn't opted for managed.final_kfree,
    but instead check if the driver has provided its own
    "release" function callback in the DRM driver structure.
    If that is the case, no warning.
    
    Remove drmm_add_final_kfree(). We take care of that, in the
    kref "release" callback when all refs are down to 0, via
    drm_dev_put(), i.e. the free is implicit.
    
    Remove superfluous NULL check, since the DRM device to be
    suspended always exists, so long as the underlying PCI and
    DRM devices exist.
    
    Luben Tuikov (3):
      drm: No warn for drivers who provide release
      drm/amdgpu: Remove drmm final free
      drm/amdgpu: Remove superfluous NULL check
    
     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
     drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 2 --
     drivers/gpu/drm/drm_drv.c                  | 3 ++-
     3 files changed, 2 insertions(+), 6 deletions(-)
    
    -- 
    2.28.0.394.ge197136389
Luben Tuikov Sept. 2, 2020, 3:46 a.m. UTC | #2
On 2020-09-01 21:42, Pan, Xinhui wrote:
> If you take a look at the below function, you should not use driver's release to free adev. As dev is embedded in adev.

Do you mean "look at the function below", using "below" as an adverb?
"below" is not an adjective.

I know dev is embedded in adev--I did that patchset.

> 
>  809 static void drm_dev_release(struct kref *ref)
>  810 {
>  811         struct drm_device *dev = container_of(ref, struct drm_device, ref);
>  812        
>  813         if (dev->driver->release)
>  814                 dev->driver->release(dev);
>  815 
>  816         drm_managed_release(dev);
>  817 
>  818         kfree(dev->managed.final_kfree);
>  819 }

That's simple--this comes from change c6603c740e0e3
and it should be reverted. Simple as that.

The version before this change was absolutely correct:

static void drm_dev_release(struct kref *ref)
{
	if (dev->driver->release)
		dev->driver->release(dev);
	else
		drm_dev_fini(dev);
}

Meaning, "the kref is now 0"--> if the driver
has a release, call it, else use our own.
But note that nothing can be assumed after this point,
about the existence of "dev".

It is exactly because struct drm_device is statically
embedded into a container, struct amdgpu_device,
that this change above should be reverted.

This is very similar to how fops has open/release
but no close. That is, the "release" is called
only when the last kref is released, i.e. when
kref goes from non-zero to zero.

This uses the kref infrastructure which has been
around for about 20 years in the Linux kernel.

I suggest reading the comments
in drm_dev.c mostly, "DOC: driver instance overview"
starting at line 240 onwards. This is right above
drm_put_dev(). There is actually an example of a driver
in the comment. Also the comment to drm_dev_init().

Now, take a look at this:

/**
 * drm_dev_put - Drop reference of a DRM device
 * @dev: device to drop reference of or NULL
 *
 * This decreases the ref-count of @dev by one. The device is destroyed if the
 * ref-count drops to zero.
 */
void drm_dev_put(struct drm_device *dev)
{
        if (dev)
                kref_put(&dev->ref, drm_dev_release);
}
EXPORT_SYMBOL(drm_dev_put);

Two things:

1. It is us, who kzalloc the amdgpu device, which contains
the drm_device (you'll see this discussed in the reading
material I pointed to above). We do this because we're
probing the PCI device whether we'll work it it or not.

2. Using the kref infrastructure, when the ref goes to 0,
drm_dev_release is called. And here's the KEY:
Because WE allocated the container, we should free it--after the release
method is called, DRM cannot assume anything about the drm
device or the container. The "release" method is final.

We allocate, we free. And we free only when the ref goes to 0.

DRM can, in due time, "free" itself of the DRM device and stop
having knowledge of it--that's fine, but as long as the ref
is not 0, the amdgpu device and thus the contained DRM device,
cannot be freed.

> 
> You have to make another change something like
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index 13068fdf4331..2aabd2b4c63b 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -815,7 +815,8 @@ static void drm_dev_release(struct kref *ref)
>  
>         drm_managed_release(dev);
>  
> -       kfree(dev->managed.final_kfree);
> +       if (dev->driver->final_release)
> +               dev->driver->final_release(dev);
>  }

No. What's this?
There is no such thing as "final" release, nor is there a "partial" release.
When the kref goes to 0, the device disappears. Simple.
If someone is using it, they should kref-get it, and when they're
done with it, they should kref-put it.

The whole point is that this is done implicitly, via the kref infrastructure.
drm_dev_init() which we call in our PCI probe function, sets the kref to 1--all
as per the documentation I pointed you to above.

Another point is that we can do some other stuff in the release
function, notify someone, write some registers, free memory we use
for that PCI device, etc.

If the "managed resources" infrastructure wants to stay, it should hook
itself into drm_dev_fini() and into drm_dev_init() or drm_dev_register().
It shouldn't have to be so out-of-place like in patch 2/3 of this series,
where the drmm_add_final_kfree() is smack-dab in the middle of our PCI
discovery function, surrounded on top and bottom by drm_dev_init()
and drm_dev_register(). The "managed resources" infra should be non-invasive
and drivers shouldn't have to change to use it--it should be invisible to them.
Then our kref would just work.

> 
> And in the final_release callback we free the dev. But that is a little complex now. so I prefer still using final_kfree.
> Of course we can do some cleanup work in the driver's release callback. BUT no kfree.

No! No final_kfree. It's a hack.

Read the documentation in drm_drv.c I noted above--it lays out how this happens. Reading is required.

Regards,
Luben


> 
> -----原始邮件-----
> 发件人: "Tuikov, Luben" <Luben.Tuikov@amd.com>
> 日期: 2020年9月2日 星期三 09:07
> 收件人: "amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>, "dri-devel@lists.freedesktop.org" <dri-devel@lists.freedesktop.org>
> 抄送: "Deucher, Alexander" <Alexander.Deucher@amd.com>, Daniel Vetter <daniel@ffwll.ch>, "Pan, Xinhui" <Xinhui.Pan@amd.com>, "Tuikov, Luben" <Luben.Tuikov@amd.com>
> 主题: [PATCH 0/3] Use implicit kref infra
> 
>     Use the implicit kref infrastructure to free the container
>     struct amdgpu_device, container of struct drm_device.
>     
>     First, in drm_dev_register(), do not indiscriminately warn
>     when a DRM driver hasn't opted for managed.final_kfree,
>     but instead check if the driver has provided its own
>     "release" function callback in the DRM driver structure.
>     If that is the case, no warning.
>     
>     Remove drmm_add_final_kfree(). We take care of that, in the
>     kref "release" callback when all refs are down to 0, via
>     drm_dev_put(), i.e. the free is implicit.
>     
>     Remove superfluous NULL check, since the DRM device to be
>     suspended always exists, so long as the underlying PCI and
>     DRM devices exist.
>     
>     Luben Tuikov (3):
>       drm: No warn for drivers who provide release
>       drm/amdgpu: Remove drmm final free
>       drm/amdgpu: Remove superfluous NULL check
>     
>      drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
>      drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 2 --
>      drivers/gpu/drm/drm_drv.c                  | 3 ++-
>      3 files changed, 2 insertions(+), 6 deletions(-)
>     
>     -- 
>     2.28.0.394.ge197136389
>     
>     
>
Pan, Xinhui Sept. 2, 2020, 4:43 a.m. UTC | #3
> 2020年9月2日 11:46,Tuikov, Luben <Luben.Tuikov@amd.com> 写道:
> 
> On 2020-09-01 21:42, Pan, Xinhui wrote:
>> If you take a look at the below function, you should not use driver's release to free adev. As dev is embedded in adev.
> 
> Do you mean "look at the function below", using "below" as an adverb?
> "below" is not an adjective.
> 
> I know dev is embedded in adev--I did that patchset.
> 
>> 
>> 809 static void drm_dev_release(struct kref *ref)
>> 810 {
>> 811         struct drm_device *dev = container_of(ref, struct drm_device, ref);
>> 812        
>> 813         if (dev->driver->release)
>> 814                 dev->driver->release(dev);
>> 815 
>> 816         drm_managed_release(dev);
>> 817 
>> 818         kfree(dev->managed.final_kfree);
>> 819 }
> 
> That's simple--this comes from change c6603c740e0e3
> and it should be reverted. Simple as that.
> 
> The version before this change was absolutely correct:
> 
> static void drm_dev_release(struct kref *ref)
> {
> 	if (dev->driver->release)
> 		dev->driver->release(dev);
> 	else
> 		drm_dev_fini(dev);
> }
> 
> Meaning, "the kref is now 0"--> if the driver
> has a release, call it, else use our own.
> But note that nothing can be assumed after this point,
> about the existence of "dev".
> 
> It is exactly because struct drm_device is statically
> embedded into a container, struct amdgpu_device,
> that this change above should be reverted.
> 
> This is very similar to how fops has open/release
> but no close. That is, the "release" is called
> only when the last kref is released, i.e. when
> kref goes from non-zero to zero.
> 
> This uses the kref infrastructure which has been
> around for about 20 years in the Linux kernel.
> 
> I suggest reading the comments
> in drm_dev.c mostly, "DOC: driver instance overview"
> starting at line 240 onwards. This is right above
> drm_put_dev(). There is actually an example of a driver
> in the comment. Also the comment to drm_dev_init().
> 
> Now, take a look at this:
> 
> /**
> * drm_dev_put - Drop reference of a DRM device
> * @dev: device to drop reference of or NULL
> *
> * This decreases the ref-count of @dev by one. The device is destroyed if the
> * ref-count drops to zero.
> */
> void drm_dev_put(struct drm_device *dev)
> {
>        if (dev)
>                kref_put(&dev->ref, drm_dev_release);
> }
> EXPORT_SYMBOL(drm_dev_put);
> 
> Two things:
> 
> 1. It is us, who kzalloc the amdgpu device, which contains
> the drm_device (you'll see this discussed in the reading
> material I pointed to above). We do this because we're
> probing the PCI device whether we'll work it it or not.
> 

that is true.
My understanding of the drm core code is like something below.
struct B { 
	strcut A 
}
we initialize A firstly and initialize B in the end. But destroy B firstly and destory A in the end.
But yes, practice is more complex. 
if B has nothing to be destroyed. we can destory A directly, otherwise destroy B firstly.

in this case, we can do something below in our release()
//some cleanup work of B
drm_dev_fini(dev);//destroy A
kfree(adev)

> 2. Using the kref infrastructure, when the ref goes to 0,
> drm_dev_release is called. And here's the KEY:
> Because WE allocated the container, we should free it--after the release
> method is called, DRM cannot assume anything about the drm
> device or the container. The "release" method is final.
> 
> We allocate, we free. And we free only when the ref goes to 0.
> 
> DRM can, in due time, "free" itself of the DRM device and stop
> having knowledge of it--that's fine, but as long as the ref
> is not 0, the amdgpu device and thus the contained DRM device,
> cannot be freed.
> 
>> 
>> You have to make another change something like
>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>> index 13068fdf4331..2aabd2b4c63b 100644
>> --- a/drivers/gpu/drm/drm_drv.c
>> +++ b/drivers/gpu/drm/drm_drv.c
>> @@ -815,7 +815,8 @@ static void drm_dev_release(struct kref *ref)
>> 
>>        drm_managed_release(dev);
>> 
>> -       kfree(dev->managed.final_kfree);
>> +       if (dev->driver->final_release)
>> +               dev->driver->final_release(dev);
>> }
> 
> No. What's this?
> There is no such thing as "final" release, nor is there a "partial" release.
> When the kref goes to 0, the device disappears. Simple.
> If someone is using it, they should kref-get it, and when they're
> done with it, they should kref-put it.

I just take an example here. add another release in the end. then no one could touch us. IOW, final_release.


A destroy B by a callback, then A destroy itself. It assumes B just free its own resource.
but that makes trouble if some resource of A is allocated by B.
Because B must take care of these common resource shared between A and B.

yes, that logical is more complex. So I think we can revert drm_dev_release to its previous version.

> 
> The whole point is that this is done implicitly, via the kref infrastructure.
> drm_dev_init() which we call in our PCI probe function, sets the kref to 1--all
> as per the documentation I pointed you to above.

I am not taking about kref. what we are discussing is all about the release sequence.


> 
> Another point is that we can do some other stuff in the release
> function, notify someone, write some registers, free memory we use
> for that PCI device, etc.
> 
> If the "managed resources" infrastructure wants to stay, it should hook
> itself into drm_dev_fini() and into drm_dev_init() or drm_dev_register().
> It shouldn't have to be so out-of-place like in patch 2/3 of this series,
> where the drmm_add_final_kfree() is smack-dab in the middle of our PCI
> discovery function, surrounded on top and bottom by drm_dev_init()
> and drm_dev_register(). The "managed resources" infra should be non-invasive
> and drivers shouldn't have to change to use it--it should be invisible to them.
> Then our kref would just work.
> 
yep, that make sense. But you need more changes to fix this issue. this patchset is insufficient.


>> 
>> And in the final_release callback we free the dev. But that is a little complex now. so I prefer still using final_kfree.
>> Of course we can do some cleanup work in the driver's release callback. BUT no kfree.
> 
> No! No final_kfree. It's a hack.
> 
> Read the documentation in drm_drv.c I noted above--it lays out how this happens. Reading is required.
> 
> Regards,
> Luben
> 
> 
>> 
>> -----原始邮件-----
>> 发件人: "Tuikov, Luben" <Luben.Tuikov@amd.com>
>> 日期: 2020年9月2日 星期三 09:07
>> 收件人: "amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>, "dri-devel@lists.freedesktop.org" <dri-devel@lists.freedesktop.org>
>> 抄送: "Deucher, Alexander" <Alexander.Deucher@amd.com>, Daniel Vetter <daniel@ffwll.ch>, "Pan, Xinhui" <Xinhui.Pan@amd.com>, "Tuikov, Luben" <Luben.Tuikov@amd.com>
>> 主题: [PATCH 0/3] Use implicit kref infra
>> 
>>    Use the implicit kref infrastructure to free the container
>>    struct amdgpu_device, container of struct drm_device.
>> 
>>    First, in drm_dev_register(), do not indiscriminately warn
>>    when a DRM driver hasn't opted for managed.final_kfree,
>>    but instead check if the driver has provided its own
>>    "release" function callback in the DRM driver structure.
>>    If that is the case, no warning.
>> 
>>    Remove drmm_add_final_kfree(). We take care of that, in the
>>    kref "release" callback when all refs are down to 0, via
>>    drm_dev_put(), i.e. the free is implicit.
>> 
>>    Remove superfluous NULL check, since the DRM device to be
>>    suspended always exists, so long as the underlying PCI and
>>    DRM devices exist.
>> 
>>    Luben Tuikov (3):
>>      drm: No warn for drivers who provide release
>>      drm/amdgpu: Remove drmm final free
>>      drm/amdgpu: Remove superfluous NULL check
>> 
>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
>>     drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 2 --
>>     drivers/gpu/drm/drm_drv.c                  | 3 ++-
>>     3 files changed, 2 insertions(+), 6 deletions(-)
>> 
>>    -- 
>>    2.28.0.394.ge197136389
>> 
>> 
>> 
>
Daniel Vetter Sept. 2, 2020, 6:52 a.m. UTC | #4
On Tue, Sep 01, 2020 at 11:46:18PM -0400, Luben Tuikov wrote:
> On 2020-09-01 21:42, Pan, Xinhui wrote:
> > If you take a look at the below function, you should not use driver's release to free adev. As dev is embedded in adev.
> 
> Do you mean "look at the function below", using "below" as an adverb?
> "below" is not an adjective.
> 
> I know dev is embedded in adev--I did that patchset.
> 
> > 
> >  809 static void drm_dev_release(struct kref *ref)
> >  810 {
> >  811         struct drm_device *dev = container_of(ref, struct drm_device, ref);
> >  812        
> >  813         if (dev->driver->release)
> >  814                 dev->driver->release(dev);
> >  815 
> >  816         drm_managed_release(dev);
> >  817 
> >  818         kfree(dev->managed.final_kfree);
> >  819 }
> 
> That's simple--this comes from change c6603c740e0e3
> and it should be reverted. Simple as that.
> 
> The version before this change was absolutely correct:
> 
> static void drm_dev_release(struct kref *ref)
> {
> 	if (dev->driver->release)
> 		dev->driver->release(dev);
> 	else
> 		drm_dev_fini(dev);
> }
> 
> Meaning, "the kref is now 0"--> if the driver
> has a release, call it, else use our own.
> But note that nothing can be assumed after this point,
> about the existence of "dev".
> 
> It is exactly because struct drm_device is statically
> embedded into a container, struct amdgpu_device,
> that this change above should be reverted.
> 
> This is very similar to how fops has open/release
> but no close. That is, the "release" is called
> only when the last kref is released, i.e. when
> kref goes from non-zero to zero.
> 
> This uses the kref infrastructure which has been
> around for about 20 years in the Linux kernel.
> 
> I suggest reading the comments
> in drm_dev.c mostly, "DOC: driver instance overview"
> starting at line 240 onwards. This is right above
> drm_put_dev(). There is actually an example of a driver
> in the comment. Also the comment to drm_dev_init().
> 
> Now, take a look at this:
> 
> /**
>  * drm_dev_put - Drop reference of a DRM device
>  * @dev: device to drop reference of or NULL
>  *
>  * This decreases the ref-count of @dev by one. The device is destroyed if the
>  * ref-count drops to zero.
>  */
> void drm_dev_put(struct drm_device *dev)
> {
>         if (dev)
>                 kref_put(&dev->ref, drm_dev_release);
> }
> EXPORT_SYMBOL(drm_dev_put);
> 
> Two things:
> 
> 1. It is us, who kzalloc the amdgpu device, which contains
> the drm_device (you'll see this discussed in the reading
> material I pointed to above). We do this because we're
> probing the PCI device whether we'll work it it or not.
> 
> 2. Using the kref infrastructure, when the ref goes to 0,
> drm_dev_release is called. And here's the KEY:
> Because WE allocated the container, we should free it--after the release
> method is called, DRM cannot assume anything about the drm
> device or the container. The "release" method is final.
> 
> We allocate, we free. And we free only when the ref goes to 0.
> 
> DRM can, in due time, "free" itself of the DRM device and stop
> having knowledge of it--that's fine, but as long as the ref
> is not 0, the amdgpu device and thus the contained DRM device,
> cannot be freed.
> 
> > 
> > You have to make another change something like
> > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > index 13068fdf4331..2aabd2b4c63b 100644
> > --- a/drivers/gpu/drm/drm_drv.c
> > +++ b/drivers/gpu/drm/drm_drv.c
> > @@ -815,7 +815,8 @@ static void drm_dev_release(struct kref *ref)
> >  
> >         drm_managed_release(dev);
> >  
> > -       kfree(dev->managed.final_kfree);
> > +       if (dev->driver->final_release)
> > +               dev->driver->final_release(dev);
> >  }
> 
> No. What's this?
> There is no such thing as "final" release, nor is there a "partial" release.
> When the kref goes to 0, the device disappears. Simple.
> If someone is using it, they should kref-get it, and when they're
> done with it, they should kref-put it.
> 
> The whole point is that this is done implicitly, via the kref infrastructure.
> drm_dev_init() which we call in our PCI probe function, sets the kref to 1--all
> as per the documentation I pointed you to above.
> 
> Another point is that we can do some other stuff in the release
> function, notify someone, write some registers, free memory we use
> for that PCI device, etc.
> 
> If the "managed resources" infrastructure wants to stay, it should hook
> itself into drm_dev_fini() and into drm_dev_init() or drm_dev_register().
> It shouldn't have to be so out-of-place like in patch 2/3 of this series,
> where the drmm_add_final_kfree() is smack-dab in the middle of our PCI
> discovery function, surrounded on top and bottom by drm_dev_init()
> and drm_dev_register(). The "managed resources" infra should be non-invasive
> and drivers shouldn't have to change to use it--it should be invisible to them.
> Then our kref would just work.

Unfortunately some part of that drm managed series is stuck still waiting
for review (I guess I need to resubmit), but with that the docs would tell
you to use devm_drm_dev_alloc instead of hand-rolling all this.

Definitely not any of the ideas proposed in this discussion or patches :-)

I'll cc you on that series when I send it out again.

Cheers, Daniel
> 
> > 
> > And in the final_release callback we free the dev. But that is a little complex now. so I prefer still using final_kfree.
> > Of course we can do some cleanup work in the driver's release callback. BUT no kfree.
> 
> No! No final_kfree. It's a hack.
> 
> Read the documentation in drm_drv.c I noted above--it lays out how this happens. Reading is required.
> 
> Regards,
> Luben
> 
> 
> > 
> > -----原始邮件-----
> > 发件人: "Tuikov, Luben" <Luben.Tuikov@amd.com>
> > 日期: 2020年9月2日 星期三 09:07
> > 收件人: "amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>, "dri-devel@lists.freedesktop.org" <dri-devel@lists.freedesktop.org>
> > 抄送: "Deucher, Alexander" <Alexander.Deucher@amd.com>, Daniel Vetter <daniel@ffwll.ch>, "Pan, Xinhui" <Xinhui.Pan@amd.com>, "Tuikov, Luben" <Luben.Tuikov@amd.com>
> > 主题: [PATCH 0/3] Use implicit kref infra
> > 
> >     Use the implicit kref infrastructure to free the container
> >     struct amdgpu_device, container of struct drm_device.
> >     
> >     First, in drm_dev_register(), do not indiscriminately warn
> >     when a DRM driver hasn't opted for managed.final_kfree,
> >     but instead check if the driver has provided its own
> >     "release" function callback in the DRM driver structure.
> >     If that is the case, no warning.
> >     
> >     Remove drmm_add_final_kfree(). We take care of that, in the
> >     kref "release" callback when all refs are down to 0, via
> >     drm_dev_put(), i.e. the free is implicit.
> >     
> >     Remove superfluous NULL check, since the DRM device to be
> >     suspended always exists, so long as the underlying PCI and
> >     DRM devices exist.
> >     
> >     Luben Tuikov (3):
> >       drm: No warn for drivers who provide release
> >       drm/amdgpu: Remove drmm final free
> >       drm/amdgpu: Remove superfluous NULL check
> >     
> >      drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
> >      drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 2 --
> >      drivers/gpu/drm/drm_drv.c                  | 3 ++-
> >      3 files changed, 2 insertions(+), 6 deletions(-)
> >     
> >     -- 
> >     2.28.0.394.ge197136389
> >     
> >     
> > 
>
Luben Tuikov Sept. 2, 2020, 2:50 p.m. UTC | #5
On 2020-09-02 00:43, Pan, Xinhui wrote:
> 
> 
>> 2020年9月2日 11:46,Tuikov, Luben <Luben.Tuikov@amd.com> 写道:
>>
>> On 2020-09-01 21:42, Pan, Xinhui wrote:
>>> If you take a look at the below function, you should not use driver's release to free adev. As dev is embedded in adev.
>>
>> Do you mean "look at the function below", using "below" as an adverb?
>> "below" is not an adjective.
>>
>> I know dev is embedded in adev--I did that patchset.
>>
>>>
>>> 809 static void drm_dev_release(struct kref *ref)
>>> 810 {
>>> 811         struct drm_device *dev = container_of(ref, struct drm_device, ref);
>>> 812        
>>> 813         if (dev->driver->release)
>>> 814                 dev->driver->release(dev);
>>> 815 
>>> 816         drm_managed_release(dev);
>>> 817 
>>> 818         kfree(dev->managed.final_kfree);
>>> 819 }
>>
>> That's simple--this comes from change c6603c740e0e3
>> and it should be reverted. Simple as that.
>>
>> The version before this change was absolutely correct:
>>
>> static void drm_dev_release(struct kref *ref)
>> {
>> 	if (dev->driver->release)
>> 		dev->driver->release(dev);
>> 	else
>> 		drm_dev_fini(dev);
>> }
>>
>> Meaning, "the kref is now 0"--> if the driver
>> has a release, call it, else use our own.
>> But note that nothing can be assumed after this point,
>> about the existence of "dev".
>>
>> It is exactly because struct drm_device is statically
>> embedded into a container, struct amdgpu_device,
>> that this change above should be reverted.
>>
>> This is very similar to how fops has open/release
>> but no close. That is, the "release" is called
>> only when the last kref is released, i.e. when
>> kref goes from non-zero to zero.
>>
>> This uses the kref infrastructure which has been
>> around for about 20 years in the Linux kernel.
>>
>> I suggest reading the comments
>> in drm_dev.c mostly, "DOC: driver instance overview"
>> starting at line 240 onwards. This is right above
>> drm_put_dev(). There is actually an example of a driver
>> in the comment. Also the comment to drm_dev_init().
>>
>> Now, take a look at this:
>>
>> /**
>> * drm_dev_put - Drop reference of a DRM device
>> * @dev: device to drop reference of or NULL
>> *
>> * This decreases the ref-count of @dev by one. The device is destroyed if the
>> * ref-count drops to zero.
>> */
>> void drm_dev_put(struct drm_device *dev)
>> {
>>        if (dev)
>>                kref_put(&dev->ref, drm_dev_release);
>> }
>> EXPORT_SYMBOL(drm_dev_put);
>>
>> Two things:
>>
>> 1. It is us, who kzalloc the amdgpu device, which contains
>> the drm_device (you'll see this discussed in the reading
>> material I pointed to above). We do this because we're
>> probing the PCI device whether we'll work it it or not.
>>
> 
> that is true.

Of course it's true--good morning!

> My understanding of the drm core code is like something below.

Let me stop you right there--just read the documentation I pointed
to you at.

> struct B { 
> 	strcut A 
> }
> we initialize A firstly and initialize B in the end. But destroy B firstly and destory A in the end.

No!
B, which is the amdgpu_device struct "exists" before A, which is the DRM struct.
This is why DRM recommends to _embed_ it into the driver's own device struct,
as the documentation I pointed you to at.

A, the DRM struct, is an abstraction, and is "created" last, and
"undone" first, since the DRM layer may finish with a device, but
the device may still exists with the driver and as well as with PCI.
This is very VERY common, with kernels, devices, device abstractions,
device layers: DRM dev <-- amdgpu dev <-- PCI dev.

> But yes, practice is more complex. 
> if B has nothing to be destroyed. we can destory A directly, otherwise destroy B firstly.

I'm sorry, that doesn't make sense. There is no such thing as "destroy directly"
and "otherwise"--this is absolutely not how this works.

A good architecture doesn't have if-then-else--it's just a pure single-branch path.

> 
> in this case, we can do something below in our release()
> //some cleanup work of B
> drm_dev_fini(dev);//destroy A
> kfree(adev)
> 
>> 2. Using the kref infrastructure, when the ref goes to 0,
>> drm_dev_release is called. And here's the KEY:
>> Because WE allocated the container, we should free it--after the release
>> method is called, DRM cannot assume anything about the drm
>> device or the container. The "release" method is final.
>>
>> We allocate, we free. And we free only when the ref goes to 0.
>>
>> DRM can, in due time, "free" itself of the DRM device and stop
>> having knowledge of it--that's fine, but as long as the ref
>> is not 0, the amdgpu device and thus the contained DRM device,
>> cannot be freed.
>>
>>>
>>> You have to make another change something like
>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>>> index 13068fdf4331..2aabd2b4c63b 100644
>>> --- a/drivers/gpu/drm/drm_drv.c
>>> +++ b/drivers/gpu/drm/drm_drv.c
>>> @@ -815,7 +815,8 @@ static void drm_dev_release(struct kref *ref)
>>>
>>>        drm_managed_release(dev);
>>>
>>> -       kfree(dev->managed.final_kfree);
>>> +       if (dev->driver->final_release)
>>> +               dev->driver->final_release(dev);
>>> }
>>
>> No. What's this?
>> There is no such thing as "final" release, nor is there a "partial" release.
>> When the kref goes to 0, the device disappears. Simple.
>> If someone is using it, they should kref-get it, and when they're
>> done with it, they should kref-put it.
> 
> I just take an example here. add another release in the end. then no one could touch us. IOW, final_release.

No, that's horrible.

> 
> 
> A destroy B by a callback, then A destroy itself. It assumes B just free its own resource.
> but that makes trouble if some resource of A is allocated by B.
> Because B must take care of these common resource shared between A and B.

No, that's horrible.

> 
> yes, that logical is more complex. So I think we can revert drm_dev_release to its previous version.

drm_dev_release() in its original form, was pure:

static void drm_dev_release(struct kref *ref)
{
	if (dev->driver->release)
		dev->driver->release(dev);
	else
		drm_dev_fini(dev);
}

> 
>>
>> The whole point is that this is done implicitly, via the kref infrastructure.
>> drm_dev_init() which we call in our PCI probe function, sets the kref to 1--all
>> as per the documentation I pointed you to above.
> 
> I am not taking about kref. what we are discussing is all about the release sequence.

You need to understand how the kref infrastructure works in the kernel. I've said
it a million times: it's implicit. The "release sequence" as you like to call it,
is implicit in the kref infrastructure.

> 
> 
>>
>> Another point is that we can do some other stuff in the release
>> function, notify someone, write some registers, free memory we use
>> for that PCI device, etc.
>>
>> If the "managed resources" infrastructure wants to stay, it should hook
>> itself into drm_dev_fini() and into drm_dev_init() or drm_dev_register().
>> It shouldn't have to be so out-of-place like in patch 2/3 of this series,
>> where the drmm_add_final_kfree() is smack-dab in the middle of our PCI
>> discovery function, surrounded on top and bottom by drm_dev_init()
>> and drm_dev_register(). The "managed resources" infra should be non-invasive
>> and drivers shouldn't have to change to use it--it should be invisible to them.
>> Then our kref would just work.
>>
> yep, that make sense. But you need more changes to fix this issue. this patchset is insufficient.

Or LESS. Less changes. Less is better. Basically revert and redo all this "managed resources".

Regards,
Luben

> 
> 
>>>
>>> And in the final_release callback we free the dev. But that is a little complex now. so I prefer still using final_kfree.
>>> Of course we can do some cleanup work in the driver's release callback. BUT no kfree.
>>
>> No! No final_kfree. It's a hack.
>>
>> Read the documentation in drm_drv.c I noted above--it lays out how this happens. Reading is required.
>>
>> Regards,
>> Luben
>>
>>
>>>
>>> -----原始邮件-----
>>> 发件人: "Tuikov, Luben" <Luben.Tuikov@amd.com>
>>> 日期: 2020年9月2日 星期三 09:07
>>> 收件人: "amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>, "dri-devel@lists.freedesktop.org" <dri-devel@lists.freedesktop.org>
>>> 抄送: "Deucher, Alexander" <Alexander.Deucher@amd.com>, Daniel Vetter <daniel@ffwll.ch>, "Pan, Xinhui" <Xinhui.Pan@amd.com>, "Tuikov, Luben" <Luben.Tuikov@amd.com>
>>> 主题: [PATCH 0/3] Use implicit kref infra
>>>
>>>    Use the implicit kref infrastructure to free the container
>>>    struct amdgpu_device, container of struct drm_device.
>>>
>>>    First, in drm_dev_register(), do not indiscriminately warn
>>>    when a DRM driver hasn't opted for managed.final_kfree,
>>>    but instead check if the driver has provided its own
>>>    "release" function callback in the DRM driver structure.
>>>    If that is the case, no warning.
>>>
>>>    Remove drmm_add_final_kfree(). We take care of that, in the
>>>    kref "release" callback when all refs are down to 0, via
>>>    drm_dev_put(), i.e. the free is implicit.
>>>
>>>    Remove superfluous NULL check, since the DRM device to be
>>>    suspended always exists, so long as the underlying PCI and
>>>    DRM devices exist.
>>>
>>>    Luben Tuikov (3):
>>>      drm: No warn for drivers who provide release
>>>      drm/amdgpu: Remove drmm final free
>>>      drm/amdgpu: Remove superfluous NULL check
>>>
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 2 --
>>>     drivers/gpu/drm/drm_drv.c                  | 3 ++-
>>>     3 files changed, 2 insertions(+), 6 deletions(-)
>>>
>>>    -- 
>>>    2.28.0.394.ge197136389
>>>
>>>
>>>
>>
>
Luben Tuikov Sept. 2, 2020, 2:59 p.m. UTC | #6
On 2020-09-02 02:52, Daniel Vetter wrote:
> On Tue, Sep 01, 2020 at 11:46:18PM -0400, Luben Tuikov wrote:
>> On 2020-09-01 21:42, Pan, Xinhui wrote:
>>> If you take a look at the below function, you should not use driver's release to free adev. As dev is embedded in adev.
>>
>> Do you mean "look at the function below", using "below" as an adverb?
>> "below" is not an adjective.
>>
>> I know dev is embedded in adev--I did that patchset.
>>
>>>
>>>  809 static void drm_dev_release(struct kref *ref)
>>>  810 {
>>>  811         struct drm_device *dev = container_of(ref, struct drm_device, ref);
>>>  812        
>>>  813         if (dev->driver->release)
>>>  814                 dev->driver->release(dev);
>>>  815 
>>>  816         drm_managed_release(dev);
>>>  817 
>>>  818         kfree(dev->managed.final_kfree);
>>>  819 }
>>
>> That's simple--this comes from change c6603c740e0e3
>> and it should be reverted. Simple as that.
>>
>> The version before this change was absolutely correct:
>>
>> static void drm_dev_release(struct kref *ref)
>> {
>> 	if (dev->driver->release)
>> 		dev->driver->release(dev);
>> 	else
>> 		drm_dev_fini(dev);
>> }
>>
>> Meaning, "the kref is now 0"--> if the driver
>> has a release, call it, else use our own.
>> But note that nothing can be assumed after this point,
>> about the existence of "dev".
>>
>> It is exactly because struct drm_device is statically
>> embedded into a container, struct amdgpu_device,
>> that this change above should be reverted.
>>
>> This is very similar to how fops has open/release
>> but no close. That is, the "release" is called
>> only when the last kref is released, i.e. when
>> kref goes from non-zero to zero.
>>
>> This uses the kref infrastructure which has been
>> around for about 20 years in the Linux kernel.
>>
>> I suggest reading the comments
>> in drm_dev.c mostly, "DOC: driver instance overview"
>> starting at line 240 onwards. This is right above
>> drm_put_dev(). There is actually an example of a driver
>> in the comment. Also the comment to drm_dev_init().
>>
>> Now, take a look at this:
>>
>> /**
>>  * drm_dev_put - Drop reference of a DRM device
>>  * @dev: device to drop reference of or NULL
>>  *
>>  * This decreases the ref-count of @dev by one. The device is destroyed if the
>>  * ref-count drops to zero.
>>  */
>> void drm_dev_put(struct drm_device *dev)
>> {
>>         if (dev)
>>                 kref_put(&dev->ref, drm_dev_release);
>> }
>> EXPORT_SYMBOL(drm_dev_put);
>>
>> Two things:
>>
>> 1. It is us, who kzalloc the amdgpu device, which contains
>> the drm_device (you'll see this discussed in the reading
>> material I pointed to above). We do this because we're
>> probing the PCI device whether we'll work it it or not.
>>
>> 2. Using the kref infrastructure, when the ref goes to 0,
>> drm_dev_release is called. And here's the KEY:
>> Because WE allocated the container, we should free it--after the release
>> method is called, DRM cannot assume anything about the drm
>> device or the container. The "release" method is final.
>>
>> We allocate, we free. And we free only when the ref goes to 0.
>>
>> DRM can, in due time, "free" itself of the DRM device and stop
>> having knowledge of it--that's fine, but as long as the ref
>> is not 0, the amdgpu device and thus the contained DRM device,
>> cannot be freed.
>>
>>>
>>> You have to make another change something like
>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>>> index 13068fdf4331..2aabd2b4c63b 100644
>>> --- a/drivers/gpu/drm/drm_drv.c
>>> +++ b/drivers/gpu/drm/drm_drv.c
>>> @@ -815,7 +815,8 @@ static void drm_dev_release(struct kref *ref)
>>>  
>>>         drm_managed_release(dev);
>>>  
>>> -       kfree(dev->managed.final_kfree);
>>> +       if (dev->driver->final_release)
>>> +               dev->driver->final_release(dev);
>>>  }
>>
>> No. What's this?
>> There is no such thing as "final" release, nor is there a "partial" release.
>> When the kref goes to 0, the device disappears. Simple.
>> If someone is using it, they should kref-get it, and when they're
>> done with it, they should kref-put it.
>>
>> The whole point is that this is done implicitly, via the kref infrastructure.
>> drm_dev_init() which we call in our PCI probe function, sets the kref to 1--all
>> as per the documentation I pointed you to above.
>>
>> Another point is that we can do some other stuff in the release
>> function, notify someone, write some registers, free memory we use
>> for that PCI device, etc.
>>
>> If the "managed resources" infrastructure wants to stay, it should hook
>> itself into drm_dev_fini() and into drm_dev_init() or drm_dev_register().
>> It shouldn't have to be so out-of-place like in patch 2/3 of this series,
>> where the drmm_add_final_kfree() is smack-dab in the middle of our PCI
>> discovery function, surrounded on top and bottom by drm_dev_init()
>> and drm_dev_register(). The "managed resources" infra should be non-invasive
>> and drivers shouldn't have to change to use it--it should be invisible to them.
>> Then our kref would just work.
> 
> Unfortunately some part of that drm managed series is stuck still waiting
> for review (I guess I need to resubmit), but with that the docs would tell
> you to use devm_drm_dev_alloc instead of hand-rolling all this.

That's very very misleading--the documentation DOES NOT tell you this.
It tells you exactly how to do it using the "release" method
folded into the kref infrastructure. It is your changes
to the documentation which tell you to use this new alloc.

> 
> Definitely not any of the ideas proposed in this discussion or patches :-)

These are nothing new--it's just standard DRM stuff described in the documentation
in drm_drv.c.

There are NO NEW IDEAS or anything "proposed" in this discussion or patches--it's
just standard stuff from DRM, which had been the case for the last 10 years,
until your "managed resources" came along. It's the managed resources
which are new.

Regards,
Luben

> 
> I'll cc you on that series when I send it out again.
> 
> Cheers, Daniel
>>
>>>
>>> And in the final_release callback we free the dev. But that is a little complex now. so I prefer still using final_kfree.
>>> Of course we can do some cleanup work in the driver's release callback. BUT no kfree.
>>
>> No! No final_kfree. It's a hack.
>>
>> Read the documentation in drm_drv.c I noted above--it lays out how this happens. Reading is required.
>>
>> Regards,
>> Luben
>>
>>
>>>
>>> -----原始邮件-----
>>> 发件人: "Tuikov, Luben" <Luben.Tuikov@amd.com>
>>> 日期: 2020年9月2日 星期三 09:07
>>> 收件人: "amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>, "dri-devel@lists.freedesktop.org" <dri-devel@lists.freedesktop.org>
>>> 抄送: "Deucher, Alexander" <Alexander.Deucher@amd.com>, Daniel Vetter <daniel@ffwll.ch>, "Pan, Xinhui" <Xinhui.Pan@amd.com>, "Tuikov, Luben" <Luben.Tuikov@amd.com>
>>> 主题: [PATCH 0/3] Use implicit kref infra
>>>
>>>     Use the implicit kref infrastructure to free the container
>>>     struct amdgpu_device, container of struct drm_device.
>>>     
>>>     First, in drm_dev_register(), do not indiscriminately warn
>>>     when a DRM driver hasn't opted for managed.final_kfree,
>>>     but instead check if the driver has provided its own
>>>     "release" function callback in the DRM driver structure.
>>>     If that is the case, no warning.
>>>     
>>>     Remove drmm_add_final_kfree(). We take care of that, in the
>>>     kref "release" callback when all refs are down to 0, via
>>>     drm_dev_put(), i.e. the free is implicit.
>>>     
>>>     Remove superfluous NULL check, since the DRM device to be
>>>     suspended always exists, so long as the underlying PCI and
>>>     DRM devices exist.
>>>     
>>>     Luben Tuikov (3):
>>>       drm: No warn for drivers who provide release
>>>       drm/amdgpu: Remove drmm final free
>>>       drm/amdgpu: Remove superfluous NULL check
>>>     
>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 2 --
>>>      drivers/gpu/drm/drm_drv.c                  | 3 ++-
>>>      3 files changed, 2 insertions(+), 6 deletions(-)
>>>     
>>>     -- 
>>>     2.28.0.394.ge197136389
>>>     
>>>     
>>>
>>
>
Daniel Stone Sept. 2, 2020, 3 p.m. UTC | #7
Hi Luben,

On Wed, 2 Sep 2020 at 15:51, Luben Tuikov <luben.tuikov@amd.com> wrote:
> Of course it's true--good morning!
>
> Let me stop you right there--just read the documentation I pointed
> to you at.
>
> No!
>
> I'm sorry, that doesn't make sense.
>
> No, that's horrible.
>
> No, that's horrible.
>
> You need to understand how the kref infrastructure works in the kernel. I've said
> it a million times: it's implicit.
>
> Or LESS. Less changes. Less is better. Basically revert and redo all this "managed resources".

There are many better ways to make your point. At the moment it's just
getting lost in shouting.

Cheers,
Daniel
Luben Tuikov Sept. 2, 2020, 3:16 p.m. UTC | #8
On 2020-09-02 11:00, Daniel Stone wrote:
> Hi Luben,
> 
> On Wed, 2 Sep 2020 at 15:51, Luben Tuikov <luben.tuikov@amd.com> wrote:
>> Of course it's true--good morning!
>>
>> Let me stop you right there--just read the documentation I pointed
>> to you at.
>>
>> No!
>>
>> I'm sorry, that doesn't make sense.
>>
>> No, that's horrible.
>>
>> No, that's horrible.
>>
>> You need to understand how the kref infrastructure works in the kernel. I've said
>> it a million times: it's implicit.
>>
>> Or LESS. Less changes. Less is better. Basically revert and redo all this "managed resources".
> 
> There are many better ways to make your point. At the moment it's just
> getting lost in shouting.

Hi Daniel,

Not sure how I can do this when someone doesn't want to read up on
the kref infrastructure. Can you help?

When someone starts off with "My understanding of ..." (as in the OP) you know you're
in trouble and in for a rough times.

Such is the nature of world-wide open-to-everyone mailing lists where
anyone can put forth an argument, regardless of their level of understanding.
The more obfuscated an argument, the more uncertainty.

If one knows the kref infrastructure, it just clicks, no explanation
necessary.

Regards,
Luben

> 
> Cheers,
> Daniel
>
Daniel Stone Sept. 2, 2020, 3:51 p.m. UTC | #9
Hi Luben,

On Wed, 2 Sep 2020 at 16:16, Luben Tuikov <luben.tuikov@amd.com> wrote:
> Not sure how I can do this when someone doesn't want to read up on
> the kref infrastructure. Can you help?
>
> When someone starts off with "My understanding of ..." (as in the OP) you know you're
> in trouble and in for a rough times.
>
> Such is the nature of world-wide open-to-everyone mailing lists where
> anyone can put forth an argument, regardless of their level of understanding.
> The more obfuscated an argument, the more uncertainty.
>
> If one knows the kref infrastructure, it just clicks, no explanation
> necessary.

Evidently there are more points of view than yours. Evidently your
method of persuasion is also not working, because this thread is now
getting quite long and not converging on your point of view (which you
are holding to be absolutely objectively correct).

I think you need to re-evaluate the way in which you speak to people,
considering that it costs nothing to be polite and considerate, and
also takes effort to be rude and dismissive.

Cheers,
Daniel
Luben Tuikov Sept. 2, 2020, 7:04 p.m. UTC | #10
On 2020-09-02 11:51 a.m., Daniel Stone wrote:
> Hi Luben,
> 
> On Wed, 2 Sep 2020 at 16:16, Luben Tuikov <luben.tuikov@amd.com> wrote:
>> Not sure how I can do this when someone doesn't want to read up on
>> the kref infrastructure. Can you help?
>>
>> When someone starts off with "My understanding of ..." (as in the OP) you know you're
>> in trouble and in for a rough times.
>>
>> Such is the nature of world-wide open-to-everyone mailing lists where
>> anyone can put forth an argument, regardless of their level of understanding.
>> The more obfuscated an argument, the more uncertainty.
>>
>> If one knows the kref infrastructure, it just clicks, no explanation
>> necessary.
> 
> Evidently there are more points of view than yours. Evidently your
> method of persuasion is also not working, because this thread is now
> getting quite long and not converging on your point of view (which you
> are holding to be absolutely objectively correct).
> 
> I think you need to re-evaluate the way in which you speak to people,
> considering that it costs nothing to be polite and considerate, and
> also takes effort to be rude and dismissive.

Not sure how to help this:

> My understanding of the drm core code is like something below.
> struct B { 
> 	strcut A 
> }
> we initialize A firstly and initialize B in the end. But destroy B firstly and destory A in the end.

Regards,
Luben
Alex Deucher Sept. 2, 2020, 7:16 p.m. UTC | #11
On Wed, Sep 2, 2020 at 3:04 PM Luben Tuikov <luben.tuikov@amd.com> wrote:
>
> On 2020-09-02 11:51 a.m., Daniel Stone wrote:
> > Hi Luben,
> >
> > On Wed, 2 Sep 2020 at 16:16, Luben Tuikov <luben.tuikov@amd.com> wrote:
> >> Not sure how I can do this when someone doesn't want to read up on
> >> the kref infrastructure. Can you help?
> >>
> >> When someone starts off with "My understanding of ..." (as in the OP) you know you're
> >> in trouble and in for a rough times.
> >>
> >> Such is the nature of world-wide open-to-everyone mailing lists where
> >> anyone can put forth an argument, regardless of their level of understanding.
> >> The more obfuscated an argument, the more uncertainty.
> >>
> >> If one knows the kref infrastructure, it just clicks, no explanation
> >> necessary.
> >
> > Evidently there are more points of view than yours. Evidently your
> > method of persuasion is also not working, because this thread is now
> > getting quite long and not converging on your point of view (which you
> > are holding to be absolutely objectively correct).
> >
> > I think you need to re-evaluate the way in which you speak to people,
> > considering that it costs nothing to be polite and considerate, and
> > also takes effort to be rude and dismissive.
>
> Not sure how to help this:
>
> > My understanding of the drm core code is like something below.
> > struct B {
> >       strcut A
> > }
> > we initialize A firstly and initialize B in the end. But destroy B firstly and destory A in the end.
>

Luben, please tone it down a bit.  You are coming across very harshly.
You do make a good point though.  What is the point of having the drm
release callback if it's ostensibly useless?  We should either use it
as intended to release the structures allocated by the driver or the
drm core should handle it all.  With the managed resources there is an
incongruity between allocation and freeing which leads to confusion.
Even with the proposed updated documentation, it's not clear to me who
should use the managed resources or not.  My understanding was that it
was optional for drivers that wanted it.

Alex
Daniel Vetter Sept. 2, 2020, 7:55 p.m. UTC | #12
On Wed, Sep 2, 2020 at 9:17 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Wed, Sep 2, 2020 at 3:04 PM Luben Tuikov <luben.tuikov@amd.com> wrote:
> >
> > On 2020-09-02 11:51 a.m., Daniel Stone wrote:
> > > Hi Luben,
> > >
> > > On Wed, 2 Sep 2020 at 16:16, Luben Tuikov <luben.tuikov@amd.com> wrote:
> > >> Not sure how I can do this when someone doesn't want to read up on
> > >> the kref infrastructure. Can you help?
> > >>
> > >> When someone starts off with "My understanding of ..." (as in the OP) you know you're
> > >> in trouble and in for a rough times.
> > >>
> > >> Such is the nature of world-wide open-to-everyone mailing lists where
> > >> anyone can put forth an argument, regardless of their level of understanding.
> > >> The more obfuscated an argument, the more uncertainty.
> > >>
> > >> If one knows the kref infrastructure, it just clicks, no explanation
> > >> necessary.
> > >
> > > Evidently there are more points of view than yours. Evidently your
> > > method of persuasion is also not working, because this thread is now
> > > getting quite long and not converging on your point of view (which you
> > > are holding to be absolutely objectively correct).
> > >
> > > I think you need to re-evaluate the way in which you speak to people,
> > > considering that it costs nothing to be polite and considerate, and
> > > also takes effort to be rude and dismissive.
> >
> > Not sure how to help this:
> >
> > > My understanding of the drm core code is like something below.
> > > struct B {
> > >       strcut A
> > > }
> > > we initialize A firstly and initialize B in the end. But destroy B firstly and destory A in the end.
> >
>
> Luben, please tone it down a bit.  You are coming across very harshly.
> You do make a good point though.  What is the point of having the drm
> release callback if it's ostensibly useless?  We should either use it
> as intended to release the structures allocated by the driver or the
> drm core should handle it all.  With the managed resources there is an
> incongruity between allocation and freeing which leads to confusion.
> Even with the proposed updated documentation, it's not clear to me who
> should use the managed resources or not.  My understanding was that it
> was optional for drivers that wanted it.

In an ideal world this would all be perfectly clean. In reality we
have huge existing drivers which, if at all feasible, can only be
converted over step by step.

So with that there's a few ways we can make this happen:
- drmm resources are cleaned up before ->release is called. This means
doing auto-cleanup of the final steps like cleanup up drm_device
resources is gated on all drivers first being converted completely
over to drmm, which is never going to happen. And it's holding up
removing all the fairly simple cleanup code from small driver, which
is where managed resources (whether drmm or devm) have the most
benefit, often they completely eliminate the need for any explicit
teardown code.
- start in the middle. That just oopses because the unwind order isn't
the inverse of the setup order anymore, and generally that's required.
- start at the end. Unfortunately this means that the drm_device
cannot be freed in the driver's ->release callback, therefore for
transition purposes I had to sprinkle drmm_add_final_kfree all over
the place. But if you use devm_drm_dev_alloc (like the updated docs
recommend) that's not needed anymore, so really not an eyesore for
driver developers.

Yes there's mildly tricky code in the core as a result, but given that
you guys wont volunteer to fix up the entire subsystem either we just
have to live with that I think. Also, the commit adding the
drm_managed stuff does explain these implementation details and the
reasons why.

Cheers, Daniel
Daniel Vetter Sept. 2, 2020, 7:59 p.m. UTC | #13
On Wed, Sep 2, 2020 at 9:55 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Wed, Sep 2, 2020 at 9:17 PM Alex Deucher <alexdeucher@gmail.com> wrote:
> >
> > On Wed, Sep 2, 2020 at 3:04 PM Luben Tuikov <luben.tuikov@amd.com> wrote:
> > >
> > > On 2020-09-02 11:51 a.m., Daniel Stone wrote:
> > > > Hi Luben,
> > > >
> > > > On Wed, 2 Sep 2020 at 16:16, Luben Tuikov <luben.tuikov@amd.com> wrote:
> > > >> Not sure how I can do this when someone doesn't want to read up on
> > > >> the kref infrastructure. Can you help?
> > > >>
> > > >> When someone starts off with "My understanding of ..." (as in the OP) you know you're
> > > >> in trouble and in for a rough times.
> > > >>
> > > >> Such is the nature of world-wide open-to-everyone mailing lists where
> > > >> anyone can put forth an argument, regardless of their level of understanding.
> > > >> The more obfuscated an argument, the more uncertainty.
> > > >>
> > > >> If one knows the kref infrastructure, it just clicks, no explanation
> > > >> necessary.
> > > >
> > > > Evidently there are more points of view than yours. Evidently your
> > > > method of persuasion is also not working, because this thread is now
> > > > getting quite long and not converging on your point of view (which you
> > > > are holding to be absolutely objectively correct).
> > > >
> > > > I think you need to re-evaluate the way in which you speak to people,
> > > > considering that it costs nothing to be polite and considerate, and
> > > > also takes effort to be rude and dismissive.
> > >
> > > Not sure how to help this:
> > >
> > > > My understanding of the drm core code is like something below.
> > > > struct B {
> > > >       strcut A
> > > > }
> > > > we initialize A firstly and initialize B in the end. But destroy B firstly and destory A in the end.
> > >
> >
> > Luben, please tone it down a bit.  You are coming across very harshly.
> > You do make a good point though.  What is the point of having the drm
> > release callback if it's ostensibly useless?  We should either use it
> > as intended to release the structures allocated by the driver or the
> > drm core should handle it all.  With the managed resources there is an
> > incongruity between allocation and freeing which leads to confusion.
> > Even with the proposed updated documentation, it's not clear to me who
> > should use the managed resources or not.  My understanding was that it
> > was optional for drivers that wanted it.
>
> In an ideal world this would all be perfectly clean. In reality we
> have huge existing drivers which, if at all feasible, can only be
> converted over step by step.
>
> So with that there's a few ways we can make this happen:
> - drmm resources are cleaned up before ->release is called. This means
> doing auto-cleanup of the final steps like cleanup up drm_device
> resources is gated on all drivers first being converted completely
> over to drmm, which is never going to happen. And it's holding up
> removing all the fairly simple cleanup code from small driver, which
> is where managed resources (whether drmm or devm) have the most
> benefit, often they completely eliminate the need for any explicit
> teardown code.
> - start in the middle. That just oopses because the unwind order isn't
> the inverse of the setup order anymore, and generally that's required.
> - start at the end. Unfortunately this means that the drm_device
> cannot be freed in the driver's ->release callback, therefore for
> transition purposes I had to sprinkle drmm_add_final_kfree all over
> the place. But if you use devm_drm_dev_alloc (like the updated docs
> recommend) that's not needed anymore, so really not an eyesore for
> driver developers.
>
> Yes there's mildly tricky code in the core as a result, but given that
> you guys wont volunteer to fix up the entire subsystem either we just
> have to live with that I think. Also, the commit adding the
> drm_managed stuff does explain these implementation details and the
> reasons why.

Also note that tons of stuff in drm doesn't yet provide drmm_
versions, teardown-less drivers really only works for really simple
ones. So completely getting rid of the ->release callback will also
need lots of core work, like the currently in-flight series to add
more drmm_ helpers for kms objects:

https://lore.kernel.org/dri-devel/20200827160545.1146-1-p.zabel@pengutronix.de/

Help obviously very much welcome.

Cheers, Daniel
Pan, Xinhui Sept. 3, 2020, 1:57 a.m. UTC | #14
> 2020年9月2日 22:50,Tuikov, Luben <Luben.Tuikov@amd.com> 写道:
> 
> On 2020-09-02 00:43, Pan, Xinhui wrote:
>> 
>> 
>>> 2020年9月2日 11:46,Tuikov, Luben <Luben.Tuikov@amd.com> 写道:
>>> 
>>> On 2020-09-01 21:42, Pan, Xinhui wrote:
>>>> If you take a look at the below function, you should not use driver's release to free adev. As dev is embedded in adev.
>>> 
>>> Do you mean "look at the function below", using "below" as an adverb?
>>> "below" is not an adjective.
>>> 
>>> I know dev is embedded in adev--I did that patchset.
>>> 
>>>> 
>>>> 809 static void drm_dev_release(struct kref *ref)
>>>> 810 {
>>>> 811         struct drm_device *dev = container_of(ref, struct drm_device, ref);
>>>> 812        
>>>> 813         if (dev->driver->release)
>>>> 814                 dev->driver->release(dev);
>>>> 815 
>>>> 816         drm_managed_release(dev);
>>>> 817 
>>>> 818         kfree(dev->managed.final_kfree);
>>>> 819 }
>>> 
>>> That's simple--this comes from change c6603c740e0e3
>>> and it should be reverted. Simple as that.
>>> 
>>> The version before this change was absolutely correct:
>>> 
>>> static void drm_dev_release(struct kref *ref)
>>> {
>>> 	if (dev->driver->release)
>>> 		dev->driver->release(dev);
>>> 	else
>>> 		drm_dev_fini(dev);
>>> }
>>> 
>>> Meaning, "the kref is now 0"--> if the driver
>>> has a release, call it, else use our own.
>>> But note that nothing can be assumed after this point,
>>> about the existence of "dev".
>>> 
>>> It is exactly because struct drm_device is statically
>>> embedded into a container, struct amdgpu_device,
>>> that this change above should be reverted.
>>> 
>>> This is very similar to how fops has open/release
>>> but no close. That is, the "release" is called
>>> only when the last kref is released, i.e. when
>>> kref goes from non-zero to zero.
>>> 
>>> This uses the kref infrastructure which has been
>>> around for about 20 years in the Linux kernel.
>>> 
>>> I suggest reading the comments
>>> in drm_dev.c mostly, "DOC: driver instance overview"
>>> starting at line 240 onwards. This is right above
>>> drm_put_dev(). There is actually an example of a driver
>>> in the comment. Also the comment to drm_dev_init().
>>> 
>>> Now, take a look at this:
>>> 
>>> /**
>>> * drm_dev_put - Drop reference of a DRM device
>>> * @dev: device to drop reference of or NULL
>>> *
>>> * This decreases the ref-count of @dev by one. The device is destroyed if the
>>> * ref-count drops to zero.
>>> */
>>> void drm_dev_put(struct drm_device *dev)
>>> {
>>>       if (dev)
>>>               kref_put(&dev->ref, drm_dev_release);
>>> }
>>> EXPORT_SYMBOL(drm_dev_put);
>>> 
>>> Two things:
>>> 
>>> 1. It is us, who kzalloc the amdgpu device, which contains
>>> the drm_device (you'll see this discussed in the reading
>>> material I pointed to above). We do this because we're
>>> probing the PCI device whether we'll work it it or not.
>>> 
>> 
>> that is true.
> 
> Of course it's true--good morning!
> 
>> My understanding of the drm core code is like something below.
> 
> Let me stop you right there--just read the documentation I pointed
> to you at.
> 
>> struct B { 
>> 	strcut A 
>> }
>> we initialize A firstly and initialize B in the end. But destroy B firstly and destory A in the end.
> 
> No!
> B, which is the amdgpu_device struct "exists" before A, which is the DRM struct.
> This is why DRM recommends to _embed_ it into the driver's own device struct,
> as the documentation I pointed you to at.
> 
I think you are misleading me here.  A pci dev as you said below can act as many roles, a drm dev, a tty dev, etc.
say, struct B{
struct A;
struct TTY;
struct printer;
...
}
but TTY or other members has nothing to do with our discussion.

B of course exists before A. but the code logic is not that. code below is really rare in drm world.
create_B()
{
	init B members
	return create_A()
}
So usually B have more work to do after it initialize A.
then code should like below
create_B()
{
	init B base members
	create_A()
	init B extended members
}


For release part.
release B extended member
release A
release B base member

a good design should not have the so-called extended and base members existing in the release process.
Now have a look at the drm core code.
it expects driver to do release process like below.
release B
cleanup work of A

as long as the cleanup work of A exists, we can not do a pure release of B.

So if you want to follow the ruls of kref, you have to rework the drm core code first. only after that, we can do a pure release of B.

What I am confused is that, kfer sits in drm dev. why adev must be destroyed too when drm dev is going to be destroyed.
adev is not equal to drm dev.
I think struct below is more fit for the logic.
struct adev {
	struct drm * ddev_p = &adev.ddev
	struct type *odev_p  = &adev.odev
	struct drm ddev
	struct type odev
}

> "undone" first, since the DRM layer may finish with a device, but
> the device may still exists with the driver and as well as with PCI.
> This is very VERY common, with kernels, devices, device abstractions,
> device layers: DRM dev <-- amdgpu dev <-- PCI dev.
> 
>> But yes, practice is more complex. 
>> if B has nothing to be destroyed. we can destory A directly, otherwise destroy B firstly.
> 
> I'm sorry, that doesn't make sense. There is no such thing as "destroy directly"
> and "otherwise"--this is absolutely not how this works.
> 
> A good architecture doesn't have if-then-else--it's just a pure single-branch path.

well, there is code below everywhere.
if (fops->release)
	fops->release
else
	default_release

>> 
>> in this case, we can do something below in our release()
>> //some cleanup work of B
>> drm_dev_fini(dev);//destroy A
>> kfree(adev)
>> 
>>> 2. Using the kref infrastructure, when the ref goes to 0,
>>> drm_dev_release is called. And here's the KEY:
>>> Because WE allocated the container, we should free it--after the release
>>> method is called, DRM cannot assume anything about the drm
>>> device or the container. The "release" method is final.
>>> 
>>> We allocate, we free. And we free only when the ref goes to 0.
>>> 
>>> DRM can, in due time, "free" itself of the DRM device and stop
>>> having knowledge of it--that's fine, but as long as the ref
>>> is not 0, the amdgpu device and thus the contained DRM device,
>>> cannot be freed.
>>> 
>>>> 
>>>> You have to make another change something like
>>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>>>> index 13068fdf4331..2aabd2b4c63b 100644
>>>> --- a/drivers/gpu/drm/drm_drv.c
>>>> +++ b/drivers/gpu/drm/drm_drv.c
>>>> @@ -815,7 +815,8 @@ static void drm_dev_release(struct kref *ref)
>>>> 
>>>>       drm_managed_release(dev);
>>>> 
>>>> -       kfree(dev->managed.final_kfree);
>>>> +       if (dev->driver->final_release)
>>>> +               dev->driver->final_release(dev);
>>>> }
>>> 
>>> No. What's this?
>>> There is no such thing as "final" release, nor is there a "partial" release.
>>> When the kref goes to 0, the device disappears. Simple.
>>> If someone is using it, they should kref-get it, and when they're
>>> done with it, they should kref-put it.
>> 
>> I just take an example here. add another release in the end. then no one could touch us. IOW, final_release.
> 
> No, that's horrible.
> 
>> 
>> 
>> A destroy B by a callback, then A destroy itself. It assumes B just free its own resource.
>> but that makes trouble if some resource of A is allocated by B.
>> Because B must take care of these common resource shared between A and B.
> 
> No, that's horrible.
> 
>> 
>> yes, that logical is more complex. So I think we can revert drm_dev_release to its previous version.
> 
> drm_dev_release() in its original form, was pure:
> 
> static void drm_dev_release(struct kref *ref)
> {
> 	if (dev->driver->release)
> 		dev->driver->release(dev);
> 	else
> 		drm_dev_fini(dev);
> }
> 
>> 
>>> 
>>> The whole point is that this is done implicitly, via the kref infrastructure.
>>> drm_dev_init() which we call in our PCI probe function, sets the kref to 1--all
>>> as per the documentation I pointed you to above.
>> 
>> I am not taking about kref. what we are discussing is all about the release sequence.
> 
> You need to understand how the kref infrastructure works in the kernel. I've said
> it a million times: it's implicit. The "release sequence" as you like to call it,
> is implicit in the kref infrastructure.
> 
>> 
>> 
>>> 
>>> Another point is that we can do some other stuff in the release
>>> function, notify someone, write some registers, free memory we use
>>> for that PCI device, etc.
>>> 
>>> If the "managed resources" infrastructure wants to stay, it should hook
>>> itself into drm_dev_fini() and into drm_dev_init() or drm_dev_register().
>>> It shouldn't have to be so out-of-place like in patch 2/3 of this series,
>>> where the drmm_add_final_kfree() is smack-dab in the middle of our PCI
>>> discovery function, surrounded on top and bottom by drm_dev_init()
>>> and drm_dev_register(). The "managed resources" infra should be non-invasive
>>> and drivers shouldn't have to change to use it--it should be invisible to them.
>>> Then our kref would just work.
>>> 
>> yep, that make sense. But you need more changes to fix this issue. this patchset is insufficient.
> 
> Or LESS. Less changes. Less is better. Basically revert and redo all this "managed resources".
> 
> Regards,
> Luben
> 
>> 
>> 
>>>> 
>>>> And in the final_release callback we free the dev. But that is a little complex now. so I prefer still using final_kfree.
>>>> Of course we can do some cleanup work in the driver's release callback. BUT no kfree.
>>> 
>>> No! No final_kfree. It's a hack.
>>> 
>>> Read the documentation in drm_drv.c I noted above--it lays out how this happens. Reading is required.
>>> 
>>> Regards,
>>> Luben
>>> 
>>> 
>>>> 
>>>> -----原始邮件-----
>>>> 发件人: "Tuikov, Luben" <Luben.Tuikov@amd.com>
>>>> 日期: 2020年9月2日 星期三 09:07
>>>> 收件人: "amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>, "dri-devel@lists.freedesktop.org" <dri-devel@lists.freedesktop.org>
>>>> 抄送: "Deucher, Alexander" <Alexander.Deucher@amd.com>, Daniel Vetter <daniel@ffwll.ch>, "Pan, Xinhui" <Xinhui.Pan@amd.com>, "Tuikov, Luben" <Luben.Tuikov@amd.com>
>>>> 主题: [PATCH 0/3] Use implicit kref infra
>>>> 
>>>>   Use the implicit kref infrastructure to free the container
>>>>   struct amdgpu_device, container of struct drm_device.
>>>> 
>>>>   First, in drm_dev_register(), do not indiscriminately warn
>>>>   when a DRM driver hasn't opted for managed.final_kfree,
>>>>   but instead check if the driver has provided its own
>>>>   "release" function callback in the DRM driver structure.
>>>>   If that is the case, no warning.
>>>> 
>>>>   Remove drmm_add_final_kfree(). We take care of that, in the
>>>>   kref "release" callback when all refs are down to 0, via
>>>>   drm_dev_put(), i.e. the free is implicit.
>>>> 
>>>>   Remove superfluous NULL check, since the DRM device to be
>>>>   suspended always exists, so long as the underlying PCI and
>>>>   DRM devices exist.
>>>> 
>>>>   Luben Tuikov (3):
>>>>     drm: No warn for drivers who provide release
>>>>     drm/amdgpu: Remove drmm final free
>>>>     drm/amdgpu: Remove superfluous NULL check
>>>> 
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 2 --
>>>>    drivers/gpu/drm/drm_drv.c                  | 3 ++-
>>>>    3 files changed, 2 insertions(+), 6 deletions(-)
>>>> 
>>>>   -- 
>>>>   2.28.0.394.ge197136389
>>>> 
>>>> 
>>>> 
>>> 
>> 
>
Luben Tuikov Sept. 3, 2020, 6:52 p.m. UTC | #15
On 2020-09-02 9:57 p.m., Pan, Xinhui wrote:
> 
> 
>> 2020年9月2日 22:50,Tuikov, Luben <Luben.Tuikov@amd.com> 写道:
>>
>> On 2020-09-02 00:43, Pan, Xinhui wrote:
>>>
>>>
>>>> 2020年9月2日 11:46,Tuikov, Luben <Luben.Tuikov@amd.com> 写道:
>>>>
>>>> On 2020-09-01 21:42, Pan, Xinhui wrote:
>>>>> If you take a look at the below function, you should not use driver's release to free adev. As dev is embedded in adev.
>>>>
>>>> Do you mean "look at the function below", using "below" as an adverb?
>>>> "below" is not an adjective.
>>>>
>>>> I know dev is embedded in adev--I did that patchset.
>>>>
>>>>>
>>>>> 809 static void drm_dev_release(struct kref *ref)
>>>>> 810 {
>>>>> 811         struct drm_device *dev = container_of(ref, struct drm_device, ref);
>>>>> 812        
>>>>> 813         if (dev->driver->release)
>>>>> 814                 dev->driver->release(dev);
>>>>> 815 
>>>>> 816         drm_managed_release(dev);
>>>>> 817 
>>>>> 818         kfree(dev->managed.final_kfree);
>>>>> 819 }
>>>>
>>>> That's simple--this comes from change c6603c740e0e3
>>>> and it should be reverted. Simple as that.
>>>>
>>>> The version before this change was absolutely correct:
>>>>
>>>> static void drm_dev_release(struct kref *ref)
>>>> {
>>>> 	if (dev->driver->release)
>>>> 		dev->driver->release(dev);
>>>> 	else
>>>> 		drm_dev_fini(dev);
>>>> }
>>>>
>>>> Meaning, "the kref is now 0"--> if the driver
>>>> has a release, call it, else use our own.
>>>> But note that nothing can be assumed after this point,
>>>> about the existence of "dev".
>>>>
>>>> It is exactly because struct drm_device is statically
>>>> embedded into a container, struct amdgpu_device,
>>>> that this change above should be reverted.
>>>>
>>>> This is very similar to how fops has open/release
>>>> but no close. That is, the "release" is called
>>>> only when the last kref is released, i.e. when
>>>> kref goes from non-zero to zero.
>>>>
>>>> This uses the kref infrastructure which has been
>>>> around for about 20 years in the Linux kernel.
>>>>
>>>> I suggest reading the comments
>>>> in drm_dev.c mostly, "DOC: driver instance overview"
>>>> starting at line 240 onwards. This is right above
>>>> drm_put_dev(). There is actually an example of a driver
>>>> in the comment. Also the comment to drm_dev_init().
>>>>
>>>> Now, take a look at this:
>>>>
>>>> /**
>>>> * drm_dev_put - Drop reference of a DRM device
>>>> * @dev: device to drop reference of or NULL
>>>> *
>>>> * This decreases the ref-count of @dev by one. The device is destroyed if the
>>>> * ref-count drops to zero.
>>>> */
>>>> void drm_dev_put(struct drm_device *dev)
>>>> {
>>>>       if (dev)
>>>>               kref_put(&dev->ref, drm_dev_release);
>>>> }
>>>> EXPORT_SYMBOL(drm_dev_put);
>>>>
>>>> Two things:
>>>>
>>>> 1. It is us, who kzalloc the amdgpu device, which contains
>>>> the drm_device (you'll see this discussed in the reading
>>>> material I pointed to above). We do this because we're
>>>> probing the PCI device whether we'll work it it or not.
>>>>
>>>
>>> that is true.
>>
>> Of course it's true--good morning!
>>
>>> My understanding of the drm core code is like something below.
>>
>> Let me stop you right there--just read the documentation I pointed
>> to you at.
>>
>>> struct B { 
>>> 	strcut A 
>>> }
>>> we initialize A firstly and initialize B in the end. But destroy B firstly and destory A in the end.
>>
>> No!
>> B, which is the amdgpu_device struct "exists" before A, which is the DRM struct.
>> This is why DRM recommends to _embed_ it into the driver's own device struct,
>> as the documentation I pointed you to at.
>>
> I think you are misleading me here.  A pci dev as you said below can act as many roles, a drm dev, a tty dev, etc.
> say, struct B{
> struct A;
> struct TTY;
> struct printer;
> ...
> }
> but TTY or other members has nothing to do with our discussion.
> 
> B of course exists before A. but the code logic is not that. code below is really rare in drm world.
> create_B()
> {
> 	init B members
> 	return create_A()
> }
> So usually B have more work to do after it initialize A.
> then code should like below
> create_B()
> {
> 	init B base members
> 	create_A()
> 	init B extended members
> }
> 
> 
> For release part.
> release B extended member
> release A
> release B base member
> 
> a good design should not have the so-called extended and base members existing in the release process.
> Now have a look at the drm core code.
> it expects driver to do release process like below.
> release B
> cleanup work of A
> 
> as long as the cleanup work of A exists, we can not do a pure release of B.
> 
> So if you want to follow the ruls of kref, you have to rework the drm core code first. only after that, we can do a pure release of B.
> 
> What I am confused is that, kfer sits in drm dev. why adev must be destroyed too when drm dev is going to be destroyed.
> adev is not equal to drm dev.
> I think struct below is more fit for the logic.
> struct adev {
> 	struct drm * ddev_p = &adev.ddev
> 	struct type *odev_p  = &adev.odev
> 	struct drm ddev
> 	struct type odev
> }

I've not much idea what you said above.

Consider that DRM dev is embedded into amdgpu_device, as opposed to
be a pointer to an allocated DRM dev (which used to be the case before
my patch making the driver device struct a container and DRM a member).
And as such, DRM dev embedded into amdgpu_device, then either both
exist or neither exists, allocated in memory.

DRM layer can finish with a DRM device, fine.
amdgpu driver can finish with an amdgpu device, fine.
They can happen out of order.

But the beauty of kref, is that regardless of how a driver
or layer or a third entity uses and later relinquishes
a device it is working on, only when the kref goes to 0,
is the memory of, in our case both DRM device struct and
amdgpu_device struct both, freed.

And as such, this happens implicitly, without explicit logic
or tracking. All that is required is that if an entity wants
to use a struct, it calls kref-get, and when it is done using
it, it calls kref-put.

The new half-baked managed resources, ported from devres, obviates this
a bit, and smudges this clean separation, because now with it,
the release method isn't terminal (final).

> 
>> "undone" first, since the DRM layer may finish with a device, but
>> the device may still exists with the driver and as well as with PCI.
>> This is very VERY common, with kernels, devices, device abstractions,
>> device layers: DRM dev <-- amdgpu dev <-- PCI dev.
>>
>>> But yes, practice is more complex. 
>>> if B has nothing to be destroyed. we can destory A directly, otherwise destroy B firstly.
>>
>> I'm sorry, that doesn't make sense. There is no such thing as "destroy directly"
>> and "otherwise"--this is absolutely not how this works.
>>
>> A good architecture doesn't have if-then-else--it's just a pure single-branch path.
> 
> well, there is code below everywhere.
> if (fops->release)
> 	fops->release
> else
> 	default_release

Yep, exactly! And this is the 3rd or 4th time I'm pasting
this, lets go for one more:

static void drm_dev_release(struct kref *ref)
{
	if (dev->driver->release)
		dev->driver->release(dev);
	else
		drm_dev_fini(dev);
}

Is how drm_dev_release() used to look like, before
"managed resources" came in. When you look at it,
it is clear what is happening.

You can have managed resources in DRM, but it should
be a bit more elegant and implicit, and possibly
preserve this finality of the release method.

For instance, one could simply decree that drivers
would allocate and free their container structure,
using new functions (as "DRM managed resources"
does right now), but without changing the release
method, thus leaving the finality to the driver.

At this point the driver may call a helper function
in DRM or, DRM can add a function to be called
in drm_dev_release() _before_ calling the driver's
release method.

Right now what we have in amdgpu is non-symmetric
allocation scheme, since with your patch, release
is missing.

Regards,
Luben

> 
>>>
>>> in this case, we can do something below in our release()
>>> //some cleanup work of B
>>> drm_dev_fini(dev);//destroy A
>>> kfree(adev)
>>>
>>>> 2. Using the kref infrastructure, when the ref goes to 0,
>>>> drm_dev_release is called. And here's the KEY:
>>>> Because WE allocated the container, we should free it--after the release
>>>> method is called, DRM cannot assume anything about the drm
>>>> device or the container. The "release" method is final.
>>>>
>>>> We allocate, we free. And we free only when the ref goes to 0.
>>>>
>>>> DRM can, in due time, "free" itself of the DRM device and stop
>>>> having knowledge of it--that's fine, but as long as the ref
>>>> is not 0, the amdgpu device and thus the contained DRM device,
>>>> cannot be freed.
>>>>
>>>>>
>>>>> You have to make another change something like
>>>>> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
>>>>> index 13068fdf4331..2aabd2b4c63b 100644
>>>>> --- a/drivers/gpu/drm/drm_drv.c
>>>>> +++ b/drivers/gpu/drm/drm_drv.c
>>>>> @@ -815,7 +815,8 @@ static void drm_dev_release(struct kref *ref)
>>>>>
>>>>>       drm_managed_release(dev);
>>>>>
>>>>> -       kfree(dev->managed.final_kfree);
>>>>> +       if (dev->driver->final_release)
>>>>> +               dev->driver->final_release(dev);
>>>>> }
>>>>
>>>> No. What's this?
>>>> There is no such thing as "final" release, nor is there a "partial" release.
>>>> When the kref goes to 0, the device disappears. Simple.
>>>> If someone is using it, they should kref-get it, and when they're
>>>> done with it, they should kref-put it.
>>>
>>> I just take an example here. add another release in the end. then no one could touch us. IOW, final_release.
>>
>> No, that's horrible.
>>
>>>
>>>
>>> A destroy B by a callback, then A destroy itself. It assumes B just free its own resource.
>>> but that makes trouble if some resource of A is allocated by B.
>>> Because B must take care of these common resource shared between A and B.
>>
>> No, that's horrible.
>>
>>>
>>> yes, that logical is more complex. So I think we can revert drm_dev_release to its previous version.
>>
>> drm_dev_release() in its original form, was pure:
>>
>> static void drm_dev_release(struct kref *ref)
>> {
>> 	if (dev->driver->release)
>> 		dev->driver->release(dev);
>> 	else
>> 		drm_dev_fini(dev);
>> }
>>
>>>
>>>>
>>>> The whole point is that this is done implicitly, via the kref infrastructure.
>>>> drm_dev_init() which we call in our PCI probe function, sets the kref to 1--all
>>>> as per the documentation I pointed you to above.
>>>
>>> I am not taking about kref. what we are discussing is all about the release sequence.
>>
>> You need to understand how the kref infrastructure works in the kernel. I've said
>> it a million times: it's implicit. The "release sequence" as you like to call it,
>> is implicit in the kref infrastructure.
>>
>>>
>>>
>>>>
>>>> Another point is that we can do some other stuff in the release
>>>> function, notify someone, write some registers, free memory we use
>>>> for that PCI device, etc.
>>>>
>>>> If the "managed resources" infrastructure wants to stay, it should hook
>>>> itself into drm_dev_fini() and into drm_dev_init() or drm_dev_register().
>>>> It shouldn't have to be so out-of-place like in patch 2/3 of this series,
>>>> where the drmm_add_final_kfree() is smack-dab in the middle of our PCI
>>>> discovery function, surrounded on top and bottom by drm_dev_init()
>>>> and drm_dev_register(). The "managed resources" infra should be non-invasive
>>>> and drivers shouldn't have to change to use it--it should be invisible to them.
>>>> Then our kref would just work.
>>>>
>>> yep, that make sense. But you need more changes to fix this issue. this patchset is insufficient.
>>
>> Or LESS. Less changes. Less is better. Basically revert and redo all this "managed resources".
>>
>> Regards,
>> Luben
>>
>>>
>>>
>>>>>
>>>>> And in the final_release callback we free the dev. But that is a little complex now. so I prefer still using final_kfree.
>>>>> Of course we can do some cleanup work in the driver's release callback. BUT no kfree.
>>>>
>>>> No! No final_kfree. It's a hack.
>>>>
>>>> Read the documentation in drm_drv.c I noted above--it lays out how this happens. Reading is required.
>>>>
>>>> Regards,
>>>> Luben
>>>>
>>>>
>>>>>
>>>>> -----原始邮件-----
>>>>> 发件人: "Tuikov, Luben" <Luben.Tuikov@amd.com>
>>>>> 日期: 2020年9月2日 星期三 09:07
>>>>> 收件人: "amd-gfx@lists.freedesktop.org" <amd-gfx@lists.freedesktop.org>, "dri-devel@lists.freedesktop.org" <dri-devel@lists.freedesktop.org>
>>>>> 抄送: "Deucher, Alexander" <Alexander.Deucher@amd.com>, Daniel Vetter <daniel@ffwll.ch>, "Pan, Xinhui" <Xinhui.Pan@amd.com>, "Tuikov, Luben" <Luben.Tuikov@amd.com>
>>>>> 主题: [PATCH 0/3] Use implicit kref infra
>>>>>
>>>>>   Use the implicit kref infrastructure to free the container
>>>>>   struct amdgpu_device, container of struct drm_device.
>>>>>
>>>>>   First, in drm_dev_register(), do not indiscriminately warn
>>>>>   when a DRM driver hasn't opted for managed.final_kfree,
>>>>>   but instead check if the driver has provided its own
>>>>>   "release" function callback in the DRM driver structure.
>>>>>   If that is the case, no warning.
>>>>>
>>>>>   Remove drmm_add_final_kfree(). We take care of that, in the
>>>>>   kref "release" callback when all refs are down to 0, via
>>>>>   drm_dev_put(), i.e. the free is implicit.
>>>>>
>>>>>   Remove superfluous NULL check, since the DRM device to be
>>>>>   suspended always exists, so long as the underlying PCI and
>>>>>   DRM devices exist.
>>>>>
>>>>>   Luben Tuikov (3):
>>>>>     drm: No warn for drivers who provide release
>>>>>     drm/amdgpu: Remove drmm final free
>>>>>     drm/amdgpu: Remove superfluous NULL check
>>>>>
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 2 --
>>>>>    drivers/gpu/drm/drm_drv.c                  | 3 ++-
>>>>>    3 files changed, 2 insertions(+), 6 deletions(-)
>>>>>
>>>>>   -- 
>>>>>   2.28.0.394.ge197136389
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>