diff mbox series

[RESEND] drm/i915: Flush buffer pools on driver remove

Message ID 20210903142320.216705-1-janusz.krzysztofik@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series [RESEND] drm/i915: Flush buffer pools on driver remove | expand

Commit Message

Janusz Krzysztofik Sept. 3, 2021, 2:23 p.m. UTC
In preparation for clean driver release, attempts to drain work queues
and release freed objects are taken at driver remove time.  However, GT
buffer pools are now not flushed before the driver release phase.
Since unused objects may stay there for up to one second, some may
survive until driver release is attempted.  That can potentially
explain sporadic then hardly reproducible issues observed at driver
release time, like non-zero shrink counter or outstanding address space
areas.

Flush buffer pools on GT remove as a fix.  On driver release, don't
flush the pools again, just assert that the flush was called and
nothing added more in between.

Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
---
Resending with Cc: dri-devel@lists.freedesktop.org as requested, and a
typo in commit description fixed.

Thanks,
Janusz

 drivers/gpu/drm/i915/gt/intel_gt.c             | 2 ++
 drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c | 2 --
 2 files changed, 2 insertions(+), 2 deletions(-)

Comments

Matt Roper Sept. 22, 2021, 10:24 p.m. UTC | #1
On Fri, Sep 03, 2021 at 04:23:20PM +0200, Janusz Krzysztofik wrote:
> In preparation for clean driver release, attempts to drain work queues
> and release freed objects are taken at driver remove time.  However, GT
> buffer pools are now not flushed before the driver release phase.
> Since unused objects may stay there for up to one second, some may
> survive until driver release is attempted.  That can potentially
> explain sporadic then hardly reproducible issues observed at driver
> release time, like non-zero shrink counter or outstanding address space

So just to make sure I'm understanding the description here:
 - We currently do an explicit flush of the buffer pools within the call
   path of drm_driver.release(); this removes all buffers, regardless of
   their age.
 - However there may be other code that runs *earlier* within the
   drm_driver.release() call chain that expects buffer pools have
   already been flushed and are already empty.
 - Since buffer pools auto-flush old buffers once per second in a worker
   thread, there's a small window where if we remove the driver while
   there are still buffers with an age of less than one second, the
   assumptions of the other release code may be violated.

So by moving the flush to driver remove (which executes earlier via the
pci_driver.remove() flow) you're ensuring that all buffers are flushed
before _any_ code in drm_driver.release() executes.

I found the wording of the commit message here somewhat confusing since
it's talking about flushes we do in driver release, but mentions
problems that arise during driver release due to lack of flushing.  You
might want to reword the commit message somewhat to help clarify.
Otherwise, the code change itself looks reasonable to me.

BTW, I do notice that drm_driver.release() in general is technically
deprecated at this point (with a suggestion in the drm_drv.h comments to
switch to using drmm_add_action(), drmm_kmalloc(), etc. to manage the
cleanup of resources).  At some point in the future me may want to
rework the i915 cleanup in general according to that guidance.


Matt

> areas.
> 
> Flush buffer pools on GT remove as a fix.  On driver release, don't
> flush the pools again, just assert that the flush was called and
> nothing added more in between.
> 
> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> ---
> Resending with Cc: dri-devel@lists.freedesktop.org as requested, and a
> typo in commit description fixed.
> 
> Thanks,
> Janusz
> 
>  drivers/gpu/drm/i915/gt/intel_gt.c             | 2 ++
>  drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c | 2 --
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
> index 62d40c986642..8f322a4ecd87 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt.c
> @@ -737,6 +737,8 @@ void intel_gt_driver_remove(struct intel_gt *gt)
>  	intel_uc_driver_remove(&gt->uc);
>  
>  	intel_engines_release(gt);
> +
> +	intel_gt_flush_buffer_pool(gt);
>  }
>  
>  void intel_gt_driver_unregister(struct intel_gt *gt)
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c b/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c
> index aa0a59c5b614..acc49c56a9f3 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c
> @@ -245,8 +245,6 @@ void intel_gt_fini_buffer_pool(struct intel_gt *gt)
>  	struct intel_gt_buffer_pool *pool = &gt->buffer_pool;
>  	int n;
>  
> -	intel_gt_flush_buffer_pool(gt);
> -
>  	for (n = 0; n < ARRAY_SIZE(pool->cache_list); n++)
>  		GEM_BUG_ON(!list_empty(&pool->cache_list[n]));
>  }
> -- 
> 2.25.1
>
Janusz Krzysztofik Sept. 23, 2021, 1:07 p.m. UTC | #2
Hi Matt,

Thanks for review.

On czwartek, 23 września 2021 00:24:29 CEST Matt Roper wrote:
> On Fri, Sep 03, 2021 at 04:23:20PM +0200, Janusz Krzysztofik wrote:
> > In preparation for clean driver release, attempts to drain work queues
> > and release freed objects are taken at driver remove time.  However, GT
> > buffer pools are now not flushed before the driver release phase.
> > Since unused objects may stay there for up to one second, some may
> > survive until driver release is attempted.  That can potentially
> > explain sporadic then hardly reproducible issues observed at driver
> > release time, like non-zero shrink counter or outstanding address space
> 
> So just to make sure I'm understanding the description here:
>  - We currently do an explicit flush of the buffer pools within the call
>    path of drm_driver.release(); this removes all buffers, regardless of
>    their age.

And also triggers release of the buffers' underlying resources (objects, 
address space areas).

>  - However there may be other code that runs *earlier* within the
>    drm_driver.release() call chain 

Yes, within the drm_driver.release() call chain, but not necessarily earlier 
-- that's irrelevant, I believe, ...

>    that expects buffer pools have
>    already been flushed and are already empty.

... since that other code expects not just buffer pools but resource 
categories they consume (objects, address space areas) to be flushed already, 
while some may still be busy with old buffers not auto-flushed yet.

>  - Since buffer pools auto-flush old buffers once per second in a worker
>    thread, there's a small window where if we remove the driver while
>    there are still buffers with an age of less than one second, the
>    assumptions of the other release code may be violated.

Correct.

> So by moving the flush to driver remove (which executes earlier via the
> pci_driver.remove() flow) you're ensuring that all buffers are flushed
> before _any_ code in drm_driver.release() executes.

And also flushed before some other code in pci_driver.remove() flushes those 
other resource categories released on buffer pools flush, then completeness of 
all those flushes is checked in drm_driver.release().

May I copy-paste some of you wording while rephrasing my commit description?

Thanks,
Janusz

> 
> I found the wording of the commit message here somewhat confusing since
> it's talking about flushes we do in driver release, but mentions
> problems that arise during driver release due to lack of flushing.  You
> might want to reword the commit message somewhat to help clarify.
> Otherwise, the code change itself looks reasonable to me.
> 
> BTW, I do notice that drm_driver.release() in general is technically
> deprecated at this point (with a suggestion in the drm_drv.h comments to
> switch to using drmm_add_action(), drmm_kmalloc(), etc. to manage the
> cleanup of resources).  At some point in the future me may want to
> rework the i915 cleanup in general according to that guidance.
> 
> 
> Matt
> 
> > areas.
> > 
> > Flush buffer pools on GT remove as a fix.  On driver release, don't
> > flush the pools again, just assert that the flush was called and
> > nothing added more in between.
> > 
> > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > ---
> > Resending with Cc: dri-devel@lists.freedesktop.org as requested, and a
> > typo in commit description fixed.
> > 
> > Thanks,
> > Janusz
> > 
> >  drivers/gpu/drm/i915/gt/intel_gt.c             | 2 ++
> >  drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c | 2 --
> >  2 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
> > index 62d40c986642..8f322a4ecd87 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_gt.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt.c
> > @@ -737,6 +737,8 @@ void intel_gt_driver_remove(struct intel_gt *gt)
> >  	intel_uc_driver_remove(&gt->uc);
> >  
> >  	intel_engines_release(gt);
> > +
> > +	intel_gt_flush_buffer_pool(gt);
> >  }
> >  
> >  void intel_gt_driver_unregister(struct intel_gt *gt)
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c b/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c
> > index aa0a59c5b614..acc49c56a9f3 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c
> > @@ -245,8 +245,6 @@ void intel_gt_fini_buffer_pool(struct intel_gt *gt)
> >  	struct intel_gt_buffer_pool *pool = &gt->buffer_pool;
> >  	int n;
> >  
> > -	intel_gt_flush_buffer_pool(gt);
> > -
> >  	for (n = 0; n < ARRAY_SIZE(pool->cache_list); n++)
> >  		GEM_BUG_ON(!list_empty(&pool->cache_list[n]));
> >  }
> 
>
Matt Roper Sept. 23, 2021, 1:51 p.m. UTC | #3
On Thu, Sep 23, 2021 at 03:07:06PM +0200, Janusz Krzysztofik wrote:
> Hi Matt,
> 
> Thanks for review.
> 
> On czwartek, 23 września 2021 00:24:29 CEST Matt Roper wrote:
> > On Fri, Sep 03, 2021 at 04:23:20PM +0200, Janusz Krzysztofik wrote:
> > > In preparation for clean driver release, attempts to drain work queues
> > > and release freed objects are taken at driver remove time.  However, GT
> > > buffer pools are now not flushed before the driver release phase.
> > > Since unused objects may stay there for up to one second, some may
> > > survive until driver release is attempted.  That can potentially
> > > explain sporadic then hardly reproducible issues observed at driver
> > > release time, like non-zero shrink counter or outstanding address space
> > 
> > So just to make sure I'm understanding the description here:
> >  - We currently do an explicit flush of the buffer pools within the call
> >    path of drm_driver.release(); this removes all buffers, regardless of
> >    their age.
> 
> And also triggers release of the buffers' underlying resources (objects, 
> address space areas).
> 
> >  - However there may be other code that runs *earlier* within the
> >    drm_driver.release() call chain 
> 
> Yes, within the drm_driver.release() call chain, but not necessarily earlier 
> -- that's irrelevant, I believe, ...
> 
> >    that expects buffer pools have
> >    already been flushed and are already empty.
> 
> ... since that other code expects not just buffer pools but resource 
> categories they consume (objects, address space areas) to be flushed already, 
> while some may still be busy with old buffers not auto-flushed yet.
> 
> >  - Since buffer pools auto-flush old buffers once per second in a worker
> >    thread, there's a small window where if we remove the driver while
> >    there are still buffers with an age of less than one second, the
> >    assumptions of the other release code may be violated.
> 
> Correct.
> 
> > So by moving the flush to driver remove (which executes earlier via the
> > pci_driver.remove() flow) you're ensuring that all buffers are flushed
> > before _any_ code in drm_driver.release() executes.
> 
> And also flushed before some other code in pci_driver.remove() flushes those 
> other resource categories released on buffer pools flush, then completeness of 
> all those flushes is checked in drm_driver.release().
> 
> May I copy-paste some of you wording while rephrasing my commit description?

Sure go ahead.

Thanks.


Matt

> 
> Thanks,
> Janusz
> 
> > 
> > I found the wording of the commit message here somewhat confusing since
> > it's talking about flushes we do in driver release, but mentions
> > problems that arise during driver release due to lack of flushing.  You
> > might want to reword the commit message somewhat to help clarify.
> > Otherwise, the code change itself looks reasonable to me.
> > 
> > BTW, I do notice that drm_driver.release() in general is technically
> > deprecated at this point (with a suggestion in the drm_drv.h comments to
> > switch to using drmm_add_action(), drmm_kmalloc(), etc. to manage the
> > cleanup of resources).  At some point in the future me may want to
> > rework the i915 cleanup in general according to that guidance.
> > 
> > 
> > Matt
> > 
> > > areas.
> > > 
> > > Flush buffer pools on GT remove as a fix.  On driver release, don't
> > > flush the pools again, just assert that the flush was called and
> > > nothing added more in between.
> > > 
> > > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
> > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > ---
> > > Resending with Cc: dri-devel@lists.freedesktop.org as requested, and a
> > > typo in commit description fixed.
> > > 
> > > Thanks,
> > > Janusz
> > > 
> > >  drivers/gpu/drm/i915/gt/intel_gt.c             | 2 ++
> > >  drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c | 2 --
> > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
> > > index 62d40c986642..8f322a4ecd87 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_gt.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_gt.c
> > > @@ -737,6 +737,8 @@ void intel_gt_driver_remove(struct intel_gt *gt)
> > >  	intel_uc_driver_remove(&gt->uc);
> > >  
> > >  	intel_engines_release(gt);
> > > +
> > > +	intel_gt_flush_buffer_pool(gt);
> > >  }
> > >  
> > >  void intel_gt_driver_unregister(struct intel_gt *gt)
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c b/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c
> > > index aa0a59c5b614..acc49c56a9f3 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c
> > > @@ -245,8 +245,6 @@ void intel_gt_fini_buffer_pool(struct intel_gt *gt)
> > >  	struct intel_gt_buffer_pool *pool = &gt->buffer_pool;
> > >  	int n;
> > >  
> > > -	intel_gt_flush_buffer_pool(gt);
> > > -
> > >  	for (n = 0; n < ARRAY_SIZE(pool->cache_list); n++)
> > >  		GEM_BUG_ON(!list_empty(&pool->cache_list[n]));
> > >  }
> > 
> > 
> 
> 
> 
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c
index 62d40c986642..8f322a4ecd87 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt.c
@@ -737,6 +737,8 @@  void intel_gt_driver_remove(struct intel_gt *gt)
 	intel_uc_driver_remove(&gt->uc);
 
 	intel_engines_release(gt);
+
+	intel_gt_flush_buffer_pool(gt);
 }
 
 void intel_gt_driver_unregister(struct intel_gt *gt)
diff --git a/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c b/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c
index aa0a59c5b614..acc49c56a9f3 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c
@@ -245,8 +245,6 @@  void intel_gt_fini_buffer_pool(struct intel_gt *gt)
 	struct intel_gt_buffer_pool *pool = &gt->buffer_pool;
 	int n;
 
-	intel_gt_flush_buffer_pool(gt);
-
 	for (n = 0; n < ARRAY_SIZE(pool->cache_list); n++)
 		GEM_BUG_ON(!list_empty(&pool->cache_list[n]));
 }