Message ID | 20210903142320.216705-1-janusz.krzysztofik@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RESEND] drm/i915: Flush buffer pools on driver remove | expand |
On Fri, Sep 03, 2021 at 04:23:20PM +0200, Janusz Krzysztofik wrote: > In preparation for clean driver release, attempts to drain work queues > and release freed objects are taken at driver remove time. However, GT > buffer pools are now not flushed before the driver release phase. > Since unused objects may stay there for up to one second, some may > survive until driver release is attempted. That can potentially > explain sporadic then hardly reproducible issues observed at driver > release time, like non-zero shrink counter or outstanding address space So just to make sure I'm understanding the description here: - We currently do an explicit flush of the buffer pools within the call path of drm_driver.release(); this removes all buffers, regardless of their age. - However there may be other code that runs *earlier* within the drm_driver.release() call chain that expects buffer pools have already been flushed and are already empty. - Since buffer pools auto-flush old buffers once per second in a worker thread, there's a small window where if we remove the driver while there are still buffers with an age of less than one second, the assumptions of the other release code may be violated. So by moving the flush to driver remove (which executes earlier via the pci_driver.remove() flow) you're ensuring that all buffers are flushed before _any_ code in drm_driver.release() executes. I found the wording of the commit message here somewhat confusing since it's talking about flushes we do in driver release, but mentions problems that arise during driver release due to lack of flushing. You might want to reword the commit message somewhat to help clarify. Otherwise, the code change itself looks reasonable to me. BTW, I do notice that drm_driver.release() in general is technically deprecated at this point (with a suggestion in the drm_drv.h comments to switch to using drmm_add_action(), drmm_kmalloc(), etc. to manage the cleanup of resources). At some point in the future me may want to rework the i915 cleanup in general according to that guidance. Matt > areas. > > Flush buffer pools on GT remove as a fix. On driver release, don't > flush the pools again, just assert that the flush was called and > nothing added more in between. > > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> > Cc: Chris Wilson <chris@chris-wilson.co.uk> > --- > Resending with Cc: dri-devel@lists.freedesktop.org as requested, and a > typo in commit description fixed. > > Thanks, > Janusz > > drivers/gpu/drm/i915/gt/intel_gt.c | 2 ++ > drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c | 2 -- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c > index 62d40c986642..8f322a4ecd87 100644 > --- a/drivers/gpu/drm/i915/gt/intel_gt.c > +++ b/drivers/gpu/drm/i915/gt/intel_gt.c > @@ -737,6 +737,8 @@ void intel_gt_driver_remove(struct intel_gt *gt) > intel_uc_driver_remove(>->uc); > > intel_engines_release(gt); > + > + intel_gt_flush_buffer_pool(gt); > } > > void intel_gt_driver_unregister(struct intel_gt *gt) > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c b/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c > index aa0a59c5b614..acc49c56a9f3 100644 > --- a/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c > +++ b/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c > @@ -245,8 +245,6 @@ void intel_gt_fini_buffer_pool(struct intel_gt *gt) > struct intel_gt_buffer_pool *pool = >->buffer_pool; > int n; > > - intel_gt_flush_buffer_pool(gt); > - > for (n = 0; n < ARRAY_SIZE(pool->cache_list); n++) > GEM_BUG_ON(!list_empty(&pool->cache_list[n])); > } > -- > 2.25.1 >
Hi Matt, Thanks for review. On czwartek, 23 września 2021 00:24:29 CEST Matt Roper wrote: > On Fri, Sep 03, 2021 at 04:23:20PM +0200, Janusz Krzysztofik wrote: > > In preparation for clean driver release, attempts to drain work queues > > and release freed objects are taken at driver remove time. However, GT > > buffer pools are now not flushed before the driver release phase. > > Since unused objects may stay there for up to one second, some may > > survive until driver release is attempted. That can potentially > > explain sporadic then hardly reproducible issues observed at driver > > release time, like non-zero shrink counter or outstanding address space > > So just to make sure I'm understanding the description here: > - We currently do an explicit flush of the buffer pools within the call > path of drm_driver.release(); this removes all buffers, regardless of > their age. And also triggers release of the buffers' underlying resources (objects, address space areas). > - However there may be other code that runs *earlier* within the > drm_driver.release() call chain Yes, within the drm_driver.release() call chain, but not necessarily earlier -- that's irrelevant, I believe, ... > that expects buffer pools have > already been flushed and are already empty. ... since that other code expects not just buffer pools but resource categories they consume (objects, address space areas) to be flushed already, while some may still be busy with old buffers not auto-flushed yet. > - Since buffer pools auto-flush old buffers once per second in a worker > thread, there's a small window where if we remove the driver while > there are still buffers with an age of less than one second, the > assumptions of the other release code may be violated. Correct. > So by moving the flush to driver remove (which executes earlier via the > pci_driver.remove() flow) you're ensuring that all buffers are flushed > before _any_ code in drm_driver.release() executes. And also flushed before some other code in pci_driver.remove() flushes those other resource categories released on buffer pools flush, then completeness of all those flushes is checked in drm_driver.release(). May I copy-paste some of you wording while rephrasing my commit description? Thanks, Janusz > > I found the wording of the commit message here somewhat confusing since > it's talking about flushes we do in driver release, but mentions > problems that arise during driver release due to lack of flushing. You > might want to reword the commit message somewhat to help clarify. > Otherwise, the code change itself looks reasonable to me. > > BTW, I do notice that drm_driver.release() in general is technically > deprecated at this point (with a suggestion in the drm_drv.h comments to > switch to using drmm_add_action(), drmm_kmalloc(), etc. to manage the > cleanup of resources). At some point in the future me may want to > rework the i915 cleanup in general according to that guidance. > > > Matt > > > areas. > > > > Flush buffer pools on GT remove as a fix. On driver release, don't > > flush the pools again, just assert that the flush was called and > > nothing added more in between. > > > > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> > > Cc: Chris Wilson <chris@chris-wilson.co.uk> > > --- > > Resending with Cc: dri-devel@lists.freedesktop.org as requested, and a > > typo in commit description fixed. > > > > Thanks, > > Janusz > > > > drivers/gpu/drm/i915/gt/intel_gt.c | 2 ++ > > drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c | 2 -- > > 2 files changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c > > index 62d40c986642..8f322a4ecd87 100644 > > --- a/drivers/gpu/drm/i915/gt/intel_gt.c > > +++ b/drivers/gpu/drm/i915/gt/intel_gt.c > > @@ -737,6 +737,8 @@ void intel_gt_driver_remove(struct intel_gt *gt) > > intel_uc_driver_remove(>->uc); > > > > intel_engines_release(gt); > > + > > + intel_gt_flush_buffer_pool(gt); > > } > > > > void intel_gt_driver_unregister(struct intel_gt *gt) > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c b/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c > > index aa0a59c5b614..acc49c56a9f3 100644 > > --- a/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c > > +++ b/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c > > @@ -245,8 +245,6 @@ void intel_gt_fini_buffer_pool(struct intel_gt *gt) > > struct intel_gt_buffer_pool *pool = >->buffer_pool; > > int n; > > > > - intel_gt_flush_buffer_pool(gt); > > - > > for (n = 0; n < ARRAY_SIZE(pool->cache_list); n++) > > GEM_BUG_ON(!list_empty(&pool->cache_list[n])); > > } > >
On Thu, Sep 23, 2021 at 03:07:06PM +0200, Janusz Krzysztofik wrote: > Hi Matt, > > Thanks for review. > > On czwartek, 23 września 2021 00:24:29 CEST Matt Roper wrote: > > On Fri, Sep 03, 2021 at 04:23:20PM +0200, Janusz Krzysztofik wrote: > > > In preparation for clean driver release, attempts to drain work queues > > > and release freed objects are taken at driver remove time. However, GT > > > buffer pools are now not flushed before the driver release phase. > > > Since unused objects may stay there for up to one second, some may > > > survive until driver release is attempted. That can potentially > > > explain sporadic then hardly reproducible issues observed at driver > > > release time, like non-zero shrink counter or outstanding address space > > > > So just to make sure I'm understanding the description here: > > - We currently do an explicit flush of the buffer pools within the call > > path of drm_driver.release(); this removes all buffers, regardless of > > their age. > > And also triggers release of the buffers' underlying resources (objects, > address space areas). > > > - However there may be other code that runs *earlier* within the > > drm_driver.release() call chain > > Yes, within the drm_driver.release() call chain, but not necessarily earlier > -- that's irrelevant, I believe, ... > > > that expects buffer pools have > > already been flushed and are already empty. > > ... since that other code expects not just buffer pools but resource > categories they consume (objects, address space areas) to be flushed already, > while some may still be busy with old buffers not auto-flushed yet. > > > - Since buffer pools auto-flush old buffers once per second in a worker > > thread, there's a small window where if we remove the driver while > > there are still buffers with an age of less than one second, the > > assumptions of the other release code may be violated. > > Correct. > > > So by moving the flush to driver remove (which executes earlier via the > > pci_driver.remove() flow) you're ensuring that all buffers are flushed > > before _any_ code in drm_driver.release() executes. > > And also flushed before some other code in pci_driver.remove() flushes those > other resource categories released on buffer pools flush, then completeness of > all those flushes is checked in drm_driver.release(). > > May I copy-paste some of you wording while rephrasing my commit description? Sure go ahead. Thanks. Matt > > Thanks, > Janusz > > > > > I found the wording of the commit message here somewhat confusing since > > it's talking about flushes we do in driver release, but mentions > > problems that arise during driver release due to lack of flushing. You > > might want to reword the commit message somewhat to help clarify. > > Otherwise, the code change itself looks reasonable to me. > > > > BTW, I do notice that drm_driver.release() in general is technically > > deprecated at this point (with a suggestion in the drm_drv.h comments to > > switch to using drmm_add_action(), drmm_kmalloc(), etc. to manage the > > cleanup of resources). At some point in the future me may want to > > rework the i915 cleanup in general according to that guidance. > > > > > > Matt > > > > > areas. > > > > > > Flush buffer pools on GT remove as a fix. On driver release, don't > > > flush the pools again, just assert that the flush was called and > > > nothing added more in between. > > > > > > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> > > > Cc: Chris Wilson <chris@chris-wilson.co.uk> > > > --- > > > Resending with Cc: dri-devel@lists.freedesktop.org as requested, and a > > > typo in commit description fixed. > > > > > > Thanks, > > > Janusz > > > > > > drivers/gpu/drm/i915/gt/intel_gt.c | 2 ++ > > > drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c | 2 -- > > > 2 files changed, 2 insertions(+), 2 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c > > > index 62d40c986642..8f322a4ecd87 100644 > > > --- a/drivers/gpu/drm/i915/gt/intel_gt.c > > > +++ b/drivers/gpu/drm/i915/gt/intel_gt.c > > > @@ -737,6 +737,8 @@ void intel_gt_driver_remove(struct intel_gt *gt) > > > intel_uc_driver_remove(>->uc); > > > > > > intel_engines_release(gt); > > > + > > > + intel_gt_flush_buffer_pool(gt); > > > } > > > > > > void intel_gt_driver_unregister(struct intel_gt *gt) > > > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c b/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c > > > index aa0a59c5b614..acc49c56a9f3 100644 > > > --- a/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c > > > +++ b/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c > > > @@ -245,8 +245,6 @@ void intel_gt_fini_buffer_pool(struct intel_gt *gt) > > > struct intel_gt_buffer_pool *pool = >->buffer_pool; > > > int n; > > > > > > - intel_gt_flush_buffer_pool(gt); > > > - > > > for (n = 0; n < ARRAY_SIZE(pool->cache_list); n++) > > > GEM_BUG_ON(!list_empty(&pool->cache_list[n])); > > > } > > > > > > > >
diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index 62d40c986642..8f322a4ecd87 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -737,6 +737,8 @@ void intel_gt_driver_remove(struct intel_gt *gt) intel_uc_driver_remove(>->uc); intel_engines_release(gt); + + intel_gt_flush_buffer_pool(gt); } void intel_gt_driver_unregister(struct intel_gt *gt) diff --git a/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c b/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c index aa0a59c5b614..acc49c56a9f3 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c +++ b/drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c @@ -245,8 +245,6 @@ void intel_gt_fini_buffer_pool(struct intel_gt *gt) struct intel_gt_buffer_pool *pool = >->buffer_pool; int n; - intel_gt_flush_buffer_pool(gt); - for (n = 0; n < ARRAY_SIZE(pool->cache_list); n++) GEM_BUG_ON(!list_empty(&pool->cache_list[n])); }
In preparation for clean driver release, attempts to drain work queues and release freed objects are taken at driver remove time. However, GT buffer pools are now not flushed before the driver release phase. Since unused objects may stay there for up to one second, some may survive until driver release is attempted. That can potentially explain sporadic then hardly reproducible issues observed at driver release time, like non-zero shrink counter or outstanding address space areas. Flush buffer pools on GT remove as a fix. On driver release, don't flush the pools again, just assert that the flush was called and nothing added more in between. Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> --- Resending with Cc: dri-devel@lists.freedesktop.org as requested, and a typo in commit description fixed. Thanks, Janusz drivers/gpu/drm/i915/gt/intel_gt.c | 2 ++ drivers/gpu/drm/i915/gt/intel_gt_buffer_pool.c | 2 -- 2 files changed, 2 insertions(+), 2 deletions(-)