mbox series

[v1,0/3] Resolve suspend-resume racing with GuC destroy-context-worker

Message ID 20230802233501.17074-1-alan.previn.teres.alexis@intel.com (mailing list archive)
Headers show
Series Resolve suspend-resume racing with GuC destroy-context-worker | expand

Message

Alan Previn Aug. 2, 2023, 11:34 p.m. UTC
This series is the result of debugging issues root caused to
races between the GuC's destroyed_worker_func being triggered vs
repeating suspend-resume cycles with concurrent delayed
fence signals for engine-freeing.

The reproduction steps require that an app is created right before
the start of the suspend cycle where it creates a new gem
context and submits a tiny workload that would complete in the
middle of the suspend cycle. However this app uses dma-buffer
sharing or dma-fence with non-GPU objects or signals that
eventually triggers a FENCE_FREE via__i915_sw_fence_notify that
connects to engines_notify -> free_engines_rcu ->
intel_context_put -> kref_put(&ce->ref..) that queues the
worker after the GuCs CTB has been disabled (i.e. after
i915-gem's suspend-late flows).

This sequence is a corner-case and required repeating this
app->suspend->resume cycle ~1500 times across 4 identical
systems to see it once. That said, based on above callstack,
it is clear that merely flushing the context destruction worker,
which is obviously missing and needed, isn't sufficient.

Because of that, this series adds additional patches besides
the obvious flushing of the worker during the suspend flows.
It includes (1) closing a race between sending the context-
deregistration H2G vs the CTB getting disabled in the midst of it
(by detecing the failure and unrolling the guc-lrc-unpin flow) and
(2) not infinitely waiting in intel_gt_pm_wait_timeout_for_idle
when in the suspend-flow.

Alan Previn (3):
  drm/i915/guc: Flush context destruction worker at suspend
  drm/i915/guc: Close deregister-context race against CT-loss
  drm/i915/gt: Timeout when waiting for idle in suspending

 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  2 +-
 drivers/gpu/drm/i915/gt/intel_gt_pm.c         | 13 +++++-
 drivers/gpu/drm/i915/gt/intel_gt_pm.h         |  7 ++-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 43 +++++++++++++++++--
 .../gpu/drm/i915/gt/uc/intel_guc_submission.h |  2 +
 drivers/gpu/drm/i915/gt/uc/intel_uc.c         |  7 +++
 drivers/gpu/drm/i915/gt/uc/intel_uc.h         |  1 +
 drivers/gpu/drm/i915/intel_wakeref.c          | 14 ++++--
 drivers/gpu/drm/i915/intel_wakeref.h          |  5 ++-
 9 files changed, 80 insertions(+), 14 deletions(-)


base-commit: 20610a111d61d6945d578360942dc5c7bd828eb5

Comments

Alan Previn Aug. 2, 2023, 11:52 p.m. UTC | #1
On Wed, 2023-08-02 at 16:34 -0700, Teres Alexis, Alan Previn wrote:
> This series is the result of debugging issues root caused to
> races between the GuC's destroyed_worker_func being triggered vs
> repeating suspend-resume cycles with concurrent delayed
> fence signals for engine-freeing.
> 
> The reproduction steps require that an app is created right before
> the start of the suspend cycle where it creates a new gem
> context and submits a tiny workload that would complete in the
> middle of the suspend cycle. However this app uses dma-buffer
> sharing or dma-fence with non-GPU objects or signals that
> eventually triggers a FENCE_FREE via__i915_sw_fence_notify that
> connects to engines_notify -> free_engines_rcu ->
> intel_context_put -> kref_put(&ce->ref..) that queues the
> worker after the GuCs CTB has been disabled (i.e. after
> i915-gem's suspend-late flows).
> 
As an FYI - in offline conversations with John and Daniele, we have agreed
that at least the first two of the patches in this are necessary improvements
but the last patch may remain open as further offline debug is continuing
to pin down the src of the above fence-signal-flow. For now we are hoping
to proceed with reviewing the first two patches and only look into the 3rd
patch if there are system level fence signalling that truly can trigger
this anomaly or if its just a straddling request somewhere within i915
that has appeared or hung at the wrong time which needs to be fixed.

alan:snip