Message ID | 20240329235306.1559639-1-John.C.Harrison@Intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/i915/guc: Fix the fix for reset lock confusion | expand |
On 3/30/2024 12:53 AM, John.C.Harrison@Intel.com wrote: > From: John Harrison <John.C.Harrison@Intel.com> > > The previous fix for the circlular lock splat about the busyness > worker wasn't quite complete. Even though the reset-in-progress flag > is cleared at the start of intel_uc_reset_finish, the entire function > is still inside the reset mutex lock. Not sure why the patch appeared > to fix the issue both locally and in CI. However, it is now back > again. > > There is a further complication the wedge code path within > intel_gt_reset() jumps around so much it results in nested > reset_prepare/_finish calls. That is, the call sequence is: > intel_gt_reset > | reset_prepare > | __intel_gt_set_wedged > | | reset_prepare > | | reset_finish > | reset_finish > > The nested finish means that even if the clear of the in-progress flag > was moved to the end of _finish, it would still be clear for the > entire second call. Surprisingly, this does not seem to be causing any > other problems at present. > > As an aside, a wedge on fini does not call the finish functions at > all. The reset_in_progress flag is left set (twice). > > So instead of trying to cancel the worker anywhere at all in the reset > path, just add a cancel to intel_guc_submission_fini instead. Note > that it is not a problem if the worker is still active during a reset. > Either it will run before the reset path starts locking things and > will simply block the reset code for a tiny amount of time. Or it will > run after the locks have been acquired and will early exit due to the > try-lock. > > Also, do not use the reset-in-progress flag to decide whether a > synchronous cancel is safe (from a lockdep perspective) or not. > Instead, use the actual reset mutex state (both the genuine one and > the custom rolled BACKOFF one). > > Fixes: 0e00a8814eec ("drm/i915/guc: Avoid circular locking issue on busyness flush") > Signed-off-by: John Harrison <John.C.Harrison@Intel.com> > Cc: Zhanjun Dong <zhanjun.dong@intel.com> > Cc: John Harrison <John.C.Harrison@Intel.com> > Cc: Andi Shyti <andi.shyti@linux.intel.com> > Cc: Daniel Vetter <daniel@ffwll.ch> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch> > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> > Cc: Nirmoy Das <nirmoy.das@intel.com> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> > Cc: Andrzej Hajda <andrzej.hajda@intel.com> > Cc: Matt Roper <matthew.d.roper@intel.com> > Cc: Jonathan Cavitt <jonathan.cavitt@intel.com> > Cc: Prathap Kumar Valsan <prathap.kumar.valsan@intel.com> > Cc: Alan Previn <alan.previn.teres.alexis@intel.com> > Cc: Madhumitha Tolakanahalli Pradeep <madhumitha.tolakanahalli.pradeep@intel.com> > Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> > Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> > Cc: Dnyaneshwar Bhadane <dnyaneshwar.bhadane@intel.com> Thanks for the details, looks good to me: Reviewed-by: Nirmoy Das <nirmoy.das@intel.com> > --- > .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 23 ++++++++----------- > drivers/gpu/drm/i915/gt/uc/intel_uc.c | 4 ++++ > 2 files changed, 13 insertions(+), 14 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > index 16640d6dd0589..00757d6333e88 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > @@ -1403,14 +1403,17 @@ static void guc_cancel_busyness_worker(struct intel_guc *guc) > * Trying to pass a 'need_sync' or 'in_reset' flag all the way down through > * every possible call stack is unfeasible. It would be too intrusive to many > * areas that really don't care about the GuC backend. However, there is the > - * 'reset_in_progress' flag available, so just use that. > + * I915_RESET_BACKOFF flag and the gt->reset.mutex can be tested for is_locked. > + * So just use those. Note that testing both is required due to the hideously > + * complex nature of the i915 driver's reset code paths. > * > * And note that in the case of a reset occurring during driver unload > - * (wedge_on_fini), skipping the cancel in _prepare (when the reset flag is set > - * is fine because there is another cancel in _finish (when the reset flag is > - * not). > + * (wedged_on_fini), skipping the cancel in reset_prepare/reset_fini (when the > + * reset flag/mutex are set) is fine because there is another explicit cancel in > + * intel_guc_submission_fini (when the reset flag/mutex are not). > */ > - if (guc_to_gt(guc)->uc.reset_in_progress) > + if (mutex_is_locked(&guc_to_gt(guc)->reset.mutex) || > + test_bit(I915_RESET_BACKOFF, &guc_to_gt(guc)->reset.flags)) > cancel_delayed_work(&guc->timestamp.work); > else > cancel_delayed_work_sync(&guc->timestamp.work); > @@ -1424,8 +1427,6 @@ static void __reset_guc_busyness_stats(struct intel_guc *guc) > unsigned long flags; > ktime_t unused; > > - guc_cancel_busyness_worker(guc); > - > spin_lock_irqsave(&guc->timestamp.lock, flags); > > guc_update_pm_timestamp(guc, &unused); > @@ -2004,13 +2005,6 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc) > > void intel_guc_submission_reset_finish(struct intel_guc *guc) > { > - /* > - * Ensure the busyness worker gets cancelled even on a fatal wedge. > - * Note that reset_prepare is not allowed to because it confuses lockdep. > - */ > - if (guc_submission_initialized(guc)) > - guc_cancel_busyness_worker(guc); > - > /* Reset called during driver load or during wedge? */ > if (unlikely(!guc_submission_initialized(guc) || > !intel_guc_is_fw_running(guc) || > @@ -2136,6 +2130,7 @@ void intel_guc_submission_fini(struct intel_guc *guc) > if (!guc->submission_initialized) > return; > > + guc_fini_engine_stats(guc); > guc_flush_destroyed_contexts(guc); > guc_lrc_desc_pool_destroy_v69(guc); > i915_sched_engine_put(guc->sched_engine); > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c b/drivers/gpu/drm/i915/gt/uc/intel_uc.c > index b47051ddf17f2..7a63abf8f644c 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c > +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c > @@ -633,6 +633,10 @@ void intel_uc_reset_finish(struct intel_uc *uc) > { > struct intel_guc *guc = &uc->guc; > > + /* > + * NB: The wedge code path results in prepare -> prepare -> finish -> finish. > + * So this function is sometimes called with the in-progress flag not set. > + */ > uc->reset_in_progress = false; > > /* Firmware expected to be running when this function is called */
Hi John, On Fri, Mar 29, 2024 at 04:53:05PM -0700, John.C.Harrison@Intel.com wrote: > From: John Harrison <John.C.Harrison@Intel.com> > > The previous fix for the circlular lock splat about the busyness > worker wasn't quite complete. Even though the reset-in-progress flag > is cleared at the start of intel_uc_reset_finish, the entire function > is still inside the reset mutex lock. Not sure why the patch appeared > to fix the issue both locally and in CI. However, it is now back > again. > > There is a further complication the wedge code path within > intel_gt_reset() jumps around so much it results in nested > reset_prepare/_finish calls. That is, the call sequence is: > intel_gt_reset > | reset_prepare > | __intel_gt_set_wedged > | | reset_prepare > | | reset_finish > | reset_finish > > The nested finish means that even if the clear of the in-progress flag > was moved to the end of _finish, it would still be clear for the > entire second call. Surprisingly, this does not seem to be causing any > other problems at present. > > As an aside, a wedge on fini does not call the finish functions at > all. The reset_in_progress flag is left set (twice). > > So instead of trying to cancel the worker anywhere at all in the reset > path, just add a cancel to intel_guc_submission_fini instead. Note > that it is not a problem if the worker is still active during a reset. > Either it will run before the reset path starts locking things and > will simply block the reset code for a tiny amount of time. Or it will > run after the locks have been acquired and will early exit due to the > try-lock. > > Also, do not use the reset-in-progress flag to decide whether a > synchronous cancel is safe (from a lockdep perspective) or not. > Instead, use the actual reset mutex state (both the genuine one and > the custom rolled BACKOFF one). > > Fixes: 0e00a8814eec ("drm/i915/guc: Avoid circular locking issue on busyness flush") > Signed-off-by: John Harrison <John.C.Harrison@Intel.com> > Cc: Zhanjun Dong <zhanjun.dong@intel.com> > Cc: John Harrison <John.C.Harrison@Intel.com> > Cc: Andi Shyti <andi.shyti@linux.intel.com> > Cc: Daniel Vetter <daniel@ffwll.ch> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch> > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> > Cc: Nirmoy Das <nirmoy.das@intel.com> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> > Cc: Andrzej Hajda <andrzej.hajda@intel.com> > Cc: Matt Roper <matthew.d.roper@intel.com> > Cc: Jonathan Cavitt <jonathan.cavitt@intel.com> > Cc: Prathap Kumar Valsan <prathap.kumar.valsan@intel.com> > Cc: Alan Previn <alan.previn.teres.alexis@intel.com> > Cc: Madhumitha Tolakanahalli Pradeep <madhumitha.tolakanahalli.pradeep@intel.com> > Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> > Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> > Cc: Dnyaneshwar Bhadane <dnyaneshwar.bhadane@intel.com> Looks good: Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Thanks, Andi
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 16640d6dd0589..00757d6333e88 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -1403,14 +1403,17 @@ static void guc_cancel_busyness_worker(struct intel_guc *guc) * Trying to pass a 'need_sync' or 'in_reset' flag all the way down through * every possible call stack is unfeasible. It would be too intrusive to many * areas that really don't care about the GuC backend. However, there is the - * 'reset_in_progress' flag available, so just use that. + * I915_RESET_BACKOFF flag and the gt->reset.mutex can be tested for is_locked. + * So just use those. Note that testing both is required due to the hideously + * complex nature of the i915 driver's reset code paths. * * And note that in the case of a reset occurring during driver unload - * (wedge_on_fini), skipping the cancel in _prepare (when the reset flag is set - * is fine because there is another cancel in _finish (when the reset flag is - * not). + * (wedged_on_fini), skipping the cancel in reset_prepare/reset_fini (when the + * reset flag/mutex are set) is fine because there is another explicit cancel in + * intel_guc_submission_fini (when the reset flag/mutex are not). */ - if (guc_to_gt(guc)->uc.reset_in_progress) + if (mutex_is_locked(&guc_to_gt(guc)->reset.mutex) || + test_bit(I915_RESET_BACKOFF, &guc_to_gt(guc)->reset.flags)) cancel_delayed_work(&guc->timestamp.work); else cancel_delayed_work_sync(&guc->timestamp.work); @@ -1424,8 +1427,6 @@ static void __reset_guc_busyness_stats(struct intel_guc *guc) unsigned long flags; ktime_t unused; - guc_cancel_busyness_worker(guc); - spin_lock_irqsave(&guc->timestamp.lock, flags); guc_update_pm_timestamp(guc, &unused); @@ -2004,13 +2005,6 @@ void intel_guc_submission_cancel_requests(struct intel_guc *guc) void intel_guc_submission_reset_finish(struct intel_guc *guc) { - /* - * Ensure the busyness worker gets cancelled even on a fatal wedge. - * Note that reset_prepare is not allowed to because it confuses lockdep. - */ - if (guc_submission_initialized(guc)) - guc_cancel_busyness_worker(guc); - /* Reset called during driver load or during wedge? */ if (unlikely(!guc_submission_initialized(guc) || !intel_guc_is_fw_running(guc) || @@ -2136,6 +2130,7 @@ void intel_guc_submission_fini(struct intel_guc *guc) if (!guc->submission_initialized) return; + guc_fini_engine_stats(guc); guc_flush_destroyed_contexts(guc); guc_lrc_desc_pool_destroy_v69(guc); i915_sched_engine_put(guc->sched_engine); diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c b/drivers/gpu/drm/i915/gt/uc/intel_uc.c index b47051ddf17f2..7a63abf8f644c 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c @@ -633,6 +633,10 @@ void intel_uc_reset_finish(struct intel_uc *uc) { struct intel_guc *guc = &uc->guc; + /* + * NB: The wedge code path results in prepare -> prepare -> finish -> finish. + * So this function is sometimes called with the in-progress flag not set. + */ uc->reset_in_progress = false; /* Firmware expected to be running when this function is called */