Message ID | 20220628191741.28866-1-ashutosh.dixit@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/i915/reset: Handle reset timeouts under unrelated kernel hangs | expand |
On Tue, 28 Jun 2022 12:17:41 -0700, Ashutosh Dixit wrote: > > From: Chris Wilson <chris@chris-wilson.co.uk> > > When resuming after hibernate sometimes we see hangs in unrelated kernel > subsystems. These hangs often result in the following i915 trace: > > i915 0000:00:02.0: [drm] \ > *ERROR* intel_gt_reset_global timed out, cancelling all in-flight rendering. > > implying our reset task has been starved by the hanging kernel subsystem, > causing us to inappropiately declare the system as wedged beyond recovery. > > The trace would be caused by our synchronize_srcu_expedited() taking more > than the allowed 5s due to the unrelated kernel hang. But we neither need > to perform that synchronisation inside the reset watchdog, nor do we need > such a short timeout before declaring the device as unrecoverable. > > Bug: https://gitlab.freedesktop.org/drm/intel/-/issues/3575 > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> > --- > drivers/gpu/drm/i915/gt/intel_reset.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c > index a5338c3fde7a0..e72744f6faedc 100644 > --- a/drivers/gpu/drm/i915/gt/intel_reset.c > +++ b/drivers/gpu/drm/i915/gt/intel_reset.c > @@ -1259,12 +1259,9 @@ static void intel_gt_reset_global(struct intel_gt *gt, > kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event); > > /* Use a watchdog to ensure that our reset completes */ > - intel_wedge_on_timeout(&w, gt, 5 * HZ) { > + intel_wedge_on_timeout(&w, gt, 60 * HZ) { How about we take one step at a time so if we are moving synchronize_srcu_expedited() out of the reset watchdog, we leave the timeout to the previous 5s? With the original timeout restored this patch is: Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com> > intel_display_prepare_reset(gt->i915); > > - /* Flush everyone using a resource about to be clobbered */ > - synchronize_srcu_expedited(>->reset.backoff_srcu); > - > intel_gt_reset(gt, engine_mask, reason); > > intel_display_finish_reset(gt->i915); > @@ -1373,6 +1370,9 @@ void intel_gt_handle_error(struct intel_gt *gt, > } > } > > + /* Flush everyone using a resource about to be clobbered */ > + synchronize_srcu_expedited(>->reset.backoff_srcu); > + > intel_gt_reset_global(gt, engine_mask, msg); > > if (!intel_uc_uses_guc_submission(>->uc)) { > -- > 2.36.1 >
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index a5338c3fde7a0..e72744f6faedc 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -1259,12 +1259,9 @@ static void intel_gt_reset_global(struct intel_gt *gt, kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event); /* Use a watchdog to ensure that our reset completes */ - intel_wedge_on_timeout(&w, gt, 5 * HZ) { + intel_wedge_on_timeout(&w, gt, 60 * HZ) { intel_display_prepare_reset(gt->i915); - /* Flush everyone using a resource about to be clobbered */ - synchronize_srcu_expedited(>->reset.backoff_srcu); - intel_gt_reset(gt, engine_mask, reason); intel_display_finish_reset(gt->i915); @@ -1373,6 +1370,9 @@ void intel_gt_handle_error(struct intel_gt *gt, } } + /* Flush everyone using a resource about to be clobbered */ + synchronize_srcu_expedited(>->reset.backoff_srcu); + intel_gt_reset_global(gt, engine_mask, msg); if (!intel_uc_uses_guc_submission(>->uc)) {