diff mbox series

drm/i915/gt: prepare reset based on reset domain

Message ID 20211209120134.4057906-1-tejaskumarx.surendrakumar.upadhyay@intel.com (mailing list archive)
State New, archived
Headers show
Series drm/i915/gt: prepare reset based on reset domain | expand

Commit Message

Tejas Upadhyay Dec. 9, 2021, 12:01 p.m. UTC
Most code paths does full reset with preparing all
engines for reset except below two :

1. Single engine reset needs to prepare engines for
reset based on its reset domain. In __intel_engine
_reset_bh is a place needs loop over to do engine
prepare for all engines which are in same reset
domain before triggering reset.

2. enable_error_interrupt() in drivers/gpu/drm/i915/
gt/intel_execlists_submission.c needs similar change.

whenever there is full reset done, engine prepare for
all engines are already being called right now before
actual reset triggered, except above two scenario
seeking single engine reset.

Note: Requirement of this change is occurred recently
because whenever engine does reset, all engines in
same reset domain gets reset and in case engine goes
for reset before stopping CS or applying required W/A,
there are high chances of hang/crash. reset_prepare_
engine takes care of it.

Signed-off-by: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_execlists_submission.c |  9 +++++++++
 drivers/gpu/drm/i915/gt/intel_reset.c                | 12 ++++++++++--
 drivers/gpu/drm/i915/gt/intel_reset.h                |  1 +
 3 files changed, 20 insertions(+), 2 deletions(-)

Comments

Tvrtko Ursulin Dec. 16, 2021, 9:52 a.m. UTC | #1
On 09/12/2021 12:01, Tejas Upadhyay wrote:
> Most code paths does full reset with preparing all
> engines for reset except below two :
> 
> 1. Single engine reset needs to prepare engines for
> reset based on its reset domain. In __intel_engine
> _reset_bh is a place needs loop over to do engine
> prepare for all engines which are in same reset
> domain before triggering reset.
> 
> 2. enable_error_interrupt() in drivers/gpu/drm/i915/
> gt/intel_execlists_submission.c needs similar change.
> 
> whenever there is full reset done, engine prepare for
> all engines are already being called right now before
> actual reset triggered, except above two scenario
> seeking single engine reset.
> 
> Note: Requirement of this change is occurred recently
> because whenever engine does reset, all engines in
> same reset domain gets reset and in case engine goes
> for reset before stopping CS or applying required W/A,
> there are high chances of hang/crash. reset_prepare_
> engine takes care of it.

On the trivial level, please wrap your commit messages to standard 75 
wide. See Documentation/process/submitting-patches.rst.

> 
> Signed-off-by: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_execlists_submission.c |  9 +++++++++
>   drivers/gpu/drm/i915/gt/intel_reset.c                | 12 ++++++++++--
>   drivers/gpu/drm/i915/gt/intel_reset.h                |  1 +
>   3 files changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> index a69df5e9e77a..668e7ba5b254 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> @@ -2806,6 +2806,15 @@ static void enable_error_interrupt(struct intel_engine_cs *engine)
>   		drm_err(&engine->i915->drm,
>   			"engine '%s' resumed still in error: %08x\n",
>   			engine->name, status);
> +		if (engine->reset_domain) {
> +			struct intel_engine_cs *nengine;
> +			enum intel_engine_id id;
> +
> +			for_each_engine(nengine, engine->gt, id)
> +				if (nengine->reset_domain ==
> +				    engine->reset_domain)
> +					reset_prepare_engine(nengine);
> +		}
>   		__intel_gt_reset(engine->gt, engine->mask);
>   	}
>   
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> index 63199f0550e6..454d6ab1d9f4 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> @@ -705,7 +705,7 @@ int intel_reset_guc(struct intel_gt *gt)
>    * Ensure irq handler finishes, and not run again.
>    * Also return the active request so that we only search for it once.
>    */
> -static void reset_prepare_engine(struct intel_engine_cs *engine)
> +void reset_prepare_engine(struct intel_engine_cs *engine)
>   {
>   	/*
>   	 * During the reset sequence, we must prevent the engine from
> @@ -1167,7 +1167,15 @@ int __intel_engine_reset_bh(struct intel_engine_cs *engine, const char *msg)
>   	if (!intel_engine_pm_get_if_awake(engine))
>   		return 0;
>   
> -	reset_prepare_engine(engine);
> +	if (engine->reset_domain) {

Can it be zero?

> +		struct intel_engine_cs *nengine;
> +		enum intel_engine_id id;
> +
> +		for_each_engine(nengine, gt, id)
> +			if (nengine->reset_domain ==
> +			    engine->reset_domain)
> +				reset_prepare_engine(nengine);

Having glanced over the discussion about the workaround you are trying 
to implement a few times, this is what I think.

First of all I think you are definitely missing the engine resume side 
of things. Consider this entry point:

execlists_submission_tasklet
  execlists_reset
   intel_engine_reset
    __intel_engine_reset_bh

AFAICT you can halt more than one engine but you will not resume them 
all. What I mean here is that __intel_engine_reset, intel_engine_resume, 
reset_finish_engine and even the engine pm management end up being 
called for only one engine. Also absence of locking via 
I915_RESET_ENGINE in gt reset flags.

Secondly, the question on whether it is acceptable to corrupt the state 
for the other engine seems not to have a 100% clear answer. Can this be 
confirmed? Because in the ideal world you would really need to preempt 
to idle the other engines from the engine group and only then would be 
able to proceed.

If that is considered a problem for later, with the first step being 
adding of a simpler workaround to prevent a more serious lockup, then it 
may be acceptable with some tweaks.

I suspect there isn't a proper IGT for this otherwise you would have 
noticed the failure to resume the coupled engines? So I think next step 
is to write that IGT and then make sure code is robust.

Sufficient testing will probably drive the implementation.

You may for instance need to lock all engines for reset starting from 
execlists_reset() and you may end up uncovering locking/ordering 
challenges as you go.

Regards,

Tvrtko

> +	}
>   
>   	if (msg)
>   		drm_notice(&engine->i915->drm,
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.h b/drivers/gpu/drm/i915/gt/intel_reset.h
> index adc734e67387..7abd5d49f0e5 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.h
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.h
> @@ -28,6 +28,7 @@ void intel_gt_handle_error(struct intel_gt *gt,
>   			   const char *fmt, ...);
>   #define I915_ERROR_CAPTURE BIT(0)
>   
> +void reset_prepare_engine(struct intel_engine_cs *engine);
>   void intel_gt_reset(struct intel_gt *gt,
>   		    intel_engine_mask_t stalled_mask,
>   		    const char *reason);
>
Rodrigo Vivi Feb. 11, 2022, 11 p.m. UTC | #2
On Thu, Dec 16, 2021 at 09:52:26AM +0000, Tvrtko Ursulin wrote:
> 
> 
> On 09/12/2021 12:01, Tejas Upadhyay wrote:
> > Most code paths does full reset with preparing all
> > engines for reset except below two :
> > 
> > 1. Single engine reset needs to prepare engines for
> > reset based on its reset domain. In __intel_engine
> > _reset_bh is a place needs loop over to do engine
> > prepare for all engines which are in same reset
> > domain before triggering reset.
> > 
> > 2. enable_error_interrupt() in drivers/gpu/drm/i915/
> > gt/intel_execlists_submission.c needs similar change.
> > 
> > whenever there is full reset done, engine prepare for
> > all engines are already being called right now before
> > actual reset triggered, except above two scenario
> > seeking single engine reset.
> > 
> > Note: Requirement of this change is occurred recently
> > because whenever engine does reset, all engines in
> > same reset domain gets reset and in case engine goes
> > for reset before stopping CS or applying required W/A,
> > there are high chances of hang/crash. reset_prepare_
> > engine takes care of it.
> 
> On the trivial level, please wrap your commit messages to standard 75 wide.
> See Documentation/process/submitting-patches.rst.
> 
> > 
> > Signed-off-by: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_execlists_submission.c |  9 +++++++++
> >   drivers/gpu/drm/i915/gt/intel_reset.c                | 12 ++++++++++--
> >   drivers/gpu/drm/i915/gt/intel_reset.h                |  1 +
> >   3 files changed, 20 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > index a69df5e9e77a..668e7ba5b254 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> > @@ -2806,6 +2806,15 @@ static void enable_error_interrupt(struct intel_engine_cs *engine)
> >   		drm_err(&engine->i915->drm,
> >   			"engine '%s' resumed still in error: %08x\n",
> >   			engine->name, status);
> > +		if (engine->reset_domain) {
> > +			struct intel_engine_cs *nengine;
> > +			enum intel_engine_id id;
> > +
> > +			for_each_engine(nengine, engine->gt, id)
> > +				if (nengine->reset_domain ==
> > +				    engine->reset_domain)
> > +					reset_prepare_engine(nengine);
> > +		}
> >   		__intel_gt_reset(engine->gt, engine->mask);
> >   	}
> > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> > index 63199f0550e6..454d6ab1d9f4 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> > @@ -705,7 +705,7 @@ int intel_reset_guc(struct intel_gt *gt)
> >    * Ensure irq handler finishes, and not run again.
> >    * Also return the active request so that we only search for it once.
> >    */
> > -static void reset_prepare_engine(struct intel_engine_cs *engine)
> > +void reset_prepare_engine(struct intel_engine_cs *engine)
> >   {
> >   	/*
> >   	 * During the reset sequence, we must prevent the engine from
> > @@ -1167,7 +1167,15 @@ int __intel_engine_reset_bh(struct intel_engine_cs *engine, const char *msg)
> >   	if (!intel_engine_pm_get_if_awake(engine))
> >   		return 0;
> > -	reset_prepare_engine(engine);
> > +	if (engine->reset_domain) {
> 
> Can it be zero?
> 
> > +		struct intel_engine_cs *nengine;
> > +		enum intel_engine_id id;
> > +
> > +		for_each_engine(nengine, gt, id)
> > +			if (nengine->reset_domain ==
> > +			    engine->reset_domain)
> > +				reset_prepare_engine(nengine);
> 
> Having glanced over the discussion about the workaround you are trying to
> implement a few times, this is what I think.
> 
> First of all I think you are definitely missing the engine resume side of
> things. Consider this entry point:
> 
> execlists_submission_tasklet
>  execlists_reset
>   intel_engine_reset
>    __intel_engine_reset_bh
> 
> AFAICT you can halt more than one engine but you will not resume them all.
> What I mean here is that __intel_engine_reset, intel_engine_resume,
> reset_finish_engine and even the engine pm management end up being called
> for only one engine. Also absence of locking via I915_RESET_ENGINE in gt
> reset flags.
> 
> Secondly, the question on whether it is acceptable to corrupt the state for
> the other engine seems not to have a 100% clear answer. Can this be
> confirmed? Because in the ideal world you would really need to preempt to
> idle the other engines from the engine group and only then would be able to
> proceed.

we shouldn't do the reset without checking if it is idle or preempting to idle.
and handle the FORCE_WAKE case as well...

> 
> If that is considered a problem for later, with the first step being adding
> of a simpler workaround to prevent a more serious lockup, then it may be
> acceptable with some tweaks.
> 
> I suspect there isn't a proper IGT for this otherwise you would have noticed
> the failure to resume the coupled engines? So I think next step is to write
> that IGT and then make sure code is robust.
> 
> Sufficient testing will probably drive the implementation.
> 
> You may for instance need to lock all engines for reset starting from
> execlists_reset() and you may end up uncovering locking/ordering challenges
> as you go.

and everything else that Tvrtko said... we cannot go with this as is without
ensuring that we have a good flow with the scheduler.

> 
> Regards,
> 
> Tvrtko
> 
> > +	}
> >   	if (msg)
> >   		drm_notice(&engine->i915->drm,
> > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.h b/drivers/gpu/drm/i915/gt/intel_reset.h
> > index adc734e67387..7abd5d49f0e5 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_reset.h
> > +++ b/drivers/gpu/drm/i915/gt/intel_reset.h
> > @@ -28,6 +28,7 @@ void intel_gt_handle_error(struct intel_gt *gt,
> >   			   const char *fmt, ...);
> >   #define I915_ERROR_CAPTURE BIT(0)
> > +void reset_prepare_engine(struct intel_engine_cs *engine);
> >   void intel_gt_reset(struct intel_gt *gt,
> >   		    intel_engine_mask_t stalled_mask,
> >   		    const char *reason);
> >
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index a69df5e9e77a..668e7ba5b254 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -2806,6 +2806,15 @@  static void enable_error_interrupt(struct intel_engine_cs *engine)
 		drm_err(&engine->i915->drm,
 			"engine '%s' resumed still in error: %08x\n",
 			engine->name, status);
+		if (engine->reset_domain) {
+			struct intel_engine_cs *nengine;
+			enum intel_engine_id id;
+
+			for_each_engine(nengine, engine->gt, id)
+				if (nengine->reset_domain ==
+				    engine->reset_domain)
+					reset_prepare_engine(nengine);
+		}
 		__intel_gt_reset(engine->gt, engine->mask);
 	}
 
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
index 63199f0550e6..454d6ab1d9f4 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -705,7 +705,7 @@  int intel_reset_guc(struct intel_gt *gt)
  * Ensure irq handler finishes, and not run again.
  * Also return the active request so that we only search for it once.
  */
-static void reset_prepare_engine(struct intel_engine_cs *engine)
+void reset_prepare_engine(struct intel_engine_cs *engine)
 {
 	/*
 	 * During the reset sequence, we must prevent the engine from
@@ -1167,7 +1167,15 @@  int __intel_engine_reset_bh(struct intel_engine_cs *engine, const char *msg)
 	if (!intel_engine_pm_get_if_awake(engine))
 		return 0;
 
-	reset_prepare_engine(engine);
+	if (engine->reset_domain) {
+		struct intel_engine_cs *nengine;
+		enum intel_engine_id id;
+
+		for_each_engine(nengine, gt, id)
+			if (nengine->reset_domain ==
+			    engine->reset_domain)
+				reset_prepare_engine(nengine);
+	}
 
 	if (msg)
 		drm_notice(&engine->i915->drm,
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.h b/drivers/gpu/drm/i915/gt/intel_reset.h
index adc734e67387..7abd5d49f0e5 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.h
+++ b/drivers/gpu/drm/i915/gt/intel_reset.h
@@ -28,6 +28,7 @@  void intel_gt_handle_error(struct intel_gt *gt,
 			   const char *fmt, ...);
 #define I915_ERROR_CAPTURE BIT(0)
 
+void reset_prepare_engine(struct intel_engine_cs *engine);
 void intel_gt_reset(struct intel_gt *gt,
 		    intel_engine_mask_t stalled_mask,
 		    const char *reason);