diff mbox series

[3/3] drm/i915: Fix gt reset with GuC submission disabled

Message ID 20240418171055.31371-3-nirmoy.das@intel.com (mailing list archive)
State New, archived
Headers show
Series [1/3] drm/i915: Refactor confusing __intel_gt_reset() | expand

Commit Message

Nirmoy Das April 18, 2024, 5:10 p.m. UTC
Currently intel_gt_reset() happens as follows:

reset_prepare() ---> Sends GDRST to GuC, GuC is in GS_MIA_IN_RESET
do_reset()
  intel_gt_reset_all_engines()
    *_engine_reset_prepare() -->RESET_CTL expects running GuC
    *_reset_engines()
intel_gt_init_hw() --> GuC comes out of GS_MIA_IN_RESET with FW loaded.

Fix the issue by sanitizing the GuC only after resetting requested
engines and before intel_gt_init_hw().

Note intel_uc_reset_finish() and intel_uc_reset() are nop when
guc submission is disabled.

Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_reset.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

Comments

John Harrison April 18, 2024, 11:38 p.m. UTC | #1
On 4/18/2024 10:10, Nirmoy Das wrote:
> Currently intel_gt_reset() happens as follows:
>
> reset_prepare() ---> Sends GDRST to GuC, GuC is in GS_MIA_IN_RESET
> do_reset()
>    intel_gt_reset_all_engines()
>      *_engine_reset_prepare() -->RESET_CTL expects running GuC
Not technically correct. There is no direct connection between RESET_CTL 
and GuC.

>      *_reset_engines()
> intel_gt_init_hw() --> GuC comes out of GS_MIA_IN_RESET with FW loaded.
>
> Fix the issue by sanitizing the GuC only after resetting requested
> engines and before intel_gt_init_hw().
You never actually state what the issue is.

The problem is that there is a dedicated CSB FIFO going to GuC (and 
nothing else has access to it). If that FIFO fills up, the hardware will 
block on the next context switch until there is space. If no-one (i.e. 
GuC) is draining it, that means the system is effectively hung. If an 
engine is reset whilst actively executing a context, a CSB entry will be 
sent to say that the context has gone idle. Thus if you reset a very 
busy system and start with killing GuC before killing the engines and 
only then re-enabling GuC, you run the risk of generating more CSB 
entries than will fit in the FIFO and deadlocking. Whereas, if the 
system is idle then you can reset the engines as much as you like while 
GuC is dead and it won't be a problem.

>
> Note intel_uc_reset_finish() and intel_uc_reset() are nop when
> guc submission is disabled.
>
> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_reset.c | 16 ++++++++++++++--
>   1 file changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> index 6504e8ba9c58..bd166f5aca4b 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> @@ -907,8 +907,17 @@ static intel_engine_mask_t reset_prepare(struct intel_gt *gt)
>   	intel_engine_mask_t awake = 0;
>   	enum intel_engine_id id;
>   
> -	/* For GuC mode, ensure submission is disabled before stopping ring */
> -	intel_uc_reset_prepare(&gt->uc);
> +	/**
> +	 * For GuC mode with submission enabled, ensure submission
> +	 * is disabled before stopping ring.
> +	 *
> +	 * For GuC mode with submission disabled, ensure that GuC is not
> +	 * sanitized, do that at the end in reset_finish(). reset_prepare()
> +	 * is followed by engine reset which in this mode requires GuC to
> +	 * be functional to process engine reset events.
-> to process any CSB FIFO entries generated by the resets.

John.

> +	 */
> +	if (intel_uc_uses_guc_submission(&gt->uc))
> +		intel_uc_reset_prepare(&gt->uc);
>   
>   	for_each_engine(engine, gt, id) {
>   		if (intel_engine_pm_get_if_awake(engine))
> @@ -1255,6 +1264,9 @@ void intel_gt_reset(struct intel_gt *gt,
>   
>   	intel_overlay_reset(gt->i915);
>   
> +	/* sanitize uC after engine reset */
> +	if (!intel_uc_uses_guc_submission(&gt->uc))
> +		intel_uc_reset_prepare(&gt->uc);
>   	/*
>   	 * Next we need to restore the context, but we don't use those
>   	 * yet either...
Nirmoy Das April 19, 2024, 9:46 a.m. UTC | #2
Hi John,

On 4/19/2024 1:38 AM, John Harrison wrote:
> On 4/18/2024 10:10, Nirmoy Das wrote:
>> Currently intel_gt_reset() happens as follows:
>>
>> reset_prepare() ---> Sends GDRST to GuC, GuC is in GS_MIA_IN_RESET
>> do_reset()
>>    intel_gt_reset_all_engines()
>>      *_engine_reset_prepare() -->RESET_CTL expects running GuC
> Not technically correct. There is no direct connection between 
> RESET_CTL and GuC.
>
>>      *_reset_engines()
>> intel_gt_init_hw() --> GuC comes out of GS_MIA_IN_RESET with FW loaded.
>>
>> Fix the issue by sanitizing the GuC only after resetting requested
>> engines and before intel_gt_init_hw().
> You never actually state what the issue is.
>
> The problem is that there is a dedicated CSB FIFO going to GuC (and 
> nothing else has access to it). If that FIFO fills up, the hardware 
> will block on the next context switch until there is space. If no-one 
> (i.e. GuC) is draining it, that means the system is effectively hung. 
> If an engine is reset whilst actively executing a context, a CSB entry 
> will be sent to say that the context has gone idle. Thus if you reset 
> a very busy system and start with killing GuC before killing the 
> engines and only then re-enabling GuC, you run the risk of generating 
> more CSB entries than will fit in the FIFO and deadlocking. Whereas, 
> if the system is idle then you can reset the engines as much as you 
> like while GuC is dead and it won't be a problem.


I wasn't sure if I could talk about internal details so kept it to 
minimal. I will borrow above explanation and resend :)

>
>>
>> Note intel_uc_reset_finish() and intel_uc_reset() are nop when
>> guc submission is disabled.
>>
>> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_reset.c | 16 ++++++++++++++--
>>   1 file changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
>> b/drivers/gpu/drm/i915/gt/intel_reset.c
>> index 6504e8ba9c58..bd166f5aca4b 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>> @@ -907,8 +907,17 @@ static intel_engine_mask_t reset_prepare(struct 
>> intel_gt *gt)
>>       intel_engine_mask_t awake = 0;
>>       enum intel_engine_id id;
>>   -    /* For GuC mode, ensure submission is disabled before stopping 
>> ring */
>> -    intel_uc_reset_prepare(&gt->uc);
>> +    /**
>> +     * For GuC mode with submission enabled, ensure submission
>> +     * is disabled before stopping ring.
>> +     *
>> +     * For GuC mode with submission disabled, ensure that GuC is not
>> +     * sanitized, do that at the end in reset_finish(). reset_prepare()
>> +     * is followed by engine reset which in this mode requires GuC to
>> +     * be functional to process engine reset events.
> -> to process any CSB FIFO entries generated by the resets.

I will add this.


Thanks,

Nirmoy

>
> John.
>
>> +     */
>> +    if (intel_uc_uses_guc_submission(&gt->uc))
>> +        intel_uc_reset_prepare(&gt->uc);
>>         for_each_engine(engine, gt, id) {
>>           if (intel_engine_pm_get_if_awake(engine))
>> @@ -1255,6 +1264,9 @@ void intel_gt_reset(struct intel_gt *gt,
>>         intel_overlay_reset(gt->i915);
>>   +    /* sanitize uC after engine reset */
>> +    if (!intel_uc_uses_guc_submission(&gt->uc))
>> +        intel_uc_reset_prepare(&gt->uc);
>>       /*
>>        * Next we need to restore the context, but we don't use those
>>        * yet either...
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
index 6504e8ba9c58..bd166f5aca4b 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -907,8 +907,17 @@  static intel_engine_mask_t reset_prepare(struct intel_gt *gt)
 	intel_engine_mask_t awake = 0;
 	enum intel_engine_id id;
 
-	/* For GuC mode, ensure submission is disabled before stopping ring */
-	intel_uc_reset_prepare(&gt->uc);
+	/**
+	 * For GuC mode with submission enabled, ensure submission
+	 * is disabled before stopping ring.
+	 *
+	 * For GuC mode with submission disabled, ensure that GuC is not
+	 * sanitized, do that at the end in reset_finish(). reset_prepare()
+	 * is followed by engine reset which in this mode requires GuC to
+	 * be functional to process engine reset events.
+	 */
+	if (intel_uc_uses_guc_submission(&gt->uc))
+		intel_uc_reset_prepare(&gt->uc);
 
 	for_each_engine(engine, gt, id) {
 		if (intel_engine_pm_get_if_awake(engine))
@@ -1255,6 +1264,9 @@  void intel_gt_reset(struct intel_gt *gt,
 
 	intel_overlay_reset(gt->i915);
 
+	/* sanitize uC after engine reset */
+	if (!intel_uc_uses_guc_submission(&gt->uc))
+		intel_uc_reset_prepare(&gt->uc);
 	/*
 	 * Next we need to restore the context, but we don't use those
 	 * yet either...