diff mbox series

[v2,2/2] drm/i915: Fix gt reset with GuC submission is disabled

Message ID 20240422201951.633-2-nirmoy.das@intel.com (mailing list archive)
State New, archived
Headers show
Series [v2,1/2] drm/i915: Refactor confusing __intel_gt_reset() | expand

Commit Message

Nirmoy Das April 22, 2024, 8:19 p.m. UTC
Currently intel_gt_reset() kills the GuC and then resets requested
engines. This is problematic because there is a dedicated CSB FIFO
which only GuC can access and if that FIFO fills up, the hardware
will block on the next context switch until there is space that means
the system is effectively hung. If an engine is reset whilst actively
executing a context, a CSB entry will be sent to say that the context
has gone idle. Thus if reset happens on a very busy system then
killing GuC before killing the engines will lead to deadlock because
of filled up CSB FIFO.

To address this issue, the GuC should be killed only after resetting
the requested engines and before calling intel_gt_init_hw().

v2: Improve commit message(John)

Cc: John Harrison <john.c.harrison@intel.com>
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_reset.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

Comments

John Harrison April 23, 2024, 12:09 a.m. UTC | #1
On 4/22/2024 13:19, Nirmoy Das wrote:
> Currently intel_gt_reset() kills the GuC and then resets requested
> engines. This is problematic because there is a dedicated CSB FIFO
> which only GuC can access and if that FIFO fills up, the hardware
> will block on the next context switch until there is space that means
> the system is effectively hung. If an engine is reset whilst actively
> executing a context, a CSB entry will be sent to say that the context
> has gone idle. Thus if reset happens on a very busy system then
> killing GuC before killing the engines will lead to deadlock because
> of filled up CSB FIFO.
>
> To address this issue, the GuC should be killed only after resetting
> the requested engines and before calling intel_gt_init_hw().
>
> v2: Improve commit message(John)
>
> Cc: John Harrison <john.c.harrison@intel.com>
> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
Reviewed-by: John Harrison <John.C.Harrison@Intel.com>

> ---
>   drivers/gpu/drm/i915/gt/intel_reset.c | 16 ++++++++++++++--
>   1 file changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> index b1393863ca9b..6161f7a3ff70 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> @@ -879,8 +879,17 @@ static intel_engine_mask_t reset_prepare(struct intel_gt *gt)
>   	intel_engine_mask_t awake = 0;
>   	enum intel_engine_id id;
>   
> -	/* For GuC mode, ensure submission is disabled before stopping ring */
> -	intel_uc_reset_prepare(&gt->uc);
> +	/**
> +	 * For GuC mode with submission enabled, ensure submission
> +	 * is disabled before stopping ring.
> +	 *
> +	 * For GuC mode with submission disabled, ensure that GuC is not
> +	 * sanitized, do that after engine reset. reset_prepare()
> +	 * is followed by engine reset which in this mode requires GuC to
> +	 * process any CSB FIFO entries generated by the resets.
> +	 */
> +	if (intel_uc_uses_guc_submission(&gt->uc))
> +		intel_uc_reset_prepare(&gt->uc);
>   
>   	for_each_engine(engine, gt, id) {
>   		if (intel_engine_pm_get_if_awake(engine))
> @@ -1227,6 +1236,9 @@ void intel_gt_reset(struct intel_gt *gt,
>   
>   	intel_overlay_reset(gt->i915);
>   
> +	/* sanitize uC after engine reset */
> +	if (!intel_uc_uses_guc_submission(&gt->uc))
> +		intel_uc_reset_prepare(&gt->uc);
>   	/*
>   	 * Next we need to restore the context, but we don't use those
>   	 * yet either...
Andi Shyti April 23, 2024, 9:32 a.m. UTC | #2
Hi Nirmoy,

On Mon, Apr 22, 2024 at 10:19:51PM +0200, Nirmoy Das wrote:
> Currently intel_gt_reset() kills the GuC and then resets requested
> engines. This is problematic because there is a dedicated CSB FIFO
> which only GuC can access and if that FIFO fills up, the hardware
> will block on the next context switch until there is space that means
> the system is effectively hung. If an engine is reset whilst actively
> executing a context, a CSB entry will be sent to say that the context
> has gone idle. Thus if reset happens on a very busy system then
> killing GuC before killing the engines will lead to deadlock because
> of filled up CSB FIFO.

is this a fix?

> To address this issue, the GuC should be killed only after resetting
> the requested engines and before calling intel_gt_init_hw().
> 
> v2: Improve commit message(John)
> 
> Cc: John Harrison <john.c.harrison@intel.com>
> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_reset.c | 16 ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> index b1393863ca9b..6161f7a3ff70 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> @@ -879,8 +879,17 @@ static intel_engine_mask_t reset_prepare(struct intel_gt *gt)
>  	intel_engine_mask_t awake = 0;
>  	enum intel_engine_id id;
>  
> -	/* For GuC mode, ensure submission is disabled before stopping ring */
> -	intel_uc_reset_prepare(&gt->uc);
> +	/**
> +	 * For GuC mode with submission enabled, ensure submission
> +	 * is disabled before stopping ring.

nit: "stopping *the* ring"

> +	 *
> +	 * For GuC mode with submission disabled, ensure that GuC is not
> +	 * sanitized, do that after engine reset. reset_prepare()
> +	 * is followed by engine reset which in this mode requires GuC to
> +	 * process any CSB FIFO entries generated by the resets.
> +	 */
> +	if (intel_uc_uses_guc_submission(&gt->uc))
> +		intel_uc_reset_prepare(&gt->uc);
>  
>  	for_each_engine(engine, gt, id) {
>  		if (intel_engine_pm_get_if_awake(engine))
> @@ -1227,6 +1236,9 @@ void intel_gt_reset(struct intel_gt *gt,
>  
>  	intel_overlay_reset(gt->i915);
>  
> +	/* sanitize uC after engine reset */
> +	if (!intel_uc_uses_guc_submission(&gt->uc))
> +		intel_uc_reset_prepare(&gt->uc);

Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>

Thanks,
Andi
Nirmoy Das April 23, 2024, 10:45 a.m. UTC | #3
Hi Andi,

On 4/23/2024 11:32 AM, Andi Shyti wrote:
> Hi Nirmoy,
>
> On Mon, Apr 22, 2024 at 10:19:51PM +0200, Nirmoy Das wrote:
>> Currently intel_gt_reset() kills the GuC and then resets requested
>> engines. This is problematic because there is a dedicated CSB FIFO
>> which only GuC can access and if that FIFO fills up, the hardware
>> will block on the next context switch until there is space that means
>> the system is effectively hung. If an engine is reset whilst actively
>> executing a context, a CSB entry will be sent to say that the context
>> has gone idle. Thus if reset happens on a very busy system then
>> killing GuC before killing the engines will lead to deadlock because
>> of filled up CSB FIFO.
> is this a fix?

I went quite far back in the commit logs, and it appears to me that 
we've always been using the current reset flow.

I believe we don't perform a GT reset immediately after sending a number 
of requests, which is what the current failed test is doing.

So, I don't think there will be any visible impact on the user with the 
current flow.

>
>> To address this issue, the GuC should be killed only after resetting
>> the requested engines and before calling intel_gt_init_hw().
>>
>> v2: Improve commit message(John)
>>
>> Cc: John Harrison <john.c.harrison@intel.com>
>> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_reset.c | 16 ++++++++++++++--
>>   1 file changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
>> index b1393863ca9b..6161f7a3ff70 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>> @@ -879,8 +879,17 @@ static intel_engine_mask_t reset_prepare(struct intel_gt *gt)
>>   	intel_engine_mask_t awake = 0;
>>   	enum intel_engine_id id;
>>   
>> -	/* For GuC mode, ensure submission is disabled before stopping ring */
>> -	intel_uc_reset_prepare(&gt->uc);
>> +	/**
>> +	 * For GuC mode with submission enabled, ensure submission
>> +	 * is disabled before stopping ring.
> nit: "stopping *the* ring"
Will fix it while merging if I don't have to resend this again.
>
>> +	 *
>> +	 * For GuC mode with submission disabled, ensure that GuC is not
>> +	 * sanitized, do that after engine reset. reset_prepare()
>> +	 * is followed by engine reset which in this mode requires GuC to
>> +	 * process any CSB FIFO entries generated by the resets.
>> +	 */
>> +	if (intel_uc_uses_guc_submission(&gt->uc))
>> +		intel_uc_reset_prepare(&gt->uc);
>>   
>>   	for_each_engine(engine, gt, id) {
>>   		if (intel_engine_pm_get_if_awake(engine))
>> @@ -1227,6 +1236,9 @@ void intel_gt_reset(struct intel_gt *gt,
>>   
>>   	intel_overlay_reset(gt->i915);
>>   
>> +	/* sanitize uC after engine reset */
>> +	if (!intel_uc_uses_guc_submission(&gt->uc))
>> +		intel_uc_reset_prepare(&gt->uc);
> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>

Thanks,

Nirmoy

> Thanks,
> Andi
Andi Shyti April 23, 2024, 2:42 p.m. UTC | #4
Hi Nirmoy,

> > > Currently intel_gt_reset() kills the GuC and then resets requested
> > > engines. This is problematic because there is a dedicated CSB FIFO
> > > which only GuC can access and if that FIFO fills up, the hardware
> > > will block on the next context switch until there is space that means
> > > the system is effectively hung. If an engine is reset whilst actively
> > > executing a context, a CSB entry will be sent to say that the context
> > > has gone idle. Thus if reset happens on a very busy system then
> > > killing GuC before killing the engines will lead to deadlock because
> > > of filled up CSB FIFO.
> > is this a fix?
> 
> I went quite far back in the commit logs, and it appears to me that we've
> always been using the current reset flow.
> 
> I believe we don't perform a GT reset immediately after sending a number of
> requests, which is what the current failed test is doing.
> 
> So, I don't think there will be any visible impact on the user with the
> current flow.

Agree... good thinking here... we often abuse on the Fixes tag.

Thanks,
Andi
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
index b1393863ca9b..6161f7a3ff70 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -879,8 +879,17 @@  static intel_engine_mask_t reset_prepare(struct intel_gt *gt)
 	intel_engine_mask_t awake = 0;
 	enum intel_engine_id id;
 
-	/* For GuC mode, ensure submission is disabled before stopping ring */
-	intel_uc_reset_prepare(&gt->uc);
+	/**
+	 * For GuC mode with submission enabled, ensure submission
+	 * is disabled before stopping ring.
+	 *
+	 * For GuC mode with submission disabled, ensure that GuC is not
+	 * sanitized, do that after engine reset. reset_prepare()
+	 * is followed by engine reset which in this mode requires GuC to
+	 * process any CSB FIFO entries generated by the resets.
+	 */
+	if (intel_uc_uses_guc_submission(&gt->uc))
+		intel_uc_reset_prepare(&gt->uc);
 
 	for_each_engine(engine, gt, id) {
 		if (intel_engine_pm_get_if_awake(engine))
@@ -1227,6 +1236,9 @@  void intel_gt_reset(struct intel_gt *gt,
 
 	intel_overlay_reset(gt->i915);
 
+	/* sanitize uC after engine reset */
+	if (!intel_uc_uses_guc_submission(&gt->uc))
+		intel_uc_reset_prepare(&gt->uc);
 	/*
 	 * Next we need to restore the context, but we don't use those
 	 * yet either...