diff mbox series

drm/sched: Check scheduler work queue before calling timeout handling

Message ID 20230510135111.58631-1-vitaly.prosyak@amd.com (mailing list archive)
State New, archived
Headers show
Series drm/sched: Check scheduler work queue before calling timeout handling | expand

Commit Message

Vitaly Prosyak May 10, 2023, 1:51 p.m. UTC
From: Vitaly Prosyak <vitaly.prosyak@amd.com>

During an IGT GPU reset test we see again oops despite of
commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling
timeout handling).

It uses ready condition whether to call drm_sched_fault which unwind
the TDR leads to GPU reset.
However it looks the ready condition is overloaded with other meanings,
for example, for the following stack is related GPU reset :

0  gfx_v9_0_cp_gfx_start
1  gfx_v9_0_cp_gfx_resume
2  gfx_v9_0_cp_resume
3  gfx_v9_0_hw_init
4  gfx_v9_0_resume
5  amdgpu_device_ip_resume_phase2

does the following:
	/* start the ring */
	gfx_v9_0_cp_gfx_start(adev);
	ring->sched.ready = true;

The same approach is for other ASICs as well :
gfx_v8_0_cp_gfx_resume
gfx_v10_0_kiq_resume, etc...

As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault
and then drm_sched_fault. However now it depends on whether the interrupt service routine
drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready
field of the scheduler to true even  for uninitialized schedulers and causes oops vs
no fault or when ISR  drm_sched_fault is completed prior  gfx_v9_0_cp_gfx_start and
NULL pointer dereference does not occur.

Use the field timeout_wq  to prevent oops for uninitialized schedulers.
The field could be initialized by the work queue of resetting the domain.

Fixes: 0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling timeout handling")

v1: Corrections to commit message (Luben)
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Luben Tuikov May 10, 2023, 2:19 p.m. UTC | #1
On 2023-05-10 09:51, vitaly.prosyak@amd.com wrote:
> From: Vitaly Prosyak <vitaly.prosyak@amd.com>
> 
> During an IGT GPU reset test we see again oops despite of
> commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling
> timeout handling).
> 
> It uses ready condition whether to call drm_sched_fault which unwind
> the TDR leads to GPU reset.
> However it looks the ready condition is overloaded with other meanings,
> for example, for the following stack is related GPU reset :
> 
> 0  gfx_v9_0_cp_gfx_start
> 1  gfx_v9_0_cp_gfx_resume
> 2  gfx_v9_0_cp_resume
> 3  gfx_v9_0_hw_init
> 4  gfx_v9_0_resume
> 5  amdgpu_device_ip_resume_phase2
> 
> does the following:
> 	/* start the ring */
> 	gfx_v9_0_cp_gfx_start(adev);
> 	ring->sched.ready = true;
> 
> The same approach is for other ASICs as well :
> gfx_v8_0_cp_gfx_resume
> gfx_v10_0_kiq_resume, etc...
> 
> As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault
> and then drm_sched_fault. However now it depends on whether the interrupt service routine
> drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready
> field of the scheduler to true even  for uninitialized schedulers and causes oops vs
> no fault or when ISR  drm_sched_fault is completed prior  gfx_v9_0_cp_gfx_start and
> NULL pointer dereference does not occur.
> 
> Use the field timeout_wq  to prevent oops for uninitialized schedulers.
> The field could be initialized by the work queue of resetting the domain.
> 
> Fixes: 0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling timeout handling")
> 
> v1: Corrections to commit message (Luben)
> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>

I didn't give my RB to this patch so I'm not sure what it is doing here.

The fixes tag should be before the SOB tag, and the v1 line should be separated
by a line before the Git tags.

Since this is a good patch and I want it in both drm-misc-fixed and amd-staging-drm-next,
I'll submit it to drm-misc-fixed with a Link: and RB/SOB tag there and then cherry-pick
that into amd-staging-drm-next.

Don't push it to amd-staging-drm-next.

I'll fix this and submit to amd-staging-drm-next and to drm-misc-fixed with
a Link: tag.

Regards,
Luben


> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 649fac2e1ccb..670b7997f389 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
>   */
>  void drm_sched_fault(struct drm_gpu_scheduler *sched)
>  {
> -	if (sched->ready)
> +	if (sched->timeout_wq)
>  		mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0);
>  }
>  EXPORT_SYMBOL(drm_sched_fault);
vitaly prosyak May 10, 2023, 2:24 p.m. UTC | #2
On 2023-05-10 10:19, Luben Tuikov wrote:
> On 2023-05-10 09:51, vitaly.prosyak@amd.com wrote:
>> From: Vitaly Prosyak <vitaly.prosyak@amd.com>
>>
>> During an IGT GPU reset test we see again oops despite of
>> commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling
>> timeout handling).
>>
>> It uses ready condition whether to call drm_sched_fault which unwind
>> the TDR leads to GPU reset.
>> However it looks the ready condition is overloaded with other meanings,
>> for example, for the following stack is related GPU reset :
>>
>> 0  gfx_v9_0_cp_gfx_start
>> 1  gfx_v9_0_cp_gfx_resume
>> 2  gfx_v9_0_cp_resume
>> 3  gfx_v9_0_hw_init
>> 4  gfx_v9_0_resume
>> 5  amdgpu_device_ip_resume_phase2
>>
>> does the following:
>> 	/* start the ring */
>> 	gfx_v9_0_cp_gfx_start(adev);
>> 	ring->sched.ready = true;
>>
>> The same approach is for other ASICs as well :
>> gfx_v8_0_cp_gfx_resume
>> gfx_v10_0_kiq_resume, etc...
>>
>> As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault
>> and then drm_sched_fault. However now it depends on whether the interrupt service routine
>> drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready
>> field of the scheduler to true even  for uninitialized schedulers and causes oops vs
>> no fault or when ISR  drm_sched_fault is completed prior  gfx_v9_0_cp_gfx_start and
>> NULL pointer dereference does not occur.
>>
>> Use the field timeout_wq  to prevent oops for uninitialized schedulers.
>> The field could be initialized by the work queue of resetting the domain.
>>
>> Fixes: 0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling timeout handling")
>>
>> v1: Corrections to commit message (Luben)
>> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
>> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
> I didn't give my RB to this patch so I'm not sure what it is doing here.
I removed your rb, also if you do not know what is doing here why do you want to push this to amd-staging-drm-next and to  drm-misc-fixed?
>
> The fixes tag should be before the SOB tag, and the v1 line should be separated
> by a line before the Git tags.
>
> Since this is a good patch and I want it in both drm-misc-fixed and amd-staging-drm-next,
> I'll submit it to drm-misc-fixed with a Link: and RB/SOB tag there and then cherry-pick
> that into amd-staging-drm-next.
>
> Don't push it to amd-staging-drm-next.
>
> I'll fix this and submit to amd-staging-drm-next and to drm-misc-fixed with
> a Link: tag.
>
> Regards,
> Luben
>
>
>> ---
>>  drivers/gpu/drm/scheduler/sched_main.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index 649fac2e1ccb..670b7997f389 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
>>   */
>>  void drm_sched_fault(struct drm_gpu_scheduler *sched)
>>  {
>> -	if (sched->ready)
>> +	if (sched->timeout_wq)
>>  		mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0);
>>  }
>>  EXPORT_SYMBOL(drm_sched_fault);
Luben Tuikov May 10, 2023, 2:31 p.m. UTC | #3
On 2023-05-10 10:24, vitaly prosyak wrote:
> 
> On 2023-05-10 10:19, Luben Tuikov wrote:
>> On 2023-05-10 09:51, vitaly.prosyak@amd.com wrote:
>>> From: Vitaly Prosyak <vitaly.prosyak@amd.com>
>>>
>>> During an IGT GPU reset test we see again oops despite of
>>> commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling
>>> timeout handling).
>>>
>>> It uses ready condition whether to call drm_sched_fault which unwind
>>> the TDR leads to GPU reset.
>>> However it looks the ready condition is overloaded with other meanings,
>>> for example, for the following stack is related GPU reset :
>>>
>>> 0  gfx_v9_0_cp_gfx_start
>>> 1  gfx_v9_0_cp_gfx_resume
>>> 2  gfx_v9_0_cp_resume
>>> 3  gfx_v9_0_hw_init
>>> 4  gfx_v9_0_resume
>>> 5  amdgpu_device_ip_resume_phase2
>>>
>>> does the following:
>>> 	/* start the ring */
>>> 	gfx_v9_0_cp_gfx_start(adev);
>>> 	ring->sched.ready = true;
>>>
>>> The same approach is for other ASICs as well :
>>> gfx_v8_0_cp_gfx_resume
>>> gfx_v10_0_kiq_resume, etc...
>>>
>>> As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault
>>> and then drm_sched_fault. However now it depends on whether the interrupt service routine
>>> drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready
>>> field of the scheduler to true even  for uninitialized schedulers and causes oops vs
>>> no fault or when ISR  drm_sched_fault is completed prior  gfx_v9_0_cp_gfx_start and
>>> NULL pointer dereference does not occur.
>>>
>>> Use the field timeout_wq  to prevent oops for uninitialized schedulers.
>>> The field could be initialized by the work queue of resetting the domain.
>>>
>>> Fixes: 0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling timeout handling")
>>>
>>> v1: Corrections to commit message (Luben)
>>> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
>>> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
>> I didn't give my RB to this patch so I'm not sure what it is doing here.
> I removed your rb, also if you do not know what is doing here why do you want to push this to amd-staging-drm-next and to  drm-misc-fixed?

I'll add my RB as I push it to those two branches.
I'll also add a Link tag and fix the commit SHA for the Fixes tag to
one which is found in drm-misc-fixes.

Thanks for the patch fixing this long-standing bug.

Regards,
Luben


>>
>> The fixes tag should be before the SOB tag, and the v1 line should be separated
>> by a line before the Git tags.
>>
>> Since this is a good patch and I want it in both drm-misc-fixed and amd-staging-drm-next,
>> I'll submit it to drm-misc-fixed with a Link: and RB/SOB tag there and then cherry-pick
>> that into amd-staging-drm-next.
>>
>> Don't push it to amd-staging-drm-next.
>>
>> I'll fix this and submit to amd-staging-drm-next and to drm-misc-fixed with
>> a Link: tag.
>>
>> Regards,
>> Luben
>>
>>
>>> ---
>>>  drivers/gpu/drm/scheduler/sched_main.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 649fac2e1ccb..670b7997f389 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
>>>   */
>>>  void drm_sched_fault(struct drm_gpu_scheduler *sched)
>>>  {
>>> -	if (sched->ready)
>>> +	if (sched->timeout_wq)
>>>  		mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0);
>>>  }
>>>  EXPORT_SYMBOL(drm_sched_fault);
diff mbox series

Patch

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 649fac2e1ccb..670b7997f389 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -308,7 +308,7 @@  static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
  */
 void drm_sched_fault(struct drm_gpu_scheduler *sched)
 {
-	if (sched->ready)
+	if (sched->timeout_wq)
 		mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0);
 }
 EXPORT_SYMBOL(drm_sched_fault);