Message ID | 20230510135111.58631-1-vitaly.prosyak@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/sched: Check scheduler work queue before calling timeout handling | expand |
On 2023-05-10 09:51, vitaly.prosyak@amd.com wrote: > From: Vitaly Prosyak <vitaly.prosyak@amd.com> > > During an IGT GPU reset test we see again oops despite of > commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling > timeout handling). > > It uses ready condition whether to call drm_sched_fault which unwind > the TDR leads to GPU reset. > However it looks the ready condition is overloaded with other meanings, > for example, for the following stack is related GPU reset : > > 0 gfx_v9_0_cp_gfx_start > 1 gfx_v9_0_cp_gfx_resume > 2 gfx_v9_0_cp_resume > 3 gfx_v9_0_hw_init > 4 gfx_v9_0_resume > 5 amdgpu_device_ip_resume_phase2 > > does the following: > /* start the ring */ > gfx_v9_0_cp_gfx_start(adev); > ring->sched.ready = true; > > The same approach is for other ASICs as well : > gfx_v8_0_cp_gfx_resume > gfx_v10_0_kiq_resume, etc... > > As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault > and then drm_sched_fault. However now it depends on whether the interrupt service routine > drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready > field of the scheduler to true even for uninitialized schedulers and causes oops vs > no fault or when ISR drm_sched_fault is completed prior gfx_v9_0_cp_gfx_start and > NULL pointer dereference does not occur. > > Use the field timeout_wq to prevent oops for uninitialized schedulers. > The field could be initialized by the work queue of resetting the domain. > > Fixes: 0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling timeout handling") > > v1: Corrections to commit message (Luben) > Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com> > Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> I didn't give my RB to this patch so I'm not sure what it is doing here. The fixes tag should be before the SOB tag, and the v1 line should be separated by a line before the Git tags. Since this is a good patch and I want it in both drm-misc-fixed and amd-staging-drm-next, I'll submit it to drm-misc-fixed with a Link: and RB/SOB tag there and then cherry-pick that into amd-staging-drm-next. Don't push it to amd-staging-drm-next. I'll fix this and submit to amd-staging-drm-next and to drm-misc-fixed with a Link: tag. Regards, Luben > --- > drivers/gpu/drm/scheduler/sched_main.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index 649fac2e1ccb..670b7997f389 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched) > */ > void drm_sched_fault(struct drm_gpu_scheduler *sched) > { > - if (sched->ready) > + if (sched->timeout_wq) > mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0); > } > EXPORT_SYMBOL(drm_sched_fault);
On 2023-05-10 10:19, Luben Tuikov wrote: > On 2023-05-10 09:51, vitaly.prosyak@amd.com wrote: >> From: Vitaly Prosyak <vitaly.prosyak@amd.com> >> >> During an IGT GPU reset test we see again oops despite of >> commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling >> timeout handling). >> >> It uses ready condition whether to call drm_sched_fault which unwind >> the TDR leads to GPU reset. >> However it looks the ready condition is overloaded with other meanings, >> for example, for the following stack is related GPU reset : >> >> 0 gfx_v9_0_cp_gfx_start >> 1 gfx_v9_0_cp_gfx_resume >> 2 gfx_v9_0_cp_resume >> 3 gfx_v9_0_hw_init >> 4 gfx_v9_0_resume >> 5 amdgpu_device_ip_resume_phase2 >> >> does the following: >> /* start the ring */ >> gfx_v9_0_cp_gfx_start(adev); >> ring->sched.ready = true; >> >> The same approach is for other ASICs as well : >> gfx_v8_0_cp_gfx_resume >> gfx_v10_0_kiq_resume, etc... >> >> As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault >> and then drm_sched_fault. However now it depends on whether the interrupt service routine >> drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready >> field of the scheduler to true even for uninitialized schedulers and causes oops vs >> no fault or when ISR drm_sched_fault is completed prior gfx_v9_0_cp_gfx_start and >> NULL pointer dereference does not occur. >> >> Use the field timeout_wq to prevent oops for uninitialized schedulers. >> The field could be initialized by the work queue of resetting the domain. >> >> Fixes: 0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling timeout handling") >> >> v1: Corrections to commit message (Luben) >> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com> >> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> > I didn't give my RB to this patch so I'm not sure what it is doing here. I removed your rb, also if you do not know what is doing here why do you want to push this to amd-staging-drm-next and to drm-misc-fixed? > > The fixes tag should be before the SOB tag, and the v1 line should be separated > by a line before the Git tags. > > Since this is a good patch and I want it in both drm-misc-fixed and amd-staging-drm-next, > I'll submit it to drm-misc-fixed with a Link: and RB/SOB tag there and then cherry-pick > that into amd-staging-drm-next. > > Don't push it to amd-staging-drm-next. > > I'll fix this and submit to amd-staging-drm-next and to drm-misc-fixed with > a Link: tag. > > Regards, > Luben > > >> --- >> drivers/gpu/drm/scheduler/sched_main.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >> index 649fac2e1ccb..670b7997f389 100644 >> --- a/drivers/gpu/drm/scheduler/sched_main.c >> +++ b/drivers/gpu/drm/scheduler/sched_main.c >> @@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched) >> */ >> void drm_sched_fault(struct drm_gpu_scheduler *sched) >> { >> - if (sched->ready) >> + if (sched->timeout_wq) >> mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0); >> } >> EXPORT_SYMBOL(drm_sched_fault);
On 2023-05-10 10:24, vitaly prosyak wrote: > > On 2023-05-10 10:19, Luben Tuikov wrote: >> On 2023-05-10 09:51, vitaly.prosyak@amd.com wrote: >>> From: Vitaly Prosyak <vitaly.prosyak@amd.com> >>> >>> During an IGT GPU reset test we see again oops despite of >>> commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling >>> timeout handling). >>> >>> It uses ready condition whether to call drm_sched_fault which unwind >>> the TDR leads to GPU reset. >>> However it looks the ready condition is overloaded with other meanings, >>> for example, for the following stack is related GPU reset : >>> >>> 0 gfx_v9_0_cp_gfx_start >>> 1 gfx_v9_0_cp_gfx_resume >>> 2 gfx_v9_0_cp_resume >>> 3 gfx_v9_0_hw_init >>> 4 gfx_v9_0_resume >>> 5 amdgpu_device_ip_resume_phase2 >>> >>> does the following: >>> /* start the ring */ >>> gfx_v9_0_cp_gfx_start(adev); >>> ring->sched.ready = true; >>> >>> The same approach is for other ASICs as well : >>> gfx_v8_0_cp_gfx_resume >>> gfx_v10_0_kiq_resume, etc... >>> >>> As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault >>> and then drm_sched_fault. However now it depends on whether the interrupt service routine >>> drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready >>> field of the scheduler to true even for uninitialized schedulers and causes oops vs >>> no fault or when ISR drm_sched_fault is completed prior gfx_v9_0_cp_gfx_start and >>> NULL pointer dereference does not occur. >>> >>> Use the field timeout_wq to prevent oops for uninitialized schedulers. >>> The field could be initialized by the work queue of resetting the domain. >>> >>> Fixes: 0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling timeout handling") >>> >>> v1: Corrections to commit message (Luben) >>> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com> >>> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com> >> I didn't give my RB to this patch so I'm not sure what it is doing here. > I removed your rb, also if you do not know what is doing here why do you want to push this to amd-staging-drm-next and to drm-misc-fixed? I'll add my RB as I push it to those two branches. I'll also add a Link tag and fix the commit SHA for the Fixes tag to one which is found in drm-misc-fixes. Thanks for the patch fixing this long-standing bug. Regards, Luben >> >> The fixes tag should be before the SOB tag, and the v1 line should be separated >> by a line before the Git tags. >> >> Since this is a good patch and I want it in both drm-misc-fixed and amd-staging-drm-next, >> I'll submit it to drm-misc-fixed with a Link: and RB/SOB tag there and then cherry-pick >> that into amd-staging-drm-next. >> >> Don't push it to amd-staging-drm-next. >> >> I'll fix this and submit to amd-staging-drm-next and to drm-misc-fixed with >> a Link: tag. >> >> Regards, >> Luben >> >> >>> --- >>> drivers/gpu/drm/scheduler/sched_main.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >>> index 649fac2e1ccb..670b7997f389 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>> @@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched) >>> */ >>> void drm_sched_fault(struct drm_gpu_scheduler *sched) >>> { >>> - if (sched->ready) >>> + if (sched->timeout_wq) >>> mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0); >>> } >>> EXPORT_SYMBOL(drm_sched_fault);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 649fac2e1ccb..670b7997f389 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched) */ void drm_sched_fault(struct drm_gpu_scheduler *sched) { - if (sched->ready) + if (sched->timeout_wq) mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0); } EXPORT_SYMBOL(drm_sched_fault);