Message ID | 8ab406c8bb2e58969668a806a529d5988b447530.1641750730.git.len.brown@intel.com (mailing list archive) |
---|---|
State | Handled Elsewhere, archived |
Headers | show |
Series | [REGRESSION] Revert "drm/amdgpu: stop scheduler when calling hw_fini (v2)" | expand |
[Public] > -----Original Message----- > From: Len Brown <lenb417@gmail.com> On Behalf Of Len Brown > Sent: Sunday, January 9, 2022 1:12 PM > To: torvalds@linux-foundation.org > Cc: linux-pm@vger.kernel.org; linux-kernel@vger.kernel.org; Len Brown > <len.brown@intel.com>; Chen, Guchun <Guchun.Chen@amd.com>; > Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Koenig, Christian > <Christian.Koenig@amd.com>; Deucher, Alexander > <Alexander.Deucher@amd.com>; stable@vger.kernel.org > Subject: [PATCH REGRESSION] Revert "drm/amdgpu: stop scheduler when > calling hw_fini (v2)" > > From: Len Brown <len.brown@intel.com> > > This reverts commit f7d6779df642720e22bffd449e683bb8690bd3bf. > > This bisected regression has impacted suspend-resume stability since 5.15- > rc1. It regressed -stable via 5.14.10. > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugz > illa.kernel.org%2Fshow_bug.cgi%3Fid%3D215315&data=04%7C01%7Cal > exander.deucher%40amd.com%7Ccf790be4827f4df9f2d808d9d39b81af%7C3 > dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637773487569442716%7C > Unknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJB > TiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AX0TXkyoMhy%2BZqE > VgRSWMkKd5nPa4WOv%2B1FZHLSErSw%3D&reserved=0 > > Fixes: f7d6779df64 ("drm/amdgpu: stop scheduler when calling hw_fini (v2)") > Cc: Guchun Chen <guchun.chen@amd.com> > Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com> > Cc: Christian Koenig <christian.koenig@amd.com> > Cc: Alex Deucher <alexander.deucher@amd.com> > Cc: <stable@vger.kernel.org> # 5.14+ > Signed-off-by: Len Brown <len.brown@intel.com> @Chen, Guchun, @Grodzovsky, Andrey, @Koenig, Christian Any ideas? What's the consequence of reverting this patch? Didn't this patch fix another suspend/resume issue? Alex > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 -------- > 1 file changed, 8 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > index 9afd11ca2709..45977a72b5dd 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > @@ -547,9 +547,6 @@ void amdgpu_fence_driver_hw_fini(struct > amdgpu_device *adev) > if (!ring || !ring->fence_drv.initialized) > continue; > > - if (!ring->no_scheduler) > - drm_sched_stop(&ring->sched, NULL); > - > /* You can't wait for HW to signal if it's gone */ > if (!drm_dev_is_unplugged(adev_to_drm(adev))) > r = amdgpu_fence_wait_empty(ring); > @@ -609,11 +606,6 @@ void amdgpu_fence_driver_hw_init(struct > amdgpu_device *adev) > if (!ring || !ring->fence_drv.initialized) > continue; > > - if (!ring->no_scheduler) { > - drm_sched_resubmit_jobs(&ring->sched); > - drm_sched_start(&ring->sched, true); > - } > - > /* enable the interrupt */ > if (ring->fence_drv.irq_src) > amdgpu_irq_get(adev, ring->fence_drv.irq_src, > -- > 2.25.1
Am 10.01.22 um 17:08 schrieb Deucher, Alexander: > [Public] > >> -----Original Message----- >> From: Len Brown <lenb417@gmail.com> On Behalf Of Len Brown >> Sent: Sunday, January 9, 2022 1:12 PM >> To: torvalds@linux-foundation.org >> Cc: linux-pm@vger.kernel.org; linux-kernel@vger.kernel.org; Len Brown >> <len.brown@intel.com>; Chen, Guchun <Guchun.Chen@amd.com>; >> Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Koenig, Christian >> <Christian.Koenig@amd.com>; Deucher, Alexander >> <Alexander.Deucher@amd.com>; stable@vger.kernel.org >> Subject: [PATCH REGRESSION] Revert "drm/amdgpu: stop scheduler when >> calling hw_fini (v2)" >> >> From: Len Brown <len.brown@intel.com> >> >> This reverts commit f7d6779df642720e22bffd449e683bb8690bd3bf. >> >> This bisected regression has impacted suspend-resume stability since 5.15- >> rc1. It regressed -stable via 5.14.10. >> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugz >> illa.kernel.org%2Fshow_bug.cgi%3Fid%3D215315&data=04%7C01%7Cal >> exander.deucher%40amd.com%7Ccf790be4827f4df9f2d808d9d39b81af%7C3 >> dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637773487569442716%7C >> Unknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJB >> TiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AX0TXkyoMhy%2BZqE >> VgRSWMkKd5nPa4WOv%2B1FZHLSErSw%3D&reserved=0 >> >> Fixes: f7d6779df64 ("drm/amdgpu: stop scheduler when calling hw_fini (v2)") >> Cc: Guchun Chen <guchun.chen@amd.com> >> Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com> >> Cc: Christian Koenig <christian.koenig@amd.com> >> Cc: Alex Deucher <alexander.deucher@amd.com> >> Cc: <stable@vger.kernel.org> # 5.14+ >> Signed-off-by: Len Brown <len.brown@intel.com> > @Chen, Guchun, @Grodzovsky, Andrey, @Koenig, Christian > > Any ideas? What's the consequence of reverting this patch? Didn't this patch fix another suspend/resume issue? I think Guchun was just trying to adapt that we removed the scheduler stop from the fence driver hw fini path. Not sure if that actually fixed something or was just a precaution. Regards, Christian. > > Alex > >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 -------- >> 1 file changed, 8 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >> index 9afd11ca2709..45977a72b5dd 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c >> @@ -547,9 +547,6 @@ void amdgpu_fence_driver_hw_fini(struct >> amdgpu_device *adev) >> if (!ring || !ring->fence_drv.initialized) >> continue; >> >> - if (!ring->no_scheduler) >> - drm_sched_stop(&ring->sched, NULL); >> - >> /* You can't wait for HW to signal if it's gone */ >> if (!drm_dev_is_unplugged(adev_to_drm(adev))) >> r = amdgpu_fence_wait_empty(ring); >> @@ -609,11 +606,6 @@ void amdgpu_fence_driver_hw_init(struct >> amdgpu_device *adev) >> if (!ring || !ring->fence_drv.initialized) >> continue; >> >> - if (!ring->no_scheduler) { >> - drm_sched_resubmit_jobs(&ring->sched); >> - drm_sched_start(&ring->sched, true); >> - } >> - >> /* enable the interrupt */ >> if (ring->fence_drv.irq_src) >> amdgpu_irq_get(adev, ring->fence_drv.irq_src, >> -- >> 2.25.1
[Public] > -----Original Message----- > From: Koenig, Christian <Christian.Koenig@amd.com> > Sent: Monday, January 10, 2022 11:16 AM > To: Deucher, Alexander <Alexander.Deucher@amd.com>; Len Brown > <lenb@kernel.org>; torvalds@linux-foundation.org; Chen, Guchun > <Guchun.Chen@amd.com>; Grodzovsky, Andrey > <Andrey.Grodzovsky@amd.com> > Cc: linux-pm@vger.kernel.org; linux-kernel@vger.kernel.org; Len Brown > <len.brown@intel.com>; stable@vger.kernel.org > Subject: Re: [PATCH REGRESSION] Revert "drm/amdgpu: stop scheduler > when calling hw_fini (v2)" > > Am 10.01.22 um 17:08 schrieb Deucher, Alexander: > > [Public] > > > >> -----Original Message----- > >> From: Len Brown <lenb417@gmail.com> On Behalf Of Len Brown > >> Sent: Sunday, January 9, 2022 1:12 PM > >> To: torvalds@linux-foundation.org > >> Cc: linux-pm@vger.kernel.org; linux-kernel@vger.kernel.org; Len Brown > >> <len.brown@intel.com>; Chen, Guchun <Guchun.Chen@amd.com>; > >> Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Koenig, Christian > >> <Christian.Koenig@amd.com>; Deucher, Alexander > >> <Alexander.Deucher@amd.com>; stable@vger.kernel.org > >> Subject: [PATCH REGRESSION] Revert "drm/amdgpu: stop scheduler > when > >> calling hw_fini (v2)" > >> > >> From: Len Brown <len.brown@intel.com> > >> > >> This reverts commit f7d6779df642720e22bffd449e683bb8690bd3bf. > >> > >> This bisected regression has impacted suspend-resume stability since > >> 5.15- rc1. It regressed -stable via 5.14.10. > >> > >> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbug > >> z > illa.kernel.org%2Fshow_bug.cgi%3Fid%3D215315&data=04%7C01%7Cal > >> > exander.deucher%40amd.com%7Ccf790be4827f4df9f2d808d9d39b81af%7C3 > >> > dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637773487569442716%7C > >> > Unknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJB > >> > TiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AX0TXkyoMhy%2BZqE > >> VgRSWMkKd5nPa4WOv%2B1FZHLSErSw%3D&reserved=0 > >> > >> Fixes: f7d6779df64 ("drm/amdgpu: stop scheduler when calling hw_fini > >> (v2)") > >> Cc: Guchun Chen <guchun.chen@amd.com> > >> Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com> > >> Cc: Christian Koenig <christian.koenig@amd.com> > >> Cc: Alex Deucher <alexander.deucher@amd.com> > >> Cc: <stable@vger.kernel.org> # 5.14+ > >> Signed-off-by: Len Brown <len.brown@intel.com> > > @Chen, Guchun, @Grodzovsky, Andrey, @Koenig, Christian > > > > Any ideas? What's the consequence of reverting this patch? Didn't this > patch fix another suspend/resume issue? > > I think Guchun was just trying to adapt that we removed the scheduler stop > from the fence driver hw fini path. > > Not sure if that actually fixed something or was just a precaution. Thanks. I'll wait for feedback from Guchun and Andrey and if they are ok with it, I'll apply the revert. Alex > > Regards, > Christian. > > > > > Alex > > > >> --- > >> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 -------- > >> 1 file changed, 8 deletions(-) > >> > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > >> index 9afd11ca2709..45977a72b5dd 100644 > >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > >> @@ -547,9 +547,6 @@ void amdgpu_fence_driver_hw_fini(struct > >> amdgpu_device *adev) > >> if (!ring || !ring->fence_drv.initialized) > >> continue; > >> > >> - if (!ring->no_scheduler) > >> - drm_sched_stop(&ring->sched, NULL); > >> - > >> /* You can't wait for HW to signal if it's gone */ > >> if (!drm_dev_is_unplugged(adev_to_drm(adev))) > >> r = amdgpu_fence_wait_empty(ring); @@ -609,11 > +606,6 @@ void > >> amdgpu_fence_driver_hw_init(struct > >> amdgpu_device *adev) > >> if (!ring || !ring->fence_drv.initialized) > >> continue; > >> > >> - if (!ring->no_scheduler) { > >> - drm_sched_resubmit_jobs(&ring->sched); > >> - drm_sched_start(&ring->sched, true); > >> - } > >> - > >> /* enable the interrupt */ > >> if (ring->fence_drv.irq_src) > >> amdgpu_irq_get(adev, ring->fence_drv.irq_src, > >> -- > >> 2.25.1
On Mon, Jan 10, 2022 at 04:25:51PM +0000, Deucher, Alexander wrote: > Thanks. I'll wait for feedback from Guchun and Andrey and if they are > ok with it, I'll apply the revert. Linus already picked it up yesterday, it's in v5.16.
[Public] Hi Alex/Christian, This patch is to put drm_sched_stop to stop scheduler before amdgpu_fence_wait_empty, otherwise, there is possibly a race problem that drm scheduler will keep submitting commands to hardware in suspend, so amdgpu_fence_wait_empty has no chance to get empty. This is based on the discussion with Andrey before. In Brown's case, without this patch, his test can run well by a 10-hour duration. However, with this patch applied, issue occurs in under an hour. I guess this patch exposes another underlying problem, as if it's totally faulty, the test with the patch applied will break in the first round suspend/resume test instead of failed after several rounds suspend/resume test. https://bugzilla.kernel.org/show_bug.cgi?id=215315 Anyway, we can revert it for now, and I will continue the investigation to the root cause. Regards, Guchun -----Original Message----- From: Deucher, Alexander <Alexander.Deucher@amd.com> Sent: Tuesday, January 11, 2022 12:26 AM To: Koenig, Christian <Christian.Koenig@amd.com>; Len Brown <lenb@kernel.org>; torvalds@linux-foundation.org; Chen, Guchun <Guchun.Chen@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com> Cc: linux-pm@vger.kernel.org; linux-kernel@vger.kernel.org; Len Brown <len.brown@intel.com>; stable@vger.kernel.org Subject: RE: [PATCH REGRESSION] Revert "drm/amdgpu: stop scheduler when calling hw_fini (v2)" [Public] > -----Original Message----- > From: Koenig, Christian <Christian.Koenig@amd.com> > Sent: Monday, January 10, 2022 11:16 AM > To: Deucher, Alexander <Alexander.Deucher@amd.com>; Len Brown > <lenb@kernel.org>; torvalds@linux-foundation.org; Chen, Guchun > <Guchun.Chen@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com> > Cc: linux-pm@vger.kernel.org; linux-kernel@vger.kernel.org; Len Brown > <len.brown@intel.com>; stable@vger.kernel.org > Subject: Re: [PATCH REGRESSION] Revert "drm/amdgpu: stop scheduler > when calling hw_fini (v2)" > > Am 10.01.22 um 17:08 schrieb Deucher, Alexander: > > [Public] > > > >> -----Original Message----- > >> From: Len Brown <lenb417@gmail.com> On Behalf Of Len Brown > >> Sent: Sunday, January 9, 2022 1:12 PM > >> To: torvalds@linux-foundation.org > >> Cc: linux-pm@vger.kernel.org; linux-kernel@vger.kernel.org; Len > >> Brown <len.brown@intel.com>; Chen, Guchun <Guchun.Chen@amd.com>; > >> Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Koenig, Christian > >> <Christian.Koenig@amd.com>; Deucher, Alexander > >> <Alexander.Deucher@amd.com>; stable@vger.kernel.org > >> Subject: [PATCH REGRESSION] Revert "drm/amdgpu: stop scheduler > when > >> calling hw_fini (v2)" > >> > >> From: Len Brown <len.brown@intel.com> > >> > >> This reverts commit f7d6779df642720e22bffd449e683bb8690bd3bf. > >> > >> This bisected regression has impacted suspend-resume stability > >> since > >> 5.15- rc1. It regressed -stable via 5.14.10. > >> > >> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbug > >> z > illa.kernel.org%2Fshow_bug.cgi%3Fid%3D215315&data=04%7C01%7Cal > >> > exander.deucher%40amd.com%7Ccf790be4827f4df9f2d808d9d39b81af%7C3 > >> > dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637773487569442716%7C > >> > Unknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJB > >> > TiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AX0TXkyoMhy%2BZqE > >> VgRSWMkKd5nPa4WOv%2B1FZHLSErSw%3D&reserved=0 > >> > >> Fixes: f7d6779df64 ("drm/amdgpu: stop scheduler when calling > >> hw_fini > >> (v2)") > >> Cc: Guchun Chen <guchun.chen@amd.com> > >> Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com> > >> Cc: Christian Koenig <christian.koenig@amd.com> > >> Cc: Alex Deucher <alexander.deucher@amd.com> > >> Cc: <stable@vger.kernel.org> # 5.14+ > >> Signed-off-by: Len Brown <len.brown@intel.com> > > @Chen, Guchun, @Grodzovsky, Andrey, @Koenig, Christian > > > > Any ideas? What's the consequence of reverting this patch? Didn't > > this > patch fix another suspend/resume issue? > > I think Guchun was just trying to adapt that we removed the scheduler > stop from the fence driver hw fini path. > > Not sure if that actually fixed something or was just a precaution. Thanks. I'll wait for feedback from Guchun and Andrey and if they are ok with it, I'll apply the revert. Alex > > Regards, > Christian. > > > > > Alex > > > >> --- > >> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 -------- > >> 1 file changed, 8 deletions(-) > >> > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > >> index 9afd11ca2709..45977a72b5dd 100644 > >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > >> @@ -547,9 +547,6 @@ void amdgpu_fence_driver_hw_fini(struct > >> amdgpu_device *adev) > >> if (!ring || !ring->fence_drv.initialized) > >> continue; > >> > >> - if (!ring->no_scheduler) > >> - drm_sched_stop(&ring->sched, NULL); > >> - > >> /* You can't wait for HW to signal if it's gone */ > >> if (!drm_dev_is_unplugged(adev_to_drm(adev))) > >> r = amdgpu_fence_wait_empty(ring); @@ -609,11 > +606,6 @@ void > >> amdgpu_fence_driver_hw_init(struct > >> amdgpu_device *adev) > >> if (!ring || !ring->fence_drv.initialized) > >> continue; > >> > >> - if (!ring->no_scheduler) { > >> - drm_sched_resubmit_jobs(&ring->sched); > >> - drm_sched_start(&ring->sched, true); > >> - } > >> - > >> /* enable the interrupt */ > >> if (ring->fence_drv.irq_src) > >> amdgpu_irq_get(adev, ring->fence_drv.irq_src, > >> -- > >> 2.25.1
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c index 9afd11ca2709..45977a72b5dd 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c @@ -547,9 +547,6 @@ void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev) if (!ring || !ring->fence_drv.initialized) continue; - if (!ring->no_scheduler) - drm_sched_stop(&ring->sched, NULL); - /* You can't wait for HW to signal if it's gone */ if (!drm_dev_is_unplugged(adev_to_drm(adev))) r = amdgpu_fence_wait_empty(ring); @@ -609,11 +606,6 @@ void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev) if (!ring || !ring->fence_drv.initialized) continue; - if (!ring->no_scheduler) { - drm_sched_resubmit_jobs(&ring->sched); - drm_sched_start(&ring->sched, true); - } - /* enable the interrupt */ if (ring->fence_drv.irq_src) amdgpu_irq_get(adev, ring->fence_drv.irq_src,