Message ID | 20231218155927.368881-1-robdclark@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/msm/a6xx: Fix recovery vs runpm race | expand |
On Mon, Dec 18, 2023 at 07:59:24AM -0800, Rob Clark wrote: > > From: Rob Clark <robdclark@chromium.org> > > a6xx_recover() is relying on the gpu lock to serialize against incoming > submits doing a runpm get, as it tries to temporarily balance out the > runpm gets with puts in order to power off the GPU. Unfortunately this > gets worse when we (in a later patch) will move the runpm get out of the > scheduler thread/work to move it out of the fence signaling path. > > Instead we can just simplify the whole thing by using force_suspend() / > force_resume() instead of trying to be clever. At some places, we take a pm_runtime vote and access the gpu registers assuming it will be powered until we drop the vote. a6xx_get_timestamp() is an example. If we do a force suspend, it may cause bus errors from those threads. Now you have to serialize every place we do runtime_get/put with a mutex. Or is there a better way to handle the 'later patch' you mentioned? -Akhil. > > Reported-by: David Heidelberg <david.heidelberg@collabora.com> > Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10272 > Fixes: abe2023b4cea ("drm/msm/gpu: Push gpu lock down past runpm") > Signed-off-by: Rob Clark <robdclark@chromium.org> > --- > drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 12 ++---------- > 1 file changed, 2 insertions(+), 10 deletions(-) > > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > index 268737e59131..a5660d63535b 100644 > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > @@ -1244,12 +1244,7 @@ static void a6xx_recover(struct msm_gpu *gpu) > dev_pm_genpd_add_notifier(gmu->cxpd, &gmu->pd_nb); > dev_pm_genpd_synced_poweroff(gmu->cxpd); > > - /* Drop the rpm refcount from active submits */ > - if (active_submits) > - pm_runtime_put(&gpu->pdev->dev); > - > - /* And the final one from recover worker */ > - pm_runtime_put_sync(&gpu->pdev->dev); > + pm_runtime_force_suspend(&gpu->pdev->dev); > > if (!wait_for_completion_timeout(&gmu->pd_gate, msecs_to_jiffies(1000))) > DRM_DEV_ERROR(&gpu->pdev->dev, "cx gdsc didn't collapse\n"); > @@ -1258,10 +1253,7 @@ static void a6xx_recover(struct msm_gpu *gpu) > > pm_runtime_use_autosuspend(&gpu->pdev->dev); > > - if (active_submits) > - pm_runtime_get(&gpu->pdev->dev); > - > - pm_runtime_get_sync(&gpu->pdev->dev); > + pm_runtime_force_resume(&gpu->pdev->dev); > > gpu->active_submits = active_submits; > mutex_unlock(&gpu->active_lock); > -- > 2.43.0 >
On Fri, Dec 22, 2023 at 11:58 AM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote: > > On Mon, Dec 18, 2023 at 07:59:24AM -0800, Rob Clark wrote: > > > > From: Rob Clark <robdclark@chromium.org> > > > > a6xx_recover() is relying on the gpu lock to serialize against incoming > > submits doing a runpm get, as it tries to temporarily balance out the > > runpm gets with puts in order to power off the GPU. Unfortunately this > > gets worse when we (in a later patch) will move the runpm get out of the > > scheduler thread/work to move it out of the fence signaling path. > > > > Instead we can just simplify the whole thing by using force_suspend() / > > force_resume() instead of trying to be clever. > > At some places, we take a pm_runtime vote and access the gpu > registers assuming it will be powered until we drop the vote. a6xx_get_timestamp() > is an example. If we do a force suspend, it may cause bus errors from > those threads. Now you have to serialize every place we do runtime_get/put with a > mutex. Or is there a better way to handle the 'later patch' you > mentioned? So I was running into issues, when I started adding an igt test to stress test recovery vs multi-threaded submit, with cxpd not always suspending and getting "cx gdsc did not collapse", which may be related. I was considering using force_suspend() on the gmu and cxpd if gpu->hang==true, I'm not sure. I ran out of time to play with this when I was in the office. The issue the 'later patch' is trying to deal with is getting memory allocations out of the "fence signaling path", ie. out from the drm/sched kthread/worker. One way to do that, without dragging all of runpm/device-link/etc into it is to do the runpm get in the submit ioctl before enqueuing the job to the scheduler. But then we can hold a lock to protect against racing with recovery. BR, -R > -Akhil. > > > > > Reported-by: David Heidelberg <david.heidelberg@collabora.com> > > Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10272 > > Fixes: abe2023b4cea ("drm/msm/gpu: Push gpu lock down past runpm") > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > --- > > drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 12 ++---------- > > 1 file changed, 2 insertions(+), 10 deletions(-) > > > > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > > index 268737e59131..a5660d63535b 100644 > > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c > > @@ -1244,12 +1244,7 @@ static void a6xx_recover(struct msm_gpu *gpu) > > dev_pm_genpd_add_notifier(gmu->cxpd, &gmu->pd_nb); > > dev_pm_genpd_synced_poweroff(gmu->cxpd); > > > > - /* Drop the rpm refcount from active submits */ > > - if (active_submits) > > - pm_runtime_put(&gpu->pdev->dev); > > - > > - /* And the final one from recover worker */ > > - pm_runtime_put_sync(&gpu->pdev->dev); > > + pm_runtime_force_suspend(&gpu->pdev->dev); > > > > if (!wait_for_completion_timeout(&gmu->pd_gate, msecs_to_jiffies(1000))) > > DRM_DEV_ERROR(&gpu->pdev->dev, "cx gdsc didn't collapse\n"); > > @@ -1258,10 +1253,7 @@ static void a6xx_recover(struct msm_gpu *gpu) > > > > pm_runtime_use_autosuspend(&gpu->pdev->dev); > > > > - if (active_submits) > > - pm_runtime_get(&gpu->pdev->dev); > > - > > - pm_runtime_get_sync(&gpu->pdev->dev); > > + pm_runtime_force_resume(&gpu->pdev->dev); > > > > gpu->active_submits = active_submits; > > mutex_unlock(&gpu->active_lock); > > -- > > 2.43.0 > >
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c index 268737e59131..a5660d63535b 100644 --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c @@ -1244,12 +1244,7 @@ static void a6xx_recover(struct msm_gpu *gpu) dev_pm_genpd_add_notifier(gmu->cxpd, &gmu->pd_nb); dev_pm_genpd_synced_poweroff(gmu->cxpd); - /* Drop the rpm refcount from active submits */ - if (active_submits) - pm_runtime_put(&gpu->pdev->dev); - - /* And the final one from recover worker */ - pm_runtime_put_sync(&gpu->pdev->dev); + pm_runtime_force_suspend(&gpu->pdev->dev); if (!wait_for_completion_timeout(&gmu->pd_gate, msecs_to_jiffies(1000))) DRM_DEV_ERROR(&gpu->pdev->dev, "cx gdsc didn't collapse\n"); @@ -1258,10 +1253,7 @@ static void a6xx_recover(struct msm_gpu *gpu) pm_runtime_use_autosuspend(&gpu->pdev->dev); - if (active_submits) - pm_runtime_get(&gpu->pdev->dev); - - pm_runtime_get_sync(&gpu->pdev->dev); + pm_runtime_force_resume(&gpu->pdev->dev); gpu->active_submits = active_submits; mutex_unlock(&gpu->active_lock);