Message ID | 1539888261-331-3-git-send-email-andrey.grodzovsky@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/3] drm/sched: Add callback to mark if sched is ready to work. | expand |
Am 18.10.18 um 20:44 schrieb Andrey Grodzovsky: > A ring might become unusable after reset, if that the case > drm_sched_entity_select_rq will choose another, working rq > to run the job if there is one. > Also, skip recovery of ring which is not ready. Well that is not even remotely sufficient. If we can't bring up a ring any more after a reset we would need to move all jobs which where previously scheduled on it and all of its entities to a different instance. What you do here breaks dependencies between jobs and can result in unforeseen consequences including random memory writes. As far as I can see that can't be done correctly with the current scheduler design. Additional to that when we can't restart one instance of a ring we usually can't restart all others either. So the whole approach is rather pointless. Regards, Christian. > > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++++- > 1 file changed, 15 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index d11489e..3124ca1 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -3355,10 +3355,24 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, > else > r = amdgpu_device_reset(adev); > > + /* > + * After reboot a ring might fail in which case this will > + * move the job to different rq if possible > + */ > + if (job) { > + drm_sched_entity_select_rq(job->base.entity); > + if (job->base.entity->rq) { > + job->base.sched = job->base.entity->rq->sched; > + } else { > + job->base.sched = NULL; > + r = -ENOENT; > + } > + } > + > for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { > struct amdgpu_ring *ring = adev->rings[i]; > > - if (!ring || !ring->sched.thread) > + if (!ring || !ring->ready || !ring->sched.thread) > continue; > > /* only need recovery sched of the given job's ring
On 10/19/2018 03:08 AM, Koenig, Christian wrote: > Am 18.10.18 um 20:44 schrieb Andrey Grodzovsky: >> A ring might become unusable after reset, if that the case >> drm_sched_entity_select_rq will choose another, working rq >> to run the job if there is one. >> Also, skip recovery of ring which is not ready. > Well that is not even remotely sufficient. > > If we can't bring up a ring any more after a reset we would need to move > all jobs which where previously scheduled on it and all of its entities > to a different instance. That one should be easy to add inside amdgpu_device_gpu_recover in case ring is dead, we just do the same for all the jobs in sched->ring_mirror_list of the dead ring before doing recovery for them, no? > > What you do here breaks dependencies between jobs and can result in > unforeseen consequences including random memory writes. Can you explain this a bit more ? AFAIK any job dependent on this job will still wait for it's completion before running regardless of did this job moved to a different ring. What am I missing ? > > As far as I can see that can't be done correctly with the current > scheduler design. > > Additional to that when we can't restart one instance of a ring we > usually can't restart all others either. So the whole approach is rather > pointless. From my testing looks like we can, compute ring 0 is dead but IB tests pass on other compute rings. Andrey > > Regards, > Christian. > >> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++++- >> 1 file changed, 15 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> index d11489e..3124ca1 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> @@ -3355,10 +3355,24 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, >> else >> r = amdgpu_device_reset(adev); >> >> + /* >> + * After reboot a ring might fail in which case this will >> + * move the job to different rq if possible >> + */ >> + if (job) { >> + drm_sched_entity_select_rq(job->base.entity); >> + if (job->base.entity->rq) { >> + job->base.sched = job->base.entity->rq->sched; >> + } else { >> + job->base.sched = NULL; >> + r = -ENOENT; >> + } >> + } >> + >> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { >> struct amdgpu_ring *ring = adev->rings[i]; >> >> - if (!ring || !ring->sched.thread) >> + if (!ring || !ring->ready || !ring->sched.thread) >> continue; >> >> /* only need recovery sched of the given job's ring > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Am 19.10.18 um 17:06 schrieb Grodzovsky, Andrey: > > On 10/19/2018 03:08 AM, Koenig, Christian wrote: >> Am 18.10.18 um 20:44 schrieb Andrey Grodzovsky: >>> A ring might become unusable after reset, if that the case >>> drm_sched_entity_select_rq will choose another, working rq >>> to run the job if there is one. >>> Also, skip recovery of ring which is not ready. >> Well that is not even remotely sufficient. >> >> If we can't bring up a ring any more after a reset we would need to move >> all jobs which where previously scheduled on it and all of its entities >> to a different instance. > That one should be easy to add inside amdgpu_device_gpu_recover in case > ring is dead, we just do the same for all the jobs in > sched->ring_mirror_list of the dead ring > before doing recovery for them, no? Correct, you need to execute all jobs from the ring_mirror_list as well as move all entities to other rqs. But there are some problem with that see below. >> What you do here breaks dependencies between jobs and can result in >> unforeseen consequences including random memory writes. > Can you explain this a bit more ? AFAIK any job dependent on this job > will still wait for it's completion > before running regardless of did this job moved to a different ring. > What am I missing ? The stuff dependent on the moving jobs should indeed work correctly. But you don't necessary have space to execute the ring_mirror_list on another scheduler. And to that the jobs are prepared to run on a specific ring, e.g. UVD jobs are patches, VMIDs assigned etc etc... So that most likely won't work correctly. >> As far as I can see that can't be done correctly with the current >> scheduler design. >> >> Additional to that when we can't restart one instance of a ring we >> usually can't restart all others either. So the whole approach is rather >> pointless. > From my testing looks like we can, compute ring 0 is dead but IB tests > pass on other compute rings. Interesting, but I would rather investigate why compute ring 0 is dead while other still work. At least some compute rings should be handled by the same engine, so if one is dead all other should be dead as well. Christian. > > Andrey > >> Regards, >> Christian. >> >>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++++- >>> 1 file changed, 15 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> index d11489e..3124ca1 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> @@ -3355,10 +3355,24 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, >>> else >>> r = amdgpu_device_reset(adev); >>> >>> + /* >>> + * After reboot a ring might fail in which case this will >>> + * move the job to different rq if possible >>> + */ >>> + if (job) { >>> + drm_sched_entity_select_rq(job->base.entity); >>> + if (job->base.entity->rq) { >>> + job->base.sched = job->base.entity->rq->sched; >>> + } else { >>> + job->base.sched = NULL; >>> + r = -ENOENT; >>> + } >>> + } >>> + >>> for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { >>> struct amdgpu_ring *ring = adev->rings[i]; >>> >>> - if (!ring || !ring->sched.thread) >>> + if (!ring || !ring->ready || !ring->sched.thread) >>> continue; >>> >>> /* only need recovery sched of the given job's ring >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > _______________________________________________ > amd-gfx mailing list > amd-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx
That my next step. Andrey On 10/19/2018 12:28 PM, Christian König wrote: From my testing looks like we can, compute ring 0 is dead but IB tests pass on other compute rings. Interesting, but I would rather investigate why compute ring 0 is dead while other still work. <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body text="#000000" bgcolor="#FFFFFF"> <p>That my next step.</p> <p>Andrey<br> </p> <br> <div class="moz-cite-prefix">On 10/19/2018 12:28 PM, Christian König wrote:<br> </div> <blockquote type="cite" cite="mid:732e8d99-e681-e2a4-506d-6aed0d7fc594@gmail.com"> <blockquote type="cite" style="color: #000000;"> From my testing looks like we can, compute ring 0 is dead but IB tests <br> pass on other compute rings. <br> </blockquote> <br> Interesting, but I would rather investigate why compute ring 0 is dead while other still work. </blockquote> <br> </body> </html>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index d11489e..3124ca1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -3355,10 +3355,24 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, else r = amdgpu_device_reset(adev); + /* + * After reboot a ring might fail in which case this will + * move the job to different rq if possible + */ + if (job) { + drm_sched_entity_select_rq(job->base.entity); + if (job->base.entity->rq) { + job->base.sched = job->base.entity->rq->sched; + } else { + job->base.sched = NULL; + r = -ENOENT; + } + } + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) { struct amdgpu_ring *ring = adev->rings[i]; - if (!ring || !ring->sched.thread) + if (!ring || !ring->ready || !ring->sched.thread) continue; /* only need recovery sched of the given job's ring
A ring might become unusable after reset, if that the case drm_sched_entity_select_rq will choose another, working rq to run the job if there is one. Also, skip recovery of ring which is not ready. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-)