Message ID | 1613495262-22605-1-git-send-email-andrey.grodzovsky@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/scheduler: Fix hang when sched_entity released | expand |
Ping Andrey On 2/16/21 12:07 PM, Andrey Grodzovsky wrote: > Problem: If scheduler is already stopped by the time sched_entity > is released and entity's job_queue not empty I encountred > a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle > never becomes false. > > Fix: In drm_sched_fini detach all sched_entities from the > scheduler's run queues. This will satisfy drm_sched_entity_is_idle. > Also wakeup all those processes stuck in sched_entity flushing > as the scheduler main thread which wakes them up is stopped by now. > > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> > --- > drivers/gpu/drm/scheduler/sched_main.c | 31 +++++++++++++++++++++++++++++++ > 1 file changed, 31 insertions(+) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index 908b0b5..11abf5d 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init); > */ > void drm_sched_fini(struct drm_gpu_scheduler *sched) > { > + int i; > + struct drm_sched_entity *s_entity; > if (sched->thread) > kthread_stop(sched->thread); > > + /* Detach all sched_entites from this scheduler once it's stopped */ > + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) { > + struct drm_sched_rq *rq = &sched->sched_rq[i]; > + > + if (!rq) > + continue; > + > + /* Loop this way because rq->lock is taken in drm_sched_rq_remove_entity */ > + spin_lock(&rq->lock); > + while ((s_entity = list_first_entry_or_null(&rq->entities, > + struct drm_sched_entity, > + list))) { > + spin_unlock(&rq->lock); > + drm_sched_rq_remove_entity(rq, s_entity); > + > + /* Mark as stopped to reject adding to any new rq */ > + spin_lock(&s_entity->rq_lock); > + s_entity->stopped = true; > + spin_unlock(&s_entity->rq_lock); > + > + spin_lock(&rq->lock); > + } > + spin_unlock(&rq->lock); > + > + } > + > + /* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */ > + wake_up_all(&sched->job_scheduled); > + > /* Confirm no work left behind accessing device structures */ > cancel_delayed_work_sync(&sched->work_tdr); >
Am 16.02.21 um 18:07 schrieb Andrey Grodzovsky: > Problem: If scheduler is already stopped by the time sched_entity > is released and entity's job_queue not empty I encountred > a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle > never becomes false. > > Fix: In drm_sched_fini detach all sched_entities from the > scheduler's run queues. This will satisfy drm_sched_entity_is_idle. > Also wakeup all those processes stuck in sched_entity flushing > as the scheduler main thread which wakes them up is stopped by now. > > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> > --- > drivers/gpu/drm/scheduler/sched_main.c | 31 +++++++++++++++++++++++++++++++ > 1 file changed, 31 insertions(+) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index 908b0b5..11abf5d 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init); > */ > void drm_sched_fini(struct drm_gpu_scheduler *sched) > { > + int i; > + struct drm_sched_entity *s_entity; > if (sched->thread) > kthread_stop(sched->thread); > > + /* Detach all sched_entites from this scheduler once it's stopped */ > + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) { > + struct drm_sched_rq *rq = &sched->sched_rq[i]; > + > + if (!rq) > + continue; > + > + /* Loop this way because rq->lock is taken in drm_sched_rq_remove_entity */ > + spin_lock(&rq->lock); > + while ((s_entity = list_first_entry_or_null(&rq->entities, > + struct drm_sched_entity, > + list))) { > + spin_unlock(&rq->lock); > + drm_sched_rq_remove_entity(rq, s_entity); > + > + /* Mark as stopped to reject adding to any new rq */ > + spin_lock(&s_entity->rq_lock); > + s_entity->stopped = true; Why not marking it as stopped and then removing it? Regards, Christian. > + spin_unlock(&s_entity->rq_lock); > + > + spin_lock(&rq->lock); > + } > + spin_unlock(&rq->lock); > + > + } > + > + /* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */ > + wake_up_all(&sched->job_scheduled); > + > /* Confirm no work left behind accessing device structures */ > cancel_delayed_work_sync(&sched->work_tdr); >
On 2/17/21 4:32 PM, Christian König wrote: > Am 16.02.21 um 18:07 schrieb Andrey Grodzovsky: >> Problem: If scheduler is already stopped by the time sched_entity >> is released and entity's job_queue not empty I encountred >> a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle >> never becomes false. >> >> Fix: In drm_sched_fini detach all sched_entities from the >> scheduler's run queues. This will satisfy drm_sched_entity_is_idle. >> Also wakeup all those processes stuck in sched_entity flushing >> as the scheduler main thread which wakes them up is stopped by now. >> >> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> >> --- >> drivers/gpu/drm/scheduler/sched_main.c | 31 +++++++++++++++++++++++++++++++ >> 1 file changed, 31 insertions(+) >> >> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >> b/drivers/gpu/drm/scheduler/sched_main.c >> index 908b0b5..11abf5d 100644 >> --- a/drivers/gpu/drm/scheduler/sched_main.c >> +++ b/drivers/gpu/drm/scheduler/sched_main.c >> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init); >> */ >> void drm_sched_fini(struct drm_gpu_scheduler *sched) >> { >> + int i; >> + struct drm_sched_entity *s_entity; >> if (sched->thread) >> kthread_stop(sched->thread); >> + /* Detach all sched_entites from this scheduler once it's stopped */ >> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) { >> + struct drm_sched_rq *rq = &sched->sched_rq[i]; >> + >> + if (!rq) >> + continue; >> + >> + /* Loop this way because rq->lock is taken in >> drm_sched_rq_remove_entity */ >> + spin_lock(&rq->lock); >> + while ((s_entity = list_first_entry_or_null(&rq->entities, >> + struct drm_sched_entity, >> + list))) { >> + spin_unlock(&rq->lock); >> + drm_sched_rq_remove_entity(rq, s_entity); >> + >> + /* Mark as stopped to reject adding to any new rq */ >> + spin_lock(&s_entity->rq_lock); >> + s_entity->stopped = true; > > Why not marking it as stopped and then removing it? > > Regards, > Christian. You mean just reverse the order of operations here to prevent a race where someone adding it again to rq before marking it as stopped ? Andrey > >> + spin_unlock(&s_entity->rq_lock); >> + >> + spin_lock(&rq->lock); >> + } >> + spin_unlock(&rq->lock); >> + >> + } >> + >> + /* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */ >> + wake_up_all(&sched->job_scheduled); >> + >> /* Confirm no work left behind accessing device structures */ >> cancel_delayed_work_sync(&sched->work_tdr); >
Am 17.02.21 um 22:36 schrieb Andrey Grodzovsky: > > On 2/17/21 4:32 PM, Christian König wrote: >> Am 16.02.21 um 18:07 schrieb Andrey Grodzovsky: >>> Problem: If scheduler is already stopped by the time sched_entity >>> is released and entity's job_queue not empty I encountred >>> a hang in drm_sched_entity_flush. This is because >>> drm_sched_entity_is_idle >>> never becomes false. >>> >>> Fix: In drm_sched_fini detach all sched_entities from the >>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle. >>> Also wakeup all those processes stuck in sched_entity flushing >>> as the scheduler main thread which wakes them up is stopped by now. >>> >>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> >>> --- >>> drivers/gpu/drm/scheduler/sched_main.c | 31 >>> +++++++++++++++++++++++++++++++ >>> 1 file changed, 31 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>> b/drivers/gpu/drm/scheduler/sched_main.c >>> index 908b0b5..11abf5d 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init); >>> */ >>> void drm_sched_fini(struct drm_gpu_scheduler *sched) >>> { >>> + int i; >>> + struct drm_sched_entity *s_entity; >>> if (sched->thread) >>> kthread_stop(sched->thread); >>> + /* Detach all sched_entites from this scheduler once it's >>> stopped */ >>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= >>> DRM_SCHED_PRIORITY_MIN; i--) { >>> + struct drm_sched_rq *rq = &sched->sched_rq[i]; >>> + >>> + if (!rq) >>> + continue; >>> + >>> + /* Loop this way because rq->lock is taken in >>> drm_sched_rq_remove_entity */ >>> + spin_lock(&rq->lock); >>> + while ((s_entity = list_first_entry_or_null(&rq->entities, >>> + struct drm_sched_entity, >>> + list))) { >>> + spin_unlock(&rq->lock); >>> + drm_sched_rq_remove_entity(rq, s_entity); >>> + >>> + /* Mark as stopped to reject adding to any new rq */ >>> + spin_lock(&s_entity->rq_lock); >>> + s_entity->stopped = true; >> >> Why not marking it as stopped and then removing it? >> >> Regards, >> Christian. > > > You mean just reverse the order of operations here to prevent a race > where someone adding it again to rq before marking it as stopped ? Exactly that, yeah. Christian. > > Andrey > > >> >>> + spin_unlock(&s_entity->rq_lock); >>> + >>> + spin_lock(&rq->lock); >>> + } >>> + spin_unlock(&rq->lock); >>> + >>> + } >>> + >>> + /* Wakeup everyone stuck in drm_sched_entity_flush for this >>> scheduler */ >>> + wake_up_all(&sched->job_scheduled); >>> + >>> /* Confirm no work left behind accessing device structures */ >>> cancel_delayed_work_sync(&sched->work_tdr); >>
Will do. Andrey On 2/17/21 4:37 PM, Christian König wrote: > > > Am 17.02.21 um 22:36 schrieb Andrey Grodzovsky: >> >> On 2/17/21 4:32 PM, Christian König wrote: >>> Am 16.02.21 um 18:07 schrieb Andrey Grodzovsky: >>>> Problem: If scheduler is already stopped by the time sched_entity >>>> is released and entity's job_queue not empty I encountred >>>> a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle >>>> never becomes false. >>>> >>>> Fix: In drm_sched_fini detach all sched_entities from the >>>> scheduler's run queues. This will satisfy drm_sched_entity_is_idle. >>>> Also wakeup all those processes stuck in sched_entity flushing >>>> as the scheduler main thread which wakes them up is stopped by now. >>>> >>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> >>>> --- >>>> drivers/gpu/drm/scheduler/sched_main.c | 31 +++++++++++++++++++++++++++++++ >>>> 1 file changed, 31 insertions(+) >>>> >>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>>> b/drivers/gpu/drm/scheduler/sched_main.c >>>> index 908b0b5..11abf5d 100644 >>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>> @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init); >>>> */ >>>> void drm_sched_fini(struct drm_gpu_scheduler *sched) >>>> { >>>> + int i; >>>> + struct drm_sched_entity *s_entity; >>>> if (sched->thread) >>>> kthread_stop(sched->thread); >>>> + /* Detach all sched_entites from this scheduler once it's stopped */ >>>> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; >>>> i--) { >>>> + struct drm_sched_rq *rq = &sched->sched_rq[i]; >>>> + >>>> + if (!rq) >>>> + continue; >>>> + >>>> + /* Loop this way because rq->lock is taken in >>>> drm_sched_rq_remove_entity */ >>>> + spin_lock(&rq->lock); >>>> + while ((s_entity = list_first_entry_or_null(&rq->entities, >>>> + struct drm_sched_entity, >>>> + list))) { >>>> + spin_unlock(&rq->lock); >>>> + drm_sched_rq_remove_entity(rq, s_entity); >>>> + >>>> + /* Mark as stopped to reject adding to any new rq */ >>>> + spin_lock(&s_entity->rq_lock); >>>> + s_entity->stopped = true; >>> >>> Why not marking it as stopped and then removing it? >>> >>> Regards, >>> Christian. >> >> >> You mean just reverse the order of operations here to prevent a race where >> someone adding it again to rq before marking it as stopped ? > > Exactly that, yeah. > > Christian. > >> >> Andrey >> >> >>> >>>> + spin_unlock(&s_entity->rq_lock); >>>> + >>>> + spin_lock(&rq->lock); >>>> + } >>>> + spin_unlock(&rq->lock); >>>> + >>>> + } >>>> + >>>> + /* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */ >>>> + wake_up_all(&sched->job_scheduled); >>>> + >>>> /* Confirm no work left behind accessing device structures */ >>>> cancel_delayed_work_sync(&sched->work_tdr); >>> >
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 908b0b5..11abf5d 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -897,9 +897,40 @@ EXPORT_SYMBOL(drm_sched_init); */ void drm_sched_fini(struct drm_gpu_scheduler *sched) { + int i; + struct drm_sched_entity *s_entity; if (sched->thread) kthread_stop(sched->thread); + /* Detach all sched_entites from this scheduler once it's stopped */ + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) { + struct drm_sched_rq *rq = &sched->sched_rq[i]; + + if (!rq) + continue; + + /* Loop this way because rq->lock is taken in drm_sched_rq_remove_entity */ + spin_lock(&rq->lock); + while ((s_entity = list_first_entry_or_null(&rq->entities, + struct drm_sched_entity, + list))) { + spin_unlock(&rq->lock); + drm_sched_rq_remove_entity(rq, s_entity); + + /* Mark as stopped to reject adding to any new rq */ + spin_lock(&s_entity->rq_lock); + s_entity->stopped = true; + spin_unlock(&s_entity->rq_lock); + + spin_lock(&rq->lock); + } + spin_unlock(&rq->lock); + + } + + /* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */ + wake_up_all(&sched->job_scheduled); + /* Confirm no work left behind accessing device structures */ cancel_delayed_work_sync(&sched->work_tdr);
Problem: If scheduler is already stopped by the time sched_entity is released and entity's job_queue not empty I encountred a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle never becomes false. Fix: In drm_sched_fini detach all sched_entities from the scheduler's run queues. This will satisfy drm_sched_entity_is_idle. Also wakeup all those processes stuck in sched_entity flushing as the scheduler main thread which wakes them up is stopped by now. Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> --- drivers/gpu/drm/scheduler/sched_main.c | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+)