Message ID | 20210225213736.12352-1-andrey.grodzovsky@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] drm/scheduler: Fix hang when sched_entity released | expand |
Am 25.02.21 um 22:37 schrieb Andrey Grodzovsky: > Problem: If scheduler is already stopped by the time sched_entity > is released and entity's job_queue not empty I encountred > a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle > never becomes false. > > Fix: In drm_sched_fini detach all sched_entities from the > scheduler's run queues. This will satisfy drm_sched_entity_is_idle. > Also wakeup all those processes stuck in sched_entity flushing > as the scheduler main thread which wakes them up is stopped by now. > > v2: > Reverse order of drm_sched_rq_remove_entity and marking > s_entity as stopped to prevent reinserion back to rq due > to race. > > v3: > Drop drm_sched_rq_remove_entity, only modify entity->stopped > and check for it in drm_sched_entity_is_idle > > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> > --- > drivers/gpu/drm/scheduler/sched_entity.c | 3 ++- > drivers/gpu/drm/scheduler/sched_main.c | 23 +++++++++++++++++++++++ > 2 files changed, 25 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c > index 92d965b629c6..68b10813129a 100644 > --- a/drivers/gpu/drm/scheduler/sched_entity.c > +++ b/drivers/gpu/drm/scheduler/sched_entity.c > @@ -116,7 +116,8 @@ static bool drm_sched_entity_is_idle(struct drm_sched_entity *entity) > rmb(); /* for list_empty to work without lock */ > > if (list_empty(&entity->list) || > - spsc_queue_count(&entity->job_queue) == 0) > + spsc_queue_count(&entity->job_queue) == 0 || > + entity->stopped) > return true; > > return false; > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index 908b0b56032d..b50fab472734 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -897,9 +897,32 @@ EXPORT_SYMBOL(drm_sched_init); > */ > void drm_sched_fini(struct drm_gpu_scheduler *sched) > { > + int i; > + struct drm_sched_entity *s_entity; Please declare i last and have an empty line between declaration and code. With that nit pick fixed the patch is Reviewed-by: Christian König <christian.koenig@amd.com>. Going to push it to drm-misc-next. Christian. > if (sched->thread) > kthread_stop(sched->thread); > > + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) { > + struct drm_sched_rq *rq = &sched->sched_rq[i]; > + > + if (!rq) > + continue; > + > + spin_lock(&rq->lock); > + list_for_each_entry(s_entity, &rq->entities, list) > + /* > + * Prevents reinsertion and marks job_queue as idle, > + * it will removed from rq in drm_sched_entity_fini > + * eventually > + */ > + s_entity->stopped = true; > + spin_unlock(&rq->lock); > + > + } > + > + /* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */ > + wake_up_all(&sched->job_scheduled); > + > /* Confirm no work left behind accessing device structures */ > cancel_delayed_work_sync(&sched->work_tdr); >
On 2021-02-26 3:04 a.m., Christian König wrote: > > > Am 25.02.21 um 22:37 schrieb Andrey Grodzovsky: >> Problem: If scheduler is already stopped by the time sched_entity >> is released and entity's job_queue not empty I encountred >> a hang in drm_sched_entity_flush. This is because >> drm_sched_entity_is_idle >> never becomes false. >> >> Fix: In drm_sched_fini detach all sched_entities from the >> scheduler's run queues. This will satisfy drm_sched_entity_is_idle. >> Also wakeup all those processes stuck in sched_entity flushing >> as the scheduler main thread which wakes them up is stopped by now. >> >> v2: >> Reverse order of drm_sched_rq_remove_entity and marking >> s_entity as stopped to prevent reinserion back to rq due >> to race. >> >> v3: >> Drop drm_sched_rq_remove_entity, only modify entity->stopped >> and check for it in drm_sched_entity_is_idle >> >> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> >> --- >> drivers/gpu/drm/scheduler/sched_entity.c | 3 ++- >> drivers/gpu/drm/scheduler/sched_main.c | 23 +++++++++++++++++++++++ >> 2 files changed, 25 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c >> b/drivers/gpu/drm/scheduler/sched_entity.c >> index 92d965b629c6..68b10813129a 100644 >> --- a/drivers/gpu/drm/scheduler/sched_entity.c >> +++ b/drivers/gpu/drm/scheduler/sched_entity.c >> @@ -116,7 +116,8 @@ static bool drm_sched_entity_is_idle(struct >> drm_sched_entity *entity) >> rmb(); /* for list_empty to work without lock */ >> if (list_empty(&entity->list) || >> - spsc_queue_count(&entity->job_queue) == 0) >> + spsc_queue_count(&entity->job_queue) == 0 || >> + entity->stopped) >> return true; >> return false; >> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >> b/drivers/gpu/drm/scheduler/sched_main.c >> index 908b0b56032d..b50fab472734 100644 >> --- a/drivers/gpu/drm/scheduler/sched_main.c >> +++ b/drivers/gpu/drm/scheduler/sched_main.c >> @@ -897,9 +897,32 @@ EXPORT_SYMBOL(drm_sched_init); >> */ >> void drm_sched_fini(struct drm_gpu_scheduler *sched) >> { >> + int i; >> + struct drm_sched_entity *s_entity; > > Please declare i last and have an empty line between declaration and > code. > > With that nit pick fixed the patch is Reviewed-by: Christian König > <christian.koenig@amd.com>. Going to push it to drm-misc-next. > > Christian. Done. Since you are pushing attaching the patch here. Andrey > >> if (sched->thread) >> kthread_stop(sched->thread); >> + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= >> DRM_SCHED_PRIORITY_MIN; i--) { >> + struct drm_sched_rq *rq = &sched->sched_rq[i]; >> + >> + if (!rq) >> + continue; >> + >> + spin_lock(&rq->lock); >> + list_for_each_entry(s_entity, &rq->entities, list) >> + /* >> + * Prevents reinsertion and marks job_queue as idle, >> + * it will removed from rq in drm_sched_entity_fini >> + * eventually >> + */ >> + s_entity->stopped = true; >> + spin_unlock(&rq->lock); >> + >> + } >> + >> + /* Wakeup everyone stuck in drm_sched_entity_flush for this >> scheduler */ >> + wake_up_all(&sched->job_scheduled); >> + >> /* Confirm no work left behind accessing device structures */ >> cancel_delayed_work_sync(&sched->work_tdr); >
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c index 92d965b629c6..68b10813129a 100644 --- a/drivers/gpu/drm/scheduler/sched_entity.c +++ b/drivers/gpu/drm/scheduler/sched_entity.c @@ -116,7 +116,8 @@ static bool drm_sched_entity_is_idle(struct drm_sched_entity *entity) rmb(); /* for list_empty to work without lock */ if (list_empty(&entity->list) || - spsc_queue_count(&entity->job_queue) == 0) + spsc_queue_count(&entity->job_queue) == 0 || + entity->stopped) return true; return false; diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 908b0b56032d..b50fab472734 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -897,9 +897,32 @@ EXPORT_SYMBOL(drm_sched_init); */ void drm_sched_fini(struct drm_gpu_scheduler *sched) { + int i; + struct drm_sched_entity *s_entity; if (sched->thread) kthread_stop(sched->thread); + for (i = DRM_SCHED_PRIORITY_COUNT - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) { + struct drm_sched_rq *rq = &sched->sched_rq[i]; + + if (!rq) + continue; + + spin_lock(&rq->lock); + list_for_each_entry(s_entity, &rq->entities, list) + /* + * Prevents reinsertion and marks job_queue as idle, + * it will removed from rq in drm_sched_entity_fini + * eventually + */ + s_entity->stopped = true; + spin_unlock(&rq->lock); + + } + + /* Wakeup everyone stuck in drm_sched_entity_flush for this scheduler */ + wake_up_all(&sched->job_scheduled); + /* Confirm no work left behind accessing device structures */ cancel_delayed_work_sync(&sched->work_tdr);
Problem: If scheduler is already stopped by the time sched_entity is released and entity's job_queue not empty I encountred a hang in drm_sched_entity_flush. This is because drm_sched_entity_is_idle never becomes false. Fix: In drm_sched_fini detach all sched_entities from the scheduler's run queues. This will satisfy drm_sched_entity_is_idle. Also wakeup all those processes stuck in sched_entity flushing as the scheduler main thread which wakes them up is stopped by now. v2: Reverse order of drm_sched_rq_remove_entity and marking s_entity as stopped to prevent reinserion back to rq due to race. v3: Drop drm_sched_rq_remove_entity, only modify entity->stopped and check for it in drm_sched_entity_is_idle Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com> --- drivers/gpu/drm/scheduler/sched_entity.c | 3 ++- drivers/gpu/drm/scheduler/sched_main.c | 23 +++++++++++++++++++++++ 2 files changed, 25 insertions(+), 1 deletion(-)