Message ID | 20231107041020.10035-2-ltuikov89@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/sched: Don't disturb the entity when in RR-mode scheduling | expand |
On Mon, Nov 06, 2023 at 11:10:21PM -0500, Luben Tuikov wrote: > Don't call drm_sched_select_entity() in drm_sched_run_job_queue(). In fact, > rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let > it do just that, schedule the work item for execution. > > The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity() > to determine if the scheduler has an entity ready in one of its run-queues, > and in the case of the Round-Robin (RR) scheduling, the function > drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity > which is ready, sets up the run-queue and completion and returns that > entity. The FIFO scheduling algorithm is unaffected. > > Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), then > in the case of RR scheduling, that would result in drm_sched_select_entity() > having been called twice, which may result in skipping a ready entity if more > than one entity is ready. This commit fixes this by eliminating the call to > drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only > in drm_sched_run_job_work(). > > v2: Rebased on top of Tvrtko's renames series of patches. (Luben) > Add fixes-tag. (Tvrtko) > > Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> > Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item") Reviewed-by: Matthew Brost <matthew.brost@intel.com> > --- > drivers/gpu/drm/scheduler/sched_main.c | 16 +++------------- > 1 file changed, 3 insertions(+), 13 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index 27843e37d9b769..cd0dc3f81d05f0 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -256,10 +256,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq) > } > > /** > - * __drm_sched_run_job_queue - enqueue run-job work > + * drm_sched_run_job_queue - enqueue run-job work > * @sched: scheduler instance > */ > -static void __drm_sched_run_job_queue(struct drm_gpu_scheduler *sched) > +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched) > { > if (!READ_ONCE(sched->pause_submit)) > queue_work(sched->submit_wq, &sched->work_run_job); > @@ -928,7 +928,7 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched) > void drm_sched_wakeup(struct drm_gpu_scheduler *sched) > { > if (drm_sched_can_queue(sched)) > - __drm_sched_run_job_queue(sched); > + drm_sched_run_job_queue(sched); > } > > /** > @@ -1040,16 +1040,6 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list, > } > EXPORT_SYMBOL(drm_sched_pick_best); > > -/** > - * drm_sched_run_job_queue - enqueue run-job work if there are ready entities > - * @sched: scheduler instance > - */ > -static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched) > -{ > - if (drm_sched_select_entity(sched)) > - __drm_sched_run_job_queue(sched); > -} > - > /** > * drm_sched_free_job_work - worker to call free_job > * > > base-commit: 27d9620e9a9a6bc27a646b464b85860d91e21af3 > -- > 2.42.1 >
On 11/7/23 05:10, Luben Tuikov wrote: > Don't call drm_sched_select_entity() in drm_sched_run_job_queue(). In fact, > rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let > it do just that, schedule the work item for execution. > > The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity() > to determine if the scheduler has an entity ready in one of its run-queues, > and in the case of the Round-Robin (RR) scheduling, the function > drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity > which is ready, sets up the run-queue and completion and returns that > entity. The FIFO scheduling algorithm is unaffected. > > Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), then > in the case of RR scheduling, that would result in drm_sched_select_entity() > having been called twice, which may result in skipping a ready entity if more > than one entity is ready. This commit fixes this by eliminating the call to > drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only > in drm_sched_run_job_work(). > > v2: Rebased on top of Tvrtko's renames series of patches. (Luben) > Add fixes-tag. (Tvrtko) > > Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> > Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item") Reviewed-by: Danilo Krummrich <dakr@redhat.com> > --- > drivers/gpu/drm/scheduler/sched_main.c | 16 +++------------- > 1 file changed, 3 insertions(+), 13 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index 27843e37d9b769..cd0dc3f81d05f0 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -256,10 +256,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq) > } > > /** > - * __drm_sched_run_job_queue - enqueue run-job work > + * drm_sched_run_job_queue - enqueue run-job work > * @sched: scheduler instance > */ > -static void __drm_sched_run_job_queue(struct drm_gpu_scheduler *sched) > +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched) > { > if (!READ_ONCE(sched->pause_submit)) > queue_work(sched->submit_wq, &sched->work_run_job); > @@ -928,7 +928,7 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched) > void drm_sched_wakeup(struct drm_gpu_scheduler *sched) > { > if (drm_sched_can_queue(sched)) > - __drm_sched_run_job_queue(sched); > + drm_sched_run_job_queue(sched); > } > > /** > @@ -1040,16 +1040,6 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list, > } > EXPORT_SYMBOL(drm_sched_pick_best); > > -/** > - * drm_sched_run_job_queue - enqueue run-job work if there are ready entities > - * @sched: scheduler instance > - */ > -static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched) > -{ > - if (drm_sched_select_entity(sched)) > - __drm_sched_run_job_queue(sched); > -} > - > /** > * drm_sched_free_job_work - worker to call free_job > * > > base-commit: 27d9620e9a9a6bc27a646b464b85860d91e21af3
On 11/7/23 05:10, Luben Tuikov wrote: > Don't call drm_sched_select_entity() in drm_sched_run_job_queue(). In fact, > rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let > it do just that, schedule the work item for execution. > > The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity() > to determine if the scheduler has an entity ready in one of its run-queues, > and in the case of the Round-Robin (RR) scheduling, the function > drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity > which is ready, sets up the run-queue and completion and returns that > entity. The FIFO scheduling algorithm is unaffected. > > Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), then > in the case of RR scheduling, that would result in drm_sched_select_entity() > having been called twice, which may result in skipping a ready entity if more > than one entity is ready. This commit fixes this by eliminating the call to > drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only > in drm_sched_run_job_work(). > > v2: Rebased on top of Tvrtko's renames series of patches. (Luben) > Add fixes-tag. (Tvrtko) > > Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> > Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item") > --- > drivers/gpu/drm/scheduler/sched_main.c | 16 +++------------- > 1 file changed, 3 insertions(+), 13 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index 27843e37d9b769..cd0dc3f81d05f0 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -256,10 +256,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq) > } > > /** > - * __drm_sched_run_job_queue - enqueue run-job work > + * drm_sched_run_job_queue - enqueue run-job work > * @sched: scheduler instance > */ > -static void __drm_sched_run_job_queue(struct drm_gpu_scheduler *sched) > +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched) > { > if (!READ_ONCE(sched->pause_submit)) > queue_work(sched->submit_wq, &sched->work_run_job); > @@ -928,7 +928,7 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched) > void drm_sched_wakeup(struct drm_gpu_scheduler *sched) > { > if (drm_sched_can_queue(sched)) > - __drm_sched_run_job_queue(sched); > + drm_sched_run_job_queue(sched); > } > > /** > @@ -1040,16 +1040,6 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list, > } > EXPORT_SYMBOL(drm_sched_pick_best); > > -/** > - * drm_sched_run_job_queue - enqueue run-job work if there are ready entities > - * @sched: scheduler instance > - */ > -static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched) > -{ > - if (drm_sched_select_entity(sched)) Hm, now that I rebase my patch to implement dynamic job-flow control I recognize that we probably need the peek semantics here. If we do not select an entity here, we also do not check whether the corresponding job fits on the ring. Alternatively, we simply can't do this check in drm_sched_wakeup(). The consequence would be that we don't detect that we need to wait for credits to free up before the run work is already executing and the run work selects an entity. - Danilo > - __drm_sched_run_job_queue(sched); > -} > - > /** > * drm_sched_free_job_work - worker to call free_job > * > > base-commit: 27d9620e9a9a6bc27a646b464b85860d91e21af3
On 2023-11-07 06:48, Matthew Brost wrote: > On Mon, Nov 06, 2023 at 11:10:21PM -0500, Luben Tuikov wrote: >> Don't call drm_sched_select_entity() in drm_sched_run_job_queue(). In fact, >> rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let >> it do just that, schedule the work item for execution. >> >> The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity() >> to determine if the scheduler has an entity ready in one of its run-queues, >> and in the case of the Round-Robin (RR) scheduling, the function >> drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity >> which is ready, sets up the run-queue and completion and returns that >> entity. The FIFO scheduling algorithm is unaffected. >> >> Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), then >> in the case of RR scheduling, that would result in drm_sched_select_entity() >> having been called twice, which may result in skipping a ready entity if more >> than one entity is ready. This commit fixes this by eliminating the call to >> drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only >> in drm_sched_run_job_work(). >> >> v2: Rebased on top of Tvrtko's renames series of patches. (Luben) >> Add fixes-tag. (Tvrtko) >> >> Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> >> Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item") > > Reviewed-by: Matthew Brost <matthew.brost@intel.com> Thank you, sir!
On 2023-11-07 12:53, Danilo Krummrich wrote: > On 11/7/23 05:10, Luben Tuikov wrote: >> Don't call drm_sched_select_entity() in drm_sched_run_job_queue(). In fact, >> rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let >> it do just that, schedule the work item for execution. >> >> The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity() >> to determine if the scheduler has an entity ready in one of its run-queues, >> and in the case of the Round-Robin (RR) scheduling, the function >> drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity >> which is ready, sets up the run-queue and completion and returns that >> entity. The FIFO scheduling algorithm is unaffected. >> >> Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), then >> in the case of RR scheduling, that would result in drm_sched_select_entity() >> having been called twice, which may result in skipping a ready entity if more >> than one entity is ready. This commit fixes this by eliminating the call to >> drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only >> in drm_sched_run_job_work(). >> >> v2: Rebased on top of Tvrtko's renames series of patches. (Luben) >> Add fixes-tag. (Tvrtko) >> >> Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> >> Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item") > > Reviewed-by: Danilo Krummrich <dakr@redhat.com> Thank you, sir!
Hi, On 2023-11-07 19:41, Danilo Krummrich wrote: > On 11/7/23 05:10, Luben Tuikov wrote: >> Don't call drm_sched_select_entity() in drm_sched_run_job_queue(). In fact, >> rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let >> it do just that, schedule the work item for execution. >> >> The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity() >> to determine if the scheduler has an entity ready in one of its run-queues, >> and in the case of the Round-Robin (RR) scheduling, the function >> drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity >> which is ready, sets up the run-queue and completion and returns that >> entity. The FIFO scheduling algorithm is unaffected. >> >> Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), then >> in the case of RR scheduling, that would result in drm_sched_select_entity() >> having been called twice, which may result in skipping a ready entity if more >> than one entity is ready. This commit fixes this by eliminating the call to >> drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only >> in drm_sched_run_job_work(). >> >> v2: Rebased on top of Tvrtko's renames series of patches. (Luben) >> Add fixes-tag. (Tvrtko) >> >> Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> >> Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item") >> --- >> drivers/gpu/drm/scheduler/sched_main.c | 16 +++------------- >> 1 file changed, 3 insertions(+), 13 deletions(-) >> >> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >> index 27843e37d9b769..cd0dc3f81d05f0 100644 >> --- a/drivers/gpu/drm/scheduler/sched_main.c >> +++ b/drivers/gpu/drm/scheduler/sched_main.c >> @@ -256,10 +256,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq) >> } >> >> /** >> - * __drm_sched_run_job_queue - enqueue run-job work >> + * drm_sched_run_job_queue - enqueue run-job work >> * @sched: scheduler instance >> */ >> -static void __drm_sched_run_job_queue(struct drm_gpu_scheduler *sched) >> +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched) >> { >> if (!READ_ONCE(sched->pause_submit)) >> queue_work(sched->submit_wq, &sched->work_run_job); >> @@ -928,7 +928,7 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched) >> void drm_sched_wakeup(struct drm_gpu_scheduler *sched) >> { >> if (drm_sched_can_queue(sched)) >> - __drm_sched_run_job_queue(sched); >> + drm_sched_run_job_queue(sched); >> } >> >> /** >> @@ -1040,16 +1040,6 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list, >> } >> EXPORT_SYMBOL(drm_sched_pick_best); >> >> -/** >> - * drm_sched_run_job_queue - enqueue run-job work if there are ready entities >> - * @sched: scheduler instance >> - */ >> -static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched) >> -{ >> - if (drm_sched_select_entity(sched)) > > Hm, now that I rebase my patch to implement dynamic job-flow control I recognize that > we probably need the peek semantics here. If we do not select an entity here, we also > do not check whether the corresponding job fits on the ring. > > Alternatively, we simply can't do this check in drm_sched_wakeup(). The consequence would > be that we don't detect that we need to wait for credits to free up before the run work is > already executing and the run work selects an entity. So I rebased v5 on top of the latest drm-misc-next, and looked around and found out that drm_sched_wakeup() is missing drm_sched_entity_is_ready(). It should look like the following, void drm_sched_wakeup(struct drm_gpu_scheduler *sched, struct drm_sched_entity *entity) { if (drm_sched_entity_is_ready(entity)) if (drm_sched_can_queue(sched, entity)) drm_sched_run_job_queue(sched); } See the attached patch. (Currently running with base-commit and the attached patch.)
On 11/9/23 07:52, Luben Tuikov wrote: > Hi, > > On 2023-11-07 19:41, Danilo Krummrich wrote: >> On 11/7/23 05:10, Luben Tuikov wrote: >>> Don't call drm_sched_select_entity() in drm_sched_run_job_queue(). In fact, >>> rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let >>> it do just that, schedule the work item for execution. >>> >>> The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity() >>> to determine if the scheduler has an entity ready in one of its run-queues, >>> and in the case of the Round-Robin (RR) scheduling, the function >>> drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity >>> which is ready, sets up the run-queue and completion and returns that >>> entity. The FIFO scheduling algorithm is unaffected. >>> >>> Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), then >>> in the case of RR scheduling, that would result in drm_sched_select_entity() >>> having been called twice, which may result in skipping a ready entity if more >>> than one entity is ready. This commit fixes this by eliminating the call to >>> drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only >>> in drm_sched_run_job_work(). >>> >>> v2: Rebased on top of Tvrtko's renames series of patches. (Luben) >>> Add fixes-tag. (Tvrtko) >>> >>> Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> >>> Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item") >>> --- >>> drivers/gpu/drm/scheduler/sched_main.c | 16 +++------------- >>> 1 file changed, 3 insertions(+), 13 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >>> index 27843e37d9b769..cd0dc3f81d05f0 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>> @@ -256,10 +256,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq) >>> } >>> >>> /** >>> - * __drm_sched_run_job_queue - enqueue run-job work >>> + * drm_sched_run_job_queue - enqueue run-job work >>> * @sched: scheduler instance >>> */ >>> -static void __drm_sched_run_job_queue(struct drm_gpu_scheduler *sched) >>> +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched) >>> { >>> if (!READ_ONCE(sched->pause_submit)) >>> queue_work(sched->submit_wq, &sched->work_run_job); >>> @@ -928,7 +928,7 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched) >>> void drm_sched_wakeup(struct drm_gpu_scheduler *sched) >>> { >>> if (drm_sched_can_queue(sched)) >>> - __drm_sched_run_job_queue(sched); >>> + drm_sched_run_job_queue(sched); >>> } >>> >>> /** >>> @@ -1040,16 +1040,6 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list, >>> } >>> EXPORT_SYMBOL(drm_sched_pick_best); >>> >>> -/** >>> - * drm_sched_run_job_queue - enqueue run-job work if there are ready entities >>> - * @sched: scheduler instance >>> - */ >>> -static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched) >>> -{ >>> - if (drm_sched_select_entity(sched)) >> >> Hm, now that I rebase my patch to implement dynamic job-flow control I recognize that >> we probably need the peek semantics here. If we do not select an entity here, we also >> do not check whether the corresponding job fits on the ring. >> >> Alternatively, we simply can't do this check in drm_sched_wakeup(). The consequence would >> be that we don't detect that we need to wait for credits to free up before the run work is >> already executing and the run work selects an entity. > > So I rebased v5 on top of the latest drm-misc-next, and looked around and found out that > drm_sched_wakeup() is missing drm_sched_entity_is_ready(). It should look like the following, Yeah, but that's just the consequence of re-basing it onto Tvrtko's patch. My point is that by removing drm_sched_select_entity() from drm_sched_run_job_queue() we do not only loose the check whether the selected entity is ready, but also whether we have enough credits to actually run a new job. This can lead to queuing up work that does nothing but calling drm_sched_select_entity() and return. By peeking the entity we could know this *before* scheduling work and hence avoid some CPU scheduler overhead. However, since this patch already landed and we can fail the same way if the selected entity isn't ready I don't consider this to be a blocker for the credit patch, hence I will send out a v6. > > void drm_sched_wakeup(struct drm_gpu_scheduler *sched, > struct drm_sched_entity *entity) > { > if (drm_sched_entity_is_ready(entity)) > if (drm_sched_can_queue(sched, entity)) > drm_sched_run_job_queue(sched); > } > > See the attached patch. (Currently running with base-commit and the attached patch.)
On 11/9/23 20:24, Danilo Krummrich wrote: > On 11/9/23 07:52, Luben Tuikov wrote: >> Hi, >> >> On 2023-11-07 19:41, Danilo Krummrich wrote: >>> On 11/7/23 05:10, Luben Tuikov wrote: >>>> Don't call drm_sched_select_entity() in drm_sched_run_job_queue(). In fact, >>>> rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let >>>> it do just that, schedule the work item for execution. >>>> >>>> The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity() >>>> to determine if the scheduler has an entity ready in one of its run-queues, >>>> and in the case of the Round-Robin (RR) scheduling, the function >>>> drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity >>>> which is ready, sets up the run-queue and completion and returns that >>>> entity. The FIFO scheduling algorithm is unaffected. >>>> >>>> Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), then >>>> in the case of RR scheduling, that would result in drm_sched_select_entity() >>>> having been called twice, which may result in skipping a ready entity if more >>>> than one entity is ready. This commit fixes this by eliminating the call to >>>> drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only >>>> in drm_sched_run_job_work(). >>>> >>>> v2: Rebased on top of Tvrtko's renames series of patches. (Luben) >>>> Add fixes-tag. (Tvrtko) >>>> >>>> Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> >>>> Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item") >>>> --- >>>> drivers/gpu/drm/scheduler/sched_main.c | 16 +++------------- >>>> 1 file changed, 3 insertions(+), 13 deletions(-) >>>> >>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >>>> index 27843e37d9b769..cd0dc3f81d05f0 100644 >>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>> @@ -256,10 +256,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq) >>>> } >>>> /** >>>> - * __drm_sched_run_job_queue - enqueue run-job work >>>> + * drm_sched_run_job_queue - enqueue run-job work >>>> * @sched: scheduler instance >>>> */ >>>> -static void __drm_sched_run_job_queue(struct drm_gpu_scheduler *sched) >>>> +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched) >>>> { >>>> if (!READ_ONCE(sched->pause_submit)) >>>> queue_work(sched->submit_wq, &sched->work_run_job); >>>> @@ -928,7 +928,7 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched) >>>> void drm_sched_wakeup(struct drm_gpu_scheduler *sched) >>>> { >>>> if (drm_sched_can_queue(sched)) >>>> - __drm_sched_run_job_queue(sched); >>>> + drm_sched_run_job_queue(sched); >>>> } >>>> /** >>>> @@ -1040,16 +1040,6 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list, >>>> } >>>> EXPORT_SYMBOL(drm_sched_pick_best); >>>> -/** >>>> - * drm_sched_run_job_queue - enqueue run-job work if there are ready entities >>>> - * @sched: scheduler instance >>>> - */ >>>> -static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched) >>>> -{ >>>> - if (drm_sched_select_entity(sched)) >>> >>> Hm, now that I rebase my patch to implement dynamic job-flow control I recognize that >>> we probably need the peek semantics here. If we do not select an entity here, we also >>> do not check whether the corresponding job fits on the ring. >>> >>> Alternatively, we simply can't do this check in drm_sched_wakeup(). The consequence would >>> be that we don't detect that we need to wait for credits to free up before the run work is >>> already executing and the run work selects an entity. >> >> So I rebased v5 on top of the latest drm-misc-next, and looked around and found out that >> drm_sched_wakeup() is missing drm_sched_entity_is_ready(). It should look like the following, > > Yeah, but that's just the consequence of re-basing it onto Tvrtko's patch. > > My point is that by removing drm_sched_select_entity() from drm_sched_run_job_queue() we do not > only loose the check whether the selected entity is ready, but also whether we have enough > credits to actually run a new job. This can lead to queuing up work that does nothing but calling > drm_sched_select_entity() and return. Ok, I see it now. We don't need to peek, we know the entity at drm_sched_wakeup(). However, the missing drm_sched_entity_is_ready() check should have been added already when drm_sched_select_entity() was removed. Gonna send a fix for that as well. - Danilo > > By peeking the entity we could know this *before* scheduling work and hence avoid some CPU scheduler > overhead. > > However, since this patch already landed and we can fail the same way if the selected entity isn't > ready I don't consider this to be a blocker for the credit patch, hence I will send out a v6. > >> >> void drm_sched_wakeup(struct drm_gpu_scheduler *sched, >> struct drm_sched_entity *entity) >> { >> if (drm_sched_entity_is_ready(entity)) >> if (drm_sched_can_queue(sched, entity)) >> drm_sched_run_job_queue(sched); >> } >> >> See the attached patch. (Currently running with base-commit and the attached patch.)
On 2023-11-09 18:41, Danilo Krummrich wrote: > On 11/9/23 20:24, Danilo Krummrich wrote: >> On 11/9/23 07:52, Luben Tuikov wrote: >>> Hi, >>> >>> On 2023-11-07 19:41, Danilo Krummrich wrote: >>>> On 11/7/23 05:10, Luben Tuikov wrote: >>>>> Don't call drm_sched_select_entity() in drm_sched_run_job_queue(). In fact, >>>>> rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let >>>>> it do just that, schedule the work item for execution. >>>>> >>>>> The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity() >>>>> to determine if the scheduler has an entity ready in one of its run-queues, >>>>> and in the case of the Round-Robin (RR) scheduling, the function >>>>> drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity >>>>> which is ready, sets up the run-queue and completion and returns that >>>>> entity. The FIFO scheduling algorithm is unaffected. >>>>> >>>>> Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), then >>>>> in the case of RR scheduling, that would result in drm_sched_select_entity() >>>>> having been called twice, which may result in skipping a ready entity if more >>>>> than one entity is ready. This commit fixes this by eliminating the call to >>>>> drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only >>>>> in drm_sched_run_job_work(). >>>>> >>>>> v2: Rebased on top of Tvrtko's renames series of patches. (Luben) >>>>> Add fixes-tag. (Tvrtko) >>>>> >>>>> Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> >>>>> Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item") >>>>> --- >>>>> drivers/gpu/drm/scheduler/sched_main.c | 16 +++------------- >>>>> 1 file changed, 3 insertions(+), 13 deletions(-) >>>>> >>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >>>>> index 27843e37d9b769..cd0dc3f81d05f0 100644 >>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>>> @@ -256,10 +256,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq) >>>>> } >>>>> /** >>>>> - * __drm_sched_run_job_queue - enqueue run-job work >>>>> + * drm_sched_run_job_queue - enqueue run-job work >>>>> * @sched: scheduler instance >>>>> */ >>>>> -static void __drm_sched_run_job_queue(struct drm_gpu_scheduler *sched) >>>>> +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched) >>>>> { >>>>> if (!READ_ONCE(sched->pause_submit)) >>>>> queue_work(sched->submit_wq, &sched->work_run_job); >>>>> @@ -928,7 +928,7 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched) >>>>> void drm_sched_wakeup(struct drm_gpu_scheduler *sched) >>>>> { >>>>> if (drm_sched_can_queue(sched)) >>>>> - __drm_sched_run_job_queue(sched); >>>>> + drm_sched_run_job_queue(sched); >>>>> } >>>>> /** >>>>> @@ -1040,16 +1040,6 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list, >>>>> } >>>>> EXPORT_SYMBOL(drm_sched_pick_best); >>>>> -/** >>>>> - * drm_sched_run_job_queue - enqueue run-job work if there are ready entities >>>>> - * @sched: scheduler instance >>>>> - */ >>>>> -static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched) >>>>> -{ >>>>> - if (drm_sched_select_entity(sched)) >>>> >>>> Hm, now that I rebase my patch to implement dynamic job-flow control I recognize that >>>> we probably need the peek semantics here. If we do not select an entity here, we also >>>> do not check whether the corresponding job fits on the ring. >>>> >>>> Alternatively, we simply can't do this check in drm_sched_wakeup(). The consequence would >>>> be that we don't detect that we need to wait for credits to free up before the run work is >>>> already executing and the run work selects an entity. >>> >>> So I rebased v5 on top of the latest drm-misc-next, and looked around and found out that >>> drm_sched_wakeup() is missing drm_sched_entity_is_ready(). It should look like the following, >> >> Yeah, but that's just the consequence of re-basing it onto Tvrtko's patch. >> >> My point is that by removing drm_sched_select_entity() from drm_sched_run_job_queue() we do not >> only loose the check whether the selected entity is ready, but also whether we have enough >> credits to actually run a new job. This can lead to queuing up work that does nothing but calling >> drm_sched_select_entity() and return. > > Ok, I see it now. We don't need to peek, we know the entity at drm_sched_wakeup(). > > However, the missing drm_sched_entity_is_ready() check should have been added already when > drm_sched_select_entity() was removed. Gonna send a fix for that as well. Let me do that, since I added it to your patch. Then you can rebase your credits patch onto mine.
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 27843e37d9b769..cd0dc3f81d05f0 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -256,10 +256,10 @@ drm_sched_rq_select_entity_fifo(struct drm_sched_rq *rq) } /** - * __drm_sched_run_job_queue - enqueue run-job work + * drm_sched_run_job_queue - enqueue run-job work * @sched: scheduler instance */ -static void __drm_sched_run_job_queue(struct drm_gpu_scheduler *sched) +static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched) { if (!READ_ONCE(sched->pause_submit)) queue_work(sched->submit_wq, &sched->work_run_job); @@ -928,7 +928,7 @@ static bool drm_sched_can_queue(struct drm_gpu_scheduler *sched) void drm_sched_wakeup(struct drm_gpu_scheduler *sched) { if (drm_sched_can_queue(sched)) - __drm_sched_run_job_queue(sched); + drm_sched_run_job_queue(sched); } /** @@ -1040,16 +1040,6 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list, } EXPORT_SYMBOL(drm_sched_pick_best); -/** - * drm_sched_run_job_queue - enqueue run-job work if there are ready entities - * @sched: scheduler instance - */ -static void drm_sched_run_job_queue(struct drm_gpu_scheduler *sched) -{ - if (drm_sched_select_entity(sched)) - __drm_sched_run_job_queue(sched); -} - /** * drm_sched_free_job_work - worker to call free_job *
Don't call drm_sched_select_entity() in drm_sched_run_job_queue(). In fact, rename __drm_sched_run_job_queue() to just drm_sched_run_job_queue(), and let it do just that, schedule the work item for execution. The problem is that drm_sched_run_job_queue() calls drm_sched_select_entity() to determine if the scheduler has an entity ready in one of its run-queues, and in the case of the Round-Robin (RR) scheduling, the function drm_sched_rq_select_entity_rr() does just that, selects the _next_ entity which is ready, sets up the run-queue and completion and returns that entity. The FIFO scheduling algorithm is unaffected. Now, since drm_sched_run_job_work() also calls drm_sched_select_entity(), then in the case of RR scheduling, that would result in drm_sched_select_entity() having been called twice, which may result in skipping a ready entity if more than one entity is ready. This commit fixes this by eliminating the call to drm_sched_select_entity() from drm_sched_run_job_queue(), and leaves it only in drm_sched_run_job_work(). v2: Rebased on top of Tvrtko's renames series of patches. (Luben) Add fixes-tag. (Tvrtko) Signed-off-by: Luben Tuikov <ltuikov89@gmail.com> Fixes: f7fe64ad0f22ff ("drm/sched: Split free_job into own work item") --- drivers/gpu/drm/scheduler/sched_main.c | 16 +++------------- 1 file changed, 3 insertions(+), 13 deletions(-) base-commit: 27d9620e9a9a6bc27a646b464b85860d91e21af3