Message ID | 20250331201705.60663-1-tvrtko.ursulin@igalia.com (mailing list archive) |
---|---|
Headers | show |
Series | Deadline DRM scheduler | expand |
Adding Leo since that is especially interesting for our multimedia engines. @Leo could you spare someone to test and maybe review this? Am 31.03.25 um 22:16 schrieb Tvrtko Ursulin: > This is similar to v2 but I dropped some patches (for now) and added some new > ones. Most notably deadline scaling based on queue depth appears to be able to > add a little bit of fairness with spammy clients (deep submission queue). > > As such, on the high level main advantages of the series: > > 1. Code simplification - no more multiple run queues. > 2. Scheduling quality - schedules better than FIFO. > 3. No more RR is even more code simplification but this one needs to be tested > and approved by someone who actually uses RR. > > In the future futher simplifactions and improvements should be possible on top > of this work. But for now I keep it simple. > > First patch adds some unit tests which allow for easy evaluation of scheduling > behaviour against different client submission patterns. From there onwards it is > a hopefully natural progression of patches (or close) to the end result which is > a slightly more fair scheduler than FIFO. > > Regarding the submission patterns tested, it is always two parallel clients > and they broadly cover these categories: > > * Deep queue clients > * Hogs versus interactive > * Priority handling First of all, impressive piece of work. > > Lets look at the results: > > 1. Two normal priority deep queue clients. > > These ones submit one second worth of 8ms jobs. As fast as they can, no > dependencies etc. There is no difference in runtime between FIFO and qddl but > the latter allows both clients to progress with work more evenly: > > https://people.igalia.com/tursulin/drm-sched-qddl/normal-normal.png > > (X axis is time, Y is submitted queue-depth, hence lowering of qd corresponds > with work progress for both clients, tested with both schedulers separately.) This was basically the killer argument why we implemented FIFO in the first place. RR completely sucked on fairness when you have many clients submitting many small jobs. Looks like that the deadline scheduler is even better than FIFO in that regard, but I would also add a test with (for example) 100 clients doing submissions at the same time. > > 2. Same two clients but one is now low priority. > > https://people.igalia.com/tursulin/drm-sched-qddl/normal-low.png > > Normal priority client is a solid line, low priority dotted. We can see how FIFO > completely starves the low priority client until the normal priority is fully > done. Only then the low priority client gets any GPU time. > > In constrast, qddl allows some GPU time to the low priority client. > > 3. Same clients but now high versus normal priority. > > Similar behaviour as in the previous one with normal a bit less de-prioritised > relative to high, than low was against normal. > > https://people.igalia.com/tursulin/drm-sched-qddl/high-normal.png > > 4. Heavy load vs interactive client. > > Heavy client emits a 75% GPU load in the format of 3x 2.5ms jobs followed by a > 2.5ms wait. > > Interactive client emites a 10% GPU load in the format of 1x 1ms job followed > by a 9ms wait. > > This simulates an interactive graphical client used on top of a relatively heavy > background load but no GPU oversubscription. > > Graphs show the interactive client only and from now on, instead of looking at > the client's queue depth, we look at its "fps". > > https://people.igalia.com/tursulin/drm-sched-qddl/heavy-interactive.png > > We can see that qddl allows a slighty higher fps for the interactive client > which is good. The most interesting question for this is what is the maximum frame time? E.g. how long needs the user to wait for a response from the interactive client at maximum? Thanks, Christian. > 5. Low priority GPU hog versus heavy-interactive. > > Low priority client: 3x 2.5ms jobs client followed by a 0.5ms wait. > Interactive client: 1x 0.5ms job followed by a 10ms wait. > > https://people.igalia.com/tursulin/drm-sched-qddl/lowhog-interactive.png > > No difference between the schedulers. > > 6. Last set of test scenarios will have three subgroups. > > In all cases we have two interactive (synchronous, single job at a time) clients > with a 50% "duty cycle" GPU time usage. > > Client 1: 1.5ms job + 1.5ms wait (aka short bursty) > Client 2: 2.5ms job + 2.5ms wait (aka long bursty) > > a) Both normal priority. > > https://people.igalia.com/tursulin/drm-sched-qddl/5050-short.png > https://people.igalia.com/tursulin/drm-sched-qddl/5050-long.png > > Both schedulers favour the higher frequency duty cycle with qddl giving it a > little bit more which should be good for interactivity. > > b) Normal vs low priority. > > https://people.igalia.com/tursulin/drm-sched-qddl/5050-normal-low-normal.png > https://people.igalia.com/tursulin/drm-sched-qddl/5050-normal-low-low.png > > Qddl gives a bit more to the normal than low. > > c) High vs normal priority. > > https://people.igalia.com/tursulin/drm-sched-qddl/5050-high-normal-high.png > https://people.igalia.com/tursulin/drm-sched-qddl/5050-high-normal-normal.png > > Again, qddl gives a bit more share to the higher priority client. > > On the overall qddl looks like a potential improvement in terms of fairness, > especially avoiding priority starvation. There do not appear to be any > regressions with the tested workloads. > > As before, I am looking for feedback, ideas for what kind of submission > scenarios to test. Testers on different GPUs would be very welcome too. > > And I should probably test round-robin at some point, to see if we are maybe > okay to drop unconditionally, it or further work improving qddl would be needed. > > v2: > * Fixed many rebase errors. > * Added some new patches. > * Dropped single shot dependecy handling. > > v3: > * Added scheduling quality unit tests. > * Refined a tiny bit by adding some fairness. > * Dropped a few patches for now. > > Cc: Christian König <christian.koenig@amd.com> > Cc: Danilo Krummrich <dakr@redhat.com> > Cc: Matthew Brost <matthew.brost@intel.com> > Cc: Philipp Stanner <pstanner@redhat.com> > Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> > Cc: Michel Dänzer <michel.daenzer@mailbox.org> > > Tvrtko Ursulin (14): > drm/sched: Add some scheduling quality unit tests > drm/sched: Avoid double re-lock on the job free path > drm/sched: Consolidate drm_sched_job_timedout > drm/sched: Clarify locked section in drm_sched_rq_select_entity_fifo > drm/sched: Consolidate drm_sched_rq_select_entity_rr > drm/sched: Implement RR via FIFO > drm/sched: Consolidate entity run queue management > drm/sched: Move run queue related code into a separate file > drm/sched: Add deadline policy > drm/sched: Remove FIFO and RR and simplify to a single run queue > drm/sched: Queue all free credits in one worker invocation > drm/sched: Embed run queue singleton into the scheduler > drm/sched: De-clutter drm_sched_init > drm/sched: Scale deadlines depending on queue depth > > drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 6 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 27 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 5 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 8 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 8 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c | 8 +- > drivers/gpu/drm/scheduler/Makefile | 2 +- > drivers/gpu/drm/scheduler/sched_entity.c | 121 ++-- > drivers/gpu/drm/scheduler/sched_fence.c | 2 +- > drivers/gpu/drm/scheduler/sched_internal.h | 17 +- > drivers/gpu/drm/scheduler/sched_main.c | 581 ++++-------------- > drivers/gpu/drm/scheduler/sched_rq.c | 188 ++++++ > drivers/gpu/drm/scheduler/tests/Makefile | 3 +- > .../gpu/drm/scheduler/tests/tests_scheduler.c | 548 +++++++++++++++++ > include/drm/gpu_scheduler.h | 17 +- > 15 files changed, 962 insertions(+), 579 deletions(-) > create mode 100644 drivers/gpu/drm/scheduler/sched_rq.c > create mode 100644 drivers/gpu/drm/scheduler/tests/tests_scheduler.c >
On 02/04/2025 07:49, Christian König wrote: > Adding Leo since that is especially interesting for our multimedia engines. > > @Leo could you spare someone to test and maybe review this? > > Am 31.03.25 um 22:16 schrieb Tvrtko Ursulin: >> This is similar to v2 but I dropped some patches (for now) and added some new >> ones. Most notably deadline scaling based on queue depth appears to be able to >> add a little bit of fairness with spammy clients (deep submission queue). >> >> As such, on the high level main advantages of the series: >> >> 1. Code simplification - no more multiple run queues. >> 2. Scheduling quality - schedules better than FIFO. >> 3. No more RR is even more code simplification but this one needs to be tested >> and approved by someone who actually uses RR. >> >> In the future futher simplifactions and improvements should be possible on top >> of this work. But for now I keep it simple. >> >> First patch adds some unit tests which allow for easy evaluation of scheduling >> behaviour against different client submission patterns. From there onwards it is >> a hopefully natural progression of patches (or close) to the end result which is >> a slightly more fair scheduler than FIFO. >> >> Regarding the submission patterns tested, it is always two parallel clients >> and they broadly cover these categories: >> >> * Deep queue clients >> * Hogs versus interactive >> * Priority handling > > First of all, impressive piece of work. Thank you! I am not super happy though, since what would be much better is some sort of a CFS. But to do that would require to crack the entity GPU time tracking problem. That I tried two times so far and failed to find a generic, elegant and not too intrusive solution. >> Lets look at the results: >> >> 1. Two normal priority deep queue clients. >> >> These ones submit one second worth of 8ms jobs. As fast as they can, no >> dependencies etc. There is no difference in runtime between FIFO and qddl but >> the latter allows both clients to progress with work more evenly: >> >> https://people.igalia.com/tursulin/drm-sched-qddl/normal-normal.png >> >> (X axis is time, Y is submitted queue-depth, hence lowering of qd corresponds >> with work progress for both clients, tested with both schedulers separately.) > > This was basically the killer argument why we implemented FIFO in the first place. RR completely sucked on fairness when you have many clients submitting many small jobs. > > Looks like that the deadline scheduler is even better than FIFO in that regard, but I would also add a test with (for example) 100 clients doing submissions at the same time. I can try that. So 100 clients with very deep submission queues? How deep? Fully async? Or some synchronicity and what kind? >> 2. Same two clients but one is now low priority. >> >> https://people.igalia.com/tursulin/drm-sched-qddl/normal-low.png >> >> Normal priority client is a solid line, low priority dotted. We can see how FIFO >> completely starves the low priority client until the normal priority is fully >> done. Only then the low priority client gets any GPU time. >> >> In constrast, qddl allows some GPU time to the low priority client. >> >> 3. Same clients but now high versus normal priority. >> >> Similar behaviour as in the previous one with normal a bit less de-prioritised >> relative to high, than low was against normal. >> >> https://people.igalia.com/tursulin/drm-sched-qddl/high-normal.png >> >> 4. Heavy load vs interactive client. >> >> Heavy client emits a 75% GPU load in the format of 3x 2.5ms jobs followed by a >> 2.5ms wait. >> >> Interactive client emites a 10% GPU load in the format of 1x 1ms job followed >> by a 9ms wait. >> >> This simulates an interactive graphical client used on top of a relatively heavy >> background load but no GPU oversubscription. >> >> Graphs show the interactive client only and from now on, instead of looking at >> the client's queue depth, we look at its "fps". >> >> https://people.igalia.com/tursulin/drm-sched-qddl/heavy-interactive.png >> >> We can see that qddl allows a slighty higher fps for the interactive client >> which is good. > > The most interesting question for this is what is the maximum frame time? > > E.g. how long needs the user to wait for a response from the interactive client at maximum? I did a quick measure of those metrics, for this workload only. Measured time from submit of the first job in the group (so frame), to time last job in a group finished, and then subtracted the expected jobs duration to get just the wait plus overheads latency. Five averaged runs: min avg max [ms] FIFO 2.5 13.14 18.3 qddl 3.2 9.9 16.6 So it is a bit better in max, more so in max latencies. Question is how representative is this synthetic workload of the real world. Regards, Tvrtko >> 5. Low priority GPU hog versus heavy-interactive. >> >> Low priority client: 3x 2.5ms jobs client followed by a 0.5ms wait. >> Interactive client: 1x 0.5ms job followed by a 10ms wait. >> >> https://people.igalia.com/tursulin/drm-sched-qddl/lowhog-interactive.png >> >> No difference between the schedulers. >> >> 6. Last set of test scenarios will have three subgroups. >> >> In all cases we have two interactive (synchronous, single job at a time) clients >> with a 50% "duty cycle" GPU time usage. >> >> Client 1: 1.5ms job + 1.5ms wait (aka short bursty) >> Client 2: 2.5ms job + 2.5ms wait (aka long bursty) >> >> a) Both normal priority. >> >> https://people.igalia.com/tursulin/drm-sched-qddl/5050-short.png >> https://people.igalia.com/tursulin/drm-sched-qddl/5050-long.png >> >> Both schedulers favour the higher frequency duty cycle with qddl giving it a >> little bit more which should be good for interactivity. >> >> b) Normal vs low priority. >> >> https://people.igalia.com/tursulin/drm-sched-qddl/5050-normal-low-normal.png >> https://people.igalia.com/tursulin/drm-sched-qddl/5050-normal-low-low.png >> >> Qddl gives a bit more to the normal than low. >> >> c) High vs normal priority. >> >> https://people.igalia.com/tursulin/drm-sched-qddl/5050-high-normal-high.png >> https://people.igalia.com/tursulin/drm-sched-qddl/5050-high-normal-normal.png >> >> Again, qddl gives a bit more share to the higher priority client. >> >> On the overall qddl looks like a potential improvement in terms of fairness, >> especially avoiding priority starvation. There do not appear to be any >> regressions with the tested workloads. >> >> As before, I am looking for feedback, ideas for what kind of submission >> scenarios to test. Testers on different GPUs would be very welcome too. >> >> And I should probably test round-robin at some point, to see if we are maybe >> okay to drop unconditionally, it or further work improving qddl would be needed. >> >> v2: >> * Fixed many rebase errors. >> * Added some new patches. >> * Dropped single shot dependecy handling. >> >> v3: >> * Added scheduling quality unit tests. >> * Refined a tiny bit by adding some fairness. >> * Dropped a few patches for now. >> >> Cc: Christian König <christian.koenig@amd.com> >> Cc: Danilo Krummrich <dakr@redhat.com> >> Cc: Matthew Brost <matthew.brost@intel.com> >> Cc: Philipp Stanner <pstanner@redhat.com> >> Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> >> Cc: Michel Dänzer <michel.daenzer@mailbox.org> >> >> Tvrtko Ursulin (14): >> drm/sched: Add some scheduling quality unit tests >> drm/sched: Avoid double re-lock on the job free path >> drm/sched: Consolidate drm_sched_job_timedout >> drm/sched: Clarify locked section in drm_sched_rq_select_entity_fifo >> drm/sched: Consolidate drm_sched_rq_select_entity_rr >> drm/sched: Implement RR via FIFO >> drm/sched: Consolidate entity run queue management >> drm/sched: Move run queue related code into a separate file >> drm/sched: Add deadline policy >> drm/sched: Remove FIFO and RR and simplify to a single run queue >> drm/sched: Queue all free credits in one worker invocation >> drm/sched: Embed run queue singleton into the scheduler >> drm/sched: De-clutter drm_sched_init >> drm/sched: Scale deadlines depending on queue depth >> >> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 6 +- >> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 27 +- >> drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 5 +- >> drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 8 +- >> drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 8 +- >> drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c | 8 +- >> drivers/gpu/drm/scheduler/Makefile | 2 +- >> drivers/gpu/drm/scheduler/sched_entity.c | 121 ++-- >> drivers/gpu/drm/scheduler/sched_fence.c | 2 +- >> drivers/gpu/drm/scheduler/sched_internal.h | 17 +- >> drivers/gpu/drm/scheduler/sched_main.c | 581 ++++-------------- >> drivers/gpu/drm/scheduler/sched_rq.c | 188 ++++++ >> drivers/gpu/drm/scheduler/tests/Makefile | 3 +- >> .../gpu/drm/scheduler/tests/tests_scheduler.c | 548 +++++++++++++++++ >> include/drm/gpu_scheduler.h | 17 +- >> 15 files changed, 962 insertions(+), 579 deletions(-) >> create mode 100644 drivers/gpu/drm/scheduler/sched_rq.c >> create mode 100644 drivers/gpu/drm/scheduler/tests/tests_scheduler.c >> >
Alright alright, uff, so this will be a lot to discuss… (btw. my chaotic email filters had me search the cover-letter for a while, which seems to have gone to my RH addr, with the others going to kernel.org – I need a better mail client) On Mon, 2025-03-31 at 21:16 +0100, Tvrtko Ursulin wrote: > This is similar to v2 but I dropped some patches (for now) and added > some new > ones. Most notably deadline scaling based on queue depth appears to > be able to > add a little bit of fairness with spammy clients (deep submission > queue). > > As such, on the high level main advantages of the series: > > 1. Code simplification - no more multiple run queues. > 2. Scheduling quality - schedules better than FIFO. > 3. No more RR is even more code simplification but this one needs to > be tested > and approved by someone who actually uses RR. > > In the future futher simplifactions and improvements should be > possible on top > of this work. But for now I keep it simple. > > First patch adds some unit tests which allow for easy evaluation of > scheduling > behaviour against different client submission patterns. From there > onwards it is > a hopefully natural progression of patches (or close) to the end > result which is > a slightly more fair scheduler than FIFO. > > Regarding the submission patterns tested, it is always two parallel > clients > and they broadly cover these categories: > > * Deep queue clients > * Hogs versus interactive > * Priority handling > > Lets look at the results: > > 1. Two normal priority deep queue clients. > > These ones submit one second worth of 8ms jobs. As fast as they can, > no > dependencies etc. There is no difference in runtime between FIFO and > qddl but > the latter allows both clients to progress with work more evenly: > > https://people.igalia.com/tursulin/drm-sched-qddl/normal-normal.png I read qd surprisingly long as dp and was confused :) And the title reads "Normal vs Normal" > > (X axis is time, Y is submitted queue-depth, hence lowering of qd > corresponds > with work progress for both clients, tested with both schedulers > separately.) > > 2. Same two clients but one is now low priority. > > https://people.igalia.com/tursulin/drm-sched-qddl/normal-low.png > > Normal priority client is a solid line, low priority dotted. We can > see how FIFO > completely starves the low priority client until the normal priority > is fully > done. Only then the low priority client gets any GPU time. > > In constrast, qddl allows some GPU time to the low priority client. > > 3. Same clients but now high versus normal priority. > > Similar behaviour as in the previous one with normal a bit less de- > prioritised > relative to high, than low was against normal. > > https://people.igalia.com/tursulin/drm-sched-qddl/high-normal.png > > 4. Heavy load vs interactive client. > > Heavy client emits a 75% GPU load in the format of 3x 2.5ms jobs > followed by a > 2.5ms wait. > > Interactive client emites a 10% GPU load in the format of 1x 1ms job > followed > by a 9ms wait. > > This simulates an interactive graphical client used on top of a > relatively heavy > background load but no GPU oversubscription. > > Graphs show the interactive client only and from now on, instead of > looking at > the client's queue depth, we look at its "fps". > > https://people.igalia.com/tursulin/drm-sched-qddl/heavy-interactive.png > > We can see that qddl allows a slighty higher fps for the interactive > client > which is good. > > 5. Low priority GPU hog versus heavy-interactive. > > Low priority client: 3x 2.5ms jobs client followed by a 0.5ms wait. > Interactive client: 1x 0.5ms job followed by a 10ms wait. > > https://people.igalia.com/tursulin/drm-sched-qddl/lowhog-interactive.png > > No difference between the schedulers. > > 6. Last set of test scenarios will have three subgroups. > > In all cases we have two interactive (synchronous, single job at a > time) clients > with a 50% "duty cycle" GPU time usage. > > Client 1: 1.5ms job + 1.5ms wait (aka short bursty) > Client 2: 2.5ms job + 2.5ms wait (aka long bursty) > > a) Both normal priority. > > https://people.igalia.com/tursulin/drm-sched-qddl/5050-short.png > https://people.igalia.com/tursulin/drm-sched-qddl/5050-long.png > > Both schedulers favour the higher frequency duty cycle with qddl > giving it a > little bit more which should be good for interactivity. > > b) Normal vs low priority. > > https://people.igalia.com/tursulin/drm-sched-qddl/5050-normal-low-normal.png > https://people.igalia.com/tursulin/drm-sched-qddl/5050-normal-low-low.png > > Qddl gives a bit more to the normal than low. > > c) High vs normal priority. > > https://people.igalia.com/tursulin/drm-sched-qddl/5050-high-normal-high.png > https://people.igalia.com/tursulin/drm-sched-qddl/5050-high-normal-normal.png > > Again, qddl gives a bit more share to the higher priority client. > > On the overall qddl looks like a potential improvement in terms of > fairness, > especially avoiding priority starvation. There do not appear to be > any > regressions with the tested workloads. OK, so the above seems to be the measurements serving as a justification for the major rework. I suppose this was tested on a workstation with AMD card? > > As before, I am looking for feedback, ideas for what kind of > submission > scenarios to test. Testers on different GPUs would be very welcome > too. > > And I should probably test round-robin at some point, to see if we > are maybe > okay to drop unconditionally, it or further work improving qddl would > be needed. > > v2: > * Fixed many rebase errors. > * Added some new patches. > * Dropped single shot dependecy handling. > > v3: > * Added scheduling quality unit tests. > * Refined a tiny bit by adding some fairness. > * Dropped a few patches for now. > > Cc: Christian König <christian.koenig@amd.com> > Cc: Danilo Krummrich <dakr@redhat.com> > Cc: Matthew Brost <matthew.brost@intel.com> > Cc: Philipp Stanner <pstanner@redhat.com> > Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> > Cc: Michel Dänzer <michel.daenzer@mailbox.org> > > Tvrtko Ursulin (14): There appear to be, again, some candidates which should be branched out since they're independent from the scheduling policy. Suspicious ones are: > drm/sched: Add some scheduling quality unit tests > drm/sched: Avoid double re-lock on the job free path > drm/sched: Consolidate drm_sched_job_timedout this one > drm/sched: Clarify locked section in > drm_sched_rq_select_entity_fifo ^ > drm/sched: Consolidate drm_sched_rq_select_entity_rr > drm/sched: Implement RR via FIFO > drm/sched: Consolidate entity run queue management > drm/sched: Move run queue related code into a separate file > drm/sched: Add deadline policy > drm/sched: Remove FIFO and RR and simplify to a single run queue > drm/sched: Queue all free credits in one worker invocation > drm/sched: Embed run queue singleton into the scheduler > drm/sched: De-clutter drm_sched_init ^ > drm/sched: Scale deadlines depending on queue depth > > drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 6 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 27 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 5 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 8 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 8 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c | 8 +- Alright, so here we have our lovely AMD cards :) Here's a thing: Back in the day, we did a seemingly harmless change of the scheduler's default scheduling policy, from RR to FIFO. From all the arguments I have seen in the git log and the archives, this was one of the typical changes where the developer is tempted to think "This *cannot* break anything, it makes everything _better_ and has only advantages". But it turned out that it did break things and resulted in lengthy and interesting discussions here: https://gitlab.freedesktop.org/drm/amd/-/issues/2516 Now, from time to time it is worth breaking things for progress. So the possibility that things *could* break even with the best improvements is not really an argument. It becomes an argument, however, when we consider the following: AMD cards are the major profiteer of this change, generating X percent of performance and fairness gain. Reason being that AMD seems to be the most important party that has hardware scheduling. For Nouveau and Xe, that deadline policy won't do anything. And, if I understood correctly, new AMD generations will move to firmware schedeuling, too. So that raises the question: is it worth it? If I am not mistaken, the firmware scheduler stakeholders largely shouldn't feel an impact when drm_sched changes its scheduling behavior, since the firmware does what it wants anyways. But there are some other drivers that use hardware scheduling, and this (and remaining bad feelings in the guts about firmware schedulers) makes me nervous about touching the global scheduling behavior. It appears to me that if AMD cards are the major profiteer, and it is deemed that the performance gain is really worth it for the older and current generations, then we should at least explore / consider paths toward changing scheduling behavior for AMD only. But, as you, I am also interested in hearing from the other drivers. And correct me on any wrong assumptions above. P. > drivers/gpu/drm/scheduler/Makefile | 2 +- > drivers/gpu/drm/scheduler/sched_entity.c | 121 ++-- > drivers/gpu/drm/scheduler/sched_fence.c | 2 +- > drivers/gpu/drm/scheduler/sched_internal.h | 17 +- > drivers/gpu/drm/scheduler/sched_main.c | 581 ++++------------ > -- > drivers/gpu/drm/scheduler/sched_rq.c | 188 ++++++ > drivers/gpu/drm/scheduler/tests/Makefile | 3 +- > .../gpu/drm/scheduler/tests/tests_scheduler.c | 548 > +++++++++++++++++ > include/drm/gpu_scheduler.h | 17 +- > 15 files changed, 962 insertions(+), 579 deletions(-) > create mode 100644 drivers/gpu/drm/scheduler/sched_rq.c > create mode 100644 drivers/gpu/drm/scheduler/tests/tests_scheduler.c >
Am 02.04.25 um 10:26 schrieb Tvrtko Ursulin: > On 02/04/2025 07:49, Christian König wrote: >> >> First of all, impressive piece of work. > > Thank you! > > I am not super happy though, since what would be much better is some sort of a CFS. But to do that would require to crack the entity GPU time tracking problem. That I tried two times so far and failed to find a generic, elegant and not too intrusive solution. When the numbers you posted hold true I think this solution here is perfectly sufficient. Keep in mind that GPU submission scheduling is a bit more complicated then classic I/O scheduling. E.g. you have no idea how much work a GPU submission was until it is completed, for an other I/O transfer you know pre-hand how many bytes are transferred. That makes tracking things far more complicated. > >>> Lets look at the results: >>> >>> 1. Two normal priority deep queue clients. >>> >>> These ones submit one second worth of 8ms jobs. As fast as they can, no >>> dependencies etc. There is no difference in runtime between FIFO and qddl but >>> the latter allows both clients to progress with work more evenly: >>> >>> https://people.igalia.com/tursulin/drm-sched-qddl/normal-normal.png >>> >>> (X axis is time, Y is submitted queue-depth, hence lowering of qd corresponds >>> with work progress for both clients, tested with both schedulers separately.) >> >> This was basically the killer argument why we implemented FIFO in the first place. RR completely sucked on fairness when you have many clients submitting many small jobs. >> >> Looks like that the deadline scheduler is even better than FIFO in that regard, but I would also add a test with (for example) 100 clients doing submissions at the same time. > > I can try that. So 100 clients with very deep submission queues? How deep? Fully async? Or some synchronicity and what kind? Not deep queues, more like 4-8 jobs maximum for each. Send all submissions roughly at the same time and with the same priority. When you have 100 entities each submission from each entity should have ~99 other submissions in between them. Record the minimum and maximum of that value and you should have a good indicator how well the algorithm performs. You can then of course start to make it more complicated, e.g. 50 entities who have 8 submissions, each taking 4ms and 50 other entities who have 4 submissions, each taking 8ms. > >>> 2. Same two clients but one is now low priority. >>> >>> https://people.igalia.com/tursulin/drm-sched-qddl/normal-low.png >>> >>> Normal priority client is a solid line, low priority dotted. We can see how FIFO >>> completely starves the low priority client until the normal priority is fully >>> done. Only then the low priority client gets any GPU time. >>> >>> In constrast, qddl allows some GPU time to the low priority client. >>> >>> 3. Same clients but now high versus normal priority. >>> >>> Similar behaviour as in the previous one with normal a bit less de-prioritised >>> relative to high, than low was against normal. >>> >>> https://people.igalia.com/tursulin/drm-sched-qddl/high-normal.png >>> >>> 4. Heavy load vs interactive client. >>> >>> Heavy client emits a 75% GPU load in the format of 3x 2.5ms jobs followed by a >>> 2.5ms wait. >>> >>> Interactive client emites a 10% GPU load in the format of 1x 1ms job followed >>> by a 9ms wait. >>> >>> This simulates an interactive graphical client used on top of a relatively heavy >>> background load but no GPU oversubscription. >>> >>> Graphs show the interactive client only and from now on, instead of looking at >>> the client's queue depth, we look at its "fps". >>> >>> https://people.igalia.com/tursulin/drm-sched-qddl/heavy-interactive.png >>> >>> We can see that qddl allows a slighty higher fps for the interactive client >>> which is good. >> >> The most interesting question for this is what is the maximum frame time? >> >> E.g. how long needs the user to wait for a response from the interactive client at maximum? > > I did a quick measure of those metrics, for this workload only. > > Measured time from submit of the first job in the group (so frame), to time last job in a group finished, and then subtracted the expected jobs duration to get just the wait plus overheads latency. > > Five averaged runs: > > min avg max [ms] > FIFO 2.5 13.14 18.3 > qddl 3.2 9.9 16.6 > > So it is a bit better in max, more so in max latencies. Question is how representative is this synthetic workload of the real world. Well if I'm not completely mistaken that is 9,2% better on max and nearly 24,6% better on average, the min time is negligible as far as I can see. That is more than a bit better. Keep in mind that we usually deal with interactive GUIs and background worker use cases which benefits a lot of that stuff. Regards, Christian. > > Regards, > > Tvrtko > >>> 5. Low priority GPU hog versus heavy-interactive. >>> >>> Low priority client: 3x 2.5ms jobs client followed by a 0.5ms wait. >>> Interactive client: 1x 0.5ms job followed by a 10ms wait. >>> >>> https://people.igalia.com/tursulin/drm-sched-qddl/lowhog-interactive.png >>> >>> No difference between the schedulers. >>> >>> 6. Last set of test scenarios will have three subgroups. >>> >>> In all cases we have two interactive (synchronous, single job at a time) clients >>> with a 50% "duty cycle" GPU time usage. >>> >>> Client 1: 1.5ms job + 1.5ms wait (aka short bursty) >>> Client 2: 2.5ms job + 2.5ms wait (aka long bursty) >>> >>> a) Both normal priority. >>> >>> https://people.igalia.com/tursulin/drm-sched-qddl/5050-short.png >>> https://people.igalia.com/tursulin/drm-sched-qddl/5050-long.png >>> >>> Both schedulers favour the higher frequency duty cycle with qddl giving it a >>> little bit more which should be good for interactivity. >>> >>> b) Normal vs low priority. >>> >>> https://people.igalia.com/tursulin/drm-sched-qddl/5050-normal-low-normal.png >>> https://people.igalia.com/tursulin/drm-sched-qddl/5050-normal-low-low.png >>> >>> Qddl gives a bit more to the normal than low. >>> >>> c) High vs normal priority. >>> >>> https://people.igalia.com/tursulin/drm-sched-qddl/5050-high-normal-high.png >>> https://people.igalia.com/tursulin/drm-sched-qddl/5050-high-normal-normal.png >>> >>> Again, qddl gives a bit more share to the higher priority client. >>> >>> On the overall qddl looks like a potential improvement in terms of fairness, >>> especially avoiding priority starvation. There do not appear to be any >>> regressions with the tested workloads. >>> >>> As before, I am looking for feedback, ideas for what kind of submission >>> scenarios to test. Testers on different GPUs would be very welcome too. >>> >>> And I should probably test round-robin at some point, to see if we are maybe >>> okay to drop unconditionally, it or further work improving qddl would be needed. >>> >>> v2: >>> * Fixed many rebase errors. >>> * Added some new patches. >>> * Dropped single shot dependecy handling. >>> >>> v3: >>> * Added scheduling quality unit tests. >>> * Refined a tiny bit by adding some fairness. >>> * Dropped a few patches for now. >>> >>> Cc: Christian König <christian.koenig@amd.com> >>> Cc: Danilo Krummrich <dakr@redhat.com> >>> Cc: Matthew Brost <matthew.brost@intel.com> >>> Cc: Philipp Stanner <pstanner@redhat.com> >>> Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> >>> Cc: Michel Dänzer <michel.daenzer@mailbox.org> >>> >>> Tvrtko Ursulin (14): >>> drm/sched: Add some scheduling quality unit tests >>> drm/sched: Avoid double re-lock on the job free path >>> drm/sched: Consolidate drm_sched_job_timedout >>> drm/sched: Clarify locked section in drm_sched_rq_select_entity_fifo >>> drm/sched: Consolidate drm_sched_rq_select_entity_rr >>> drm/sched: Implement RR via FIFO >>> drm/sched: Consolidate entity run queue management >>> drm/sched: Move run queue related code into a separate file >>> drm/sched: Add deadline policy >>> drm/sched: Remove FIFO and RR and simplify to a single run queue >>> drm/sched: Queue all free credits in one worker invocation >>> drm/sched: Embed run queue singleton into the scheduler >>> drm/sched: De-clutter drm_sched_init >>> drm/sched: Scale deadlines depending on queue depth >>> >>> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 6 +- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 27 +- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 5 +- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 8 +- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 8 +- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c | 8 +- >>> drivers/gpu/drm/scheduler/Makefile | 2 +- >>> drivers/gpu/drm/scheduler/sched_entity.c | 121 ++-- >>> drivers/gpu/drm/scheduler/sched_fence.c | 2 +- >>> drivers/gpu/drm/scheduler/sched_internal.h | 17 +- >>> drivers/gpu/drm/scheduler/sched_main.c | 581 ++++-------------- >>> drivers/gpu/drm/scheduler/sched_rq.c | 188 ++++++ >>> drivers/gpu/drm/scheduler/tests/Makefile | 3 +- >>> .../gpu/drm/scheduler/tests/tests_scheduler.c | 548 +++++++++++++++++ >>> include/drm/gpu_scheduler.h | 17 +- >>> 15 files changed, 962 insertions(+), 579 deletions(-) >>> create mode 100644 drivers/gpu/drm/scheduler/sched_rq.c >>> create mode 100644 drivers/gpu/drm/scheduler/tests/tests_scheduler.c >>> >> >
On 02/04/2025 11:37, Philipp Stanner wrote: > Alright alright, uff, so this will be a lot to discuss… > > (btw. my chaotic email filters had me search the cover-letter for a > while, which seems to have gone to my RH addr, with the others going to > kernel.org – I need a better mail client) My bad. > On Mon, 2025-03-31 at 21:16 +0100, Tvrtko Ursulin wrote: >> This is similar to v2 but I dropped some patches (for now) and added >> some new >> ones. Most notably deadline scaling based on queue depth appears to >> be able to >> add a little bit of fairness with spammy clients (deep submission >> queue). >> >> As such, on the high level main advantages of the series: >> >> 1. Code simplification - no more multiple run queues. >> 2. Scheduling quality - schedules better than FIFO. >> 3. No more RR is even more code simplification but this one needs to >> be tested >> and approved by someone who actually uses RR. >> >> In the future futher simplifactions and improvements should be >> possible on top >> of this work. But for now I keep it simple. >> >> First patch adds some unit tests which allow for easy evaluation of >> scheduling >> behaviour against different client submission patterns. From there >> onwards it is >> a hopefully natural progression of patches (or close) to the end >> result which is >> a slightly more fair scheduler than FIFO. >> >> Regarding the submission patterns tested, it is always two parallel >> clients >> and they broadly cover these categories: >> >> * Deep queue clients >> * Hogs versus interactive >> * Priority handling >> >> Lets look at the results: >> >> 1. Two normal priority deep queue clients. >> >> These ones submit one second worth of 8ms jobs. As fast as they can, >> no >> dependencies etc. There is no difference in runtime between FIFO and >> qddl but >> the latter allows both clients to progress with work more evenly: >> >> https://people.igalia.com/tursulin/drm-sched-qddl/normal-normal.png > > I read qd surprisingly long as dp and was confused :) > > And the title reads "Normal vs Normal" Yes, two identical DRM_SCHED_PRIORITY_NORMAL clients. >> (X axis is time, Y is submitted queue-depth, hence lowering of qd >> corresponds >> with work progress for both clients, tested with both schedulers >> separately.) >> >> 2. Same two clients but one is now low priority. >> >> https://people.igalia.com/tursulin/drm-sched-qddl/normal-low.png >> >> Normal priority client is a solid line, low priority dotted. We can >> see how FIFO >> completely starves the low priority client until the normal priority >> is fully >> done. Only then the low priority client gets any GPU time. >> >> In constrast, qddl allows some GPU time to the low priority client. >> >> 3. Same clients but now high versus normal priority. >> >> Similar behaviour as in the previous one with normal a bit less de- >> prioritised >> relative to high, than low was against normal. >> >> https://people.igalia.com/tursulin/drm-sched-qddl/high-normal.png >> >> 4. Heavy load vs interactive client. >> >> Heavy client emits a 75% GPU load in the format of 3x 2.5ms jobs >> followed by a >> 2.5ms wait. >> >> Interactive client emites a 10% GPU load in the format of 1x 1ms job >> followed >> by a 9ms wait. >> >> This simulates an interactive graphical client used on top of a >> relatively heavy >> background load but no GPU oversubscription. >> >> Graphs show the interactive client only and from now on, instead of >> looking at >> the client's queue depth, we look at its "fps". >> >> https://people.igalia.com/tursulin/drm-sched-qddl/heavy-interactive.png >> >> We can see that qddl allows a slighty higher fps for the interactive >> client >> which is good. >> >> 5. Low priority GPU hog versus heavy-interactive. >> >> Low priority client: 3x 2.5ms jobs client followed by a 0.5ms wait. >> Interactive client: 1x 0.5ms job followed by a 10ms wait. >> >> https://people.igalia.com/tursulin/drm-sched-qddl/lowhog-interactive.png >> >> No difference between the schedulers. >> >> 6. Last set of test scenarios will have three subgroups. >> >> In all cases we have two interactive (synchronous, single job at a >> time) clients >> with a 50% "duty cycle" GPU time usage. >> >> Client 1: 1.5ms job + 1.5ms wait (aka short bursty) >> Client 2: 2.5ms job + 2.5ms wait (aka long bursty) >> >> a) Both normal priority. >> >> https://people.igalia.com/tursulin/drm-sched-qddl/5050-short.png >> https://people.igalia.com/tursulin/drm-sched-qddl/5050-long.png >> >> Both schedulers favour the higher frequency duty cycle with qddl >> giving it a >> little bit more which should be good for interactivity. >> >> b) Normal vs low priority. >> >> https://people.igalia.com/tursulin/drm-sched-qddl/5050-normal-low-normal.png >> https://people.igalia.com/tursulin/drm-sched-qddl/5050-normal-low-low.png >> >> Qddl gives a bit more to the normal than low. >> >> c) High vs normal priority. >> >> https://people.igalia.com/tursulin/drm-sched-qddl/5050-high-normal-high.png >> https://people.igalia.com/tursulin/drm-sched-qddl/5050-high-normal-normal.png >> >> Again, qddl gives a bit more share to the higher priority client. >> >> On the overall qddl looks like a potential improvement in terms of >> fairness, >> especially avoiding priority starvation. There do not appear to be >> any >> regressions with the tested workloads. > > OK, so the above seems to be the measurements serving as a > justification for the major rework. > > I suppose this was tested on a workstation with AMD card? No, using the mock scheduler unit tests. ;) >> As before, I am looking for feedback, ideas for what kind of >> submission >> scenarios to test. Testers on different GPUs would be very welcome >> too. >> >> And I should probably test round-robin at some point, to see if we >> are maybe >> okay to drop unconditionally, it or further work improving qddl would >> be needed. >> >> v2: >> * Fixed many rebase errors. >> * Added some new patches. >> * Dropped single shot dependecy handling. >> >> v3: >> * Added scheduling quality unit tests. >> * Refined a tiny bit by adding some fairness. >> * Dropped a few patches for now. >> >> Cc: Christian König <christian.koenig@amd.com> >> Cc: Danilo Krummrich <dakr@redhat.com> >> Cc: Matthew Brost <matthew.brost@intel.com> >> Cc: Philipp Stanner <pstanner@redhat.com> >> Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-prayer@amd.com> >> Cc: Michel Dänzer <michel.daenzer@mailbox.org> >> >> Tvrtko Ursulin (14): > > There appear to be, again, some candidates which should be branched out > since they're independent from the scheduling policy. Suspicious ones > are: > >> drm/sched: Add some scheduling quality unit tests >> drm/sched: Avoid double re-lock on the job free path >> drm/sched: Consolidate drm_sched_job_timedout Tests are WIP but the other two are indeed independent. > this one > >> drm/sched: Clarify locked section in >> drm_sched_rq_select_entity_fifo > > ^ This one I retracted, its wrong. >> drm/sched: Consolidate drm_sched_rq_select_entity_rr >> drm/sched: Implement RR via FIFO >> drm/sched: Consolidate entity run queue management >> drm/sched: Move run queue related code into a separate file >> drm/sched: Add deadline policy >> drm/sched: Remove FIFO and RR and simplify to a single run queue >> drm/sched: Queue all free credits in one worker invocation >> drm/sched: Embed run queue singleton into the scheduler >> drm/sched: De-clutter drm_sched_init > > ^ Yep. I will pull in this one at some point. >> drm/sched: Scale deadlines depending on queue depth >> >> drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 6 +- >> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 27 +- >> drivers/gpu/drm/amd/amdgpu/amdgpu_job.h | 5 +- >> drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h | 8 +- >> drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c | 8 +- >> drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c | 8 +- > > Alright, so here we have our lovely AMD cards :) Those diffs are just a consequence of "drm/sched: Embed run queue singleton into the scheduler" since deadline does not need per priority run queues is is code simplification. > Here's a thing: > > Back in the day, we did a seemingly harmless change of the scheduler's > default scheduling policy, from RR to FIFO. From all the arguments I > have seen in the git log and the archives, this was one of the typical > changes where the developer is tempted to think "This *cannot* break > anything, it makes everything _better_ and has only advantages". But it > turned out that it did break things and resulted in lengthy and > interesting discussions here: > > https://gitlab.freedesktop.org/drm/amd/-/issues/2516 > > Now, from time to time it is worth breaking things for progress. So the > possibility that things *could* break even with the best improvements > is not really an argument. > > It becomes an argument, however, when we consider the following: > AMD cards are the major profiteer of this change, generating X percent > of performance and fairness gain. > > Reason being that AMD seems to be the most important party that has > hardware scheduling. For Nouveau and Xe, that deadline policy won't do > anything. > > And, if I understood correctly, new AMD generations will move to > firmware schedeuling, too. > > So that raises the question: is it worth it? > > If I am not mistaken, the firmware scheduler stakeholders largely > shouldn't feel an impact when drm_sched changes its scheduling > behavior, since the firmware does what it wants anyways. > > But there are some other drivers that use hardware scheduling, and this > (and remaining bad feelings in the guts about firmware schedulers) > makes me nervous about touching the global scheduling behavior. > > It appears to me that if AMD cards are the major profiteer, and it is > deemed that the performance gain is really worth it for the older and > current generations, then we should at least explore / consider paths > toward changing scheduling behavior for AMD only. > > But, as you, I am also interested in hearing from the other drivers. > And correct me on any wrong assumptions above. You are correct that there is no benefit in the scheduling algorithm for drivers with 1:1 entity to scheduler. But it is not just about AMD. There are other drivers which are not 1:1, like simple GPUs which process one job at a time. They would benefit if we come up with scheduling improvements. Then there is a secondary benefit that the code is simplified. If you look at the series diffstat and subtract the unit tests you will see the code on the overall shrinks. Partly due RR removal, and partly due removal of per priority run queues. I would say it could conditionally be worth it. Conditional on building confidence that there are no regressing workloads. As I say in the cover letter I am looking for ideas, suggestions, testing. So far my synthetic workloads did not find a weakness relative to FIFO. But Pierre-Eric apparently did find a regression with viewperf+glxgears. I am trying to model that in the unit tests. Also, both him and I experiment with CFS like scheduler in parallel to this. That would be even better but it is not easy to do in a nice way. Regards, Tvrtko >> drivers/gpu/drm/scheduler/Makefile | 2 +- >> drivers/gpu/drm/scheduler/sched_entity.c | 121 ++-- >> drivers/gpu/drm/scheduler/sched_fence.c | 2 +- >> drivers/gpu/drm/scheduler/sched_internal.h | 17 +- >> drivers/gpu/drm/scheduler/sched_main.c | 581 ++++------------ >> -- >> drivers/gpu/drm/scheduler/sched_rq.c | 188 ++++++ >> drivers/gpu/drm/scheduler/tests/Makefile | 3 +- >> .../gpu/drm/scheduler/tests/tests_scheduler.c | 548 >> +++++++++++++++++ >> include/drm/gpu_scheduler.h | 17 +- >> 15 files changed, 962 insertions(+), 579 deletions(-) >> create mode 100644 drivers/gpu/drm/scheduler/sched_rq.c >> create mode 100644 drivers/gpu/drm/scheduler/tests/tests_scheduler.c >> >