Message ID | 20230404002211.3611376-1-matthew.brost@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Xe DRM scheduler and long running workload plans | expand |
Hi, thanks for the Cc! On 04/04/2023 09.22, Matthew Brost wrote: > Hello, > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > have been asked to merge our common DRM scheduler patches first as well > as develop a common solution for long running workloads with the DRM > scheduler. This RFC series is our first attempt at doing this. We > welcome any and all feedback. > > This can we thought of as 4 parts detailed below. > > - DRM scheduler changes for 1 to 1 relationship between scheduler and > entity (patches 1-3) > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the > GuC) which is a new paradigm WRT to the DRM scheduler and presents > severals problems as the DRM was originally designed to schedule jobs on > hardware queues. The main problem being that DRM scheduler expects the > submission order of jobs to be the completion order of jobs even across > multiple entities. This assumption falls apart with a firmware scheduler > as a firmware scheduler has no concept of jobs and jobs can complete out > of order. A novel solution for was originally thought of by Faith during > the initial prototype of Xe, create a 1 to 1 relationship between scheduler > and entity. I believe the AGX driver [3] is using this approach and > Boris may use approach as well for the Mali driver [4]. > > To support a 1 to 1 relationship we move the main execution function > from a kthread to a work queue and add a new scheduling mode which > bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > The new scheduling mode should unify all drivers usage with a 1 to 1 > relationship and can be thought of as using scheduler as a dependency / > infligt job tracker rather than a true scheduler. Yup, we're in the exact same situation with drm/asahi, so this is very welcome! We've been using the existing scheduler as-is, but this should help remove some unneeded complexity in this use case. Do you want me to pull in this series into our tree and make sure this all works out for us? I also have a couple bugfixes for drm/sched I need to send out, but I think the rebase/merge with this series should be trivial. I'll send that out this week. > - Generic messaging interface for DRM scheduler > > Idea is to be able to communicate to the submission backend with in band > (relative to main execution function) messages. Messages are backend > defined and flexable enough for any use case. In Xe we use these > messages to clean up entites, set properties for entites, and suspend / > resume execution of an entity [5]. I suspect other driver can leverage > this messaging concept too as it a convenient way to avoid races in the > backend. We haven't needed this so far (mostly by using fine-grained locking and refcounting all over the place) but I can see it being useful to simplify some of those constructs and maybe avoid potential deadlocks in some places. I'm not sure yet whether we can fully get rid of the main queue refcounting/locking (our completion/error signaling path doesn't map well to DMA fences directly so we still need something there to get from the global GPU completion signaling thread to individual queues) but it might be a step in the right direction at least! ~~ Lina
On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote: > Hi, thanks for the Cc! > No problem. > On 04/04/2023 09.22, Matthew Brost wrote: > > Hello, > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > > have been asked to merge our common DRM scheduler patches first as well > > as develop a common solution for long running workloads with the DRM > > scheduler. This RFC series is our first attempt at doing this. We > > welcome any and all feedback. > > > > This can we thought of as 4 parts detailed below. > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and > > entity (patches 1-3) > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the > > GuC) which is a new paradigm WRT to the DRM scheduler and presents > > severals problems as the DRM was originally designed to schedule jobs on > > hardware queues. The main problem being that DRM scheduler expects the > > submission order of jobs to be the completion order of jobs even across > > multiple entities. This assumption falls apart with a firmware scheduler > > as a firmware scheduler has no concept of jobs and jobs can complete out > > of order. A novel solution for was originally thought of by Faith during > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler > > and entity. I believe the AGX driver [3] is using this approach and > > Boris may use approach as well for the Mali driver [4]. > > > > To support a 1 to 1 relationship we move the main execution function > > from a kthread to a work queue and add a new scheduling mode which > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > > The new scheduling mode should unify all drivers usage with a 1 to 1 > > relationship and can be thought of as using scheduler as a dependency / > > infligt job tracker rather than a true scheduler. > > Yup, we're in the exact same situation with drm/asahi, so this is very > welcome! We've been using the existing scheduler as-is, but this should help > remove some unneeded complexity in this use case. > That's the idea. > Do you want me to pull in this series into our tree and make sure this all > works out for us? > We tested this in Xe and it definitely works for us but the more testing the better. > I also have a couple bugfixes for drm/sched I need to send out, but I think > the rebase/merge with this series should be trivial. I'll send that out this > week. > > > - Generic messaging interface for DRM scheduler > > > > Idea is to be able to communicate to the submission backend with in band > > (relative to main execution function) messages. Messages are backend > > defined and flexable enough for any use case. In Xe we use these > > messages to clean up entites, set properties for entites, and suspend / > > resume execution of an entity [5]. I suspect other driver can leverage > > this messaging concept too as it a convenient way to avoid races in the > > backend. > > We haven't needed this so far (mostly by using fine-grained locking and > refcounting all over the place) but I can see it being useful to simplify > some of those constructs and maybe avoid potential deadlocks in some places. > I'm not sure yet whether we can fully get rid of the main queue > refcounting/locking (our completion/error signaling path doesn't map well to > DMA fences directly so we still need something there to get from the global > GPU completion signaling thread to individual queues) but it might be a step > in the right direction at least! > With this messaging interface we essentially have a lockless submission backend which is really nice compared to what we did in the i915. Matt > ~~ Lina >
Please make sure to CC Luben on scheduler patches. Regards, Christian. Am 04.04.23 um 02:22 schrieb Matthew Brost: > Hello, > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > have been asked to merge our common DRM scheduler patches first as well > as develop a common solution for long running workloads with the DRM > scheduler. This RFC series is our first attempt at doing this. We > welcome any and all feedback. > > This can we thought of as 4 parts detailed below. > > - DRM scheduler changes for 1 to 1 relationship between scheduler and > entity (patches 1-3) > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the > GuC) which is a new paradigm WRT to the DRM scheduler and presents > severals problems as the DRM was originally designed to schedule jobs on > hardware queues. The main problem being that DRM scheduler expects the > submission order of jobs to be the completion order of jobs even across > multiple entities. This assumption falls apart with a firmware scheduler > as a firmware scheduler has no concept of jobs and jobs can complete out > of order. A novel solution for was originally thought of by Faith during > the initial prototype of Xe, create a 1 to 1 relationship between scheduler > and entity. I believe the AGX driver [3] is using this approach and > Boris may use approach as well for the Mali driver [4]. > > To support a 1 to 1 relationship we move the main execution function > from a kthread to a work queue and add a new scheduling mode which > bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > The new scheduling mode should unify all drivers usage with a 1 to 1 > relationship and can be thought of as using scheduler as a dependency / > infligt job tracker rather than a true scheduler. > > - Generic messaging interface for DRM scheduler > > Idea is to be able to communicate to the submission backend with in band > (relative to main execution function) messages. Messages are backend > defined and flexable enough for any use case. In Xe we use these > messages to clean up entites, set properties for entites, and suspend / > resume execution of an entity [5]. I suspect other driver can leverage > this messaging concept too as it a convenient way to avoid races in the > backend. > > - Support for using TDR for all error paths of a scheduler / entity > > Fix a few races / bugs, add function to dynamically set the TDR timeout. > > - Annotate dma-fences for long running workloads. > > The idea here is to use dma-fences only as sync points within the > scheduler and never export them for long running workloads. By > annotating these fences as long running we ensure that these dma-fences > are never used in a way that breaks the dma-fence rules. A benefit of > thus approach is the scheduler can still safely flow control the > execution ring buffer via the job limit without breaking the dma-fence > rules. > > Again this a first draft and looking forward to feedback. > > Enjoy - Matt > > [1] https://gitlab.freedesktop.org/drm/xe/kernel > [2] https://patchwork.freedesktop.org/series/112188/ > [3] https://patchwork.freedesktop.org/series/114772/ > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 > > Matthew Brost (8): > drm/sched: Convert drm scheduler to use a work queue rather than > kthread > drm/sched: Move schedule policy to scheduler / entity > drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy > drm/sched: Add generic scheduler message interface > drm/sched: Start run wq before TDR in drm_sched_start > drm/sched: Submit job before starting TDR > drm/sched: Add helper to set TDR timeout > drm/syncobj: Warn on long running dma-fences > > Thomas Hellström (2): > dma-buf/dma-fence: Introduce long-running completion fences > drm/sched: Support long-running sched entities > > drivers/dma-buf/dma-fence.c | 142 +++++++--- > drivers/dma-buf/dma-resv.c | 5 + > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- > drivers/gpu/drm/drm_syncobj.c | 5 +- > drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- > drivers/gpu/drm/lima/lima_sched.c | 5 +- > drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- > drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- > drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- > drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- > drivers/gpu/drm/scheduler/sched_fence.c | 6 +- > drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++----- > drivers/gpu/drm/v3d/v3d_sched.c | 25 +- > include/drm/gpu_scheduler.h | 130 +++++++-- > include/linux/dma-fence.h | 60 ++++- > 16 files changed, 649 insertions(+), 184 deletions(-) >
Hi, Am 04.04.23 um 02:22 schrieb Matthew Brost: > Hello, > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > have been asked to merge our common DRM scheduler patches first as well > as develop a common solution for long running workloads with the DRM > scheduler. This RFC series is our first attempt at doing this. We > welcome any and all feedback. > > This can we thought of as 4 parts detailed below. > > - DRM scheduler changes for 1 to 1 relationship between scheduler and > entity (patches 1-3) > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the > GuC) which is a new paradigm WRT to the DRM scheduler and presents > severals problems as the DRM was originally designed to schedule jobs on > hardware queues. The main problem being that DRM scheduler expects the > submission order of jobs to be the completion order of jobs even across > multiple entities. This assumption falls apart with a firmware scheduler > as a firmware scheduler has no concept of jobs and jobs can complete out > of order. A novel solution for was originally thought of by Faith during > the initial prototype of Xe, create a 1 to 1 relationship between scheduler > and entity. I believe the AGX driver [3] is using this approach and > Boris may use approach as well for the Mali driver [4]. > > To support a 1 to 1 relationship we move the main execution function > from a kthread to a work queue and add a new scheduling mode which > bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > The new scheduling mode should unify all drivers usage with a 1 to 1 > relationship and can be thought of as using scheduler as a dependency / > infligt job tracker rather than a true scheduler. > > - Generic messaging interface for DRM scheduler > > Idea is to be able to communicate to the submission backend with in band > (relative to main execution function) messages. Messages are backend > defined and flexable enough for any use case. In Xe we use these > messages to clean up entites, set properties for entites, and suspend / > resume execution of an entity [5]. I suspect other driver can leverage > this messaging concept too as it a convenient way to avoid races in the > backend. Oh, please absolutely *don't* do this. This is basically the design which makes a bunch of stuff so horrible broken on Windows. I can explain it in more detail if necessary, but I strongly recommend to not go down this path. Regards, Christian. > > - Support for using TDR for all error paths of a scheduler / entity > > Fix a few races / bugs, add function to dynamically set the TDR timeout. > > - Annotate dma-fences for long running workloads. > > The idea here is to use dma-fences only as sync points within the > scheduler and never export them for long running workloads. By > annotating these fences as long running we ensure that these dma-fences > are never used in a way that breaks the dma-fence rules. A benefit of > thus approach is the scheduler can still safely flow control the > execution ring buffer via the job limit without breaking the dma-fence > rules. > > Again this a first draft and looking forward to feedback. > > Enjoy - Matt > > [1] https://gitlab.freedesktop.org/drm/xe/kernel > [2] https://patchwork.freedesktop.org/series/112188/ > [3] https://patchwork.freedesktop.org/series/114772/ > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 > > Matthew Brost (8): > drm/sched: Convert drm scheduler to use a work queue rather than > kthread > drm/sched: Move schedule policy to scheduler / entity > drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy > drm/sched: Add generic scheduler message interface > drm/sched: Start run wq before TDR in drm_sched_start > drm/sched: Submit job before starting TDR > drm/sched: Add helper to set TDR timeout > drm/syncobj: Warn on long running dma-fences > > Thomas Hellström (2): > dma-buf/dma-fence: Introduce long-running completion fences > drm/sched: Support long-running sched entities > > drivers/dma-buf/dma-fence.c | 142 +++++++--- > drivers/dma-buf/dma-resv.c | 5 + > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- > drivers/gpu/drm/drm_syncobj.c | 5 +- > drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- > drivers/gpu/drm/lima/lima_sched.c | 5 +- > drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- > drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- > drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- > drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- > drivers/gpu/drm/scheduler/sched_fence.c | 6 +- > drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++----- > drivers/gpu/drm/v3d/v3d_sched.c | 25 +- > include/drm/gpu_scheduler.h | 130 +++++++-- > include/linux/dma-fence.h | 60 ++++- > 16 files changed, 649 insertions(+), 184 deletions(-) >
On 04/04/2023 01:22, Matthew Brost wrote: > Hello, > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > have been asked to merge our common DRM scheduler patches first as well > as develop a common solution for long running workloads with the DRM > scheduler. This RFC series is our first attempt at doing this. We > welcome any and all feedback. > > This can we thought of as 4 parts detailed below. > > - DRM scheduler changes for 1 to 1 relationship between scheduler and > entity (patches 1-3) > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the > GuC) which is a new paradigm WRT to the DRM scheduler and presents > severals problems as the DRM was originally designed to schedule jobs on > hardware queues. The main problem being that DRM scheduler expects the > submission order of jobs to be the completion order of jobs even across > multiple entities. This assumption falls apart with a firmware scheduler > as a firmware scheduler has no concept of jobs and jobs can complete out > of order. A novel solution for was originally thought of by Faith during > the initial prototype of Xe, create a 1 to 1 relationship between scheduler > and entity. I believe the AGX driver [3] is using this approach and > Boris may use approach as well for the Mali driver [4]. > > To support a 1 to 1 relationship we move the main execution function > from a kthread to a work queue and add a new scheduling mode which > bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > The new scheduling mode should unify all drivers usage with a 1 to 1 > relationship and can be thought of as using scheduler as a dependency / > infligt job tracker rather than a true scheduler. Once you add capability for a more proper 1:1 via DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to replace kthreads with a wq? Or in other words, what purpose does the offloading of a job picking code to a separate execution context serve? Could it be done directly in the 1:1 mode and leave kthread setup for N:M? Apart from those design level questions, low level open IMO still is that default fallback of using the system_wq has the potential to affect latency for other drivers. But that's for those driver owners to approve. Regards, Tvrtko > - Generic messaging interface for DRM scheduler > > Idea is to be able to communicate to the submission backend with in band > (relative to main execution function) messages. Messages are backend > defined and flexable enough for any use case. In Xe we use these > messages to clean up entites, set properties for entites, and suspend / > resume execution of an entity [5]. I suspect other driver can leverage > this messaging concept too as it a convenient way to avoid races in the > backend. > > - Support for using TDR for all error paths of a scheduler / entity > > Fix a few races / bugs, add function to dynamically set the TDR timeout. > > - Annotate dma-fences for long running workloads. > > The idea here is to use dma-fences only as sync points within the > scheduler and never export them for long running workloads. By > annotating these fences as long running we ensure that these dma-fences > are never used in a way that breaks the dma-fence rules. A benefit of > thus approach is the scheduler can still safely flow control the > execution ring buffer via the job limit without breaking the dma-fence > rules. > > Again this a first draft and looking forward to feedback. > > Enjoy - Matt > > [1] https://gitlab.freedesktop.org/drm/xe/kernel > [2] https://patchwork.freedesktop.org/series/112188/ > [3] https://patchwork.freedesktop.org/series/114772/ > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 > > Matthew Brost (8): > drm/sched: Convert drm scheduler to use a work queue rather than > kthread > drm/sched: Move schedule policy to scheduler / entity > drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy > drm/sched: Add generic scheduler message interface > drm/sched: Start run wq before TDR in drm_sched_start > drm/sched: Submit job before starting TDR > drm/sched: Add helper to set TDR timeout > drm/syncobj: Warn on long running dma-fences > > Thomas Hellström (2): > dma-buf/dma-fence: Introduce long-running completion fences > drm/sched: Support long-running sched entities > > drivers/dma-buf/dma-fence.c | 142 +++++++--- > drivers/dma-buf/dma-resv.c | 5 + > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- > drivers/gpu/drm/drm_syncobj.c | 5 +- > drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- > drivers/gpu/drm/lima/lima_sched.c | 5 +- > drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- > drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- > drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- > drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- > drivers/gpu/drm/scheduler/sched_fence.c | 6 +- > drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++----- > drivers/gpu/drm/v3d/v3d_sched.c | 25 +- > include/drm/gpu_scheduler.h | 130 +++++++-- > include/linux/dma-fence.h | 60 ++++- > 16 files changed, 649 insertions(+), 184 deletions(-) >
Am 04.04.23 um 11:43 schrieb Tvrtko Ursulin: > > On 04/04/2023 01:22, Matthew Brost wrote: >> Hello, >> >> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we >> have been asked to merge our common DRM scheduler patches first as well >> as develop a common solution for long running workloads with the DRM >> scheduler. This RFC series is our first attempt at doing this. We >> welcome any and all feedback. >> >> This can we thought of as 4 parts detailed below. >> >> - DRM scheduler changes for 1 to 1 relationship between scheduler and >> entity (patches 1-3) >> >> In Xe all of the scheduling of jobs is done by a firmware scheduler (the >> GuC) which is a new paradigm WRT to the DRM scheduler and presents >> severals problems as the DRM was originally designed to schedule jobs on >> hardware queues. The main problem being that DRM scheduler expects the >> submission order of jobs to be the completion order of jobs even across >> multiple entities. This assumption falls apart with a firmware scheduler >> as a firmware scheduler has no concept of jobs and jobs can complete out >> of order. A novel solution for was originally thought of by Faith during >> the initial prototype of Xe, create a 1 to 1 relationship between >> scheduler >> and entity. I believe the AGX driver [3] is using this approach and >> Boris may use approach as well for the Mali driver [4]. >> >> To support a 1 to 1 relationship we move the main execution function >> from a kthread to a work queue and add a new scheduling mode which >> bypasses code in the DRM which isn't needed in a 1 to 1 relationship. >> The new scheduling mode should unify all drivers usage with a 1 to 1 >> relationship and can be thought of as using scheduler as a dependency / >> infligt job tracker rather than a true scheduler. > > Once you add capability for a more proper 1:1 via > DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to > replace kthreads with a wq? Yeah, I fail to see the need for that as well. On the other hand it would be really nice to get rid of the rq/priority design in general. > > Or in other words, what purpose does the offloading of a job picking > code to a separate execution context serve? Could it be done directly > in the 1:1 mode and leave kthread setup for N:M? Well moving from kthread to work item is beneficial on it's own since the later usually follows the the source of it's queue. E.g. when this is triggered by an interrupt we run on the CPU of the interrupt and not have inter CPU signaling. > > Apart from those design level questions, low level open IMO still is > that default fallback of using the system_wq has the potential to > affect latency for other drivers. But that's for those driver owners > to approve. Oh, yeah that's a good point as well. This needs some high priority queue. Christian. > > Regards, > > Tvrtko > >> - Generic messaging interface for DRM scheduler >> >> Idea is to be able to communicate to the submission backend with in band >> (relative to main execution function) messages. Messages are backend >> defined and flexable enough for any use case. In Xe we use these >> messages to clean up entites, set properties for entites, and suspend / >> resume execution of an entity [5]. I suspect other driver can leverage >> this messaging concept too as it a convenient way to avoid races in the >> backend. >> >> - Support for using TDR for all error paths of a scheduler / entity >> >> Fix a few races / bugs, add function to dynamically set the TDR timeout. >> >> - Annotate dma-fences for long running workloads. >> >> The idea here is to use dma-fences only as sync points within the >> scheduler and never export them for long running workloads. By >> annotating these fences as long running we ensure that these dma-fences >> are never used in a way that breaks the dma-fence rules. A benefit of >> thus approach is the scheduler can still safely flow control the >> execution ring buffer via the job limit without breaking the dma-fence >> rules. >> >> Again this a first draft and looking forward to feedback. >> >> Enjoy - Matt >> >> [1] https://gitlab.freedesktop.org/drm/xe/kernel >> [2] https://patchwork.freedesktop.org/series/112188/ >> [3] https://patchwork.freedesktop.org/series/114772/ >> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 >> [5] >> https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 >> >> Matthew Brost (8): >> drm/sched: Convert drm scheduler to use a work queue rather than >> kthread >> drm/sched: Move schedule policy to scheduler / entity >> drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy >> drm/sched: Add generic scheduler message interface >> drm/sched: Start run wq before TDR in drm_sched_start >> drm/sched: Submit job before starting TDR >> drm/sched: Add helper to set TDR timeout >> drm/syncobj: Warn on long running dma-fences >> >> Thomas Hellström (2): >> dma-buf/dma-fence: Introduce long-running completion fences >> drm/sched: Support long-running sched entities >> >> drivers/dma-buf/dma-fence.c | 142 +++++++--- >> drivers/dma-buf/dma-resv.c | 5 + >> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- >> drivers/gpu/drm/drm_syncobj.c | 5 +- >> drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- >> drivers/gpu/drm/lima/lima_sched.c | 5 +- >> drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- >> drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- >> drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- >> drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- >> drivers/gpu/drm/scheduler/sched_fence.c | 6 +- >> drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++----- >> drivers/gpu/drm/v3d/v3d_sched.c | 25 +- >> include/drm/gpu_scheduler.h | 130 +++++++-- >> include/linux/dma-fence.h | 60 ++++- >> 16 files changed, 649 insertions(+), 184 deletions(-) >>
On Tue, Apr 04, 2023 at 11:04:54AM +0200, Christian König wrote: > Please make sure to CC Luben on scheduler patches. > Sure, figured I was missing a few people. Matt > Regards, > Christian. > > Am 04.04.23 um 02:22 schrieb Matthew Brost: > > Hello, > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > > have been asked to merge our common DRM scheduler patches first as well > > as develop a common solution for long running workloads with the DRM > > scheduler. This RFC series is our first attempt at doing this. We > > welcome any and all feedback. > > > > This can we thought of as 4 parts detailed below. > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and > > entity (patches 1-3) > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the > > GuC) which is a new paradigm WRT to the DRM scheduler and presents > > severals problems as the DRM was originally designed to schedule jobs on > > hardware queues. The main problem being that DRM scheduler expects the > > submission order of jobs to be the completion order of jobs even across > > multiple entities. This assumption falls apart with a firmware scheduler > > as a firmware scheduler has no concept of jobs and jobs can complete out > > of order. A novel solution for was originally thought of by Faith during > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler > > and entity. I believe the AGX driver [3] is using this approach and > > Boris may use approach as well for the Mali driver [4]. > > > > To support a 1 to 1 relationship we move the main execution function > > from a kthread to a work queue and add a new scheduling mode which > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > > The new scheduling mode should unify all drivers usage with a 1 to 1 > > relationship and can be thought of as using scheduler as a dependency / > > infligt job tracker rather than a true scheduler. > > > > - Generic messaging interface for DRM scheduler > > > > Idea is to be able to communicate to the submission backend with in band > > (relative to main execution function) messages. Messages are backend > > defined and flexable enough for any use case. In Xe we use these > > messages to clean up entites, set properties for entites, and suspend / > > resume execution of an entity [5]. I suspect other driver can leverage > > this messaging concept too as it a convenient way to avoid races in the > > backend. > > > > - Support for using TDR for all error paths of a scheduler / entity > > > > Fix a few races / bugs, add function to dynamically set the TDR timeout. > > > > - Annotate dma-fences for long running workloads. > > > > The idea here is to use dma-fences only as sync points within the > > scheduler and never export them for long running workloads. By > > annotating these fences as long running we ensure that these dma-fences > > are never used in a way that breaks the dma-fence rules. A benefit of > > thus approach is the scheduler can still safely flow control the > > execution ring buffer via the job limit without breaking the dma-fence > > rules. > > > > Again this a first draft and looking forward to feedback. > > > > Enjoy - Matt > > > > [1] https://gitlab.freedesktop.org/drm/xe/kernel > > [2] https://patchwork.freedesktop.org/series/112188/ > > [3] https://patchwork.freedesktop.org/series/114772/ > > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 > > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 > > > > Matthew Brost (8): > > drm/sched: Convert drm scheduler to use a work queue rather than > > kthread > > drm/sched: Move schedule policy to scheduler / entity > > drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy > > drm/sched: Add generic scheduler message interface > > drm/sched: Start run wq before TDR in drm_sched_start > > drm/sched: Submit job before starting TDR > > drm/sched: Add helper to set TDR timeout > > drm/syncobj: Warn on long running dma-fences > > > > Thomas Hellström (2): > > dma-buf/dma-fence: Introduce long-running completion fences > > drm/sched: Support long-running sched entities > > > > drivers/dma-buf/dma-fence.c | 142 +++++++--- > > drivers/dma-buf/dma-resv.c | 5 + > > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- > > drivers/gpu/drm/drm_syncobj.c | 5 +- > > drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- > > drivers/gpu/drm/lima/lima_sched.c | 5 +- > > drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- > > drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- > > drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- > > drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- > > drivers/gpu/drm/scheduler/sched_fence.c | 6 +- > > drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++----- > > drivers/gpu/drm/v3d/v3d_sched.c | 25 +- > > include/drm/gpu_scheduler.h | 130 +++++++-- > > include/linux/dma-fence.h | 60 ++++- > > 16 files changed, 649 insertions(+), 184 deletions(-) > > >
On Tue, Apr 04, 2023 at 11:13:28AM +0200, Christian König wrote: > Hi, > > Am 04.04.23 um 02:22 schrieb Matthew Brost: > > Hello, > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > > have been asked to merge our common DRM scheduler patches first as well > > as develop a common solution for long running workloads with the DRM > > scheduler. This RFC series is our first attempt at doing this. We > > welcome any and all feedback. > > > > This can we thought of as 4 parts detailed below. > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and > > entity (patches 1-3) > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the > > GuC) which is a new paradigm WRT to the DRM scheduler and presents > > severals problems as the DRM was originally designed to schedule jobs on > > hardware queues. The main problem being that DRM scheduler expects the > > submission order of jobs to be the completion order of jobs even across > > multiple entities. This assumption falls apart with a firmware scheduler > > as a firmware scheduler has no concept of jobs and jobs can complete out > > of order. A novel solution for was originally thought of by Faith during > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler > > and entity. I believe the AGX driver [3] is using this approach and > > Boris may use approach as well for the Mali driver [4]. > > > > To support a 1 to 1 relationship we move the main execution function > > from a kthread to a work queue and add a new scheduling mode which > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > > The new scheduling mode should unify all drivers usage with a 1 to 1 > > relationship and can be thought of as using scheduler as a dependency / > > infligt job tracker rather than a true scheduler. > > > > - Generic messaging interface for DRM scheduler > > > > Idea is to be able to communicate to the submission backend with in band > > (relative to main execution function) messages. Messages are backend > > defined and flexable enough for any use case. In Xe we use these > > messages to clean up entites, set properties for entites, and suspend / > > resume execution of an entity [5]. I suspect other driver can leverage > > this messaging concept too as it a convenient way to avoid races in the > > backend. > > Oh, please absolutely *don't* do this. > > This is basically the design which makes a bunch of stuff so horrible broken > on Windows. > > I can explain it in more detail if necessary, but I strongly recommend to > not go down this path. > I'm afraid we are going to have to discuss this further. Let me explain my reasoning, basically the idea is to have a single main entry point to backend - the work queue. This avoids the need for lock between run_job and any message that changes an entites state, also it really helps during the reset flows (either TDR or GT reset) as we can call drm_sched_run_wq_stop can ensure that nothing else is in the backend changing an entity state. It all works out really nicely actually, our GuC backend is incredibly stable (hasn't really had a bug pop in about a year) and way simpler than what we did in the i915. I think the simplity to largely due to this design of limiting the entry points. I personally don't see how this a poor design, limiting entry points absolutely makes sense to me, if it didn't why not just call cleanup_job bypassing the main execution thread (now worker), this is the exact same concept. FWIW Asahi liked the idea as well and think it could be useful for AGX. Matt > Regards, > Christian. > > > > > - Support for using TDR for all error paths of a scheduler / entity > > > > Fix a few races / bugs, add function to dynamically set the TDR timeout. > > > > - Annotate dma-fences for long running workloads. > > > > The idea here is to use dma-fences only as sync points within the > > scheduler and never export them for long running workloads. By > > annotating these fences as long running we ensure that these dma-fences > > are never used in a way that breaks the dma-fence rules. A benefit of > > thus approach is the scheduler can still safely flow control the > > execution ring buffer via the job limit without breaking the dma-fence > > rules. > > > > Again this a first draft and looking forward to feedback. > > > > Enjoy - Matt > > > > [1] https://gitlab.freedesktop.org/drm/xe/kernel > > [2] https://patchwork.freedesktop.org/series/112188/ > > [3] https://patchwork.freedesktop.org/series/114772/ > > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 > > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 > > > > Matthew Brost (8): > > drm/sched: Convert drm scheduler to use a work queue rather than > > kthread > > drm/sched: Move schedule policy to scheduler / entity > > drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy > > drm/sched: Add generic scheduler message interface > > drm/sched: Start run wq before TDR in drm_sched_start > > drm/sched: Submit job before starting TDR > > drm/sched: Add helper to set TDR timeout > > drm/syncobj: Warn on long running dma-fences > > > > Thomas Hellström (2): > > dma-buf/dma-fence: Introduce long-running completion fences > > drm/sched: Support long-running sched entities > > > > drivers/dma-buf/dma-fence.c | 142 +++++++--- > > drivers/dma-buf/dma-resv.c | 5 + > > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- > > drivers/gpu/drm/drm_syncobj.c | 5 +- > > drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- > > drivers/gpu/drm/lima/lima_sched.c | 5 +- > > drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- > > drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- > > drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- > > drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- > > drivers/gpu/drm/scheduler/sched_fence.c | 6 +- > > drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++----- > > drivers/gpu/drm/v3d/v3d_sched.c | 25 +- > > include/drm/gpu_scheduler.h | 130 +++++++-- > > include/linux/dma-fence.h | 60 ++++- > > 16 files changed, 649 insertions(+), 184 deletions(-) > > >
On Tue, Apr 04, 2023 at 11:48:36AM +0200, Christian König wrote: > Am 04.04.23 um 11:43 schrieb Tvrtko Ursulin: > > > > On 04/04/2023 01:22, Matthew Brost wrote: > > > Hello, > > > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > > > have been asked to merge our common DRM scheduler patches first as well > > > as develop a common solution for long running workloads with the DRM > > > scheduler. This RFC series is our first attempt at doing this. We > > > welcome any and all feedback. > > > > > > This can we thought of as 4 parts detailed below. > > > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and > > > entity (patches 1-3) > > > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents > > > severals problems as the DRM was originally designed to schedule jobs on > > > hardware queues. The main problem being that DRM scheduler expects the > > > submission order of jobs to be the completion order of jobs even across > > > multiple entities. This assumption falls apart with a firmware scheduler > > > as a firmware scheduler has no concept of jobs and jobs can complete out > > > of order. A novel solution for was originally thought of by Faith during > > > the initial prototype of Xe, create a 1 to 1 relationship between > > > scheduler > > > and entity. I believe the AGX driver [3] is using this approach and > > > Boris may use approach as well for the Mali driver [4]. > > > > > > To support a 1 to 1 relationship we move the main execution function > > > from a kthread to a work queue and add a new scheduling mode which > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > > > The new scheduling mode should unify all drivers usage with a 1 to 1 > > > relationship and can be thought of as using scheduler as a dependency / > > > infligt job tracker rather than a true scheduler. > > > > Once you add capability for a more proper 1:1 via > > DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to > > replace kthreads with a wq? > > Yeah, I fail to see the need for that as well. On the other hand it would be > really nice to get rid of the rq/priority design in general. > wrt to replacing kthread with a worker I think the idea is you don't want to tie a kthread creation directly to a uAPI as a user then could create 1000s of kthreads. fwiw in a private email about a year yoy actually suggest using a work queue Christian. > > > > Or in other words, what purpose does the offloading of a job picking > > code to a separate execution context serve? Could it be done directly in > > the 1:1 mode and leave kthread setup for N:M? > > Well moving from kthread to work item is beneficial on it's own since the > later usually follows the the source of it's queue. E.g. when this is > triggered by an interrupt we run on the CPU of the interrupt and not have > inter CPU signaling. > > > > > Apart from those design level questions, low level open IMO still is > > that default fallback of using the system_wq has the potential to affect > > latency for other drivers. But that's for those driver owners to > > approve. > > Oh, yeah that's a good point as well. This needs some high priority queue. > system_highpri_wq? Matt > Christian. > > > > > Regards, > > > > Tvrtko > > > > > - Generic messaging interface for DRM scheduler > > > > > > Idea is to be able to communicate to the submission backend with in band > > > (relative to main execution function) messages. Messages are backend > > > defined and flexable enough for any use case. In Xe we use these > > > messages to clean up entites, set properties for entites, and suspend / > > > resume execution of an entity [5]. I suspect other driver can leverage > > > this messaging concept too as it a convenient way to avoid races in the > > > backend. > > > > > > - Support for using TDR for all error paths of a scheduler / entity > > > > > > Fix a few races / bugs, add function to dynamically set the TDR timeout. > > > > > > - Annotate dma-fences for long running workloads. > > > > > > The idea here is to use dma-fences only as sync points within the > > > scheduler and never export them for long running workloads. By > > > annotating these fences as long running we ensure that these dma-fences > > > are never used in a way that breaks the dma-fence rules. A benefit of > > > thus approach is the scheduler can still safely flow control the > > > execution ring buffer via the job limit without breaking the dma-fence > > > rules. > > > > > > Again this a first draft and looking forward to feedback. > > > > > > Enjoy - Matt > > > > > > [1] https://gitlab.freedesktop.org/drm/xe/kernel > > > [2] https://patchwork.freedesktop.org/series/112188/ > > > [3] https://patchwork.freedesktop.org/series/114772/ > > > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 > > > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 > > > > > > Matthew Brost (8): > > > drm/sched: Convert drm scheduler to use a work queue rather than > > > kthread > > > drm/sched: Move schedule policy to scheduler / entity > > > drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy > > > drm/sched: Add generic scheduler message interface > > > drm/sched: Start run wq before TDR in drm_sched_start > > > drm/sched: Submit job before starting TDR > > > drm/sched: Add helper to set TDR timeout > > > drm/syncobj: Warn on long running dma-fences > > > > > > Thomas Hellström (2): > > > dma-buf/dma-fence: Introduce long-running completion fences > > > drm/sched: Support long-running sched entities > > > > > > drivers/dma-buf/dma-fence.c | 142 +++++++--- > > > drivers/dma-buf/dma-resv.c | 5 + > > > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- > > > drivers/gpu/drm/drm_syncobj.c | 5 +- > > > drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- > > > drivers/gpu/drm/lima/lima_sched.c | 5 +- > > > drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- > > > drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- > > > drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- > > > drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- > > > drivers/gpu/drm/scheduler/sched_fence.c | 6 +- > > > drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++----- > > > drivers/gpu/drm/v3d/v3d_sched.c | 25 +- > > > include/drm/gpu_scheduler.h | 130 +++++++-- > > > include/linux/dma-fence.h | 60 ++++- > > > 16 files changed, 649 insertions(+), 184 deletions(-) > > > >
On Tue, Apr 04, 2023 at 10:43:03AM +0100, Tvrtko Ursulin wrote: > > On 04/04/2023 01:22, Matthew Brost wrote: > > Hello, > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > > have been asked to merge our common DRM scheduler patches first as well > > as develop a common solution for long running workloads with the DRM > > scheduler. This RFC series is our first attempt at doing this. We > > welcome any and all feedback. > > > > This can we thought of as 4 parts detailed below. > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and > > entity (patches 1-3) > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the > > GuC) which is a new paradigm WRT to the DRM scheduler and presents > > severals problems as the DRM was originally designed to schedule jobs on > > hardware queues. The main problem being that DRM scheduler expects the > > submission order of jobs to be the completion order of jobs even across > > multiple entities. This assumption falls apart with a firmware scheduler > > as a firmware scheduler has no concept of jobs and jobs can complete out > > of order. A novel solution for was originally thought of by Faith during > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler > > and entity. I believe the AGX driver [3] is using this approach and > > Boris may use approach as well for the Mali driver [4]. > > > > To support a 1 to 1 relationship we move the main execution function > > from a kthread to a work queue and add a new scheduling mode which > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > > The new scheduling mode should unify all drivers usage with a 1 to 1 > > relationship and can be thought of as using scheduler as a dependency / > > infligt job tracker rather than a true scheduler. > > Once you add capability for a more proper 1:1 via > DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to replace > kthreads with a wq? > > Or in other words, what purpose does the offloading of a job picking code to > a separate execution context serve? Could it be done directly in the 1:1 > mode and leave kthread setup for N:M? > Addressed the other two on my reply to Christian... For this one basically the concept of a single entity point IMO is a very good concept which I'd like to keep. But most important reason being the main execution thread (now worker) is kicked when a dependency for a job is resolved, dependencies are dma-fences signaled via a callback, and these call backs can be signaled in IRQ contexts. We absolutely do not want to enter the backend in an IRQ context for a variety of reasons. Matt > Apart from those design level questions, low level open IMO still is that > default fallback of using the system_wq has the potential to affect latency > for other drivers. But that's for those driver owners to approve. > > Regards, > > Tvrtko > > > - Generic messaging interface for DRM scheduler > > > > Idea is to be able to communicate to the submission backend with in band > > (relative to main execution function) messages. Messages are backend > > defined and flexable enough for any use case. In Xe we use these > > messages to clean up entites, set properties for entites, and suspend / > > resume execution of an entity [5]. I suspect other driver can leverage > > this messaging concept too as it a convenient way to avoid races in the > > backend. > > > > - Support for using TDR for all error paths of a scheduler / entity > > > > Fix a few races / bugs, add function to dynamically set the TDR timeout. > > > > - Annotate dma-fences for long running workloads. > > > > The idea here is to use dma-fences only as sync points within the > > scheduler and never export them for long running workloads. By > > annotating these fences as long running we ensure that these dma-fences > > are never used in a way that breaks the dma-fence rules. A benefit of > > thus approach is the scheduler can still safely flow control the > > execution ring buffer via the job limit without breaking the dma-fence > > rules. > > > > Again this a first draft and looking forward to feedback. > > > > Enjoy - Matt > > > > [1] https://gitlab.freedesktop.org/drm/xe/kernel > > [2] https://patchwork.freedesktop.org/series/112188/ > > [3] https://patchwork.freedesktop.org/series/114772/ > > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 > > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 > > > > Matthew Brost (8): > > drm/sched: Convert drm scheduler to use a work queue rather than > > kthread > > drm/sched: Move schedule policy to scheduler / entity > > drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy > > drm/sched: Add generic scheduler message interface > > drm/sched: Start run wq before TDR in drm_sched_start > > drm/sched: Submit job before starting TDR > > drm/sched: Add helper to set TDR timeout > > drm/syncobj: Warn on long running dma-fences > > > > Thomas Hellström (2): > > dma-buf/dma-fence: Introduce long-running completion fences > > drm/sched: Support long-running sched entities > > > > drivers/dma-buf/dma-fence.c | 142 +++++++--- > > drivers/dma-buf/dma-resv.c | 5 + > > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- > > drivers/gpu/drm/drm_syncobj.c | 5 +- > > drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- > > drivers/gpu/drm/lima/lima_sched.c | 5 +- > > drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- > > drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- > > drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- > > drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- > > drivers/gpu/drm/scheduler/sched_fence.c | 6 +- > > drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++----- > > drivers/gpu/drm/v3d/v3d_sched.c | 25 +- > > include/drm/gpu_scheduler.h | 130 +++++++-- > > include/linux/dma-fence.h | 60 ++++- > > 16 files changed, 649 insertions(+), 184 deletions(-) > >
On 04/04/2023 14:52, Matthew Brost wrote: > On Tue, Apr 04, 2023 at 10:43:03AM +0100, Tvrtko Ursulin wrote: >> >> On 04/04/2023 01:22, Matthew Brost wrote: >>> Hello, >>> >>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we >>> have been asked to merge our common DRM scheduler patches first as well >>> as develop a common solution for long running workloads with the DRM >>> scheduler. This RFC series is our first attempt at doing this. We >>> welcome any and all feedback. >>> >>> This can we thought of as 4 parts detailed below. >>> >>> - DRM scheduler changes for 1 to 1 relationship between scheduler and >>> entity (patches 1-3) >>> >>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the >>> GuC) which is a new paradigm WRT to the DRM scheduler and presents >>> severals problems as the DRM was originally designed to schedule jobs on >>> hardware queues. The main problem being that DRM scheduler expects the >>> submission order of jobs to be the completion order of jobs even across >>> multiple entities. This assumption falls apart with a firmware scheduler >>> as a firmware scheduler has no concept of jobs and jobs can complete out >>> of order. A novel solution for was originally thought of by Faith during >>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler >>> and entity. I believe the AGX driver [3] is using this approach and >>> Boris may use approach as well for the Mali driver [4]. >>> >>> To support a 1 to 1 relationship we move the main execution function >>> from a kthread to a work queue and add a new scheduling mode which >>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship. >>> The new scheduling mode should unify all drivers usage with a 1 to 1 >>> relationship and can be thought of as using scheduler as a dependency / >>> infligt job tracker rather than a true scheduler. >> >> Once you add capability for a more proper 1:1 via >> DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to replace >> kthreads with a wq? >> >> Or in other words, what purpose does the offloading of a job picking code to >> a separate execution context serve? Could it be done directly in the 1:1 >> mode and leave kthread setup for N:M? >> > > Addressed the other two on my reply to Christian... > > For this one basically the concept of a single entity point IMO is a > very good concept which I'd like to keep. But most important reason > being the main execution thread (now worker) is kicked when a dependency > for a job is resolved, dependencies are dma-fences signaled via a > callback, and these call backs can be signaled in IRQ contexts. We > absolutely do not want to enter the backend in an IRQ context for a > variety of reasons. Sounds like a fair enough requirement but if drivers will not be comfortable with the wq conversion, it is probably possible to introduce some vfuncs for the 1:1 case which would allow scheduler users override the scheduler wakeup and select a special "pick one job" path. That could allow 1:1 users do their thing, leaving rest as is. I mean you already have the special single entity scheduler, you'd just need to add some more specialization on the init, wake up, etc paths. And I will mention once more that I find a wq item with a loop such as: while (!READ_ONCE(sched->pause_run_wq)) { ... A bit dodgy. If you piggy back on any system_wq it smells of system wide starvation so for me any proposal with an option to use a system shared wq is a no go. Regards, Tvrtko >> Apart from those design level questions, low level open IMO still is that >> default fallback of using the system_wq has the potential to affect latency >> for other drivers. But that's for those driver owners to approve. >> >> Regards, >> >> Tvrtko >> >>> - Generic messaging interface for DRM scheduler >>> >>> Idea is to be able to communicate to the submission backend with in band >>> (relative to main execution function) messages. Messages are backend >>> defined and flexable enough for any use case. In Xe we use these >>> messages to clean up entites, set properties for entites, and suspend / >>> resume execution of an entity [5]. I suspect other driver can leverage >>> this messaging concept too as it a convenient way to avoid races in the >>> backend. >>> >>> - Support for using TDR for all error paths of a scheduler / entity >>> >>> Fix a few races / bugs, add function to dynamically set the TDR timeout. >>> >>> - Annotate dma-fences for long running workloads. >>> >>> The idea here is to use dma-fences only as sync points within the >>> scheduler and never export them for long running workloads. By >>> annotating these fences as long running we ensure that these dma-fences >>> are never used in a way that breaks the dma-fence rules. A benefit of >>> thus approach is the scheduler can still safely flow control the >>> execution ring buffer via the job limit without breaking the dma-fence >>> rules. >>> >>> Again this a first draft and looking forward to feedback. >>> >>> Enjoy - Matt >>> >>> [1] https://gitlab.freedesktop.org/drm/xe/kernel >>> [2] https://patchwork.freedesktop.org/series/112188/ >>> [3] https://patchwork.freedesktop.org/series/114772/ >>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 >>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 >>> >>> Matthew Brost (8): >>> drm/sched: Convert drm scheduler to use a work queue rather than >>> kthread >>> drm/sched: Move schedule policy to scheduler / entity >>> drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy >>> drm/sched: Add generic scheduler message interface >>> drm/sched: Start run wq before TDR in drm_sched_start >>> drm/sched: Submit job before starting TDR >>> drm/sched: Add helper to set TDR timeout >>> drm/syncobj: Warn on long running dma-fences >>> >>> Thomas Hellström (2): >>> dma-buf/dma-fence: Introduce long-running completion fences >>> drm/sched: Support long-running sched entities >>> >>> drivers/dma-buf/dma-fence.c | 142 +++++++--- >>> drivers/dma-buf/dma-resv.c | 5 + >>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- >>> drivers/gpu/drm/drm_syncobj.c | 5 +- >>> drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- >>> drivers/gpu/drm/lima/lima_sched.c | 5 +- >>> drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- >>> drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- >>> drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- >>> drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- >>> drivers/gpu/drm/scheduler/sched_fence.c | 6 +- >>> drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++----- >>> drivers/gpu/drm/v3d/v3d_sched.c | 25 +- >>> include/drm/gpu_scheduler.h | 130 +++++++-- >>> include/linux/dma-fence.h | 60 ++++- >>> 16 files changed, 649 insertions(+), 184 deletions(-) >>>
Hi Matt, Thomas, Some very bold out of box thinking in this area: 1. so you want to use drm scheduler and dma-fence for long running workload. Why you want to do this in the first place? What is the benefit? Drm scheduler is pretty much a software scheduler. Modern gpu has scheduler built at fw/hw level, as you said below for intel this is Guc. Can xe driver just directly submit job to Guc, bypassing drm scheduler? 2. using dma-fence for long run workload: I am well aware that page fault (and the consequent memory allocation/lock acquiring to fix the fault) can cause deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be used purely because the nature of the workload that it runs very long (indefinite). I did a math: the dma_fence_wait_timeout function's third param is the timeout which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not long enough, can we just change the timeout parameter to signed 64 bits so it is much longer than our life time... So I mainly argue we can't use dma-fence for long-run workload is not because the workload runs very long, rather because of the fact that we use page fault for long-run workload. If we enable page fault for short-run workload, we can't use dma-fence either. Page fault is the key thing here. Now since we use page fault which is *fundamentally* controversial with dma-fence design, why now just introduce a independent concept such as user-fence instead of extending existing dma-fence? I like unified design. If drm scheduler, dma-fence can be extended to work for everything, it is beautiful. But seems we have some fundamental problem here. Thanks, Oak > -----Original Message----- > From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of > Matthew Brost > Sent: April 3, 2023 8:22 PM > To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org > Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com; airlied@linux.ie; > lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew > <matthew.brost@intel.com>; christian.koenig@amd.com; > faith.ekstrand@collabora.com > Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans > > Hello, > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > have been asked to merge our common DRM scheduler patches first as well > as develop a common solution for long running workloads with the DRM > scheduler. This RFC series is our first attempt at doing this. We > welcome any and all feedback. > > This can we thought of as 4 parts detailed below. > > - DRM scheduler changes for 1 to 1 relationship between scheduler and > entity (patches 1-3) > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the > GuC) which is a new paradigm WRT to the DRM scheduler and presents > severals problems as the DRM was originally designed to schedule jobs on > hardware queues. The main problem being that DRM scheduler expects the > submission order of jobs to be the completion order of jobs even across > multiple entities. This assumption falls apart with a firmware scheduler > as a firmware scheduler has no concept of jobs and jobs can complete out > of order. A novel solution for was originally thought of by Faith during > the initial prototype of Xe, create a 1 to 1 relationship between scheduler > and entity. I believe the AGX driver [3] is using this approach and > Boris may use approach as well for the Mali driver [4]. > > To support a 1 to 1 relationship we move the main execution function > from a kthread to a work queue and add a new scheduling mode which > bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > The new scheduling mode should unify all drivers usage with a 1 to 1 > relationship and can be thought of as using scheduler as a dependency / > infligt job tracker rather than a true scheduler. > > - Generic messaging interface for DRM scheduler > > Idea is to be able to communicate to the submission backend with in band > (relative to main execution function) messages. Messages are backend > defined and flexable enough for any use case. In Xe we use these > messages to clean up entites, set properties for entites, and suspend / > resume execution of an entity [5]. I suspect other driver can leverage > this messaging concept too as it a convenient way to avoid races in the > backend. > > - Support for using TDR for all error paths of a scheduler / entity > > Fix a few races / bugs, add function to dynamically set the TDR timeout. > > - Annotate dma-fences for long running workloads. > > The idea here is to use dma-fences only as sync points within the > scheduler and never export them for long running workloads. By > annotating these fences as long running we ensure that these dma-fences > are never used in a way that breaks the dma-fence rules. A benefit of > thus approach is the scheduler can still safely flow control the > execution ring buffer via the job limit without breaking the dma-fence > rules. > > Again this a first draft and looking forward to feedback. > > Enjoy - Matt > > [1] https://gitlab.freedesktop.org/drm/xe/kernel > [2] https://patchwork.freedesktop.org/series/112188/ > [3] https://patchwork.freedesktop.org/series/114772/ > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe- > next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 > > Matthew Brost (8): > drm/sched: Convert drm scheduler to use a work queue rather than > kthread > drm/sched: Move schedule policy to scheduler / entity > drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy > drm/sched: Add generic scheduler message interface > drm/sched: Start run wq before TDR in drm_sched_start > drm/sched: Submit job before starting TDR > drm/sched: Add helper to set TDR timeout > drm/syncobj: Warn on long running dma-fences > > Thomas Hellström (2): > dma-buf/dma-fence: Introduce long-running completion fences > drm/sched: Support long-running sched entities > > drivers/dma-buf/dma-fence.c | 142 +++++++--- > drivers/dma-buf/dma-resv.c | 5 + > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- > drivers/gpu/drm/drm_syncobj.c | 5 +- > drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- > drivers/gpu/drm/lima/lima_sched.c | 5 +- > drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- > drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- > drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- > drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- > drivers/gpu/drm/scheduler/sched_fence.c | 6 +- > drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++----- > drivers/gpu/drm/v3d/v3d_sched.c | 25 +- > include/drm/gpu_scheduler.h | 130 +++++++-- > include/linux/dma-fence.h | 60 ++++- > 16 files changed, 649 insertions(+), 184 deletions(-) > > -- > 2.34.1
On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote: > Hi Matt, Thomas, > > Some very bold out of box thinking in this area: > > 1. so you want to use drm scheduler and dma-fence for long running workload. Why you want to do this in the first place? What is the benefit? Drm scheduler is pretty much a software scheduler. Modern gpu has scheduler built at fw/hw level, as you said below for intel this is Guc. Can xe driver just directly submit job to Guc, bypassing drm scheduler? > If we did that now we have 2 paths for dependency track, flow controling the ring, resets / error handling / backend submission implementations. We don't want this. > 2. using dma-fence for long run workload: I am well aware that page fault (and the consequent memory allocation/lock acquiring to fix the fault) can cause deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be used purely because the nature of the workload that it runs very long (indefinite). I did a math: the dma_fence_wait_timeout function's third param is the timeout which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not long enough, can we just change the timeout parameter to signed 64 bits so it is much longer than our life time... > > So I mainly argue we can't use dma-fence for long-run workload is not because the workload runs very long, rather because of the fact that we use page fault for long-run workload. If we enable page fault for short-run workload, we can't use dma-fence either. Page fault is the key thing here. > > Now since we use page fault which is *fundamentally* controversial with dma-fence design, why now just introduce a independent concept such as user-fence instead of extending existing dma-fence? > > I like unified design. If drm scheduler, dma-fence can be extended to work for everything, it is beautiful. But seems we have some fundamental problem here. > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use the signal / CB infrastructure) and enforce we don't use use these dma-fences from the scheduler in memory reclaim paths or export these to user space or other drivers. Think of this mode as SW only fence. Matt > Thanks, > Oak > > > -----Original Message----- > > From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of > > Matthew Brost > > Sent: April 3, 2023 8:22 PM > > To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org > > Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com; airlied@linux.ie; > > lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew > > <matthew.brost@intel.com>; christian.koenig@amd.com; > > faith.ekstrand@collabora.com > > Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans > > > > Hello, > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > > have been asked to merge our common DRM scheduler patches first as well > > as develop a common solution for long running workloads with the DRM > > scheduler. This RFC series is our first attempt at doing this. We > > welcome any and all feedback. > > > > This can we thought of as 4 parts detailed below. > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and > > entity (patches 1-3) > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the > > GuC) which is a new paradigm WRT to the DRM scheduler and presents > > severals problems as the DRM was originally designed to schedule jobs on > > hardware queues. The main problem being that DRM scheduler expects the > > submission order of jobs to be the completion order of jobs even across > > multiple entities. This assumption falls apart with a firmware scheduler > > as a firmware scheduler has no concept of jobs and jobs can complete out > > of order. A novel solution for was originally thought of by Faith during > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler > > and entity. I believe the AGX driver [3] is using this approach and > > Boris may use approach as well for the Mali driver [4]. > > > > To support a 1 to 1 relationship we move the main execution function > > from a kthread to a work queue and add a new scheduling mode which > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > > The new scheduling mode should unify all drivers usage with a 1 to 1 > > relationship and can be thought of as using scheduler as a dependency / > > infligt job tracker rather than a true scheduler. > > > > - Generic messaging interface for DRM scheduler > > > > Idea is to be able to communicate to the submission backend with in band > > (relative to main execution function) messages. Messages are backend > > defined and flexable enough for any use case. In Xe we use these > > messages to clean up entites, set properties for entites, and suspend / > > resume execution of an entity [5]. I suspect other driver can leverage > > this messaging concept too as it a convenient way to avoid races in the > > backend. > > > > - Support for using TDR for all error paths of a scheduler / entity > > > > Fix a few races / bugs, add function to dynamically set the TDR timeout. > > > > - Annotate dma-fences for long running workloads. > > > > The idea here is to use dma-fences only as sync points within the > > scheduler and never export them for long running workloads. By > > annotating these fences as long running we ensure that these dma-fences > > are never used in a way that breaks the dma-fence rules. A benefit of > > thus approach is the scheduler can still safely flow control the > > execution ring buffer via the job limit without breaking the dma-fence > > rules. > > > > Again this a first draft and looking forward to feedback. > > > > Enjoy - Matt > > > > [1] https://gitlab.freedesktop.org/drm/xe/kernel > > [2] https://patchwork.freedesktop.org/series/112188/ > > [3] https://patchwork.freedesktop.org/series/114772/ > > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 > > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe- > > next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 > > > > Matthew Brost (8): > > drm/sched: Convert drm scheduler to use a work queue rather than > > kthread > > drm/sched: Move schedule policy to scheduler / entity > > drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy > > drm/sched: Add generic scheduler message interface > > drm/sched: Start run wq before TDR in drm_sched_start > > drm/sched: Submit job before starting TDR > > drm/sched: Add helper to set TDR timeout > > drm/syncobj: Warn on long running dma-fences > > > > Thomas Hellström (2): > > dma-buf/dma-fence: Introduce long-running completion fences > > drm/sched: Support long-running sched entities > > > > drivers/dma-buf/dma-fence.c | 142 +++++++--- > > drivers/dma-buf/dma-resv.c | 5 + > > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- > > drivers/gpu/drm/drm_syncobj.c | 5 +- > > drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- > > drivers/gpu/drm/lima/lima_sched.c | 5 +- > > drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- > > drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- > > drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- > > drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- > > drivers/gpu/drm/scheduler/sched_fence.c | 6 +- > > drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++----- > > drivers/gpu/drm/v3d/v3d_sched.c | 25 +- > > include/drm/gpu_scheduler.h | 130 +++++++-- > > include/linux/dma-fence.h | 60 ++++- > > 16 files changed, 649 insertions(+), 184 deletions(-) > > > > -- > > 2.34.1 >
On Tue, 4 Apr 2023 at 19:29, Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> wrote: > > > On 04/04/2023 14:52, Matthew Brost wrote: > > On Tue, Apr 04, 2023 at 10:43:03AM +0100, Tvrtko Ursulin wrote: > >> > >> On 04/04/2023 01:22, Matthew Brost wrote: > >>> Hello, > >>> > >>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > >>> have been asked to merge our common DRM scheduler patches first as well > >>> as develop a common solution for long running workloads with the DRM > >>> scheduler. This RFC series is our first attempt at doing this. We > >>> welcome any and all feedback. > >>> > >>> This can we thought of as 4 parts detailed below. > >>> > >>> - DRM scheduler changes for 1 to 1 relationship between scheduler and > >>> entity (patches 1-3) > >>> > >>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the > >>> GuC) which is a new paradigm WRT to the DRM scheduler and presents > >>> severals problems as the DRM was originally designed to schedule jobs on > >>> hardware queues. The main problem being that DRM scheduler expects the > >>> submission order of jobs to be the completion order of jobs even across > >>> multiple entities. This assumption falls apart with a firmware scheduler > >>> as a firmware scheduler has no concept of jobs and jobs can complete out > >>> of order. A novel solution for was originally thought of by Faith during > >>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler > >>> and entity. I believe the AGX driver [3] is using this approach and > >>> Boris may use approach as well for the Mali driver [4]. > >>> > >>> To support a 1 to 1 relationship we move the main execution function > >>> from a kthread to a work queue and add a new scheduling mode which > >>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > >>> The new scheduling mode should unify all drivers usage with a 1 to 1 > >>> relationship and can be thought of as using scheduler as a dependency / > >>> infligt job tracker rather than a true scheduler. > >> > >> Once you add capability for a more proper 1:1 via > >> DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to replace > >> kthreads with a wq? > >> > >> Or in other words, what purpose does the offloading of a job picking code to > >> a separate execution context serve? Could it be done directly in the 1:1 > >> mode and leave kthread setup for N:M? > >> > > > > Addressed the other two on my reply to Christian... > > > > For this one basically the concept of a single entity point IMO is a > > very good concept which I'd like to keep. But most important reason > > being the main execution thread (now worker) is kicked when a dependency > > for a job is resolved, dependencies are dma-fences signaled via a > > callback, and these call backs can be signaled in IRQ contexts. We > > absolutely do not want to enter the backend in an IRQ context for a > > variety of reasons. > > Sounds like a fair enough requirement but if drivers will not be > comfortable with the wq conversion, it is probably possible to introduce > some vfuncs for the 1:1 case which would allow scheduler users override > the scheduler wakeup and select a special "pick one job" path. That > could allow 1:1 users do their thing, leaving rest as is. I mean you > already have the special single entity scheduler, you'd just need to add > some more specialization on the init, wake up, etc paths. > > And I will mention once more that I find a wq item with a loop such as: > > while (!READ_ONCE(sched->pause_run_wq)) { > ... > > A bit dodgy. If you piggy back on any system_wq it smells of system wide > starvation so for me any proposal with an option to use a system shared > wq is a no go. Yeah I think the argument for wq based scheduler would need a per-drm_scheduler wq, like we currently have a per-scheduler kthread. It might still need some serious work to replace the kthread_stop/start() with a wq-native equivalent (because really this is the tricky stuff we shouldn't hand-roll unless someone is willing to write a few papers on the lockless design that's done), but would look a bunch more reasonable. Having a per-sched workqueue might also help with the big sched_stop/start/fini state transitions, which I really think should still go over all the per-entity schedulers even in the 1:1 case (because otherwise you get some funky code in drivers that do the iterations needed, which probably tosses the fairly nice design the current scheduler has by relying on the kthread_stop/start primitives for this. -Daniel > > Regards, > > Tvrtko > > > >> Apart from those design level questions, low level open IMO still is that > >> default fallback of using the system_wq has the potential to affect latency > >> for other drivers. But that's for those driver owners to approve. > >> > >> Regards, > >> > >> Tvrtko > >> > >>> - Generic messaging interface for DRM scheduler > >>> > >>> Idea is to be able to communicate to the submission backend with in band > >>> (relative to main execution function) messages. Messages are backend > >>> defined and flexable enough for any use case. In Xe we use these > >>> messages to clean up entites, set properties for entites, and suspend / > >>> resume execution of an entity [5]. I suspect other driver can leverage > >>> this messaging concept too as it a convenient way to avoid races in the > >>> backend. > >>> > >>> - Support for using TDR for all error paths of a scheduler / entity > >>> > >>> Fix a few races / bugs, add function to dynamically set the TDR timeout. > >>> > >>> - Annotate dma-fences for long running workloads. > >>> > >>> The idea here is to use dma-fences only as sync points within the > >>> scheduler and never export them for long running workloads. By > >>> annotating these fences as long running we ensure that these dma-fences > >>> are never used in a way that breaks the dma-fence rules. A benefit of > >>> thus approach is the scheduler can still safely flow control the > >>> execution ring buffer via the job limit without breaking the dma-fence > >>> rules. > >>> > >>> Again this a first draft and looking forward to feedback. > >>> > >>> Enjoy - Matt > >>> > >>> [1] https://gitlab.freedesktop.org/drm/xe/kernel > >>> [2] https://patchwork.freedesktop.org/series/112188/ > >>> [3] https://patchwork.freedesktop.org/series/114772/ > >>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 > >>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 > >>> > >>> Matthew Brost (8): > >>> drm/sched: Convert drm scheduler to use a work queue rather than > >>> kthread > >>> drm/sched: Move schedule policy to scheduler / entity > >>> drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy > >>> drm/sched: Add generic scheduler message interface > >>> drm/sched: Start run wq before TDR in drm_sched_start > >>> drm/sched: Submit job before starting TDR > >>> drm/sched: Add helper to set TDR timeout > >>> drm/syncobj: Warn on long running dma-fences > >>> > >>> Thomas Hellström (2): > >>> dma-buf/dma-fence: Introduce long-running completion fences > >>> drm/sched: Support long-running sched entities > >>> > >>> drivers/dma-buf/dma-fence.c | 142 +++++++--- > >>> drivers/dma-buf/dma-resv.c | 5 + > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- > >>> drivers/gpu/drm/drm_syncobj.c | 5 +- > >>> drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- > >>> drivers/gpu/drm/lima/lima_sched.c | 5 +- > >>> drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- > >>> drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- > >>> drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- > >>> drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- > >>> drivers/gpu/drm/scheduler/sched_fence.c | 6 +- > >>> drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++----- > >>> drivers/gpu/drm/v3d/v3d_sched.c | 25 +- > >>> include/drm/gpu_scheduler.h | 130 +++++++-- > >>> include/linux/dma-fence.h | 60 ++++- > >>> 16 files changed, 649 insertions(+), 184 deletions(-) > >>>
Am 04.04.23 um 20:08 schrieb Matthew Brost: > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote: >> Hi Matt, Thomas, >> >> Some very bold out of box thinking in this area: >> >> 1. so you want to use drm scheduler and dma-fence for long running workload. Why you want to do this in the first place? What is the benefit? Drm scheduler is pretty much a software scheduler. Modern gpu has scheduler built at fw/hw level, as you said below for intel this is Guc. Can xe driver just directly submit job to Guc, bypassing drm scheduler? >> > If we did that now we have 2 paths for dependency track, flow controling > the ring, resets / error handling / backend submission implementations. > We don't want this. Well exactly that's the point: Why? As far as I can see that are two completely distinct use cases, so you absolutely do want two completely distinct implementations for this. >> 2. using dma-fence for long run workload: I am well aware that page fault (and the consequent memory allocation/lock acquiring to fix the fault) can cause deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be used purely because the nature of the workload that it runs very long (indefinite). I did a math: the dma_fence_wait_timeout function's third param is the timeout which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not long enough, can we just change the timeout parameter to signed 64 bits so it is much longer than our life time... >> >> So I mainly argue we can't use dma-fence for long-run workload is not because the workload runs very long, rather because of the fact that we use page fault for long-run workload. If we enable page fault for short-run workload, we can't use dma-fence either. Page fault is the key thing here. >> >> Now since we use page fault which is *fundamentally* controversial with dma-fence design, why now just introduce a independent concept such as user-fence instead of extending existing dma-fence? >> >> I like unified design. If drm scheduler, dma-fence can be extended to work for everything, it is beautiful. But seems we have some fundamental problem here. >> > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use > the signal / CB infrastructure) and enforce we don't use use these > dma-fences from the scheduler in memory reclaim paths or export these to > user space or other drivers. Think of this mode as SW only fence. Yeah and I truly think this is an really bad idea. The signal/CB infrastructure in the dma_fence turned out to be the absolutely nightmare I initially predicted. Sorry to say that, but in this case the "I've told you so" is appropriate in my opinion. If we need infrastructure for long running dependency tracking we should encapsulate that in a new framework and not try to mangle the existing code for something it was never intended for. Christian. > > Matt > >> Thanks, >> Oak >> >>> -----Original Message----- >>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of >>> Matthew Brost >>> Sent: April 3, 2023 8:22 PM >>> To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org >>> Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com; airlied@linux.ie; >>> lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew >>> <matthew.brost@intel.com>; christian.koenig@amd.com; >>> faith.ekstrand@collabora.com >>> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans >>> >>> Hello, >>> >>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we >>> have been asked to merge our common DRM scheduler patches first as well >>> as develop a common solution for long running workloads with the DRM >>> scheduler. This RFC series is our first attempt at doing this. We >>> welcome any and all feedback. >>> >>> This can we thought of as 4 parts detailed below. >>> >>> - DRM scheduler changes for 1 to 1 relationship between scheduler and >>> entity (patches 1-3) >>> >>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the >>> GuC) which is a new paradigm WRT to the DRM scheduler and presents >>> severals problems as the DRM was originally designed to schedule jobs on >>> hardware queues. The main problem being that DRM scheduler expects the >>> submission order of jobs to be the completion order of jobs even across >>> multiple entities. This assumption falls apart with a firmware scheduler >>> as a firmware scheduler has no concept of jobs and jobs can complete out >>> of order. A novel solution for was originally thought of by Faith during >>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler >>> and entity. I believe the AGX driver [3] is using this approach and >>> Boris may use approach as well for the Mali driver [4]. >>> >>> To support a 1 to 1 relationship we move the main execution function >>> from a kthread to a work queue and add a new scheduling mode which >>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship. >>> The new scheduling mode should unify all drivers usage with a 1 to 1 >>> relationship and can be thought of as using scheduler as a dependency / >>> infligt job tracker rather than a true scheduler. >>> >>> - Generic messaging interface for DRM scheduler >>> >>> Idea is to be able to communicate to the submission backend with in band >>> (relative to main execution function) messages. Messages are backend >>> defined and flexable enough for any use case. In Xe we use these >>> messages to clean up entites, set properties for entites, and suspend / >>> resume execution of an entity [5]. I suspect other driver can leverage >>> this messaging concept too as it a convenient way to avoid races in the >>> backend. >>> >>> - Support for using TDR for all error paths of a scheduler / entity >>> >>> Fix a few races / bugs, add function to dynamically set the TDR timeout. >>> >>> - Annotate dma-fences for long running workloads. >>> >>> The idea here is to use dma-fences only as sync points within the >>> scheduler and never export them for long running workloads. By >>> annotating these fences as long running we ensure that these dma-fences >>> are never used in a way that breaks the dma-fence rules. A benefit of >>> thus approach is the scheduler can still safely flow control the >>> execution ring buffer via the job limit without breaking the dma-fence >>> rules. >>> >>> Again this a first draft and looking forward to feedback. >>> >>> Enjoy - Matt >>> >>> [1] https://gitlab.freedesktop.org/drm/xe/kernel >>> [2] https://patchwork.freedesktop.org/series/112188/ >>> [3] https://patchwork.freedesktop.org/series/114772/ >>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 >>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe- >>> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 >>> >>> Matthew Brost (8): >>> drm/sched: Convert drm scheduler to use a work queue rather than >>> kthread >>> drm/sched: Move schedule policy to scheduler / entity >>> drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy >>> drm/sched: Add generic scheduler message interface >>> drm/sched: Start run wq before TDR in drm_sched_start >>> drm/sched: Submit job before starting TDR >>> drm/sched: Add helper to set TDR timeout >>> drm/syncobj: Warn on long running dma-fences >>> >>> Thomas Hellström (2): >>> dma-buf/dma-fence: Introduce long-running completion fences >>> drm/sched: Support long-running sched entities >>> >>> drivers/dma-buf/dma-fence.c | 142 +++++++--- >>> drivers/dma-buf/dma-resv.c | 5 + >>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- >>> drivers/gpu/drm/drm_syncobj.c | 5 +- >>> drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- >>> drivers/gpu/drm/lima/lima_sched.c | 5 +- >>> drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- >>> drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- >>> drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- >>> drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- >>> drivers/gpu/drm/scheduler/sched_fence.c | 6 +- >>> drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++----- >>> drivers/gpu/drm/v3d/v3d_sched.c | 25 +- >>> include/drm/gpu_scheduler.h | 130 +++++++-- >>> include/linux/dma-fence.h | 60 ++++- >>> 16 files changed, 649 insertions(+), 184 deletions(-) >>> >>> -- >>> 2.34.1
Am 04.04.23 um 15:37 schrieb Matthew Brost: > On Tue, Apr 04, 2023 at 11:13:28AM +0200, Christian König wrote: >> Hi, >> >> Am 04.04.23 um 02:22 schrieb Matthew Brost: >>> Hello, >>> >>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we >>> have been asked to merge our common DRM scheduler patches first as well >>> as develop a common solution for long running workloads with the DRM >>> scheduler. This RFC series is our first attempt at doing this. We >>> welcome any and all feedback. >>> >>> This can we thought of as 4 parts detailed below. >>> >>> - DRM scheduler changes for 1 to 1 relationship between scheduler and >>> entity (patches 1-3) >>> >>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the >>> GuC) which is a new paradigm WRT to the DRM scheduler and presents >>> severals problems as the DRM was originally designed to schedule jobs on >>> hardware queues. The main problem being that DRM scheduler expects the >>> submission order of jobs to be the completion order of jobs even across >>> multiple entities. This assumption falls apart with a firmware scheduler >>> as a firmware scheduler has no concept of jobs and jobs can complete out >>> of order. A novel solution for was originally thought of by Faith during >>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler >>> and entity. I believe the AGX driver [3] is using this approach and >>> Boris may use approach as well for the Mali driver [4]. >>> >>> To support a 1 to 1 relationship we move the main execution function >>> from a kthread to a work queue and add a new scheduling mode which >>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship. >>> The new scheduling mode should unify all drivers usage with a 1 to 1 >>> relationship and can be thought of as using scheduler as a dependency / >>> infligt job tracker rather than a true scheduler. >>> >>> - Generic messaging interface for DRM scheduler >>> >>> Idea is to be able to communicate to the submission backend with in band >>> (relative to main execution function) messages. Messages are backend >>> defined and flexable enough for any use case. In Xe we use these >>> messages to clean up entites, set properties for entites, and suspend / >>> resume execution of an entity [5]. I suspect other driver can leverage >>> this messaging concept too as it a convenient way to avoid races in the >>> backend. >> Oh, please absolutely *don't* do this. >> >> This is basically the design which makes a bunch of stuff so horrible broken >> on Windows. >> >> I can explain it in more detail if necessary, but I strongly recommend to >> not go down this path. >> > I'm afraid we are going to have to discuss this further. Let me explain > my reasoning, basically the idea is to have a single main entry point to > backend - the work queue. This avoids the need for lock between run_job > and any message that changes an entites state, also it really helps > during the reset flows (either TDR or GT reset) as we can call > drm_sched_run_wq_stop can ensure that nothing else is in the backend > changing an entity state. It all works out really nicely actually, our > GuC backend is incredibly stable (hasn't really had a bug pop in about a > year) and way simpler than what we did in the i915. I think the simplity > to largely due to this design of limiting the entry points. > > I personally don't see how this a poor design, limiting entry points > absolutely makes sense to me, if it didn't why not just call cleanup_job > bypassing the main execution thread (now worker), this is the exact same > concept. Well then I strongly suggest to read a few analyses on the failure of the message processing loop on Windows. Have you ever wondered why classic Win32 applications sometimes seems to be stuck and don't do anything? This design pattern combine with timeouts to solve deadlocks is the reason for that. The major problem with this approach is that analyzing tools like lockdep have a hard time grasping the dependencies. What you can do is to offload all your operations which are supposed to be run in the same thread as work items into a work queue. This is something lockdep understands and is able to scream out lout if someone messes up the deadlock dependencies. Regards, Christian. > > FWIW Asahi liked the idea as well and think it could be useful for AGX. > Matt > >> Regards, >> Christian. >> >>> - Support for using TDR for all error paths of a scheduler / entity >>> >>> Fix a few races / bugs, add function to dynamically set the TDR timeout. >>> >>> - Annotate dma-fences for long running workloads. >>> >>> The idea here is to use dma-fences only as sync points within the >>> scheduler and never export them for long running workloads. By >>> annotating these fences as long running we ensure that these dma-fences >>> are never used in a way that breaks the dma-fence rules. A benefit of >>> thus approach is the scheduler can still safely flow control the >>> execution ring buffer via the job limit without breaking the dma-fence >>> rules. >>> >>> Again this a first draft and looking forward to feedback. >>> >>> Enjoy - Matt >>> >>> [1] https://gitlab.freedesktop.org/drm/xe/kernel >>> [2] https://patchwork.freedesktop.org/series/112188/ >>> [3] https://patchwork.freedesktop.org/series/114772/ >>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 >>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 >>> >>> Matthew Brost (8): >>> drm/sched: Convert drm scheduler to use a work queue rather than >>> kthread >>> drm/sched: Move schedule policy to scheduler / entity >>> drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy >>> drm/sched: Add generic scheduler message interface >>> drm/sched: Start run wq before TDR in drm_sched_start >>> drm/sched: Submit job before starting TDR >>> drm/sched: Add helper to set TDR timeout >>> drm/syncobj: Warn on long running dma-fences >>> >>> Thomas Hellström (2): >>> dma-buf/dma-fence: Introduce long-running completion fences >>> drm/sched: Support long-running sched entities >>> >>> drivers/dma-buf/dma-fence.c | 142 +++++++--- >>> drivers/dma-buf/dma-resv.c | 5 + >>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- >>> drivers/gpu/drm/drm_syncobj.c | 5 +- >>> drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- >>> drivers/gpu/drm/lima/lima_sched.c | 5 +- >>> drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- >>> drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- >>> drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- >>> drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- >>> drivers/gpu/drm/scheduler/sched_fence.c | 6 +- >>> drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++----- >>> drivers/gpu/drm/v3d/v3d_sched.c | 25 +- >>> include/drm/gpu_scheduler.h | 130 +++++++-- >>> include/linux/dma-fence.h | 60 ++++- >>> 16 files changed, 649 insertions(+), 184 deletions(-) >>>
On Wed, Apr 05, 2023 at 09:41:23AM +0200, Christian König wrote: > Am 04.04.23 um 15:37 schrieb Matthew Brost: > > On Tue, Apr 04, 2023 at 11:13:28AM +0200, Christian König wrote: > > > Hi, > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost: > > > > Hello, > > > > > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > > > > have been asked to merge our common DRM scheduler patches first as well > > > > as develop a common solution for long running workloads with the DRM > > > > scheduler. This RFC series is our first attempt at doing this. We > > > > welcome any and all feedback. > > > > > > > > This can we thought of as 4 parts detailed below. > > > > > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and > > > > entity (patches 1-3) > > > > > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the > > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents > > > > severals problems as the DRM was originally designed to schedule jobs on > > > > hardware queues. The main problem being that DRM scheduler expects the > > > > submission order of jobs to be the completion order of jobs even across > > > > multiple entities. This assumption falls apart with a firmware scheduler > > > > as a firmware scheduler has no concept of jobs and jobs can complete out > > > > of order. A novel solution for was originally thought of by Faith during > > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler > > > > and entity. I believe the AGX driver [3] is using this approach and > > > > Boris may use approach as well for the Mali driver [4]. > > > > > > > > To support a 1 to 1 relationship we move the main execution function > > > > from a kthread to a work queue and add a new scheduling mode which > > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > > > > The new scheduling mode should unify all drivers usage with a 1 to 1 > > > > relationship and can be thought of as using scheduler as a dependency / > > > > infligt job tracker rather than a true scheduler. > > > > > > > > - Generic messaging interface for DRM scheduler > > > > > > > > Idea is to be able to communicate to the submission backend with in band > > > > (relative to main execution function) messages. Messages are backend > > > > defined and flexable enough for any use case. In Xe we use these > > > > messages to clean up entites, set properties for entites, and suspend / > > > > resume execution of an entity [5]. I suspect other driver can leverage > > > > this messaging concept too as it a convenient way to avoid races in the > > > > backend. > > > Oh, please absolutely *don't* do this. > > > > > > This is basically the design which makes a bunch of stuff so horrible broken > > > on Windows. > > > > > > I can explain it in more detail if necessary, but I strongly recommend to > > > not go down this path. > > > > > I'm afraid we are going to have to discuss this further. Let me explain > > my reasoning, basically the idea is to have a single main entry point to > > backend - the work queue. This avoids the need for lock between run_job > > and any message that changes an entites state, also it really helps > > during the reset flows (either TDR or GT reset) as we can call > > drm_sched_run_wq_stop can ensure that nothing else is in the backend > > changing an entity state. It all works out really nicely actually, our > > GuC backend is incredibly stable (hasn't really had a bug pop in about a > > year) and way simpler than what we did in the i915. I think the simplity > > to largely due to this design of limiting the entry points. > > > > I personally don't see how this a poor design, limiting entry points > > absolutely makes sense to me, if it didn't why not just call cleanup_job > > bypassing the main execution thread (now worker), this is the exact same > > concept. > > Well then I strongly suggest to read a few analyses on the failure of the > message processing loop on Windows. > > Have you ever wondered why classic Win32 applications sometimes seems to be > stuck and don't do anything? This design pattern combine with timeouts to > solve deadlocks is the reason for that. > > The major problem with this approach is that analyzing tools like lockdep > have a hard time grasping the dependencies. wq is fully annotated and actually splats. Plain kthread doesn't, without adding something like the dma_fence_signalling stuff. But yeah if you block badly in the work items and stall the entire queue, then things go sideways real bad. There's not really any tools we have in the kernel to enforce this, since we still want to allow mutex and sleeping and stuff like that. > What you can do is to offload all your operations which are supposed to be > run in the same thread as work items into a work queue. This is something > lockdep understands and is able to scream out lout if someone messes up the > deadlock dependencies. I thought that's the plan here? Or at least what I thought the plan was, and why I really think we need a per engine worqqueue to make it work well (and also why I suggested the refactoring to split up drm_scheduler into the driver api struct, which stays per-engine, and the internal backend which would be per drm_sched_entity for fw schedulers that round-robin gpu ctx on their own). Also maybe we need to allow drivers to pass in the workqueue like we allow for the tdr handling already, since that simplifies the locking. At least for intel gpu I think this message passing design makes some sense because fundamentally the fw only has a single blocking message queue. And so intel/xe fundamentally needs to deal with the "stupid code might block forward progress for everyone" problem you're describing, not matter whether it's done with the help of drm/sched infra or not. I do agree though that we shouldn't encourage drivers to use this which don't have that kind of fw command queue design. So maybe a huge comment to explain when (and _only_ when) it's ok to use that message passing would be needed, and explaining why in other cases it would make things a lot worse? -Daniel > > Regards, > Christian. > > > > > FWIW Asahi liked the idea as well and think it could be useful for AGX. > > Matt > > > > > Regards, > > > Christian. > > > > > > > - Support for using TDR for all error paths of a scheduler / entity > > > > > > > > Fix a few races / bugs, add function to dynamically set the TDR timeout. > > > > > > > > - Annotate dma-fences for long running workloads. > > > > > > > > The idea here is to use dma-fences only as sync points within the > > > > scheduler and never export them for long running workloads. By > > > > annotating these fences as long running we ensure that these dma-fences > > > > are never used in a way that breaks the dma-fence rules. A benefit of > > > > thus approach is the scheduler can still safely flow control the > > > > execution ring buffer via the job limit without breaking the dma-fence > > > > rules. > > > > > > > > Again this a first draft and looking forward to feedback. > > > > > > > > Enjoy - Matt > > > > > > > > [1] https://gitlab.freedesktop.org/drm/xe/kernel > > > > [2] https://patchwork.freedesktop.org/series/112188/ > > > > [3] https://patchwork.freedesktop.org/series/114772/ > > > > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 > > > > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 > > > > > > > > Matthew Brost (8): > > > > drm/sched: Convert drm scheduler to use a work queue rather than > > > > kthread > > > > drm/sched: Move schedule policy to scheduler / entity > > > > drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy > > > > drm/sched: Add generic scheduler message interface > > > > drm/sched: Start run wq before TDR in drm_sched_start > > > > drm/sched: Submit job before starting TDR > > > > drm/sched: Add helper to set TDR timeout > > > > drm/syncobj: Warn on long running dma-fences > > > > > > > > Thomas Hellström (2): > > > > dma-buf/dma-fence: Introduce long-running completion fences > > > > drm/sched: Support long-running sched entities > > > > > > > > drivers/dma-buf/dma-fence.c | 142 +++++++--- > > > > drivers/dma-buf/dma-resv.c | 5 + > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- > > > > drivers/gpu/drm/drm_syncobj.c | 5 +- > > > > drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- > > > > drivers/gpu/drm/lima/lima_sched.c | 5 +- > > > > drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- > > > > drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- > > > > drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- > > > > drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- > > > > drivers/gpu/drm/scheduler/sched_fence.c | 6 +- > > > > drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++----- > > > > drivers/gpu/drm/v3d/v3d_sched.c | 25 +- > > > > include/drm/gpu_scheduler.h | 130 +++++++-- > > > > include/linux/dma-fence.h | 60 ++++- > > > > 16 files changed, 649 insertions(+), 184 deletions(-) > > > > >
On Wed, Apr 05, 2023 at 09:30:11AM +0200, Christian König wrote: > Am 04.04.23 um 20:08 schrieb Matthew Brost: > > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote: > > > Hi Matt, Thomas, > > > > > > Some very bold out of box thinking in this area: > > > > > > 1. so you want to use drm scheduler and dma-fence for long running workload. Why you want to do this in the first place? What is the benefit? Drm scheduler is pretty much a software scheduler. Modern gpu has scheduler built at fw/hw level, as you said below for intel this is Guc. Can xe driver just directly submit job to Guc, bypassing drm scheduler? > > > > > If we did that now we have 2 paths for dependency track, flow controling > > the ring, resets / error handling / backend submission implementations. > > We don't want this. > > Well exactly that's the point: Why? > > As far as I can see that are two completely distinct use cases, so you > absolutely do want two completely distinct implementations for this. > > > > 2. using dma-fence for long run workload: I am well aware that page fault (and the consequent memory allocation/lock acquiring to fix the fault) can cause deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be used purely because the nature of the workload that it runs very long (indefinite). I did a math: the dma_fence_wait_timeout function's third param is the timeout which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not long enough, can we just change the timeout parameter to signed 64 bits so it is much longer than our life time... > > > > > > So I mainly argue we can't use dma-fence for long-run workload is not because the workload runs very long, rather because of the fact that we use page fault for long-run workload. If we enable page fault for short-run workload, we can't use dma-fence either. Page fault is the key thing here. > > > > > > Now since we use page fault which is *fundamentally* controversial with dma-fence design, why now just introduce a independent concept such as user-fence instead of extending existing dma-fence? > > > > > > I like unified design. If drm scheduler, dma-fence can be extended to work for everything, it is beautiful. But seems we have some fundamental problem here. > > > > > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use > > the signal / CB infrastructure) and enforce we don't use use these > > dma-fences from the scheduler in memory reclaim paths or export these to > > user space or other drivers. Think of this mode as SW only fence. > > Yeah and I truly think this is an really bad idea. > > The signal/CB infrastructure in the dma_fence turned out to be the > absolutely nightmare I initially predicted. Sorry to say that, but in this > case the "I've told you so" is appropriate in my opinion. > > If we need infrastructure for long running dependency tracking we should > encapsulate that in a new framework and not try to mangle the existing code > for something it was never intended for. Concurring hard (already typed that up somewhere else). I'd go one step further and ask whether the kernel really has to handle dependencies for these long-running compute jobs. The entire design with userspace memory fences assumes that this is userspace's job. Also for drm_syncobj we've also pushed a lot of the dependency handling to userspace, with submit threads in mesa. So if there is any blocking to be done (running out of ring space), why can't we sort that out the same way? Meaning: 1. superfast direct-to-hw submit path (using doorbells or whatever) 2. submit ioctl which only succeds if it doesn't have to do any userspace memory fence waits, otherwise you get EWOULDBLOCK 3. userspace sorts out the mess in a submit thread if it gets an EWOULDBLOCK, because fundamentally the kernel cannot guarantee a bottomless queue. If userspace wants bottomless, they need to handle the allocating and delaying imo You can even make 3 entirely as-needed, which means for the usual fast-path you'll never see the userspace thread created unless you do hit an EWOULDBLOCK. If we insist that the kernel handles the long-running dependencies fully then all we end up doing is implementing step 3, but entirely in the kernel instead of userspace. And in the kernel every bug gets you halfway to a CVE, and I just don't think that makes much sense for something which is the fallback of the fallback - once you run out of ring space you're not going to have a great day not matter what. I'd go as far and say if we want step 3 in the kernel someone needs to supply the real-world (i.e. real application running real workloads, not some microbenchmark) benchmark to proof it's actually worth the pain. Otherwise on-demand userspace submit thread. -Daniel > > Christian. > > > > > Matt > > > Thanks, > > > Oak > > > > > > > -----Original Message----- > > > > From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of > > > > Matthew Brost > > > > Sent: April 3, 2023 8:22 PM > > > > To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org > > > > Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com; airlied@linux.ie; > > > > lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew > > > > <matthew.brost@intel.com>; christian.koenig@amd.com; > > > > faith.ekstrand@collabora.com > > > > Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans > > > > > > > > Hello, > > > > > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > > > > have been asked to merge our common DRM scheduler patches first as well > > > > as develop a common solution for long running workloads with the DRM > > > > scheduler. This RFC series is our first attempt at doing this. We > > > > welcome any and all feedback. > > > > > > > > This can we thought of as 4 parts detailed below. > > > > > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and > > > > entity (patches 1-3) > > > > > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the > > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents > > > > severals problems as the DRM was originally designed to schedule jobs on > > > > hardware queues. The main problem being that DRM scheduler expects the > > > > submission order of jobs to be the completion order of jobs even across > > > > multiple entities. This assumption falls apart with a firmware scheduler > > > > as a firmware scheduler has no concept of jobs and jobs can complete out > > > > of order. A novel solution for was originally thought of by Faith during > > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler > > > > and entity. I believe the AGX driver [3] is using this approach and > > > > Boris may use approach as well for the Mali driver [4]. > > > > > > > > To support a 1 to 1 relationship we move the main execution function > > > > from a kthread to a work queue and add a new scheduling mode which > > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > > > > The new scheduling mode should unify all drivers usage with a 1 to 1 > > > > relationship and can be thought of as using scheduler as a dependency / > > > > infligt job tracker rather than a true scheduler. > > > > > > > > - Generic messaging interface for DRM scheduler > > > > > > > > Idea is to be able to communicate to the submission backend with in band > > > > (relative to main execution function) messages. Messages are backend > > > > defined and flexable enough for any use case. In Xe we use these > > > > messages to clean up entites, set properties for entites, and suspend / > > > > resume execution of an entity [5]. I suspect other driver can leverage > > > > this messaging concept too as it a convenient way to avoid races in the > > > > backend. > > > > > > > > - Support for using TDR for all error paths of a scheduler / entity > > > > > > > > Fix a few races / bugs, add function to dynamically set the TDR timeout. > > > > > > > > - Annotate dma-fences for long running workloads. > > > > > > > > The idea here is to use dma-fences only as sync points within the > > > > scheduler and never export them for long running workloads. By > > > > annotating these fences as long running we ensure that these dma-fences > > > > are never used in a way that breaks the dma-fence rules. A benefit of > > > > thus approach is the scheduler can still safely flow control the > > > > execution ring buffer via the job limit without breaking the dma-fence > > > > rules. > > > > > > > > Again this a first draft and looking forward to feedback. > > > > > > > > Enjoy - Matt > > > > > > > > [1] https://gitlab.freedesktop.org/drm/xe/kernel > > > > [2] https://patchwork.freedesktop.org/series/112188/ > > > > [3] https://patchwork.freedesktop.org/series/114772/ > > > > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 > > > > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe- > > > > next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 > > > > > > > > Matthew Brost (8): > > > > drm/sched: Convert drm scheduler to use a work queue rather than > > > > kthread > > > > drm/sched: Move schedule policy to scheduler / entity > > > > drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy > > > > drm/sched: Add generic scheduler message interface > > > > drm/sched: Start run wq before TDR in drm_sched_start > > > > drm/sched: Submit job before starting TDR > > > > drm/sched: Add helper to set TDR timeout > > > > drm/syncobj: Warn on long running dma-fences > > > > > > > > Thomas Hellström (2): > > > > dma-buf/dma-fence: Introduce long-running completion fences > > > > drm/sched: Support long-running sched entities > > > > > > > > drivers/dma-buf/dma-fence.c | 142 +++++++--- > > > > drivers/dma-buf/dma-resv.c | 5 + > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- > > > > drivers/gpu/drm/drm_syncobj.c | 5 +- > > > > drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- > > > > drivers/gpu/drm/lima/lima_sched.c | 5 +- > > > > drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- > > > > drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- > > > > drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- > > > > drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- > > > > drivers/gpu/drm/scheduler/sched_fence.c | 6 +- > > > > drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++----- > > > > drivers/gpu/drm/v3d/v3d_sched.c | 25 +- > > > > include/drm/gpu_scheduler.h | 130 +++++++-- > > > > include/linux/dma-fence.h | 60 ++++- > > > > 16 files changed, 649 insertions(+), 184 deletions(-) > > > > > > > > -- > > > > 2.34.1 >
Am 05.04.23 um 10:34 schrieb Daniel Vetter: > On Wed, Apr 05, 2023 at 09:41:23AM +0200, Christian König wrote: >> Am 04.04.23 um 15:37 schrieb Matthew Brost: >>> On Tue, Apr 04, 2023 at 11:13:28AM +0200, Christian König wrote: >>>> Hi, >>>> >>>> Am 04.04.23 um 02:22 schrieb Matthew Brost: >>>>> Hello, >>>>> >>>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we >>>>> have been asked to merge our common DRM scheduler patches first as well >>>>> as develop a common solution for long running workloads with the DRM >>>>> scheduler. This RFC series is our first attempt at doing this. We >>>>> welcome any and all feedback. >>>>> >>>>> This can we thought of as 4 parts detailed below. >>>>> >>>>> - DRM scheduler changes for 1 to 1 relationship between scheduler and >>>>> entity (patches 1-3) >>>>> >>>>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the >>>>> GuC) which is a new paradigm WRT to the DRM scheduler and presents >>>>> severals problems as the DRM was originally designed to schedule jobs on >>>>> hardware queues. The main problem being that DRM scheduler expects the >>>>> submission order of jobs to be the completion order of jobs even across >>>>> multiple entities. This assumption falls apart with a firmware scheduler >>>>> as a firmware scheduler has no concept of jobs and jobs can complete out >>>>> of order. A novel solution for was originally thought of by Faith during >>>>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler >>>>> and entity. I believe the AGX driver [3] is using this approach and >>>>> Boris may use approach as well for the Mali driver [4]. >>>>> >>>>> To support a 1 to 1 relationship we move the main execution function >>>>> from a kthread to a work queue and add a new scheduling mode which >>>>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship. >>>>> The new scheduling mode should unify all drivers usage with a 1 to 1 >>>>> relationship and can be thought of as using scheduler as a dependency / >>>>> infligt job tracker rather than a true scheduler. >>>>> >>>>> - Generic messaging interface for DRM scheduler >>>>> >>>>> Idea is to be able to communicate to the submission backend with in band >>>>> (relative to main execution function) messages. Messages are backend >>>>> defined and flexable enough for any use case. In Xe we use these >>>>> messages to clean up entites, set properties for entites, and suspend / >>>>> resume execution of an entity [5]. I suspect other driver can leverage >>>>> this messaging concept too as it a convenient way to avoid races in the >>>>> backend. >>>> Oh, please absolutely *don't* do this. >>>> >>>> This is basically the design which makes a bunch of stuff so horrible broken >>>> on Windows. >>>> >>>> I can explain it in more detail if necessary, but I strongly recommend to >>>> not go down this path. >>>> >>> I'm afraid we are going to have to discuss this further. Let me explain >>> my reasoning, basically the idea is to have a single main entry point to >>> backend - the work queue. This avoids the need for lock between run_job >>> and any message that changes an entites state, also it really helps >>> during the reset flows (either TDR or GT reset) as we can call >>> drm_sched_run_wq_stop can ensure that nothing else is in the backend >>> changing an entity state. It all works out really nicely actually, our >>> GuC backend is incredibly stable (hasn't really had a bug pop in about a >>> year) and way simpler than what we did in the i915. I think the simplity >>> to largely due to this design of limiting the entry points. >>> >>> I personally don't see how this a poor design, limiting entry points >>> absolutely makes sense to me, if it didn't why not just call cleanup_job >>> bypassing the main execution thread (now worker), this is the exact same >>> concept. >> Well then I strongly suggest to read a few analyses on the failure of the >> message processing loop on Windows. >> >> Have you ever wondered why classic Win32 applications sometimes seems to be >> stuck and don't do anything? This design pattern combine with timeouts to >> solve deadlocks is the reason for that. >> >> The major problem with this approach is that analyzing tools like lockdep >> have a hard time grasping the dependencies. > wq is fully annotated and actually splats. Plain kthread doesn't, without > adding something like the dma_fence_signalling stuff. > > But yeah if you block badly in the work items and stall the entire queue, > then things go sideways real bad. There's not really any tools we have in > the kernel to enforce this, since we still want to allow mutex and > sleeping and stuff like that. > >> What you can do is to offload all your operations which are supposed to be >> run in the same thread as work items into a work queue. This is something >> lockdep understands and is able to scream out lout if someone messes up the >> deadlock dependencies. > I thought that's the plan here? At least from my impression that didn't looked like what was implemented here. > Or at least what I thought the plan was, > and why I really think we need a per engine worqqueue to make it work well > (and also why I suggested the refactoring to split up drm_scheduler into > the driver api struct, which stays per-engine, and the internal backend > which would be per drm_sched_entity for fw schedulers that round-robin gpu > ctx on their own). > > Also maybe we need to allow drivers to pass in the workqueue like we allow > for the tdr handling already, since that simplifies the locking. > > At least for intel gpu I think this message passing design makes some > sense because fundamentally the fw only has a single blocking message > queue. And so intel/xe fundamentally needs to deal with the "stupid code > might block forward progress for everyone" problem you're describing, not > matter whether it's done with the help of drm/sched infra or not. > > I do agree though that we shouldn't encourage drivers to use this which > don't have that kind of fw command queue design. So maybe a huge comment > to explain when (and _only_ when) it's ok to use that message passing > would be needed, and explaining why in other cases it would make things a > lot worse? I would approach it from the complete other side. This component here is a tool to decide what job should run next. How that is then signaled and run should not be part of the scheduler, but another more higher level component. This way you also don't have a problem with not using DMA-fences as dependencies as well as constrains for running more jobs. Christian. > -Daniel > >> Regards, >> Christian. >> >>> FWIW Asahi liked the idea as well and think it could be useful for AGX. >>> Matt >>> >>>> Regards, >>>> Christian. >>>> >>>>> - Support for using TDR for all error paths of a scheduler / entity >>>>> >>>>> Fix a few races / bugs, add function to dynamically set the TDR timeout. >>>>> >>>>> - Annotate dma-fences for long running workloads. >>>>> >>>>> The idea here is to use dma-fences only as sync points within the >>>>> scheduler and never export them for long running workloads. By >>>>> annotating these fences as long running we ensure that these dma-fences >>>>> are never used in a way that breaks the dma-fence rules. A benefit of >>>>> thus approach is the scheduler can still safely flow control the >>>>> execution ring buffer via the job limit without breaking the dma-fence >>>>> rules. >>>>> >>>>> Again this a first draft and looking forward to feedback. >>>>> >>>>> Enjoy - Matt >>>>> >>>>> [1] https://gitlab.freedesktop.org/drm/xe/kernel >>>>> [2] https://patchwork.freedesktop.org/series/112188/ >>>>> [3] https://patchwork.freedesktop.org/series/114772/ >>>>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 >>>>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 >>>>> >>>>> Matthew Brost (8): >>>>> drm/sched: Convert drm scheduler to use a work queue rather than >>>>> kthread >>>>> drm/sched: Move schedule policy to scheduler / entity >>>>> drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy >>>>> drm/sched: Add generic scheduler message interface >>>>> drm/sched: Start run wq before TDR in drm_sched_start >>>>> drm/sched: Submit job before starting TDR >>>>> drm/sched: Add helper to set TDR timeout >>>>> drm/syncobj: Warn on long running dma-fences >>>>> >>>>> Thomas Hellström (2): >>>>> dma-buf/dma-fence: Introduce long-running completion fences >>>>> drm/sched: Support long-running sched entities >>>>> >>>>> drivers/dma-buf/dma-fence.c | 142 +++++++--- >>>>> drivers/dma-buf/dma-resv.c | 5 + >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- >>>>> drivers/gpu/drm/drm_syncobj.c | 5 +- >>>>> drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- >>>>> drivers/gpu/drm/lima/lima_sched.c | 5 +- >>>>> drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- >>>>> drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- >>>>> drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- >>>>> drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- >>>>> drivers/gpu/drm/scheduler/sched_fence.c | 6 +- >>>>> drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++----- >>>>> drivers/gpu/drm/v3d/v3d_sched.c | 25 +- >>>>> include/drm/gpu_scheduler.h | 130 +++++++-- >>>>> include/linux/dma-fence.h | 60 ++++- >>>>> 16 files changed, 649 insertions(+), 184 deletions(-) >>>>>
On Wed, Apr 05, 2023 at 10:53:26AM +0200, Christian König wrote: > Am 05.04.23 um 10:34 schrieb Daniel Vetter: > > On Wed, Apr 05, 2023 at 09:41:23AM +0200, Christian König wrote: > > > Am 04.04.23 um 15:37 schrieb Matthew Brost: > > > > On Tue, Apr 04, 2023 at 11:13:28AM +0200, Christian König wrote: > > > > > Hi, > > > > > > > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost: > > > > > > Hello, > > > > > > > > > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > > > > > > have been asked to merge our common DRM scheduler patches first as well > > > > > > as develop a common solution for long running workloads with the DRM > > > > > > scheduler. This RFC series is our first attempt at doing this. We > > > > > > welcome any and all feedback. > > > > > > > > > > > > This can we thought of as 4 parts detailed below. > > > > > > > > > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and > > > > > > entity (patches 1-3) > > > > > > > > > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the > > > > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents > > > > > > severals problems as the DRM was originally designed to schedule jobs on > > > > > > hardware queues. The main problem being that DRM scheduler expects the > > > > > > submission order of jobs to be the completion order of jobs even across > > > > > > multiple entities. This assumption falls apart with a firmware scheduler > > > > > > as a firmware scheduler has no concept of jobs and jobs can complete out > > > > > > of order. A novel solution for was originally thought of by Faith during > > > > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler > > > > > > and entity. I believe the AGX driver [3] is using this approach and > > > > > > Boris may use approach as well for the Mali driver [4]. > > > > > > > > > > > > To support a 1 to 1 relationship we move the main execution function > > > > > > from a kthread to a work queue and add a new scheduling mode which > > > > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > > > > > > The new scheduling mode should unify all drivers usage with a 1 to 1 > > > > > > relationship and can be thought of as using scheduler as a dependency / > > > > > > infligt job tracker rather than a true scheduler. > > > > > > > > > > > > - Generic messaging interface for DRM scheduler > > > > > > > > > > > > Idea is to be able to communicate to the submission backend with in band > > > > > > (relative to main execution function) messages. Messages are backend > > > > > > defined and flexable enough for any use case. In Xe we use these > > > > > > messages to clean up entites, set properties for entites, and suspend / > > > > > > resume execution of an entity [5]. I suspect other driver can leverage > > > > > > this messaging concept too as it a convenient way to avoid races in the > > > > > > backend. > > > > > Oh, please absolutely *don't* do this. > > > > > > > > > > This is basically the design which makes a bunch of stuff so horrible broken > > > > > on Windows. > > > > > > > > > > I can explain it in more detail if necessary, but I strongly recommend to > > > > > not go down this path. > > > > > > > > > I'm afraid we are going to have to discuss this further. Let me explain > > > > my reasoning, basically the idea is to have a single main entry point to > > > > backend - the work queue. This avoids the need for lock between run_job > > > > and any message that changes an entites state, also it really helps > > > > during the reset flows (either TDR or GT reset) as we can call > > > > drm_sched_run_wq_stop can ensure that nothing else is in the backend > > > > changing an entity state. It all works out really nicely actually, our > > > > GuC backend is incredibly stable (hasn't really had a bug pop in about a > > > > year) and way simpler than what we did in the i915. I think the simplity > > > > to largely due to this design of limiting the entry points. > > > > > > > > I personally don't see how this a poor design, limiting entry points > > > > absolutely makes sense to me, if it didn't why not just call cleanup_job > > > > bypassing the main execution thread (now worker), this is the exact same > > > > concept. > > > Well then I strongly suggest to read a few analyses on the failure of the > > > message processing loop on Windows. > > > > > > Have you ever wondered why classic Win32 applications sometimes seems to be > > > stuck and don't do anything? This design pattern combine with timeouts to > > > solve deadlocks is the reason for that. > > > > > > The major problem with this approach is that analyzing tools like lockdep > > > have a hard time grasping the dependencies. > > wq is fully annotated and actually splats. Plain kthread doesn't, without > > adding something like the dma_fence_signalling stuff. > > > > But yeah if you block badly in the work items and stall the entire queue, > > then things go sideways real bad. There's not really any tools we have in > > the kernel to enforce this, since we still want to allow mutex and > > sleeping and stuff like that. > > > > > What you can do is to offload all your operations which are supposed to be > > > run in the same thread as work items into a work queue. This is something > > > lockdep understands and is able to scream out lout if someone messes up the > > > deadlock dependencies. > > I thought that's the plan here? > > At least from my impression that didn't looked like what was implemented > here. Yup the current patches aren't what we want I think, at least not in these details. > > > Or at least what I thought the plan was, > > and why I really think we need a per engine worqqueue to make it work well > > (and also why I suggested the refactoring to split up drm_scheduler into > > the driver api struct, which stays per-engine, and the internal backend > > which would be per drm_sched_entity for fw schedulers that round-robin gpu > > ctx on their own). > > > > Also maybe we need to allow drivers to pass in the workqueue like we allow > > for the tdr handling already, since that simplifies the locking. > > > > At least for intel gpu I think this message passing design makes some > > sense because fundamentally the fw only has a single blocking message > > queue. And so intel/xe fundamentally needs to deal with the "stupid code > > might block forward progress for everyone" problem you're describing, not > > matter whether it's done with the help of drm/sched infra or not. > > > > I do agree though that we shouldn't encourage drivers to use this which > > don't have that kind of fw command queue design. So maybe a huge comment > > to explain when (and _only_ when) it's ok to use that message passing > > would be needed, and explaining why in other cases it would make things a > > lot worse? > > I would approach it from the complete other side. This component here is a > tool to decide what job should run next. > > How that is then signaled and run should not be part of the scheduler, but > another more higher level component. > > This way you also don't have a problem with not using DMA-fences as > dependencies as well as constrains for running more jobs. I think we're talking about two things here and mixing them up. For the dependencies I agree with you, and imo that higher level tool should probably just be an on-demand submit thread in userspace for the rare case where the kernel would need to sort out a dependency otherwise (due to running out of ringspace in the per-ctx ringbuffer). The other thing is the message passing stuff, and this is what I was talking about above. This has nothing to do with handling dependencies, but with talking to the gpu fw. Here the intel design issue is that the fw only provides a single queue, and it's in-order. Which means it fundamentally has the stalling issue you describe as a point against a message passing design. And fundamentally we need to be able to talk to the fw in the scheduler ->run_job callback. The proposal here for the message passing part is that since it has the stalling issue already anyway, and the scheduler needs to be involved anyway, it makes sense to integrated this (as an optional thing, only for drivers which have this kind of fw interface) into the scheduler. Otherwise you just end up with two layers for no reason and more ping-pong delay because the ->run_job needs to kick off the subordinate driver layer first. Note that for this case the optional message passing support in the drm/scheduler actually makes things better, because it allows you to cut out one layer. Of course if a driver with better fw interface uses this message passing support, then that's bad. Hence the big warning in the kerneldoc. -Daniel > > Christian. > > > -Daniel > > > > > Regards, > > > Christian. > > > > > > > FWIW Asahi liked the idea as well and think it could be useful for AGX. > > > > Matt > > > > > > > > > Regards, > > > > > Christian. > > > > > > > > > > > - Support for using TDR for all error paths of a scheduler / entity > > > > > > > > > > > > Fix a few races / bugs, add function to dynamically set the TDR timeout. > > > > > > > > > > > > - Annotate dma-fences for long running workloads. > > > > > > > > > > > > The idea here is to use dma-fences only as sync points within the > > > > > > scheduler and never export them for long running workloads. By > > > > > > annotating these fences as long running we ensure that these dma-fences > > > > > > are never used in a way that breaks the dma-fence rules. A benefit of > > > > > > thus approach is the scheduler can still safely flow control the > > > > > > execution ring buffer via the job limit without breaking the dma-fence > > > > > > rules. > > > > > > > > > > > > Again this a first draft and looking forward to feedback. > > > > > > > > > > > > Enjoy - Matt > > > > > > > > > > > > [1] https://gitlab.freedesktop.org/drm/xe/kernel > > > > > > [2] https://patchwork.freedesktop.org/series/112188/ > > > > > > [3] https://patchwork.freedesktop.org/series/114772/ > > > > > > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 > > > > > > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 > > > > > > > > > > > > Matthew Brost (8): > > > > > > drm/sched: Convert drm scheduler to use a work queue rather than > > > > > > kthread > > > > > > drm/sched: Move schedule policy to scheduler / entity > > > > > > drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy > > > > > > drm/sched: Add generic scheduler message interface > > > > > > drm/sched: Start run wq before TDR in drm_sched_start > > > > > > drm/sched: Submit job before starting TDR > > > > > > drm/sched: Add helper to set TDR timeout > > > > > > drm/syncobj: Warn on long running dma-fences > > > > > > > > > > > > Thomas Hellström (2): > > > > > > dma-buf/dma-fence: Introduce long-running completion fences > > > > > > drm/sched: Support long-running sched entities > > > > > > > > > > > > drivers/dma-buf/dma-fence.c | 142 +++++++--- > > > > > > drivers/dma-buf/dma-resv.c | 5 + > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- > > > > > > drivers/gpu/drm/drm_syncobj.c | 5 +- > > > > > > drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- > > > > > > drivers/gpu/drm/lima/lima_sched.c | 5 +- > > > > > > drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- > > > > > > drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- > > > > > > drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- > > > > > > drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- > > > > > > drivers/gpu/drm/scheduler/sched_fence.c | 6 +- > > > > > > drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++----- > > > > > > drivers/gpu/drm/v3d/v3d_sched.c | 25 +- > > > > > > include/drm/gpu_scheduler.h | 130 +++++++-- > > > > > > include/linux/dma-fence.h | 60 ++++- > > > > > > 16 files changed, 649 insertions(+), 184 deletions(-) > > > > > > >
Am 05.04.23 um 11:07 schrieb Daniel Vetter: > [SNIP] >> I would approach it from the complete other side. This component here is a >> tool to decide what job should run next. >> >> How that is then signaled and run should not be part of the scheduler, but >> another more higher level component. >> >> This way you also don't have a problem with not using DMA-fences as >> dependencies as well as constrains for running more jobs. > I think we're talking about two things here and mixing them up. > > For the dependencies I agree with you, and imo that higher level tool > should probably just be an on-demand submit thread in userspace for the > rare case where the kernel would need to sort out a dependency otherwise > (due to running out of ringspace in the per-ctx ringbuffer). > > The other thing is the message passing stuff, and this is what I was > talking about above. This has nothing to do with handling dependencies, > but with talking to the gpu fw. Here the intel design issue is that the fw > only provides a single queue, and it's in-order. Which means it > fundamentally has the stalling issue you describe as a point against a > message passing design. And fundamentally we need to be able to talk to > the fw in the scheduler ->run_job callback. > > The proposal here for the message passing part is that since it has the > stalling issue already anyway, and the scheduler needs to be involved > anyway, it makes sense to integrated this (as an optional thing, only for > drivers which have this kind of fw interface) into the scheduler. > Otherwise you just end up with two layers for no reason and more ping-pong > delay because the ->run_job needs to kick off the subordinate driver layer > first. Note that for this case the optional message passing support in the > drm/scheduler actually makes things better, because it allows you to cut > out one layer. > > Of course if a driver with better fw interface uses this message passing > support, then that's bad. Hence the big warning in the kerneldoc. Well what I wanted to say is that if you design the dependency handling / scheduler properly you don't need the message passing through it. For example if the GPU scheduler component uses a work item to do it's handling instead of a kthread you could also let the driver specify the work queue where this work item is executed on. When you design it like this the driver specifies the thread context of execution for it's job. In other words it can specify a single threaded firmware work queue as well. When you then have other messages which needs to be passed to the firmware you can also use the same single threaded workqueue for this. Drivers which have a different firmware interface would just use one of the system work queues instead. This approach basically decouples the GPU scheduler component from the message passing functionality. Regards, Christian. > -Daniel >
On Wed, 5 Apr 2023 at 11:57, Christian König <christian.koenig@amd.com> wrote: > > Am 05.04.23 um 11:07 schrieb Daniel Vetter: > > [SNIP] > >> I would approach it from the complete other side. This component here is a > >> tool to decide what job should run next. > >> > >> How that is then signaled and run should not be part of the scheduler, but > >> another more higher level component. > >> > >> This way you also don't have a problem with not using DMA-fences as > >> dependencies as well as constrains for running more jobs. > > I think we're talking about two things here and mixing them up. > > > > For the dependencies I agree with you, and imo that higher level tool > > should probably just be an on-demand submit thread in userspace for the > > rare case where the kernel would need to sort out a dependency otherwise > > (due to running out of ringspace in the per-ctx ringbuffer). > > > > The other thing is the message passing stuff, and this is what I was > > talking about above. This has nothing to do with handling dependencies, > > but with talking to the gpu fw. Here the intel design issue is that the fw > > only provides a single queue, and it's in-order. Which means it > > fundamentally has the stalling issue you describe as a point against a > > message passing design. And fundamentally we need to be able to talk to > > the fw in the scheduler ->run_job callback. > > > > The proposal here for the message passing part is that since it has the > > stalling issue already anyway, and the scheduler needs to be involved > > anyway, it makes sense to integrated this (as an optional thing, only for > > drivers which have this kind of fw interface) into the scheduler. > > Otherwise you just end up with two layers for no reason and more ping-pong > > delay because the ->run_job needs to kick off the subordinate driver layer > > first. Note that for this case the optional message passing support in the > > drm/scheduler actually makes things better, because it allows you to cut > > out one layer. > > > > Of course if a driver with better fw interface uses this message passing > > support, then that's bad. Hence the big warning in the kerneldoc. > > Well what I wanted to say is that if you design the dependency handling > / scheduler properly you don't need the message passing through it. > > For example if the GPU scheduler component uses a work item to do it's > handling instead of a kthread you could also let the driver specify the > work queue where this work item is executed on. > > When you design it like this the driver specifies the thread context of > execution for it's job. In other words it can specify a single threaded > firmware work queue as well. > > When you then have other messages which needs to be passed to the > firmware you can also use the same single threaded workqueue for this. > > Drivers which have a different firmware interface would just use one of > the system work queues instead. > > This approach basically decouples the GPU scheduler component from the > message passing functionality. Hm I guess we've been talking past each another big time, because that's really what I thought was under discussions? Essentially the current rfc, but implementing with some polish. iow I agree with you (I think at least). -Daniel
Hi, Using dma-fence for completion/dependency tracking for long-run workload(more precisely on-demand paging/page fault enabled workload) can cause deadlock. This seems the significant issue here. Other issues such as the drm scheduler completion order implication etc are minors which can be solve inside the framework of drm scheduler. We need to evaluate below paths: 1) still use drm scheduler for job submission, and use dma-fence for job completion waiting/dependency tracking. This is solution proposed in this series. Annotate dma-fence for long-run workload: user can still wait dma-fence for job completion but can't wait dma-fence while holding any memory management locks. We still use dma-fence for dependency tracking. But it is just very easily run into deadlock when on-demand paging is in the picture. The annotation helps us to detect deadlock but not solve deadlock problems. Seems *not* a complete solution: It is almost impossible to completely avoid dependency deadlock in complex runtime environment 2) Still use drm scheduler but not use dma-fence for completion signaling and dependency tracking. This way we still get some free functions (reset, err handling ring flow control as Matt said)from drm scheduler, but push the dependency/completion tracking completely to user space using techniques such as user space fence. User space doesn't have chance to wait fence while holding a kernel memory management lock, thus the dma-fence deadlock issue is solved. 3) Completely discard drm scheduler and dma-fence for long-run workload. Use user queue/doorbell for super fast submission, directly interact with fw scheduler. Use user fence for completion/dependency tracking. Thanks, Oak > -----Original Message----- > From: Christian König <christian.koenig@amd.com> > Sent: April 5, 2023 3:30 AM > To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak > <oak.zeng@intel.com> > Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; > robdclark@chromium.org; thomas.hellstrom@linux.intel.com; airlied@linux.ie; > lina@asahilina.net; boris.brezillon@collabora.com; faith.ekstrand@collabora.com > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload > plans > > Am 04.04.23 um 20:08 schrieb Matthew Brost: > > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote: > >> Hi Matt, Thomas, > >> > >> Some very bold out of box thinking in this area: > >> > >> 1. so you want to use drm scheduler and dma-fence for long running workload. > Why you want to do this in the first place? What is the benefit? Drm scheduler is > pretty much a software scheduler. Modern gpu has scheduler built at fw/hw > level, as you said below for intel this is Guc. Can xe driver just directly submit job > to Guc, bypassing drm scheduler? > >> > > If we did that now we have 2 paths for dependency track, flow controling > > the ring, resets / error handling / backend submission implementations. > > We don't want this. > > Well exactly that's the point: Why? > > As far as I can see that are two completely distinct use cases, so you > absolutely do want two completely distinct implementations for this. > > >> 2. using dma-fence for long run workload: I am well aware that page fault (and > the consequent memory allocation/lock acquiring to fix the fault) can cause > deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be > used purely because the nature of the workload that it runs very long (indefinite). > I did a math: the dma_fence_wait_timeout function's third param is the timeout > which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not long > enough, can we just change the timeout parameter to signed 64 bits so it is much > longer than our life time... > >> > >> So I mainly argue we can't use dma-fence for long-run workload is not > because the workload runs very long, rather because of the fact that we use > page fault for long-run workload. If we enable page fault for short-run workload, > we can't use dma-fence either. Page fault is the key thing here. > >> > >> Now since we use page fault which is *fundamentally* controversial with > dma-fence design, why now just introduce a independent concept such as user- > fence instead of extending existing dma-fence? > >> > >> I like unified design. If drm scheduler, dma-fence can be extended to work for > everything, it is beautiful. But seems we have some fundamental problem here. > >> > > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use > > the signal / CB infrastructure) and enforce we don't use use these > > dma-fences from the scheduler in memory reclaim paths or export these to > > user space or other drivers. Think of this mode as SW only fence. > > Yeah and I truly think this is an really bad idea. > > The signal/CB infrastructure in the dma_fence turned out to be the > absolutely nightmare I initially predicted. Sorry to say that, but in > this case the "I've told you so" is appropriate in my opinion. > > If we need infrastructure for long running dependency tracking we should > encapsulate that in a new framework and not try to mangle the existing > code for something it was never intended for. > > Christian. > > > > > Matt > > > >> Thanks, > >> Oak > >> > >>> -----Original Message----- > >>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of > >>> Matthew Brost > >>> Sent: April 3, 2023 8:22 PM > >>> To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org > >>> Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com; > airlied@linux.ie; > >>> lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew > >>> <matthew.brost@intel.com>; christian.koenig@amd.com; > >>> faith.ekstrand@collabora.com > >>> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload > plans > >>> > >>> Hello, > >>> > >>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > >>> have been asked to merge our common DRM scheduler patches first as well > >>> as develop a common solution for long running workloads with the DRM > >>> scheduler. This RFC series is our first attempt at doing this. We > >>> welcome any and all feedback. > >>> > >>> This can we thought of as 4 parts detailed below. > >>> > >>> - DRM scheduler changes for 1 to 1 relationship between scheduler and > >>> entity (patches 1-3) > >>> > >>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the > >>> GuC) which is a new paradigm WRT to the DRM scheduler and presents > >>> severals problems as the DRM was originally designed to schedule jobs on > >>> hardware queues. The main problem being that DRM scheduler expects the > >>> submission order of jobs to be the completion order of jobs even across > >>> multiple entities. This assumption falls apart with a firmware scheduler > >>> as a firmware scheduler has no concept of jobs and jobs can complete out > >>> of order. A novel solution for was originally thought of by Faith during > >>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler > >>> and entity. I believe the AGX driver [3] is using this approach and > >>> Boris may use approach as well for the Mali driver [4]. > >>> > >>> To support a 1 to 1 relationship we move the main execution function > >>> from a kthread to a work queue and add a new scheduling mode which > >>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > >>> The new scheduling mode should unify all drivers usage with a 1 to 1 > >>> relationship and can be thought of as using scheduler as a dependency / > >>> infligt job tracker rather than a true scheduler. > >>> > >>> - Generic messaging interface for DRM scheduler > >>> > >>> Idea is to be able to communicate to the submission backend with in band > >>> (relative to main execution function) messages. Messages are backend > >>> defined and flexable enough for any use case. In Xe we use these > >>> messages to clean up entites, set properties for entites, and suspend / > >>> resume execution of an entity [5]. I suspect other driver can leverage > >>> this messaging concept too as it a convenient way to avoid races in the > >>> backend. > >>> > >>> - Support for using TDR for all error paths of a scheduler / entity > >>> > >>> Fix a few races / bugs, add function to dynamically set the TDR timeout. > >>> > >>> - Annotate dma-fences for long running workloads. > >>> > >>> The idea here is to use dma-fences only as sync points within the > >>> scheduler and never export them for long running workloads. By > >>> annotating these fences as long running we ensure that these dma-fences > >>> are never used in a way that breaks the dma-fence rules. A benefit of > >>> thus approach is the scheduler can still safely flow control the > >>> execution ring buffer via the job limit without breaking the dma-fence > >>> rules. > >>> > >>> Again this a first draft and looking forward to feedback. > >>> > >>> Enjoy - Matt > >>> > >>> [1] https://gitlab.freedesktop.org/drm/xe/kernel > >>> [2] https://patchwork.freedesktop.org/series/112188/ > >>> [3] https://patchwork.freedesktop.org/series/114772/ > >>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 > >>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe- > >>> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 > >>> > >>> Matthew Brost (8): > >>> drm/sched: Convert drm scheduler to use a work queue rather than > >>> kthread > >>> drm/sched: Move schedule policy to scheduler / entity > >>> drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy > >>> drm/sched: Add generic scheduler message interface > >>> drm/sched: Start run wq before TDR in drm_sched_start > >>> drm/sched: Submit job before starting TDR > >>> drm/sched: Add helper to set TDR timeout > >>> drm/syncobj: Warn on long running dma-fences > >>> > >>> Thomas Hellström (2): > >>> dma-buf/dma-fence: Introduce long-running completion fences > >>> drm/sched: Support long-running sched entities > >>> > >>> drivers/dma-buf/dma-fence.c | 142 +++++++--- > >>> drivers/dma-buf/dma-resv.c | 5 + > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- > >>> drivers/gpu/drm/drm_syncobj.c | 5 +- > >>> drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- > >>> drivers/gpu/drm/lima/lima_sched.c | 5 +- > >>> drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- > >>> drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- > >>> drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- > >>> drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- > >>> drivers/gpu/drm/scheduler/sched_fence.c | 6 +- > >>> drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++----- > >>> drivers/gpu/drm/v3d/v3d_sched.c | 25 +- > >>> include/drm/gpu_scheduler.h | 130 +++++++-- > >>> include/linux/dma-fence.h | 60 ++++- > >>> 16 files changed, 649 insertions(+), 184 deletions(-) > >>> > >>> -- > >>> 2.34.1
On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote: > Hi, > > Using dma-fence for completion/dependency tracking for long-run workload(more precisely on-demand paging/page fault enabled workload) can cause deadlock. This seems the significant issue here. Other issues such as the drm scheduler completion order implication etc are minors which can be solve inside the framework of drm scheduler. We need to evaluate below paths: > > 1) still use drm scheduler for job submission, and use dma-fence for job completion waiting/dependency tracking. This is solution proposed in this series. Annotate dma-fence for long-run workload: user can still wait dma-fence for job completion but can't wait dma-fence while holding any memory management locks. We still use dma-fence for dependency tracking. But it is just very easily run into deadlock when on-demand paging is in the picture. The annotation helps us to detect deadlock but not solve deadlock problems. Seems *not* a complete solution: It is almost impossible to completely avoid dependency deadlock in complex runtime environment > No one can wait on LR fence, so it is impossible to deadlock. The annotations enforce this. Literally this is only for flow controling the ring / hold pending jobs in in the DRM schedule list. > 2) Still use drm scheduler but not use dma-fence for completion signaling and dependency tracking. This way we still get some free functions (reset, err handling ring flow control as Matt said)from drm scheduler, but push the dependency/completion tracking completely to user space using techniques such as user space fence. User space doesn't have chance to wait fence while holding a kernel memory management lock, thus the dma-fence deadlock issue is solved. > We use user space fence for syncs. > 3) Completely discard drm scheduler and dma-fence for long-run workload. Use user queue/doorbell for super fast submission, directly interact with fw scheduler. Use user fence for completion/dependency tracking. > This is a hard no from me, I want 1 submission path in Xe. Either we use the DRM scheduler or we don't. Matt > Thanks, > Oak > > > -----Original Message----- > > From: Christian König <christian.koenig@amd.com> > > Sent: April 5, 2023 3:30 AM > > To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak > > <oak.zeng@intel.com> > > Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; > > robdclark@chromium.org; thomas.hellstrom@linux.intel.com; airlied@linux.ie; > > lina@asahilina.net; boris.brezillon@collabora.com; faith.ekstrand@collabora.com > > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload > > plans > > > > Am 04.04.23 um 20:08 schrieb Matthew Brost: > > > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote: > > >> Hi Matt, Thomas, > > >> > > >> Some very bold out of box thinking in this area: > > >> > > >> 1. so you want to use drm scheduler and dma-fence for long running workload. > > Why you want to do this in the first place? What is the benefit? Drm scheduler is > > pretty much a software scheduler. Modern gpu has scheduler built at fw/hw > > level, as you said below for intel this is Guc. Can xe driver just directly submit job > > to Guc, bypassing drm scheduler? > > >> > > > If we did that now we have 2 paths for dependency track, flow controling > > > the ring, resets / error handling / backend submission implementations. > > > We don't want this. > > > > Well exactly that's the point: Why? > > > > As far as I can see that are two completely distinct use cases, so you > > absolutely do want two completely distinct implementations for this. > > > > >> 2. using dma-fence for long run workload: I am well aware that page fault (and > > the consequent memory allocation/lock acquiring to fix the fault) can cause > > deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be > > used purely because the nature of the workload that it runs very long (indefinite). > > I did a math: the dma_fence_wait_timeout function's third param is the timeout > > which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not long > > enough, can we just change the timeout parameter to signed 64 bits so it is much > > longer than our life time... > > >> > > >> So I mainly argue we can't use dma-fence for long-run workload is not > > because the workload runs very long, rather because of the fact that we use > > page fault for long-run workload. If we enable page fault for short-run workload, > > we can't use dma-fence either. Page fault is the key thing here. > > >> > > >> Now since we use page fault which is *fundamentally* controversial with > > dma-fence design, why now just introduce a independent concept such as user- > > fence instead of extending existing dma-fence? > > >> > > >> I like unified design. If drm scheduler, dma-fence can be extended to work for > > everything, it is beautiful. But seems we have some fundamental problem here. > > >> > > > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use > > > the signal / CB infrastructure) and enforce we don't use use these > > > dma-fences from the scheduler in memory reclaim paths or export these to > > > user space or other drivers. Think of this mode as SW only fence. > > > > Yeah and I truly think this is an really bad idea. > > > > The signal/CB infrastructure in the dma_fence turned out to be the > > absolutely nightmare I initially predicted. Sorry to say that, but in > > this case the "I've told you so" is appropriate in my opinion. > > > > If we need infrastructure for long running dependency tracking we should > > encapsulate that in a new framework and not try to mangle the existing > > code for something it was never intended for. > > > > Christian. > > > > > > > > Matt > > > > > >> Thanks, > > >> Oak > > >> > > >>> -----Original Message----- > > >>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of > > >>> Matthew Brost > > >>> Sent: April 3, 2023 8:22 PM > > >>> To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org > > >>> Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com; > > airlied@linux.ie; > > >>> lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew > > >>> <matthew.brost@intel.com>; christian.koenig@amd.com; > > >>> faith.ekstrand@collabora.com > > >>> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload > > plans > > >>> > > >>> Hello, > > >>> > > >>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > > >>> have been asked to merge our common DRM scheduler patches first as well > > >>> as develop a common solution for long running workloads with the DRM > > >>> scheduler. This RFC series is our first attempt at doing this. We > > >>> welcome any and all feedback. > > >>> > > >>> This can we thought of as 4 parts detailed below. > > >>> > > >>> - DRM scheduler changes for 1 to 1 relationship between scheduler and > > >>> entity (patches 1-3) > > >>> > > >>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the > > >>> GuC) which is a new paradigm WRT to the DRM scheduler and presents > > >>> severals problems as the DRM was originally designed to schedule jobs on > > >>> hardware queues. The main problem being that DRM scheduler expects the > > >>> submission order of jobs to be the completion order of jobs even across > > >>> multiple entities. This assumption falls apart with a firmware scheduler > > >>> as a firmware scheduler has no concept of jobs and jobs can complete out > > >>> of order. A novel solution for was originally thought of by Faith during > > >>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler > > >>> and entity. I believe the AGX driver [3] is using this approach and > > >>> Boris may use approach as well for the Mali driver [4]. > > >>> > > >>> To support a 1 to 1 relationship we move the main execution function > > >>> from a kthread to a work queue and add a new scheduling mode which > > >>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > > >>> The new scheduling mode should unify all drivers usage with a 1 to 1 > > >>> relationship and can be thought of as using scheduler as a dependency / > > >>> infligt job tracker rather than a true scheduler. > > >>> > > >>> - Generic messaging interface for DRM scheduler > > >>> > > >>> Idea is to be able to communicate to the submission backend with in band > > >>> (relative to main execution function) messages. Messages are backend > > >>> defined and flexable enough for any use case. In Xe we use these > > >>> messages to clean up entites, set properties for entites, and suspend / > > >>> resume execution of an entity [5]. I suspect other driver can leverage > > >>> this messaging concept too as it a convenient way to avoid races in the > > >>> backend. > > >>> > > >>> - Support for using TDR for all error paths of a scheduler / entity > > >>> > > >>> Fix a few races / bugs, add function to dynamically set the TDR timeout. > > >>> > > >>> - Annotate dma-fences for long running workloads. > > >>> > > >>> The idea here is to use dma-fences only as sync points within the > > >>> scheduler and never export them for long running workloads. By > > >>> annotating these fences as long running we ensure that these dma-fences > > >>> are never used in a way that breaks the dma-fence rules. A benefit of > > >>> thus approach is the scheduler can still safely flow control the > > >>> execution ring buffer via the job limit without breaking the dma-fence > > >>> rules. > > >>> > > >>> Again this a first draft and looking forward to feedback. > > >>> > > >>> Enjoy - Matt > > >>> > > >>> [1] https://gitlab.freedesktop.org/drm/xe/kernel > > >>> [2] https://patchwork.freedesktop.org/series/112188/ > > >>> [3] https://patchwork.freedesktop.org/series/114772/ > > >>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 > > >>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe- > > >>> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 > > >>> > > >>> Matthew Brost (8): > > >>> drm/sched: Convert drm scheduler to use a work queue rather than > > >>> kthread > > >>> drm/sched: Move schedule policy to scheduler / entity > > >>> drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy > > >>> drm/sched: Add generic scheduler message interface > > >>> drm/sched: Start run wq before TDR in drm_sched_start > > >>> drm/sched: Submit job before starting TDR > > >>> drm/sched: Add helper to set TDR timeout > > >>> drm/syncobj: Warn on long running dma-fences > > >>> > > >>> Thomas Hellström (2): > > >>> dma-buf/dma-fence: Introduce long-running completion fences > > >>> drm/sched: Support long-running sched entities > > >>> > > >>> drivers/dma-buf/dma-fence.c | 142 +++++++--- > > >>> drivers/dma-buf/dma-resv.c | 5 + > > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- > > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- > > >>> drivers/gpu/drm/drm_syncobj.c | 5 +- > > >>> drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- > > >>> drivers/gpu/drm/lima/lima_sched.c | 5 +- > > >>> drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- > > >>> drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- > > >>> drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- > > >>> drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- > > >>> drivers/gpu/drm/scheduler/sched_fence.c | 6 +- > > >>> drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++----- > > >>> drivers/gpu/drm/v3d/v3d_sched.c | 25 +- > > >>> include/drm/gpu_scheduler.h | 130 +++++++-- > > >>> include/linux/dma-fence.h | 60 ++++- > > >>> 16 files changed, 649 insertions(+), 184 deletions(-) > > >>> > > >>> -- > > >>> 2.34.1 >
On Wed, Apr 05, 2023 at 12:12:27PM +0200, Daniel Vetter wrote: > On Wed, 5 Apr 2023 at 11:57, Christian König <christian.koenig@amd.com> wrote: > > > > Am 05.04.23 um 11:07 schrieb Daniel Vetter: > > > [SNIP] > > >> I would approach it from the complete other side. This component here is a > > >> tool to decide what job should run next. > > >> > > >> How that is then signaled and run should not be part of the scheduler, but > > >> another more higher level component. > > >> > > >> This way you also don't have a problem with not using DMA-fences as > > >> dependencies as well as constrains for running more jobs. > > > I think we're talking about two things here and mixing them up. > > > > > > For the dependencies I agree with you, and imo that higher level tool > > > should probably just be an on-demand submit thread in userspace for the > > > rare case where the kernel would need to sort out a dependency otherwise > > > (due to running out of ringspace in the per-ctx ringbuffer). > > > > > > The other thing is the message passing stuff, and this is what I was > > > talking about above. This has nothing to do with handling dependencies, > > > but with talking to the gpu fw. Here the intel design issue is that the fw > > > only provides a single queue, and it's in-order. Which means it > > > fundamentally has the stalling issue you describe as a point against a > > > message passing design. And fundamentally we need to be able to talk to > > > the fw in the scheduler ->run_job callback. > > > > > > The proposal here for the message passing part is that since it has the > > > stalling issue already anyway, and the scheduler needs to be involved > > > anyway, it makes sense to integrated this (as an optional thing, only for > > > drivers which have this kind of fw interface) into the scheduler. > > > Otherwise you just end up with two layers for no reason and more ping-pong > > > delay because the ->run_job needs to kick off the subordinate driver layer > > > first. Note that for this case the optional message passing support in the > > > drm/scheduler actually makes things better, because it allows you to cut > > > out one layer. > > > > > > Of course if a driver with better fw interface uses this message passing > > > support, then that's bad. Hence the big warning in the kerneldoc. > > > > Well what I wanted to say is that if you design the dependency handling > > / scheduler properly you don't need the message passing through it. > > > > For example if the GPU scheduler component uses a work item to do it's > > handling instead of a kthread you could also let the driver specify the > > work queue where this work item is executed on. > > > > When you design it like this the driver specifies the thread context of > > execution for it's job. In other words it can specify a single threaded > > firmware work queue as well. > > > > When you then have other messages which needs to be passed to the > > firmware you can also use the same single threaded workqueue for this. > > > > Drivers which have a different firmware interface would just use one of > > the system work queues instead. > > > > This approach basically decouples the GPU scheduler component from the > > message passing functionality. > > Hm I guess we've been talking past each another big time, because > that's really what I thought was under discussions? Essentially the > current rfc, but implementing with some polish. > I think Daniel pretty much nailed it here (thanks), to recap: 1. I want the messages in the same worker so run_job / free_job / process_msg execution is mutual exclusive and also so during reset paths if the worker is stopped all the entry points can't be entered. If this is a NAK, then another worker is fine I guess. A lock between run_job / free_job + process_msg should solve the exclusion issue and the reset paths can also stop this new worker too. That being said I'd rather leave this as is but will not fight this point. 2. process_msg is just used to communicate with the firmware using the same queue as submission. Waiting for space in this queue is the only place this function can block (same as submission), well actually we have the concept a preempt time slice but that sleeps for 10 ms by default. Also preempt is only used in LR entities so I don't think it is relavent in this case either. 3. Agree this is in the dma-fence signaling path (if process_msg is in the submission worker) so we can't block indefinitely or an unreasonable period of time (i.e. we must obey dma-fence rules). 4. Agree the documentation for thw usage of the messaging interface needs to be clear. 5. Agree that my code could alway use polishing. Lets close on #1 then can I get on general Ack on this part of the RFC and apply the polish in the full review process? Matt > iow I agree with you (I think at least). > -Daniel > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch
On Thu, Apr 06, 2023 at 02:08:10AM +0000, Matthew Brost wrote: > On Wed, Apr 05, 2023 at 12:12:27PM +0200, Daniel Vetter wrote: > > On Wed, 5 Apr 2023 at 11:57, Christian König <christian.koenig@amd.com> wrote: > > > > > > Am 05.04.23 um 11:07 schrieb Daniel Vetter: > > > > [SNIP] > > > >> I would approach it from the complete other side. This component here is a > > > >> tool to decide what job should run next. > > > >> > > > >> How that is then signaled and run should not be part of the scheduler, but > > > >> another more higher level component. > > > >> > > > >> This way you also don't have a problem with not using DMA-fences as > > > >> dependencies as well as constrains for running more jobs. > > > > I think we're talking about two things here and mixing them up. > > > > > > > > For the dependencies I agree with you, and imo that higher level tool > > > > should probably just be an on-demand submit thread in userspace for the > > > > rare case where the kernel would need to sort out a dependency otherwise > > > > (due to running out of ringspace in the per-ctx ringbuffer). > > > > > > > > The other thing is the message passing stuff, and this is what I was > > > > talking about above. This has nothing to do with handling dependencies, > > > > but with talking to the gpu fw. Here the intel design issue is that the fw > > > > only provides a single queue, and it's in-order. Which means it > > > > fundamentally has the stalling issue you describe as a point against a > > > > message passing design. And fundamentally we need to be able to talk to > > > > the fw in the scheduler ->run_job callback. > > > > > > > > The proposal here for the message passing part is that since it has the > > > > stalling issue already anyway, and the scheduler needs to be involved > > > > anyway, it makes sense to integrated this (as an optional thing, only for > > > > drivers which have this kind of fw interface) into the scheduler. > > > > Otherwise you just end up with two layers for no reason and more ping-pong > > > > delay because the ->run_job needs to kick off the subordinate driver layer > > > > first. Note that for this case the optional message passing support in the > > > > drm/scheduler actually makes things better, because it allows you to cut > > > > out one layer. > > > > > > > > Of course if a driver with better fw interface uses this message passing > > > > support, then that's bad. Hence the big warning in the kerneldoc. > > > > > > Well what I wanted to say is that if you design the dependency handling > > > / scheduler properly you don't need the message passing through it. > > > > > > For example if the GPU scheduler component uses a work item to do it's > > > handling instead of a kthread you could also let the driver specify the > > > work queue where this work item is executed on. > > > > > > When you design it like this the driver specifies the thread context of > > > execution for it's job. In other words it can specify a single threaded > > > firmware work queue as well. > > > > > > When you then have other messages which needs to be passed to the > > > firmware you can also use the same single threaded workqueue for this. > > > > > > Drivers which have a different firmware interface would just use one of > > > the system work queues instead. > > > > > > This approach basically decouples the GPU scheduler component from the > > > message passing functionality. > > > > Hm I guess we've been talking past each another big time, because > > that's really what I thought was under discussions? Essentially the > > current rfc, but implementing with some polish. > > > > I think Daniel pretty much nailed it here (thanks), to recap: > > 1. I want the messages in the same worker so run_job / free_job / > process_msg execution is mutual exclusive and also so during reset paths > if the worker is stopped all the entry points can't be entered. > > If this is a NAK, then another worker is fine I guess. A lock between > run_job / free_job + process_msg should solve the exclusion issue and the > reset paths can also stop this new worker too. That being said I'd > rather leave this as is but will not fight this point. > > 2. process_msg is just used to communicate with the firmware using the > same queue as submission. Waiting for space in this queue is the only > place this function can block (same as submission), well actually we > have the concept a preempt time slice but that sleeps for 10 ms by > default. Also preempt is only used in LR entities so I don't think it is > relavent in this case either. > > 3. Agree this is in the dma-fence signaling path (if process_msg is in > the submission worker) so we can't block indefinitely or an unreasonable > period of time (i.e. we must obey dma-fence rules). Just to hammer this in: Not just process_msg is in the dma_fence signaling path, but the entire fw queue where everything is being funneled through, including whatever the fw is doing to process these. Yes this is terrible and blew up a few times already :-/ But also, probably something that the docs really need to hammer in, to make sure people don't look at this and thinkg "hey this seems to be the recommended way to do this on linux". We don't want hw people to build more of these designs, they're an absolute pain to deal with with Linux' dma_fence signalling and gpu job scheduling rules. It's just that if you're stuck with such fw, then integrating the flow into drm/sched instead of having an extra layer of workers seems the better of two pretty bad solutions. -Daniel > 4. Agree the documentation for thw usage of the messaging interface > needs to be clear. > > 5. Agree that my code could alway use polishing. > > Lets close on #1 then can I get on general Ack on this part of the RFC > and apply the polish in the full review process? > > Matt > > > iow I agree with you (I think at least). > > -Daniel > > -- > > Daniel Vetter > > Software Engineer, Intel Corporation > > http://blog.ffwll.ch
Am 05.04.23 um 20:53 schrieb Matthew Brost: > On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote: >> Hi, >> >> Using dma-fence for completion/dependency tracking for long-run workload(more precisely on-demand paging/page fault enabled workload) can cause deadlock. This seems the significant issue here. Other issues such as the drm scheduler completion order implication etc are minors which can be solve inside the framework of drm scheduler. We need to evaluate below paths: >> >> 1) still use drm scheduler for job submission, and use dma-fence for job completion waiting/dependency tracking. This is solution proposed in this series. Annotate dma-fence for long-run workload: user can still wait dma-fence for job completion but can't wait dma-fence while holding any memory management locks. We still use dma-fence for dependency tracking. But it is just very easily run into deadlock when on-demand paging is in the picture. The annotation helps us to detect deadlock but not solve deadlock problems. Seems *not* a complete solution: It is almost impossible to completely avoid dependency deadlock in complex runtime environment >> > No one can wait on LR fence, so it is impossible to deadlock. The > annotations enforce this. Literally this is only for flow controling the > ring / hold pending jobs in in the DRM schedule list. You can still have someone depend on the LR fence and cause a deadlock without waiting on it. See my attempted solution to this problem. It tries to inherit the LR flag of a fence when something depended on it. For example if you create a fence container and one of the fences inside the container is a LR fence then the container itself would be an LR fence as well. >> 2) Still use drm scheduler but not use dma-fence for completion signaling and dependency tracking. This way we still get some free functions (reset, err handling ring flow control as Matt said)from drm scheduler, but push the dependency/completion tracking completely to user space using techniques such as user space fence. User space doesn't have chance to wait fence while holding a kernel memory management lock, thus the dma-fence deadlock issue is solved. >> > We use user space fence for syncs. > >> 3) Completely discard drm scheduler and dma-fence for long-run workload. Use user queue/doorbell for super fast submission, directly interact with fw scheduler. Use user fence for completion/dependency tracking. >> > This is a hard no from me, I want 1 submission path in Xe. Either we use > the DRM scheduler or we don't. Well I don't think that this will be acceptable. Especially if you not only have long running submission, but things like page faults/HMM in those jobs. Regards, Christian. > > Matt > >> Thanks, >> Oak >> >>> -----Original Message----- >>> From: Christian König <christian.koenig@amd.com> >>> Sent: April 5, 2023 3:30 AM >>> To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak >>> <oak.zeng@intel.com> >>> Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; >>> robdclark@chromium.org; thomas.hellstrom@linux.intel.com; airlied@linux.ie; >>> lina@asahilina.net; boris.brezillon@collabora.com; faith.ekstrand@collabora.com >>> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload >>> plans >>> >>> Am 04.04.23 um 20:08 schrieb Matthew Brost: >>>> On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote: >>>>> Hi Matt, Thomas, >>>>> >>>>> Some very bold out of box thinking in this area: >>>>> >>>>> 1. so you want to use drm scheduler and dma-fence for long running workload. >>> Why you want to do this in the first place? What is the benefit? Drm scheduler is >>> pretty much a software scheduler. Modern gpu has scheduler built at fw/hw >>> level, as you said below for intel this is Guc. Can xe driver just directly submit job >>> to Guc, bypassing drm scheduler? >>>> If we did that now we have 2 paths for dependency track, flow controling >>>> the ring, resets / error handling / backend submission implementations. >>>> We don't want this. >>> Well exactly that's the point: Why? >>> >>> As far as I can see that are two completely distinct use cases, so you >>> absolutely do want two completely distinct implementations for this. >>> >>>>> 2. using dma-fence for long run workload: I am well aware that page fault (and >>> the consequent memory allocation/lock acquiring to fix the fault) can cause >>> deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be >>> used purely because the nature of the workload that it runs very long (indefinite). >>> I did a math: the dma_fence_wait_timeout function's third param is the timeout >>> which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not long >>> enough, can we just change the timeout parameter to signed 64 bits so it is much >>> longer than our life time... >>>>> So I mainly argue we can't use dma-fence for long-run workload is not >>> because the workload runs very long, rather because of the fact that we use >>> page fault for long-run workload. If we enable page fault for short-run workload, >>> we can't use dma-fence either. Page fault is the key thing here. >>>>> Now since we use page fault which is *fundamentally* controversial with >>> dma-fence design, why now just introduce a independent concept such as user- >>> fence instead of extending existing dma-fence? >>>>> I like unified design. If drm scheduler, dma-fence can be extended to work for >>> everything, it is beautiful. But seems we have some fundamental problem here. >>>> Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use >>>> the signal / CB infrastructure) and enforce we don't use use these >>>> dma-fences from the scheduler in memory reclaim paths or export these to >>>> user space or other drivers. Think of this mode as SW only fence. >>> Yeah and I truly think this is an really bad idea. >>> >>> The signal/CB infrastructure in the dma_fence turned out to be the >>> absolutely nightmare I initially predicted. Sorry to say that, but in >>> this case the "I've told you so" is appropriate in my opinion. >>> >>> If we need infrastructure for long running dependency tracking we should >>> encapsulate that in a new framework and not try to mangle the existing >>> code for something it was never intended for. >>> >>> Christian. >>> >>>> Matt >>>> >>>>> Thanks, >>>>> Oak >>>>> >>>>>> -----Original Message----- >>>>>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of >>>>>> Matthew Brost >>>>>> Sent: April 3, 2023 8:22 PM >>>>>> To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org >>>>>> Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com; >>> airlied@linux.ie; >>>>>> lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew >>>>>> <matthew.brost@intel.com>; christian.koenig@amd.com; >>>>>> faith.ekstrand@collabora.com >>>>>> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload >>> plans >>>>>> Hello, >>>>>> >>>>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we >>>>>> have been asked to merge our common DRM scheduler patches first as well >>>>>> as develop a common solution for long running workloads with the DRM >>>>>> scheduler. This RFC series is our first attempt at doing this. We >>>>>> welcome any and all feedback. >>>>>> >>>>>> This can we thought of as 4 parts detailed below. >>>>>> >>>>>> - DRM scheduler changes for 1 to 1 relationship between scheduler and >>>>>> entity (patches 1-3) >>>>>> >>>>>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the >>>>>> GuC) which is a new paradigm WRT to the DRM scheduler and presents >>>>>> severals problems as the DRM was originally designed to schedule jobs on >>>>>> hardware queues. The main problem being that DRM scheduler expects the >>>>>> submission order of jobs to be the completion order of jobs even across >>>>>> multiple entities. This assumption falls apart with a firmware scheduler >>>>>> as a firmware scheduler has no concept of jobs and jobs can complete out >>>>>> of order. A novel solution for was originally thought of by Faith during >>>>>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler >>>>>> and entity. I believe the AGX driver [3] is using this approach and >>>>>> Boris may use approach as well for the Mali driver [4]. >>>>>> >>>>>> To support a 1 to 1 relationship we move the main execution function >>>>>> from a kthread to a work queue and add a new scheduling mode which >>>>>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship. >>>>>> The new scheduling mode should unify all drivers usage with a 1 to 1 >>>>>> relationship and can be thought of as using scheduler as a dependency / >>>>>> infligt job tracker rather than a true scheduler. >>>>>> >>>>>> - Generic messaging interface for DRM scheduler >>>>>> >>>>>> Idea is to be able to communicate to the submission backend with in band >>>>>> (relative to main execution function) messages. Messages are backend >>>>>> defined and flexable enough for any use case. In Xe we use these >>>>>> messages to clean up entites, set properties for entites, and suspend / >>>>>> resume execution of an entity [5]. I suspect other driver can leverage >>>>>> this messaging concept too as it a convenient way to avoid races in the >>>>>> backend. >>>>>> >>>>>> - Support for using TDR for all error paths of a scheduler / entity >>>>>> >>>>>> Fix a few races / bugs, add function to dynamically set the TDR timeout. >>>>>> >>>>>> - Annotate dma-fences for long running workloads. >>>>>> >>>>>> The idea here is to use dma-fences only as sync points within the >>>>>> scheduler and never export them for long running workloads. By >>>>>> annotating these fences as long running we ensure that these dma-fences >>>>>> are never used in a way that breaks the dma-fence rules. A benefit of >>>>>> thus approach is the scheduler can still safely flow control the >>>>>> execution ring buffer via the job limit without breaking the dma-fence >>>>>> rules. >>>>>> >>>>>> Again this a first draft and looking forward to feedback. >>>>>> >>>>>> Enjoy - Matt >>>>>> >>>>>> [1] https://gitlab.freedesktop.org/drm/xe/kernel >>>>>> [2] https://patchwork.freedesktop.org/series/112188/ >>>>>> [3] https://patchwork.freedesktop.org/series/114772/ >>>>>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 >>>>>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe- >>>>>> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 >>>>>> >>>>>> Matthew Brost (8): >>>>>> drm/sched: Convert drm scheduler to use a work queue rather than >>>>>> kthread >>>>>> drm/sched: Move schedule policy to scheduler / entity >>>>>> drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy >>>>>> drm/sched: Add generic scheduler message interface >>>>>> drm/sched: Start run wq before TDR in drm_sched_start >>>>>> drm/sched: Submit job before starting TDR >>>>>> drm/sched: Add helper to set TDR timeout >>>>>> drm/syncobj: Warn on long running dma-fences >>>>>> >>>>>> Thomas Hellström (2): >>>>>> dma-buf/dma-fence: Introduce long-running completion fences >>>>>> drm/sched: Support long-running sched entities >>>>>> >>>>>> drivers/dma-buf/dma-fence.c | 142 +++++++--- >>>>>> drivers/dma-buf/dma-resv.c | 5 + >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- >>>>>> drivers/gpu/drm/drm_syncobj.c | 5 +- >>>>>> drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- >>>>>> drivers/gpu/drm/lima/lima_sched.c | 5 +- >>>>>> drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- >>>>>> drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- >>>>>> drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- >>>>>> drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- >>>>>> drivers/gpu/drm/scheduler/sched_fence.c | 6 +- >>>>>> drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++----- >>>>>> drivers/gpu/drm/v3d/v3d_sched.c | 25 +- >>>>>> include/drm/gpu_scheduler.h | 130 +++++++-- >>>>>> include/linux/dma-fence.h | 60 ++++- >>>>>> 16 files changed, 649 insertions(+), 184 deletions(-) >>>>>> >>>>>> -- >>>>>> 2.34.1
Am 06.04.23 um 08:37 schrieb Daniel Vetter: > On Thu, Apr 06, 2023 at 02:08:10AM +0000, Matthew Brost wrote: >> On Wed, Apr 05, 2023 at 12:12:27PM +0200, Daniel Vetter wrote: >>> On Wed, 5 Apr 2023 at 11:57, Christian König <christian.koenig@amd.com> wrote: >>>> Am 05.04.23 um 11:07 schrieb Daniel Vetter: >>>>> [SNIP] >>>>>> I would approach it from the complete other side. This component here is a >>>>>> tool to decide what job should run next. >>>>>> >>>>>> How that is then signaled and run should not be part of the scheduler, but >>>>>> another more higher level component. >>>>>> >>>>>> This way you also don't have a problem with not using DMA-fences as >>>>>> dependencies as well as constrains for running more jobs. >>>>> I think we're talking about two things here and mixing them up. >>>>> >>>>> For the dependencies I agree with you, and imo that higher level tool >>>>> should probably just be an on-demand submit thread in userspace for the >>>>> rare case where the kernel would need to sort out a dependency otherwise >>>>> (due to running out of ringspace in the per-ctx ringbuffer). >>>>> >>>>> The other thing is the message passing stuff, and this is what I was >>>>> talking about above. This has nothing to do with handling dependencies, >>>>> but with talking to the gpu fw. Here the intel design issue is that the fw >>>>> only provides a single queue, and it's in-order. Which means it >>>>> fundamentally has the stalling issue you describe as a point against a >>>>> message passing design. And fundamentally we need to be able to talk to >>>>> the fw in the scheduler ->run_job callback. >>>>> >>>>> The proposal here for the message passing part is that since it has the >>>>> stalling issue already anyway, and the scheduler needs to be involved >>>>> anyway, it makes sense to integrated this (as an optional thing, only for >>>>> drivers which have this kind of fw interface) into the scheduler. >>>>> Otherwise you just end up with two layers for no reason and more ping-pong >>>>> delay because the ->run_job needs to kick off the subordinate driver layer >>>>> first. Note that for this case the optional message passing support in the >>>>> drm/scheduler actually makes things better, because it allows you to cut >>>>> out one layer. >>>>> >>>>> Of course if a driver with better fw interface uses this message passing >>>>> support, then that's bad. Hence the big warning in the kerneldoc. >>>> Well what I wanted to say is that if you design the dependency handling >>>> / scheduler properly you don't need the message passing through it. >>>> >>>> For example if the GPU scheduler component uses a work item to do it's >>>> handling instead of a kthread you could also let the driver specify the >>>> work queue where this work item is executed on. >>>> >>>> When you design it like this the driver specifies the thread context of >>>> execution for it's job. In other words it can specify a single threaded >>>> firmware work queue as well. >>>> >>>> When you then have other messages which needs to be passed to the >>>> firmware you can also use the same single threaded workqueue for this. >>>> >>>> Drivers which have a different firmware interface would just use one of >>>> the system work queues instead. >>>> >>>> This approach basically decouples the GPU scheduler component from the >>>> message passing functionality. >>> Hm I guess we've been talking past each another big time, because >>> that's really what I thought was under discussions? Essentially the >>> current rfc, but implementing with some polish. >>> >> I think Daniel pretty much nailed it here (thanks), to recap: >> >> 1. I want the messages in the same worker so run_job / free_job / >> process_msg execution is mutual exclusive and also so during reset paths >> if the worker is stopped all the entry points can't be entered. >> >> If this is a NAK, then another worker is fine I guess. A lock between >> run_job / free_job + process_msg should solve the exclusion issue and the >> reset paths can also stop this new worker too. That being said I'd >> rather leave this as is but will not fight this point. >> >> 2. process_msg is just used to communicate with the firmware using the >> same queue as submission. Waiting for space in this queue is the only >> place this function can block (same as submission), well actually we >> have the concept a preempt time slice but that sleeps for 10 ms by >> default. Also preempt is only used in LR entities so I don't think it is >> relavent in this case either. >> >> 3. Agree this is in the dma-fence signaling path (if process_msg is in >> the submission worker) so we can't block indefinitely or an unreasonable >> period of time (i.e. we must obey dma-fence rules). > Just to hammer this in: Not just process_msg is in the dma_fence signaling > path, but the entire fw queue where everything is being funneled through, > including whatever the fw is doing to process these. > > Yes this is terrible and blew up a few times already :-/ > > But also, probably something that the docs really need to hammer in, to > make sure people don't look at this and thinkg "hey this seems to be the > recommended way to do this on linux". We don't want hw people to build > more of these designs, they're an absolute pain to deal with with Linux' > dma_fence signalling and gpu job scheduling rules. > > It's just that if you're stuck with such fw, then integrating the flow > into drm/sched instead of having an extra layer of workers seems the > better of two pretty bad solutions. Yeah and if you have such fw limitations, make sure that you use something which is understood by lockdep to feed into it. In other words, either locks or work item/queue and not some message passing functionality through the scheduler. As far as I can see the approach with the work item/queue should fit your needs here. Christian. > -Daniel > > >> 4. Agree the documentation for thw usage of the messaging interface >> needs to be clear. >> >> 5. Agree that my code could alway use polishing. >> >> Lets close on #1 then can I get on general Ack on this part of the RFC >> and apply the polish in the full review process? >> >> Matt >> >>> iow I agree with you (I think at least). >>> -Daniel >>> -- >>> Daniel Vetter >>> Software Engineer, Intel Corporation >>> http://blog.ffwll.ch
On Thu, Apr 06, 2023 at 12:14:36PM +0200, Christian König wrote: > Am 06.04.23 um 08:37 schrieb Daniel Vetter: > > On Thu, Apr 06, 2023 at 02:08:10AM +0000, Matthew Brost wrote: > > > On Wed, Apr 05, 2023 at 12:12:27PM +0200, Daniel Vetter wrote: > > > > On Wed, 5 Apr 2023 at 11:57, Christian König <christian.koenig@amd.com> wrote: > > > > > Am 05.04.23 um 11:07 schrieb Daniel Vetter: > > > > > > [SNIP] > > > > > > > I would approach it from the complete other side. This component here is a > > > > > > > tool to decide what job should run next. > > > > > > > > > > > > > > How that is then signaled and run should not be part of the scheduler, but > > > > > > > another more higher level component. > > > > > > > > > > > > > > This way you also don't have a problem with not using DMA-fences as > > > > > > > dependencies as well as constrains for running more jobs. > > > > > > I think we're talking about two things here and mixing them up. > > > > > > > > > > > > For the dependencies I agree with you, and imo that higher level tool > > > > > > should probably just be an on-demand submit thread in userspace for the > > > > > > rare case where the kernel would need to sort out a dependency otherwise > > > > > > (due to running out of ringspace in the per-ctx ringbuffer). > > > > > > > > > > > > The other thing is the message passing stuff, and this is what I was > > > > > > talking about above. This has nothing to do with handling dependencies, > > > > > > but with talking to the gpu fw. Here the intel design issue is that the fw > > > > > > only provides a single queue, and it's in-order. Which means it > > > > > > fundamentally has the stalling issue you describe as a point against a > > > > > > message passing design. And fundamentally we need to be able to talk to > > > > > > the fw in the scheduler ->run_job callback. > > > > > > > > > > > > The proposal here for the message passing part is that since it has the > > > > > > stalling issue already anyway, and the scheduler needs to be involved > > > > > > anyway, it makes sense to integrated this (as an optional thing, only for > > > > > > drivers which have this kind of fw interface) into the scheduler. > > > > > > Otherwise you just end up with two layers for no reason and more ping-pong > > > > > > delay because the ->run_job needs to kick off the subordinate driver layer > > > > > > first. Note that for this case the optional message passing support in the > > > > > > drm/scheduler actually makes things better, because it allows you to cut > > > > > > out one layer. > > > > > > > > > > > > Of course if a driver with better fw interface uses this message passing > > > > > > support, then that's bad. Hence the big warning in the kerneldoc. > > > > > Well what I wanted to say is that if you design the dependency handling > > > > > / scheduler properly you don't need the message passing through it. > > > > > > > > > > For example if the GPU scheduler component uses a work item to do it's > > > > > handling instead of a kthread you could also let the driver specify the > > > > > work queue where this work item is executed on. > > > > > > > > > > When you design it like this the driver specifies the thread context of > > > > > execution for it's job. In other words it can specify a single threaded > > > > > firmware work queue as well. > > > > > > > > > > When you then have other messages which needs to be passed to the > > > > > firmware you can also use the same single threaded workqueue for this. > > > > > > > > > > Drivers which have a different firmware interface would just use one of > > > > > the system work queues instead. > > > > > > > > > > This approach basically decouples the GPU scheduler component from the > > > > > message passing functionality. > > > > Hm I guess we've been talking past each another big time, because > > > > that's really what I thought was under discussions? Essentially the > > > > current rfc, but implementing with some polish. > > > > > > > I think Daniel pretty much nailed it here (thanks), to recap: > > > > > > 1. I want the messages in the same worker so run_job / free_job / > > > process_msg execution is mutual exclusive and also so during reset paths > > > if the worker is stopped all the entry points can't be entered. > > > > > > If this is a NAK, then another worker is fine I guess. A lock between > > > run_job / free_job + process_msg should solve the exclusion issue and the > > > reset paths can also stop this new worker too. That being said I'd > > > rather leave this as is but will not fight this point. > > > > > > 2. process_msg is just used to communicate with the firmware using the > > > same queue as submission. Waiting for space in this queue is the only > > > place this function can block (same as submission), well actually we > > > have the concept a preempt time slice but that sleeps for 10 ms by > > > default. Also preempt is only used in LR entities so I don't think it is > > > relavent in this case either. > > > > > > 3. Agree this is in the dma-fence signaling path (if process_msg is in > > > the submission worker) so we can't block indefinitely or an unreasonable > > > period of time (i.e. we must obey dma-fence rules). > > Just to hammer this in: Not just process_msg is in the dma_fence signaling > > path, but the entire fw queue where everything is being funneled through, > > including whatever the fw is doing to process these. > > > > Yes this is terrible and blew up a few times already :-/ > > > > But also, probably something that the docs really need to hammer in, to > > make sure people don't look at this and thinkg "hey this seems to be the > > recommended way to do this on linux". We don't want hw people to build > > more of these designs, they're an absolute pain to deal with with Linux' > > dma_fence signalling and gpu job scheduling rules. > > > > It's just that if you're stuck with such fw, then integrating the flow > > into drm/sched instead of having an extra layer of workers seems the > > better of two pretty bad solutions. > > Yeah and if you have such fw limitations, make sure that you use something > which is understood by lockdep to feed into it. > > In other words, either locks or work item/queue and not some message passing > functionality through the scheduler. > > As far as I can see the approach with the work item/queue should fit your > needs here. dma_fence signalling annotations would also make the scheduler thread suitable to catch issues with lockdep, just to make double-sure it can catch issues. -Daniel > > Christian. > > > -Daniel > > > > > 4. Agree the documentation for thw usage of the messaging interface > > > needs to be clear. > > > > > > 5. Agree that my code could alway use polishing. > > > > > > Lets close on #1 then can I get on general Ack on this part of the RFC > > > and apply the polish in the full review process? > > > > > > Matt > > > > > > > iow I agree with you (I think at least). > > > > -Daniel > > > > -- > > > > Daniel Vetter > > > > Software Engineer, Intel Corporation > > > > http://blog.ffwll.ch >
So this series basically go with option 2. The part that option2 makes me uncomfortable is, dma-fence doesn't work for long running workload, why we generate it in the first place? As long as dma-fence is generated, it will become a source of confusion in the future. It doesn't matter how much you annotate it/document it. So if we decide to go with option2, the bottom line is, don't generate dma-fence for long running workload during job submission. This requires some rework in drm scheduler. The cleanest solution to me is option3. Dma-fence is a very old technology. When it was created, no gpu support page fault. Obviously this is not a good technology for modern gpu with page fault support. I think the best way is to create a new scheduler and dependency tracking mechanism works for both page fault enabled and page fault disabled context. I think this matches what Christian said below. Maybe nobody think this is easy? Thanks, Oak > -----Original Message----- > From: Brost, Matthew <matthew.brost@intel.com> > Sent: April 5, 2023 2:53 PM > To: Zeng, Oak <oak.zeng@intel.com> > Cc: Christian König <christian.koenig@amd.com>; Vetter, Daniel > <daniel.vetter@intel.com>; Thomas Hellström > <thomas.hellstrom@linux.intel.com>; dri-devel@lists.freedesktop.org; intel- > xe@lists.freedesktop.org; robdclark@chromium.org; airlied@linux.ie; > lina@asahilina.net; boris.brezillon@collabora.com; faith.ekstrand@collabora.com > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload > plans > > On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote: > > Hi, > > > > Using dma-fence for completion/dependency tracking for long-run > workload(more precisely on-demand paging/page fault enabled workload) can > cause deadlock. This seems the significant issue here. Other issues such as the > drm scheduler completion order implication etc are minors which can be solve > inside the framework of drm scheduler. We need to evaluate below paths: > > > > 1) still use drm scheduler for job submission, and use dma-fence for job > completion waiting/dependency tracking. This is solution proposed in this series. > Annotate dma-fence for long-run workload: user can still wait dma-fence for job > completion but can't wait dma-fence while holding any memory management > locks. We still use dma-fence for dependency tracking. But it is just very easily > run into deadlock when on-demand paging is in the picture. The annotation helps > us to detect deadlock but not solve deadlock problems. Seems *not* a complete > solution: It is almost impossible to completely avoid dependency deadlock in > complex runtime environment > > > > No one can wait on LR fence, so it is impossible to deadlock. The > annotations enforce this. Literally this is only for flow controling the > ring / hold pending jobs in in the DRM schedule list. > > > 2) Still use drm scheduler but not use dma-fence for completion signaling > and dependency tracking. This way we still get some free functions (reset, err > handling ring flow control as Matt said)from drm scheduler, but push the > dependency/completion tracking completely to user space using techniques such > as user space fence. User space doesn't have chance to wait fence while holding > a kernel memory management lock, thus the dma-fence deadlock issue is solved. > > > > We use user space fence for syncs. > > > 3) Completely discard drm scheduler and dma-fence for long-run > workload. Use user queue/doorbell for super fast submission, directly interact > with fw scheduler. Use user fence for completion/dependency tracking. > > > > This is a hard no from me, I want 1 submission path in Xe. Either we use > the DRM scheduler or we don't. > > Matt > > > Thanks, > > Oak > > > > > -----Original Message----- > > > From: Christian König <christian.koenig@amd.com> > > > Sent: April 5, 2023 3:30 AM > > > To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak > > > <oak.zeng@intel.com> > > > Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; > > > robdclark@chromium.org; thomas.hellstrom@linux.intel.com; > airlied@linux.ie; > > > lina@asahilina.net; boris.brezillon@collabora.com; > faith.ekstrand@collabora.com > > > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload > > > plans > > > > > > Am 04.04.23 um 20:08 schrieb Matthew Brost: > > > > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote: > > > >> Hi Matt, Thomas, > > > >> > > > >> Some very bold out of box thinking in this area: > > > >> > > > >> 1. so you want to use drm scheduler and dma-fence for long running > workload. > > > Why you want to do this in the first place? What is the benefit? Drm scheduler > is > > > pretty much a software scheduler. Modern gpu has scheduler built at fw/hw > > > level, as you said below for intel this is Guc. Can xe driver just directly submit > job > > > to Guc, bypassing drm scheduler? > > > >> > > > > If we did that now we have 2 paths for dependency track, flow controling > > > > the ring, resets / error handling / backend submission implementations. > > > > We don't want this. > > > > > > Well exactly that's the point: Why? > > > > > > As far as I can see that are two completely distinct use cases, so you > > > absolutely do want two completely distinct implementations for this. > > > > > > >> 2. using dma-fence for long run workload: I am well aware that page fault > (and > > > the consequent memory allocation/lock acquiring to fix the fault) can cause > > > deadlock for a dma-fence wait. But I am not convinced that dma-fence can't > be > > > used purely because the nature of the workload that it runs very long > (indefinite). > > > I did a math: the dma_fence_wait_timeout function's third param is the > timeout > > > which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not > long > > > enough, can we just change the timeout parameter to signed 64 bits so it is > much > > > longer than our life time... > > > >> > > > >> So I mainly argue we can't use dma-fence for long-run workload is not > > > because the workload runs very long, rather because of the fact that we use > > > page fault for long-run workload. If we enable page fault for short-run > workload, > > > we can't use dma-fence either. Page fault is the key thing here. > > > >> > > > >> Now since we use page fault which is *fundamentally* controversial with > > > dma-fence design, why now just introduce a independent concept such as > user- > > > fence instead of extending existing dma-fence? > > > >> > > > >> I like unified design. If drm scheduler, dma-fence can be extended to work > for > > > everything, it is beautiful. But seems we have some fundamental problem > here. > > > >> > > > > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use > > > > the signal / CB infrastructure) and enforce we don't use use these > > > > dma-fences from the scheduler in memory reclaim paths or export these to > > > > user space or other drivers. Think of this mode as SW only fence. > > > > > > Yeah and I truly think this is an really bad idea. > > > > > > The signal/CB infrastructure in the dma_fence turned out to be the > > > absolutely nightmare I initially predicted. Sorry to say that, but in > > > this case the "I've told you so" is appropriate in my opinion. > > > > > > If we need infrastructure for long running dependency tracking we should > > > encapsulate that in a new framework and not try to mangle the existing > > > code for something it was never intended for. > > > > > > Christian. > > > > > > > > > > > Matt > > > > > > > >> Thanks, > > > >> Oak > > > >> > > > >>> -----Original Message----- > > > >>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of > > > >>> Matthew Brost > > > >>> Sent: April 3, 2023 8:22 PM > > > >>> To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org > > > >>> Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com; > > > airlied@linux.ie; > > > >>> lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew > > > >>> <matthew.brost@intel.com>; christian.koenig@amd.com; > > > >>> faith.ekstrand@collabora.com > > > >>> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload > > > plans > > > >>> > > > >>> Hello, > > > >>> > > > >>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > > > >>> have been asked to merge our common DRM scheduler patches first as > well > > > >>> as develop a common solution for long running workloads with the DRM > > > >>> scheduler. This RFC series is our first attempt at doing this. We > > > >>> welcome any and all feedback. > > > >>> > > > >>> This can we thought of as 4 parts detailed below. > > > >>> > > > >>> - DRM scheduler changes for 1 to 1 relationship between scheduler and > > > >>> entity (patches 1-3) > > > >>> > > > >>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the > > > >>> GuC) which is a new paradigm WRT to the DRM scheduler and presents > > > >>> severals problems as the DRM was originally designed to schedule jobs > on > > > >>> hardware queues. The main problem being that DRM scheduler expects > the > > > >>> submission order of jobs to be the completion order of jobs even across > > > >>> multiple entities. This assumption falls apart with a firmware scheduler > > > >>> as a firmware scheduler has no concept of jobs and jobs can complete > out > > > >>> of order. A novel solution for was originally thought of by Faith during > > > >>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler > > > >>> and entity. I believe the AGX driver [3] is using this approach and > > > >>> Boris may use approach as well for the Mali driver [4]. > > > >>> > > > >>> To support a 1 to 1 relationship we move the main execution function > > > >>> from a kthread to a work queue and add a new scheduling mode which > > > >>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > > > >>> The new scheduling mode should unify all drivers usage with a 1 to 1 > > > >>> relationship and can be thought of as using scheduler as a dependency / > > > >>> infligt job tracker rather than a true scheduler. > > > >>> > > > >>> - Generic messaging interface for DRM scheduler > > > >>> > > > >>> Idea is to be able to communicate to the submission backend with in > band > > > >>> (relative to main execution function) messages. Messages are backend > > > >>> defined and flexable enough for any use case. In Xe we use these > > > >>> messages to clean up entites, set properties for entites, and suspend / > > > >>> resume execution of an entity [5]. I suspect other driver can leverage > > > >>> this messaging concept too as it a convenient way to avoid races in the > > > >>> backend. > > > >>> > > > >>> - Support for using TDR for all error paths of a scheduler / entity > > > >>> > > > >>> Fix a few races / bugs, add function to dynamically set the TDR timeout. > > > >>> > > > >>> - Annotate dma-fences for long running workloads. > > > >>> > > > >>> The idea here is to use dma-fences only as sync points within the > > > >>> scheduler and never export them for long running workloads. By > > > >>> annotating these fences as long running we ensure that these dma- > fences > > > >>> are never used in a way that breaks the dma-fence rules. A benefit of > > > >>> thus approach is the scheduler can still safely flow control the > > > >>> execution ring buffer via the job limit without breaking the dma-fence > > > >>> rules. > > > >>> > > > >>> Again this a first draft and looking forward to feedback. > > > >>> > > > >>> Enjoy - Matt > > > >>> > > > >>> [1] https://gitlab.freedesktop.org/drm/xe/kernel > > > >>> [2] https://patchwork.freedesktop.org/series/112188/ > > > >>> [3] https://patchwork.freedesktop.org/series/114772/ > > > >>> [4] > https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 > > > >>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe- > > > >>> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 > > > >>> > > > >>> Matthew Brost (8): > > > >>> drm/sched: Convert drm scheduler to use a work queue rather than > > > >>> kthread > > > >>> drm/sched: Move schedule policy to scheduler / entity > > > >>> drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling > policy > > > >>> drm/sched: Add generic scheduler message interface > > > >>> drm/sched: Start run wq before TDR in drm_sched_start > > > >>> drm/sched: Submit job before starting TDR > > > >>> drm/sched: Add helper to set TDR timeout > > > >>> drm/syncobj: Warn on long running dma-fences > > > >>> > > > >>> Thomas Hellström (2): > > > >>> dma-buf/dma-fence: Introduce long-running completion fences > > > >>> drm/sched: Support long-running sched entities > > > >>> > > > >>> drivers/dma-buf/dma-fence.c | 142 +++++++--- > > > >>> drivers/dma-buf/dma-resv.c | 5 + > > > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- > > > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- > > > >>> drivers/gpu/drm/drm_syncobj.c | 5 +- > > > >>> drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- > > > >>> drivers/gpu/drm/lima/lima_sched.c | 5 +- > > > >>> drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- > > > >>> drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- > > > >>> drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- > > > >>> drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- > > > >>> drivers/gpu/drm/scheduler/sched_fence.c | 6 +- > > > >>> drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++-- > --- > > > >>> drivers/gpu/drm/v3d/v3d_sched.c | 25 +- > > > >>> include/drm/gpu_scheduler.h | 130 +++++++-- > > > >>> include/linux/dma-fence.h | 60 ++++- > > > >>> 16 files changed, 649 insertions(+), 184 deletions(-) > > > >>> > > > >>> -- > > > >>> 2.34.1 > >
On 04/04/2023 10.58, Matthew Brost wrote: > On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote: >> Hi, thanks for the Cc! >> > > No problem. > >> On 04/04/2023 09.22, Matthew Brost wrote: >>> Hello, >>> >>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we >>> have been asked to merge our common DRM scheduler patches first as well >>> as develop a common solution for long running workloads with the DRM >>> scheduler. This RFC series is our first attempt at doing this. We >>> welcome any and all feedback. >>> >>> This can we thought of as 4 parts detailed below. >>> >>> - DRM scheduler changes for 1 to 1 relationship between scheduler and >>> entity (patches 1-3) >>> >>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the >>> GuC) which is a new paradigm WRT to the DRM scheduler and presents >>> severals problems as the DRM was originally designed to schedule jobs on >>> hardware queues. The main problem being that DRM scheduler expects the >>> submission order of jobs to be the completion order of jobs even across >>> multiple entities. This assumption falls apart with a firmware scheduler >>> as a firmware scheduler has no concept of jobs and jobs can complete out >>> of order. A novel solution for was originally thought of by Faith during >>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler >>> and entity. I believe the AGX driver [3] is using this approach and >>> Boris may use approach as well for the Mali driver [4]. >>> >>> To support a 1 to 1 relationship we move the main execution function >>> from a kthread to a work queue and add a new scheduling mode which >>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship. >>> The new scheduling mode should unify all drivers usage with a 1 to 1 >>> relationship and can be thought of as using scheduler as a dependency / >>> infligt job tracker rather than a true scheduler. >> >> Yup, we're in the exact same situation with drm/asahi, so this is very >> welcome! We've been using the existing scheduler as-is, but this should help >> remove some unneeded complexity in this use case. >> > > That's the idea. > >> Do you want me to pull in this series into our tree and make sure this all >> works out for us? >> > > We tested this in Xe and it definitely works for us but the more testing > the better. > I haven't gotten around to testing this series yet, but after more debugging of drm_sched issues I want to hear more about how Xe uses the scheduler. From what I can tell, and from what Christian says, drm_sched has the hidden requirement that all job objects outlive the scheduler. I've run into several UAF bugs due to this. Not only that, it also currently has the requirement that all drm_sched fences outlive the scheduler object. These requirements are subtle and only manifest as kernel oopses in rare corner cases, so it wasn't at all obvious to me that this was somehow a fundamental design assumption when I started using it. As far as I can tell, this design is going to work in 99% of cases for global-schedulers-per-GPU models, where those corner cases would have to be hit on top of a GPU removal scenario (and GPU remove is... well, not the most tested/exercised use case). When the scheduler basically lives forever, none of this really matters. But with a one-scheduler-per-queue model, how do you deal with this when the queue goes away? So far, without any of the partial bugfixes I have sent so far (which Christian objected to): - If you try to tear down a scheduler with any jobs currently scheduled at the hardware, drm_sched will oops when those jobs complete and the hw fences signal. - If you try to tear down an entity (which should cancel all its pending jobs) and then the scheduler it was attached to without actually waiting for all the free_job() callbacks to be called on every job that ever existed for that entity, you can oops (entity cleanup is asynchronous in some cases like killed processes, so it will return before all jobs are freed and then that asynchronous process will crash and burn if the scheduler goes away out from under its feet). Waiting for job completion fences is not enough for this, you have to wait until free_job() has actually been called for all jobs. - Even if you actually wait for all jobs to be truly gone and then tear down the scheduler, if any scheduler job fences remain alive, that will then oops if you try to call the debug functions on them (like cat /sys/kernel/debug/dma_buf/bufinfo). I tried to fix these things, but Christian objected implying it was the driver's job to keep a reference from jobs and hw fences to the scheduler. But I find that completely broken, because besides the extra memory/resource usage keeping the scheduler alive when you're trying to free resources as fast as possible when a process goes away, you can't even use normal reference counting for that: if you try to drop the last drm_sched reference from within a free_job() callback, the whole thing deadlocks since that will be running in the scheduler's thread/workqueue context, which can't free itself. So now you both reference count the scheduler from jobs and fences, and on top of that you need to outsource drm_sched freeing to a workqueue in the driver to make sure you don't deadlock. For job fences this is particularly broken, because those fences can live forever signaled and attached to shared buffers and there is no guarantee that they will be freed in any kind of reasonable time frame. If they have to keep the scheduler that created them alive, that creates a lot of dead object junk we have to drag around just because a signaled fence exists somewhere. For a Rust abstraction we have to do all that tracking and refcounting in the abstraction itself to make it safe, which is starting to sound like reimplementing half of the job tracking drm_sched itself does just to fix the lifetime issues, which really tells me the existing design is not sound nor easy to use correctly in general. How does Xe deal with this (does it deal with it at all)? What happens when you kill -9 a process using the GPU? Does freeing all of this wait for all jobs to complete *and be freed* with free_job()? What about exported dma_bufs with fences attached from that scheduler? Do you keep the scheduler alive for those? Personally, after running into all this, and after seeing Christian's reaction to me trying to improve the state of things, I'm starting to doubt that drm_sched is the right solution at all for firmware-scheduling drivers. If you want a workload to try to see if you run into any of these things, running and killing lots of things in parallel is a good thing to try (mess with the numbers and let it run for a while to see if you can hit any corner cases): while true; do for i in $(seq 1 10); do timeout -k 0.01 0.05 glxgears & done; sleep 0.1; done ~~ Lina
The point is that this not only requires some work in the drm_scheduler, but rather it then makes only little sense to use the drm_scheduler in the first place. The whole point of the drm_scheduler is to provide dma_fence implementation for the submitted jobs. We also have dependency handling, but as Daniel and I said this can be easily extracted into a separate object/component. Regards, Christian. Am 07.04.23 um 02:20 schrieb Zeng, Oak: > So this series basically go with option 2. The part that option2 makes me uncomfortable is, dma-fence doesn't work for long running workload, why we generate it in the first place? As long as dma-fence is generated, it will become a source of confusion in the future. It doesn't matter how much you annotate it/document it. So if we decide to go with option2, the bottom line is, don't generate dma-fence for long running workload during job submission. This requires some rework in drm scheduler. > > The cleanest solution to me is option3. Dma-fence is a very old technology. When it was created, no gpu support page fault. Obviously this is not a good technology for modern gpu with page fault support. I think the best way is to create a new scheduler and dependency tracking mechanism works for both page fault enabled and page fault disabled context. I think this matches what Christian said below. Maybe nobody think this is easy? > > Thanks, > Oak > >> -----Original Message----- >> From: Brost, Matthew <matthew.brost@intel.com> >> Sent: April 5, 2023 2:53 PM >> To: Zeng, Oak <oak.zeng@intel.com> >> Cc: Christian König <christian.koenig@amd.com>; Vetter, Daniel >> <daniel.vetter@intel.com>; Thomas Hellström >> <thomas.hellstrom@linux.intel.com>; dri-devel@lists.freedesktop.org; intel- >> xe@lists.freedesktop.org; robdclark@chromium.org; airlied@linux.ie; >> lina@asahilina.net; boris.brezillon@collabora.com; faith.ekstrand@collabora.com >> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload >> plans >> >> On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote: >>> Hi, >>> >>> Using dma-fence for completion/dependency tracking for long-run >> workload(more precisely on-demand paging/page fault enabled workload) can >> cause deadlock. This seems the significant issue here. Other issues such as the >> drm scheduler completion order implication etc are minors which can be solve >> inside the framework of drm scheduler. We need to evaluate below paths: >>> 1) still use drm scheduler for job submission, and use dma-fence for job >> completion waiting/dependency tracking. This is solution proposed in this series. >> Annotate dma-fence for long-run workload: user can still wait dma-fence for job >> completion but can't wait dma-fence while holding any memory management >> locks. We still use dma-fence for dependency tracking. But it is just very easily >> run into deadlock when on-demand paging is in the picture. The annotation helps >> us to detect deadlock but not solve deadlock problems. Seems *not* a complete >> solution: It is almost impossible to completely avoid dependency deadlock in >> complex runtime environment >> No one can wait on LR fence, so it is impossible to deadlock. The >> annotations enforce this. Literally this is only for flow controling the >> ring / hold pending jobs in in the DRM schedule list. >> >>> 2) Still use drm scheduler but not use dma-fence for completion signaling >> and dependency tracking. This way we still get some free functions (reset, err >> handling ring flow control as Matt said)from drm scheduler, but push the >> dependency/completion tracking completely to user space using techniques such >> as user space fence. User space doesn't have chance to wait fence while holding >> a kernel memory management lock, thus the dma-fence deadlock issue is solved. >> We use user space fence for syncs. >> >>> 3) Completely discard drm scheduler and dma-fence for long-run >> workload. Use user queue/doorbell for super fast submission, directly interact >> with fw scheduler. Use user fence for completion/dependency tracking. >> This is a hard no from me, I want 1 submission path in Xe. Either we use >> the DRM scheduler or we don't. >> >> Matt >> >>> Thanks, >>> Oak >>> >>>> -----Original Message----- >>>> From: Christian König <christian.koenig@amd.com> >>>> Sent: April 5, 2023 3:30 AM >>>> To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak >>>> <oak.zeng@intel.com> >>>> Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; >>>> robdclark@chromium.org; thomas.hellstrom@linux.intel.com; >> airlied@linux.ie; >>>> lina@asahilina.net; boris.brezillon@collabora.com; >> faith.ekstrand@collabora.com >>>> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload >>>> plans >>>> >>>> Am 04.04.23 um 20:08 schrieb Matthew Brost: >>>>> On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote: >>>>>> Hi Matt, Thomas, >>>>>> >>>>>> Some very bold out of box thinking in this area: >>>>>> >>>>>> 1. so you want to use drm scheduler and dma-fence for long running >> workload. >>>> Why you want to do this in the first place? What is the benefit? Drm scheduler >> is >>>> pretty much a software scheduler. Modern gpu has scheduler built at fw/hw >>>> level, as you said below for intel this is Guc. Can xe driver just directly submit >> job >>>> to Guc, bypassing drm scheduler? >>>>> If we did that now we have 2 paths for dependency track, flow controling >>>>> the ring, resets / error handling / backend submission implementations. >>>>> We don't want this. >>>> Well exactly that's the point: Why? >>>> >>>> As far as I can see that are two completely distinct use cases, so you >>>> absolutely do want two completely distinct implementations for this. >>>> >>>>>> 2. using dma-fence for long run workload: I am well aware that page fault >> (and >>>> the consequent memory allocation/lock acquiring to fix the fault) can cause >>>> deadlock for a dma-fence wait. But I am not convinced that dma-fence can't >> be >>>> used purely because the nature of the workload that it runs very long >> (indefinite). >>>> I did a math: the dma_fence_wait_timeout function's third param is the >> timeout >>>> which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not >> long >>>> enough, can we just change the timeout parameter to signed 64 bits so it is >> much >>>> longer than our life time... >>>>>> So I mainly argue we can't use dma-fence for long-run workload is not >>>> because the workload runs very long, rather because of the fact that we use >>>> page fault for long-run workload. If we enable page fault for short-run >> workload, >>>> we can't use dma-fence either. Page fault is the key thing here. >>>>>> Now since we use page fault which is *fundamentally* controversial with >>>> dma-fence design, why now just introduce a independent concept such as >> user- >>>> fence instead of extending existing dma-fence? >>>>>> I like unified design. If drm scheduler, dma-fence can be extended to work >> for >>>> everything, it is beautiful. But seems we have some fundamental problem >> here. >>>>> Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use >>>>> the signal / CB infrastructure) and enforce we don't use use these >>>>> dma-fences from the scheduler in memory reclaim paths or export these to >>>>> user space or other drivers. Think of this mode as SW only fence. >>>> Yeah and I truly think this is an really bad idea. >>>> >>>> The signal/CB infrastructure in the dma_fence turned out to be the >>>> absolutely nightmare I initially predicted. Sorry to say that, but in >>>> this case the "I've told you so" is appropriate in my opinion. >>>> >>>> If we need infrastructure for long running dependency tracking we should >>>> encapsulate that in a new framework and not try to mangle the existing >>>> code for something it was never intended for. >>>> >>>> Christian. >>>> >>>>> Matt >>>>> >>>>>> Thanks, >>>>>> Oak >>>>>> >>>>>>> -----Original Message----- >>>>>>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of >>>>>>> Matthew Brost >>>>>>> Sent: April 3, 2023 8:22 PM >>>>>>> To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org >>>>>>> Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com; >>>> airlied@linux.ie; >>>>>>> lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew >>>>>>> <matthew.brost@intel.com>; christian.koenig@amd.com; >>>>>>> faith.ekstrand@collabora.com >>>>>>> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload >>>> plans >>>>>>> Hello, >>>>>>> >>>>>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we >>>>>>> have been asked to merge our common DRM scheduler patches first as >> well >>>>>>> as develop a common solution for long running workloads with the DRM >>>>>>> scheduler. This RFC series is our first attempt at doing this. We >>>>>>> welcome any and all feedback. >>>>>>> >>>>>>> This can we thought of as 4 parts detailed below. >>>>>>> >>>>>>> - DRM scheduler changes for 1 to 1 relationship between scheduler and >>>>>>> entity (patches 1-3) >>>>>>> >>>>>>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the >>>>>>> GuC) which is a new paradigm WRT to the DRM scheduler and presents >>>>>>> severals problems as the DRM was originally designed to schedule jobs >> on >>>>>>> hardware queues. The main problem being that DRM scheduler expects >> the >>>>>>> submission order of jobs to be the completion order of jobs even across >>>>>>> multiple entities. This assumption falls apart with a firmware scheduler >>>>>>> as a firmware scheduler has no concept of jobs and jobs can complete >> out >>>>>>> of order. A novel solution for was originally thought of by Faith during >>>>>>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler >>>>>>> and entity. I believe the AGX driver [3] is using this approach and >>>>>>> Boris may use approach as well for the Mali driver [4]. >>>>>>> >>>>>>> To support a 1 to 1 relationship we move the main execution function >>>>>>> from a kthread to a work queue and add a new scheduling mode which >>>>>>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship. >>>>>>> The new scheduling mode should unify all drivers usage with a 1 to 1 >>>>>>> relationship and can be thought of as using scheduler as a dependency / >>>>>>> infligt job tracker rather than a true scheduler. >>>>>>> >>>>>>> - Generic messaging interface for DRM scheduler >>>>>>> >>>>>>> Idea is to be able to communicate to the submission backend with in >> band >>>>>>> (relative to main execution function) messages. Messages are backend >>>>>>> defined and flexable enough for any use case. In Xe we use these >>>>>>> messages to clean up entites, set properties for entites, and suspend / >>>>>>> resume execution of an entity [5]. I suspect other driver can leverage >>>>>>> this messaging concept too as it a convenient way to avoid races in the >>>>>>> backend. >>>>>>> >>>>>>> - Support for using TDR for all error paths of a scheduler / entity >>>>>>> >>>>>>> Fix a few races / bugs, add function to dynamically set the TDR timeout. >>>>>>> >>>>>>> - Annotate dma-fences for long running workloads. >>>>>>> >>>>>>> The idea here is to use dma-fences only as sync points within the >>>>>>> scheduler and never export them for long running workloads. By >>>>>>> annotating these fences as long running we ensure that these dma- >> fences >>>>>>> are never used in a way that breaks the dma-fence rules. A benefit of >>>>>>> thus approach is the scheduler can still safely flow control the >>>>>>> execution ring buffer via the job limit without breaking the dma-fence >>>>>>> rules. >>>>>>> >>>>>>> Again this a first draft and looking forward to feedback. >>>>>>> >>>>>>> Enjoy - Matt >>>>>>> >>>>>>> [1] https://gitlab.freedesktop.org/drm/xe/kernel >>>>>>> [2] https://patchwork.freedesktop.org/series/112188/ >>>>>>> [3] https://patchwork.freedesktop.org/series/114772/ >>>>>>> [4] >> https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 >>>>>>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe- >>>>>>> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 >>>>>>> >>>>>>> Matthew Brost (8): >>>>>>> drm/sched: Convert drm scheduler to use a work queue rather than >>>>>>> kthread >>>>>>> drm/sched: Move schedule policy to scheduler / entity >>>>>>> drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling >> policy >>>>>>> drm/sched: Add generic scheduler message interface >>>>>>> drm/sched: Start run wq before TDR in drm_sched_start >>>>>>> drm/sched: Submit job before starting TDR >>>>>>> drm/sched: Add helper to set TDR timeout >>>>>>> drm/syncobj: Warn on long running dma-fences >>>>>>> >>>>>>> Thomas Hellström (2): >>>>>>> dma-buf/dma-fence: Introduce long-running completion fences >>>>>>> drm/sched: Support long-running sched entities >>>>>>> >>>>>>> drivers/dma-buf/dma-fence.c | 142 +++++++--- >>>>>>> drivers/dma-buf/dma-resv.c | 5 + >>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- >>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- >>>>>>> drivers/gpu/drm/drm_syncobj.c | 5 +- >>>>>>> drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- >>>>>>> drivers/gpu/drm/lima/lima_sched.c | 5 +- >>>>>>> drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- >>>>>>> drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- >>>>>>> drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- >>>>>>> drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- >>>>>>> drivers/gpu/drm/scheduler/sched_fence.c | 6 +- >>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++-- >> --- >>>>>>> drivers/gpu/drm/v3d/v3d_sched.c | 25 +- >>>>>>> include/drm/gpu_scheduler.h | 130 +++++++-- >>>>>>> include/linux/dma-fence.h | 60 ++++- >>>>>>> 16 files changed, 649 insertions(+), 184 deletions(-) >>>>>>> >>>>>>> -- >>>>>>> 2.34.1
On Sat, Apr 08, 2023 at 04:05:20PM +0900, Asahi Lina wrote: > On 04/04/2023 10.58, Matthew Brost wrote: > > On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote: > > > Hi, thanks for the Cc! > > > > > > > No problem. > > > > > On 04/04/2023 09.22, Matthew Brost wrote: > > > > Hello, > > > > > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > > > > have been asked to merge our common DRM scheduler patches first as well > > > > as develop a common solution for long running workloads with the DRM > > > > scheduler. This RFC series is our first attempt at doing this. We > > > > welcome any and all feedback. > > > > > > > > This can we thought of as 4 parts detailed below. > > > > > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and > > > > entity (patches 1-3) > > > > > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the > > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents > > > > severals problems as the DRM was originally designed to schedule jobs on > > > > hardware queues. The main problem being that DRM scheduler expects the > > > > submission order of jobs to be the completion order of jobs even across > > > > multiple entities. This assumption falls apart with a firmware scheduler > > > > as a firmware scheduler has no concept of jobs and jobs can complete out > > > > of order. A novel solution for was originally thought of by Faith during > > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler > > > > and entity. I believe the AGX driver [3] is using this approach and > > > > Boris may use approach as well for the Mali driver [4]. > > > > > > > > To support a 1 to 1 relationship we move the main execution function > > > > from a kthread to a work queue and add a new scheduling mode which > > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > > > > The new scheduling mode should unify all drivers usage with a 1 to 1 > > > > relationship and can be thought of as using scheduler as a dependency / > > > > infligt job tracker rather than a true scheduler. > > > > > > Yup, we're in the exact same situation with drm/asahi, so this is very > > > welcome! We've been using the existing scheduler as-is, but this should help > > > remove some unneeded complexity in this use case. > > > > > > > That's the idea. > > > > > Do you want me to pull in this series into our tree and make sure this all > > > works out for us? > > > > > > > We tested this in Xe and it definitely works for us but the more testing > > the better. > > > > I haven't gotten around to testing this series yet, but after more debugging > of drm_sched issues I want to hear more about how Xe uses the scheduler. > > From what I can tell, and from what Christian says, drm_sched has the hidden > requirement that all job objects outlive the scheduler. I've run into > several UAF bugs due to this. Not only that, it also currently has the > requirement that all drm_sched fences outlive the scheduler object. > > These requirements are subtle and only manifest as kernel oopses in rare > corner cases, so it wasn't at all obvious to me that this was somehow a > fundamental design assumption when I started using it. > > As far as I can tell, this design is going to work in 99% of cases for > global-schedulers-per-GPU models, where those corner cases would have to be > hit on top of a GPU removal scenario (and GPU remove is... well, not the > most tested/exercised use case). When the scheduler basically lives forever, > none of this really matters. > > But with a one-scheduler-per-queue model, how do you deal with this when the > queue goes away? So far, without any of the partial bugfixes I have sent so > far (which Christian objected to): > > - If you try to tear down a scheduler with any jobs currently scheduled at > the hardware, drm_sched will oops when those jobs complete and the hw fences > signal. > - If you try to tear down an entity (which should cancel all its pending > jobs) and then the scheduler it was attached to without actually waiting for > all the free_job() callbacks to be called on every job that ever existed for > that entity, you can oops (entity cleanup is asynchronous in some cases like > killed processes, so it will return before all jobs are freed and then that > asynchronous process will crash and burn if the scheduler goes away out from > under its feet). Waiting for job completion fences is not enough for this, > you have to wait until free_job() has actually been called for all jobs. > - Even if you actually wait for all jobs to be truly gone and then tear down > the scheduler, if any scheduler job fences remain alive, that will then oops > if you try to call the debug functions on them (like cat > /sys/kernel/debug/dma_buf/bufinfo). > > I tried to fix these things, but Christian objected implying it was the > driver's job to keep a reference from jobs and hw fences to the scheduler. > But I find that completely broken, because besides the extra memory/resource > usage keeping the scheduler alive when you're trying to free resources as > fast as possible when a process goes away, you can't even use normal > reference counting for that: if you try to drop the last drm_sched reference > from within a free_job() callback, the whole thing deadlocks since that will > be running in the scheduler's thread/workqueue context, which can't free > itself. So now you both reference count the scheduler from jobs and fences, > and on top of that you need to outsource drm_sched freeing to a workqueue in > the driver to make sure you don't deadlock. > > For job fences this is particularly broken, because those fences can live > forever signaled and attached to shared buffers and there is no guarantee > that they will be freed in any kind of reasonable time frame. If they have > to keep the scheduler that created them alive, that creates a lot of dead > object junk we have to drag around just because a signaled fence exists > somewhere. > > For a Rust abstraction we have to do all that tracking and refcounting in > the abstraction itself to make it safe, which is starting to sound like > reimplementing half of the job tracking drm_sched itself does just to fix > the lifetime issues, which really tells me the existing design is not sound > nor easy to use correctly in general. > > How does Xe deal with this (does it deal with it at all)? What happens when > you kill -9 a process using the GPU? Does freeing all of this wait for all > jobs to complete *and be freed* with free_job()? What about exported > dma_bufs with fences attached from that scheduler? Do you keep the scheduler > alive for those? > > Personally, after running into all this, and after seeing Christian's > reaction to me trying to improve the state of things, I'm starting to doubt > that drm_sched is the right solution at all for firmware-scheduling drivers. Bit a wash-up reply on the more fundamental thing here: For the current scheduler the issues you've found are indeed all driver bugs (or most I think at least). Which is why I think we shouldn't just try to shoehorn fundamentally new semantics without updating the driver interfaces (the drm_sched split into the driver interface part and the internal scheduler part). Once we have that, including kerneldoc update and what the rules are, then all the various uaf you've discovered become real bugs and I don't see any issue merging all the fixes. Without that we do have a chicken/egg problem between: "here's a bunch of hacks to make the problems disappear I've hit in my reuse of drm/sched for fw schedulers" vs. "this makes no sense for the current drm/sched interfaces and how current upstream drivers use it" I don't think there's a lot needed in terms of drm/sched driver api rework, but I think it's also pretty clearly not ever going to get anywhere with just nothing at all. Writing an entire new scheduler lib instead of at least trying what minimal semantic changes (instead of just a pile of hacks without even doc changes for the new rules) does not sound like a good idea to me :-) > If you want a workload to try to see if you run into any of these things, > running and killing lots of things in parallel is a good thing to try (mess > with the numbers and let it run for a while to see if you can hit any corner > cases): > > while true; do for i in $(seq 1 10); do timeout -k 0.01 0.05 glxgears & > done; sleep 0.1; done Maybe xe gets away with this due to synchronously killing everything related to a ctx, but yeah I'd expect this to go boom in fun ways. -Daniel
On Tue, Apr 11, 2023 at 11:02:55AM +0200, Christian König wrote: > The point is that this not only requires some work in the drm_scheduler, but > rather it then makes only little sense to use the drm_scheduler in the first > place. > > The whole point of the drm_scheduler is to provide dma_fence implementation > for the submitted jobs. > > We also have dependency handling, but as Daniel and I said this can be > easily extracted into a separate object/component. Uh that's not what I meant. My take is that minimally patching drm/sched to make the out-fence either optional, or complete it right away, is the simplest way to get at the dependency handling. For me at least the major part of drm/sched is the dep handling and timeout stuff. And the later can be reused with some glue to handle preempt timeouts too and other things, since tdr is a work struct you can just issue any other gpu timeouts on the same workqueue and using the roughly same pattern as the ->timed_out hook and it'll just work. The entire "oh we also make sure your hw fence doesn't leak into public fences and causes lifetime mayhem" seems pretty minor. And maybe also something we want to replicate for the preempt-ctx dma_fence that some long-running context need (but more as part of drm_sched_entity I guess). We can of course bikeshed how much flexibility really should be in the different parts of drm/sched, but imo that's a bikeshed. -Daniel > > Regards, > Christian. > > Am 07.04.23 um 02:20 schrieb Zeng, Oak: > > So this series basically go with option 2. The part that option2 makes me uncomfortable is, dma-fence doesn't work for long running workload, why we generate it in the first place? As long as dma-fence is generated, it will become a source of confusion in the future. It doesn't matter how much you annotate it/document it. So if we decide to go with option2, the bottom line is, don't generate dma-fence for long running workload during job submission. This requires some rework in drm scheduler. > > > > The cleanest solution to me is option3. Dma-fence is a very old technology. When it was created, no gpu support page fault. Obviously this is not a good technology for modern gpu with page fault support. I think the best way is to create a new scheduler and dependency tracking mechanism works for both page fault enabled and page fault disabled context. I think this matches what Christian said below. Maybe nobody think this is easy? > > > > Thanks, > > Oak > > > > > -----Original Message----- > > > From: Brost, Matthew <matthew.brost@intel.com> > > > Sent: April 5, 2023 2:53 PM > > > To: Zeng, Oak <oak.zeng@intel.com> > > > Cc: Christian König <christian.koenig@amd.com>; Vetter, Daniel > > > <daniel.vetter@intel.com>; Thomas Hellström > > > <thomas.hellstrom@linux.intel.com>; dri-devel@lists.freedesktop.org; intel- > > > xe@lists.freedesktop.org; robdclark@chromium.org; airlied@linux.ie; > > > lina@asahilina.net; boris.brezillon@collabora.com; faith.ekstrand@collabora.com > > > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload > > > plans > > > > > > On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote: > > > > Hi, > > > > > > > > Using dma-fence for completion/dependency tracking for long-run > > > workload(more precisely on-demand paging/page fault enabled workload) can > > > cause deadlock. This seems the significant issue here. Other issues such as the > > > drm scheduler completion order implication etc are minors which can be solve > > > inside the framework of drm scheduler. We need to evaluate below paths: > > > > 1) still use drm scheduler for job submission, and use dma-fence for job > > > completion waiting/dependency tracking. This is solution proposed in this series. > > > Annotate dma-fence for long-run workload: user can still wait dma-fence for job > > > completion but can't wait dma-fence while holding any memory management > > > locks. We still use dma-fence for dependency tracking. But it is just very easily > > > run into deadlock when on-demand paging is in the picture. The annotation helps > > > us to detect deadlock but not solve deadlock problems. Seems *not* a complete > > > solution: It is almost impossible to completely avoid dependency deadlock in > > > complex runtime environment > > > No one can wait on LR fence, so it is impossible to deadlock. The > > > annotations enforce this. Literally this is only for flow controling the > > > ring / hold pending jobs in in the DRM schedule list. > > > > > > > 2) Still use drm scheduler but not use dma-fence for completion signaling > > > and dependency tracking. This way we still get some free functions (reset, err > > > handling ring flow control as Matt said)from drm scheduler, but push the > > > dependency/completion tracking completely to user space using techniques such > > > as user space fence. User space doesn't have chance to wait fence while holding > > > a kernel memory management lock, thus the dma-fence deadlock issue is solved. > > > We use user space fence for syncs. > > > > > > > 3) Completely discard drm scheduler and dma-fence for long-run > > > workload. Use user queue/doorbell for super fast submission, directly interact > > > with fw scheduler. Use user fence for completion/dependency tracking. > > > This is a hard no from me, I want 1 submission path in Xe. Either we use > > > the DRM scheduler or we don't. > > > > > > Matt > > > > > > > Thanks, > > > > Oak > > > > > > > > > -----Original Message----- > > > > > From: Christian König <christian.koenig@amd.com> > > > > > Sent: April 5, 2023 3:30 AM > > > > > To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak > > > > > <oak.zeng@intel.com> > > > > > Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; > > > > > robdclark@chromium.org; thomas.hellstrom@linux.intel.com; > > > airlied@linux.ie; > > > > > lina@asahilina.net; boris.brezillon@collabora.com; > > > faith.ekstrand@collabora.com > > > > > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload > > > > > plans > > > > > > > > > > Am 04.04.23 um 20:08 schrieb Matthew Brost: > > > > > > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote: > > > > > > > Hi Matt, Thomas, > > > > > > > > > > > > > > Some very bold out of box thinking in this area: > > > > > > > > > > > > > > 1. so you want to use drm scheduler and dma-fence for long running > > > workload. > > > > > Why you want to do this in the first place? What is the benefit? Drm scheduler > > > is > > > > > pretty much a software scheduler. Modern gpu has scheduler built at fw/hw > > > > > level, as you said below for intel this is Guc. Can xe driver just directly submit > > > job > > > > > to Guc, bypassing drm scheduler? > > > > > > If we did that now we have 2 paths for dependency track, flow controling > > > > > > the ring, resets / error handling / backend submission implementations. > > > > > > We don't want this. > > > > > Well exactly that's the point: Why? > > > > > > > > > > As far as I can see that are two completely distinct use cases, so you > > > > > absolutely do want two completely distinct implementations for this. > > > > > > > > > > > > 2. using dma-fence for long run workload: I am well aware that page fault > > > (and > > > > > the consequent memory allocation/lock acquiring to fix the fault) can cause > > > > > deadlock for a dma-fence wait. But I am not convinced that dma-fence can't > > > be > > > > > used purely because the nature of the workload that it runs very long > > > (indefinite). > > > > > I did a math: the dma_fence_wait_timeout function's third param is the > > > timeout > > > > > which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not > > > long > > > > > enough, can we just change the timeout parameter to signed 64 bits so it is > > > much > > > > > longer than our life time... > > > > > > > So I mainly argue we can't use dma-fence for long-run workload is not > > > > > because the workload runs very long, rather because of the fact that we use > > > > > page fault for long-run workload. If we enable page fault for short-run > > > workload, > > > > > we can't use dma-fence either. Page fault is the key thing here. > > > > > > > Now since we use page fault which is *fundamentally* controversial with > > > > > dma-fence design, why now just introduce a independent concept such as > > > user- > > > > > fence instead of extending existing dma-fence? > > > > > > > I like unified design. If drm scheduler, dma-fence can be extended to work > > > for > > > > > everything, it is beautiful. But seems we have some fundamental problem > > > here. > > > > > > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use > > > > > > the signal / CB infrastructure) and enforce we don't use use these > > > > > > dma-fences from the scheduler in memory reclaim paths or export these to > > > > > > user space or other drivers. Think of this mode as SW only fence. > > > > > Yeah and I truly think this is an really bad idea. > > > > > > > > > > The signal/CB infrastructure in the dma_fence turned out to be the > > > > > absolutely nightmare I initially predicted. Sorry to say that, but in > > > > > this case the "I've told you so" is appropriate in my opinion. > > > > > > > > > > If we need infrastructure for long running dependency tracking we should > > > > > encapsulate that in a new framework and not try to mangle the existing > > > > > code for something it was never intended for. > > > > > > > > > > Christian. > > > > > > > > > > > Matt > > > > > > > > > > > > > Thanks, > > > > > > > Oak > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of > > > > > > > > Matthew Brost > > > > > > > > Sent: April 3, 2023 8:22 PM > > > > > > > > To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org > > > > > > > > Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com; > > > > > airlied@linux.ie; > > > > > > > > lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew > > > > > > > > <matthew.brost@intel.com>; christian.koenig@amd.com; > > > > > > > > faith.ekstrand@collabora.com > > > > > > > > Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload > > > > > plans > > > > > > > > Hello, > > > > > > > > > > > > > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > > > > > > > > have been asked to merge our common DRM scheduler patches first as > > > well > > > > > > > > as develop a common solution for long running workloads with the DRM > > > > > > > > scheduler. This RFC series is our first attempt at doing this. We > > > > > > > > welcome any and all feedback. > > > > > > > > > > > > > > > > This can we thought of as 4 parts detailed below. > > > > > > > > > > > > > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and > > > > > > > > entity (patches 1-3) > > > > > > > > > > > > > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the > > > > > > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents > > > > > > > > severals problems as the DRM was originally designed to schedule jobs > > > on > > > > > > > > hardware queues. The main problem being that DRM scheduler expects > > > the > > > > > > > > submission order of jobs to be the completion order of jobs even across > > > > > > > > multiple entities. This assumption falls apart with a firmware scheduler > > > > > > > > as a firmware scheduler has no concept of jobs and jobs can complete > > > out > > > > > > > > of order. A novel solution for was originally thought of by Faith during > > > > > > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler > > > > > > > > and entity. I believe the AGX driver [3] is using this approach and > > > > > > > > Boris may use approach as well for the Mali driver [4]. > > > > > > > > > > > > > > > > To support a 1 to 1 relationship we move the main execution function > > > > > > > > from a kthread to a work queue and add a new scheduling mode which > > > > > > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > > > > > > > > The new scheduling mode should unify all drivers usage with a 1 to 1 > > > > > > > > relationship and can be thought of as using scheduler as a dependency / > > > > > > > > infligt job tracker rather than a true scheduler. > > > > > > > > > > > > > > > > - Generic messaging interface for DRM scheduler > > > > > > > > > > > > > > > > Idea is to be able to communicate to the submission backend with in > > > band > > > > > > > > (relative to main execution function) messages. Messages are backend > > > > > > > > defined and flexable enough for any use case. In Xe we use these > > > > > > > > messages to clean up entites, set properties for entites, and suspend / > > > > > > > > resume execution of an entity [5]. I suspect other driver can leverage > > > > > > > > this messaging concept too as it a convenient way to avoid races in the > > > > > > > > backend. > > > > > > > > > > > > > > > > - Support for using TDR for all error paths of a scheduler / entity > > > > > > > > > > > > > > > > Fix a few races / bugs, add function to dynamically set the TDR timeout. > > > > > > > > > > > > > > > > - Annotate dma-fences for long running workloads. > > > > > > > > > > > > > > > > The idea here is to use dma-fences only as sync points within the > > > > > > > > scheduler and never export them for long running workloads. By > > > > > > > > annotating these fences as long running we ensure that these dma- > > > fences > > > > > > > > are never used in a way that breaks the dma-fence rules. A benefit of > > > > > > > > thus approach is the scheduler can still safely flow control the > > > > > > > > execution ring buffer via the job limit without breaking the dma-fence > > > > > > > > rules. > > > > > > > > > > > > > > > > Again this a first draft and looking forward to feedback. > > > > > > > > > > > > > > > > Enjoy - Matt > > > > > > > > > > > > > > > > [1] https://gitlab.freedesktop.org/drm/xe/kernel > > > > > > > > [2] https://patchwork.freedesktop.org/series/112188/ > > > > > > > > [3] https://patchwork.freedesktop.org/series/114772/ > > > > > > > > [4] > > > https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 > > > > > > > > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe- > > > > > > > > next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 > > > > > > > > > > > > > > > > Matthew Brost (8): > > > > > > > > drm/sched: Convert drm scheduler to use a work queue rather than > > > > > > > > kthread > > > > > > > > drm/sched: Move schedule policy to scheduler / entity > > > > > > > > drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling > > > policy > > > > > > > > drm/sched: Add generic scheduler message interface > > > > > > > > drm/sched: Start run wq before TDR in drm_sched_start > > > > > > > > drm/sched: Submit job before starting TDR > > > > > > > > drm/sched: Add helper to set TDR timeout > > > > > > > > drm/syncobj: Warn on long running dma-fences > > > > > > > > > > > > > > > > Thomas Hellström (2): > > > > > > > > dma-buf/dma-fence: Introduce long-running completion fences > > > > > > > > drm/sched: Support long-running sched entities > > > > > > > > > > > > > > > > drivers/dma-buf/dma-fence.c | 142 +++++++--- > > > > > > > > drivers/dma-buf/dma-resv.c | 5 + > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- > > > > > > > > drivers/gpu/drm/drm_syncobj.c | 5 +- > > > > > > > > drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- > > > > > > > > drivers/gpu/drm/lima/lima_sched.c | 5 +- > > > > > > > > drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- > > > > > > > > drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- > > > > > > > > drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- > > > > > > > > drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- > > > > > > > > drivers/gpu/drm/scheduler/sched_fence.c | 6 +- > > > > > > > > drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++-- > > > --- > > > > > > > > drivers/gpu/drm/v3d/v3d_sched.c | 25 +- > > > > > > > > include/drm/gpu_scheduler.h | 130 +++++++-- > > > > > > > > include/linux/dma-fence.h | 60 ++++- > > > > > > > > 16 files changed, 649 insertions(+), 184 deletions(-) > > > > > > > > > > > > > > > > -- > > > > > > > > 2.34.1 >
On 11/04/2023 23.07, Daniel Vetter wrote: > On Sat, Apr 08, 2023 at 04:05:20PM +0900, Asahi Lina wrote: >> On 04/04/2023 10.58, Matthew Brost wrote: >>> On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote: >>>> Hi, thanks for the Cc! >>>> >>> >>> No problem. >>> >>>> On 04/04/2023 09.22, Matthew Brost wrote: >>>>> Hello, >>>>> >>>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we >>>>> have been asked to merge our common DRM scheduler patches first as well >>>>> as develop a common solution for long running workloads with the DRM >>>>> scheduler. This RFC series is our first attempt at doing this. We >>>>> welcome any and all feedback. >>>>> >>>>> This can we thought of as 4 parts detailed below. >>>>> >>>>> - DRM scheduler changes for 1 to 1 relationship between scheduler and >>>>> entity (patches 1-3) >>>>> >>>>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the >>>>> GuC) which is a new paradigm WRT to the DRM scheduler and presents >>>>> severals problems as the DRM was originally designed to schedule jobs on >>>>> hardware queues. The main problem being that DRM scheduler expects the >>>>> submission order of jobs to be the completion order of jobs even across >>>>> multiple entities. This assumption falls apart with a firmware scheduler >>>>> as a firmware scheduler has no concept of jobs and jobs can complete out >>>>> of order. A novel solution for was originally thought of by Faith during >>>>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler >>>>> and entity. I believe the AGX driver [3] is using this approach and >>>>> Boris may use approach as well for the Mali driver [4]. >>>>> >>>>> To support a 1 to 1 relationship we move the main execution function >>>>> from a kthread to a work queue and add a new scheduling mode which >>>>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship. >>>>> The new scheduling mode should unify all drivers usage with a 1 to 1 >>>>> relationship and can be thought of as using scheduler as a dependency / >>>>> infligt job tracker rather than a true scheduler. >>>> >>>> Yup, we're in the exact same situation with drm/asahi, so this is very >>>> welcome! We've been using the existing scheduler as-is, but this should help >>>> remove some unneeded complexity in this use case. >>>> >>> >>> That's the idea. >>> >>>> Do you want me to pull in this series into our tree and make sure this all >>>> works out for us? >>>> >>> >>> We tested this in Xe and it definitely works for us but the more testing >>> the better. >>> >> >> I haven't gotten around to testing this series yet, but after more debugging >> of drm_sched issues I want to hear more about how Xe uses the scheduler. >> >> From what I can tell, and from what Christian says, drm_sched has the hidden >> requirement that all job objects outlive the scheduler. I've run into >> several UAF bugs due to this. Not only that, it also currently has the >> requirement that all drm_sched fences outlive the scheduler object. >> >> These requirements are subtle and only manifest as kernel oopses in rare >> corner cases, so it wasn't at all obvious to me that this was somehow a >> fundamental design assumption when I started using it. >> >> As far as I can tell, this design is going to work in 99% of cases for >> global-schedulers-per-GPU models, where those corner cases would have to be >> hit on top of a GPU removal scenario (and GPU remove is... well, not the >> most tested/exercised use case). When the scheduler basically lives forever, >> none of this really matters. >> >> But with a one-scheduler-per-queue model, how do you deal with this when the >> queue goes away? So far, without any of the partial bugfixes I have sent so >> far (which Christian objected to): >> >> - If you try to tear down a scheduler with any jobs currently scheduled at >> the hardware, drm_sched will oops when those jobs complete and the hw fences >> signal. >> - If you try to tear down an entity (which should cancel all its pending >> jobs) and then the scheduler it was attached to without actually waiting for >> all the free_job() callbacks to be called on every job that ever existed for >> that entity, you can oops (entity cleanup is asynchronous in some cases like >> killed processes, so it will return before all jobs are freed and then that >> asynchronous process will crash and burn if the scheduler goes away out from >> under its feet). Waiting for job completion fences is not enough for this, >> you have to wait until free_job() has actually been called for all jobs. >> - Even if you actually wait for all jobs to be truly gone and then tear down >> the scheduler, if any scheduler job fences remain alive, that will then oops >> if you try to call the debug functions on them (like cat >> /sys/kernel/debug/dma_buf/bufinfo). >> >> I tried to fix these things, but Christian objected implying it was the >> driver's job to keep a reference from jobs and hw fences to the scheduler. >> But I find that completely broken, because besides the extra memory/resource >> usage keeping the scheduler alive when you're trying to free resources as >> fast as possible when a process goes away, you can't even use normal >> reference counting for that: if you try to drop the last drm_sched reference >> from within a free_job() callback, the whole thing deadlocks since that will >> be running in the scheduler's thread/workqueue context, which can't free >> itself. So now you both reference count the scheduler from jobs and fences, >> and on top of that you need to outsource drm_sched freeing to a workqueue in >> the driver to make sure you don't deadlock. >> >> For job fences this is particularly broken, because those fences can live >> forever signaled and attached to shared buffers and there is no guarantee >> that they will be freed in any kind of reasonable time frame. If they have >> to keep the scheduler that created them alive, that creates a lot of dead >> object junk we have to drag around just because a signaled fence exists >> somewhere. >> >> For a Rust abstraction we have to do all that tracking and refcounting in >> the abstraction itself to make it safe, which is starting to sound like >> reimplementing half of the job tracking drm_sched itself does just to fix >> the lifetime issues, which really tells me the existing design is not sound >> nor easy to use correctly in general. >> >> How does Xe deal with this (does it deal with it at all)? What happens when >> you kill -9 a process using the GPU? Does freeing all of this wait for all >> jobs to complete *and be freed* with free_job()? What about exported >> dma_bufs with fences attached from that scheduler? Do you keep the scheduler >> alive for those? >> >> Personally, after running into all this, and after seeing Christian's >> reaction to me trying to improve the state of things, I'm starting to doubt >> that drm_sched is the right solution at all for firmware-scheduling drivers. > > Bit a wash-up reply on the more fundamental thing here: > > For the current scheduler the issues you've found are indeed all driver > bugs (or most I think at least). Even the last one with the fences? I can't see how that could be implemented correctly today by any driver, short of having the driver live until any buffers it has touched and installed a fence into go away, which doesn't sound right, since that would block cleanup (and module unloading) possibly forever, and that itself sounds like a bug... This is why I'm a bit disappointed here, because even that one got me a "you're doing it wrong" response from Christian... but if scheduler fences are supposed to be outlived by the driver and its fences, what is even the point of having separate fences? > > Which is why I think we shouldn't just try to shoehorn fundamentally new > semantics without updating the driver interfaces (the drm_sched split into > the driver interface part and the internal scheduler part). Once we have > that, including kerneldoc update and what the rules are, then all the > various uaf you've discovered become real bugs and I don't see any issue > merging all the fixes. > > Without that we do have a chicken/egg problem between: > > "here's a bunch of hacks to make the problems disappear I've hit in my > reuse of drm/sched for fw schedulers" > > vs. > > "this makes no sense for the current drm/sched interfaces and how current > upstream drivers use it" > > I don't think there's a lot needed in terms of drm/sched driver api > rework, but I think it's also pretty clearly not ever going to get > anywhere with just nothing at all. Writing an entire new scheduler lib > instead of at least trying what minimal semantic changes (instead of just > a pile of hacks without even doc changes for the new rules) does not sound > like a good idea to me :-) I wish I knew what the old rules were, since they're still not documented... It's frustrating following what few rules are written down, running into a bug, writing a patch to fix it, and being told "no, you're just not following the unwritten rules"... several times now. > >> If you want a workload to try to see if you run into any of these things, >> running and killing lots of things in parallel is a good thing to try (mess >> with the numbers and let it run for a while to see if you can hit any corner >> cases): >> >> while true; do for i in $(seq 1 10); do timeout -k 0.01 0.05 glxgears & >> done; sleep 0.1; done > > Maybe xe gets away with this due to synchronously killing everything > related to a ctx, but yeah I'd expect this to go boom in fun ways. It'd have to explicitly refcount all the jobs and block killing the ctx until all jobs are freed (not just signaled) for that not to go boom right now, but even then you'd still have the issue with dangling fences in buffers making `cat /sys/kernel/debug/dma_buf/bufinfo` oops... and you can't synchronously kill those as far as I know. ~~ Lina
On Wed, Apr 12, 2023 at 02:47:52PM +0900, Asahi Lina wrote: > On 11/04/2023 23.07, Daniel Vetter wrote: > > On Sat, Apr 08, 2023 at 04:05:20PM +0900, Asahi Lina wrote: > > > On 04/04/2023 10.58, Matthew Brost wrote: > > > > On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote: > > > > > Hi, thanks for the Cc! > > > > > > > > > > > > > No problem. > > > > > > > > > On 04/04/2023 09.22, Matthew Brost wrote: > > > > > > Hello, > > > > > > > > > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > > > > > > have been asked to merge our common DRM scheduler patches first as well > > > > > > as develop a common solution for long running workloads with the DRM > > > > > > scheduler. This RFC series is our first attempt at doing this. We > > > > > > welcome any and all feedback. > > > > > > > > > > > > This can we thought of as 4 parts detailed below. > > > > > > > > > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and > > > > > > entity (patches 1-3) > > > > > > > > > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the > > > > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents > > > > > > severals problems as the DRM was originally designed to schedule jobs on > > > > > > hardware queues. The main problem being that DRM scheduler expects the > > > > > > submission order of jobs to be the completion order of jobs even across > > > > > > multiple entities. This assumption falls apart with a firmware scheduler > > > > > > as a firmware scheduler has no concept of jobs and jobs can complete out > > > > > > of order. A novel solution for was originally thought of by Faith during > > > > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler > > > > > > and entity. I believe the AGX driver [3] is using this approach and > > > > > > Boris may use approach as well for the Mali driver [4]. > > > > > > > > > > > > To support a 1 to 1 relationship we move the main execution function > > > > > > from a kthread to a work queue and add a new scheduling mode which > > > > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > > > > > > The new scheduling mode should unify all drivers usage with a 1 to 1 > > > > > > relationship and can be thought of as using scheduler as a dependency / > > > > > > infligt job tracker rather than a true scheduler. > > > > > > > > > > Yup, we're in the exact same situation with drm/asahi, so this is very > > > > > welcome! We've been using the existing scheduler as-is, but this should help > > > > > remove some unneeded complexity in this use case. > > > > > > > > > > > > > That's the idea. > > > > > > > > > Do you want me to pull in this series into our tree and make sure this all > > > > > works out for us? > > > > > > > > > > > > > We tested this in Xe and it definitely works for us but the more testing > > > > the better. > > > > > > > > > > I haven't gotten around to testing this series yet, but after more debugging > > > of drm_sched issues I want to hear more about how Xe uses the scheduler. > > > > > > From what I can tell, and from what Christian says, drm_sched has the hidden > > > requirement that all job objects outlive the scheduler. I've run into > > > several UAF bugs due to this. Not only that, it also currently has the > > > requirement that all drm_sched fences outlive the scheduler object. > > > > > > These requirements are subtle and only manifest as kernel oopses in rare > > > corner cases, so it wasn't at all obvious to me that this was somehow a > > > fundamental design assumption when I started using it. > > > > > > As far as I can tell, this design is going to work in 99% of cases for > > > global-schedulers-per-GPU models, where those corner cases would have to be > > > hit on top of a GPU removal scenario (and GPU remove is... well, not the > > > most tested/exercised use case). When the scheduler basically lives forever, > > > none of this really matters. > > > > > > But with a one-scheduler-per-queue model, how do you deal with this when the > > > queue goes away? So far, without any of the partial bugfixes I have sent so > > > far (which Christian objected to): > > > > > > - If you try to tear down a scheduler with any jobs currently scheduled at > > > the hardware, drm_sched will oops when those jobs complete and the hw fences > > > signal. > > > - If you try to tear down an entity (which should cancel all its pending > > > jobs) and then the scheduler it was attached to without actually waiting for > > > all the free_job() callbacks to be called on every job that ever existed for > > > that entity, you can oops (entity cleanup is asynchronous in some cases like > > > killed processes, so it will return before all jobs are freed and then that > > > asynchronous process will crash and burn if the scheduler goes away out from > > > under its feet). Waiting for job completion fences is not enough for this, > > > you have to wait until free_job() has actually been called for all jobs. > > > - Even if you actually wait for all jobs to be truly gone and then tear down > > > the scheduler, if any scheduler job fences remain alive, that will then oops > > > if you try to call the debug functions on them (like cat > > > /sys/kernel/debug/dma_buf/bufinfo). > > > > > > I tried to fix these things, but Christian objected implying it was the > > > driver's job to keep a reference from jobs and hw fences to the scheduler. > > > But I find that completely broken, because besides the extra memory/resource > > > usage keeping the scheduler alive when you're trying to free resources as > > > fast as possible when a process goes away, you can't even use normal > > > reference counting for that: if you try to drop the last drm_sched reference > > > from within a free_job() callback, the whole thing deadlocks since that will > > > be running in the scheduler's thread/workqueue context, which can't free > > > itself. So now you both reference count the scheduler from jobs and fences, > > > and on top of that you need to outsource drm_sched freeing to a workqueue in > > > the driver to make sure you don't deadlock. > > > > > > For job fences this is particularly broken, because those fences can live > > > forever signaled and attached to shared buffers and there is no guarantee > > > that they will be freed in any kind of reasonable time frame. If they have > > > to keep the scheduler that created them alive, that creates a lot of dead > > > object junk we have to drag around just because a signaled fence exists > > > somewhere. > > > > > > For a Rust abstraction we have to do all that tracking and refcounting in > > > the abstraction itself to make it safe, which is starting to sound like > > > reimplementing half of the job tracking drm_sched itself does just to fix > > > the lifetime issues, which really tells me the existing design is not sound > > > nor easy to use correctly in general. > > > > > > How does Xe deal with this (does it deal with it at all)? What happens when > > > you kill -9 a process using the GPU? Does freeing all of this wait for all > > > jobs to complete *and be freed* with free_job()? What about exported > > > dma_bufs with fences attached from that scheduler? Do you keep the scheduler > > > alive for those? > > > > > > Personally, after running into all this, and after seeing Christian's > > > reaction to me trying to improve the state of things, I'm starting to doubt > > > that drm_sched is the right solution at all for firmware-scheduling drivers. > > > > Bit a wash-up reply on the more fundamental thing here: > > For the current scheduler the issues you've found are indeed all driver > > bugs (or most I think at least). > > Even the last one with the fences? I can't see how that could be implemented > correctly today by any driver, short of having the driver live until any > buffers it has touched and installed a fence into go away, which doesn't > sound right, since that would block cleanup (and module unloading) possibly > forever, and that itself sounds like a bug... > > This is why I'm a bit disappointed here, because even that one got me a > "you're doing it wrong" response from Christian... but if scheduler fences > are supposed to be outlived by the driver and its fences, what is even the > point of having separate fences? Yeah that one sounds like a bug. Not sure what Christian was thinking, but the point of the split between hw fence and public drm_job fence is that the latter is supposed to be free-standing. Otherwise we'd need to refcount the world and have an enourmous space leak. i915-gem tried that, and it's not even close to pretty because you get to untie all kinds of weak pointers in all kinds of really scary ways. Maybe try resend that, but put a patch in front that fixes the kerneldoc to explain this in really clear terms & why it's needed? Then 2nd patches is more a "we fix the driver api contract here". > > Which is why I think we shouldn't just try to shoehorn fundamentally new > > semantics without updating the driver interfaces (the drm_sched split into > > the driver interface part and the internal scheduler part). Once we have > > that, including kerneldoc update and what the rules are, then all the > > various uaf you've discovered become real bugs and I don't see any issue > > merging all the fixes. > > > > Without that we do have a chicken/egg problem between: > > > > "here's a bunch of hacks to make the problems disappear I've hit in my > > reuse of drm/sched for fw schedulers" > > > > vs. > > > > "this makes no sense for the current drm/sched interfaces and how current > > upstream drivers use it" > > > > I don't think there's a lot needed in terms of drm/sched driver api > > rework, but I think it's also pretty clearly not ever going to get > > anywhere with just nothing at all. Writing an entire new scheduler lib > > instead of at least trying what minimal semantic changes (instead of just > > a pile of hacks without even doc changes for the new rules) does not sound > > like a good idea to me :-) > > I wish I knew what the old rules were, since they're still not documented... > > It's frustrating following what few rules are written down, running into a > bug, writing a patch to fix it, and being told "no, you're just not > following the unwritten rules"... several times now. Yeah I get that. And I think I've put down a few places in the threads where things derailed that the more constructive approach (instead of random nak and shit like that) is to ask to improve the docs. Because rn they're really not great for the scheduler :-/ > > > If you want a workload to try to see if you run into any of these things, > > > running and killing lots of things in parallel is a good thing to try (mess > > > with the numbers and let it run for a while to see if you can hit any corner > > > cases): > > > > > > while true; do for i in $(seq 1 10); do timeout -k 0.01 0.05 glxgears & > > > done; sleep 0.1; done > > > > Maybe xe gets away with this due to synchronously killing everything > > related to a ctx, but yeah I'd expect this to go boom in fun ways. > > It'd have to explicitly refcount all the jobs and block killing the ctx > until all jobs are freed (not just signaled) for that not to go boom right > now, but even then you'd still have the issue with dangling fences in > buffers making `cat /sys/kernel/debug/dma_buf/bufinfo` oops... and you can't > synchronously kill those as far as I know. Yeah that sounds like a full on bug. Doc patch to sharpen the semantics we want + fix should get this going - or I'll jump in and make it going :-) For the bigger issue I guess we might just land on "drm/sched needs to refcount a lot more for fw scheduler", which is again why I think some doc patches + minimal driver api rework might be best. Sometimes just submitting code isn't the best way to communicate. -Daniel
On Sat, Apr 08, 2023 at 04:05:20PM +0900, Asahi Lina wrote: > On 04/04/2023 10.58, Matthew Brost wrote: > > On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote: > > > Hi, thanks for the Cc! > > > > > > > No problem. > > > > > On 04/04/2023 09.22, Matthew Brost wrote: > > > > Hello, > > > > > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > > > > have been asked to merge our common DRM scheduler patches first as well > > > > as develop a common solution for long running workloads with the DRM > > > > scheduler. This RFC series is our first attempt at doing this. We > > > > welcome any and all feedback. > > > > > > > > This can we thought of as 4 parts detailed below. > > > > > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and > > > > entity (patches 1-3) > > > > > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the > > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents > > > > severals problems as the DRM was originally designed to schedule jobs on > > > > hardware queues. The main problem being that DRM scheduler expects the > > > > submission order of jobs to be the completion order of jobs even across > > > > multiple entities. This assumption falls apart with a firmware scheduler > > > > as a firmware scheduler has no concept of jobs and jobs can complete out > > > > of order. A novel solution for was originally thought of by Faith during > > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler > > > > and entity. I believe the AGX driver [3] is using this approach and > > > > Boris may use approach as well for the Mali driver [4]. > > > > > > > > To support a 1 to 1 relationship we move the main execution function > > > > from a kthread to a work queue and add a new scheduling mode which > > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > > > > The new scheduling mode should unify all drivers usage with a 1 to 1 > > > > relationship and can be thought of as using scheduler as a dependency / > > > > infligt job tracker rather than a true scheduler. > > > > > > Yup, we're in the exact same situation with drm/asahi, so this is very > > > welcome! We've been using the existing scheduler as-is, but this should help > > > remove some unneeded complexity in this use case. > > > > > > > That's the idea. > > > > > Do you want me to pull in this series into our tree and make sure this all > > > works out for us? > > > > > > > We tested this in Xe and it definitely works for us but the more testing > > the better. > > > > I haven't gotten around to testing this series yet, but after more debugging > of drm_sched issues I want to hear more about how Xe uses the scheduler. > > From what I can tell, and from what Christian says, drm_sched has the hidden > requirement that all job objects outlive the scheduler. I've run into > several UAF bugs due to this. Not only that, it also currently has the > requirement that all drm_sched fences outlive the scheduler object. > > These requirements are subtle and only manifest as kernel oopses in rare > corner cases, so it wasn't at all obvious to me that this was somehow a > fundamental design assumption when I started using it. > > As far as I can tell, this design is going to work in 99% of cases for > global-schedulers-per-GPU models, where those corner cases would have to be > hit on top of a GPU removal scenario (and GPU remove is... well, not the > most tested/exercised use case). When the scheduler basically lives forever, > none of this really matters. > > But with a one-scheduler-per-queue model, how do you deal with this when the > queue goes away? So far, without any of the partial bugfixes I have sent so > far (which Christian objected to): > > - If you try to tear down a scheduler with any jobs currently scheduled at > the hardware, drm_sched will oops when those jobs complete and the hw fences > signal. > - If you try to tear down an entity (which should cancel all its pending > jobs) and then the scheduler it was attached to without actually waiting for > all the free_job() callbacks to be called on every job that ever existed for > that entity, you can oops (entity cleanup is asynchronous in some cases like > killed processes, so it will return before all jobs are freed and then that > asynchronous process will crash and burn if the scheduler goes away out from > under its feet). Waiting for job completion fences is not enough for this, > you have to wait until free_job() has actually been called for all jobs. > - Even if you actually wait for all jobs to be truly gone and then tear down > the scheduler, if any scheduler job fences remain alive, that will then oops > if you try to call the debug functions on them (like cat > /sys/kernel/debug/dma_buf/bufinfo). > > I tried to fix these things, but Christian objected implying it was the > driver's job to keep a reference from jobs and hw fences to the scheduler. > But I find that completely broken, because besides the extra memory/resource > usage keeping the scheduler alive when you're trying to free resources as > fast as possible when a process goes away, you can't even use normal > reference counting for that: if you try to drop the last drm_sched reference > from within a free_job() callback, the whole thing deadlocks since that will > be running in the scheduler's thread/workqueue context, which can't free > itself. So now you both reference count the scheduler from jobs and fences, > and on top of that you need to outsource drm_sched freeing to a workqueue in > the driver to make sure you don't deadlock. > This what Xe does, jobs reference the scheduler / entity (xe_engine), when the reference count of an xe_engine goes to zero we trigger the teardown process (ping / pong with firmware) via a CLEANUP message, when teardown is done the last step of killing the scheduler is indeed done by an async worker as you suggest. To kill a queue, we just kick the TDR which in turn kills any outstanding job resulting the xe_engine ref count (at least from the jobs) going to zero. If a user holds ref to dma-fence of a job, then yes the scheduler isn't going to be freed (it can be killed before as described above). This all seems to work just fine for Xe. > For job fences this is particularly broken, because those fences can live > forever signaled and attached to shared buffers and there is no guarantee > that they will be freed in any kind of reasonable time frame. If they have > to keep the scheduler that created them alive, that creates a lot of dead > object junk we have to drag around just because a signaled fence exists > somewhere. > > For a Rust abstraction we have to do all that tracking and refcounting in > the abstraction itself to make it safe, which is starting to sound like > reimplementing half of the job tracking drm_sched itself does just to fix > the lifetime issues, which really tells me the existing design is not sound > nor easy to use correctly in general. > > How does Xe deal with this (does it deal with it at all)? What happens when > you kill -9 a process using the GPU? Does freeing all of this wait for all > jobs to complete *and be freed* with free_job()? What about exported > dma_bufs with fences attached from that scheduler? Do you keep the scheduler > alive for those? kill -9 would trigger killing of the queue described above. Yes if fences are exported the scheduler might hold onto some firmware resources for a bit. > > Personally, after running into all this, and after seeing Christian's > reaction to me trying to improve the state of things, I'm starting to doubt > that drm_sched is the right solution at all for firmware-scheduling drivers. > > If you want a workload to try to see if you run into any of these things, > running and killing lots of things in parallel is a good thing to try (mess > with the numbers and let it run for a while to see if you can hit any corner > cases): > > while true; do for i in $(seq 1 10); do timeout -k 0.01 0.05 glxgears & > done; sleep 0.1; done > Tested this and this works in Xe. Feel free to ping me on IRC if you want to chat more about this. Matt > ~~ Lina >
Am 11.04.23 um 16:13 schrieb Daniel Vetter: > On Tue, Apr 11, 2023 at 11:02:55AM +0200, Christian König wrote: >> The point is that this not only requires some work in the drm_scheduler, but >> rather it then makes only little sense to use the drm_scheduler in the first >> place. >> >> The whole point of the drm_scheduler is to provide dma_fence implementation >> for the submitted jobs. >> >> We also have dependency handling, but as Daniel and I said this can be >> easily extracted into a separate object/component. > Uh that's not what I meant. My take is that minimally patching drm/sched > to make the out-fence either optional, or complete it right away, is the > simplest way to get at the dependency handling. For me at least the major > part of drm/sched is the dep handling and timeout stuff. And the later can > be reused with some glue to handle preempt timeouts too and other things, > since tdr is a work struct you can just issue any other gpu timeouts on > the same workqueue and using the roughly same pattern as the ->timed_out > hook and it'll just work. Well that strongly sounds like what I had in mind as well. If we move the dependency/timeout functionality into a new component or if we move the scheduler fence into a new component doesn't seem to matter, the high level goal is that we have separated the two functionalities and both approach will work for that. > The entire "oh we also make sure your hw fence doesn't leak into public > fences and causes lifetime mayhem" seems pretty minor. And maybe also > something we want to replicate for the preempt-ctx dma_fence that some > long-running context need (but more as part of drm_sched_entity I guess). > > We can of course bikeshed how much flexibility really should be in the > different parts of drm/sched, but imo that's a bikeshed. Well the dependency handling in a separate component would still be interesting to have since we need something similar for user queues as well. Christian. > -Daniel > > >> Regards, >> Christian. >> >> Am 07.04.23 um 02:20 schrieb Zeng, Oak: >>> So this series basically go with option 2. The part that option2 makes me uncomfortable is, dma-fence doesn't work for long running workload, why we generate it in the first place? As long as dma-fence is generated, it will become a source of confusion in the future. It doesn't matter how much you annotate it/document it. So if we decide to go with option2, the bottom line is, don't generate dma-fence for long running workload during job submission. This requires some rework in drm scheduler. >>> >>> The cleanest solution to me is option3. Dma-fence is a very old technology. When it was created, no gpu support page fault. Obviously this is not a good technology for modern gpu with page fault support. I think the best way is to create a new scheduler and dependency tracking mechanism works for both page fault enabled and page fault disabled context. I think this matches what Christian said below. Maybe nobody think this is easy? >>> >>> Thanks, >>> Oak >>> >>>> -----Original Message----- >>>> From: Brost, Matthew <matthew.brost@intel.com> >>>> Sent: April 5, 2023 2:53 PM >>>> To: Zeng, Oak <oak.zeng@intel.com> >>>> Cc: Christian König <christian.koenig@amd.com>; Vetter, Daniel >>>> <daniel.vetter@intel.com>; Thomas Hellström >>>> <thomas.hellstrom@linux.intel.com>; dri-devel@lists.freedesktop.org; intel- >>>> xe@lists.freedesktop.org; robdclark@chromium.org; airlied@linux.ie; >>>> lina@asahilina.net; boris.brezillon@collabora.com; faith.ekstrand@collabora.com >>>> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload >>>> plans >>>> >>>> On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote: >>>>> Hi, >>>>> >>>>> Using dma-fence for completion/dependency tracking for long-run >>>> workload(more precisely on-demand paging/page fault enabled workload) can >>>> cause deadlock. This seems the significant issue here. Other issues such as the >>>> drm scheduler completion order implication etc are minors which can be solve >>>> inside the framework of drm scheduler. We need to evaluate below paths: >>>>> 1) still use drm scheduler for job submission, and use dma-fence for job >>>> completion waiting/dependency tracking. This is solution proposed in this series. >>>> Annotate dma-fence for long-run workload: user can still wait dma-fence for job >>>> completion but can't wait dma-fence while holding any memory management >>>> locks. We still use dma-fence for dependency tracking. But it is just very easily >>>> run into deadlock when on-demand paging is in the picture. The annotation helps >>>> us to detect deadlock but not solve deadlock problems. Seems *not* a complete >>>> solution: It is almost impossible to completely avoid dependency deadlock in >>>> complex runtime environment >>>> No one can wait on LR fence, so it is impossible to deadlock. The >>>> annotations enforce this. Literally this is only for flow controling the >>>> ring / hold pending jobs in in the DRM schedule list. >>>> >>>>> 2) Still use drm scheduler but not use dma-fence for completion signaling >>>> and dependency tracking. This way we still get some free functions (reset, err >>>> handling ring flow control as Matt said)from drm scheduler, but push the >>>> dependency/completion tracking completely to user space using techniques such >>>> as user space fence. User space doesn't have chance to wait fence while holding >>>> a kernel memory management lock, thus the dma-fence deadlock issue is solved. >>>> We use user space fence for syncs. >>>> >>>>> 3) Completely discard drm scheduler and dma-fence for long-run >>>> workload. Use user queue/doorbell for super fast submission, directly interact >>>> with fw scheduler. Use user fence for completion/dependency tracking. >>>> This is a hard no from me, I want 1 submission path in Xe. Either we use >>>> the DRM scheduler or we don't. >>>> >>>> Matt >>>> >>>>> Thanks, >>>>> Oak >>>>> >>>>>> -----Original Message----- >>>>>> From: Christian König <christian.koenig@amd.com> >>>>>> Sent: April 5, 2023 3:30 AM >>>>>> To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak >>>>>> <oak.zeng@intel.com> >>>>>> Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; >>>>>> robdclark@chromium.org; thomas.hellstrom@linux.intel.com; >>>> airlied@linux.ie; >>>>>> lina@asahilina.net; boris.brezillon@collabora.com; >>>> faith.ekstrand@collabora.com >>>>>> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload >>>>>> plans >>>>>> >>>>>> Am 04.04.23 um 20:08 schrieb Matthew Brost: >>>>>>> On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote: >>>>>>>> Hi Matt, Thomas, >>>>>>>> >>>>>>>> Some very bold out of box thinking in this area: >>>>>>>> >>>>>>>> 1. so you want to use drm scheduler and dma-fence for long running >>>> workload. >>>>>> Why you want to do this in the first place? What is the benefit? Drm scheduler >>>> is >>>>>> pretty much a software scheduler. Modern gpu has scheduler built at fw/hw >>>>>> level, as you said below for intel this is Guc. Can xe driver just directly submit >>>> job >>>>>> to Guc, bypassing drm scheduler? >>>>>>> If we did that now we have 2 paths for dependency track, flow controling >>>>>>> the ring, resets / error handling / backend submission implementations. >>>>>>> We don't want this. >>>>>> Well exactly that's the point: Why? >>>>>> >>>>>> As far as I can see that are two completely distinct use cases, so you >>>>>> absolutely do want two completely distinct implementations for this. >>>>>> >>>>>>>> 2. using dma-fence for long run workload: I am well aware that page fault >>>> (and >>>>>> the consequent memory allocation/lock acquiring to fix the fault) can cause >>>>>> deadlock for a dma-fence wait. But I am not convinced that dma-fence can't >>>> be >>>>>> used purely because the nature of the workload that it runs very long >>>> (indefinite). >>>>>> I did a math: the dma_fence_wait_timeout function's third param is the >>>> timeout >>>>>> which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not >>>> long >>>>>> enough, can we just change the timeout parameter to signed 64 bits so it is >>>> much >>>>>> longer than our life time... >>>>>>>> So I mainly argue we can't use dma-fence for long-run workload is not >>>>>> because the workload runs very long, rather because of the fact that we use >>>>>> page fault for long-run workload. If we enable page fault for short-run >>>> workload, >>>>>> we can't use dma-fence either. Page fault is the key thing here. >>>>>>>> Now since we use page fault which is *fundamentally* controversial with >>>>>> dma-fence design, why now just introduce a independent concept such as >>>> user- >>>>>> fence instead of extending existing dma-fence? >>>>>>>> I like unified design. If drm scheduler, dma-fence can be extended to work >>>> for >>>>>> everything, it is beautiful. But seems we have some fundamental problem >>>> here. >>>>>>> Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use >>>>>>> the signal / CB infrastructure) and enforce we don't use use these >>>>>>> dma-fences from the scheduler in memory reclaim paths or export these to >>>>>>> user space or other drivers. Think of this mode as SW only fence. >>>>>> Yeah and I truly think this is an really bad idea. >>>>>> >>>>>> The signal/CB infrastructure in the dma_fence turned out to be the >>>>>> absolutely nightmare I initially predicted. Sorry to say that, but in >>>>>> this case the "I've told you so" is appropriate in my opinion. >>>>>> >>>>>> If we need infrastructure for long running dependency tracking we should >>>>>> encapsulate that in a new framework and not try to mangle the existing >>>>>> code for something it was never intended for. >>>>>> >>>>>> Christian. >>>>>> >>>>>>> Matt >>>>>>> >>>>>>>> Thanks, >>>>>>>> Oak >>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of >>>>>>>>> Matthew Brost >>>>>>>>> Sent: April 3, 2023 8:22 PM >>>>>>>>> To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org >>>>>>>>> Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com; >>>>>> airlied@linux.ie; >>>>>>>>> lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew >>>>>>>>> <matthew.brost@intel.com>; christian.koenig@amd.com; >>>>>>>>> faith.ekstrand@collabora.com >>>>>>>>> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload >>>>>> plans >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we >>>>>>>>> have been asked to merge our common DRM scheduler patches first as >>>> well >>>>>>>>> as develop a common solution for long running workloads with the DRM >>>>>>>>> scheduler. This RFC series is our first attempt at doing this. We >>>>>>>>> welcome any and all feedback. >>>>>>>>> >>>>>>>>> This can we thought of as 4 parts detailed below. >>>>>>>>> >>>>>>>>> - DRM scheduler changes for 1 to 1 relationship between scheduler and >>>>>>>>> entity (patches 1-3) >>>>>>>>> >>>>>>>>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the >>>>>>>>> GuC) which is a new paradigm WRT to the DRM scheduler and presents >>>>>>>>> severals problems as the DRM was originally designed to schedule jobs >>>> on >>>>>>>>> hardware queues. The main problem being that DRM scheduler expects >>>> the >>>>>>>>> submission order of jobs to be the completion order of jobs even across >>>>>>>>> multiple entities. This assumption falls apart with a firmware scheduler >>>>>>>>> as a firmware scheduler has no concept of jobs and jobs can complete >>>> out >>>>>>>>> of order. A novel solution for was originally thought of by Faith during >>>>>>>>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler >>>>>>>>> and entity. I believe the AGX driver [3] is using this approach and >>>>>>>>> Boris may use approach as well for the Mali driver [4]. >>>>>>>>> >>>>>>>>> To support a 1 to 1 relationship we move the main execution function >>>>>>>>> from a kthread to a work queue and add a new scheduling mode which >>>>>>>>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship. >>>>>>>>> The new scheduling mode should unify all drivers usage with a 1 to 1 >>>>>>>>> relationship and can be thought of as using scheduler as a dependency / >>>>>>>>> infligt job tracker rather than a true scheduler. >>>>>>>>> >>>>>>>>> - Generic messaging interface for DRM scheduler >>>>>>>>> >>>>>>>>> Idea is to be able to communicate to the submission backend with in >>>> band >>>>>>>>> (relative to main execution function) messages. Messages are backend >>>>>>>>> defined and flexable enough for any use case. In Xe we use these >>>>>>>>> messages to clean up entites, set properties for entites, and suspend / >>>>>>>>> resume execution of an entity [5]. I suspect other driver can leverage >>>>>>>>> this messaging concept too as it a convenient way to avoid races in the >>>>>>>>> backend. >>>>>>>>> >>>>>>>>> - Support for using TDR for all error paths of a scheduler / entity >>>>>>>>> >>>>>>>>> Fix a few races / bugs, add function to dynamically set the TDR timeout. >>>>>>>>> >>>>>>>>> - Annotate dma-fences for long running workloads. >>>>>>>>> >>>>>>>>> The idea here is to use dma-fences only as sync points within the >>>>>>>>> scheduler and never export them for long running workloads. By >>>>>>>>> annotating these fences as long running we ensure that these dma- >>>> fences >>>>>>>>> are never used in a way that breaks the dma-fence rules. A benefit of >>>>>>>>> thus approach is the scheduler can still safely flow control the >>>>>>>>> execution ring buffer via the job limit without breaking the dma-fence >>>>>>>>> rules. >>>>>>>>> >>>>>>>>> Again this a first draft and looking forward to feedback. >>>>>>>>> >>>>>>>>> Enjoy - Matt >>>>>>>>> >>>>>>>>> [1] https://gitlab.freedesktop.org/drm/xe/kernel >>>>>>>>> [2] https://patchwork.freedesktop.org/series/112188/ >>>>>>>>> [3] https://patchwork.freedesktop.org/series/114772/ >>>>>>>>> [4] >>>> https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 >>>>>>>>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe- >>>>>>>>> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 >>>>>>>>> >>>>>>>>> Matthew Brost (8): >>>>>>>>> drm/sched: Convert drm scheduler to use a work queue rather than >>>>>>>>> kthread >>>>>>>>> drm/sched: Move schedule policy to scheduler / entity >>>>>>>>> drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling >>>> policy >>>>>>>>> drm/sched: Add generic scheduler message interface >>>>>>>>> drm/sched: Start run wq before TDR in drm_sched_start >>>>>>>>> drm/sched: Submit job before starting TDR >>>>>>>>> drm/sched: Add helper to set TDR timeout >>>>>>>>> drm/syncobj: Warn on long running dma-fences >>>>>>>>> >>>>>>>>> Thomas Hellström (2): >>>>>>>>> dma-buf/dma-fence: Introduce long-running completion fences >>>>>>>>> drm/sched: Support long-running sched entities >>>>>>>>> >>>>>>>>> drivers/dma-buf/dma-fence.c | 142 +++++++--- >>>>>>>>> drivers/dma-buf/dma-resv.c | 5 + >>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- >>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- >>>>>>>>> drivers/gpu/drm/drm_syncobj.c | 5 +- >>>>>>>>> drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- >>>>>>>>> drivers/gpu/drm/lima/lima_sched.c | 5 +- >>>>>>>>> drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- >>>>>>>>> drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- >>>>>>>>> drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- >>>>>>>>> drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- >>>>>>>>> drivers/gpu/drm/scheduler/sched_fence.c | 6 +- >>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++-- >>>> --- >>>>>>>>> drivers/gpu/drm/v3d/v3d_sched.c | 25 +- >>>>>>>>> include/drm/gpu_scheduler.h | 130 +++++++-- >>>>>>>>> include/linux/dma-fence.h | 60 ++++- >>>>>>>>> 16 files changed, 649 insertions(+), 184 deletions(-) >>>>>>>>> >>>>>>>>> -- >>>>>>>>> 2.34.1
On Mon, Apr 17, 2023 at 08:47:19AM +0200, Christian König wrote: > Am 11.04.23 um 16:13 schrieb Daniel Vetter: > > On Tue, Apr 11, 2023 at 11:02:55AM +0200, Christian König wrote: > > > The point is that this not only requires some work in the drm_scheduler, but > > > rather it then makes only little sense to use the drm_scheduler in the first > > > place. > > > > > > The whole point of the drm_scheduler is to provide dma_fence implementation > > > for the submitted jobs. > > > > > > We also have dependency handling, but as Daniel and I said this can be > > > easily extracted into a separate object/component. > > Uh that's not what I meant. My take is that minimally patching drm/sched > > to make the out-fence either optional, or complete it right away, is the > > simplest way to get at the dependency handling. For me at least the major > > part of drm/sched is the dep handling and timeout stuff. And the later can > > be reused with some glue to handle preempt timeouts too and other things, > > since tdr is a work struct you can just issue any other gpu timeouts on > > the same workqueue and using the roughly same pattern as the ->timed_out > > hook and it'll just work. > > Well that strongly sounds like what I had in mind as well. > > If we move the dependency/timeout functionality into a new component or if > we move the scheduler fence into a new component doesn't seem to matter, the > high level goal is that we have separated the two functionalities and both > approach will work for that. Ah ok, I guess I just got confused about your wording then. > > The entire "oh we also make sure your hw fence doesn't leak into public > > fences and causes lifetime mayhem" seems pretty minor. And maybe also > > something we want to replicate for the preempt-ctx dma_fence that some > > long-running context need (but more as part of drm_sched_entity I guess). > > > > We can of course bikeshed how much flexibility really should be in the > > different parts of drm/sched, but imo that's a bikeshed. > > Well the dependency handling in a separate component would still be > interesting to have since we need something similar for user queues as well. Yeah it might be neater to refactor, but I think that part is optional at least near-term. There's always room to polish shared code, and often it's better to do that once you have at least some in-tree users for the new need :-) -Daniel > > Christian. > > > -Daniel > > > > > > > Regards, > > > Christian. > > > > > > Am 07.04.23 um 02:20 schrieb Zeng, Oak: > > > > So this series basically go with option 2. The part that option2 makes me uncomfortable is, dma-fence doesn't work for long running workload, why we generate it in the first place? As long as dma-fence is generated, it will become a source of confusion in the future. It doesn't matter how much you annotate it/document it. So if we decide to go with option2, the bottom line is, don't generate dma-fence for long running workload during job submission. This requires some rework in drm scheduler. > > > > > > > > The cleanest solution to me is option3. Dma-fence is a very old technology. When it was created, no gpu support page fault. Obviously this is not a good technology for modern gpu with page fault support. I think the best way is to create a new scheduler and dependency tracking mechanism works for both page fault enabled and page fault disabled context. I think this matches what Christian said below. Maybe nobody think this is easy? > > > > > > > > Thanks, > > > > Oak > > > > > > > > > -----Original Message----- > > > > > From: Brost, Matthew <matthew.brost@intel.com> > > > > > Sent: April 5, 2023 2:53 PM > > > > > To: Zeng, Oak <oak.zeng@intel.com> > > > > > Cc: Christian König <christian.koenig@amd.com>; Vetter, Daniel > > > > > <daniel.vetter@intel.com>; Thomas Hellström > > > > > <thomas.hellstrom@linux.intel.com>; dri-devel@lists.freedesktop.org; intel- > > > > > xe@lists.freedesktop.org; robdclark@chromium.org; airlied@linux.ie; > > > > > lina@asahilina.net; boris.brezillon@collabora.com; faith.ekstrand@collabora.com > > > > > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload > > > > > plans > > > > > > > > > > On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote: > > > > > > Hi, > > > > > > > > > > > > Using dma-fence for completion/dependency tracking for long-run > > > > > workload(more precisely on-demand paging/page fault enabled workload) can > > > > > cause deadlock. This seems the significant issue here. Other issues such as the > > > > > drm scheduler completion order implication etc are minors which can be solve > > > > > inside the framework of drm scheduler. We need to evaluate below paths: > > > > > > 1) still use drm scheduler for job submission, and use dma-fence for job > > > > > completion waiting/dependency tracking. This is solution proposed in this series. > > > > > Annotate dma-fence for long-run workload: user can still wait dma-fence for job > > > > > completion but can't wait dma-fence while holding any memory management > > > > > locks. We still use dma-fence for dependency tracking. But it is just very easily > > > > > run into deadlock when on-demand paging is in the picture. The annotation helps > > > > > us to detect deadlock but not solve deadlock problems. Seems *not* a complete > > > > > solution: It is almost impossible to completely avoid dependency deadlock in > > > > > complex runtime environment > > > > > No one can wait on LR fence, so it is impossible to deadlock. The > > > > > annotations enforce this. Literally this is only for flow controling the > > > > > ring / hold pending jobs in in the DRM schedule list. > > > > > > > > > > > 2) Still use drm scheduler but not use dma-fence for completion signaling > > > > > and dependency tracking. This way we still get some free functions (reset, err > > > > > handling ring flow control as Matt said)from drm scheduler, but push the > > > > > dependency/completion tracking completely to user space using techniques such > > > > > as user space fence. User space doesn't have chance to wait fence while holding > > > > > a kernel memory management lock, thus the dma-fence deadlock issue is solved. > > > > > We use user space fence for syncs. > > > > > > > > > > > 3) Completely discard drm scheduler and dma-fence for long-run > > > > > workload. Use user queue/doorbell for super fast submission, directly interact > > > > > with fw scheduler. Use user fence for completion/dependency tracking. > > > > > This is a hard no from me, I want 1 submission path in Xe. Either we use > > > > > the DRM scheduler or we don't. > > > > > > > > > > Matt > > > > > > > > > > > Thanks, > > > > > > Oak > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Christian König <christian.koenig@amd.com> > > > > > > > Sent: April 5, 2023 3:30 AM > > > > > > > To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak > > > > > > > <oak.zeng@intel.com> > > > > > > > Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org; > > > > > > > robdclark@chromium.org; thomas.hellstrom@linux.intel.com; > > > > > airlied@linux.ie; > > > > > > > lina@asahilina.net; boris.brezillon@collabora.com; > > > > > faith.ekstrand@collabora.com > > > > > > > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload > > > > > > > plans > > > > > > > > > > > > > > Am 04.04.23 um 20:08 schrieb Matthew Brost: > > > > > > > > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote: > > > > > > > > > Hi Matt, Thomas, > > > > > > > > > > > > > > > > > > Some very bold out of box thinking in this area: > > > > > > > > > > > > > > > > > > 1. so you want to use drm scheduler and dma-fence for long running > > > > > workload. > > > > > > > Why you want to do this in the first place? What is the benefit? Drm scheduler > > > > > is > > > > > > > pretty much a software scheduler. Modern gpu has scheduler built at fw/hw > > > > > > > level, as you said below for intel this is Guc. Can xe driver just directly submit > > > > > job > > > > > > > to Guc, bypassing drm scheduler? > > > > > > > > If we did that now we have 2 paths for dependency track, flow controling > > > > > > > > the ring, resets / error handling / backend submission implementations. > > > > > > > > We don't want this. > > > > > > > Well exactly that's the point: Why? > > > > > > > > > > > > > > As far as I can see that are two completely distinct use cases, so you > > > > > > > absolutely do want two completely distinct implementations for this. > > > > > > > > > > > > > > > > 2. using dma-fence for long run workload: I am well aware that page fault > > > > > (and > > > > > > > the consequent memory allocation/lock acquiring to fix the fault) can cause > > > > > > > deadlock for a dma-fence wait. But I am not convinced that dma-fence can't > > > > > be > > > > > > > used purely because the nature of the workload that it runs very long > > > > > (indefinite). > > > > > > > I did a math: the dma_fence_wait_timeout function's third param is the > > > > > timeout > > > > > > > which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not > > > > > long > > > > > > > enough, can we just change the timeout parameter to signed 64 bits so it is > > > > > much > > > > > > > longer than our life time... > > > > > > > > > So I mainly argue we can't use dma-fence for long-run workload is not > > > > > > > because the workload runs very long, rather because of the fact that we use > > > > > > > page fault for long-run workload. If we enable page fault for short-run > > > > > workload, > > > > > > > we can't use dma-fence either. Page fault is the key thing here. > > > > > > > > > Now since we use page fault which is *fundamentally* controversial with > > > > > > > dma-fence design, why now just introduce a independent concept such as > > > > > user- > > > > > > > fence instead of extending existing dma-fence? > > > > > > > > > I like unified design. If drm scheduler, dma-fence can be extended to work > > > > > for > > > > > > > everything, it is beautiful. But seems we have some fundamental problem > > > > > here. > > > > > > > > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use > > > > > > > > the signal / CB infrastructure) and enforce we don't use use these > > > > > > > > dma-fences from the scheduler in memory reclaim paths or export these to > > > > > > > > user space or other drivers. Think of this mode as SW only fence. > > > > > > > Yeah and I truly think this is an really bad idea. > > > > > > > > > > > > > > The signal/CB infrastructure in the dma_fence turned out to be the > > > > > > > absolutely nightmare I initially predicted. Sorry to say that, but in > > > > > > > this case the "I've told you so" is appropriate in my opinion. > > > > > > > > > > > > > > If we need infrastructure for long running dependency tracking we should > > > > > > > encapsulate that in a new framework and not try to mangle the existing > > > > > > > code for something it was never intended for. > > > > > > > > > > > > > > Christian. > > > > > > > > > > > > > > > Matt > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Oak > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > > From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of > > > > > > > > > > Matthew Brost > > > > > > > > > > Sent: April 3, 2023 8:22 PM > > > > > > > > > > To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org > > > > > > > > > > Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com; > > > > > > > airlied@linux.ie; > > > > > > > > > > lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew > > > > > > > > > > <matthew.brost@intel.com>; christian.koenig@amd.com; > > > > > > > > > > faith.ekstrand@collabora.com > > > > > > > > > > Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload > > > > > > > plans > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > > > > > > > > > > have been asked to merge our common DRM scheduler patches first as > > > > > well > > > > > > > > > > as develop a common solution for long running workloads with the DRM > > > > > > > > > > scheduler. This RFC series is our first attempt at doing this. We > > > > > > > > > > welcome any and all feedback. > > > > > > > > > > > > > > > > > > > > This can we thought of as 4 parts detailed below. > > > > > > > > > > > > > > > > > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and > > > > > > > > > > entity (patches 1-3) > > > > > > > > > > > > > > > > > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the > > > > > > > > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents > > > > > > > > > > severals problems as the DRM was originally designed to schedule jobs > > > > > on > > > > > > > > > > hardware queues. The main problem being that DRM scheduler expects > > > > > the > > > > > > > > > > submission order of jobs to be the completion order of jobs even across > > > > > > > > > > multiple entities. This assumption falls apart with a firmware scheduler > > > > > > > > > > as a firmware scheduler has no concept of jobs and jobs can complete > > > > > out > > > > > > > > > > of order. A novel solution for was originally thought of by Faith during > > > > > > > > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler > > > > > > > > > > and entity. I believe the AGX driver [3] is using this approach and > > > > > > > > > > Boris may use approach as well for the Mali driver [4]. > > > > > > > > > > > > > > > > > > > > To support a 1 to 1 relationship we move the main execution function > > > > > > > > > > from a kthread to a work queue and add a new scheduling mode which > > > > > > > > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > > > > > > > > > > The new scheduling mode should unify all drivers usage with a 1 to 1 > > > > > > > > > > relationship and can be thought of as using scheduler as a dependency / > > > > > > > > > > infligt job tracker rather than a true scheduler. > > > > > > > > > > > > > > > > > > > > - Generic messaging interface for DRM scheduler > > > > > > > > > > > > > > > > > > > > Idea is to be able to communicate to the submission backend with in > > > > > band > > > > > > > > > > (relative to main execution function) messages. Messages are backend > > > > > > > > > > defined and flexable enough for any use case. In Xe we use these > > > > > > > > > > messages to clean up entites, set properties for entites, and suspend / > > > > > > > > > > resume execution of an entity [5]. I suspect other driver can leverage > > > > > > > > > > this messaging concept too as it a convenient way to avoid races in the > > > > > > > > > > backend. > > > > > > > > > > > > > > > > > > > > - Support for using TDR for all error paths of a scheduler / entity > > > > > > > > > > > > > > > > > > > > Fix a few races / bugs, add function to dynamically set the TDR timeout. > > > > > > > > > > > > > > > > > > > > - Annotate dma-fences for long running workloads. > > > > > > > > > > > > > > > > > > > > The idea here is to use dma-fences only as sync points within the > > > > > > > > > > scheduler and never export them for long running workloads. By > > > > > > > > > > annotating these fences as long running we ensure that these dma- > > > > > fences > > > > > > > > > > are never used in a way that breaks the dma-fence rules. A benefit of > > > > > > > > > > thus approach is the scheduler can still safely flow control the > > > > > > > > > > execution ring buffer via the job limit without breaking the dma-fence > > > > > > > > > > rules. > > > > > > > > > > > > > > > > > > > > Again this a first draft and looking forward to feedback. > > > > > > > > > > > > > > > > > > > > Enjoy - Matt > > > > > > > > > > > > > > > > > > > > [1] https://gitlab.freedesktop.org/drm/xe/kernel > > > > > > > > > > [2] https://patchwork.freedesktop.org/series/112188/ > > > > > > > > > > [3] https://patchwork.freedesktop.org/series/114772/ > > > > > > > > > > [4] > > > > > https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 > > > > > > > > > > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe- > > > > > > > > > > next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 > > > > > > > > > > > > > > > > > > > > Matthew Brost (8): > > > > > > > > > > drm/sched: Convert drm scheduler to use a work queue rather than > > > > > > > > > > kthread > > > > > > > > > > drm/sched: Move schedule policy to scheduler / entity > > > > > > > > > > drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling > > > > > policy > > > > > > > > > > drm/sched: Add generic scheduler message interface > > > > > > > > > > drm/sched: Start run wq before TDR in drm_sched_start > > > > > > > > > > drm/sched: Submit job before starting TDR > > > > > > > > > > drm/sched: Add helper to set TDR timeout > > > > > > > > > > drm/syncobj: Warn on long running dma-fences > > > > > > > > > > > > > > > > > > > > Thomas Hellström (2): > > > > > > > > > > dma-buf/dma-fence: Introduce long-running completion fences > > > > > > > > > > drm/sched: Support long-running sched entities > > > > > > > > > > > > > > > > > > > > drivers/dma-buf/dma-fence.c | 142 +++++++--- > > > > > > > > > > drivers/dma-buf/dma-resv.c | 5 + > > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- > > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- > > > > > > > > > > drivers/gpu/drm/drm_syncobj.c | 5 +- > > > > > > > > > > drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- > > > > > > > > > > drivers/gpu/drm/lima/lima_sched.c | 5 +- > > > > > > > > > > drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- > > > > > > > > > > drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- > > > > > > > > > > drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- > > > > > > > > > > drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- > > > > > > > > > > drivers/gpu/drm/scheduler/sched_fence.c | 6 +- > > > > > > > > > > drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++-- > > > > > --- > > > > > > > > > > drivers/gpu/drm/v3d/v3d_sched.c | 25 +- > > > > > > > > > > include/drm/gpu_scheduler.h | 130 +++++++-- > > > > > > > > > > include/linux/dma-fence.h | 60 ++++- > > > > > > > > > > 16 files changed, 649 insertions(+), 184 deletions(-) > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > 2.34.1 >
On Mon, Apr 03, 2023 at 05:22:01PM -0700, Matthew Brost wrote: > Hello, Hello, Jumping a bit late on this thread as I was waiting on some approvals and then holidays kicked in, but I would like to (re)introduce myself and the people I work with and to let you know that we are interested in the changes proposed here and we would like to help if we can. I currently maintain a number of Arm Mali Display drivers but in recent times I have moved to the Mali GPU team and now we've got approval to start making contributions to the upstream driver(s). We're planning to collaborate on Boris' new Mali driver and make it work well on Mali GPUs. One of the first things to look at (besides bringing the driver up on internal dev platforms) are the scheduler changes proposed here. As such, I would like to ask that people start including John Reitan, Ketil Johnsen and me on patches. As soon as we have something working and we can make comments on, we will do so. Best regards, Liviu > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we > have been asked to merge our common DRM scheduler patches first as well > as develop a common solution for long running workloads with the DRM > scheduler. This RFC series is our first attempt at doing this. We > welcome any and all feedback. > > This can we thought of as 4 parts detailed below. > > - DRM scheduler changes for 1 to 1 relationship between scheduler and > entity (patches 1-3) > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the > GuC) which is a new paradigm WRT to the DRM scheduler and presents > severals problems as the DRM was originally designed to schedule jobs on > hardware queues. The main problem being that DRM scheduler expects the > submission order of jobs to be the completion order of jobs even across > multiple entities. This assumption falls apart with a firmware scheduler > as a firmware scheduler has no concept of jobs and jobs can complete out > of order. A novel solution for was originally thought of by Faith during > the initial prototype of Xe, create a 1 to 1 relationship between scheduler > and entity. I believe the AGX driver [3] is using this approach and > Boris may use approach as well for the Mali driver [4]. > > To support a 1 to 1 relationship we move the main execution function > from a kthread to a work queue and add a new scheduling mode which > bypasses code in the DRM which isn't needed in a 1 to 1 relationship. > The new scheduling mode should unify all drivers usage with a 1 to 1 > relationship and can be thought of as using scheduler as a dependency / > infligt job tracker rather than a true scheduler. > > - Generic messaging interface for DRM scheduler > > Idea is to be able to communicate to the submission backend with in band > (relative to main execution function) messages. Messages are backend > defined and flexable enough for any use case. In Xe we use these > messages to clean up entites, set properties for entites, and suspend / > resume execution of an entity [5]. I suspect other driver can leverage > this messaging concept too as it a convenient way to avoid races in the > backend. > > - Support for using TDR for all error paths of a scheduler / entity > > Fix a few races / bugs, add function to dynamically set the TDR timeout. > > - Annotate dma-fences for long running workloads. > > The idea here is to use dma-fences only as sync points within the > scheduler and never export them for long running workloads. By > annotating these fences as long running we ensure that these dma-fences > are never used in a way that breaks the dma-fence rules. A benefit of > thus approach is the scheduler can still safely flow control the > execution ring buffer via the job limit without breaking the dma-fence > rules. > > Again this a first draft and looking forward to feedback. > > Enjoy - Matt > > [1] https://gitlab.freedesktop.org/drm/xe/kernel > [2] https://patchwork.freedesktop.org/series/112188/ > [3] https://patchwork.freedesktop.org/series/114772/ > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1 > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031 > > Matthew Brost (8): > drm/sched: Convert drm scheduler to use a work queue rather than > kthread > drm/sched: Move schedule policy to scheduler / entity > drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy > drm/sched: Add generic scheduler message interface > drm/sched: Start run wq before TDR in drm_sched_start > drm/sched: Submit job before starting TDR > drm/sched: Add helper to set TDR timeout > drm/syncobj: Warn on long running dma-fences > > Thomas Hellström (2): > dma-buf/dma-fence: Introduce long-running completion fences > drm/sched: Support long-running sched entities > > drivers/dma-buf/dma-fence.c | 142 +++++++--- > drivers/dma-buf/dma-resv.c | 5 + > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 14 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 15 +- > drivers/gpu/drm/drm_syncobj.c | 5 +- > drivers/gpu/drm/etnaviv/etnaviv_sched.c | 5 +- > drivers/gpu/drm/lima/lima_sched.c | 5 +- > drivers/gpu/drm/msm/adreno/adreno_device.c | 6 +- > drivers/gpu/drm/msm/msm_ringbuffer.c | 5 +- > drivers/gpu/drm/panfrost/panfrost_job.c | 5 +- > drivers/gpu/drm/scheduler/sched_entity.c | 127 +++++++-- > drivers/gpu/drm/scheduler/sched_fence.c | 6 +- > drivers/gpu/drm/scheduler/sched_main.c | 278 +++++++++++++++----- > drivers/gpu/drm/v3d/v3d_sched.c | 25 +- > include/drm/gpu_scheduler.h | 130 +++++++-- > include/linux/dma-fence.h | 60 ++++- > 16 files changed, 649 insertions(+), 184 deletions(-) > > -- > 2.34.1 >