mbox series

[RFC,00/10] Xe DRM scheduler and long running workload plans

Message ID 20230404002211.3611376-1-matthew.brost@intel.com (mailing list archive)
Headers show
Series Xe DRM scheduler and long running workload plans | expand

Message

Matthew Brost April 4, 2023, 12:22 a.m. UTC
Hello,

As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
have been asked to merge our common DRM scheduler patches first as well
as develop a common solution for long running workloads with the DRM
scheduler. This RFC series is our first attempt at doing this. We
welcome any and all feedback.

This can we thought of as 4 parts detailed below.

- DRM scheduler changes for 1 to 1 relationship between scheduler and
entity (patches 1-3)

In Xe all of the scheduling of jobs is done by a firmware scheduler (the
GuC) which is a new paradigm WRT to the DRM scheduler and presents
severals problems as the DRM was originally designed to schedule jobs on
hardware queues. The main problem being that DRM scheduler expects the
submission order of jobs to be the completion order of jobs even across
multiple entities. This assumption falls apart with a firmware scheduler
as a firmware scheduler has no concept of jobs and jobs can complete out
of order. A novel solution for was originally thought of by Faith during
the initial prototype of Xe, create a 1 to 1 relationship between scheduler
and entity. I believe the AGX driver [3] is using this approach and
Boris may use approach as well for the Mali driver [4].

To support a 1 to 1 relationship we move the main execution function
from a kthread to a work queue and add a new scheduling mode which
bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
The new scheduling mode should unify all drivers usage with a 1 to 1
relationship and can be thought of as using scheduler as a dependency /
infligt job tracker rather than a true scheduler.

- Generic messaging interface for DRM scheduler

Idea is to be able to communicate to the submission backend with in band
(relative to main execution function) messages. Messages are backend
defined and flexable enough for any use case. In Xe we use these
messages to clean up entites, set properties for entites, and suspend /
resume execution of an entity [5]. I suspect other driver can leverage
this messaging concept too as it a convenient way to avoid races in the
backend.

- Support for using TDR for all error paths of a scheduler / entity

Fix a few races / bugs, add function to dynamically set the TDR timeout.

- Annotate dma-fences for long running workloads.

The idea here is to use dma-fences only as sync points within the
scheduler and never export them for long running workloads. By
annotating these fences as long running we ensure that these dma-fences
are never used in a way that breaks the dma-fence rules. A benefit of
thus approach is the scheduler can still safely flow control the
execution ring buffer via the job limit without breaking the dma-fence
rules.

Again this a first draft and looking forward to feedback.

Enjoy - Matt

[1] https://gitlab.freedesktop.org/drm/xe/kernel
[2] https://patchwork.freedesktop.org/series/112188/ 
[3] https://patchwork.freedesktop.org/series/114772/
[4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
[5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031

Matthew Brost (8):
  drm/sched: Convert drm scheduler to use a work queue rather than
    kthread
  drm/sched: Move schedule policy to scheduler / entity
  drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
  drm/sched: Add generic scheduler message interface
  drm/sched: Start run wq before TDR in drm_sched_start
  drm/sched: Submit job before starting TDR
  drm/sched: Add helper to set TDR timeout
  drm/syncobj: Warn on long running dma-fences

Thomas Hellström (2):
  dma-buf/dma-fence: Introduce long-running completion fences
  drm/sched: Support long-running sched entities

 drivers/dma-buf/dma-fence.c                 | 142 +++++++---
 drivers/dma-buf/dma-resv.c                  |   5 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
 drivers/gpu/drm/drm_syncobj.c               |   5 +-
 drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
 drivers/gpu/drm/lima/lima_sched.c           |   5 +-
 drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
 drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
 drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
 drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
 drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
 drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
 drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
 include/drm/gpu_scheduler.h                 | 130 +++++++--
 include/linux/dma-fence.h                   |  60 ++++-
 16 files changed, 649 insertions(+), 184 deletions(-)

Comments

Asahi Lina April 4, 2023, 1:07 a.m. UTC | #1
Hi, thanks for the Cc!

On 04/04/2023 09.22, Matthew Brost wrote:
> Hello,
> 
> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> have been asked to merge our common DRM scheduler patches first as well
> as develop a common solution for long running workloads with the DRM
> scheduler. This RFC series is our first attempt at doing this. We
> welcome any and all feedback.
> 
> This can we thought of as 4 parts detailed below.
> 
> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> entity (patches 1-3)
> 
> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> severals problems as the DRM was originally designed to schedule jobs on
> hardware queues. The main problem being that DRM scheduler expects the
> submission order of jobs to be the completion order of jobs even across
> multiple entities. This assumption falls apart with a firmware scheduler
> as a firmware scheduler has no concept of jobs and jobs can complete out
> of order. A novel solution for was originally thought of by Faith during
> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> and entity. I believe the AGX driver [3] is using this approach and
> Boris may use approach as well for the Mali driver [4].
> 
> To support a 1 to 1 relationship we move the main execution function
> from a kthread to a work queue and add a new scheduling mode which
> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> The new scheduling mode should unify all drivers usage with a 1 to 1
> relationship and can be thought of as using scheduler as a dependency /
> infligt job tracker rather than a true scheduler.

Yup, we're in the exact same situation with drm/asahi, so this is very 
welcome! We've been using the existing scheduler as-is, but this should 
help remove some unneeded complexity in this use case.

Do you want me to pull in this series into our tree and make sure this 
all works out for us?

I also have a couple bugfixes for drm/sched I need to send out, but I 
think the rebase/merge with this series should be trivial. I'll send 
that out this week.

> - Generic messaging interface for DRM scheduler
> 
> Idea is to be able to communicate to the submission backend with in band
> (relative to main execution function) messages. Messages are backend
> defined and flexable enough for any use case. In Xe we use these
> messages to clean up entites, set properties for entites, and suspend /
> resume execution of an entity [5]. I suspect other driver can leverage
> this messaging concept too as it a convenient way to avoid races in the
> backend.

We haven't needed this so far (mostly by using fine-grained locking and 
refcounting all over the place) but I can see it being useful to 
simplify some of those constructs and maybe avoid potential deadlocks in 
some places. I'm not sure yet whether we can fully get rid of the main 
queue refcounting/locking (our completion/error signaling path doesn't 
map well to DMA fences directly so we still need something there to get 
from the global GPU completion signaling thread to individual queues) 
but it might be a step in the right direction at least!

~~ Lina
Matthew Brost April 4, 2023, 1:58 a.m. UTC | #2
On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote:
> Hi, thanks for the Cc!
> 

No problem.

> On 04/04/2023 09.22, Matthew Brost wrote:
> > Hello,
> > 
> > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > have been asked to merge our common DRM scheduler patches first as well
> > as develop a common solution for long running workloads with the DRM
> > scheduler. This RFC series is our first attempt at doing this. We
> > welcome any and all feedback.
> > 
> > This can we thought of as 4 parts detailed below.
> > 
> > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > entity (patches 1-3)
> > 
> > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > severals problems as the DRM was originally designed to schedule jobs on
> > hardware queues. The main problem being that DRM scheduler expects the
> > submission order of jobs to be the completion order of jobs even across
> > multiple entities. This assumption falls apart with a firmware scheduler
> > as a firmware scheduler has no concept of jobs and jobs can complete out
> > of order. A novel solution for was originally thought of by Faith during
> > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > and entity. I believe the AGX driver [3] is using this approach and
> > Boris may use approach as well for the Mali driver [4].
> > 
> > To support a 1 to 1 relationship we move the main execution function
> > from a kthread to a work queue and add a new scheduling mode which
> > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > The new scheduling mode should unify all drivers usage with a 1 to 1
> > relationship and can be thought of as using scheduler as a dependency /
> > infligt job tracker rather than a true scheduler.
> 
> Yup, we're in the exact same situation with drm/asahi, so this is very
> welcome! We've been using the existing scheduler as-is, but this should help
> remove some unneeded complexity in this use case.
>

That's the idea.

> Do you want me to pull in this series into our tree and make sure this all
> works out for us?
>

We tested this in Xe and it definitely works for us but the more testing
the better.

> I also have a couple bugfixes for drm/sched I need to send out, but I think
> the rebase/merge with this series should be trivial. I'll send that out this
> week.
> 
> > - Generic messaging interface for DRM scheduler
> > 
> > Idea is to be able to communicate to the submission backend with in band
> > (relative to main execution function) messages. Messages are backend
> > defined and flexable enough for any use case. In Xe we use these
> > messages to clean up entites, set properties for entites, and suspend /
> > resume execution of an entity [5]. I suspect other driver can leverage
> > this messaging concept too as it a convenient way to avoid races in the
> > backend.
> 
> We haven't needed this so far (mostly by using fine-grained locking and
> refcounting all over the place) but I can see it being useful to simplify
> some of those constructs and maybe avoid potential deadlocks in some places.
> I'm not sure yet whether we can fully get rid of the main queue
> refcounting/locking (our completion/error signaling path doesn't map well to
> DMA fences directly so we still need something there to get from the global
> GPU completion signaling thread to individual queues) but it might be a step
> in the right direction at least!
>

With this messaging interface we essentially have a lockless submission
backend which is really nice compared to what we did in the i915.

Matt

> ~~ Lina
>
Christian König April 4, 2023, 9:04 a.m. UTC | #3
Please make sure to CC Luben on scheduler patches.

Regards,
Christian.

Am 04.04.23 um 02:22 schrieb Matthew Brost:
> Hello,
>
> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> have been asked to merge our common DRM scheduler patches first as well
> as develop a common solution for long running workloads with the DRM
> scheduler. This RFC series is our first attempt at doing this. We
> welcome any and all feedback.
>
> This can we thought of as 4 parts detailed below.
>
> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> entity (patches 1-3)
>
> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> severals problems as the DRM was originally designed to schedule jobs on
> hardware queues. The main problem being that DRM scheduler expects the
> submission order of jobs to be the completion order of jobs even across
> multiple entities. This assumption falls apart with a firmware scheduler
> as a firmware scheduler has no concept of jobs and jobs can complete out
> of order. A novel solution for was originally thought of by Faith during
> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> and entity. I believe the AGX driver [3] is using this approach and
> Boris may use approach as well for the Mali driver [4].
>
> To support a 1 to 1 relationship we move the main execution function
> from a kthread to a work queue and add a new scheduling mode which
> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> The new scheduling mode should unify all drivers usage with a 1 to 1
> relationship and can be thought of as using scheduler as a dependency /
> infligt job tracker rather than a true scheduler.
>
> - Generic messaging interface for DRM scheduler
>
> Idea is to be able to communicate to the submission backend with in band
> (relative to main execution function) messages. Messages are backend
> defined and flexable enough for any use case. In Xe we use these
> messages to clean up entites, set properties for entites, and suspend /
> resume execution of an entity [5]. I suspect other driver can leverage
> this messaging concept too as it a convenient way to avoid races in the
> backend.
>
> - Support for using TDR for all error paths of a scheduler / entity
>
> Fix a few races / bugs, add function to dynamically set the TDR timeout.
>
> - Annotate dma-fences for long running workloads.
>
> The idea here is to use dma-fences only as sync points within the
> scheduler and never export them for long running workloads. By
> annotating these fences as long running we ensure that these dma-fences
> are never used in a way that breaks the dma-fence rules. A benefit of
> thus approach is the scheduler can still safely flow control the
> execution ring buffer via the job limit without breaking the dma-fence
> rules.
>
> Again this a first draft and looking forward to feedback.
>
> Enjoy - Matt
>
> [1] https://gitlab.freedesktop.org/drm/xe/kernel
> [2] https://patchwork.freedesktop.org/series/112188/
> [3] https://patchwork.freedesktop.org/series/114772/
> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
>
> Matthew Brost (8):
>    drm/sched: Convert drm scheduler to use a work queue rather than
>      kthread
>    drm/sched: Move schedule policy to scheduler / entity
>    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>    drm/sched: Add generic scheduler message interface
>    drm/sched: Start run wq before TDR in drm_sched_start
>    drm/sched: Submit job before starting TDR
>    drm/sched: Add helper to set TDR timeout
>    drm/syncobj: Warn on long running dma-fences
>
> Thomas Hellström (2):
>    dma-buf/dma-fence: Introduce long-running completion fences
>    drm/sched: Support long-running sched entities
>
>   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>   drivers/dma-buf/dma-resv.c                  |   5 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>   drivers/gpu/drm/drm_syncobj.c               |   5 +-
>   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
>   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>   include/drm/gpu_scheduler.h                 | 130 +++++++--
>   include/linux/dma-fence.h                   |  60 ++++-
>   16 files changed, 649 insertions(+), 184 deletions(-)
>
Christian König April 4, 2023, 9:13 a.m. UTC | #4
Hi,

Am 04.04.23 um 02:22 schrieb Matthew Brost:
> Hello,
>
> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> have been asked to merge our common DRM scheduler patches first as well
> as develop a common solution for long running workloads with the DRM
> scheduler. This RFC series is our first attempt at doing this. We
> welcome any and all feedback.
>
> This can we thought of as 4 parts detailed below.
>
> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> entity (patches 1-3)
>
> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> severals problems as the DRM was originally designed to schedule jobs on
> hardware queues. The main problem being that DRM scheduler expects the
> submission order of jobs to be the completion order of jobs even across
> multiple entities. This assumption falls apart with a firmware scheduler
> as a firmware scheduler has no concept of jobs and jobs can complete out
> of order. A novel solution for was originally thought of by Faith during
> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> and entity. I believe the AGX driver [3] is using this approach and
> Boris may use approach as well for the Mali driver [4].
>
> To support a 1 to 1 relationship we move the main execution function
> from a kthread to a work queue and add a new scheduling mode which
> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> The new scheduling mode should unify all drivers usage with a 1 to 1
> relationship and can be thought of as using scheduler as a dependency /
> infligt job tracker rather than a true scheduler.
>
> - Generic messaging interface for DRM scheduler
>
> Idea is to be able to communicate to the submission backend with in band
> (relative to main execution function) messages. Messages are backend
> defined and flexable enough for any use case. In Xe we use these
> messages to clean up entites, set properties for entites, and suspend /
> resume execution of an entity [5]. I suspect other driver can leverage
> this messaging concept too as it a convenient way to avoid races in the
> backend.

Oh, please absolutely *don't* do this.

This is basically the design which makes a bunch of stuff so horrible 
broken on Windows.

I can explain it in more detail if necessary, but I strongly recommend 
to not go down this path.

Regards,
Christian.

>
> - Support for using TDR for all error paths of a scheduler / entity
>
> Fix a few races / bugs, add function to dynamically set the TDR timeout.
>
> - Annotate dma-fences for long running workloads.
>
> The idea here is to use dma-fences only as sync points within the
> scheduler and never export them for long running workloads. By
> annotating these fences as long running we ensure that these dma-fences
> are never used in a way that breaks the dma-fence rules. A benefit of
> thus approach is the scheduler can still safely flow control the
> execution ring buffer via the job limit without breaking the dma-fence
> rules.
>
> Again this a first draft and looking forward to feedback.
>
> Enjoy - Matt
>
> [1] https://gitlab.freedesktop.org/drm/xe/kernel
> [2] https://patchwork.freedesktop.org/series/112188/
> [3] https://patchwork.freedesktop.org/series/114772/
> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
>
> Matthew Brost (8):
>    drm/sched: Convert drm scheduler to use a work queue rather than
>      kthread
>    drm/sched: Move schedule policy to scheduler / entity
>    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>    drm/sched: Add generic scheduler message interface
>    drm/sched: Start run wq before TDR in drm_sched_start
>    drm/sched: Submit job before starting TDR
>    drm/sched: Add helper to set TDR timeout
>    drm/syncobj: Warn on long running dma-fences
>
> Thomas Hellström (2):
>    dma-buf/dma-fence: Introduce long-running completion fences
>    drm/sched: Support long-running sched entities
>
>   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>   drivers/dma-buf/dma-resv.c                  |   5 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>   drivers/gpu/drm/drm_syncobj.c               |   5 +-
>   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
>   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>   include/drm/gpu_scheduler.h                 | 130 +++++++--
>   include/linux/dma-fence.h                   |  60 ++++-
>   16 files changed, 649 insertions(+), 184 deletions(-)
>
Tvrtko Ursulin April 4, 2023, 9:43 a.m. UTC | #5
On 04/04/2023 01:22, Matthew Brost wrote:
> Hello,
> 
> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> have been asked to merge our common DRM scheduler patches first as well
> as develop a common solution for long running workloads with the DRM
> scheduler. This RFC series is our first attempt at doing this. We
> welcome any and all feedback.
> 
> This can we thought of as 4 parts detailed below.
> 
> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> entity (patches 1-3)
> 
> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> severals problems as the DRM was originally designed to schedule jobs on
> hardware queues. The main problem being that DRM scheduler expects the
> submission order of jobs to be the completion order of jobs even across
> multiple entities. This assumption falls apart with a firmware scheduler
> as a firmware scheduler has no concept of jobs and jobs can complete out
> of order. A novel solution for was originally thought of by Faith during
> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> and entity. I believe the AGX driver [3] is using this approach and
> Boris may use approach as well for the Mali driver [4].
> 
> To support a 1 to 1 relationship we move the main execution function
> from a kthread to a work queue and add a new scheduling mode which
> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> The new scheduling mode should unify all drivers usage with a 1 to 1
> relationship and can be thought of as using scheduler as a dependency /
> infligt job tracker rather than a true scheduler.

Once you add capability for a more proper 1:1 via 
DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to 
replace kthreads with a wq?

Or in other words, what purpose does the offloading of a job picking 
code to a separate execution context serve? Could it be done directly in 
the 1:1 mode and leave kthread setup for N:M?

Apart from those design level questions, low level open IMO still is 
that default fallback of using the system_wq has the potential to 
affect latency for other drivers. But that's for those driver owners to 
approve.

Regards,

Tvrtko

> - Generic messaging interface for DRM scheduler
> 
> Idea is to be able to communicate to the submission backend with in band
> (relative to main execution function) messages. Messages are backend
> defined and flexable enough for any use case. In Xe we use these
> messages to clean up entites, set properties for entites, and suspend /
> resume execution of an entity [5]. I suspect other driver can leverage
> this messaging concept too as it a convenient way to avoid races in the
> backend.
> 
> - Support for using TDR for all error paths of a scheduler / entity
> 
> Fix a few races / bugs, add function to dynamically set the TDR timeout.
> 
> - Annotate dma-fences for long running workloads.
> 
> The idea here is to use dma-fences only as sync points within the
> scheduler and never export them for long running workloads. By
> annotating these fences as long running we ensure that these dma-fences
> are never used in a way that breaks the dma-fence rules. A benefit of
> thus approach is the scheduler can still safely flow control the
> execution ring buffer via the job limit without breaking the dma-fence
> rules.
> 
> Again this a first draft and looking forward to feedback.
> 
> Enjoy - Matt
> 
> [1] https://gitlab.freedesktop.org/drm/xe/kernel
> [2] https://patchwork.freedesktop.org/series/112188/
> [3] https://patchwork.freedesktop.org/series/114772/
> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> 
> Matthew Brost (8):
>    drm/sched: Convert drm scheduler to use a work queue rather than
>      kthread
>    drm/sched: Move schedule policy to scheduler / entity
>    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>    drm/sched: Add generic scheduler message interface
>    drm/sched: Start run wq before TDR in drm_sched_start
>    drm/sched: Submit job before starting TDR
>    drm/sched: Add helper to set TDR timeout
>    drm/syncobj: Warn on long running dma-fences
> 
> Thomas Hellström (2):
>    dma-buf/dma-fence: Introduce long-running completion fences
>    drm/sched: Support long-running sched entities
> 
>   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>   drivers/dma-buf/dma-resv.c                  |   5 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>   drivers/gpu/drm/drm_syncobj.c               |   5 +-
>   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
>   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>   include/drm/gpu_scheduler.h                 | 130 +++++++--
>   include/linux/dma-fence.h                   |  60 ++++-
>   16 files changed, 649 insertions(+), 184 deletions(-)
>
Christian König April 4, 2023, 9:48 a.m. UTC | #6
Am 04.04.23 um 11:43 schrieb Tvrtko Ursulin:
>
> On 04/04/2023 01:22, Matthew Brost wrote:
>> Hello,
>>
>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
>> have been asked to merge our common DRM scheduler patches first as well
>> as develop a common solution for long running workloads with the DRM
>> scheduler. This RFC series is our first attempt at doing this. We
>> welcome any and all feedback.
>>
>> This can we thought of as 4 parts detailed below.
>>
>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
>> entity (patches 1-3)
>>
>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
>> severals problems as the DRM was originally designed to schedule jobs on
>> hardware queues. The main problem being that DRM scheduler expects the
>> submission order of jobs to be the completion order of jobs even across
>> multiple entities. This assumption falls apart with a firmware scheduler
>> as a firmware scheduler has no concept of jobs and jobs can complete out
>> of order. A novel solution for was originally thought of by Faith during
>> the initial prototype of Xe, create a 1 to 1 relationship between 
>> scheduler
>> and entity. I believe the AGX driver [3] is using this approach and
>> Boris may use approach as well for the Mali driver [4].
>>
>> To support a 1 to 1 relationship we move the main execution function
>> from a kthread to a work queue and add a new scheduling mode which
>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
>> The new scheduling mode should unify all drivers usage with a 1 to 1
>> relationship and can be thought of as using scheduler as a dependency /
>> infligt job tracker rather than a true scheduler.
>
> Once you add capability for a more proper 1:1 via 
> DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to 
> replace kthreads with a wq?

Yeah, I fail to see the need for that as well. On the other hand it 
would be really nice to get rid of the rq/priority design in general.

>
> Or in other words, what purpose does the offloading of a job picking 
> code to a separate execution context serve? Could it be done directly 
> in the 1:1 mode and leave kthread setup for N:M?

Well moving from kthread to work item is beneficial on it's own since 
the later usually follows the the source of it's queue. E.g. when this 
is triggered by an interrupt we run on the CPU of the interrupt and not 
have inter CPU signaling.

>
> Apart from those design level questions, low level open IMO still is 
> that default fallback of using the system_wq has the potential to 
> affect latency for other drivers. But that's for those driver owners 
> to approve.

Oh, yeah that's a good point as well. This needs some high priority queue.

Christian.

>
> Regards,
>
> Tvrtko
>
>> - Generic messaging interface for DRM scheduler
>>
>> Idea is to be able to communicate to the submission backend with in band
>> (relative to main execution function) messages. Messages are backend
>> defined and flexable enough for any use case. In Xe we use these
>> messages to clean up entites, set properties for entites, and suspend /
>> resume execution of an entity [5]. I suspect other driver can leverage
>> this messaging concept too as it a convenient way to avoid races in the
>> backend.
>>
>> - Support for using TDR for all error paths of a scheduler / entity
>>
>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
>>
>> - Annotate dma-fences for long running workloads.
>>
>> The idea here is to use dma-fences only as sync points within the
>> scheduler and never export them for long running workloads. By
>> annotating these fences as long running we ensure that these dma-fences
>> are never used in a way that breaks the dma-fence rules. A benefit of
>> thus approach is the scheduler can still safely flow control the
>> execution ring buffer via the job limit without breaking the dma-fence
>> rules.
>>
>> Again this a first draft and looking forward to feedback.
>>
>> Enjoy - Matt
>>
>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
>> [2] https://patchwork.freedesktop.org/series/112188/
>> [3] https://patchwork.freedesktop.org/series/114772/
>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
>> [5] 
>> https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
>>
>> Matthew Brost (8):
>>    drm/sched: Convert drm scheduler to use a work queue rather than
>>      kthread
>>    drm/sched: Move schedule policy to scheduler / entity
>>    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>>    drm/sched: Add generic scheduler message interface
>>    drm/sched: Start run wq before TDR in drm_sched_start
>>    drm/sched: Submit job before starting TDR
>>    drm/sched: Add helper to set TDR timeout
>>    drm/syncobj: Warn on long running dma-fences
>>
>> Thomas Hellström (2):
>>    dma-buf/dma-fence: Introduce long-running completion fences
>>    drm/sched: Support long-running sched entities
>>
>>   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>>   drivers/dma-buf/dma-resv.c                  |   5 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>>   drivers/gpu/drm/drm_syncobj.c               |   5 +-
>>   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>>   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>>   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>>   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>>   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>>   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>>   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>>   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
>>   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>>   include/drm/gpu_scheduler.h                 | 130 +++++++--
>>   include/linux/dma-fence.h                   |  60 ++++-
>>   16 files changed, 649 insertions(+), 184 deletions(-)
>>
Matthew Brost April 4, 2023, 1:23 p.m. UTC | #7
On Tue, Apr 04, 2023 at 11:04:54AM +0200, Christian König wrote:
> Please make sure to CC Luben on scheduler patches.
> 

Sure, figured I was missing a few people.

Matt

> Regards,
> Christian.
> 
> Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > Hello,
> > 
> > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > have been asked to merge our common DRM scheduler patches first as well
> > as develop a common solution for long running workloads with the DRM
> > scheduler. This RFC series is our first attempt at doing this. We
> > welcome any and all feedback.
> > 
> > This can we thought of as 4 parts detailed below.
> > 
> > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > entity (patches 1-3)
> > 
> > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > severals problems as the DRM was originally designed to schedule jobs on
> > hardware queues. The main problem being that DRM scheduler expects the
> > submission order of jobs to be the completion order of jobs even across
> > multiple entities. This assumption falls apart with a firmware scheduler
> > as a firmware scheduler has no concept of jobs and jobs can complete out
> > of order. A novel solution for was originally thought of by Faith during
> > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > and entity. I believe the AGX driver [3] is using this approach and
> > Boris may use approach as well for the Mali driver [4].
> > 
> > To support a 1 to 1 relationship we move the main execution function
> > from a kthread to a work queue and add a new scheduling mode which
> > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > The new scheduling mode should unify all drivers usage with a 1 to 1
> > relationship and can be thought of as using scheduler as a dependency /
> > infligt job tracker rather than a true scheduler.
> > 
> > - Generic messaging interface for DRM scheduler
> > 
> > Idea is to be able to communicate to the submission backend with in band
> > (relative to main execution function) messages. Messages are backend
> > defined and flexable enough for any use case. In Xe we use these
> > messages to clean up entites, set properties for entites, and suspend /
> > resume execution of an entity [5]. I suspect other driver can leverage
> > this messaging concept too as it a convenient way to avoid races in the
> > backend.
> > 
> > - Support for using TDR for all error paths of a scheduler / entity
> > 
> > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > 
> > - Annotate dma-fences for long running workloads.
> > 
> > The idea here is to use dma-fences only as sync points within the
> > scheduler and never export them for long running workloads. By
> > annotating these fences as long running we ensure that these dma-fences
> > are never used in a way that breaks the dma-fence rules. A benefit of
> > thus approach is the scheduler can still safely flow control the
> > execution ring buffer via the job limit without breaking the dma-fence
> > rules.
> > 
> > Again this a first draft and looking forward to feedback.
> > 
> > Enjoy - Matt
> > 
> > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > [2] https://patchwork.freedesktop.org/series/112188/
> > [3] https://patchwork.freedesktop.org/series/114772/
> > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > 
> > Matthew Brost (8):
> >    drm/sched: Convert drm scheduler to use a work queue rather than
> >      kthread
> >    drm/sched: Move schedule policy to scheduler / entity
> >    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> >    drm/sched: Add generic scheduler message interface
> >    drm/sched: Start run wq before TDR in drm_sched_start
> >    drm/sched: Submit job before starting TDR
> >    drm/sched: Add helper to set TDR timeout
> >    drm/syncobj: Warn on long running dma-fences
> > 
> > Thomas Hellström (2):
> >    dma-buf/dma-fence: Introduce long-running completion fences
> >    drm/sched: Support long-running sched entities
> > 
> >   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> >   drivers/dma-buf/dma-resv.c                  |   5 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> >   drivers/gpu/drm/drm_syncobj.c               |   5 +-
> >   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> >   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> >   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> >   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> >   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> >   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> >   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> >   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
> >   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> >   include/drm/gpu_scheduler.h                 | 130 +++++++--
> >   include/linux/dma-fence.h                   |  60 ++++-
> >   16 files changed, 649 insertions(+), 184 deletions(-)
> > 
>
Matthew Brost April 4, 2023, 1:37 p.m. UTC | #8
On Tue, Apr 04, 2023 at 11:13:28AM +0200, Christian König wrote:
> Hi,
> 
> Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > Hello,
> > 
> > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > have been asked to merge our common DRM scheduler patches first as well
> > as develop a common solution for long running workloads with the DRM
> > scheduler. This RFC series is our first attempt at doing this. We
> > welcome any and all feedback.
> > 
> > This can we thought of as 4 parts detailed below.
> > 
> > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > entity (patches 1-3)
> > 
> > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > severals problems as the DRM was originally designed to schedule jobs on
> > hardware queues. The main problem being that DRM scheduler expects the
> > submission order of jobs to be the completion order of jobs even across
> > multiple entities. This assumption falls apart with a firmware scheduler
> > as a firmware scheduler has no concept of jobs and jobs can complete out
> > of order. A novel solution for was originally thought of by Faith during
> > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > and entity. I believe the AGX driver [3] is using this approach and
> > Boris may use approach as well for the Mali driver [4].
> > 
> > To support a 1 to 1 relationship we move the main execution function
> > from a kthread to a work queue and add a new scheduling mode which
> > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > The new scheduling mode should unify all drivers usage with a 1 to 1
> > relationship and can be thought of as using scheduler as a dependency /
> > infligt job tracker rather than a true scheduler.
> > 
> > - Generic messaging interface for DRM scheduler
> > 
> > Idea is to be able to communicate to the submission backend with in band
> > (relative to main execution function) messages. Messages are backend
> > defined and flexable enough for any use case. In Xe we use these
> > messages to clean up entites, set properties for entites, and suspend /
> > resume execution of an entity [5]. I suspect other driver can leverage
> > this messaging concept too as it a convenient way to avoid races in the
> > backend.
> 
> Oh, please absolutely *don't* do this.
> 
> This is basically the design which makes a bunch of stuff so horrible broken
> on Windows.
> 
> I can explain it in more detail if necessary, but I strongly recommend to
> not go down this path.
> 

I'm afraid we are going to have to discuss this further. Let me explain
my reasoning, basically the idea is to have a single main entry point to
backend - the work queue. This avoids the need for lock between run_job
and any message that changes an entites state, also it really helps
during the reset flows (either TDR or GT reset) as we can call
drm_sched_run_wq_stop can ensure that nothing else is in the backend
changing an entity state. It all works out really nicely actually, our
GuC backend is incredibly stable (hasn't really had a bug pop in about a
year) and way simpler than what we did in the i915. I think the simplity
to largely due to this design of limiting the entry points.

I personally don't see how this a poor design, limiting entry points
absolutely makes sense to me, if it didn't why not just call cleanup_job
bypassing the main execution thread (now worker), this is the exact same
concept.

FWIW Asahi liked the idea as well and think it could be useful for AGX.
Matt

> Regards,
> Christian.
> 
> > 
> > - Support for using TDR for all error paths of a scheduler / entity
> > 
> > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > 
> > - Annotate dma-fences for long running workloads.
> > 
> > The idea here is to use dma-fences only as sync points within the
> > scheduler and never export them for long running workloads. By
> > annotating these fences as long running we ensure that these dma-fences
> > are never used in a way that breaks the dma-fence rules. A benefit of
> > thus approach is the scheduler can still safely flow control the
> > execution ring buffer via the job limit without breaking the dma-fence
> > rules.
> > 
> > Again this a first draft and looking forward to feedback.
> > 
> > Enjoy - Matt
> > 
> > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > [2] https://patchwork.freedesktop.org/series/112188/
> > [3] https://patchwork.freedesktop.org/series/114772/
> > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > 
> > Matthew Brost (8):
> >    drm/sched: Convert drm scheduler to use a work queue rather than
> >      kthread
> >    drm/sched: Move schedule policy to scheduler / entity
> >    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> >    drm/sched: Add generic scheduler message interface
> >    drm/sched: Start run wq before TDR in drm_sched_start
> >    drm/sched: Submit job before starting TDR
> >    drm/sched: Add helper to set TDR timeout
> >    drm/syncobj: Warn on long running dma-fences
> > 
> > Thomas Hellström (2):
> >    dma-buf/dma-fence: Introduce long-running completion fences
> >    drm/sched: Support long-running sched entities
> > 
> >   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> >   drivers/dma-buf/dma-resv.c                  |   5 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> >   drivers/gpu/drm/drm_syncobj.c               |   5 +-
> >   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> >   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> >   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> >   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> >   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> >   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> >   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> >   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
> >   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> >   include/drm/gpu_scheduler.h                 | 130 +++++++--
> >   include/linux/dma-fence.h                   |  60 ++++-
> >   16 files changed, 649 insertions(+), 184 deletions(-)
> > 
>
Matthew Brost April 4, 2023, 1:43 p.m. UTC | #9
On Tue, Apr 04, 2023 at 11:48:36AM +0200, Christian König wrote:
> Am 04.04.23 um 11:43 schrieb Tvrtko Ursulin:
> > 
> > On 04/04/2023 01:22, Matthew Brost wrote:
> > > Hello,
> > > 
> > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > have been asked to merge our common DRM scheduler patches first as well
> > > as develop a common solution for long running workloads with the DRM
> > > scheduler. This RFC series is our first attempt at doing this. We
> > > welcome any and all feedback.
> > > 
> > > This can we thought of as 4 parts detailed below.
> > > 
> > > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > entity (patches 1-3)
> > > 
> > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > severals problems as the DRM was originally designed to schedule jobs on
> > > hardware queues. The main problem being that DRM scheduler expects the
> > > submission order of jobs to be the completion order of jobs even across
> > > multiple entities. This assumption falls apart with a firmware scheduler
> > > as a firmware scheduler has no concept of jobs and jobs can complete out
> > > of order. A novel solution for was originally thought of by Faith during
> > > the initial prototype of Xe, create a 1 to 1 relationship between
> > > scheduler
> > > and entity. I believe the AGX driver [3] is using this approach and
> > > Boris may use approach as well for the Mali driver [4].
> > > 
> > > To support a 1 to 1 relationship we move the main execution function
> > > from a kthread to a work queue and add a new scheduling mode which
> > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > relationship and can be thought of as using scheduler as a dependency /
> > > infligt job tracker rather than a true scheduler.
> > 
> > Once you add capability for a more proper 1:1 via
> > DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to
> > replace kthreads with a wq?
> 
> Yeah, I fail to see the need for that as well. On the other hand it would be
> really nice to get rid of the rq/priority design in general.
> 

wrt to replacing kthread with a worker I think the idea is you don't
want to tie a kthread creation directly to a uAPI as a user then could
create 1000s of kthreads.

fwiw in a private email about a year yoy actually suggest using a work
queue Christian.

> > 
> > Or in other words, what purpose does the offloading of a job picking
> > code to a separate execution context serve? Could it be done directly in
> > the 1:1 mode and leave kthread setup for N:M?
> 
> Well moving from kthread to work item is beneficial on it's own since the
> later usually follows the the source of it's queue. E.g. when this is
> triggered by an interrupt we run on the CPU of the interrupt and not have
> inter CPU signaling.
> 
> > 
> > Apart from those design level questions, low level open IMO still is
> > that default fallback of using the system_wq has the potential to affect
> > latency for other drivers. But that's for those driver owners to
> > approve.
> 
> Oh, yeah that's a good point as well. This needs some high priority queue.
>

system_highpri_wq?

Matt

> Christian.
> 
> > 
> > Regards,
> > 
> > Tvrtko
> > 
> > > - Generic messaging interface for DRM scheduler
> > > 
> > > Idea is to be able to communicate to the submission backend with in band
> > > (relative to main execution function) messages. Messages are backend
> > > defined and flexable enough for any use case. In Xe we use these
> > > messages to clean up entites, set properties for entites, and suspend /
> > > resume execution of an entity [5]. I suspect other driver can leverage
> > > this messaging concept too as it a convenient way to avoid races in the
> > > backend.
> > > 
> > > - Support for using TDR for all error paths of a scheduler / entity
> > > 
> > > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > > 
> > > - Annotate dma-fences for long running workloads.
> > > 
> > > The idea here is to use dma-fences only as sync points within the
> > > scheduler and never export them for long running workloads. By
> > > annotating these fences as long running we ensure that these dma-fences
> > > are never used in a way that breaks the dma-fence rules. A benefit of
> > > thus approach is the scheduler can still safely flow control the
> > > execution ring buffer via the job limit without breaking the dma-fence
> > > rules.
> > > 
> > > Again this a first draft and looking forward to feedback.
> > > 
> > > Enjoy - Matt
> > > 
> > > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > > [2] https://patchwork.freedesktop.org/series/112188/
> > > [3] https://patchwork.freedesktop.org/series/114772/
> > > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > > 
> > > Matthew Brost (8):
> > >    drm/sched: Convert drm scheduler to use a work queue rather than
> > >      kthread
> > >    drm/sched: Move schedule policy to scheduler / entity
> > >    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> > >    drm/sched: Add generic scheduler message interface
> > >    drm/sched: Start run wq before TDR in drm_sched_start
> > >    drm/sched: Submit job before starting TDR
> > >    drm/sched: Add helper to set TDR timeout
> > >    drm/syncobj: Warn on long running dma-fences
> > > 
> > > Thomas Hellström (2):
> > >    dma-buf/dma-fence: Introduce long-running completion fences
> > >    drm/sched: Support long-running sched entities
> > > 
> > >   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> > >   drivers/dma-buf/dma-resv.c                  |   5 +
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> > >   drivers/gpu/drm/drm_syncobj.c               |   5 +-
> > >   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> > >   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> > >   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> > >   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> > >   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> > >   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> > >   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> > >   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
> > >   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> > >   include/drm/gpu_scheduler.h                 | 130 +++++++--
> > >   include/linux/dma-fence.h                   |  60 ++++-
> > >   16 files changed, 649 insertions(+), 184 deletions(-)
> > > 
>
Matthew Brost April 4, 2023, 1:52 p.m. UTC | #10
On Tue, Apr 04, 2023 at 10:43:03AM +0100, Tvrtko Ursulin wrote:
> 
> On 04/04/2023 01:22, Matthew Brost wrote:
> > Hello,
> > 
> > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > have been asked to merge our common DRM scheduler patches first as well
> > as develop a common solution for long running workloads with the DRM
> > scheduler. This RFC series is our first attempt at doing this. We
> > welcome any and all feedback.
> > 
> > This can we thought of as 4 parts detailed below.
> > 
> > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > entity (patches 1-3)
> > 
> > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > severals problems as the DRM was originally designed to schedule jobs on
> > hardware queues. The main problem being that DRM scheduler expects the
> > submission order of jobs to be the completion order of jobs even across
> > multiple entities. This assumption falls apart with a firmware scheduler
> > as a firmware scheduler has no concept of jobs and jobs can complete out
> > of order. A novel solution for was originally thought of by Faith during
> > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > and entity. I believe the AGX driver [3] is using this approach and
> > Boris may use approach as well for the Mali driver [4].
> > 
> > To support a 1 to 1 relationship we move the main execution function
> > from a kthread to a work queue and add a new scheduling mode which
> > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > The new scheduling mode should unify all drivers usage with a 1 to 1
> > relationship and can be thought of as using scheduler as a dependency /
> > infligt job tracker rather than a true scheduler.
> 
> Once you add capability for a more proper 1:1 via
> DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to replace
> kthreads with a wq?
> 
> Or in other words, what purpose does the offloading of a job picking code to
> a separate execution context serve? Could it be done directly in the 1:1
> mode and leave kthread setup for N:M?
> 

Addressed the other two on my reply to Christian...

For this one basically the concept of a single entity point IMO is a
very good concept which I'd like to keep. But most important reason
being the main execution thread (now worker) is kicked when a dependency
for a job is resolved, dependencies are dma-fences signaled via a
callback, and these call backs can be signaled in IRQ contexts. We
absolutely do not want to enter the backend in an IRQ context for a
variety of reasons.

Matt

> Apart from those design level questions, low level open IMO still is that
> default fallback of using the system_wq has the potential to affect latency
> for other drivers. But that's for those driver owners to approve.
> 
> Regards,
> 
> Tvrtko
> 
> > - Generic messaging interface for DRM scheduler
> > 
> > Idea is to be able to communicate to the submission backend with in band
> > (relative to main execution function) messages. Messages are backend
> > defined and flexable enough for any use case. In Xe we use these
> > messages to clean up entites, set properties for entites, and suspend /
> > resume execution of an entity [5]. I suspect other driver can leverage
> > this messaging concept too as it a convenient way to avoid races in the
> > backend.
> > 
> > - Support for using TDR for all error paths of a scheduler / entity
> > 
> > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > 
> > - Annotate dma-fences for long running workloads.
> > 
> > The idea here is to use dma-fences only as sync points within the
> > scheduler and never export them for long running workloads. By
> > annotating these fences as long running we ensure that these dma-fences
> > are never used in a way that breaks the dma-fence rules. A benefit of
> > thus approach is the scheduler can still safely flow control the
> > execution ring buffer via the job limit without breaking the dma-fence
> > rules.
> > 
> > Again this a first draft and looking forward to feedback.
> > 
> > Enjoy - Matt
> > 
> > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > [2] https://patchwork.freedesktop.org/series/112188/
> > [3] https://patchwork.freedesktop.org/series/114772/
> > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > 
> > Matthew Brost (8):
> >    drm/sched: Convert drm scheduler to use a work queue rather than
> >      kthread
> >    drm/sched: Move schedule policy to scheduler / entity
> >    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> >    drm/sched: Add generic scheduler message interface
> >    drm/sched: Start run wq before TDR in drm_sched_start
> >    drm/sched: Submit job before starting TDR
> >    drm/sched: Add helper to set TDR timeout
> >    drm/syncobj: Warn on long running dma-fences
> > 
> > Thomas Hellström (2):
> >    dma-buf/dma-fence: Introduce long-running completion fences
> >    drm/sched: Support long-running sched entities
> > 
> >   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> >   drivers/dma-buf/dma-resv.c                  |   5 +
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> >   drivers/gpu/drm/drm_syncobj.c               |   5 +-
> >   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> >   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> >   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> >   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> >   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> >   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> >   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> >   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
> >   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> >   include/drm/gpu_scheduler.h                 | 130 +++++++--
> >   include/linux/dma-fence.h                   |  60 ++++-
> >   16 files changed, 649 insertions(+), 184 deletions(-)
> >
Tvrtko Ursulin April 4, 2023, 5:29 p.m. UTC | #11
On 04/04/2023 14:52, Matthew Brost wrote:
> On Tue, Apr 04, 2023 at 10:43:03AM +0100, Tvrtko Ursulin wrote:
>>
>> On 04/04/2023 01:22, Matthew Brost wrote:
>>> Hello,
>>>
>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
>>> have been asked to merge our common DRM scheduler patches first as well
>>> as develop a common solution for long running workloads with the DRM
>>> scheduler. This RFC series is our first attempt at doing this. We
>>> welcome any and all feedback.
>>>
>>> This can we thought of as 4 parts detailed below.
>>>
>>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
>>> entity (patches 1-3)
>>>
>>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
>>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
>>> severals problems as the DRM was originally designed to schedule jobs on
>>> hardware queues. The main problem being that DRM scheduler expects the
>>> submission order of jobs to be the completion order of jobs even across
>>> multiple entities. This assumption falls apart with a firmware scheduler
>>> as a firmware scheduler has no concept of jobs and jobs can complete out
>>> of order. A novel solution for was originally thought of by Faith during
>>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
>>> and entity. I believe the AGX driver [3] is using this approach and
>>> Boris may use approach as well for the Mali driver [4].
>>>
>>> To support a 1 to 1 relationship we move the main execution function
>>> from a kthread to a work queue and add a new scheduling mode which
>>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
>>> The new scheduling mode should unify all drivers usage with a 1 to 1
>>> relationship and can be thought of as using scheduler as a dependency /
>>> infligt job tracker rather than a true scheduler.
>>
>> Once you add capability for a more proper 1:1 via
>> DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to replace
>> kthreads with a wq?
>>
>> Or in other words, what purpose does the offloading of a job picking code to
>> a separate execution context serve? Could it be done directly in the 1:1
>> mode and leave kthread setup for N:M?
>>
> 
> Addressed the other two on my reply to Christian...
> 
> For this one basically the concept of a single entity point IMO is a
> very good concept which I'd like to keep. But most important reason
> being the main execution thread (now worker) is kicked when a dependency
> for a job is resolved, dependencies are dma-fences signaled via a
> callback, and these call backs can be signaled in IRQ contexts. We
> absolutely do not want to enter the backend in an IRQ context for a
> variety of reasons.

Sounds like a fair enough requirement but if drivers will not be 
comfortable with the wq conversion, it is probably possible to introduce 
some vfuncs for the 1:1 case which would allow scheduler users override 
the scheduler wakeup and select a special "pick one job" path. That 
could allow 1:1 users do their thing, leaving rest as is. I mean you 
already have the special single entity scheduler, you'd just need to add 
some more specialization on the init, wake up, etc paths.

And I will mention once more that I find a wq item with a loop such as:

	while (!READ_ONCE(sched->pause_run_wq)) {
	...

A bit dodgy. If you piggy back on any system_wq it smells of system wide 
starvation so for me any proposal with an option to use a system shared 
wq is a no go.

Regards,

Tvrtko


>> Apart from those design level questions, low level open IMO still is that
>> default fallback of using the system_wq has the potential to affect latency
>> for other drivers. But that's for those driver owners to approve.
>>
>> Regards,
>>
>> Tvrtko
>>
>>> - Generic messaging interface for DRM scheduler
>>>
>>> Idea is to be able to communicate to the submission backend with in band
>>> (relative to main execution function) messages. Messages are backend
>>> defined and flexable enough for any use case. In Xe we use these
>>> messages to clean up entites, set properties for entites, and suspend /
>>> resume execution of an entity [5]. I suspect other driver can leverage
>>> this messaging concept too as it a convenient way to avoid races in the
>>> backend.
>>>
>>> - Support for using TDR for all error paths of a scheduler / entity
>>>
>>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
>>>
>>> - Annotate dma-fences for long running workloads.
>>>
>>> The idea here is to use dma-fences only as sync points within the
>>> scheduler and never export them for long running workloads. By
>>> annotating these fences as long running we ensure that these dma-fences
>>> are never used in a way that breaks the dma-fence rules. A benefit of
>>> thus approach is the scheduler can still safely flow control the
>>> execution ring buffer via the job limit without breaking the dma-fence
>>> rules.
>>>
>>> Again this a first draft and looking forward to feedback.
>>>
>>> Enjoy - Matt
>>>
>>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
>>> [2] https://patchwork.freedesktop.org/series/112188/
>>> [3] https://patchwork.freedesktop.org/series/114772/
>>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
>>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
>>>
>>> Matthew Brost (8):
>>>     drm/sched: Convert drm scheduler to use a work queue rather than
>>>       kthread
>>>     drm/sched: Move schedule policy to scheduler / entity
>>>     drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>>>     drm/sched: Add generic scheduler message interface
>>>     drm/sched: Start run wq before TDR in drm_sched_start
>>>     drm/sched: Submit job before starting TDR
>>>     drm/sched: Add helper to set TDR timeout
>>>     drm/syncobj: Warn on long running dma-fences
>>>
>>> Thomas Hellström (2):
>>>     dma-buf/dma-fence: Introduce long-running completion fences
>>>     drm/sched: Support long-running sched entities
>>>
>>>    drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>>>    drivers/dma-buf/dma-resv.c                  |   5 +
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>>>    drivers/gpu/drm/drm_syncobj.c               |   5 +-
>>>    drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>>>    drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>>>    drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>>>    drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>>>    drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>>>    drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>>>    drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>>>    drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
>>>    drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>>>    include/drm/gpu_scheduler.h                 | 130 +++++++--
>>>    include/linux/dma-fence.h                   |  60 ++++-
>>>    16 files changed, 649 insertions(+), 184 deletions(-)
>>>
Zeng, Oak April 4, 2023, 6:02 p.m. UTC | #12
Hi Matt, Thomas,

Some very bold out of box thinking in this area:

1. so you want to use drm scheduler and dma-fence for long running workload. Why you want to do this in the first place? What is the benefit? Drm scheduler is pretty much a software scheduler. Modern gpu has scheduler built at fw/hw level, as you said below for intel this is Guc. Can xe driver just directly submit job to Guc, bypassing drm scheduler? 

2. using dma-fence for long run workload: I am well aware that page fault (and the consequent memory allocation/lock acquiring to fix the fault) can cause deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be used purely because the nature of the workload that it runs very long (indefinite). I did a math: the dma_fence_wait_timeout function's third param is the timeout which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not long enough, can we just change the timeout parameter to signed 64 bits so it is much longer than our life time... 

So I mainly argue we can't use dma-fence for long-run workload is not because the workload runs very long, rather because of the fact that we use page fault for long-run workload. If we enable page fault for short-run workload, we can't use dma-fence either. Page fault is the key thing here.

Now since we use page fault which is *fundamentally* controversial with dma-fence design, why now just introduce a independent concept such as user-fence instead of extending existing dma-fence? 

I like unified design. If drm scheduler, dma-fence can be extended to work for everything, it is beautiful. But seems we have some fundamental problem here.

Thanks,
Oak

> -----Original Message-----
> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> Matthew Brost
> Sent: April 3, 2023 8:22 PM
> To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org
> Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com; airlied@linux.ie;
> lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew
> <matthew.brost@intel.com>; christian.koenig@amd.com;
> faith.ekstrand@collabora.com
> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
> 
> Hello,
> 
> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> have been asked to merge our common DRM scheduler patches first as well
> as develop a common solution for long running workloads with the DRM
> scheduler. This RFC series is our first attempt at doing this. We
> welcome any and all feedback.
> 
> This can we thought of as 4 parts detailed below.
> 
> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> entity (patches 1-3)
> 
> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> severals problems as the DRM was originally designed to schedule jobs on
> hardware queues. The main problem being that DRM scheduler expects the
> submission order of jobs to be the completion order of jobs even across
> multiple entities. This assumption falls apart with a firmware scheduler
> as a firmware scheduler has no concept of jobs and jobs can complete out
> of order. A novel solution for was originally thought of by Faith during
> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> and entity. I believe the AGX driver [3] is using this approach and
> Boris may use approach as well for the Mali driver [4].
> 
> To support a 1 to 1 relationship we move the main execution function
> from a kthread to a work queue and add a new scheduling mode which
> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> The new scheduling mode should unify all drivers usage with a 1 to 1
> relationship and can be thought of as using scheduler as a dependency /
> infligt job tracker rather than a true scheduler.
> 
> - Generic messaging interface for DRM scheduler
> 
> Idea is to be able to communicate to the submission backend with in band
> (relative to main execution function) messages. Messages are backend
> defined and flexable enough for any use case. In Xe we use these
> messages to clean up entites, set properties for entites, and suspend /
> resume execution of an entity [5]. I suspect other driver can leverage
> this messaging concept too as it a convenient way to avoid races in the
> backend.
> 
> - Support for using TDR for all error paths of a scheduler / entity
> 
> Fix a few races / bugs, add function to dynamically set the TDR timeout.
> 
> - Annotate dma-fences for long running workloads.
> 
> The idea here is to use dma-fences only as sync points within the
> scheduler and never export them for long running workloads. By
> annotating these fences as long running we ensure that these dma-fences
> are never used in a way that breaks the dma-fence rules. A benefit of
> thus approach is the scheduler can still safely flow control the
> execution ring buffer via the job limit without breaking the dma-fence
> rules.
> 
> Again this a first draft and looking forward to feedback.
> 
> Enjoy - Matt
> 
> [1] https://gitlab.freedesktop.org/drm/xe/kernel
> [2] https://patchwork.freedesktop.org/series/112188/
> [3] https://patchwork.freedesktop.org/series/114772/
> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> 
> Matthew Brost (8):
>   drm/sched: Convert drm scheduler to use a work queue rather than
>     kthread
>   drm/sched: Move schedule policy to scheduler / entity
>   drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>   drm/sched: Add generic scheduler message interface
>   drm/sched: Start run wq before TDR in drm_sched_start
>   drm/sched: Submit job before starting TDR
>   drm/sched: Add helper to set TDR timeout
>   drm/syncobj: Warn on long running dma-fences
> 
> Thomas Hellström (2):
>   dma-buf/dma-fence: Introduce long-running completion fences
>   drm/sched: Support long-running sched entities
> 
>  drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>  drivers/dma-buf/dma-resv.c                  |   5 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>  drivers/gpu/drm/drm_syncobj.c               |   5 +-
>  drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>  drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>  drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>  drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>  drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>  drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>  drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>  drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
>  drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>  include/drm/gpu_scheduler.h                 | 130 +++++++--
>  include/linux/dma-fence.h                   |  60 ++++-
>  16 files changed, 649 insertions(+), 184 deletions(-)
> 
> --
> 2.34.1
Matthew Brost April 4, 2023, 6:08 p.m. UTC | #13
On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
> Hi Matt, Thomas,
> 
> Some very bold out of box thinking in this area:
> 
> 1. so you want to use drm scheduler and dma-fence for long running workload. Why you want to do this in the first place? What is the benefit? Drm scheduler is pretty much a software scheduler. Modern gpu has scheduler built at fw/hw level, as you said below for intel this is Guc. Can xe driver just directly submit job to Guc, bypassing drm scheduler? 
>

If we did that now we have 2 paths for dependency track, flow controling
the ring, resets / error handling / backend submission implementations.
We don't want this.
 
> 2. using dma-fence for long run workload: I am well aware that page fault (and the consequent memory allocation/lock acquiring to fix the fault) can cause deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be used purely because the nature of the workload that it runs very long (indefinite). I did a math: the dma_fence_wait_timeout function's third param is the timeout which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not long enough, can we just change the timeout parameter to signed 64 bits so it is much longer than our life time... 
> 
> So I mainly argue we can't use dma-fence for long-run workload is not because the workload runs very long, rather because of the fact that we use page fault for long-run workload. If we enable page fault for short-run workload, we can't use dma-fence either. Page fault is the key thing here.
> 
> Now since we use page fault which is *fundamentally* controversial with dma-fence design, why now just introduce a independent concept such as user-fence instead of extending existing dma-fence? 
> 
> I like unified design. If drm scheduler, dma-fence can be extended to work for everything, it is beautiful. But seems we have some fundamental problem here.
>

Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
the signal / CB infrastructure) and enforce we don't use use these
dma-fences from the scheduler in memory reclaim paths or export these to
user space or other drivers. Think of this mode as SW only fence.

Matt
 
> Thanks,
> Oak
> 
> > -----Original Message-----
> > From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> > Matthew Brost
> > Sent: April 3, 2023 8:22 PM
> > To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org
> > Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com; airlied@linux.ie;
> > lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew
> > <matthew.brost@intel.com>; christian.koenig@amd.com;
> > faith.ekstrand@collabora.com
> > Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
> > 
> > Hello,
> > 
> > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > have been asked to merge our common DRM scheduler patches first as well
> > as develop a common solution for long running workloads with the DRM
> > scheduler. This RFC series is our first attempt at doing this. We
> > welcome any and all feedback.
> > 
> > This can we thought of as 4 parts detailed below.
> > 
> > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > entity (patches 1-3)
> > 
> > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > severals problems as the DRM was originally designed to schedule jobs on
> > hardware queues. The main problem being that DRM scheduler expects the
> > submission order of jobs to be the completion order of jobs even across
> > multiple entities. This assumption falls apart with a firmware scheduler
> > as a firmware scheduler has no concept of jobs and jobs can complete out
> > of order. A novel solution for was originally thought of by Faith during
> > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > and entity. I believe the AGX driver [3] is using this approach and
> > Boris may use approach as well for the Mali driver [4].
> > 
> > To support a 1 to 1 relationship we move the main execution function
> > from a kthread to a work queue and add a new scheduling mode which
> > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > The new scheduling mode should unify all drivers usage with a 1 to 1
> > relationship and can be thought of as using scheduler as a dependency /
> > infligt job tracker rather than a true scheduler.
> > 
> > - Generic messaging interface for DRM scheduler
> > 
> > Idea is to be able to communicate to the submission backend with in band
> > (relative to main execution function) messages. Messages are backend
> > defined and flexable enough for any use case. In Xe we use these
> > messages to clean up entites, set properties for entites, and suspend /
> > resume execution of an entity [5]. I suspect other driver can leverage
> > this messaging concept too as it a convenient way to avoid races in the
> > backend.
> > 
> > - Support for using TDR for all error paths of a scheduler / entity
> > 
> > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > 
> > - Annotate dma-fences for long running workloads.
> > 
> > The idea here is to use dma-fences only as sync points within the
> > scheduler and never export them for long running workloads. By
> > annotating these fences as long running we ensure that these dma-fences
> > are never used in a way that breaks the dma-fence rules. A benefit of
> > thus approach is the scheduler can still safely flow control the
> > execution ring buffer via the job limit without breaking the dma-fence
> > rules.
> > 
> > Again this a first draft and looking forward to feedback.
> > 
> > Enjoy - Matt
> > 
> > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > [2] https://patchwork.freedesktop.org/series/112188/
> > [3] https://patchwork.freedesktop.org/series/114772/
> > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
> > next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > 
> > Matthew Brost (8):
> >   drm/sched: Convert drm scheduler to use a work queue rather than
> >     kthread
> >   drm/sched: Move schedule policy to scheduler / entity
> >   drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> >   drm/sched: Add generic scheduler message interface
> >   drm/sched: Start run wq before TDR in drm_sched_start
> >   drm/sched: Submit job before starting TDR
> >   drm/sched: Add helper to set TDR timeout
> >   drm/syncobj: Warn on long running dma-fences
> > 
> > Thomas Hellström (2):
> >   dma-buf/dma-fence: Introduce long-running completion fences
> >   drm/sched: Support long-running sched entities
> > 
> >  drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> >  drivers/dma-buf/dma-resv.c                  |   5 +
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> >  drivers/gpu/drm/drm_syncobj.c               |   5 +-
> >  drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> >  drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> >  drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> >  drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> >  drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> >  drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> >  drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> >  drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
> >  drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> >  include/drm/gpu_scheduler.h                 | 130 +++++++--
> >  include/linux/dma-fence.h                   |  60 ++++-
> >  16 files changed, 649 insertions(+), 184 deletions(-)
> > 
> > --
> > 2.34.1
>
Daniel Vetter April 4, 2023, 7:07 p.m. UTC | #14
On Tue, 4 Apr 2023 at 19:29, Tvrtko Ursulin
<tvrtko.ursulin@linux.intel.com> wrote:
>
>
> On 04/04/2023 14:52, Matthew Brost wrote:
> > On Tue, Apr 04, 2023 at 10:43:03AM +0100, Tvrtko Ursulin wrote:
> >>
> >> On 04/04/2023 01:22, Matthew Brost wrote:
> >>> Hello,
> >>>
> >>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> >>> have been asked to merge our common DRM scheduler patches first as well
> >>> as develop a common solution for long running workloads with the DRM
> >>> scheduler. This RFC series is our first attempt at doing this. We
> >>> welcome any and all feedback.
> >>>
> >>> This can we thought of as 4 parts detailed below.
> >>>
> >>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> >>> entity (patches 1-3)
> >>>
> >>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> >>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> >>> severals problems as the DRM was originally designed to schedule jobs on
> >>> hardware queues. The main problem being that DRM scheduler expects the
> >>> submission order of jobs to be the completion order of jobs even across
> >>> multiple entities. This assumption falls apart with a firmware scheduler
> >>> as a firmware scheduler has no concept of jobs and jobs can complete out
> >>> of order. A novel solution for was originally thought of by Faith during
> >>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> >>> and entity. I believe the AGX driver [3] is using this approach and
> >>> Boris may use approach as well for the Mali driver [4].
> >>>
> >>> To support a 1 to 1 relationship we move the main execution function
> >>> from a kthread to a work queue and add a new scheduling mode which
> >>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> >>> The new scheduling mode should unify all drivers usage with a 1 to 1
> >>> relationship and can be thought of as using scheduler as a dependency /
> >>> infligt job tracker rather than a true scheduler.
> >>
> >> Once you add capability for a more proper 1:1 via
> >> DRM_SCHED_POLICY_SINGLE_ENTITY, do you still have further need to replace
> >> kthreads with a wq?
> >>
> >> Or in other words, what purpose does the offloading of a job picking code to
> >> a separate execution context serve? Could it be done directly in the 1:1
> >> mode and leave kthread setup for N:M?
> >>
> >
> > Addressed the other two on my reply to Christian...
> >
> > For this one basically the concept of a single entity point IMO is a
> > very good concept which I'd like to keep. But most important reason
> > being the main execution thread (now worker) is kicked when a dependency
> > for a job is resolved, dependencies are dma-fences signaled via a
> > callback, and these call backs can be signaled in IRQ contexts. We
> > absolutely do not want to enter the backend in an IRQ context for a
> > variety of reasons.
>
> Sounds like a fair enough requirement but if drivers will not be
> comfortable with the wq conversion, it is probably possible to introduce
> some vfuncs for the 1:1 case which would allow scheduler users override
> the scheduler wakeup and select a special "pick one job" path. That
> could allow 1:1 users do their thing, leaving rest as is. I mean you
> already have the special single entity scheduler, you'd just need to add
> some more specialization on the init, wake up, etc paths.
>
> And I will mention once more that I find a wq item with a loop such as:
>
>         while (!READ_ONCE(sched->pause_run_wq)) {
>         ...
>
> A bit dodgy. If you piggy back on any system_wq it smells of system wide
> starvation so for me any proposal with an option to use a system shared
> wq is a no go.

Yeah I think the argument for wq based scheduler would need a
per-drm_scheduler wq, like we currently have a per-scheduler kthread.
It might still need some serious work to replace the
kthread_stop/start() with a wq-native equivalent (because really this
is the tricky stuff we shouldn't hand-roll unless someone is willing
to write a few papers on the lockless design that's done), but would
look a bunch more reasonable. Having a per-sched workqueue might also
help with the big sched_stop/start/fini state transitions, which I
really think should still go over all the per-entity schedulers even
in the 1:1 case (because otherwise you get some funky code in drivers
that do the iterations needed, which probably tosses the fairly nice
design the current scheduler has by relying on the kthread_stop/start
primitives for this.
-Daniel

>
> Regards,
>
> Tvrtko
>
>
> >> Apart from those design level questions, low level open IMO still is that
> >> default fallback of using the system_wq has the potential to affect latency
> >> for other drivers. But that's for those driver owners to approve.
> >>
> >> Regards,
> >>
> >> Tvrtko
> >>
> >>> - Generic messaging interface for DRM scheduler
> >>>
> >>> Idea is to be able to communicate to the submission backend with in band
> >>> (relative to main execution function) messages. Messages are backend
> >>> defined and flexable enough for any use case. In Xe we use these
> >>> messages to clean up entites, set properties for entites, and suspend /
> >>> resume execution of an entity [5]. I suspect other driver can leverage
> >>> this messaging concept too as it a convenient way to avoid races in the
> >>> backend.
> >>>
> >>> - Support for using TDR for all error paths of a scheduler / entity
> >>>
> >>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
> >>>
> >>> - Annotate dma-fences for long running workloads.
> >>>
> >>> The idea here is to use dma-fences only as sync points within the
> >>> scheduler and never export them for long running workloads. By
> >>> annotating these fences as long running we ensure that these dma-fences
> >>> are never used in a way that breaks the dma-fence rules. A benefit of
> >>> thus approach is the scheduler can still safely flow control the
> >>> execution ring buffer via the job limit without breaking the dma-fence
> >>> rules.
> >>>
> >>> Again this a first draft and looking forward to feedback.
> >>>
> >>> Enjoy - Matt
> >>>
> >>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
> >>> [2] https://patchwork.freedesktop.org/series/112188/
> >>> [3] https://patchwork.freedesktop.org/series/114772/
> >>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> >>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> >>>
> >>> Matthew Brost (8):
> >>>     drm/sched: Convert drm scheduler to use a work queue rather than
> >>>       kthread
> >>>     drm/sched: Move schedule policy to scheduler / entity
> >>>     drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> >>>     drm/sched: Add generic scheduler message interface
> >>>     drm/sched: Start run wq before TDR in drm_sched_start
> >>>     drm/sched: Submit job before starting TDR
> >>>     drm/sched: Add helper to set TDR timeout
> >>>     drm/syncobj: Warn on long running dma-fences
> >>>
> >>> Thomas Hellström (2):
> >>>     dma-buf/dma-fence: Introduce long-running completion fences
> >>>     drm/sched: Support long-running sched entities
> >>>
> >>>    drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> >>>    drivers/dma-buf/dma-resv.c                  |   5 +
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> >>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> >>>    drivers/gpu/drm/drm_syncobj.c               |   5 +-
> >>>    drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> >>>    drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> >>>    drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> >>>    drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> >>>    drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> >>>    drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> >>>    drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> >>>    drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
> >>>    drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> >>>    include/drm/gpu_scheduler.h                 | 130 +++++++--
> >>>    include/linux/dma-fence.h                   |  60 ++++-
> >>>    16 files changed, 649 insertions(+), 184 deletions(-)
> >>>
Christian König April 5, 2023, 7:30 a.m. UTC | #15
Am 04.04.23 um 20:08 schrieb Matthew Brost:
> On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
>> Hi Matt, Thomas,
>>
>> Some very bold out of box thinking in this area:
>>
>> 1. so you want to use drm scheduler and dma-fence for long running workload. Why you want to do this in the first place? What is the benefit? Drm scheduler is pretty much a software scheduler. Modern gpu has scheduler built at fw/hw level, as you said below for intel this is Guc. Can xe driver just directly submit job to Guc, bypassing drm scheduler?
>>
> If we did that now we have 2 paths for dependency track, flow controling
> the ring, resets / error handling / backend submission implementations.
> We don't want this.

Well exactly that's the point: Why?

As far as I can see that are two completely distinct use cases, so you 
absolutely do want two completely distinct implementations for this.

>> 2. using dma-fence for long run workload: I am well aware that page fault (and the consequent memory allocation/lock acquiring to fix the fault) can cause deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be used purely because the nature of the workload that it runs very long (indefinite). I did a math: the dma_fence_wait_timeout function's third param is the timeout which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not long enough, can we just change the timeout parameter to signed 64 bits so it is much longer than our life time...
>>
>> So I mainly argue we can't use dma-fence for long-run workload is not because the workload runs very long, rather because of the fact that we use page fault for long-run workload. If we enable page fault for short-run workload, we can't use dma-fence either. Page fault is the key thing here.
>>
>> Now since we use page fault which is *fundamentally* controversial with dma-fence design, why now just introduce a independent concept such as user-fence instead of extending existing dma-fence?
>>
>> I like unified design. If drm scheduler, dma-fence can be extended to work for everything, it is beautiful. But seems we have some fundamental problem here.
>>
> Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
> the signal / CB infrastructure) and enforce we don't use use these
> dma-fences from the scheduler in memory reclaim paths or export these to
> user space or other drivers. Think of this mode as SW only fence.

Yeah and I truly think this is an really bad idea.

The signal/CB infrastructure in the dma_fence turned out to be the 
absolutely nightmare I initially predicted. Sorry to say that, but in 
this case the "I've told you so" is appropriate in my opinion.

If we need infrastructure for long running dependency tracking we should 
encapsulate that in a new framework and not try to mangle the existing 
code for something it was never intended for.

Christian.

>
> Matt
>   
>> Thanks,
>> Oak
>>
>>> -----Original Message-----
>>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
>>> Matthew Brost
>>> Sent: April 3, 2023 8:22 PM
>>> To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org
>>> Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com; airlied@linux.ie;
>>> lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew
>>> <matthew.brost@intel.com>; christian.koenig@amd.com;
>>> faith.ekstrand@collabora.com
>>> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
>>>
>>> Hello,
>>>
>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
>>> have been asked to merge our common DRM scheduler patches first as well
>>> as develop a common solution for long running workloads with the DRM
>>> scheduler. This RFC series is our first attempt at doing this. We
>>> welcome any and all feedback.
>>>
>>> This can we thought of as 4 parts detailed below.
>>>
>>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
>>> entity (patches 1-3)
>>>
>>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
>>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
>>> severals problems as the DRM was originally designed to schedule jobs on
>>> hardware queues. The main problem being that DRM scheduler expects the
>>> submission order of jobs to be the completion order of jobs even across
>>> multiple entities. This assumption falls apart with a firmware scheduler
>>> as a firmware scheduler has no concept of jobs and jobs can complete out
>>> of order. A novel solution for was originally thought of by Faith during
>>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
>>> and entity. I believe the AGX driver [3] is using this approach and
>>> Boris may use approach as well for the Mali driver [4].
>>>
>>> To support a 1 to 1 relationship we move the main execution function
>>> from a kthread to a work queue and add a new scheduling mode which
>>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
>>> The new scheduling mode should unify all drivers usage with a 1 to 1
>>> relationship and can be thought of as using scheduler as a dependency /
>>> infligt job tracker rather than a true scheduler.
>>>
>>> - Generic messaging interface for DRM scheduler
>>>
>>> Idea is to be able to communicate to the submission backend with in band
>>> (relative to main execution function) messages. Messages are backend
>>> defined and flexable enough for any use case. In Xe we use these
>>> messages to clean up entites, set properties for entites, and suspend /
>>> resume execution of an entity [5]. I suspect other driver can leverage
>>> this messaging concept too as it a convenient way to avoid races in the
>>> backend.
>>>
>>> - Support for using TDR for all error paths of a scheduler / entity
>>>
>>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
>>>
>>> - Annotate dma-fences for long running workloads.
>>>
>>> The idea here is to use dma-fences only as sync points within the
>>> scheduler and never export them for long running workloads. By
>>> annotating these fences as long running we ensure that these dma-fences
>>> are never used in a way that breaks the dma-fence rules. A benefit of
>>> thus approach is the scheduler can still safely flow control the
>>> execution ring buffer via the job limit without breaking the dma-fence
>>> rules.
>>>
>>> Again this a first draft and looking forward to feedback.
>>>
>>> Enjoy - Matt
>>>
>>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
>>> [2] https://patchwork.freedesktop.org/series/112188/
>>> [3] https://patchwork.freedesktop.org/series/114772/
>>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
>>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
>>> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
>>>
>>> Matthew Brost (8):
>>>    drm/sched: Convert drm scheduler to use a work queue rather than
>>>      kthread
>>>    drm/sched: Move schedule policy to scheduler / entity
>>>    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>>>    drm/sched: Add generic scheduler message interface
>>>    drm/sched: Start run wq before TDR in drm_sched_start
>>>    drm/sched: Submit job before starting TDR
>>>    drm/sched: Add helper to set TDR timeout
>>>    drm/syncobj: Warn on long running dma-fences
>>>
>>> Thomas Hellström (2):
>>>    dma-buf/dma-fence: Introduce long-running completion fences
>>>    drm/sched: Support long-running sched entities
>>>
>>>   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>>>   drivers/dma-buf/dma-resv.c                  |   5 +
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>>>   drivers/gpu/drm/drm_syncobj.c               |   5 +-
>>>   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>>>   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>>>   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>>>   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>>>   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>>>   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>>>   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>>>   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
>>>   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>>>   include/drm/gpu_scheduler.h                 | 130 +++++++--
>>>   include/linux/dma-fence.h                   |  60 ++++-
>>>   16 files changed, 649 insertions(+), 184 deletions(-)
>>>
>>> --
>>> 2.34.1
Christian König April 5, 2023, 7:41 a.m. UTC | #16
Am 04.04.23 um 15:37 schrieb Matthew Brost:
> On Tue, Apr 04, 2023 at 11:13:28AM +0200, Christian König wrote:
>> Hi,
>>
>> Am 04.04.23 um 02:22 schrieb Matthew Brost:
>>> Hello,
>>>
>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
>>> have been asked to merge our common DRM scheduler patches first as well
>>> as develop a common solution for long running workloads with the DRM
>>> scheduler. This RFC series is our first attempt at doing this. We
>>> welcome any and all feedback.
>>>
>>> This can we thought of as 4 parts detailed below.
>>>
>>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
>>> entity (patches 1-3)
>>>
>>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
>>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
>>> severals problems as the DRM was originally designed to schedule jobs on
>>> hardware queues. The main problem being that DRM scheduler expects the
>>> submission order of jobs to be the completion order of jobs even across
>>> multiple entities. This assumption falls apart with a firmware scheduler
>>> as a firmware scheduler has no concept of jobs and jobs can complete out
>>> of order. A novel solution for was originally thought of by Faith during
>>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
>>> and entity. I believe the AGX driver [3] is using this approach and
>>> Boris may use approach as well for the Mali driver [4].
>>>
>>> To support a 1 to 1 relationship we move the main execution function
>>> from a kthread to a work queue and add a new scheduling mode which
>>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
>>> The new scheduling mode should unify all drivers usage with a 1 to 1
>>> relationship and can be thought of as using scheduler as a dependency /
>>> infligt job tracker rather than a true scheduler.
>>>
>>> - Generic messaging interface for DRM scheduler
>>>
>>> Idea is to be able to communicate to the submission backend with in band
>>> (relative to main execution function) messages. Messages are backend
>>> defined and flexable enough for any use case. In Xe we use these
>>> messages to clean up entites, set properties for entites, and suspend /
>>> resume execution of an entity [5]. I suspect other driver can leverage
>>> this messaging concept too as it a convenient way to avoid races in the
>>> backend.
>> Oh, please absolutely *don't* do this.
>>
>> This is basically the design which makes a bunch of stuff so horrible broken
>> on Windows.
>>
>> I can explain it in more detail if necessary, but I strongly recommend to
>> not go down this path.
>>
> I'm afraid we are going to have to discuss this further. Let me explain
> my reasoning, basically the idea is to have a single main entry point to
> backend - the work queue. This avoids the need for lock between run_job
> and any message that changes an entites state, also it really helps
> during the reset flows (either TDR or GT reset) as we can call
> drm_sched_run_wq_stop can ensure that nothing else is in the backend
> changing an entity state. It all works out really nicely actually, our
> GuC backend is incredibly stable (hasn't really had a bug pop in about a
> year) and way simpler than what we did in the i915. I think the simplity
> to largely due to this design of limiting the entry points.
>
> I personally don't see how this a poor design, limiting entry points
> absolutely makes sense to me, if it didn't why not just call cleanup_job
> bypassing the main execution thread (now worker), this is the exact same
> concept.

Well then I strongly suggest to read a few analyses on the failure of 
the message processing loop on Windows.

Have you ever wondered why classic Win32 applications sometimes seems to 
be stuck and don't do anything? This design pattern combine with 
timeouts to solve deadlocks is the reason for that.

The major problem with this approach is that analyzing tools like 
lockdep have a hard time grasping the dependencies.

What you can do is to offload all your operations which are supposed to 
be run in the same thread as work items into a work queue. This is 
something lockdep understands and is able to scream out lout if someone 
messes up the deadlock dependencies.

Regards,
Christian.

>
> FWIW Asahi liked the idea as well and think it could be useful for AGX.
> Matt
>
>> Regards,
>> Christian.
>>
>>> - Support for using TDR for all error paths of a scheduler / entity
>>>
>>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
>>>
>>> - Annotate dma-fences for long running workloads.
>>>
>>> The idea here is to use dma-fences only as sync points within the
>>> scheduler and never export them for long running workloads. By
>>> annotating these fences as long running we ensure that these dma-fences
>>> are never used in a way that breaks the dma-fence rules. A benefit of
>>> thus approach is the scheduler can still safely flow control the
>>> execution ring buffer via the job limit without breaking the dma-fence
>>> rules.
>>>
>>> Again this a first draft and looking forward to feedback.
>>>
>>> Enjoy - Matt
>>>
>>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
>>> [2] https://patchwork.freedesktop.org/series/112188/
>>> [3] https://patchwork.freedesktop.org/series/114772/
>>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
>>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
>>>
>>> Matthew Brost (8):
>>>     drm/sched: Convert drm scheduler to use a work queue rather than
>>>       kthread
>>>     drm/sched: Move schedule policy to scheduler / entity
>>>     drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>>>     drm/sched: Add generic scheduler message interface
>>>     drm/sched: Start run wq before TDR in drm_sched_start
>>>     drm/sched: Submit job before starting TDR
>>>     drm/sched: Add helper to set TDR timeout
>>>     drm/syncobj: Warn on long running dma-fences
>>>
>>> Thomas Hellström (2):
>>>     dma-buf/dma-fence: Introduce long-running completion fences
>>>     drm/sched: Support long-running sched entities
>>>
>>>    drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>>>    drivers/dma-buf/dma-resv.c                  |   5 +
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>>>    drivers/gpu/drm/drm_syncobj.c               |   5 +-
>>>    drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>>>    drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>>>    drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>>>    drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>>>    drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>>>    drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>>>    drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>>>    drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
>>>    drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>>>    include/drm/gpu_scheduler.h                 | 130 +++++++--
>>>    include/linux/dma-fence.h                   |  60 ++++-
>>>    16 files changed, 649 insertions(+), 184 deletions(-)
>>>
Daniel Vetter April 5, 2023, 8:34 a.m. UTC | #17
On Wed, Apr 05, 2023 at 09:41:23AM +0200, Christian König wrote:
> Am 04.04.23 um 15:37 schrieb Matthew Brost:
> > On Tue, Apr 04, 2023 at 11:13:28AM +0200, Christian König wrote:
> > > Hi,
> > > 
> > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > Hello,
> > > > 
> > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > > have been asked to merge our common DRM scheduler patches first as well
> > > > as develop a common solution for long running workloads with the DRM
> > > > scheduler. This RFC series is our first attempt at doing this. We
> > > > welcome any and all feedback.
> > > > 
> > > > This can we thought of as 4 parts detailed below.
> > > > 
> > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > > entity (patches 1-3)
> > > > 
> > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > > severals problems as the DRM was originally designed to schedule jobs on
> > > > hardware queues. The main problem being that DRM scheduler expects the
> > > > submission order of jobs to be the completion order of jobs even across
> > > > multiple entities. This assumption falls apart with a firmware scheduler
> > > > as a firmware scheduler has no concept of jobs and jobs can complete out
> > > > of order. A novel solution for was originally thought of by Faith during
> > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > > > and entity. I believe the AGX driver [3] is using this approach and
> > > > Boris may use approach as well for the Mali driver [4].
> > > > 
> > > > To support a 1 to 1 relationship we move the main execution function
> > > > from a kthread to a work queue and add a new scheduling mode which
> > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > > relationship and can be thought of as using scheduler as a dependency /
> > > > infligt job tracker rather than a true scheduler.
> > > > 
> > > > - Generic messaging interface for DRM scheduler
> > > > 
> > > > Idea is to be able to communicate to the submission backend with in band
> > > > (relative to main execution function) messages. Messages are backend
> > > > defined and flexable enough for any use case. In Xe we use these
> > > > messages to clean up entites, set properties for entites, and suspend /
> > > > resume execution of an entity [5]. I suspect other driver can leverage
> > > > this messaging concept too as it a convenient way to avoid races in the
> > > > backend.
> > > Oh, please absolutely *don't* do this.
> > > 
> > > This is basically the design which makes a bunch of stuff so horrible broken
> > > on Windows.
> > > 
> > > I can explain it in more detail if necessary, but I strongly recommend to
> > > not go down this path.
> > > 
> > I'm afraid we are going to have to discuss this further. Let me explain
> > my reasoning, basically the idea is to have a single main entry point to
> > backend - the work queue. This avoids the need for lock between run_job
> > and any message that changes an entites state, also it really helps
> > during the reset flows (either TDR or GT reset) as we can call
> > drm_sched_run_wq_stop can ensure that nothing else is in the backend
> > changing an entity state. It all works out really nicely actually, our
> > GuC backend is incredibly stable (hasn't really had a bug pop in about a
> > year) and way simpler than what we did in the i915. I think the simplity
> > to largely due to this design of limiting the entry points.
> > 
> > I personally don't see how this a poor design, limiting entry points
> > absolutely makes sense to me, if it didn't why not just call cleanup_job
> > bypassing the main execution thread (now worker), this is the exact same
> > concept.
> 
> Well then I strongly suggest to read a few analyses on the failure of the
> message processing loop on Windows.
> 
> Have you ever wondered why classic Win32 applications sometimes seems to be
> stuck and don't do anything? This design pattern combine with timeouts to
> solve deadlocks is the reason for that.
> 
> The major problem with this approach is that analyzing tools like lockdep
> have a hard time grasping the dependencies.

wq is fully annotated and actually splats. Plain kthread doesn't, without
adding something like the dma_fence_signalling stuff.

But yeah if you block badly in the work items and stall the entire queue,
then things go sideways real bad. There's not really any tools we have in
the kernel to enforce this, since we still want to allow mutex and
sleeping and stuff like that.

> What you can do is to offload all your operations which are supposed to be
> run in the same thread as work items into a work queue. This is something
> lockdep understands and is able to scream out lout if someone messes up the
> deadlock dependencies.

I thought that's the plan here? Or at least what I thought the plan was,
and why I really think we need a per engine worqqueue to make it work well
(and also why I suggested the refactoring to split up drm_scheduler into
the driver api struct, which stays per-engine, and the internal backend
which would be per drm_sched_entity for fw schedulers that round-robin gpu
ctx on their own).

Also maybe we need to allow drivers to pass in the workqueue like we allow
for the tdr handling already, since that simplifies the locking.

At least for intel gpu I think this message passing design makes some
sense because fundamentally the fw only has a single blocking message
queue. And so intel/xe fundamentally needs to deal with the "stupid code
might block forward progress for everyone" problem you're describing, not
matter whether it's done with the help of drm/sched infra or not.

I do agree though that we shouldn't encourage drivers to use this which
don't have that kind of fw command queue design. So maybe a huge comment
to explain when (and _only_ when) it's ok to use that message passing
would be needed, and explaining why in other cases it would make things a
lot worse?
-Daniel

> 
> Regards,
> Christian.
> 
> > 
> > FWIW Asahi liked the idea as well and think it could be useful for AGX.
> > Matt
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > - Support for using TDR for all error paths of a scheduler / entity
> > > > 
> > > > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > > > 
> > > > - Annotate dma-fences for long running workloads.
> > > > 
> > > > The idea here is to use dma-fences only as sync points within the
> > > > scheduler and never export them for long running workloads. By
> > > > annotating these fences as long running we ensure that these dma-fences
> > > > are never used in a way that breaks the dma-fence rules. A benefit of
> > > > thus approach is the scheduler can still safely flow control the
> > > > execution ring buffer via the job limit without breaking the dma-fence
> > > > rules.
> > > > 
> > > > Again this a first draft and looking forward to feedback.
> > > > 
> > > > Enjoy - Matt
> > > > 
> > > > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > > > [2] https://patchwork.freedesktop.org/series/112188/
> > > > [3] https://patchwork.freedesktop.org/series/114772/
> > > > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > > > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > > > 
> > > > Matthew Brost (8):
> > > >     drm/sched: Convert drm scheduler to use a work queue rather than
> > > >       kthread
> > > >     drm/sched: Move schedule policy to scheduler / entity
> > > >     drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> > > >     drm/sched: Add generic scheduler message interface
> > > >     drm/sched: Start run wq before TDR in drm_sched_start
> > > >     drm/sched: Submit job before starting TDR
> > > >     drm/sched: Add helper to set TDR timeout
> > > >     drm/syncobj: Warn on long running dma-fences
> > > > 
> > > > Thomas Hellström (2):
> > > >     dma-buf/dma-fence: Introduce long-running completion fences
> > > >     drm/sched: Support long-running sched entities
> > > > 
> > > >    drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> > > >    drivers/dma-buf/dma-resv.c                  |   5 +
> > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> > > >    drivers/gpu/drm/drm_syncobj.c               |   5 +-
> > > >    drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> > > >    drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> > > >    drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> > > >    drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> > > >    drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> > > >    drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> > > >    drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> > > >    drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
> > > >    drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> > > >    include/drm/gpu_scheduler.h                 | 130 +++++++--
> > > >    include/linux/dma-fence.h                   |  60 ++++-
> > > >    16 files changed, 649 insertions(+), 184 deletions(-)
> > > > 
>
Daniel Vetter April 5, 2023, 8:42 a.m. UTC | #18
On Wed, Apr 05, 2023 at 09:30:11AM +0200, Christian König wrote:
> Am 04.04.23 um 20:08 schrieb Matthew Brost:
> > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
> > > Hi Matt, Thomas,
> > > 
> > > Some very bold out of box thinking in this area:
> > > 
> > > 1. so you want to use drm scheduler and dma-fence for long running workload. Why you want to do this in the first place? What is the benefit? Drm scheduler is pretty much a software scheduler. Modern gpu has scheduler built at fw/hw level, as you said below for intel this is Guc. Can xe driver just directly submit job to Guc, bypassing drm scheduler?
> > > 
> > If we did that now we have 2 paths for dependency track, flow controling
> > the ring, resets / error handling / backend submission implementations.
> > We don't want this.
> 
> Well exactly that's the point: Why?
> 
> As far as I can see that are two completely distinct use cases, so you
> absolutely do want two completely distinct implementations for this.
> 
> > > 2. using dma-fence for long run workload: I am well aware that page fault (and the consequent memory allocation/lock acquiring to fix the fault) can cause deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be used purely because the nature of the workload that it runs very long (indefinite). I did a math: the dma_fence_wait_timeout function's third param is the timeout which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not long enough, can we just change the timeout parameter to signed 64 bits so it is much longer than our life time...
> > > 
> > > So I mainly argue we can't use dma-fence for long-run workload is not because the workload runs very long, rather because of the fact that we use page fault for long-run workload. If we enable page fault for short-run workload, we can't use dma-fence either. Page fault is the key thing here.
> > > 
> > > Now since we use page fault which is *fundamentally* controversial with dma-fence design, why now just introduce a independent concept such as user-fence instead of extending existing dma-fence?
> > > 
> > > I like unified design. If drm scheduler, dma-fence can be extended to work for everything, it is beautiful. But seems we have some fundamental problem here.
> > > 
> > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
> > the signal / CB infrastructure) and enforce we don't use use these
> > dma-fences from the scheduler in memory reclaim paths or export these to
> > user space or other drivers. Think of this mode as SW only fence.
> 
> Yeah and I truly think this is an really bad idea.
> 
> The signal/CB infrastructure in the dma_fence turned out to be the
> absolutely nightmare I initially predicted. Sorry to say that, but in this
> case the "I've told you so" is appropriate in my opinion.
> 
> If we need infrastructure for long running dependency tracking we should
> encapsulate that in a new framework and not try to mangle the existing code
> for something it was never intended for.

Concurring hard (already typed that up somewhere else). I'd go one step
further and ask whether the kernel really has to handle dependencies for
these long-running compute jobs. The entire design with userspace memory
fences assumes that this is userspace's job.

Also for drm_syncobj we've also pushed a lot of the dependency handling to
userspace, with submit threads in mesa. So if there is any blocking to be
done (running out of ring space), why can't we sort that out the same way?
Meaning:
1. superfast direct-to-hw submit path (using doorbells or whatever)
2. submit ioctl which only succeds if it doesn't have to do any userspace
memory fence waits, otherwise you get EWOULDBLOCK
3. userspace sorts out the mess in a submit thread if it gets an
EWOULDBLOCK, because fundamentally the kernel cannot guarantee a
bottomless queue. If userspace wants bottomless, they need to handle the
allocating and delaying imo

You can even make 3 entirely as-needed, which means for the usual
fast-path you'll never see the userspace thread created unless you do hit
an EWOULDBLOCK.

If we insist that the kernel handles the long-running dependencies fully
then all we end up doing is implementing step 3, but entirely in the
kernel instead of userspace. And in the kernel every bug gets you halfway
to a CVE, and I just don't think that makes much sense for something which
is the fallback of the fallback - once you run out of ring space you're
not going to have a great day not matter what.

I'd go as far and say if we want step 3 in the kernel someone needs to
supply the real-world (i.e. real application running real workloads, not
some microbenchmark) benchmark to proof it's actually worth the pain.
Otherwise on-demand userspace submit thread.
-Daniel

> 
> Christian.
> 
> > 
> > Matt
> > > Thanks,
> > > Oak
> > > 
> > > > -----Original Message-----
> > > > From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> > > > Matthew Brost
> > > > Sent: April 3, 2023 8:22 PM
> > > > To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org
> > > > Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com; airlied@linux.ie;
> > > > lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew
> > > > <matthew.brost@intel.com>; christian.koenig@amd.com;
> > > > faith.ekstrand@collabora.com
> > > > Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload plans
> > > > 
> > > > Hello,
> > > > 
> > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > > have been asked to merge our common DRM scheduler patches first as well
> > > > as develop a common solution for long running workloads with the DRM
> > > > scheduler. This RFC series is our first attempt at doing this. We
> > > > welcome any and all feedback.
> > > > 
> > > > This can we thought of as 4 parts detailed below.
> > > > 
> > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > > entity (patches 1-3)
> > > > 
> > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > > severals problems as the DRM was originally designed to schedule jobs on
> > > > hardware queues. The main problem being that DRM scheduler expects the
> > > > submission order of jobs to be the completion order of jobs even across
> > > > multiple entities. This assumption falls apart with a firmware scheduler
> > > > as a firmware scheduler has no concept of jobs and jobs can complete out
> > > > of order. A novel solution for was originally thought of by Faith during
> > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > > > and entity. I believe the AGX driver [3] is using this approach and
> > > > Boris may use approach as well for the Mali driver [4].
> > > > 
> > > > To support a 1 to 1 relationship we move the main execution function
> > > > from a kthread to a work queue and add a new scheduling mode which
> > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > > relationship and can be thought of as using scheduler as a dependency /
> > > > infligt job tracker rather than a true scheduler.
> > > > 
> > > > - Generic messaging interface for DRM scheduler
> > > > 
> > > > Idea is to be able to communicate to the submission backend with in band
> > > > (relative to main execution function) messages. Messages are backend
> > > > defined and flexable enough for any use case. In Xe we use these
> > > > messages to clean up entites, set properties for entites, and suspend /
> > > > resume execution of an entity [5]. I suspect other driver can leverage
> > > > this messaging concept too as it a convenient way to avoid races in the
> > > > backend.
> > > > 
> > > > - Support for using TDR for all error paths of a scheduler / entity
> > > > 
> > > > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > > > 
> > > > - Annotate dma-fences for long running workloads.
> > > > 
> > > > The idea here is to use dma-fences only as sync points within the
> > > > scheduler and never export them for long running workloads. By
> > > > annotating these fences as long running we ensure that these dma-fences
> > > > are never used in a way that breaks the dma-fence rules. A benefit of
> > > > thus approach is the scheduler can still safely flow control the
> > > > execution ring buffer via the job limit without breaking the dma-fence
> > > > rules.
> > > > 
> > > > Again this a first draft and looking forward to feedback.
> > > > 
> > > > Enjoy - Matt
> > > > 
> > > > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > > > [2] https://patchwork.freedesktop.org/series/112188/
> > > > [3] https://patchwork.freedesktop.org/series/114772/
> > > > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > > > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
> > > > next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > > > 
> > > > Matthew Brost (8):
> > > >    drm/sched: Convert drm scheduler to use a work queue rather than
> > > >      kthread
> > > >    drm/sched: Move schedule policy to scheduler / entity
> > > >    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> > > >    drm/sched: Add generic scheduler message interface
> > > >    drm/sched: Start run wq before TDR in drm_sched_start
> > > >    drm/sched: Submit job before starting TDR
> > > >    drm/sched: Add helper to set TDR timeout
> > > >    drm/syncobj: Warn on long running dma-fences
> > > > 
> > > > Thomas Hellström (2):
> > > >    dma-buf/dma-fence: Introduce long-running completion fences
> > > >    drm/sched: Support long-running sched entities
> > > > 
> > > >   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> > > >   drivers/dma-buf/dma-resv.c                  |   5 +
> > > >   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> > > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> > > >   drivers/gpu/drm/drm_syncobj.c               |   5 +-
> > > >   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> > > >   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> > > >   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> > > >   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> > > >   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> > > >   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> > > >   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> > > >   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
> > > >   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> > > >   include/drm/gpu_scheduler.h                 | 130 +++++++--
> > > >   include/linux/dma-fence.h                   |  60 ++++-
> > > >   16 files changed, 649 insertions(+), 184 deletions(-)
> > > > 
> > > > --
> > > > 2.34.1
>
Christian König April 5, 2023, 8:53 a.m. UTC | #19
Am 05.04.23 um 10:34 schrieb Daniel Vetter:
> On Wed, Apr 05, 2023 at 09:41:23AM +0200, Christian König wrote:
>> Am 04.04.23 um 15:37 schrieb Matthew Brost:
>>> On Tue, Apr 04, 2023 at 11:13:28AM +0200, Christian König wrote:
>>>> Hi,
>>>>
>>>> Am 04.04.23 um 02:22 schrieb Matthew Brost:
>>>>> Hello,
>>>>>
>>>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
>>>>> have been asked to merge our common DRM scheduler patches first as well
>>>>> as develop a common solution for long running workloads with the DRM
>>>>> scheduler. This RFC series is our first attempt at doing this. We
>>>>> welcome any and all feedback.
>>>>>
>>>>> This can we thought of as 4 parts detailed below.
>>>>>
>>>>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
>>>>> entity (patches 1-3)
>>>>>
>>>>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
>>>>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
>>>>> severals problems as the DRM was originally designed to schedule jobs on
>>>>> hardware queues. The main problem being that DRM scheduler expects the
>>>>> submission order of jobs to be the completion order of jobs even across
>>>>> multiple entities. This assumption falls apart with a firmware scheduler
>>>>> as a firmware scheduler has no concept of jobs and jobs can complete out
>>>>> of order. A novel solution for was originally thought of by Faith during
>>>>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
>>>>> and entity. I believe the AGX driver [3] is using this approach and
>>>>> Boris may use approach as well for the Mali driver [4].
>>>>>
>>>>> To support a 1 to 1 relationship we move the main execution function
>>>>> from a kthread to a work queue and add a new scheduling mode which
>>>>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
>>>>> The new scheduling mode should unify all drivers usage with a 1 to 1
>>>>> relationship and can be thought of as using scheduler as a dependency /
>>>>> infligt job tracker rather than a true scheduler.
>>>>>
>>>>> - Generic messaging interface for DRM scheduler
>>>>>
>>>>> Idea is to be able to communicate to the submission backend with in band
>>>>> (relative to main execution function) messages. Messages are backend
>>>>> defined and flexable enough for any use case. In Xe we use these
>>>>> messages to clean up entites, set properties for entites, and suspend /
>>>>> resume execution of an entity [5]. I suspect other driver can leverage
>>>>> this messaging concept too as it a convenient way to avoid races in the
>>>>> backend.
>>>> Oh, please absolutely *don't* do this.
>>>>
>>>> This is basically the design which makes a bunch of stuff so horrible broken
>>>> on Windows.
>>>>
>>>> I can explain it in more detail if necessary, but I strongly recommend to
>>>> not go down this path.
>>>>
>>> I'm afraid we are going to have to discuss this further. Let me explain
>>> my reasoning, basically the idea is to have a single main entry point to
>>> backend - the work queue. This avoids the need for lock between run_job
>>> and any message that changes an entites state, also it really helps
>>> during the reset flows (either TDR or GT reset) as we can call
>>> drm_sched_run_wq_stop can ensure that nothing else is in the backend
>>> changing an entity state. It all works out really nicely actually, our
>>> GuC backend is incredibly stable (hasn't really had a bug pop in about a
>>> year) and way simpler than what we did in the i915. I think the simplity
>>> to largely due to this design of limiting the entry points.
>>>
>>> I personally don't see how this a poor design, limiting entry points
>>> absolutely makes sense to me, if it didn't why not just call cleanup_job
>>> bypassing the main execution thread (now worker), this is the exact same
>>> concept.
>> Well then I strongly suggest to read a few analyses on the failure of the
>> message processing loop on Windows.
>>
>> Have you ever wondered why classic Win32 applications sometimes seems to be
>> stuck and don't do anything? This design pattern combine with timeouts to
>> solve deadlocks is the reason for that.
>>
>> The major problem with this approach is that analyzing tools like lockdep
>> have a hard time grasping the dependencies.
> wq is fully annotated and actually splats. Plain kthread doesn't, without
> adding something like the dma_fence_signalling stuff.
>
> But yeah if you block badly in the work items and stall the entire queue,
> then things go sideways real bad. There's not really any tools we have in
> the kernel to enforce this, since we still want to allow mutex and
> sleeping and stuff like that.
>
>> What you can do is to offload all your operations which are supposed to be
>> run in the same thread as work items into a work queue. This is something
>> lockdep understands and is able to scream out lout if someone messes up the
>> deadlock dependencies.
> I thought that's the plan here?

At least from my impression that didn't looked like what was implemented 
here.

>   Or at least what I thought the plan was,
> and why I really think we need a per engine worqqueue to make it work well
> (and also why I suggested the refactoring to split up drm_scheduler into
> the driver api struct, which stays per-engine, and the internal backend
> which would be per drm_sched_entity for fw schedulers that round-robin gpu
> ctx on their own).
>
> Also maybe we need to allow drivers to pass in the workqueue like we allow
> for the tdr handling already, since that simplifies the locking.
>
> At least for intel gpu I think this message passing design makes some
> sense because fundamentally the fw only has a single blocking message
> queue. And so intel/xe fundamentally needs to deal with the "stupid code
> might block forward progress for everyone" problem you're describing, not
> matter whether it's done with the help of drm/sched infra or not.
>
> I do agree though that we shouldn't encourage drivers to use this which
> don't have that kind of fw command queue design. So maybe a huge comment
> to explain when (and _only_ when) it's ok to use that message passing
> would be needed, and explaining why in other cases it would make things a
> lot worse?

I would approach it from the complete other side. This component here is 
a tool to decide what job should run next.

How that is then signaled and run should not be part of the scheduler, 
but another more higher level component.

This way you also don't have a problem with not using DMA-fences as 
dependencies as well as constrains for running more jobs.

Christian.

> -Daniel
>
>> Regards,
>> Christian.
>>
>>> FWIW Asahi liked the idea as well and think it could be useful for AGX.
>>> Matt
>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> - Support for using TDR for all error paths of a scheduler / entity
>>>>>
>>>>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
>>>>>
>>>>> - Annotate dma-fences for long running workloads.
>>>>>
>>>>> The idea here is to use dma-fences only as sync points within the
>>>>> scheduler and never export them for long running workloads. By
>>>>> annotating these fences as long running we ensure that these dma-fences
>>>>> are never used in a way that breaks the dma-fence rules. A benefit of
>>>>> thus approach is the scheduler can still safely flow control the
>>>>> execution ring buffer via the job limit without breaking the dma-fence
>>>>> rules.
>>>>>
>>>>> Again this a first draft and looking forward to feedback.
>>>>>
>>>>> Enjoy - Matt
>>>>>
>>>>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
>>>>> [2] https://patchwork.freedesktop.org/series/112188/
>>>>> [3] https://patchwork.freedesktop.org/series/114772/
>>>>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
>>>>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
>>>>>
>>>>> Matthew Brost (8):
>>>>>      drm/sched: Convert drm scheduler to use a work queue rather than
>>>>>        kthread
>>>>>      drm/sched: Move schedule policy to scheduler / entity
>>>>>      drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>>>>>      drm/sched: Add generic scheduler message interface
>>>>>      drm/sched: Start run wq before TDR in drm_sched_start
>>>>>      drm/sched: Submit job before starting TDR
>>>>>      drm/sched: Add helper to set TDR timeout
>>>>>      drm/syncobj: Warn on long running dma-fences
>>>>>
>>>>> Thomas Hellström (2):
>>>>>      dma-buf/dma-fence: Introduce long-running completion fences
>>>>>      drm/sched: Support long-running sched entities
>>>>>
>>>>>     drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>>>>>     drivers/dma-buf/dma-resv.c                  |   5 +
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>>>>>     drivers/gpu/drm/drm_syncobj.c               |   5 +-
>>>>>     drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>>>>>     drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>>>>>     drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>>>>>     drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>>>>>     drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>>>>>     drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>>>>>     drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>>>>>     drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
>>>>>     drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>>>>>     include/drm/gpu_scheduler.h                 | 130 +++++++--
>>>>>     include/linux/dma-fence.h                   |  60 ++++-
>>>>>     16 files changed, 649 insertions(+), 184 deletions(-)
>>>>>
Daniel Vetter April 5, 2023, 9:07 a.m. UTC | #20
On Wed, Apr 05, 2023 at 10:53:26AM +0200, Christian König wrote:
> Am 05.04.23 um 10:34 schrieb Daniel Vetter:
> > On Wed, Apr 05, 2023 at 09:41:23AM +0200, Christian König wrote:
> > > Am 04.04.23 um 15:37 schrieb Matthew Brost:
> > > > On Tue, Apr 04, 2023 at 11:13:28AM +0200, Christian König wrote:
> > > > > Hi,
> > > > > 
> > > > > Am 04.04.23 um 02:22 schrieb Matthew Brost:
> > > > > > Hello,
> > > > > > 
> > > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > > > > have been asked to merge our common DRM scheduler patches first as well
> > > > > > as develop a common solution for long running workloads with the DRM
> > > > > > scheduler. This RFC series is our first attempt at doing this. We
> > > > > > welcome any and all feedback.
> > > > > > 
> > > > > > This can we thought of as 4 parts detailed below.
> > > > > > 
> > > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > > > > entity (patches 1-3)
> > > > > > 
> > > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > > > > severals problems as the DRM was originally designed to schedule jobs on
> > > > > > hardware queues. The main problem being that DRM scheduler expects the
> > > > > > submission order of jobs to be the completion order of jobs even across
> > > > > > multiple entities. This assumption falls apart with a firmware scheduler
> > > > > > as a firmware scheduler has no concept of jobs and jobs can complete out
> > > > > > of order. A novel solution for was originally thought of by Faith during
> > > > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > > > > > and entity. I believe the AGX driver [3] is using this approach and
> > > > > > Boris may use approach as well for the Mali driver [4].
> > > > > > 
> > > > > > To support a 1 to 1 relationship we move the main execution function
> > > > > > from a kthread to a work queue and add a new scheduling mode which
> > > > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > > > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > > > > relationship and can be thought of as using scheduler as a dependency /
> > > > > > infligt job tracker rather than a true scheduler.
> > > > > > 
> > > > > > - Generic messaging interface for DRM scheduler
> > > > > > 
> > > > > > Idea is to be able to communicate to the submission backend with in band
> > > > > > (relative to main execution function) messages. Messages are backend
> > > > > > defined and flexable enough for any use case. In Xe we use these
> > > > > > messages to clean up entites, set properties for entites, and suspend /
> > > > > > resume execution of an entity [5]. I suspect other driver can leverage
> > > > > > this messaging concept too as it a convenient way to avoid races in the
> > > > > > backend.
> > > > > Oh, please absolutely *don't* do this.
> > > > > 
> > > > > This is basically the design which makes a bunch of stuff so horrible broken
> > > > > on Windows.
> > > > > 
> > > > > I can explain it in more detail if necessary, but I strongly recommend to
> > > > > not go down this path.
> > > > > 
> > > > I'm afraid we are going to have to discuss this further. Let me explain
> > > > my reasoning, basically the idea is to have a single main entry point to
> > > > backend - the work queue. This avoids the need for lock between run_job
> > > > and any message that changes an entites state, also it really helps
> > > > during the reset flows (either TDR or GT reset) as we can call
> > > > drm_sched_run_wq_stop can ensure that nothing else is in the backend
> > > > changing an entity state. It all works out really nicely actually, our
> > > > GuC backend is incredibly stable (hasn't really had a bug pop in about a
> > > > year) and way simpler than what we did in the i915. I think the simplity
> > > > to largely due to this design of limiting the entry points.
> > > > 
> > > > I personally don't see how this a poor design, limiting entry points
> > > > absolutely makes sense to me, if it didn't why not just call cleanup_job
> > > > bypassing the main execution thread (now worker), this is the exact same
> > > > concept.
> > > Well then I strongly suggest to read a few analyses on the failure of the
> > > message processing loop on Windows.
> > > 
> > > Have you ever wondered why classic Win32 applications sometimes seems to be
> > > stuck and don't do anything? This design pattern combine with timeouts to
> > > solve deadlocks is the reason for that.
> > > 
> > > The major problem with this approach is that analyzing tools like lockdep
> > > have a hard time grasping the dependencies.
> > wq is fully annotated and actually splats. Plain kthread doesn't, without
> > adding something like the dma_fence_signalling stuff.
> > 
> > But yeah if you block badly in the work items and stall the entire queue,
> > then things go sideways real bad. There's not really any tools we have in
> > the kernel to enforce this, since we still want to allow mutex and
> > sleeping and stuff like that.
> > 
> > > What you can do is to offload all your operations which are supposed to be
> > > run in the same thread as work items into a work queue. This is something
> > > lockdep understands and is able to scream out lout if someone messes up the
> > > deadlock dependencies.
> > I thought that's the plan here?
> 
> At least from my impression that didn't looked like what was implemented
> here.

Yup the current patches aren't what we want I think, at least not in these
details.

> 
> >   Or at least what I thought the plan was,
> > and why I really think we need a per engine worqqueue to make it work well
> > (and also why I suggested the refactoring to split up drm_scheduler into
> > the driver api struct, which stays per-engine, and the internal backend
> > which would be per drm_sched_entity for fw schedulers that round-robin gpu
> > ctx on their own).
> > 
> > Also maybe we need to allow drivers to pass in the workqueue like we allow
> > for the tdr handling already, since that simplifies the locking.
> > 
> > At least for intel gpu I think this message passing design makes some
> > sense because fundamentally the fw only has a single blocking message
> > queue. And so intel/xe fundamentally needs to deal with the "stupid code
> > might block forward progress for everyone" problem you're describing, not
> > matter whether it's done with the help of drm/sched infra or not.
> > 
> > I do agree though that we shouldn't encourage drivers to use this which
> > don't have that kind of fw command queue design. So maybe a huge comment
> > to explain when (and _only_ when) it's ok to use that message passing
> > would be needed, and explaining why in other cases it would make things a
> > lot worse?
> 
> I would approach it from the complete other side. This component here is a
> tool to decide what job should run next.
> 
> How that is then signaled and run should not be part of the scheduler, but
> another more higher level component.
> 
> This way you also don't have a problem with not using DMA-fences as
> dependencies as well as constrains for running more jobs.

I think we're talking about two things here and mixing them up.

For the dependencies I agree with you, and imo that higher level tool
should probably just be an on-demand submit thread in userspace for the
rare case where the kernel would need to sort out a dependency otherwise
(due to running out of ringspace in the per-ctx ringbuffer).

The other thing is the message passing stuff, and this is what I was
talking about above. This has nothing to do with handling dependencies,
but with talking to the gpu fw. Here the intel design issue is that the fw
only provides a single queue, and it's in-order. Which means it
fundamentally has the stalling issue you describe as a point against a
message passing design. And fundamentally we need to be able to talk to
the fw in the scheduler ->run_job callback.

The proposal here for the message passing part is that since it has the
stalling issue already anyway, and the scheduler needs to be involved
anyway, it makes sense to integrated this (as an optional thing, only for
drivers which have this kind of fw interface) into the scheduler.
Otherwise you just end up with two layers for no reason and more ping-pong
delay because the ->run_job needs to kick off the subordinate driver layer
first. Note that for this case the optional message passing support in the
drm/scheduler actually makes things better, because it allows you to cut
out one layer.

Of course if a driver with better fw interface uses this message passing
support, then that's bad. Hence the big warning in the kerneldoc.
-Daniel

> 
> Christian.
> 
> > -Daniel
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > FWIW Asahi liked the idea as well and think it could be useful for AGX.
> > > > Matt
> > > > 
> > > > > Regards,
> > > > > Christian.
> > > > > 
> > > > > > - Support for using TDR for all error paths of a scheduler / entity
> > > > > > 
> > > > > > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > > > > > 
> > > > > > - Annotate dma-fences for long running workloads.
> > > > > > 
> > > > > > The idea here is to use dma-fences only as sync points within the
> > > > > > scheduler and never export them for long running workloads. By
> > > > > > annotating these fences as long running we ensure that these dma-fences
> > > > > > are never used in a way that breaks the dma-fence rules. A benefit of
> > > > > > thus approach is the scheduler can still safely flow control the
> > > > > > execution ring buffer via the job limit without breaking the dma-fence
> > > > > > rules.
> > > > > > 
> > > > > > Again this a first draft and looking forward to feedback.
> > > > > > 
> > > > > > Enjoy - Matt
> > > > > > 
> > > > > > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > > > > > [2] https://patchwork.freedesktop.org/series/112188/
> > > > > > [3] https://patchwork.freedesktop.org/series/114772/
> > > > > > [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > > > > > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > > > > > 
> > > > > > Matthew Brost (8):
> > > > > >      drm/sched: Convert drm scheduler to use a work queue rather than
> > > > > >        kthread
> > > > > >      drm/sched: Move schedule policy to scheduler / entity
> > > > > >      drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> > > > > >      drm/sched: Add generic scheduler message interface
> > > > > >      drm/sched: Start run wq before TDR in drm_sched_start
> > > > > >      drm/sched: Submit job before starting TDR
> > > > > >      drm/sched: Add helper to set TDR timeout
> > > > > >      drm/syncobj: Warn on long running dma-fences
> > > > > > 
> > > > > > Thomas Hellström (2):
> > > > > >      dma-buf/dma-fence: Introduce long-running completion fences
> > > > > >      drm/sched: Support long-running sched entities
> > > > > > 
> > > > > >     drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> > > > > >     drivers/dma-buf/dma-resv.c                  |   5 +
> > > > > >     drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> > > > > >     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> > > > > >     drivers/gpu/drm/drm_syncobj.c               |   5 +-
> > > > > >     drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> > > > > >     drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> > > > > >     drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> > > > > >     drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> > > > > >     drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> > > > > >     drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> > > > > >     drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> > > > > >     drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
> > > > > >     drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> > > > > >     include/drm/gpu_scheduler.h                 | 130 +++++++--
> > > > > >     include/linux/dma-fence.h                   |  60 ++++-
> > > > > >     16 files changed, 649 insertions(+), 184 deletions(-)
> > > > > > 
>
Christian König April 5, 2023, 9:57 a.m. UTC | #21
Am 05.04.23 um 11:07 schrieb Daniel Vetter:
> [SNIP]
>> I would approach it from the complete other side. This component here is a
>> tool to decide what job should run next.
>>
>> How that is then signaled and run should not be part of the scheduler, but
>> another more higher level component.
>>
>> This way you also don't have a problem with not using DMA-fences as
>> dependencies as well as constrains for running more jobs.
> I think we're talking about two things here and mixing them up.
>
> For the dependencies I agree with you, and imo that higher level tool
> should probably just be an on-demand submit thread in userspace for the
> rare case where the kernel would need to sort out a dependency otherwise
> (due to running out of ringspace in the per-ctx ringbuffer).
>
> The other thing is the message passing stuff, and this is what I was
> talking about above. This has nothing to do with handling dependencies,
> but with talking to the gpu fw. Here the intel design issue is that the fw
> only provides a single queue, and it's in-order. Which means it
> fundamentally has the stalling issue you describe as a point against a
> message passing design. And fundamentally we need to be able to talk to
> the fw in the scheduler ->run_job callback.
>
> The proposal here for the message passing part is that since it has the
> stalling issue already anyway, and the scheduler needs to be involved
> anyway, it makes sense to integrated this (as an optional thing, only for
> drivers which have this kind of fw interface) into the scheduler.
> Otherwise you just end up with two layers for no reason and more ping-pong
> delay because the ->run_job needs to kick off the subordinate driver layer
> first. Note that for this case the optional message passing support in the
> drm/scheduler actually makes things better, because it allows you to cut
> out one layer.
>
> Of course if a driver with better fw interface uses this message passing
> support, then that's bad. Hence the big warning in the kerneldoc.

Well what I wanted to say is that if you design the dependency handling 
/ scheduler properly you don't need the message passing through it.

For example if the GPU scheduler component uses a work item to do it's 
handling instead of a kthread you could also let the driver specify the 
work queue where this work item is executed on.

When you design it like this the driver specifies the thread context of 
execution for it's job. In other words it can specify a single threaded 
firmware work queue as well.

When you then have other messages which needs to be passed to the 
firmware you can also use the same single threaded workqueue for this.

Drivers which have a different firmware interface would just use one of 
the system work queues instead.

This approach basically decouples the GPU scheduler component from the 
message passing functionality.

Regards,
Christian.


> -Daniel
>
Daniel Vetter April 5, 2023, 10:12 a.m. UTC | #22
On Wed, 5 Apr 2023 at 11:57, Christian König <christian.koenig@amd.com> wrote:
>
> Am 05.04.23 um 11:07 schrieb Daniel Vetter:
> > [SNIP]
> >> I would approach it from the complete other side. This component here is a
> >> tool to decide what job should run next.
> >>
> >> How that is then signaled and run should not be part of the scheduler, but
> >> another more higher level component.
> >>
> >> This way you also don't have a problem with not using DMA-fences as
> >> dependencies as well as constrains for running more jobs.
> > I think we're talking about two things here and mixing them up.
> >
> > For the dependencies I agree with you, and imo that higher level tool
> > should probably just be an on-demand submit thread in userspace for the
> > rare case where the kernel would need to sort out a dependency otherwise
> > (due to running out of ringspace in the per-ctx ringbuffer).
> >
> > The other thing is the message passing stuff, and this is what I was
> > talking about above. This has nothing to do with handling dependencies,
> > but with talking to the gpu fw. Here the intel design issue is that the fw
> > only provides a single queue, and it's in-order. Which means it
> > fundamentally has the stalling issue you describe as a point against a
> > message passing design. And fundamentally we need to be able to talk to
> > the fw in the scheduler ->run_job callback.
> >
> > The proposal here for the message passing part is that since it has the
> > stalling issue already anyway, and the scheduler needs to be involved
> > anyway, it makes sense to integrated this (as an optional thing, only for
> > drivers which have this kind of fw interface) into the scheduler.
> > Otherwise you just end up with two layers for no reason and more ping-pong
> > delay because the ->run_job needs to kick off the subordinate driver layer
> > first. Note that for this case the optional message passing support in the
> > drm/scheduler actually makes things better, because it allows you to cut
> > out one layer.
> >
> > Of course if a driver with better fw interface uses this message passing
> > support, then that's bad. Hence the big warning in the kerneldoc.
>
> Well what I wanted to say is that if you design the dependency handling
> / scheduler properly you don't need the message passing through it.
>
> For example if the GPU scheduler component uses a work item to do it's
> handling instead of a kthread you could also let the driver specify the
> work queue where this work item is executed on.
>
> When you design it like this the driver specifies the thread context of
> execution for it's job. In other words it can specify a single threaded
> firmware work queue as well.
>
> When you then have other messages which needs to be passed to the
> firmware you can also use the same single threaded workqueue for this.
>
> Drivers which have a different firmware interface would just use one of
> the system work queues instead.
>
> This approach basically decouples the GPU scheduler component from the
> message passing functionality.

Hm I guess we've been talking past each another big time, because
that's really what I thought was under discussions? Essentially the
current rfc, but implementing with some polish.

iow I agree with you (I think at least).
-Daniel
Zeng, Oak April 5, 2023, 6:06 p.m. UTC | #23
Hi,

Using dma-fence for completion/dependency tracking for long-run workload(more precisely on-demand paging/page fault enabled workload) can cause deadlock. This seems the significant issue here. Other issues such as the drm scheduler completion order implication etc are minors which can be solve inside the framework of drm scheduler. We need to evaluate below paths:

	1) still use drm scheduler for job submission, and use dma-fence for job completion waiting/dependency tracking. This is solution proposed in this series. Annotate dma-fence for long-run workload: user can still wait dma-fence for job completion but can't wait dma-fence while holding any memory management locks.  We still use dma-fence for dependency tracking. But it is just very easily run into deadlock when on-demand paging is in the picture. The annotation helps us to detect deadlock but not solve deadlock problems. Seems *not* a complete solution: It is almost impossible to completely avoid dependency deadlock in complex runtime environment
	
	2) Still use drm scheduler but not use dma-fence for completion signaling and dependency tracking. This way we still get some free functions (reset, err handling ring flow control as Matt said)from drm scheduler, but push the dependency/completion tracking completely to user space using techniques such as user space fence. User space doesn't have chance to wait fence while holding a kernel memory management lock, thus the dma-fence deadlock issue is solved.
	
	3) Completely discard drm scheduler and dma-fence for long-run workload. Use user queue/doorbell for super fast submission, directly interact with fw scheduler. Use user fence for completion/dependency tracking.

Thanks,
Oak

> -----Original Message-----
> From: Christian König <christian.koenig@amd.com>
> Sent: April 5, 2023 3:30 AM
> To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
> <oak.zeng@intel.com>
> Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org;
> robdclark@chromium.org; thomas.hellstrom@linux.intel.com; airlied@linux.ie;
> lina@asahilina.net; boris.brezillon@collabora.com; faith.ekstrand@collabora.com
> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> plans
> 
> Am 04.04.23 um 20:08 schrieb Matthew Brost:
> > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
> >> Hi Matt, Thomas,
> >>
> >> Some very bold out of box thinking in this area:
> >>
> >> 1. so you want to use drm scheduler and dma-fence for long running workload.
> Why you want to do this in the first place? What is the benefit? Drm scheduler is
> pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
> level, as you said below for intel this is Guc. Can xe driver just directly submit job
> to Guc, bypassing drm scheduler?
> >>
> > If we did that now we have 2 paths for dependency track, flow controling
> > the ring, resets / error handling / backend submission implementations.
> > We don't want this.
> 
> Well exactly that's the point: Why?
> 
> As far as I can see that are two completely distinct use cases, so you
> absolutely do want two completely distinct implementations for this.
> 
> >> 2. using dma-fence for long run workload: I am well aware that page fault (and
> the consequent memory allocation/lock acquiring to fix the fault) can cause
> deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be
> used purely because the nature of the workload that it runs very long (indefinite).
> I did a math: the dma_fence_wait_timeout function's third param is the timeout
> which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not long
> enough, can we just change the timeout parameter to signed 64 bits so it is much
> longer than our life time...
> >>
> >> So I mainly argue we can't use dma-fence for long-run workload is not
> because the workload runs very long, rather because of the fact that we use
> page fault for long-run workload. If we enable page fault for short-run workload,
> we can't use dma-fence either. Page fault is the key thing here.
> >>
> >> Now since we use page fault which is *fundamentally* controversial with
> dma-fence design, why now just introduce a independent concept such as user-
> fence instead of extending existing dma-fence?
> >>
> >> I like unified design. If drm scheduler, dma-fence can be extended to work for
> everything, it is beautiful. But seems we have some fundamental problem here.
> >>
> > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
> > the signal / CB infrastructure) and enforce we don't use use these
> > dma-fences from the scheduler in memory reclaim paths or export these to
> > user space or other drivers. Think of this mode as SW only fence.
> 
> Yeah and I truly think this is an really bad idea.
> 
> The signal/CB infrastructure in the dma_fence turned out to be the
> absolutely nightmare I initially predicted. Sorry to say that, but in
> this case the "I've told you so" is appropriate in my opinion.
> 
> If we need infrastructure for long running dependency tracking we should
> encapsulate that in a new framework and not try to mangle the existing
> code for something it was never intended for.
> 
> Christian.
> 
> >
> > Matt
> >
> >> Thanks,
> >> Oak
> >>
> >>> -----Original Message-----
> >>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> >>> Matthew Brost
> >>> Sent: April 3, 2023 8:22 PM
> >>> To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org
> >>> Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
> airlied@linux.ie;
> >>> lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew
> >>> <matthew.brost@intel.com>; christian.koenig@amd.com;
> >>> faith.ekstrand@collabora.com
> >>> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> plans
> >>>
> >>> Hello,
> >>>
> >>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> >>> have been asked to merge our common DRM scheduler patches first as well
> >>> as develop a common solution for long running workloads with the DRM
> >>> scheduler. This RFC series is our first attempt at doing this. We
> >>> welcome any and all feedback.
> >>>
> >>> This can we thought of as 4 parts detailed below.
> >>>
> >>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> >>> entity (patches 1-3)
> >>>
> >>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> >>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> >>> severals problems as the DRM was originally designed to schedule jobs on
> >>> hardware queues. The main problem being that DRM scheduler expects the
> >>> submission order of jobs to be the completion order of jobs even across
> >>> multiple entities. This assumption falls apart with a firmware scheduler
> >>> as a firmware scheduler has no concept of jobs and jobs can complete out
> >>> of order. A novel solution for was originally thought of by Faith during
> >>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> >>> and entity. I believe the AGX driver [3] is using this approach and
> >>> Boris may use approach as well for the Mali driver [4].
> >>>
> >>> To support a 1 to 1 relationship we move the main execution function
> >>> from a kthread to a work queue and add a new scheduling mode which
> >>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> >>> The new scheduling mode should unify all drivers usage with a 1 to 1
> >>> relationship and can be thought of as using scheduler as a dependency /
> >>> infligt job tracker rather than a true scheduler.
> >>>
> >>> - Generic messaging interface for DRM scheduler
> >>>
> >>> Idea is to be able to communicate to the submission backend with in band
> >>> (relative to main execution function) messages. Messages are backend
> >>> defined and flexable enough for any use case. In Xe we use these
> >>> messages to clean up entites, set properties for entites, and suspend /
> >>> resume execution of an entity [5]. I suspect other driver can leverage
> >>> this messaging concept too as it a convenient way to avoid races in the
> >>> backend.
> >>>
> >>> - Support for using TDR for all error paths of a scheduler / entity
> >>>
> >>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
> >>>
> >>> - Annotate dma-fences for long running workloads.
> >>>
> >>> The idea here is to use dma-fences only as sync points within the
> >>> scheduler and never export them for long running workloads. By
> >>> annotating these fences as long running we ensure that these dma-fences
> >>> are never used in a way that breaks the dma-fence rules. A benefit of
> >>> thus approach is the scheduler can still safely flow control the
> >>> execution ring buffer via the job limit without breaking the dma-fence
> >>> rules.
> >>>
> >>> Again this a first draft and looking forward to feedback.
> >>>
> >>> Enjoy - Matt
> >>>
> >>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
> >>> [2] https://patchwork.freedesktop.org/series/112188/
> >>> [3] https://patchwork.freedesktop.org/series/114772/
> >>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> >>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
> >>> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> >>>
> >>> Matthew Brost (8):
> >>>    drm/sched: Convert drm scheduler to use a work queue rather than
> >>>      kthread
> >>>    drm/sched: Move schedule policy to scheduler / entity
> >>>    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> >>>    drm/sched: Add generic scheduler message interface
> >>>    drm/sched: Start run wq before TDR in drm_sched_start
> >>>    drm/sched: Submit job before starting TDR
> >>>    drm/sched: Add helper to set TDR timeout
> >>>    drm/syncobj: Warn on long running dma-fences
> >>>
> >>> Thomas Hellström (2):
> >>>    dma-buf/dma-fence: Introduce long-running completion fences
> >>>    drm/sched: Support long-running sched entities
> >>>
> >>>   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> >>>   drivers/dma-buf/dma-resv.c                  |   5 +
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> >>>   drivers/gpu/drm/drm_syncobj.c               |   5 +-
> >>>   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> >>>   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> >>>   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> >>>   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> >>>   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> >>>   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> >>>   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> >>>   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
> >>>   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> >>>   include/drm/gpu_scheduler.h                 | 130 +++++++--
> >>>   include/linux/dma-fence.h                   |  60 ++++-
> >>>   16 files changed, 649 insertions(+), 184 deletions(-)
> >>>
> >>> --
> >>> 2.34.1
Matthew Brost April 5, 2023, 6:53 p.m. UTC | #24
On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:
> Hi,
> 
> Using dma-fence for completion/dependency tracking for long-run workload(more precisely on-demand paging/page fault enabled workload) can cause deadlock. This seems the significant issue here. Other issues such as the drm scheduler completion order implication etc are minors which can be solve inside the framework of drm scheduler. We need to evaluate below paths:
> 
> 	1) still use drm scheduler for job submission, and use dma-fence for job completion waiting/dependency tracking. This is solution proposed in this series. Annotate dma-fence for long-run workload: user can still wait dma-fence for job completion but can't wait dma-fence while holding any memory management locks.  We still use dma-fence for dependency tracking. But it is just very easily run into deadlock when on-demand paging is in the picture. The annotation helps us to detect deadlock but not solve deadlock problems. Seems *not* a complete solution: It is almost impossible to completely avoid dependency deadlock in complex runtime environment
>

No one can wait on LR fence, so it is impossible to deadlock. The
annotations enforce this. Literally this is only for flow controling the
ring / hold pending jobs in in the DRM schedule list.

> 	2) Still use drm scheduler but not use dma-fence for completion signaling and dependency tracking. This way we still get some free functions (reset, err handling ring flow control as Matt said)from drm scheduler, but push the dependency/completion tracking completely to user space using techniques such as user space fence. User space doesn't have chance to wait fence while holding a kernel memory management lock, thus the dma-fence deadlock issue is solved.
>

We use user space fence for syncs.

> 	3) Completely discard drm scheduler and dma-fence for long-run workload. Use user queue/doorbell for super fast submission, directly interact with fw scheduler. Use user fence for completion/dependency tracking.
> 

This is a hard no from me, I want 1 submission path in Xe. Either we use
the DRM scheduler or we don't.

Matt

> Thanks,
> Oak
> 
> > -----Original Message-----
> > From: Christian König <christian.koenig@amd.com>
> > Sent: April 5, 2023 3:30 AM
> > To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
> > <oak.zeng@intel.com>
> > Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org;
> > robdclark@chromium.org; thomas.hellstrom@linux.intel.com; airlied@linux.ie;
> > lina@asahilina.net; boris.brezillon@collabora.com; faith.ekstrand@collabora.com
> > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > plans
> > 
> > Am 04.04.23 um 20:08 schrieb Matthew Brost:
> > > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
> > >> Hi Matt, Thomas,
> > >>
> > >> Some very bold out of box thinking in this area:
> > >>
> > >> 1. so you want to use drm scheduler and dma-fence for long running workload.
> > Why you want to do this in the first place? What is the benefit? Drm scheduler is
> > pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
> > level, as you said below for intel this is Guc. Can xe driver just directly submit job
> > to Guc, bypassing drm scheduler?
> > >>
> > > If we did that now we have 2 paths for dependency track, flow controling
> > > the ring, resets / error handling / backend submission implementations.
> > > We don't want this.
> > 
> > Well exactly that's the point: Why?
> > 
> > As far as I can see that are two completely distinct use cases, so you
> > absolutely do want two completely distinct implementations for this.
> > 
> > >> 2. using dma-fence for long run workload: I am well aware that page fault (and
> > the consequent memory allocation/lock acquiring to fix the fault) can cause
> > deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be
> > used purely because the nature of the workload that it runs very long (indefinite).
> > I did a math: the dma_fence_wait_timeout function's third param is the timeout
> > which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not long
> > enough, can we just change the timeout parameter to signed 64 bits so it is much
> > longer than our life time...
> > >>
> > >> So I mainly argue we can't use dma-fence for long-run workload is not
> > because the workload runs very long, rather because of the fact that we use
> > page fault for long-run workload. If we enable page fault for short-run workload,
> > we can't use dma-fence either. Page fault is the key thing here.
> > >>
> > >> Now since we use page fault which is *fundamentally* controversial with
> > dma-fence design, why now just introduce a independent concept such as user-
> > fence instead of extending existing dma-fence?
> > >>
> > >> I like unified design. If drm scheduler, dma-fence can be extended to work for
> > everything, it is beautiful. But seems we have some fundamental problem here.
> > >>
> > > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
> > > the signal / CB infrastructure) and enforce we don't use use these
> > > dma-fences from the scheduler in memory reclaim paths or export these to
> > > user space or other drivers. Think of this mode as SW only fence.
> > 
> > Yeah and I truly think this is an really bad idea.
> > 
> > The signal/CB infrastructure in the dma_fence turned out to be the
> > absolutely nightmare I initially predicted. Sorry to say that, but in
> > this case the "I've told you so" is appropriate in my opinion.
> > 
> > If we need infrastructure for long running dependency tracking we should
> > encapsulate that in a new framework and not try to mangle the existing
> > code for something it was never intended for.
> > 
> > Christian.
> > 
> > >
> > > Matt
> > >
> > >> Thanks,
> > >> Oak
> > >>
> > >>> -----Original Message-----
> > >>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> > >>> Matthew Brost
> > >>> Sent: April 3, 2023 8:22 PM
> > >>> To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org
> > >>> Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
> > airlied@linux.ie;
> > >>> lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew
> > >>> <matthew.brost@intel.com>; christian.koenig@amd.com;
> > >>> faith.ekstrand@collabora.com
> > >>> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > plans
> > >>>
> > >>> Hello,
> > >>>
> > >>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > >>> have been asked to merge our common DRM scheduler patches first as well
> > >>> as develop a common solution for long running workloads with the DRM
> > >>> scheduler. This RFC series is our first attempt at doing this. We
> > >>> welcome any and all feedback.
> > >>>
> > >>> This can we thought of as 4 parts detailed below.
> > >>>
> > >>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > >>> entity (patches 1-3)
> > >>>
> > >>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > >>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > >>> severals problems as the DRM was originally designed to schedule jobs on
> > >>> hardware queues. The main problem being that DRM scheduler expects the
> > >>> submission order of jobs to be the completion order of jobs even across
> > >>> multiple entities. This assumption falls apart with a firmware scheduler
> > >>> as a firmware scheduler has no concept of jobs and jobs can complete out
> > >>> of order. A novel solution for was originally thought of by Faith during
> > >>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > >>> and entity. I believe the AGX driver [3] is using this approach and
> > >>> Boris may use approach as well for the Mali driver [4].
> > >>>
> > >>> To support a 1 to 1 relationship we move the main execution function
> > >>> from a kthread to a work queue and add a new scheduling mode which
> > >>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > >>> The new scheduling mode should unify all drivers usage with a 1 to 1
> > >>> relationship and can be thought of as using scheduler as a dependency /
> > >>> infligt job tracker rather than a true scheduler.
> > >>>
> > >>> - Generic messaging interface for DRM scheduler
> > >>>
> > >>> Idea is to be able to communicate to the submission backend with in band
> > >>> (relative to main execution function) messages. Messages are backend
> > >>> defined and flexable enough for any use case. In Xe we use these
> > >>> messages to clean up entites, set properties for entites, and suspend /
> > >>> resume execution of an entity [5]. I suspect other driver can leverage
> > >>> this messaging concept too as it a convenient way to avoid races in the
> > >>> backend.
> > >>>
> > >>> - Support for using TDR for all error paths of a scheduler / entity
> > >>>
> > >>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > >>>
> > >>> - Annotate dma-fences for long running workloads.
> > >>>
> > >>> The idea here is to use dma-fences only as sync points within the
> > >>> scheduler and never export them for long running workloads. By
> > >>> annotating these fences as long running we ensure that these dma-fences
> > >>> are never used in a way that breaks the dma-fence rules. A benefit of
> > >>> thus approach is the scheduler can still safely flow control the
> > >>> execution ring buffer via the job limit without breaking the dma-fence
> > >>> rules.
> > >>>
> > >>> Again this a first draft and looking forward to feedback.
> > >>>
> > >>> Enjoy - Matt
> > >>>
> > >>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > >>> [2] https://patchwork.freedesktop.org/series/112188/
> > >>> [3] https://patchwork.freedesktop.org/series/114772/
> > >>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > >>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
> > >>> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > >>>
> > >>> Matthew Brost (8):
> > >>>    drm/sched: Convert drm scheduler to use a work queue rather than
> > >>>      kthread
> > >>>    drm/sched: Move schedule policy to scheduler / entity
> > >>>    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
> > >>>    drm/sched: Add generic scheduler message interface
> > >>>    drm/sched: Start run wq before TDR in drm_sched_start
> > >>>    drm/sched: Submit job before starting TDR
> > >>>    drm/sched: Add helper to set TDR timeout
> > >>>    drm/syncobj: Warn on long running dma-fences
> > >>>
> > >>> Thomas Hellström (2):
> > >>>    dma-buf/dma-fence: Introduce long-running completion fences
> > >>>    drm/sched: Support long-running sched entities
> > >>>
> > >>>   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> > >>>   drivers/dma-buf/dma-resv.c                  |   5 +
> > >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> > >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> > >>>   drivers/gpu/drm/drm_syncobj.c               |   5 +-
> > >>>   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> > >>>   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> > >>>   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> > >>>   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> > >>>   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> > >>>   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> > >>>   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> > >>>   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
> > >>>   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> > >>>   include/drm/gpu_scheduler.h                 | 130 +++++++--
> > >>>   include/linux/dma-fence.h                   |  60 ++++-
> > >>>   16 files changed, 649 insertions(+), 184 deletions(-)
> > >>>
> > >>> --
> > >>> 2.34.1
>
Matthew Brost April 6, 2023, 2:08 a.m. UTC | #25
On Wed, Apr 05, 2023 at 12:12:27PM +0200, Daniel Vetter wrote:
> On Wed, 5 Apr 2023 at 11:57, Christian König <christian.koenig@amd.com> wrote:
> >
> > Am 05.04.23 um 11:07 schrieb Daniel Vetter:
> > > [SNIP]
> > >> I would approach it from the complete other side. This component here is a
> > >> tool to decide what job should run next.
> > >>
> > >> How that is then signaled and run should not be part of the scheduler, but
> > >> another more higher level component.
> > >>
> > >> This way you also don't have a problem with not using DMA-fences as
> > >> dependencies as well as constrains for running more jobs.
> > > I think we're talking about two things here and mixing them up.
> > >
> > > For the dependencies I agree with you, and imo that higher level tool
> > > should probably just be an on-demand submit thread in userspace for the
> > > rare case where the kernel would need to sort out a dependency otherwise
> > > (due to running out of ringspace in the per-ctx ringbuffer).
> > >
> > > The other thing is the message passing stuff, and this is what I was
> > > talking about above. This has nothing to do with handling dependencies,
> > > but with talking to the gpu fw. Here the intel design issue is that the fw
> > > only provides a single queue, and it's in-order. Which means it
> > > fundamentally has the stalling issue you describe as a point against a
> > > message passing design. And fundamentally we need to be able to talk to
> > > the fw in the scheduler ->run_job callback.
> > >
> > > The proposal here for the message passing part is that since it has the
> > > stalling issue already anyway, and the scheduler needs to be involved
> > > anyway, it makes sense to integrated this (as an optional thing, only for
> > > drivers which have this kind of fw interface) into the scheduler.
> > > Otherwise you just end up with two layers for no reason and more ping-pong
> > > delay because the ->run_job needs to kick off the subordinate driver layer
> > > first. Note that for this case the optional message passing support in the
> > > drm/scheduler actually makes things better, because it allows you to cut
> > > out one layer.
> > >
> > > Of course if a driver with better fw interface uses this message passing
> > > support, then that's bad. Hence the big warning in the kerneldoc.
> >
> > Well what I wanted to say is that if you design the dependency handling
> > / scheduler properly you don't need the message passing through it.
> >
> > For example if the GPU scheduler component uses a work item to do it's
> > handling instead of a kthread you could also let the driver specify the
> > work queue where this work item is executed on.
> >
> > When you design it like this the driver specifies the thread context of
> > execution for it's job. In other words it can specify a single threaded
> > firmware work queue as well.
> >
> > When you then have other messages which needs to be passed to the
> > firmware you can also use the same single threaded workqueue for this.
> >
> > Drivers which have a different firmware interface would just use one of
> > the system work queues instead.
> >
> > This approach basically decouples the GPU scheduler component from the
> > message passing functionality.
> 
> Hm I guess we've been talking past each another big time, because
> that's really what I thought was under discussions? Essentially the
> current rfc, but implementing with some polish.
>

I think Daniel pretty much nailed it here (thanks), to recap:

1. I want the messages in the same worker so run_job / free_job /
process_msg execution is mutual exclusive and also so during reset paths
if the worker is stopped all the entry points can't be entered.

If this is a NAK, then another worker is fine I guess. A lock between
run_job / free_job + process_msg should solve the exclusion issue and the
reset paths can also stop this new worker too. That being said I'd
rather leave this as is but will not fight this point.

2. process_msg is just used to communicate with the firmware using the
same queue as submission. Waiting for space in this queue is the only
place this function can block (same as submission), well actually we
have the concept a preempt time slice but that sleeps for 10 ms by
default. Also preempt is only used in LR entities so I don't think it is
relavent in this case either.

3. Agree this is in the dma-fence signaling path (if process_msg is in
the submission worker) so we can't block indefinitely or an unreasonable
period of time (i.e. we must obey dma-fence rules).

4. Agree the documentation for thw usage of the messaging interface
needs to be clear.

5. Agree that my code could alway use polishing.

Lets close on #1 then can I get on general Ack on this part of the RFC
and apply the polish in the full review process?

Matt

> iow I agree with you (I think at least).
> -Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
Daniel Vetter April 6, 2023, 6:37 a.m. UTC | #26
On Thu, Apr 06, 2023 at 02:08:10AM +0000, Matthew Brost wrote:
> On Wed, Apr 05, 2023 at 12:12:27PM +0200, Daniel Vetter wrote:
> > On Wed, 5 Apr 2023 at 11:57, Christian König <christian.koenig@amd.com> wrote:
> > >
> > > Am 05.04.23 um 11:07 schrieb Daniel Vetter:
> > > > [SNIP]
> > > >> I would approach it from the complete other side. This component here is a
> > > >> tool to decide what job should run next.
> > > >>
> > > >> How that is then signaled and run should not be part of the scheduler, but
> > > >> another more higher level component.
> > > >>
> > > >> This way you also don't have a problem with not using DMA-fences as
> > > >> dependencies as well as constrains for running more jobs.
> > > > I think we're talking about two things here and mixing them up.
> > > >
> > > > For the dependencies I agree with you, and imo that higher level tool
> > > > should probably just be an on-demand submit thread in userspace for the
> > > > rare case where the kernel would need to sort out a dependency otherwise
> > > > (due to running out of ringspace in the per-ctx ringbuffer).
> > > >
> > > > The other thing is the message passing stuff, and this is what I was
> > > > talking about above. This has nothing to do with handling dependencies,
> > > > but with talking to the gpu fw. Here the intel design issue is that the fw
> > > > only provides a single queue, and it's in-order. Which means it
> > > > fundamentally has the stalling issue you describe as a point against a
> > > > message passing design. And fundamentally we need to be able to talk to
> > > > the fw in the scheduler ->run_job callback.
> > > >
> > > > The proposal here for the message passing part is that since it has the
> > > > stalling issue already anyway, and the scheduler needs to be involved
> > > > anyway, it makes sense to integrated this (as an optional thing, only for
> > > > drivers which have this kind of fw interface) into the scheduler.
> > > > Otherwise you just end up with two layers for no reason and more ping-pong
> > > > delay because the ->run_job needs to kick off the subordinate driver layer
> > > > first. Note that for this case the optional message passing support in the
> > > > drm/scheduler actually makes things better, because it allows you to cut
> > > > out one layer.
> > > >
> > > > Of course if a driver with better fw interface uses this message passing
> > > > support, then that's bad. Hence the big warning in the kerneldoc.
> > >
> > > Well what I wanted to say is that if you design the dependency handling
> > > / scheduler properly you don't need the message passing through it.
> > >
> > > For example if the GPU scheduler component uses a work item to do it's
> > > handling instead of a kthread you could also let the driver specify the
> > > work queue where this work item is executed on.
> > >
> > > When you design it like this the driver specifies the thread context of
> > > execution for it's job. In other words it can specify a single threaded
> > > firmware work queue as well.
> > >
> > > When you then have other messages which needs to be passed to the
> > > firmware you can also use the same single threaded workqueue for this.
> > >
> > > Drivers which have a different firmware interface would just use one of
> > > the system work queues instead.
> > >
> > > This approach basically decouples the GPU scheduler component from the
> > > message passing functionality.
> > 
> > Hm I guess we've been talking past each another big time, because
> > that's really what I thought was under discussions? Essentially the
> > current rfc, but implementing with some polish.
> >
> 
> I think Daniel pretty much nailed it here (thanks), to recap:
> 
> 1. I want the messages in the same worker so run_job / free_job /
> process_msg execution is mutual exclusive and also so during reset paths
> if the worker is stopped all the entry points can't be entered.
> 
> If this is a NAK, then another worker is fine I guess. A lock between
> run_job / free_job + process_msg should solve the exclusion issue and the
> reset paths can also stop this new worker too. That being said I'd
> rather leave this as is but will not fight this point.
> 
> 2. process_msg is just used to communicate with the firmware using the
> same queue as submission. Waiting for space in this queue is the only
> place this function can block (same as submission), well actually we
> have the concept a preempt time slice but that sleeps for 10 ms by
> default. Also preempt is only used in LR entities so I don't think it is
> relavent in this case either.
> 
> 3. Agree this is in the dma-fence signaling path (if process_msg is in
> the submission worker) so we can't block indefinitely or an unreasonable
> period of time (i.e. we must obey dma-fence rules).

Just to hammer this in: Not just process_msg is in the dma_fence signaling
path, but the entire fw queue where everything is being funneled through,
including whatever the fw is doing to process these.

Yes this is terrible and blew up a few times already :-/

But also, probably something that the docs really need to hammer in, to
make sure people don't look at this and thinkg "hey this seems to be the
recommended way to do this on linux". We don't want hw people to build
more of these designs, they're an absolute pain to deal with with Linux'
dma_fence signalling and gpu job scheduling rules.

It's just that if you're stuck with such fw, then integrating the flow
into drm/sched instead of having an extra layer of workers seems the
better of two pretty bad solutions.
-Daniel

 
> 4. Agree the documentation for thw usage of the messaging interface
> needs to be clear.
> 
> 5. Agree that my code could alway use polishing.
> 
> Lets close on #1 then can I get on general Ack on this part of the RFC
> and apply the polish in the full review process?
> 
> Matt
> 
> > iow I agree with you (I think at least).
> > -Daniel
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
Christian König April 6, 2023, 10:04 a.m. UTC | #27
Am 05.04.23 um 20:53 schrieb Matthew Brost:
> On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:
>> Hi,
>>
>> Using dma-fence for completion/dependency tracking for long-run workload(more precisely on-demand paging/page fault enabled workload) can cause deadlock. This seems the significant issue here. Other issues such as the drm scheduler completion order implication etc are minors which can be solve inside the framework of drm scheduler. We need to evaluate below paths:
>>
>> 	1) still use drm scheduler for job submission, and use dma-fence for job completion waiting/dependency tracking. This is solution proposed in this series. Annotate dma-fence for long-run workload: user can still wait dma-fence for job completion but can't wait dma-fence while holding any memory management locks.  We still use dma-fence for dependency tracking. But it is just very easily run into deadlock when on-demand paging is in the picture. The annotation helps us to detect deadlock but not solve deadlock problems. Seems *not* a complete solution: It is almost impossible to completely avoid dependency deadlock in complex runtime environment
>>
> No one can wait on LR fence, so it is impossible to deadlock. The
> annotations enforce this. Literally this is only for flow controling the
> ring / hold pending jobs in in the DRM schedule list.

You can still have someone depend on the LR fence and cause a deadlock 
without waiting on it.

See my attempted solution to this problem. It tries to inherit the LR 
flag of a fence when something depended on it.

For example if you create a fence container and one of the fences inside 
the container is a LR fence then the container itself would be an LR 
fence as well.

>> 	2) Still use drm scheduler but not use dma-fence for completion signaling and dependency tracking. This way we still get some free functions (reset, err handling ring flow control as Matt said)from drm scheduler, but push the dependency/completion tracking completely to user space using techniques such as user space fence. User space doesn't have chance to wait fence while holding a kernel memory management lock, thus the dma-fence deadlock issue is solved.
>>
> We use user space fence for syncs.
>
>> 	3) Completely discard drm scheduler and dma-fence for long-run workload. Use user queue/doorbell for super fast submission, directly interact with fw scheduler. Use user fence for completion/dependency tracking.
>>
> This is a hard no from me, I want 1 submission path in Xe. Either we use
> the DRM scheduler or we don't.

Well I don't think that this will be acceptable. Especially if you not 
only have long running submission, but things like page faults/HMM in 
those jobs.

Regards,
Christian.

>
> Matt
>
>> Thanks,
>> Oak
>>
>>> -----Original Message-----
>>> From: Christian König <christian.koenig@amd.com>
>>> Sent: April 5, 2023 3:30 AM
>>> To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
>>> <oak.zeng@intel.com>
>>> Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org;
>>> robdclark@chromium.org; thomas.hellstrom@linux.intel.com; airlied@linux.ie;
>>> lina@asahilina.net; boris.brezillon@collabora.com; faith.ekstrand@collabora.com
>>> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
>>> plans
>>>
>>> Am 04.04.23 um 20:08 schrieb Matthew Brost:
>>>> On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
>>>>> Hi Matt, Thomas,
>>>>>
>>>>> Some very bold out of box thinking in this area:
>>>>>
>>>>> 1. so you want to use drm scheduler and dma-fence for long running workload.
>>> Why you want to do this in the first place? What is the benefit? Drm scheduler is
>>> pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
>>> level, as you said below for intel this is Guc. Can xe driver just directly submit job
>>> to Guc, bypassing drm scheduler?
>>>> If we did that now we have 2 paths for dependency track, flow controling
>>>> the ring, resets / error handling / backend submission implementations.
>>>> We don't want this.
>>> Well exactly that's the point: Why?
>>>
>>> As far as I can see that are two completely distinct use cases, so you
>>> absolutely do want two completely distinct implementations for this.
>>>
>>>>> 2. using dma-fence for long run workload: I am well aware that page fault (and
>>> the consequent memory allocation/lock acquiring to fix the fault) can cause
>>> deadlock for a dma-fence wait. But I am not convinced that dma-fence can't be
>>> used purely because the nature of the workload that it runs very long (indefinite).
>>> I did a math: the dma_fence_wait_timeout function's third param is the timeout
>>> which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not long
>>> enough, can we just change the timeout parameter to signed 64 bits so it is much
>>> longer than our life time...
>>>>> So I mainly argue we can't use dma-fence for long-run workload is not
>>> because the workload runs very long, rather because of the fact that we use
>>> page fault for long-run workload. If we enable page fault for short-run workload,
>>> we can't use dma-fence either. Page fault is the key thing here.
>>>>> Now since we use page fault which is *fundamentally* controversial with
>>> dma-fence design, why now just introduce a independent concept such as user-
>>> fence instead of extending existing dma-fence?
>>>>> I like unified design. If drm scheduler, dma-fence can be extended to work for
>>> everything, it is beautiful. But seems we have some fundamental problem here.
>>>> Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
>>>> the signal / CB infrastructure) and enforce we don't use use these
>>>> dma-fences from the scheduler in memory reclaim paths or export these to
>>>> user space or other drivers. Think of this mode as SW only fence.
>>> Yeah and I truly think this is an really bad idea.
>>>
>>> The signal/CB infrastructure in the dma_fence turned out to be the
>>> absolutely nightmare I initially predicted. Sorry to say that, but in
>>> this case the "I've told you so" is appropriate in my opinion.
>>>
>>> If we need infrastructure for long running dependency tracking we should
>>> encapsulate that in a new framework and not try to mangle the existing
>>> code for something it was never intended for.
>>>
>>> Christian.
>>>
>>>> Matt
>>>>
>>>>> Thanks,
>>>>> Oak
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
>>>>>> Matthew Brost
>>>>>> Sent: April 3, 2023 8:22 PM
>>>>>> To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org
>>>>>> Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
>>> airlied@linux.ie;
>>>>>> lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew
>>>>>> <matthew.brost@intel.com>; christian.koenig@amd.com;
>>>>>> faith.ekstrand@collabora.com
>>>>>> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
>>> plans
>>>>>> Hello,
>>>>>>
>>>>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
>>>>>> have been asked to merge our common DRM scheduler patches first as well
>>>>>> as develop a common solution for long running workloads with the DRM
>>>>>> scheduler. This RFC series is our first attempt at doing this. We
>>>>>> welcome any and all feedback.
>>>>>>
>>>>>> This can we thought of as 4 parts detailed below.
>>>>>>
>>>>>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
>>>>>> entity (patches 1-3)
>>>>>>
>>>>>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
>>>>>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
>>>>>> severals problems as the DRM was originally designed to schedule jobs on
>>>>>> hardware queues. The main problem being that DRM scheduler expects the
>>>>>> submission order of jobs to be the completion order of jobs even across
>>>>>> multiple entities. This assumption falls apart with a firmware scheduler
>>>>>> as a firmware scheduler has no concept of jobs and jobs can complete out
>>>>>> of order. A novel solution for was originally thought of by Faith during
>>>>>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
>>>>>> and entity. I believe the AGX driver [3] is using this approach and
>>>>>> Boris may use approach as well for the Mali driver [4].
>>>>>>
>>>>>> To support a 1 to 1 relationship we move the main execution function
>>>>>> from a kthread to a work queue and add a new scheduling mode which
>>>>>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
>>>>>> The new scheduling mode should unify all drivers usage with a 1 to 1
>>>>>> relationship and can be thought of as using scheduler as a dependency /
>>>>>> infligt job tracker rather than a true scheduler.
>>>>>>
>>>>>> - Generic messaging interface for DRM scheduler
>>>>>>
>>>>>> Idea is to be able to communicate to the submission backend with in band
>>>>>> (relative to main execution function) messages. Messages are backend
>>>>>> defined and flexable enough for any use case. In Xe we use these
>>>>>> messages to clean up entites, set properties for entites, and suspend /
>>>>>> resume execution of an entity [5]. I suspect other driver can leverage
>>>>>> this messaging concept too as it a convenient way to avoid races in the
>>>>>> backend.
>>>>>>
>>>>>> - Support for using TDR for all error paths of a scheduler / entity
>>>>>>
>>>>>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
>>>>>>
>>>>>> - Annotate dma-fences for long running workloads.
>>>>>>
>>>>>> The idea here is to use dma-fences only as sync points within the
>>>>>> scheduler and never export them for long running workloads. By
>>>>>> annotating these fences as long running we ensure that these dma-fences
>>>>>> are never used in a way that breaks the dma-fence rules. A benefit of
>>>>>> thus approach is the scheduler can still safely flow control the
>>>>>> execution ring buffer via the job limit without breaking the dma-fence
>>>>>> rules.
>>>>>>
>>>>>> Again this a first draft and looking forward to feedback.
>>>>>>
>>>>>> Enjoy - Matt
>>>>>>
>>>>>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
>>>>>> [2] https://patchwork.freedesktop.org/series/112188/
>>>>>> [3] https://patchwork.freedesktop.org/series/114772/
>>>>>> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
>>>>>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
>>>>>> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
>>>>>>
>>>>>> Matthew Brost (8):
>>>>>>     drm/sched: Convert drm scheduler to use a work queue rather than
>>>>>>       kthread
>>>>>>     drm/sched: Move schedule policy to scheduler / entity
>>>>>>     drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>>>>>>     drm/sched: Add generic scheduler message interface
>>>>>>     drm/sched: Start run wq before TDR in drm_sched_start
>>>>>>     drm/sched: Submit job before starting TDR
>>>>>>     drm/sched: Add helper to set TDR timeout
>>>>>>     drm/syncobj: Warn on long running dma-fences
>>>>>>
>>>>>> Thomas Hellström (2):
>>>>>>     dma-buf/dma-fence: Introduce long-running completion fences
>>>>>>     drm/sched: Support long-running sched entities
>>>>>>
>>>>>>    drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>>>>>>    drivers/dma-buf/dma-resv.c                  |   5 +
>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>>>>>>    drivers/gpu/drm/drm_syncobj.c               |   5 +-
>>>>>>    drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>>>>>>    drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>>>>>>    drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>>>>>>    drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>>>>>>    drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>>>>>>    drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>>>>>>    drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>>>>>>    drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
>>>>>>    drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>>>>>>    include/drm/gpu_scheduler.h                 | 130 +++++++--
>>>>>>    include/linux/dma-fence.h                   |  60 ++++-
>>>>>>    16 files changed, 649 insertions(+), 184 deletions(-)
>>>>>>
>>>>>> --
>>>>>> 2.34.1
Christian König April 6, 2023, 10:14 a.m. UTC | #28
Am 06.04.23 um 08:37 schrieb Daniel Vetter:
> On Thu, Apr 06, 2023 at 02:08:10AM +0000, Matthew Brost wrote:
>> On Wed, Apr 05, 2023 at 12:12:27PM +0200, Daniel Vetter wrote:
>>> On Wed, 5 Apr 2023 at 11:57, Christian König <christian.koenig@amd.com> wrote:
>>>> Am 05.04.23 um 11:07 schrieb Daniel Vetter:
>>>>> [SNIP]
>>>>>> I would approach it from the complete other side. This component here is a
>>>>>> tool to decide what job should run next.
>>>>>>
>>>>>> How that is then signaled and run should not be part of the scheduler, but
>>>>>> another more higher level component.
>>>>>>
>>>>>> This way you also don't have a problem with not using DMA-fences as
>>>>>> dependencies as well as constrains for running more jobs.
>>>>> I think we're talking about two things here and mixing them up.
>>>>>
>>>>> For the dependencies I agree with you, and imo that higher level tool
>>>>> should probably just be an on-demand submit thread in userspace for the
>>>>> rare case where the kernel would need to sort out a dependency otherwise
>>>>> (due to running out of ringspace in the per-ctx ringbuffer).
>>>>>
>>>>> The other thing is the message passing stuff, and this is what I was
>>>>> talking about above. This has nothing to do with handling dependencies,
>>>>> but with talking to the gpu fw. Here the intel design issue is that the fw
>>>>> only provides a single queue, and it's in-order. Which means it
>>>>> fundamentally has the stalling issue you describe as a point against a
>>>>> message passing design. And fundamentally we need to be able to talk to
>>>>> the fw in the scheduler ->run_job callback.
>>>>>
>>>>> The proposal here for the message passing part is that since it has the
>>>>> stalling issue already anyway, and the scheduler needs to be involved
>>>>> anyway, it makes sense to integrated this (as an optional thing, only for
>>>>> drivers which have this kind of fw interface) into the scheduler.
>>>>> Otherwise you just end up with two layers for no reason and more ping-pong
>>>>> delay because the ->run_job needs to kick off the subordinate driver layer
>>>>> first. Note that for this case the optional message passing support in the
>>>>> drm/scheduler actually makes things better, because it allows you to cut
>>>>> out one layer.
>>>>>
>>>>> Of course if a driver with better fw interface uses this message passing
>>>>> support, then that's bad. Hence the big warning in the kerneldoc.
>>>> Well what I wanted to say is that if you design the dependency handling
>>>> / scheduler properly you don't need the message passing through it.
>>>>
>>>> For example if the GPU scheduler component uses a work item to do it's
>>>> handling instead of a kthread you could also let the driver specify the
>>>> work queue where this work item is executed on.
>>>>
>>>> When you design it like this the driver specifies the thread context of
>>>> execution for it's job. In other words it can specify a single threaded
>>>> firmware work queue as well.
>>>>
>>>> When you then have other messages which needs to be passed to the
>>>> firmware you can also use the same single threaded workqueue for this.
>>>>
>>>> Drivers which have a different firmware interface would just use one of
>>>> the system work queues instead.
>>>>
>>>> This approach basically decouples the GPU scheduler component from the
>>>> message passing functionality.
>>> Hm I guess we've been talking past each another big time, because
>>> that's really what I thought was under discussions? Essentially the
>>> current rfc, but implementing with some polish.
>>>
>> I think Daniel pretty much nailed it here (thanks), to recap:
>>
>> 1. I want the messages in the same worker so run_job / free_job /
>> process_msg execution is mutual exclusive and also so during reset paths
>> if the worker is stopped all the entry points can't be entered.
>>
>> If this is a NAK, then another worker is fine I guess. A lock between
>> run_job / free_job + process_msg should solve the exclusion issue and the
>> reset paths can also stop this new worker too. That being said I'd
>> rather leave this as is but will not fight this point.
>>
>> 2. process_msg is just used to communicate with the firmware using the
>> same queue as submission. Waiting for space in this queue is the only
>> place this function can block (same as submission), well actually we
>> have the concept a preempt time slice but that sleeps for 10 ms by
>> default. Also preempt is only used in LR entities so I don't think it is
>> relavent in this case either.
>>
>> 3. Agree this is in the dma-fence signaling path (if process_msg is in
>> the submission worker) so we can't block indefinitely or an unreasonable
>> period of time (i.e. we must obey dma-fence rules).
> Just to hammer this in: Not just process_msg is in the dma_fence signaling
> path, but the entire fw queue where everything is being funneled through,
> including whatever the fw is doing to process these.
>
> Yes this is terrible and blew up a few times already :-/
>
> But also, probably something that the docs really need to hammer in, to
> make sure people don't look at this and thinkg "hey this seems to be the
> recommended way to do this on linux". We don't want hw people to build
> more of these designs, they're an absolute pain to deal with with Linux'
> dma_fence signalling and gpu job scheduling rules.
>
> It's just that if you're stuck with such fw, then integrating the flow
> into drm/sched instead of having an extra layer of workers seems the
> better of two pretty bad solutions.

Yeah and if you have such fw limitations, make sure that you use 
something which is understood by lockdep to feed into it.

In other words, either locks or work item/queue and not some message 
passing functionality through the scheduler.

As far as I can see the approach with the work item/queue should fit 
your needs here.

Christian.

> -Daniel
>
>   
>> 4. Agree the documentation for thw usage of the messaging interface
>> needs to be clear.
>>
>> 5. Agree that my code could alway use polishing.
>>
>> Lets close on #1 then can I get on general Ack on this part of the RFC
>> and apply the polish in the full review process?
>>
>> Matt
>>
>>> iow I agree with you (I think at least).
>>> -Daniel
>>> -- 
>>> Daniel Vetter
>>> Software Engineer, Intel Corporation
>>> http://blog.ffwll.ch
Daniel Vetter April 6, 2023, 10:32 a.m. UTC | #29
On Thu, Apr 06, 2023 at 12:14:36PM +0200, Christian König wrote:
> Am 06.04.23 um 08:37 schrieb Daniel Vetter:
> > On Thu, Apr 06, 2023 at 02:08:10AM +0000, Matthew Brost wrote:
> > > On Wed, Apr 05, 2023 at 12:12:27PM +0200, Daniel Vetter wrote:
> > > > On Wed, 5 Apr 2023 at 11:57, Christian König <christian.koenig@amd.com> wrote:
> > > > > Am 05.04.23 um 11:07 schrieb Daniel Vetter:
> > > > > > [SNIP]
> > > > > > > I would approach it from the complete other side. This component here is a
> > > > > > > tool to decide what job should run next.
> > > > > > > 
> > > > > > > How that is then signaled and run should not be part of the scheduler, but
> > > > > > > another more higher level component.
> > > > > > > 
> > > > > > > This way you also don't have a problem with not using DMA-fences as
> > > > > > > dependencies as well as constrains for running more jobs.
> > > > > > I think we're talking about two things here and mixing them up.
> > > > > > 
> > > > > > For the dependencies I agree with you, and imo that higher level tool
> > > > > > should probably just be an on-demand submit thread in userspace for the
> > > > > > rare case where the kernel would need to sort out a dependency otherwise
> > > > > > (due to running out of ringspace in the per-ctx ringbuffer).
> > > > > > 
> > > > > > The other thing is the message passing stuff, and this is what I was
> > > > > > talking about above. This has nothing to do with handling dependencies,
> > > > > > but with talking to the gpu fw. Here the intel design issue is that the fw
> > > > > > only provides a single queue, and it's in-order. Which means it
> > > > > > fundamentally has the stalling issue you describe as a point against a
> > > > > > message passing design. And fundamentally we need to be able to talk to
> > > > > > the fw in the scheduler ->run_job callback.
> > > > > > 
> > > > > > The proposal here for the message passing part is that since it has the
> > > > > > stalling issue already anyway, and the scheduler needs to be involved
> > > > > > anyway, it makes sense to integrated this (as an optional thing, only for
> > > > > > drivers which have this kind of fw interface) into the scheduler.
> > > > > > Otherwise you just end up with two layers for no reason and more ping-pong
> > > > > > delay because the ->run_job needs to kick off the subordinate driver layer
> > > > > > first. Note that for this case the optional message passing support in the
> > > > > > drm/scheduler actually makes things better, because it allows you to cut
> > > > > > out one layer.
> > > > > > 
> > > > > > Of course if a driver with better fw interface uses this message passing
> > > > > > support, then that's bad. Hence the big warning in the kerneldoc.
> > > > > Well what I wanted to say is that if you design the dependency handling
> > > > > / scheduler properly you don't need the message passing through it.
> > > > > 
> > > > > For example if the GPU scheduler component uses a work item to do it's
> > > > > handling instead of a kthread you could also let the driver specify the
> > > > > work queue where this work item is executed on.
> > > > > 
> > > > > When you design it like this the driver specifies the thread context of
> > > > > execution for it's job. In other words it can specify a single threaded
> > > > > firmware work queue as well.
> > > > > 
> > > > > When you then have other messages which needs to be passed to the
> > > > > firmware you can also use the same single threaded workqueue for this.
> > > > > 
> > > > > Drivers which have a different firmware interface would just use one of
> > > > > the system work queues instead.
> > > > > 
> > > > > This approach basically decouples the GPU scheduler component from the
> > > > > message passing functionality.
> > > > Hm I guess we've been talking past each another big time, because
> > > > that's really what I thought was under discussions? Essentially the
> > > > current rfc, but implementing with some polish.
> > > > 
> > > I think Daniel pretty much nailed it here (thanks), to recap:
> > > 
> > > 1. I want the messages in the same worker so run_job / free_job /
> > > process_msg execution is mutual exclusive and also so during reset paths
> > > if the worker is stopped all the entry points can't be entered.
> > > 
> > > If this is a NAK, then another worker is fine I guess. A lock between
> > > run_job / free_job + process_msg should solve the exclusion issue and the
> > > reset paths can also stop this new worker too. That being said I'd
> > > rather leave this as is but will not fight this point.
> > > 
> > > 2. process_msg is just used to communicate with the firmware using the
> > > same queue as submission. Waiting for space in this queue is the only
> > > place this function can block (same as submission), well actually we
> > > have the concept a preempt time slice but that sleeps for 10 ms by
> > > default. Also preempt is only used in LR entities so I don't think it is
> > > relavent in this case either.
> > > 
> > > 3. Agree this is in the dma-fence signaling path (if process_msg is in
> > > the submission worker) so we can't block indefinitely or an unreasonable
> > > period of time (i.e. we must obey dma-fence rules).
> > Just to hammer this in: Not just process_msg is in the dma_fence signaling
> > path, but the entire fw queue where everything is being funneled through,
> > including whatever the fw is doing to process these.
> > 
> > Yes this is terrible and blew up a few times already :-/
> > 
> > But also, probably something that the docs really need to hammer in, to
> > make sure people don't look at this and thinkg "hey this seems to be the
> > recommended way to do this on linux". We don't want hw people to build
> > more of these designs, they're an absolute pain to deal with with Linux'
> > dma_fence signalling and gpu job scheduling rules.
> > 
> > It's just that if you're stuck with such fw, then integrating the flow
> > into drm/sched instead of having an extra layer of workers seems the
> > better of two pretty bad solutions.
> 
> Yeah and if you have such fw limitations, make sure that you use something
> which is understood by lockdep to feed into it.
> 
> In other words, either locks or work item/queue and not some message passing
> functionality through the scheduler.
> 
> As far as I can see the approach with the work item/queue should fit your
> needs here.

dma_fence signalling annotations would also make the scheduler thread
suitable to catch issues with lockdep, just to make double-sure it can
catch issues.
-Daniel

> 
> Christian.
> 
> > -Daniel
> > 
> > > 4. Agree the documentation for thw usage of the messaging interface
> > > needs to be clear.
> > > 
> > > 5. Agree that my code could alway use polishing.
> > > 
> > > Lets close on #1 then can I get on general Ack on this part of the RFC
> > > and apply the polish in the full review process?
> > > 
> > > Matt
> > > 
> > > > iow I agree with you (I think at least).
> > > > -Daniel
> > > > -- 
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > http://blog.ffwll.ch
>
Zeng, Oak April 7, 2023, 12:20 a.m. UTC | #30
So this series basically go with option 2. The part that option2 makes me uncomfortable is, dma-fence doesn't work for long running workload, why we generate it in the first place? As long as dma-fence is generated, it will become a source of confusion in the future. It doesn't matter how much you annotate it/document it. So if we decide to go with option2, the bottom line is, don't generate dma-fence for long running workload during job submission. This requires some rework in drm scheduler.

The cleanest solution to me is option3. Dma-fence is a very old technology. When it was created, no gpu support page fault. Obviously this is not a good technology for modern gpu with page fault support. I think the best way is to create a new scheduler and dependency tracking mechanism works for both page fault enabled and page fault disabled context. I think this matches what Christian said below. Maybe nobody think this is easy?  

Thanks,
Oak

> -----Original Message-----
> From: Brost, Matthew <matthew.brost@intel.com>
> Sent: April 5, 2023 2:53 PM
> To: Zeng, Oak <oak.zeng@intel.com>
> Cc: Christian König <christian.koenig@amd.com>; Vetter, Daniel
> <daniel.vetter@intel.com>; Thomas Hellström
> <thomas.hellstrom@linux.intel.com>; dri-devel@lists.freedesktop.org; intel-
> xe@lists.freedesktop.org; robdclark@chromium.org; airlied@linux.ie;
> lina@asahilina.net; boris.brezillon@collabora.com; faith.ekstrand@collabora.com
> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> plans
> 
> On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:
> > Hi,
> >
> > Using dma-fence for completion/dependency tracking for long-run
> workload(more precisely on-demand paging/page fault enabled workload) can
> cause deadlock. This seems the significant issue here. Other issues such as the
> drm scheduler completion order implication etc are minors which can be solve
> inside the framework of drm scheduler. We need to evaluate below paths:
> >
> > 	1) still use drm scheduler for job submission, and use dma-fence for job
> completion waiting/dependency tracking. This is solution proposed in this series.
> Annotate dma-fence for long-run workload: user can still wait dma-fence for job
> completion but can't wait dma-fence while holding any memory management
> locks.  We still use dma-fence for dependency tracking. But it is just very easily
> run into deadlock when on-demand paging is in the picture. The annotation helps
> us to detect deadlock but not solve deadlock problems. Seems *not* a complete
> solution: It is almost impossible to completely avoid dependency deadlock in
> complex runtime environment
> >
> 
> No one can wait on LR fence, so it is impossible to deadlock. The
> annotations enforce this. Literally this is only for flow controling the
> ring / hold pending jobs in in the DRM schedule list.
> 
> > 	2) Still use drm scheduler but not use dma-fence for completion signaling
> and dependency tracking. This way we still get some free functions (reset, err
> handling ring flow control as Matt said)from drm scheduler, but push the
> dependency/completion tracking completely to user space using techniques such
> as user space fence. User space doesn't have chance to wait fence while holding
> a kernel memory management lock, thus the dma-fence deadlock issue is solved.
> >
> 
> We use user space fence for syncs.
> 
> > 	3) Completely discard drm scheduler and dma-fence for long-run
> workload. Use user queue/doorbell for super fast submission, directly interact
> with fw scheduler. Use user fence for completion/dependency tracking.
> >
> 
> This is a hard no from me, I want 1 submission path in Xe. Either we use
> the DRM scheduler or we don't.
> 
> Matt
> 
> > Thanks,
> > Oak
> >
> > > -----Original Message-----
> > > From: Christian König <christian.koenig@amd.com>
> > > Sent: April 5, 2023 3:30 AM
> > > To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
> > > <oak.zeng@intel.com>
> > > Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org;
> > > robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
> airlied@linux.ie;
> > > lina@asahilina.net; boris.brezillon@collabora.com;
> faith.ekstrand@collabora.com
> > > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > > plans
> > >
> > > Am 04.04.23 um 20:08 schrieb Matthew Brost:
> > > > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
> > > >> Hi Matt, Thomas,
> > > >>
> > > >> Some very bold out of box thinking in this area:
> > > >>
> > > >> 1. so you want to use drm scheduler and dma-fence for long running
> workload.
> > > Why you want to do this in the first place? What is the benefit? Drm scheduler
> is
> > > pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
> > > level, as you said below for intel this is Guc. Can xe driver just directly submit
> job
> > > to Guc, bypassing drm scheduler?
> > > >>
> > > > If we did that now we have 2 paths for dependency track, flow controling
> > > > the ring, resets / error handling / backend submission implementations.
> > > > We don't want this.
> > >
> > > Well exactly that's the point: Why?
> > >
> > > As far as I can see that are two completely distinct use cases, so you
> > > absolutely do want two completely distinct implementations for this.
> > >
> > > >> 2. using dma-fence for long run workload: I am well aware that page fault
> (and
> > > the consequent memory allocation/lock acquiring to fix the fault) can cause
> > > deadlock for a dma-fence wait. But I am not convinced that dma-fence can't
> be
> > > used purely because the nature of the workload that it runs very long
> (indefinite).
> > > I did a math: the dma_fence_wait_timeout function's third param is the
> timeout
> > > which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not
> long
> > > enough, can we just change the timeout parameter to signed 64 bits so it is
> much
> > > longer than our life time...
> > > >>
> > > >> So I mainly argue we can't use dma-fence for long-run workload is not
> > > because the workload runs very long, rather because of the fact that we use
> > > page fault for long-run workload. If we enable page fault for short-run
> workload,
> > > we can't use dma-fence either. Page fault is the key thing here.
> > > >>
> > > >> Now since we use page fault which is *fundamentally* controversial with
> > > dma-fence design, why now just introduce a independent concept such as
> user-
> > > fence instead of extending existing dma-fence?
> > > >>
> > > >> I like unified design. If drm scheduler, dma-fence can be extended to work
> for
> > > everything, it is beautiful. But seems we have some fundamental problem
> here.
> > > >>
> > > > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
> > > > the signal / CB infrastructure) and enforce we don't use use these
> > > > dma-fences from the scheduler in memory reclaim paths or export these to
> > > > user space or other drivers. Think of this mode as SW only fence.
> > >
> > > Yeah and I truly think this is an really bad idea.
> > >
> > > The signal/CB infrastructure in the dma_fence turned out to be the
> > > absolutely nightmare I initially predicted. Sorry to say that, but in
> > > this case the "I've told you so" is appropriate in my opinion.
> > >
> > > If we need infrastructure for long running dependency tracking we should
> > > encapsulate that in a new framework and not try to mangle the existing
> > > code for something it was never intended for.
> > >
> > > Christian.
> > >
> > > >
> > > > Matt
> > > >
> > > >> Thanks,
> > > >> Oak
> > > >>
> > > >>> -----Original Message-----
> > > >>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> > > >>> Matthew Brost
> > > >>> Sent: April 3, 2023 8:22 PM
> > > >>> To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org
> > > >>> Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
> > > airlied@linux.ie;
> > > >>> lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew
> > > >>> <matthew.brost@intel.com>; christian.koenig@amd.com;
> > > >>> faith.ekstrand@collabora.com
> > > >>> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > > plans
> > > >>>
> > > >>> Hello,
> > > >>>
> > > >>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > >>> have been asked to merge our common DRM scheduler patches first as
> well
> > > >>> as develop a common solution for long running workloads with the DRM
> > > >>> scheduler. This RFC series is our first attempt at doing this. We
> > > >>> welcome any and all feedback.
> > > >>>
> > > >>> This can we thought of as 4 parts detailed below.
> > > >>>
> > > >>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > >>> entity (patches 1-3)
> > > >>>
> > > >>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > >>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > >>> severals problems as the DRM was originally designed to schedule jobs
> on
> > > >>> hardware queues. The main problem being that DRM scheduler expects
> the
> > > >>> submission order of jobs to be the completion order of jobs even across
> > > >>> multiple entities. This assumption falls apart with a firmware scheduler
> > > >>> as a firmware scheduler has no concept of jobs and jobs can complete
> out
> > > >>> of order. A novel solution for was originally thought of by Faith during
> > > >>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > > >>> and entity. I believe the AGX driver [3] is using this approach and
> > > >>> Boris may use approach as well for the Mali driver [4].
> > > >>>
> > > >>> To support a 1 to 1 relationship we move the main execution function
> > > >>> from a kthread to a work queue and add a new scheduling mode which
> > > >>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > >>> The new scheduling mode should unify all drivers usage with a 1 to 1
> > > >>> relationship and can be thought of as using scheduler as a dependency /
> > > >>> infligt job tracker rather than a true scheduler.
> > > >>>
> > > >>> - Generic messaging interface for DRM scheduler
> > > >>>
> > > >>> Idea is to be able to communicate to the submission backend with in
> band
> > > >>> (relative to main execution function) messages. Messages are backend
> > > >>> defined and flexable enough for any use case. In Xe we use these
> > > >>> messages to clean up entites, set properties for entites, and suspend /
> > > >>> resume execution of an entity [5]. I suspect other driver can leverage
> > > >>> this messaging concept too as it a convenient way to avoid races in the
> > > >>> backend.
> > > >>>
> > > >>> - Support for using TDR for all error paths of a scheduler / entity
> > > >>>
> > > >>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > > >>>
> > > >>> - Annotate dma-fences for long running workloads.
> > > >>>
> > > >>> The idea here is to use dma-fences only as sync points within the
> > > >>> scheduler and never export them for long running workloads. By
> > > >>> annotating these fences as long running we ensure that these dma-
> fences
> > > >>> are never used in a way that breaks the dma-fence rules. A benefit of
> > > >>> thus approach is the scheduler can still safely flow control the
> > > >>> execution ring buffer via the job limit without breaking the dma-fence
> > > >>> rules.
> > > >>>
> > > >>> Again this a first draft and looking forward to feedback.
> > > >>>
> > > >>> Enjoy - Matt
> > > >>>
> > > >>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > > >>> [2] https://patchwork.freedesktop.org/series/112188/
> > > >>> [3] https://patchwork.freedesktop.org/series/114772/
> > > >>> [4]
> https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > > >>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
> > > >>> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > > >>>
> > > >>> Matthew Brost (8):
> > > >>>    drm/sched: Convert drm scheduler to use a work queue rather than
> > > >>>      kthread
> > > >>>    drm/sched: Move schedule policy to scheduler / entity
> > > >>>    drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling
> policy
> > > >>>    drm/sched: Add generic scheduler message interface
> > > >>>    drm/sched: Start run wq before TDR in drm_sched_start
> > > >>>    drm/sched: Submit job before starting TDR
> > > >>>    drm/sched: Add helper to set TDR timeout
> > > >>>    drm/syncobj: Warn on long running dma-fences
> > > >>>
> > > >>> Thomas Hellström (2):
> > > >>>    dma-buf/dma-fence: Introduce long-running completion fences
> > > >>>    drm/sched: Support long-running sched entities
> > > >>>
> > > >>>   drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> > > >>>   drivers/dma-buf/dma-resv.c                  |   5 +
> > > >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> > > >>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> > > >>>   drivers/gpu/drm/drm_syncobj.c               |   5 +-
> > > >>>   drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> > > >>>   drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> > > >>>   drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> > > >>>   drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> > > >>>   drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> > > >>>   drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> > > >>>   drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> > > >>>   drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++--
> ---
> > > >>>   drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> > > >>>   include/drm/gpu_scheduler.h                 | 130 +++++++--
> > > >>>   include/linux/dma-fence.h                   |  60 ++++-
> > > >>>   16 files changed, 649 insertions(+), 184 deletions(-)
> > > >>>
> > > >>> --
> > > >>> 2.34.1
> >
Asahi Lina April 8, 2023, 7:05 a.m. UTC | #31
On 04/04/2023 10.58, Matthew Brost wrote:
> On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote:
>> Hi, thanks for the Cc!
>>
> 
> No problem.
> 
>> On 04/04/2023 09.22, Matthew Brost wrote:
>>> Hello,
>>>
>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
>>> have been asked to merge our common DRM scheduler patches first as well
>>> as develop a common solution for long running workloads with the DRM
>>> scheduler. This RFC series is our first attempt at doing this. We
>>> welcome any and all feedback.
>>>
>>> This can we thought of as 4 parts detailed below.
>>>
>>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
>>> entity (patches 1-3)
>>>
>>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
>>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
>>> severals problems as the DRM was originally designed to schedule jobs on
>>> hardware queues. The main problem being that DRM scheduler expects the
>>> submission order of jobs to be the completion order of jobs even across
>>> multiple entities. This assumption falls apart with a firmware scheduler
>>> as a firmware scheduler has no concept of jobs and jobs can complete out
>>> of order. A novel solution for was originally thought of by Faith during
>>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
>>> and entity. I believe the AGX driver [3] is using this approach and
>>> Boris may use approach as well for the Mali driver [4].
>>>
>>> To support a 1 to 1 relationship we move the main execution function
>>> from a kthread to a work queue and add a new scheduling mode which
>>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
>>> The new scheduling mode should unify all drivers usage with a 1 to 1
>>> relationship and can be thought of as using scheduler as a dependency /
>>> infligt job tracker rather than a true scheduler.
>>
>> Yup, we're in the exact same situation with drm/asahi, so this is very
>> welcome! We've been using the existing scheduler as-is, but this should help
>> remove some unneeded complexity in this use case.
>>
> 
> That's the idea.
> 
>> Do you want me to pull in this series into our tree and make sure this all
>> works out for us?
>>
> 
> We tested this in Xe and it definitely works for us but the more testing
> the better.
> 

I haven't gotten around to testing this series yet, but after more 
debugging of drm_sched issues I want to hear more about how Xe uses the 
scheduler.

 From what I can tell, and from what Christian says, drm_sched has the 
hidden requirement that all job objects outlive the scheduler. I've run 
into several UAF bugs due to this. Not only that, it also currently has 
the requirement that all drm_sched fences outlive the scheduler object.

These requirements are subtle and only manifest as kernel oopses in rare 
corner cases, so it wasn't at all obvious to me that this was somehow a 
fundamental design assumption when I started using it.

As far as I can tell, this design is going to work in 99% of cases for 
global-schedulers-per-GPU models, where those corner cases would have to 
be hit on top of a GPU removal scenario (and GPU remove is... well, not 
the most tested/exercised use case). When the scheduler basically lives 
forever, none of this really matters.

But with a one-scheduler-per-queue model, how do you deal with this when 
the queue goes away? So far, without any of the partial bugfixes I have 
sent so far (which Christian objected to):

- If you try to tear down a scheduler with any jobs currently scheduled 
at the hardware, drm_sched will oops when those jobs complete and the hw 
fences signal.
- If you try to tear down an entity (which should cancel all its pending 
jobs) and then the scheduler it was attached to without actually waiting 
for all the free_job() callbacks to be called on every job that ever 
existed for that entity, you can oops (entity cleanup is asynchronous in 
some cases like killed processes, so it will return before all jobs are 
freed and then that asynchronous process will crash and burn if the 
scheduler goes away out from under its feet). Waiting for job completion 
fences is not enough for this, you have to wait until free_job() has 
actually been called for all jobs.
- Even if you actually wait for all jobs to be truly gone and then tear 
down the scheduler, if any scheduler job fences remain alive, that will 
then oops if you try to call the debug functions on them (like cat 
/sys/kernel/debug/dma_buf/bufinfo).

I tried to fix these things, but Christian objected implying it was the 
driver's job to keep a reference from jobs and hw fences to the 
scheduler. But I find that completely broken, because besides the extra 
memory/resource usage keeping the scheduler alive when you're trying to 
free resources as fast as possible when a process goes away, you can't 
even use normal reference counting for that: if you try to drop the last 
drm_sched reference from within a free_job() callback, the whole thing 
deadlocks since that will be running in the scheduler's thread/workqueue 
context, which can't free itself. So now you both reference count the 
scheduler from jobs and fences, and on top of that you need to outsource 
drm_sched freeing to a workqueue in the driver to make sure you don't 
deadlock.

For job fences this is particularly broken, because those fences can 
live forever signaled and attached to shared buffers and there is no 
guarantee that they will be freed in any kind of reasonable time frame. 
If they have to keep the scheduler that created them alive, that creates 
a lot of dead object junk we have to drag around just because a signaled 
fence exists somewhere.

For a Rust abstraction we have to do all that tracking and refcounting 
in the abstraction itself to make it safe, which is starting to sound 
like reimplementing half of the job tracking drm_sched itself does just 
to fix the lifetime issues, which really tells me the existing design is 
not sound nor easy to use correctly in general.

How does Xe deal with this (does it deal with it at all)? What happens 
when you kill -9 a process using the GPU? Does freeing all of this wait 
for all jobs to complete *and be freed* with free_job()? What about 
exported dma_bufs with fences attached from that scheduler? Do you keep 
the scheduler alive for those?

Personally, after running into all this, and after seeing Christian's 
reaction to me trying to improve the state of things, I'm starting to 
doubt that drm_sched is the right solution at all for 
firmware-scheduling drivers.

If you want a workload to try to see if you run into any of these 
things, running and killing lots of things in parallel is a good thing 
to try (mess with the numbers and let it run for a while to see if you 
can hit any corner cases):

while true; do for i in $(seq 1 10); do timeout -k 0.01 0.05 glxgears & 
done; sleep 0.1; done

~~ Lina
Christian König April 11, 2023, 9:02 a.m. UTC | #32
The point is that this not only requires some work in the drm_scheduler, 
but rather it then makes only little sense to use the drm_scheduler in 
the first place.

The whole point of the drm_scheduler is to provide dma_fence 
implementation for the submitted jobs.

We also have dependency handling, but as Daniel and I said this can be 
easily extracted into a separate object/component.

Regards,
Christian.

Am 07.04.23 um 02:20 schrieb Zeng, Oak:
> So this series basically go with option 2. The part that option2 makes me uncomfortable is, dma-fence doesn't work for long running workload, why we generate it in the first place? As long as dma-fence is generated, it will become a source of confusion in the future. It doesn't matter how much you annotate it/document it. So if we decide to go with option2, the bottom line is, don't generate dma-fence for long running workload during job submission. This requires some rework in drm scheduler.
>
> The cleanest solution to me is option3. Dma-fence is a very old technology. When it was created, no gpu support page fault. Obviously this is not a good technology for modern gpu with page fault support. I think the best way is to create a new scheduler and dependency tracking mechanism works for both page fault enabled and page fault disabled context. I think this matches what Christian said below. Maybe nobody think this is easy?
>
> Thanks,
> Oak
>
>> -----Original Message-----
>> From: Brost, Matthew <matthew.brost@intel.com>
>> Sent: April 5, 2023 2:53 PM
>> To: Zeng, Oak <oak.zeng@intel.com>
>> Cc: Christian König <christian.koenig@amd.com>; Vetter, Daniel
>> <daniel.vetter@intel.com>; Thomas Hellström
>> <thomas.hellstrom@linux.intel.com>; dri-devel@lists.freedesktop.org; intel-
>> xe@lists.freedesktop.org; robdclark@chromium.org; airlied@linux.ie;
>> lina@asahilina.net; boris.brezillon@collabora.com; faith.ekstrand@collabora.com
>> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
>> plans
>>
>> On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:
>>> Hi,
>>>
>>> Using dma-fence for completion/dependency tracking for long-run
>> workload(more precisely on-demand paging/page fault enabled workload) can
>> cause deadlock. This seems the significant issue here. Other issues such as the
>> drm scheduler completion order implication etc are minors which can be solve
>> inside the framework of drm scheduler. We need to evaluate below paths:
>>> 	1) still use drm scheduler for job submission, and use dma-fence for job
>> completion waiting/dependency tracking. This is solution proposed in this series.
>> Annotate dma-fence for long-run workload: user can still wait dma-fence for job
>> completion but can't wait dma-fence while holding any memory management
>> locks.  We still use dma-fence for dependency tracking. But it is just very easily
>> run into deadlock when on-demand paging is in the picture. The annotation helps
>> us to detect deadlock but not solve deadlock problems. Seems *not* a complete
>> solution: It is almost impossible to completely avoid dependency deadlock in
>> complex runtime environment
>> No one can wait on LR fence, so it is impossible to deadlock. The
>> annotations enforce this. Literally this is only for flow controling the
>> ring / hold pending jobs in in the DRM schedule list.
>>
>>> 	2) Still use drm scheduler but not use dma-fence for completion signaling
>> and dependency tracking. This way we still get some free functions (reset, err
>> handling ring flow control as Matt said)from drm scheduler, but push the
>> dependency/completion tracking completely to user space using techniques such
>> as user space fence. User space doesn't have chance to wait fence while holding
>> a kernel memory management lock, thus the dma-fence deadlock issue is solved.
>> We use user space fence for syncs.
>>
>>> 	3) Completely discard drm scheduler and dma-fence for long-run
>> workload. Use user queue/doorbell for super fast submission, directly interact
>> with fw scheduler. Use user fence for completion/dependency tracking.
>> This is a hard no from me, I want 1 submission path in Xe. Either we use
>> the DRM scheduler or we don't.
>>
>> Matt
>>
>>> Thanks,
>>> Oak
>>>
>>>> -----Original Message-----
>>>> From: Christian König <christian.koenig@amd.com>
>>>> Sent: April 5, 2023 3:30 AM
>>>> To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
>>>> <oak.zeng@intel.com>
>>>> Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org;
>>>> robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
>> airlied@linux.ie;
>>>> lina@asahilina.net; boris.brezillon@collabora.com;
>> faith.ekstrand@collabora.com
>>>> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
>>>> plans
>>>>
>>>> Am 04.04.23 um 20:08 schrieb Matthew Brost:
>>>>> On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
>>>>>> Hi Matt, Thomas,
>>>>>>
>>>>>> Some very bold out of box thinking in this area:
>>>>>>
>>>>>> 1. so you want to use drm scheduler and dma-fence for long running
>> workload.
>>>> Why you want to do this in the first place? What is the benefit? Drm scheduler
>> is
>>>> pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
>>>> level, as you said below for intel this is Guc. Can xe driver just directly submit
>> job
>>>> to Guc, bypassing drm scheduler?
>>>>> If we did that now we have 2 paths for dependency track, flow controling
>>>>> the ring, resets / error handling / backend submission implementations.
>>>>> We don't want this.
>>>> Well exactly that's the point: Why?
>>>>
>>>> As far as I can see that are two completely distinct use cases, so you
>>>> absolutely do want two completely distinct implementations for this.
>>>>
>>>>>> 2. using dma-fence for long run workload: I am well aware that page fault
>> (and
>>>> the consequent memory allocation/lock acquiring to fix the fault) can cause
>>>> deadlock for a dma-fence wait. But I am not convinced that dma-fence can't
>> be
>>>> used purely because the nature of the workload that it runs very long
>> (indefinite).
>>>> I did a math: the dma_fence_wait_timeout function's third param is the
>> timeout
>>>> which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not
>> long
>>>> enough, can we just change the timeout parameter to signed 64 bits so it is
>> much
>>>> longer than our life time...
>>>>>> So I mainly argue we can't use dma-fence for long-run workload is not
>>>> because the workload runs very long, rather because of the fact that we use
>>>> page fault for long-run workload. If we enable page fault for short-run
>> workload,
>>>> we can't use dma-fence either. Page fault is the key thing here.
>>>>>> Now since we use page fault which is *fundamentally* controversial with
>>>> dma-fence design, why now just introduce a independent concept such as
>> user-
>>>> fence instead of extending existing dma-fence?
>>>>>> I like unified design. If drm scheduler, dma-fence can be extended to work
>> for
>>>> everything, it is beautiful. But seems we have some fundamental problem
>> here.
>>>>> Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
>>>>> the signal / CB infrastructure) and enforce we don't use use these
>>>>> dma-fences from the scheduler in memory reclaim paths or export these to
>>>>> user space or other drivers. Think of this mode as SW only fence.
>>>> Yeah and I truly think this is an really bad idea.
>>>>
>>>> The signal/CB infrastructure in the dma_fence turned out to be the
>>>> absolutely nightmare I initially predicted. Sorry to say that, but in
>>>> this case the "I've told you so" is appropriate in my opinion.
>>>>
>>>> If we need infrastructure for long running dependency tracking we should
>>>> encapsulate that in a new framework and not try to mangle the existing
>>>> code for something it was never intended for.
>>>>
>>>> Christian.
>>>>
>>>>> Matt
>>>>>
>>>>>> Thanks,
>>>>>> Oak
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
>>>>>>> Matthew Brost
>>>>>>> Sent: April 3, 2023 8:22 PM
>>>>>>> To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org
>>>>>>> Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
>>>> airlied@linux.ie;
>>>>>>> lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew
>>>>>>> <matthew.brost@intel.com>; christian.koenig@amd.com;
>>>>>>> faith.ekstrand@collabora.com
>>>>>>> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
>>>> plans
>>>>>>> Hello,
>>>>>>>
>>>>>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
>>>>>>> have been asked to merge our common DRM scheduler patches first as
>> well
>>>>>>> as develop a common solution for long running workloads with the DRM
>>>>>>> scheduler. This RFC series is our first attempt at doing this. We
>>>>>>> welcome any and all feedback.
>>>>>>>
>>>>>>> This can we thought of as 4 parts detailed below.
>>>>>>>
>>>>>>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
>>>>>>> entity (patches 1-3)
>>>>>>>
>>>>>>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
>>>>>>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
>>>>>>> severals problems as the DRM was originally designed to schedule jobs
>> on
>>>>>>> hardware queues. The main problem being that DRM scheduler expects
>> the
>>>>>>> submission order of jobs to be the completion order of jobs even across
>>>>>>> multiple entities. This assumption falls apart with a firmware scheduler
>>>>>>> as a firmware scheduler has no concept of jobs and jobs can complete
>> out
>>>>>>> of order. A novel solution for was originally thought of by Faith during
>>>>>>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
>>>>>>> and entity. I believe the AGX driver [3] is using this approach and
>>>>>>> Boris may use approach as well for the Mali driver [4].
>>>>>>>
>>>>>>> To support a 1 to 1 relationship we move the main execution function
>>>>>>> from a kthread to a work queue and add a new scheduling mode which
>>>>>>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
>>>>>>> The new scheduling mode should unify all drivers usage with a 1 to 1
>>>>>>> relationship and can be thought of as using scheduler as a dependency /
>>>>>>> infligt job tracker rather than a true scheduler.
>>>>>>>
>>>>>>> - Generic messaging interface for DRM scheduler
>>>>>>>
>>>>>>> Idea is to be able to communicate to the submission backend with in
>> band
>>>>>>> (relative to main execution function) messages. Messages are backend
>>>>>>> defined and flexable enough for any use case. In Xe we use these
>>>>>>> messages to clean up entites, set properties for entites, and suspend /
>>>>>>> resume execution of an entity [5]. I suspect other driver can leverage
>>>>>>> this messaging concept too as it a convenient way to avoid races in the
>>>>>>> backend.
>>>>>>>
>>>>>>> - Support for using TDR for all error paths of a scheduler / entity
>>>>>>>
>>>>>>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
>>>>>>>
>>>>>>> - Annotate dma-fences for long running workloads.
>>>>>>>
>>>>>>> The idea here is to use dma-fences only as sync points within the
>>>>>>> scheduler and never export them for long running workloads. By
>>>>>>> annotating these fences as long running we ensure that these dma-
>> fences
>>>>>>> are never used in a way that breaks the dma-fence rules. A benefit of
>>>>>>> thus approach is the scheduler can still safely flow control the
>>>>>>> execution ring buffer via the job limit without breaking the dma-fence
>>>>>>> rules.
>>>>>>>
>>>>>>> Again this a first draft and looking forward to feedback.
>>>>>>>
>>>>>>> Enjoy - Matt
>>>>>>>
>>>>>>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
>>>>>>> [2] https://patchwork.freedesktop.org/series/112188/
>>>>>>> [3] https://patchwork.freedesktop.org/series/114772/
>>>>>>> [4]
>> https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
>>>>>>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
>>>>>>> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
>>>>>>>
>>>>>>> Matthew Brost (8):
>>>>>>>     drm/sched: Convert drm scheduler to use a work queue rather than
>>>>>>>       kthread
>>>>>>>     drm/sched: Move schedule policy to scheduler / entity
>>>>>>>     drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling
>> policy
>>>>>>>     drm/sched: Add generic scheduler message interface
>>>>>>>     drm/sched: Start run wq before TDR in drm_sched_start
>>>>>>>     drm/sched: Submit job before starting TDR
>>>>>>>     drm/sched: Add helper to set TDR timeout
>>>>>>>     drm/syncobj: Warn on long running dma-fences
>>>>>>>
>>>>>>> Thomas Hellström (2):
>>>>>>>     dma-buf/dma-fence: Introduce long-running completion fences
>>>>>>>     drm/sched: Support long-running sched entities
>>>>>>>
>>>>>>>    drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>>>>>>>    drivers/dma-buf/dma-resv.c                  |   5 +
>>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>>>>>>>    drivers/gpu/drm/drm_syncobj.c               |   5 +-
>>>>>>>    drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>>>>>>>    drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>>>>>>>    drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>>>>>>>    drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>>>>>>>    drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>>>>>>>    drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>>>>>>>    drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>>>>>>>    drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++--
>> ---
>>>>>>>    drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>>>>>>>    include/drm/gpu_scheduler.h                 | 130 +++++++--
>>>>>>>    include/linux/dma-fence.h                   |  60 ++++-
>>>>>>>    16 files changed, 649 insertions(+), 184 deletions(-)
>>>>>>>
>>>>>>> --
>>>>>>> 2.34.1
Daniel Vetter April 11, 2023, 2:07 p.m. UTC | #33
On Sat, Apr 08, 2023 at 04:05:20PM +0900, Asahi Lina wrote:
> On 04/04/2023 10.58, Matthew Brost wrote:
> > On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote:
> > > Hi, thanks for the Cc!
> > > 
> > 
> > No problem.
> > 
> > > On 04/04/2023 09.22, Matthew Brost wrote:
> > > > Hello,
> > > > 
> > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > > have been asked to merge our common DRM scheduler patches first as well
> > > > as develop a common solution for long running workloads with the DRM
> > > > scheduler. This RFC series is our first attempt at doing this. We
> > > > welcome any and all feedback.
> > > > 
> > > > This can we thought of as 4 parts detailed below.
> > > > 
> > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > > entity (patches 1-3)
> > > > 
> > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > > severals problems as the DRM was originally designed to schedule jobs on
> > > > hardware queues. The main problem being that DRM scheduler expects the
> > > > submission order of jobs to be the completion order of jobs even across
> > > > multiple entities. This assumption falls apart with a firmware scheduler
> > > > as a firmware scheduler has no concept of jobs and jobs can complete out
> > > > of order. A novel solution for was originally thought of by Faith during
> > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > > > and entity. I believe the AGX driver [3] is using this approach and
> > > > Boris may use approach as well for the Mali driver [4].
> > > > 
> > > > To support a 1 to 1 relationship we move the main execution function
> > > > from a kthread to a work queue and add a new scheduling mode which
> > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > > relationship and can be thought of as using scheduler as a dependency /
> > > > infligt job tracker rather than a true scheduler.
> > > 
> > > Yup, we're in the exact same situation with drm/asahi, so this is very
> > > welcome! We've been using the existing scheduler as-is, but this should help
> > > remove some unneeded complexity in this use case.
> > > 
> > 
> > That's the idea.
> > 
> > > Do you want me to pull in this series into our tree and make sure this all
> > > works out for us?
> > > 
> > 
> > We tested this in Xe and it definitely works for us but the more testing
> > the better.
> > 
> 
> I haven't gotten around to testing this series yet, but after more debugging
> of drm_sched issues I want to hear more about how Xe uses the scheduler.
> 
> From what I can tell, and from what Christian says, drm_sched has the hidden
> requirement that all job objects outlive the scheduler. I've run into
> several UAF bugs due to this. Not only that, it also currently has the
> requirement that all drm_sched fences outlive the scheduler object.
> 
> These requirements are subtle and only manifest as kernel oopses in rare
> corner cases, so it wasn't at all obvious to me that this was somehow a
> fundamental design assumption when I started using it.
> 
> As far as I can tell, this design is going to work in 99% of cases for
> global-schedulers-per-GPU models, where those corner cases would have to be
> hit on top of a GPU removal scenario (and GPU remove is... well, not the
> most tested/exercised use case). When the scheduler basically lives forever,
> none of this really matters.
> 
> But with a one-scheduler-per-queue model, how do you deal with this when the
> queue goes away? So far, without any of the partial bugfixes I have sent so
> far (which Christian objected to):
> 
> - If you try to tear down a scheduler with any jobs currently scheduled at
> the hardware, drm_sched will oops when those jobs complete and the hw fences
> signal.
> - If you try to tear down an entity (which should cancel all its pending
> jobs) and then the scheduler it was attached to without actually waiting for
> all the free_job() callbacks to be called on every job that ever existed for
> that entity, you can oops (entity cleanup is asynchronous in some cases like
> killed processes, so it will return before all jobs are freed and then that
> asynchronous process will crash and burn if the scheduler goes away out from
> under its feet). Waiting for job completion fences is not enough for this,
> you have to wait until free_job() has actually been called for all jobs.
> - Even if you actually wait for all jobs to be truly gone and then tear down
> the scheduler, if any scheduler job fences remain alive, that will then oops
> if you try to call the debug functions on them (like cat
> /sys/kernel/debug/dma_buf/bufinfo).
> 
> I tried to fix these things, but Christian objected implying it was the
> driver's job to keep a reference from jobs and hw fences to the scheduler.
> But I find that completely broken, because besides the extra memory/resource
> usage keeping the scheduler alive when you're trying to free resources as
> fast as possible when a process goes away, you can't even use normal
> reference counting for that: if you try to drop the last drm_sched reference
> from within a free_job() callback, the whole thing deadlocks since that will
> be running in the scheduler's thread/workqueue context, which can't free
> itself. So now you both reference count the scheduler from jobs and fences,
> and on top of that you need to outsource drm_sched freeing to a workqueue in
> the driver to make sure you don't deadlock.
> 
> For job fences this is particularly broken, because those fences can live
> forever signaled and attached to shared buffers and there is no guarantee
> that they will be freed in any kind of reasonable time frame. If they have
> to keep the scheduler that created them alive, that creates a lot of dead
> object junk we have to drag around just because a signaled fence exists
> somewhere.
> 
> For a Rust abstraction we have to do all that tracking and refcounting in
> the abstraction itself to make it safe, which is starting to sound like
> reimplementing half of the job tracking drm_sched itself does just to fix
> the lifetime issues, which really tells me the existing design is not sound
> nor easy to use correctly in general.
> 
> How does Xe deal with this (does it deal with it at all)? What happens when
> you kill -9 a process using the GPU? Does freeing all of this wait for all
> jobs to complete *and be freed* with free_job()? What about exported
> dma_bufs with fences attached from that scheduler? Do you keep the scheduler
> alive for those?
> 
> Personally, after running into all this, and after seeing Christian's
> reaction to me trying to improve the state of things, I'm starting to doubt
> that drm_sched is the right solution at all for firmware-scheduling drivers.

Bit a wash-up reply on the more fundamental thing here:
 
For the current scheduler the issues you've found are indeed all driver
bugs (or most I think at least).

Which is why I think we shouldn't just try to shoehorn fundamentally new
semantics without updating the driver interfaces (the drm_sched split into
the driver interface part and the internal scheduler part). Once we have
that, including kerneldoc update and what the rules are, then all the
various uaf you've discovered become real bugs and I don't see any issue
merging all the fixes.

Without that we do have a chicken/egg problem between:

"here's a bunch of hacks to make the problems disappear I've hit in my
reuse of drm/sched for fw schedulers"

vs.

"this makes no sense for the current drm/sched interfaces and how current
upstream drivers use it"

I don't think there's a lot needed in terms of drm/sched driver api
rework, but I think it's also pretty clearly not ever going to get
anywhere with just nothing at all. Writing an entire new scheduler lib
instead of at least trying what minimal semantic changes (instead of just
a pile of hacks without even doc changes for the new rules) does not sound
like a good idea to me :-)

> If you want a workload to try to see if you run into any of these things,
> running and killing lots of things in parallel is a good thing to try (mess
> with the numbers and let it run for a while to see if you can hit any corner
> cases):
> 
> while true; do for i in $(seq 1 10); do timeout -k 0.01 0.05 glxgears &
> done; sleep 0.1; done

Maybe xe gets away with this due to synchronously killing everything
related to a ctx, but yeah I'd expect this to go boom in fun ways.
-Daniel
Daniel Vetter April 11, 2023, 2:13 p.m. UTC | #34
On Tue, Apr 11, 2023 at 11:02:55AM +0200, Christian König wrote:
> The point is that this not only requires some work in the drm_scheduler, but
> rather it then makes only little sense to use the drm_scheduler in the first
> place.
> 
> The whole point of the drm_scheduler is to provide dma_fence implementation
> for the submitted jobs.
> 
> We also have dependency handling, but as Daniel and I said this can be
> easily extracted into a separate object/component.

Uh that's not what I meant. My take is that minimally patching drm/sched
to make the out-fence either optional, or complete it right away, is the
simplest way to get at the dependency handling. For me at least the major
part of drm/sched is the dep handling and timeout stuff. And the later can
be reused with some glue to handle preempt timeouts too and other things,
since tdr is a work struct you can just issue any other gpu timeouts on
the same workqueue and using the roughly same pattern as the ->timed_out
hook and it'll just work.

The entire "oh we also make sure your hw fence doesn't leak into public
fences and causes lifetime mayhem" seems pretty minor. And maybe also
something we want to replicate for the preempt-ctx dma_fence that some
long-running context need (but more as part of drm_sched_entity I guess).

We can of course bikeshed how much flexibility really should be in the
different parts of drm/sched, but imo that's a bikeshed.
-Daniel


> 
> Regards,
> Christian.
> 
> Am 07.04.23 um 02:20 schrieb Zeng, Oak:
> > So this series basically go with option 2. The part that option2 makes me uncomfortable is, dma-fence doesn't work for long running workload, why we generate it in the first place? As long as dma-fence is generated, it will become a source of confusion in the future. It doesn't matter how much you annotate it/document it. So if we decide to go with option2, the bottom line is, don't generate dma-fence for long running workload during job submission. This requires some rework in drm scheduler.
> > 
> > The cleanest solution to me is option3. Dma-fence is a very old technology. When it was created, no gpu support page fault. Obviously this is not a good technology for modern gpu with page fault support. I think the best way is to create a new scheduler and dependency tracking mechanism works for both page fault enabled and page fault disabled context. I think this matches what Christian said below. Maybe nobody think this is easy?
> > 
> > Thanks,
> > Oak
> > 
> > > -----Original Message-----
> > > From: Brost, Matthew <matthew.brost@intel.com>
> > > Sent: April 5, 2023 2:53 PM
> > > To: Zeng, Oak <oak.zeng@intel.com>
> > > Cc: Christian König <christian.koenig@amd.com>; Vetter, Daniel
> > > <daniel.vetter@intel.com>; Thomas Hellström
> > > <thomas.hellstrom@linux.intel.com>; dri-devel@lists.freedesktop.org; intel-
> > > xe@lists.freedesktop.org; robdclark@chromium.org; airlied@linux.ie;
> > > lina@asahilina.net; boris.brezillon@collabora.com; faith.ekstrand@collabora.com
> > > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > > plans
> > > 
> > > On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:
> > > > Hi,
> > > > 
> > > > Using dma-fence for completion/dependency tracking for long-run
> > > workload(more precisely on-demand paging/page fault enabled workload) can
> > > cause deadlock. This seems the significant issue here. Other issues such as the
> > > drm scheduler completion order implication etc are minors which can be solve
> > > inside the framework of drm scheduler. We need to evaluate below paths:
> > > > 	1) still use drm scheduler for job submission, and use dma-fence for job
> > > completion waiting/dependency tracking. This is solution proposed in this series.
> > > Annotate dma-fence for long-run workload: user can still wait dma-fence for job
> > > completion but can't wait dma-fence while holding any memory management
> > > locks.  We still use dma-fence for dependency tracking. But it is just very easily
> > > run into deadlock when on-demand paging is in the picture. The annotation helps
> > > us to detect deadlock but not solve deadlock problems. Seems *not* a complete
> > > solution: It is almost impossible to completely avoid dependency deadlock in
> > > complex runtime environment
> > > No one can wait on LR fence, so it is impossible to deadlock. The
> > > annotations enforce this. Literally this is only for flow controling the
> > > ring / hold pending jobs in in the DRM schedule list.
> > > 
> > > > 	2) Still use drm scheduler but not use dma-fence for completion signaling
> > > and dependency tracking. This way we still get some free functions (reset, err
> > > handling ring flow control as Matt said)from drm scheduler, but push the
> > > dependency/completion tracking completely to user space using techniques such
> > > as user space fence. User space doesn't have chance to wait fence while holding
> > > a kernel memory management lock, thus the dma-fence deadlock issue is solved.
> > > We use user space fence for syncs.
> > > 
> > > > 	3) Completely discard drm scheduler and dma-fence for long-run
> > > workload. Use user queue/doorbell for super fast submission, directly interact
> > > with fw scheduler. Use user fence for completion/dependency tracking.
> > > This is a hard no from me, I want 1 submission path in Xe. Either we use
> > > the DRM scheduler or we don't.
> > > 
> > > Matt
> > > 
> > > > Thanks,
> > > > Oak
> > > > 
> > > > > -----Original Message-----
> > > > > From: Christian König <christian.koenig@amd.com>
> > > > > Sent: April 5, 2023 3:30 AM
> > > > > To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
> > > > > <oak.zeng@intel.com>
> > > > > Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org;
> > > > > robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
> > > airlied@linux.ie;
> > > > > lina@asahilina.net; boris.brezillon@collabora.com;
> > > faith.ekstrand@collabora.com
> > > > > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > > > > plans
> > > > > 
> > > > > Am 04.04.23 um 20:08 schrieb Matthew Brost:
> > > > > > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
> > > > > > > Hi Matt, Thomas,
> > > > > > > 
> > > > > > > Some very bold out of box thinking in this area:
> > > > > > > 
> > > > > > > 1. so you want to use drm scheduler and dma-fence for long running
> > > workload.
> > > > > Why you want to do this in the first place? What is the benefit? Drm scheduler
> > > is
> > > > > pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
> > > > > level, as you said below for intel this is Guc. Can xe driver just directly submit
> > > job
> > > > > to Guc, bypassing drm scheduler?
> > > > > > If we did that now we have 2 paths for dependency track, flow controling
> > > > > > the ring, resets / error handling / backend submission implementations.
> > > > > > We don't want this.
> > > > > Well exactly that's the point: Why?
> > > > > 
> > > > > As far as I can see that are two completely distinct use cases, so you
> > > > > absolutely do want two completely distinct implementations for this.
> > > > > 
> > > > > > > 2. using dma-fence for long run workload: I am well aware that page fault
> > > (and
> > > > > the consequent memory allocation/lock acquiring to fix the fault) can cause
> > > > > deadlock for a dma-fence wait. But I am not convinced that dma-fence can't
> > > be
> > > > > used purely because the nature of the workload that it runs very long
> > > (indefinite).
> > > > > I did a math: the dma_fence_wait_timeout function's third param is the
> > > timeout
> > > > > which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not
> > > long
> > > > > enough, can we just change the timeout parameter to signed 64 bits so it is
> > > much
> > > > > longer than our life time...
> > > > > > > So I mainly argue we can't use dma-fence for long-run workload is not
> > > > > because the workload runs very long, rather because of the fact that we use
> > > > > page fault for long-run workload. If we enable page fault for short-run
> > > workload,
> > > > > we can't use dma-fence either. Page fault is the key thing here.
> > > > > > > Now since we use page fault which is *fundamentally* controversial with
> > > > > dma-fence design, why now just introduce a independent concept such as
> > > user-
> > > > > fence instead of extending existing dma-fence?
> > > > > > > I like unified design. If drm scheduler, dma-fence can be extended to work
> > > for
> > > > > everything, it is beautiful. But seems we have some fundamental problem
> > > here.
> > > > > > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
> > > > > > the signal / CB infrastructure) and enforce we don't use use these
> > > > > > dma-fences from the scheduler in memory reclaim paths or export these to
> > > > > > user space or other drivers. Think of this mode as SW only fence.
> > > > > Yeah and I truly think this is an really bad idea.
> > > > > 
> > > > > The signal/CB infrastructure in the dma_fence turned out to be the
> > > > > absolutely nightmare I initially predicted. Sorry to say that, but in
> > > > > this case the "I've told you so" is appropriate in my opinion.
> > > > > 
> > > > > If we need infrastructure for long running dependency tracking we should
> > > > > encapsulate that in a new framework and not try to mangle the existing
> > > > > code for something it was never intended for.
> > > > > 
> > > > > Christian.
> > > > > 
> > > > > > Matt
> > > > > > 
> > > > > > > Thanks,
> > > > > > > Oak
> > > > > > > 
> > > > > > > > -----Original Message-----
> > > > > > > > From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> > > > > > > > Matthew Brost
> > > > > > > > Sent: April 3, 2023 8:22 PM
> > > > > > > > To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org
> > > > > > > > Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
> > > > > airlied@linux.ie;
> > > > > > > > lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew
> > > > > > > > <matthew.brost@intel.com>; christian.koenig@amd.com;
> > > > > > > > faith.ekstrand@collabora.com
> > > > > > > > Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > > > > plans
> > > > > > > > Hello,
> > > > > > > > 
> > > > > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > > > > > > have been asked to merge our common DRM scheduler patches first as
> > > well
> > > > > > > > as develop a common solution for long running workloads with the DRM
> > > > > > > > scheduler. This RFC series is our first attempt at doing this. We
> > > > > > > > welcome any and all feedback.
> > > > > > > > 
> > > > > > > > This can we thought of as 4 parts detailed below.
> > > > > > > > 
> > > > > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > > > > > > entity (patches 1-3)
> > > > > > > > 
> > > > > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > > > > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > > > > > > severals problems as the DRM was originally designed to schedule jobs
> > > on
> > > > > > > > hardware queues. The main problem being that DRM scheduler expects
> > > the
> > > > > > > > submission order of jobs to be the completion order of jobs even across
> > > > > > > > multiple entities. This assumption falls apart with a firmware scheduler
> > > > > > > > as a firmware scheduler has no concept of jobs and jobs can complete
> > > out
> > > > > > > > of order. A novel solution for was originally thought of by Faith during
> > > > > > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > > > > > > > and entity. I believe the AGX driver [3] is using this approach and
> > > > > > > > Boris may use approach as well for the Mali driver [4].
> > > > > > > > 
> > > > > > > > To support a 1 to 1 relationship we move the main execution function
> > > > > > > > from a kthread to a work queue and add a new scheduling mode which
> > > > > > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > > > > > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > > > > > > relationship and can be thought of as using scheduler as a dependency /
> > > > > > > > infligt job tracker rather than a true scheduler.
> > > > > > > > 
> > > > > > > > - Generic messaging interface for DRM scheduler
> > > > > > > > 
> > > > > > > > Idea is to be able to communicate to the submission backend with in
> > > band
> > > > > > > > (relative to main execution function) messages. Messages are backend
> > > > > > > > defined and flexable enough for any use case. In Xe we use these
> > > > > > > > messages to clean up entites, set properties for entites, and suspend /
> > > > > > > > resume execution of an entity [5]. I suspect other driver can leverage
> > > > > > > > this messaging concept too as it a convenient way to avoid races in the
> > > > > > > > backend.
> > > > > > > > 
> > > > > > > > - Support for using TDR for all error paths of a scheduler / entity
> > > > > > > > 
> > > > > > > > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > > > > > > > 
> > > > > > > > - Annotate dma-fences for long running workloads.
> > > > > > > > 
> > > > > > > > The idea here is to use dma-fences only as sync points within the
> > > > > > > > scheduler and never export them for long running workloads. By
> > > > > > > > annotating these fences as long running we ensure that these dma-
> > > fences
> > > > > > > > are never used in a way that breaks the dma-fence rules. A benefit of
> > > > > > > > thus approach is the scheduler can still safely flow control the
> > > > > > > > execution ring buffer via the job limit without breaking the dma-fence
> > > > > > > > rules.
> > > > > > > > 
> > > > > > > > Again this a first draft and looking forward to feedback.
> > > > > > > > 
> > > > > > > > Enjoy - Matt
> > > > > > > > 
> > > > > > > > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > > > > > > > [2] https://patchwork.freedesktop.org/series/112188/
> > > > > > > > [3] https://patchwork.freedesktop.org/series/114772/
> > > > > > > > [4]
> > > https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > > > > > > > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
> > > > > > > > next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > > > > > > > 
> > > > > > > > Matthew Brost (8):
> > > > > > > >     drm/sched: Convert drm scheduler to use a work queue rather than
> > > > > > > >       kthread
> > > > > > > >     drm/sched: Move schedule policy to scheduler / entity
> > > > > > > >     drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling
> > > policy
> > > > > > > >     drm/sched: Add generic scheduler message interface
> > > > > > > >     drm/sched: Start run wq before TDR in drm_sched_start
> > > > > > > >     drm/sched: Submit job before starting TDR
> > > > > > > >     drm/sched: Add helper to set TDR timeout
> > > > > > > >     drm/syncobj: Warn on long running dma-fences
> > > > > > > > 
> > > > > > > > Thomas Hellström (2):
> > > > > > > >     dma-buf/dma-fence: Introduce long-running completion fences
> > > > > > > >     drm/sched: Support long-running sched entities
> > > > > > > > 
> > > > > > > >    drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> > > > > > > >    drivers/dma-buf/dma-resv.c                  |   5 +
> > > > > > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> > > > > > > >    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> > > > > > > >    drivers/gpu/drm/drm_syncobj.c               |   5 +-
> > > > > > > >    drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> > > > > > > >    drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> > > > > > > >    drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> > > > > > > >    drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> > > > > > > >    drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> > > > > > > >    drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> > > > > > > >    drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> > > > > > > >    drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++--
> > > ---
> > > > > > > >    drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> > > > > > > >    include/drm/gpu_scheduler.h                 | 130 +++++++--
> > > > > > > >    include/linux/dma-fence.h                   |  60 ++++-
> > > > > > > >    16 files changed, 649 insertions(+), 184 deletions(-)
> > > > > > > > 
> > > > > > > > --
> > > > > > > > 2.34.1
>
Asahi Lina April 12, 2023, 5:47 a.m. UTC | #35
On 11/04/2023 23.07, Daniel Vetter wrote:
> On Sat, Apr 08, 2023 at 04:05:20PM +0900, Asahi Lina wrote:
>> On 04/04/2023 10.58, Matthew Brost wrote:
>>> On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote:
>>>> Hi, thanks for the Cc!
>>>>
>>>
>>> No problem.
>>>
>>>> On 04/04/2023 09.22, Matthew Brost wrote:
>>>>> Hello,
>>>>>
>>>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
>>>>> have been asked to merge our common DRM scheduler patches first as well
>>>>> as develop a common solution for long running workloads with the DRM
>>>>> scheduler. This RFC series is our first attempt at doing this. We
>>>>> welcome any and all feedback.
>>>>>
>>>>> This can we thought of as 4 parts detailed below.
>>>>>
>>>>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
>>>>> entity (patches 1-3)
>>>>>
>>>>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
>>>>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
>>>>> severals problems as the DRM was originally designed to schedule jobs on
>>>>> hardware queues. The main problem being that DRM scheduler expects the
>>>>> submission order of jobs to be the completion order of jobs even across
>>>>> multiple entities. This assumption falls apart with a firmware scheduler
>>>>> as a firmware scheduler has no concept of jobs and jobs can complete out
>>>>> of order. A novel solution for was originally thought of by Faith during
>>>>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
>>>>> and entity. I believe the AGX driver [3] is using this approach and
>>>>> Boris may use approach as well for the Mali driver [4].
>>>>>
>>>>> To support a 1 to 1 relationship we move the main execution function
>>>>> from a kthread to a work queue and add a new scheduling mode which
>>>>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
>>>>> The new scheduling mode should unify all drivers usage with a 1 to 1
>>>>> relationship and can be thought of as using scheduler as a dependency /
>>>>> infligt job tracker rather than a true scheduler.
>>>>
>>>> Yup, we're in the exact same situation with drm/asahi, so this is very
>>>> welcome! We've been using the existing scheduler as-is, but this should help
>>>> remove some unneeded complexity in this use case.
>>>>
>>>
>>> That's the idea.
>>>
>>>> Do you want me to pull in this series into our tree and make sure this all
>>>> works out for us?
>>>>
>>>
>>> We tested this in Xe and it definitely works for us but the more testing
>>> the better.
>>>
>>
>> I haven't gotten around to testing this series yet, but after more debugging
>> of drm_sched issues I want to hear more about how Xe uses the scheduler.
>>
>>  From what I can tell, and from what Christian says, drm_sched has the hidden
>> requirement that all job objects outlive the scheduler. I've run into
>> several UAF bugs due to this. Not only that, it also currently has the
>> requirement that all drm_sched fences outlive the scheduler object.
>>
>> These requirements are subtle and only manifest as kernel oopses in rare
>> corner cases, so it wasn't at all obvious to me that this was somehow a
>> fundamental design assumption when I started using it.
>>
>> As far as I can tell, this design is going to work in 99% of cases for
>> global-schedulers-per-GPU models, where those corner cases would have to be
>> hit on top of a GPU removal scenario (and GPU remove is... well, not the
>> most tested/exercised use case). When the scheduler basically lives forever,
>> none of this really matters.
>>
>> But with a one-scheduler-per-queue model, how do you deal with this when the
>> queue goes away? So far, without any of the partial bugfixes I have sent so
>> far (which Christian objected to):
>>
>> - If you try to tear down a scheduler with any jobs currently scheduled at
>> the hardware, drm_sched will oops when those jobs complete and the hw fences
>> signal.
>> - If you try to tear down an entity (which should cancel all its pending
>> jobs) and then the scheduler it was attached to without actually waiting for
>> all the free_job() callbacks to be called on every job that ever existed for
>> that entity, you can oops (entity cleanup is asynchronous in some cases like
>> killed processes, so it will return before all jobs are freed and then that
>> asynchronous process will crash and burn if the scheduler goes away out from
>> under its feet). Waiting for job completion fences is not enough for this,
>> you have to wait until free_job() has actually been called for all jobs.
>> - Even if you actually wait for all jobs to be truly gone and then tear down
>> the scheduler, if any scheduler job fences remain alive, that will then oops
>> if you try to call the debug functions on them (like cat
>> /sys/kernel/debug/dma_buf/bufinfo).
>>
>> I tried to fix these things, but Christian objected implying it was the
>> driver's job to keep a reference from jobs and hw fences to the scheduler.
>> But I find that completely broken, because besides the extra memory/resource
>> usage keeping the scheduler alive when you're trying to free resources as
>> fast as possible when a process goes away, you can't even use normal
>> reference counting for that: if you try to drop the last drm_sched reference
>> from within a free_job() callback, the whole thing deadlocks since that will
>> be running in the scheduler's thread/workqueue context, which can't free
>> itself. So now you both reference count the scheduler from jobs and fences,
>> and on top of that you need to outsource drm_sched freeing to a workqueue in
>> the driver to make sure you don't deadlock.
>>
>> For job fences this is particularly broken, because those fences can live
>> forever signaled and attached to shared buffers and there is no guarantee
>> that they will be freed in any kind of reasonable time frame. If they have
>> to keep the scheduler that created them alive, that creates a lot of dead
>> object junk we have to drag around just because a signaled fence exists
>> somewhere.
>>
>> For a Rust abstraction we have to do all that tracking and refcounting in
>> the abstraction itself to make it safe, which is starting to sound like
>> reimplementing half of the job tracking drm_sched itself does just to fix
>> the lifetime issues, which really tells me the existing design is not sound
>> nor easy to use correctly in general.
>>
>> How does Xe deal with this (does it deal with it at all)? What happens when
>> you kill -9 a process using the GPU? Does freeing all of this wait for all
>> jobs to complete *and be freed* with free_job()? What about exported
>> dma_bufs with fences attached from that scheduler? Do you keep the scheduler
>> alive for those?
>>
>> Personally, after running into all this, and after seeing Christian's
>> reaction to me trying to improve the state of things, I'm starting to doubt
>> that drm_sched is the right solution at all for firmware-scheduling drivers.
> 
> Bit a wash-up reply on the more fundamental thing here:
>   
> For the current scheduler the issues you've found are indeed all driver
> bugs (or most I think at least).

Even the last one with the fences? I can't see how that could be 
implemented correctly today by any driver, short of having the driver 
live until any buffers it has touched and installed a fence into go 
away, which doesn't sound right, since that would block cleanup (and 
module unloading) possibly forever, and that itself sounds like a bug...

This is why I'm a bit disappointed here, because even that one got me a 
"you're doing it wrong" response from Christian... but if scheduler 
fences are supposed to be outlived by the driver and its fences, what is 
even the point of having separate fences?

> 
> Which is why I think we shouldn't just try to shoehorn fundamentally new
> semantics without updating the driver interfaces (the drm_sched split into
> the driver interface part and the internal scheduler part). Once we have
> that, including kerneldoc update and what the rules are, then all the
> various uaf you've discovered become real bugs and I don't see any issue
> merging all the fixes.
> 
> Without that we do have a chicken/egg problem between:
> 
> "here's a bunch of hacks to make the problems disappear I've hit in my
> reuse of drm/sched for fw schedulers"
> 
> vs.
> 
> "this makes no sense for the current drm/sched interfaces and how current
> upstream drivers use it"
> 
> I don't think there's a lot needed in terms of drm/sched driver api
> rework, but I think it's also pretty clearly not ever going to get
> anywhere with just nothing at all. Writing an entire new scheduler lib
> instead of at least trying what minimal semantic changes (instead of just
> a pile of hacks without even doc changes for the new rules) does not sound
> like a good idea to me :-)

I wish I knew what the old rules were, since they're still not documented...

It's frustrating following what few rules are written down, running into 
a bug, writing a patch to fix it, and being told "no, you're just not 
following the unwritten rules"... several times now.

> 
>> If you want a workload to try to see if you run into any of these things,
>> running and killing lots of things in parallel is a good thing to try (mess
>> with the numbers and let it run for a while to see if you can hit any corner
>> cases):
>>
>> while true; do for i in $(seq 1 10); do timeout -k 0.01 0.05 glxgears &
>> done; sleep 0.1; done
> 
> Maybe xe gets away with this due to synchronously killing everything
> related to a ctx, but yeah I'd expect this to go boom in fun ways.

It'd have to explicitly refcount all the jobs and block killing the ctx 
until all jobs are freed (not just signaled) for that not to go boom 
right now, but even then you'd still have the issue with dangling fences 
in buffers making `cat /sys/kernel/debug/dma_buf/bufinfo` oops... and 
you can't synchronously kill those as far as I know.

~~ Lina
Daniel Vetter April 12, 2023, 8:18 a.m. UTC | #36
On Wed, Apr 12, 2023 at 02:47:52PM +0900, Asahi Lina wrote:
> On 11/04/2023 23.07, Daniel Vetter wrote:
> > On Sat, Apr 08, 2023 at 04:05:20PM +0900, Asahi Lina wrote:
> > > On 04/04/2023 10.58, Matthew Brost wrote:
> > > > On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote:
> > > > > Hi, thanks for the Cc!
> > > > > 
> > > > 
> > > > No problem.
> > > > 
> > > > > On 04/04/2023 09.22, Matthew Brost wrote:
> > > > > > Hello,
> > > > > > 
> > > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > > > > have been asked to merge our common DRM scheduler patches first as well
> > > > > > as develop a common solution for long running workloads with the DRM
> > > > > > scheduler. This RFC series is our first attempt at doing this. We
> > > > > > welcome any and all feedback.
> > > > > > 
> > > > > > This can we thought of as 4 parts detailed below.
> > > > > > 
> > > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > > > > entity (patches 1-3)
> > > > > > 
> > > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > > > > severals problems as the DRM was originally designed to schedule jobs on
> > > > > > hardware queues. The main problem being that DRM scheduler expects the
> > > > > > submission order of jobs to be the completion order of jobs even across
> > > > > > multiple entities. This assumption falls apart with a firmware scheduler
> > > > > > as a firmware scheduler has no concept of jobs and jobs can complete out
> > > > > > of order. A novel solution for was originally thought of by Faith during
> > > > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > > > > > and entity. I believe the AGX driver [3] is using this approach and
> > > > > > Boris may use approach as well for the Mali driver [4].
> > > > > > 
> > > > > > To support a 1 to 1 relationship we move the main execution function
> > > > > > from a kthread to a work queue and add a new scheduling mode which
> > > > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > > > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > > > > relationship and can be thought of as using scheduler as a dependency /
> > > > > > infligt job tracker rather than a true scheduler.
> > > > > 
> > > > > Yup, we're in the exact same situation with drm/asahi, so this is very
> > > > > welcome! We've been using the existing scheduler as-is, but this should help
> > > > > remove some unneeded complexity in this use case.
> > > > > 
> > > > 
> > > > That's the idea.
> > > > 
> > > > > Do you want me to pull in this series into our tree and make sure this all
> > > > > works out for us?
> > > > > 
> > > > 
> > > > We tested this in Xe and it definitely works for us but the more testing
> > > > the better.
> > > > 
> > > 
> > > I haven't gotten around to testing this series yet, but after more debugging
> > > of drm_sched issues I want to hear more about how Xe uses the scheduler.
> > > 
> > >  From what I can tell, and from what Christian says, drm_sched has the hidden
> > > requirement that all job objects outlive the scheduler. I've run into
> > > several UAF bugs due to this. Not only that, it also currently has the
> > > requirement that all drm_sched fences outlive the scheduler object.
> > > 
> > > These requirements are subtle and only manifest as kernel oopses in rare
> > > corner cases, so it wasn't at all obvious to me that this was somehow a
> > > fundamental design assumption when I started using it.
> > > 
> > > As far as I can tell, this design is going to work in 99% of cases for
> > > global-schedulers-per-GPU models, where those corner cases would have to be
> > > hit on top of a GPU removal scenario (and GPU remove is... well, not the
> > > most tested/exercised use case). When the scheduler basically lives forever,
> > > none of this really matters.
> > > 
> > > But with a one-scheduler-per-queue model, how do you deal with this when the
> > > queue goes away? So far, without any of the partial bugfixes I have sent so
> > > far (which Christian objected to):
> > > 
> > > - If you try to tear down a scheduler with any jobs currently scheduled at
> > > the hardware, drm_sched will oops when those jobs complete and the hw fences
> > > signal.
> > > - If you try to tear down an entity (which should cancel all its pending
> > > jobs) and then the scheduler it was attached to without actually waiting for
> > > all the free_job() callbacks to be called on every job that ever existed for
> > > that entity, you can oops (entity cleanup is asynchronous in some cases like
> > > killed processes, so it will return before all jobs are freed and then that
> > > asynchronous process will crash and burn if the scheduler goes away out from
> > > under its feet). Waiting for job completion fences is not enough for this,
> > > you have to wait until free_job() has actually been called for all jobs.
> > > - Even if you actually wait for all jobs to be truly gone and then tear down
> > > the scheduler, if any scheduler job fences remain alive, that will then oops
> > > if you try to call the debug functions on them (like cat
> > > /sys/kernel/debug/dma_buf/bufinfo).
> > > 
> > > I tried to fix these things, but Christian objected implying it was the
> > > driver's job to keep a reference from jobs and hw fences to the scheduler.
> > > But I find that completely broken, because besides the extra memory/resource
> > > usage keeping the scheduler alive when you're trying to free resources as
> > > fast as possible when a process goes away, you can't even use normal
> > > reference counting for that: if you try to drop the last drm_sched reference
> > > from within a free_job() callback, the whole thing deadlocks since that will
> > > be running in the scheduler's thread/workqueue context, which can't free
> > > itself. So now you both reference count the scheduler from jobs and fences,
> > > and on top of that you need to outsource drm_sched freeing to a workqueue in
> > > the driver to make sure you don't deadlock.
> > > 
> > > For job fences this is particularly broken, because those fences can live
> > > forever signaled and attached to shared buffers and there is no guarantee
> > > that they will be freed in any kind of reasonable time frame. If they have
> > > to keep the scheduler that created them alive, that creates a lot of dead
> > > object junk we have to drag around just because a signaled fence exists
> > > somewhere.
> > > 
> > > For a Rust abstraction we have to do all that tracking and refcounting in
> > > the abstraction itself to make it safe, which is starting to sound like
> > > reimplementing half of the job tracking drm_sched itself does just to fix
> > > the lifetime issues, which really tells me the existing design is not sound
> > > nor easy to use correctly in general.
> > > 
> > > How does Xe deal with this (does it deal with it at all)? What happens when
> > > you kill -9 a process using the GPU? Does freeing all of this wait for all
> > > jobs to complete *and be freed* with free_job()? What about exported
> > > dma_bufs with fences attached from that scheduler? Do you keep the scheduler
> > > alive for those?
> > > 
> > > Personally, after running into all this, and after seeing Christian's
> > > reaction to me trying to improve the state of things, I'm starting to doubt
> > > that drm_sched is the right solution at all for firmware-scheduling drivers.
> > 
> > Bit a wash-up reply on the more fundamental thing here:
> > For the current scheduler the issues you've found are indeed all driver
> > bugs (or most I think at least).
> 
> Even the last one with the fences? I can't see how that could be implemented
> correctly today by any driver, short of having the driver live until any
> buffers it has touched and installed a fence into go away, which doesn't
> sound right, since that would block cleanup (and module unloading) possibly
> forever, and that itself sounds like a bug...
> 
> This is why I'm a bit disappointed here, because even that one got me a
> "you're doing it wrong" response from Christian... but if scheduler fences
> are supposed to be outlived by the driver and its fences, what is even the
> point of having separate fences?

Yeah that one sounds like a bug. Not sure what Christian was thinking, but
the point of the split between hw fence and public drm_job fence is that
the latter is supposed to be free-standing.

Otherwise we'd need to refcount the world and have an enourmous space
leak. i915-gem tried that, and it's not even close to pretty because you
get to untie all kinds of weak pointers in all kinds of really scary ways.

Maybe try resend that, but put a patch in front that fixes the kerneldoc
to explain this in really clear terms & why it's needed? Then 2nd patches
is more a "we fix the driver api contract here".

> > Which is why I think we shouldn't just try to shoehorn fundamentally new
> > semantics without updating the driver interfaces (the drm_sched split into
> > the driver interface part and the internal scheduler part). Once we have
> > that, including kerneldoc update and what the rules are, then all the
> > various uaf you've discovered become real bugs and I don't see any issue
> > merging all the fixes.
> > 
> > Without that we do have a chicken/egg problem between:
> > 
> > "here's a bunch of hacks to make the problems disappear I've hit in my
> > reuse of drm/sched for fw schedulers"
> > 
> > vs.
> > 
> > "this makes no sense for the current drm/sched interfaces and how current
> > upstream drivers use it"
> > 
> > I don't think there's a lot needed in terms of drm/sched driver api
> > rework, but I think it's also pretty clearly not ever going to get
> > anywhere with just nothing at all. Writing an entire new scheduler lib
> > instead of at least trying what minimal semantic changes (instead of just
> > a pile of hacks without even doc changes for the new rules) does not sound
> > like a good idea to me :-)
> 
> I wish I knew what the old rules were, since they're still not documented...
> 
> It's frustrating following what few rules are written down, running into a
> bug, writing a patch to fix it, and being told "no, you're just not
> following the unwritten rules"... several times now.

Yeah I get that. And I think I've put down a few places in the threads
where things derailed that the more constructive approach (instead of
random nak and shit like that) is to ask to improve the docs. Because rn
they're really not great for the scheduler :-/

> > > If you want a workload to try to see if you run into any of these things,
> > > running and killing lots of things in parallel is a good thing to try (mess
> > > with the numbers and let it run for a while to see if you can hit any corner
> > > cases):
> > > 
> > > while true; do for i in $(seq 1 10); do timeout -k 0.01 0.05 glxgears &
> > > done; sleep 0.1; done
> > 
> > Maybe xe gets away with this due to synchronously killing everything
> > related to a ctx, but yeah I'd expect this to go boom in fun ways.
> 
> It'd have to explicitly refcount all the jobs and block killing the ctx
> until all jobs are freed (not just signaled) for that not to go boom right
> now, but even then you'd still have the issue with dangling fences in
> buffers making `cat /sys/kernel/debug/dma_buf/bufinfo` oops... and you can't
> synchronously kill those as far as I know.

Yeah that sounds like a full on bug. Doc patch to sharpen the semantics we
want + fix should get this going - or I'll jump in and make it going :-)

For the bigger issue I guess we might just land on "drm/sched needs to
refcount a lot more for fw scheduler", which is again why I think some doc
patches + minimal driver api rework might be best. Sometimes just
submitting code isn't the best way to communicate.
-Daniel
Matthew Brost April 17, 2023, 12:03 a.m. UTC | #37
On Sat, Apr 08, 2023 at 04:05:20PM +0900, Asahi Lina wrote:
> On 04/04/2023 10.58, Matthew Brost wrote:
> > On Tue, Apr 04, 2023 at 10:07:48AM +0900, Asahi Lina wrote:
> > > Hi, thanks for the Cc!
> > > 
> > 
> > No problem.
> > 
> > > On 04/04/2023 09.22, Matthew Brost wrote:
> > > > Hello,
> > > > 
> > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > > have been asked to merge our common DRM scheduler patches first as well
> > > > as develop a common solution for long running workloads with the DRM
> > > > scheduler. This RFC series is our first attempt at doing this. We
> > > > welcome any and all feedback.
> > > > 
> > > > This can we thought of as 4 parts detailed below.
> > > > 
> > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > > entity (patches 1-3)
> > > > 
> > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > > severals problems as the DRM was originally designed to schedule jobs on
> > > > hardware queues. The main problem being that DRM scheduler expects the
> > > > submission order of jobs to be the completion order of jobs even across
> > > > multiple entities. This assumption falls apart with a firmware scheduler
> > > > as a firmware scheduler has no concept of jobs and jobs can complete out
> > > > of order. A novel solution for was originally thought of by Faith during
> > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > > > and entity. I believe the AGX driver [3] is using this approach and
> > > > Boris may use approach as well for the Mali driver [4].
> > > > 
> > > > To support a 1 to 1 relationship we move the main execution function
> > > > from a kthread to a work queue and add a new scheduling mode which
> > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > > relationship and can be thought of as using scheduler as a dependency /
> > > > infligt job tracker rather than a true scheduler.
> > > 
> > > Yup, we're in the exact same situation with drm/asahi, so this is very
> > > welcome! We've been using the existing scheduler as-is, but this should help
> > > remove some unneeded complexity in this use case.
> > > 
> > 
> > That's the idea.
> > 
> > > Do you want me to pull in this series into our tree and make sure this all
> > > works out for us?
> > > 
> > 
> > We tested this in Xe and it definitely works for us but the more testing
> > the better.
> > 
> 
> I haven't gotten around to testing this series yet, but after more debugging
> of drm_sched issues I want to hear more about how Xe uses the scheduler.
> 
> From what I can tell, and from what Christian says, drm_sched has the hidden
> requirement that all job objects outlive the scheduler. I've run into
> several UAF bugs due to this. Not only that, it also currently has the
> requirement that all drm_sched fences outlive the scheduler object.
> 
> These requirements are subtle and only manifest as kernel oopses in rare
> corner cases, so it wasn't at all obvious to me that this was somehow a
> fundamental design assumption when I started using it.
> 
> As far as I can tell, this design is going to work in 99% of cases for
> global-schedulers-per-GPU models, where those corner cases would have to be
> hit on top of a GPU removal scenario (and GPU remove is... well, not the
> most tested/exercised use case). When the scheduler basically lives forever,
> none of this really matters.
> 
> But with a one-scheduler-per-queue model, how do you deal with this when the
> queue goes away? So far, without any of the partial bugfixes I have sent so
> far (which Christian objected to):
> 
> - If you try to tear down a scheduler with any jobs currently scheduled at
> the hardware, drm_sched will oops when those jobs complete and the hw fences
> signal.
> - If you try to tear down an entity (which should cancel all its pending
> jobs) and then the scheduler it was attached to without actually waiting for
> all the free_job() callbacks to be called on every job that ever existed for
> that entity, you can oops (entity cleanup is asynchronous in some cases like
> killed processes, so it will return before all jobs are freed and then that
> asynchronous process will crash and burn if the scheduler goes away out from
> under its feet). Waiting for job completion fences is not enough for this,
> you have to wait until free_job() has actually been called for all jobs.
> - Even if you actually wait for all jobs to be truly gone and then tear down
> the scheduler, if any scheduler job fences remain alive, that will then oops
> if you try to call the debug functions on them (like cat
> /sys/kernel/debug/dma_buf/bufinfo).
> 
> I tried to fix these things, but Christian objected implying it was the
> driver's job to keep a reference from jobs and hw fences to the scheduler.
> But I find that completely broken, because besides the extra memory/resource
> usage keeping the scheduler alive when you're trying to free resources as
> fast as possible when a process goes away, you can't even use normal
> reference counting for that: if you try to drop the last drm_sched reference
> from within a free_job() callback, the whole thing deadlocks since that will
> be running in the scheduler's thread/workqueue context, which can't free
> itself. So now you both reference count the scheduler from jobs and fences,
> and on top of that you need to outsource drm_sched freeing to a workqueue in
> the driver to make sure you don't deadlock.
> 

This what Xe does, jobs reference the scheduler / entity (xe_engine),
when the reference count of an xe_engine goes to zero we trigger the
teardown process (ping / pong with firmware) via a CLEANUP message, when
teardown is done the last step of killing the scheduler is indeed done
by an async worker as you suggest.

To kill a queue, we just kick the TDR which in turn kills any
outstanding job resulting the xe_engine ref count (at least from the
jobs) going to zero.

If a user holds ref to dma-fence of a job, then yes the scheduler isn't
going to be freed (it can be killed before as described above).

This all seems to work just fine for Xe.

> For job fences this is particularly broken, because those fences can live
> forever signaled and attached to shared buffers and there is no guarantee
> that they will be freed in any kind of reasonable time frame. If they have
> to keep the scheduler that created them alive, that creates a lot of dead
> object junk we have to drag around just because a signaled fence exists
> somewhere.
> 
> For a Rust abstraction we have to do all that tracking and refcounting in
> the abstraction itself to make it safe, which is starting to sound like
> reimplementing half of the job tracking drm_sched itself does just to fix
> the lifetime issues, which really tells me the existing design is not sound
> nor easy to use correctly in general.
> 
> How does Xe deal with this (does it deal with it at all)? What happens when
> you kill -9 a process using the GPU? Does freeing all of this wait for all
> jobs to complete *and be freed* with free_job()? What about exported
> dma_bufs with fences attached from that scheduler? Do you keep the scheduler
> alive for those?

kill -9 would trigger killing of the queue described above. Yes if
fences are exported the scheduler might hold onto some firmware
resources for a bit.

> 
> Personally, after running into all this, and after seeing Christian's
> reaction to me trying to improve the state of things, I'm starting to doubt
> that drm_sched is the right solution at all for firmware-scheduling drivers.
> 
> If you want a workload to try to see if you run into any of these things,
> running and killing lots of things in parallel is a good thing to try (mess
> with the numbers and let it run for a while to see if you can hit any corner
> cases):
> 
> while true; do for i in $(seq 1 10); do timeout -k 0.01 0.05 glxgears &
> done; sleep 0.1; done
>

Tested this and this works in Xe.

Feel free to ping me on IRC if you want to chat more about this.

Matt

> ~~ Lina
>
Christian König April 17, 2023, 6:47 a.m. UTC | #38
Am 11.04.23 um 16:13 schrieb Daniel Vetter:
> On Tue, Apr 11, 2023 at 11:02:55AM +0200, Christian König wrote:
>> The point is that this not only requires some work in the drm_scheduler, but
>> rather it then makes only little sense to use the drm_scheduler in the first
>> place.
>>
>> The whole point of the drm_scheduler is to provide dma_fence implementation
>> for the submitted jobs.
>>
>> We also have dependency handling, but as Daniel and I said this can be
>> easily extracted into a separate object/component.
> Uh that's not what I meant. My take is that minimally patching drm/sched
> to make the out-fence either optional, or complete it right away, is the
> simplest way to get at the dependency handling. For me at least the major
> part of drm/sched is the dep handling and timeout stuff. And the later can
> be reused with some glue to handle preempt timeouts too and other things,
> since tdr is a work struct you can just issue any other gpu timeouts on
> the same workqueue and using the roughly same pattern as the ->timed_out
> hook and it'll just work.

Well that strongly sounds like what I had in mind as well.

If we move the dependency/timeout functionality into a new component or 
if we move the scheduler fence into a new component doesn't seem to 
matter, the high level goal is that we have separated the two 
functionalities and both approach will work for that.

> The entire "oh we also make sure your hw fence doesn't leak into public
> fences and causes lifetime mayhem" seems pretty minor. And maybe also
> something we want to replicate for the preempt-ctx dma_fence that some
> long-running context need (but more as part of drm_sched_entity I guess).
>
> We can of course bikeshed how much flexibility really should be in the
> different parts of drm/sched, but imo that's a bikeshed.

Well the dependency handling in a separate component would still be 
interesting to have since we need something similar for user queues as well.

Christian.

> -Daniel
>
>
>> Regards,
>> Christian.
>>
>> Am 07.04.23 um 02:20 schrieb Zeng, Oak:
>>> So this series basically go with option 2. The part that option2 makes me uncomfortable is, dma-fence doesn't work for long running workload, why we generate it in the first place? As long as dma-fence is generated, it will become a source of confusion in the future. It doesn't matter how much you annotate it/document it. So if we decide to go with option2, the bottom line is, don't generate dma-fence for long running workload during job submission. This requires some rework in drm scheduler.
>>>
>>> The cleanest solution to me is option3. Dma-fence is a very old technology. When it was created, no gpu support page fault. Obviously this is not a good technology for modern gpu with page fault support. I think the best way is to create a new scheduler and dependency tracking mechanism works for both page fault enabled and page fault disabled context. I think this matches what Christian said below. Maybe nobody think this is easy?
>>>
>>> Thanks,
>>> Oak
>>>
>>>> -----Original Message-----
>>>> From: Brost, Matthew <matthew.brost@intel.com>
>>>> Sent: April 5, 2023 2:53 PM
>>>> To: Zeng, Oak <oak.zeng@intel.com>
>>>> Cc: Christian König <christian.koenig@amd.com>; Vetter, Daniel
>>>> <daniel.vetter@intel.com>; Thomas Hellström
>>>> <thomas.hellstrom@linux.intel.com>; dri-devel@lists.freedesktop.org; intel-
>>>> xe@lists.freedesktop.org; robdclark@chromium.org; airlied@linux.ie;
>>>> lina@asahilina.net; boris.brezillon@collabora.com; faith.ekstrand@collabora.com
>>>> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
>>>> plans
>>>>
>>>> On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:
>>>>> Hi,
>>>>>
>>>>> Using dma-fence for completion/dependency tracking for long-run
>>>> workload(more precisely on-demand paging/page fault enabled workload) can
>>>> cause deadlock. This seems the significant issue here. Other issues such as the
>>>> drm scheduler completion order implication etc are minors which can be solve
>>>> inside the framework of drm scheduler. We need to evaluate below paths:
>>>>> 	1) still use drm scheduler for job submission, and use dma-fence for job
>>>> completion waiting/dependency tracking. This is solution proposed in this series.
>>>> Annotate dma-fence for long-run workload: user can still wait dma-fence for job
>>>> completion but can't wait dma-fence while holding any memory management
>>>> locks.  We still use dma-fence for dependency tracking. But it is just very easily
>>>> run into deadlock when on-demand paging is in the picture. The annotation helps
>>>> us to detect deadlock but not solve deadlock problems. Seems *not* a complete
>>>> solution: It is almost impossible to completely avoid dependency deadlock in
>>>> complex runtime environment
>>>> No one can wait on LR fence, so it is impossible to deadlock. The
>>>> annotations enforce this. Literally this is only for flow controling the
>>>> ring / hold pending jobs in in the DRM schedule list.
>>>>
>>>>> 	2) Still use drm scheduler but not use dma-fence for completion signaling
>>>> and dependency tracking. This way we still get some free functions (reset, err
>>>> handling ring flow control as Matt said)from drm scheduler, but push the
>>>> dependency/completion tracking completely to user space using techniques such
>>>> as user space fence. User space doesn't have chance to wait fence while holding
>>>> a kernel memory management lock, thus the dma-fence deadlock issue is solved.
>>>> We use user space fence for syncs.
>>>>
>>>>> 	3) Completely discard drm scheduler and dma-fence for long-run
>>>> workload. Use user queue/doorbell for super fast submission, directly interact
>>>> with fw scheduler. Use user fence for completion/dependency tracking.
>>>> This is a hard no from me, I want 1 submission path in Xe. Either we use
>>>> the DRM scheduler or we don't.
>>>>
>>>> Matt
>>>>
>>>>> Thanks,
>>>>> Oak
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Christian König <christian.koenig@amd.com>
>>>>>> Sent: April 5, 2023 3:30 AM
>>>>>> To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
>>>>>> <oak.zeng@intel.com>
>>>>>> Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org;
>>>>>> robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
>>>> airlied@linux.ie;
>>>>>> lina@asahilina.net; boris.brezillon@collabora.com;
>>>> faith.ekstrand@collabora.com
>>>>>> Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
>>>>>> plans
>>>>>>
>>>>>> Am 04.04.23 um 20:08 schrieb Matthew Brost:
>>>>>>> On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
>>>>>>>> Hi Matt, Thomas,
>>>>>>>>
>>>>>>>> Some very bold out of box thinking in this area:
>>>>>>>>
>>>>>>>> 1. so you want to use drm scheduler and dma-fence for long running
>>>> workload.
>>>>>> Why you want to do this in the first place? What is the benefit? Drm scheduler
>>>> is
>>>>>> pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
>>>>>> level, as you said below for intel this is Guc. Can xe driver just directly submit
>>>> job
>>>>>> to Guc, bypassing drm scheduler?
>>>>>>> If we did that now we have 2 paths for dependency track, flow controling
>>>>>>> the ring, resets / error handling / backend submission implementations.
>>>>>>> We don't want this.
>>>>>> Well exactly that's the point: Why?
>>>>>>
>>>>>> As far as I can see that are two completely distinct use cases, so you
>>>>>> absolutely do want two completely distinct implementations for this.
>>>>>>
>>>>>>>> 2. using dma-fence for long run workload: I am well aware that page fault
>>>> (and
>>>>>> the consequent memory allocation/lock acquiring to fix the fault) can cause
>>>>>> deadlock for a dma-fence wait. But I am not convinced that dma-fence can't
>>>> be
>>>>>> used purely because the nature of the workload that it runs very long
>>>> (indefinite).
>>>>>> I did a math: the dma_fence_wait_timeout function's third param is the
>>>> timeout
>>>>>> which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not
>>>> long
>>>>>> enough, can we just change the timeout parameter to signed 64 bits so it is
>>>> much
>>>>>> longer than our life time...
>>>>>>>> So I mainly argue we can't use dma-fence for long-run workload is not
>>>>>> because the workload runs very long, rather because of the fact that we use
>>>>>> page fault for long-run workload. If we enable page fault for short-run
>>>> workload,
>>>>>> we can't use dma-fence either. Page fault is the key thing here.
>>>>>>>> Now since we use page fault which is *fundamentally* controversial with
>>>>>> dma-fence design, why now just introduce a independent concept such as
>>>> user-
>>>>>> fence instead of extending existing dma-fence?
>>>>>>>> I like unified design. If drm scheduler, dma-fence can be extended to work
>>>> for
>>>>>> everything, it is beautiful. But seems we have some fundamental problem
>>>> here.
>>>>>>> Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
>>>>>>> the signal / CB infrastructure) and enforce we don't use use these
>>>>>>> dma-fences from the scheduler in memory reclaim paths or export these to
>>>>>>> user space or other drivers. Think of this mode as SW only fence.
>>>>>> Yeah and I truly think this is an really bad idea.
>>>>>>
>>>>>> The signal/CB infrastructure in the dma_fence turned out to be the
>>>>>> absolutely nightmare I initially predicted. Sorry to say that, but in
>>>>>> this case the "I've told you so" is appropriate in my opinion.
>>>>>>
>>>>>> If we need infrastructure for long running dependency tracking we should
>>>>>> encapsulate that in a new framework and not try to mangle the existing
>>>>>> code for something it was never intended for.
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>>> Matt
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Oak
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
>>>>>>>>> Matthew Brost
>>>>>>>>> Sent: April 3, 2023 8:22 PM
>>>>>>>>> To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org
>>>>>>>>> Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
>>>>>> airlied@linux.ie;
>>>>>>>>> lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew
>>>>>>>>> <matthew.brost@intel.com>; christian.koenig@amd.com;
>>>>>>>>> faith.ekstrand@collabora.com
>>>>>>>>> Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
>>>>>> plans
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
>>>>>>>>> have been asked to merge our common DRM scheduler patches first as
>>>> well
>>>>>>>>> as develop a common solution for long running workloads with the DRM
>>>>>>>>> scheduler. This RFC series is our first attempt at doing this. We
>>>>>>>>> welcome any and all feedback.
>>>>>>>>>
>>>>>>>>> This can we thought of as 4 parts detailed below.
>>>>>>>>>
>>>>>>>>> - DRM scheduler changes for 1 to 1 relationship between scheduler and
>>>>>>>>> entity (patches 1-3)
>>>>>>>>>
>>>>>>>>> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
>>>>>>>>> GuC) which is a new paradigm WRT to the DRM scheduler and presents
>>>>>>>>> severals problems as the DRM was originally designed to schedule jobs
>>>> on
>>>>>>>>> hardware queues. The main problem being that DRM scheduler expects
>>>> the
>>>>>>>>> submission order of jobs to be the completion order of jobs even across
>>>>>>>>> multiple entities. This assumption falls apart with a firmware scheduler
>>>>>>>>> as a firmware scheduler has no concept of jobs and jobs can complete
>>>> out
>>>>>>>>> of order. A novel solution for was originally thought of by Faith during
>>>>>>>>> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
>>>>>>>>> and entity. I believe the AGX driver [3] is using this approach and
>>>>>>>>> Boris may use approach as well for the Mali driver [4].
>>>>>>>>>
>>>>>>>>> To support a 1 to 1 relationship we move the main execution function
>>>>>>>>> from a kthread to a work queue and add a new scheduling mode which
>>>>>>>>> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
>>>>>>>>> The new scheduling mode should unify all drivers usage with a 1 to 1
>>>>>>>>> relationship and can be thought of as using scheduler as a dependency /
>>>>>>>>> infligt job tracker rather than a true scheduler.
>>>>>>>>>
>>>>>>>>> - Generic messaging interface for DRM scheduler
>>>>>>>>>
>>>>>>>>> Idea is to be able to communicate to the submission backend with in
>>>> band
>>>>>>>>> (relative to main execution function) messages. Messages are backend
>>>>>>>>> defined and flexable enough for any use case. In Xe we use these
>>>>>>>>> messages to clean up entites, set properties for entites, and suspend /
>>>>>>>>> resume execution of an entity [5]. I suspect other driver can leverage
>>>>>>>>> this messaging concept too as it a convenient way to avoid races in the
>>>>>>>>> backend.
>>>>>>>>>
>>>>>>>>> - Support for using TDR for all error paths of a scheduler / entity
>>>>>>>>>
>>>>>>>>> Fix a few races / bugs, add function to dynamically set the TDR timeout.
>>>>>>>>>
>>>>>>>>> - Annotate dma-fences for long running workloads.
>>>>>>>>>
>>>>>>>>> The idea here is to use dma-fences only as sync points within the
>>>>>>>>> scheduler and never export them for long running workloads. By
>>>>>>>>> annotating these fences as long running we ensure that these dma-
>>>> fences
>>>>>>>>> are never used in a way that breaks the dma-fence rules. A benefit of
>>>>>>>>> thus approach is the scheduler can still safely flow control the
>>>>>>>>> execution ring buffer via the job limit without breaking the dma-fence
>>>>>>>>> rules.
>>>>>>>>>
>>>>>>>>> Again this a first draft and looking forward to feedback.
>>>>>>>>>
>>>>>>>>> Enjoy - Matt
>>>>>>>>>
>>>>>>>>> [1] https://gitlab.freedesktop.org/drm/xe/kernel
>>>>>>>>> [2] https://patchwork.freedesktop.org/series/112188/
>>>>>>>>> [3] https://patchwork.freedesktop.org/series/114772/
>>>>>>>>> [4]
>>>> https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
>>>>>>>>> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
>>>>>>>>> next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
>>>>>>>>>
>>>>>>>>> Matthew Brost (8):
>>>>>>>>>      drm/sched: Convert drm scheduler to use a work queue rather than
>>>>>>>>>        kthread
>>>>>>>>>      drm/sched: Move schedule policy to scheduler / entity
>>>>>>>>>      drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling
>>>> policy
>>>>>>>>>      drm/sched: Add generic scheduler message interface
>>>>>>>>>      drm/sched: Start run wq before TDR in drm_sched_start
>>>>>>>>>      drm/sched: Submit job before starting TDR
>>>>>>>>>      drm/sched: Add helper to set TDR timeout
>>>>>>>>>      drm/syncobj: Warn on long running dma-fences
>>>>>>>>>
>>>>>>>>> Thomas Hellström (2):
>>>>>>>>>      dma-buf/dma-fence: Introduce long-running completion fences
>>>>>>>>>      drm/sched: Support long-running sched entities
>>>>>>>>>
>>>>>>>>>     drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>>>>>>>>>     drivers/dma-buf/dma-resv.c                  |   5 +
>>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>>>>>>>>>     drivers/gpu/drm/drm_syncobj.c               |   5 +-
>>>>>>>>>     drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>>>>>>>>>     drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>>>>>>>>>     drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>>>>>>>>>     drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>>>>>>>>>     drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>>>>>>>>>     drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>>>>>>>>>     drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>>>>>>>>>     drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++--
>>>> ---
>>>>>>>>>     drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>>>>>>>>>     include/drm/gpu_scheduler.h                 | 130 +++++++--
>>>>>>>>>     include/linux/dma-fence.h                   |  60 ++++-
>>>>>>>>>     16 files changed, 649 insertions(+), 184 deletions(-)
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> 2.34.1
Daniel Vetter April 17, 2023, 8:39 a.m. UTC | #39
On Mon, Apr 17, 2023 at 08:47:19AM +0200, Christian König wrote:
> Am 11.04.23 um 16:13 schrieb Daniel Vetter:
> > On Tue, Apr 11, 2023 at 11:02:55AM +0200, Christian König wrote:
> > > The point is that this not only requires some work in the drm_scheduler, but
> > > rather it then makes only little sense to use the drm_scheduler in the first
> > > place.
> > > 
> > > The whole point of the drm_scheduler is to provide dma_fence implementation
> > > for the submitted jobs.
> > > 
> > > We also have dependency handling, but as Daniel and I said this can be
> > > easily extracted into a separate object/component.
> > Uh that's not what I meant. My take is that minimally patching drm/sched
> > to make the out-fence either optional, or complete it right away, is the
> > simplest way to get at the dependency handling. For me at least the major
> > part of drm/sched is the dep handling and timeout stuff. And the later can
> > be reused with some glue to handle preempt timeouts too and other things,
> > since tdr is a work struct you can just issue any other gpu timeouts on
> > the same workqueue and using the roughly same pattern as the ->timed_out
> > hook and it'll just work.
> 
> Well that strongly sounds like what I had in mind as well.
> 
> If we move the dependency/timeout functionality into a new component or if
> we move the scheduler fence into a new component doesn't seem to matter, the
> high level goal is that we have separated the two functionalities and both
> approach will work for that.

Ah ok, I guess I just got confused about your wording then.

> > The entire "oh we also make sure your hw fence doesn't leak into public
> > fences and causes lifetime mayhem" seems pretty minor. And maybe also
> > something we want to replicate for the preempt-ctx dma_fence that some
> > long-running context need (but more as part of drm_sched_entity I guess).
> > 
> > We can of course bikeshed how much flexibility really should be in the
> > different parts of drm/sched, but imo that's a bikeshed.
> 
> Well the dependency handling in a separate component would still be
> interesting to have since we need something similar for user queues as well.

Yeah it might be neater to refactor, but I think that part is optional at
least near-term. There's always room to polish shared code, and often it's
better to do that once you have at least some in-tree users for the new
need :-)
-Daniel

> 
> Christian.
> 
> > -Daniel
> > 
> > 
> > > Regards,
> > > Christian.
> > > 
> > > Am 07.04.23 um 02:20 schrieb Zeng, Oak:
> > > > So this series basically go with option 2. The part that option2 makes me uncomfortable is, dma-fence doesn't work for long running workload, why we generate it in the first place? As long as dma-fence is generated, it will become a source of confusion in the future. It doesn't matter how much you annotate it/document it. So if we decide to go with option2, the bottom line is, don't generate dma-fence for long running workload during job submission. This requires some rework in drm scheduler.
> > > > 
> > > > The cleanest solution to me is option3. Dma-fence is a very old technology. When it was created, no gpu support page fault. Obviously this is not a good technology for modern gpu with page fault support. I think the best way is to create a new scheduler and dependency tracking mechanism works for both page fault enabled and page fault disabled context. I think this matches what Christian said below. Maybe nobody think this is easy?
> > > > 
> > > > Thanks,
> > > > Oak
> > > > 
> > > > > -----Original Message-----
> > > > > From: Brost, Matthew <matthew.brost@intel.com>
> > > > > Sent: April 5, 2023 2:53 PM
> > > > > To: Zeng, Oak <oak.zeng@intel.com>
> > > > > Cc: Christian König <christian.koenig@amd.com>; Vetter, Daniel
> > > > > <daniel.vetter@intel.com>; Thomas Hellström
> > > > > <thomas.hellstrom@linux.intel.com>; dri-devel@lists.freedesktop.org; intel-
> > > > > xe@lists.freedesktop.org; robdclark@chromium.org; airlied@linux.ie;
> > > > > lina@asahilina.net; boris.brezillon@collabora.com; faith.ekstrand@collabora.com
> > > > > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > > > > plans
> > > > > 
> > > > > On Wed, Apr 05, 2023 at 12:06:53PM -0600, Zeng, Oak wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > Using dma-fence for completion/dependency tracking for long-run
> > > > > workload(more precisely on-demand paging/page fault enabled workload) can
> > > > > cause deadlock. This seems the significant issue here. Other issues such as the
> > > > > drm scheduler completion order implication etc are minors which can be solve
> > > > > inside the framework of drm scheduler. We need to evaluate below paths:
> > > > > > 	1) still use drm scheduler for job submission, and use dma-fence for job
> > > > > completion waiting/dependency tracking. This is solution proposed in this series.
> > > > > Annotate dma-fence for long-run workload: user can still wait dma-fence for job
> > > > > completion but can't wait dma-fence while holding any memory management
> > > > > locks.  We still use dma-fence for dependency tracking. But it is just very easily
> > > > > run into deadlock when on-demand paging is in the picture. The annotation helps
> > > > > us to detect deadlock but not solve deadlock problems. Seems *not* a complete
> > > > > solution: It is almost impossible to completely avoid dependency deadlock in
> > > > > complex runtime environment
> > > > > No one can wait on LR fence, so it is impossible to deadlock. The
> > > > > annotations enforce this. Literally this is only for flow controling the
> > > > > ring / hold pending jobs in in the DRM schedule list.
> > > > > 
> > > > > > 	2) Still use drm scheduler but not use dma-fence for completion signaling
> > > > > and dependency tracking. This way we still get some free functions (reset, err
> > > > > handling ring flow control as Matt said)from drm scheduler, but push the
> > > > > dependency/completion tracking completely to user space using techniques such
> > > > > as user space fence. User space doesn't have chance to wait fence while holding
> > > > > a kernel memory management lock, thus the dma-fence deadlock issue is solved.
> > > > > We use user space fence for syncs.
> > > > > 
> > > > > > 	3) Completely discard drm scheduler and dma-fence for long-run
> > > > > workload. Use user queue/doorbell for super fast submission, directly interact
> > > > > with fw scheduler. Use user fence for completion/dependency tracking.
> > > > > This is a hard no from me, I want 1 submission path in Xe. Either we use
> > > > > the DRM scheduler or we don't.
> > > > > 
> > > > > Matt
> > > > > 
> > > > > > Thanks,
> > > > > > Oak
> > > > > > 
> > > > > > > -----Original Message-----
> > > > > > > From: Christian König <christian.koenig@amd.com>
> > > > > > > Sent: April 5, 2023 3:30 AM
> > > > > > > To: Brost, Matthew <matthew.brost@intel.com>; Zeng, Oak
> > > > > > > <oak.zeng@intel.com>
> > > > > > > Cc: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org;
> > > > > > > robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
> > > > > airlied@linux.ie;
> > > > > > > lina@asahilina.net; boris.brezillon@collabora.com;
> > > > > faith.ekstrand@collabora.com
> > > > > > > Subject: Re: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > > > > > > plans
> > > > > > > 
> > > > > > > Am 04.04.23 um 20:08 schrieb Matthew Brost:
> > > > > > > > On Tue, Apr 04, 2023 at 12:02:03PM -0600, Zeng, Oak wrote:
> > > > > > > > > Hi Matt, Thomas,
> > > > > > > > > 
> > > > > > > > > Some very bold out of box thinking in this area:
> > > > > > > > > 
> > > > > > > > > 1. so you want to use drm scheduler and dma-fence for long running
> > > > > workload.
> > > > > > > Why you want to do this in the first place? What is the benefit? Drm scheduler
> > > > > is
> > > > > > > pretty much a software scheduler. Modern gpu has scheduler built at fw/hw
> > > > > > > level, as you said below for intel this is Guc. Can xe driver just directly submit
> > > > > job
> > > > > > > to Guc, bypassing drm scheduler?
> > > > > > > > If we did that now we have 2 paths for dependency track, flow controling
> > > > > > > > the ring, resets / error handling / backend submission implementations.
> > > > > > > > We don't want this.
> > > > > > > Well exactly that's the point: Why?
> > > > > > > 
> > > > > > > As far as I can see that are two completely distinct use cases, so you
> > > > > > > absolutely do want two completely distinct implementations for this.
> > > > > > > 
> > > > > > > > > 2. using dma-fence for long run workload: I am well aware that page fault
> > > > > (and
> > > > > > > the consequent memory allocation/lock acquiring to fix the fault) can cause
> > > > > > > deadlock for a dma-fence wait. But I am not convinced that dma-fence can't
> > > > > be
> > > > > > > used purely because the nature of the workload that it runs very long
> > > > > (indefinite).
> > > > > > > I did a math: the dma_fence_wait_timeout function's third param is the
> > > > > timeout
> > > > > > > which is a signed long type. If HZ is 1000, this is about 23 days. If 23 days is not
> > > > > long
> > > > > > > enough, can we just change the timeout parameter to signed 64 bits so it is
> > > > > much
> > > > > > > longer than our life time...
> > > > > > > > > So I mainly argue we can't use dma-fence for long-run workload is not
> > > > > > > because the workload runs very long, rather because of the fact that we use
> > > > > > > page fault for long-run workload. If we enable page fault for short-run
> > > > > workload,
> > > > > > > we can't use dma-fence either. Page fault is the key thing here.
> > > > > > > > > Now since we use page fault which is *fundamentally* controversial with
> > > > > > > dma-fence design, why now just introduce a independent concept such as
> > > > > user-
> > > > > > > fence instead of extending existing dma-fence?
> > > > > > > > > I like unified design. If drm scheduler, dma-fence can be extended to work
> > > > > for
> > > > > > > everything, it is beautiful. But seems we have some fundamental problem
> > > > > here.
> > > > > > > > Thomas's patches turn a dma-fence into KMD sync point (e.g. we just use
> > > > > > > > the signal / CB infrastructure) and enforce we don't use use these
> > > > > > > > dma-fences from the scheduler in memory reclaim paths or export these to
> > > > > > > > user space or other drivers. Think of this mode as SW only fence.
> > > > > > > Yeah and I truly think this is an really bad idea.
> > > > > > > 
> > > > > > > The signal/CB infrastructure in the dma_fence turned out to be the
> > > > > > > absolutely nightmare I initially predicted. Sorry to say that, but in
> > > > > > > this case the "I've told you so" is appropriate in my opinion.
> > > > > > > 
> > > > > > > If we need infrastructure for long running dependency tracking we should
> > > > > > > encapsulate that in a new framework and not try to mangle the existing
> > > > > > > code for something it was never intended for.
> > > > > > > 
> > > > > > > Christian.
> > > > > > > 
> > > > > > > > Matt
> > > > > > > > 
> > > > > > > > > Thanks,
> > > > > > > > > Oak
> > > > > > > > > 
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
> > > > > > > > > > Matthew Brost
> > > > > > > > > > Sent: April 3, 2023 8:22 PM
> > > > > > > > > > To: dri-devel@lists.freedesktop.org; intel-xe@lists.freedesktop.org
> > > > > > > > > > Cc: robdclark@chromium.org; thomas.hellstrom@linux.intel.com;
> > > > > > > airlied@linux.ie;
> > > > > > > > > > lina@asahilina.net; boris.brezillon@collabora.com; Brost, Matthew
> > > > > > > > > > <matthew.brost@intel.com>; christian.koenig@amd.com;
> > > > > > > > > > faith.ekstrand@collabora.com
> > > > > > > > > > Subject: [RFC PATCH 00/10] Xe DRM scheduler and long running workload
> > > > > > > plans
> > > > > > > > > > Hello,
> > > > > > > > > > 
> > > > > > > > > > As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> > > > > > > > > > have been asked to merge our common DRM scheduler patches first as
> > > > > well
> > > > > > > > > > as develop a common solution for long running workloads with the DRM
> > > > > > > > > > scheduler. This RFC series is our first attempt at doing this. We
> > > > > > > > > > welcome any and all feedback.
> > > > > > > > > > 
> > > > > > > > > > This can we thought of as 4 parts detailed below.
> > > > > > > > > > 
> > > > > > > > > > - DRM scheduler changes for 1 to 1 relationship between scheduler and
> > > > > > > > > > entity (patches 1-3)
> > > > > > > > > > 
> > > > > > > > > > In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> > > > > > > > > > GuC) which is a new paradigm WRT to the DRM scheduler and presents
> > > > > > > > > > severals problems as the DRM was originally designed to schedule jobs
> > > > > on
> > > > > > > > > > hardware queues. The main problem being that DRM scheduler expects
> > > > > the
> > > > > > > > > > submission order of jobs to be the completion order of jobs even across
> > > > > > > > > > multiple entities. This assumption falls apart with a firmware scheduler
> > > > > > > > > > as a firmware scheduler has no concept of jobs and jobs can complete
> > > > > out
> > > > > > > > > > of order. A novel solution for was originally thought of by Faith during
> > > > > > > > > > the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> > > > > > > > > > and entity. I believe the AGX driver [3] is using this approach and
> > > > > > > > > > Boris may use approach as well for the Mali driver [4].
> > > > > > > > > > 
> > > > > > > > > > To support a 1 to 1 relationship we move the main execution function
> > > > > > > > > > from a kthread to a work queue and add a new scheduling mode which
> > > > > > > > > > bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> > > > > > > > > > The new scheduling mode should unify all drivers usage with a 1 to 1
> > > > > > > > > > relationship and can be thought of as using scheduler as a dependency /
> > > > > > > > > > infligt job tracker rather than a true scheduler.
> > > > > > > > > > 
> > > > > > > > > > - Generic messaging interface for DRM scheduler
> > > > > > > > > > 
> > > > > > > > > > Idea is to be able to communicate to the submission backend with in
> > > > > band
> > > > > > > > > > (relative to main execution function) messages. Messages are backend
> > > > > > > > > > defined and flexable enough for any use case. In Xe we use these
> > > > > > > > > > messages to clean up entites, set properties for entites, and suspend /
> > > > > > > > > > resume execution of an entity [5]. I suspect other driver can leverage
> > > > > > > > > > this messaging concept too as it a convenient way to avoid races in the
> > > > > > > > > > backend.
> > > > > > > > > > 
> > > > > > > > > > - Support for using TDR for all error paths of a scheduler / entity
> > > > > > > > > > 
> > > > > > > > > > Fix a few races / bugs, add function to dynamically set the TDR timeout.
> > > > > > > > > > 
> > > > > > > > > > - Annotate dma-fences for long running workloads.
> > > > > > > > > > 
> > > > > > > > > > The idea here is to use dma-fences only as sync points within the
> > > > > > > > > > scheduler and never export them for long running workloads. By
> > > > > > > > > > annotating these fences as long running we ensure that these dma-
> > > > > fences
> > > > > > > > > > are never used in a way that breaks the dma-fence rules. A benefit of
> > > > > > > > > > thus approach is the scheduler can still safely flow control the
> > > > > > > > > > execution ring buffer via the job limit without breaking the dma-fence
> > > > > > > > > > rules.
> > > > > > > > > > 
> > > > > > > > > > Again this a first draft and looking forward to feedback.
> > > > > > > > > > 
> > > > > > > > > > Enjoy - Matt
> > > > > > > > > > 
> > > > > > > > > > [1] https://gitlab.freedesktop.org/drm/xe/kernel
> > > > > > > > > > [2] https://patchwork.freedesktop.org/series/112188/
> > > > > > > > > > [3] https://patchwork.freedesktop.org/series/114772/
> > > > > > > > > > [4]
> > > > > https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> > > > > > > > > > [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-
> > > > > > > > > > next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> > > > > > > > > > 
> > > > > > > > > > Matthew Brost (8):
> > > > > > > > > >      drm/sched: Convert drm scheduler to use a work queue rather than
> > > > > > > > > >        kthread
> > > > > > > > > >      drm/sched: Move schedule policy to scheduler / entity
> > > > > > > > > >      drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling
> > > > > policy
> > > > > > > > > >      drm/sched: Add generic scheduler message interface
> > > > > > > > > >      drm/sched: Start run wq before TDR in drm_sched_start
> > > > > > > > > >      drm/sched: Submit job before starting TDR
> > > > > > > > > >      drm/sched: Add helper to set TDR timeout
> > > > > > > > > >      drm/syncobj: Warn on long running dma-fences
> > > > > > > > > > 
> > > > > > > > > > Thomas Hellström (2):
> > > > > > > > > >      dma-buf/dma-fence: Introduce long-running completion fences
> > > > > > > > > >      drm/sched: Support long-running sched entities
> > > > > > > > > > 
> > > > > > > > > >     drivers/dma-buf/dma-fence.c                 | 142 +++++++---
> > > > > > > > > >     drivers/dma-buf/dma-resv.c                  |   5 +
> > > > > > > > > >     drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> > > > > > > > > >     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
> > > > > > > > > >     drivers/gpu/drm/drm_syncobj.c               |   5 +-
> > > > > > > > > >     drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
> > > > > > > > > >     drivers/gpu/drm/lima/lima_sched.c           |   5 +-
> > > > > > > > > >     drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
> > > > > > > > > >     drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
> > > > > > > > > >     drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
> > > > > > > > > >     drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
> > > > > > > > > >     drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
> > > > > > > > > >     drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++--
> > > > > ---
> > > > > > > > > >     drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
> > > > > > > > > >     include/drm/gpu_scheduler.h                 | 130 +++++++--
> > > > > > > > > >     include/linux/dma-fence.h                   |  60 ++++-
> > > > > > > > > >     16 files changed, 649 insertions(+), 184 deletions(-)
> > > > > > > > > > 
> > > > > > > > > > --
> > > > > > > > > > 2.34.1
>
Liviu Dudau April 18, 2023, 3:10 p.m. UTC | #40
On Mon, Apr 03, 2023 at 05:22:01PM -0700, Matthew Brost wrote:
> Hello,

Hello,

Jumping a bit late on this thread as I was waiting on some approvals and then
holidays kicked in, but I would like to (re)introduce myself and the people
I work with and to let you know that we are interested in the changes proposed
here and we would like to help if we can.

I currently maintain a number of Arm Mali Display drivers but in recent times
I have moved to the Mali GPU team and now we've got approval to start making
contributions to the upstream driver(s). We're planning to collaborate on
Boris' new Mali driver and make it work well on Mali GPUs. One of the first
things to look at (besides bringing the driver up on internal dev platforms)
are the scheduler changes proposed here.

As such, I would like to ask that people start including John Reitan,
Ketil Johnsen and me on patches. As soon as we have something working and we
can make comments on, we will do so.

Best regards,
Liviu


> 
> As a prerequisite to merging the new Intel Xe DRM driver [1] [2], we
> have been asked to merge our common DRM scheduler patches first as well
> as develop a common solution for long running workloads with the DRM
> scheduler. This RFC series is our first attempt at doing this. We
> welcome any and all feedback.
> 
> This can we thought of as 4 parts detailed below.
> 
> - DRM scheduler changes for 1 to 1 relationship between scheduler and
> entity (patches 1-3)
> 
> In Xe all of the scheduling of jobs is done by a firmware scheduler (the
> GuC) which is a new paradigm WRT to the DRM scheduler and presents
> severals problems as the DRM was originally designed to schedule jobs on
> hardware queues. The main problem being that DRM scheduler expects the
> submission order of jobs to be the completion order of jobs even across
> multiple entities. This assumption falls apart with a firmware scheduler
> as a firmware scheduler has no concept of jobs and jobs can complete out
> of order. A novel solution for was originally thought of by Faith during
> the initial prototype of Xe, create a 1 to 1 relationship between scheduler
> and entity. I believe the AGX driver [3] is using this approach and
> Boris may use approach as well for the Mali driver [4].
> 
> To support a 1 to 1 relationship we move the main execution function
> from a kthread to a work queue and add a new scheduling mode which
> bypasses code in the DRM which isn't needed in a 1 to 1 relationship.
> The new scheduling mode should unify all drivers usage with a 1 to 1
> relationship and can be thought of as using scheduler as a dependency /
> infligt job tracker rather than a true scheduler.
> 
> - Generic messaging interface for DRM scheduler
> 
> Idea is to be able to communicate to the submission backend with in band
> (relative to main execution function) messages. Messages are backend
> defined and flexable enough for any use case. In Xe we use these
> messages to clean up entites, set properties for entites, and suspend /
> resume execution of an entity [5]. I suspect other driver can leverage
> this messaging concept too as it a convenient way to avoid races in the
> backend.
> 
> - Support for using TDR for all error paths of a scheduler / entity
> 
> Fix a few races / bugs, add function to dynamically set the TDR timeout.
> 
> - Annotate dma-fences for long running workloads.
> 
> The idea here is to use dma-fences only as sync points within the
> scheduler and never export them for long running workloads. By
> annotating these fences as long running we ensure that these dma-fences
> are never used in a way that breaks the dma-fence rules. A benefit of
> thus approach is the scheduler can still safely flow control the
> execution ring buffer via the job limit without breaking the dma-fence
> rules.
> 
> Again this a first draft and looking forward to feedback.
> 
> Enjoy - Matt
> 
> [1] https://gitlab.freedesktop.org/drm/xe/kernel
> [2] https://patchwork.freedesktop.org/series/112188/ 
> [3] https://patchwork.freedesktop.org/series/114772/
> [4] https://patchwork.freedesktop.org/patch/515854/?series=112188&rev=1
> [5] https://gitlab.freedesktop.org/drm/xe/kernel/-/blob/drm-xe-next/drivers/gpu/drm/xe/xe_guc_submit.c#L1031
> 
> Matthew Brost (8):
>   drm/sched: Convert drm scheduler to use a work queue rather than
>     kthread
>   drm/sched: Move schedule policy to scheduler / entity
>   drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy
>   drm/sched: Add generic scheduler message interface
>   drm/sched: Start run wq before TDR in drm_sched_start
>   drm/sched: Submit job before starting TDR
>   drm/sched: Add helper to set TDR timeout
>   drm/syncobj: Warn on long running dma-fences
> 
> Thomas Hellström (2):
>   dma-buf/dma-fence: Introduce long-running completion fences
>   drm/sched: Support long-running sched entities
> 
>  drivers/dma-buf/dma-fence.c                 | 142 +++++++---
>  drivers/dma-buf/dma-resv.c                  |   5 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |  15 +-
>  drivers/gpu/drm/drm_syncobj.c               |   5 +-
>  drivers/gpu/drm/etnaviv/etnaviv_sched.c     |   5 +-
>  drivers/gpu/drm/lima/lima_sched.c           |   5 +-
>  drivers/gpu/drm/msm/adreno/adreno_device.c  |   6 +-
>  drivers/gpu/drm/msm/msm_ringbuffer.c        |   5 +-
>  drivers/gpu/drm/panfrost/panfrost_job.c     |   5 +-
>  drivers/gpu/drm/scheduler/sched_entity.c    | 127 +++++++--
>  drivers/gpu/drm/scheduler/sched_fence.c     |   6 +-
>  drivers/gpu/drm/scheduler/sched_main.c      | 278 +++++++++++++++-----
>  drivers/gpu/drm/v3d/v3d_sched.c             |  25 +-
>  include/drm/gpu_scheduler.h                 | 130 +++++++--
>  include/linux/dma-fence.h                   |  60 ++++-
>  16 files changed, 649 insertions(+), 184 deletions(-)
> 
> -- 
> 2.34.1
>