diff mbox series

[1/2] drm/scheduler: improve GPU scheduler documentation v2

Message ID 20231116141547.206695-1-christian.koenig@amd.com (mailing list archive)
State New, archived
Headers show
Series [1/2] drm/scheduler: improve GPU scheduler documentation v2 | expand

Commit Message

Christian König Nov. 16, 2023, 2:15 p.m. UTC
Start to improve the scheduler document. Especially document the
lifetime of each of the objects as well as the restrictions around
DMA-fence handling and userspace compatibility.

v2: Some improvements suggested by Danilo, add section about error
    handling.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 Documentation/gpu/drm-mm.rst           |  36 +++++
 drivers/gpu/drm/scheduler/sched_main.c | 174 +++++++++++++++++++++----
 2 files changed, 188 insertions(+), 22 deletions(-)

Comments

Alex Deucher Nov. 16, 2023, 6:44 p.m. UTC | #1
On Thu, Nov 16, 2023 at 12:52 PM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Start to improve the scheduler document. Especially document the
> lifetime of each of the objects as well as the restrictions around
> DMA-fence handling and userspace compatibility.

A few minor grammatical suggestions below.

>
> v2: Some improvements suggested by Danilo, add section about error
>     handling.
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  Documentation/gpu/drm-mm.rst           |  36 +++++
>  drivers/gpu/drm/scheduler/sched_main.c | 174 +++++++++++++++++++++----
>  2 files changed, 188 insertions(+), 22 deletions(-)
>
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> index acc5901ac840..112463fa9f3a 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -552,12 +552,48 @@ Overview
>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>     :doc: Overview
>
> +Job Object
> +----------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Job Object
> +
> +Entity Object
> +-------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Entity Object
> +
> +Hardware Fence Object
> +---------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Hardware Fence Object
> +
> +Scheduler Fence Object
> +----------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Scheduler Fence Object
> +
> +Scheduler and Run Queue Objects
> +-------------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Scheduler and Run Queue Objects
> +
>  Flow Control
>  ------------
>
>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>     :doc: Flow Control
>
> +Error and Timeout handling
> +--------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Error and Timeout handling
> +
>  Scheduler Function References
>  -----------------------------
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 044a8c4875ba..026123497b0e 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -24,28 +24,122 @@
>  /**
>   * DOC: Overview
>   *
> - * The GPU scheduler provides entities which allow userspace to push jobs
> - * into software queues which are then scheduled on a hardware run queue.
> - * The software queues have a priority among them. The scheduler selects the entities
> - * from the run queue using a FIFO. The scheduler provides dependency handling
> - * features among jobs. The driver is supposed to provide callback functions for
> - * backend operations to the scheduler like submitting a job to hardware run queue,
> - * returning the dependencies of a job etc.
> - *
> - * The organisation of the scheduler is the following:
> - *
> - * 1. Each hw run queue has one scheduler
> - * 2. Each scheduler has multiple run queues with different priorities
> - *    (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
> - * 3. Each scheduler run queue has a queue of entities to schedule
> - * 4. Entities themselves maintain a queue of jobs that will be scheduled on
> - *    the hardware.
> - *
> - * The jobs in a entity are always scheduled in the order that they were pushed.
> - *
> - * Note that once a job was taken from the entities queue and pushed to the
> - * hardware, i.e. the pending queue, the entity must not be referenced anymore
> - * through the jobs entity pointer.
> + * The GPU scheduler implements some logic to decide which command submission

s/some//

> + * to push next to the hardware. Another major use case of the GPU scheduler

s/use case/feature/

> + * is to enforce correct driver behavior around those command submissions.
> + * Because of this it's also used by drivers which don't need the actual

s/it's/it can also/

> + * scheduling functionality.
> + *
> + * All callbacks the driver needs to implement are restricted by DMA-fence
> + * signaling rules to guarantee deadlock free forward progress. This especially

s/especially//

> + * means that for normal operation no memory can be allocated in a callback.
> + * All memory which is needed for pushing the job to the hardware must be
> + * allocated before arming a job. It also means that no locks can be taken

s/also//

> + * under which memory might be allocated as well.
> + *
> + * Memory which is optional to allocate, for example for device core dumping or
> + * debugging, *must* be allocated with GFP_NOWAIT and appropriate error
> + * handling taking if that allocation fails. GFP_ATOMIC should only be used if

s/taking/taken/

> + * absolutely necessary since dipping into the special atomic reserves is
> + * usually not justified for a GPU driver.
> + */
> +
> +/**
> + * DOC: Job Object
> + *
> + * The base job object contains submission dependencies in the form of DMA-fence
> + * objects. Drivers can also implement an optional prepare_job callback which
> + * returns additional dependencies as DMA-fence objects. It's important to note

s/it's/it is/

> + * that this callback can't allocate memory or grab locks under which memory is
> + * allocated.
> + *
> + * Drivers should use this as base class for an object which contains the

s/this as/this object as/

> + * necessary state to push the command submission to the hardware.
> + *
> + * The lifetime of the job object should at least be from pushing it into the

s/should at least be/should span/

> + * scheduler until the scheduler notes through the free callback that a job
> + * isn't needed any more. Drivers can of course keep their job object alive
> + * longer than that, but that's outside of the scope of the scheduler
> + * component. Job initialization is split into two parts, drm_sched_job_init()
> + * and drm_sched_job_arm(). It's important to note that after arming a job

Describe why we have job_init() and job_arm().

> + * drivers must follow the DMA-fence rules and can't easily allocate memory
> + * or takes locks under which memory is allocated.

s/takes/take/

> + */
> +
> +/**
> + * DOC: Entity Object
> + *
> + * The entity object which is a container for jobs which should execute

s/which//

> + * sequentially. Drivers should create an entity for each individual context
> + * they maintain for command submissions which can run in parallel.
> + *
> + * The lifetime of the entity should *not* exceed the lifetime of the
> + * userspace process it was created for and drivers should call the
> + * drm_sched_entity_flush() function from their file_operations.flush
> + * callback. So it's possible that an entity object is not alive any
> + * more while jobs from it are still running on the hardware.

s/So it's possible that an entity object is not alive any more/It is
possible that an entity may be freed/

> + *
> + * Background is that for compatibility reasons with existing

s/Background is that for/For/

> + * userspace all results of a command submission should become visible
> + * externally even after after a process exits. This is normal POSIX behavior

s/after after/after/

> + * for I/O operations.
> + *
> + * The problem with this approach is that GPU submissions contain executable
> + * shaders enabling processes to evade their termination by offloading work to
> + * the GPU. So when a process is terminated with a SIGKILL the entity object

s/The problem with this approach is that GPU submissions contain
executable shaders enabling processes to evade their termination by
offloading work to the GPU/
GPU submissions contain work executing on the GPU which leaves an
aspect of the process behind even after process termination/

> + * makes sure that jobs are freed without running them while still maintaining
> + * correct sequential order for signaling fences.
> + */
> +
> +/**
> + * DOC: Hardware Fence Object
> + *
> + * The hardware fence object is a DMA-fence provided by the driver as result of
> + * running jobs. Drivers need to make sure that the normal DMA-fence semantics
> + * are followed for this object. It's important to note that the memory for

s/It's/It is/

> + * this object can *not* be allocated in the run_job callback since that would
> + * violate the requirements for the DMA-fence implementation. The scheduler
> + * maintains a timeout handler which triggers if this fence doesn't signal in
> + * a configurable time frame.
> + *
> + * The lifetime of this object follows DMA-fence ref-counting rules, the
> + * scheduler takes ownership of the reference returned by the driver and drops
> + * it when it's not needed any more.
> + */
> +
> +/**
> + * DOC: Scheduler Fence Object
> + *
> + * The scheduler fence object which encapsulates the whole time from pushing

s/which//

> + * the job into the scheduler until the hardware has finished processing it.
> + * This is internally managed by the scheduler, but drivers can grab additional
> + * reference to it after arming a job. The implementation provides DMA-fence
> + * interfaces for signaling both scheduling of a command submission as well as
> + * finishing of processing.
> + *
> + * The lifetime of this object also follows normal DMA-fence ref-counting
> + * rules. The finished fence is the one normally exposed outside of the
> + * scheduler, but the driver can grab references to both the scheduled as well
> + * as the finished fence when needed for pipe-lining optimizations.

s/pipe-lining/pipelining/

> + */
> +
> +/**
> + * DOC: Scheduler and Run Queue Objects
> + *
> + * The scheduler object itself does the actual work of selecting a job and

s/itself//

> + * pushing it to the hardware. Both FIFO and RR selection algorithm are

s/FIFO and RR selection algorithm/FIFO and Round Robin (RR) selection
algorithms/

> + * supported, but FIFO is preferred for many use cases.

s/is preferred for many use cases/is the default and preferred for
most use cases/

> + *
> + * The lifetime of the scheduler is managed by the driver using it. Before
> + * destroying the scheduler the driver must ensure that all hardware processing

s/scheduler the/scheduler, the/

> + * involving this scheduler object has finished by calling for example
> + * disable_irq(). It is *not* sufficient to wait for the hardware fence here
> + * since this doesn't guarantee that all callback processing has finished.

I'm not quite sure what you are trying to say here.  I think the idea
is that the driver must wait for all signalling mechanisms (IRQ
handlers, callbacks, etc.) that might need access to the scheduler
object.

> + *
> + * The run queue object is a container of entities for a certain priority
> + * level. This object is internally managed by the scheduler and drivers

s/This object is/These objects are

> + * shouldn't touch them directly. The lifetime of run queues are bound to the
> + * schedulers lifetime.
>   */
>
>  /**
> @@ -72,6 +166,42 @@
>   * limit.
>   */
>
> +/**
> + * DOC: Error and Timeout handling
> + *
> + * Errors schould be signaled by using dma_fence_set_error() on the hardware

s/schould/should/

> + * fence object before signaling it. Errors are then bubbled up from the
> + * hardware fence to the scheduler fence.
> + *
> + * The entity allows querying errors on the last run submission using the
> + * drm_sched_entity_error() function which can be used to cancel queued
> + * submissions in the run_job callback as well as preventing pushing further

s/further/additional/

> + * ones into the entity in the drivers submission function.

s/ones/jobs/
s/drivers/driver's/

> + *
> + * When the hardware fence fails to signal in a configurable amount of time the
> + * timedout_job callback is issued. The driver should then follow the procedure
> + * described on the &struct drm_sched_backend_ops.timedout_job callback (TODO:
> + * The timeout handler should probably switch to using the hardware fence as
> + * parameter instead of the job. Otherwise the handling will always race
> + * between timing out and signaling the fence).
> + *
> + * The scheduler also used to provided functionality for re-submitting jobs

s/The scheduler also used to provided/NOTE: The scheduler used to provide/

> + * with replacing the hardware fence during reset handling. This functionality

s/with replacing/while replacing/

> + * is now marked as deprecated. This has proven to be fundamentally racy and
> + * not compatible with DMA-fence rules and shouldn't be used in any new code.
> + *
> + * Additional there is the drm_sched_increase_karma() function which tries to

s/Additional/Additionally, /

> + * find the entity which submitted a job and increases it's 'karma'

s/it's/its/

> + * atomic variable to prevent re-submitting jobs from this entity. This has
> + * quite some overhead and re-submitting jobs is now marked as deprecated. So
> + * using this function is rather discouraged.

s/rather//

> + *
> + * Drivers can still re-create the GPU state should it be lost during timeout
> + * handling when they can guarantee that forward progress is made and this
> + * doesn't cause another timeout. But this is strongly hardware specific and

s/doesn't/will not/

> + * out of the scope of the general GPU scheduler.
> + */
> +
>  #include <linux/wait.h>
>  #include <linux/sched.h>
>  #include <linux/completion.h>
> --
> 2.34.1
>
Danilo Krummrich Nov. 16, 2023, 10:23 p.m. UTC | #2
Hi Christian,

Thanks for sending an update of this patch!

On Thu, Nov 16, 2023 at 03:15:46PM +0100, Christian König wrote:
> Start to improve the scheduler document. Especially document the
> lifetime of each of the objects as well as the restrictions around
> DMA-fence handling and userspace compatibility.
> 
> v2: Some improvements suggested by Danilo, add section about error
>     handling.
> 
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  Documentation/gpu/drm-mm.rst           |  36 +++++
>  drivers/gpu/drm/scheduler/sched_main.c | 174 +++++++++++++++++++++----
>  2 files changed, 188 insertions(+), 22 deletions(-)
> 
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> index acc5901ac840..112463fa9f3a 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -552,12 +552,48 @@ Overview
>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>     :doc: Overview
>  
> +Job Object
> +----------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Job Object
> +
> +Entity Object
> +-------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Entity Object
> +
> +Hardware Fence Object
> +---------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Hardware Fence Object
> +
> +Scheduler Fence Object
> +----------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Scheduler Fence Object
> +
> +Scheduler and Run Queue Objects
> +-------------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Scheduler and Run Queue Objects
> +
>  Flow Control
>  ------------
>  
>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>     :doc: Flow Control
>  
> +Error and Timeout handling
> +--------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Error and Timeout handling
> +
>  Scheduler Function References
>  -----------------------------
>  
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 044a8c4875ba..026123497b0e 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -24,28 +24,122 @@
>  /**
>   * DOC: Overview
>   *
> - * The GPU scheduler provides entities which allow userspace to push jobs
> - * into software queues which are then scheduled on a hardware run queue.
> - * The software queues have a priority among them. The scheduler selects the entities
> - * from the run queue using a FIFO. The scheduler provides dependency handling
> - * features among jobs. The driver is supposed to provide callback functions for
> - * backend operations to the scheduler like submitting a job to hardware run queue,
> - * returning the dependencies of a job etc.
> - *
> - * The organisation of the scheduler is the following:
> - *
> - * 1. Each hw run queue has one scheduler
> - * 2. Each scheduler has multiple run queues with different priorities
> - *    (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
> - * 3. Each scheduler run queue has a queue of entities to schedule
> - * 4. Entities themselves maintain a queue of jobs that will be scheduled on
> - *    the hardware.
> - *
> - * The jobs in a entity are always scheduled in the order that they were pushed.
> - *
> - * Note that once a job was taken from the entities queue and pushed to the
> - * hardware, i.e. the pending queue, the entity must not be referenced anymore
> - * through the jobs entity pointer.
> + * The GPU scheduler implements some logic to decide which command submission
> + * to push next to the hardware. Another major use case of the GPU scheduler
> + * is to enforce correct driver behavior around those command submissions.
> + * Because of this it's also used by drivers which don't need the actual
> + * scheduling functionality.

This reads a bit like we're already right in the middle of the documentation.
I'd propose to start with something like "The DRM GPU scheduler serves as a
common component intended to help drivers to manage command submissions." And
then mention the different solutions the scheduler provides, e.g. ordering of
command submissions, dma-fence handling in the context of command submissions,
etc.

Also, I think it would be good to give a rough overview of the topology of the
scheduler's components. And since you already mention that the component is
"also used by drivers which don't need the actual scheduling functionality", I'd
also mention the special case of a single entity and a single run-queue per
scheduler.

> + *
> + * All callbacks the driver needs to implement are restricted by DMA-fence
> + * signaling rules to guarantee deadlock free forward progress. This especially
> + * means that for normal operation no memory can be allocated in a callback.
> + * All memory which is needed for pushing the job to the hardware must be
> + * allocated before arming a job. It also means that no locks can be taken
> + * under which memory might be allocated as well.

I think that's good. Even though, with the recently merged workqueue patches,
drivers can actually create a setup where the free_job callback isn't part of
the fence signalling critical path anymore. But I agree with Sima that this is
probably too error prone to give drivers ideas. So, this paragraph is probably
good as it is. :-)

> + *
> + * Memory which is optional to allocate, for example for device core dumping or
> + * debugging, *must* be allocated with GFP_NOWAIT and appropriate error
> + * handling taking if that allocation fails. GFP_ATOMIC should only be used if
> + * absolutely necessary since dipping into the special atomic reserves is
> + * usually not justified for a GPU driver.
> + */
> +
> +/**
> + * DOC: Job Object
> + *
> + * The base job object contains submission dependencies in the form of DMA-fence
> + * objects. Drivers can also implement an optional prepare_job callback which
> + * returns additional dependencies as DMA-fence objects. It's important to note
> + * that this callback can't allocate memory or grab locks under which memory is
> + * allocated.
> + *
> + * Drivers should use this as base class for an object which contains the
> + * necessary state to push the command submission to the hardware.
> + *
> + * The lifetime of the job object should at least be from pushing it into the

"should at least last from"

> + * scheduler until the scheduler notes through the free callback that a job

What about "until the free_job callback has been called and hence the scheduler
does not require the job object anymore."? 

> + * isn't needed any more. Drivers can of course keep their job object alive
> + * longer than that, but that's outside of the scope of the scheduler
> + * component. Job initialization is split into two parts, drm_sched_job_init()
> + * and drm_sched_job_arm(). It's important to note that after arming a job

I suggest to add a brief comment on why job initialization is split up.

> + * drivers must follow the DMA-fence rules and can't easily allocate memory
> + * or takes locks under which memory is allocated.
> + */
> +
> +/**
> + * DOC: Entity Object
> + *
> + * The entity object which is a container for jobs which should execute
> + * sequentially. Drivers should create an entity for each individual context
> + * they maintain for command submissions which can run in parallel.
> + *
> + * The lifetime of the entity should *not* exceed the lifetime of the
> + * userspace process it was created for and drivers should call the
> + * drm_sched_entity_flush() function from their file_operations.flush
> + * callback. So it's possible that an entity object is not alive any

"Note that it is possible..."

> + * more while jobs from it are still running on the hardware.

"while jobs previously fetched from this entity are still..."

> + *
> + * Background is that for compatibility reasons with existing
> + * userspace all results of a command submission should become visible
> + * externally even after after a process exits. This is normal POSIX behavior
> + * for I/O operations.
> + *
> + * The problem with this approach is that GPU submissions contain executable
> + * shaders enabling processes to evade their termination by offloading work to
> + * the GPU. So when a process is terminated with a SIGKILL the entity object
> + * makes sure that jobs are freed without running them while still maintaining
> + * correct sequential order for signaling fences.
> + */
> +
> +/**
> + * DOC: Hardware Fence Object
> + *
> + * The hardware fence object is a DMA-fence provided by the driver as result of
> + * running jobs. Drivers need to make sure that the normal DMA-fence semantics
> + * are followed for this object. It's important to note that the memory for
> + * this object can *not* be allocated in the run_job callback since that would
> + * violate the requirements for the DMA-fence implementation. The scheduler
> + * maintains a timeout handler which triggers if this fence doesn't signal in
> + * a configurable time frame.
> + *
> + * The lifetime of this object follows DMA-fence ref-counting rules, the
> + * scheduler takes ownership of the reference returned by the driver and drops
> + * it when it's not needed any more.
> + */
> +
> +/**
> + * DOC: Scheduler Fence Object
> + *
> + * The scheduler fence object which encapsulates the whole time from pushing
> + * the job into the scheduler until the hardware has finished processing it.
> + * This is internally managed by the scheduler, but drivers can grab additional
> + * reference to it after arming a job. The implementation provides DMA-fence
> + * interfaces for signaling both scheduling of a command submission as well as
> + * finishing of processing.
> + *
> + * The lifetime of this object also follows normal DMA-fence ref-counting
> + * rules. The finished fence is the one normally exposed outside of the
> + * scheduler, but the driver can grab references to both the scheduled as well
> + * as the finished fence when needed for pipe-lining optimizations.
> + */
> +
> +/**
> + * DOC: Scheduler and Run Queue Objects
> + *
> + * The scheduler object itself does the actual work of selecting a job and
> + * pushing it to the hardware. Both FIFO and RR selection algorithm are
> + * supported, but FIFO is preferred for many use cases.

I suggest to name the use cases FIFO scheduling is preferred for and why.

If, instead, it's just a general recommendation, I also suggest to explain why.

> + *
> + * The lifetime of the scheduler is managed by the driver using it. Before
> + * destroying the scheduler the driver must ensure that all hardware processing
> + * involving this scheduler object has finished by calling for example
> + * disable_irq(). It is *not* sufficient to wait for the hardware fence here
> + * since this doesn't guarantee that all callback processing has finished.

This is the part I'm most concerned about, since I feel like we leave drivers
"up in the air" entirely. Hence, I think here we need to be more verbose and
detailed about the options drivers have to ensure that.

For instance, let's assume we have the single-entity-per-scheduler topology
because the driver only uses the GPU scheduler to feed a firmware scheduler with
dynamically allocated ring buffers.

In this case the entity, scheduler and ring buffer are bound to the lifetime of
a userspace process.

What do we expect the driver to do if the userspace process is killed? As you
mentioned, only waiting for the ring to be idle (which implies all HW fences
are signalled) is not enough. This doesn't guarantee all the free_job()
callbacks have been called yet and hence stopping the scheduler before the
pending_list is actually empty would leak the memory of the jobs on the
pending_list waiting to be freed.

I already brought this up when we were discussing Matt's Xe inspired scheduler
patch series and it seems there was no interest to provide drivers with some
common mechanism that gurantees that the pending_list is empty. Hence, I really
think we should at least give recommendations how drivers should deal with that.

> + *
> + * The run queue object is a container of entities for a certain priority
> + * level. This object is internally managed by the scheduler and drivers
> + * shouldn't touch them directly. The lifetime of run queues are bound to the
> + * schedulers lifetime.

I think we should also mention that we support a variable number of run-queues
up to DRM_SCHED_PRIORITY_COUNT. Also there is this weird restriction on which
priorities a driver can use when choosing less than DRM_SCHED_PRIORITY_COUNT
run-queues.

For instance, initializing the scheduler with a single run-queue, requires the
corresponding entities to pick DRM_SCHED_PRIORITY_MIN, otherwise we'll just
fault since the priority is also used as an array index into sched->sched_rq[].

>   */
>  
>  /**
> @@ -72,6 +166,42 @@
>   * limit.
>   */
>  
> +/**
> + * DOC: Error and Timeout handling
> + *
> + * Errors schould be signaled by using dma_fence_set_error() on the hardware
> + * fence object before signaling it. Errors are then bubbled up from the
> + * hardware fence to the scheduler fence.
> + *
> + * The entity allows querying errors on the last run submission using the
> + * drm_sched_entity_error() function which can be used to cancel queued
> + * submissions in the run_job callback as well as preventing pushing further
> + * ones into the entity in the drivers submission function.
> + *
> + * When the hardware fence fails to signal in a configurable amount of time the
> + * timedout_job callback is issued. The driver should then follow the procedure
> + * described on the &struct drm_sched_backend_ops.timedout_job callback (TODO:
> + * The timeout handler should probably switch to using the hardware fence as
> + * parameter instead of the job. Otherwise the handling will always race
> + * between timing out and signaling the fence).
> + *
> + * The scheduler also used to provided functionality for re-submitting jobs
> + * with replacing the hardware fence during reset handling. This functionality
> + * is now marked as deprecated. This has proven to be fundamentally racy and
> + * not compatible with DMA-fence rules and shouldn't be used in any new code.
> + *
> + * Additional there is the drm_sched_increase_karma() function which tries to
 
"Additionally"

> + * find the entity which submitted a job and increases it's 'karma'
> + * atomic variable to prevent re-submitting jobs from this entity. This has
> + * quite some overhead and re-submitting jobs is now marked as deprecated. So
> + * using this function is rather discouraged.
> + *
> + * Drivers can still re-create the GPU state should it be lost during timeout
> + * handling when they can guarantee that forward progress is made and this
> + * doesn't cause another timeout. But this is strongly hardware specific and
> + * out of the scope of the general GPU scheduler.
> + */
> +
>  #include <linux/wait.h>
>  #include <linux/sched.h>
>  #include <linux/completion.h>
> -- 
> 2.34.1
>
Luben Tuikov Nov. 17, 2023, 10:32 p.m. UTC | #3
Hi,

On 2023-11-16 09:15, Christian König wrote:
> Start to improve the scheduler document. Especially document the
> lifetime of each of the objects as well as the restrictions around
> DMA-fence handling and userspace compatibility.
> 
> v2: Some improvements suggested by Danilo, add section about error
>     handling.
> 
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  Documentation/gpu/drm-mm.rst           |  36 +++++
>  drivers/gpu/drm/scheduler/sched_main.c | 174 +++++++++++++++++++++----
>  2 files changed, 188 insertions(+), 22 deletions(-)
> 
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> index acc5901ac840..112463fa9f3a 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -552,12 +552,48 @@ Overview
>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>     :doc: Overview
>  
> +Job Object
> +----------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Job Object
> +
> +Entity Object
> +-------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Entity Object
> +
> +Hardware Fence Object
> +---------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Hardware Fence Object
> +
> +Scheduler Fence Object
> +----------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Scheduler Fence Object
> +
> +Scheduler and Run Queue Objects
> +-------------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Scheduler and Run Queue Objects
> +
>  Flow Control
>  ------------
>  
>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>     :doc: Flow Control
>  
> +Error and Timeout handling
> +--------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Error and Timeout handling
> +
>  Scheduler Function References
>  -----------------------------
>  
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 044a8c4875ba..026123497b0e 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -24,28 +24,122 @@
>  /**
>   * DOC: Overview
>   *
> - * The GPU scheduler provides entities which allow userspace to push jobs
> - * into software queues which are then scheduled on a hardware run queue.
> - * The software queues have a priority among them. The scheduler selects the entities
> - * from the run queue using a FIFO. The scheduler provides dependency handling
> - * features among jobs. The driver is supposed to provide callback functions for
> - * backend operations to the scheduler like submitting a job to hardware run queue,
> - * returning the dependencies of a job etc.
> - *
> - * The organisation of the scheduler is the following:
> - *
> - * 1. Each hw run queue has one scheduler
> - * 2. Each scheduler has multiple run queues with different priorities
> - *    (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
> - * 3. Each scheduler run queue has a queue of entities to schedule
> - * 4. Entities themselves maintain a queue of jobs that will be scheduled on
> - *    the hardware.
> - *
> - * The jobs in a entity are always scheduled in the order that they were pushed.
> - *
> - * Note that once a job was taken from the entities queue and pushed to the
> - * hardware, i.e. the pending queue, the entity must not be referenced anymore
> - * through the jobs entity pointer.
> + * The GPU scheduler implements some logic to decide which command submission
> + * to push next to the hardware. Another major use case of the GPU scheduler

You can't start a 2nd sentence with "Another major use ...", not unless you'd been
discussing the other major use for at least a few paragraphs, but not after just
once sentence.

Get rid of "some" in "some logic", and just say "implements logic to ..."

Then 2nd sentence should say, "The GPU scheduler also enforces correct ..."

> + * is to enforce correct driver behavior around those command submissions.
> + * Because of this it's also used by drivers which don't need the actual
> + * scheduling functionality.
> + *
> + * All callbacks the driver needs to implement are restricted by DMA-fence
> + * signaling rules to guarantee deadlock free forward progress. This especially

"deadlock-free"

Link to "DMA-fence signaling rules" would be nice to have. Can't mention them,
and not provide a link. Naturally someone reading this would immediately ask themselves,
"What are the ``DMA-fence signaling rules''?", and if they don't need to ask themselves
this, then they probably mostly know all of this here too.

> + * means that for normal operation no memory can be allocated in a callback.

What callback? Perhaps say, "callback into the driver", or name it/them,
as they're in the code.

> + * All memory which is needed for pushing the job to the hardware must be

"pushing _a_ job"

> + * allocated before arming a job. It also means that no locks can be taken
> + * under which memory might be allocated as well.
> + *
> + * Memory which is optional to allocate, for example for device core dumping or
> + * debugging, *must* be allocated with GFP_NOWAIT and appropriate error
> + * handling taking if that allocation fails. GFP_ATOMIC should only be used if
> + * absolutely necessary since dipping into the special atomic reserves is
> + * usually not justified for a GPU driver.
> + */
> +
> +/**
> + * DOC: Job Object
> + *
> + * The base job object contains submission dependencies in the form of DMA-fence
> + * objects. Drivers can also implement an optional prepare_job callback which

Drop "also".

> + * returns additional dependencies as DMA-fence objects. It's important to note
> + * that this callback can't allocate memory or grab locks under which memory is
> + * allocated.

Probably say "that this callback into the driver cannot allocate" for clarity.

> + *
> + * Drivers should use this as base class for an object which contains the
> + * necessary state to push the command submission to the hardware.

"this"? The context is lost by this sentence and you should say "Drivers should
use the job structure as a base for an object which contains ..."

I'd stray away from using the word "class".

> + *
> + * The lifetime of the job object should at least be from pushing it into the
> + * scheduler until the scheduler notes through the free callback that a job

"the free callback"? Perhaps be more specific here with something like,
"struct drm_sched_backend_jobs::free_job callback".

> + * isn't needed any more. Drivers can of course keep their job object alive
> + * longer than that, but that's outside of the scope of the scheduler
> + * component. Job initialization is split into two parts, drm_sched_job_init()

This "Job initialization is split into ..." needs to start on a new paragraph.

> + * and drm_sched_job_arm(). It's important to note that after arming a job

These two functions need to be listed, numeric or alphabetic, something like,

	Job initialization is split into two parts,
		a) drm_sched_job_init(), which <explain here what it does and
		   why drivers should/would call it and what is expected from them>, and
		b) drm_sched_arm_job(), which <explain here what this does and
		   why drivers should/would call this and what is expected from them>.

> + * drivers must follow the DMA-fence rules and can't easily allocate memory
> + * or takes locks under which memory is allocated.

I wouldn't use contractions for things which are forbidden and would do

s/can't/cannot/g

A link to those "DMA-fence rules" would be helpful.

> + */
> +
> +/**
> + * DOC: Entity Object
> + *
> + * The entity object which is a container for jobs which should execute

Drop "which" globally.
"Container of" not "for".
s/should/generally/

To become this,

	The entity object is a container of jobs which generally execute ...

Still a better way to explain what something _is_, is to start with,

	The entity object represents a context/user process/etc., which generates
	jobs to be executed by the GPU. It holds incoming jobs in its "job_queue".
	Entities are generally allowed to run on one or more schedulers via their
	"sched_list" member.

Something like that.

> + * sequentially. Drivers should create an entity for each individual context
> + * they maintain for command submissions which can run in parallel.

Drop "should" to become,
	Drivers create an entity ...

> + *
> + * The lifetime of the entity should *not* exceed the lifetime of the
> + * userspace process it was created for and drivers should call the
> + * drm_sched_entity_flush() function from their file_operations.flush

"their" here would be ambiguous to a first-time reader. Perhaps
clarify that it is the process' file ops.

> + * callback. So it's possible that an entity object is not alive any
> + * more while jobs from it are still running on the hardware.
> + *
> + * Background is that for compatibility reasons with existing

"Background" --> "The reason for this is for compatibility with existing ..."

> + * userspace all results of a command submission should become visible
> + * externally even after after a process exits. This is normal POSIX behavior

Remove one of the "after".

> + * for I/O operations.
> + *
> + * The problem with this approach is that GPU submissions contain executable
> + * shaders enabling processes to evade their termination by offloading work to
> + * the GPU. So when a process is terminated with a SIGKILL the entity object
> + * makes sure that jobs are freed without running them while still maintaining
> + * correct sequential order for signaling fences.
> + */
> +
> +/**
> + * DOC: Hardware Fence Object
> + *
> + * The hardware fence object is a DMA-fence provided by the driver as result of
> + * running jobs. Drivers need to make sure that the normal DMA-fence semantics
> + * are followed for this object. It's important to note that the memory for
> + * this object can *not* be allocated in the run_job callback since that would
> + * violate the requirements for the DMA-fence implementation. The scheduler
> + * maintains a timeout handler which triggers if this fence doesn't signal in
> + * a configurable time frame.

"frame" --> "period" or "interval".

It would greatly help to suggest where/how the hw fence should be allocated,
and give suggestions to drivers.

> + *
> + * The lifetime of this object follows DMA-fence ref-counting rules, the
> + * scheduler takes ownership of the reference returned by the driver and drops
> + * it when it's not needed any more.

Clarify the callback:
"of the reference returned by the driver in the run_job callback and drops it ..."

"when it's not needed any more" should be clarified since the lifetime is attempted
to be clarified in this paragraph here.

> + */
> +
> +/**
> + * DOC: Scheduler Fence Object

As these all refer to real structs, it would help to name them
in the DOC, as in "struct drm_sched_fence".

> + *
> + * The scheduler fence object which encapsulates the whole time from pushing

See my comments in v1.

Also, "The scheduler fence object, struct drm_sched_fence, ..."
Name it explicitly so that the reader knows exactly which structure it is.

> + * the job into the scheduler until the hardware has finished processing it.
> + * This is internally managed by the scheduler, but drivers can grab additional
> + * reference to it after arming a job. The implementation provides DMA-fence
> + * interfaces for signaling both scheduling of a command submission as well as
> + * finishing of processing.

"for signalling both scheduling of a command" is confusing.

I'd probably list them in an alphabetized list, as in
	The drm_sched_fence structure contains two DMA fences,
	a) "scheduled" which is signalled by ..., when ..., and
	b) "finished" which is signalled by ... when ...

> + *
> + * The lifetime of this object also follows normal DMA-fence ref-counting
> + * rules. The finished fence is the one normally exposed outside of the
> + * scheduler, but the driver can grab references to both the scheduled as well
> + * as the finished fence when needed for pipe-lining optimizations.

"as well as" --> "and"

> + */
> +
> +/**
> + * DOC: Scheduler and Run Queue Objects

Perhaps list their names,
struct drm_gpu_scheduler,
struct drm_sched_rq.

> + *
> + * The scheduler object itself does the actual work of selecting a job and
> + * pushing it to the hardware. Both FIFO and RR selection algorithm are
> + * supported, but FIFO is preferred for many use cases.
> + *
> + * The lifetime of the scheduler is managed by the driver using it. Before
> + * destroying the scheduler the driver must ensure that all hardware processing
> + * involving this scheduler object has finished by calling for example
> + * disable_irq(). It is *not* sufficient to wait for the hardware fence here
> + * since this doesn't guarantee that all callback processing has finished.

Perhaps give an alternative of what *is* sufficient, as to give guidelines
to driver writers.

> + *
> + * The run queue object is a container of entities for a certain priority

"for a certain" --> "of a certain"

> + * level. This object is internally managed by the scheduler and drivers
> + * shouldn't touch them directly. The lifetime of run queues are bound to the

"touch it directly"
"is bound to"

> + * schedulers lifetime.
>   */
>  
>  /**
> @@ -72,6 +166,42 @@
>   * limit.
>   */
>  
> +/**
> + * DOC: Error and Timeout handling
> + *
> + * Errors schould be signaled by using dma_fence_set_error() on the hardware

Run this patch through spell-check.

> + * fence object before signaling it. Errors are then bubbled up from the
> + * hardware fence to the scheduler fence.
> + *
> + * The entity allows querying errors on the last run submission using the
> + * drm_sched_entity_error() function which can be used to cancel queued
> + * submissions in the run_job callback as well as preventing pushing further
> + * ones into the entity in the drivers submission function.
> + *
> + * When the hardware fence fails to signal in a configurable amount of time the
> + * timedout_job callback is issued. The driver should then follow the procedure
> + * described on the &struct drm_sched_backend_ops.timedout_job callback (TODO:
> + * The timeout handler should probably switch to using the hardware fence as
> + * parameter instead of the job. Otherwise the handling will always race
> + * between timing out and signaling the fence).
> + *
> + * The scheduler also used to provided functionality for re-submitting jobs

"scheduler is also used to provide"

> + * with replacing the hardware fence during reset handling. This functionality
> + * is now marked as deprecated. This has proven to be fundamentally racy and
> + * not compatible with DMA-fence rules and shouldn't be used in any new code.
> + *
> + * Additional there is the drm_sched_increase_karma() function which tries to

"In addition," or "Additionally,"

> + * find the entity which submitted a job and increases it's 'karma'
> + * atomic variable to prevent re-submitting jobs from this entity. This has
> + * quite some overhead and re-submitting jobs is now marked as deprecated. So
> + * using this function is rather discouraged.

Perhaps add,
	"This function, as well as this 'karma'-business is slated for removal."

> + *
> + * Drivers can still re-create the GPU state should it be lost during timeout
> + * handling when they can guarantee that forward progress is made and this
> + * doesn't cause another timeout. But this is strongly hardware specific and
> + * out of the scope of the general GPU scheduler.
> + */
> +
>  #include <linux/wait.h>
>  #include <linux/sched.h>
>  #include <linux/completion.h>
Christian König Nov. 28, 2023, 1:51 p.m. UTC | #4
Am 16.11.23 um 23:23 schrieb Danilo Krummrich:
> [SNIP]
>> + *
>> + * The lifetime of the scheduler is managed by the driver using it. Before
>> + * destroying the scheduler the driver must ensure that all hardware processing
>> + * involving this scheduler object has finished by calling for example
>> + * disable_irq(). It is *not* sufficient to wait for the hardware fence here
>> + * since this doesn't guarantee that all callback processing has finished.
> This is the part I'm most concerned about, since I feel like we leave drivers
> "up in the air" entirely. Hence, I think here we need to be more verbose and
> detailed about the options drivers have to ensure that.
>
> For instance, let's assume we have the single-entity-per-scheduler topology
> because the driver only uses the GPU scheduler to feed a firmware scheduler with
> dynamically allocated ring buffers.
>
> In this case the entity, scheduler and ring buffer are bound to the lifetime of
> a userspace process.
>
> What do we expect the driver to do if the userspace process is killed? As you
> mentioned, only waiting for the ring to be idle (which implies all HW fences
> are signalled) is not enough. This doesn't guarantee all the free_job()
> callbacks have been called yet and hence stopping the scheduler before the
> pending_list is actually empty would leak the memory of the jobs on the
> pending_list waiting to be freed.
>
> I already brought this up when we were discussing Matt's Xe inspired scheduler
> patch series and it seems there was no interest to provide drivers with some
> common mechanism that gurantees that the pending_list is empty. Hence, I really
> think we should at least give recommendations how drivers should deal with that.

I put this work on hold to have time looking deeper into this and trying 
to find alternative ways for the handling.

I think this is another good reason why the scheduler should really not 
be involved in freeing jobs, but let's first discuss another issue with 
this.

It goes far down into the underlying dma_fence mechanism which gives you 
guarantees that hardware operations have finished, but not that the 
associated software callbacks are done.

So what happens is that components like the scheduler can't just wait 
for dma_fences to be sure that a registered callback are not executed on 
another CPU.

See this patch here for another example where this totally bites us in 
drivers, completely independent of the GPU scheduler:

commit 7c703a7d3f2b50a6187267420a4d3d7e62fa3206
Author: xinhui pan <xinhui.pan@amd.com>
Date:   Tue Apr 12 19:52:16 2022 +0800

     drm/amdgpu: Fix one use-after-free of VM

Basically the solution amdgpu came up with is to take and drop the 
spinlock of the underlying dma_fence context:

/* Make sure that all fence callbacks have completed */
spin_lock_irqsave(vm->last_tlb_flush->lock, flags);
spin_unlock_irqrestore(vm->last_tlb_flush->lock, flags);

But this is just a hack only works because amdgpu knows the internals of 
his own dma_fence implementation.

For the scheduler this is not applicable. I've mentioned this problem to 
Daniel before, but at least at this time he thought that this is a 
complete driver problem.

Ideas?

Regards,
Christian.
Danilo Krummrich Feb. 13, 2024, 5:37 p.m. UTC | #5
Hi Christian,

What's the status of this effort? Was there ever a follow-up?

- Danilo

On 11/16/23 23:23, Danilo Krummrich wrote:
> Hi Christian,
> 
> Thanks for sending an update of this patch!
> 
> On Thu, Nov 16, 2023 at 03:15:46PM +0100, Christian König wrote:
>> Start to improve the scheduler document. Especially document the
>> lifetime of each of the objects as well as the restrictions around
>> DMA-fence handling and userspace compatibility.
>>
>> v2: Some improvements suggested by Danilo, add section about error
>>      handling.
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>>   Documentation/gpu/drm-mm.rst           |  36 +++++
>>   drivers/gpu/drm/scheduler/sched_main.c | 174 +++++++++++++++++++++----
>>   2 files changed, 188 insertions(+), 22 deletions(-)
>>
>> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
>> index acc5901ac840..112463fa9f3a 100644
>> --- a/Documentation/gpu/drm-mm.rst
>> +++ b/Documentation/gpu/drm-mm.rst
>> @@ -552,12 +552,48 @@ Overview
>>   .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>      :doc: Overview
>>   
>> +Job Object
>> +----------
>> +
>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>> +   :doc: Job Object
>> +
>> +Entity Object
>> +-------------
>> +
>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>> +   :doc: Entity Object
>> +
>> +Hardware Fence Object
>> +---------------------
>> +
>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>> +   :doc: Hardware Fence Object
>> +
>> +Scheduler Fence Object
>> +----------------------
>> +
>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>> +   :doc: Scheduler Fence Object
>> +
>> +Scheduler and Run Queue Objects
>> +-------------------------------
>> +
>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>> +   :doc: Scheduler and Run Queue Objects
>> +
>>   Flow Control
>>   ------------
>>   
>>   .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>      :doc: Flow Control
>>   
>> +Error and Timeout handling
>> +--------------------------
>> +
>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>> +   :doc: Error and Timeout handling
>> +
>>   Scheduler Function References
>>   -----------------------------
>>   
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index 044a8c4875ba..026123497b0e 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -24,28 +24,122 @@
>>   /**
>>    * DOC: Overview
>>    *
>> - * The GPU scheduler provides entities which allow userspace to push jobs
>> - * into software queues which are then scheduled on a hardware run queue.
>> - * The software queues have a priority among them. The scheduler selects the entities
>> - * from the run queue using a FIFO. The scheduler provides dependency handling
>> - * features among jobs. The driver is supposed to provide callback functions for
>> - * backend operations to the scheduler like submitting a job to hardware run queue,
>> - * returning the dependencies of a job etc.
>> - *
>> - * The organisation of the scheduler is the following:
>> - *
>> - * 1. Each hw run queue has one scheduler
>> - * 2. Each scheduler has multiple run queues with different priorities
>> - *    (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
>> - * 3. Each scheduler run queue has a queue of entities to schedule
>> - * 4. Entities themselves maintain a queue of jobs that will be scheduled on
>> - *    the hardware.
>> - *
>> - * The jobs in a entity are always scheduled in the order that they were pushed.
>> - *
>> - * Note that once a job was taken from the entities queue and pushed to the
>> - * hardware, i.e. the pending queue, the entity must not be referenced anymore
>> - * through the jobs entity pointer.
>> + * The GPU scheduler implements some logic to decide which command submission
>> + * to push next to the hardware. Another major use case of the GPU scheduler
>> + * is to enforce correct driver behavior around those command submissions.
>> + * Because of this it's also used by drivers which don't need the actual
>> + * scheduling functionality.
> 
> This reads a bit like we're already right in the middle of the documentation.
> I'd propose to start with something like "The DRM GPU scheduler serves as a
> common component intended to help drivers to manage command submissions." And
> then mention the different solutions the scheduler provides, e.g. ordering of
> command submissions, dma-fence handling in the context of command submissions,
> etc.
> 
> Also, I think it would be good to give a rough overview of the topology of the
> scheduler's components. And since you already mention that the component is
> "also used by drivers which don't need the actual scheduling functionality", I'd
> also mention the special case of a single entity and a single run-queue per
> scheduler.
> 
>> + *
>> + * All callbacks the driver needs to implement are restricted by DMA-fence
>> + * signaling rules to guarantee deadlock free forward progress. This especially
>> + * means that for normal operation no memory can be allocated in a callback.
>> + * All memory which is needed for pushing the job to the hardware must be
>> + * allocated before arming a job. It also means that no locks can be taken
>> + * under which memory might be allocated as well.
> 
> I think that's good. Even though, with the recently merged workqueue patches,
> drivers can actually create a setup where the free_job callback isn't part of
> the fence signalling critical path anymore. But I agree with Sima that this is
> probably too error prone to give drivers ideas. So, this paragraph is probably
> good as it is. :-)
> 
>> + *
>> + * Memory which is optional to allocate, for example for device core dumping or
>> + * debugging, *must* be allocated with GFP_NOWAIT and appropriate error
>> + * handling taking if that allocation fails. GFP_ATOMIC should only be used if
>> + * absolutely necessary since dipping into the special atomic reserves is
>> + * usually not justified for a GPU driver.
>> + */
>> +
>> +/**
>> + * DOC: Job Object
>> + *
>> + * The base job object contains submission dependencies in the form of DMA-fence
>> + * objects. Drivers can also implement an optional prepare_job callback which
>> + * returns additional dependencies as DMA-fence objects. It's important to note
>> + * that this callback can't allocate memory or grab locks under which memory is
>> + * allocated.
>> + *
>> + * Drivers should use this as base class for an object which contains the
>> + * necessary state to push the command submission to the hardware.
>> + *
>> + * The lifetime of the job object should at least be from pushing it into the
> 
> "should at least last from"
> 
>> + * scheduler until the scheduler notes through the free callback that a job
> 
> What about "until the free_job callback has been called and hence the scheduler
> does not require the job object anymore."?
> 
>> + * isn't needed any more. Drivers can of course keep their job object alive
>> + * longer than that, but that's outside of the scope of the scheduler
>> + * component. Job initialization is split into two parts, drm_sched_job_init()
>> + * and drm_sched_job_arm(). It's important to note that after arming a job
> 
> I suggest to add a brief comment on why job initialization is split up.
> 
>> + * drivers must follow the DMA-fence rules and can't easily allocate memory
>> + * or takes locks under which memory is allocated.
>> + */
>> +
>> +/**
>> + * DOC: Entity Object
>> + *
>> + * The entity object which is a container for jobs which should execute
>> + * sequentially. Drivers should create an entity for each individual context
>> + * they maintain for command submissions which can run in parallel.
>> + *
>> + * The lifetime of the entity should *not* exceed the lifetime of the
>> + * userspace process it was created for and drivers should call the
>> + * drm_sched_entity_flush() function from their file_operations.flush
>> + * callback. So it's possible that an entity object is not alive any
> 
> "Note that it is possible..."
> 
>> + * more while jobs from it are still running on the hardware.
> 
> "while jobs previously fetched from this entity are still..."
> 
>> + *
>> + * Background is that for compatibility reasons with existing
>> + * userspace all results of a command submission should become visible
>> + * externally even after after a process exits. This is normal POSIX behavior
>> + * for I/O operations.
>> + *
>> + * The problem with this approach is that GPU submissions contain executable
>> + * shaders enabling processes to evade their termination by offloading work to
>> + * the GPU. So when a process is terminated with a SIGKILL the entity object
>> + * makes sure that jobs are freed without running them while still maintaining
>> + * correct sequential order for signaling fences.
>> + */
>> +
>> +/**
>> + * DOC: Hardware Fence Object
>> + *
>> + * The hardware fence object is a DMA-fence provided by the driver as result of
>> + * running jobs. Drivers need to make sure that the normal DMA-fence semantics
>> + * are followed for this object. It's important to note that the memory for
>> + * this object can *not* be allocated in the run_job callback since that would
>> + * violate the requirements for the DMA-fence implementation. The scheduler
>> + * maintains a timeout handler which triggers if this fence doesn't signal in
>> + * a configurable time frame.
>> + *
>> + * The lifetime of this object follows DMA-fence ref-counting rules, the
>> + * scheduler takes ownership of the reference returned by the driver and drops
>> + * it when it's not needed any more.
>> + */
>> +
>> +/**
>> + * DOC: Scheduler Fence Object
>> + *
>> + * The scheduler fence object which encapsulates the whole time from pushing
>> + * the job into the scheduler until the hardware has finished processing it.
>> + * This is internally managed by the scheduler, but drivers can grab additional
>> + * reference to it after arming a job. The implementation provides DMA-fence
>> + * interfaces for signaling both scheduling of a command submission as well as
>> + * finishing of processing.
>> + *
>> + * The lifetime of this object also follows normal DMA-fence ref-counting
>> + * rules. The finished fence is the one normally exposed outside of the
>> + * scheduler, but the driver can grab references to both the scheduled as well
>> + * as the finished fence when needed for pipe-lining optimizations.
>> + */
>> +
>> +/**
>> + * DOC: Scheduler and Run Queue Objects
>> + *
>> + * The scheduler object itself does the actual work of selecting a job and
>> + * pushing it to the hardware. Both FIFO and RR selection algorithm are
>> + * supported, but FIFO is preferred for many use cases.
> 
> I suggest to name the use cases FIFO scheduling is preferred for and why.
> 
> If, instead, it's just a general recommendation, I also suggest to explain why.
> 
>> + *
>> + * The lifetime of the scheduler is managed by the driver using it. Before
>> + * destroying the scheduler the driver must ensure that all hardware processing
>> + * involving this scheduler object has finished by calling for example
>> + * disable_irq(). It is *not* sufficient to wait for the hardware fence here
>> + * since this doesn't guarantee that all callback processing has finished.
> 
> This is the part I'm most concerned about, since I feel like we leave drivers
> "up in the air" entirely. Hence, I think here we need to be more verbose and
> detailed about the options drivers have to ensure that.
> 
> For instance, let's assume we have the single-entity-per-scheduler topology
> because the driver only uses the GPU scheduler to feed a firmware scheduler with
> dynamically allocated ring buffers.
> 
> In this case the entity, scheduler and ring buffer are bound to the lifetime of
> a userspace process.
> 
> What do we expect the driver to do if the userspace process is killed? As you
> mentioned, only waiting for the ring to be idle (which implies all HW fences
> are signalled) is not enough. This doesn't guarantee all the free_job()
> callbacks have been called yet and hence stopping the scheduler before the
> pending_list is actually empty would leak the memory of the jobs on the
> pending_list waiting to be freed.
> 
> I already brought this up when we were discussing Matt's Xe inspired scheduler
> patch series and it seems there was no interest to provide drivers with some
> common mechanism that gurantees that the pending_list is empty. Hence, I really
> think we should at least give recommendations how drivers should deal with that.
> 
>> + *
>> + * The run queue object is a container of entities for a certain priority
>> + * level. This object is internally managed by the scheduler and drivers
>> + * shouldn't touch them directly. The lifetime of run queues are bound to the
>> + * schedulers lifetime.
> 
> I think we should also mention that we support a variable number of run-queues
> up to DRM_SCHED_PRIORITY_COUNT. Also there is this weird restriction on which
> priorities a driver can use when choosing less than DRM_SCHED_PRIORITY_COUNT
> run-queues.
> 
> For instance, initializing the scheduler with a single run-queue, requires the
> corresponding entities to pick DRM_SCHED_PRIORITY_MIN, otherwise we'll just
> fault since the priority is also used as an array index into sched->sched_rq[].
> 
>>    */
>>   
>>   /**
>> @@ -72,6 +166,42 @@
>>    * limit.
>>    */
>>   
>> +/**
>> + * DOC: Error and Timeout handling
>> + *
>> + * Errors schould be signaled by using dma_fence_set_error() on the hardware
>> + * fence object before signaling it. Errors are then bubbled up from the
>> + * hardware fence to the scheduler fence.
>> + *
>> + * The entity allows querying errors on the last run submission using the
>> + * drm_sched_entity_error() function which can be used to cancel queued
>> + * submissions in the run_job callback as well as preventing pushing further
>> + * ones into the entity in the drivers submission function.
>> + *
>> + * When the hardware fence fails to signal in a configurable amount of time the
>> + * timedout_job callback is issued. The driver should then follow the procedure
>> + * described on the &struct drm_sched_backend_ops.timedout_job callback (TODO:
>> + * The timeout handler should probably switch to using the hardware fence as
>> + * parameter instead of the job. Otherwise the handling will always race
>> + * between timing out and signaling the fence).
>> + *
>> + * The scheduler also used to provided functionality for re-submitting jobs
>> + * with replacing the hardware fence during reset handling. This functionality
>> + * is now marked as deprecated. This has proven to be fundamentally racy and
>> + * not compatible with DMA-fence rules and shouldn't be used in any new code.
>> + *
>> + * Additional there is the drm_sched_increase_karma() function which tries to
>   
> "Additionally"
> 
>> + * find the entity which submitted a job and increases it's 'karma'
>> + * atomic variable to prevent re-submitting jobs from this entity. This has
>> + * quite some overhead and re-submitting jobs is now marked as deprecated. So
>> + * using this function is rather discouraged.
>> + *
>> + * Drivers can still re-create the GPU state should it be lost during timeout
>> + * handling when they can guarantee that forward progress is made and this
>> + * doesn't cause another timeout. But this is strongly hardware specific and
>> + * out of the scope of the general GPU scheduler.
>> + */
>> +
>>   #include <linux/wait.h>
>>   #include <linux/sched.h>
>>   #include <linux/completion.h>
>> -- 
>> 2.34.1
>>
Christian König Feb. 14, 2024, 8:23 a.m. UTC | #6
Am 13.02.24 um 18:37 schrieb Danilo Krummrich:
> Hi Christian,
>
> What's the status of this effort? Was there ever a follow-up?

It's unfortunately on hold for the moment since I have to look into some 
internal things with highest priority. No idea when this will calm down 
again.

Christian.

>
> - Danilo
>
> On 11/16/23 23:23, Danilo Krummrich wrote:
>> Hi Christian,
>>
>> Thanks for sending an update of this patch!
>>
>> On Thu, Nov 16, 2023 at 03:15:46PM +0100, Christian König wrote:
>>> Start to improve the scheduler document. Especially document the
>>> lifetime of each of the objects as well as the restrictions around
>>> DMA-fence handling and userspace compatibility.
>>>
>>> v2: Some improvements suggested by Danilo, add section about error
>>>      handling.
>>>
>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>> ---
>>>   Documentation/gpu/drm-mm.rst           |  36 +++++
>>>   drivers/gpu/drm/scheduler/sched_main.c | 174 
>>> +++++++++++++++++++++----
>>>   2 files changed, 188 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/Documentation/gpu/drm-mm.rst 
>>> b/Documentation/gpu/drm-mm.rst
>>> index acc5901ac840..112463fa9f3a 100644
>>> --- a/Documentation/gpu/drm-mm.rst
>>> +++ b/Documentation/gpu/drm-mm.rst
>>> @@ -552,12 +552,48 @@ Overview
>>>   .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>>      :doc: Overview
>>>   +Job Object
>>> +----------
>>> +
>>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>> +   :doc: Job Object
>>> +
>>> +Entity Object
>>> +-------------
>>> +
>>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>> +   :doc: Entity Object
>>> +
>>> +Hardware Fence Object
>>> +---------------------
>>> +
>>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>> +   :doc: Hardware Fence Object
>>> +
>>> +Scheduler Fence Object
>>> +----------------------
>>> +
>>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>> +   :doc: Scheduler Fence Object
>>> +
>>> +Scheduler and Run Queue Objects
>>> +-------------------------------
>>> +
>>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>> +   :doc: Scheduler and Run Queue Objects
>>> +
>>>   Flow Control
>>>   ------------
>>>     .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>>      :doc: Flow Control
>>>   +Error and Timeout handling
>>> +--------------------------
>>> +
>>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>> +   :doc: Error and Timeout handling
>>> +
>>>   Scheduler Function References
>>>   -----------------------------
>>>   diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 044a8c4875ba..026123497b0e 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -24,28 +24,122 @@
>>>   /**
>>>    * DOC: Overview
>>>    *
>>> - * The GPU scheduler provides entities which allow userspace to 
>>> push jobs
>>> - * into software queues which are then scheduled on a hardware run 
>>> queue.
>>> - * The software queues have a priority among them. The scheduler 
>>> selects the entities
>>> - * from the run queue using a FIFO. The scheduler provides 
>>> dependency handling
>>> - * features among jobs. The driver is supposed to provide callback 
>>> functions for
>>> - * backend operations to the scheduler like submitting a job to 
>>> hardware run queue,
>>> - * returning the dependencies of a job etc.
>>> - *
>>> - * The organisation of the scheduler is the following:
>>> - *
>>> - * 1. Each hw run queue has one scheduler
>>> - * 2. Each scheduler has multiple run queues with different priorities
>>> - *    (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
>>> - * 3. Each scheduler run queue has a queue of entities to schedule
>>> - * 4. Entities themselves maintain a queue of jobs that will be 
>>> scheduled on
>>> - *    the hardware.
>>> - *
>>> - * The jobs in a entity are always scheduled in the order that they 
>>> were pushed.
>>> - *
>>> - * Note that once a job was taken from the entities queue and 
>>> pushed to the
>>> - * hardware, i.e. the pending queue, the entity must not be 
>>> referenced anymore
>>> - * through the jobs entity pointer.
>>> + * The GPU scheduler implements some logic to decide which command 
>>> submission
>>> + * to push next to the hardware. Another major use case of the GPU 
>>> scheduler
>>> + * is to enforce correct driver behavior around those command 
>>> submissions.
>>> + * Because of this it's also used by drivers which don't need the 
>>> actual
>>> + * scheduling functionality.
>>
>> This reads a bit like we're already right in the middle of the 
>> documentation.
>> I'd propose to start with something like "The DRM GPU scheduler 
>> serves as a
>> common component intended to help drivers to manage command 
>> submissions." And
>> then mention the different solutions the scheduler provides, e.g. 
>> ordering of
>> command submissions, dma-fence handling in the context of command 
>> submissions,
>> etc.
>>
>> Also, I think it would be good to give a rough overview of the 
>> topology of the
>> scheduler's components. And since you already mention that the 
>> component is
>> "also used by drivers which don't need the actual scheduling 
>> functionality", I'd
>> also mention the special case of a single entity and a single 
>> run-queue per
>> scheduler.
>>
>>> + *
>>> + * All callbacks the driver needs to implement are restricted by 
>>> DMA-fence
>>> + * signaling rules to guarantee deadlock free forward progress. 
>>> This especially
>>> + * means that for normal operation no memory can be allocated in a 
>>> callback.
>>> + * All memory which is needed for pushing the job to the hardware 
>>> must be
>>> + * allocated before arming a job. It also means that no locks can 
>>> be taken
>>> + * under which memory might be allocated as well.
>>
>> I think that's good. Even though, with the recently merged workqueue 
>> patches,
>> drivers can actually create a setup where the free_job callback isn't 
>> part of
>> the fence signalling critical path anymore. But I agree with Sima 
>> that this is
>> probably too error prone to give drivers ideas. So, this paragraph is 
>> probably
>> good as it is. :-)
>>
>>> + *
>>> + * Memory which is optional to allocate, for example for device 
>>> core dumping or
>>> + * debugging, *must* be allocated with GFP_NOWAIT and appropriate 
>>> error
>>> + * handling taking if that allocation fails. GFP_ATOMIC should only 
>>> be used if
>>> + * absolutely necessary since dipping into the special atomic 
>>> reserves is
>>> + * usually not justified for a GPU driver.
>>> + */
>>> +
>>> +/**
>>> + * DOC: Job Object
>>> + *
>>> + * The base job object contains submission dependencies in the form 
>>> of DMA-fence
>>> + * objects. Drivers can also implement an optional prepare_job 
>>> callback which
>>> + * returns additional dependencies as DMA-fence objects. It's 
>>> important to note
>>> + * that this callback can't allocate memory or grab locks under 
>>> which memory is
>>> + * allocated.
>>> + *
>>> + * Drivers should use this as base class for an object which 
>>> contains the
>>> + * necessary state to push the command submission to the hardware.
>>> + *
>>> + * The lifetime of the job object should at least be from pushing 
>>> it into the
>>
>> "should at least last from"
>>
>>> + * scheduler until the scheduler notes through the free callback 
>>> that a job
>>
>> What about "until the free_job callback has been called and hence the 
>> scheduler
>> does not require the job object anymore."?
>>
>>> + * isn't needed any more. Drivers can of course keep their job 
>>> object alive
>>> + * longer than that, but that's outside of the scope of the scheduler
>>> + * component. Job initialization is split into two parts, 
>>> drm_sched_job_init()
>>> + * and drm_sched_job_arm(). It's important to note that after 
>>> arming a job
>>
>> I suggest to add a brief comment on why job initialization is split up.
>>
>>> + * drivers must follow the DMA-fence rules and can't easily 
>>> allocate memory
>>> + * or takes locks under which memory is allocated.
>>> + */
>>> +
>>> +/**
>>> + * DOC: Entity Object
>>> + *
>>> + * The entity object which is a container for jobs which should 
>>> execute
>>> + * sequentially. Drivers should create an entity for each 
>>> individual context
>>> + * they maintain for command submissions which can run in parallel.
>>> + *
>>> + * The lifetime of the entity should *not* exceed the lifetime of the
>>> + * userspace process it was created for and drivers should call the
>>> + * drm_sched_entity_flush() function from their file_operations.flush
>>> + * callback. So it's possible that an entity object is not alive any
>>
>> "Note that it is possible..."
>>
>>> + * more while jobs from it are still running on the hardware.
>>
>> "while jobs previously fetched from this entity are still..."
>>
>>> + *
>>> + * Background is that for compatibility reasons with existing
>>> + * userspace all results of a command submission should become visible
>>> + * externally even after after a process exits. This is normal 
>>> POSIX behavior
>>> + * for I/O operations.
>>> + *
>>> + * The problem with this approach is that GPU submissions contain 
>>> executable
>>> + * shaders enabling processes to evade their termination by 
>>> offloading work to
>>> + * the GPU. So when a process is terminated with a SIGKILL the 
>>> entity object
>>> + * makes sure that jobs are freed without running them while still 
>>> maintaining
>>> + * correct sequential order for signaling fences.
>>> + */
>>> +
>>> +/**
>>> + * DOC: Hardware Fence Object
>>> + *
>>> + * The hardware fence object is a DMA-fence provided by the driver 
>>> as result of
>>> + * running jobs. Drivers need to make sure that the normal 
>>> DMA-fence semantics
>>> + * are followed for this object. It's important to note that the 
>>> memory for
>>> + * this object can *not* be allocated in the run_job callback since 
>>> that would
>>> + * violate the requirements for the DMA-fence implementation. The 
>>> scheduler
>>> + * maintains a timeout handler which triggers if this fence doesn't 
>>> signal in
>>> + * a configurable time frame.
>>> + *
>>> + * The lifetime of this object follows DMA-fence ref-counting 
>>> rules, the
>>> + * scheduler takes ownership of the reference returned by the 
>>> driver and drops
>>> + * it when it's not needed any more.
>>> + */
>>> +
>>> +/**
>>> + * DOC: Scheduler Fence Object
>>> + *
>>> + * The scheduler fence object which encapsulates the whole time 
>>> from pushing
>>> + * the job into the scheduler until the hardware has finished 
>>> processing it.
>>> + * This is internally managed by the scheduler, but drivers can 
>>> grab additional
>>> + * reference to it after arming a job. The implementation provides 
>>> DMA-fence
>>> + * interfaces for signaling both scheduling of a command submission 
>>> as well as
>>> + * finishing of processing.
>>> + *
>>> + * The lifetime of this object also follows normal DMA-fence 
>>> ref-counting
>>> + * rules. The finished fence is the one normally exposed outside of 
>>> the
>>> + * scheduler, but the driver can grab references to both the 
>>> scheduled as well
>>> + * as the finished fence when needed for pipe-lining optimizations.
>>> + */
>>> +
>>> +/**
>>> + * DOC: Scheduler and Run Queue Objects
>>> + *
>>> + * The scheduler object itself does the actual work of selecting a 
>>> job and
>>> + * pushing it to the hardware. Both FIFO and RR selection algorithm 
>>> are
>>> + * supported, but FIFO is preferred for many use cases.
>>
>> I suggest to name the use cases FIFO scheduling is preferred for and 
>> why.
>>
>> If, instead, it's just a general recommendation, I also suggest to 
>> explain why.
>>
>>> + *
>>> + * The lifetime of the scheduler is managed by the driver using it. 
>>> Before
>>> + * destroying the scheduler the driver must ensure that all 
>>> hardware processing
>>> + * involving this scheduler object has finished by calling for example
>>> + * disable_irq(). It is *not* sufficient to wait for the hardware 
>>> fence here
>>> + * since this doesn't guarantee that all callback processing has 
>>> finished.
>>
>> This is the part I'm most concerned about, since I feel like we leave 
>> drivers
>> "up in the air" entirely. Hence, I think here we need to be more 
>> verbose and
>> detailed about the options drivers have to ensure that.
>>
>> For instance, let's assume we have the single-entity-per-scheduler 
>> topology
>> because the driver only uses the GPU scheduler to feed a firmware 
>> scheduler with
>> dynamically allocated ring buffers.
>>
>> In this case the entity, scheduler and ring buffer are bound to the 
>> lifetime of
>> a userspace process.
>>
>> What do we expect the driver to do if the userspace process is 
>> killed? As you
>> mentioned, only waiting for the ring to be idle (which implies all HW 
>> fences
>> are signalled) is not enough. This doesn't guarantee all the free_job()
>> callbacks have been called yet and hence stopping the scheduler 
>> before the
>> pending_list is actually empty would leak the memory of the jobs on the
>> pending_list waiting to be freed.
>>
>> I already brought this up when we were discussing Matt's Xe inspired 
>> scheduler
>> patch series and it seems there was no interest to provide drivers 
>> with some
>> common mechanism that gurantees that the pending_list is empty. 
>> Hence, I really
>> think we should at least give recommendations how drivers should deal 
>> with that.
>>
>>> + *
>>> + * The run queue object is a container of entities for a certain 
>>> priority
>>> + * level. This object is internally managed by the scheduler and 
>>> drivers
>>> + * shouldn't touch them directly. The lifetime of run queues are 
>>> bound to the
>>> + * schedulers lifetime.
>>
>> I think we should also mention that we support a variable number of 
>> run-queues
>> up to DRM_SCHED_PRIORITY_COUNT. Also there is this weird restriction 
>> on which
>> priorities a driver can use when choosing less than 
>> DRM_SCHED_PRIORITY_COUNT
>> run-queues.
>>
>> For instance, initializing the scheduler with a single run-queue, 
>> requires the
>> corresponding entities to pick DRM_SCHED_PRIORITY_MIN, otherwise 
>> we'll just
>> fault since the priority is also used as an array index into 
>> sched->sched_rq[].
>>
>>>    */
>>>     /**
>>> @@ -72,6 +166,42 @@
>>>    * limit.
>>>    */
>>>   +/**
>>> + * DOC: Error and Timeout handling
>>> + *
>>> + * Errors schould be signaled by using dma_fence_set_error() on the 
>>> hardware
>>> + * fence object before signaling it. Errors are then bubbled up 
>>> from the
>>> + * hardware fence to the scheduler fence.
>>> + *
>>> + * The entity allows querying errors on the last run submission 
>>> using the
>>> + * drm_sched_entity_error() function which can be used to cancel 
>>> queued
>>> + * submissions in the run_job callback as well as preventing 
>>> pushing further
>>> + * ones into the entity in the drivers submission function.
>>> + *
>>> + * When the hardware fence fails to signal in a configurable amount 
>>> of time the
>>> + * timedout_job callback is issued. The driver should then follow 
>>> the procedure
>>> + * described on the &struct drm_sched_backend_ops.timedout_job 
>>> callback (TODO:
>>> + * The timeout handler should probably switch to using the hardware 
>>> fence as
>>> + * parameter instead of the job. Otherwise the handling will always 
>>> race
>>> + * between timing out and signaling the fence).
>>> + *
>>> + * The scheduler also used to provided functionality for 
>>> re-submitting jobs
>>> + * with replacing the hardware fence during reset handling. This 
>>> functionality
>>> + * is now marked as deprecated. This has proven to be fundamentally 
>>> racy and
>>> + * not compatible with DMA-fence rules and shouldn't be used in any 
>>> new code.
>>> + *
>>> + * Additional there is the drm_sched_increase_karma() function 
>>> which tries to
>>   "Additionally"
>>
>>> + * find the entity which submitted a job and increases it's 'karma'
>>> + * atomic variable to prevent re-submitting jobs from this entity. 
>>> This has
>>> + * quite some overhead and re-submitting jobs is now marked as 
>>> deprecated. So
>>> + * using this function is rather discouraged.
>>> + *
>>> + * Drivers can still re-create the GPU state should it be lost 
>>> during timeout
>>> + * handling when they can guarantee that forward progress is made 
>>> and this
>>> + * doesn't cause another timeout. But this is strongly hardware 
>>> specific and
>>> + * out of the scope of the general GPU scheduler.
>>> + */
>>> +
>>>   #include <linux/wait.h>
>>>   #include <linux/sched.h>
>>>   #include <linux/completion.h>
>>> -- 
>>> 2.34.1
>>>
>
Philipp Stanner July 19, 2024, 10:29 a.m. UTC | #7
On Thu, 2023-11-16 at 15:15 +0100, Christian König wrote:
> Start to improve the scheduler document. Especially document the
> lifetime of each of the objects as well as the restrictions around
> DMA-fence handling and userspace compatibility.

Hallo Christian,

thanks for working on this.

I'm currently looking deeper into the scheduler and am documenting the
pitfalls etc. that I have found so far.


What are your current plans with this documentation series? If you
don't intend to get it upstreamed in the foreseeable future, I would
like to hijack the series and use it as a basis for my own improvements
to the documentation.

Please tell me what you think,


Regards,
P.


> 
> v2: Some improvements suggested by Danilo, add section about error
>     handling.
> 
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  Documentation/gpu/drm-mm.rst           |  36 +++++
>  drivers/gpu/drm/scheduler/sched_main.c | 174 +++++++++++++++++++++--
> --
>  2 files changed, 188 insertions(+), 22 deletions(-)
> 
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-
> mm.rst
> index acc5901ac840..112463fa9f3a 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -552,12 +552,48 @@ Overview
>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>     :doc: Overview
>  
> +Job Object
> +----------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Job Object
> +
> +Entity Object
> +-------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Entity Object
> +
> +Hardware Fence Object
> +---------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Hardware Fence Object
> +
> +Scheduler Fence Object
> +----------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Scheduler Fence Object
> +
> +Scheduler and Run Queue Objects
> +-------------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Scheduler and Run Queue Objects
> +
>  Flow Control
>  ------------
>  
>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>     :doc: Flow Control
>  
> +Error and Timeout handling
> +--------------------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
> +   :doc: Error and Timeout handling
> +
>  Scheduler Function References
>  -----------------------------
>  
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 044a8c4875ba..026123497b0e 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -24,28 +24,122 @@
>  /**
>   * DOC: Overview
>   *
> - * The GPU scheduler provides entities which allow userspace to push
> jobs
> - * into software queues which are then scheduled on a hardware run
> queue.
> - * The software queues have a priority among them. The scheduler
> selects the entities
> - * from the run queue using a FIFO. The scheduler provides
> dependency handling
> - * features among jobs. The driver is supposed to provide callback
> functions for
> - * backend operations to the scheduler like submitting a job to
> hardware run queue,
> - * returning the dependencies of a job etc.
> - *
> - * The organisation of the scheduler is the following:
> - *
> - * 1. Each hw run queue has one scheduler
> - * 2. Each scheduler has multiple run queues with different
> priorities
> - *    (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
> - * 3. Each scheduler run queue has a queue of entities to schedule
> - * 4. Entities themselves maintain a queue of jobs that will be
> scheduled on
> - *    the hardware.
> - *
> - * The jobs in a entity are always scheduled in the order that they
> were pushed.
> - *
> - * Note that once a job was taken from the entities queue and pushed
> to the
> - * hardware, i.e. the pending queue, the entity must not be
> referenced anymore
> - * through the jobs entity pointer.
> + * The GPU scheduler implements some logic to decide which command
> submission
> + * to push next to the hardware. Another major use case of the GPU
> scheduler
> + * is to enforce correct driver behavior around those command
> submissions.
> + * Because of this it's also used by drivers which don't need the
> actual
> + * scheduling functionality.
> + *
> + * All callbacks the driver needs to implement are restricted by
> DMA-fence
> + * signaling rules to guarantee deadlock free forward progress. This
> especially
> + * means that for normal operation no memory can be allocated in a
> callback.
> + * All memory which is needed for pushing the job to the hardware
> must be
> + * allocated before arming a job. It also means that no locks can be
> taken
> + * under which memory might be allocated as well.
> + *
> + * Memory which is optional to allocate, for example for device core
> dumping or
> + * debugging, *must* be allocated with GFP_NOWAIT and appropriate
> error
> + * handling taking if that allocation fails. GFP_ATOMIC should only
> be used if
> + * absolutely necessary since dipping into the special atomic
> reserves is
> + * usually not justified for a GPU driver.
> + */
> +
> +/**
> + * DOC: Job Object
> + *
> + * The base job object contains submission dependencies in the form
> of DMA-fence
> + * objects. Drivers can also implement an optional prepare_job
> callback which
> + * returns additional dependencies as DMA-fence objects. It's
> important to note
> + * that this callback can't allocate memory or grab locks under
> which memory is
> + * allocated.
> + *
> + * Drivers should use this as base class for an object which
> contains the
> + * necessary state to push the command submission to the hardware.
> + *
> + * The lifetime of the job object should at least be from pushing it
> into the
> + * scheduler until the scheduler notes through the free callback
> that a job
> + * isn't needed any more. Drivers can of course keep their job
> object alive
> + * longer than that, but that's outside of the scope of the
> scheduler
> + * component. Job initialization is split into two parts,
> drm_sched_job_init()
> + * and drm_sched_job_arm(). It's important to note that after arming
> a job
> + * drivers must follow the DMA-fence rules and can't easily allocate
> memory
> + * or takes locks under which memory is allocated.
> + */
> +
> +/**
> + * DOC: Entity Object
> + *
> + * The entity object which is a container for jobs which should
> execute
> + * sequentially. Drivers should create an entity for each individual
> context
> + * they maintain for command submissions which can run in parallel.
> + *
> + * The lifetime of the entity should *not* exceed the lifetime of
> the
> + * userspace process it was created for and drivers should call the
> + * drm_sched_entity_flush() function from their
> file_operations.flush
> + * callback. So it's possible that an entity object is not alive any
> + * more while jobs from it are still running on the hardware.
> + *
> + * Background is that for compatibility reasons with existing
> + * userspace all results of a command submission should become
> visible
> + * externally even after after a process exits. This is normal POSIX
> behavior
> + * for I/O operations.
> + *
> + * The problem with this approach is that GPU submissions contain
> executable
> + * shaders enabling processes to evade their termination by
> offloading work to
> + * the GPU. So when a process is terminated with a SIGKILL the
> entity object
> + * makes sure that jobs are freed without running them while still
> maintaining
> + * correct sequential order for signaling fences.
> + */
> +
> +/**
> + * DOC: Hardware Fence Object
> + *
> + * The hardware fence object is a DMA-fence provided by the driver
> as result of
> + * running jobs. Drivers need to make sure that the normal DMA-fence
> semantics
> + * are followed for this object. It's important to note that the
> memory for
> + * this object can *not* be allocated in the run_job callback since
> that would
> + * violate the requirements for the DMA-fence implementation. The
> scheduler
> + * maintains a timeout handler which triggers if this fence doesn't
> signal in
> + * a configurable time frame.
> + *
> + * The lifetime of this object follows DMA-fence ref-counting rules,
> the
> + * scheduler takes ownership of the reference returned by the driver
> and drops
> + * it when it's not needed any more.
> + */
> +
> +/**
> + * DOC: Scheduler Fence Object
> + *
> + * The scheduler fence object which encapsulates the whole time from
> pushing
> + * the job into the scheduler until the hardware has finished
> processing it.
> + * This is internally managed by the scheduler, but drivers can grab
> additional
> + * reference to it after arming a job. The implementation provides
> DMA-fence
> + * interfaces for signaling both scheduling of a command submission
> as well as
> + * finishing of processing.
> + *
> + * The lifetime of this object also follows normal DMA-fence ref-
> counting
> + * rules. The finished fence is the one normally exposed outside of
> the
> + * scheduler, but the driver can grab references to both the
> scheduled as well
> + * as the finished fence when needed for pipe-lining optimizations.
> + */
> +
> +/**
> + * DOC: Scheduler and Run Queue Objects
> + *
> + * The scheduler object itself does the actual work of selecting a
> job and
> + * pushing it to the hardware. Both FIFO and RR selection algorithm
> are
> + * supported, but FIFO is preferred for many use cases.
> + *
> + * The lifetime of the scheduler is managed by the driver using it.
> Before
> + * destroying the scheduler the driver must ensure that all hardware
> processing
> + * involving this scheduler object has finished by calling for
> example
> + * disable_irq(). It is *not* sufficient to wait for the hardware
> fence here
> + * since this doesn't guarantee that all callback processing has
> finished.
> + *
> + * The run queue object is a container of entities for a certain
> priority
> + * level. This object is internally managed by the scheduler and
> drivers
> + * shouldn't touch them directly. The lifetime of run queues are
> bound to the
> + * schedulers lifetime.
>   */
>  
>  /**
> @@ -72,6 +166,42 @@
>   * limit.
>   */
>  
> +/**
> + * DOC: Error and Timeout handling
> + *
> + * Errors schould be signaled by using dma_fence_set_error() on the
> hardware
> + * fence object before signaling it. Errors are then bubbled up from
> the
> + * hardware fence to the scheduler fence.
> + *
> + * The entity allows querying errors on the last run submission
> using the
> + * drm_sched_entity_error() function which can be used to cancel
> queued
> + * submissions in the run_job callback as well as preventing pushing
> further
> + * ones into the entity in the drivers submission function.
> + *
> + * When the hardware fence fails to signal in a configurable amount
> of time the
> + * timedout_job callback is issued. The driver should then follow
> the procedure
> + * described on the &struct drm_sched_backend_ops.timedout_job
> callback (TODO:
> + * The timeout handler should probably switch to using the hardware
> fence as
> + * parameter instead of the job. Otherwise the handling will always
> race
> + * between timing out and signaling the fence).
> + *
> + * The scheduler also used to provided functionality for re-
> submitting jobs
> + * with replacing the hardware fence during reset handling. This
> functionality
> + * is now marked as deprecated. This has proven to be fundamentally
> racy and
> + * not compatible with DMA-fence rules and shouldn't be used in any
> new code.
> + *
> + * Additional there is the drm_sched_increase_karma() function which
> tries to
> + * find the entity which submitted a job and increases it's 'karma'
> + * atomic variable to prevent re-submitting jobs from this entity.
> This has
> + * quite some overhead and re-submitting jobs is now marked as
> deprecated. So
> + * using this function is rather discouraged.
> + *
> + * Drivers can still re-create the GPU state should it be lost
> during timeout
> + * handling when they can guarantee that forward progress is made
> and this
> + * doesn't cause another timeout. But this is strongly hardware
> specific and
> + * out of the scope of the general GPU scheduler.
> + */
> +
>  #include <linux/wait.h>
>  #include <linux/sched.h>
>  #include <linux/completion.h>
Christian König July 19, 2024, 11:51 a.m. UTC | #8
Am 19.07.24 um 12:29 schrieb Philipp Stanner:
> On Thu, 2023-11-16 at 15:15 +0100, Christian König wrote:
>> Start to improve the scheduler document. Especially document the
>> lifetime of each of the objects as well as the restrictions around
>> DMA-fence handling and userspace compatibility.
> Hallo Christian,
>
> thanks for working on this.
>
> I'm currently looking deeper into the scheduler and am documenting the
> pitfalls etc. that I have found so far.
>
>
> What are your current plans with this documentation series? If you
> don't intend to get it upstreamed in the foreseeable future, I would
> like to hijack the series and use it as a basis for my own improvements
> to the documentation.
>
> Please tell me what you think,

Feel free to pick that up, I have around 10 different upstream things on 
my backlog at the moment and it didn't got any smaller since the 
beginning of the year :(

I really need to sort out my time management to be able to work on 
upstream stuff without getting interrupted by anything AMD internal.

Cheers,
Christian.

>
>
> Regards,
> P.
>
>
>> v2: Some improvements suggested by Danilo, add section about error
>>      handling.
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>>   Documentation/gpu/drm-mm.rst           |  36 +++++
>>   drivers/gpu/drm/scheduler/sched_main.c | 174 +++++++++++++++++++++--
>> --
>>   2 files changed, 188 insertions(+), 22 deletions(-)
>>
>> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-
>> mm.rst
>> index acc5901ac840..112463fa9f3a 100644
>> --- a/Documentation/gpu/drm-mm.rst
>> +++ b/Documentation/gpu/drm-mm.rst
>> @@ -552,12 +552,48 @@ Overview
>>   .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>      :doc: Overview
>>   
>> +Job Object
>> +----------
>> +
>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>> +   :doc: Job Object
>> +
>> +Entity Object
>> +-------------
>> +
>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>> +   :doc: Entity Object
>> +
>> +Hardware Fence Object
>> +---------------------
>> +
>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>> +   :doc: Hardware Fence Object
>> +
>> +Scheduler Fence Object
>> +----------------------
>> +
>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>> +   :doc: Scheduler Fence Object
>> +
>> +Scheduler and Run Queue Objects
>> +-------------------------------
>> +
>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>> +   :doc: Scheduler and Run Queue Objects
>> +
>>   Flow Control
>>   ------------
>>   
>>   .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>>      :doc: Flow Control
>>   
>> +Error and Timeout handling
>> +--------------------------
>> +
>> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>> +   :doc: Error and Timeout handling
>> +
>>   Scheduler Function References
>>   -----------------------------
>>   
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index 044a8c4875ba..026123497b0e 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -24,28 +24,122 @@
>>   /**
>>    * DOC: Overview
>>    *
>> - * The GPU scheduler provides entities which allow userspace to push
>> jobs
>> - * into software queues which are then scheduled on a hardware run
>> queue.
>> - * The software queues have a priority among them. The scheduler
>> selects the entities
>> - * from the run queue using a FIFO. The scheduler provides
>> dependency handling
>> - * features among jobs. The driver is supposed to provide callback
>> functions for
>> - * backend operations to the scheduler like submitting a job to
>> hardware run queue,
>> - * returning the dependencies of a job etc.
>> - *
>> - * The organisation of the scheduler is the following:
>> - *
>> - * 1. Each hw run queue has one scheduler
>> - * 2. Each scheduler has multiple run queues with different
>> priorities
>> - *    (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
>> - * 3. Each scheduler run queue has a queue of entities to schedule
>> - * 4. Entities themselves maintain a queue of jobs that will be
>> scheduled on
>> - *    the hardware.
>> - *
>> - * The jobs in a entity are always scheduled in the order that they
>> were pushed.
>> - *
>> - * Note that once a job was taken from the entities queue and pushed
>> to the
>> - * hardware, i.e. the pending queue, the entity must not be
>> referenced anymore
>> - * through the jobs entity pointer.
>> + * The GPU scheduler implements some logic to decide which command
>> submission
>> + * to push next to the hardware. Another major use case of the GPU
>> scheduler
>> + * is to enforce correct driver behavior around those command
>> submissions.
>> + * Because of this it's also used by drivers which don't need the
>> actual
>> + * scheduling functionality.
>> + *
>> + * All callbacks the driver needs to implement are restricted by
>> DMA-fence
>> + * signaling rules to guarantee deadlock free forward progress. This
>> especially
>> + * means that for normal operation no memory can be allocated in a
>> callback.
>> + * All memory which is needed for pushing the job to the hardware
>> must be
>> + * allocated before arming a job. It also means that no locks can be
>> taken
>> + * under which memory might be allocated as well.
>> + *
>> + * Memory which is optional to allocate, for example for device core
>> dumping or
>> + * debugging, *must* be allocated with GFP_NOWAIT and appropriate
>> error
>> + * handling taking if that allocation fails. GFP_ATOMIC should only
>> be used if
>> + * absolutely necessary since dipping into the special atomic
>> reserves is
>> + * usually not justified for a GPU driver.
>> + */
>> +
>> +/**
>> + * DOC: Job Object
>> + *
>> + * The base job object contains submission dependencies in the form
>> of DMA-fence
>> + * objects. Drivers can also implement an optional prepare_job
>> callback which
>> + * returns additional dependencies as DMA-fence objects. It's
>> important to note
>> + * that this callback can't allocate memory or grab locks under
>> which memory is
>> + * allocated.
>> + *
>> + * Drivers should use this as base class for an object which
>> contains the
>> + * necessary state to push the command submission to the hardware.
>> + *
>> + * The lifetime of the job object should at least be from pushing it
>> into the
>> + * scheduler until the scheduler notes through the free callback
>> that a job
>> + * isn't needed any more. Drivers can of course keep their job
>> object alive
>> + * longer than that, but that's outside of the scope of the
>> scheduler
>> + * component. Job initialization is split into two parts,
>> drm_sched_job_init()
>> + * and drm_sched_job_arm(). It's important to note that after arming
>> a job
>> + * drivers must follow the DMA-fence rules and can't easily allocate
>> memory
>> + * or takes locks under which memory is allocated.
>> + */
>> +
>> +/**
>> + * DOC: Entity Object
>> + *
>> + * The entity object which is a container for jobs which should
>> execute
>> + * sequentially. Drivers should create an entity for each individual
>> context
>> + * they maintain for command submissions which can run in parallel.
>> + *
>> + * The lifetime of the entity should *not* exceed the lifetime of
>> the
>> + * userspace process it was created for and drivers should call the
>> + * drm_sched_entity_flush() function from their
>> file_operations.flush
>> + * callback. So it's possible that an entity object is not alive any
>> + * more while jobs from it are still running on the hardware.
>> + *
>> + * Background is that for compatibility reasons with existing
>> + * userspace all results of a command submission should become
>> visible
>> + * externally even after after a process exits. This is normal POSIX
>> behavior
>> + * for I/O operations.
>> + *
>> + * The problem with this approach is that GPU submissions contain
>> executable
>> + * shaders enabling processes to evade their termination by
>> offloading work to
>> + * the GPU. So when a process is terminated with a SIGKILL the
>> entity object
>> + * makes sure that jobs are freed without running them while still
>> maintaining
>> + * correct sequential order for signaling fences.
>> + */
>> +
>> +/**
>> + * DOC: Hardware Fence Object
>> + *
>> + * The hardware fence object is a DMA-fence provided by the driver
>> as result of
>> + * running jobs. Drivers need to make sure that the normal DMA-fence
>> semantics
>> + * are followed for this object. It's important to note that the
>> memory for
>> + * this object can *not* be allocated in the run_job callback since
>> that would
>> + * violate the requirements for the DMA-fence implementation. The
>> scheduler
>> + * maintains a timeout handler which triggers if this fence doesn't
>> signal in
>> + * a configurable time frame.
>> + *
>> + * The lifetime of this object follows DMA-fence ref-counting rules,
>> the
>> + * scheduler takes ownership of the reference returned by the driver
>> and drops
>> + * it when it's not needed any more.
>> + */
>> +
>> +/**
>> + * DOC: Scheduler Fence Object
>> + *
>> + * The scheduler fence object which encapsulates the whole time from
>> pushing
>> + * the job into the scheduler until the hardware has finished
>> processing it.
>> + * This is internally managed by the scheduler, but drivers can grab
>> additional
>> + * reference to it after arming a job. The implementation provides
>> DMA-fence
>> + * interfaces for signaling both scheduling of a command submission
>> as well as
>> + * finishing of processing.
>> + *
>> + * The lifetime of this object also follows normal DMA-fence ref-
>> counting
>> + * rules. The finished fence is the one normally exposed outside of
>> the
>> + * scheduler, but the driver can grab references to both the
>> scheduled as well
>> + * as the finished fence when needed for pipe-lining optimizations.
>> + */
>> +
>> +/**
>> + * DOC: Scheduler and Run Queue Objects
>> + *
>> + * The scheduler object itself does the actual work of selecting a
>> job and
>> + * pushing it to the hardware. Both FIFO and RR selection algorithm
>> are
>> + * supported, but FIFO is preferred for many use cases.
>> + *
>> + * The lifetime of the scheduler is managed by the driver using it.
>> Before
>> + * destroying the scheduler the driver must ensure that all hardware
>> processing
>> + * involving this scheduler object has finished by calling for
>> example
>> + * disable_irq(). It is *not* sufficient to wait for the hardware
>> fence here
>> + * since this doesn't guarantee that all callback processing has
>> finished.
>> + *
>> + * The run queue object is a container of entities for a certain
>> priority
>> + * level. This object is internally managed by the scheduler and
>> drivers
>> + * shouldn't touch them directly. The lifetime of run queues are
>> bound to the
>> + * schedulers lifetime.
>>    */
>>   
>>   /**
>> @@ -72,6 +166,42 @@
>>    * limit.
>>    */
>>   
>> +/**
>> + * DOC: Error and Timeout handling
>> + *
>> + * Errors schould be signaled by using dma_fence_set_error() on the
>> hardware
>> + * fence object before signaling it. Errors are then bubbled up from
>> the
>> + * hardware fence to the scheduler fence.
>> + *
>> + * The entity allows querying errors on the last run submission
>> using the
>> + * drm_sched_entity_error() function which can be used to cancel
>> queued
>> + * submissions in the run_job callback as well as preventing pushing
>> further
>> + * ones into the entity in the drivers submission function.
>> + *
>> + * When the hardware fence fails to signal in a configurable amount
>> of time the
>> + * timedout_job callback is issued. The driver should then follow
>> the procedure
>> + * described on the &struct drm_sched_backend_ops.timedout_job
>> callback (TODO:
>> + * The timeout handler should probably switch to using the hardware
>> fence as
>> + * parameter instead of the job. Otherwise the handling will always
>> race
>> + * between timing out and signaling the fence).
>> + *
>> + * The scheduler also used to provided functionality for re-
>> submitting jobs
>> + * with replacing the hardware fence during reset handling. This
>> functionality
>> + * is now marked as deprecated. This has proven to be fundamentally
>> racy and
>> + * not compatible with DMA-fence rules and shouldn't be used in any
>> new code.
>> + *
>> + * Additional there is the drm_sched_increase_karma() function which
>> tries to
>> + * find the entity which submitted a job and increases it's 'karma'
>> + * atomic variable to prevent re-submitting jobs from this entity.
>> This has
>> + * quite some overhead and re-submitting jobs is now marked as
>> deprecated. So
>> + * using this function is rather discouraged.
>> + *
>> + * Drivers can still re-create the GPU state should it be lost
>> during timeout
>> + * handling when they can guarantee that forward progress is made
>> and this
>> + * doesn't cause another timeout. But this is strongly hardware
>> specific and
>> + * out of the scope of the general GPU scheduler.
>> + */
>> +
>>   #include <linux/wait.h>
>>   #include <linux/sched.h>
>>   #include <linux/completion.h>
diff mbox series

Patch

diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
index acc5901ac840..112463fa9f3a 100644
--- a/Documentation/gpu/drm-mm.rst
+++ b/Documentation/gpu/drm-mm.rst
@@ -552,12 +552,48 @@  Overview
 .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
    :doc: Overview
 
+Job Object
+----------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+   :doc: Job Object
+
+Entity Object
+-------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+   :doc: Entity Object
+
+Hardware Fence Object
+---------------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+   :doc: Hardware Fence Object
+
+Scheduler Fence Object
+----------------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+   :doc: Scheduler Fence Object
+
+Scheduler and Run Queue Objects
+-------------------------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+   :doc: Scheduler and Run Queue Objects
+
 Flow Control
 ------------
 
 .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
    :doc: Flow Control
 
+Error and Timeout handling
+--------------------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
+   :doc: Error and Timeout handling
+
 Scheduler Function References
 -----------------------------
 
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 044a8c4875ba..026123497b0e 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -24,28 +24,122 @@ 
 /**
  * DOC: Overview
  *
- * The GPU scheduler provides entities which allow userspace to push jobs
- * into software queues which are then scheduled on a hardware run queue.
- * The software queues have a priority among them. The scheduler selects the entities
- * from the run queue using a FIFO. The scheduler provides dependency handling
- * features among jobs. The driver is supposed to provide callback functions for
- * backend operations to the scheduler like submitting a job to hardware run queue,
- * returning the dependencies of a job etc.
- *
- * The organisation of the scheduler is the following:
- *
- * 1. Each hw run queue has one scheduler
- * 2. Each scheduler has multiple run queues with different priorities
- *    (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
- * 3. Each scheduler run queue has a queue of entities to schedule
- * 4. Entities themselves maintain a queue of jobs that will be scheduled on
- *    the hardware.
- *
- * The jobs in a entity are always scheduled in the order that they were pushed.
- *
- * Note that once a job was taken from the entities queue and pushed to the
- * hardware, i.e. the pending queue, the entity must not be referenced anymore
- * through the jobs entity pointer.
+ * The GPU scheduler implements some logic to decide which command submission
+ * to push next to the hardware. Another major use case of the GPU scheduler
+ * is to enforce correct driver behavior around those command submissions.
+ * Because of this it's also used by drivers which don't need the actual
+ * scheduling functionality.
+ *
+ * All callbacks the driver needs to implement are restricted by DMA-fence
+ * signaling rules to guarantee deadlock free forward progress. This especially
+ * means that for normal operation no memory can be allocated in a callback.
+ * All memory which is needed for pushing the job to the hardware must be
+ * allocated before arming a job. It also means that no locks can be taken
+ * under which memory might be allocated as well.
+ *
+ * Memory which is optional to allocate, for example for device core dumping or
+ * debugging, *must* be allocated with GFP_NOWAIT and appropriate error
+ * handling taking if that allocation fails. GFP_ATOMIC should only be used if
+ * absolutely necessary since dipping into the special atomic reserves is
+ * usually not justified for a GPU driver.
+ */
+
+/**
+ * DOC: Job Object
+ *
+ * The base job object contains submission dependencies in the form of DMA-fence
+ * objects. Drivers can also implement an optional prepare_job callback which
+ * returns additional dependencies as DMA-fence objects. It's important to note
+ * that this callback can't allocate memory or grab locks under which memory is
+ * allocated.
+ *
+ * Drivers should use this as base class for an object which contains the
+ * necessary state to push the command submission to the hardware.
+ *
+ * The lifetime of the job object should at least be from pushing it into the
+ * scheduler until the scheduler notes through the free callback that a job
+ * isn't needed any more. Drivers can of course keep their job object alive
+ * longer than that, but that's outside of the scope of the scheduler
+ * component. Job initialization is split into two parts, drm_sched_job_init()
+ * and drm_sched_job_arm(). It's important to note that after arming a job
+ * drivers must follow the DMA-fence rules and can't easily allocate memory
+ * or takes locks under which memory is allocated.
+ */
+
+/**
+ * DOC: Entity Object
+ *
+ * The entity object which is a container for jobs which should execute
+ * sequentially. Drivers should create an entity for each individual context
+ * they maintain for command submissions which can run in parallel.
+ *
+ * The lifetime of the entity should *not* exceed the lifetime of the
+ * userspace process it was created for and drivers should call the
+ * drm_sched_entity_flush() function from their file_operations.flush
+ * callback. So it's possible that an entity object is not alive any
+ * more while jobs from it are still running on the hardware.
+ *
+ * Background is that for compatibility reasons with existing
+ * userspace all results of a command submission should become visible
+ * externally even after after a process exits. This is normal POSIX behavior
+ * for I/O operations.
+ *
+ * The problem with this approach is that GPU submissions contain executable
+ * shaders enabling processes to evade their termination by offloading work to
+ * the GPU. So when a process is terminated with a SIGKILL the entity object
+ * makes sure that jobs are freed without running them while still maintaining
+ * correct sequential order for signaling fences.
+ */
+
+/**
+ * DOC: Hardware Fence Object
+ *
+ * The hardware fence object is a DMA-fence provided by the driver as result of
+ * running jobs. Drivers need to make sure that the normal DMA-fence semantics
+ * are followed for this object. It's important to note that the memory for
+ * this object can *not* be allocated in the run_job callback since that would
+ * violate the requirements for the DMA-fence implementation. The scheduler
+ * maintains a timeout handler which triggers if this fence doesn't signal in
+ * a configurable time frame.
+ *
+ * The lifetime of this object follows DMA-fence ref-counting rules, the
+ * scheduler takes ownership of the reference returned by the driver and drops
+ * it when it's not needed any more.
+ */
+
+/**
+ * DOC: Scheduler Fence Object
+ *
+ * The scheduler fence object which encapsulates the whole time from pushing
+ * the job into the scheduler until the hardware has finished processing it.
+ * This is internally managed by the scheduler, but drivers can grab additional
+ * reference to it after arming a job. The implementation provides DMA-fence
+ * interfaces for signaling both scheduling of a command submission as well as
+ * finishing of processing.
+ *
+ * The lifetime of this object also follows normal DMA-fence ref-counting
+ * rules. The finished fence is the one normally exposed outside of the
+ * scheduler, but the driver can grab references to both the scheduled as well
+ * as the finished fence when needed for pipe-lining optimizations.
+ */
+
+/**
+ * DOC: Scheduler and Run Queue Objects
+ *
+ * The scheduler object itself does the actual work of selecting a job and
+ * pushing it to the hardware. Both FIFO and RR selection algorithm are
+ * supported, but FIFO is preferred for many use cases.
+ *
+ * The lifetime of the scheduler is managed by the driver using it. Before
+ * destroying the scheduler the driver must ensure that all hardware processing
+ * involving this scheduler object has finished by calling for example
+ * disable_irq(). It is *not* sufficient to wait for the hardware fence here
+ * since this doesn't guarantee that all callback processing has finished.
+ *
+ * The run queue object is a container of entities for a certain priority
+ * level. This object is internally managed by the scheduler and drivers
+ * shouldn't touch them directly. The lifetime of run queues are bound to the
+ * schedulers lifetime.
  */
 
 /**
@@ -72,6 +166,42 @@ 
  * limit.
  */
 
+/**
+ * DOC: Error and Timeout handling
+ *
+ * Errors schould be signaled by using dma_fence_set_error() on the hardware
+ * fence object before signaling it. Errors are then bubbled up from the
+ * hardware fence to the scheduler fence.
+ *
+ * The entity allows querying errors on the last run submission using the
+ * drm_sched_entity_error() function which can be used to cancel queued
+ * submissions in the run_job callback as well as preventing pushing further
+ * ones into the entity in the drivers submission function.
+ *
+ * When the hardware fence fails to signal in a configurable amount of time the
+ * timedout_job callback is issued. The driver should then follow the procedure
+ * described on the &struct drm_sched_backend_ops.timedout_job callback (TODO:
+ * The timeout handler should probably switch to using the hardware fence as
+ * parameter instead of the job. Otherwise the handling will always race
+ * between timing out and signaling the fence).
+ *
+ * The scheduler also used to provided functionality for re-submitting jobs
+ * with replacing the hardware fence during reset handling. This functionality
+ * is now marked as deprecated. This has proven to be fundamentally racy and
+ * not compatible with DMA-fence rules and shouldn't be used in any new code.
+ *
+ * Additional there is the drm_sched_increase_karma() function which tries to
+ * find the entity which submitted a job and increases it's 'karma'
+ * atomic variable to prevent re-submitting jobs from this entity. This has
+ * quite some overhead and re-submitting jobs is now marked as deprecated. So
+ * using this function is rather discouraged.
+ *
+ * Drivers can still re-create the GPU state should it be lost during timeout
+ * handling when they can guarantee that forward progress is made and this
+ * doesn't cause another timeout. But this is strongly hardware specific and
+ * out of the scope of the general GPU scheduler.
+ */
+
 #include <linux/wait.h>
 #include <linux/sched.h>
 #include <linux/completion.h>