diff mbox series

[drm-misc-next] drm/sched: support multiple rings per gpu_scheduler

Message ID 20230809221729.3657-1-dakr@redhat.com (mailing list archive)
State New, archived
Headers show
Series [drm-misc-next] drm/sched: support multiple rings per gpu_scheduler | expand

Commit Message

Danilo Krummrich Aug. 9, 2023, 10:17 p.m. UTC
With the current mental model every GPU scheduler instance represents
a single HW ring, while every entity represents a software queue feeding
into one or multiple GPU scheduler instances and hence into one or
multiple HW rings.

This does not really scale with firmware schedulers feeding the actual
HW rings, while the driver feeds the firmware scheduler through an
arbitrary amount of dynamically created rings, since for each of those
rings a GPU scheduler instance would be required and a separate kthread
would be created.

To overcome this we can think of the scheduler having two modes of
operation, single ring mode and multi ring mode. Depending on the mode
of operation, the mental model differs.

Single ring mode (which is the default) keeps the original behaviour of
the scheduler and its entities.

In multi ring mode (enabled by passing the GPU scheduler the
DRM_GPU_SCHEDULER_MODE_MULTI_RING flag) each entity represents a ring,
while the GPU scheduler itself only exists to sort out job dependencies
and actually process the jobs of each entity.

By just applying this different mental model, the required code change
is pretty minimalistic: While in single ring mode, if a job depends on a
dma-fence from the same scheduler instance we only wait for the
dependency to be scheduled (sched_fence->scheduled), in multi ring mode
we fall through and just wait for the dependency to fully complete
(sched_fence->finished) in order to avoid races between separate rings.
Or in other words, to pick up the schedulers existing terminology,
prevent dependency pipelining.

Signed-off-by: Danilo Krummrich <dakr@redhat.com>
---
Just before sending out this patch I was made aware of the "DRM Scheduler
changes for XE" [1] patch series.

However, I think bringing this alternative approach into discussion could
be useful.

From a first glance it looks like that both approaches semantically aim for
the same goal, namely addressing firmware schedulers with arbitrary amounts
of software rings to feed from the driver side.

The "XE approach" for that seems to be to make GPU scheduler instances
scalable by submitting jobs to a workqueue rather than creating a kthread
per scheduler instance, such that per software ring a scheduler instance can
be created. Hence the definition of a 1:1 relationship between scheduler and
entity.

In Nouveau we approached it differently, such that we have just one GPU
scheduler per driver instance, while software rings are represented as
entities. As explained in the commit message, this seems to work fine, as
long as we can teach the scheduler to prevent dependency pipelining.

As a hack this basically already works without this patch by just setting the
DRM_SCHED_FENCE_DONT_PIPELINE flag for all out-fences.

[1] https://lore.kernel.org/intel-xe/20230801205103.627779-1-matthew.brost@intel.com/T/#t
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
 drivers/gpu/drm/etnaviv/etnaviv_sched.c    |  2 +-
 drivers/gpu/drm/lima/lima_sched.c          |  2 +-
 drivers/gpu/drm/msm/msm_ringbuffer.c       |  2 +-
 drivers/gpu/drm/nouveau/nouveau_sched.c    |  7 +--
 drivers/gpu/drm/panfrost/panfrost_job.c    |  2 +-
 drivers/gpu/drm/scheduler/sched_entity.c   |  3 +-
 drivers/gpu/drm/scheduler/sched_main.c     | 55 +++++++++++++++-------
 drivers/gpu/drm/v3d/v3d_sched.c            | 10 ++--
 include/drm/gpu_scheduler.h                | 18 ++++++-
 10 files changed, 72 insertions(+), 31 deletions(-)


base-commit: e05f3938c5a7896d09736e3381675a57ffee7a0a

Comments

Matthew Brost Aug. 10, 2023, 4:31 a.m. UTC | #1
On Thu, Aug 10, 2023 at 12:17:23AM +0200, Danilo Krummrich wrote:
> With the current mental model every GPU scheduler instance represents
> a single HW ring, while every entity represents a software queue feeding
> into one or multiple GPU scheduler instances and hence into one or
> multiple HW rings.
> 
> This does not really scale with firmware schedulers feeding the actual
> HW rings, while the driver feeds the firmware scheduler through an
> arbitrary amount of dynamically created rings, since for each of those
> rings a GPU scheduler instance would be required and a separate kthread
> would be created.
> 
> To overcome this we can think of the scheduler having two modes of
> operation, single ring mode and multi ring mode. Depending on the mode
> of operation, the mental model differs.
> 
> Single ring mode (which is the default) keeps the original behaviour of
> the scheduler and its entities.
> 
> In multi ring mode (enabled by passing the GPU scheduler the
> DRM_GPU_SCHEDULER_MODE_MULTI_RING flag) each entity represents a ring,
> while the GPU scheduler itself only exists to sort out job dependencies
> and actually process the jobs of each entity.
> 
> By just applying this different mental model, the required code change
> is pretty minimalistic: While in single ring mode, if a job depends on a
> dma-fence from the same scheduler instance we only wait for the
> dependency to be scheduled (sched_fence->scheduled), in multi ring mode
> we fall through and just wait for the dependency to fully complete
> (sched_fence->finished) in order to avoid races between separate rings.
> Or in other words, to pick up the schedulers existing terminology,
> prevent dependency pipelining.
> 
> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
> ---
> Just before sending out this patch I was made aware of the "DRM Scheduler
> changes for XE" [1] patch series.
> 
> However, I think bringing this alternative approach into discussion could
> be useful.
> 
> From a first glance it looks like that both approaches semantically aim for
> the same goal, namely addressing firmware schedulers with arbitrary amounts
> of software rings to feed from the driver side.
> 
> The "XE approach" for that seems to be to make GPU scheduler instances
> scalable by submitting jobs to a workqueue rather than creating a kthread
> per scheduler instance, such that per software ring a scheduler instance can
> be created. Hence the definition of a 1:1 relationship between scheduler and
> entity.
> 
> In Nouveau we approached it differently, such that we have just one GPU
> scheduler per driver instance, while software rings are represented as
> entities. As explained in the commit message, this seems to work fine, as
> long as we can teach the scheduler to prevent dependency pipelining.
> 
> As a hack this basically already works without this patch by just setting the
> DRM_SCHED_FENCE_DONT_PIPELINE flag for all out-fences.
> 

A few questions.

1. With different rings + a firmware scheduler how do you do with
pending list being in-order and jobs from different rings completing of
order. The pending list in order nature is required for free_job being
called and the TDR to work properly.

2. How do you flow control each ring in Nouveau?

3. Do you support per ring reset? Again how does this work if the
pending list has jobs from multiple rings?

The answer to all of these in Xe is solved by the 1 to 1 relationship
between scheduler and entity. In addition to the above questions the 1
to 1 relationship in allows sleeping in the main worker in the callback
functions which we use our preempt fence implementation.

This implementation as is will not for Xe. Also in Xe it is desirable to
have multiple CPUs to be submiting jobs in parallel too.

Matt

> [1] https://lore.kernel.org/intel-xe/20230801205103.627779-1-matthew.brost@intel.com/T/#t
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>  drivers/gpu/drm/etnaviv/etnaviv_sched.c    |  2 +-
>  drivers/gpu/drm/lima/lima_sched.c          |  2 +-
>  drivers/gpu/drm/msm/msm_ringbuffer.c       |  2 +-
>  drivers/gpu/drm/nouveau/nouveau_sched.c    |  7 +--
>  drivers/gpu/drm/panfrost/panfrost_job.c    |  2 +-
>  drivers/gpu/drm/scheduler/sched_entity.c   |  3 +-
>  drivers/gpu/drm/scheduler/sched_main.c     | 55 +++++++++++++++-------
>  drivers/gpu/drm/v3d/v3d_sched.c            | 10 ++--
>  include/drm/gpu_scheduler.h                | 18 ++++++-
>  10 files changed, 72 insertions(+), 31 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index a2cdde0ca0a7..eed6f56e3957 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2491,7 +2491,7 @@ static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
>  			break;
>  		}
>  
> -		r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
> +		r = drm_sched_init(&ring->sched, 0, &amdgpu_sched_ops,
>  				   ring->num_hw_submission, 0,
>  				   timeout, adev->reset_domain->wq,
>  				   ring->sched_score, ring->name,
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> index 1ae87dfd19c4..bfe761316ba9 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> @@ -133,7 +133,7 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
>  {
>  	int ret;
>  
> -	ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops,
> +	ret = drm_sched_init(&gpu->sched, 0, &etnaviv_sched_ops,
>  			     etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
>  			     msecs_to_jiffies(500), NULL, NULL,
>  			     dev_name(gpu->dev), gpu->dev);
> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> index ffd91a5ee299..f559d7de5658 100644
> --- a/drivers/gpu/drm/lima/lima_sched.c
> +++ b/drivers/gpu/drm/lima/lima_sched.c
> @@ -488,7 +488,7 @@ int lima_sched_pipe_init(struct lima_sched_pipe *pipe, const char *name)
>  
>  	INIT_WORK(&pipe->recover_work, lima_sched_recover_work);
>  
> -	return drm_sched_init(&pipe->base, &lima_sched_ops, 1,
> +	return drm_sched_init(&pipe->base, 0, &lima_sched_ops, 1,
>  			      lima_job_hang_limit,
>  			      msecs_to_jiffies(timeout), NULL,
>  			      NULL, name, pipe->ldev->dev);
> diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
> index b60199184409..79fb8c96465d 100644
> --- a/drivers/gpu/drm/msm/msm_ringbuffer.c
> +++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
> @@ -93,7 +93,7 @@ struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int id,
>  	 /* currently managing hangcheck ourselves: */
>  	sched_timeout = MAX_SCHEDULE_TIMEOUT;
>  
> -	ret = drm_sched_init(&ring->sched, &msm_sched_ops,
> +	ret = drm_sched_init(&ring->sched, 0, &msm_sched_ops,
>  			num_hw_submissions, 0, sched_timeout,
>  			NULL, NULL, to_msm_bo(ring->bo)->name, gpu->dev->dev);
>  	if (ret) {
> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
> index 3424a1bf6af3..6528c09fffc9 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
> @@ -407,9 +407,10 @@ int nouveau_sched_init(struct nouveau_drm *drm)
>  	if (!drm->sched_wq)
>  		return -ENOMEM;
>  
> -	return drm_sched_init(sched, &nouveau_sched_ops,
> -			      NOUVEAU_SCHED_HW_SUBMISSIONS, 0, job_hang_limit,
> -			      NULL, NULL, "nouveau_sched", drm->dev->dev);
> +	return drm_sched_init(sched, DRM_GPU_SCHEDULER_MODE_MULTI_RING,
> +			      &nouveau_sched_ops, NOUVEAU_SCHED_HW_SUBMISSIONS,
> +			      0, job_hang_limit, NULL, NULL, "nouveau_sched",
> +			      drm->dev->dev);
>  }
>  
>  void nouveau_sched_fini(struct nouveau_drm *drm)
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> index dbc597ab46fb..dbbabf63f08f 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -814,7 +814,7 @@ int panfrost_job_init(struct panfrost_device *pfdev)
>  	for (j = 0; j < NUM_JOB_SLOTS; j++) {
>  		js->queue[j].fence_context = dma_fence_context_alloc(1);
>  
> -		ret = drm_sched_init(&js->queue[j].sched,
> +		ret = drm_sched_init(&js->queue[j].sched, 0,
>  				     &panfrost_sched_ops,
>  				     nentries, 0,
>  				     msecs_to_jiffies(JOB_TIMEOUT_MS),
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index a42763e1429d..6d1e11e6bdcd 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -401,7 +401,8 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>  
>  	s_fence = to_drm_sched_fence(fence);
>  	if (!fence->error && s_fence && s_fence->sched == sched &&
> -	    !test_bit(DRM_SCHED_FENCE_DONT_PIPELINE, &fence->flags)) {
> +	    !test_bit(DRM_SCHED_FENCE_DONT_PIPELINE, &fence->flags) &&
> +	    !(sched->flags & DRM_GPU_SCHEDULER_MODE_MULTI_RING)) {
>  
>  		/*
>  		 * Fence is from the same scheduler, only need to wait for
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 506371c42745..0c432c1dafdf 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -24,28 +24,50 @@
>  /**
>   * DOC: Overview
>   *
> - * The GPU scheduler provides entities which allow userspace to push jobs
> - * into software queues which are then scheduled on a hardware run queue.
> - * The software queues have a priority among them. The scheduler selects the entities
> - * from the run queue using a FIFO. The scheduler provides dependency handling
> - * features among jobs. The driver is supposed to provide callback functions for
> - * backend operations to the scheduler like submitting a job to hardware run queue,
> - * returning the dependencies of a job etc.
> + * The GPU scheduler is mainly structured into the scheduler itself
> + * (&drm_gpu_scheduler), scheduler entities (&drm_sched_entity) and scheduler
> + * jobs (&drm_sched_job).
>   *
> - * The organisation of the scheduler is the following:
> + * Each &drm_gpu_scheduler has different priority run queues (e.g. HIGH_HW,
> + * HIGH_SW, KERNEL, NORMAL), which themselfs contain a list of &drm_sched_entity
> + * objects, while each &drm_sched_entity maintains a queue of &drm_sched_jobs.
>   *
> - * 1. Each hw run queue has one scheduler
> - * 2. Each scheduler has multiple run queues with different priorities
> - *    (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
> - * 3. Each scheduler run queue has a queue of entities to schedule
> - * 4. Entities themselves maintain a queue of jobs that will be scheduled on
> - *    the hardware.
> + * There are two modes of operation, single ring mode and multi ring mode.
> + * Depending on the mode of operation, the mental model of what the named
> + * structures represent differs.
>   *
> - * The jobs in a entity are always scheduled in the order that they were pushed.
> + * In single ring mode (which is the default) every &drm_gpu_scheduler
> + * instance represents a single HW ring, while every &drm_sched_entity
> + * represents a software queue feeding into one or multiple &drm_gpu_scheduler
> + * instances and hence into one or multiple HW rings.
> + *
> + * Single ring mode may be used when the GPU has a fixed amount of HW rings
> + * which can be directly fed by the driver.
> + *
> + * In multi ring mode (enabled by passing the &drm_gpu_scheduler the
> + * &DRM_GPU_SCHEDULER_MODE_MULTI_RING flag) each &drm_sched_entity represents a
> + * HW ring, while the &drm_gpu_scheduler itself only exists to sort out job
> + * dependencies and actually process the jobs of each &drm_sched_entity.
> + *
> + * Multi ring mode may be used when the GPU has a firmware scheduler feeding the
> + * actual HW rings, while the driver feeds the firmware scheduler through an
> + * arbitrary amount of dynamically created rings.
> + *
> + * While one or the other mental model could be applied without setting the ring
> + * mode through a flag, the scheduler needs to know the mode of operation in
> + * order to be able to make correct decitions when it comes to handling job
> + * dependencies.
> + *
> + * Independent of the mode of operation jobs within an entity are always
> + * scheduled in the order in which they were submitted.
>   *
>   * Note that once a job was taken from the entities queue and pushed to the
>   * hardware, i.e. the pending queue, the entity must not be referenced anymore
>   * through the jobs entity pointer.
> + *
> + * In order for the scheduler to actually prepare, process or free a job once it
> + * is completed, the driver is supposed to provide the corresponding callback
> + * functions (&drm_sched_backend_ops) to the scheduler.
>   */
>  
>  #include <linux/kthread.h>
> @@ -1082,7 +1104,7 @@ static int drm_sched_main(void *param)
>   *
>   * Return 0 on success, otherwise error code.
>   */
> -int drm_sched_init(struct drm_gpu_scheduler *sched,
> +int drm_sched_init(struct drm_gpu_scheduler *sched, unsigned int flags,
>  		   const struct drm_sched_backend_ops *ops,
>  		   unsigned hw_submission, unsigned hang_limit,
>  		   long timeout, struct workqueue_struct *timeout_wq,
> @@ -1097,6 +1119,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>  	sched->hang_limit = hang_limit;
>  	sched->score = score ? score : &sched->_score;
>  	sched->dev = dev;
> +	sched->flags = flags;
>  	for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT; i++)
>  		drm_sched_rq_init(sched, &sched->sched_rq[i]);
>  
> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
> index 06238e6d7f5c..4ba561014a5d 100644
> --- a/drivers/gpu/drm/v3d/v3d_sched.c
> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
> @@ -387,7 +387,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>  	int hang_limit_ms = 500;
>  	int ret;
>  
> -	ret = drm_sched_init(&v3d->queue[V3D_BIN].sched,
> +	ret = drm_sched_init(&v3d->queue[V3D_BIN].sched, 0,
>  			     &v3d_bin_sched_ops,
>  			     hw_jobs_limit, job_hang_limit,
>  			     msecs_to_jiffies(hang_limit_ms), NULL,
> @@ -395,7 +395,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>  	if (ret)
>  		return ret;
>  
> -	ret = drm_sched_init(&v3d->queue[V3D_RENDER].sched,
> +	ret = drm_sched_init(&v3d->queue[V3D_RENDER].sched, 0,
>  			     &v3d_render_sched_ops,
>  			     hw_jobs_limit, job_hang_limit,
>  			     msecs_to_jiffies(hang_limit_ms), NULL,
> @@ -403,7 +403,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>  	if (ret)
>  		goto fail;
>  
> -	ret = drm_sched_init(&v3d->queue[V3D_TFU].sched,
> +	ret = drm_sched_init(&v3d->queue[V3D_TFU].sched, 0,
>  			     &v3d_tfu_sched_ops,
>  			     hw_jobs_limit, job_hang_limit,
>  			     msecs_to_jiffies(hang_limit_ms), NULL,
> @@ -412,7 +412,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>  		goto fail;
>  
>  	if (v3d_has_csd(v3d)) {
> -		ret = drm_sched_init(&v3d->queue[V3D_CSD].sched,
> +		ret = drm_sched_init(&v3d->queue[V3D_CSD].sched, 0,
>  				     &v3d_csd_sched_ops,
>  				     hw_jobs_limit, job_hang_limit,
>  				     msecs_to_jiffies(hang_limit_ms), NULL,
> @@ -420,7 +420,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>  		if (ret)
>  			goto fail;
>  
> -		ret = drm_sched_init(&v3d->queue[V3D_CACHE_CLEAN].sched,
> +		ret = drm_sched_init(&v3d->queue[V3D_CACHE_CLEAN].sched, 0,
>  				     &v3d_cache_clean_sched_ops,
>  				     hw_jobs_limit, job_hang_limit,
>  				     msecs_to_jiffies(hang_limit_ms), NULL,
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index f9544d9b670d..afe29196cc7e 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -465,6 +465,20 @@ struct drm_sched_backend_ops {
>  	void (*free_job)(struct drm_sched_job *sched_job);
>  };
>  
> +/**
> + * DRM_GPU_SCHEDULER_MODE_MULTI_RING - Run the scheduler in multi ring mode
> + *
> + * The &drm_gpu_scheduler typically represents one HW ring with
> + * &drm_sched_entities feeding into one or multiple scheduler instances.
> + *
> + * In multi ring mode it is assumed there is a single &drm_gpu_scheduler
> + * instance only, where &drm_sched_entities represent rings instead.
> + *
> + * This is useful for GPUs where the &drm_gpu_scheduler should feed a firmware
> + * scheduler with an arbitrary amount of rings.
> + */
> +#define DRM_GPU_SCHEDULER_MODE_MULTI_RING	(1 << 0)
> +
>  /**
>   * struct drm_gpu_scheduler - scheduler instance-specific data
>   *
> @@ -493,6 +507,7 @@ struct drm_sched_backend_ops {
>   * @ready: marks if the underlying HW is ready to work
>   * @free_guilty: A hit to time out handler to free the guilty job.
>   * @dev: system &struct device
> + * @flags: the flags for the &drm_gpu_scheduler
>   *
>   * One scheduler is implemented for each hardware ring.
>   */
> @@ -517,9 +532,10 @@ struct drm_gpu_scheduler {
>  	bool				ready;
>  	bool				free_guilty;
>  	struct device			*dev;
> +	unsigned long			flags;
>  };
>  
> -int drm_sched_init(struct drm_gpu_scheduler *sched,
> +int drm_sched_init(struct drm_gpu_scheduler *sched, unsigned int flags,
>  		   const struct drm_sched_backend_ops *ops,
>  		   uint32_t hw_submission, unsigned hang_limit,
>  		   long timeout, struct workqueue_struct *timeout_wq,
> 
> base-commit: e05f3938c5a7896d09736e3381675a57ffee7a0a
> -- 
> 2.41.0
>
Christian König Aug. 10, 2023, 6:34 a.m. UTC | #2
Am 10.08.23 um 00:17 schrieb Danilo Krummrich:
> With the current mental model every GPU scheduler instance represents
> a single HW ring, while every entity represents a software queue feeding
> into one or multiple GPU scheduler instances and hence into one or
> multiple HW rings.
>
> This does not really scale with firmware schedulers feeding the actual
> HW rings, while the driver feeds the firmware scheduler through an
> arbitrary amount of dynamically created rings, since for each of those
> rings a GPU scheduler instance would be required and a separate kthread
> would be created.
>
> To overcome this we can think of the scheduler having two modes of
> operation, single ring mode and multi ring mode. Depending on the mode
> of operation, the mental model differs.
>
> Single ring mode (which is the default) keeps the original behaviour of
> the scheduler and its entities.
>
> In multi ring mode (enabled by passing the GPU scheduler the
> DRM_GPU_SCHEDULER_MODE_MULTI_RING flag) each entity represents a ring,
> while the GPU scheduler itself only exists to sort out job dependencies
> and actually process the jobs of each entity.
>
> By just applying this different mental model, the required code change
> is pretty minimalistic: While in single ring mode, if a job depends on a
> dma-fence from the same scheduler instance we only wait for the
> dependency to be scheduled (sched_fence->scheduled), in multi ring mode
> we fall through and just wait for the dependency to fully complete
> (sched_fence->finished) in order to avoid races between separate rings.
> Or in other words, to pick up the schedulers existing terminology,
> prevent dependency pipelining.
>
> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
> ---
> Just before sending out this patch I was made aware of the "DRM Scheduler
> changes for XE" [1] patch series.
>
> However, I think bringing this alternative approach into discussion could
> be useful.
>
>  From a first glance it looks like that both approaches semantically aim for
> the same goal, namely addressing firmware schedulers with arbitrary amounts
> of software rings to feed from the driver side.
>
> The "XE approach" for that seems to be to make GPU scheduler instances
> scalable by submitting jobs to a workqueue rather than creating a kthread
> per scheduler instance, such that per software ring a scheduler instance can
> be created. Hence the definition of a 1:1 relationship between scheduler and
> entity.
>
> In Nouveau we approached it differently, such that we have just one GPU
> scheduler per driver instance, while software rings are represented as
> entities. As explained in the commit message, this seems to work fine, as
> long as we can teach the scheduler to prevent dependency pipelining.
>
> As a hack this basically already works without this patch by just setting the
> DRM_SCHED_FENCE_DONT_PIPELINE flag for all out-fences.

I discussed that with other AMD folks on some internal meeting when the 
first Xe changes came up, but quickly abandoned it.

The scheduler depends on the sequential execution of submissions, 
otherwise we can't guarantee the sequential signaling of scheduler 
fences and this in turn is a really big NO-GO for dma_fences.

You avoid most of the problems by setting the 
DRM_SCHED_FENCE_DONT_PIPELINE flag, but this is still something very flaky.

Additional to that essentially the scheduler would still only push a 
limited number of submissions to the hw rings which makes the hardware 
scheduler pretty much useless.

Christian.

>
> [1] https://lore.kernel.org/intel-xe/20230801205103.627779-1-matthew.brost@intel.com/T/#t
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>   drivers/gpu/drm/etnaviv/etnaviv_sched.c    |  2 +-
>   drivers/gpu/drm/lima/lima_sched.c          |  2 +-
>   drivers/gpu/drm/msm/msm_ringbuffer.c       |  2 +-
>   drivers/gpu/drm/nouveau/nouveau_sched.c    |  7 +--
>   drivers/gpu/drm/panfrost/panfrost_job.c    |  2 +-
>   drivers/gpu/drm/scheduler/sched_entity.c   |  3 +-
>   drivers/gpu/drm/scheduler/sched_main.c     | 55 +++++++++++++++-------
>   drivers/gpu/drm/v3d/v3d_sched.c            | 10 ++--
>   include/drm/gpu_scheduler.h                | 18 ++++++-
>   10 files changed, 72 insertions(+), 31 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index a2cdde0ca0a7..eed6f56e3957 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2491,7 +2491,7 @@ static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
>   			break;
>   		}
>   
> -		r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
> +		r = drm_sched_init(&ring->sched, 0, &amdgpu_sched_ops,
>   				   ring->num_hw_submission, 0,
>   				   timeout, adev->reset_domain->wq,
>   				   ring->sched_score, ring->name,
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> index 1ae87dfd19c4..bfe761316ba9 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> @@ -133,7 +133,7 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
>   {
>   	int ret;
>   
> -	ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops,
> +	ret = drm_sched_init(&gpu->sched, 0, &etnaviv_sched_ops,
>   			     etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
>   			     msecs_to_jiffies(500), NULL, NULL,
>   			     dev_name(gpu->dev), gpu->dev);
> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> index ffd91a5ee299..f559d7de5658 100644
> --- a/drivers/gpu/drm/lima/lima_sched.c
> +++ b/drivers/gpu/drm/lima/lima_sched.c
> @@ -488,7 +488,7 @@ int lima_sched_pipe_init(struct lima_sched_pipe *pipe, const char *name)
>   
>   	INIT_WORK(&pipe->recover_work, lima_sched_recover_work);
>   
> -	return drm_sched_init(&pipe->base, &lima_sched_ops, 1,
> +	return drm_sched_init(&pipe->base, 0, &lima_sched_ops, 1,
>   			      lima_job_hang_limit,
>   			      msecs_to_jiffies(timeout), NULL,
>   			      NULL, name, pipe->ldev->dev);
> diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
> index b60199184409..79fb8c96465d 100644
> --- a/drivers/gpu/drm/msm/msm_ringbuffer.c
> +++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
> @@ -93,7 +93,7 @@ struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int id,
>   	 /* currently managing hangcheck ourselves: */
>   	sched_timeout = MAX_SCHEDULE_TIMEOUT;
>   
> -	ret = drm_sched_init(&ring->sched, &msm_sched_ops,
> +	ret = drm_sched_init(&ring->sched, 0, &msm_sched_ops,
>   			num_hw_submissions, 0, sched_timeout,
>   			NULL, NULL, to_msm_bo(ring->bo)->name, gpu->dev->dev);
>   	if (ret) {
> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
> index 3424a1bf6af3..6528c09fffc9 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
> @@ -407,9 +407,10 @@ int nouveau_sched_init(struct nouveau_drm *drm)
>   	if (!drm->sched_wq)
>   		return -ENOMEM;
>   
> -	return drm_sched_init(sched, &nouveau_sched_ops,
> -			      NOUVEAU_SCHED_HW_SUBMISSIONS, 0, job_hang_limit,
> -			      NULL, NULL, "nouveau_sched", drm->dev->dev);
> +	return drm_sched_init(sched, DRM_GPU_SCHEDULER_MODE_MULTI_RING,
> +			      &nouveau_sched_ops, NOUVEAU_SCHED_HW_SUBMISSIONS,
> +			      0, job_hang_limit, NULL, NULL, "nouveau_sched",
> +			      drm->dev->dev);
>   }
>   
>   void nouveau_sched_fini(struct nouveau_drm *drm)
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> index dbc597ab46fb..dbbabf63f08f 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -814,7 +814,7 @@ int panfrost_job_init(struct panfrost_device *pfdev)
>   	for (j = 0; j < NUM_JOB_SLOTS; j++) {
>   		js->queue[j].fence_context = dma_fence_context_alloc(1);
>   
> -		ret = drm_sched_init(&js->queue[j].sched,
> +		ret = drm_sched_init(&js->queue[j].sched, 0,
>   				     &panfrost_sched_ops,
>   				     nentries, 0,
>   				     msecs_to_jiffies(JOB_TIMEOUT_MS),
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index a42763e1429d..6d1e11e6bdcd 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -401,7 +401,8 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>   
>   	s_fence = to_drm_sched_fence(fence);
>   	if (!fence->error && s_fence && s_fence->sched == sched &&
> -	    !test_bit(DRM_SCHED_FENCE_DONT_PIPELINE, &fence->flags)) {
> +	    !test_bit(DRM_SCHED_FENCE_DONT_PIPELINE, &fence->flags) &&
> +	    !(sched->flags & DRM_GPU_SCHEDULER_MODE_MULTI_RING)) {
>   
>   		/*
>   		 * Fence is from the same scheduler, only need to wait for
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 506371c42745..0c432c1dafdf 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -24,28 +24,50 @@
>   /**
>    * DOC: Overview
>    *
> - * The GPU scheduler provides entities which allow userspace to push jobs
> - * into software queues which are then scheduled on a hardware run queue.
> - * The software queues have a priority among them. The scheduler selects the entities
> - * from the run queue using a FIFO. The scheduler provides dependency handling
> - * features among jobs. The driver is supposed to provide callback functions for
> - * backend operations to the scheduler like submitting a job to hardware run queue,
> - * returning the dependencies of a job etc.
> + * The GPU scheduler is mainly structured into the scheduler itself
> + * (&drm_gpu_scheduler), scheduler entities (&drm_sched_entity) and scheduler
> + * jobs (&drm_sched_job).
>    *
> - * The organisation of the scheduler is the following:
> + * Each &drm_gpu_scheduler has different priority run queues (e.g. HIGH_HW,
> + * HIGH_SW, KERNEL, NORMAL), which themselfs contain a list of &drm_sched_entity
> + * objects, while each &drm_sched_entity maintains a queue of &drm_sched_jobs.
>    *
> - * 1. Each hw run queue has one scheduler
> - * 2. Each scheduler has multiple run queues with different priorities
> - *    (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
> - * 3. Each scheduler run queue has a queue of entities to schedule
> - * 4. Entities themselves maintain a queue of jobs that will be scheduled on
> - *    the hardware.
> + * There are two modes of operation, single ring mode and multi ring mode.
> + * Depending on the mode of operation, the mental model of what the named
> + * structures represent differs.
>    *
> - * The jobs in a entity are always scheduled in the order that they were pushed.
> + * In single ring mode (which is the default) every &drm_gpu_scheduler
> + * instance represents a single HW ring, while every &drm_sched_entity
> + * represents a software queue feeding into one or multiple &drm_gpu_scheduler
> + * instances and hence into one or multiple HW rings.
> + *
> + * Single ring mode may be used when the GPU has a fixed amount of HW rings
> + * which can be directly fed by the driver.
> + *
> + * In multi ring mode (enabled by passing the &drm_gpu_scheduler the
> + * &DRM_GPU_SCHEDULER_MODE_MULTI_RING flag) each &drm_sched_entity represents a
> + * HW ring, while the &drm_gpu_scheduler itself only exists to sort out job
> + * dependencies and actually process the jobs of each &drm_sched_entity.
> + *
> + * Multi ring mode may be used when the GPU has a firmware scheduler feeding the
> + * actual HW rings, while the driver feeds the firmware scheduler through an
> + * arbitrary amount of dynamically created rings.
> + *
> + * While one or the other mental model could be applied without setting the ring
> + * mode through a flag, the scheduler needs to know the mode of operation in
> + * order to be able to make correct decitions when it comes to handling job
> + * dependencies.
> + *
> + * Independent of the mode of operation jobs within an entity are always
> + * scheduled in the order in which they were submitted.
>    *
>    * Note that once a job was taken from the entities queue and pushed to the
>    * hardware, i.e. the pending queue, the entity must not be referenced anymore
>    * through the jobs entity pointer.
> + *
> + * In order for the scheduler to actually prepare, process or free a job once it
> + * is completed, the driver is supposed to provide the corresponding callback
> + * functions (&drm_sched_backend_ops) to the scheduler.
>    */
>   
>   #include <linux/kthread.h>
> @@ -1082,7 +1104,7 @@ static int drm_sched_main(void *param)
>    *
>    * Return 0 on success, otherwise error code.
>    */
> -int drm_sched_init(struct drm_gpu_scheduler *sched,
> +int drm_sched_init(struct drm_gpu_scheduler *sched, unsigned int flags,
>   		   const struct drm_sched_backend_ops *ops,
>   		   unsigned hw_submission, unsigned hang_limit,
>   		   long timeout, struct workqueue_struct *timeout_wq,
> @@ -1097,6 +1119,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>   	sched->hang_limit = hang_limit;
>   	sched->score = score ? score : &sched->_score;
>   	sched->dev = dev;
> +	sched->flags = flags;
>   	for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT; i++)
>   		drm_sched_rq_init(sched, &sched->sched_rq[i]);
>   
> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
> index 06238e6d7f5c..4ba561014a5d 100644
> --- a/drivers/gpu/drm/v3d/v3d_sched.c
> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
> @@ -387,7 +387,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>   	int hang_limit_ms = 500;
>   	int ret;
>   
> -	ret = drm_sched_init(&v3d->queue[V3D_BIN].sched,
> +	ret = drm_sched_init(&v3d->queue[V3D_BIN].sched, 0,
>   			     &v3d_bin_sched_ops,
>   			     hw_jobs_limit, job_hang_limit,
>   			     msecs_to_jiffies(hang_limit_ms), NULL,
> @@ -395,7 +395,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>   	if (ret)
>   		return ret;
>   
> -	ret = drm_sched_init(&v3d->queue[V3D_RENDER].sched,
> +	ret = drm_sched_init(&v3d->queue[V3D_RENDER].sched, 0,
>   			     &v3d_render_sched_ops,
>   			     hw_jobs_limit, job_hang_limit,
>   			     msecs_to_jiffies(hang_limit_ms), NULL,
> @@ -403,7 +403,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>   	if (ret)
>   		goto fail;
>   
> -	ret = drm_sched_init(&v3d->queue[V3D_TFU].sched,
> +	ret = drm_sched_init(&v3d->queue[V3D_TFU].sched, 0,
>   			     &v3d_tfu_sched_ops,
>   			     hw_jobs_limit, job_hang_limit,
>   			     msecs_to_jiffies(hang_limit_ms), NULL,
> @@ -412,7 +412,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>   		goto fail;
>   
>   	if (v3d_has_csd(v3d)) {
> -		ret = drm_sched_init(&v3d->queue[V3D_CSD].sched,
> +		ret = drm_sched_init(&v3d->queue[V3D_CSD].sched, 0,
>   				     &v3d_csd_sched_ops,
>   				     hw_jobs_limit, job_hang_limit,
>   				     msecs_to_jiffies(hang_limit_ms), NULL,
> @@ -420,7 +420,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>   		if (ret)
>   			goto fail;
>   
> -		ret = drm_sched_init(&v3d->queue[V3D_CACHE_CLEAN].sched,
> +		ret = drm_sched_init(&v3d->queue[V3D_CACHE_CLEAN].sched, 0,
>   				     &v3d_cache_clean_sched_ops,
>   				     hw_jobs_limit, job_hang_limit,
>   				     msecs_to_jiffies(hang_limit_ms), NULL,
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index f9544d9b670d..afe29196cc7e 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -465,6 +465,20 @@ struct drm_sched_backend_ops {
>   	void (*free_job)(struct drm_sched_job *sched_job);
>   };
>   
> +/**
> + * DRM_GPU_SCHEDULER_MODE_MULTI_RING - Run the scheduler in multi ring mode
> + *
> + * The &drm_gpu_scheduler typically represents one HW ring with
> + * &drm_sched_entities feeding into one or multiple scheduler instances.
> + *
> + * In multi ring mode it is assumed there is a single &drm_gpu_scheduler
> + * instance only, where &drm_sched_entities represent rings instead.
> + *
> + * This is useful for GPUs where the &drm_gpu_scheduler should feed a firmware
> + * scheduler with an arbitrary amount of rings.
> + */
> +#define DRM_GPU_SCHEDULER_MODE_MULTI_RING	(1 << 0)
> +
>   /**
>    * struct drm_gpu_scheduler - scheduler instance-specific data
>    *
> @@ -493,6 +507,7 @@ struct drm_sched_backend_ops {
>    * @ready: marks if the underlying HW is ready to work
>    * @free_guilty: A hit to time out handler to free the guilty job.
>    * @dev: system &struct device
> + * @flags: the flags for the &drm_gpu_scheduler
>    *
>    * One scheduler is implemented for each hardware ring.
>    */
> @@ -517,9 +532,10 @@ struct drm_gpu_scheduler {
>   	bool				ready;
>   	bool				free_guilty;
>   	struct device			*dev;
> +	unsigned long			flags;
>   };
>   
> -int drm_sched_init(struct drm_gpu_scheduler *sched,
> +int drm_sched_init(struct drm_gpu_scheduler *sched, unsigned int flags,
>   		   const struct drm_sched_backend_ops *ops,
>   		   uint32_t hw_submission, unsigned hang_limit,
>   		   long timeout, struct workqueue_struct *timeout_wq,
>
> base-commit: e05f3938c5a7896d09736e3381675a57ffee7a0a
Danilo Krummrich Aug. 10, 2023, 3:06 p.m. UTC | #3
On 8/10/23 06:31, Matthew Brost wrote:
> On Thu, Aug 10, 2023 at 12:17:23AM +0200, Danilo Krummrich wrote:
>> With the current mental model every GPU scheduler instance represents
>> a single HW ring, while every entity represents a software queue feeding
>> into one or multiple GPU scheduler instances and hence into one or
>> multiple HW rings.
>>
>> This does not really scale with firmware schedulers feeding the actual
>> HW rings, while the driver feeds the firmware scheduler through an
>> arbitrary amount of dynamically created rings, since for each of those
>> rings a GPU scheduler instance would be required and a separate kthread
>> would be created.
>>
>> To overcome this we can think of the scheduler having two modes of
>> operation, single ring mode and multi ring mode. Depending on the mode
>> of operation, the mental model differs.
>>
>> Single ring mode (which is the default) keeps the original behaviour of
>> the scheduler and its entities.
>>
>> In multi ring mode (enabled by passing the GPU scheduler the
>> DRM_GPU_SCHEDULER_MODE_MULTI_RING flag) each entity represents a ring,
>> while the GPU scheduler itself only exists to sort out job dependencies
>> and actually process the jobs of each entity.
>>
>> By just applying this different mental model, the required code change
>> is pretty minimalistic: While in single ring mode, if a job depends on a
>> dma-fence from the same scheduler instance we only wait for the
>> dependency to be scheduled (sched_fence->scheduled), in multi ring mode
>> we fall through and just wait for the dependency to fully complete
>> (sched_fence->finished) in order to avoid races between separate rings.
>> Or in other words, to pick up the schedulers existing terminology,
>> prevent dependency pipelining.
>>
>> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
>> ---
>> Just before sending out this patch I was made aware of the "DRM Scheduler
>> changes for XE" [1] patch series.
>>
>> However, I think bringing this alternative approach into discussion could
>> be useful.
>>
>>  From a first glance it looks like that both approaches semantically aim for
>> the same goal, namely addressing firmware schedulers with arbitrary amounts
>> of software rings to feed from the driver side.
>>
>> The "XE approach" for that seems to be to make GPU scheduler instances
>> scalable by submitting jobs to a workqueue rather than creating a kthread
>> per scheduler instance, such that per software ring a scheduler instance can
>> be created. Hence the definition of a 1:1 relationship between scheduler and
>> entity.
>>
>> In Nouveau we approached it differently, such that we have just one GPU
>> scheduler per driver instance, while software rings are represented as
>> entities. As explained in the commit message, this seems to work fine, as
>> long as we can teach the scheduler to prevent dependency pipelining.
>>
>> As a hack this basically already works without this patch by just setting the
>> DRM_SCHED_FENCE_DONT_PIPELINE flag for all out-fences.
>>
>

Hi Matt,

> A few questions.
> 
> 1. With different rings + a firmware scheduler how do you do with
> pending list being in-order and jobs from different rings completing of
> order. The pending list in order nature is required for free_job being
> called and the TDR to work properly.
> 
> 2. How do you flow control each ring in Nouveau?
> 
> 3. Do you support per ring reset? Again how does this work if the
> pending list has jobs from multiple rings?

Right, all that requires more changes. And clearly, this would fork off 
multi ring mode too much from single ring mode.

Hence I agree, considering all that, keeping a drm_gpu_scheduler 
instance per ring is clearly advisable.

- Danilo

> 
> The answer to all of these in Xe is solved by the 1 to 1 relationship
> between scheduler and entity. In addition to the above questions the 1
> to 1 relationship in allows sleeping in the main worker in the callback
> functions which we use our preempt fence implementation.
> 
> This implementation as is will not for Xe. Also in Xe it is desirable to
> have multiple CPUs to be submiting jobs in parallel too.
> 
> Matt
> 
>> [1] https://lore.kernel.org/intel-xe/20230801205103.627779-1-matthew.brost@intel.com/T/#t
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>>   drivers/gpu/drm/etnaviv/etnaviv_sched.c    |  2 +-
>>   drivers/gpu/drm/lima/lima_sched.c          |  2 +-
>>   drivers/gpu/drm/msm/msm_ringbuffer.c       |  2 +-
>>   drivers/gpu/drm/nouveau/nouveau_sched.c    |  7 +--
>>   drivers/gpu/drm/panfrost/panfrost_job.c    |  2 +-
>>   drivers/gpu/drm/scheduler/sched_entity.c   |  3 +-
>>   drivers/gpu/drm/scheduler/sched_main.c     | 55 +++++++++++++++-------
>>   drivers/gpu/drm/v3d/v3d_sched.c            | 10 ++--
>>   include/drm/gpu_scheduler.h                | 18 ++++++-
>>   10 files changed, 72 insertions(+), 31 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index a2cdde0ca0a7..eed6f56e3957 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -2491,7 +2491,7 @@ static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
>>   			break;
>>   		}
>>   
>> -		r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
>> +		r = drm_sched_init(&ring->sched, 0, &amdgpu_sched_ops,
>>   				   ring->num_hw_submission, 0,
>>   				   timeout, adev->reset_domain->wq,
>>   				   ring->sched_score, ring->name,
>> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>> index 1ae87dfd19c4..bfe761316ba9 100644
>> --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>> +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>> @@ -133,7 +133,7 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
>>   {
>>   	int ret;
>>   
>> -	ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops,
>> +	ret = drm_sched_init(&gpu->sched, 0, &etnaviv_sched_ops,
>>   			     etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
>>   			     msecs_to_jiffies(500), NULL, NULL,
>>   			     dev_name(gpu->dev), gpu->dev);
>> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
>> index ffd91a5ee299..f559d7de5658 100644
>> --- a/drivers/gpu/drm/lima/lima_sched.c
>> +++ b/drivers/gpu/drm/lima/lima_sched.c
>> @@ -488,7 +488,7 @@ int lima_sched_pipe_init(struct lima_sched_pipe *pipe, const char *name)
>>   
>>   	INIT_WORK(&pipe->recover_work, lima_sched_recover_work);
>>   
>> -	return drm_sched_init(&pipe->base, &lima_sched_ops, 1,
>> +	return drm_sched_init(&pipe->base, 0, &lima_sched_ops, 1,
>>   			      lima_job_hang_limit,
>>   			      msecs_to_jiffies(timeout), NULL,
>>   			      NULL, name, pipe->ldev->dev);
>> diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
>> index b60199184409..79fb8c96465d 100644
>> --- a/drivers/gpu/drm/msm/msm_ringbuffer.c
>> +++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
>> @@ -93,7 +93,7 @@ struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int id,
>>   	 /* currently managing hangcheck ourselves: */
>>   	sched_timeout = MAX_SCHEDULE_TIMEOUT;
>>   
>> -	ret = drm_sched_init(&ring->sched, &msm_sched_ops,
>> +	ret = drm_sched_init(&ring->sched, 0, &msm_sched_ops,
>>   			num_hw_submissions, 0, sched_timeout,
>>   			NULL, NULL, to_msm_bo(ring->bo)->name, gpu->dev->dev);
>>   	if (ret) {
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
>> index 3424a1bf6af3..6528c09fffc9 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
>> @@ -407,9 +407,10 @@ int nouveau_sched_init(struct nouveau_drm *drm)
>>   	if (!drm->sched_wq)
>>   		return -ENOMEM;
>>   
>> -	return drm_sched_init(sched, &nouveau_sched_ops,
>> -			      NOUVEAU_SCHED_HW_SUBMISSIONS, 0, job_hang_limit,
>> -			      NULL, NULL, "nouveau_sched", drm->dev->dev);
>> +	return drm_sched_init(sched, DRM_GPU_SCHEDULER_MODE_MULTI_RING,
>> +			      &nouveau_sched_ops, NOUVEAU_SCHED_HW_SUBMISSIONS,
>> +			      0, job_hang_limit, NULL, NULL, "nouveau_sched",
>> +			      drm->dev->dev);
>>   }
>>   
>>   void nouveau_sched_fini(struct nouveau_drm *drm)
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
>> index dbc597ab46fb..dbbabf63f08f 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
>> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
>> @@ -814,7 +814,7 @@ int panfrost_job_init(struct panfrost_device *pfdev)
>>   	for (j = 0; j < NUM_JOB_SLOTS; j++) {
>>   		js->queue[j].fence_context = dma_fence_context_alloc(1);
>>   
>> -		ret = drm_sched_init(&js->queue[j].sched,
>> +		ret = drm_sched_init(&js->queue[j].sched, 0,
>>   				     &panfrost_sched_ops,
>>   				     nentries, 0,
>>   				     msecs_to_jiffies(JOB_TIMEOUT_MS),
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
>> index a42763e1429d..6d1e11e6bdcd 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -401,7 +401,8 @@ static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>   
>>   	s_fence = to_drm_sched_fence(fence);
>>   	if (!fence->error && s_fence && s_fence->sched == sched &&
>> -	    !test_bit(DRM_SCHED_FENCE_DONT_PIPELINE, &fence->flags)) {
>> +	    !test_bit(DRM_SCHED_FENCE_DONT_PIPELINE, &fence->flags) &&
>> +	    !(sched->flags & DRM_GPU_SCHEDULER_MODE_MULTI_RING)) {
>>   
>>   		/*
>>   		 * Fence is from the same scheduler, only need to wait for
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index 506371c42745..0c432c1dafdf 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -24,28 +24,50 @@
>>   /**
>>    * DOC: Overview
>>    *
>> - * The GPU scheduler provides entities which allow userspace to push jobs
>> - * into software queues which are then scheduled on a hardware run queue.
>> - * The software queues have a priority among them. The scheduler selects the entities
>> - * from the run queue using a FIFO. The scheduler provides dependency handling
>> - * features among jobs. The driver is supposed to provide callback functions for
>> - * backend operations to the scheduler like submitting a job to hardware run queue,
>> - * returning the dependencies of a job etc.
>> + * The GPU scheduler is mainly structured into the scheduler itself
>> + * (&drm_gpu_scheduler), scheduler entities (&drm_sched_entity) and scheduler
>> + * jobs (&drm_sched_job).
>>    *
>> - * The organisation of the scheduler is the following:
>> + * Each &drm_gpu_scheduler has different priority run queues (e.g. HIGH_HW,
>> + * HIGH_SW, KERNEL, NORMAL), which themselfs contain a list of &drm_sched_entity
>> + * objects, while each &drm_sched_entity maintains a queue of &drm_sched_jobs.
>>    *
>> - * 1. Each hw run queue has one scheduler
>> - * 2. Each scheduler has multiple run queues with different priorities
>> - *    (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
>> - * 3. Each scheduler run queue has a queue of entities to schedule
>> - * 4. Entities themselves maintain a queue of jobs that will be scheduled on
>> - *    the hardware.
>> + * There are two modes of operation, single ring mode and multi ring mode.
>> + * Depending on the mode of operation, the mental model of what the named
>> + * structures represent differs.
>>    *
>> - * The jobs in a entity are always scheduled in the order that they were pushed.
>> + * In single ring mode (which is the default) every &drm_gpu_scheduler
>> + * instance represents a single HW ring, while every &drm_sched_entity
>> + * represents a software queue feeding into one or multiple &drm_gpu_scheduler
>> + * instances and hence into one or multiple HW rings.
>> + *
>> + * Single ring mode may be used when the GPU has a fixed amount of HW rings
>> + * which can be directly fed by the driver.
>> + *
>> + * In multi ring mode (enabled by passing the &drm_gpu_scheduler the
>> + * &DRM_GPU_SCHEDULER_MODE_MULTI_RING flag) each &drm_sched_entity represents a
>> + * HW ring, while the &drm_gpu_scheduler itself only exists to sort out job
>> + * dependencies and actually process the jobs of each &drm_sched_entity.
>> + *
>> + * Multi ring mode may be used when the GPU has a firmware scheduler feeding the
>> + * actual HW rings, while the driver feeds the firmware scheduler through an
>> + * arbitrary amount of dynamically created rings.
>> + *
>> + * While one or the other mental model could be applied without setting the ring
>> + * mode through a flag, the scheduler needs to know the mode of operation in
>> + * order to be able to make correct decitions when it comes to handling job
>> + * dependencies.
>> + *
>> + * Independent of the mode of operation jobs within an entity are always
>> + * scheduled in the order in which they were submitted.
>>    *
>>    * Note that once a job was taken from the entities queue and pushed to the
>>    * hardware, i.e. the pending queue, the entity must not be referenced anymore
>>    * through the jobs entity pointer.
>> + *
>> + * In order for the scheduler to actually prepare, process or free a job once it
>> + * is completed, the driver is supposed to provide the corresponding callback
>> + * functions (&drm_sched_backend_ops) to the scheduler.
>>    */
>>   
>>   #include <linux/kthread.h>
>> @@ -1082,7 +1104,7 @@ static int drm_sched_main(void *param)
>>    *
>>    * Return 0 on success, otherwise error code.
>>    */
>> -int drm_sched_init(struct drm_gpu_scheduler *sched,
>> +int drm_sched_init(struct drm_gpu_scheduler *sched, unsigned int flags,
>>   		   const struct drm_sched_backend_ops *ops,
>>   		   unsigned hw_submission, unsigned hang_limit,
>>   		   long timeout, struct workqueue_struct *timeout_wq,
>> @@ -1097,6 +1119,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>>   	sched->hang_limit = hang_limit;
>>   	sched->score = score ? score : &sched->_score;
>>   	sched->dev = dev;
>> +	sched->flags = flags;
>>   	for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT; i++)
>>   		drm_sched_rq_init(sched, &sched->sched_rq[i]);
>>   
>> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
>> index 06238e6d7f5c..4ba561014a5d 100644
>> --- a/drivers/gpu/drm/v3d/v3d_sched.c
>> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
>> @@ -387,7 +387,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>>   	int hang_limit_ms = 500;
>>   	int ret;
>>   
>> -	ret = drm_sched_init(&v3d->queue[V3D_BIN].sched,
>> +	ret = drm_sched_init(&v3d->queue[V3D_BIN].sched, 0,
>>   			     &v3d_bin_sched_ops,
>>   			     hw_jobs_limit, job_hang_limit,
>>   			     msecs_to_jiffies(hang_limit_ms), NULL,
>> @@ -395,7 +395,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>>   	if (ret)
>>   		return ret;
>>   
>> -	ret = drm_sched_init(&v3d->queue[V3D_RENDER].sched,
>> +	ret = drm_sched_init(&v3d->queue[V3D_RENDER].sched, 0,
>>   			     &v3d_render_sched_ops,
>>   			     hw_jobs_limit, job_hang_limit,
>>   			     msecs_to_jiffies(hang_limit_ms), NULL,
>> @@ -403,7 +403,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>>   	if (ret)
>>   		goto fail;
>>   
>> -	ret = drm_sched_init(&v3d->queue[V3D_TFU].sched,
>> +	ret = drm_sched_init(&v3d->queue[V3D_TFU].sched, 0,
>>   			     &v3d_tfu_sched_ops,
>>   			     hw_jobs_limit, job_hang_limit,
>>   			     msecs_to_jiffies(hang_limit_ms), NULL,
>> @@ -412,7 +412,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>>   		goto fail;
>>   
>>   	if (v3d_has_csd(v3d)) {
>> -		ret = drm_sched_init(&v3d->queue[V3D_CSD].sched,
>> +		ret = drm_sched_init(&v3d->queue[V3D_CSD].sched, 0,
>>   				     &v3d_csd_sched_ops,
>>   				     hw_jobs_limit, job_hang_limit,
>>   				     msecs_to_jiffies(hang_limit_ms), NULL,
>> @@ -420,7 +420,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>>   		if (ret)
>>   			goto fail;
>>   
>> -		ret = drm_sched_init(&v3d->queue[V3D_CACHE_CLEAN].sched,
>> +		ret = drm_sched_init(&v3d->queue[V3D_CACHE_CLEAN].sched, 0,
>>   				     &v3d_cache_clean_sched_ops,
>>   				     hw_jobs_limit, job_hang_limit,
>>   				     msecs_to_jiffies(hang_limit_ms), NULL,
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index f9544d9b670d..afe29196cc7e 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -465,6 +465,20 @@ struct drm_sched_backend_ops {
>>   	void (*free_job)(struct drm_sched_job *sched_job);
>>   };
>>   
>> +/**
>> + * DRM_GPU_SCHEDULER_MODE_MULTI_RING - Run the scheduler in multi ring mode
>> + *
>> + * The &drm_gpu_scheduler typically represents one HW ring with
>> + * &drm_sched_entities feeding into one or multiple scheduler instances.
>> + *
>> + * In multi ring mode it is assumed there is a single &drm_gpu_scheduler
>> + * instance only, where &drm_sched_entities represent rings instead.
>> + *
>> + * This is useful for GPUs where the &drm_gpu_scheduler should feed a firmware
>> + * scheduler with an arbitrary amount of rings.
>> + */
>> +#define DRM_GPU_SCHEDULER_MODE_MULTI_RING	(1 << 0)
>> +
>>   /**
>>    * struct drm_gpu_scheduler - scheduler instance-specific data
>>    *
>> @@ -493,6 +507,7 @@ struct drm_sched_backend_ops {
>>    * @ready: marks if the underlying HW is ready to work
>>    * @free_guilty: A hit to time out handler to free the guilty job.
>>    * @dev: system &struct device
>> + * @flags: the flags for the &drm_gpu_scheduler
>>    *
>>    * One scheduler is implemented for each hardware ring.
>>    */
>> @@ -517,9 +532,10 @@ struct drm_gpu_scheduler {
>>   	bool				ready;
>>   	bool				free_guilty;
>>   	struct device			*dev;
>> +	unsigned long			flags;
>>   };
>>   
>> -int drm_sched_init(struct drm_gpu_scheduler *sched,
>> +int drm_sched_init(struct drm_gpu_scheduler *sched, unsigned int flags,
>>   		   const struct drm_sched_backend_ops *ops,
>>   		   uint32_t hw_submission, unsigned hang_limit,
>>   		   long timeout, struct workqueue_struct *timeout_wq,
>>
>> base-commit: e05f3938c5a7896d09736e3381675a57ffee7a0a
>> -- 
>> 2.41.0
>>
>
Danilo Krummrich Aug. 10, 2023, 3:25 p.m. UTC | #4
On 8/10/23 08:34, Christian König wrote:
> Am 10.08.23 um 00:17 schrieb Danilo Krummrich:
>> With the current mental model every GPU scheduler instance represents
>> a single HW ring, while every entity represents a software queue feeding
>> into one or multiple GPU scheduler instances and hence into one or
>> multiple HW rings.
>>
>> This does not really scale with firmware schedulers feeding the actual
>> HW rings, while the driver feeds the firmware scheduler through an
>> arbitrary amount of dynamically created rings, since for each of those
>> rings a GPU scheduler instance would be required and a separate kthread
>> would be created.
>>
>> To overcome this we can think of the scheduler having two modes of
>> operation, single ring mode and multi ring mode. Depending on the mode
>> of operation, the mental model differs.
>>
>> Single ring mode (which is the default) keeps the original behaviour of
>> the scheduler and its entities.
>>
>> In multi ring mode (enabled by passing the GPU scheduler the
>> DRM_GPU_SCHEDULER_MODE_MULTI_RING flag) each entity represents a ring,
>> while the GPU scheduler itself only exists to sort out job dependencies
>> and actually process the jobs of each entity.
>>
>> By just applying this different mental model, the required code change
>> is pretty minimalistic: While in single ring mode, if a job depends on a
>> dma-fence from the same scheduler instance we only wait for the
>> dependency to be scheduled (sched_fence->scheduled), in multi ring mode
>> we fall through and just wait for the dependency to fully complete
>> (sched_fence->finished) in order to avoid races between separate rings.
>> Or in other words, to pick up the schedulers existing terminology,
>> prevent dependency pipelining.
>>
>> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
>> ---
>> Just before sending out this patch I was made aware of the "DRM Scheduler
>> changes for XE" [1] patch series.
>>
>> However, I think bringing this alternative approach into discussion could
>> be useful.
>>
>>  From a first glance it looks like that both approaches semantically 
>> aim for
>> the same goal, namely addressing firmware schedulers with arbitrary 
>> amounts
>> of software rings to feed from the driver side.
>>
>> The "XE approach" for that seems to be to make GPU scheduler instances
>> scalable by submitting jobs to a workqueue rather than creating a kthread
>> per scheduler instance, such that per software ring a scheduler 
>> instance can
>> be created. Hence the definition of a 1:1 relationship between 
>> scheduler and
>> entity.
>>
>> In Nouveau we approached it differently, such that we have just one GPU
>> scheduler per driver instance, while software rings are represented as
>> entities. As explained in the commit message, this seems to work fine, as
>> long as we can teach the scheduler to prevent dependency pipelining.
>>
>> As a hack this basically already works without this patch by just 
>> setting the
>> DRM_SCHED_FENCE_DONT_PIPELINE flag for all out-fences.
> 
> I discussed that with other AMD folks on some internal meeting when the 
> first Xe changes came up, but quickly abandoned it.
> 
> The scheduler depends on the sequential execution of submissions, 
> otherwise we can't guarantee the sequential signaling of scheduler 
> fences and this in turn is a really big NO-GO for dma_fences.
> 
> You avoid most of the problems by setting the 
> DRM_SCHED_FENCE_DONT_PIPELINE flag, but this is still something very flaky.
> 
> Additional to that essentially the scheduler would still only push a 
> limited number of submissions to the hw rings which makes the hardware 
> scheduler pretty much useless.

I agree, that is also what Matt pointed out in his reply besides some 
other limitations. Looks like we really want to keep the one GPU 
scheduler per ring relationship, unless we rework the whole scheduler 
design entirely.

Thank you for clarifying.

> 
> Christian.
> 
>>
>> [1] 
>> https://lore.kernel.org/intel-xe/20230801205103.627779-1-matthew.brost@intel.com/T/#t
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  2 +-
>>   drivers/gpu/drm/etnaviv/etnaviv_sched.c    |  2 +-
>>   drivers/gpu/drm/lima/lima_sched.c          |  2 +-
>>   drivers/gpu/drm/msm/msm_ringbuffer.c       |  2 +-
>>   drivers/gpu/drm/nouveau/nouveau_sched.c    |  7 +--
>>   drivers/gpu/drm/panfrost/panfrost_job.c    |  2 +-
>>   drivers/gpu/drm/scheduler/sched_entity.c   |  3 +-
>>   drivers/gpu/drm/scheduler/sched_main.c     | 55 +++++++++++++++-------
>>   drivers/gpu/drm/v3d/v3d_sched.c            | 10 ++--
>>   include/drm/gpu_scheduler.h                | 18 ++++++-
>>   10 files changed, 72 insertions(+), 31 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index a2cdde0ca0a7..eed6f56e3957 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -2491,7 +2491,7 @@ static int amdgpu_device_init_schedulers(struct 
>> amdgpu_device *adev)
>>               break;
>>           }
>> -        r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
>> +        r = drm_sched_init(&ring->sched, 0, &amdgpu_sched_ops,
>>                      ring->num_hw_submission, 0,
>>                      timeout, adev->reset_domain->wq,
>>                      ring->sched_score, ring->name,
>> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c 
>> b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>> index 1ae87dfd19c4..bfe761316ba9 100644
>> --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>> +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
>> @@ -133,7 +133,7 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
>>   {
>>       int ret;
>> -    ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops,
>> +    ret = drm_sched_init(&gpu->sched, 0, &etnaviv_sched_ops,
>>                    etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
>>                    msecs_to_jiffies(500), NULL, NULL,
>>                    dev_name(gpu->dev), gpu->dev);
>> diff --git a/drivers/gpu/drm/lima/lima_sched.c 
>> b/drivers/gpu/drm/lima/lima_sched.c
>> index ffd91a5ee299..f559d7de5658 100644
>> --- a/drivers/gpu/drm/lima/lima_sched.c
>> +++ b/drivers/gpu/drm/lima/lima_sched.c
>> @@ -488,7 +488,7 @@ int lima_sched_pipe_init(struct lima_sched_pipe 
>> *pipe, const char *name)
>>       INIT_WORK(&pipe->recover_work, lima_sched_recover_work);
>> -    return drm_sched_init(&pipe->base, &lima_sched_ops, 1,
>> +    return drm_sched_init(&pipe->base, 0, &lima_sched_ops, 1,
>>                     lima_job_hang_limit,
>>                     msecs_to_jiffies(timeout), NULL,
>>                     NULL, name, pipe->ldev->dev);
>> diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c 
>> b/drivers/gpu/drm/msm/msm_ringbuffer.c
>> index b60199184409..79fb8c96465d 100644
>> --- a/drivers/gpu/drm/msm/msm_ringbuffer.c
>> +++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
>> @@ -93,7 +93,7 @@ struct msm_ringbuffer *msm_ringbuffer_new(struct 
>> msm_gpu *gpu, int id,
>>        /* currently managing hangcheck ourselves: */
>>       sched_timeout = MAX_SCHEDULE_TIMEOUT;
>> -    ret = drm_sched_init(&ring->sched, &msm_sched_ops,
>> +    ret = drm_sched_init(&ring->sched, 0, &msm_sched_ops,
>>               num_hw_submissions, 0, sched_timeout,
>>               NULL, NULL, to_msm_bo(ring->bo)->name, gpu->dev->dev);
>>       if (ret) {
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c 
>> b/drivers/gpu/drm/nouveau/nouveau_sched.c
>> index 3424a1bf6af3..6528c09fffc9 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
>> @@ -407,9 +407,10 @@ int nouveau_sched_init(struct nouveau_drm *drm)
>>       if (!drm->sched_wq)
>>           return -ENOMEM;
>> -    return drm_sched_init(sched, &nouveau_sched_ops,
>> -                  NOUVEAU_SCHED_HW_SUBMISSIONS, 0, job_hang_limit,
>> -                  NULL, NULL, "nouveau_sched", drm->dev->dev);
>> +    return drm_sched_init(sched, DRM_GPU_SCHEDULER_MODE_MULTI_RING,
>> +                  &nouveau_sched_ops, NOUVEAU_SCHED_HW_SUBMISSIONS,
>> +                  0, job_hang_limit, NULL, NULL, "nouveau_sched",
>> +                  drm->dev->dev);
>>   }
>>   void nouveau_sched_fini(struct nouveau_drm *drm)
>> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c 
>> b/drivers/gpu/drm/panfrost/panfrost_job.c
>> index dbc597ab46fb..dbbabf63f08f 100644
>> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
>> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
>> @@ -814,7 +814,7 @@ int panfrost_job_init(struct panfrost_device *pfdev)
>>       for (j = 0; j < NUM_JOB_SLOTS; j++) {
>>           js->queue[j].fence_context = dma_fence_context_alloc(1);
>> -        ret = drm_sched_init(&js->queue[j].sched,
>> +        ret = drm_sched_init(&js->queue[j].sched, 0,
>>                        &panfrost_sched_ops,
>>                        nentries, 0,
>>                        msecs_to_jiffies(JOB_TIMEOUT_MS),
>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c 
>> b/drivers/gpu/drm/scheduler/sched_entity.c
>> index a42763e1429d..6d1e11e6bdcd 100644
>> --- a/drivers/gpu/drm/scheduler/sched_entity.c
>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
>> @@ -401,7 +401,8 @@ static bool 
>> drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
>>       s_fence = to_drm_sched_fence(fence);
>>       if (!fence->error && s_fence && s_fence->sched == sched &&
>> -        !test_bit(DRM_SCHED_FENCE_DONT_PIPELINE, &fence->flags)) {
>> +        !test_bit(DRM_SCHED_FENCE_DONT_PIPELINE, &fence->flags) &&
>> +        !(sched->flags & DRM_GPU_SCHEDULER_MODE_MULTI_RING)) {
>>           /*
>>            * Fence is from the same scheduler, only need to wait for
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index 506371c42745..0c432c1dafdf 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -24,28 +24,50 @@
>>   /**
>>    * DOC: Overview
>>    *
>> - * The GPU scheduler provides entities which allow userspace to push 
>> jobs
>> - * into software queues which are then scheduled on a hardware run 
>> queue.
>> - * The software queues have a priority among them. The scheduler 
>> selects the entities
>> - * from the run queue using a FIFO. The scheduler provides dependency 
>> handling
>> - * features among jobs. The driver is supposed to provide callback 
>> functions for
>> - * backend operations to the scheduler like submitting a job to 
>> hardware run queue,
>> - * returning the dependencies of a job etc.
>> + * The GPU scheduler is mainly structured into the scheduler itself
>> + * (&drm_gpu_scheduler), scheduler entities (&drm_sched_entity) and 
>> scheduler
>> + * jobs (&drm_sched_job).
>>    *
>> - * The organisation of the scheduler is the following:
>> + * Each &drm_gpu_scheduler has different priority run queues (e.g. 
>> HIGH_HW,
>> + * HIGH_SW, KERNEL, NORMAL), which themselfs contain a list of 
>> &drm_sched_entity
>> + * objects, while each &drm_sched_entity maintains a queue of 
>> &drm_sched_jobs.
>>    *
>> - * 1. Each hw run queue has one scheduler
>> - * 2. Each scheduler has multiple run queues with different priorities
>> - *    (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
>> - * 3. Each scheduler run queue has a queue of entities to schedule
>> - * 4. Entities themselves maintain a queue of jobs that will be 
>> scheduled on
>> - *    the hardware.
>> + * There are two modes of operation, single ring mode and multi ring 
>> mode.
>> + * Depending on the mode of operation, the mental model of what the 
>> named
>> + * structures represent differs.
>>    *
>> - * The jobs in a entity are always scheduled in the order that they 
>> were pushed.
>> + * In single ring mode (which is the default) every &drm_gpu_scheduler
>> + * instance represents a single HW ring, while every &drm_sched_entity
>> + * represents a software queue feeding into one or multiple 
>> &drm_gpu_scheduler
>> + * instances and hence into one or multiple HW rings.
>> + *
>> + * Single ring mode may be used when the GPU has a fixed amount of HW 
>> rings
>> + * which can be directly fed by the driver.
>> + *
>> + * In multi ring mode (enabled by passing the &drm_gpu_scheduler the
>> + * &DRM_GPU_SCHEDULER_MODE_MULTI_RING flag) each &drm_sched_entity 
>> represents a
>> + * HW ring, while the &drm_gpu_scheduler itself only exists to sort 
>> out job
>> + * dependencies and actually process the jobs of each &drm_sched_entity.
>> + *
>> + * Multi ring mode may be used when the GPU has a firmware scheduler 
>> feeding the
>> + * actual HW rings, while the driver feeds the firmware scheduler 
>> through an
>> + * arbitrary amount of dynamically created rings.
>> + *
>> + * While one or the other mental model could be applied without 
>> setting the ring
>> + * mode through a flag, the scheduler needs to know the mode of 
>> operation in
>> + * order to be able to make correct decitions when it comes to 
>> handling job
>> + * dependencies.
>> + *
>> + * Independent of the mode of operation jobs within an entity are always
>> + * scheduled in the order in which they were submitted.
>>    *
>>    * Note that once a job was taken from the entities queue and pushed 
>> to the
>>    * hardware, i.e. the pending queue, the entity must not be 
>> referenced anymore
>>    * through the jobs entity pointer.
>> + *
>> + * In order for the scheduler to actually prepare, process or free a 
>> job once it
>> + * is completed, the driver is supposed to provide the corresponding 
>> callback
>> + * functions (&drm_sched_backend_ops) to the scheduler.
>>    */
>>   #include <linux/kthread.h>
>> @@ -1082,7 +1104,7 @@ static int drm_sched_main(void *param)
>>    *
>>    * Return 0 on success, otherwise error code.
>>    */
>> -int drm_sched_init(struct drm_gpu_scheduler *sched,
>> +int drm_sched_init(struct drm_gpu_scheduler *sched, unsigned int flags,
>>              const struct drm_sched_backend_ops *ops,
>>              unsigned hw_submission, unsigned hang_limit,
>>              long timeout, struct workqueue_struct *timeout_wq,
>> @@ -1097,6 +1119,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>>       sched->hang_limit = hang_limit;
>>       sched->score = score ? score : &sched->_score;
>>       sched->dev = dev;
>> +    sched->flags = flags;
>>       for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT; i++)
>>           drm_sched_rq_init(sched, &sched->sched_rq[i]);
>> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c 
>> b/drivers/gpu/drm/v3d/v3d_sched.c
>> index 06238e6d7f5c..4ba561014a5d 100644
>> --- a/drivers/gpu/drm/v3d/v3d_sched.c
>> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
>> @@ -387,7 +387,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>>       int hang_limit_ms = 500;
>>       int ret;
>> -    ret = drm_sched_init(&v3d->queue[V3D_BIN].sched,
>> +    ret = drm_sched_init(&v3d->queue[V3D_BIN].sched, 0,
>>                    &v3d_bin_sched_ops,
>>                    hw_jobs_limit, job_hang_limit,
>>                    msecs_to_jiffies(hang_limit_ms), NULL,
>> @@ -395,7 +395,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>>       if (ret)
>>           return ret;
>> -    ret = drm_sched_init(&v3d->queue[V3D_RENDER].sched,
>> +    ret = drm_sched_init(&v3d->queue[V3D_RENDER].sched, 0,
>>                    &v3d_render_sched_ops,
>>                    hw_jobs_limit, job_hang_limit,
>>                    msecs_to_jiffies(hang_limit_ms), NULL,
>> @@ -403,7 +403,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>>       if (ret)
>>           goto fail;
>> -    ret = drm_sched_init(&v3d->queue[V3D_TFU].sched,
>> +    ret = drm_sched_init(&v3d->queue[V3D_TFU].sched, 0,
>>                    &v3d_tfu_sched_ops,
>>                    hw_jobs_limit, job_hang_limit,
>>                    msecs_to_jiffies(hang_limit_ms), NULL,
>> @@ -412,7 +412,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>>           goto fail;
>>       if (v3d_has_csd(v3d)) {
>> -        ret = drm_sched_init(&v3d->queue[V3D_CSD].sched,
>> +        ret = drm_sched_init(&v3d->queue[V3D_CSD].sched, 0,
>>                        &v3d_csd_sched_ops,
>>                        hw_jobs_limit, job_hang_limit,
>>                        msecs_to_jiffies(hang_limit_ms), NULL,
>> @@ -420,7 +420,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>>           if (ret)
>>               goto fail;
>> -        ret = drm_sched_init(&v3d->queue[V3D_CACHE_CLEAN].sched,
>> +        ret = drm_sched_init(&v3d->queue[V3D_CACHE_CLEAN].sched, 0,
>>                        &v3d_cache_clean_sched_ops,
>>                        hw_jobs_limit, job_hang_limit,
>>                        msecs_to_jiffies(hang_limit_ms), NULL,
>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>> index f9544d9b670d..afe29196cc7e 100644
>> --- a/include/drm/gpu_scheduler.h
>> +++ b/include/drm/gpu_scheduler.h
>> @@ -465,6 +465,20 @@ struct drm_sched_backend_ops {
>>       void (*free_job)(struct drm_sched_job *sched_job);
>>   };
>> +/**
>> + * DRM_GPU_SCHEDULER_MODE_MULTI_RING - Run the scheduler in multi 
>> ring mode
>> + *
>> + * The &drm_gpu_scheduler typically represents one HW ring with
>> + * &drm_sched_entities feeding into one or multiple scheduler instances.
>> + *
>> + * In multi ring mode it is assumed there is a single &drm_gpu_scheduler
>> + * instance only, where &drm_sched_entities represent rings instead.
>> + *
>> + * This is useful for GPUs where the &drm_gpu_scheduler should feed a 
>> firmware
>> + * scheduler with an arbitrary amount of rings.
>> + */
>> +#define DRM_GPU_SCHEDULER_MODE_MULTI_RING    (1 << 0)
>> +
>>   /**
>>    * struct drm_gpu_scheduler - scheduler instance-specific data
>>    *
>> @@ -493,6 +507,7 @@ struct drm_sched_backend_ops {
>>    * @ready: marks if the underlying HW is ready to work
>>    * @free_guilty: A hit to time out handler to free the guilty job.
>>    * @dev: system &struct device
>> + * @flags: the flags for the &drm_gpu_scheduler
>>    *
>>    * One scheduler is implemented for each hardware ring.
>>    */
>> @@ -517,9 +532,10 @@ struct drm_gpu_scheduler {
>>       bool                ready;
>>       bool                free_guilty;
>>       struct device            *dev;
>> +    unsigned long            flags;
>>   };
>> -int drm_sched_init(struct drm_gpu_scheduler *sched,
>> +int drm_sched_init(struct drm_gpu_scheduler *sched, unsigned int flags,
>>              const struct drm_sched_backend_ops *ops,
>>              uint32_t hw_submission, unsigned hang_limit,
>>              long timeout, struct workqueue_struct *timeout_wq,
>>
>> base-commit: e05f3938c5a7896d09736e3381675a57ffee7a0a
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index a2cdde0ca0a7..eed6f56e3957 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2491,7 +2491,7 @@  static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
 			break;
 		}
 
-		r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
+		r = drm_sched_init(&ring->sched, 0, &amdgpu_sched_ops,
 				   ring->num_hw_submission, 0,
 				   timeout, adev->reset_domain->wq,
 				   ring->sched_score, ring->name,
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
index 1ae87dfd19c4..bfe761316ba9 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
@@ -133,7 +133,7 @@  int etnaviv_sched_init(struct etnaviv_gpu *gpu)
 {
 	int ret;
 
-	ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops,
+	ret = drm_sched_init(&gpu->sched, 0, &etnaviv_sched_ops,
 			     etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
 			     msecs_to_jiffies(500), NULL, NULL,
 			     dev_name(gpu->dev), gpu->dev);
diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
index ffd91a5ee299..f559d7de5658 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -488,7 +488,7 @@  int lima_sched_pipe_init(struct lima_sched_pipe *pipe, const char *name)
 
 	INIT_WORK(&pipe->recover_work, lima_sched_recover_work);
 
-	return drm_sched_init(&pipe->base, &lima_sched_ops, 1,
+	return drm_sched_init(&pipe->base, 0, &lima_sched_ops, 1,
 			      lima_job_hang_limit,
 			      msecs_to_jiffies(timeout), NULL,
 			      NULL, name, pipe->ldev->dev);
diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
index b60199184409..79fb8c96465d 100644
--- a/drivers/gpu/drm/msm/msm_ringbuffer.c
+++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
@@ -93,7 +93,7 @@  struct msm_ringbuffer *msm_ringbuffer_new(struct msm_gpu *gpu, int id,
 	 /* currently managing hangcheck ourselves: */
 	sched_timeout = MAX_SCHEDULE_TIMEOUT;
 
-	ret = drm_sched_init(&ring->sched, &msm_sched_ops,
+	ret = drm_sched_init(&ring->sched, 0, &msm_sched_ops,
 			num_hw_submissions, 0, sched_timeout,
 			NULL, NULL, to_msm_bo(ring->bo)->name, gpu->dev->dev);
 	if (ret) {
diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
index 3424a1bf6af3..6528c09fffc9 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.c
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
@@ -407,9 +407,10 @@  int nouveau_sched_init(struct nouveau_drm *drm)
 	if (!drm->sched_wq)
 		return -ENOMEM;
 
-	return drm_sched_init(sched, &nouveau_sched_ops,
-			      NOUVEAU_SCHED_HW_SUBMISSIONS, 0, job_hang_limit,
-			      NULL, NULL, "nouveau_sched", drm->dev->dev);
+	return drm_sched_init(sched, DRM_GPU_SCHEDULER_MODE_MULTI_RING,
+			      &nouveau_sched_ops, NOUVEAU_SCHED_HW_SUBMISSIONS,
+			      0, job_hang_limit, NULL, NULL, "nouveau_sched",
+			      drm->dev->dev);
 }
 
 void nouveau_sched_fini(struct nouveau_drm *drm)
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index dbc597ab46fb..dbbabf63f08f 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -814,7 +814,7 @@  int panfrost_job_init(struct panfrost_device *pfdev)
 	for (j = 0; j < NUM_JOB_SLOTS; j++) {
 		js->queue[j].fence_context = dma_fence_context_alloc(1);
 
-		ret = drm_sched_init(&js->queue[j].sched,
+		ret = drm_sched_init(&js->queue[j].sched, 0,
 				     &panfrost_sched_ops,
 				     nentries, 0,
 				     msecs_to_jiffies(JOB_TIMEOUT_MS),
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index a42763e1429d..6d1e11e6bdcd 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -401,7 +401,8 @@  static bool drm_sched_entity_add_dependency_cb(struct drm_sched_entity *entity)
 
 	s_fence = to_drm_sched_fence(fence);
 	if (!fence->error && s_fence && s_fence->sched == sched &&
-	    !test_bit(DRM_SCHED_FENCE_DONT_PIPELINE, &fence->flags)) {
+	    !test_bit(DRM_SCHED_FENCE_DONT_PIPELINE, &fence->flags) &&
+	    !(sched->flags & DRM_GPU_SCHEDULER_MODE_MULTI_RING)) {
 
 		/*
 		 * Fence is from the same scheduler, only need to wait for
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 506371c42745..0c432c1dafdf 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -24,28 +24,50 @@ 
 /**
  * DOC: Overview
  *
- * The GPU scheduler provides entities which allow userspace to push jobs
- * into software queues which are then scheduled on a hardware run queue.
- * The software queues have a priority among them. The scheduler selects the entities
- * from the run queue using a FIFO. The scheduler provides dependency handling
- * features among jobs. The driver is supposed to provide callback functions for
- * backend operations to the scheduler like submitting a job to hardware run queue,
- * returning the dependencies of a job etc.
+ * The GPU scheduler is mainly structured into the scheduler itself
+ * (&drm_gpu_scheduler), scheduler entities (&drm_sched_entity) and scheduler
+ * jobs (&drm_sched_job).
  *
- * The organisation of the scheduler is the following:
+ * Each &drm_gpu_scheduler has different priority run queues (e.g. HIGH_HW,
+ * HIGH_SW, KERNEL, NORMAL), which themselfs contain a list of &drm_sched_entity
+ * objects, while each &drm_sched_entity maintains a queue of &drm_sched_jobs.
  *
- * 1. Each hw run queue has one scheduler
- * 2. Each scheduler has multiple run queues with different priorities
- *    (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)
- * 3. Each scheduler run queue has a queue of entities to schedule
- * 4. Entities themselves maintain a queue of jobs that will be scheduled on
- *    the hardware.
+ * There are two modes of operation, single ring mode and multi ring mode.
+ * Depending on the mode of operation, the mental model of what the named
+ * structures represent differs.
  *
- * The jobs in a entity are always scheduled in the order that they were pushed.
+ * In single ring mode (which is the default) every &drm_gpu_scheduler
+ * instance represents a single HW ring, while every &drm_sched_entity
+ * represents a software queue feeding into one or multiple &drm_gpu_scheduler
+ * instances and hence into one or multiple HW rings.
+ *
+ * Single ring mode may be used when the GPU has a fixed amount of HW rings
+ * which can be directly fed by the driver.
+ *
+ * In multi ring mode (enabled by passing the &drm_gpu_scheduler the
+ * &DRM_GPU_SCHEDULER_MODE_MULTI_RING flag) each &drm_sched_entity represents a
+ * HW ring, while the &drm_gpu_scheduler itself only exists to sort out job
+ * dependencies and actually process the jobs of each &drm_sched_entity.
+ *
+ * Multi ring mode may be used when the GPU has a firmware scheduler feeding the
+ * actual HW rings, while the driver feeds the firmware scheduler through an
+ * arbitrary amount of dynamically created rings.
+ *
+ * While one or the other mental model could be applied without setting the ring
+ * mode through a flag, the scheduler needs to know the mode of operation in
+ * order to be able to make correct decitions when it comes to handling job
+ * dependencies.
+ *
+ * Independent of the mode of operation jobs within an entity are always
+ * scheduled in the order in which they were submitted.
  *
  * Note that once a job was taken from the entities queue and pushed to the
  * hardware, i.e. the pending queue, the entity must not be referenced anymore
  * through the jobs entity pointer.
+ *
+ * In order for the scheduler to actually prepare, process or free a job once it
+ * is completed, the driver is supposed to provide the corresponding callback
+ * functions (&drm_sched_backend_ops) to the scheduler.
  */
 
 #include <linux/kthread.h>
@@ -1082,7 +1104,7 @@  static int drm_sched_main(void *param)
  *
  * Return 0 on success, otherwise error code.
  */
-int drm_sched_init(struct drm_gpu_scheduler *sched,
+int drm_sched_init(struct drm_gpu_scheduler *sched, unsigned int flags,
 		   const struct drm_sched_backend_ops *ops,
 		   unsigned hw_submission, unsigned hang_limit,
 		   long timeout, struct workqueue_struct *timeout_wq,
@@ -1097,6 +1119,7 @@  int drm_sched_init(struct drm_gpu_scheduler *sched,
 	sched->hang_limit = hang_limit;
 	sched->score = score ? score : &sched->_score;
 	sched->dev = dev;
+	sched->flags = flags;
 	for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT; i++)
 		drm_sched_rq_init(sched, &sched->sched_rq[i]);
 
diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
index 06238e6d7f5c..4ba561014a5d 100644
--- a/drivers/gpu/drm/v3d/v3d_sched.c
+++ b/drivers/gpu/drm/v3d/v3d_sched.c
@@ -387,7 +387,7 @@  v3d_sched_init(struct v3d_dev *v3d)
 	int hang_limit_ms = 500;
 	int ret;
 
-	ret = drm_sched_init(&v3d->queue[V3D_BIN].sched,
+	ret = drm_sched_init(&v3d->queue[V3D_BIN].sched, 0,
 			     &v3d_bin_sched_ops,
 			     hw_jobs_limit, job_hang_limit,
 			     msecs_to_jiffies(hang_limit_ms), NULL,
@@ -395,7 +395,7 @@  v3d_sched_init(struct v3d_dev *v3d)
 	if (ret)
 		return ret;
 
-	ret = drm_sched_init(&v3d->queue[V3D_RENDER].sched,
+	ret = drm_sched_init(&v3d->queue[V3D_RENDER].sched, 0,
 			     &v3d_render_sched_ops,
 			     hw_jobs_limit, job_hang_limit,
 			     msecs_to_jiffies(hang_limit_ms), NULL,
@@ -403,7 +403,7 @@  v3d_sched_init(struct v3d_dev *v3d)
 	if (ret)
 		goto fail;
 
-	ret = drm_sched_init(&v3d->queue[V3D_TFU].sched,
+	ret = drm_sched_init(&v3d->queue[V3D_TFU].sched, 0,
 			     &v3d_tfu_sched_ops,
 			     hw_jobs_limit, job_hang_limit,
 			     msecs_to_jiffies(hang_limit_ms), NULL,
@@ -412,7 +412,7 @@  v3d_sched_init(struct v3d_dev *v3d)
 		goto fail;
 
 	if (v3d_has_csd(v3d)) {
-		ret = drm_sched_init(&v3d->queue[V3D_CSD].sched,
+		ret = drm_sched_init(&v3d->queue[V3D_CSD].sched, 0,
 				     &v3d_csd_sched_ops,
 				     hw_jobs_limit, job_hang_limit,
 				     msecs_to_jiffies(hang_limit_ms), NULL,
@@ -420,7 +420,7 @@  v3d_sched_init(struct v3d_dev *v3d)
 		if (ret)
 			goto fail;
 
-		ret = drm_sched_init(&v3d->queue[V3D_CACHE_CLEAN].sched,
+		ret = drm_sched_init(&v3d->queue[V3D_CACHE_CLEAN].sched, 0,
 				     &v3d_cache_clean_sched_ops,
 				     hw_jobs_limit, job_hang_limit,
 				     msecs_to_jiffies(hang_limit_ms), NULL,
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index f9544d9b670d..afe29196cc7e 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -465,6 +465,20 @@  struct drm_sched_backend_ops {
 	void (*free_job)(struct drm_sched_job *sched_job);
 };
 
+/**
+ * DRM_GPU_SCHEDULER_MODE_MULTI_RING - Run the scheduler in multi ring mode
+ *
+ * The &drm_gpu_scheduler typically represents one HW ring with
+ * &drm_sched_entities feeding into one or multiple scheduler instances.
+ *
+ * In multi ring mode it is assumed there is a single &drm_gpu_scheduler
+ * instance only, where &drm_sched_entities represent rings instead.
+ *
+ * This is useful for GPUs where the &drm_gpu_scheduler should feed a firmware
+ * scheduler with an arbitrary amount of rings.
+ */
+#define DRM_GPU_SCHEDULER_MODE_MULTI_RING	(1 << 0)
+
 /**
  * struct drm_gpu_scheduler - scheduler instance-specific data
  *
@@ -493,6 +507,7 @@  struct drm_sched_backend_ops {
  * @ready: marks if the underlying HW is ready to work
  * @free_guilty: A hit to time out handler to free the guilty job.
  * @dev: system &struct device
+ * @flags: the flags for the &drm_gpu_scheduler
  *
  * One scheduler is implemented for each hardware ring.
  */
@@ -517,9 +532,10 @@  struct drm_gpu_scheduler {
 	bool				ready;
 	bool				free_guilty;
 	struct device			*dev;
+	unsigned long			flags;
 };
 
-int drm_sched_init(struct drm_gpu_scheduler *sched,
+int drm_sched_init(struct drm_gpu_scheduler *sched, unsigned int flags,
 		   const struct drm_sched_backend_ops *ops,
 		   uint32_t hw_submission, unsigned hang_limit,
 		   long timeout, struct workqueue_struct *timeout_wq,