[4/8] drm/sched: Add generic scheduler message interface

Message ID	20230801205103.627779-5-matthew.brost@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Matthew Brost <matthew.brost@intel.com> To: dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org Subject: [PATCH 4/8] drm/sched: Add generic scheduler message interface Date: Tue, 1 Aug 2023 13:50:59 -0700 Message-Id: <20230801205103.627779-5-matthew.brost@intel.com> In-Reply-To: <20230801205103.627779-1-matthew.brost@intel.com> References: <20230801205103.627779-1-matthew.brost@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Cc: robdclark@chromium.org, thomas.hellstrom@linux.intel.com, Matthew Brost <matthew.brost@intel.com>, sarah.walker@imgtec.com, ketil.johnsen@arm.com, Liviu.Dudau@arm.com, luben.tuikov@amd.com, lina@asahilina.net, donald.robson@imgtec.com, boris.brezillon@collabora.com, christian.koenig@amd.com, faith.ekstrand@collabora.com Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	DRM scheduler changes for Xe \| expand [0/8] DRM scheduler changes for Xe [1/8] drm/sched: Convert drm scheduler to use a work queue rather than kthread [2/8] drm/sched: Move schedule policy to scheduler / entity [3/8] drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy [4/8] drm/sched: Add generic scheduler message interface [5/8] drm/sched: Add drm_sched_start_timeout_unlocked helper [6/8] drm/sched: Start run wq before TDR in drm_sched_start [7/8] drm/sched: Submit job before starting TDR [8/8] drm/sched: Add helper to set TDR timeout

Matthew Brost Aug. 1, 2023, 8:50 p.m. UTC

Add generic schedule message interface which sends messages to backend
from the drm_gpu_scheduler main submission thread. The idea is some of
these messages modify some state in drm_sched_entity which is also
modified during submission. By scheduling these messages and submission
in the same thread their is not race changing states in
drm_sched_entity.

This interface will be used in XE, new Intel GPU driver, to cleanup,
suspend, resume, and change scheduling properties of a drm_sched_entity.

The interface is designed to be generic and extendable with only the
backend understanding the messages.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 52 +++++++++++++++++++++++++-
 include/drm/gpu_scheduler.h            | 29 +++++++++++++-
 2 files changed, 78 insertions(+), 3 deletions(-)

Christian König Aug. 3, 2023, 8:53 a.m. UTC | #1

Am 01.08.23 um 22:50 schrieb Matthew Brost:
> Add generic schedule message interface which sends messages to backend
> from the drm_gpu_scheduler main submission thread. The idea is some of
> these messages modify some state in drm_sched_entity which is also
> modified during submission. By scheduling these messages and submission
> in the same thread their is not race changing states in
> drm_sched_entity.
>
> This interface will be used in XE, new Intel GPU driver, to cleanup,
> suspend, resume, and change scheduling properties of a drm_sched_entity.
>
> The interface is designed to be generic and extendable with only the
> backend understanding the messages.

I'm still extremely frowned on this.

If you need this functionality then let the drivers decide which 
runqueue the scheduler should use.

When you then create a single threaded runqueue you can just submit work 
to it and serialize this with the scheduler work.

This way we wouldn't duplicate this core kernel function inside the 
scheduler.

Regards,
Christian.

>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/scheduler/sched_main.c | 52 +++++++++++++++++++++++++-
>   include/drm/gpu_scheduler.h            | 29 +++++++++++++-
>   2 files changed, 78 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 2597fb298733..84821a124ca2 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1049,6 +1049,49 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
>   }
>   EXPORT_SYMBOL(drm_sched_pick_best);
>   
> +/**
> + * drm_sched_add_msg - add scheduler message
> + *
> + * @sched: scheduler instance
> + * @msg: message to be added
> + *
> + * Can and will pass an jobs waiting on dependencies or in a runnable queue.
> + * Messages processing will stop if schedule run wq is stopped and resume when
> + * run wq is started.
> + */
> +void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
> +		       struct drm_sched_msg *msg)
> +{
> +	spin_lock(&sched->job_list_lock);
> +	list_add_tail(&msg->link, &sched->msgs);
> +	spin_unlock(&sched->job_list_lock);
> +
> +	drm_sched_run_wq_queue(sched);
> +}
> +EXPORT_SYMBOL(drm_sched_add_msg);
> +
> +/**
> + * drm_sched_get_msg - get scheduler message
> + *
> + * @sched: scheduler instance
> + *
> + * Returns NULL or message
> + */
> +static struct drm_sched_msg *
> +drm_sched_get_msg(struct drm_gpu_scheduler *sched)
> +{
> +	struct drm_sched_msg *msg;
> +
> +	spin_lock(&sched->job_list_lock);
> +	msg = list_first_entry_or_null(&sched->msgs,
> +				       struct drm_sched_msg, link);
> +	if (msg)
> +		list_del(&msg->link);
> +	spin_unlock(&sched->job_list_lock);
> +
> +	return msg;
> +}
> +
>   /**
>    * drm_sched_main - main scheduler thread
>    *
> @@ -1060,6 +1103,7 @@ static void drm_sched_main(struct work_struct *w)
>   		container_of(w, struct drm_gpu_scheduler, work_run);
>   	struct drm_sched_entity *entity;
>   	struct drm_sched_job *cleanup_job;
> +	struct drm_sched_msg *msg;
>   	int r;
>   
>   	if (READ_ONCE(sched->pause_run_wq))
> @@ -1067,12 +1111,15 @@ static void drm_sched_main(struct work_struct *w)
>   
>   	cleanup_job = drm_sched_get_cleanup_job(sched);
>   	entity = drm_sched_select_entity(sched);
> +	msg = drm_sched_get_msg(sched);
>   
> -	if (!entity && !cleanup_job)
> +	if (!entity && !cleanup_job && !msg)
>   		return;	/* No more work */
>   
>   	if (cleanup_job)
>   		sched->ops->free_job(cleanup_job);
> +	if (msg)
> +		sched->ops->process_msg(msg);
>   
>   	if (entity) {
>   		struct dma_fence *fence;
> @@ -1082,7 +1129,7 @@ static void drm_sched_main(struct work_struct *w)
>   		sched_job = drm_sched_entity_pop_job(entity);
>   		if (!sched_job) {
>   			complete_all(&entity->entity_idle);
> -			if (!cleanup_job)
> +			if (!cleanup_job && !msg)
>   				return;	/* No more work */
>   			goto again;
>   		}
> @@ -1177,6 +1224,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>   
>   	init_waitqueue_head(&sched->job_scheduled);
>   	INIT_LIST_HEAD(&sched->pending_list);
> +	INIT_LIST_HEAD(&sched->msgs);
>   	spin_lock_init(&sched->job_list_lock);
>   	atomic_set(&sched->hw_rq_count, 0);
>   	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index df1993dd44ae..267bd060d178 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -394,6 +394,23 @@ enum drm_gpu_sched_stat {
>   	DRM_GPU_SCHED_STAT_ENODEV,
>   };
>   
> +/**
> + * struct drm_sched_msg - an in-band (relative to GPU scheduler run queue)
> + * message
> + *
> + * Generic enough for backend defined messages, backend can expand if needed.
> + */
> +struct drm_sched_msg {
> +	/** @link: list link into the gpu scheduler list of messages */
> +	struct list_head		link;
> +	/**
> +	 * @private_data: opaque pointer to message private data (backend defined)
> +	 */
> +	void				*private_data;
> +	/** @opcode: opcode of message (backend defined) */
> +	unsigned int			opcode;
> +};
> +
>   /**
>    * struct drm_sched_backend_ops - Define the backend operations
>    *	called by the scheduler
> @@ -471,6 +488,12 @@ struct drm_sched_backend_ops {
>            * and it's time to clean it up.
>   	 */
>   	void (*free_job)(struct drm_sched_job *sched_job);
> +
> +	/**
> +	 * @process_msg: Process a message. Allowed to block, it is this
> +	 * function's responsibility to free message if dynamically allocated.
> +	 */
> +	void (*process_msg)(struct drm_sched_msg *msg);
>   };
>   
>   /**
> @@ -482,6 +505,7 @@ struct drm_sched_backend_ops {
>    * @timeout: the time after which a job is removed from the scheduler.
>    * @name: name of the ring for which this scheduler is being used.
>    * @sched_rq: priority wise array of run queues.
> + * @msgs: list of messages to be processed in @work_run
>    * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
>    *                 waits on this wait queue until all the scheduled jobs are
>    *                 finished.
> @@ -489,7 +513,7 @@ struct drm_sched_backend_ops {
>    * @job_id_count: used to assign unique id to the each job.
>    * @run_wq: workqueue used to queue @work_run
>    * @timeout_wq: workqueue used to queue @work_tdr
> - * @work_run: schedules jobs and cleans up entities
> + * @work_run: schedules jobs, cleans up jobs, and processes messages
>    * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
>    *            timeout interval is over.
>    * @pending_list: the list of jobs which are currently in the job queue.
> @@ -513,6 +537,7 @@ struct drm_gpu_scheduler {
>   	long				timeout;
>   	const char			*name;
>   	struct drm_sched_rq		sched_rq[DRM_SCHED_PRIORITY_COUNT];
> +	struct list_head		msgs;
>   	wait_queue_head_t		job_scheduled;
>   	atomic_t			hw_rq_count;
>   	atomic64_t			job_id_count;
> @@ -566,6 +591,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>   
>   void drm_sched_job_cleanup(struct drm_sched_job *job);
>   void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
> +void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
> +		       struct drm_sched_msg *msg);
>   void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched);
>   void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched);
>   void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);

Daniel Vetter Aug. 3, 2023, 8:58 a.m. UTC | #2

On Thu, 3 Aug 2023 at 10:53, Christian König <christian.koenig@amd.com> wrote:
>
> Am 01.08.23 um 22:50 schrieb Matthew Brost:
> > Add generic schedule message interface which sends messages to backend
> > from the drm_gpu_scheduler main submission thread. The idea is some of
> > these messages modify some state in drm_sched_entity which is also
> > modified during submission. By scheduling these messages and submission
> > in the same thread their is not race changing states in
> > drm_sched_entity.
> >
> > This interface will be used in XE, new Intel GPU driver, to cleanup,
> > suspend, resume, and change scheduling properties of a drm_sched_entity.
> >
> > The interface is designed to be generic and extendable with only the
> > backend understanding the messages.
>
> I'm still extremely frowned on this.
>
> If you need this functionality then let the drivers decide which
> runqueue the scheduler should use.
>
> When you then create a single threaded runqueue you can just submit work
> to it and serialize this with the scheduler work.
>
> This way we wouldn't duplicate this core kernel function inside the
> scheduler.

Yeah that's essentially the design we picked for the tdr workers,
where some drivers have requirements that all tdr work must be done on
the same thread (because of cross-engine coordination issues). But
that would require that we rework the scheduler as a pile of
self-submitting work items, and I'm not sure that actually fits all
that well into the core workqueue interfaces either.

Worst case I think this isn't a dead-end and can be refactored to
internally use the workqueue services, with the new functions here
just being dumb wrappers until everyone is converted over. So it
doesn't look like an expensive mistake, if it turns out to be a
mistake.
-Daniel


> Regards,
> Christian.
>
> >
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/gpu/drm/scheduler/sched_main.c | 52 +++++++++++++++++++++++++-
> >   include/drm/gpu_scheduler.h            | 29 +++++++++++++-
> >   2 files changed, 78 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > index 2597fb298733..84821a124ca2 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -1049,6 +1049,49 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
> >   }
> >   EXPORT_SYMBOL(drm_sched_pick_best);
> >
> > +/**
> > + * drm_sched_add_msg - add scheduler message
> > + *
> > + * @sched: scheduler instance
> > + * @msg: message to be added
> > + *
> > + * Can and will pass an jobs waiting on dependencies or in a runnable queue.
> > + * Messages processing will stop if schedule run wq is stopped and resume when
> > + * run wq is started.
> > + */
> > +void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
> > +                    struct drm_sched_msg *msg)
> > +{
> > +     spin_lock(&sched->job_list_lock);
> > +     list_add_tail(&msg->link, &sched->msgs);
> > +     spin_unlock(&sched->job_list_lock);
> > +
> > +     drm_sched_run_wq_queue(sched);
> > +}
> > +EXPORT_SYMBOL(drm_sched_add_msg);
> > +
> > +/**
> > + * drm_sched_get_msg - get scheduler message
> > + *
> > + * @sched: scheduler instance
> > + *
> > + * Returns NULL or message
> > + */
> > +static struct drm_sched_msg *
> > +drm_sched_get_msg(struct drm_gpu_scheduler *sched)
> > +{
> > +     struct drm_sched_msg *msg;
> > +
> > +     spin_lock(&sched->job_list_lock);
> > +     msg = list_first_entry_or_null(&sched->msgs,
> > +                                    struct drm_sched_msg, link);
> > +     if (msg)
> > +             list_del(&msg->link);
> > +     spin_unlock(&sched->job_list_lock);
> > +
> > +     return msg;
> > +}
> > +
> >   /**
> >    * drm_sched_main - main scheduler thread
> >    *
> > @@ -1060,6 +1103,7 @@ static void drm_sched_main(struct work_struct *w)
> >               container_of(w, struct drm_gpu_scheduler, work_run);
> >       struct drm_sched_entity *entity;
> >       struct drm_sched_job *cleanup_job;
> > +     struct drm_sched_msg *msg;
> >       int r;
> >
> >       if (READ_ONCE(sched->pause_run_wq))
> > @@ -1067,12 +1111,15 @@ static void drm_sched_main(struct work_struct *w)
> >
> >       cleanup_job = drm_sched_get_cleanup_job(sched);
> >       entity = drm_sched_select_entity(sched);
> > +     msg = drm_sched_get_msg(sched);
> >
> > -     if (!entity && !cleanup_job)
> > +     if (!entity && !cleanup_job && !msg)
> >               return; /* No more work */
> >
> >       if (cleanup_job)
> >               sched->ops->free_job(cleanup_job);
> > +     if (msg)
> > +             sched->ops->process_msg(msg);
> >
> >       if (entity) {
> >               struct dma_fence *fence;
> > @@ -1082,7 +1129,7 @@ static void drm_sched_main(struct work_struct *w)
> >               sched_job = drm_sched_entity_pop_job(entity);
> >               if (!sched_job) {
> >                       complete_all(&entity->entity_idle);
> > -                     if (!cleanup_job)
> > +                     if (!cleanup_job && !msg)
> >                               return; /* No more work */
> >                       goto again;
> >               }
> > @@ -1177,6 +1224,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> >
> >       init_waitqueue_head(&sched->job_scheduled);
> >       INIT_LIST_HEAD(&sched->pending_list);
> > +     INIT_LIST_HEAD(&sched->msgs);
> >       spin_lock_init(&sched->job_list_lock);
> >       atomic_set(&sched->hw_rq_count, 0);
> >       INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > index df1993dd44ae..267bd060d178 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -394,6 +394,23 @@ enum drm_gpu_sched_stat {
> >       DRM_GPU_SCHED_STAT_ENODEV,
> >   };
> >
> > +/**
> > + * struct drm_sched_msg - an in-band (relative to GPU scheduler run queue)
> > + * message
> > + *
> > + * Generic enough for backend defined messages, backend can expand if needed.
> > + */
> > +struct drm_sched_msg {
> > +     /** @link: list link into the gpu scheduler list of messages */
> > +     struct list_head                link;
> > +     /**
> > +      * @private_data: opaque pointer to message private data (backend defined)
> > +      */
> > +     void                            *private_data;
> > +     /** @opcode: opcode of message (backend defined) */
> > +     unsigned int                    opcode;
> > +};
> > +
> >   /**
> >    * struct drm_sched_backend_ops - Define the backend operations
> >    *  called by the scheduler
> > @@ -471,6 +488,12 @@ struct drm_sched_backend_ops {
> >            * and it's time to clean it up.
> >        */
> >       void (*free_job)(struct drm_sched_job *sched_job);
> > +
> > +     /**
> > +      * @process_msg: Process a message. Allowed to block, it is this
> > +      * function's responsibility to free message if dynamically allocated.
> > +      */
> > +     void (*process_msg)(struct drm_sched_msg *msg);
> >   };
> >
> >   /**
> > @@ -482,6 +505,7 @@ struct drm_sched_backend_ops {
> >    * @timeout: the time after which a job is removed from the scheduler.
> >    * @name: name of the ring for which this scheduler is being used.
> >    * @sched_rq: priority wise array of run queues.
> > + * @msgs: list of messages to be processed in @work_run
> >    * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
> >    *                 waits on this wait queue until all the scheduled jobs are
> >    *                 finished.
> > @@ -489,7 +513,7 @@ struct drm_sched_backend_ops {
> >    * @job_id_count: used to assign unique id to the each job.
> >    * @run_wq: workqueue used to queue @work_run
> >    * @timeout_wq: workqueue used to queue @work_tdr
> > - * @work_run: schedules jobs and cleans up entities
> > + * @work_run: schedules jobs, cleans up jobs, and processes messages
> >    * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
> >    *            timeout interval is over.
> >    * @pending_list: the list of jobs which are currently in the job queue.
> > @@ -513,6 +537,7 @@ struct drm_gpu_scheduler {
> >       long                            timeout;
> >       const char                      *name;
> >       struct drm_sched_rq             sched_rq[DRM_SCHED_PRIORITY_COUNT];
> > +     struct list_head                msgs;
> >       wait_queue_head_t               job_scheduled;
> >       atomic_t                        hw_rq_count;
> >       atomic64_t                      job_id_count;
> > @@ -566,6 +591,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
> >
> >   void drm_sched_job_cleanup(struct drm_sched_job *job);
> >   void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
> > +void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
> > +                    struct drm_sched_msg *msg);
> >   void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched);
> >   void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched);
> >   void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
>

Christian König Aug. 3, 2023, 9:35 a.m. UTC | #3

Am 03.08.23 um 10:58 schrieb Daniel Vetter:
> On Thu, 3 Aug 2023 at 10:53, Christian König <christian.koenig@amd.com> wrote:
>> Am 01.08.23 um 22:50 schrieb Matthew Brost:
>>> Add generic schedule message interface which sends messages to backend
>>> from the drm_gpu_scheduler main submission thread. The idea is some of
>>> these messages modify some state in drm_sched_entity which is also
>>> modified during submission. By scheduling these messages and submission
>>> in the same thread their is not race changing states in
>>> drm_sched_entity.
>>>
>>> This interface will be used in XE, new Intel GPU driver, to cleanup,
>>> suspend, resume, and change scheduling properties of a drm_sched_entity.
>>>
>>> The interface is designed to be generic and extendable with only the
>>> backend understanding the messages.
>> I'm still extremely frowned on this.
>>
>> If you need this functionality then let the drivers decide which
>> runqueue the scheduler should use.
>>
>> When you then create a single threaded runqueue you can just submit work
>> to it and serialize this with the scheduler work.
>>
>> This way we wouldn't duplicate this core kernel function inside the
>> scheduler.
> Yeah that's essentially the design we picked for the tdr workers,
> where some drivers have requirements that all tdr work must be done on
> the same thread (because of cross-engine coordination issues). But
> that would require that we rework the scheduler as a pile of
> self-submitting work items, and I'm not sure that actually fits all
> that well into the core workqueue interfaces either.

There were already patches floating around which did exactly that.

Last time I checked those were actually looking pretty good.

Additional to message passing advantage the real big issue with the 
scheduler and 1 to 1 mapping is that we create a kernel thread for each 
instance, which results in tons on overhead.

Just using a work item which is submitted to a work queue completely 
avoids that.

Regards,
Christian.

>
> Worst case I think this isn't a dead-end and can be refactored to
> internally use the workqueue services, with the new functions here
> just being dumb wrappers until everyone is converted over. So it
> doesn't look like an expensive mistake, if it turns out to be a
> mistake.
> -Daniel
>
>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>    drivers/gpu/drm/scheduler/sched_main.c | 52 +++++++++++++++++++++++++-
>>>    include/drm/gpu_scheduler.h            | 29 +++++++++++++-
>>>    2 files changed, 78 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 2597fb298733..84821a124ca2 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -1049,6 +1049,49 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
>>>    }
>>>    EXPORT_SYMBOL(drm_sched_pick_best);
>>>
>>> +/**
>>> + * drm_sched_add_msg - add scheduler message
>>> + *
>>> + * @sched: scheduler instance
>>> + * @msg: message to be added
>>> + *
>>> + * Can and will pass an jobs waiting on dependencies or in a runnable queue.
>>> + * Messages processing will stop if schedule run wq is stopped and resume when
>>> + * run wq is started.
>>> + */
>>> +void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
>>> +                    struct drm_sched_msg *msg)
>>> +{
>>> +     spin_lock(&sched->job_list_lock);
>>> +     list_add_tail(&msg->link, &sched->msgs);
>>> +     spin_unlock(&sched->job_list_lock);
>>> +
>>> +     drm_sched_run_wq_queue(sched);
>>> +}
>>> +EXPORT_SYMBOL(drm_sched_add_msg);
>>> +
>>> +/**
>>> + * drm_sched_get_msg - get scheduler message
>>> + *
>>> + * @sched: scheduler instance
>>> + *
>>> + * Returns NULL or message
>>> + */
>>> +static struct drm_sched_msg *
>>> +drm_sched_get_msg(struct drm_gpu_scheduler *sched)
>>> +{
>>> +     struct drm_sched_msg *msg;
>>> +
>>> +     spin_lock(&sched->job_list_lock);
>>> +     msg = list_first_entry_or_null(&sched->msgs,
>>> +                                    struct drm_sched_msg, link);
>>> +     if (msg)
>>> +             list_del(&msg->link);
>>> +     spin_unlock(&sched->job_list_lock);
>>> +
>>> +     return msg;
>>> +}
>>> +
>>>    /**
>>>     * drm_sched_main - main scheduler thread
>>>     *
>>> @@ -1060,6 +1103,7 @@ static void drm_sched_main(struct work_struct *w)
>>>                container_of(w, struct drm_gpu_scheduler, work_run);
>>>        struct drm_sched_entity *entity;
>>>        struct drm_sched_job *cleanup_job;
>>> +     struct drm_sched_msg *msg;
>>>        int r;
>>>
>>>        if (READ_ONCE(sched->pause_run_wq))
>>> @@ -1067,12 +1111,15 @@ static void drm_sched_main(struct work_struct *w)
>>>
>>>        cleanup_job = drm_sched_get_cleanup_job(sched);
>>>        entity = drm_sched_select_entity(sched);
>>> +     msg = drm_sched_get_msg(sched);
>>>
>>> -     if (!entity && !cleanup_job)
>>> +     if (!entity && !cleanup_job && !msg)
>>>                return; /* No more work */
>>>
>>>        if (cleanup_job)
>>>                sched->ops->free_job(cleanup_job);
>>> +     if (msg)
>>> +             sched->ops->process_msg(msg);
>>>
>>>        if (entity) {
>>>                struct dma_fence *fence;
>>> @@ -1082,7 +1129,7 @@ static void drm_sched_main(struct work_struct *w)
>>>                sched_job = drm_sched_entity_pop_job(entity);
>>>                if (!sched_job) {
>>>                        complete_all(&entity->entity_idle);
>>> -                     if (!cleanup_job)
>>> +                     if (!cleanup_job && !msg)
>>>                                return; /* No more work */
>>>                        goto again;
>>>                }
>>> @@ -1177,6 +1224,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>>>
>>>        init_waitqueue_head(&sched->job_scheduled);
>>>        INIT_LIST_HEAD(&sched->pending_list);
>>> +     INIT_LIST_HEAD(&sched->msgs);
>>>        spin_lock_init(&sched->job_list_lock);
>>>        atomic_set(&sched->hw_rq_count, 0);
>>>        INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>> index df1993dd44ae..267bd060d178 100644
>>> --- a/include/drm/gpu_scheduler.h
>>> +++ b/include/drm/gpu_scheduler.h
>>> @@ -394,6 +394,23 @@ enum drm_gpu_sched_stat {
>>>        DRM_GPU_SCHED_STAT_ENODEV,
>>>    };
>>>
>>> +/**
>>> + * struct drm_sched_msg - an in-band (relative to GPU scheduler run queue)
>>> + * message
>>> + *
>>> + * Generic enough for backend defined messages, backend can expand if needed.
>>> + */
>>> +struct drm_sched_msg {
>>> +     /** @link: list link into the gpu scheduler list of messages */
>>> +     struct list_head                link;
>>> +     /**
>>> +      * @private_data: opaque pointer to message private data (backend defined)
>>> +      */
>>> +     void                            *private_data;
>>> +     /** @opcode: opcode of message (backend defined) */
>>> +     unsigned int                    opcode;
>>> +};
>>> +
>>>    /**
>>>     * struct drm_sched_backend_ops - Define the backend operations
>>>     *  called by the scheduler
>>> @@ -471,6 +488,12 @@ struct drm_sched_backend_ops {
>>>             * and it's time to clean it up.
>>>         */
>>>        void (*free_job)(struct drm_sched_job *sched_job);
>>> +
>>> +     /**
>>> +      * @process_msg: Process a message. Allowed to block, it is this
>>> +      * function's responsibility to free message if dynamically allocated.
>>> +      */
>>> +     void (*process_msg)(struct drm_sched_msg *msg);
>>>    };
>>>
>>>    /**
>>> @@ -482,6 +505,7 @@ struct drm_sched_backend_ops {
>>>     * @timeout: the time after which a job is removed from the scheduler.
>>>     * @name: name of the ring for which this scheduler is being used.
>>>     * @sched_rq: priority wise array of run queues.
>>> + * @msgs: list of messages to be processed in @work_run
>>>     * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
>>>     *                 waits on this wait queue until all the scheduled jobs are
>>>     *                 finished.
>>> @@ -489,7 +513,7 @@ struct drm_sched_backend_ops {
>>>     * @job_id_count: used to assign unique id to the each job.
>>>     * @run_wq: workqueue used to queue @work_run
>>>     * @timeout_wq: workqueue used to queue @work_tdr
>>> - * @work_run: schedules jobs and cleans up entities
>>> + * @work_run: schedules jobs, cleans up jobs, and processes messages
>>>     * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
>>>     *            timeout interval is over.
>>>     * @pending_list: the list of jobs which are currently in the job queue.
>>> @@ -513,6 +537,7 @@ struct drm_gpu_scheduler {
>>>        long                            timeout;
>>>        const char                      *name;
>>>        struct drm_sched_rq             sched_rq[DRM_SCHED_PRIORITY_COUNT];
>>> +     struct list_head                msgs;
>>>        wait_queue_head_t               job_scheduled;
>>>        atomic_t                        hw_rq_count;
>>>        atomic64_t                      job_id_count;
>>> @@ -566,6 +591,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>>>
>>>    void drm_sched_job_cleanup(struct drm_sched_job *job);
>>>    void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
>>> +void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
>>> +                    struct drm_sched_msg *msg);
>>>    void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched);
>>>    void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched);
>>>    void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
>

Daniel Vetter Aug. 4, 2023, 8:50 a.m. UTC | #4

On Thu, Aug 03, 2023 at 11:35:30AM +0200, Christian König wrote:
> Am 03.08.23 um 10:58 schrieb Daniel Vetter:
> > On Thu, 3 Aug 2023 at 10:53, Christian König <christian.koenig@amd.com> wrote:
> > > Am 01.08.23 um 22:50 schrieb Matthew Brost:
> > > > Add generic schedule message interface which sends messages to backend
> > > > from the drm_gpu_scheduler main submission thread. The idea is some of
> > > > these messages modify some state in drm_sched_entity which is also
> > > > modified during submission. By scheduling these messages and submission
> > > > in the same thread their is not race changing states in
> > > > drm_sched_entity.
> > > > 
> > > > This interface will be used in XE, new Intel GPU driver, to cleanup,
> > > > suspend, resume, and change scheduling properties of a drm_sched_entity.
> > > > 
> > > > The interface is designed to be generic and extendable with only the
> > > > backend understanding the messages.
> > > I'm still extremely frowned on this.
> > > 
> > > If you need this functionality then let the drivers decide which
> > > runqueue the scheduler should use.
> > > 
> > > When you then create a single threaded runqueue you can just submit work
> > > to it and serialize this with the scheduler work.
> > > 
> > > This way we wouldn't duplicate this core kernel function inside the
> > > scheduler.
> > Yeah that's essentially the design we picked for the tdr workers,
> > where some drivers have requirements that all tdr work must be done on
> > the same thread (because of cross-engine coordination issues). But
> > that would require that we rework the scheduler as a pile of
> > self-submitting work items, and I'm not sure that actually fits all
> > that well into the core workqueue interfaces either.
> 
> There were already patches floating around which did exactly that.
> 
> Last time I checked those were actually looking pretty good.
> 
> Additional to message passing advantage the real big issue with the
> scheduler and 1 to 1 mapping is that we create a kernel thread for each
> instance, which results in tons on overhead.
> 
> Just using a work item which is submitted to a work queue completely avoids
> that.

Hm I should have read the entire series first, since that does the
conversion still. Apologies for the confusion, and yeah we should be able
to just submit other work to the same wq with the first patch? And so
hand-rolling this infra here isn't needed at all?

Or what am I missing?

> Regards,
> Christian.
> 
> > 
> > Worst case I think this isn't a dead-end and can be refactored to
> > internally use the workqueue services, with the new functions here
> > just being dumb wrappers until everyone is converted over. So it
> > doesn't look like an expensive mistake, if it turns out to be a
> > mistake.
> > -Daniel
> > 
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >    drivers/gpu/drm/scheduler/sched_main.c | 52 +++++++++++++++++++++++++-
> > > >    include/drm/gpu_scheduler.h            | 29 +++++++++++++-
> > > >    2 files changed, 78 insertions(+), 3 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > index 2597fb298733..84821a124ca2 100644
> > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > @@ -1049,6 +1049,49 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
> > > >    }
> > > >    EXPORT_SYMBOL(drm_sched_pick_best);
> > > > 
> > > > +/**
> > > > + * drm_sched_add_msg - add scheduler message
> > > > + *
> > > > + * @sched: scheduler instance
> > > > + * @msg: message to be added
> > > > + *
> > > > + * Can and will pass an jobs waiting on dependencies or in a runnable queue.
> > > > + * Messages processing will stop if schedule run wq is stopped and resume when
> > > > + * run wq is started.
> > > > + */
> > > > +void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
> > > > +                    struct drm_sched_msg *msg)
> > > > +{
> > > > +     spin_lock(&sched->job_list_lock);
> > > > +     list_add_tail(&msg->link, &sched->msgs);
> > > > +     spin_unlock(&sched->job_list_lock);
> > > > +
> > > > +     drm_sched_run_wq_queue(sched);
> > > > +}
> > > > +EXPORT_SYMBOL(drm_sched_add_msg);
> > > > +
> > > > +/**
> > > > + * drm_sched_get_msg - get scheduler message
> > > > + *
> > > > + * @sched: scheduler instance
> > > > + *
> > > > + * Returns NULL or message
> > > > + */
> > > > +static struct drm_sched_msg *
> > > > +drm_sched_get_msg(struct drm_gpu_scheduler *sched)
> > > > +{
> > > > +     struct drm_sched_msg *msg;
> > > > +
> > > > +     spin_lock(&sched->job_list_lock);
> > > > +     msg = list_first_entry_or_null(&sched->msgs,
> > > > +                                    struct drm_sched_msg, link);
> > > > +     if (msg)
> > > > +             list_del(&msg->link);
> > > > +     spin_unlock(&sched->job_list_lock);
> > > > +
> > > > +     return msg;
> > > > +}
> > > > +
> > > >    /**
> > > >     * drm_sched_main - main scheduler thread
> > > >     *
> > > > @@ -1060,6 +1103,7 @@ static void drm_sched_main(struct work_struct *w)
> > > >                container_of(w, struct drm_gpu_scheduler, work_run);
> > > >        struct drm_sched_entity *entity;
> > > >        struct drm_sched_job *cleanup_job;
> > > > +     struct drm_sched_msg *msg;
> > > >        int r;
> > > > 
> > > >        if (READ_ONCE(sched->pause_run_wq))
> > > > @@ -1067,12 +1111,15 @@ static void drm_sched_main(struct work_struct *w)
> > > > 
> > > >        cleanup_job = drm_sched_get_cleanup_job(sched);
> > > >        entity = drm_sched_select_entity(sched);
> > > > +     msg = drm_sched_get_msg(sched);
> > > > 
> > > > -     if (!entity && !cleanup_job)
> > > > +     if (!entity && !cleanup_job && !msg)
> > > >                return; /* No more work */
> > > > 
> > > >        if (cleanup_job)
> > > >                sched->ops->free_job(cleanup_job);
> > > > +     if (msg)
> > > > +             sched->ops->process_msg(msg);
> > > > 
> > > >        if (entity) {
> > > >                struct dma_fence *fence;
> > > > @@ -1082,7 +1129,7 @@ static void drm_sched_main(struct work_struct *w)
> > > >                sched_job = drm_sched_entity_pop_job(entity);
> > > >                if (!sched_job) {
> > > >                        complete_all(&entity->entity_idle);
> > > > -                     if (!cleanup_job)
> > > > +                     if (!cleanup_job && !msg)
> > > >                                return; /* No more work */
> > > >                        goto again;
> > > >                }
> > > > @@ -1177,6 +1224,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> > > > 
> > > >        init_waitqueue_head(&sched->job_scheduled);
> > > >        INIT_LIST_HEAD(&sched->pending_list);
> > > > +     INIT_LIST_HEAD(&sched->msgs);
> > > >        spin_lock_init(&sched->job_list_lock);
> > > >        atomic_set(&sched->hw_rq_count, 0);
> > > >        INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> > > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > > > index df1993dd44ae..267bd060d178 100644
> > > > --- a/include/drm/gpu_scheduler.h
> > > > +++ b/include/drm/gpu_scheduler.h
> > > > @@ -394,6 +394,23 @@ enum drm_gpu_sched_stat {
> > > >        DRM_GPU_SCHED_STAT_ENODEV,
> > > >    };
> > > > 
> > > > +/**
> > > > + * struct drm_sched_msg - an in-band (relative to GPU scheduler run queue)
> > > > + * message
> > > > + *
> > > > + * Generic enough for backend defined messages, backend can expand if needed.
> > > > + */
> > > > +struct drm_sched_msg {
> > > > +     /** @link: list link into the gpu scheduler list of messages */
> > > > +     struct list_head                link;
> > > > +     /**
> > > > +      * @private_data: opaque pointer to message private data (backend defined)
> > > > +      */
> > > > +     void                            *private_data;
> > > > +     /** @opcode: opcode of message (backend defined) */
> > > > +     unsigned int                    opcode;
> > > > +};
> > > > +
> > > >    /**
> > > >     * struct drm_sched_backend_ops - Define the backend operations
> > > >     *  called by the scheduler
> > > > @@ -471,6 +488,12 @@ struct drm_sched_backend_ops {
> > > >             * and it's time to clean it up.
> > > >         */
> > > >        void (*free_job)(struct drm_sched_job *sched_job);
> > > > +
> > > > +     /**
> > > > +      * @process_msg: Process a message. Allowed to block, it is this
> > > > +      * function's responsibility to free message if dynamically allocated.
> > > > +      */
> > > > +     void (*process_msg)(struct drm_sched_msg *msg);
> > > >    };
> > > > 
> > > >    /**
> > > > @@ -482,6 +505,7 @@ struct drm_sched_backend_ops {
> > > >     * @timeout: the time after which a job is removed from the scheduler.
> > > >     * @name: name of the ring for which this scheduler is being used.
> > > >     * @sched_rq: priority wise array of run queues.
> > > > + * @msgs: list of messages to be processed in @work_run
> > > >     * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
> > > >     *                 waits on this wait queue until all the scheduled jobs are
> > > >     *                 finished.
> > > > @@ -489,7 +513,7 @@ struct drm_sched_backend_ops {
> > > >     * @job_id_count: used to assign unique id to the each job.
> > > >     * @run_wq: workqueue used to queue @work_run
> > > >     * @timeout_wq: workqueue used to queue @work_tdr
> > > > - * @work_run: schedules jobs and cleans up entities
> > > > + * @work_run: schedules jobs, cleans up jobs, and processes messages
> > > >     * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
> > > >     *            timeout interval is over.
> > > >     * @pending_list: the list of jobs which are currently in the job queue.
> > > > @@ -513,6 +537,7 @@ struct drm_gpu_scheduler {
> > > >        long                            timeout;
> > > >        const char                      *name;
> > > >        struct drm_sched_rq             sched_rq[DRM_SCHED_PRIORITY_COUNT];
> > > > +     struct list_head                msgs;
> > > >        wait_queue_head_t               job_scheduled;
> > > >        atomic_t                        hw_rq_count;
> > > >        atomic64_t                      job_id_count;
> > > > @@ -566,6 +591,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
> > > > 
> > > >    void drm_sched_job_cleanup(struct drm_sched_job *job);
> > > >    void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
> > > > +void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
> > > > +                    struct drm_sched_msg *msg);
> > > >    void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched);
> > > >    void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched);
> > > >    void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
> > 
>

Matthew Brost Aug. 4, 2023, 2:13 p.m. UTC | #5

On Fri, Aug 04, 2023 at 10:50:36AM +0200, Daniel Vetter wrote:
> On Thu, Aug 03, 2023 at 11:35:30AM +0200, Christian König wrote:
> > Am 03.08.23 um 10:58 schrieb Daniel Vetter:
> > > On Thu, 3 Aug 2023 at 10:53, Christian König <christian.koenig@amd.com> wrote:
> > > > Am 01.08.23 um 22:50 schrieb Matthew Brost:
> > > > > Add generic schedule message interface which sends messages to backend
> > > > > from the drm_gpu_scheduler main submission thread. The idea is some of
> > > > > these messages modify some state in drm_sched_entity which is also
> > > > > modified during submission. By scheduling these messages and submission
> > > > > in the same thread their is not race changing states in
> > > > > drm_sched_entity.
> > > > > 
> > > > > This interface will be used in XE, new Intel GPU driver, to cleanup,
> > > > > suspend, resume, and change scheduling properties of a drm_sched_entity.
> > > > > 
> > > > > The interface is designed to be generic and extendable with only the
> > > > > backend understanding the messages.

Christian / Daniel - I've read both of you comments and having a hard
time parsing them. I do not really understand the issue with this patch
or exactly what is being suggested instead. Let's try to work through
this.

> > > > I'm still extremely frowned on this.
> > > > 
> > > > If you need this functionality then let the drivers decide which
> > > > runqueue the scheduler should use.

What do you mean by runqueue here? Do you mean 'struct
workqueue_struct'? The scheduler in this context is 'struct
drm_gpu_scheduler', right?

Yes, we have added this functionality iin the first patch.

> > > > 
> > > > When you then create a single threaded runqueue you can just submit work
> > > > to it and serialize this with the scheduler work.
> > > > 

We don't want to use a single threaded workqueue_struct in Xe, we want
to use a system_wq as run_job() can be executed in parallel across
multiple entites (or drm_gpu_scheduler as in Xe we have 1 to 1
relationship between entity and scheduler). What we want is on per
entity / scheduler granularity to be able to communicate into the
backend a message synchronously (run_job / free_job not executing,
scheduler execution not paused for a reset).

If I'm underatanding what you suggesting in Xe we'd create an ordered
workqueue_struct per drm_gpu_scheduler and the queue messages on the
ordered workqueue_struct? This seems pretty messy to me as now we have
open coded a solution bypassing the scheduler, every drm_gpu_scheduler
creates its own workqueue_struct, and we'd also have to open code the
pausing of these messages for resets too.

IMO this is pretty clean solution that follows the pattern of cleanup
jobs already in place.

> > > > This way we wouldn't duplicate this core kernel function inside the
> > > > scheduler.
> > > Yeah that's essentially the design we picked for the tdr workers,
> > > where some drivers have requirements that all tdr work must be done on
> > > the same thread (because of cross-engine coordination issues). But
> > > that would require that we rework the scheduler as a pile of
> > > self-submitting work items, and I'm not sure that actually fits all
> > > that well into the core workqueue interfaces either.

This is the ordering between TDRs firing between different
drm_gpu_scheduler and larger external resets (GT in Xe) an ordered
workqueue_struct makes sense for this. Here we are talking about
ordering jobs and messages within a single drm_gpu_scheduler. Using the
main execution thread to do ordering makes sense in my opinion.

> > 
> > There were already patches floating around which did exactly that.
> > 
> > Last time I checked those were actually looking pretty good.
> > 

Link to patches for reference.

> > Additional to message passing advantage the real big issue with the
> > scheduler and 1 to 1 mapping is that we create a kernel thread for each
> > instance, which results in tons on overhead.

First patch in the series switches from kthread to work queue, that is
still a good idea.

> > 
> > Just using a work item which is submitted to a work queue completely avoids
> > that.
> 
> Hm I should have read the entire series first, since that does the
> conversion still. Apologies for the confusion, and yeah we should be able
> to just submit other work to the same wq with the first patch? And so
> hand-rolling this infra here isn't needed at all?
>

I wouldn't call this hand rolling, rather it following patten already in
place.

Matt

> Or what am I missing?
> 
> > Regards,
> > Christian.
> > 
> > > 
> > > Worst case I think this isn't a dead-end and can be refactored to
> > > internally use the workqueue services, with the new functions here
> > > just being dumb wrappers until everyone is converted over. So it
> > > doesn't look like an expensive mistake, if it turns out to be a
> > > mistake.
> > > -Daniel
> > > 
> > > 
> > > > Regards,
> > > > Christian.
> > > > 
> > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > ---
> > > > >    drivers/gpu/drm/scheduler/sched_main.c | 52 +++++++++++++++++++++++++-
> > > > >    include/drm/gpu_scheduler.h            | 29 +++++++++++++-
> > > > >    2 files changed, 78 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > index 2597fb298733..84821a124ca2 100644
> > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > @@ -1049,6 +1049,49 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
> > > > >    }
> > > > >    EXPORT_SYMBOL(drm_sched_pick_best);
> > > > > 
> > > > > +/**
> > > > > + * drm_sched_add_msg - add scheduler message
> > > > > + *
> > > > > + * @sched: scheduler instance
> > > > > + * @msg: message to be added
> > > > > + *
> > > > > + * Can and will pass an jobs waiting on dependencies or in a runnable queue.
> > > > > + * Messages processing will stop if schedule run wq is stopped and resume when
> > > > > + * run wq is started.
> > > > > + */
> > > > > +void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
> > > > > +                    struct drm_sched_msg *msg)
> > > > > +{
> > > > > +     spin_lock(&sched->job_list_lock);
> > > > > +     list_add_tail(&msg->link, &sched->msgs);
> > > > > +     spin_unlock(&sched->job_list_lock);
> > > > > +
> > > > > +     drm_sched_run_wq_queue(sched);
> > > > > +}
> > > > > +EXPORT_SYMBOL(drm_sched_add_msg);
> > > > > +
> > > > > +/**
> > > > > + * drm_sched_get_msg - get scheduler message
> > > > > + *
> > > > > + * @sched: scheduler instance
> > > > > + *
> > > > > + * Returns NULL or message
> > > > > + */
> > > > > +static struct drm_sched_msg *
> > > > > +drm_sched_get_msg(struct drm_gpu_scheduler *sched)
> > > > > +{
> > > > > +     struct drm_sched_msg *msg;
> > > > > +
> > > > > +     spin_lock(&sched->job_list_lock);
> > > > > +     msg = list_first_entry_or_null(&sched->msgs,
> > > > > +                                    struct drm_sched_msg, link);
> > > > > +     if (msg)
> > > > > +             list_del(&msg->link);
> > > > > +     spin_unlock(&sched->job_list_lock);
> > > > > +
> > > > > +     return msg;
> > > > > +}
> > > > > +
> > > > >    /**
> > > > >     * drm_sched_main - main scheduler thread
> > > > >     *
> > > > > @@ -1060,6 +1103,7 @@ static void drm_sched_main(struct work_struct *w)
> > > > >                container_of(w, struct drm_gpu_scheduler, work_run);
> > > > >        struct drm_sched_entity *entity;
> > > > >        struct drm_sched_job *cleanup_job;
> > > > > +     struct drm_sched_msg *msg;
> > > > >        int r;
> > > > > 
> > > > >        if (READ_ONCE(sched->pause_run_wq))
> > > > > @@ -1067,12 +1111,15 @@ static void drm_sched_main(struct work_struct *w)
> > > > > 
> > > > >        cleanup_job = drm_sched_get_cleanup_job(sched);
> > > > >        entity = drm_sched_select_entity(sched);
> > > > > +     msg = drm_sched_get_msg(sched);
> > > > > 
> > > > > -     if (!entity && !cleanup_job)
> > > > > +     if (!entity && !cleanup_job && !msg)
> > > > >                return; /* No more work */
> > > > > 
> > > > >        if (cleanup_job)
> > > > >                sched->ops->free_job(cleanup_job);
> > > > > +     if (msg)
> > > > > +             sched->ops->process_msg(msg);
> > > > > 
> > > > >        if (entity) {
> > > > >                struct dma_fence *fence;
> > > > > @@ -1082,7 +1129,7 @@ static void drm_sched_main(struct work_struct *w)
> > > > >                sched_job = drm_sched_entity_pop_job(entity);
> > > > >                if (!sched_job) {
> > > > >                        complete_all(&entity->entity_idle);
> > > > > -                     if (!cleanup_job)
> > > > > +                     if (!cleanup_job && !msg)
> > > > >                                return; /* No more work */
> > > > >                        goto again;
> > > > >                }
> > > > > @@ -1177,6 +1224,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> > > > > 
> > > > >        init_waitqueue_head(&sched->job_scheduled);
> > > > >        INIT_LIST_HEAD(&sched->pending_list);
> > > > > +     INIT_LIST_HEAD(&sched->msgs);
> > > > >        spin_lock_init(&sched->job_list_lock);
> > > > >        atomic_set(&sched->hw_rq_count, 0);
> > > > >        INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> > > > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > > > > index df1993dd44ae..267bd060d178 100644
> > > > > --- a/include/drm/gpu_scheduler.h
> > > > > +++ b/include/drm/gpu_scheduler.h
> > > > > @@ -394,6 +394,23 @@ enum drm_gpu_sched_stat {
> > > > >        DRM_GPU_SCHED_STAT_ENODEV,
> > > > >    };
> > > > > 
> > > > > +/**
> > > > > + * struct drm_sched_msg - an in-band (relative to GPU scheduler run queue)
> > > > > + * message
> > > > > + *
> > > > > + * Generic enough for backend defined messages, backend can expand if needed.
> > > > > + */
> > > > > +struct drm_sched_msg {
> > > > > +     /** @link: list link into the gpu scheduler list of messages */
> > > > > +     struct list_head                link;
> > > > > +     /**
> > > > > +      * @private_data: opaque pointer to message private data (backend defined)
> > > > > +      */
> > > > > +     void                            *private_data;
> > > > > +     /** @opcode: opcode of message (backend defined) */
> > > > > +     unsigned int                    opcode;
> > > > > +};
> > > > > +
> > > > >    /**
> > > > >     * struct drm_sched_backend_ops - Define the backend operations
> > > > >     *  called by the scheduler
> > > > > @@ -471,6 +488,12 @@ struct drm_sched_backend_ops {
> > > > >             * and it's time to clean it up.
> > > > >         */
> > > > >        void (*free_job)(struct drm_sched_job *sched_job);
> > > > > +
> > > > > +     /**
> > > > > +      * @process_msg: Process a message. Allowed to block, it is this
> > > > > +      * function's responsibility to free message if dynamically allocated.
> > > > > +      */
> > > > > +     void (*process_msg)(struct drm_sched_msg *msg);
> > > > >    };
> > > > > 
> > > > >    /**
> > > > > @@ -482,6 +505,7 @@ struct drm_sched_backend_ops {
> > > > >     * @timeout: the time after which a job is removed from the scheduler.
> > > > >     * @name: name of the ring for which this scheduler is being used.
> > > > >     * @sched_rq: priority wise array of run queues.
> > > > > + * @msgs: list of messages to be processed in @work_run
> > > > >     * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
> > > > >     *                 waits on this wait queue until all the scheduled jobs are
> > > > >     *                 finished.
> > > > > @@ -489,7 +513,7 @@ struct drm_sched_backend_ops {
> > > > >     * @job_id_count: used to assign unique id to the each job.
> > > > >     * @run_wq: workqueue used to queue @work_run
> > > > >     * @timeout_wq: workqueue used to queue @work_tdr
> > > > > - * @work_run: schedules jobs and cleans up entities
> > > > > + * @work_run: schedules jobs, cleans up jobs, and processes messages
> > > > >     * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
> > > > >     *            timeout interval is over.
> > > > >     * @pending_list: the list of jobs which are currently in the job queue.
> > > > > @@ -513,6 +537,7 @@ struct drm_gpu_scheduler {
> > > > >        long                            timeout;
> > > > >        const char                      *name;
> > > > >        struct drm_sched_rq             sched_rq[DRM_SCHED_PRIORITY_COUNT];
> > > > > +     struct list_head                msgs;
> > > > >        wait_queue_head_t               job_scheduled;
> > > > >        atomic_t                        hw_rq_count;
> > > > >        atomic64_t                      job_id_count;
> > > > > @@ -566,6 +591,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
> > > > > 
> > > > >    void drm_sched_job_cleanup(struct drm_sched_job *job);
> > > > >    void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
> > > > > +void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
> > > > > +                    struct drm_sched_msg *msg);
> > > > >    void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched);
> > > > >    void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched);
> > > > >    void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
> > > 
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Christian König Aug. 7, 2023, 3:46 p.m. UTC | #6

Am 04.08.23 um 16:13 schrieb Matthew Brost:
> [SNIP]
> Christian / Daniel - I've read both of you comments and having a hard
> time parsing them. I do not really understand the issue with this patch
> or exactly what is being suggested instead. Let's try to work through
> this.
>
>>>>> I'm still extremely frowned on this.
>>>>>
>>>>> If you need this functionality then let the drivers decide which
>>>>> runqueue the scheduler should use.
> What do you mean by runqueue here? Do you mean 'struct
> workqueue_struct'? The scheduler in this context is 'struct
> drm_gpu_scheduler', right?

Sorry for the confusing wording, your understanding is correct.

> Yes, we have added this functionality iin the first patch.
>
>>>>> When you then create a single threaded runqueue you can just submit work
>>>>> to it and serialize this with the scheduler work.
>>>>>
> We don't want to use a single threaded workqueue_struct in Xe, we want
> to use a system_wq as run_job() can be executed in parallel across
> multiple entites (or drm_gpu_scheduler as in Xe we have 1 to 1
> relationship between entity and scheduler). What we want is on per
> entity / scheduler granularity to be able to communicate into the
> backend a message synchronously (run_job / free_job not executing,
> scheduler execution not paused for a reset).
>
> If I'm underatanding what you suggesting in Xe we'd create an ordered
> workqueue_struct per drm_gpu_scheduler and the queue messages on the
> ordered workqueue_struct?

Yes, correct.

> This seems pretty messy to me as now we have
> open coded a solution bypassing the scheduler, every drm_gpu_scheduler
> creates its own workqueue_struct, and we'd also have to open code the
> pausing of these messages for resets too.
>
> IMO this is pretty clean solution that follows the pattern of cleanup
> jobs already in place.

Yeah, exactly that's the point. Moving the job cleanup into the 
scheduler thread is seen as very very bad idea by me.

And I really don't want to exercise that again for different use cases.

>
>>>>> This way we wouldn't duplicate this core kernel function inside the
>>>>> scheduler.
>>>> Yeah that's essentially the design we picked for the tdr workers,
>>>> where some drivers have requirements that all tdr work must be done on
>>>> the same thread (because of cross-engine coordination issues). But
>>>> that would require that we rework the scheduler as a pile of
>>>> self-submitting work items, and I'm not sure that actually fits all
>>>> that well into the core workqueue interfaces either.
> This is the ordering between TDRs firing between different
> drm_gpu_scheduler and larger external resets (GT in Xe) an ordered
> workqueue_struct makes sense for this. Here we are talking about
> ordering jobs and messages within a single drm_gpu_scheduler. Using the
> main execution thread to do ordering makes sense in my opinion.

I completely disagree to that.

Take a look at how this came to be. This is a very very ugly hack and we 
already had a hard time making lockdep understand the different fence 
signaling dependencies with freeing the job and I'm pretty sure that is 
still not 100% correct.

>
>>> There were already patches floating around which did exactly that.
>>>
>>> Last time I checked those were actually looking pretty good.
>>>
> Link to patches for reference.
>
>>> Additional to message passing advantage the real big issue with the
>>> scheduler and 1 to 1 mapping is that we create a kernel thread for each
>>> instance, which results in tons on overhead.
> First patch in the series switches from kthread to work queue, that is
> still a good idea.

This was the patch I was referring to. Sorry didn't remembered that this 
was in the same patch set.

>
>>> Just using a work item which is submitted to a work queue completely avoids
>>> that.
>> Hm I should have read the entire series first, since that does the
>> conversion still. Apologies for the confusion, and yeah we should be able
>> to just submit other work to the same wq with the first patch? And so
>> hand-rolling this infra here isn't needed at all?
>>
> I wouldn't call this hand rolling, rather it following patten already in
> place.

Basically workqueues are the in kernel infrastructure for exactly that 
use case and we are trying to re-create that here and that is usually a 
rather bad idea.

Regards,
Christian.

>
> Matt
>
>> Or what am I missing?
>>
>>> Regards,
>>> Christian.
>>>
>>>> Worst case I think this isn't a dead-end and can be refactored to
>>>> internally use the workqueue services, with the new functions here
>>>> just being dumb wrappers until everyone is converted over. So it
>>>> doesn't look like an expensive mistake, if it turns out to be a
>>>> mistake.
>>>> -Daniel
>>>>
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>> ---
>>>>>>     drivers/gpu/drm/scheduler/sched_main.c | 52 +++++++++++++++++++++++++-
>>>>>>     include/drm/gpu_scheduler.h            | 29 +++++++++++++-
>>>>>>     2 files changed, 78 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> index 2597fb298733..84821a124ca2 100644
>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>> @@ -1049,6 +1049,49 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
>>>>>>     }
>>>>>>     EXPORT_SYMBOL(drm_sched_pick_best);
>>>>>>
>>>>>> +/**
>>>>>> + * drm_sched_add_msg - add scheduler message
>>>>>> + *
>>>>>> + * @sched: scheduler instance
>>>>>> + * @msg: message to be added
>>>>>> + *
>>>>>> + * Can and will pass an jobs waiting on dependencies or in a runnable queue.
>>>>>> + * Messages processing will stop if schedule run wq is stopped and resume when
>>>>>> + * run wq is started.
>>>>>> + */
>>>>>> +void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
>>>>>> +                    struct drm_sched_msg *msg)
>>>>>> +{
>>>>>> +     spin_lock(&sched->job_list_lock);
>>>>>> +     list_add_tail(&msg->link, &sched->msgs);
>>>>>> +     spin_unlock(&sched->job_list_lock);
>>>>>> +
>>>>>> +     drm_sched_run_wq_queue(sched);
>>>>>> +}
>>>>>> +EXPORT_SYMBOL(drm_sched_add_msg);
>>>>>> +
>>>>>> +/**
>>>>>> + * drm_sched_get_msg - get scheduler message
>>>>>> + *
>>>>>> + * @sched: scheduler instance
>>>>>> + *
>>>>>> + * Returns NULL or message
>>>>>> + */
>>>>>> +static struct drm_sched_msg *
>>>>>> +drm_sched_get_msg(struct drm_gpu_scheduler *sched)
>>>>>> +{
>>>>>> +     struct drm_sched_msg *msg;
>>>>>> +
>>>>>> +     spin_lock(&sched->job_list_lock);
>>>>>> +     msg = list_first_entry_or_null(&sched->msgs,
>>>>>> +                                    struct drm_sched_msg, link);
>>>>>> +     if (msg)
>>>>>> +             list_del(&msg->link);
>>>>>> +     spin_unlock(&sched->job_list_lock);
>>>>>> +
>>>>>> +     return msg;
>>>>>> +}
>>>>>> +
>>>>>>     /**
>>>>>>      * drm_sched_main - main scheduler thread
>>>>>>      *
>>>>>> @@ -1060,6 +1103,7 @@ static void drm_sched_main(struct work_struct *w)
>>>>>>                 container_of(w, struct drm_gpu_scheduler, work_run);
>>>>>>         struct drm_sched_entity *entity;
>>>>>>         struct drm_sched_job *cleanup_job;
>>>>>> +     struct drm_sched_msg *msg;
>>>>>>         int r;
>>>>>>
>>>>>>         if (READ_ONCE(sched->pause_run_wq))
>>>>>> @@ -1067,12 +1111,15 @@ static void drm_sched_main(struct work_struct *w)
>>>>>>
>>>>>>         cleanup_job = drm_sched_get_cleanup_job(sched);
>>>>>>         entity = drm_sched_select_entity(sched);
>>>>>> +     msg = drm_sched_get_msg(sched);
>>>>>>
>>>>>> -     if (!entity && !cleanup_job)
>>>>>> +     if (!entity && !cleanup_job && !msg)
>>>>>>                 return; /* No more work */
>>>>>>
>>>>>>         if (cleanup_job)
>>>>>>                 sched->ops->free_job(cleanup_job);
>>>>>> +     if (msg)
>>>>>> +             sched->ops->process_msg(msg);
>>>>>>
>>>>>>         if (entity) {
>>>>>>                 struct dma_fence *fence;
>>>>>> @@ -1082,7 +1129,7 @@ static void drm_sched_main(struct work_struct *w)
>>>>>>                 sched_job = drm_sched_entity_pop_job(entity);
>>>>>>                 if (!sched_job) {
>>>>>>                         complete_all(&entity->entity_idle);
>>>>>> -                     if (!cleanup_job)
>>>>>> +                     if (!cleanup_job && !msg)
>>>>>>                                 return; /* No more work */
>>>>>>                         goto again;
>>>>>>                 }
>>>>>> @@ -1177,6 +1224,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>>>>>>
>>>>>>         init_waitqueue_head(&sched->job_scheduled);
>>>>>>         INIT_LIST_HEAD(&sched->pending_list);
>>>>>> +     INIT_LIST_HEAD(&sched->msgs);
>>>>>>         spin_lock_init(&sched->job_list_lock);
>>>>>>         atomic_set(&sched->hw_rq_count, 0);
>>>>>>         INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
>>>>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>>>>> index df1993dd44ae..267bd060d178 100644
>>>>>> --- a/include/drm/gpu_scheduler.h
>>>>>> +++ b/include/drm/gpu_scheduler.h
>>>>>> @@ -394,6 +394,23 @@ enum drm_gpu_sched_stat {
>>>>>>         DRM_GPU_SCHED_STAT_ENODEV,
>>>>>>     };
>>>>>>
>>>>>> +/**
>>>>>> + * struct drm_sched_msg - an in-band (relative to GPU scheduler run queue)
>>>>>> + * message
>>>>>> + *
>>>>>> + * Generic enough for backend defined messages, backend can expand if needed.
>>>>>> + */
>>>>>> +struct drm_sched_msg {
>>>>>> +     /** @link: list link into the gpu scheduler list of messages */
>>>>>> +     struct list_head                link;
>>>>>> +     /**
>>>>>> +      * @private_data: opaque pointer to message private data (backend defined)
>>>>>> +      */
>>>>>> +     void                            *private_data;
>>>>>> +     /** @opcode: opcode of message (backend defined) */
>>>>>> +     unsigned int                    opcode;
>>>>>> +};
>>>>>> +
>>>>>>     /**
>>>>>>      * struct drm_sched_backend_ops - Define the backend operations
>>>>>>      *  called by the scheduler
>>>>>> @@ -471,6 +488,12 @@ struct drm_sched_backend_ops {
>>>>>>              * and it's time to clean it up.
>>>>>>          */
>>>>>>         void (*free_job)(struct drm_sched_job *sched_job);
>>>>>> +
>>>>>> +     /**
>>>>>> +      * @process_msg: Process a message. Allowed to block, it is this
>>>>>> +      * function's responsibility to free message if dynamically allocated.
>>>>>> +      */
>>>>>> +     void (*process_msg)(struct drm_sched_msg *msg);
>>>>>>     };
>>>>>>
>>>>>>     /**
>>>>>> @@ -482,6 +505,7 @@ struct drm_sched_backend_ops {
>>>>>>      * @timeout: the time after which a job is removed from the scheduler.
>>>>>>      * @name: name of the ring for which this scheduler is being used.
>>>>>>      * @sched_rq: priority wise array of run queues.
>>>>>> + * @msgs: list of messages to be processed in @work_run
>>>>>>      * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
>>>>>>      *                 waits on this wait queue until all the scheduled jobs are
>>>>>>      *                 finished.
>>>>>> @@ -489,7 +513,7 @@ struct drm_sched_backend_ops {
>>>>>>      * @job_id_count: used to assign unique id to the each job.
>>>>>>      * @run_wq: workqueue used to queue @work_run
>>>>>>      * @timeout_wq: workqueue used to queue @work_tdr
>>>>>> - * @work_run: schedules jobs and cleans up entities
>>>>>> + * @work_run: schedules jobs, cleans up jobs, and processes messages
>>>>>>      * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
>>>>>>      *            timeout interval is over.
>>>>>>      * @pending_list: the list of jobs which are currently in the job queue.
>>>>>> @@ -513,6 +537,7 @@ struct drm_gpu_scheduler {
>>>>>>         long                            timeout;
>>>>>>         const char                      *name;
>>>>>>         struct drm_sched_rq             sched_rq[DRM_SCHED_PRIORITY_COUNT];
>>>>>> +     struct list_head                msgs;
>>>>>>         wait_queue_head_t               job_scheduled;
>>>>>>         atomic_t                        hw_rq_count;
>>>>>>         atomic64_t                      job_id_count;
>>>>>> @@ -566,6 +591,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>>>>>>
>>>>>>     void drm_sched_job_cleanup(struct drm_sched_job *job);
>>>>>>     void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
>>>>>> +void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
>>>>>> +                    struct drm_sched_msg *msg);
>>>>>>     void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched);
>>>>>>     void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched);
>>>>>>     void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
>> -- 
>> Daniel Vetter
>> Software Engineer, Intel Corporation
>> http://blog.ffwll.ch

Matthew Brost Aug. 8, 2023, 2:06 p.m. UTC | #7

On Mon, Aug 07, 2023 at 05:46:16PM +0200, Christian König wrote:
> Am 04.08.23 um 16:13 schrieb Matthew Brost:
> > [SNIP]
> > Christian / Daniel - I've read both of you comments and having a hard
> > time parsing them. I do not really understand the issue with this patch
> > or exactly what is being suggested instead. Let's try to work through
> > this.
> > 
> > > > > > I'm still extremely frowned on this.
> > > > > > 
> > > > > > If you need this functionality then let the drivers decide which
> > > > > > runqueue the scheduler should use.
> > What do you mean by runqueue here? Do you mean 'struct
> > workqueue_struct'? The scheduler in this context is 'struct
> > drm_gpu_scheduler', right?
> 
> Sorry for the confusing wording, your understanding is correct.
> 
> > Yes, we have added this functionality iin the first patch.
> > 
> > > > > > When you then create a single threaded runqueue you can just submit work
> > > > > > to it and serialize this with the scheduler work.
> > > > > > 
> > We don't want to use a single threaded workqueue_struct in Xe, we want
> > to use a system_wq as run_job() can be executed in parallel across
> > multiple entites (or drm_gpu_scheduler as in Xe we have 1 to 1
> > relationship between entity and scheduler). What we want is on per
> > entity / scheduler granularity to be able to communicate into the
> > backend a message synchronously (run_job / free_job not executing,
> > scheduler execution not paused for a reset).
> > 
> > If I'm underatanding what you suggesting in Xe we'd create an ordered
> > workqueue_struct per drm_gpu_scheduler and the queue messages on the
> > ordered workqueue_struct?
> 
> Yes, correct.
> 
> > This seems pretty messy to me as now we have
> > open coded a solution bypassing the scheduler, every drm_gpu_scheduler
> > creates its own workqueue_struct, and we'd also have to open code the
> > pausing of these messages for resets too.
> > 
> > IMO this is pretty clean solution that follows the pattern of cleanup
> > jobs already in place.
> 
> Yeah, exactly that's the point. Moving the job cleanup into the scheduler
> thread is seen as very very bad idea by me.
> 
> And I really don't want to exercise that again for different use cases.
> 
> > 
> > > > > > This way we wouldn't duplicate this core kernel function inside the
> > > > > > scheduler.
> > > > > Yeah that's essentially the design we picked for the tdr workers,
> > > > > where some drivers have requirements that all tdr work must be done on
> > > > > the same thread (because of cross-engine coordination issues). But
> > > > > that would require that we rework the scheduler as a pile of
> > > > > self-submitting work items, and I'm not sure that actually fits all
> > > > > that well into the core workqueue interfaces either.
> > This is the ordering between TDRs firing between different
> > drm_gpu_scheduler and larger external resets (GT in Xe) an ordered
> > workqueue_struct makes sense for this. Here we are talking about
> > ordering jobs and messages within a single drm_gpu_scheduler. Using the
> > main execution thread to do ordering makes sense in my opinion.
> 
> I completely disagree to that.
> 
> Take a look at how this came to be. This is a very very ugly hack and we
> already had a hard time making lockdep understand the different fence
> signaling dependencies with freeing the job and I'm pretty sure that is
> still not 100% correct.
> 
> > 
> > > > There were already patches floating around which did exactly that.
> > > > 
> > > > Last time I checked those were actually looking pretty good.
> > > > 
> > Link to patches for reference.
> > 
> > > > Additional to message passing advantage the real big issue with the
> > > > scheduler and 1 to 1 mapping is that we create a kernel thread for each
> > > > instance, which results in tons on overhead.
> > First patch in the series switches from kthread to work queue, that is
> > still a good idea.
> 
> This was the patch I was referring to. Sorry didn't remembered that this was
> in the same patch set.
> 
> > 
> > > > Just using a work item which is submitted to a work queue completely avoids
> > > > that.
> > > Hm I should have read the entire series first, since that does the
> > > conversion still. Apologies for the confusion, and yeah we should be able
> > > to just submit other work to the same wq with the first patch? And so
> > > hand-rolling this infra here isn't needed at all?
> > > 
> > I wouldn't call this hand rolling, rather it following patten already in
> > place.
> 
> Basically workqueues are the in kernel infrastructure for exactly that use
> case and we are trying to re-create that here and that is usually a rather
> bad idea.
> 

Ok let me play around with what this would look like in Xe, what you are
suggesting would be ordered-wq per scheduler, work item for run job,
work item for clean up job, and work item for a message. That might
work I suppose? Only issue I see is scaling as this exposes an
ordered-wq creation directly to an IOCTL. No idea if that is actually a
concern though.

Matt

> Regards,
> Christian.
> 
> > 
> > Matt
> > 
> > > Or what am I missing?
> > > 
> > > > Regards,
> > > > Christian.
> > > > 
> > > > > Worst case I think this isn't a dead-end and can be refactored to
> > > > > internally use the workqueue services, with the new functions here
> > > > > just being dumb wrappers until everyone is converted over. So it
> > > > > doesn't look like an expensive mistake, if it turns out to be a
> > > > > mistake.
> > > > > -Daniel
> > > > > 
> > > > > 
> > > > > > Regards,
> > > > > > Christian.
> > > > > > 
> > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > ---
> > > > > > >     drivers/gpu/drm/scheduler/sched_main.c | 52 +++++++++++++++++++++++++-
> > > > > > >     include/drm/gpu_scheduler.h            | 29 +++++++++++++-
> > > > > > >     2 files changed, 78 insertions(+), 3 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > index 2597fb298733..84821a124ca2 100644
> > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > @@ -1049,6 +1049,49 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
> > > > > > >     }
> > > > > > >     EXPORT_SYMBOL(drm_sched_pick_best);
> > > > > > > 
> > > > > > > +/**
> > > > > > > + * drm_sched_add_msg - add scheduler message
> > > > > > > + *
> > > > > > > + * @sched: scheduler instance
> > > > > > > + * @msg: message to be added
> > > > > > > + *
> > > > > > > + * Can and will pass an jobs waiting on dependencies or in a runnable queue.
> > > > > > > + * Messages processing will stop if schedule run wq is stopped and resume when
> > > > > > > + * run wq is started.
> > > > > > > + */
> > > > > > > +void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
> > > > > > > +                    struct drm_sched_msg *msg)
> > > > > > > +{
> > > > > > > +     spin_lock(&sched->job_list_lock);
> > > > > > > +     list_add_tail(&msg->link, &sched->msgs);
> > > > > > > +     spin_unlock(&sched->job_list_lock);
> > > > > > > +
> > > > > > > +     drm_sched_run_wq_queue(sched);
> > > > > > > +}
> > > > > > > +EXPORT_SYMBOL(drm_sched_add_msg);
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * drm_sched_get_msg - get scheduler message
> > > > > > > + *
> > > > > > > + * @sched: scheduler instance
> > > > > > > + *
> > > > > > > + * Returns NULL or message
> > > > > > > + */
> > > > > > > +static struct drm_sched_msg *
> > > > > > > +drm_sched_get_msg(struct drm_gpu_scheduler *sched)
> > > > > > > +{
> > > > > > > +     struct drm_sched_msg *msg;
> > > > > > > +
> > > > > > > +     spin_lock(&sched->job_list_lock);
> > > > > > > +     msg = list_first_entry_or_null(&sched->msgs,
> > > > > > > +                                    struct drm_sched_msg, link);
> > > > > > > +     if (msg)
> > > > > > > +             list_del(&msg->link);
> > > > > > > +     spin_unlock(&sched->job_list_lock);
> > > > > > > +
> > > > > > > +     return msg;
> > > > > > > +}
> > > > > > > +
> > > > > > >     /**
> > > > > > >      * drm_sched_main - main scheduler thread
> > > > > > >      *
> > > > > > > @@ -1060,6 +1103,7 @@ static void drm_sched_main(struct work_struct *w)
> > > > > > >                 container_of(w, struct drm_gpu_scheduler, work_run);
> > > > > > >         struct drm_sched_entity *entity;
> > > > > > >         struct drm_sched_job *cleanup_job;
> > > > > > > +     struct drm_sched_msg *msg;
> > > > > > >         int r;
> > > > > > > 
> > > > > > >         if (READ_ONCE(sched->pause_run_wq))
> > > > > > > @@ -1067,12 +1111,15 @@ static void drm_sched_main(struct work_struct *w)
> > > > > > > 
> > > > > > >         cleanup_job = drm_sched_get_cleanup_job(sched);
> > > > > > >         entity = drm_sched_select_entity(sched);
> > > > > > > +     msg = drm_sched_get_msg(sched);
> > > > > > > 
> > > > > > > -     if (!entity && !cleanup_job)
> > > > > > > +     if (!entity && !cleanup_job && !msg)
> > > > > > >                 return; /* No more work */
> > > > > > > 
> > > > > > >         if (cleanup_job)
> > > > > > >                 sched->ops->free_job(cleanup_job);
> > > > > > > +     if (msg)
> > > > > > > +             sched->ops->process_msg(msg);
> > > > > > > 
> > > > > > >         if (entity) {
> > > > > > >                 struct dma_fence *fence;
> > > > > > > @@ -1082,7 +1129,7 @@ static void drm_sched_main(struct work_struct *w)
> > > > > > >                 sched_job = drm_sched_entity_pop_job(entity);
> > > > > > >                 if (!sched_job) {
> > > > > > >                         complete_all(&entity->entity_idle);
> > > > > > > -                     if (!cleanup_job)
> > > > > > > +                     if (!cleanup_job && !msg)
> > > > > > >                                 return; /* No more work */
> > > > > > >                         goto again;
> > > > > > >                 }
> > > > > > > @@ -1177,6 +1224,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> > > > > > > 
> > > > > > >         init_waitqueue_head(&sched->job_scheduled);
> > > > > > >         INIT_LIST_HEAD(&sched->pending_list);
> > > > > > > +     INIT_LIST_HEAD(&sched->msgs);
> > > > > > >         spin_lock_init(&sched->job_list_lock);
> > > > > > >         atomic_set(&sched->hw_rq_count, 0);
> > > > > > >         INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> > > > > > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > > > > > > index df1993dd44ae..267bd060d178 100644
> > > > > > > --- a/include/drm/gpu_scheduler.h
> > > > > > > +++ b/include/drm/gpu_scheduler.h
> > > > > > > @@ -394,6 +394,23 @@ enum drm_gpu_sched_stat {
> > > > > > >         DRM_GPU_SCHED_STAT_ENODEV,
> > > > > > >     };
> > > > > > > 
> > > > > > > +/**
> > > > > > > + * struct drm_sched_msg - an in-band (relative to GPU scheduler run queue)
> > > > > > > + * message
> > > > > > > + *
> > > > > > > + * Generic enough for backend defined messages, backend can expand if needed.
> > > > > > > + */
> > > > > > > +struct drm_sched_msg {
> > > > > > > +     /** @link: list link into the gpu scheduler list of messages */
> > > > > > > +     struct list_head                link;
> > > > > > > +     /**
> > > > > > > +      * @private_data: opaque pointer to message private data (backend defined)
> > > > > > > +      */
> > > > > > > +     void                            *private_data;
> > > > > > > +     /** @opcode: opcode of message (backend defined) */
> > > > > > > +     unsigned int                    opcode;
> > > > > > > +};
> > > > > > > +
> > > > > > >     /**
> > > > > > >      * struct drm_sched_backend_ops - Define the backend operations
> > > > > > >      *  called by the scheduler
> > > > > > > @@ -471,6 +488,12 @@ struct drm_sched_backend_ops {
> > > > > > >              * and it's time to clean it up.
> > > > > > >          */
> > > > > > >         void (*free_job)(struct drm_sched_job *sched_job);
> > > > > > > +
> > > > > > > +     /**
> > > > > > > +      * @process_msg: Process a message. Allowed to block, it is this
> > > > > > > +      * function's responsibility to free message if dynamically allocated.
> > > > > > > +      */
> > > > > > > +     void (*process_msg)(struct drm_sched_msg *msg);
> > > > > > >     };
> > > > > > > 
> > > > > > >     /**
> > > > > > > @@ -482,6 +505,7 @@ struct drm_sched_backend_ops {
> > > > > > >      * @timeout: the time after which a job is removed from the scheduler.
> > > > > > >      * @name: name of the ring for which this scheduler is being used.
> > > > > > >      * @sched_rq: priority wise array of run queues.
> > > > > > > + * @msgs: list of messages to be processed in @work_run
> > > > > > >      * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
> > > > > > >      *                 waits on this wait queue until all the scheduled jobs are
> > > > > > >      *                 finished.
> > > > > > > @@ -489,7 +513,7 @@ struct drm_sched_backend_ops {
> > > > > > >      * @job_id_count: used to assign unique id to the each job.
> > > > > > >      * @run_wq: workqueue used to queue @work_run
> > > > > > >      * @timeout_wq: workqueue used to queue @work_tdr
> > > > > > > - * @work_run: schedules jobs and cleans up entities
> > > > > > > + * @work_run: schedules jobs, cleans up jobs, and processes messages
> > > > > > >      * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
> > > > > > >      *            timeout interval is over.
> > > > > > >      * @pending_list: the list of jobs which are currently in the job queue.
> > > > > > > @@ -513,6 +537,7 @@ struct drm_gpu_scheduler {
> > > > > > >         long                            timeout;
> > > > > > >         const char                      *name;
> > > > > > >         struct drm_sched_rq             sched_rq[DRM_SCHED_PRIORITY_COUNT];
> > > > > > > +     struct list_head                msgs;
> > > > > > >         wait_queue_head_t               job_scheduled;
> > > > > > >         atomic_t                        hw_rq_count;
> > > > > > >         atomic64_t                      job_id_count;
> > > > > > > @@ -566,6 +591,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
> > > > > > > 
> > > > > > >     void drm_sched_job_cleanup(struct drm_sched_job *job);
> > > > > > >     void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
> > > > > > > +void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
> > > > > > > +                    struct drm_sched_msg *msg);
> > > > > > >     void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched);
> > > > > > >     void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched);
> > > > > > >     void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
> > > -- 
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
>

Christian König Aug. 8, 2023, 2:14 p.m. UTC | #8

Am 08.08.23 um 16:06 schrieb Matthew Brost:
> [SNIP]
>> Basically workqueues are the in kernel infrastructure for exactly that use
>> case and we are trying to re-create that here and that is usually a rather
>> bad idea.
>>
> Ok let me play around with what this would look like in Xe, what you are
> suggesting would be ordered-wq per scheduler, work item for run job,
> work item for clean up job, and work item for a message. That might
> work I suppose? Only issue I see is scaling as this exposes an
> ordered-wq creation directly to an IOCTL. No idea if that is actually a
> concern though.

That's a very good question I can't answer of hand either.

But from the history of work queues I know that they were invented to 
reduce the overhead/costs of having many kernel threads.

So my educated guess is that you probably won't find anything better at 
the moment. If work queues then indeed don't match this use case then we 
need to figure out how to improve them or find a different solution.

Christian.

>
> Matt
>
>> Regards,
>> Christian.
>>
>>> Matt
>>>
>>>> Or what am I missing?
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Worst case I think this isn't a dead-end and can be refactored to
>>>>>> internally use the workqueue services, with the new functions here
>>>>>> just being dumb wrappers until everyone is converted over. So it
>>>>>> doesn't look like an expensive mistake, if it turns out to be a
>>>>>> mistake.
>>>>>> -Daniel
>>>>>>
>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/scheduler/sched_main.c | 52 +++++++++++++++++++++++++-
>>>>>>>>      include/drm/gpu_scheduler.h            | 29 +++++++++++++-
>>>>>>>>      2 files changed, 78 insertions(+), 3 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> index 2597fb298733..84821a124ca2 100644
>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>> @@ -1049,6 +1049,49 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
>>>>>>>>      }
>>>>>>>>      EXPORT_SYMBOL(drm_sched_pick_best);
>>>>>>>>
>>>>>>>> +/**
>>>>>>>> + * drm_sched_add_msg - add scheduler message
>>>>>>>> + *
>>>>>>>> + * @sched: scheduler instance
>>>>>>>> + * @msg: message to be added
>>>>>>>> + *
>>>>>>>> + * Can and will pass an jobs waiting on dependencies or in a runnable queue.
>>>>>>>> + * Messages processing will stop if schedule run wq is stopped and resume when
>>>>>>>> + * run wq is started.
>>>>>>>> + */
>>>>>>>> +void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
>>>>>>>> +                    struct drm_sched_msg *msg)
>>>>>>>> +{
>>>>>>>> +     spin_lock(&sched->job_list_lock);
>>>>>>>> +     list_add_tail(&msg->link, &sched->msgs);
>>>>>>>> +     spin_unlock(&sched->job_list_lock);
>>>>>>>> +
>>>>>>>> +     drm_sched_run_wq_queue(sched);
>>>>>>>> +}
>>>>>>>> +EXPORT_SYMBOL(drm_sched_add_msg);
>>>>>>>> +
>>>>>>>> +/**
>>>>>>>> + * drm_sched_get_msg - get scheduler message
>>>>>>>> + *
>>>>>>>> + * @sched: scheduler instance
>>>>>>>> + *
>>>>>>>> + * Returns NULL or message
>>>>>>>> + */
>>>>>>>> +static struct drm_sched_msg *
>>>>>>>> +drm_sched_get_msg(struct drm_gpu_scheduler *sched)
>>>>>>>> +{
>>>>>>>> +     struct drm_sched_msg *msg;
>>>>>>>> +
>>>>>>>> +     spin_lock(&sched->job_list_lock);
>>>>>>>> +     msg = list_first_entry_or_null(&sched->msgs,
>>>>>>>> +                                    struct drm_sched_msg, link);
>>>>>>>> +     if (msg)
>>>>>>>> +             list_del(&msg->link);
>>>>>>>> +     spin_unlock(&sched->job_list_lock);
>>>>>>>> +
>>>>>>>> +     return msg;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>      /**
>>>>>>>>       * drm_sched_main - main scheduler thread
>>>>>>>>       *
>>>>>>>> @@ -1060,6 +1103,7 @@ static void drm_sched_main(struct work_struct *w)
>>>>>>>>                  container_of(w, struct drm_gpu_scheduler, work_run);
>>>>>>>>          struct drm_sched_entity *entity;
>>>>>>>>          struct drm_sched_job *cleanup_job;
>>>>>>>> +     struct drm_sched_msg *msg;
>>>>>>>>          int r;
>>>>>>>>
>>>>>>>>          if (READ_ONCE(sched->pause_run_wq))
>>>>>>>> @@ -1067,12 +1111,15 @@ static void drm_sched_main(struct work_struct *w)
>>>>>>>>
>>>>>>>>          cleanup_job = drm_sched_get_cleanup_job(sched);
>>>>>>>>          entity = drm_sched_select_entity(sched);
>>>>>>>> +     msg = drm_sched_get_msg(sched);
>>>>>>>>
>>>>>>>> -     if (!entity && !cleanup_job)
>>>>>>>> +     if (!entity && !cleanup_job && !msg)
>>>>>>>>                  return; /* No more work */
>>>>>>>>
>>>>>>>>          if (cleanup_job)
>>>>>>>>                  sched->ops->free_job(cleanup_job);
>>>>>>>> +     if (msg)
>>>>>>>> +             sched->ops->process_msg(msg);
>>>>>>>>
>>>>>>>>          if (entity) {
>>>>>>>>                  struct dma_fence *fence;
>>>>>>>> @@ -1082,7 +1129,7 @@ static void drm_sched_main(struct work_struct *w)
>>>>>>>>                  sched_job = drm_sched_entity_pop_job(entity);
>>>>>>>>                  if (!sched_job) {
>>>>>>>>                          complete_all(&entity->entity_idle);
>>>>>>>> -                     if (!cleanup_job)
>>>>>>>> +                     if (!cleanup_job && !msg)
>>>>>>>>                                  return; /* No more work */
>>>>>>>>                          goto again;
>>>>>>>>                  }
>>>>>>>> @@ -1177,6 +1224,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>>>>>>>>
>>>>>>>>          init_waitqueue_head(&sched->job_scheduled);
>>>>>>>>          INIT_LIST_HEAD(&sched->pending_list);
>>>>>>>> +     INIT_LIST_HEAD(&sched->msgs);
>>>>>>>>          spin_lock_init(&sched->job_list_lock);
>>>>>>>>          atomic_set(&sched->hw_rq_count, 0);
>>>>>>>>          INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
>>>>>>>> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
>>>>>>>> index df1993dd44ae..267bd060d178 100644
>>>>>>>> --- a/include/drm/gpu_scheduler.h
>>>>>>>> +++ b/include/drm/gpu_scheduler.h
>>>>>>>> @@ -394,6 +394,23 @@ enum drm_gpu_sched_stat {
>>>>>>>>          DRM_GPU_SCHED_STAT_ENODEV,
>>>>>>>>      };
>>>>>>>>
>>>>>>>> +/**
>>>>>>>> + * struct drm_sched_msg - an in-band (relative to GPU scheduler run queue)
>>>>>>>> + * message
>>>>>>>> + *
>>>>>>>> + * Generic enough for backend defined messages, backend can expand if needed.
>>>>>>>> + */
>>>>>>>> +struct drm_sched_msg {
>>>>>>>> +     /** @link: list link into the gpu scheduler list of messages */
>>>>>>>> +     struct list_head                link;
>>>>>>>> +     /**
>>>>>>>> +      * @private_data: opaque pointer to message private data (backend defined)
>>>>>>>> +      */
>>>>>>>> +     void                            *private_data;
>>>>>>>> +     /** @opcode: opcode of message (backend defined) */
>>>>>>>> +     unsigned int                    opcode;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>>      /**
>>>>>>>>       * struct drm_sched_backend_ops - Define the backend operations
>>>>>>>>       *  called by the scheduler
>>>>>>>> @@ -471,6 +488,12 @@ struct drm_sched_backend_ops {
>>>>>>>>               * and it's time to clean it up.
>>>>>>>>           */
>>>>>>>>          void (*free_job)(struct drm_sched_job *sched_job);
>>>>>>>> +
>>>>>>>> +     /**
>>>>>>>> +      * @process_msg: Process a message. Allowed to block, it is this
>>>>>>>> +      * function's responsibility to free message if dynamically allocated.
>>>>>>>> +      */
>>>>>>>> +     void (*process_msg)(struct drm_sched_msg *msg);
>>>>>>>>      };
>>>>>>>>
>>>>>>>>      /**
>>>>>>>> @@ -482,6 +505,7 @@ struct drm_sched_backend_ops {
>>>>>>>>       * @timeout: the time after which a job is removed from the scheduler.
>>>>>>>>       * @name: name of the ring for which this scheduler is being used.
>>>>>>>>       * @sched_rq: priority wise array of run queues.
>>>>>>>> + * @msgs: list of messages to be processed in @work_run
>>>>>>>>       * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
>>>>>>>>       *                 waits on this wait queue until all the scheduled jobs are
>>>>>>>>       *                 finished.
>>>>>>>> @@ -489,7 +513,7 @@ struct drm_sched_backend_ops {
>>>>>>>>       * @job_id_count: used to assign unique id to the each job.
>>>>>>>>       * @run_wq: workqueue used to queue @work_run
>>>>>>>>       * @timeout_wq: workqueue used to queue @work_tdr
>>>>>>>> - * @work_run: schedules jobs and cleans up entities
>>>>>>>> + * @work_run: schedules jobs, cleans up jobs, and processes messages
>>>>>>>>       * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
>>>>>>>>       *            timeout interval is over.
>>>>>>>>       * @pending_list: the list of jobs which are currently in the job queue.
>>>>>>>> @@ -513,6 +537,7 @@ struct drm_gpu_scheduler {
>>>>>>>>          long                            timeout;
>>>>>>>>          const char                      *name;
>>>>>>>>          struct drm_sched_rq             sched_rq[DRM_SCHED_PRIORITY_COUNT];
>>>>>>>> +     struct list_head                msgs;
>>>>>>>>          wait_queue_head_t               job_scheduled;
>>>>>>>>          atomic_t                        hw_rq_count;
>>>>>>>>          atomic64_t                      job_id_count;
>>>>>>>> @@ -566,6 +591,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
>>>>>>>>
>>>>>>>>      void drm_sched_job_cleanup(struct drm_sched_job *job);
>>>>>>>>      void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
>>>>>>>> +void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
>>>>>>>> +                    struct drm_sched_msg *msg);
>>>>>>>>      void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched);
>>>>>>>>      void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched);
>>>>>>>>      void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
>>>> -- 
>>>> Daniel Vetter
>>>> Software Engineer, Intel Corporation
>>>> http://blog.ffwll.ch

Matthew Brost Aug. 9, 2023, 2:36 p.m. UTC | #9

On Tue, Aug 08, 2023 at 04:14:55PM +0200, Christian König wrote:
> Am 08.08.23 um 16:06 schrieb Matthew Brost:
> > [SNIP]
> > > Basically workqueues are the in kernel infrastructure for exactly that use
> > > case and we are trying to re-create that here and that is usually a rather
> > > bad idea.
> > > 
> > Ok let me play around with what this would look like in Xe, what you are
> > suggesting would be ordered-wq per scheduler, work item for run job,
> > work item for clean up job, and work item for a message. That might
> > work I suppose? Only issue I see is scaling as this exposes an
> > ordered-wq creation directly to an IOCTL. No idea if that is actually a
> > concern though.
> 
> That's a very good question I can't answer of hand either.
> 
> But from the history of work queues I know that they were invented to reduce
> the overhead/costs of having many kernel threads.
> 
> So my educated guess is that you probably won't find anything better at the
> moment. If work queues then indeed don't match this use case then we need to
> figure out how to improve them or find a different solution.
> 

I looked at workqueue code and think the workqueue creation is decoupled
from kthread creation in most cases so I think this fits.

Hacked together a quick PoC of this on top of Xe, seems to work, and
like how the code looks too. Need to clean it up a bit and run some perf
tests but looks promising. Hopely it all comes together and can get
another spin of this series out fairly soon.

Matt

> Christian.
> 
> > 
> > Matt
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > Matt
> > > > 
> > > > > Or what am I missing?
> > > > > 
> > > > > > Regards,
> > > > > > Christian.
> > > > > > 
> > > > > > > Worst case I think this isn't a dead-end and can be refactored to
> > > > > > > internally use the workqueue services, with the new functions here
> > > > > > > just being dumb wrappers until everyone is converted over. So it
> > > > > > > doesn't look like an expensive mistake, if it turns out to be a
> > > > > > > mistake.
> > > > > > > -Daniel
> > > > > > > 
> > > > > > > 
> > > > > > > > Regards,
> > > > > > > > Christian.
> > > > > > > > 
> > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > > > > > > > ---
> > > > > > > > >      drivers/gpu/drm/scheduler/sched_main.c | 52 +++++++++++++++++++++++++-
> > > > > > > > >      include/drm/gpu_scheduler.h            | 29 +++++++++++++-
> > > > > > > > >      2 files changed, 78 insertions(+), 3 deletions(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > index 2597fb298733..84821a124ca2 100644
> > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > @@ -1049,6 +1049,49 @@ drm_sched_pick_best(struct drm_gpu_scheduler **sched_list,
> > > > > > > > >      }
> > > > > > > > >      EXPORT_SYMBOL(drm_sched_pick_best);
> > > > > > > > > 
> > > > > > > > > +/**
> > > > > > > > > + * drm_sched_add_msg - add scheduler message
> > > > > > > > > + *
> > > > > > > > > + * @sched: scheduler instance
> > > > > > > > > + * @msg: message to be added
> > > > > > > > > + *
> > > > > > > > > + * Can and will pass an jobs waiting on dependencies or in a runnable queue.
> > > > > > > > > + * Messages processing will stop if schedule run wq is stopped and resume when
> > > > > > > > > + * run wq is started.
> > > > > > > > > + */
> > > > > > > > > +void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
> > > > > > > > > +                    struct drm_sched_msg *msg)
> > > > > > > > > +{
> > > > > > > > > +     spin_lock(&sched->job_list_lock);
> > > > > > > > > +     list_add_tail(&msg->link, &sched->msgs);
> > > > > > > > > +     spin_unlock(&sched->job_list_lock);
> > > > > > > > > +
> > > > > > > > > +     drm_sched_run_wq_queue(sched);
> > > > > > > > > +}
> > > > > > > > > +EXPORT_SYMBOL(drm_sched_add_msg);
> > > > > > > > > +
> > > > > > > > > +/**
> > > > > > > > > + * drm_sched_get_msg - get scheduler message
> > > > > > > > > + *
> > > > > > > > > + * @sched: scheduler instance
> > > > > > > > > + *
> > > > > > > > > + * Returns NULL or message
> > > > > > > > > + */
> > > > > > > > > +static struct drm_sched_msg *
> > > > > > > > > +drm_sched_get_msg(struct drm_gpu_scheduler *sched)
> > > > > > > > > +{
> > > > > > > > > +     struct drm_sched_msg *msg;
> > > > > > > > > +
> > > > > > > > > +     spin_lock(&sched->job_list_lock);
> > > > > > > > > +     msg = list_first_entry_or_null(&sched->msgs,
> > > > > > > > > +                                    struct drm_sched_msg, link);
> > > > > > > > > +     if (msg)
> > > > > > > > > +             list_del(&msg->link);
> > > > > > > > > +     spin_unlock(&sched->job_list_lock);
> > > > > > > > > +
> > > > > > > > > +     return msg;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > >      /**
> > > > > > > > >       * drm_sched_main - main scheduler thread
> > > > > > > > >       *
> > > > > > > > > @@ -1060,6 +1103,7 @@ static void drm_sched_main(struct work_struct *w)
> > > > > > > > >                  container_of(w, struct drm_gpu_scheduler, work_run);
> > > > > > > > >          struct drm_sched_entity *entity;
> > > > > > > > >          struct drm_sched_job *cleanup_job;
> > > > > > > > > +     struct drm_sched_msg *msg;
> > > > > > > > >          int r;
> > > > > > > > > 
> > > > > > > > >          if (READ_ONCE(sched->pause_run_wq))
> > > > > > > > > @@ -1067,12 +1111,15 @@ static void drm_sched_main(struct work_struct *w)
> > > > > > > > > 
> > > > > > > > >          cleanup_job = drm_sched_get_cleanup_job(sched);
> > > > > > > > >          entity = drm_sched_select_entity(sched);
> > > > > > > > > +     msg = drm_sched_get_msg(sched);
> > > > > > > > > 
> > > > > > > > > -     if (!entity && !cleanup_job)
> > > > > > > > > +     if (!entity && !cleanup_job && !msg)
> > > > > > > > >                  return; /* No more work */
> > > > > > > > > 
> > > > > > > > >          if (cleanup_job)
> > > > > > > > >                  sched->ops->free_job(cleanup_job);
> > > > > > > > > +     if (msg)
> > > > > > > > > +             sched->ops->process_msg(msg);
> > > > > > > > > 
> > > > > > > > >          if (entity) {
> > > > > > > > >                  struct dma_fence *fence;
> > > > > > > > > @@ -1082,7 +1129,7 @@ static void drm_sched_main(struct work_struct *w)
> > > > > > > > >                  sched_job = drm_sched_entity_pop_job(entity);
> > > > > > > > >                  if (!sched_job) {
> > > > > > > > >                          complete_all(&entity->entity_idle);
> > > > > > > > > -                     if (!cleanup_job)
> > > > > > > > > +                     if (!cleanup_job && !msg)
> > > > > > > > >                                  return; /* No more work */
> > > > > > > > >                          goto again;
> > > > > > > > >                  }
> > > > > > > > > @@ -1177,6 +1224,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
> > > > > > > > > 
> > > > > > > > >          init_waitqueue_head(&sched->job_scheduled);
> > > > > > > > >          INIT_LIST_HEAD(&sched->pending_list);
> > > > > > > > > +     INIT_LIST_HEAD(&sched->msgs);
> > > > > > > > >          spin_lock_init(&sched->job_list_lock);
> > > > > > > > >          atomic_set(&sched->hw_rq_count, 0);
> > > > > > > > >          INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
> > > > > > > > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> > > > > > > > > index df1993dd44ae..267bd060d178 100644
> > > > > > > > > --- a/include/drm/gpu_scheduler.h
> > > > > > > > > +++ b/include/drm/gpu_scheduler.h
> > > > > > > > > @@ -394,6 +394,23 @@ enum drm_gpu_sched_stat {
> > > > > > > > >          DRM_GPU_SCHED_STAT_ENODEV,
> > > > > > > > >      };
> > > > > > > > > 
> > > > > > > > > +/**
> > > > > > > > > + * struct drm_sched_msg - an in-band (relative to GPU scheduler run queue)
> > > > > > > > > + * message
> > > > > > > > > + *
> > > > > > > > > + * Generic enough for backend defined messages, backend can expand if needed.
> > > > > > > > > + */
> > > > > > > > > +struct drm_sched_msg {
> > > > > > > > > +     /** @link: list link into the gpu scheduler list of messages */
> > > > > > > > > +     struct list_head                link;
> > > > > > > > > +     /**
> > > > > > > > > +      * @private_data: opaque pointer to message private data (backend defined)
> > > > > > > > > +      */
> > > > > > > > > +     void                            *private_data;
> > > > > > > > > +     /** @opcode: opcode of message (backend defined) */
> > > > > > > > > +     unsigned int                    opcode;
> > > > > > > > > +};
> > > > > > > > > +
> > > > > > > > >      /**
> > > > > > > > >       * struct drm_sched_backend_ops - Define the backend operations
> > > > > > > > >       *  called by the scheduler
> > > > > > > > > @@ -471,6 +488,12 @@ struct drm_sched_backend_ops {
> > > > > > > > >               * and it's time to clean it up.
> > > > > > > > >           */
> > > > > > > > >          void (*free_job)(struct drm_sched_job *sched_job);
> > > > > > > > > +
> > > > > > > > > +     /**
> > > > > > > > > +      * @process_msg: Process a message. Allowed to block, it is this
> > > > > > > > > +      * function's responsibility to free message if dynamically allocated.
> > > > > > > > > +      */
> > > > > > > > > +     void (*process_msg)(struct drm_sched_msg *msg);
> > > > > > > > >      };
> > > > > > > > > 
> > > > > > > > >      /**
> > > > > > > > > @@ -482,6 +505,7 @@ struct drm_sched_backend_ops {
> > > > > > > > >       * @timeout: the time after which a job is removed from the scheduler.
> > > > > > > > >       * @name: name of the ring for which this scheduler is being used.
> > > > > > > > >       * @sched_rq: priority wise array of run queues.
> > > > > > > > > + * @msgs: list of messages to be processed in @work_run
> > > > > > > > >       * @job_scheduled: once @drm_sched_entity_do_release is called the scheduler
> > > > > > > > >       *                 waits on this wait queue until all the scheduled jobs are
> > > > > > > > >       *                 finished.
> > > > > > > > > @@ -489,7 +513,7 @@ struct drm_sched_backend_ops {
> > > > > > > > >       * @job_id_count: used to assign unique id to the each job.
> > > > > > > > >       * @run_wq: workqueue used to queue @work_run
> > > > > > > > >       * @timeout_wq: workqueue used to queue @work_tdr
> > > > > > > > > - * @work_run: schedules jobs and cleans up entities
> > > > > > > > > + * @work_run: schedules jobs, cleans up jobs, and processes messages
> > > > > > > > >       * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
> > > > > > > > >       *            timeout interval is over.
> > > > > > > > >       * @pending_list: the list of jobs which are currently in the job queue.
> > > > > > > > > @@ -513,6 +537,7 @@ struct drm_gpu_scheduler {
> > > > > > > > >          long                            timeout;
> > > > > > > > >          const char                      *name;
> > > > > > > > >          struct drm_sched_rq             sched_rq[DRM_SCHED_PRIORITY_COUNT];
> > > > > > > > > +     struct list_head                msgs;
> > > > > > > > >          wait_queue_head_t               job_scheduled;
> > > > > > > > >          atomic_t                        hw_rq_count;
> > > > > > > > >          atomic64_t                      job_id_count;
> > > > > > > > > @@ -566,6 +591,8 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity,
> > > > > > > > > 
> > > > > > > > >      void drm_sched_job_cleanup(struct drm_sched_job *job);
> > > > > > > > >      void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
> > > > > > > > > +void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
> > > > > > > > > +                    struct drm_sched_msg *msg);
> > > > > > > > >      void drm_sched_run_wq_stop(struct drm_gpu_scheduler *sched);
> > > > > > > > >      void drm_sched_run_wq_start(struct drm_gpu_scheduler *sched);
> > > > > > > > >      void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
> > > > > -- 
> > > > > Daniel Vetter
> > > > > Software Engineer, Intel Corporation
> > > > > http://blog.ffwll.ch
>

[4/8] drm/sched: Add generic scheduler message interface

Commit Message

Comments

Patch