diff mbox series

[RFC] drm/sched: Fix teardown leaks with waitqueue

Message ID 20240903094446.29797-2-pstanner@redhat.com (mailing list archive)
State New, archived
Headers show
Series [RFC] drm/sched: Fix teardown leaks with waitqueue | expand

Commit Message

Philipp Stanner Sept. 3, 2024, 9:44 a.m. UTC
The GPU scheduler currently does not ensure that its pending_list is
empty before performing various other teardown tasks in
drm_sched_fini().

If there are still jobs in the pending_list, this is problematic because
after scheduler teardown, no one will call backend_ops.free_job()
anymore. This would, consequently, result in memory leaks.

One way to solves this is to implement a waitqueue that drm_sched_fini()
blocks on until the pending_list has become empty.

Add a waitqueue to struct drm_gpu_scheduler. Wake up waiters once the
pending_list becomes empty. Wait in drm_sched_fini() for that to happen.

Suggested-by: Danilo Krummrich <dakr@redhat.com>
Signed-off-by: Philipp Stanner <pstanner@redhat.com>
---
Hi all,

since the scheduler has many stake holders, I want this solution
discussed as an RFC first.

This version here has IMO the advantage (and disadvantage...) that it
blocks infinitly if the driver messed up the clean up, so problems
might become more visible than the refcount solution I proposed in
parallel.

Cheers,
P.
---
 drivers/gpu/drm/scheduler/sched_main.c | 40 ++++++++++++++++++++++++++
 include/drm/gpu_scheduler.h            |  4 +++
 2 files changed, 44 insertions(+)

Comments

Simona Vetter Sept. 4, 2024, 5:47 p.m. UTC | #1
On Tue, Sep 03, 2024 at 11:44:47AM +0200, Philipp Stanner wrote:
> The GPU scheduler currently does not ensure that its pending_list is
> empty before performing various other teardown tasks in
> drm_sched_fini().
> 
> If there are still jobs in the pending_list, this is problematic because
> after scheduler teardown, no one will call backend_ops.free_job()
> anymore. This would, consequently, result in memory leaks.
> 
> One way to solves this is to implement a waitqueue that drm_sched_fini()
> blocks on until the pending_list has become empty.
> 
> Add a waitqueue to struct drm_gpu_scheduler. Wake up waiters once the
> pending_list becomes empty. Wait in drm_sched_fini() for that to happen.
> 
> Suggested-by: Danilo Krummrich <dakr@redhat.com>
> Signed-off-by: Philipp Stanner <pstanner@redhat.com>
> ---
> Hi all,
> 
> since the scheduler has many stake holders, I want this solution
> discussed as an RFC first.
> 
> This version here has IMO the advantage (and disadvantage...) that it
> blocks infinitly if the driver messed up the clean up, so problems
> might become more visible than the refcount solution I proposed in
> parallel.

Very quick comment because I'm heading out for the r4l conference, but
maybe I can discuss this there with Danilo a bit.

Maybe we should do step 0 first, and document the current rules? The
kerneldoc isn't absolute zero anymore, but it's very, very bare-bones.
Then get that acked and merged, which is a very good way to make sure
we're actually standing on common ground.

Then maybe step 0.5 would be to add runtime asserts to enforce the rules,
like if you tear down the scheduler and there's stuff in flight, you splat
on a WARN_ON.

With that foundation there should be a lot clearer basis to discuss
whether there is an issue, and what a better design could look like. The
little pondering I've done I've come up with a few more ideas along
similar lines.

One comment below, kinda unrelated.

> 
> Cheers,
> P.
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 40 ++++++++++++++++++++++++++
>  include/drm/gpu_scheduler.h            |  4 +++
>  2 files changed, 44 insertions(+)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 7e90c9f95611..200fa932f289 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -564,6 +564,13 @@ static void drm_sched_job_timedout(struct work_struct *work)
>  		 * is parked at which point it's safe.
>  		 */
>  		list_del_init(&job->list);
> +
> +		/*
> +		 * Inform tasks blocking in drm_sched_fini() that it's now safe to proceed.
> +		 */
> +		if (list_empty(&sched->pending_list))
> +			wake_up(&sched->job_list_waitque);
> +
>  		spin_unlock(&sched->job_list_lock);
>  
>  		status = job->sched->ops->timedout_job(job);
> @@ -584,6 +591,15 @@ static void drm_sched_job_timedout(struct work_struct *work)
>  		drm_sched_start_timeout_unlocked(sched);
>  }
>  
> +static bool drm_sched_no_jobs_pending(struct drm_gpu_scheduler *sched)
> +{
> +	/*
> +	 * For list_empty() to work without a lock.
> +	 */

So this is pretty far from the gold standard for documenting memory
barrier semantics in lockless code. Ideally we have a comment for both
sides of the barrier (you always need two, or there's no function
barrier), pointing at each another, and explaining exactly what's being
synchronized against what and how.

I did years ago add a few missing barriers with that approach, see
b0a5303d4e14 ("drm/sched: Barriers are needed for
entity->last_scheduled"). Reading that patch again it feels a bit on the
terse side of things (plus I noticed spelling issues now too, oops).

Cheers, Sima

> +	rmb();
> +	return list_empty(&sched->pending_list);
> +}
> +
>  /**
>   * drm_sched_stop - stop the scheduler
>   *
> @@ -659,6 +675,12 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
>  		}
>  	}
>  
> +	/*
> +	 * Inform tasks blocking in drm_sched_fini() that it's now safe to proceed.
> +	 */
> +	if (drm_sched_no_jobs_pending(sched))
> +		wake_up(&sched->job_list_waitque);
> +
>  	/*
>  	 * Stop pending timer in flight as we rearm it in  drm_sched_start. This
>  	 * avoids the pending timeout work in progress to fire right away after
> @@ -1085,6 +1107,12 @@ drm_sched_get_finished_job(struct drm_gpu_scheduler *sched)
>  		/* remove job from pending_list */
>  		list_del_init(&job->list);
>  
> +		/*
> +		 * Inform tasks blocking in drm_sched_fini() that it's now safe to proceed.
> +		 */
> +		if (list_empty(&sched->pending_list))
> +			wake_up(&sched->job_list_waitque);
> +
>  		/* cancel this job's TO timer */
>  		cancel_delayed_work(&sched->work_tdr);
>  		/* make the scheduled timestamp more accurate */
> @@ -1303,6 +1331,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>  	init_waitqueue_head(&sched->job_scheduled);
>  	INIT_LIST_HEAD(&sched->pending_list);
>  	spin_lock_init(&sched->job_list_lock);
> +	init_waitqueue_head(&sched->job_list_waitque);
>  	atomic_set(&sched->credit_count, 0);
>  	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
>  	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> @@ -1333,12 +1362,23 @@ EXPORT_SYMBOL(drm_sched_init);
>   * @sched: scheduler instance
>   *
>   * Tears down and cleans up the scheduler.
> + *
> + * Note that this function blocks until the fences returned by
> + * backend_ops.run_job() have been signalled.
>   */
>  void drm_sched_fini(struct drm_gpu_scheduler *sched)
>  {
>  	struct drm_sched_entity *s_entity;
>  	int i;
>  
> +	/*
> +	 * Jobs that have neither been scheduled or which have timed out are
> +	 * gone by now, but jobs that have been submitted through
> +	 * backend_ops.run_job() and have not yet terminated are still pending.
> +	 *
> +	 * Wait for the pending_list to become empty to avoid leaking those jobs.
> +	 */
> +	wait_event(sched->job_list_waitque, drm_sched_no_jobs_pending(sched));
>  	drm_sched_wqueue_stop(sched);
>  
>  	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 5acc64954a88..bff092784405 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -29,6 +29,7 @@
>  #include <linux/completion.h>
>  #include <linux/xarray.h>
>  #include <linux/workqueue.h>
> +#include <linux/wait.h>
>  
>  #define MAX_WAIT_SCHED_ENTITY_Q_EMPTY msecs_to_jiffies(1000)
>  
> @@ -503,6 +504,8 @@ struct drm_sched_backend_ops {
>   *            timeout interval is over.
>   * @pending_list: the list of jobs which are currently in the job queue.
>   * @job_list_lock: lock to protect the pending_list.
> + * @job_list_waitque: a waitqueue for drm_sched_fini() to block on until all
> + *		      pending jobs have been finished.
>   * @hang_limit: once the hangs by a job crosses this limit then it is marked
>   *              guilty and it will no longer be considered for scheduling.
>   * @score: score to help loadbalancer pick a idle sched
> @@ -532,6 +535,7 @@ struct drm_gpu_scheduler {
>  	struct delayed_work		work_tdr;
>  	struct list_head		pending_list;
>  	spinlock_t			job_list_lock;
> +	wait_queue_head_t		job_list_waitque;
>  	int				hang_limit;
>  	atomic_t                        *score;
>  	atomic_t                        _score;
> -- 
> 2.46.0
>
Philipp Stanner Sept. 5, 2024, 7:38 a.m. UTC | #2
On Wed, 2024-09-04 at 19:47 +0200, Simona Vetter wrote:
> On Tue, Sep 03, 2024 at 11:44:47AM +0200, Philipp Stanner wrote:
> > The GPU scheduler currently does not ensure that its pending_list
> > is
> > empty before performing various other teardown tasks in
> > drm_sched_fini().
> > 
> > If there are still jobs in the pending_list, this is problematic
> > because
> > after scheduler teardown, no one will call backend_ops.free_job()
> > anymore. This would, consequently, result in memory leaks.
> > 
> > One way to solves this is to implement a waitqueue that
> > drm_sched_fini()
> > blocks on until the pending_list has become empty.
> > 
> > Add a waitqueue to struct drm_gpu_scheduler. Wake up waiters once
> > the
> > pending_list becomes empty. Wait in drm_sched_fini() for that to
> > happen.
> > 
> > Suggested-by: Danilo Krummrich <dakr@redhat.com>
> > Signed-off-by: Philipp Stanner <pstanner@redhat.com>
> > ---
> > Hi all,
> > 
> > since the scheduler has many stake holders, I want this solution
> > discussed as an RFC first.
> > 
> > This version here has IMO the advantage (and disadvantage...) that
> > it
> > blocks infinitly if the driver messed up the clean up, so problems
> > might become more visible than the refcount solution I proposed in
> > parallel.
> 
> Very quick comment because I'm heading out for the r4l conference,
> but
> maybe I can discuss this there with Danilo a bit.
> 
> Maybe we should do step 0 first, and document the current rules? The
> kerneldoc isn't absolute zero anymore, but it's very, very bare-
> bones.
> Then get that acked and merged, which is a very good way to make sure
> we're actually standing on common ground.

Yes, documentation is definitely also on my TODO list. I wanted to send
out something clarifying the objects' lifetimes (based on Christian's
previous series [1]) quite soon.

> 
> Then maybe step 0.5 would be to add runtime asserts to enforce the
> rules,
> like if you tear down the scheduler and there's stuff in flight, you
> splat
> on a WARN_ON.
> 
> With that foundation there should be a lot clearer basis to discuss
> whether there is an issue, and what a better design could look like.

I mean, I'm quite sure that there are teardown issues. But we could
indeed make them visible first through documentation (and a FIXME tag)
and after establishing consensus through merging that go on as you
suggested.

> The
> little pondering I've done I've come up with a few more ideas along
> similar lines.
> 
> One comment below, kinda unrelated.
> 
> > 
> > Cheers,
> > P.
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 40
> > ++++++++++++++++++++++++++
> >  include/drm/gpu_scheduler.h            |  4 +++
> >  2 files changed, 44 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index 7e90c9f95611..200fa932f289 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -564,6 +564,13 @@ static void drm_sched_job_timedout(struct
> > work_struct *work)
> >  		 * is parked at which point it's safe.
> >  		 */
> >  		list_del_init(&job->list);
> > +
> > +		/*
> > +		 * Inform tasks blocking in drm_sched_fini() that
> > it's now safe to proceed.
> > +		 */
> > +		if (list_empty(&sched->pending_list))
> > +			wake_up(&sched->job_list_waitque);
> > +
> >  		spin_unlock(&sched->job_list_lock);
> >  
> >  		status = job->sched->ops->timedout_job(job);
> > @@ -584,6 +591,15 @@ static void drm_sched_job_timedout(struct
> > work_struct *work)
> >  		drm_sched_start_timeout_unlocked(sched);
> >  }
> >  
> > +static bool drm_sched_no_jobs_pending(struct drm_gpu_scheduler
> > *sched)
> > +{
> > +	/*
> > +	 * For list_empty() to work without a lock.
> > +	 */
> 
> So this is pretty far from the gold standard for documenting memory
> barrier semantics in lockless code. Ideally we have a comment for
> both
> sides of the barrier (you always need two, or there's no function
> barrier), pointing at each another, and explaining exactly what's
> being
> synchronized against what and how.
> 
> I did years ago add a few missing barriers with that approach, see
> b0a5303d4e14 ("drm/sched: Barriers are needed for
> entity->last_scheduled"). Reading that patch again it feels a bit on
> the
> terse side of things (plus I noticed spelling issues now too, oops).

ACK, will do that properly once we role out the actual patch.


P.

> 
> Cheers, Sima
> 
> > +	rmb();
> > +	return list_empty(&sched->pending_list);
> > +}
> > +
> >  /**
> >   * drm_sched_stop - stop the scheduler
> >   *
> > @@ -659,6 +675,12 @@ void drm_sched_stop(struct drm_gpu_scheduler
> > *sched, struct drm_sched_job *bad)
> >  		}
> >  	}
> >  
> > +	/*
> > +	 * Inform tasks blocking in drm_sched_fini() that it's now
> > safe to proceed.
> > +	 */
> > +	if (drm_sched_no_jobs_pending(sched))
> > +		wake_up(&sched->job_list_waitque);
> > +
> >  	/*
> >  	 * Stop pending timer in flight as we rearm it in 
> > drm_sched_start. This
> >  	 * avoids the pending timeout work in progress to fire
> > right away after
> > @@ -1085,6 +1107,12 @@ drm_sched_get_finished_job(struct
> > drm_gpu_scheduler *sched)
> >  		/* remove job from pending_list */
> >  		list_del_init(&job->list);
> >  
> > +		/*
> > +		 * Inform tasks blocking in drm_sched_fini() that
> > it's now safe to proceed.
> > +		 */
> > +		if (list_empty(&sched->pending_list))
> > +			wake_up(&sched->job_list_waitque);
> > +
> >  		/* cancel this job's TO timer */
> >  		cancel_delayed_work(&sched->work_tdr);
> >  		/* make the scheduled timestamp more accurate */
> > @@ -1303,6 +1331,7 @@ int drm_sched_init(struct drm_gpu_scheduler
> > *sched,
> >  	init_waitqueue_head(&sched->job_scheduled);
> >  	INIT_LIST_HEAD(&sched->pending_list);
> >  	spin_lock_init(&sched->job_list_lock);
> > +	init_waitqueue_head(&sched->job_list_waitque);
> >  	atomic_set(&sched->credit_count, 0);
> >  	INIT_DELAYED_WORK(&sched->work_tdr,
> > drm_sched_job_timedout);
> >  	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> > @@ -1333,12 +1362,23 @@ EXPORT_SYMBOL(drm_sched_init);
> >   * @sched: scheduler instance
> >   *
> >   * Tears down and cleans up the scheduler.
> > + *
> > + * Note that this function blocks until the fences returned by
> > + * backend_ops.run_job() have been signalled.
> >   */
> >  void drm_sched_fini(struct drm_gpu_scheduler *sched)
> >  {
> >  	struct drm_sched_entity *s_entity;
> >  	int i;
> >  
> > +	/*
> > +	 * Jobs that have neither been scheduled or which have
> > timed out are
> > +	 * gone by now, but jobs that have been submitted through
> > +	 * backend_ops.run_job() and have not yet terminated are
> > still pending.
> > +	 *
> > +	 * Wait for the pending_list to become empty to avoid
> > leaking those jobs.
> > +	 */
> > +	wait_event(sched->job_list_waitque,
> > drm_sched_no_jobs_pending(sched));
> >  	drm_sched_wqueue_stop(sched);
> >  
> >  	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs;
> > i++) {
> > diff --git a/include/drm/gpu_scheduler.h
> > b/include/drm/gpu_scheduler.h
> > index 5acc64954a88..bff092784405 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -29,6 +29,7 @@
> >  #include <linux/completion.h>
> >  #include <linux/xarray.h>
> >  #include <linux/workqueue.h>
> > +#include <linux/wait.h>
> >  
> >  #define MAX_WAIT_SCHED_ENTITY_Q_EMPTY msecs_to_jiffies(1000)
> >  
> > @@ -503,6 +504,8 @@ struct drm_sched_backend_ops {
> >   *            timeout interval is over.
> >   * @pending_list: the list of jobs which are currently in the job
> > queue.
> >   * @job_list_lock: lock to protect the pending_list.
> > + * @job_list_waitque: a waitqueue for drm_sched_fini() to block on
> > until all
> > + *		      pending jobs have been finished.
> >   * @hang_limit: once the hangs by a job crosses this limit then it
> > is marked
> >   *              guilty and it will no longer be considered for
> > scheduling.
> >   * @score: score to help loadbalancer pick a idle sched
> > @@ -532,6 +535,7 @@ struct drm_gpu_scheduler {
> >  	struct delayed_work		work_tdr;
> >  	struct list_head		pending_list;
> >  	spinlock_t			job_list_lock;
> > +	wait_queue_head_t		job_list_waitque;
> >  	int				hang_limit;
> >  	atomic_t                        *score;
> >  	atomic_t                        _score;
> > -- 
> > 2.46.0
> > 
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 7e90c9f95611..200fa932f289 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -564,6 +564,13 @@  static void drm_sched_job_timedout(struct work_struct *work)
 		 * is parked at which point it's safe.
 		 */
 		list_del_init(&job->list);
+
+		/*
+		 * Inform tasks blocking in drm_sched_fini() that it's now safe to proceed.
+		 */
+		if (list_empty(&sched->pending_list))
+			wake_up(&sched->job_list_waitque);
+
 		spin_unlock(&sched->job_list_lock);
 
 		status = job->sched->ops->timedout_job(job);
@@ -584,6 +591,15 @@  static void drm_sched_job_timedout(struct work_struct *work)
 		drm_sched_start_timeout_unlocked(sched);
 }
 
+static bool drm_sched_no_jobs_pending(struct drm_gpu_scheduler *sched)
+{
+	/*
+	 * For list_empty() to work without a lock.
+	 */
+	rmb();
+	return list_empty(&sched->pending_list);
+}
+
 /**
  * drm_sched_stop - stop the scheduler
  *
@@ -659,6 +675,12 @@  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
 		}
 	}
 
+	/*
+	 * Inform tasks blocking in drm_sched_fini() that it's now safe to proceed.
+	 */
+	if (drm_sched_no_jobs_pending(sched))
+		wake_up(&sched->job_list_waitque);
+
 	/*
 	 * Stop pending timer in flight as we rearm it in  drm_sched_start. This
 	 * avoids the pending timeout work in progress to fire right away after
@@ -1085,6 +1107,12 @@  drm_sched_get_finished_job(struct drm_gpu_scheduler *sched)
 		/* remove job from pending_list */
 		list_del_init(&job->list);
 
+		/*
+		 * Inform tasks blocking in drm_sched_fini() that it's now safe to proceed.
+		 */
+		if (list_empty(&sched->pending_list))
+			wake_up(&sched->job_list_waitque);
+
 		/* cancel this job's TO timer */
 		cancel_delayed_work(&sched->work_tdr);
 		/* make the scheduled timestamp more accurate */
@@ -1303,6 +1331,7 @@  int drm_sched_init(struct drm_gpu_scheduler *sched,
 	init_waitqueue_head(&sched->job_scheduled);
 	INIT_LIST_HEAD(&sched->pending_list);
 	spin_lock_init(&sched->job_list_lock);
+	init_waitqueue_head(&sched->job_list_waitque);
 	atomic_set(&sched->credit_count, 0);
 	INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout);
 	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
@@ -1333,12 +1362,23 @@  EXPORT_SYMBOL(drm_sched_init);
  * @sched: scheduler instance
  *
  * Tears down and cleans up the scheduler.
+ *
+ * Note that this function blocks until the fences returned by
+ * backend_ops.run_job() have been signalled.
  */
 void drm_sched_fini(struct drm_gpu_scheduler *sched)
 {
 	struct drm_sched_entity *s_entity;
 	int i;
 
+	/*
+	 * Jobs that have neither been scheduled or which have timed out are
+	 * gone by now, but jobs that have been submitted through
+	 * backend_ops.run_job() and have not yet terminated are still pending.
+	 *
+	 * Wait for the pending_list to become empty to avoid leaking those jobs.
+	 */
+	wait_event(sched->job_list_waitque, drm_sched_no_jobs_pending(sched));
 	drm_sched_wqueue_stop(sched);
 
 	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 5acc64954a88..bff092784405 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -29,6 +29,7 @@ 
 #include <linux/completion.h>
 #include <linux/xarray.h>
 #include <linux/workqueue.h>
+#include <linux/wait.h>
 
 #define MAX_WAIT_SCHED_ENTITY_Q_EMPTY msecs_to_jiffies(1000)
 
@@ -503,6 +504,8 @@  struct drm_sched_backend_ops {
  *            timeout interval is over.
  * @pending_list: the list of jobs which are currently in the job queue.
  * @job_list_lock: lock to protect the pending_list.
+ * @job_list_waitque: a waitqueue for drm_sched_fini() to block on until all
+ *		      pending jobs have been finished.
  * @hang_limit: once the hangs by a job crosses this limit then it is marked
  *              guilty and it will no longer be considered for scheduling.
  * @score: score to help loadbalancer pick a idle sched
@@ -532,6 +535,7 @@  struct drm_gpu_scheduler {
 	struct delayed_work		work_tdr;
 	struct list_head		pending_list;
 	spinlock_t			job_list_lock;
+	wait_queue_head_t		job_list_waitque;
 	int				hang_limit;
 	atomic_t                        *score;
 	atomic_t                        _score;