[2/2] drm/scheduler: improve timedout_job wording a bit

Message ID	20231116141547.206695-2-christian.koenig@amd.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: " =?utf-8?q?Christian_K=C3=B6nig?= " <ckoenig.leichtzumerken@gmail.com> To: dakr@redhat.com, airlied@gmail.com, ltuikov89@gmail.com, matthew.brost@intel.com, boris.brezillon@collabora.com, daniel@ffwll.ch, alexander.deucher@amd.com, dri-devel@lists.freedesktop.org Subject: [PATCH 2/2] drm/scheduler: improve timedout_job wording a bit Date: Thu, 16 Nov 2023 15:15:47 +0100 Message-Id: <20231116141547.206695-2-christian.koenig@amd.com> In-Reply-To: <20231116141547.206695-1-christian.koenig@amd.com> References: <20231116141547.206695-1-christian.koenig@amd.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: list Cc: =?utf-8?q?Christian_K=C3=B6nig?= <christian.koenig@amd.com> Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	[1/2] drm/scheduler: improve GPU scheduler documentation v2 \| expand [1/2] drm/scheduler: improve GPU scheduler documentation v2 [2/2] drm/scheduler: improve timedout_job wording a bit

Message ID

20231116141547.206695-2-christian.koenig@amd.com (mailing list archive)

State

New, archived

Headers

From: " =?utf-8?q?Christian_K=C3=B6nig?= " <ckoenig.leichtzumerken@gmail.com>
To: dakr@redhat.com, airlied@gmail.com, ltuikov89@gmail.com,
 matthew.brost@intel.com, boris.brezillon@collabora.com, daniel@ffwll.ch,
 alexander.deucher@amd.com, dri-devel@lists.freedesktop.org
Subject: [PATCH 2/2] drm/scheduler: improve timedout_job wording a bit
Date: Thu, 16 Nov 2023 15:15:47 +0100
Message-Id: <20231116141547.206695-2-christian.koenig@amd.com>
In-Reply-To: <20231116141547.206695-1-christian.koenig@amd.com>
References: <20231116141547.206695-1-christian.koenig@amd.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: list
Cc: =?utf-8?q?Christian_K=C3=B6nig?= <christian.koenig@amd.com>
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

Series

[1/2] drm/scheduler: improve GPU scheduler documentation v2 | expand

Commit Message

Christian König Nov. 16, 2023, 2:15 p.m. UTC

Drop the reference to the deprecated re-submission of jobs.

Mention that it isn't the job which times out, but the hardware fence.
Mention that drivers can try a context based reset as well.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 include/drm/gpu_scheduler.h | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

Comments

Alex Deucher Nov. 16, 2023, 6:46 p.m. UTC | #1

On Thu, Nov 16, 2023 at 9:32 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Drop the reference to the deprecated re-submission of jobs.
>
> Mention that it isn't the job which times out, but the hardware fence.
> Mention that drivers can try a context based reset as well.
>
> Signed-off-by: Christian König <christian.koenig@amd.com>

Reviewed-by: Alex Deucher <alexander.deucher@amd.com>

> ---
>  include/drm/gpu_scheduler.h | 15 ++++++---------
>  1 file changed, 6 insertions(+), 9 deletions(-)
>
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 1d60eab747de..ac1d7222f5b2 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -418,8 +418,8 @@ struct drm_sched_backend_ops {
>         struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
>
>         /**
> -        * @timedout_job: Called when a job has taken too long to execute,
> -        * to trigger GPU recovery.
> +        * @timedout_job: Called when a hardware fence didn't signal in a
> +        * configurable amount of time to trigger GPU recovery.
>          *
>          * This method is called in a workqueue context.
>          *
> @@ -430,9 +430,8 @@ struct drm_sched_backend_ops {
>          *    scheduler thread and cancel the timeout work, guaranteeing that
>          *    nothing is queued while we reset the hardware queue
>          * 2. Try to gracefully stop non-faulty jobs (optional)
> -        * 3. Issue a GPU reset (driver-specific)
> -        * 4. Re-submit jobs using drm_sched_resubmit_jobs()
> -        * 5. Restart the scheduler using drm_sched_start(). At that point, new
> +        * 3. Issue a GPU or context reset (driver-specific)
> +        * 4. Restart the scheduler using drm_sched_start(). At that point, new
>          *    jobs can be queued, and the scheduler thread is unblocked
>          *
>          * Note that some GPUs have distinct hardware queues but need to reset
> @@ -448,16 +447,14 @@ struct drm_sched_backend_ops {
>          * 2. Try to gracefully stop non-faulty jobs on all queues impacted by
>          *    the reset (optional)
>          * 3. Issue a GPU reset on all faulty queues (driver-specific)
> -        * 4. Re-submit jobs on all schedulers impacted by the reset using
> -        *    drm_sched_resubmit_jobs()
> -        * 5. Restart all schedulers that were stopped in step #1 using
> +        * 4. Restart all schedulers that were stopped in step #1 using
>          *    drm_sched_start()
>          *
>          * Return DRM_GPU_SCHED_STAT_NOMINAL, when all is normal,
>          * and the underlying driver has started or completed recovery.
>          *
>          * Return DRM_GPU_SCHED_STAT_ENODEV, if the device is no longer
> -        * available, i.e. has been unplugged.
> +        * available, i.e. has been unplugged or failed to recover.
>          */
>         enum drm_gpu_sched_stat (*timedout_job)(struct drm_sched_job *sched_job);
>
> --
> 2.34.1
>

diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 1d60eab747de..ac1d7222f5b2 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -418,8 +418,8 @@  struct drm_sched_backend_ops {
 	struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
 
 	/**
-	 * @timedout_job: Called when a job has taken too long to execute,
-	 * to trigger GPU recovery.
+	 * @timedout_job: Called when a hardware fence didn't signal in a
+	 * configurable amount of time to trigger GPU recovery.
 	 *
 	 * This method is called in a workqueue context.
 	 *
@@ -430,9 +430,8 @@  struct drm_sched_backend_ops {
 	 *    scheduler thread and cancel the timeout work, guaranteeing that
 	 *    nothing is queued while we reset the hardware queue
 	 * 2. Try to gracefully stop non-faulty jobs (optional)
-	 * 3. Issue a GPU reset (driver-specific)
-	 * 4. Re-submit jobs using drm_sched_resubmit_jobs()
-	 * 5. Restart the scheduler using drm_sched_start(). At that point, new
+	 * 3. Issue a GPU or context reset (driver-specific)
+	 * 4. Restart the scheduler using drm_sched_start(). At that point, new
 	 *    jobs can be queued, and the scheduler thread is unblocked
 	 *
 	 * Note that some GPUs have distinct hardware queues but need to reset
@@ -448,16 +447,14 @@  struct drm_sched_backend_ops {
 	 * 2. Try to gracefully stop non-faulty jobs on all queues impacted by
 	 *    the reset (optional)
 	 * 3. Issue a GPU reset on all faulty queues (driver-specific)
-	 * 4. Re-submit jobs on all schedulers impacted by the reset using
-	 *    drm_sched_resubmit_jobs()
-	 * 5. Restart all schedulers that were stopped in step #1 using
+	 * 4. Restart all schedulers that were stopped in step #1 using
 	 *    drm_sched_start()
 	 *
 	 * Return DRM_GPU_SCHED_STAT_NOMINAL, when all is normal,
 	 * and the underlying driver has started or completed recovery.
 	 *
 	 * Return DRM_GPU_SCHED_STAT_ENODEV, if the device is no longer
-	 * available, i.e. has been unplugged.
+	 * available, i.e. has been unplugged or failed to recover.
 	 */
 	enum drm_gpu_sched_stat (*timedout_job)(struct drm_sched_job *sched_job);

[2/2] drm/scheduler: improve timedout_job wording a bit

Commit Message

Comments

Patch