Message ID | 20231116141547.206695-2-christian.koenig@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] drm/scheduler: improve GPU scheduler documentation v2 | expand |
On Thu, Nov 16, 2023 at 9:32 AM Christian König <ckoenig.leichtzumerken@gmail.com> wrote: > > Drop the reference to the deprecated re-submission of jobs. > > Mention that it isn't the job which times out, but the hardware fence. > Mention that drivers can try a context based reset as well. > > Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> > --- > include/drm/gpu_scheduler.h | 15 ++++++--------- > 1 file changed, 6 insertions(+), 9 deletions(-) > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h > index 1d60eab747de..ac1d7222f5b2 100644 > --- a/include/drm/gpu_scheduler.h > +++ b/include/drm/gpu_scheduler.h > @@ -418,8 +418,8 @@ struct drm_sched_backend_ops { > struct dma_fence *(*run_job)(struct drm_sched_job *sched_job); > > /** > - * @timedout_job: Called when a job has taken too long to execute, > - * to trigger GPU recovery. > + * @timedout_job: Called when a hardware fence didn't signal in a > + * configurable amount of time to trigger GPU recovery. > * > * This method is called in a workqueue context. > * > @@ -430,9 +430,8 @@ struct drm_sched_backend_ops { > * scheduler thread and cancel the timeout work, guaranteeing that > * nothing is queued while we reset the hardware queue > * 2. Try to gracefully stop non-faulty jobs (optional) > - * 3. Issue a GPU reset (driver-specific) > - * 4. Re-submit jobs using drm_sched_resubmit_jobs() > - * 5. Restart the scheduler using drm_sched_start(). At that point, new > + * 3. Issue a GPU or context reset (driver-specific) > + * 4. Restart the scheduler using drm_sched_start(). At that point, new > * jobs can be queued, and the scheduler thread is unblocked > * > * Note that some GPUs have distinct hardware queues but need to reset > @@ -448,16 +447,14 @@ struct drm_sched_backend_ops { > * 2. Try to gracefully stop non-faulty jobs on all queues impacted by > * the reset (optional) > * 3. Issue a GPU reset on all faulty queues (driver-specific) > - * 4. Re-submit jobs on all schedulers impacted by the reset using > - * drm_sched_resubmit_jobs() > - * 5. Restart all schedulers that were stopped in step #1 using > + * 4. Restart all schedulers that were stopped in step #1 using > * drm_sched_start() > * > * Return DRM_GPU_SCHED_STAT_NOMINAL, when all is normal, > * and the underlying driver has started or completed recovery. > * > * Return DRM_GPU_SCHED_STAT_ENODEV, if the device is no longer > - * available, i.e. has been unplugged. > + * available, i.e. has been unplugged or failed to recover. > */ > enum drm_gpu_sched_stat (*timedout_job)(struct drm_sched_job *sched_job); > > -- > 2.34.1 >
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 1d60eab747de..ac1d7222f5b2 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -418,8 +418,8 @@ struct drm_sched_backend_ops { struct dma_fence *(*run_job)(struct drm_sched_job *sched_job); /** - * @timedout_job: Called when a job has taken too long to execute, - * to trigger GPU recovery. + * @timedout_job: Called when a hardware fence didn't signal in a + * configurable amount of time to trigger GPU recovery. * * This method is called in a workqueue context. * @@ -430,9 +430,8 @@ struct drm_sched_backend_ops { * scheduler thread and cancel the timeout work, guaranteeing that * nothing is queued while we reset the hardware queue * 2. Try to gracefully stop non-faulty jobs (optional) - * 3. Issue a GPU reset (driver-specific) - * 4. Re-submit jobs using drm_sched_resubmit_jobs() - * 5. Restart the scheduler using drm_sched_start(). At that point, new + * 3. Issue a GPU or context reset (driver-specific) + * 4. Restart the scheduler using drm_sched_start(). At that point, new * jobs can be queued, and the scheduler thread is unblocked * * Note that some GPUs have distinct hardware queues but need to reset @@ -448,16 +447,14 @@ struct drm_sched_backend_ops { * 2. Try to gracefully stop non-faulty jobs on all queues impacted by * the reset (optional) * 3. Issue a GPU reset on all faulty queues (driver-specific) - * 4. Re-submit jobs on all schedulers impacted by the reset using - * drm_sched_resubmit_jobs() - * 5. Restart all schedulers that were stopped in step #1 using + * 4. Restart all schedulers that were stopped in step #1 using * drm_sched_start() * * Return DRM_GPU_SCHED_STAT_NOMINAL, when all is normal, * and the underlying driver has started or completed recovery. * * Return DRM_GPU_SCHED_STAT_ENODEV, if the device is no longer - * available, i.e. has been unplugged. + * available, i.e. has been unplugged or failed to recover. */ enum drm_gpu_sched_stat (*timedout_job)(struct drm_sched_job *sched_job);
Drop the reference to the deprecated re-submission of jobs. Mention that it isn't the job which times out, but the hardware fence. Mention that drivers can try a context based reset as well. Signed-off-by: Christian König <christian.koenig@amd.com> --- include/drm/gpu_scheduler.h | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-)