Message ID | 20230404002211.3611376-8-matthew.brost@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Xe DRM scheduler and long running workload plans | expand |
On 2023-04-03 20:22, Matthew Brost wrote: > Add helper to set TDR timeout and restart the TDR with new timeout > value. This will be used in XE, new Intel GPU driver, to trigger the TDR > to cleanup drm_sched_entity that encounter errors. > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > --- > drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++ > include/drm/gpu_scheduler.h | 1 + > 2 files changed, 19 insertions(+) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index 4eac02d212c1..d61880315d8d 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -370,6 +370,24 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched) > queue_delayed_work(sched->timeout_wq, &sched->work_tdr, sched->timeout); > } > > +/** > + * drm_sched_set_timeout - set timeout for reset worker > + * > + * @sched: scheduler instance to set and (re)-start the worker for > + * @timeout: timeout period > + * > + * Set and (re)-start the timeout for the given scheduler. > + */ > +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout) > +{ > + spin_lock(&sched->job_list_lock); > + sched->timeout = timeout; > + cancel_delayed_work(&sched->work_tdr); I see that the comment says "(re-)start"(sic). Is the rest of the logic stable in that we don't need to use _sync() version, and/or at least inspect the return value of the one currently used? Regards, Luben > + drm_sched_start_timeout(sched); > + spin_unlock(&sched->job_list_lock); > +} > +EXPORT_SYMBOL(drm_sched_set_timeout); > + > /** > * drm_sched_fault - immediately start timeout handler > * > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h > index 18172ae63ab7..6258e324bd7c 100644 > --- a/include/drm/gpu_scheduler.h > +++ b/include/drm/gpu_scheduler.h > @@ -593,6 +593,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity, > struct drm_gpu_scheduler **sched_list, > unsigned int num_sched_list); > > +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout); > void drm_sched_job_cleanup(struct drm_sched_job *job); > void drm_sched_wakeup(struct drm_gpu_scheduler *sched); > void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
On Thu, May 04, 2023 at 01:28:12AM -0400, Luben Tuikov wrote: > On 2023-04-03 20:22, Matthew Brost wrote: > > Add helper to set TDR timeout and restart the TDR with new timeout > > value. This will be used in XE, new Intel GPU driver, to trigger the TDR > > to cleanup drm_sched_entity that encounter errors. > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > --- > > drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++ > > include/drm/gpu_scheduler.h | 1 + > > 2 files changed, 19 insertions(+) > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > index 4eac02d212c1..d61880315d8d 100644 > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > @@ -370,6 +370,24 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched) > > queue_delayed_work(sched->timeout_wq, &sched->work_tdr, sched->timeout); > > } > > > > +/** > > + * drm_sched_set_timeout - set timeout for reset worker > > + * > > + * @sched: scheduler instance to set and (re)-start the worker for > > + * @timeout: timeout period > > + * > > + * Set and (re)-start the timeout for the given scheduler. > > + */ > > +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout) > > +{ > > + spin_lock(&sched->job_list_lock); > > + sched->timeout = timeout; > > + cancel_delayed_work(&sched->work_tdr); > > I see that the comment says "(re-)start"(sic). Is the rest of the logic > stable in that we don't need to use _sync() version, and/or at least > inspect the return value of the one currently used? > Sorry for the delayed response, just reviving this series now and seeing this comment. We don't care if the TDR is currently executing (at least in Xe which makes use of this function), that is totally fine we only care to change the future timeout values. I believe we actually call this from the TDR in Xe to set the timeout value to zero so using a sync version would deadlock. We do this as a mechanism to kill the drm_gpu_scheduler and immediately timeout all remaining jobs. We also call this in a few other places too with a value of zero for the same reason (kill the drm_gpu_scheduler). Matt > Regards, > Luben > > > + drm_sched_start_timeout(sched); > > + spin_unlock(&sched->job_list_lock); > > +} > > +EXPORT_SYMBOL(drm_sched_set_timeout); > > + > > /** > > * drm_sched_fault - immediately start timeout handler > > * > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h > > index 18172ae63ab7..6258e324bd7c 100644 > > --- a/include/drm/gpu_scheduler.h > > +++ b/include/drm/gpu_scheduler.h > > @@ -593,6 +593,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity, > > struct drm_gpu_scheduler **sched_list, > > unsigned int num_sched_list); > > > > +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout); > > void drm_sched_job_cleanup(struct drm_sched_job *job); > > void drm_sched_wakeup(struct drm_gpu_scheduler *sched); > > void drm_sched_add_msg(struct drm_gpu_scheduler *sched, >
On 2023-07-30 21:09, Matthew Brost wrote: > On Thu, May 04, 2023 at 01:28:12AM -0400, Luben Tuikov wrote: >> On 2023-04-03 20:22, Matthew Brost wrote: >>> Add helper to set TDR timeout and restart the TDR with new timeout >>> value. This will be used in XE, new Intel GPU driver, to trigger the TDR >>> to cleanup drm_sched_entity that encounter errors. >>> >>> Signed-off-by: Matthew Brost <matthew.brost@intel.com> >>> --- >>> drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++ >>> include/drm/gpu_scheduler.h | 1 + >>> 2 files changed, 19 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >>> index 4eac02d212c1..d61880315d8d 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>> @@ -370,6 +370,24 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched) >>> queue_delayed_work(sched->timeout_wq, &sched->work_tdr, sched->timeout); >>> } >>> >>> +/** >>> + * drm_sched_set_timeout - set timeout for reset worker >>> + * >>> + * @sched: scheduler instance to set and (re)-start the worker for >>> + * @timeout: timeout period >>> + * >>> + * Set and (re)-start the timeout for the given scheduler. >>> + */ >>> +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout) >>> +{ >>> + spin_lock(&sched->job_list_lock); >>> + sched->timeout = timeout; >>> + cancel_delayed_work(&sched->work_tdr); >> >> I see that the comment says "(re-)start"(sic). Is the rest of the logic >> stable in that we don't need to use _sync() version, and/or at least >> inspect the return value of the one currently used? >> > > Sorry for the delayed response, just reviving this series now and seeing > this comment. > > We don't care if the TDR is currently executing (at least in Xe which > makes use of this function), that is totally fine we only care to change > the future timeout values. I believe we actually call this from the TDR > in Xe to set the timeout value to zero so using a sync version would > deadlock. We do this as a mechanism to kill the drm_gpu_scheduler and > immediately timeout all remaining jobs. We also call this in a few other > places too with a value of zero for the same reason (kill the > drm_gpu_scheduler). (Catching up chronologically after vacation...) Okay, that's fine, but this shows a need for an interface/logic to simply kill the DRM gpu scheduler. So perhaps we need to provide that kind of functionality, as opposed to gaming the scheduler--setting the timeout to 0 to kill the scheduler. Perhaps that would be simpler...?
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 4eac02d212c1..d61880315d8d 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -370,6 +370,24 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched) queue_delayed_work(sched->timeout_wq, &sched->work_tdr, sched->timeout); } +/** + * drm_sched_set_timeout - set timeout for reset worker + * + * @sched: scheduler instance to set and (re)-start the worker for + * @timeout: timeout period + * + * Set and (re)-start the timeout for the given scheduler. + */ +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout) +{ + spin_lock(&sched->job_list_lock); + sched->timeout = timeout; + cancel_delayed_work(&sched->work_tdr); + drm_sched_start_timeout(sched); + spin_unlock(&sched->job_list_lock); +} +EXPORT_SYMBOL(drm_sched_set_timeout); + /** * drm_sched_fault - immediately start timeout handler * diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 18172ae63ab7..6258e324bd7c 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -593,6 +593,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity, struct drm_gpu_scheduler **sched_list, unsigned int num_sched_list); +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout); void drm_sched_job_cleanup(struct drm_sched_job *job); void drm_sched_wakeup(struct drm_gpu_scheduler *sched); void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
Add helper to set TDR timeout and restart the TDR with new timeout value. This will be used in XE, new Intel GPU driver, to trigger the TDR to cleanup drm_sched_entity that encounter errors. Signed-off-by: Matthew Brost <matthew.brost@intel.com> --- drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++ include/drm/gpu_scheduler.h | 1 + 2 files changed, 19 insertions(+)