Message ID | 20230912021615.2086698-11-matthew.brost@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | DRM scheduler changes for Xe | expand |
On 2023-09-11 22:16, Matthew Brost wrote: > Add helper to set TDR timeout and restart the TDR with new timeout > value. This will be used in XE, new Intel GPU driver, to trigger the TDR > to cleanup drm_sched_entity that encounter errors. Do you just want to trigger the cleanup or do you really want to modify the timeout and requeue TDR delayed work (to be triggered later at a different time)? If the former, then might as well just add a function to queue that right away. If the latter, then this would do, albeit with a few notes as mentioned below. > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > --- > drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++ > include/drm/gpu_scheduler.h | 1 + > 2 files changed, 19 insertions(+) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index 9dbfab7be2c6..689fb6686e01 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -426,6 +426,24 @@ static void drm_sched_start_timeout_unlocked(struct drm_gpu_scheduler *sched) > spin_unlock(&sched->job_list_lock); > } > > +/** > + * drm_sched_set_timeout - set timeout for reset worker > + * > + * @sched: scheduler instance to set and (re)-start the worker for > + * @timeout: timeout period > + * > + * Set and (re)-start the timeout for the given scheduler. > + */ > +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout) > +{ Well, I'd perhaps call this "drm_sched_set_tdr_timeout()", or something to that effect, as "drm_sched_set_timeout()" isn't clear that it is indeed a cleanup timeout. However, it's totally up to you. :-) It appears that "long timeout" is the new job timeout, so it is possible that a stuck job might be given old timeout + new timeout recovery time, after this function is called. > + spin_lock(&sched->job_list_lock); > + sched->timeout = timeout; > + cancel_delayed_work(&sched->work_tdr); > + drm_sched_start_timeout(sched); > + spin_unlock(&sched->job_list_lock); > +} > +EXPORT_SYMBOL(drm_sched_set_timeout); Well, drm_sched_start_timeout() (which also has a name lacking description, perhaps it should be "drm_sched_start_tdr_timeout()" or "...start_cleanup_timeout()"), anyway, so that function compares to MAX_SCHEDULE_TIMEOUT and pending list not being empty before it requeues delayed TDR work item. So, while a remote possibility, this new function may have the unintended consequence of canceling TDR, and never restarting it. I see it grabs the lock, however. Maybe wrap it in "if (sched->timeout != MAX_SCHEDULE_TIMEOUT)"? How about using mod_delayed_work()?
On Wed, Sep 13, 2023 at 10:38:24PM -0400, Luben Tuikov wrote: > On 2023-09-11 22:16, Matthew Brost wrote: > > Add helper to set TDR timeout and restart the TDR with new timeout > > value. This will be used in XE, new Intel GPU driver, to trigger the TDR > > to cleanup drm_sched_entity that encounter errors. > > Do you just want to trigger the cleanup or do you really want to > modify the timeout and requeue TDR delayed work (to be triggered > later at a different time)? > > If the former, then might as well just add a function to queue that > right away. If the latter, then this would do, albeit with a few Let me change the function to queue it immediately as that is what is needed. Matt > notes as mentioned below. > > > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com> > > --- > > drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++ > > include/drm/gpu_scheduler.h | 1 + > > 2 files changed, 19 insertions(+) > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > index 9dbfab7be2c6..689fb6686e01 100644 > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > @@ -426,6 +426,24 @@ static void drm_sched_start_timeout_unlocked(struct drm_gpu_scheduler *sched) > > spin_unlock(&sched->job_list_lock); > > } > > > > +/** > > + * drm_sched_set_timeout - set timeout for reset worker > > + * > > + * @sched: scheduler instance to set and (re)-start the worker for > > + * @timeout: timeout period > > + * > > + * Set and (re)-start the timeout for the given scheduler. > > + */ > > +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout) > > +{ > > Well, I'd perhaps call this "drm_sched_set_tdr_timeout()", or something > to that effect, as "drm_sched_set_timeout()" isn't clear that it is indeed > a cleanup timeout. However, it's totally up to you. :-) > > It appears that "long timeout" is the new job timeout, so it is possible > that a stuck job might be given old timeout + new timeout recovery time, > after this function is called. > > > + spin_lock(&sched->job_list_lock); > > + sched->timeout = timeout; > > + cancel_delayed_work(&sched->work_tdr); > > + drm_sched_start_timeout(sched); > > + spin_unlock(&sched->job_list_lock); > > +} > > +EXPORT_SYMBOL(drm_sched_set_timeout); > > Well, drm_sched_start_timeout() (which also has a name lacking description, perhaps > it should be "drm_sched_start_tdr_timeout()" or "...start_cleanup_timeout()"), anyway, > so that function compares to MAX_SCHEDULE_TIMEOUT and pending list not being empty > before it requeues delayed TDR work item. So, while a remote possibility, this new > function may have the unintended consequence of canceling TDR, and never restarting it. > I see it grabs the lock, however. Maybe wrap it in "if (sched->timeout != MAX_SCHEDULE_TIMEOUT)"? > How about using mod_delayed_work()? > -- > Regards, > Luben > > > + > > /** > > * drm_sched_fault - immediately start timeout handler > > * > > diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h > > index 5d753ecb5d71..b7b818cd81b6 100644 > > --- a/include/drm/gpu_scheduler.h > > +++ b/include/drm/gpu_scheduler.h > > @@ -596,6 +596,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity, > > struct drm_gpu_scheduler **sched_list, > > unsigned int num_sched_list); > > > > +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout); > > void drm_sched_job_cleanup(struct drm_sched_job *job); > > void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched); > > void drm_sched_add_msg(struct drm_gpu_scheduler *sched, >
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index 9dbfab7be2c6..689fb6686e01 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -426,6 +426,24 @@ static void drm_sched_start_timeout_unlocked(struct drm_gpu_scheduler *sched) spin_unlock(&sched->job_list_lock); } +/** + * drm_sched_set_timeout - set timeout for reset worker + * + * @sched: scheduler instance to set and (re)-start the worker for + * @timeout: timeout period + * + * Set and (re)-start the timeout for the given scheduler. + */ +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout) +{ + spin_lock(&sched->job_list_lock); + sched->timeout = timeout; + cancel_delayed_work(&sched->work_tdr); + drm_sched_start_timeout(sched); + spin_unlock(&sched->job_list_lock); +} +EXPORT_SYMBOL(drm_sched_set_timeout); + /** * drm_sched_fault - immediately start timeout handler * diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index 5d753ecb5d71..b7b818cd81b6 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -596,6 +596,7 @@ void drm_sched_entity_modify_sched(struct drm_sched_entity *entity, struct drm_gpu_scheduler **sched_list, unsigned int num_sched_list); +void drm_sched_set_timeout(struct drm_gpu_scheduler *sched, long timeout); void drm_sched_job_cleanup(struct drm_sched_job *job); void drm_sched_wakeup_if_can_queue(struct drm_gpu_scheduler *sched); void drm_sched_add_msg(struct drm_gpu_scheduler *sched,
Add helper to set TDR timeout and restart the TDR with new timeout value. This will be used in XE, new Intel GPU driver, to trigger the TDR to cleanup drm_sched_entity that encounter errors. Signed-off-by: Matthew Brost <matthew.brost@intel.com> --- drivers/gpu/drm/scheduler/sched_main.c | 18 ++++++++++++++++++ include/drm/gpu_scheduler.h | 1 + 2 files changed, 19 insertions(+)