Message ID | 20241105143137.71893-2-pstanner@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/sched: Improve teardown documentation | expand |
On Tue, 2024-11-05 at 15:31 +0100, Philipp Stanner wrote: > If jobs are still enqueued in struct drm_gpu_scheduler.pending_list > when drm_sched_fini() gets called, those jobs will be leaked since > that > function stops both job-submission and (automatic) job-cleanup. It > is, > thus, up to the driver to take care of preventing leaks. > > The related function drm_sched_wqueue_stop() also prevents automatic > job > cleanup. > > Those pitfals are not reflected in the documentation, currently. > > Explicitly inform about the leak problem in the docstring of > drm_sched_fini(). > > Additionally, detail the purpose of drm_sched_wqueue_{start,stop} and > hint at the consequences for automatic cleanup. > > Signed-off-by: Philipp Stanner <pstanner@redhat.com> Got an off-list (mail from dri-devel apparently got lost) RB from Christian. Applied to drm-misc-next. P. > --- > Hi, > > in our discussion about my proposed waitque-cleanup for this problem > Sima suggested [1] that we document the problems first and by doing > so get > to a consenus what the problems actually are and how we could solve > them. > > This is my proposal for documenting the leaks on teardown. Feedback > very > welcome. > > P. > > [1] > https://lore.kernel.org/dri-devel/ZtidJ8S9THvzkQ-6@phenom.ffwll.local/ > --- > drivers/gpu/drm/scheduler/sched_main.c | 23 +++++++++++++++++++++-- > 1 file changed, 21 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > b/drivers/gpu/drm/scheduler/sched_main.c > index e97c6c60bc96..3dfa9db89484 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -1333,6 +1333,19 @@ EXPORT_SYMBOL(drm_sched_init); > * @sched: scheduler instance > * > * Tears down and cleans up the scheduler. > + * > + * This stops submission of new jobs to the hardware through > + * drm_sched_backend_ops.run_job(). Consequently, > drm_sched_backend_ops.free_job() > + * will not be called for all jobs still in > drm_gpu_scheduler.pending_list. > + * There is no solution for this currently. Thus, it is up to the > driver to make > + * sure that > + * a) drm_sched_fini() is only called after for all submitted jobs > + * drm_sched_backend_ops.free_job() has been called or that > + * b) the jobs for which drm_sched_backend_ops.free_job() has not > been called > + * after drm_sched_fini() ran are freed manually. > + * > + * FIXME: Take care of the above problem and prevent this function > from leaking > + * the jobs in drm_gpu_scheduler.pending_list under any > circumstances. > */ > void drm_sched_fini(struct drm_gpu_scheduler *sched) > { > @@ -1428,8 +1441,10 @@ EXPORT_SYMBOL(drm_sched_wqueue_ready); > > /** > * drm_sched_wqueue_stop - stop scheduler submission > - * > * @sched: scheduler instance > + * > + * Stops the scheduler from pulling new jobs from entities. It also > stops > + * freeing jobs automatically through > drm_sched_backend_ops.free_job(). > */ > void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched) > { > @@ -1441,8 +1456,12 @@ EXPORT_SYMBOL(drm_sched_wqueue_stop); > > /** > * drm_sched_wqueue_start - start scheduler submission > - * > * @sched: scheduler instance > + * > + * Restarts the scheduler after drm_sched_wqueue_stop() has stopped > it. > + * > + * This function is not necessary for 'conventional' startup. The > scheduler is > + * fully operational after drm_sched_init() succeeded. > */ > void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched) > {
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index e97c6c60bc96..3dfa9db89484 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1333,6 +1333,19 @@ EXPORT_SYMBOL(drm_sched_init); * @sched: scheduler instance * * Tears down and cleans up the scheduler. + * + * This stops submission of new jobs to the hardware through + * drm_sched_backend_ops.run_job(). Consequently, drm_sched_backend_ops.free_job() + * will not be called for all jobs still in drm_gpu_scheduler.pending_list. + * There is no solution for this currently. Thus, it is up to the driver to make + * sure that + * a) drm_sched_fini() is only called after for all submitted jobs + * drm_sched_backend_ops.free_job() has been called or that + * b) the jobs for which drm_sched_backend_ops.free_job() has not been called + * after drm_sched_fini() ran are freed manually. + * + * FIXME: Take care of the above problem and prevent this function from leaking + * the jobs in drm_gpu_scheduler.pending_list under any circumstances. */ void drm_sched_fini(struct drm_gpu_scheduler *sched) { @@ -1428,8 +1441,10 @@ EXPORT_SYMBOL(drm_sched_wqueue_ready); /** * drm_sched_wqueue_stop - stop scheduler submission - * * @sched: scheduler instance + * + * Stops the scheduler from pulling new jobs from entities. It also stops + * freeing jobs automatically through drm_sched_backend_ops.free_job(). */ void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched) { @@ -1441,8 +1456,12 @@ EXPORT_SYMBOL(drm_sched_wqueue_stop); /** * drm_sched_wqueue_start - start scheduler submission - * * @sched: scheduler instance + * + * Restarts the scheduler after drm_sched_wqueue_stop() has stopped it. + * + * This function is not necessary for 'conventional' startup. The scheduler is + * fully operational after drm_sched_init() succeeded. */ void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched) {
If jobs are still enqueued in struct drm_gpu_scheduler.pending_list when drm_sched_fini() gets called, those jobs will be leaked since that function stops both job-submission and (automatic) job-cleanup. It is, thus, up to the driver to take care of preventing leaks. The related function drm_sched_wqueue_stop() also prevents automatic job cleanup. Those pitfals are not reflected in the documentation, currently. Explicitly inform about the leak problem in the docstring of drm_sched_fini(). Additionally, detail the purpose of drm_sched_wqueue_{start,stop} and hint at the consequences for automatic cleanup. Signed-off-by: Philipp Stanner <pstanner@redhat.com> --- Hi, in our discussion about my proposed waitque-cleanup for this problem Sima suggested [1] that we document the problems first and by doing so get to a consenus what the problems actually are and how we could solve them. This is my proposal for documenting the leaks on teardown. Feedback very welcome. P. [1] https://lore.kernel.org/dri-devel/ZtidJ8S9THvzkQ-6@phenom.ffwll.local/ --- drivers/gpu/drm/scheduler/sched_main.c | 23 +++++++++++++++++++++-- 1 file changed, 21 insertions(+), 2 deletions(-)