diff mbox series

[v5,2/3] drm/sched: Adjust outdated docu for run_job()

Message ID 20250220112813.87992-4-phasta@kernel.org (mailing list archive)
State New
Headers show
Series drm/sched: Documentation and refcount improvements | expand

Commit Message

Philipp Stanner Feb. 20, 2025, 11:28 a.m. UTC
The documentation for drm_sched_backend_ops.run_job() mentions a certain
function called drm_sched_job_recovery(). This function does not exist.
What's actually meant is drm_sched_resubmit_jobs(), which is by now also
deprecated.

Remove the mention of the removed function.

Discourage the behavior of drm_sched_backend_ops.run_job() being called
multiple times for the same job.

Signed-off-by: Philipp Stanner <phasta@kernel.org>
---
 include/drm/gpu_scheduler.h | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

Comments

Maíra Canal Feb. 20, 2025, 1:28 p.m. UTC | #1
Hi Philipp,

On 20/02/25 08:28, Philipp Stanner wrote:
> The documentation for drm_sched_backend_ops.run_job() mentions a certain
> function called drm_sched_job_recovery(). This function does not exist.
> What's actually meant is drm_sched_resubmit_jobs(), which is by now also
> deprecated.
> 
> Remove the mention of the removed function.
> 
> Discourage the behavior of drm_sched_backend_ops.run_job() being called
> multiple times for the same job.

It looks odd to me that this patch removes lines that were added in
patch 1/3. Maybe you could change the patchset order and place this one
as the first.

> 
> Signed-off-by: Philipp Stanner <phasta@kernel.org>
> ---
>   include/drm/gpu_scheduler.h | 19 +++++++++++++------
>   1 file changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 916279b5aa00..29e5bda91806 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -421,20 +421,27 @@ struct drm_sched_backend_ops {
>   
>   	/**
>   	 * @run_job: Called to execute the job once all of the dependencies
> -	 * have been resolved. This may be called multiple times, if
> -	 * timedout_job() has happened and drm_sched_job_recovery() decides to
> -	 * try it again.
> +	 * have been resolved.
> +	 *
> +	 * The deprecated drm_sched_resubmit_jobs() (called from
> +	 * drm_sched_backend_ops.timedout_job()) can invoke this again with the

I think it would be "@timedout_job".

> +	 * same parameters. Using this is discouraged because it, presumably,
> +	 * violates dma_fence rules.

I believe it would be "struct dma_fence".

> +	 *
> +	 * TODO: Document which fence rules above.
>   	 *
>   	 * @sched_job: the job to run
>   	 *
> -	 * Returns: dma_fence the driver must signal once the hardware has
> -	 *	completed the job ("hardware fence").
> -	 *
>   	 * Note that the scheduler expects to 'inherit' its own reference to
>   	 * this fence from the callback. It does not invoke an extra
>   	 * dma_fence_get() on it. Consequently, this callback must take a
>   	 * reference for the scheduler, and additional ones for the driver's
>   	 * respective needs.

Would it be possible to add a comment that `run_job()` must check if
`s_fence->finished.error` is different than 0? If you increase the karma
of a job and don't check for `s_fence->finished.error`, you might run a
cancelled job.

> +	 *
> +	 * Return:
> +	 * * On success: dma_fence the driver must signal once the hardware has
> +	 * completed the job ("hardware fence").

A suggestion: "the fence that the driver must signal once the hardware
has completed the job".

Best Regards,
- Maíra

> +	 * * On failure: NULL or an ERR_PTR.
>   	 */
>   	struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
>
Philipp Stanner Feb. 20, 2025, 3:28 p.m. UTC | #2
On Thu, 2025-02-20 at 10:28 -0300, Maíra Canal wrote:
> Hi Philipp,
> 
> On 20/02/25 08:28, Philipp Stanner wrote:
> > The documentation for drm_sched_backend_ops.run_job() mentions a
> > certain
> > function called drm_sched_job_recovery(). This function does not
> > exist.
> > What's actually meant is drm_sched_resubmit_jobs(), which is by now
> > also
> > deprecated.
> > 
> > Remove the mention of the removed function.
> > 
> > Discourage the behavior of drm_sched_backend_ops.run_job() being
> > called
> > multiple times for the same job.
> 
> It looks odd to me that this patch removes lines that were added in
> patch 1/3. Maybe you could change the patchset order and place this
> one
> as the first.
> 
> > 
> > Signed-off-by: Philipp Stanner <phasta@kernel.org>
> > ---
> >   include/drm/gpu_scheduler.h | 19 +++++++++++++------
> >   1 file changed, 13 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/drm/gpu_scheduler.h
> > b/include/drm/gpu_scheduler.h
> > index 916279b5aa00..29e5bda91806 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -421,20 +421,27 @@ struct drm_sched_backend_ops {
> >   
> >   	/**
> >   	 * @run_job: Called to execute the job once all of the
> > dependencies
> > -	 * have been resolved. This may be called multiple times,
> > if
> > -	 * timedout_job() has happened and
> > drm_sched_job_recovery() decides to
> > -	 * try it again.
> > +	 * have been resolved.
> > +	 *
> > +	 * The deprecated drm_sched_resubmit_jobs() (called from
> > +	 * drm_sched_backend_ops.timedout_job()) can invoke this
> > again with the
> 
> I think it would be "@timedout_job".

Not sure, isn't referencing in docstrings done with '&'?

> 
> > +	 * same parameters. Using this is discouraged because it,
> > presumably,
> > +	 * violates dma_fence rules.
> 
> I believe it would be "struct dma_fence".

Well, in this case strictly speaking not IMO, because it's about the
rules of the "DMA Fence Subsystem", not about the struct itself.

I'd just keep it that way or call it "dma fence"

> 
> > +	 *
> > +	 * TODO: Document which fence rules above.
> >   	 *
> >   	 * @sched_job: the job to run
> >   	 *
> > -	 * Returns: dma_fence the driver must signal once the
> > hardware has
> > -	 *	completed the job ("hardware fence").
> > -	 *
> >   	 * Note that the scheduler expects to 'inherit' its own
> > reference to
> >   	 * this fence from the callback. It does not invoke an
> > extra
> >   	 * dma_fence_get() on it. Consequently, this callback must
> > take a
> >   	 * reference for the scheduler, and additional ones for
> > the driver's
> >   	 * respective needs.
> 
> Would it be possible to add a comment that `run_job()` must check if
> `s_fence->finished.error` is different than 0? If you increase the
> karma
> of a job and don't check for `s_fence->finished.error`, you might run
> a
> cancelled job.

s_fence->finished is only signaled and its error set once the hardware
fence got signaled; or when the entity is killed.

In any case, signaling "finished" will cause the job to be prevented
from being executed (again), and will never reach run_job() in the
first place.

Correct me if I am mistaken.

Or are you suggesting that there is a race?


P.

> 
> > +	 *
> > +	 * Return:
> > +	 * * On success: dma_fence the driver must signal once the
> > hardware has
> > +	 * completed the job ("hardware fence").
> 
> A suggestion: "the fence that the driver must signal once the
> hardware
> has completed the job".
> 
> Best Regards,
> - Maíra
> 
> > +	 * * On failure: NULL or an ERR_PTR.
> >   	 */
> >   	struct dma_fence *(*run_job)(struct drm_sched_job
> > *sched_job);
> >   
>
diff mbox series

Patch

diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 916279b5aa00..29e5bda91806 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -421,20 +421,27 @@  struct drm_sched_backend_ops {
 
 	/**
 	 * @run_job: Called to execute the job once all of the dependencies
-	 * have been resolved. This may be called multiple times, if
-	 * timedout_job() has happened and drm_sched_job_recovery() decides to
-	 * try it again.
+	 * have been resolved.
+	 *
+	 * The deprecated drm_sched_resubmit_jobs() (called from
+	 * drm_sched_backend_ops.timedout_job()) can invoke this again with the
+	 * same parameters. Using this is discouraged because it, presumably,
+	 * violates dma_fence rules.
+	 *
+	 * TODO: Document which fence rules above.
 	 *
 	 * @sched_job: the job to run
 	 *
-	 * Returns: dma_fence the driver must signal once the hardware has
-	 *	completed the job ("hardware fence").
-	 *
 	 * Note that the scheduler expects to 'inherit' its own reference to
 	 * this fence from the callback. It does not invoke an extra
 	 * dma_fence_get() on it. Consequently, this callback must take a
 	 * reference for the scheduler, and additional ones for the driver's
 	 * respective needs.
+	 *
+	 * Return:
+	 * * On success: dma_fence the driver must signal once the hardware has
+	 * completed the job ("hardware fence").
+	 * * On failure: NULL or an ERR_PTR.
 	 */
 	struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);