diff mbox series

drm/sched: revert "drm_sched_job_cleanup(): correct false doc"

Message ID 20250310074414.2129157-1-christian.koenig@amd.com (mailing list archive)
State New
Headers show
Series drm/sched: revert "drm_sched_job_cleanup(): correct false doc" | expand

Commit Message

Christian König March 10, 2025, 7:44 a.m. UTC
This reverts commit 44d2f310f008613c1dbe5e234c2cf2be90cbbfab.

Sorry for the delayed response, I only stumbled over this now while going
over old mails and then re-thinking my reviewed by for this change.

The function drm_sched_job_arm() is indeed the point of no return. The
background is that it is nearly impossible for the driver to correctly
retract the fence and signal it in the order enforced by the dma_fence
framework.

The code in drm_sched_job_cleanup() is for the purpose to cleanup after
the job was armed through drm_sched_job_arm() *and* processed by the
scheduler.

The correct approach for error handling in this situation is to set the
error on the fences and then push to the entity anyway. We can certainly
improve the documentation, but removing the warning is clearly not a good
idea.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

Comments

Philipp Stanner March 10, 2025, 8:41 a.m. UTC | #1
On Mon, 2025-03-10 at 08:44 +0100, Christian König wrote:
> This reverts commit 44d2f310f008613c1dbe5e234c2cf2be90cbbfab.
> 
> Sorry for the delayed response, I only stumbled over this now while
> going
> over old mails and then re-thinking my reviewed by for this change.
> 
> The function drm_sched_job_arm() is indeed the point of no return.

So would you say that the comment in the function's body,
"drm_sched_job_arm() has been called" actually means / should mean "job
had been submitted with drm_sched_entity_push_job()"?

> The
> background is that it is nearly impossible for the driver to
> correctly
> retract the fence and signal it in the order enforced by the
> dma_fence
> framework.
> 
> The code in drm_sched_job_cleanup() is for the purpose to cleanup
> after
> the job was armed through drm_sched_job_arm() *and* processed by the
> scheduler.
> 
> The correct approach for error handling in this situation is to set
> the
> error on the fences and then push to the entity anyway.

You expect the driver to set an error on scheduled and finished fence?
I think this would be very likely to explode. AFAICS the scheduler code
has no awareness for anyone else having touched those fences.

If at all, if this is really a problem, we should tell the driver that
it must enforce that there will be no failure between
drm_sched_job_arm() and drm_sched_job_entity_push_job(). That's how
Nouveau does it.

Let's set our understanding straight before reverting. What
drm_sched_job_arm() does is:

   1. Pick scheduler, rq and priority for the job
   2. Atomically increment the job_id_count and assign to job
   3. Call drm_sched_fence_init() and therefore:
   4. Set the fence's scheduler
   5. Set the fences seq nr atomically
   6. Initialize scheduled and finished fence

What of the above precisely is the problem?

You say the driver cannot "correctly retract the fence and signal it in
the order enforced by the dma_fence framework". You mean that the
finished_fences have to be signalled in an order only the scheduler
knows? I assume you're referring to set dependencies?

P.

> We can certainly
> improve the documentation, but removing the warning is clearly not a
> good
> idea.
> 
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 12 +++++-------
>  1 file changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 53e6aec37b46..4d4219fbe49d 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1015,13 +1015,11 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
>   * Cleans up the resources allocated with drm_sched_job_init().
>   *
>   * Drivers should call this from their error unwind code if @job is
> aborted
> - * before it was submitted to an entity with
> drm_sched_entity_push_job().
> + * before drm_sched_job_arm() is called.
>   *
> - * Since calling drm_sched_job_arm() causes the job's fences to be
> initialized,
> - * it is up to the driver to ensure that fences that were exposed to
> external
> - * parties get signaled. drm_sched_job_cleanup() does not ensure
> this.
> - *
> - * This function must also be called in &struct
> drm_sched_backend_ops.free_job
> + * After that point of no return @job is committed to be executed by
> the
> + * scheduler, and this function should be called from the
> + * &drm_sched_backend_ops.free_job callback.
>   */
>  void drm_sched_job_cleanup(struct drm_sched_job *job)
>  {
> @@ -1032,7 +1030,7 @@ void drm_sched_job_cleanup(struct drm_sched_job
> *job)
>   /* drm_sched_job_arm() has been called */
>   dma_fence_put(&job->s_fence->finished);
>   } else {
> - /* aborted job before arming */
> + /* aborted job before committing to run it */
>   drm_sched_fence_free(job->s_fence);
>   }
>
Tvrtko Ursulin March 10, 2025, 9:30 a.m. UTC | #2
On 10/03/2025 08:41, Philipp Stanner wrote:
> On Mon, 2025-03-10 at 08:44 +0100, Christian König wrote:
>> This reverts commit 44d2f310f008613c1dbe5e234c2cf2be90cbbfab.
>>
>> Sorry for the delayed response, I only stumbled over this now while
>> going
>> over old mails and then re-thinking my reviewed by for this change.
>>
>> The function drm_sched_job_arm() is indeed the point of no return.
> 
> So would you say that the comment in the function's body,
> "drm_sched_job_arm() has been called" actually means / should mean "job
> had been submitted with drm_sched_entity_push_job()"?
> 
>> The
>> background is that it is nearly impossible for the driver to
>> correctly
>> retract the fence and signal it in the order enforced by the
>> dma_fence
>> framework.
>>
>> The code in drm_sched_job_cleanup() is for the purpose to cleanup
>> after
>> the job was armed through drm_sched_job_arm() *and* processed by the
>> scheduler.
>>
>> The correct approach for error handling in this situation is to set
>> the
>> error on the fences and then push to the entity anyway.
> 
> You expect the driver to set an error on scheduled and finished fence?
> I think this would be very likely to explode. AFAICS the scheduler code
> has no awareness for anyone else having touched those fences.
> 
> If at all, if this is really a problem, we should tell the driver that
> it must enforce that there will be no failure between
> drm_sched_job_arm() and drm_sched_job_entity_push_job(). That's how
> Nouveau does it.

On top of Philipp's questions - I just want to raise that 
amdgpu_cs_submit then also needs explaining. I always take amdgpu as 
kind of almost reference, since that is where scheduler originated from. 
And in there there is definitely the path of drm_sched_job_cleanup() 
called after drm_sched_job_arm() when adding gang dependencies fails.

Regards,

Tvrtko

> 
> Let's set our understanding straight before reverting. What
> drm_sched_job_arm() does is:
> 
>     1. Pick scheduler, rq and priority for the job
>     2. Atomically increment the job_id_count and assign to job
>     3. Call drm_sched_fence_init() and therefore:
>     4. Set the fence's scheduler
>     5. Set the fences seq nr atomically
>     6. Initialize scheduled and finished fence
> 
> What of the above precisely is the problem?
> 
> You say the driver cannot "correctly retract the fence and signal it in
> the order enforced by the dma_fence framework". You mean that the
> finished_fences have to be signalled in an order only the scheduler
> knows? I assume you're referring to set dependencies?
> 
> P.
> 
>> We can certainly
>> improve the documentation, but removing the warning is clearly not a
>> good
>> idea.
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 12 +++++-------
>>   1 file changed, 5 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index 53e6aec37b46..4d4219fbe49d 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -1015,13 +1015,11 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
>>    * Cleans up the resources allocated with drm_sched_job_init().
>>    *
>>    * Drivers should call this from their error unwind code if @job is
>> aborted
>> - * before it was submitted to an entity with
>> drm_sched_entity_push_job().
>> + * before drm_sched_job_arm() is called.
>>    *
>> - * Since calling drm_sched_job_arm() causes the job's fences to be
>> initialized,
>> - * it is up to the driver to ensure that fences that were exposed to
>> external
>> - * parties get signaled. drm_sched_job_cleanup() does not ensure
>> this.
>> - *
>> - * This function must also be called in &struct
>> drm_sched_backend_ops.free_job
>> + * After that point of no return @job is committed to be executed by
>> the
>> + * scheduler, and this function should be called from the
>> + * &drm_sched_backend_ops.free_job callback.
>>    */
>>   void drm_sched_job_cleanup(struct drm_sched_job *job)
>>   {
>> @@ -1032,7 +1030,7 @@ void drm_sched_job_cleanup(struct drm_sched_job
>> *job)
>>    /* drm_sched_job_arm() has been called */
>>    dma_fence_put(&job->s_fence->finished);
>>    } else {
>> - /* aborted job before arming */
>> + /* aborted job before committing to run it */
>>    drm_sched_fence_free(job->s_fence);
>>    }
>>   
>
Christian König March 10, 2025, 9:56 a.m. UTC | #3
Am 10.03.25 um 10:30 schrieb Tvrtko Ursulin:
>
> On 10/03/2025 08:41, Philipp Stanner wrote:
>> On Mon, 2025-03-10 at 08:44 +0100, Christian König wrote:
>>> This reverts commit 44d2f310f008613c1dbe5e234c2cf2be90cbbfab.
>>>
>>> Sorry for the delayed response, I only stumbled over this now while
>>> going
>>> over old mails and then re-thinking my reviewed by for this change.
>>>
>>> The function drm_sched_job_arm() is indeed the point of no return.
>>
>> So would you say that the comment in the function's body,
>> "drm_sched_job_arm() has been called" actually means / should mean "job
>> had been submitted with drm_sched_entity_push_job()"?

Yes, probably. The problem is simply that we need to guarantee correct order of signaling for the dma_fence.

Otherwise functions like dma_fence_is_later()/dma_fence_later() start to blow up and we run into potential use after free problems.

>>
>>> The
>>> background is that it is nearly impossible for the driver to
>>> correctly
>>> retract the fence and signal it in the order enforced by the
>>> dma_fence
>>> framework.
>>>
>>> The code in drm_sched_job_cleanup() is for the purpose to cleanup
>>> after
>>> the job was armed through drm_sched_job_arm() *and* processed by the
>>> scheduler.
>>>
>>> The correct approach for error handling in this situation is to set
>>> the
>>> error on the fences and then push to the entity anyway.
>>
>> You expect the driver to set an error on scheduled and finished fence?

Ideally we would avoid that as well. In general drivers should be coded so that after calling drm_sched_job_arm() nothing can go wrong any more.

Setting an error on either the scheduler or the finished fence is basically just the last resort if we can't really avoid any error handling.

>> I think this would be very likely to explode. AFAICS the scheduler code
>> has no awareness for anyone else having touched those fences.

Yeah that is probably not well handled today.

>>
>> If at all, if this is really a problem, we should tell the driver that
>> it must enforce that there will be no failure between
>> drm_sched_job_arm() and drm_sched_job_entity_push_job(). That's how
>> Nouveau does it.
>
> On top of Philipp's questions - I just want to raise that amdgpu_cs_submit then also needs explaining. I always take amdgpu as kind of almost reference, since that is where scheduler originated from. And in there there is definitely the path of drm_sched_job_cleanup() called after drm_sched_job_arm() when adding gang dependencies fails.

Oh shit, yeah that doesn't work correctly at all. The user pages handling is completely broken as well.

Thanks for pointing that out.

Regards,
Christian.

>
> Regards,
>
> Tvrtko
>
>>
>> Let's set our understanding straight before reverting. What
>> drm_sched_job_arm() does is:
>>
>>     1. Pick scheduler, rq and priority for the job
>>     2. Atomically increment the job_id_count and assign to job
>>     3. Call drm_sched_fence_init() and therefore:
>>     4. Set the fence's scheduler
>>     5. Set the fences seq nr atomically
>>     6. Initialize scheduled and finished fence
>>
>> What of the above precisely is the problem?
>>
>> You say the driver cannot "correctly retract the fence and signal it in
>> the order enforced by the dma_fence framework". You mean that the
>> finished_fences have to be signalled in an order only the scheduler
>> knows? I assume you're referring to set dependencies?
>>
>> P.
>>
>>> We can certainly
>>> improve the documentation, but removing the warning is clearly not a
>>> good
>>> idea.
>>>
>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>> ---
>>>   drivers/gpu/drm/scheduler/sched_main.c | 12 +++++-------
>>>   1 file changed, 5 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 53e6aec37b46..4d4219fbe49d 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -1015,13 +1015,11 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
>>>    * Cleans up the resources allocated with drm_sched_job_init().
>>>    *
>>>    * Drivers should call this from their error unwind code if @job is
>>> aborted
>>> - * before it was submitted to an entity with
>>> drm_sched_entity_push_job().
>>> + * before drm_sched_job_arm() is called.
>>>    *
>>> - * Since calling drm_sched_job_arm() causes the job's fences to be
>>> initialized,
>>> - * it is up to the driver to ensure that fences that were exposed to
>>> external
>>> - * parties get signaled. drm_sched_job_cleanup() does not ensure
>>> this.
>>> - *
>>> - * This function must also be called in &struct
>>> drm_sched_backend_ops.free_job
>>> + * After that point of no return @job is committed to be executed by
>>> the
>>> + * scheduler, and this function should be called from the
>>> + * &drm_sched_backend_ops.free_job callback.
>>>    */
>>>   void drm_sched_job_cleanup(struct drm_sched_job *job)
>>>   {
>>> @@ -1032,7 +1030,7 @@ void drm_sched_job_cleanup(struct drm_sched_job
>>> *job)
>>>    /* drm_sched_job_arm() has been called */
>>>    dma_fence_put(&job->s_fence->finished);
>>>    } else {
>>> - /* aborted job before arming */
>>> + /* aborted job before committing to run it */
>>>    drm_sched_fence_free(job->s_fence);
>>>    }
>>>   
>>
>
Philipp Stanner March 10, 2025, 12:11 p.m. UTC | #4
On Mon, 2025-03-10 at 08:44 +0100, Christian König wrote:
> This reverts commit 44d2f310f008613c1dbe5e234c2cf2be90cbbfab.

OK, your arguments with fence ordering are strong. Please update the
commit message according to our discussion:

> 
> Sorry for the delayed response, I only stumbled over this now while
> going
> over old mails and then re-thinking my reviewed by for this change.

Your RB hadn't even been applied (I merged before you gave it), so you
can remove this first paragraph from the commit message

> 
> The function drm_sched_job_arm() is indeed the point of no return.
> The
> background is that it is nearly impossible for the driver to
> correctly
> retract the fence and signal it in the order enforced by the
> dma_fence
> framework.
> 
> The code in drm_sched_job_cleanup() is for the purpose to cleanup
> after
> the job was armed through drm_sched_job_arm() *and* processed by the
> scheduler.
> 
> The correct approach for error handling in this situation is to set
> the
> error on the fences and then push to the entity anyway. We can
> certainly
> improve the documentation, but removing the warning is clearly not a
> good
> idea.

This last paragraph, as per our discussion, seems invalid. We shouldn't
have that in the commit log, so that it won't give later hackers
browsing it wrong ideas and we end up with someone actually mengling
with those fences.

Thx
P.

> 
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 12 +++++-------
>  1 file changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index 53e6aec37b46..4d4219fbe49d 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1015,13 +1015,11 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
>   * Cleans up the resources allocated with drm_sched_job_init().
>   *
>   * Drivers should call this from their error unwind code if @job is
> aborted
> - * before it was submitted to an entity with
> drm_sched_entity_push_job().
> + * before drm_sched_job_arm() is called.
>   *
> - * Since calling drm_sched_job_arm() causes the job's fences to be
> initialized,
> - * it is up to the driver to ensure that fences that were exposed to
> external
> - * parties get signaled. drm_sched_job_cleanup() does not ensure
> this.
> - *
> - * This function must also be called in &struct
> drm_sched_backend_ops.free_job
> + * After that point of no return @job is committed to be executed by
> the
> + * scheduler, and this function should be called from the
> + * &drm_sched_backend_ops.free_job callback.
>   */
>  void drm_sched_job_cleanup(struct drm_sched_job *job)
>  {
> @@ -1032,7 +1030,7 @@ void drm_sched_job_cleanup(struct drm_sched_job
> *job)
>  		/* drm_sched_job_arm() has been called */
>  		dma_fence_put(&job->s_fence->finished);
>  	} else {
> -		/* aborted job before arming */
> +		/* aborted job before committing to run it */
>  		drm_sched_fence_free(job->s_fence);
>  	}
>
Christian König March 10, 2025, 12:25 p.m. UTC | #5
Am 10.03.25 um 13:11 schrieb Philipp Stanner:
> On Mon, 2025-03-10 at 08:44 +0100, Christian König wrote:
>> This reverts commit 44d2f310f008613c1dbe5e234c2cf2be90cbbfab.
> OK, your arguments with fence ordering are strong. Please update the
> commit message according to our discussion:
>
>> Sorry for the delayed response, I only stumbled over this now while
>> going
>> over old mails and then re-thinking my reviewed by for this change.
> Your RB hadn't even been applied (I merged before you gave it), so you
> can remove this first paragraph from the commit message
>
>> The function drm_sched_job_arm() is indeed the point of no return.
>> The
>> background is that it is nearly impossible for the driver to
>> correctly
>> retract the fence and signal it in the order enforced by the
>> dma_fence
>> framework.
>>
>> The code in drm_sched_job_cleanup() is for the purpose to cleanup
>> after
>> the job was armed through drm_sched_job_arm() *and* processed by the
>> scheduler.
>>
>> The correct approach for error handling in this situation is to set
>> the
>> error on the fences and then push to the entity anyway. We can
>> certainly
>> improve the documentation, but removing the warning is clearly not a
>> good
>> idea.
> This last paragraph, as per our discussion, seems invalid. We shouldn't
> have that in the commit log, so that it won't give later hackers
> browsing it wrong ideas and we end up with someone actually mengling
> with those fences.

Sure, going to make those updates. I just wanted to give people a possible direction to look into when they really run into a situation where they need to abort some submission very late.

Should I also clarify the comment in drm_sched_job_cleanup()?

Regards,
Christian.

>
> Thx
> P.
>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>>  drivers/gpu/drm/scheduler/sched_main.c | 12 +++++-------
>>  1 file changed, 5 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index 53e6aec37b46..4d4219fbe49d 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -1015,13 +1015,11 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
>>   * Cleans up the resources allocated with drm_sched_job_init().
>>   *
>>   * Drivers should call this from their error unwind code if @job is
>> aborted
>> - * before it was submitted to an entity with
>> drm_sched_entity_push_job().
>> + * before drm_sched_job_arm() is called.
>>   *
>> - * Since calling drm_sched_job_arm() causes the job's fences to be
>> initialized,
>> - * it is up to the driver to ensure that fences that were exposed to
>> external
>> - * parties get signaled. drm_sched_job_cleanup() does not ensure
>> this.
>> - *
>> - * This function must also be called in &struct
>> drm_sched_backend_ops.free_job
>> + * After that point of no return @job is committed to be executed by
>> the
>> + * scheduler, and this function should be called from the
>> + * &drm_sched_backend_ops.free_job callback.
>>   */
>>  void drm_sched_job_cleanup(struct drm_sched_job *job)
>>  {
>> @@ -1032,7 +1030,7 @@ void drm_sched_job_cleanup(struct drm_sched_job
>> *job)
>>  		/* drm_sched_job_arm() has been called */
>>  		dma_fence_put(&job->s_fence->finished);
>>  	} else {
>> -		/* aborted job before arming */
>> +		/* aborted job before committing to run it */
>>  		drm_sched_fence_free(job->s_fence);
>>  	}
>>
Tvrtko Ursulin March 10, 2025, 12:27 p.m. UTC | #6
On 10/03/2025 12:11, Philipp Stanner wrote:
> On Mon, 2025-03-10 at 08:44 +0100, Christian König wrote:
>> This reverts commit 44d2f310f008613c1dbe5e234c2cf2be90cbbfab.
> 
> OK, your arguments with fence ordering are strong. Please update the
> commit message according to our discussion:

Could that argument please be explained in more concrete terms?

Are we talking here about skipping one seqno has potential to cause a 
problem, or there is more to it?

Because if it is just skipping I don't immediately see that breaks the 
monotonic/unique seqno ordering.

Only if we are worried about some code somewhere making assumptions  "if 
N got completed, that means N-1 got completed too". That generally isn't 
anything new and can happen with GPU resets, albeit in the latter case 
the fence error is I think always set.

Regards,

Tvrtko

>> Sorry for the delayed response, I only stumbled over this now while
>> going
>> over old mails and then re-thinking my reviewed by for this change.
> 
> Your RB hadn't even been applied (I merged before you gave it), so you
> can remove this first paragraph from the commit message
> 
>>
>> The function drm_sched_job_arm() is indeed the point of no return.
>> The
>> background is that it is nearly impossible for the driver to
>> correctly
>> retract the fence and signal it in the order enforced by the
>> dma_fence
>> framework.
>>
>> The code in drm_sched_job_cleanup() is for the purpose to cleanup
>> after
>> the job was armed through drm_sched_job_arm() *and* processed by the
>> scheduler.
>>
>> The correct approach for error handling in this situation is to set
>> the
>> error on the fences and then push to the entity anyway. We can
>> certainly
>> improve the documentation, but removing the warning is clearly not a
>> good
>> idea.
> 
> This last paragraph, as per our discussion, seems invalid. We shouldn't
> have that in the commit log, so that it won't give later hackers
> browsing it wrong ideas and we end up with someone actually mengling
> with those fences.
> 
> Thx
> P.
> 
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 12 +++++-------
>>   1 file changed, 5 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index 53e6aec37b46..4d4219fbe49d 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -1015,13 +1015,11 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
>>    * Cleans up the resources allocated with drm_sched_job_init().
>>    *
>>    * Drivers should call this from their error unwind code if @job is
>> aborted
>> - * before it was submitted to an entity with
>> drm_sched_entity_push_job().
>> + * before drm_sched_job_arm() is called.
>>    *
>> - * Since calling drm_sched_job_arm() causes the job's fences to be
>> initialized,
>> - * it is up to the driver to ensure that fences that were exposed to
>> external
>> - * parties get signaled. drm_sched_job_cleanup() does not ensure
>> this.
>> - *
>> - * This function must also be called in &struct
>> drm_sched_backend_ops.free_job
>> + * After that point of no return @job is committed to be executed by
>> the
>> + * scheduler, and this function should be called from the
>> + * &drm_sched_backend_ops.free_job callback.
>>    */
>>   void drm_sched_job_cleanup(struct drm_sched_job *job)
>>   {
>> @@ -1032,7 +1030,7 @@ void drm_sched_job_cleanup(struct drm_sched_job
>> *job)
>>   		/* drm_sched_job_arm() has been called */
>>   		dma_fence_put(&job->s_fence->finished);
>>   	} else {
>> -		/* aborted job before arming */
>> +		/* aborted job before committing to run it */
>>   		drm_sched_fence_free(job->s_fence);
>>   	}
>>   
>
Philipp Stanner March 10, 2025, 12:37 p.m. UTC | #7
On Mon, 2025-03-10 at 13:25 +0100, Christian König wrote:
> Am 10.03.25 um 13:11 schrieb Philipp Stanner:
> > On Mon, 2025-03-10 at 08:44 +0100, Christian König wrote:
> > > This reverts commit 44d2f310f008613c1dbe5e234c2cf2be90cbbfab.
> > OK, your arguments with fence ordering are strong. Please update
> > the
> > commit message according to our discussion:
> > 
> > > Sorry for the delayed response, I only stumbled over this now
> > > while
> > > going
> > > over old mails and then re-thinking my reviewed by for this
> > > change.
> > Your RB hadn't even been applied (I merged before you gave it), so
> > you
> > can remove this first paragraph from the commit message
> > 
> > > The function drm_sched_job_arm() is indeed the point of no
> > > return.
> > > The
> > > background is that it is nearly impossible for the driver to
> > > correctly
> > > retract the fence and signal it in the order enforced by the
> > > dma_fence
> > > framework.
> > > 
> > > The code in drm_sched_job_cleanup() is for the purpose to cleanup
> > > after
> > > the job was armed through drm_sched_job_arm() *and* processed by
> > > the
> > > scheduler.
> > > 
> > > The correct approach for error handling in this situation is to
> > > set
> > > the
> > > error on the fences and then push to the entity anyway. We can
> > > certainly
> > > improve the documentation, but removing the warning is clearly
> > > not a
> > > good
> > > idea.
> > This last paragraph, as per our discussion, seems invalid. We
> > shouldn't
> > have that in the commit log, so that it won't give later hackers
> > browsing it wrong ideas and we end up with someone actually
> > mengling
> > with those fences.
> 
> Sure, going to make those updates. I just wanted to give people a
> possible direction to look into when they really run into a situation
> where they need to abort some submission very late.
> 
> Should I also clarify the comment in drm_sched_job_cleanup()?

Shouldn't be done in a revert-commit I think.

I'd say we revert for now (it's not a big loss, just a bit of docu) and
then you (or maybe I) submit a new patch where we discuss Tvrtko's
questions and then write down the rules and issues precisely in both
commit message and the function's comments.


P.

> 
> Regards,
> Christian.
> 
> > 
> > Thx
> > P.
> > 
> > > Signed-off-by: Christian König <christian.koenig@amd.com>
> > > ---
> > >  drivers/gpu/drm/scheduler/sched_main.c | 12 +++++-------
> > >  1 file changed, 5 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > index 53e6aec37b46..4d4219fbe49d 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -1015,13 +1015,11 @@
> > > EXPORT_SYMBOL(drm_sched_job_has_dependency);
> > >   * Cleans up the resources allocated with drm_sched_job_init().
> > >   *
> > >   * Drivers should call this from their error unwind code if @job
> > > is
> > > aborted
> > > - * before it was submitted to an entity with
> > > drm_sched_entity_push_job().
> > > + * before drm_sched_job_arm() is called.
> > >   *
> > > - * Since calling drm_sched_job_arm() causes the job's fences to
> > > be
> > > initialized,
> > > - * it is up to the driver to ensure that fences that were
> > > exposed to
> > > external
> > > - * parties get signaled. drm_sched_job_cleanup() does not ensure
> > > this.
> > > - *
> > > - * This function must also be called in &struct
> > > drm_sched_backend_ops.free_job
> > > + * After that point of no return @job is committed to be
> > > executed by
> > > the
> > > + * scheduler, and this function should be called from the
> > > + * &drm_sched_backend_ops.free_job callback.
> > >   */
> > >  void drm_sched_job_cleanup(struct drm_sched_job *job)
> > >  {
> > > @@ -1032,7 +1030,7 @@ void drm_sched_job_cleanup(struct
> > > drm_sched_job
> > > *job)
> > >  		/* drm_sched_job_arm() has been called */
> > >  		dma_fence_put(&job->s_fence->finished);
> > >  	} else {
> > > -		/* aborted job before arming */
> > > +		/* aborted job before committing to run it */
> > >  		drm_sched_fence_free(job->s_fence);
> > >  	}
> > >  
>
Christian König March 10, 2025, 2:28 p.m. UTC | #8
Am 10.03.25 um 13:27 schrieb Tvrtko Ursulin:
>
> On 10/03/2025 12:11, Philipp Stanner wrote:
>> On Mon, 2025-03-10 at 08:44 +0100, Christian König wrote:
>>> This reverts commit 44d2f310f008613c1dbe5e234c2cf2be90cbbfab.
>>
>> OK, your arguments with fence ordering are strong. Please update the
>> commit message according to our discussion:
>
> Could that argument please be explained in more concrete terms?
>
> Are we talking here about skipping one seqno has potential to cause a problem, or there is more to it?
>
> Because if it is just skipping I don't immediately see that breaks the monotonic/unique seqno ordering.
>
> Only if we are worried about some code somewhere making assumptions  "if N got completed, that means N-1 got completed too". That generally isn't anything new and can happen with GPU resets, albeit in the latter case the fence error is I think always set.

Exactly that is highly problematic.

In a case of a reset and all pending work canceled it doesn't matter if fences are signaled A,B,C or C,B,A.

But when you can make fence C signal while A is still running it can be that we start to cleanup the VM and free memory etc.. while the shaders from A are still able to access resources.

That's a security hole you can push an elephant through. Virtual memory on GPUs mitigates that on modern hardware quite a bit, but we still have a bunch of use cases which rely on getting this right.

Regards,
Christian.

>
> Regards,
>
> Tvrtko
>
>>> Sorry for the delayed response, I only stumbled over this now while
>>> going
>>> over old mails and then re-thinking my reviewed by for this change.
>>
>> Your RB hadn't even been applied (I merged before you gave it), so you
>> can remove this first paragraph from the commit message
>>
>>>
>>> The function drm_sched_job_arm() is indeed the point of no return.
>>> The
>>> background is that it is nearly impossible for the driver to
>>> correctly
>>> retract the fence and signal it in the order enforced by the
>>> dma_fence
>>> framework.
>>>
>>> The code in drm_sched_job_cleanup() is for the purpose to cleanup
>>> after
>>> the job was armed through drm_sched_job_arm() *and* processed by the
>>> scheduler.
>>>
>>> The correct approach for error handling in this situation is to set
>>> the
>>> error on the fences and then push to the entity anyway. We can
>>> certainly
>>> improve the documentation, but removing the warning is clearly not a
>>> good
>>> idea.
>>
>> This last paragraph, as per our discussion, seems invalid. We shouldn't
>> have that in the commit log, so that it won't give later hackers
>> browsing it wrong ideas and we end up with someone actually mengling
>> with those fences.
>>
>> Thx
>> P.
>>
>>>
>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>> ---
>>>   drivers/gpu/drm/scheduler/sched_main.c | 12 +++++-------
>>>   1 file changed, 5 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 53e6aec37b46..4d4219fbe49d 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -1015,13 +1015,11 @@ EXPORT_SYMBOL(drm_sched_job_has_dependency);
>>>    * Cleans up the resources allocated with drm_sched_job_init().
>>>    *
>>>    * Drivers should call this from their error unwind code if @job is
>>> aborted
>>> - * before it was submitted to an entity with
>>> drm_sched_entity_push_job().
>>> + * before drm_sched_job_arm() is called.
>>>    *
>>> - * Since calling drm_sched_job_arm() causes the job's fences to be
>>> initialized,
>>> - * it is up to the driver to ensure that fences that were exposed to
>>> external
>>> - * parties get signaled. drm_sched_job_cleanup() does not ensure
>>> this.
>>> - *
>>> - * This function must also be called in &struct
>>> drm_sched_backend_ops.free_job
>>> + * After that point of no return @job is committed to be executed by
>>> the
>>> + * scheduler, and this function should be called from the
>>> + * &drm_sched_backend_ops.free_job callback.
>>>    */
>>>   void drm_sched_job_cleanup(struct drm_sched_job *job)
>>>   {
>>> @@ -1032,7 +1030,7 @@ void drm_sched_job_cleanup(struct drm_sched_job
>>> *job)
>>>           /* drm_sched_job_arm() has been called */
>>>           dma_fence_put(&job->s_fence->finished);
>>>       } else {
>>> -        /* aborted job before arming */
>>> +        /* aborted job before committing to run it */
>>>           drm_sched_fence_free(job->s_fence);
>>>       }
>>>   
>>
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 53e6aec37b46..4d4219fbe49d 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1015,13 +1015,11 @@  EXPORT_SYMBOL(drm_sched_job_has_dependency);
  * Cleans up the resources allocated with drm_sched_job_init().
  *
  * Drivers should call this from their error unwind code if @job is aborted
- * before it was submitted to an entity with drm_sched_entity_push_job().
+ * before drm_sched_job_arm() is called.
  *
- * Since calling drm_sched_job_arm() causes the job's fences to be initialized,
- * it is up to the driver to ensure that fences that were exposed to external
- * parties get signaled. drm_sched_job_cleanup() does not ensure this.
- *
- * This function must also be called in &struct drm_sched_backend_ops.free_job
+ * After that point of no return @job is committed to be executed by the
+ * scheduler, and this function should be called from the
+ * &drm_sched_backend_ops.free_job callback.
  */
 void drm_sched_job_cleanup(struct drm_sched_job *job)
 {
@@ -1032,7 +1030,7 @@  void drm_sched_job_cleanup(struct drm_sched_job *job)
 		/* drm_sched_job_arm() has been called */
 		dma_fence_put(&job->s_fence->finished);
 	} else {
-		/* aborted job before arming */
+		/* aborted job before committing to run it */
 		drm_sched_fence_free(job->s_fence);
 	}