diff mbox series

[1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini

Message ID 20240918133956.26557-1-christian.koenig@amd.com (mailing list archive)
State New, archived
Headers show
Series [1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini | expand

Commit Message

Christian König Sept. 18, 2024, 1:39 p.m. UTC
Tearing down the scheduler with jobs still on the pending list can
lead to use after free issues. Add a warning if drivers try to
destroy a scheduler which still has work pushed to the HW.

When there are still entities with jobs the situation is even worse
since the dma_fences for those jobs can never signal we can just
choose between potentially locking up core memory management and
random memory corruption. When drivers really mess it up that well
let them run into a BUG_ON().

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

Comments

Jani Nikula Sept. 18, 2024, 2:41 p.m. UTC | #1
On Wed, 18 Sep 2024, "Christian König" <ckoenig.leichtzumerken@gmail.com> wrote:
> Tearing down the scheduler with jobs still on the pending list can
> lead to use after free issues. Add a warning if drivers try to
> destroy a scheduler which still has work pushed to the HW.
>
> When there are still entities with jobs the situation is even worse
> since the dma_fences for those jobs can never signal we can just
> choose between potentially locking up core memory management and
> random memory corruption. When drivers really mess it up that well
> let them run into a BUG_ON().
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index f093616fe53c..8a46fab5cdc8 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
>  
>  	drm_sched_wqueue_stop(sched);
>  
> +	/*
> +	 * Tearing down the scheduler wile there are still unprocessed jobs can
> +	 * lead to use after free issues in the scheduler fence.
> +	 */
> +	WARN_ON(!list_empty(&sched->pending_list));

drm_WARN_ON(sched->dev, ...) would identify the device, which I presume
would be helpful in multi-GPU systems.

BR,
Jani.

> +
>  	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
>  		struct drm_sched_rq *rq = sched->sched_rq[i];
>  
>  		spin_lock(&rq->lock);
> -		list_for_each_entry(s_entity, &rq->entities, list)
> +		list_for_each_entry(s_entity, &rq->entities, list) {
> +			/*
> +			 * The justification for this BUG_ON() is that tearing
> +			 * down the scheduler while jobs are pending leaves
> +			 * dma_fences unsignaled. Since we have dependencies
> +			 * from the core memory management to eventually signal
> +			 * dma_fences this can trivially lead to a system wide
> +			 * stop because of a locked up memory management.
> +			 */
> +			BUG_ON(spsc_queue_count(&s_entity->job_queue));
> +
>  			/*
>  			 * Prevents reinsertion and marks job_queue as idle,
>  			 * it will removed from rq in drm_sched_entity_fini
>  			 * eventually
>  			 */
>  			s_entity->stopped = true;
> +		}
>  		spin_unlock(&rq->lock);
>  		kfree(sched->sched_rq[i]);
>  	}
Christian König Sept. 18, 2024, 2:56 p.m. UTC | #2
Am 18.09.24 um 16:41 schrieb Jani Nikula:
> On Wed, 18 Sep 2024, "Christian König" <ckoenig.leichtzumerken@gmail.com> wrote:
>> Tearing down the scheduler with jobs still on the pending list can
>> lead to use after free issues. Add a warning if drivers try to
>> destroy a scheduler which still has work pushed to the HW.
>>
>> When there are still entities with jobs the situation is even worse
>> since the dma_fences for those jobs can never signal we can just
>> choose between potentially locking up core memory management and
>> random memory corruption. When drivers really mess it up that well
>> let them run into a BUG_ON().
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++-
>>   1 file changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index f093616fe53c..8a46fab5cdc8 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>   
>>   	drm_sched_wqueue_stop(sched);
>>   
>> +	/*
>> +	 * Tearing down the scheduler wile there are still unprocessed jobs can
>> +	 * lead to use after free issues in the scheduler fence.
>> +	 */
>> +	WARN_ON(!list_empty(&sched->pending_list));
> drm_WARN_ON(sched->dev, ...) would identify the device, which I presume
> would be helpful in multi-GPU systems.

Good point, going to fix that.

Regards,
Christian.

>
> BR,
> Jani.
>
>> +
>>   	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
>>   		struct drm_sched_rq *rq = sched->sched_rq[i];
>>   
>>   		spin_lock(&rq->lock);
>> -		list_for_each_entry(s_entity, &rq->entities, list)
>> +		list_for_each_entry(s_entity, &rq->entities, list) {
>> +			/*
>> +			 * The justification for this BUG_ON() is that tearing
>> +			 * down the scheduler while jobs are pending leaves
>> +			 * dma_fences unsignaled. Since we have dependencies
>> +			 * from the core memory management to eventually signal
>> +			 * dma_fences this can trivially lead to a system wide
>> +			 * stop because of a locked up memory management.
>> +			 */
>> +			BUG_ON(spsc_queue_count(&s_entity->job_queue));
>> +
>>   			/*
>>   			 * Prevents reinsertion and marks job_queue as idle,
>>   			 * it will removed from rq in drm_sched_entity_fini
>>   			 * eventually
>>   			 */
>>   			s_entity->stopped = true;
>> +		}
>>   		spin_unlock(&rq->lock);
>>   		kfree(sched->sched_rq[i]);
>>   	}
Philipp Stanner Sept. 20, 2024, 8:57 a.m. UTC | #3
On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote:
> Tearing down the scheduler with jobs still on the pending list can
> lead to use after free issues. Add a warning if drivers try to
> destroy a scheduler which still has work pushed to the HW.

Did you have time yet to look into my proposed waitque-solution?

> 
> When there are still entities with jobs the situation is even worse
> since the dma_fences for those jobs can never signal we can just
> choose between potentially locking up core memory management and
> random memory corruption. When drivers really mess it up that well
> let them run into a BUG_ON().
> 
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> b/drivers/gpu/drm/scheduler/sched_main.c
> index f093616fe53c..8a46fab5cdc8 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler
> *sched)

I agree with Sima that it should first be documented in the function's
docstring what the user is expected to have done before calling the
function.

P.

>  
>  	drm_sched_wqueue_stop(sched);
>  
> +	/*
> +	 * Tearing down the scheduler wile there are still
> unprocessed jobs can
> +	 * lead to use after free issues in the scheduler fence.
> +	 */
> +	WARN_ON(!list_empty(&sched->pending_list));
> +
>  	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++)
> {
>  		struct drm_sched_rq *rq = sched->sched_rq[i];
>  
>  		spin_lock(&rq->lock);
> -		list_for_each_entry(s_entity, &rq->entities, list)
> +		list_for_each_entry(s_entity, &rq->entities, list) {
> +			/*
> +			 * The justification for this BUG_ON() is
> that tearing
> +			 * down the scheduler while jobs are pending
> leaves
> +			 * dma_fences unsignaled. Since we have
> dependencies
> +			 * from the core memory management to
> eventually signal
> +			 * dma_fences this can trivially lead to a
> system wide
> +			 * stop because of a locked up memory
> management.
> +			 */
> +			BUG_ON(spsc_queue_count(&s_entity-
> >job_queue));
> +
>  			/*
>  			 * Prevents reinsertion and marks job_queue
> as idle,
>  			 * it will removed from rq in
> drm_sched_entity_fini
>  			 * eventually
>  			 */
>  			s_entity->stopped = true;
> +		}
>  		spin_unlock(&rq->lock);
>  		kfree(sched->sched_rq[i]);
>  	}
Christian König Sept. 20, 2024, 10:33 a.m. UTC | #4
Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote:
>> Tearing down the scheduler with jobs still on the pending list can
>> lead to use after free issues. Add a warning if drivers try to
>> destroy a scheduler which still has work pushed to the HW.
> Did you have time yet to look into my proposed waitque-solution?

I don't remember seeing anything. What have I missed?

>
>> When there are still entities with jobs the situation is even worse
>> since the dma_fences for those jobs can never signal we can just
>> choose between potentially locking up core memory management and
>> random memory corruption. When drivers really mess it up that well
>> let them run into a BUG_ON().
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++-
>>   1 file changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>> b/drivers/gpu/drm/scheduler/sched_main.c
>> index f093616fe53c..8a46fab5cdc8 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler
>> *sched)
> I agree with Sima that it should first be documented in the function's
> docstring what the user is expected to have done before calling the
> function.

Good point, going to update the documentation as well.

Thanks,
Christian.

>
> P.
>
>>   
>>   	drm_sched_wqueue_stop(sched);
>>   
>> +	/*
>> +	 * Tearing down the scheduler wile there are still
>> unprocessed jobs can
>> +	 * lead to use after free issues in the scheduler fence.
>> +	 */
>> +	WARN_ON(!list_empty(&sched->pending_list));
>> +
>>   	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++)
>> {
>>   		struct drm_sched_rq *rq = sched->sched_rq[i];
>>   
>>   		spin_lock(&rq->lock);
>> -		list_for_each_entry(s_entity, &rq->entities, list)
>> +		list_for_each_entry(s_entity, &rq->entities, list) {
>> +			/*
>> +			 * The justification for this BUG_ON() is
>> that tearing
>> +			 * down the scheduler while jobs are pending
>> leaves
>> +			 * dma_fences unsignaled. Since we have
>> dependencies
>> +			 * from the core memory management to
>> eventually signal
>> +			 * dma_fences this can trivially lead to a
>> system wide
>> +			 * stop because of a locked up memory
>> management.
>> +			 */
>> +			BUG_ON(spsc_queue_count(&s_entity-
>>> job_queue));
>> +
>>   			/*
>>   			 * Prevents reinsertion and marks job_queue
>> as idle,
>>   			 * it will removed from rq in
>> drm_sched_entity_fini
>>   			 * eventually
>>   			 */
>>   			s_entity->stopped = true;
>> +		}
>>   		spin_unlock(&rq->lock);
>>   		kfree(sched->sched_rq[i]);
>>   	}
Philipp Stanner Sept. 20, 2024, 1:26 p.m. UTC | #5
On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:
> Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> > On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote:
> > > Tearing down the scheduler with jobs still on the pending list
> > > can
> > > lead to use after free issues. Add a warning if drivers try to
> > > destroy a scheduler which still has work pushed to the HW.
> > Did you have time yet to look into my proposed waitque-solution?
> 
> I don't remember seeing anything. What have I missed?

https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/

> 
> > 
> > > When there are still entities with jobs the situation is even
> > > worse
> > > since the dma_fences for those jobs can never signal we can just
> > > choose between potentially locking up core memory management and
> > > random memory corruption. When drivers really mess it up that
> > > well
> > > let them run into a BUG_ON().
> > > 
> > > Signed-off-by: Christian König <christian.koenig@amd.com>
> > > ---
> > >   drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++-
> > >   1 file changed, 18 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > index f093616fe53c..8a46fab5cdc8 100644
> > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct
> > > drm_gpu_scheduler
> > > *sched)
> > I agree with Sima that it should first be documented in the
> > function's
> > docstring what the user is expected to have done before calling the
> > function.
> 
> Good point, going to update the documentation as well.

Cool thing, thx.
Would be great if everything (not totally trivial) necessary to be done
before _fini() is mentioned.

One could also think about providing a hint at how the driver can do
that. AFAICS the only way for the driver to ensure that is to maintain
its own, separate list of submitted jobs.

P.

> 
> Thanks,
> Christian.
> 
> > 
> > P.
> > 
> > >   
> > >   	drm_sched_wqueue_stop(sched);
> > >   
> > > +	/*
> > > +	 * Tearing down the scheduler wile there are still
> > > unprocessed jobs can
> > > +	 * lead to use after free issues in the scheduler fence.
> > > +	 */
> > > +	WARN_ON(!list_empty(&sched->pending_list));
> > > +
> > >   	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs;
> > > i++)
> > > {
> > >   		struct drm_sched_rq *rq = sched->sched_rq[i];
> > >   
> > >   		spin_lock(&rq->lock);
> > > -		list_for_each_entry(s_entity, &rq->entities,
> > > list)
> > > +		list_for_each_entry(s_entity, &rq->entities,
> > > list) {
> > > +			/*
> > > +			 * The justification for this BUG_ON()
> > > is
> > > that tearing
> > > +			 * down the scheduler while jobs are
> > > pending
> > > leaves
> > > +			 * dma_fences unsignaled. Since we have
> > > dependencies
> > > +			 * from the core memory management to
> > > eventually signal
> > > +			 * dma_fences this can trivially lead to
> > > a
> > > system wide
> > > +			 * stop because of a locked up memory
> > > management.
> > > +			 */
> > > +			BUG_ON(spsc_queue_count(&s_entity-
> > > > job_queue));
> > > +
> > >   			/*
> > >   			 * Prevents reinsertion and marks
> > > job_queue
> > > as idle,
> > >   			 * it will removed from rq in
> > > drm_sched_entity_fini
> > >   			 * eventually
> > >   			 */
> > >   			s_entity->stopped = true;
> > > +		}
> > >   		spin_unlock(&rq->lock);
> > >   		kfree(sched->sched_rq[i]);
> > >   	}
>
Christian König Sept. 23, 2024, 3:24 p.m. UTC | #6
Am 20.09.24 um 15:26 schrieb Philipp Stanner:
> On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:
>> Am 20.09.24 um 10:57 schrieb Philipp Stanner:
>>> On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote:
>>>> Tearing down the scheduler with jobs still on the pending list
>>>> can
>>>> lead to use after free issues. Add a warning if drivers try to
>>>> destroy a scheduler which still has work pushed to the HW.
>>> Did you have time yet to look into my proposed waitque-solution?
>> I don't remember seeing anything. What have I missed?
> https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/

Mhm, I didn't got that in my inbox for some reason.

Interesting approach, I'm just not sure if we can or should wait in 
drm_sched_fini().

Probably better to make that a separate function, something like 
drm_sched_flush() or similar.

>>>> When there are still entities with jobs the situation is even
>>>> worse
>>>> since the dma_fences for those jobs can never signal we can just
>>>> choose between potentially locking up core memory management and
>>>> random memory corruption. When drivers really mess it up that
>>>> well
>>>> let them run into a BUG_ON().
>>>>
>>>> Signed-off-by: Christian König<christian.koenig@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++-
>>>>    1 file changed, 18 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>> index f093616fe53c..8a46fab5cdc8 100644
>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct
>>>> drm_gpu_scheduler
>>>> *sched)
>>> I agree with Sima that it should first be documented in the
>>> function's
>>> docstring what the user is expected to have done before calling the
>>> function.
>> Good point, going to update the documentation as well.
> Cool thing, thx.
> Would be great if everything (not totally trivial) necessary to be done
> before _fini() is mentioned.
>
> One could also think about providing a hint at how the driver can do
> that. AFAICS the only way for the driver to ensure that is to maintain
> its own, separate list of submitted jobs.

Even with a duplicated pending list it's actually currently impossible 
to do this fully cleanly.

The problem is that the dma_fence object gives no guarantee when 
callbacks are processed, e.g. they can be both processed from interrupt 
context as well as from a CPU which called dma_fence_is_signaled().

So when a driver (or drm_sched_fini) waits for the last submitted fence 
it actually can be that the drm_sched object still needs to do some 
processing. See the hack in amdgpu_vm_tlb_seq() for more background on 
the problem.

Regards,
Christian.

>
> P.
>
>> Thanks,
>> Christian.
>>
>>> P.
>>>
>>>>    
>>>>    	drm_sched_wqueue_stop(sched);
>>>>    
>>>> +	/*
>>>> +	 * Tearing down the scheduler wile there are still
>>>> unprocessed jobs can
>>>> +	 * lead to use after free issues in the scheduler fence.
>>>> +	 */
>>>> +	WARN_ON(!list_empty(&sched->pending_list));
>>>> +
>>>>    	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs;
>>>> i++)
>>>> {
>>>>    		struct drm_sched_rq *rq = sched->sched_rq[i];
>>>>    
>>>>    		spin_lock(&rq->lock);
>>>> -		list_for_each_entry(s_entity, &rq->entities,
>>>> list)
>>>> +		list_for_each_entry(s_entity, &rq->entities,
>>>> list) {
>>>> +			/*
>>>> +			 * The justification for this BUG_ON()
>>>> is
>>>> that tearing
>>>> +			 * down the scheduler while jobs are
>>>> pending
>>>> leaves
>>>> +			 * dma_fences unsignaled. Since we have
>>>> dependencies
>>>> +			 * from the core memory management to
>>>> eventually signal
>>>> +			 * dma_fences this can trivially lead to
>>>> a
>>>> system wide
>>>> +			 * stop because of a locked up memory
>>>> management.
>>>> +			 */
>>>> +			BUG_ON(spsc_queue_count(&s_entity-
>>>>> job_queue));
>>>> +
>>>>    			/*
>>>>    			 * Prevents reinsertion and marks
>>>> job_queue
>>>> as idle,
>>>>    			 * it will removed from rq in
>>>> drm_sched_entity_fini
>>>>    			 * eventually
>>>>    			 */
>>>>    			s_entity->stopped = true;
>>>> +		}
>>>>    		spin_unlock(&rq->lock);
>>>>    		kfree(sched->sched_rq[i]);
>>>>    	}
Simona Vetter Sept. 24, 2024, 11:18 a.m. UTC | #7
On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König wrote:
> Am 20.09.24 um 15:26 schrieb Philipp Stanner:
> > On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:
> > > Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> > > > On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote:
> > > > > Tearing down the scheduler with jobs still on the pending list
> > > > > can
> > > > > lead to use after free issues. Add a warning if drivers try to
> > > > > destroy a scheduler which still has work pushed to the HW.
> > > > Did you have time yet to look into my proposed waitque-solution?
> > > I don't remember seeing anything. What have I missed?
> > https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/
> 
> Mhm, I didn't got that in my inbox for some reason.
> 
> Interesting approach, I'm just not sure if we can or should wait in
> drm_sched_fini().
> 
> Probably better to make that a separate function, something like
> drm_sched_flush() or similar.

Yeah I don't think we should smash this into drm_sched_fini
unconditionally. I think conceptually there's about three cases:

- Ringbuffer schedules. Probably want everything as-is, because
  drm_sched_fini is called long after all the entities are gone in
  drm_device cleanup.

- fw scheduler hardware with preemption support. There we probably want to
  nuke the context by setting the tdr timeout to zero (or maybe just as
  long as context preemption takes to be efficient), and relying on the
  normal gpu reset flow to handle things. drm_sched_entity_flush kinda
  does this, except not really and it's a lot more focused on the
  ringbuffer context. So maybe we want a new drm_sched_entity_kill.

  For this case calling drm_sched_fini() after the 1:1 entity is gone
  should not find any linger jobs, it would actually be a bug somewhere if
  there's a job lingering. Maybe a sanity check that there's not just no
  jobs lingering, but also no entity left would be good here?

- fw scheduler without preemption support. There we kinda need the
  drm_sched_flush, except blocking in fops->close is not great. So instead
  I think the following is better:
  1. drm_sched_entity_stopped, which only stops new submissions (for
  paranoia) but doesn't tear down the entity
  2. drm_dev_get
  3. launch a worker which does a) drm_sched_flush (or
  drm_sched_entity_flush or whatever we call it) b) drm_sched_entity_fini
  + drm_sched_fini c) drm_dev_put

  Note that semantically this implements the refcount in the other path
  from Phillip:

  https://lore.kernel.org/all/20240903094531.29893-2-pstanner@redhat.com/
  
  Except it doesn't impose refcount on everyone else who doesn't need it,
  and it doesn't even impose refcounting on drivers that do need it
  because we use drm_sched_flush and a worker to achieve the same.

Essentially helper functions for the common use-cases instead of trying to
solve them all by putting drm_sched_flush as a potentially very blocking
function into drm_sched_fini.


> > > > > When there are still entities with jobs the situation is even
> > > > > worse
> > > > > since the dma_fences for those jobs can never signal we can just
> > > > > choose between potentially locking up core memory management and
> > > > > random memory corruption. When drivers really mess it up that
> > > > > well
> > > > > let them run into a BUG_ON().
> > > > > 
> > > > > Signed-off-by: Christian König<christian.koenig@amd.com>
> > > > > ---
> > > > >    drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++-
> > > > >    1 file changed, 18 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > index f093616fe53c..8a46fab5cdc8 100644
> > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct
> > > > > drm_gpu_scheduler
> > > > > *sched)
> > > > I agree with Sima that it should first be documented in the
> > > > function's
> > > > docstring what the user is expected to have done before calling the
> > > > function.
> > > Good point, going to update the documentation as well.
> > Cool thing, thx.
> > Would be great if everything (not totally trivial) necessary to be done
> > before _fini() is mentioned.
> > 
> > One could also think about providing a hint at how the driver can do
> > that. AFAICS the only way for the driver to ensure that is to maintain
> > its own, separate list of submitted jobs.
> 
> Even with a duplicated pending list it's actually currently impossible to do
> this fully cleanly.
> 
> The problem is that the dma_fence object gives no guarantee when callbacks
> are processed, e.g. they can be both processed from interrupt context as
> well as from a CPU which called dma_fence_is_signaled().
> 
> So when a driver (or drm_sched_fini) waits for the last submitted fence it
> actually can be that the drm_sched object still needs to do some processing.
> See the hack in amdgpu_vm_tlb_seq() for more background on the problem.

So I thought this should be fairly easy because of the sched hw/public
fence split: If we wait for both all jobs to finish and for all the sched
work/tdr work to finish, and we make sure there's no entity existing
that's not yet stopped we should catch them all? Or at least I think it's
a bug if any other code even tries to touch the hw fence.

If you have any other driver code which relies on the rcu freeing then I
think that's just a separate concern and we can ignore that here since the
fences themselves will till get rcu-delay freed even if drm_sched_fini has
finished.
-Sima

> 
> Regards,
> Christian.
> 
> > 
> > P.
> > 
> > > Thanks,
> > > Christian.
> > > 
> > > > P.
> > > > 
> > > > >    	drm_sched_wqueue_stop(sched);
> > > > > +	/*
> > > > > +	 * Tearing down the scheduler wile there are still
> > > > > unprocessed jobs can
> > > > > +	 * lead to use after free issues in the scheduler fence.
> > > > > +	 */
> > > > > +	WARN_ON(!list_empty(&sched->pending_list));
> > > > > +
> > > > >    	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs;
> > > > > i++)
> > > > > {
> > > > >    		struct drm_sched_rq *rq = sched->sched_rq[i];
> > > > >    		spin_lock(&rq->lock);
> > > > > -		list_for_each_entry(s_entity, &rq->entities,
> > > > > list)
> > > > > +		list_for_each_entry(s_entity, &rq->entities,
> > > > > list) {
> > > > > +			/*
> > > > > +			 * The justification for this BUG_ON()
> > > > > is
> > > > > that tearing
> > > > > +			 * down the scheduler while jobs are
> > > > > pending
> > > > > leaves
> > > > > +			 * dma_fences unsignaled. Since we have
> > > > > dependencies
> > > > > +			 * from the core memory management to
> > > > > eventually signal
> > > > > +			 * dma_fences this can trivially lead to
> > > > > a
> > > > > system wide
> > > > > +			 * stop because of a locked up memory
> > > > > management.
> > > > > +			 */
> > > > > +			BUG_ON(spsc_queue_count(&s_entity-
> > > > > > job_queue));
> > > > > +
> > > > >    			/*
> > > > >    			 * Prevents reinsertion and marks
> > > > > job_queue
> > > > > as idle,
> > > > >    			 * it will removed from rq in
> > > > > drm_sched_entity_fini
> > > > >    			 * eventually
> > > > >    			 */
> > > > >    			s_entity->stopped = true;
> > > > > +		}
> > > > >    		spin_unlock(&rq->lock);
> > > > >    		kfree(sched->sched_rq[i]);
> > > > >    	}
Philipp Stanner Sept. 25, 2024, 2:53 p.m. UTC | #8
On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote:
> On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König wrote:
> > Am 20.09.24 um 15:26 schrieb Philipp Stanner:
> > > On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:
> > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote:
> > > > > > Tearing down the scheduler with jobs still on the pending
> > > > > > list
> > > > > > can
> > > > > > lead to use after free issues. Add a warning if drivers try
> > > > > > to
> > > > > > destroy a scheduler which still has work pushed to the HW.
> > > > > Did you have time yet to look into my proposed waitque-
> > > > > solution?
> > > > I don't remember seeing anything. What have I missed?
> > > https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/
> > 
> > Mhm, I didn't got that in my inbox for some reason.
> > 
> > Interesting approach, I'm just not sure if we can or should wait in
> > drm_sched_fini().

We do agree that jobs still pending when drm_sched_fini() starts is
always a bug, right?

If so, what are the disadvantages of waiting in drm_sched_fini()? We
could block buggy drivers as I see it. Which wouldn't be good, but
could then be fixed on drivers' site.

> > 
> > Probably better to make that a separate function, something like
> > drm_sched_flush() or similar.

We could do that. Such a function could then be called by drivers which
are not sure whether all jobs are done before they start tearing down.

> 
> Yeah I don't think we should smash this into drm_sched_fini
> unconditionally. I think conceptually there's about three cases:
> 
> - Ringbuffer schedules. Probably want everything as-is, because
>   drm_sched_fini is called long after all the entities are gone in
>   drm_device cleanup.
> 
> - fw scheduler hardware with preemption support. There we probably
> want to
>   nuke the context by setting the tdr timeout to zero (or maybe just
> as
>   long as context preemption takes to be efficient), and relying on
> the
>   normal gpu reset flow to handle things. drm_sched_entity_flush
> kinda
>   does this, except not really and it's a lot more focused on the
>   ringbuffer context. So maybe we want a new drm_sched_entity_kill.
> 
>   For this case calling drm_sched_fini() after the 1:1 entity is gone
>   should not find any linger jobs, it would actually be a bug
> somewhere if
>   there's a job lingering. Maybe a sanity check that there's not just
> no
>   jobs lingering, but also no entity left would be good here?

The check for lingering ones is in Christian's patch here IISC.
At which position would you imagine the check for the entity being
performed?

> 
> - fw scheduler without preemption support. There we kinda need the
>   drm_sched_flush, except blocking in fops->close is not great. So
> instead
>   I think the following is better:
>   1. drm_sched_entity_stopped, which only stops new submissions (for
>   paranoia) but doesn't tear down the entity

Who would call that function?
Drivers using it voluntarily could just as well stop accepting new jobs
from userspace to their entities, couldn't they?

>   2. drm_dev_get
>   3. launch a worker which does a) drm_sched_flush (or
>   drm_sched_entity_flush or whatever we call it) b)
> drm_sched_entity_fini
>   + drm_sched_fini c) drm_dev_put
> 
>   Note that semantically this implements the refcount in the other
> path
>   from Phillip:
> 
>  
> https://lore.kernel.org/all/20240903094531.29893-2-pstanner@redhat.com/
>   
>   Except it doesn't impose refcount on everyone else who doesn't need
> it,
>   and it doesn't even impose refcounting on drivers that do need it
>   because we use drm_sched_flush and a worker to achieve the same.

I indeed wasn't happy with the refcount approach for that reason,
agreed.

> 
> Essentially helper functions for the common use-cases instead of
> trying to
> solve them all by putting drm_sched_flush as a potentially very
> blocking
> function into drm_sched_fini.

I'm still not able to see why it blocking would be undesired – as far
as I can see, it is only invoked on driver teardown, so not during
active operation. Teardown doesn't happen that often, and it can (if
implemented correctly) only block until the driver's code has signaled
the last fences. If that doesn't happen, the block would reveal a bug.

But don't get me wrong: I don't want to *push* this solution. I just
want to understand when it could become a problem.


Wouldn't an explicitly blocking, separate function like
drm_sched_flush() or drm_sched_fini_flush() be a small, doable step
towards the right direction?


> 
> 
> > > > > > When there are still entities with jobs the situation is
> > > > > > even
> > > > > > worse
> > > > > > since the dma_fences for those jobs can never signal we can
> > > > > > just
> > > > > > choose between potentially locking up core memory
> > > > > > management and
> > > > > > random memory corruption. When drivers really mess it up
> > > > > > that
> > > > > > well
> > > > > > let them run into a BUG_ON().
> > > > > > 
> > > > > > Signed-off-by: Christian König<christian.koenig@amd.com>
> > > > > > ---
> > > > > >    drivers/gpu/drm/scheduler/sched_main.c | 19
> > > > > > ++++++++++++++++++-
> > > > > >    1 file changed, 18 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > index f093616fe53c..8a46fab5cdc8 100644
> > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct
> > > > > > drm_gpu_scheduler
> > > > > > *sched)
> > > > > I agree with Sima that it should first be documented in the
> > > > > function's
> > > > > docstring what the user is expected to have done before
> > > > > calling the
> > > > > function.
> > > > Good point, going to update the documentation as well.
> > > Cool thing, thx.
> > > Would be great if everything (not totally trivial) necessary to
> > > be done
> > > before _fini() is mentioned.
> > > 
> > > One could also think about providing a hint at how the driver can
> > > do
> > > that. AFAICS the only way for the driver to ensure that is to
> > > maintain
> > > its own, separate list of submitted jobs.
> > 
> > Even with a duplicated pending list it's actually currently
> > impossible to do
> > this fully cleanly.
> > 
> > The problem is that the dma_fence object gives no guarantee when
> > callbacks
> > are processed, e.g. they can be both processed from interrupt
> > context as
> > well as from a CPU which called dma_fence_is_signaled().
> > 
> > So when a driver (or drm_sched_fini) waits for the last submitted
> > fence it
> > actually can be that the drm_sched object still needs to do some
> > processing.
> > See the hack in amdgpu_vm_tlb_seq() for more background on the
> > problem.

Oh dear ^^'
We better work towards fixing that centrally

Thanks,
P.


> 
> So I thought this should be fairly easy because of the sched
> hw/public
> fence split: If we wait for both all jobs to finish and for all the
> sched
> work/tdr work to finish, and we make sure there's no entity existing
> that's not yet stopped we should catch them all? Or at least I think
> it's
> a bug if any other code even tries to touch the hw fence.
> 
> If you have any other driver code which relies on the rcu freeing
> then I
> think that's just a separate concern and we can ignore that here
> since the
> fences themselves will till get rcu-delay freed even if
> drm_sched_fini has
> finished.
> -Sima
> 
> > 
> > Regards,
> > Christian.
> > 
> > > 
> > > P.
> > > 
> > > > Thanks,
> > > > Christian.
> > > > 
> > > > > P.
> > > > > 
> > > > > >    	drm_sched_wqueue_stop(sched);
> > > > > > +	/*
> > > > > > +	 * Tearing down the scheduler wile there are still
> > > > > > unprocessed jobs can
> > > > > > +	 * lead to use after free issues in the scheduler
> > > > > > fence.
> > > > > > +	 */
> > > > > > +	WARN_ON(!list_empty(&sched->pending_list));
> > > > > > +
> > > > > >    	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched-
> > > > > > >num_rqs;
> > > > > > i++)
> > > > > > {
> > > > > >    		struct drm_sched_rq *rq = sched-
> > > > > > >sched_rq[i];
> > > > > >    		spin_lock(&rq->lock);
> > > > > > -		list_for_each_entry(s_entity, &rq-
> > > > > > >entities,
> > > > > > list)
> > > > > > +		list_for_each_entry(s_entity, &rq-
> > > > > > >entities,
> > > > > > list) {
> > > > > > +			/*
> > > > > > +			 * The justification for this
> > > > > > BUG_ON()
> > > > > > is
> > > > > > that tearing
> > > > > > +			 * down the scheduler while jobs
> > > > > > are
> > > > > > pending
> > > > > > leaves
> > > > > > +			 * dma_fences unsignaled. Since we
> > > > > > have
> > > > > > dependencies
> > > > > > +			 * from the core memory management
> > > > > > to
> > > > > > eventually signal
> > > > > > +			 * dma_fences this can trivially
> > > > > > lead to
> > > > > > a
> > > > > > system wide
> > > > > > +			 * stop because of a locked up
> > > > > > memory
> > > > > > management.
> > > > > > +			 */
> > > > > > +			BUG_ON(spsc_queue_count(&s_entity-
> > > > > > > job_queue));
> > > > > > +
> > > > > >    			/*
> > > > > >    			 * Prevents reinsertion and marks
> > > > > > job_queue
> > > > > > as idle,
> > > > > >    			 * it will removed from rq in
> > > > > > drm_sched_entity_fini
> > > > > >    			 * eventually
> > > > > >    			 */
> > > > > >    			s_entity->stopped = true;
> > > > > > +		}
> > > > > >    		spin_unlock(&rq->lock);
> > > > > >    		kfree(sched->sched_rq[i]);
> > > > > >    	}
>
Christian König Sept. 27, 2024, 9:04 a.m. UTC | #9
Am 25.09.24 um 16:53 schrieb Philipp Stanner:
> On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote:
>> On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König wrote:
>>> Am 20.09.24 um 15:26 schrieb Philipp Stanner:
>>>> On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:
>>>>> Am 20.09.24 um 10:57 schrieb Philipp Stanner:
>>>>>> On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote:
>>>>>>> Tearing down the scheduler with jobs still on the pending
>>>>>>> list
>>>>>>> can
>>>>>>> lead to use after free issues. Add a warning if drivers try
>>>>>>> to
>>>>>>> destroy a scheduler which still has work pushed to the HW.
>>>>>> Did you have time yet to look into my proposed waitque-
>>>>>> solution?
>>>>> I don't remember seeing anything. What have I missed?
>>>> https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/
>>> Mhm, I didn't got that in my inbox for some reason.
>>>
>>> Interesting approach, I'm just not sure if we can or should wait in
>>> drm_sched_fini().
> We do agree that jobs still pending when drm_sched_fini() starts is
> always a bug, right?

Correct, the question is how to avoid that.

> If so, what are the disadvantages of waiting in drm_sched_fini()? We
> could block buggy drivers as I see it. Which wouldn't be good, but
> could then be fixed on drivers' site.

Sima explained that pretty well: Don't block in fops->close, do that in 
fops->flush instead.

One issue this solves is that when you send a SIGTERM the tear down 
handling first flushes all the FDs and then closes them.

So if flushing the FDs blocks because the process initiated sending a 
terabyte of data over a 300bps line (for example) you can still throw a 
SIGKILL and abort that as well.

If you would block in fops-close() that SIGKILL won't have any effect 
any more because by the time close() is called the process is gone and 
signals are already blocked.

And yes when I learned about that issue I was also buffed that handling 
like this in the UNIX design is nearly 50 years old and still applies to 
today.
>>> Probably better to make that a separate function, something like
>>> drm_sched_flush() or similar.
> We could do that. Such a function could then be called by drivers which
> are not sure whether all jobs are done before they start tearing down.

Yes exactly that's the idea. And give that flush function a return code 
so that it can return -EINTR.

>> Yeah I don't think we should smash this into drm_sched_fini
>> unconditionally. I think conceptually there's about three cases:
>>
>> - Ringbuffer schedules. Probably want everything as-is, because
>>    drm_sched_fini is called long after all the entities are gone in
>>    drm_device cleanup.
>>
>> - fw scheduler hardware with preemption support. There we probably
>> want to
>>    nuke the context by setting the tdr timeout to zero (or maybe just
>> as
>>    long as context preemption takes to be efficient), and relying on
>> the
>>    normal gpu reset flow to handle things. drm_sched_entity_flush
>> kinda
>>    does this, except not really and it's a lot more focused on the
>>    ringbuffer context. So maybe we want a new drm_sched_entity_kill.
>>
>>    For this case calling drm_sched_fini() after the 1:1 entity is gone
>>    should not find any linger jobs, it would actually be a bug
>> somewhere if
>>    there's a job lingering. Maybe a sanity check that there's not just
>> no
>>    jobs lingering, but also no entity left would be good here?
> The check for lingering ones is in Christian's patch here IISC.
> At which position would you imagine the check for the entity being
> performed?
>
>> - fw scheduler without preemption support. There we kinda need the
>>    drm_sched_flush, except blocking in fops->close is not great. So
>> instead
>>    I think the following is better:
>>    1. drm_sched_entity_stopped, which only stops new submissions (for
>>    paranoia) but doesn't tear down the entity
> Who would call that function?
> Drivers using it voluntarily could just as well stop accepting new jobs
> from userspace to their entities, couldn't they?
>
>>    2. drm_dev_get
>>    3. launch a worker which does a) drm_sched_flush (or
>>    drm_sched_entity_flush or whatever we call it) b)
>> drm_sched_entity_fini
>>    + drm_sched_fini c) drm_dev_put
>>
>>    Note that semantically this implements the refcount in the other
>> path
>>    from Phillip:
>>
>>   
>> https://lore.kernel.org/all/20240903094531.29893-2-pstanner@redhat.com/
>>    
>>    Except it doesn't impose refcount on everyone else who doesn't need
>> it,
>>    and it doesn't even impose refcounting on drivers that do need it
>>    because we use drm_sched_flush and a worker to achieve the same.
> I indeed wasn't happy with the refcount approach for that reason,
> agreed.
>
>> Essentially helper functions for the common use-cases instead of
>> trying to
>> solve them all by putting drm_sched_flush as a potentially very
>> blocking
>> function into drm_sched_fini.
> I'm still not able to see why it blocking would be undesired – as far
> as I can see, it is only invoked on driver teardown, so not during
> active operation. Teardown doesn't happen that often, and it can (if
> implemented correctly) only block until the driver's code has signaled
> the last fences. If that doesn't happen, the block would reveal a bug.
>
> But don't get me wrong: I don't want to *push* this solution. I just
> want to understand when it could become a problem.
>
>
> Wouldn't an explicitly blocking, separate function like
> drm_sched_flush() or drm_sched_fini_flush() be a small, doable step
> towards the right direction?

I think that this is the right thing to do, yes.

>>>>>>> When there are still entities with jobs the situation is
>>>>>>> even
>>>>>>> worse
>>>>>>> since the dma_fences for those jobs can never signal we can
>>>>>>> just
>>>>>>> choose between potentially locking up core memory
>>>>>>> management and
>>>>>>> random memory corruption. When drivers really mess it up
>>>>>>> that
>>>>>>> well
>>>>>>> let them run into a BUG_ON().
>>>>>>>
>>>>>>> Signed-off-by: Christian König<christian.koenig@amd.com>
>>>>>>> ---
>>>>>>>     drivers/gpu/drm/scheduler/sched_main.c | 19
>>>>>>> ++++++++++++++++++-
>>>>>>>     1 file changed, 18 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> index f093616fe53c..8a46fab5cdc8 100644
>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct
>>>>>>> drm_gpu_scheduler
>>>>>>> *sched)
>>>>>> I agree with Sima that it should first be documented in the
>>>>>> function's
>>>>>> docstring what the user is expected to have done before
>>>>>> calling the
>>>>>> function.
>>>>> Good point, going to update the documentation as well.
>>>> Cool thing, thx.
>>>> Would be great if everything (not totally trivial) necessary to
>>>> be done
>>>> before _fini() is mentioned.
>>>>
>>>> One could also think about providing a hint at how the driver can
>>>> do
>>>> that. AFAICS the only way for the driver to ensure that is to
>>>> maintain
>>>> its own, separate list of submitted jobs.
>>> Even with a duplicated pending list it's actually currently
>>> impossible to do
>>> this fully cleanly.
>>>
>>> The problem is that the dma_fence object gives no guarantee when
>>> callbacks
>>> are processed, e.g. they can be both processed from interrupt
>>> context as
>>> well as from a CPU which called dma_fence_is_signaled().
>>>
>>> So when a driver (or drm_sched_fini) waits for the last submitted
>>> fence it
>>> actually can be that the drm_sched object still needs to do some
>>> processing.
>>> See the hack in amdgpu_vm_tlb_seq() for more background on the
>>> problem.
> Oh dear ^^'
> We better work towards fixing that centrally
>
> Thanks,
> P.
>
>
>> So I thought this should be fairly easy because of the sched
>> hw/public
>> fence split: If we wait for both all jobs to finish and for all the
>> sched
>> work/tdr work to finish, and we make sure there's no entity existing
>> that's not yet stopped we should catch them all?

Unfortunately not.

Even when you do a dma_fence_wait() on the last submission it can still 
be that another CPU is executing the callbacks to wake up the scheduler 
work item and cleanup the job.

That's one of the reasons why I think the design of keeping the job 
alive is so extremely awkward. The dma_fence as representation of the hw 
submission has a much better defined state machine and lifetime.

Regards,
Christian.

>>   Or at least I think
>> it's
>> a bug if any other code even tries to touch the hw fence.
>>
>> If you have any other driver code which relies on the rcu freeing
>> then I
>> think that's just a separate concern and we can ignore that here
>> since the
>> fences themselves will till get rcu-delay freed even if
>> drm_sched_fini has
>> finished.
>> -Sima
>>
>>> Regards,
>>> Christian.
>>>
>>>> P.
>>>>
>>>>> Thanks,
>>>>> Christian.
>>>>>
>>>>>> P.
>>>>>>
>>>>>>>     	drm_sched_wqueue_stop(sched);
>>>>>>> +	/*
>>>>>>> +	 * Tearing down the scheduler wile there are still
>>>>>>> unprocessed jobs can
>>>>>>> +	 * lead to use after free issues in the scheduler
>>>>>>> fence.
>>>>>>> +	 */
>>>>>>> +	WARN_ON(!list_empty(&sched->pending_list));
>>>>>>> +
>>>>>>>     	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched-
>>>>>>>> num_rqs;
>>>>>>> i++)
>>>>>>> {
>>>>>>>     		struct drm_sched_rq *rq = sched-
>>>>>>>> sched_rq[i];
>>>>>>>     		spin_lock(&rq->lock);
>>>>>>> -		list_for_each_entry(s_entity, &rq-
>>>>>>>> entities,
>>>>>>> list)
>>>>>>> +		list_for_each_entry(s_entity, &rq-
>>>>>>>> entities,
>>>>>>> list) {
>>>>>>> +			/*
>>>>>>> +			 * The justification for this
>>>>>>> BUG_ON()
>>>>>>> is
>>>>>>> that tearing
>>>>>>> +			 * down the scheduler while jobs
>>>>>>> are
>>>>>>> pending
>>>>>>> leaves
>>>>>>> +			 * dma_fences unsignaled. Since we
>>>>>>> have
>>>>>>> dependencies
>>>>>>> +			 * from the core memory management
>>>>>>> to
>>>>>>> eventually signal
>>>>>>> +			 * dma_fences this can trivially
>>>>>>> lead to
>>>>>>> a
>>>>>>> system wide
>>>>>>> +			 * stop because of a locked up
>>>>>>> memory
>>>>>>> management.
>>>>>>> +			 */
>>>>>>> +			BUG_ON(spsc_queue_count(&s_entity-
>>>>>>>> job_queue));
>>>>>>> +
>>>>>>>     			/*
>>>>>>>     			 * Prevents reinsertion and marks
>>>>>>> job_queue
>>>>>>> as idle,
>>>>>>>     			 * it will removed from rq in
>>>>>>> drm_sched_entity_fini
>>>>>>>     			 * eventually
>>>>>>>     			 */
>>>>>>>     			s_entity->stopped = true;
>>>>>>> +		}
>>>>>>>     		spin_unlock(&rq->lock);
>>>>>>>     		kfree(sched->sched_rq[i]);
>>>>>>>     	}
Danilo Krummrich Oct. 14, 2024, 2:49 p.m. UTC | #10
On Tue, Sep 24, 2024 at 01:18:25PM +0200, Simona Vetter wrote:
> On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König wrote:
> > Am 20.09.24 um 15:26 schrieb Philipp Stanner:
> > > On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:
> > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote:
> > > > > > Tearing down the scheduler with jobs still on the pending list
> > > > > > can
> > > > > > lead to use after free issues. Add a warning if drivers try to
> > > > > > destroy a scheduler which still has work pushed to the HW.
> > > > > Did you have time yet to look into my proposed waitque-solution?
> > > > I don't remember seeing anything. What have I missed?
> > > https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/
> > 
> > Mhm, I didn't got that in my inbox for some reason.
> > 
> > Interesting approach, I'm just not sure if we can or should wait in
> > drm_sched_fini().
> > 
> > Probably better to make that a separate function, something like
> > drm_sched_flush() or similar.
> 
> Yeah I don't think we should smash this into drm_sched_fini
> unconditionally. I think conceptually there's about three cases:
> 
> - Ringbuffer schedules. Probably want everything as-is, because
>   drm_sched_fini is called long after all the entities are gone in
>   drm_device cleanup.

Even if entities are long gone we can still have jobs on the pending list of the
scheduler. Calling drm_sched_fini() would still leak those jobs.

It's just that for this case it's unlikely, but I don't think we want to rely on
"unlikely".

Do I miss anything?

> 
> - fw scheduler hardware with preemption support. There we probably want to
>   nuke the context by setting the tdr timeout to zero (or maybe just as
>   long as context preemption takes to be efficient), and relying on the
>   normal gpu reset flow to handle things. drm_sched_entity_flush kinda
>   does this, except not really and it's a lot more focused on the
>   ringbuffer context. So maybe we want a new drm_sched_entity_kill.
> 
>   For this case calling drm_sched_fini() after the 1:1 entity is gone
>   should not find any linger jobs, it would actually be a bug somewhere if
>   there's a job lingering. Maybe a sanity check that there's not just no
>   jobs lingering, but also no entity left would be good here?

This is a timing assumption too, isn't it? What if we reach drm_sched_fini()
quicker than all the free() callbacks from the jobs on the pending list are
processed?

I know Xe uses the "tdr to zero trick", so I'm not aware of all implications --
in Nouveau I had to do it differently.

> 
> - fw scheduler without preemption support. There we kinda need the
>   drm_sched_flush, except blocking in fops->close is not great. So instead
>   I think the following is better:
>   1. drm_sched_entity_stopped, which only stops new submissions (for
>   paranoia) but doesn't tear down the entity
>   2. drm_dev_get
>   3. launch a worker which does a) drm_sched_flush (or
>   drm_sched_entity_flush or whatever we call it) b) drm_sched_entity_fini

I think it would be drm_sched_flush(); the jobs on the pending are not
associated with any entity any more. See also 96c7c2f4d5bd ("drm/scheduler: set
entity to NULL in drm_sched_entity_pop_job()").

>   + drm_sched_fini c) drm_dev_put
> 
>   Note that semantically this implements the refcount in the other path
>   from Phillip:
> 
>   https://lore.kernel.org/all/20240903094531.29893-2-pstanner@redhat.com/
>   
>   Except it doesn't impose refcount on everyone else who doesn't need it,
>   and it doesn't even impose refcounting on drivers that do need it
>   because we use drm_sched_flush and a worker to achieve the same.
> 
> Essentially helper functions for the common use-cases instead of trying to
> solve them all by putting drm_sched_flush as a potentially very blocking
> function into drm_sched_fini.
> 
> 
> > > > > > When there are still entities with jobs the situation is even
> > > > > > worse
> > > > > > since the dma_fences for those jobs can never signal we can just
> > > > > > choose between potentially locking up core memory management and
> > > > > > random memory corruption. When drivers really mess it up that
> > > > > > well
> > > > > > let them run into a BUG_ON().
> > > > > > 
> > > > > > Signed-off-by: Christian König<christian.koenig@amd.com>
> > > > > > ---
> > > > > >    drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++-
> > > > > >    1 file changed, 18 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > index f093616fe53c..8a46fab5cdc8 100644
> > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct
> > > > > > drm_gpu_scheduler
> > > > > > *sched)
> > > > > I agree with Sima that it should first be documented in the
> > > > > function's
> > > > > docstring what the user is expected to have done before calling the
> > > > > function.
> > > > Good point, going to update the documentation as well.
> > > Cool thing, thx.
> > > Would be great if everything (not totally trivial) necessary to be done
> > > before _fini() is mentioned.
> > > 
> > > One could also think about providing a hint at how the driver can do
> > > that. AFAICS the only way for the driver to ensure that is to maintain
> > > its own, separate list of submitted jobs.
> > 
> > Even with a duplicated pending list it's actually currently impossible to do
> > this fully cleanly.
> > 
> > The problem is that the dma_fence object gives no guarantee when callbacks
> > are processed, e.g. they can be both processed from interrupt context as
> > well as from a CPU which called dma_fence_is_signaled().
> > 
> > So when a driver (or drm_sched_fini) waits for the last submitted fence it
> > actually can be that the drm_sched object still needs to do some processing.
> > See the hack in amdgpu_vm_tlb_seq() for more background on the problem.
> 
> So I thought this should be fairly easy because of the sched hw/public
> fence split: If we wait for both all jobs to finish and for all the sched
> work/tdr work to finish, and we make sure there's no entity existing
> that's not yet stopped we should catch them all? Or at least I think it's
> a bug if any other code even tries to touch the hw fence.
> 
> If you have any other driver code which relies on the rcu freeing then I
> think that's just a separate concern and we can ignore that here since the
> fences themselves will till get rcu-delay freed even if drm_sched_fini has
> finished.
> -Sima
> 
> > 
> > Regards,
> > Christian.
> > 
> > > 
> > > P.
> > > 
> > > > Thanks,
> > > > Christian.
> > > > 
> > > > > P.
> > > > > 
> > > > > >    	drm_sched_wqueue_stop(sched);
> > > > > > +	/*
> > > > > > +	 * Tearing down the scheduler wile there are still
> > > > > > unprocessed jobs can
> > > > > > +	 * lead to use after free issues in the scheduler fence.
> > > > > > +	 */
> > > > > > +	WARN_ON(!list_empty(&sched->pending_list));
> > > > > > +
> > > > > >    	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs;
> > > > > > i++)
> > > > > > {
> > > > > >    		struct drm_sched_rq *rq = sched->sched_rq[i];
> > > > > >    		spin_lock(&rq->lock);
> > > > > > -		list_for_each_entry(s_entity, &rq->entities,
> > > > > > list)
> > > > > > +		list_for_each_entry(s_entity, &rq->entities,
> > > > > > list) {
> > > > > > +			/*
> > > > > > +			 * The justification for this BUG_ON()
> > > > > > is
> > > > > > that tearing
> > > > > > +			 * down the scheduler while jobs are
> > > > > > pending
> > > > > > leaves
> > > > > > +			 * dma_fences unsignaled. Since we have
> > > > > > dependencies
> > > > > > +			 * from the core memory management to
> > > > > > eventually signal
> > > > > > +			 * dma_fences this can trivially lead to
> > > > > > a
> > > > > > system wide
> > > > > > +			 * stop because of a locked up memory
> > > > > > management.
> > > > > > +			 */
> > > > > > +			BUG_ON(spsc_queue_count(&s_entity-
> > > > > > > job_queue));
> > > > > > +
> > > > > >    			/*
> > > > > >    			 * Prevents reinsertion and marks
> > > > > > job_queue
> > > > > > as idle,
> > > > > >    			 * it will removed from rq in
> > > > > > drm_sched_entity_fini
> > > > > >    			 * eventually
> > > > > >    			 */
> > > > > >    			s_entity->stopped = true;
> > > > > > +		}
> > > > > >    		spin_unlock(&rq->lock);
> > > > > >    		kfree(sched->sched_rq[i]);
> > > > > >    	}
> 
> -- 
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>
Danilo Krummrich Oct. 14, 2024, 2:56 p.m. UTC | #11
On Fri, Sep 27, 2024 at 11:04:48AM +0200, Christian König wrote:
> Am 25.09.24 um 16:53 schrieb Philipp Stanner:
> > On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote:
> > > On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König wrote:
> > > > Am 20.09.24 um 15:26 schrieb Philipp Stanner:
> > > > > On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:
> > > > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> > > > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote:
> > > > > > > > Tearing down the scheduler with jobs still on the pending
> > > > > > > > list
> > > > > > > > can
> > > > > > > > lead to use after free issues. Add a warning if drivers try
> > > > > > > > to
> > > > > > > > destroy a scheduler which still has work pushed to the HW.
> > > > > > > Did you have time yet to look into my proposed waitque-
> > > > > > > solution?
> > > > > > I don't remember seeing anything. What have I missed?
> > > > > https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/
> > > > Mhm, I didn't got that in my inbox for some reason.
> > > > 
> > > > Interesting approach, I'm just not sure if we can or should wait in
> > > > drm_sched_fini().
> > We do agree that jobs still pending when drm_sched_fini() starts is
> > always a bug, right?
> 
> Correct, the question is how to avoid that.
> 
> > If so, what are the disadvantages of waiting in drm_sched_fini()? We
> > could block buggy drivers as I see it. Which wouldn't be good, but
> > could then be fixed on drivers' site.
> 
> Sima explained that pretty well: Don't block in fops->close, do that in
> fops->flush instead.

I agree that we shouldn't block in close(), but this effectively means that we
need to reference count the scheduler, right?

Otherwise, if we allow to just skip / interrupt the teardown, we can still leak
memory.

> 
> One issue this solves is that when you send a SIGTERM the tear down handling
> first flushes all the FDs and then closes them.
> 
> So if flushing the FDs blocks because the process initiated sending a
> terabyte of data over a 300bps line (for example) you can still throw a
> SIGKILL and abort that as well.
> 
> If you would block in fops-close() that SIGKILL won't have any effect any
> more because by the time close() is called the process is gone and signals
> are already blocked.
> 
> And yes when I learned about that issue I was also buffed that handling like
> this in the UNIX design is nearly 50 years old and still applies to today.
> > > > Probably better to make that a separate function, something like
> > > > drm_sched_flush() or similar.
> > We could do that. Such a function could then be called by drivers which
> > are not sure whether all jobs are done before they start tearing down.
> 
> Yes exactly that's the idea. And give that flush function a return code so
> that it can return -EINTR.
> 
> > > Yeah I don't think we should smash this into drm_sched_fini
> > > unconditionally. I think conceptually there's about three cases:
> > > 
> > > - Ringbuffer schedules. Probably want everything as-is, because
> > >    drm_sched_fini is called long after all the entities are gone in
> > >    drm_device cleanup.
> > > 
> > > - fw scheduler hardware with preemption support. There we probably
> > > want to
> > >    nuke the context by setting the tdr timeout to zero (or maybe just
> > > as
> > >    long as context preemption takes to be efficient), and relying on
> > > the
> > >    normal gpu reset flow to handle things. drm_sched_entity_flush
> > > kinda
> > >    does this, except not really and it's a lot more focused on the
> > >    ringbuffer context. So maybe we want a new drm_sched_entity_kill.
> > > 
> > >    For this case calling drm_sched_fini() after the 1:1 entity is gone
> > >    should not find any linger jobs, it would actually be a bug
> > > somewhere if
> > >    there's a job lingering. Maybe a sanity check that there's not just
> > > no
> > >    jobs lingering, but also no entity left would be good here?
> > The check for lingering ones is in Christian's patch here IISC.
> > At which position would you imagine the check for the entity being
> > performed?
> > 
> > > - fw scheduler without preemption support. There we kinda need the
> > >    drm_sched_flush, except blocking in fops->close is not great. So
> > > instead
> > >    I think the following is better:
> > >    1. drm_sched_entity_stopped, which only stops new submissions (for
> > >    paranoia) but doesn't tear down the entity
> > Who would call that function?
> > Drivers using it voluntarily could just as well stop accepting new jobs
> > from userspace to their entities, couldn't they?
> > 
> > >    2. drm_dev_get
> > >    3. launch a worker which does a) drm_sched_flush (or
> > >    drm_sched_entity_flush or whatever we call it) b)
> > > drm_sched_entity_fini
> > >    + drm_sched_fini c) drm_dev_put
> > > 
> > >    Note that semantically this implements the refcount in the other
> > > path
> > >    from Phillip:
> > > 
> > > https://lore.kernel.org/all/20240903094531.29893-2-pstanner@redhat.com/
> > >    Except it doesn't impose refcount on everyone else who doesn't need
> > > it,
> > >    and it doesn't even impose refcounting on drivers that do need it
> > >    because we use drm_sched_flush and a worker to achieve the same.
> > I indeed wasn't happy with the refcount approach for that reason,
> > agreed.
> > 
> > > Essentially helper functions for the common use-cases instead of
> > > trying to
> > > solve them all by putting drm_sched_flush as a potentially very
> > > blocking
> > > function into drm_sched_fini.
> > I'm still not able to see why it blocking would be undesired – as far
> > as I can see, it is only invoked on driver teardown, so not during
> > active operation. Teardown doesn't happen that often, and it can (if
> > implemented correctly) only block until the driver's code has signaled
> > the last fences. If that doesn't happen, the block would reveal a bug.
> > 
> > But don't get me wrong: I don't want to *push* this solution. I just
> > want to understand when it could become a problem.
> > 
> > 
> > Wouldn't an explicitly blocking, separate function like
> > drm_sched_flush() or drm_sched_fini_flush() be a small, doable step
> > towards the right direction?
> 
> I think that this is the right thing to do, yes.
> 
> > > > > > > > When there are still entities with jobs the situation is
> > > > > > > > even
> > > > > > > > worse
> > > > > > > > since the dma_fences for those jobs can never signal we can
> > > > > > > > just
> > > > > > > > choose between potentially locking up core memory
> > > > > > > > management and
> > > > > > > > random memory corruption. When drivers really mess it up
> > > > > > > > that
> > > > > > > > well
> > > > > > > > let them run into a BUG_ON().
> > > > > > > > 
> > > > > > > > Signed-off-by: Christian König<christian.koenig@amd.com>
> > > > > > > > ---
> > > > > > > >     drivers/gpu/drm/scheduler/sched_main.c | 19
> > > > > > > > ++++++++++++++++++-
> > > > > > > >     1 file changed, 18 insertions(+), 1 deletion(-)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > index f093616fe53c..8a46fab5cdc8 100644
> > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct
> > > > > > > > drm_gpu_scheduler
> > > > > > > > *sched)
> > > > > > > I agree with Sima that it should first be documented in the
> > > > > > > function's
> > > > > > > docstring what the user is expected to have done before
> > > > > > > calling the
> > > > > > > function.
> > > > > > Good point, going to update the documentation as well.
> > > > > Cool thing, thx.
> > > > > Would be great if everything (not totally trivial) necessary to
> > > > > be done
> > > > > before _fini() is mentioned.
> > > > > 
> > > > > One could also think about providing a hint at how the driver can
> > > > > do
> > > > > that. AFAICS the only way for the driver to ensure that is to
> > > > > maintain
> > > > > its own, separate list of submitted jobs.
> > > > Even with a duplicated pending list it's actually currently
> > > > impossible to do
> > > > this fully cleanly.
> > > > 
> > > > The problem is that the dma_fence object gives no guarantee when
> > > > callbacks
> > > > are processed, e.g. they can be both processed from interrupt
> > > > context as
> > > > well as from a CPU which called dma_fence_is_signaled().
> > > > 
> > > > So when a driver (or drm_sched_fini) waits for the last submitted
> > > > fence it
> > > > actually can be that the drm_sched object still needs to do some
> > > > processing.
> > > > See the hack in amdgpu_vm_tlb_seq() for more background on the
> > > > problem.
> > Oh dear ^^'
> > We better work towards fixing that centrally
> > 
> > Thanks,
> > P.
> > 
> > 
> > > So I thought this should be fairly easy because of the sched
> > > hw/public
> > > fence split: If we wait for both all jobs to finish and for all the
> > > sched
> > > work/tdr work to finish, and we make sure there's no entity existing
> > > that's not yet stopped we should catch them all?
> 
> Unfortunately not.
> 
> Even when you do a dma_fence_wait() on the last submission it can still be
> that another CPU is executing the callbacks to wake up the scheduler work
> item and cleanup the job.
> 
> That's one of the reasons why I think the design of keeping the job alive is
> so extremely awkward. The dma_fence as representation of the hw submission
> has a much better defined state machine and lifetime.
> 
> Regards,
> Christian.
> 
> > >   Or at least I think
> > > it's
> > > a bug if any other code even tries to touch the hw fence.
> > > 
> > > If you have any other driver code which relies on the rcu freeing
> > > then I
> > > think that's just a separate concern and we can ignore that here
> > > since the
> > > fences themselves will till get rcu-delay freed even if
> > > drm_sched_fini has
> > > finished.
> > > -Sima
> > > 
> > > > Regards,
> > > > Christian.
> > > > 
> > > > > P.
> > > > > 
> > > > > > Thanks,
> > > > > > Christian.
> > > > > > 
> > > > > > > P.
> > > > > > > 
> > > > > > > >     	drm_sched_wqueue_stop(sched);
> > > > > > > > +	/*
> > > > > > > > +	 * Tearing down the scheduler wile there are still
> > > > > > > > unprocessed jobs can
> > > > > > > > +	 * lead to use after free issues in the scheduler
> > > > > > > > fence.
> > > > > > > > +	 */
> > > > > > > > +	WARN_ON(!list_empty(&sched->pending_list));
> > > > > > > > +
> > > > > > > >     	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched-
> > > > > > > > > num_rqs;
> > > > > > > > i++)
> > > > > > > > {
> > > > > > > >     		struct drm_sched_rq *rq = sched-
> > > > > > > > > sched_rq[i];
> > > > > > > >     		spin_lock(&rq->lock);
> > > > > > > > -		list_for_each_entry(s_entity, &rq-
> > > > > > > > > entities,
> > > > > > > > list)
> > > > > > > > +		list_for_each_entry(s_entity, &rq-
> > > > > > > > > entities,
> > > > > > > > list) {
> > > > > > > > +			/*
> > > > > > > > +			 * The justification for this
> > > > > > > > BUG_ON()
> > > > > > > > is
> > > > > > > > that tearing
> > > > > > > > +			 * down the scheduler while jobs
> > > > > > > > are
> > > > > > > > pending
> > > > > > > > leaves
> > > > > > > > +			 * dma_fences unsignaled. Since we
> > > > > > > > have
> > > > > > > > dependencies
> > > > > > > > +			 * from the core memory management
> > > > > > > > to
> > > > > > > > eventually signal
> > > > > > > > +			 * dma_fences this can trivially
> > > > > > > > lead to
> > > > > > > > a
> > > > > > > > system wide
> > > > > > > > +			 * stop because of a locked up
> > > > > > > > memory
> > > > > > > > management.
> > > > > > > > +			 */
> > > > > > > > +			BUG_ON(spsc_queue_count(&s_entity-
> > > > > > > > > job_queue));
> > > > > > > > +
> > > > > > > >     			/*
> > > > > > > >     			 * Prevents reinsertion and marks
> > > > > > > > job_queue
> > > > > > > > as idle,
> > > > > > > >     			 * it will removed from rq in
> > > > > > > > drm_sched_entity_fini
> > > > > > > >     			 * eventually
> > > > > > > >     			 */
> > > > > > > >     			s_entity->stopped = true;
> > > > > > > > +		}
> > > > > > > >     		spin_unlock(&rq->lock);
> > > > > > > >     		kfree(sched->sched_rq[i]);
> > > > > > > >     	}
Philipp Stanner Oct. 18, 2024, 12:07 p.m. UTC | #12
On Mon, 2024-10-14 at 16:56 +0200, Danilo Krummrich wrote:
> On Fri, Sep 27, 2024 at 11:04:48AM +0200, Christian König wrote:
> > Am 25.09.24 um 16:53 schrieb Philipp Stanner:
> > > On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote:
> > > > On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König
> > > > wrote:
> > > > > Am 20.09.24 um 15:26 schrieb Philipp Stanner:
> > > > > > On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:
> > > > > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> > > > > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian König
> > > > > > > > wrote:
> > > > > > > > > Tearing down the scheduler with jobs still on the
> > > > > > > > > pending
> > > > > > > > > list
> > > > > > > > > can
> > > > > > > > > lead to use after free issues. Add a warning if
> > > > > > > > > drivers try
> > > > > > > > > to
> > > > > > > > > destroy a scheduler which still has work pushed to
> > > > > > > > > the HW.
> > > > > > > > Did you have time yet to look into my proposed waitque-
> > > > > > > > solution?
> > > > > > > I don't remember seeing anything. What have I missed?
> > > > > > https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/
> > > > > Mhm, I didn't got that in my inbox for some reason.
> > > > > 
> > > > > Interesting approach, I'm just not sure if we can or should
> > > > > wait in
> > > > > drm_sched_fini().
> > > We do agree that jobs still pending when drm_sched_fini() starts
> > > is
> > > always a bug, right?
> > 
> > Correct, the question is how to avoid that.
> > 
> > > If so, what are the disadvantages of waiting in drm_sched_fini()?
> > > We
> > > could block buggy drivers as I see it. Which wouldn't be good,
> > > but
> > > could then be fixed on drivers' site.
> > 
> > Sima explained that pretty well: Don't block in fops->close, do
> > that in
> > fops->flush instead.
> 
> I agree that we shouldn't block in close(), but this effectively
> means that we
> need to reference count the scheduler, right?
> 
> Otherwise, if we allow to just skip / interrupt the teardown, we can
> still leak
> memory.

Having thought about it, I agree with Danilo. Having something that
shall wait on green light, but can be interrupted, is no guarantee and
therefore not a feasible solution.

To break down the solution space, these seem to be our options:
   1. We have something (either drm_sched_fini() or a helper, e.g.,
      drm_sched_flush()) that definitely blocks until the pending list
      has become empty.
   2. We have jobs reference-count the scheduler, so the latter can
      outlive the driver and will be freed some time later.

Can anyone think of a third solution?


Solution #1 has the problem of obviously blocking unconditionally if
the driver didn't make sure that all fences will be signaled within
reasonable time. In my opinion, this would actually be an advantage,
because it will be *very* noticable and force users to repair their
driver. The driver *has* to guarantee that all fences will be signaled.
If the driver has to do fishy things, having the blocking outsourced to
the helper drm_sched_flush() would allow them to circumvent that.

Solution #2 has the problem of backend_ops.free_job() potentially using
driver-data after the driver is gone, causing UAF. So with this
solutions all drivers would have to be aware of the issue and handle it
through one of DRMs primitives dedicated to such problems.


Currently, all drivers either work around the problem internally or
simply ignore it, it seems.

So I'd argue that both solutions are an improvement over the existing
situation. My preference would be #1.


Opinions?

P.

> 
> > 
> > One issue this solves is that when you send a SIGTERM the tear down
> > handling
> > first flushes all the FDs and then closes them.
> > 
> > So if flushing the FDs blocks because the process initiated sending
> > a
> > terabyte of data over a 300bps line (for example) you can still
> > throw a
> > SIGKILL and abort that as well.
> > 
> > If you would block in fops-close() that SIGKILL won't have any
> > effect any
> > more because by the time close() is called the process is gone and
> > signals
> > are already blocked.
> > 
> > And yes when I learned about that issue I was also buffed that
> > handling like
> > this in the UNIX design is nearly 50 years old and still applies to
> > today.
> > > > > Probably better to make that a separate function, something
> > > > > like
> > > > > drm_sched_flush() or similar.
> > > We could do that. Such a function could then be called by drivers
> > > which
> > > are not sure whether all jobs are done before they start tearing
> > > down.
> > 
> > Yes exactly that's the idea. And give that flush function a return
> > code so
> > that it can return -EINTR.
> > 
> > > > Yeah I don't think we should smash this into drm_sched_fini
> > > > unconditionally. I think conceptually there's about three
> > > > cases:
> > > > 
> > > > - Ringbuffer schedules. Probably want everything as-is, because
> > > >    drm_sched_fini is called long after all the entities are
> > > > gone in
> > > >    drm_device cleanup.
> > > > 
> > > > - fw scheduler hardware with preemption support. There we
> > > > probably
> > > > want to
> > > >    nuke the context by setting the tdr timeout to zero (or
> > > > maybe just
> > > > as
> > > >    long as context preemption takes to be efficient), and
> > > > relying on
> > > > the
> > > >    normal gpu reset flow to handle things.
> > > > drm_sched_entity_flush
> > > > kinda
> > > >    does this, except not really and it's a lot more focused on
> > > > the
> > > >    ringbuffer context. So maybe we want a new
> > > > drm_sched_entity_kill.
> > > > 
> > > >    For this case calling drm_sched_fini() after the 1:1 entity
> > > > is gone
> > > >    should not find any linger jobs, it would actually be a bug
> > > > somewhere if
> > > >    there's a job lingering. Maybe a sanity check that there's
> > > > not just
> > > > no
> > > >    jobs lingering, but also no entity left would be good here?
> > > The check for lingering ones is in Christian's patch here IISC.
> > > At which position would you imagine the check for the entity
> > > being
> > > performed?
> > > 
> > > > - fw scheduler without preemption support. There we kinda need
> > > > the
> > > >    drm_sched_flush, except blocking in fops->close is not
> > > > great. So
> > > > instead
> > > >    I think the following is better:
> > > >    1. drm_sched_entity_stopped, which only stops new
> > > > submissions (for
> > > >    paranoia) but doesn't tear down the entity
> > > Who would call that function?
> > > Drivers using it voluntarily could just as well stop accepting
> > > new jobs
> > > from userspace to their entities, couldn't they?
> > > 
> > > >    2. drm_dev_get
> > > >    3. launch a worker which does a) drm_sched_flush (or
> > > >    drm_sched_entity_flush or whatever we call it) b)
> > > > drm_sched_entity_fini
> > > >    + drm_sched_fini c) drm_dev_put
> > > > 
> > > >    Note that semantically this implements the refcount in the
> > > > other
> > > > path
> > > >    from Phillip:
> > > > 
> > > > https://lore.kernel.org/all/20240903094531.29893-2-pstanner@redhat.com/
> > > >    Except it doesn't impose refcount on everyone else who
> > > > doesn't need
> > > > it,
> > > >    and it doesn't even impose refcounting on drivers that do
> > > > need it
> > > >    because we use drm_sched_flush and a worker to achieve the
> > > > same.
> > > I indeed wasn't happy with the refcount approach for that reason,
> > > agreed.
> > > 
> > > > Essentially helper functions for the common use-cases instead
> > > > of
> > > > trying to
> > > > solve them all by putting drm_sched_flush as a potentially very
> > > > blocking
> > > > function into drm_sched_fini.
> > > I'm still not able to see why it blocking would be undesired – as
> > > far
> > > as I can see, it is only invoked on driver teardown, so not
> > > during
> > > active operation. Teardown doesn't happen that often, and it can
> > > (if
> > > implemented correctly) only block until the driver's code has
> > > signaled
> > > the last fences. If that doesn't happen, the block would reveal a
> > > bug.
> > > 
> > > But don't get me wrong: I don't want to *push* this solution. I
> > > just
> > > want to understand when it could become a problem.
> > > 
> > > 
> > > Wouldn't an explicitly blocking, separate function like
> > > drm_sched_flush() or drm_sched_fini_flush() be a small, doable
> > > step
> > > towards the right direction?
> > 
> > I think that this is the right thing to do, yes.
> > 
> > > > > > > > > When there are still entities with jobs the situation
> > > > > > > > > is
> > > > > > > > > even
> > > > > > > > > worse
> > > > > > > > > since the dma_fences for those jobs can never signal
> > > > > > > > > we can
> > > > > > > > > just
> > > > > > > > > choose between potentially locking up core memory
> > > > > > > > > management and
> > > > > > > > > random memory corruption. When drivers really mess it
> > > > > > > > > up
> > > > > > > > > that
> > > > > > > > > well
> > > > > > > > > let them run into a BUG_ON().
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Christian
> > > > > > > > > König<christian.koenig@amd.com>
> > > > > > > > > ---
> > > > > > > > >     drivers/gpu/drm/scheduler/sched_main.c | 19
> > > > > > > > > ++++++++++++++++++-
> > > > > > > > >     1 file changed, 18 insertions(+), 1 deletion(-)
> > > > > > > > > 
> > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > index f093616fe53c..8a46fab5cdc8 100644
> > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct
> > > > > > > > > drm_gpu_scheduler
> > > > > > > > > *sched)
> > > > > > > > I agree with Sima that it should first be documented in
> > > > > > > > the
> > > > > > > > function's
> > > > > > > > docstring what the user is expected to have done before
> > > > > > > > calling the
> > > > > > > > function.
> > > > > > > Good point, going to update the documentation as well.
> > > > > > Cool thing, thx.
> > > > > > Would be great if everything (not totally trivial)
> > > > > > necessary to
> > > > > > be done
> > > > > > before _fini() is mentioned.
> > > > > > 
> > > > > > One could also think about providing a hint at how the
> > > > > > driver can
> > > > > > do
> > > > > > that. AFAICS the only way for the driver to ensure that is
> > > > > > to
> > > > > > maintain
> > > > > > its own, separate list of submitted jobs.
> > > > > Even with a duplicated pending list it's actually currently
> > > > > impossible to do
> > > > > this fully cleanly.
> > > > > 
> > > > > The problem is that the dma_fence object gives no guarantee
> > > > > when
> > > > > callbacks
> > > > > are processed, e.g. they can be both processed from interrupt
> > > > > context as
> > > > > well as from a CPU which called dma_fence_is_signaled().
> > > > > 
> > > > > So when a driver (or drm_sched_fini) waits for the last
> > > > > submitted
> > > > > fence it
> > > > > actually can be that the drm_sched object still needs to do
> > > > > some
> > > > > processing.
> > > > > See the hack in amdgpu_vm_tlb_seq() for more background on
> > > > > the
> > > > > problem.
> > > Oh dear ^^'
> > > We better work towards fixing that centrally
> > > 
> > > Thanks,
> > > P.
> > > 
> > > 
> > > > So I thought this should be fairly easy because of the sched
> > > > hw/public
> > > > fence split: If we wait for both all jobs to finish and for all
> > > > the
> > > > sched
> > > > work/tdr work to finish, and we make sure there's no entity
> > > > existing
> > > > that's not yet stopped we should catch them all?
> > 
> > Unfortunately not.
> > 
> > Even when you do a dma_fence_wait() on the last submission it can
> > still be
> > that another CPU is executing the callbacks to wake up the
> > scheduler work
> > item and cleanup the job.
> > 
> > That's one of the reasons why I think the design of keeping the job
> > alive is
> > so extremely awkward. The dma_fence as representation of the hw
> > submission
> > has a much better defined state machine and lifetime.
> > 
> > Regards,
> > Christian.
> > 
> > > >   Or at least I think
> > > > it's
> > > > a bug if any other code even tries to touch the hw fence.
> > > > 
> > > > If you have any other driver code which relies on the rcu
> > > > freeing
> > > > then I
> > > > think that's just a separate concern and we can ignore that
> > > > here
> > > > since the
> > > > fences themselves will till get rcu-delay freed even if
> > > > drm_sched_fini has
> > > > finished.
> > > > -Sima
> > > > 
> > > > > Regards,
> > > > > Christian.
> > > > > 
> > > > > > P.
> > > > > > 
> > > > > > > Thanks,
> > > > > > > Christian.
> > > > > > > 
> > > > > > > > P.
> > > > > > > > 
> > > > > > > > >      drm_sched_wqueue_stop(sched);
> > > > > > > > > + /*
> > > > > > > > > + * Tearing down the scheduler wile there are still
> > > > > > > > > unprocessed jobs can
> > > > > > > > > + * lead to use after free issues in the scheduler
> > > > > > > > > fence.
> > > > > > > > > + */
> > > > > > > > > + WARN_ON(!list_empty(&sched->pending_list));
> > > > > > > > > +
> > > > > > > > >      for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched-
> > > > > > > > > > num_rqs;
> > > > > > > > > i++)
> > > > > > > > > {
> > > > > > > > >      struct drm_sched_rq *rq = sched-
> > > > > > > > > > sched_rq[i];
> > > > > > > > >      spin_lock(&rq->lock);
> > > > > > > > > - list_for_each_entry(s_entity, &rq-
> > > > > > > > > > entities,
> > > > > > > > > list)
> > > > > > > > > + list_for_each_entry(s_entity, &rq-
> > > > > > > > > > entities,
> > > > > > > > > list) {
> > > > > > > > > + /*
> > > > > > > > > + * The justification for this
> > > > > > > > > BUG_ON()
> > > > > > > > > is
> > > > > > > > > that tearing
> > > > > > > > > + * down the scheduler while jobs
> > > > > > > > > are
> > > > > > > > > pending
> > > > > > > > > leaves
> > > > > > > > > + * dma_fences unsignaled. Since we
> > > > > > > > > have
> > > > > > > > > dependencies
> > > > > > > > > + * from the core memory management
> > > > > > > > > to
> > > > > > > > > eventually signal
> > > > > > > > > + * dma_fences this can trivially
> > > > > > > > > lead to
> > > > > > > > > a
> > > > > > > > > system wide
> > > > > > > > > + * stop because of a locked up
> > > > > > > > > memory
> > > > > > > > > management.
> > > > > > > > > + */
> > > > > > > > > + BUG_ON(spsc_queue_count(&s_entity-
> > > > > > > > > > job_queue));
> > > > > > > > > +
> > > > > > > > >      /*
> > > > > > > > >      * Prevents reinsertion and marks
> > > > > > > > > job_queue
> > > > > > > > > as idle,
> > > > > > > > >      * it will removed from rq in
> > > > > > > > > drm_sched_entity_fini
> > > > > > > > >      * eventually
> > > > > > > > >      */
> > > > > > > > >      s_entity->stopped = true;
> > > > > > > > > + }
> > > > > > > > >      spin_unlock(&rq->lock);
> > > > > > > > >      kfree(sched->sched_rq[i]);
> > > > > > > > >      }
>
Philipp Stanner Oct. 29, 2024, 7:22 a.m. UTC | #13
Christian, Sima?

Matthew? (+CC)

Opinions on the below?

tl;dr:
I still think it's a good thing to detectably block in
drm_sched_fini(), or at the very least drm_sched_flush(), because then
you'll find out that the driver is broken and can repair it.

P.


On Fri, 2024-10-18 at 14:07 +0200, Philipp Stanner wrote:
> On Mon, 2024-10-14 at 16:56 +0200, Danilo Krummrich wrote:
> > On Fri, Sep 27, 2024 at 11:04:48AM +0200, Christian König wrote:
> > > Am 25.09.24 um 16:53 schrieb Philipp Stanner:
> > > > On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote:
> > > > > On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König
> > > > > wrote:
> > > > > > Am 20.09.24 um 15:26 schrieb Philipp Stanner:
> > > > > > > On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:
> > > > > > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> > > > > > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian König
> > > > > > > > > wrote:
> > > > > > > > > > Tearing down the scheduler with jobs still on the
> > > > > > > > > > pending
> > > > > > > > > > list
> > > > > > > > > > can
> > > > > > > > > > lead to use after free issues. Add a warning if
> > > > > > > > > > drivers try
> > > > > > > > > > to
> > > > > > > > > > destroy a scheduler which still has work pushed to
> > > > > > > > > > the HW.
> > > > > > > > > Did you have time yet to look into my proposed
> > > > > > > > > waitque-
> > > > > > > > > solution?
> > > > > > > > I don't remember seeing anything. What have I missed?
> > > > > > > https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/
> > > > > > Mhm, I didn't got that in my inbox for some reason.
> > > > > > 
> > > > > > Interesting approach, I'm just not sure if we can or should
> > > > > > wait in
> > > > > > drm_sched_fini().
> > > > We do agree that jobs still pending when drm_sched_fini()
> > > > starts
> > > > is
> > > > always a bug, right?
> > > 
> > > Correct, the question is how to avoid that.
> > > 
> > > > If so, what are the disadvantages of waiting in
> > > > drm_sched_fini()?
> > > > We
> > > > could block buggy drivers as I see it. Which wouldn't be good,
> > > > but
> > > > could then be fixed on drivers' site.
> > > 
> > > Sima explained that pretty well: Don't block in fops->close, do
> > > that in
> > > fops->flush instead.
> > 
> > I agree that we shouldn't block in close(), but this effectively
> > means that we
> > need to reference count the scheduler, right?
> > 
> > Otherwise, if we allow to just skip / interrupt the teardown, we
> > can
> > still leak
> > memory.
> 
> Having thought about it, I agree with Danilo. Having something that
> shall wait on green light, but can be interrupted, is no guarantee
> and
> therefore not a feasible solution.
> 
> To break down the solution space, these seem to be our options:
>    1. We have something (either drm_sched_fini() or a helper, e.g.,
>       drm_sched_flush()) that definitely blocks until the pending
> list
>       has become empty.
>    2. We have jobs reference-count the scheduler, so the latter can
>       outlive the driver and will be freed some time later.
> 
> Can anyone think of a third solution?
> 
> 
> Solution #1 has the problem of obviously blocking unconditionally if
> the driver didn't make sure that all fences will be signaled within
> reasonable time. In my opinion, this would actually be an advantage,
> because it will be *very* noticable and force users to repair their
> driver. The driver *has* to guarantee that all fences will be
> signaled.
> If the driver has to do fishy things, having the blocking outsourced
> to
> the helper drm_sched_flush() would allow them to circumvent that.
> 
> Solution #2 has the problem of backend_ops.free_job() potentially
> using
> driver-data after the driver is gone, causing UAF. So with this
> solutions all drivers would have to be aware of the issue and handle
> it
> through one of DRMs primitives dedicated to such problems.
> 
> 
> Currently, all drivers either work around the problem internally or
> simply ignore it, it seems.
> 
> So I'd argue that both solutions are an improvement over the existing
> situation. My preference would be #1.
> 
> 
> Opinions?
> 
> P.
> 
> > 
> > > 
> > > One issue this solves is that when you send a SIGTERM the tear
> > > down
> > > handling
> > > first flushes all the FDs and then closes them.
> > > 
> > > So if flushing the FDs blocks because the process initiated
> > > sending
> > > a
> > > terabyte of data over a 300bps line (for example) you can still
> > > throw a
> > > SIGKILL and abort that as well.
> > > 
> > > If you would block in fops-close() that SIGKILL won't have any
> > > effect any
> > > more because by the time close() is called the process is gone
> > > and
> > > signals
> > > are already blocked.
> > > 
> > > And yes when I learned about that issue I was also buffed that
> > > handling like
> > > this in the UNIX design is nearly 50 years old and still applies
> > > to
> > > today.
> > > > > > Probably better to make that a separate function, something
> > > > > > like
> > > > > > drm_sched_flush() or similar.
> > > > We could do that. Such a function could then be called by
> > > > drivers
> > > > which
> > > > are not sure whether all jobs are done before they start
> > > > tearing
> > > > down.
> > > 
> > > Yes exactly that's the idea. And give that flush function a
> > > return
> > > code so
> > > that it can return -EINTR.
> > > 
> > > > > Yeah I don't think we should smash this into drm_sched_fini
> > > > > unconditionally. I think conceptually there's about three
> > > > > cases:
> > > > > 
> > > > > - Ringbuffer schedules. Probably want everything as-is,
> > > > > because
> > > > >    drm_sched_fini is called long after all the entities are
> > > > > gone in
> > > > >    drm_device cleanup.
> > > > > 
> > > > > - fw scheduler hardware with preemption support. There we
> > > > > probably
> > > > > want to
> > > > >    nuke the context by setting the tdr timeout to zero (or
> > > > > maybe just
> > > > > as
> > > > >    long as context preemption takes to be efficient), and
> > > > > relying on
> > > > > the
> > > > >    normal gpu reset flow to handle things.
> > > > > drm_sched_entity_flush
> > > > > kinda
> > > > >    does this, except not really and it's a lot more focused
> > > > > on
> > > > > the
> > > > >    ringbuffer context. So maybe we want a new
> > > > > drm_sched_entity_kill.
> > > > > 
> > > > >    For this case calling drm_sched_fini() after the 1:1
> > > > > entity
> > > > > is gone
> > > > >    should not find any linger jobs, it would actually be a
> > > > > bug
> > > > > somewhere if
> > > > >    there's a job lingering. Maybe a sanity check that there's
> > > > > not just
> > > > > no
> > > > >    jobs lingering, but also no entity left would be good
> > > > > here?
> > > > The check for lingering ones is in Christian's patch here IISC.
> > > > At which position would you imagine the check for the entity
> > > > being
> > > > performed?
> > > > 
> > > > > - fw scheduler without preemption support. There we kinda
> > > > > need
> > > > > the
> > > > >    drm_sched_flush, except blocking in fops->close is not
> > > > > great. So
> > > > > instead
> > > > >    I think the following is better:
> > > > >    1. drm_sched_entity_stopped, which only stops new
> > > > > submissions (for
> > > > >    paranoia) but doesn't tear down the entity
> > > > Who would call that function?
> > > > Drivers using it voluntarily could just as well stop accepting
> > > > new jobs
> > > > from userspace to their entities, couldn't they?
> > > > 
> > > > >    2. drm_dev_get
> > > > >    3. launch a worker which does a) drm_sched_flush (or
> > > > >    drm_sched_entity_flush or whatever we call it) b)
> > > > > drm_sched_entity_fini
> > > > >    + drm_sched_fini c) drm_dev_put
> > > > > 
> > > > >    Note that semantically this implements the refcount in the
> > > > > other
> > > > > path
> > > > >    from Phillip:
> > > > > 
> > > > > https://lore.kernel.org/all/20240903094531.29893-2-pstanner@redhat.com/
> > > > >    Except it doesn't impose refcount on everyone else who
> > > > > doesn't need
> > > > > it,
> > > > >    and it doesn't even impose refcounting on drivers that do
> > > > > need it
> > > > >    because we use drm_sched_flush and a worker to achieve the
> > > > > same.
> > > > I indeed wasn't happy with the refcount approach for that
> > > > reason,
> > > > agreed.
> > > > 
> > > > > Essentially helper functions for the common use-cases instead
> > > > > of
> > > > > trying to
> > > > > solve them all by putting drm_sched_flush as a potentially
> > > > > very
> > > > > blocking
> > > > > function into drm_sched_fini.
> > > > I'm still not able to see why it blocking would be undesired –
> > > > as
> > > > far
> > > > as I can see, it is only invoked on driver teardown, so not
> > > > during
> > > > active operation. Teardown doesn't happen that often, and it
> > > > can
> > > > (if
> > > > implemented correctly) only block until the driver's code has
> > > > signaled
> > > > the last fences. If that doesn't happen, the block would reveal
> > > > a
> > > > bug.
> > > > 
> > > > But don't get me wrong: I don't want to *push* this solution. I
> > > > just
> > > > want to understand when it could become a problem.
> > > > 
> > > > 
> > > > Wouldn't an explicitly blocking, separate function like
> > > > drm_sched_flush() or drm_sched_fini_flush() be a small, doable
> > > > step
> > > > towards the right direction?
> > > 
> > > I think that this is the right thing to do, yes.
> > > 
> > > > > > > > > > When there are still entities with jobs the
> > > > > > > > > > situation
> > > > > > > > > > is
> > > > > > > > > > even
> > > > > > > > > > worse
> > > > > > > > > > since the dma_fences for those jobs can never
> > > > > > > > > > signal
> > > > > > > > > > we can
> > > > > > > > > > just
> > > > > > > > > > choose between potentially locking up core memory
> > > > > > > > > > management and
> > > > > > > > > > random memory corruption. When drivers really mess
> > > > > > > > > > it
> > > > > > > > > > up
> > > > > > > > > > that
> > > > > > > > > > well
> > > > > > > > > > let them run into a BUG_ON().
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Christian
> > > > > > > > > > König<christian.koenig@amd.com>
> > > > > > > > > > ---
> > > > > > > > > >     drivers/gpu/drm/scheduler/sched_main.c | 19
> > > > > > > > > > ++++++++++++++++++-
> > > > > > > > > >     1 file changed, 18 insertions(+), 1 deletion(-)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > index f093616fe53c..8a46fab5cdc8 100644
> > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct
> > > > > > > > > > drm_gpu_scheduler
> > > > > > > > > > *sched)
> > > > > > > > > I agree with Sima that it should first be documented
> > > > > > > > > in
> > > > > > > > > the
> > > > > > > > > function's
> > > > > > > > > docstring what the user is expected to have done
> > > > > > > > > before
> > > > > > > > > calling the
> > > > > > > > > function.
> > > > > > > > Good point, going to update the documentation as well.
> > > > > > > Cool thing, thx.
> > > > > > > Would be great if everything (not totally trivial)
> > > > > > > necessary to
> > > > > > > be done
> > > > > > > before _fini() is mentioned.
> > > > > > > 
> > > > > > > One could also think about providing a hint at how the
> > > > > > > driver can
> > > > > > > do
> > > > > > > that. AFAICS the only way for the driver to ensure that
> > > > > > > is
> > > > > > > to
> > > > > > > maintain
> > > > > > > its own, separate list of submitted jobs.
> > > > > > Even with a duplicated pending list it's actually currently
> > > > > > impossible to do
> > > > > > this fully cleanly.
> > > > > > 
> > > > > > The problem is that the dma_fence object gives no guarantee
> > > > > > when
> > > > > > callbacks
> > > > > > are processed, e.g. they can be both processed from
> > > > > > interrupt
> > > > > > context as
> > > > > > well as from a CPU which called dma_fence_is_signaled().
> > > > > > 
> > > > > > So when a driver (or drm_sched_fini) waits for the last
> > > > > > submitted
> > > > > > fence it
> > > > > > actually can be that the drm_sched object still needs to do
> > > > > > some
> > > > > > processing.
> > > > > > See the hack in amdgpu_vm_tlb_seq() for more background on
> > > > > > the
> > > > > > problem.
> > > > Oh dear ^^'
> > > > We better work towards fixing that centrally
> > > > 
> > > > Thanks,
> > > > P.
> > > > 
> > > > 
> > > > > So I thought this should be fairly easy because of the sched
> > > > > hw/public
> > > > > fence split: If we wait for both all jobs to finish and for
> > > > > all
> > > > > the
> > > > > sched
> > > > > work/tdr work to finish, and we make sure there's no entity
> > > > > existing
> > > > > that's not yet stopped we should catch them all?
> > > 
> > > Unfortunately not.
> > > 
> > > Even when you do a dma_fence_wait() on the last submission it can
> > > still be
> > > that another CPU is executing the callbacks to wake up the
> > > scheduler work
> > > item and cleanup the job.
> > > 
> > > That's one of the reasons why I think the design of keeping the
> > > job
> > > alive is
> > > so extremely awkward. The dma_fence as representation of the hw
> > > submission
> > > has a much better defined state machine and lifetime.
> > > 
> > > Regards,
> > > Christian.
> > > 
> > > > >   Or at least I think
> > > > > it's
> > > > > a bug if any other code even tries to touch the hw fence.
> > > > > 
> > > > > If you have any other driver code which relies on the rcu
> > > > > freeing
> > > > > then I
> > > > > think that's just a separate concern and we can ignore that
> > > > > here
> > > > > since the
> > > > > fences themselves will till get rcu-delay freed even if
> > > > > drm_sched_fini has
> > > > > finished.
> > > > > -Sima
> > > > > 
> > > > > > Regards,
> > > > > > Christian.
> > > > > > 
> > > > > > > P.
> > > > > > > 
> > > > > > > > Thanks,
> > > > > > > > Christian.
> > > > > > > > 
> > > > > > > > > P.
> > > > > > > > > 
> > > > > > > > > >      drm_sched_wqueue_stop(sched);
> > > > > > > > > > + /*
> > > > > > > > > > + * Tearing down the scheduler wile there are still
> > > > > > > > > > unprocessed jobs can
> > > > > > > > > > + * lead to use after free issues in the scheduler
> > > > > > > > > > fence.
> > > > > > > > > > + */
> > > > > > > > > > + WARN_ON(!list_empty(&sched->pending_list));
> > > > > > > > > > +
> > > > > > > > > >      for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched-
> > > > > > > > > > > num_rqs;
> > > > > > > > > > i++)
> > > > > > > > > > {
> > > > > > > > > >      struct drm_sched_rq *rq = sched-
> > > > > > > > > > > sched_rq[i];
> > > > > > > > > >      spin_lock(&rq->lock);
> > > > > > > > > > - list_for_each_entry(s_entity, &rq-
> > > > > > > > > > > entities,
> > > > > > > > > > list)
> > > > > > > > > > + list_for_each_entry(s_entity, &rq-
> > > > > > > > > > > entities,
> > > > > > > > > > list) {
> > > > > > > > > > + /*
> > > > > > > > > > + * The justification for this
> > > > > > > > > > BUG_ON()
> > > > > > > > > > is
> > > > > > > > > > that tearing
> > > > > > > > > > + * down the scheduler while jobs
> > > > > > > > > > are
> > > > > > > > > > pending
> > > > > > > > > > leaves
> > > > > > > > > > + * dma_fences unsignaled. Since we
> > > > > > > > > > have
> > > > > > > > > > dependencies
> > > > > > > > > > + * from the core memory management
> > > > > > > > > > to
> > > > > > > > > > eventually signal
> > > > > > > > > > + * dma_fences this can trivially
> > > > > > > > > > lead to
> > > > > > > > > > a
> > > > > > > > > > system wide
> > > > > > > > > > + * stop because of a locked up
> > > > > > > > > > memory
> > > > > > > > > > management.
> > > > > > > > > > + */
> > > > > > > > > > + BUG_ON(spsc_queue_count(&s_entity-
> > > > > > > > > > > job_queue));
> > > > > > > > > > +
> > > > > > > > > >      /*
> > > > > > > > > >      * Prevents reinsertion and marks
> > > > > > > > > > job_queue
> > > > > > > > > > as idle,
> > > > > > > > > >      * it will removed from rq in
> > > > > > > > > > drm_sched_entity_fini
> > > > > > > > > >      * eventually
> > > > > > > > > >      */
> > > > > > > > > >      s_entity->stopped = true;
> > > > > > > > > > + }
> > > > > > > > > >      spin_unlock(&rq->lock);
> > > > > > > > > >      kfree(sched->sched_rq[i]);
> > > > > > > > > >      }
> > 
>
Dave Airlie Nov. 8, 2024, 4 a.m. UTC | #14
On Wed, 18 Sept 2024 at 23:48, Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> Tearing down the scheduler with jobs still on the pending list can
> lead to use after free issues. Add a warning if drivers try to
> destroy a scheduler which still has work pushed to the HW.
>
> When there are still entities with jobs the situation is even worse
> since the dma_fences for those jobs can never signal we can just
> choose between potentially locking up core memory management and
> random memory corruption. When drivers really mess it up that well
> let them run into a BUG_ON().

I've been talking a bit to Phillip about this offline.

I'm not sure we should ever be BUG_ON here, I think it might be an
extreme answer, considering we are saying blocking userspace to let
things finish is bad, I think hitting this would be much worse.

I'd rather we WARN_ON, and consider just saying screw it and block
userspace close.

If we really want to avoid the hang on close possibility (though I'm
mostly fine with that) then I think Sima's option to just keep a
reference on the driver, let userspace close and try and clean things
up on fences in the driver later.

I think this should be at least good for rust lifetimes.

Having an explicit memory leak is bad, having a managed memory leak is
less bad, because at least all the memory is still pointed to by
something and managed, at a guess we'd have to keep the whole driver
and scheduler around, to avoid having to make special free functions.
Unless there is some concept like TTM ghost objects we could get away
with, but I think having to signal the fence means we should keep all
the pieces.

Dave.

>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index f093616fe53c..8a46fab5cdc8 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
>
>         drm_sched_wqueue_stop(sched);
>
> +       /*
> +        * Tearing down the scheduler wile there are still unprocessed jobs can
> +        * lead to use after free issues in the scheduler fence.
> +        */
> +       WARN_ON(!list_empty(&sched->pending_list));
> +
>         for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
>                 struct drm_sched_rq *rq = sched->sched_rq[i];
>
>                 spin_lock(&rq->lock);
> -               list_for_each_entry(s_entity, &rq->entities, list)
> +               list_for_each_entry(s_entity, &rq->entities, list) {
> +                       /*
> +                        * The justification for this BUG_ON() is that tearing
> +                        * down the scheduler while jobs are pending leaves
> +                        * dma_fences unsignaled. Since we have dependencies
> +                        * from the core memory management to eventually signal
> +                        * dma_fences this can trivially lead to a system wide
> +                        * stop because of a locked up memory management.
> +                        */
> +                       BUG_ON(spsc_queue_count(&s_entity->job_queue));
> +
>                         /*
>                          * Prevents reinsertion and marks job_queue as idle,
>                          * it will removed from rq in drm_sched_entity_fini
>                          * eventually
>                          */
>                         s_entity->stopped = true;
> +               }
>                 spin_unlock(&rq->lock);
>                 kfree(sched->sched_rq[i]);
>         }
> --
> 2.34.1
>
Christian König Nov. 12, 2024, 3:26 p.m. UTC | #15
Am 08.11.24 um 05:00 schrieb Dave Airlie:
> On Wed, 18 Sept 2024 at 23:48, Christian König
> <ckoenig.leichtzumerken@gmail.com>  wrote:
>> Tearing down the scheduler with jobs still on the pending list can
>> lead to use after free issues. Add a warning if drivers try to
>> destroy a scheduler which still has work pushed to the HW.
>>
>> When there are still entities with jobs the situation is even worse
>> since the dma_fences for those jobs can never signal we can just
>> choose between potentially locking up core memory management and
>> random memory corruption. When drivers really mess it up that well
>> let them run into a BUG_ON().
> I've been talking a bit to Phillip about this offline.
>
> I'm not sure we should ever be BUG_ON here, I think it might be an
> extreme answer, considering we are saying blocking userspace to let
> things finish is bad, I think hitting this would be much worse.

Yeah, completely agree that this is the most extreme response.

The problem is that the scheduler doesn't have much of a choice in this 
moment any more. What we can do is the following:

1. Try to block for the queued up jobs in the entities to be processed 
and signaling their hardware fence.

     There is no guarantee that this won't deadlock and we are 
potentially blocking a SIGKILL here.
     We already tried it before and that turned out to be quite unstable.

2. Signal all pending fences without actually executing anything.

     Because of the dma_fence design, pipe-lining work and only keeping 
the latest fence for each context in the containers that can potentially 
lead to random memory corruptions.

3. Don't signal anything, just free up all jobs.

     That can trivially result in the core memory management waiting 
forever for things to make progress. Leading to complete system stops.

So the choice you have are all really bad for the system as a whole.

> I'd rather we WARN_ON, and consider just saying screw it and block
> userspace close.

Going to make that a WARN_ON() if you insist, but IIRC Linus or Greg 
once summarized it as "A BUG() is only justified if you prevent even 
worse things from happening". If you ask me that's exactly the case here.

> If we really want to avoid the hang on close possibility (though I'm
> mostly fine with that) then I think Sima's option to just keep a
> reference on the driver, let userspace close and try and clean things
> up on fences in the driver later.
>
> I think this should be at least good for rust lifetimes.

The problem with the rust lifetime is that it got the direction of the 
dependency wrong.

A dma_fence represents an operation on the hardware, e.g. a direct 
access to some memory. This means that the dma_fence can only signal if 
the hardware tells the driver that the operation is completed.

If a dma_fence should signals because some object is not referenced any 
more, for example because userspace closed it, than that clearly points 
to a fundamentally broken design.

> Having an explicit memory leak is bad, having a managed memory leak is
> less bad, because at least all the memory is still pointed to by
> something and managed, at a guess we'd have to keep the whole driver
> and scheduler around, to avoid having to make special free functions.
> Unless there is some concept like TTM ghost objects we could get away
> with, but I think having to signal the fence means we should keep all
> the pieces.

Well I think memory leaks are more or less harmless. What we really 
really need to prevent are random memory corruptions.

Those can run undetected under the radar for a very long time and cause 
massive damage without anybody being able to pinpoint where corruptions 
came from.

Regards,
Christian.

>
> Dave.
>
>> Signed-off-by: Christian König<christian.koenig@amd.com>
>> ---
>>   drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++-
>>   1 file changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>> index f093616fe53c..8a46fab5cdc8 100644
>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched)
>>
>>          drm_sched_wqueue_stop(sched);
>>
>> +       /*
>> +        * Tearing down the scheduler wile there are still unprocessed jobs can
>> +        * lead to use after free issues in the scheduler fence.
>> +        */
>> +       WARN_ON(!list_empty(&sched->pending_list));
>> +
>>          for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
>>                  struct drm_sched_rq *rq = sched->sched_rq[i];
>>
>>                  spin_lock(&rq->lock);
>> -               list_for_each_entry(s_entity, &rq->entities, list)
>> +               list_for_each_entry(s_entity, &rq->entities, list) {
>> +                       /*
>> +                        * The justification for this BUG_ON() is that tearing
>> +                        * down the scheduler while jobs are pending leaves
>> +                        * dma_fences unsignaled. Since we have dependencies
>> +                        * from the core memory management to eventually signal
>> +                        * dma_fences this can trivially lead to a system wide
>> +                        * stop because of a locked up memory management.
>> +                        */
>> +                       BUG_ON(spsc_queue_count(&s_entity->job_queue));
>> +
>>                          /*
>>                           * Prevents reinsertion and marks job_queue as idle,
>>                           * it will removed from rq in drm_sched_entity_fini
>>                           * eventually
>>                           */
>>                          s_entity->stopped = true;
>> +               }
>>                  spin_unlock(&rq->lock);
>>                  kfree(sched->sched_rq[i]);
>>          }
>> --
>> 2.34.1
>>
Christian König Nov. 15, 2024, 11:26 a.m. UTC | #16
Interesting, those mails just showed up in my inbox as unread. More than 
14 days after you send it.

So sorry for the late reply.

Am 29.10.24 um 08:22 schrieb Philipp Stanner:
> Christian, Sima?
>
> Matthew? (+CC)
>
> Opinions on the below?
>
> tl;dr:
> I still think it's a good thing to detectably block in
> drm_sched_fini(), or at the very least drm_sched_flush(), because then
> you'll find out that the driver is broken and can repair it.

As discussed in private mail as well, the third option not mentioned so 
far is to completely drop the free_job callback.

A good bunch of the problems we have here are caused by abusing the job 
as state for executing the submission on the hardware.

The original design idea of the scheduler instead was to only have the 
job as intermediate state between submission and picking what to execute 
next. E.g. the scheduler in that view is just a pipeline were you push 
jobs in on one end and jobs come out on the other end.

In that approach the object which represents the hardware execution is 
the dma_fence instead of the job. And the dma_fence has a well defined 
lifetime, error handling, etc...

So when we go back to this original approach it would mean that we don't 
need to wait for any cleanup because the scheduler wouldn't be involved 
in the execution of the jobs on the hardware any more.

The worst thing that could happen is that the driver messes things up 
and still has not executed job in an entity, but in that case we could 
just warn, block for the hardware fence and then kill all pending 
submissions with an error.- And this blocking on the hardware fence 
can't deadlock because we know that it is a hardware fence with certain 
properties and not some random free_job callback where we don't have any 
idea what locks the driver need.

Regards,
Christian.

>
> P.
>
>
> On Fri, 2024-10-18 at 14:07 +0200, Philipp Stanner wrote:
>> On Mon, 2024-10-14 at 16:56 +0200, Danilo Krummrich wrote:
>>> On Fri, Sep 27, 2024 at 11:04:48AM +0200, Christian König wrote:
>>>> Am 25.09.24 um 16:53 schrieb Philipp Stanner:
>>>>> On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote:
>>>>>> On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König
>>>>>> wrote:
>>>>>>> Am 20.09.24 um 15:26 schrieb Philipp Stanner:
>>>>>>>> On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:
>>>>>>>>> Am 20.09.24 um 10:57 schrieb Philipp Stanner:
>>>>>>>>>> On Wed, 2024-09-18 at 15:39 +0200, Christian König
>>>>>>>>>> wrote:
>>>>>>>>>>> Tearing down the scheduler with jobs still on the
>>>>>>>>>>> pending
>>>>>>>>>>> list
>>>>>>>>>>> can
>>>>>>>>>>> lead to use after free issues. Add a warning if
>>>>>>>>>>> drivers try
>>>>>>>>>>> to
>>>>>>>>>>> destroy a scheduler which still has work pushed to
>>>>>>>>>>> the HW.
>>>>>>>>>> Did you have time yet to look into my proposed
>>>>>>>>>> waitque-
>>>>>>>>>> solution?
>>>>>>>>> I don't remember seeing anything. What have I missed?
>>>>>>>> https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/
>>>>>>> Mhm, I didn't got that in my inbox for some reason.
>>>>>>>
>>>>>>> Interesting approach, I'm just not sure if we can or should
>>>>>>> wait in
>>>>>>> drm_sched_fini().
>>>>> We do agree that jobs still pending when drm_sched_fini()
>>>>> starts
>>>>> is
>>>>> always a bug, right?
>>>> Correct, the question is how to avoid that.
>>>>
>>>>> If so, what are the disadvantages of waiting in
>>>>> drm_sched_fini()?
>>>>> We
>>>>> could block buggy drivers as I see it. Which wouldn't be good,
>>>>> but
>>>>> could then be fixed on drivers' site.
>>>> Sima explained that pretty well: Don't block in fops->close, do
>>>> that in
>>>> fops->flush instead.
>>> I agree that we shouldn't block in close(), but this effectively
>>> means that we
>>> need to reference count the scheduler, right?
>>>
>>> Otherwise, if we allow to just skip / interrupt the teardown, we
>>> can
>>> still leak
>>> memory.
>> Having thought about it, I agree with Danilo. Having something that
>> shall wait on green light, but can be interrupted, is no guarantee
>> and
>> therefore not a feasible solution.
>>
>> To break down the solution space, these seem to be our options:
>>     1. We have something (either drm_sched_fini() or a helper, e.g.,
>>        drm_sched_flush()) that definitely blocks until the pending
>> list
>>        has become empty.
>>     2. We have jobs reference-count the scheduler, so the latter can
>>        outlive the driver and will be freed some time later.
>>
>> Can anyone think of a third solution?
>>
>>
>> Solution #1 has the problem of obviously blocking unconditionally if
>> the driver didn't make sure that all fences will be signaled within
>> reasonable time. In my opinion, this would actually be an advantage,
>> because it will be *very* noticable and force users to repair their
>> driver. The driver *has* to guarantee that all fences will be
>> signaled.
>> If the driver has to do fishy things, having the blocking outsourced
>> to
>> the helper drm_sched_flush() would allow them to circumvent that.
>>
>> Solution #2 has the problem of backend_ops.free_job() potentially
>> using
>> driver-data after the driver is gone, causing UAF. So with this
>> solutions all drivers would have to be aware of the issue and handle
>> it
>> through one of DRMs primitives dedicated to such problems.
>>
>>
>> Currently, all drivers either work around the problem internally or
>> simply ignore it, it seems.
>>
>> So I'd argue that both solutions are an improvement over the existing
>> situation. My preference would be #1.
>>
>>
>> Opinions?
>>
>> P.
>>
>>>> One issue this solves is that when you send a SIGTERM the tear
>>>> down
>>>> handling
>>>> first flushes all the FDs and then closes them.
>>>>
>>>> So if flushing the FDs blocks because the process initiated
>>>> sending
>>>> a
>>>> terabyte of data over a 300bps line (for example) you can still
>>>> throw a
>>>> SIGKILL and abort that as well.
>>>>
>>>> If you would block in fops-close() that SIGKILL won't have any
>>>> effect any
>>>> more because by the time close() is called the process is gone
>>>> and
>>>> signals
>>>> are already blocked.
>>>>
>>>> And yes when I learned about that issue I was also buffed that
>>>> handling like
>>>> this in the UNIX design is nearly 50 years old and still applies
>>>> to
>>>> today.
>>>>>>> Probably better to make that a separate function, something
>>>>>>> like
>>>>>>> drm_sched_flush() or similar.
>>>>> We could do that. Such a function could then be called by
>>>>> drivers
>>>>> which
>>>>> are not sure whether all jobs are done before they start
>>>>> tearing
>>>>> down.
>>>> Yes exactly that's the idea. And give that flush function a
>>>> return
>>>> code so
>>>> that it can return -EINTR.
>>>>
>>>>>> Yeah I don't think we should smash this into drm_sched_fini
>>>>>> unconditionally. I think conceptually there's about three
>>>>>> cases:
>>>>>>
>>>>>> - Ringbuffer schedules. Probably want everything as-is,
>>>>>> because
>>>>>>     drm_sched_fini is called long after all the entities are
>>>>>> gone in
>>>>>>     drm_device cleanup.
>>>>>>
>>>>>> - fw scheduler hardware with preemption support. There we
>>>>>> probably
>>>>>> want to
>>>>>>     nuke the context by setting the tdr timeout to zero (or
>>>>>> maybe just
>>>>>> as
>>>>>>     long as context preemption takes to be efficient), and
>>>>>> relying on
>>>>>> the
>>>>>>     normal gpu reset flow to handle things.
>>>>>> drm_sched_entity_flush
>>>>>> kinda
>>>>>>     does this, except not really and it's a lot more focused
>>>>>> on
>>>>>> the
>>>>>>     ringbuffer context. So maybe we want a new
>>>>>> drm_sched_entity_kill.
>>>>>>
>>>>>>     For this case calling drm_sched_fini() after the 1:1
>>>>>> entity
>>>>>> is gone
>>>>>>     should not find any linger jobs, it would actually be a
>>>>>> bug
>>>>>> somewhere if
>>>>>>     there's a job lingering. Maybe a sanity check that there's
>>>>>> not just
>>>>>> no
>>>>>>     jobs lingering, but also no entity left would be good
>>>>>> here?
>>>>> The check for lingering ones is in Christian's patch here IISC.
>>>>> At which position would you imagine the check for the entity
>>>>> being
>>>>> performed?
>>>>>
>>>>>> - fw scheduler without preemption support. There we kinda
>>>>>> need
>>>>>> the
>>>>>>     drm_sched_flush, except blocking in fops->close is not
>>>>>> great. So
>>>>>> instead
>>>>>>     I think the following is better:
>>>>>>     1. drm_sched_entity_stopped, which only stops new
>>>>>> submissions (for
>>>>>>     paranoia) but doesn't tear down the entity
>>>>> Who would call that function?
>>>>> Drivers using it voluntarily could just as well stop accepting
>>>>> new jobs
>>>>> from userspace to their entities, couldn't they?
>>>>>
>>>>>>     2. drm_dev_get
>>>>>>     3. launch a worker which does a) drm_sched_flush (or
>>>>>>     drm_sched_entity_flush or whatever we call it) b)
>>>>>> drm_sched_entity_fini
>>>>>>     + drm_sched_fini c) drm_dev_put
>>>>>>
>>>>>>     Note that semantically this implements the refcount in the
>>>>>> other
>>>>>> path
>>>>>>     from Phillip:
>>>>>>
>>>>>> https://lore.kernel.org/all/20240903094531.29893-2-pstanner@redhat.com/
>>>>>>     Except it doesn't impose refcount on everyone else who
>>>>>> doesn't need
>>>>>> it,
>>>>>>     and it doesn't even impose refcounting on drivers that do
>>>>>> need it
>>>>>>     because we use drm_sched_flush and a worker to achieve the
>>>>>> same.
>>>>> I indeed wasn't happy with the refcount approach for that
>>>>> reason,
>>>>> agreed.
>>>>>
>>>>>> Essentially helper functions for the common use-cases instead
>>>>>> of
>>>>>> trying to
>>>>>> solve them all by putting drm_sched_flush as a potentially
>>>>>> very
>>>>>> blocking
>>>>>> function into drm_sched_fini.
>>>>> I'm still not able to see why it blocking would be undesired –
>>>>> as
>>>>> far
>>>>> as I can see, it is only invoked on driver teardown, so not
>>>>> during
>>>>> active operation. Teardown doesn't happen that often, and it
>>>>> can
>>>>> (if
>>>>> implemented correctly) only block until the driver's code has
>>>>> signaled
>>>>> the last fences. If that doesn't happen, the block would reveal
>>>>> a
>>>>> bug.
>>>>>
>>>>> But don't get me wrong: I don't want to *push* this solution. I
>>>>> just
>>>>> want to understand when it could become a problem.
>>>>>
>>>>>
>>>>> Wouldn't an explicitly blocking, separate function like
>>>>> drm_sched_flush() or drm_sched_fini_flush() be a small, doable
>>>>> step
>>>>> towards the right direction?
>>>> I think that this is the right thing to do, yes.
>>>>
>>>>>>>>>>> When there are still entities with jobs the
>>>>>>>>>>> situation
>>>>>>>>>>> is
>>>>>>>>>>> even
>>>>>>>>>>> worse
>>>>>>>>>>> since the dma_fences for those jobs can never
>>>>>>>>>>> signal
>>>>>>>>>>> we can
>>>>>>>>>>> just
>>>>>>>>>>> choose between potentially locking up core memory
>>>>>>>>>>> management and
>>>>>>>>>>> random memory corruption. When drivers really mess
>>>>>>>>>>> it
>>>>>>>>>>> up
>>>>>>>>>>> that
>>>>>>>>>>> well
>>>>>>>>>>> let them run into a BUG_ON().
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Christian
>>>>>>>>>>> König<christian.koenig@amd.com>
>>>>>>>>>>> ---
>>>>>>>>>>>      drivers/gpu/drm/scheduler/sched_main.c | 19
>>>>>>>>>>> ++++++++++++++++++-
>>>>>>>>>>>      1 file changed, 18 insertions(+), 1 deletion(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> index f093616fe53c..8a46fab5cdc8 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>>>>>>>> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct
>>>>>>>>>>> drm_gpu_scheduler
>>>>>>>>>>> *sched)
>>>>>>>>>> I agree with Sima that it should first be documented
>>>>>>>>>> in
>>>>>>>>>> the
>>>>>>>>>> function's
>>>>>>>>>> docstring what the user is expected to have done
>>>>>>>>>> before
>>>>>>>>>> calling the
>>>>>>>>>> function.
>>>>>>>>> Good point, going to update the documentation as well.
>>>>>>>> Cool thing, thx.
>>>>>>>> Would be great if everything (not totally trivial)
>>>>>>>> necessary to
>>>>>>>> be done
>>>>>>>> before _fini() is mentioned.
>>>>>>>>
>>>>>>>> One could also think about providing a hint at how the
>>>>>>>> driver can
>>>>>>>> do
>>>>>>>> that. AFAICS the only way for the driver to ensure that
>>>>>>>> is
>>>>>>>> to
>>>>>>>> maintain
>>>>>>>> its own, separate list of submitted jobs.
>>>>>>> Even with a duplicated pending list it's actually currently
>>>>>>> impossible to do
>>>>>>> this fully cleanly.
>>>>>>>
>>>>>>> The problem is that the dma_fence object gives no guarantee
>>>>>>> when
>>>>>>> callbacks
>>>>>>> are processed, e.g. they can be both processed from
>>>>>>> interrupt
>>>>>>> context as
>>>>>>> well as from a CPU which called dma_fence_is_signaled().
>>>>>>>
>>>>>>> So when a driver (or drm_sched_fini) waits for the last
>>>>>>> submitted
>>>>>>> fence it
>>>>>>> actually can be that the drm_sched object still needs to do
>>>>>>> some
>>>>>>> processing.
>>>>>>> See the hack in amdgpu_vm_tlb_seq() for more background on
>>>>>>> the
>>>>>>> problem.
>>>>> Oh dear ^^'
>>>>> We better work towards fixing that centrally
>>>>>
>>>>> Thanks,
>>>>> P.
>>>>>
>>>>>
>>>>>> So I thought this should be fairly easy because of the sched
>>>>>> hw/public
>>>>>> fence split: If we wait for both all jobs to finish and for
>>>>>> all
>>>>>> the
>>>>>> sched
>>>>>> work/tdr work to finish, and we make sure there's no entity
>>>>>> existing
>>>>>> that's not yet stopped we should catch them all?
>>>> Unfortunately not.
>>>>
>>>> Even when you do a dma_fence_wait() on the last submission it can
>>>> still be
>>>> that another CPU is executing the callbacks to wake up the
>>>> scheduler work
>>>> item and cleanup the job.
>>>>
>>>> That's one of the reasons why I think the design of keeping the
>>>> job
>>>> alive is
>>>> so extremely awkward. The dma_fence as representation of the hw
>>>> submission
>>>> has a much better defined state machine and lifetime.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>>>    Or at least I think
>>>>>> it's
>>>>>> a bug if any other code even tries to touch the hw fence.
>>>>>>
>>>>>> If you have any other driver code which relies on the rcu
>>>>>> freeing
>>>>>> then I
>>>>>> think that's just a separate concern and we can ignore that
>>>>>> here
>>>>>> since the
>>>>>> fences themselves will till get rcu-delay freed even if
>>>>>> drm_sched_fini has
>>>>>> finished.
>>>>>> -Sima
>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>> P.
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Christian.
>>>>>>>>>
>>>>>>>>>> P.
>>>>>>>>>>
>>>>>>>>>>>       drm_sched_wqueue_stop(sched);
>>>>>>>>>>> + /*
>>>>>>>>>>> + * Tearing down the scheduler wile there are still
>>>>>>>>>>> unprocessed jobs can
>>>>>>>>>>> + * lead to use after free issues in the scheduler
>>>>>>>>>>> fence.
>>>>>>>>>>> + */
>>>>>>>>>>> + WARN_ON(!list_empty(&sched->pending_list));
>>>>>>>>>>> +
>>>>>>>>>>>       for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched-
>>>>>>>>>>>> num_rqs;
>>>>>>>>>>> i++)
>>>>>>>>>>> {
>>>>>>>>>>>       struct drm_sched_rq *rq = sched-
>>>>>>>>>>>> sched_rq[i];
>>>>>>>>>>>       spin_lock(&rq->lock);
>>>>>>>>>>> - list_for_each_entry(s_entity, &rq-
>>>>>>>>>>>> entities,
>>>>>>>>>>> list)
>>>>>>>>>>> + list_for_each_entry(s_entity, &rq-
>>>>>>>>>>>> entities,
>>>>>>>>>>> list) {
>>>>>>>>>>> + /*
>>>>>>>>>>> + * The justification for this
>>>>>>>>>>> BUG_ON()
>>>>>>>>>>> is
>>>>>>>>>>> that tearing
>>>>>>>>>>> + * down the scheduler while jobs
>>>>>>>>>>> are
>>>>>>>>>>> pending
>>>>>>>>>>> leaves
>>>>>>>>>>> + * dma_fences unsignaled. Since we
>>>>>>>>>>> have
>>>>>>>>>>> dependencies
>>>>>>>>>>> + * from the core memory management
>>>>>>>>>>> to
>>>>>>>>>>> eventually signal
>>>>>>>>>>> + * dma_fences this can trivially
>>>>>>>>>>> lead to
>>>>>>>>>>> a
>>>>>>>>>>> system wide
>>>>>>>>>>> + * stop because of a locked up
>>>>>>>>>>> memory
>>>>>>>>>>> management.
>>>>>>>>>>> + */
>>>>>>>>>>> + BUG_ON(spsc_queue_count(&s_entity-
>>>>>>>>>>>> job_queue));
>>>>>>>>>>> +
>>>>>>>>>>>       /*
>>>>>>>>>>>       * Prevents reinsertion and marks
>>>>>>>>>>> job_queue
>>>>>>>>>>> as idle,
>>>>>>>>>>>       * it will removed from rq in
>>>>>>>>>>> drm_sched_entity_fini
>>>>>>>>>>>       * eventually
>>>>>>>>>>>       */
>>>>>>>>>>>       s_entity->stopped = true;
>>>>>>>>>>> + }
>>>>>>>>>>>       spin_unlock(&rq->lock);
>>>>>>>>>>>       kfree(sched->sched_rq[i]);
>>>>>>>>>>>       }
Philipp Stanner Nov. 15, 2024, 1:55 p.m. UTC | #17
On Fri, 2024-11-15 at 12:26 +0100, Christian König wrote:
> Interesting, those mails just showed up in my inbox as unread. More
> than 
> 14 days after you send it.
> 
> So sorry for the late reply.

I smell google mail :p

> 
> Am 29.10.24 um 08:22 schrieb Philipp Stanner:
> > Christian, Sima?
> > 
> > Matthew? (+CC)
> > 
> > Opinions on the below?
> > 
> > tl;dr:
> > I still think it's a good thing to detectably block in
> > drm_sched_fini(), or at the very least drm_sched_flush(), because
> > then
> > you'll find out that the driver is broken and can repair it.
> 
> As discussed in private mail as well, the third option not mentioned
> so 
> far is to completely drop the free_job callback.

If it's really possible to do that I think this sounds like a great
solution at first glance because we could decrease complexity quite
substantially.

> 
> A good bunch of the problems we have here are caused by abusing the
> job 
> as state for executing the submission on the hardware.
> 
> The original design idea of the scheduler instead was to only have
> the 
> job as intermediate state between submission and picking what to
> execute 
> next. E.g. the scheduler in that view is just a pipeline were you
> push 
> jobs in on one end and jobs come out on the other end.

So let's get a bit more precise about this:
   1. Driver enqueues with drm_sched_job_arm()
   2. job ends up in pending_list
   3. Sooner or later scheduler calls run_job()
   4. Job is pushed to hardware
   5. Fence gets signaled
   6. ???

What would the "come out on the other end" part you describe look like?

How would jobs get removed from pending_list and, accordingly, how
would we avoid leaks?

> 
> In that approach the object which represents the hardware execution
> is 
> the dma_fence instead of the job. And the dma_fence has a well
> defined 
> lifetime, error handling, etc...
> 
> So when we go back to this original approach it would mean that we
> don't 
> need to wait for any cleanup because the scheduler wouldn't be
> involved 
> in the execution of the jobs on the hardware any more.

It would be involved (to speak precisely) in the sense of the scheduler
still being the one who pushes the job onto the hardware, agreed?

> 
> The worst thing that could happen is that the driver messes things up
> and still has not executed job in an entity,

I can't fully follow.

So in your mind, the driver would personally set the dma_fence callback
and hereby be informed about the job being completed, right?

But you wouldn't want to aim for getting rid of struct drm_sched_job
completely, or would you?


Grüße,
P.


> but in that case we could
> just warn, block for the hardware fence and then kill all pending 
> submissions with an error.- And this blocking on the hardware fence 
> can't deadlock because we know that it is a hardware fence with
> certain 
> properties and not some random free_job callback where we don't have
> any 
> idea what locks the driver need.
> 
> Regards,
> Christian.
> 
> > 
> > P.
> > 
> > 
> > On Fri, 2024-10-18 at 14:07 +0200, Philipp Stanner wrote:
> > > On Mon, 2024-10-14 at 16:56 +0200, Danilo Krummrich wrote:
> > > > On Fri, Sep 27, 2024 at 11:04:48AM +0200, Christian König
> > > > wrote:
> > > > > Am 25.09.24 um 16:53 schrieb Philipp Stanner:
> > > > > > On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote:
> > > > > > > On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König
> > > > > > > wrote:
> > > > > > > > Am 20.09.24 um 15:26 schrieb Philipp Stanner:
> > > > > > > > > On Fri, 2024-09-20 at 12:33 +0200, Christian König
> > > > > > > > > wrote:
> > > > > > > > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> > > > > > > > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian
> > > > > > > > > > > König
> > > > > > > > > > > wrote:
> > > > > > > > > > > > Tearing down the scheduler with jobs still on
> > > > > > > > > > > > the
> > > > > > > > > > > > pending
> > > > > > > > > > > > list
> > > > > > > > > > > > can
> > > > > > > > > > > > lead to use after free issues. Add a warning if
> > > > > > > > > > > > drivers try
> > > > > > > > > > > > to
> > > > > > > > > > > > destroy a scheduler which still has work pushed
> > > > > > > > > > > > to
> > > > > > > > > > > > the HW.
> > > > > > > > > > > Did you have time yet to look into my proposed
> > > > > > > > > > > waitque-
> > > > > > > > > > > solution?
> > > > > > > > > > I don't remember seeing anything. What have I
> > > > > > > > > > missed?
> > > > > > > > > https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/
> > > > > > > > Mhm, I didn't got that in my inbox for some reason.
> > > > > > > > 
> > > > > > > > Interesting approach, I'm just not sure if we can or
> > > > > > > > should
> > > > > > > > wait in
> > > > > > > > drm_sched_fini().
> > > > > > We do agree that jobs still pending when drm_sched_fini()
> > > > > > starts
> > > > > > is
> > > > > > always a bug, right?
> > > > > Correct, the question is how to avoid that.
> > > > > 
> > > > > > If so, what are the disadvantages of waiting in
> > > > > > drm_sched_fini()?
> > > > > > We
> > > > > > could block buggy drivers as I see it. Which wouldn't be
> > > > > > good,
> > > > > > but
> > > > > > could then be fixed on drivers' site.
> > > > > Sima explained that pretty well: Don't block in fops->close,
> > > > > do
> > > > > that in
> > > > > fops->flush instead.
> > > > I agree that we shouldn't block in close(), but this
> > > > effectively
> > > > means that we
> > > > need to reference count the scheduler, right?
> > > > 
> > > > Otherwise, if we allow to just skip / interrupt the teardown,
> > > > we
> > > > can
> > > > still leak
> > > > memory.
> > > Having thought about it, I agree with Danilo. Having something
> > > that
> > > shall wait on green light, but can be interrupted, is no
> > > guarantee
> > > and
> > > therefore not a feasible solution.
> > > 
> > > To break down the solution space, these seem to be our options:
> > >     1. We have something (either drm_sched_fini() or a helper,
> > > e.g.,
> > >        drm_sched_flush()) that definitely blocks until the
> > > pending
> > > list
> > >        has become empty.
> > >     2. We have jobs reference-count the scheduler, so the latter
> > > can
> > >        outlive the driver and will be freed some time later.
> > > 
> > > Can anyone think of a third solution?
> > > 
> > > 
> > > Solution #1 has the problem of obviously blocking unconditionally
> > > if
> > > the driver didn't make sure that all fences will be signaled
> > > within
> > > reasonable time. In my opinion, this would actually be an
> > > advantage,
> > > because it will be *very* noticable and force users to repair
> > > their
> > > driver. The driver *has* to guarantee that all fences will be
> > > signaled.
> > > If the driver has to do fishy things, having the blocking
> > > outsourced
> > > to
> > > the helper drm_sched_flush() would allow them to circumvent that.
> > > 
> > > Solution #2 has the problem of backend_ops.free_job() potentially
> > > using
> > > driver-data after the driver is gone, causing UAF. So with this
> > > solutions all drivers would have to be aware of the issue and
> > > handle
> > > it
> > > through one of DRMs primitives dedicated to such problems.
> > > 
> > > 
> > > Currently, all drivers either work around the problem internally
> > > or
> > > simply ignore it, it seems.
> > > 
> > > So I'd argue that both solutions are an improvement over the
> > > existing
> > > situation. My preference would be #1.
> > > 
> > > 
> > > Opinions?
> > > 
> > > P.
> > > 
> > > > > One issue this solves is that when you send a SIGTERM the
> > > > > tear
> > > > > down
> > > > > handling
> > > > > first flushes all the FDs and then closes them.
> > > > > 
> > > > > So if flushing the FDs blocks because the process initiated
> > > > > sending
> > > > > a
> > > > > terabyte of data over a 300bps line (for example) you can
> > > > > still
> > > > > throw a
> > > > > SIGKILL and abort that as well.
> > > > > 
> > > > > If you would block in fops-close() that SIGKILL won't have
> > > > > any
> > > > > effect any
> > > > > more because by the time close() is called the process is
> > > > > gone
> > > > > and
> > > > > signals
> > > > > are already blocked.
> > > > > 
> > > > > And yes when I learned about that issue I was also buffed
> > > > > that
> > > > > handling like
> > > > > this in the UNIX design is nearly 50 years old and still
> > > > > applies
> > > > > to
> > > > > today.
> > > > > > > > Probably better to make that a separate function,
> > > > > > > > something
> > > > > > > > like
> > > > > > > > drm_sched_flush() or similar.
> > > > > > We could do that. Such a function could then be called by
> > > > > > drivers
> > > > > > which
> > > > > > are not sure whether all jobs are done before they start
> > > > > > tearing
> > > > > > down.
> > > > > Yes exactly that's the idea. And give that flush function a
> > > > > return
> > > > > code so
> > > > > that it can return -EINTR.
> > > > > 
> > > > > > > Yeah I don't think we should smash this into
> > > > > > > drm_sched_fini
> > > > > > > unconditionally. I think conceptually there's about three
> > > > > > > cases:
> > > > > > > 
> > > > > > > - Ringbuffer schedules. Probably want everything as-is,
> > > > > > > because
> > > > > > >     drm_sched_fini is called long after all the entities
> > > > > > > are
> > > > > > > gone in
> > > > > > >     drm_device cleanup.
> > > > > > > 
> > > > > > > - fw scheduler hardware with preemption support. There we
> > > > > > > probably
> > > > > > > want to
> > > > > > >     nuke the context by setting the tdr timeout to zero
> > > > > > > (or
> > > > > > > maybe just
> > > > > > > as
> > > > > > >     long as context preemption takes to be efficient),
> > > > > > > and
> > > > > > > relying on
> > > > > > > the
> > > > > > >     normal gpu reset flow to handle things.
> > > > > > > drm_sched_entity_flush
> > > > > > > kinda
> > > > > > >     does this, except not really and it's a lot more
> > > > > > > focused
> > > > > > > on
> > > > > > > the
> > > > > > >     ringbuffer context. So maybe we want a new
> > > > > > > drm_sched_entity_kill.
> > > > > > > 
> > > > > > >     For this case calling drm_sched_fini() after the 1:1
> > > > > > > entity
> > > > > > > is gone
> > > > > > >     should not find any linger jobs, it would actually be
> > > > > > > a
> > > > > > > bug
> > > > > > > somewhere if
> > > > > > >     there's a job lingering. Maybe a sanity check that
> > > > > > > there's
> > > > > > > not just
> > > > > > > no
> > > > > > >     jobs lingering, but also no entity left would be good
> > > > > > > here?
> > > > > > The check for lingering ones is in Christian's patch here
> > > > > > IISC.
> > > > > > At which position would you imagine the check for the
> > > > > > entity
> > > > > > being
> > > > > > performed?
> > > > > > 
> > > > > > > - fw scheduler without preemption support. There we kinda
> > > > > > > need
> > > > > > > the
> > > > > > >     drm_sched_flush, except blocking in fops->close is
> > > > > > > not
> > > > > > > great. So
> > > > > > > instead
> > > > > > >     I think the following is better:
> > > > > > >     1. drm_sched_entity_stopped, which only stops new
> > > > > > > submissions (for
> > > > > > >     paranoia) but doesn't tear down the entity
> > > > > > Who would call that function?
> > > > > > Drivers using it voluntarily could just as well stop
> > > > > > accepting
> > > > > > new jobs
> > > > > > from userspace to their entities, couldn't they?
> > > > > > 
> > > > > > >     2. drm_dev_get
> > > > > > >     3. launch a worker which does a) drm_sched_flush (or
> > > > > > >     drm_sched_entity_flush or whatever we call it) b)
> > > > > > > drm_sched_entity_fini
> > > > > > >     + drm_sched_fini c) drm_dev_put
> > > > > > > 
> > > > > > >     Note that semantically this implements the refcount
> > > > > > > in the
> > > > > > > other
> > > > > > > path
> > > > > > >     from Phillip:
> > > > > > > 
> > > > > > > https://lore.kernel.org/all/20240903094531.29893-2-pstanner@redhat.com/
> > > > > > >     Except it doesn't impose refcount on everyone else
> > > > > > > who
> > > > > > > doesn't need
> > > > > > > it,
> > > > > > >     and it doesn't even impose refcounting on drivers
> > > > > > > that do
> > > > > > > need it
> > > > > > >     because we use drm_sched_flush and a worker to
> > > > > > > achieve the
> > > > > > > same.
> > > > > > I indeed wasn't happy with the refcount approach for that
> > > > > > reason,
> > > > > > agreed.
> > > > > > 
> > > > > > > Essentially helper functions for the common use-cases
> > > > > > > instead
> > > > > > > of
> > > > > > > trying to
> > > > > > > solve them all by putting drm_sched_flush as a
> > > > > > > potentially
> > > > > > > very
> > > > > > > blocking
> > > > > > > function into drm_sched_fini.
> > > > > > I'm still not able to see why it blocking would be
> > > > > > undesired –
> > > > > > as
> > > > > > far
> > > > > > as I can see, it is only invoked on driver teardown, so not
> > > > > > during
> > > > > > active operation. Teardown doesn't happen that often, and
> > > > > > it
> > > > > > can
> > > > > > (if
> > > > > > implemented correctly) only block until the driver's code
> > > > > > has
> > > > > > signaled
> > > > > > the last fences. If that doesn't happen, the block would
> > > > > > reveal
> > > > > > a
> > > > > > bug.
> > > > > > 
> > > > > > But don't get me wrong: I don't want to *push* this
> > > > > > solution. I
> > > > > > just
> > > > > > want to understand when it could become a problem.
> > > > > > 
> > > > > > 
> > > > > > Wouldn't an explicitly blocking, separate function like
> > > > > > drm_sched_flush() or drm_sched_fini_flush() be a small,
> > > > > > doable
> > > > > > step
> > > > > > towards the right direction?
> > > > > I think that this is the right thing to do, yes.
> > > > > 
> > > > > > > > > > > > When there are still entities with jobs the
> > > > > > > > > > > > situation
> > > > > > > > > > > > is
> > > > > > > > > > > > even
> > > > > > > > > > > > worse
> > > > > > > > > > > > since the dma_fences for those jobs can never
> > > > > > > > > > > > signal
> > > > > > > > > > > > we can
> > > > > > > > > > > > just
> > > > > > > > > > > > choose between potentially locking up core
> > > > > > > > > > > > memory
> > > > > > > > > > > > management and
> > > > > > > > > > > > random memory corruption. When drivers really
> > > > > > > > > > > > mess
> > > > > > > > > > > > it
> > > > > > > > > > > > up
> > > > > > > > > > > > that
> > > > > > > > > > > > well
> > > > > > > > > > > > let them run into a BUG_ON().
> > > > > > > > > > > > 
> > > > > > > > > > > > Signed-off-by: Christian
> > > > > > > > > > > > König<christian.koenig@amd.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > >      drivers/gpu/drm/scheduler/sched_main.c |
> > > > > > > > > > > > 19
> > > > > > > > > > > > ++++++++++++++++++-
> > > > > > > > > > > >      1 file changed, 18 insertions(+), 1
> > > > > > > > > > > > deletion(-)
> > > > > > > > > > > > 
> > > > > > > > > > > > diff --git
> > > > > > > > > > > > a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > > index f093616fe53c..8a46fab5cdc8 100644
> > > > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > > @@ -1333,17 +1333,34 @@ void
> > > > > > > > > > > > drm_sched_fini(struct
> > > > > > > > > > > > drm_gpu_scheduler
> > > > > > > > > > > > *sched)
> > > > > > > > > > > I agree with Sima that it should first be
> > > > > > > > > > > documented
> > > > > > > > > > > in
> > > > > > > > > > > the
> > > > > > > > > > > function's
> > > > > > > > > > > docstring what the user is expected to have done
> > > > > > > > > > > before
> > > > > > > > > > > calling the
> > > > > > > > > > > function.
> > > > > > > > > > Good point, going to update the documentation as
> > > > > > > > > > well.
> > > > > > > > > Cool thing, thx.
> > > > > > > > > Would be great if everything (not totally trivial)
> > > > > > > > > necessary to
> > > > > > > > > be done
> > > > > > > > > before _fini() is mentioned.
> > > > > > > > > 
> > > > > > > > > One could also think about providing a hint at how
> > > > > > > > > the
> > > > > > > > > driver can
> > > > > > > > > do
> > > > > > > > > that. AFAICS the only way for the driver to ensure
> > > > > > > > > that
> > > > > > > > > is
> > > > > > > > > to
> > > > > > > > > maintain
> > > > > > > > > its own, separate list of submitted jobs.
> > > > > > > > Even with a duplicated pending list it's actually
> > > > > > > > currently
> > > > > > > > impossible to do
> > > > > > > > this fully cleanly.
> > > > > > > > 
> > > > > > > > The problem is that the dma_fence object gives no
> > > > > > > > guarantee
> > > > > > > > when
> > > > > > > > callbacks
> > > > > > > > are processed, e.g. they can be both processed from
> > > > > > > > interrupt
> > > > > > > > context as
> > > > > > > > well as from a CPU which called
> > > > > > > > dma_fence_is_signaled().
> > > > > > > > 
> > > > > > > > So when a driver (or drm_sched_fini) waits for the last
> > > > > > > > submitted
> > > > > > > > fence it
> > > > > > > > actually can be that the drm_sched object still needs
> > > > > > > > to do
> > > > > > > > some
> > > > > > > > processing.
> > > > > > > > See the hack in amdgpu_vm_tlb_seq() for more background
> > > > > > > > on
> > > > > > > > the
> > > > > > > > problem.
> > > > > > Oh dear ^^'
> > > > > > We better work towards fixing that centrally
> > > > > > 
> > > > > > Thanks,
> > > > > > P.
> > > > > > 
> > > > > > 
> > > > > > > So I thought this should be fairly easy because of the
> > > > > > > sched
> > > > > > > hw/public
> > > > > > > fence split: If we wait for both all jobs to finish and
> > > > > > > for
> > > > > > > all
> > > > > > > the
> > > > > > > sched
> > > > > > > work/tdr work to finish, and we make sure there's no
> > > > > > > entity
> > > > > > > existing
> > > > > > > that's not yet stopped we should catch them all?
> > > > > Unfortunately not.
> > > > > 
> > > > > Even when you do a dma_fence_wait() on the last submission it
> > > > > can
> > > > > still be
> > > > > that another CPU is executing the callbacks to wake up the
> > > > > scheduler work
> > > > > item and cleanup the job.
> > > > > 
> > > > > That's one of the reasons why I think the design of keeping
> > > > > the
> > > > > job
> > > > > alive is
> > > > > so extremely awkward. The dma_fence as representation of the
> > > > > hw
> > > > > submission
> > > > > has a much better defined state machine and lifetime.
> > > > > 
> > > > > Regards,
> > > > > Christian.
> > > > > 
> > > > > > >    Or at least I think
> > > > > > > it's
> > > > > > > a bug if any other code even tries to touch the hw fence.
> > > > > > > 
> > > > > > > If you have any other driver code which relies on the rcu
> > > > > > > freeing
> > > > > > > then I
> > > > > > > think that's just a separate concern and we can ignore
> > > > > > > that
> > > > > > > here
> > > > > > > since the
> > > > > > > fences themselves will till get rcu-delay freed even if
> > > > > > > drm_sched_fini has
> > > > > > > finished.
> > > > > > > -Sima
> > > > > > > 
> > > > > > > > Regards,
> > > > > > > > Christian.
> > > > > > > > 
> > > > > > > > > P.
> > > > > > > > > 
> > > > > > > > > > Thanks,
> > > > > > > > > > Christian.
> > > > > > > > > > 
> > > > > > > > > > > P.
> > > > > > > > > > > 
> > > > > > > > > > > >       drm_sched_wqueue_stop(sched);
> > > > > > > > > > > > + /*
> > > > > > > > > > > > + * Tearing down the scheduler wile there are
> > > > > > > > > > > > still
> > > > > > > > > > > > unprocessed jobs can
> > > > > > > > > > > > + * lead to use after free issues in the
> > > > > > > > > > > > scheduler
> > > > > > > > > > > > fence.
> > > > > > > > > > > > + */
> > > > > > > > > > > > + WARN_ON(!list_empty(&sched->pending_list));
> > > > > > > > > > > > +
> > > > > > > > > > > >       for (i = DRM_SCHED_PRIORITY_KERNEL; i <
> > > > > > > > > > > > sched-
> > > > > > > > > > > > > num_rqs;
> > > > > > > > > > > > i++)
> > > > > > > > > > > > {
> > > > > > > > > > > >       struct drm_sched_rq *rq = sched-
> > > > > > > > > > > > > sched_rq[i];
> > > > > > > > > > > >       spin_lock(&rq->lock);
> > > > > > > > > > > > - list_for_each_entry(s_entity, &rq-
> > > > > > > > > > > > > entities,
> > > > > > > > > > > > list)
> > > > > > > > > > > > + list_for_each_entry(s_entity, &rq-
> > > > > > > > > > > > > entities,
> > > > > > > > > > > > list) {
> > > > > > > > > > > > + /*
> > > > > > > > > > > > + * The justification for this
> > > > > > > > > > > > BUG_ON()
> > > > > > > > > > > > is
> > > > > > > > > > > > that tearing
> > > > > > > > > > > > + * down the scheduler while jobs
> > > > > > > > > > > > are
> > > > > > > > > > > > pending
> > > > > > > > > > > > leaves
> > > > > > > > > > > > + * dma_fences unsignaled. Since we
> > > > > > > > > > > > have
> > > > > > > > > > > > dependencies
> > > > > > > > > > > > + * from the core memory management
> > > > > > > > > > > > to
> > > > > > > > > > > > eventually signal
> > > > > > > > > > > > + * dma_fences this can trivially
> > > > > > > > > > > > lead to
> > > > > > > > > > > > a
> > > > > > > > > > > > system wide
> > > > > > > > > > > > + * stop because of a locked up
> > > > > > > > > > > > memory
> > > > > > > > > > > > management.
> > > > > > > > > > > > + */
> > > > > > > > > > > > + BUG_ON(spsc_queue_count(&s_entity-
> > > > > > > > > > > > > job_queue));
> > > > > > > > > > > > +
> > > > > > > > > > > >       /*
> > > > > > > > > > > >       * Prevents reinsertion and marks
> > > > > > > > > > > > job_queue
> > > > > > > > > > > > as idle,
> > > > > > > > > > > >       * it will removed from rq in
> > > > > > > > > > > > drm_sched_entity_fini
> > > > > > > > > > > >       * eventually
> > > > > > > > > > > >       */
> > > > > > > > > > > >       s_entity->stopped = true;
> > > > > > > > > > > > + }
> > > > > > > > > > > >       spin_unlock(&rq->lock);
> > > > > > > > > > > >       kfree(sched->sched_rq[i]);
> > > > > > > > > > > >       }
>
Christian König Nov. 15, 2024, 2:39 p.m. UTC | #18
Am 15.11.24 um 14:55 schrieb Philipp Stanner:
> [SNIP]
>> A good bunch of the problems we have here are caused by abusing the
>> job
>> as state for executing the submission on the hardware.
>>
>> The original design idea of the scheduler instead was to only have
>> the
>> job as intermediate state between submission and picking what to
>> execute
>> next. E.g. the scheduler in that view is just a pipeline were you
>> push
>> jobs in on one end and jobs come out on the other end.
> So let's get a bit more precise about this:
>     1. Driver enqueues with drm_sched_job_arm()
>     2. job ends up in pending_list
>     3. Sooner or later scheduler calls run_job()
>     4. Job is pushed to hardware
>     5. Fence gets signaled
>     6. ???
>
> What would the "come out on the other end" part you describe look like?
>
> How would jobs get removed from pending_list and, accordingly, how
> would we avoid leaks?

Let me describe the full solution a bit further down.

>> In that approach the object which represents the hardware execution
>> is
>> the dma_fence instead of the job. And the dma_fence has a well
>> defined
>> lifetime, error handling, etc...
>>
>> So when we go back to this original approach it would mean that we
>> don't
>> need to wait for any cleanup because the scheduler wouldn't be
>> involved
>> in the execution of the jobs on the hardware any more.
> It would be involved (to speak precisely) in the sense of the scheduler
> still being the one who pushes the job onto the hardware, agreed?

Yes, exactly.

See in the original design the "job" wasn't even a defined structure, 
but rather just a void*.

So basically what the driver did was telling the scheduler here I have a 
bunch of different void* please tell me which one to run first.

And apart from being this identifier this void* had absolutely no 
meaning for the scheduler.

>> The worst thing that could happen is that the driver messes things up
>> and still has not executed job in an entity,
> I can't fully follow.
>
> So in your mind, the driver would personally set the dma_fence callback
> and hereby be informed about the job being completed, right?

No. The run_job callback would still return the hw fence so that the 
scheduler can install the callback and so gets informed when the next 
job could be run.

> But you wouldn't want to aim for getting rid of struct drm_sched_job
> completely, or would you?

No, the drm_sched_job structure was added to aid the single producer 
single consumer queue and so made it easier to put the jobs into a 
container.


In my mind the full solution for running jobs looks like this:

1. Driver enqueues with drm_sched_job_arm()
2. job ends up in pending_list
3. Sooner or later scheduler calls run_job()
4. In return scheduler gets a dma_fence representing the resulting 
hardware operation.
5, This fence is put into a container to keep around what the hw is 
actually executing.
6. At some point the fence gets signaled informing the scheduler that 
the next job can be pushed if enough credits are now available.

There is no free_job callback any more because after run_job is called 
the scheduler is done with the job and only the dma_fence which 
represents the actually HW operation is the object the scheduler now 
works with.

We would still need something like a kill_job callback which is used 
when an entity is released without flushing all jobs (see 
drm_sched_entity_kill()), but that is then only used for a specific 
corner case.

Blocking for the cleanup in drm_sched_fini() then becomes a trivial 
dma_fence_wait() on the remaining unsignaled HW fences and eventually on 
the latest killed fence.

The problem with this approach is that the idea of re-submitting jobs in 
a GPU reset by the scheduler is then basically dead. But to be honest 
that never worked correctly.

See the deprecated warning I already put on drm_sched_resubmit_jobs(). 
The job lifetime is not really well defined because of this, see the 
free_guilty hack as well.

Regards,
Christian.

>
>
> Grüße,
> P.
>
Philipp Stanner Nov. 15, 2024, 4:07 p.m. UTC | #19
On Fri, 2024-11-15 at 15:39 +0100, Christian König wrote:
>  Am 15.11.24 um 14:55 schrieb Philipp Stanner:
>  
> > [SNIP] 
> > >  
> > > A good bunch of the problems we have here are caused by abusing
> > > the
> > > job 
> > > as state for executing the submission on the hardware.
> > > 
> > > The original design idea of the scheduler instead was to only
> > > have
> > > the 
> > > job as intermediate state between submission and picking what to
> > > execute 
> > > next. E.g. the scheduler in that view is just a pipeline were you
> > > push 
> > > jobs in on one end and jobs come out on the other end.
> > >  
> >  
> > So let's get a bit more precise about this:
> >    1. Driver enqueues with drm_sched_job_arm()
> >    2. job ends up in pending_list
> >    3. Sooner or later scheduler calls run_job()
> >    4. Job is pushed to hardware
> >    5. Fence gets signaled
> >    6. ???
> > 
> > What would the "come out on the other end" part you describe look
> > like?
> > 
> > How would jobs get removed from pending_list and, accordingly, how
> > would we avoid leaks?
> >  
>  
>  Let me describe the full solution a bit further down.
>  
>  
>  
> >  
> > >  
> > > In that approach the object which represents the hardware
> > > execution
> > > is 
> > > the dma_fence instead of the job. And the dma_fence has a well
> > > defined 
> > > lifetime, error handling, etc...
> > > 
> > > So when we go back to this original approach it would mean that
> > > we
> > > don't 
> > > need to wait for any cleanup because the scheduler wouldn't be
> > > involved 
> > > in the execution of the jobs on the hardware any more.
> > >  
> >  
> > It would be involved (to speak precisely) in the sense of the
> > scheduler
> > still being the one who pushes the job onto the hardware, agreed?
> >  
>  
>  Yes, exactly.
>  
>  See in the original design the "job" wasn't even a defined
> structure, but rather just a void*.
>  
>  So basically what the driver did was telling the scheduler here I
> have a bunch of different void* please tell me which one to run
> first.
>  
>  And apart from being this identifier this void* had absolutely no
> meaning for the scheduler.

Interesting..

>  
> > >  
> > > The worst thing that could happen is that the driver messes
> > > things up
> > > and still has not executed job in an entity,
> > >  
> >  
> > I can't fully follow.
> > 
> > So in your mind, the driver would personally set the dma_fence
> > callback
> > and hereby be informed about the job being completed, right?
> >  
>  
>  No. The run_job callback would still return the hw fence so that the
> scheduler can install the callback and so gets informed when the next
> job could be run.
>  
>  
> >  
> > But you wouldn't want to aim for getting rid of struct
> > drm_sched_job
> > completely, or would you?
> >  
>  
>  No, the drm_sched_job structure was added to aid the single producer
> single consumer queue and so made it easier to put the jobs into a
> container.
>  
>  
>  In my mind the full solution for running jobs looks like this:
>  
>  1. Driver enqueues with drm_sched_job_arm()
>  2. job ends up in pending_list
>  3. Sooner or later scheduler calls run_job()
>  4. In return scheduler gets a dma_fence representing the resulting
> hardware operation.
>  5, This fence is put into a container to keep around what the hw is
> actually executing.
>  6. At some point the fence gets signaled informing the scheduler
> that the next job can be pushed if enough credits are now available.
>  
>  There is no free_job callback any more because after run_job is
> called the scheduler is done with the job and only the dma_fence
> which represents the actually HW operation is the object the
> scheduler now works with.
>  
>  We would still need something like a kill_job callback which is used
> when an entity is released without flushing all jobs (see
> drm_sched_entity_kill()), but that is then only used for a specific
> corner case.

Can't we just limit the scheduler's responsibility to telling the
driver that it has to flush, and if it doesn't it's a bug?
 
>  Blocking for the cleanup in drm_sched_fini() then becomes a trivial
> dma_fence_wait() on the remaining unsignaled HW fences and eventually
> on the latest killed fence.

But that results in exactly the same situation as my waitque solution,
doesn't it?

Blocking infinitely in drm_sched_fini():

If the driver doesn't guarantee that all fences will get signaled, then
wait_event() in the waitque solution will block forever. The same with
dma_fence_wait().

Or are you aiming at an interruptible dma_fence_wait(), thereby not
blocking SIGKILL?

That then would still result in leaks. So basically the same situation
as with an interruptible drm_sched_flush() function.

(Although I agree that would probably be more elegant)

>  
>  The problem with this approach is that the idea of re-submitting
> jobs in a GPU reset by the scheduler is then basically dead. But to
> be honest that never worked correctly.
>  
>  See the deprecated warning I already put on
> drm_sched_resubmit_jobs(). The job lifetime is not really well
> defined because of this, see the free_guilty hack as well.

drm_sched_resubmit_jobs() is being used by 5 drivers currently. So if
we go for this approach we have to port them, first. That doesn't sound
trivial to me


P.

>  
>  Regards,
>  Christian.
>  
>  
> >  
> > 
> > 
> > Grüße,
> > P.
> > 
> >  
>
Christian König Nov. 18, 2024, 11:48 a.m. UTC | #20
Am 15.11.24 um 17:07 schrieb Philipp Stanner:
> On Fri, 2024-11-15 at 15:39 +0100, Christian König wrote:
>>   Am 15.11.24 um 14:55 schrieb Philipp Stanner:
>>   
>>> [SNIP]
>>> But you wouldn't want to aim for getting rid of struct
>>> drm_sched_job
>>> completely, or would you?
>>>   
>>   
>>   No, the drm_sched_job structure was added to aid the single producer
>> single consumer queue and so made it easier to put the jobs into a
>> container.
>>   
>>   
>>   In my mind the full solution for running jobs looks like this:
>>   
>>   1. Driver enqueues with drm_sched_job_arm()
>>   2. job ends up in pending_list
>>   3. Sooner or later scheduler calls run_job()
>>   4. In return scheduler gets a dma_fence representing the resulting
>> hardware operation.
>>   5, This fence is put into a container to keep around what the hw is
>> actually executing.
>>   6. At some point the fence gets signaled informing the scheduler
>> that the next job can be pushed if enough credits are now available.
>>   
>>   There is no free_job callback any more because after run_job is
>> called the scheduler is done with the job and only the dma_fence
>> which represents the actually HW operation is the object the
>> scheduler now works with.
>>   
>>   We would still need something like a kill_job callback which is used
>> when an entity is released without flushing all jobs (see
>> drm_sched_entity_kill()), but that is then only used for a specific
>> corner case.
> Can't we just limit the scheduler's responsibility to telling the
> driver that it has to flush, and if it doesn't it's a bug?

We still need to remove the pending jobs from the scheduler if flushing 
times out.

Without timing out flushing and/or aborting when the process was killed 
we run into the same problem again that we don't want ti block on _fini().
>>   Blocking for the cleanup in drm_sched_fini() then becomes a trivial
>> dma_fence_wait() on the remaining unsignaled HW fences and eventually
>> on the latest killed fence.
> But that results in exactly the same situation as my waitque solution,
> doesn't it?

The key part is that dma_fence's have a documented requirement to never 
block infinitely.

> Blocking infinitely in drm_sched_fini():
>
> If the driver doesn't guarantee that all fences will get signaled, then
> wait_event() in the waitque solution will block forever. The same with
> dma_fence_wait().
>
> Or are you aiming at an interruptible dma_fence_wait(), thereby not
> blocking SIGKILL?

That is basically what drm_sched_entity_flush() is already doing.

> That then would still result in leaks. So basically the same situation
> as with an interruptible drm_sched_flush() function.
>
> (Although I agree that would probably be more elegant)

If the wait_event really would just waiting for dma_fences then yes.

The problem with the waitqueue approach is that we need to wait for the 
free_job callback and that callback is normally called from a work item 
without holding any additional locks.

When drm_sched_fini starts to block for that we create a rat-tail of new 
dependencies since that one is most likely called from a file descriptor 
destruction.

>
>>   
>>   The problem with this approach is that the idea of re-submitting
>> jobs in a GPU reset by the scheduler is then basically dead. But to
>> be honest that never worked correctly.
>>   
>>   See the deprecated warning I already put on
>> drm_sched_resubmit_jobs(). The job lifetime is not really well
>> defined because of this, see the free_guilty hack as well.
> drm_sched_resubmit_jobs() is being used by 5 drivers currently. So if
> we go for this approach we have to port them, first. That doesn't sound
> trivial to me

Yeah, completely agree. I think the scheduler should provide something 
like drm_sched_for_each_pending_fence() which helps to iterate over all 
the pending HW fences.

The individual drivers could then device by themself what to do, e.g. 
upcast the HW fence into the job and push it again or similar.

But what we really need to get away from are those fence pre-requisite 
violations Sima pointed out. For example we can't allocate memory for 
run_job.

Regards,
Christian.

>
>
> P.
>
>>   
>>   Regards,
>>   Christian.
>>   
>>   
>>>   
>>>
>>>
>>> Grüße,
>>> P.
>>>
>>>   
>>
Philipp Stanner Nov. 21, 2024, 9:06 a.m. UTC | #21
+CC Thomas Hellström


On Mon, 2024-11-18 at 12:48 +0100, Christian König wrote:
>  Am 15.11.24 um 17:07 schrieb Philipp Stanner:
>  
> >  
> > On Fri, 2024-11-15 at 15:39 +0100, Christian König wrote:
> >  
> > >  
> > >  Am 15.11.24 um 14:55 schrieb Philipp Stanner:
> > >  
> > >  
> > > >  
> > > > [SNIP] 
> > > >  
> > >  
> >  
> > >  
> > > >  
> > > > But you wouldn't want to aim for getting rid of struct
> > > > drm_sched_job
> > > > completely, or would you?
> > > >  
> > > >  
> > >  
> > >  
> > >  No, the drm_sched_job structure was added to aid the single
> > > producer
> > > single consumer queue and so made it easier to put the jobs into
> > > a
> > > container.
> > >  
> > >  
> > >  In my mind the full solution for running jobs looks like this:
> > >  
> > >  1. Driver enqueues with drm_sched_job_arm()
> > >  2. job ends up in pending_list
> > >  3. Sooner or later scheduler calls run_job()
> > >  4. In return scheduler gets a dma_fence representing the
> > > resulting
> > > hardware operation.
> > >  5, This fence is put into a container to keep around what the hw
> > > is
> > > actually executing.
> > >  6. At some point the fence gets signaled informing the scheduler
> > > that the next job can be pushed if enough credits are now
> > > available.
> > >  
> > >  There is no free_job callback any more because after run_job is
> > > called the scheduler is done with the job and only the dma_fence
> > > which represents the actually HW operation is the object the
> > > scheduler now works with.
> > >  
> > >  We would still need something like a kill_job callback which is
> > > used
> > > when an entity is released without flushing all jobs (see
> > > drm_sched_entity_kill()), but that is then only used for a
> > > specific
> > > corner case.
> > >  
> >  
> > Can't we just limit the scheduler's responsibility to telling the
> > driver that it has to flush, and if it doesn't it's a bug?
> >  
>  
>  We still need to remove the pending jobs from the scheduler if
> flushing times out.
>  
>  Without timing out flushing and/or aborting when the process was
> killed we run into the same problem again that we don't want ti block
> on _fini().
>    
> >  
> > >  
> > >  Blocking for the cleanup in drm_sched_fini() then becomes a
> > > trivial
> > > dma_fence_wait() on the remaining unsignaled HW fences and
> > > eventually
> > > on the latest killed fence.
> > >  
> >  
> > But that results in exactly the same situation as my waitque
> > solution,
> > doesn't it?
> >  
>  
>  The key part is that dma_fence's have a documented requirement to
> never block infinitely.
>  
>  
> >  
> > Blocking infinitely in drm_sched_fini():
> > 
> > If the driver doesn't guarantee that all fences will get signaled,
> > then
> > wait_event() in the waitque solution will block forever. The same
> > with
> > dma_fence_wait().
> > 
> > Or are you aiming at an interruptible dma_fence_wait(), thereby not
> > blocking SIGKILL?
> >  
>  
>  That is basically what drm_sched_entity_flush() is already doing.
>  
>  
> >  
> > That then would still result in leaks. So basically the same
> > situation
> > as with an interruptible drm_sched_flush() function.
> > 
> > (Although I agree that would probably be more elegant)
> >  
>  
>  If the wait_event really would just waiting for dma_fences then yes.
>  
>  The problem with the waitqueue approach is that we need to wait for
> the free_job callback and that callback is normally called from a
> work item without holding any additional locks.
>  
>  When drm_sched_fini starts to block for that we create a rat-tail of
> new dependencies since that one is most likely called from a file
> descriptor destruction.
>   
>  
> >  
> > 
> >  
> > >  
> > >  
> > >  The problem with this approach is that the idea of re-submitting
> > > jobs in a GPU reset by the scheduler is then basically dead. But
> > > to
> > > be honest that never worked correctly.
> > >  
> > >  See the deprecated warning I already put on
> > > drm_sched_resubmit_jobs(). The job lifetime is not really well
> > > defined because of this, see the free_guilty hack as well.
> > >  
> >  
> > drm_sched_resubmit_jobs() is being used by 5 drivers currently. So
> > if
> > we go for this approach we have to port them, first. That doesn't
> > sound
> > trivial to me
> >  
>  
>  Yeah, completely agree. I think the scheduler should provide
> something like drm_sched_for_each_pending_fence() which helps to
> iterate over all the pending HW fences.
>  
>  The individual drivers could then device by themself what to do,
> e.g. upcast the HW fence into the job and push it again or similar.

I have talked with Dave and we would be interested in exploring the
direction of getting rid of backend_ops.free_job() and doing the
modifications this implies.

Matthew, Thomas, any hard NACKs on principle, or can we look into this
in a future patch series?


P.





>  
>  But what we really need to get away from are those fence pre-
> requisite violations Sima pointed out. For example we can't allocate
> memory for run_job.
>  
>  Regards,
>  Christian.
>  
>  
> >  
> > 
> > 
> > P.
> > 
> >  
> > >  
> > >  
> > >  Regards,
> > >  Christian.
> > >  
> > >  
> > >  
> > > >  
> > > >  
> > > > 
> > > > 
> > > > Grüße,
> > > > P.
> > > > 
> > > >  
> > > >  
> > >  
> > >  
> > >  
> >   
>  
>
Matthew Brost Nov. 21, 2024, 6:05 p.m. UTC | #22
On Tue, Oct 29, 2024 at 08:22:22AM +0100, Philipp Stanner wrote:
> Christian, Sima?
> 
> Matthew? (+CC)
> 

Sorry catching up here. Internally in Xe we ref count the scheduler as
the scheduler is embedded into our user queue structure (xe_exec_queue).
Jobs ref the queue once they are armed. Upon queue killing we set TDR to
zero which will flush out all jobs / signal all the job's fences. Once
the ref count of queue is zero we do hardware / firmware teardown of the
queue and then finally call drm_sched_fini before freeing the queues
memory (and thus the scheduler).

This flow seems to work pretty well but largely bypasses the scheduler
functions to implement this. Not sure if this flow could be normalized
at all but I would expect usage models of a 1 to 1 relationship between
queue and scheduler to something similar to Xe's flow. Maybe we could
write this done as design guideline so other drivers don't have to
figure this out - this took me a bit to land on this.

With that, in general I agree with Christian's patch here complaining
about pending jobs if drm_sched_fini is called.

> Opinions on the below?
> 
> tl;dr:
> I still think it's a good thing to detectably block in
> drm_sched_fini(), or at the very least drm_sched_flush(), because then

See above. I don't think drm_sched_fini should block rather just
complain this patch does which says 'go fix your driver'.

Matt

> you'll find out that the driver is broken and can repair it.
> 
> P.
> 
> 
> On Fri, 2024-10-18 at 14:07 +0200, Philipp Stanner wrote:
> > On Mon, 2024-10-14 at 16:56 +0200, Danilo Krummrich wrote:
> > > On Fri, Sep 27, 2024 at 11:04:48AM +0200, Christian König wrote:
> > > > Am 25.09.24 um 16:53 schrieb Philipp Stanner:
> > > > > On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote:
> > > > > > On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König
> > > > > > wrote:
> > > > > > > Am 20.09.24 um 15:26 schrieb Philipp Stanner:
> > > > > > > > On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote:
> > > > > > > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner:
> > > > > > > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian König
> > > > > > > > > > wrote:
> > > > > > > > > > > Tearing down the scheduler with jobs still on the
> > > > > > > > > > > pending
> > > > > > > > > > > list
> > > > > > > > > > > can
> > > > > > > > > > > lead to use after free issues. Add a warning if
> > > > > > > > > > > drivers try
> > > > > > > > > > > to
> > > > > > > > > > > destroy a scheduler which still has work pushed to
> > > > > > > > > > > the HW.
> > > > > > > > > > Did you have time yet to look into my proposed
> > > > > > > > > > waitque-
> > > > > > > > > > solution?
> > > > > > > > > I don't remember seeing anything. What have I missed?
> > > > > > > > https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/
> > > > > > > Mhm, I didn't got that in my inbox for some reason.
> > > > > > > 
> > > > > > > Interesting approach, I'm just not sure if we can or should
> > > > > > > wait in
> > > > > > > drm_sched_fini().
> > > > > We do agree that jobs still pending when drm_sched_fini()
> > > > > starts
> > > > > is
> > > > > always a bug, right?
> > > > 
> > > > Correct, the question is how to avoid that.
> > > > 
> > > > > If so, what are the disadvantages of waiting in
> > > > > drm_sched_fini()?
> > > > > We
> > > > > could block buggy drivers as I see it. Which wouldn't be good,
> > > > > but
> > > > > could then be fixed on drivers' site.
> > > > 
> > > > Sima explained that pretty well: Don't block in fops->close, do
> > > > that in
> > > > fops->flush instead.
> > > 
> > > I agree that we shouldn't block in close(), but this effectively
> > > means that we
> > > need to reference count the scheduler, right?
> > > 
> > > Otherwise, if we allow to just skip / interrupt the teardown, we
> > > can
> > > still leak
> > > memory.
> > 
> > Having thought about it, I agree with Danilo. Having something that
> > shall wait on green light, but can be interrupted, is no guarantee
> > and
> > therefore not a feasible solution.
> > 
> > To break down the solution space, these seem to be our options:
> >    1. We have something (either drm_sched_fini() or a helper, e.g.,
> >       drm_sched_flush()) that definitely blocks until the pending
> > list
> >       has become empty.
> >    2. We have jobs reference-count the scheduler, so the latter can
> >       outlive the driver and will be freed some time later.
> > 
> > Can anyone think of a third solution?
> > 
> > 
> > Solution #1 has the problem of obviously blocking unconditionally if
> > the driver didn't make sure that all fences will be signaled within
> > reasonable time. In my opinion, this would actually be an advantage,
> > because it will be *very* noticable and force users to repair their
> > driver. The driver *has* to guarantee that all fences will be
> > signaled.
> > If the driver has to do fishy things, having the blocking outsourced
> > to
> > the helper drm_sched_flush() would allow them to circumvent that.
> > 
> > Solution #2 has the problem of backend_ops.free_job() potentially
> > using
> > driver-data after the driver is gone, causing UAF. So with this
> > solutions all drivers would have to be aware of the issue and handle
> > it
> > through one of DRMs primitives dedicated to such problems.
> > 
> > 
> > Currently, all drivers either work around the problem internally or
> > simply ignore it, it seems.
> > 
> > So I'd argue that both solutions are an improvement over the existing
> > situation. My preference would be #1.
> > 
> > 
> > Opinions?
> > 
> > P.
> > 
> > > 
> > > > 
> > > > One issue this solves is that when you send a SIGTERM the tear
> > > > down
> > > > handling
> > > > first flushes all the FDs and then closes them.
> > > > 
> > > > So if flushing the FDs blocks because the process initiated
> > > > sending
> > > > a
> > > > terabyte of data over a 300bps line (for example) you can still
> > > > throw a
> > > > SIGKILL and abort that as well.
> > > > 
> > > > If you would block in fops-close() that SIGKILL won't have any
> > > > effect any
> > > > more because by the time close() is called the process is gone
> > > > and
> > > > signals
> > > > are already blocked.
> > > > 
> > > > And yes when I learned about that issue I was also buffed that
> > > > handling like
> > > > this in the UNIX design is nearly 50 years old and still applies
> > > > to
> > > > today.
> > > > > > > Probably better to make that a separate function, something
> > > > > > > like
> > > > > > > drm_sched_flush() or similar.
> > > > > We could do that. Such a function could then be called by
> > > > > drivers
> > > > > which
> > > > > are not sure whether all jobs are done before they start
> > > > > tearing
> > > > > down.
> > > > 
> > > > Yes exactly that's the idea. And give that flush function a
> > > > return
> > > > code so
> > > > that it can return -EINTR.
> > > > 
> > > > > > Yeah I don't think we should smash this into drm_sched_fini
> > > > > > unconditionally. I think conceptually there's about three
> > > > > > cases:
> > > > > > 
> > > > > > - Ringbuffer schedules. Probably want everything as-is,
> > > > > > because
> > > > > >    drm_sched_fini is called long after all the entities are
> > > > > > gone in
> > > > > >    drm_device cleanup.
> > > > > > 
> > > > > > - fw scheduler hardware with preemption support. There we
> > > > > > probably
> > > > > > want to
> > > > > >    nuke the context by setting the tdr timeout to zero (or
> > > > > > maybe just
> > > > > > as
> > > > > >    long as context preemption takes to be efficient), and
> > > > > > relying on
> > > > > > the
> > > > > >    normal gpu reset flow to handle things.
> > > > > > drm_sched_entity_flush
> > > > > > kinda
> > > > > >    does this, except not really and it's a lot more focused
> > > > > > on
> > > > > > the
> > > > > >    ringbuffer context. So maybe we want a new
> > > > > > drm_sched_entity_kill.
> > > > > > 
> > > > > >    For this case calling drm_sched_fini() after the 1:1
> > > > > > entity
> > > > > > is gone
> > > > > >    should not find any linger jobs, it would actually be a
> > > > > > bug
> > > > > > somewhere if
> > > > > >    there's a job lingering. Maybe a sanity check that there's
> > > > > > not just
> > > > > > no
> > > > > >    jobs lingering, but also no entity left would be good
> > > > > > here?
> > > > > The check for lingering ones is in Christian's patch here IISC.
> > > > > At which position would you imagine the check for the entity
> > > > > being
> > > > > performed?
> > > > > 
> > > > > > - fw scheduler without preemption support. There we kinda
> > > > > > need
> > > > > > the
> > > > > >    drm_sched_flush, except blocking in fops->close is not
> > > > > > great. So
> > > > > > instead
> > > > > >    I think the following is better:
> > > > > >    1. drm_sched_entity_stopped, which only stops new
> > > > > > submissions (for
> > > > > >    paranoia) but doesn't tear down the entity
> > > > > Who would call that function?
> > > > > Drivers using it voluntarily could just as well stop accepting
> > > > > new jobs
> > > > > from userspace to their entities, couldn't they?
> > > > > 
> > > > > >    2. drm_dev_get
> > > > > >    3. launch a worker which does a) drm_sched_flush (or
> > > > > >    drm_sched_entity_flush or whatever we call it) b)
> > > > > > drm_sched_entity_fini
> > > > > >    + drm_sched_fini c) drm_dev_put
> > > > > > 
> > > > > >    Note that semantically this implements the refcount in the
> > > > > > other
> > > > > > path
> > > > > >    from Phillip:
> > > > > > 
> > > > > > https://lore.kernel.org/all/20240903094531.29893-2-pstanner@redhat.com/
> > > > > >    Except it doesn't impose refcount on everyone else who
> > > > > > doesn't need
> > > > > > it,
> > > > > >    and it doesn't even impose refcounting on drivers that do
> > > > > > need it
> > > > > >    because we use drm_sched_flush and a worker to achieve the
> > > > > > same.
> > > > > I indeed wasn't happy with the refcount approach for that
> > > > > reason,
> > > > > agreed.
> > > > > 
> > > > > > Essentially helper functions for the common use-cases instead
> > > > > > of
> > > > > > trying to
> > > > > > solve them all by putting drm_sched_flush as a potentially
> > > > > > very
> > > > > > blocking
> > > > > > function into drm_sched_fini.
> > > > > I'm still not able to see why it blocking would be undesired –
> > > > > as
> > > > > far
> > > > > as I can see, it is only invoked on driver teardown, so not
> > > > > during
> > > > > active operation. Teardown doesn't happen that often, and it
> > > > > can
> > > > > (if
> > > > > implemented correctly) only block until the driver's code has
> > > > > signaled
> > > > > the last fences. If that doesn't happen, the block would reveal
> > > > > a
> > > > > bug.
> > > > > 
> > > > > But don't get me wrong: I don't want to *push* this solution. I
> > > > > just
> > > > > want to understand when it could become a problem.
> > > > > 
> > > > > 
> > > > > Wouldn't an explicitly blocking, separate function like
> > > > > drm_sched_flush() or drm_sched_fini_flush() be a small, doable
> > > > > step
> > > > > towards the right direction?
> > > > 
> > > > I think that this is the right thing to do, yes.
> > > > 
> > > > > > > > > > > When there are still entities with jobs the
> > > > > > > > > > > situation
> > > > > > > > > > > is
> > > > > > > > > > > even
> > > > > > > > > > > worse
> > > > > > > > > > > since the dma_fences for those jobs can never
> > > > > > > > > > > signal
> > > > > > > > > > > we can
> > > > > > > > > > > just
> > > > > > > > > > > choose between potentially locking up core memory
> > > > > > > > > > > management and
> > > > > > > > > > > random memory corruption. When drivers really mess
> > > > > > > > > > > it
> > > > > > > > > > > up
> > > > > > > > > > > that
> > > > > > > > > > > well
> > > > > > > > > > > let them run into a BUG_ON().
> > > > > > > > > > > 
> > > > > > > > > > > Signed-off-by: Christian
> > > > > > > > > > > König<christian.koenig@amd.com>
> > > > > > > > > > > ---
> > > > > > > > > > >     drivers/gpu/drm/scheduler/sched_main.c | 19
> > > > > > > > > > > ++++++++++++++++++-
> > > > > > > > > > >     1 file changed, 18 insertions(+), 1 deletion(-)
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > index f093616fe53c..8a46fab5cdc8 100644
> > > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > > > > > > > > > > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct
> > > > > > > > > > > drm_gpu_scheduler
> > > > > > > > > > > *sched)
> > > > > > > > > > I agree with Sima that it should first be documented
> > > > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > function's
> > > > > > > > > > docstring what the user is expected to have done
> > > > > > > > > > before
> > > > > > > > > > calling the
> > > > > > > > > > function.
> > > > > > > > > Good point, going to update the documentation as well.
> > > > > > > > Cool thing, thx.
> > > > > > > > Would be great if everything (not totally trivial)
> > > > > > > > necessary to
> > > > > > > > be done
> > > > > > > > before _fini() is mentioned.
> > > > > > > > 
> > > > > > > > One could also think about providing a hint at how the
> > > > > > > > driver can
> > > > > > > > do
> > > > > > > > that. AFAICS the only way for the driver to ensure that
> > > > > > > > is
> > > > > > > > to
> > > > > > > > maintain
> > > > > > > > its own, separate list of submitted jobs.
> > > > > > > Even with a duplicated pending list it's actually currently
> > > > > > > impossible to do
> > > > > > > this fully cleanly.
> > > > > > > 
> > > > > > > The problem is that the dma_fence object gives no guarantee
> > > > > > > when
> > > > > > > callbacks
> > > > > > > are processed, e.g. they can be both processed from
> > > > > > > interrupt
> > > > > > > context as
> > > > > > > well as from a CPU which called dma_fence_is_signaled().
> > > > > > > 
> > > > > > > So when a driver (or drm_sched_fini) waits for the last
> > > > > > > submitted
> > > > > > > fence it
> > > > > > > actually can be that the drm_sched object still needs to do
> > > > > > > some
> > > > > > > processing.
> > > > > > > See the hack in amdgpu_vm_tlb_seq() for more background on
> > > > > > > the
> > > > > > > problem.
> > > > > Oh dear ^^'
> > > > > We better work towards fixing that centrally
> > > > > 
> > > > > Thanks,
> > > > > P.
> > > > > 
> > > > > 
> > > > > > So I thought this should be fairly easy because of the sched
> > > > > > hw/public
> > > > > > fence split: If we wait for both all jobs to finish and for
> > > > > > all
> > > > > > the
> > > > > > sched
> > > > > > work/tdr work to finish, and we make sure there's no entity
> > > > > > existing
> > > > > > that's not yet stopped we should catch them all?
> > > > 
> > > > Unfortunately not.
> > > > 
> > > > Even when you do a dma_fence_wait() on the last submission it can
> > > > still be
> > > > that another CPU is executing the callbacks to wake up the
> > > > scheduler work
> > > > item and cleanup the job.
> > > > 
> > > > That's one of the reasons why I think the design of keeping the
> > > > job
> > > > alive is
> > > > so extremely awkward. The dma_fence as representation of the hw
> > > > submission
> > > > has a much better defined state machine and lifetime.
> > > > 
> > > > Regards,
> > > > Christian.
> > > > 
> > > > > >   Or at least I think
> > > > > > it's
> > > > > > a bug if any other code even tries to touch the hw fence.
> > > > > > 
> > > > > > If you have any other driver code which relies on the rcu
> > > > > > freeing
> > > > > > then I
> > > > > > think that's just a separate concern and we can ignore that
> > > > > > here
> > > > > > since the
> > > > > > fences themselves will till get rcu-delay freed even if
> > > > > > drm_sched_fini has
> > > > > > finished.
> > > > > > -Sima
> > > > > > 
> > > > > > > Regards,
> > > > > > > Christian.
> > > > > > > 
> > > > > > > > P.
> > > > > > > > 
> > > > > > > > > Thanks,
> > > > > > > > > Christian.
> > > > > > > > > 
> > > > > > > > > > P.
> > > > > > > > > > 
> > > > > > > > > > >      drm_sched_wqueue_stop(sched);
> > > > > > > > > > > + /*
> > > > > > > > > > > + * Tearing down the scheduler wile there are still
> > > > > > > > > > > unprocessed jobs can
> > > > > > > > > > > + * lead to use after free issues in the scheduler
> > > > > > > > > > > fence.
> > > > > > > > > > > + */
> > > > > > > > > > > + WARN_ON(!list_empty(&sched->pending_list));
> > > > > > > > > > > +
> > > > > > > > > > >      for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched-
> > > > > > > > > > > > num_rqs;
> > > > > > > > > > > i++)
> > > > > > > > > > > {
> > > > > > > > > > >      struct drm_sched_rq *rq = sched-
> > > > > > > > > > > > sched_rq[i];
> > > > > > > > > > >      spin_lock(&rq->lock);
> > > > > > > > > > > - list_for_each_entry(s_entity, &rq-
> > > > > > > > > > > > entities,
> > > > > > > > > > > list)
> > > > > > > > > > > + list_for_each_entry(s_entity, &rq-
> > > > > > > > > > > > entities,
> > > > > > > > > > > list) {
> > > > > > > > > > > + /*
> > > > > > > > > > > + * The justification for this
> > > > > > > > > > > BUG_ON()
> > > > > > > > > > > is
> > > > > > > > > > > that tearing
> > > > > > > > > > > + * down the scheduler while jobs
> > > > > > > > > > > are
> > > > > > > > > > > pending
> > > > > > > > > > > leaves
> > > > > > > > > > > + * dma_fences unsignaled. Since we
> > > > > > > > > > > have
> > > > > > > > > > > dependencies
> > > > > > > > > > > + * from the core memory management
> > > > > > > > > > > to
> > > > > > > > > > > eventually signal
> > > > > > > > > > > + * dma_fences this can trivially
> > > > > > > > > > > lead to
> > > > > > > > > > > a
> > > > > > > > > > > system wide
> > > > > > > > > > > + * stop because of a locked up
> > > > > > > > > > > memory
> > > > > > > > > > > management.
> > > > > > > > > > > + */
> > > > > > > > > > > + BUG_ON(spsc_queue_count(&s_entity-
> > > > > > > > > > > > job_queue));
> > > > > > > > > > > +
> > > > > > > > > > >      /*
> > > > > > > > > > >      * Prevents reinsertion and marks
> > > > > > > > > > > job_queue
> > > > > > > > > > > as idle,
> > > > > > > > > > >      * it will removed from rq in
> > > > > > > > > > > drm_sched_entity_fini
> > > > > > > > > > >      * eventually
> > > > > > > > > > >      */
> > > > > > > > > > >      s_entity->stopped = true;
> > > > > > > > > > > + }
> > > > > > > > > > >      spin_unlock(&rq->lock);
> > > > > > > > > > >      kfree(sched->sched_rq[i]);
> > > > > > > > > > >      }
> > > 
> > 
>
Matthew Brost Nov. 21, 2024, 6:10 p.m. UTC | #23
On Thu, Nov 21, 2024 at 10:06:54AM +0100, Philipp Stanner wrote:
> +CC Thomas Hellström
> 
> 
> On Mon, 2024-11-18 at 12:48 +0100, Christian König wrote:
> >  Am 15.11.24 um 17:07 schrieb Philipp Stanner:
> >  
> > >  
> > > On Fri, 2024-11-15 at 15:39 +0100, Christian König wrote:
> > >  
> > > >  
> > > >  Am 15.11.24 um 14:55 schrieb Philipp Stanner:
> > > >  
> > > >  
> > > > >  
> > > > > [SNIP] 
> > > > >  
> > > >  
> > >  
> > > >  
> > > > >  
> > > > > But you wouldn't want to aim for getting rid of struct
> > > > > drm_sched_job
> > > > > completely, or would you?
> > > > >  
> > > > >  
> > > >  
> > > >  
> > > >  No, the drm_sched_job structure was added to aid the single
> > > > producer
> > > > single consumer queue and so made it easier to put the jobs into
> > > > a
> > > > container.
> > > >  
> > > >  
> > > >  In my mind the full solution for running jobs looks like this:
> > > >  
> > > >  1. Driver enqueues with drm_sched_job_arm()
> > > >  2. job ends up in pending_list
> > > >  3. Sooner or later scheduler calls run_job()
> > > >  4. In return scheduler gets a dma_fence representing the
> > > > resulting
> > > > hardware operation.
> > > >  5, This fence is put into a container to keep around what the hw
> > > > is
> > > > actually executing.
> > > >  6. At some point the fence gets signaled informing the scheduler
> > > > that the next job can be pushed if enough credits are now
> > > > available.
> > > >  
> > > >  There is no free_job callback any more because after run_job is
> > > > called the scheduler is done with the job and only the dma_fence
> > > > which represents the actually HW operation is the object the
> > > > scheduler now works with.
> > > >  
> > > >  We would still need something like a kill_job callback which is
> > > > used
> > > > when an entity is released without flushing all jobs (see
> > > > drm_sched_entity_kill()), but that is then only used for a
> > > > specific
> > > > corner case.
> > > >  
> > >  
> > > Can't we just limit the scheduler's responsibility to telling the
> > > driver that it has to flush, and if it doesn't it's a bug?
> > >  
> >  
> >  We still need to remove the pending jobs from the scheduler if
> > flushing times out.
> >  
> >  Without timing out flushing and/or aborting when the process was
> > killed we run into the same problem again that we don't want ti block
> > on _fini().
> >    
> > >  
> > > >  
> > > >  Blocking for the cleanup in drm_sched_fini() then becomes a
> > > > trivial
> > > > dma_fence_wait() on the remaining unsignaled HW fences and
> > > > eventually
> > > > on the latest killed fence.
> > > >  
> > >  
> > > But that results in exactly the same situation as my waitque
> > > solution,
> > > doesn't it?
> > >  
> >  
> >  The key part is that dma_fence's have a documented requirement to
> > never block infinitely.
> >  
> >  
> > >  
> > > Blocking infinitely in drm_sched_fini():
> > > 
> > > If the driver doesn't guarantee that all fences will get signaled,
> > > then
> > > wait_event() in the waitque solution will block forever. The same
> > > with
> > > dma_fence_wait().
> > > 
> > > Or are you aiming at an interruptible dma_fence_wait(), thereby not
> > > blocking SIGKILL?
> > >  
> >  
> >  That is basically what drm_sched_entity_flush() is already doing.
> >  
> >  
> > >  
> > > That then would still result in leaks. So basically the same
> > > situation
> > > as with an interruptible drm_sched_flush() function.
> > > 
> > > (Although I agree that would probably be more elegant)
> > >  
> >  
> >  If the wait_event really would just waiting for dma_fences then yes.
> >  
> >  The problem with the waitqueue approach is that we need to wait for
> > the free_job callback and that callback is normally called from a
> > work item without holding any additional locks.
> >  
> >  When drm_sched_fini starts to block for that we create a rat-tail of
> > new dependencies since that one is most likely called from a file
> > descriptor destruction.
> >   
> >  
> > >  
> > > 
> > >  
> > > >  
> > > >  
> > > >  The problem with this approach is that the idea of re-submitting
> > > > jobs in a GPU reset by the scheduler is then basically dead. But
> > > > to
> > > > be honest that never worked correctly.
> > > >  
> > > >  See the deprecated warning I already put on
> > > > drm_sched_resubmit_jobs(). The job lifetime is not really well
> > > > defined because of this, see the free_guilty hack as well.
> > > >  
> > >  
> > > drm_sched_resubmit_jobs() is being used by 5 drivers currently. So
> > > if
> > > we go for this approach we have to port them, first. That doesn't
> > > sound
> > > trivial to me
> > >  
> >  
> >  Yeah, completely agree. I think the scheduler should provide
> > something like drm_sched_for_each_pending_fence() which helps to
> > iterate over all the pending HW fences.
> >  
> >  The individual drivers could then device by themself what to do,
> > e.g. upcast the HW fence into the job and push it again or similar.
> 
> I have talked with Dave and we would be interested in exploring the
> direction of getting rid of backend_ops.free_job() and doing the
> modifications this implies.
> 

Sorry again, catching up late.

Got any details to dropping free_job?

Gut reaction is this is fairly large subsystem rework, touching many
drivers, and great way to regress stable software. 

> Matthew, Thomas, any hard NACKs on principle, or can we look into this
> in a future patch series?
> 

Definietly not NACK outright, but see my concerns above. If ends up not
being all that invasive and good cleanup, not opposed. If it starts
messing with existing reset / job cancel / queue teardown flows I would
be concerned given how hard it is to get though correct.

Matt

> 
> P.
> 
> 
> 
> 
> 
> >  
> >  But what we really need to get away from are those fence pre-
> > requisite violations Sima pointed out. For example we can't allocate
> > memory for run_job.
> >  
> >  Regards,
> >  Christian.
> >  
> >  
> > >  
> > > 
> > > 
> > > P.
> > > 
> > >  
> > > >  
> > > >  
> > > >  Regards,
> > > >  Christian.
> > > >  
> > > >  
> > > >  
> > > > >  
> > > > >  
> > > > > 
> > > > > 
> > > > > Grüße,
> > > > > P.
> > > > > 
> > > > >  
> > > > >  
> > > >  
> > > >  
> > > >  
> > >   
> >  
> >  
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index f093616fe53c..8a46fab5cdc8 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1333,17 +1333,34 @@  void drm_sched_fini(struct drm_gpu_scheduler *sched)
 
 	drm_sched_wqueue_stop(sched);
 
+	/*
+	 * Tearing down the scheduler wile there are still unprocessed jobs can
+	 * lead to use after free issues in the scheduler fence.
+	 */
+	WARN_ON(!list_empty(&sched->pending_list));
+
 	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
 		struct drm_sched_rq *rq = sched->sched_rq[i];
 
 		spin_lock(&rq->lock);
-		list_for_each_entry(s_entity, &rq->entities, list)
+		list_for_each_entry(s_entity, &rq->entities, list) {
+			/*
+			 * The justification for this BUG_ON() is that tearing
+			 * down the scheduler while jobs are pending leaves
+			 * dma_fences unsignaled. Since we have dependencies
+			 * from the core memory management to eventually signal
+			 * dma_fences this can trivially lead to a system wide
+			 * stop because of a locked up memory management.
+			 */
+			BUG_ON(spsc_queue_count(&s_entity->job_queue));
+
 			/*
 			 * Prevents reinsertion and marks job_queue as idle,
 			 * it will removed from rq in drm_sched_entity_fini
 			 * eventually
 			 */
 			s_entity->stopped = true;
+		}
 		spin_unlock(&rq->lock);
 		kfree(sched->sched_rq[i]);
 	}