Message ID | 20240918133956.26557-1-christian.koenig@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] drm/sched: add WARN_ON and BUG_ON to drm_sched_fini | expand |
On Wed, 18 Sep 2024, "Christian König" <ckoenig.leichtzumerken@gmail.com> wrote: > Tearing down the scheduler with jobs still on the pending list can > lead to use after free issues. Add a warning if drivers try to > destroy a scheduler which still has work pushed to the HW. > > When there are still entities with jobs the situation is even worse > since the dma_fences for those jobs can never signal we can just > choose between potentially locking up core memory management and > random memory corruption. When drivers really mess it up that well > let them run into a BUG_ON(). > > Signed-off-by: Christian König <christian.koenig@amd.com> > --- > drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++- > 1 file changed, 18 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index f093616fe53c..8a46fab5cdc8 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched) > > drm_sched_wqueue_stop(sched); > > + /* > + * Tearing down the scheduler wile there are still unprocessed jobs can > + * lead to use after free issues in the scheduler fence. > + */ > + WARN_ON(!list_empty(&sched->pending_list)); drm_WARN_ON(sched->dev, ...) would identify the device, which I presume would be helpful in multi-GPU systems. BR, Jani. > + > for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) { > struct drm_sched_rq *rq = sched->sched_rq[i]; > > spin_lock(&rq->lock); > - list_for_each_entry(s_entity, &rq->entities, list) > + list_for_each_entry(s_entity, &rq->entities, list) { > + /* > + * The justification for this BUG_ON() is that tearing > + * down the scheduler while jobs are pending leaves > + * dma_fences unsignaled. Since we have dependencies > + * from the core memory management to eventually signal > + * dma_fences this can trivially lead to a system wide > + * stop because of a locked up memory management. > + */ > + BUG_ON(spsc_queue_count(&s_entity->job_queue)); > + > /* > * Prevents reinsertion and marks job_queue as idle, > * it will removed from rq in drm_sched_entity_fini > * eventually > */ > s_entity->stopped = true; > + } > spin_unlock(&rq->lock); > kfree(sched->sched_rq[i]); > }
Am 18.09.24 um 16:41 schrieb Jani Nikula: > On Wed, 18 Sep 2024, "Christian König" <ckoenig.leichtzumerken@gmail.com> wrote: >> Tearing down the scheduler with jobs still on the pending list can >> lead to use after free issues. Add a warning if drivers try to >> destroy a scheduler which still has work pushed to the HW. >> >> When there are still entities with jobs the situation is even worse >> since the dma_fences for those jobs can never signal we can just >> choose between potentially locking up core memory management and >> random memory corruption. When drivers really mess it up that well >> let them run into a BUG_ON(). >> >> Signed-off-by: Christian König <christian.koenig@amd.com> >> --- >> drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++- >> 1 file changed, 18 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >> index f093616fe53c..8a46fab5cdc8 100644 >> --- a/drivers/gpu/drm/scheduler/sched_main.c >> +++ b/drivers/gpu/drm/scheduler/sched_main.c >> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched) >> >> drm_sched_wqueue_stop(sched); >> >> + /* >> + * Tearing down the scheduler wile there are still unprocessed jobs can >> + * lead to use after free issues in the scheduler fence. >> + */ >> + WARN_ON(!list_empty(&sched->pending_list)); > drm_WARN_ON(sched->dev, ...) would identify the device, which I presume > would be helpful in multi-GPU systems. Good point, going to fix that. Regards, Christian. > > BR, > Jani. > >> + >> for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) { >> struct drm_sched_rq *rq = sched->sched_rq[i]; >> >> spin_lock(&rq->lock); >> - list_for_each_entry(s_entity, &rq->entities, list) >> + list_for_each_entry(s_entity, &rq->entities, list) { >> + /* >> + * The justification for this BUG_ON() is that tearing >> + * down the scheduler while jobs are pending leaves >> + * dma_fences unsignaled. Since we have dependencies >> + * from the core memory management to eventually signal >> + * dma_fences this can trivially lead to a system wide >> + * stop because of a locked up memory management. >> + */ >> + BUG_ON(spsc_queue_count(&s_entity->job_queue)); >> + >> /* >> * Prevents reinsertion and marks job_queue as idle, >> * it will removed from rq in drm_sched_entity_fini >> * eventually >> */ >> s_entity->stopped = true; >> + } >> spin_unlock(&rq->lock); >> kfree(sched->sched_rq[i]); >> }
On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote: > Tearing down the scheduler with jobs still on the pending list can > lead to use after free issues. Add a warning if drivers try to > destroy a scheduler which still has work pushed to the HW. Did you have time yet to look into my proposed waitque-solution? > > When there are still entities with jobs the situation is even worse > since the dma_fences for those jobs can never signal we can just > choose between potentially locking up core memory management and > random memory corruption. When drivers really mess it up that well > let them run into a BUG_ON(). > > Signed-off-by: Christian König <christian.koenig@amd.com> > --- > drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++- > 1 file changed, 18 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > b/drivers/gpu/drm/scheduler/sched_main.c > index f093616fe53c..8a46fab5cdc8 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler > *sched) I agree with Sima that it should first be documented in the function's docstring what the user is expected to have done before calling the function. P. > > drm_sched_wqueue_stop(sched); > > + /* > + * Tearing down the scheduler wile there are still > unprocessed jobs can > + * lead to use after free issues in the scheduler fence. > + */ > + WARN_ON(!list_empty(&sched->pending_list)); > + > for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) > { > struct drm_sched_rq *rq = sched->sched_rq[i]; > > spin_lock(&rq->lock); > - list_for_each_entry(s_entity, &rq->entities, list) > + list_for_each_entry(s_entity, &rq->entities, list) { > + /* > + * The justification for this BUG_ON() is > that tearing > + * down the scheduler while jobs are pending > leaves > + * dma_fences unsignaled. Since we have > dependencies > + * from the core memory management to > eventually signal > + * dma_fences this can trivially lead to a > system wide > + * stop because of a locked up memory > management. > + */ > + BUG_ON(spsc_queue_count(&s_entity- > >job_queue)); > + > /* > * Prevents reinsertion and marks job_queue > as idle, > * it will removed from rq in > drm_sched_entity_fini > * eventually > */ > s_entity->stopped = true; > + } > spin_unlock(&rq->lock); > kfree(sched->sched_rq[i]); > }
Am 20.09.24 um 10:57 schrieb Philipp Stanner: > On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote: >> Tearing down the scheduler with jobs still on the pending list can >> lead to use after free issues. Add a warning if drivers try to >> destroy a scheduler which still has work pushed to the HW. > Did you have time yet to look into my proposed waitque-solution? I don't remember seeing anything. What have I missed? > >> When there are still entities with jobs the situation is even worse >> since the dma_fences for those jobs can never signal we can just >> choose between potentially locking up core memory management and >> random memory corruption. When drivers really mess it up that well >> let them run into a BUG_ON(). >> >> Signed-off-by: Christian König <christian.koenig@amd.com> >> --- >> drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++- >> 1 file changed, 18 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >> b/drivers/gpu/drm/scheduler/sched_main.c >> index f093616fe53c..8a46fab5cdc8 100644 >> --- a/drivers/gpu/drm/scheduler/sched_main.c >> +++ b/drivers/gpu/drm/scheduler/sched_main.c >> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler >> *sched) > I agree with Sima that it should first be documented in the function's > docstring what the user is expected to have done before calling the > function. Good point, going to update the documentation as well. Thanks, Christian. > > P. > >> >> drm_sched_wqueue_stop(sched); >> >> + /* >> + * Tearing down the scheduler wile there are still >> unprocessed jobs can >> + * lead to use after free issues in the scheduler fence. >> + */ >> + WARN_ON(!list_empty(&sched->pending_list)); >> + >> for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) >> { >> struct drm_sched_rq *rq = sched->sched_rq[i]; >> >> spin_lock(&rq->lock); >> - list_for_each_entry(s_entity, &rq->entities, list) >> + list_for_each_entry(s_entity, &rq->entities, list) { >> + /* >> + * The justification for this BUG_ON() is >> that tearing >> + * down the scheduler while jobs are pending >> leaves >> + * dma_fences unsignaled. Since we have >> dependencies >> + * from the core memory management to >> eventually signal >> + * dma_fences this can trivially lead to a >> system wide >> + * stop because of a locked up memory >> management. >> + */ >> + BUG_ON(spsc_queue_count(&s_entity- >>> job_queue)); >> + >> /* >> * Prevents reinsertion and marks job_queue >> as idle, >> * it will removed from rq in >> drm_sched_entity_fini >> * eventually >> */ >> s_entity->stopped = true; >> + } >> spin_unlock(&rq->lock); >> kfree(sched->sched_rq[i]); >> }
On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote: > Am 20.09.24 um 10:57 schrieb Philipp Stanner: > > On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote: > > > Tearing down the scheduler with jobs still on the pending list > > > can > > > lead to use after free issues. Add a warning if drivers try to > > > destroy a scheduler which still has work pushed to the HW. > > Did you have time yet to look into my proposed waitque-solution? > > I don't remember seeing anything. What have I missed? https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/ > > > > > > When there are still entities with jobs the situation is even > > > worse > > > since the dma_fences for those jobs can never signal we can just > > > choose between potentially locking up core memory management and > > > random memory corruption. When drivers really mess it up that > > > well > > > let them run into a BUG_ON(). > > > > > > Signed-off-by: Christian König <christian.koenig@amd.com> > > > --- > > > drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++- > > > 1 file changed, 18 insertions(+), 1 deletion(-) > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > index f093616fe53c..8a46fab5cdc8 100644 > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct > > > drm_gpu_scheduler > > > *sched) > > I agree with Sima that it should first be documented in the > > function's > > docstring what the user is expected to have done before calling the > > function. > > Good point, going to update the documentation as well. Cool thing, thx. Would be great if everything (not totally trivial) necessary to be done before _fini() is mentioned. One could also think about providing a hint at how the driver can do that. AFAICS the only way for the driver to ensure that is to maintain its own, separate list of submitted jobs. P. > > Thanks, > Christian. > > > > > P. > > > > > > > > drm_sched_wqueue_stop(sched); > > > > > > + /* > > > + * Tearing down the scheduler wile there are still > > > unprocessed jobs can > > > + * lead to use after free issues in the scheduler fence. > > > + */ > > > + WARN_ON(!list_empty(&sched->pending_list)); > > > + > > > for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; > > > i++) > > > { > > > struct drm_sched_rq *rq = sched->sched_rq[i]; > > > > > > spin_lock(&rq->lock); > > > - list_for_each_entry(s_entity, &rq->entities, > > > list) > > > + list_for_each_entry(s_entity, &rq->entities, > > > list) { > > > + /* > > > + * The justification for this BUG_ON() > > > is > > > that tearing > > > + * down the scheduler while jobs are > > > pending > > > leaves > > > + * dma_fences unsignaled. Since we have > > > dependencies > > > + * from the core memory management to > > > eventually signal > > > + * dma_fences this can trivially lead to > > > a > > > system wide > > > + * stop because of a locked up memory > > > management. > > > + */ > > > + BUG_ON(spsc_queue_count(&s_entity- > > > > job_queue)); > > > + > > > /* > > > * Prevents reinsertion and marks > > > job_queue > > > as idle, > > > * it will removed from rq in > > > drm_sched_entity_fini > > > * eventually > > > */ > > > s_entity->stopped = true; > > > + } > > > spin_unlock(&rq->lock); > > > kfree(sched->sched_rq[i]); > > > } >
Am 20.09.24 um 15:26 schrieb Philipp Stanner: > On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote: >> Am 20.09.24 um 10:57 schrieb Philipp Stanner: >>> On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote: >>>> Tearing down the scheduler with jobs still on the pending list >>>> can >>>> lead to use after free issues. Add a warning if drivers try to >>>> destroy a scheduler which still has work pushed to the HW. >>> Did you have time yet to look into my proposed waitque-solution? >> I don't remember seeing anything. What have I missed? > https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/ Mhm, I didn't got that in my inbox for some reason. Interesting approach, I'm just not sure if we can or should wait in drm_sched_fini(). Probably better to make that a separate function, something like drm_sched_flush() or similar. >>>> When there are still entities with jobs the situation is even >>>> worse >>>> since the dma_fences for those jobs can never signal we can just >>>> choose between potentially locking up core memory management and >>>> random memory corruption. When drivers really mess it up that >>>> well >>>> let them run into a BUG_ON(). >>>> >>>> Signed-off-by: Christian König<christian.koenig@amd.com> >>>> --- >>>> drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++- >>>> 1 file changed, 18 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>>> b/drivers/gpu/drm/scheduler/sched_main.c >>>> index f093616fe53c..8a46fab5cdc8 100644 >>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct >>>> drm_gpu_scheduler >>>> *sched) >>> I agree with Sima that it should first be documented in the >>> function's >>> docstring what the user is expected to have done before calling the >>> function. >> Good point, going to update the documentation as well. > Cool thing, thx. > Would be great if everything (not totally trivial) necessary to be done > before _fini() is mentioned. > > One could also think about providing a hint at how the driver can do > that. AFAICS the only way for the driver to ensure that is to maintain > its own, separate list of submitted jobs. Even with a duplicated pending list it's actually currently impossible to do this fully cleanly. The problem is that the dma_fence object gives no guarantee when callbacks are processed, e.g. they can be both processed from interrupt context as well as from a CPU which called dma_fence_is_signaled(). So when a driver (or drm_sched_fini) waits for the last submitted fence it actually can be that the drm_sched object still needs to do some processing. See the hack in amdgpu_vm_tlb_seq() for more background on the problem. Regards, Christian. > > P. > >> Thanks, >> Christian. >> >>> P. >>> >>>> >>>> drm_sched_wqueue_stop(sched); >>>> >>>> + /* >>>> + * Tearing down the scheduler wile there are still >>>> unprocessed jobs can >>>> + * lead to use after free issues in the scheduler fence. >>>> + */ >>>> + WARN_ON(!list_empty(&sched->pending_list)); >>>> + >>>> for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; >>>> i++) >>>> { >>>> struct drm_sched_rq *rq = sched->sched_rq[i]; >>>> >>>> spin_lock(&rq->lock); >>>> - list_for_each_entry(s_entity, &rq->entities, >>>> list) >>>> + list_for_each_entry(s_entity, &rq->entities, >>>> list) { >>>> + /* >>>> + * The justification for this BUG_ON() >>>> is >>>> that tearing >>>> + * down the scheduler while jobs are >>>> pending >>>> leaves >>>> + * dma_fences unsignaled. Since we have >>>> dependencies >>>> + * from the core memory management to >>>> eventually signal >>>> + * dma_fences this can trivially lead to >>>> a >>>> system wide >>>> + * stop because of a locked up memory >>>> management. >>>> + */ >>>> + BUG_ON(spsc_queue_count(&s_entity- >>>>> job_queue)); >>>> + >>>> /* >>>> * Prevents reinsertion and marks >>>> job_queue >>>> as idle, >>>> * it will removed from rq in >>>> drm_sched_entity_fini >>>> * eventually >>>> */ >>>> s_entity->stopped = true; >>>> + } >>>> spin_unlock(&rq->lock); >>>> kfree(sched->sched_rq[i]); >>>> }
On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König wrote: > Am 20.09.24 um 15:26 schrieb Philipp Stanner: > > On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote: > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner: > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote: > > > > > Tearing down the scheduler with jobs still on the pending list > > > > > can > > > > > lead to use after free issues. Add a warning if drivers try to > > > > > destroy a scheduler which still has work pushed to the HW. > > > > Did you have time yet to look into my proposed waitque-solution? > > > I don't remember seeing anything. What have I missed? > > https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/ > > Mhm, I didn't got that in my inbox for some reason. > > Interesting approach, I'm just not sure if we can or should wait in > drm_sched_fini(). > > Probably better to make that a separate function, something like > drm_sched_flush() or similar. Yeah I don't think we should smash this into drm_sched_fini unconditionally. I think conceptually there's about three cases: - Ringbuffer schedules. Probably want everything as-is, because drm_sched_fini is called long after all the entities are gone in drm_device cleanup. - fw scheduler hardware with preemption support. There we probably want to nuke the context by setting the tdr timeout to zero (or maybe just as long as context preemption takes to be efficient), and relying on the normal gpu reset flow to handle things. drm_sched_entity_flush kinda does this, except not really and it's a lot more focused on the ringbuffer context. So maybe we want a new drm_sched_entity_kill. For this case calling drm_sched_fini() after the 1:1 entity is gone should not find any linger jobs, it would actually be a bug somewhere if there's a job lingering. Maybe a sanity check that there's not just no jobs lingering, but also no entity left would be good here? - fw scheduler without preemption support. There we kinda need the drm_sched_flush, except blocking in fops->close is not great. So instead I think the following is better: 1. drm_sched_entity_stopped, which only stops new submissions (for paranoia) but doesn't tear down the entity 2. drm_dev_get 3. launch a worker which does a) drm_sched_flush (or drm_sched_entity_flush or whatever we call it) b) drm_sched_entity_fini + drm_sched_fini c) drm_dev_put Note that semantically this implements the refcount in the other path from Phillip: https://lore.kernel.org/all/20240903094531.29893-2-pstanner@redhat.com/ Except it doesn't impose refcount on everyone else who doesn't need it, and it doesn't even impose refcounting on drivers that do need it because we use drm_sched_flush and a worker to achieve the same. Essentially helper functions for the common use-cases instead of trying to solve them all by putting drm_sched_flush as a potentially very blocking function into drm_sched_fini. > > > > > When there are still entities with jobs the situation is even > > > > > worse > > > > > since the dma_fences for those jobs can never signal we can just > > > > > choose between potentially locking up core memory management and > > > > > random memory corruption. When drivers really mess it up that > > > > > well > > > > > let them run into a BUG_ON(). > > > > > > > > > > Signed-off-by: Christian König<christian.koenig@amd.com> > > > > > --- > > > > > drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++- > > > > > 1 file changed, 18 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > > > index f093616fe53c..8a46fab5cdc8 100644 > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct > > > > > drm_gpu_scheduler > > > > > *sched) > > > > I agree with Sima that it should first be documented in the > > > > function's > > > > docstring what the user is expected to have done before calling the > > > > function. > > > Good point, going to update the documentation as well. > > Cool thing, thx. > > Would be great if everything (not totally trivial) necessary to be done > > before _fini() is mentioned. > > > > One could also think about providing a hint at how the driver can do > > that. AFAICS the only way for the driver to ensure that is to maintain > > its own, separate list of submitted jobs. > > Even with a duplicated pending list it's actually currently impossible to do > this fully cleanly. > > The problem is that the dma_fence object gives no guarantee when callbacks > are processed, e.g. they can be both processed from interrupt context as > well as from a CPU which called dma_fence_is_signaled(). > > So when a driver (or drm_sched_fini) waits for the last submitted fence it > actually can be that the drm_sched object still needs to do some processing. > See the hack in amdgpu_vm_tlb_seq() for more background on the problem. So I thought this should be fairly easy because of the sched hw/public fence split: If we wait for both all jobs to finish and for all the sched work/tdr work to finish, and we make sure there's no entity existing that's not yet stopped we should catch them all? Or at least I think it's a bug if any other code even tries to touch the hw fence. If you have any other driver code which relies on the rcu freeing then I think that's just a separate concern and we can ignore that here since the fences themselves will till get rcu-delay freed even if drm_sched_fini has finished. -Sima > > Regards, > Christian. > > > > > P. > > > > > Thanks, > > > Christian. > > > > > > > P. > > > > > > > > > drm_sched_wqueue_stop(sched); > > > > > + /* > > > > > + * Tearing down the scheduler wile there are still > > > > > unprocessed jobs can > > > > > + * lead to use after free issues in the scheduler fence. > > > > > + */ > > > > > + WARN_ON(!list_empty(&sched->pending_list)); > > > > > + > > > > > for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; > > > > > i++) > > > > > { > > > > > struct drm_sched_rq *rq = sched->sched_rq[i]; > > > > > spin_lock(&rq->lock); > > > > > - list_for_each_entry(s_entity, &rq->entities, > > > > > list) > > > > > + list_for_each_entry(s_entity, &rq->entities, > > > > > list) { > > > > > + /* > > > > > + * The justification for this BUG_ON() > > > > > is > > > > > that tearing > > > > > + * down the scheduler while jobs are > > > > > pending > > > > > leaves > > > > > + * dma_fences unsignaled. Since we have > > > > > dependencies > > > > > + * from the core memory management to > > > > > eventually signal > > > > > + * dma_fences this can trivially lead to > > > > > a > > > > > system wide > > > > > + * stop because of a locked up memory > > > > > management. > > > > > + */ > > > > > + BUG_ON(spsc_queue_count(&s_entity- > > > > > > job_queue)); > > > > > + > > > > > /* > > > > > * Prevents reinsertion and marks > > > > > job_queue > > > > > as idle, > > > > > * it will removed from rq in > > > > > drm_sched_entity_fini > > > > > * eventually > > > > > */ > > > > > s_entity->stopped = true; > > > > > + } > > > > > spin_unlock(&rq->lock); > > > > > kfree(sched->sched_rq[i]); > > > > > }
On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote: > On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König wrote: > > Am 20.09.24 um 15:26 schrieb Philipp Stanner: > > > On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote: > > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner: > > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote: > > > > > > Tearing down the scheduler with jobs still on the pending > > > > > > list > > > > > > can > > > > > > lead to use after free issues. Add a warning if drivers try > > > > > > to > > > > > > destroy a scheduler which still has work pushed to the HW. > > > > > Did you have time yet to look into my proposed waitque- > > > > > solution? > > > > I don't remember seeing anything. What have I missed? > > > https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/ > > > > Mhm, I didn't got that in my inbox for some reason. > > > > Interesting approach, I'm just not sure if we can or should wait in > > drm_sched_fini(). We do agree that jobs still pending when drm_sched_fini() starts is always a bug, right? If so, what are the disadvantages of waiting in drm_sched_fini()? We could block buggy drivers as I see it. Which wouldn't be good, but could then be fixed on drivers' site. > > > > Probably better to make that a separate function, something like > > drm_sched_flush() or similar. We could do that. Such a function could then be called by drivers which are not sure whether all jobs are done before they start tearing down. > > Yeah I don't think we should smash this into drm_sched_fini > unconditionally. I think conceptually there's about three cases: > > - Ringbuffer schedules. Probably want everything as-is, because > drm_sched_fini is called long after all the entities are gone in > drm_device cleanup. > > - fw scheduler hardware with preemption support. There we probably > want to > nuke the context by setting the tdr timeout to zero (or maybe just > as > long as context preemption takes to be efficient), and relying on > the > normal gpu reset flow to handle things. drm_sched_entity_flush > kinda > does this, except not really and it's a lot more focused on the > ringbuffer context. So maybe we want a new drm_sched_entity_kill. > > For this case calling drm_sched_fini() after the 1:1 entity is gone > should not find any linger jobs, it would actually be a bug > somewhere if > there's a job lingering. Maybe a sanity check that there's not just > no > jobs lingering, but also no entity left would be good here? The check for lingering ones is in Christian's patch here IISC. At which position would you imagine the check for the entity being performed? > > - fw scheduler without preemption support. There we kinda need the > drm_sched_flush, except blocking in fops->close is not great. So > instead > I think the following is better: > 1. drm_sched_entity_stopped, which only stops new submissions (for > paranoia) but doesn't tear down the entity Who would call that function? Drivers using it voluntarily could just as well stop accepting new jobs from userspace to their entities, couldn't they? > 2. drm_dev_get > 3. launch a worker which does a) drm_sched_flush (or > drm_sched_entity_flush or whatever we call it) b) > drm_sched_entity_fini > + drm_sched_fini c) drm_dev_put > > Note that semantically this implements the refcount in the other > path > from Phillip: > > > https://lore.kernel.org/all/20240903094531.29893-2-pstanner@redhat.com/ > > Except it doesn't impose refcount on everyone else who doesn't need > it, > and it doesn't even impose refcounting on drivers that do need it > because we use drm_sched_flush and a worker to achieve the same. I indeed wasn't happy with the refcount approach for that reason, agreed. > > Essentially helper functions for the common use-cases instead of > trying to > solve them all by putting drm_sched_flush as a potentially very > blocking > function into drm_sched_fini. I'm still not able to see why it blocking would be undesired – as far as I can see, it is only invoked on driver teardown, so not during active operation. Teardown doesn't happen that often, and it can (if implemented correctly) only block until the driver's code has signaled the last fences. If that doesn't happen, the block would reveal a bug. But don't get me wrong: I don't want to *push* this solution. I just want to understand when it could become a problem. Wouldn't an explicitly blocking, separate function like drm_sched_flush() or drm_sched_fini_flush() be a small, doable step towards the right direction? > > > > > > > > When there are still entities with jobs the situation is > > > > > > even > > > > > > worse > > > > > > since the dma_fences for those jobs can never signal we can > > > > > > just > > > > > > choose between potentially locking up core memory > > > > > > management and > > > > > > random memory corruption. When drivers really mess it up > > > > > > that > > > > > > well > > > > > > let them run into a BUG_ON(). > > > > > > > > > > > > Signed-off-by: Christian König<christian.koenig@amd.com> > > > > > > --- > > > > > > drivers/gpu/drm/scheduler/sched_main.c | 19 > > > > > > ++++++++++++++++++- > > > > > > 1 file changed, 18 insertions(+), 1 deletion(-) > > > > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > index f093616fe53c..8a46fab5cdc8 100644 > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct > > > > > > drm_gpu_scheduler > > > > > > *sched) > > > > > I agree with Sima that it should first be documented in the > > > > > function's > > > > > docstring what the user is expected to have done before > > > > > calling the > > > > > function. > > > > Good point, going to update the documentation as well. > > > Cool thing, thx. > > > Would be great if everything (not totally trivial) necessary to > > > be done > > > before _fini() is mentioned. > > > > > > One could also think about providing a hint at how the driver can > > > do > > > that. AFAICS the only way for the driver to ensure that is to > > > maintain > > > its own, separate list of submitted jobs. > > > > Even with a duplicated pending list it's actually currently > > impossible to do > > this fully cleanly. > > > > The problem is that the dma_fence object gives no guarantee when > > callbacks > > are processed, e.g. they can be both processed from interrupt > > context as > > well as from a CPU which called dma_fence_is_signaled(). > > > > So when a driver (or drm_sched_fini) waits for the last submitted > > fence it > > actually can be that the drm_sched object still needs to do some > > processing. > > See the hack in amdgpu_vm_tlb_seq() for more background on the > > problem. Oh dear ^^' We better work towards fixing that centrally Thanks, P. > > So I thought this should be fairly easy because of the sched > hw/public > fence split: If we wait for both all jobs to finish and for all the > sched > work/tdr work to finish, and we make sure there's no entity existing > that's not yet stopped we should catch them all? Or at least I think > it's > a bug if any other code even tries to touch the hw fence. > > If you have any other driver code which relies on the rcu freeing > then I > think that's just a separate concern and we can ignore that here > since the > fences themselves will till get rcu-delay freed even if > drm_sched_fini has > finished. > -Sima > > > > > Regards, > > Christian. > > > > > > > > P. > > > > > > > Thanks, > > > > Christian. > > > > > > > > > P. > > > > > > > > > > > drm_sched_wqueue_stop(sched); > > > > > > + /* > > > > > > + * Tearing down the scheduler wile there are still > > > > > > unprocessed jobs can > > > > > > + * lead to use after free issues in the scheduler > > > > > > fence. > > > > > > + */ > > > > > > + WARN_ON(!list_empty(&sched->pending_list)); > > > > > > + > > > > > > for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched- > > > > > > >num_rqs; > > > > > > i++) > > > > > > { > > > > > > struct drm_sched_rq *rq = sched- > > > > > > >sched_rq[i]; > > > > > > spin_lock(&rq->lock); > > > > > > - list_for_each_entry(s_entity, &rq- > > > > > > >entities, > > > > > > list) > > > > > > + list_for_each_entry(s_entity, &rq- > > > > > > >entities, > > > > > > list) { > > > > > > + /* > > > > > > + * The justification for this > > > > > > BUG_ON() > > > > > > is > > > > > > that tearing > > > > > > + * down the scheduler while jobs > > > > > > are > > > > > > pending > > > > > > leaves > > > > > > + * dma_fences unsignaled. Since we > > > > > > have > > > > > > dependencies > > > > > > + * from the core memory management > > > > > > to > > > > > > eventually signal > > > > > > + * dma_fences this can trivially > > > > > > lead to > > > > > > a > > > > > > system wide > > > > > > + * stop because of a locked up > > > > > > memory > > > > > > management. > > > > > > + */ > > > > > > + BUG_ON(spsc_queue_count(&s_entity- > > > > > > > job_queue)); > > > > > > + > > > > > > /* > > > > > > * Prevents reinsertion and marks > > > > > > job_queue > > > > > > as idle, > > > > > > * it will removed from rq in > > > > > > drm_sched_entity_fini > > > > > > * eventually > > > > > > */ > > > > > > s_entity->stopped = true; > > > > > > + } > > > > > > spin_unlock(&rq->lock); > > > > > > kfree(sched->sched_rq[i]); > > > > > > } >
Am 25.09.24 um 16:53 schrieb Philipp Stanner: > On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote: >> On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König wrote: >>> Am 20.09.24 um 15:26 schrieb Philipp Stanner: >>>> On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote: >>>>> Am 20.09.24 um 10:57 schrieb Philipp Stanner: >>>>>> On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote: >>>>>>> Tearing down the scheduler with jobs still on the pending >>>>>>> list >>>>>>> can >>>>>>> lead to use after free issues. Add a warning if drivers try >>>>>>> to >>>>>>> destroy a scheduler which still has work pushed to the HW. >>>>>> Did you have time yet to look into my proposed waitque- >>>>>> solution? >>>>> I don't remember seeing anything. What have I missed? >>>> https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/ >>> Mhm, I didn't got that in my inbox for some reason. >>> >>> Interesting approach, I'm just not sure if we can or should wait in >>> drm_sched_fini(). > We do agree that jobs still pending when drm_sched_fini() starts is > always a bug, right? Correct, the question is how to avoid that. > If so, what are the disadvantages of waiting in drm_sched_fini()? We > could block buggy drivers as I see it. Which wouldn't be good, but > could then be fixed on drivers' site. Sima explained that pretty well: Don't block in fops->close, do that in fops->flush instead. One issue this solves is that when you send a SIGTERM the tear down handling first flushes all the FDs and then closes them. So if flushing the FDs blocks because the process initiated sending a terabyte of data over a 300bps line (for example) you can still throw a SIGKILL and abort that as well. If you would block in fops-close() that SIGKILL won't have any effect any more because by the time close() is called the process is gone and signals are already blocked. And yes when I learned about that issue I was also buffed that handling like this in the UNIX design is nearly 50 years old and still applies to today. >>> Probably better to make that a separate function, something like >>> drm_sched_flush() or similar. > We could do that. Such a function could then be called by drivers which > are not sure whether all jobs are done before they start tearing down. Yes exactly that's the idea. And give that flush function a return code so that it can return -EINTR. >> Yeah I don't think we should smash this into drm_sched_fini >> unconditionally. I think conceptually there's about three cases: >> >> - Ringbuffer schedules. Probably want everything as-is, because >> drm_sched_fini is called long after all the entities are gone in >> drm_device cleanup. >> >> - fw scheduler hardware with preemption support. There we probably >> want to >> nuke the context by setting the tdr timeout to zero (or maybe just >> as >> long as context preemption takes to be efficient), and relying on >> the >> normal gpu reset flow to handle things. drm_sched_entity_flush >> kinda >> does this, except not really and it's a lot more focused on the >> ringbuffer context. So maybe we want a new drm_sched_entity_kill. >> >> For this case calling drm_sched_fini() after the 1:1 entity is gone >> should not find any linger jobs, it would actually be a bug >> somewhere if >> there's a job lingering. Maybe a sanity check that there's not just >> no >> jobs lingering, but also no entity left would be good here? > The check for lingering ones is in Christian's patch here IISC. > At which position would you imagine the check for the entity being > performed? > >> - fw scheduler without preemption support. There we kinda need the >> drm_sched_flush, except blocking in fops->close is not great. So >> instead >> I think the following is better: >> 1. drm_sched_entity_stopped, which only stops new submissions (for >> paranoia) but doesn't tear down the entity > Who would call that function? > Drivers using it voluntarily could just as well stop accepting new jobs > from userspace to their entities, couldn't they? > >> 2. drm_dev_get >> 3. launch a worker which does a) drm_sched_flush (or >> drm_sched_entity_flush or whatever we call it) b) >> drm_sched_entity_fini >> + drm_sched_fini c) drm_dev_put >> >> Note that semantically this implements the refcount in the other >> path >> from Phillip: >> >> >> https://lore.kernel.org/all/20240903094531.29893-2-pstanner@redhat.com/ >> >> Except it doesn't impose refcount on everyone else who doesn't need >> it, >> and it doesn't even impose refcounting on drivers that do need it >> because we use drm_sched_flush and a worker to achieve the same. > I indeed wasn't happy with the refcount approach for that reason, > agreed. > >> Essentially helper functions for the common use-cases instead of >> trying to >> solve them all by putting drm_sched_flush as a potentially very >> blocking >> function into drm_sched_fini. > I'm still not able to see why it blocking would be undesired – as far > as I can see, it is only invoked on driver teardown, so not during > active operation. Teardown doesn't happen that often, and it can (if > implemented correctly) only block until the driver's code has signaled > the last fences. If that doesn't happen, the block would reveal a bug. > > But don't get me wrong: I don't want to *push* this solution. I just > want to understand when it could become a problem. > > > Wouldn't an explicitly blocking, separate function like > drm_sched_flush() or drm_sched_fini_flush() be a small, doable step > towards the right direction? I think that this is the right thing to do, yes. >>>>>>> When there are still entities with jobs the situation is >>>>>>> even >>>>>>> worse >>>>>>> since the dma_fences for those jobs can never signal we can >>>>>>> just >>>>>>> choose between potentially locking up core memory >>>>>>> management and >>>>>>> random memory corruption. When drivers really mess it up >>>>>>> that >>>>>>> well >>>>>>> let them run into a BUG_ON(). >>>>>>> >>>>>>> Signed-off-by: Christian König<christian.koenig@amd.com> >>>>>>> --- >>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 19 >>>>>>> ++++++++++++++++++- >>>>>>> 1 file changed, 18 insertions(+), 1 deletion(-) >>>>>>> >>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c >>>>>>> index f093616fe53c..8a46fab5cdc8 100644 >>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>>>>> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct >>>>>>> drm_gpu_scheduler >>>>>>> *sched) >>>>>> I agree with Sima that it should first be documented in the >>>>>> function's >>>>>> docstring what the user is expected to have done before >>>>>> calling the >>>>>> function. >>>>> Good point, going to update the documentation as well. >>>> Cool thing, thx. >>>> Would be great if everything (not totally trivial) necessary to >>>> be done >>>> before _fini() is mentioned. >>>> >>>> One could also think about providing a hint at how the driver can >>>> do >>>> that. AFAICS the only way for the driver to ensure that is to >>>> maintain >>>> its own, separate list of submitted jobs. >>> Even with a duplicated pending list it's actually currently >>> impossible to do >>> this fully cleanly. >>> >>> The problem is that the dma_fence object gives no guarantee when >>> callbacks >>> are processed, e.g. they can be both processed from interrupt >>> context as >>> well as from a CPU which called dma_fence_is_signaled(). >>> >>> So when a driver (or drm_sched_fini) waits for the last submitted >>> fence it >>> actually can be that the drm_sched object still needs to do some >>> processing. >>> See the hack in amdgpu_vm_tlb_seq() for more background on the >>> problem. > Oh dear ^^' > We better work towards fixing that centrally > > Thanks, > P. > > >> So I thought this should be fairly easy because of the sched >> hw/public >> fence split: If we wait for both all jobs to finish and for all the >> sched >> work/tdr work to finish, and we make sure there's no entity existing >> that's not yet stopped we should catch them all? Unfortunately not. Even when you do a dma_fence_wait() on the last submission it can still be that another CPU is executing the callbacks to wake up the scheduler work item and cleanup the job. That's one of the reasons why I think the design of keeping the job alive is so extremely awkward. The dma_fence as representation of the hw submission has a much better defined state machine and lifetime. Regards, Christian. >> Or at least I think >> it's >> a bug if any other code even tries to touch the hw fence. >> >> If you have any other driver code which relies on the rcu freeing >> then I >> think that's just a separate concern and we can ignore that here >> since the >> fences themselves will till get rcu-delay freed even if >> drm_sched_fini has >> finished. >> -Sima >> >>> Regards, >>> Christian. >>> >>>> P. >>>> >>>>> Thanks, >>>>> Christian. >>>>> >>>>>> P. >>>>>> >>>>>>> drm_sched_wqueue_stop(sched); >>>>>>> + /* >>>>>>> + * Tearing down the scheduler wile there are still >>>>>>> unprocessed jobs can >>>>>>> + * lead to use after free issues in the scheduler >>>>>>> fence. >>>>>>> + */ >>>>>>> + WARN_ON(!list_empty(&sched->pending_list)); >>>>>>> + >>>>>>> for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched- >>>>>>>> num_rqs; >>>>>>> i++) >>>>>>> { >>>>>>> struct drm_sched_rq *rq = sched- >>>>>>>> sched_rq[i]; >>>>>>> spin_lock(&rq->lock); >>>>>>> - list_for_each_entry(s_entity, &rq- >>>>>>>> entities, >>>>>>> list) >>>>>>> + list_for_each_entry(s_entity, &rq- >>>>>>>> entities, >>>>>>> list) { >>>>>>> + /* >>>>>>> + * The justification for this >>>>>>> BUG_ON() >>>>>>> is >>>>>>> that tearing >>>>>>> + * down the scheduler while jobs >>>>>>> are >>>>>>> pending >>>>>>> leaves >>>>>>> + * dma_fences unsignaled. Since we >>>>>>> have >>>>>>> dependencies >>>>>>> + * from the core memory management >>>>>>> to >>>>>>> eventually signal >>>>>>> + * dma_fences this can trivially >>>>>>> lead to >>>>>>> a >>>>>>> system wide >>>>>>> + * stop because of a locked up >>>>>>> memory >>>>>>> management. >>>>>>> + */ >>>>>>> + BUG_ON(spsc_queue_count(&s_entity- >>>>>>>> job_queue)); >>>>>>> + >>>>>>> /* >>>>>>> * Prevents reinsertion and marks >>>>>>> job_queue >>>>>>> as idle, >>>>>>> * it will removed from rq in >>>>>>> drm_sched_entity_fini >>>>>>> * eventually >>>>>>> */ >>>>>>> s_entity->stopped = true; >>>>>>> + } >>>>>>> spin_unlock(&rq->lock); >>>>>>> kfree(sched->sched_rq[i]); >>>>>>> }
On Tue, Sep 24, 2024 at 01:18:25PM +0200, Simona Vetter wrote: > On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König wrote: > > Am 20.09.24 um 15:26 schrieb Philipp Stanner: > > > On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote: > > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner: > > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote: > > > > > > Tearing down the scheduler with jobs still on the pending list > > > > > > can > > > > > > lead to use after free issues. Add a warning if drivers try to > > > > > > destroy a scheduler which still has work pushed to the HW. > > > > > Did you have time yet to look into my proposed waitque-solution? > > > > I don't remember seeing anything. What have I missed? > > > https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/ > > > > Mhm, I didn't got that in my inbox for some reason. > > > > Interesting approach, I'm just not sure if we can or should wait in > > drm_sched_fini(). > > > > Probably better to make that a separate function, something like > > drm_sched_flush() or similar. > > Yeah I don't think we should smash this into drm_sched_fini > unconditionally. I think conceptually there's about three cases: > > - Ringbuffer schedules. Probably want everything as-is, because > drm_sched_fini is called long after all the entities are gone in > drm_device cleanup. Even if entities are long gone we can still have jobs on the pending list of the scheduler. Calling drm_sched_fini() would still leak those jobs. It's just that for this case it's unlikely, but I don't think we want to rely on "unlikely". Do I miss anything? > > - fw scheduler hardware with preemption support. There we probably want to > nuke the context by setting the tdr timeout to zero (or maybe just as > long as context preemption takes to be efficient), and relying on the > normal gpu reset flow to handle things. drm_sched_entity_flush kinda > does this, except not really and it's a lot more focused on the > ringbuffer context. So maybe we want a new drm_sched_entity_kill. > > For this case calling drm_sched_fini() after the 1:1 entity is gone > should not find any linger jobs, it would actually be a bug somewhere if > there's a job lingering. Maybe a sanity check that there's not just no > jobs lingering, but also no entity left would be good here? This is a timing assumption too, isn't it? What if we reach drm_sched_fini() quicker than all the free() callbacks from the jobs on the pending list are processed? I know Xe uses the "tdr to zero trick", so I'm not aware of all implications -- in Nouveau I had to do it differently. > > - fw scheduler without preemption support. There we kinda need the > drm_sched_flush, except blocking in fops->close is not great. So instead > I think the following is better: > 1. drm_sched_entity_stopped, which only stops new submissions (for > paranoia) but doesn't tear down the entity > 2. drm_dev_get > 3. launch a worker which does a) drm_sched_flush (or > drm_sched_entity_flush or whatever we call it) b) drm_sched_entity_fini I think it would be drm_sched_flush(); the jobs on the pending are not associated with any entity any more. See also 96c7c2f4d5bd ("drm/scheduler: set entity to NULL in drm_sched_entity_pop_job()"). > + drm_sched_fini c) drm_dev_put > > Note that semantically this implements the refcount in the other path > from Phillip: > > https://lore.kernel.org/all/20240903094531.29893-2-pstanner@redhat.com/ > > Except it doesn't impose refcount on everyone else who doesn't need it, > and it doesn't even impose refcounting on drivers that do need it > because we use drm_sched_flush and a worker to achieve the same. > > Essentially helper functions for the common use-cases instead of trying to > solve them all by putting drm_sched_flush as a potentially very blocking > function into drm_sched_fini. > > > > > > > > When there are still entities with jobs the situation is even > > > > > > worse > > > > > > since the dma_fences for those jobs can never signal we can just > > > > > > choose between potentially locking up core memory management and > > > > > > random memory corruption. When drivers really mess it up that > > > > > > well > > > > > > let them run into a BUG_ON(). > > > > > > > > > > > > Signed-off-by: Christian König<christian.koenig@amd.com> > > > > > > --- > > > > > > drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++- > > > > > > 1 file changed, 18 insertions(+), 1 deletion(-) > > > > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > index f093616fe53c..8a46fab5cdc8 100644 > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct > > > > > > drm_gpu_scheduler > > > > > > *sched) > > > > > I agree with Sima that it should first be documented in the > > > > > function's > > > > > docstring what the user is expected to have done before calling the > > > > > function. > > > > Good point, going to update the documentation as well. > > > Cool thing, thx. > > > Would be great if everything (not totally trivial) necessary to be done > > > before _fini() is mentioned. > > > > > > One could also think about providing a hint at how the driver can do > > > that. AFAICS the only way for the driver to ensure that is to maintain > > > its own, separate list of submitted jobs. > > > > Even with a duplicated pending list it's actually currently impossible to do > > this fully cleanly. > > > > The problem is that the dma_fence object gives no guarantee when callbacks > > are processed, e.g. they can be both processed from interrupt context as > > well as from a CPU which called dma_fence_is_signaled(). > > > > So when a driver (or drm_sched_fini) waits for the last submitted fence it > > actually can be that the drm_sched object still needs to do some processing. > > See the hack in amdgpu_vm_tlb_seq() for more background on the problem. > > So I thought this should be fairly easy because of the sched hw/public > fence split: If we wait for both all jobs to finish and for all the sched > work/tdr work to finish, and we make sure there's no entity existing > that's not yet stopped we should catch them all? Or at least I think it's > a bug if any other code even tries to touch the hw fence. > > If you have any other driver code which relies on the rcu freeing then I > think that's just a separate concern and we can ignore that here since the > fences themselves will till get rcu-delay freed even if drm_sched_fini has > finished. > -Sima > > > > > Regards, > > Christian. > > > > > > > > P. > > > > > > > Thanks, > > > > Christian. > > > > > > > > > P. > > > > > > > > > > > drm_sched_wqueue_stop(sched); > > > > > > + /* > > > > > > + * Tearing down the scheduler wile there are still > > > > > > unprocessed jobs can > > > > > > + * lead to use after free issues in the scheduler fence. > > > > > > + */ > > > > > > + WARN_ON(!list_empty(&sched->pending_list)); > > > > > > + > > > > > > for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; > > > > > > i++) > > > > > > { > > > > > > struct drm_sched_rq *rq = sched->sched_rq[i]; > > > > > > spin_lock(&rq->lock); > > > > > > - list_for_each_entry(s_entity, &rq->entities, > > > > > > list) > > > > > > + list_for_each_entry(s_entity, &rq->entities, > > > > > > list) { > > > > > > + /* > > > > > > + * The justification for this BUG_ON() > > > > > > is > > > > > > that tearing > > > > > > + * down the scheduler while jobs are > > > > > > pending > > > > > > leaves > > > > > > + * dma_fences unsignaled. Since we have > > > > > > dependencies > > > > > > + * from the core memory management to > > > > > > eventually signal > > > > > > + * dma_fences this can trivially lead to > > > > > > a > > > > > > system wide > > > > > > + * stop because of a locked up memory > > > > > > management. > > > > > > + */ > > > > > > + BUG_ON(spsc_queue_count(&s_entity- > > > > > > > job_queue)); > > > > > > + > > > > > > /* > > > > > > * Prevents reinsertion and marks > > > > > > job_queue > > > > > > as idle, > > > > > > * it will removed from rq in > > > > > > drm_sched_entity_fini > > > > > > * eventually > > > > > > */ > > > > > > s_entity->stopped = true; > > > > > > + } > > > > > > spin_unlock(&rq->lock); > > > > > > kfree(sched->sched_rq[i]); > > > > > > } > > -- > Simona Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch >
On Fri, Sep 27, 2024 at 11:04:48AM +0200, Christian König wrote: > Am 25.09.24 um 16:53 schrieb Philipp Stanner: > > On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote: > > > On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König wrote: > > > > Am 20.09.24 um 15:26 schrieb Philipp Stanner: > > > > > On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote: > > > > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner: > > > > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian König wrote: > > > > > > > > Tearing down the scheduler with jobs still on the pending > > > > > > > > list > > > > > > > > can > > > > > > > > lead to use after free issues. Add a warning if drivers try > > > > > > > > to > > > > > > > > destroy a scheduler which still has work pushed to the HW. > > > > > > > Did you have time yet to look into my proposed waitque- > > > > > > > solution? > > > > > > I don't remember seeing anything. What have I missed? > > > > > https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/ > > > > Mhm, I didn't got that in my inbox for some reason. > > > > > > > > Interesting approach, I'm just not sure if we can or should wait in > > > > drm_sched_fini(). > > We do agree that jobs still pending when drm_sched_fini() starts is > > always a bug, right? > > Correct, the question is how to avoid that. > > > If so, what are the disadvantages of waiting in drm_sched_fini()? We > > could block buggy drivers as I see it. Which wouldn't be good, but > > could then be fixed on drivers' site. > > Sima explained that pretty well: Don't block in fops->close, do that in > fops->flush instead. I agree that we shouldn't block in close(), but this effectively means that we need to reference count the scheduler, right? Otherwise, if we allow to just skip / interrupt the teardown, we can still leak memory. > > One issue this solves is that when you send a SIGTERM the tear down handling > first flushes all the FDs and then closes them. > > So if flushing the FDs blocks because the process initiated sending a > terabyte of data over a 300bps line (for example) you can still throw a > SIGKILL and abort that as well. > > If you would block in fops-close() that SIGKILL won't have any effect any > more because by the time close() is called the process is gone and signals > are already blocked. > > And yes when I learned about that issue I was also buffed that handling like > this in the UNIX design is nearly 50 years old and still applies to today. > > > > Probably better to make that a separate function, something like > > > > drm_sched_flush() or similar. > > We could do that. Such a function could then be called by drivers which > > are not sure whether all jobs are done before they start tearing down. > > Yes exactly that's the idea. And give that flush function a return code so > that it can return -EINTR. > > > > Yeah I don't think we should smash this into drm_sched_fini > > > unconditionally. I think conceptually there's about three cases: > > > > > > - Ringbuffer schedules. Probably want everything as-is, because > > > drm_sched_fini is called long after all the entities are gone in > > > drm_device cleanup. > > > > > > - fw scheduler hardware with preemption support. There we probably > > > want to > > > nuke the context by setting the tdr timeout to zero (or maybe just > > > as > > > long as context preemption takes to be efficient), and relying on > > > the > > > normal gpu reset flow to handle things. drm_sched_entity_flush > > > kinda > > > does this, except not really and it's a lot more focused on the > > > ringbuffer context. So maybe we want a new drm_sched_entity_kill. > > > > > > For this case calling drm_sched_fini() after the 1:1 entity is gone > > > should not find any linger jobs, it would actually be a bug > > > somewhere if > > > there's a job lingering. Maybe a sanity check that there's not just > > > no > > > jobs lingering, but also no entity left would be good here? > > The check for lingering ones is in Christian's patch here IISC. > > At which position would you imagine the check for the entity being > > performed? > > > > > - fw scheduler without preemption support. There we kinda need the > > > drm_sched_flush, except blocking in fops->close is not great. So > > > instead > > > I think the following is better: > > > 1. drm_sched_entity_stopped, which only stops new submissions (for > > > paranoia) but doesn't tear down the entity > > Who would call that function? > > Drivers using it voluntarily could just as well stop accepting new jobs > > from userspace to their entities, couldn't they? > > > > > 2. drm_dev_get > > > 3. launch a worker which does a) drm_sched_flush (or > > > drm_sched_entity_flush or whatever we call it) b) > > > drm_sched_entity_fini > > > + drm_sched_fini c) drm_dev_put > > > > > > Note that semantically this implements the refcount in the other > > > path > > > from Phillip: > > > > > > https://lore.kernel.org/all/20240903094531.29893-2-pstanner@redhat.com/ > > > Except it doesn't impose refcount on everyone else who doesn't need > > > it, > > > and it doesn't even impose refcounting on drivers that do need it > > > because we use drm_sched_flush and a worker to achieve the same. > > I indeed wasn't happy with the refcount approach for that reason, > > agreed. > > > > > Essentially helper functions for the common use-cases instead of > > > trying to > > > solve them all by putting drm_sched_flush as a potentially very > > > blocking > > > function into drm_sched_fini. > > I'm still not able to see why it blocking would be undesired – as far > > as I can see, it is only invoked on driver teardown, so not during > > active operation. Teardown doesn't happen that often, and it can (if > > implemented correctly) only block until the driver's code has signaled > > the last fences. If that doesn't happen, the block would reveal a bug. > > > > But don't get me wrong: I don't want to *push* this solution. I just > > want to understand when it could become a problem. > > > > > > Wouldn't an explicitly blocking, separate function like > > drm_sched_flush() or drm_sched_fini_flush() be a small, doable step > > towards the right direction? > > I think that this is the right thing to do, yes. > > > > > > > > > When there are still entities with jobs the situation is > > > > > > > > even > > > > > > > > worse > > > > > > > > since the dma_fences for those jobs can never signal we can > > > > > > > > just > > > > > > > > choose between potentially locking up core memory > > > > > > > > management and > > > > > > > > random memory corruption. When drivers really mess it up > > > > > > > > that > > > > > > > > well > > > > > > > > let them run into a BUG_ON(). > > > > > > > > > > > > > > > > Signed-off-by: Christian König<christian.koenig@amd.com> > > > > > > > > --- > > > > > > > > drivers/gpu/drm/scheduler/sched_main.c | 19 > > > > > > > > ++++++++++++++++++- > > > > > > > > 1 file changed, 18 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > index f093616fe53c..8a46fab5cdc8 100644 > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct > > > > > > > > drm_gpu_scheduler > > > > > > > > *sched) > > > > > > > I agree with Sima that it should first be documented in the > > > > > > > function's > > > > > > > docstring what the user is expected to have done before > > > > > > > calling the > > > > > > > function. > > > > > > Good point, going to update the documentation as well. > > > > > Cool thing, thx. > > > > > Would be great if everything (not totally trivial) necessary to > > > > > be done > > > > > before _fini() is mentioned. > > > > > > > > > > One could also think about providing a hint at how the driver can > > > > > do > > > > > that. AFAICS the only way for the driver to ensure that is to > > > > > maintain > > > > > its own, separate list of submitted jobs. > > > > Even with a duplicated pending list it's actually currently > > > > impossible to do > > > > this fully cleanly. > > > > > > > > The problem is that the dma_fence object gives no guarantee when > > > > callbacks > > > > are processed, e.g. they can be both processed from interrupt > > > > context as > > > > well as from a CPU which called dma_fence_is_signaled(). > > > > > > > > So when a driver (or drm_sched_fini) waits for the last submitted > > > > fence it > > > > actually can be that the drm_sched object still needs to do some > > > > processing. > > > > See the hack in amdgpu_vm_tlb_seq() for more background on the > > > > problem. > > Oh dear ^^' > > We better work towards fixing that centrally > > > > Thanks, > > P. > > > > > > > So I thought this should be fairly easy because of the sched > > > hw/public > > > fence split: If we wait for both all jobs to finish and for all the > > > sched > > > work/tdr work to finish, and we make sure there's no entity existing > > > that's not yet stopped we should catch them all? > > Unfortunately not. > > Even when you do a dma_fence_wait() on the last submission it can still be > that another CPU is executing the callbacks to wake up the scheduler work > item and cleanup the job. > > That's one of the reasons why I think the design of keeping the job alive is > so extremely awkward. The dma_fence as representation of the hw submission > has a much better defined state machine and lifetime. > > Regards, > Christian. > > > > Or at least I think > > > it's > > > a bug if any other code even tries to touch the hw fence. > > > > > > If you have any other driver code which relies on the rcu freeing > > > then I > > > think that's just a separate concern and we can ignore that here > > > since the > > > fences themselves will till get rcu-delay freed even if > > > drm_sched_fini has > > > finished. > > > -Sima > > > > > > > Regards, > > > > Christian. > > > > > > > > > P. > > > > > > > > > > > Thanks, > > > > > > Christian. > > > > > > > > > > > > > P. > > > > > > > > > > > > > > > drm_sched_wqueue_stop(sched); > > > > > > > > + /* > > > > > > > > + * Tearing down the scheduler wile there are still > > > > > > > > unprocessed jobs can > > > > > > > > + * lead to use after free issues in the scheduler > > > > > > > > fence. > > > > > > > > + */ > > > > > > > > + WARN_ON(!list_empty(&sched->pending_list)); > > > > > > > > + > > > > > > > > for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched- > > > > > > > > > num_rqs; > > > > > > > > i++) > > > > > > > > { > > > > > > > > struct drm_sched_rq *rq = sched- > > > > > > > > > sched_rq[i]; > > > > > > > > spin_lock(&rq->lock); > > > > > > > > - list_for_each_entry(s_entity, &rq- > > > > > > > > > entities, > > > > > > > > list) > > > > > > > > + list_for_each_entry(s_entity, &rq- > > > > > > > > > entities, > > > > > > > > list) { > > > > > > > > + /* > > > > > > > > + * The justification for this > > > > > > > > BUG_ON() > > > > > > > > is > > > > > > > > that tearing > > > > > > > > + * down the scheduler while jobs > > > > > > > > are > > > > > > > > pending > > > > > > > > leaves > > > > > > > > + * dma_fences unsignaled. Since we > > > > > > > > have > > > > > > > > dependencies > > > > > > > > + * from the core memory management > > > > > > > > to > > > > > > > > eventually signal > > > > > > > > + * dma_fences this can trivially > > > > > > > > lead to > > > > > > > > a > > > > > > > > system wide > > > > > > > > + * stop because of a locked up > > > > > > > > memory > > > > > > > > management. > > > > > > > > + */ > > > > > > > > + BUG_ON(spsc_queue_count(&s_entity- > > > > > > > > > job_queue)); > > > > > > > > + > > > > > > > > /* > > > > > > > > * Prevents reinsertion and marks > > > > > > > > job_queue > > > > > > > > as idle, > > > > > > > > * it will removed from rq in > > > > > > > > drm_sched_entity_fini > > > > > > > > * eventually > > > > > > > > */ > > > > > > > > s_entity->stopped = true; > > > > > > > > + } > > > > > > > > spin_unlock(&rq->lock); > > > > > > > > kfree(sched->sched_rq[i]); > > > > > > > > }
On Mon, 2024-10-14 at 16:56 +0200, Danilo Krummrich wrote: > On Fri, Sep 27, 2024 at 11:04:48AM +0200, Christian König wrote: > > Am 25.09.24 um 16:53 schrieb Philipp Stanner: > > > On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote: > > > > On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König > > > > wrote: > > > > > Am 20.09.24 um 15:26 schrieb Philipp Stanner: > > > > > > On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote: > > > > > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner: > > > > > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian König > > > > > > > > wrote: > > > > > > > > > Tearing down the scheduler with jobs still on the > > > > > > > > > pending > > > > > > > > > list > > > > > > > > > can > > > > > > > > > lead to use after free issues. Add a warning if > > > > > > > > > drivers try > > > > > > > > > to > > > > > > > > > destroy a scheduler which still has work pushed to > > > > > > > > > the HW. > > > > > > > > Did you have time yet to look into my proposed waitque- > > > > > > > > solution? > > > > > > > I don't remember seeing anything. What have I missed? > > > > > > https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/ > > > > > Mhm, I didn't got that in my inbox for some reason. > > > > > > > > > > Interesting approach, I'm just not sure if we can or should > > > > > wait in > > > > > drm_sched_fini(). > > > We do agree that jobs still pending when drm_sched_fini() starts > > > is > > > always a bug, right? > > > > Correct, the question is how to avoid that. > > > > > If so, what are the disadvantages of waiting in drm_sched_fini()? > > > We > > > could block buggy drivers as I see it. Which wouldn't be good, > > > but > > > could then be fixed on drivers' site. > > > > Sima explained that pretty well: Don't block in fops->close, do > > that in > > fops->flush instead. > > I agree that we shouldn't block in close(), but this effectively > means that we > need to reference count the scheduler, right? > > Otherwise, if we allow to just skip / interrupt the teardown, we can > still leak > memory. Having thought about it, I agree with Danilo. Having something that shall wait on green light, but can be interrupted, is no guarantee and therefore not a feasible solution. To break down the solution space, these seem to be our options: 1. We have something (either drm_sched_fini() or a helper, e.g., drm_sched_flush()) that definitely blocks until the pending list has become empty. 2. We have jobs reference-count the scheduler, so the latter can outlive the driver and will be freed some time later. Can anyone think of a third solution? Solution #1 has the problem of obviously blocking unconditionally if the driver didn't make sure that all fences will be signaled within reasonable time. In my opinion, this would actually be an advantage, because it will be *very* noticable and force users to repair their driver. The driver *has* to guarantee that all fences will be signaled. If the driver has to do fishy things, having the blocking outsourced to the helper drm_sched_flush() would allow them to circumvent that. Solution #2 has the problem of backend_ops.free_job() potentially using driver-data after the driver is gone, causing UAF. So with this solutions all drivers would have to be aware of the issue and handle it through one of DRMs primitives dedicated to such problems. Currently, all drivers either work around the problem internally or simply ignore it, it seems. So I'd argue that both solutions are an improvement over the existing situation. My preference would be #1. Opinions? P. > > > > > One issue this solves is that when you send a SIGTERM the tear down > > handling > > first flushes all the FDs and then closes them. > > > > So if flushing the FDs blocks because the process initiated sending > > a > > terabyte of data over a 300bps line (for example) you can still > > throw a > > SIGKILL and abort that as well. > > > > If you would block in fops-close() that SIGKILL won't have any > > effect any > > more because by the time close() is called the process is gone and > > signals > > are already blocked. > > > > And yes when I learned about that issue I was also buffed that > > handling like > > this in the UNIX design is nearly 50 years old and still applies to > > today. > > > > > Probably better to make that a separate function, something > > > > > like > > > > > drm_sched_flush() or similar. > > > We could do that. Such a function could then be called by drivers > > > which > > > are not sure whether all jobs are done before they start tearing > > > down. > > > > Yes exactly that's the idea. And give that flush function a return > > code so > > that it can return -EINTR. > > > > > > Yeah I don't think we should smash this into drm_sched_fini > > > > unconditionally. I think conceptually there's about three > > > > cases: > > > > > > > > - Ringbuffer schedules. Probably want everything as-is, because > > > > drm_sched_fini is called long after all the entities are > > > > gone in > > > > drm_device cleanup. > > > > > > > > - fw scheduler hardware with preemption support. There we > > > > probably > > > > want to > > > > nuke the context by setting the tdr timeout to zero (or > > > > maybe just > > > > as > > > > long as context preemption takes to be efficient), and > > > > relying on > > > > the > > > > normal gpu reset flow to handle things. > > > > drm_sched_entity_flush > > > > kinda > > > > does this, except not really and it's a lot more focused on > > > > the > > > > ringbuffer context. So maybe we want a new > > > > drm_sched_entity_kill. > > > > > > > > For this case calling drm_sched_fini() after the 1:1 entity > > > > is gone > > > > should not find any linger jobs, it would actually be a bug > > > > somewhere if > > > > there's a job lingering. Maybe a sanity check that there's > > > > not just > > > > no > > > > jobs lingering, but also no entity left would be good here? > > > The check for lingering ones is in Christian's patch here IISC. > > > At which position would you imagine the check for the entity > > > being > > > performed? > > > > > > > - fw scheduler without preemption support. There we kinda need > > > > the > > > > drm_sched_flush, except blocking in fops->close is not > > > > great. So > > > > instead > > > > I think the following is better: > > > > 1. drm_sched_entity_stopped, which only stops new > > > > submissions (for > > > > paranoia) but doesn't tear down the entity > > > Who would call that function? > > > Drivers using it voluntarily could just as well stop accepting > > > new jobs > > > from userspace to their entities, couldn't they? > > > > > > > 2. drm_dev_get > > > > 3. launch a worker which does a) drm_sched_flush (or > > > > drm_sched_entity_flush or whatever we call it) b) > > > > drm_sched_entity_fini > > > > + drm_sched_fini c) drm_dev_put > > > > > > > > Note that semantically this implements the refcount in the > > > > other > > > > path > > > > from Phillip: > > > > > > > > https://lore.kernel.org/all/20240903094531.29893-2-pstanner@redhat.com/ > > > > Except it doesn't impose refcount on everyone else who > > > > doesn't need > > > > it, > > > > and it doesn't even impose refcounting on drivers that do > > > > need it > > > > because we use drm_sched_flush and a worker to achieve the > > > > same. > > > I indeed wasn't happy with the refcount approach for that reason, > > > agreed. > > > > > > > Essentially helper functions for the common use-cases instead > > > > of > > > > trying to > > > > solve them all by putting drm_sched_flush as a potentially very > > > > blocking > > > > function into drm_sched_fini. > > > I'm still not able to see why it blocking would be undesired – as > > > far > > > as I can see, it is only invoked on driver teardown, so not > > > during > > > active operation. Teardown doesn't happen that often, and it can > > > (if > > > implemented correctly) only block until the driver's code has > > > signaled > > > the last fences. If that doesn't happen, the block would reveal a > > > bug. > > > > > > But don't get me wrong: I don't want to *push* this solution. I > > > just > > > want to understand when it could become a problem. > > > > > > > > > Wouldn't an explicitly blocking, separate function like > > > drm_sched_flush() or drm_sched_fini_flush() be a small, doable > > > step > > > towards the right direction? > > > > I think that this is the right thing to do, yes. > > > > > > > > > > > When there are still entities with jobs the situation > > > > > > > > > is > > > > > > > > > even > > > > > > > > > worse > > > > > > > > > since the dma_fences for those jobs can never signal > > > > > > > > > we can > > > > > > > > > just > > > > > > > > > choose between potentially locking up core memory > > > > > > > > > management and > > > > > > > > > random memory corruption. When drivers really mess it > > > > > > > > > up > > > > > > > > > that > > > > > > > > > well > > > > > > > > > let them run into a BUG_ON(). > > > > > > > > > > > > > > > > > > Signed-off-by: Christian > > > > > > > > > König<christian.koenig@amd.com> > > > > > > > > > --- > > > > > > > > > drivers/gpu/drm/scheduler/sched_main.c | 19 > > > > > > > > > ++++++++++++++++++- > > > > > > > > > 1 file changed, 18 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > index f093616fe53c..8a46fab5cdc8 100644 > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct > > > > > > > > > drm_gpu_scheduler > > > > > > > > > *sched) > > > > > > > > I agree with Sima that it should first be documented in > > > > > > > > the > > > > > > > > function's > > > > > > > > docstring what the user is expected to have done before > > > > > > > > calling the > > > > > > > > function. > > > > > > > Good point, going to update the documentation as well. > > > > > > Cool thing, thx. > > > > > > Would be great if everything (not totally trivial) > > > > > > necessary to > > > > > > be done > > > > > > before _fini() is mentioned. > > > > > > > > > > > > One could also think about providing a hint at how the > > > > > > driver can > > > > > > do > > > > > > that. AFAICS the only way for the driver to ensure that is > > > > > > to > > > > > > maintain > > > > > > its own, separate list of submitted jobs. > > > > > Even with a duplicated pending list it's actually currently > > > > > impossible to do > > > > > this fully cleanly. > > > > > > > > > > The problem is that the dma_fence object gives no guarantee > > > > > when > > > > > callbacks > > > > > are processed, e.g. they can be both processed from interrupt > > > > > context as > > > > > well as from a CPU which called dma_fence_is_signaled(). > > > > > > > > > > So when a driver (or drm_sched_fini) waits for the last > > > > > submitted > > > > > fence it > > > > > actually can be that the drm_sched object still needs to do > > > > > some > > > > > processing. > > > > > See the hack in amdgpu_vm_tlb_seq() for more background on > > > > > the > > > > > problem. > > > Oh dear ^^' > > > We better work towards fixing that centrally > > > > > > Thanks, > > > P. > > > > > > > > > > So I thought this should be fairly easy because of the sched > > > > hw/public > > > > fence split: If we wait for both all jobs to finish and for all > > > > the > > > > sched > > > > work/tdr work to finish, and we make sure there's no entity > > > > existing > > > > that's not yet stopped we should catch them all? > > > > Unfortunately not. > > > > Even when you do a dma_fence_wait() on the last submission it can > > still be > > that another CPU is executing the callbacks to wake up the > > scheduler work > > item and cleanup the job. > > > > That's one of the reasons why I think the design of keeping the job > > alive is > > so extremely awkward. The dma_fence as representation of the hw > > submission > > has a much better defined state machine and lifetime. > > > > Regards, > > Christian. > > > > > > Or at least I think > > > > it's > > > > a bug if any other code even tries to touch the hw fence. > > > > > > > > If you have any other driver code which relies on the rcu > > > > freeing > > > > then I > > > > think that's just a separate concern and we can ignore that > > > > here > > > > since the > > > > fences themselves will till get rcu-delay freed even if > > > > drm_sched_fini has > > > > finished. > > > > -Sima > > > > > > > > > Regards, > > > > > Christian. > > > > > > > > > > > P. > > > > > > > > > > > > > Thanks, > > > > > > > Christian. > > > > > > > > > > > > > > > P. > > > > > > > > > > > > > > > > > drm_sched_wqueue_stop(sched); > > > > > > > > > + /* > > > > > > > > > + * Tearing down the scheduler wile there are still > > > > > > > > > unprocessed jobs can > > > > > > > > > + * lead to use after free issues in the scheduler > > > > > > > > > fence. > > > > > > > > > + */ > > > > > > > > > + WARN_ON(!list_empty(&sched->pending_list)); > > > > > > > > > + > > > > > > > > > for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched- > > > > > > > > > > num_rqs; > > > > > > > > > i++) > > > > > > > > > { > > > > > > > > > struct drm_sched_rq *rq = sched- > > > > > > > > > > sched_rq[i]; > > > > > > > > > spin_lock(&rq->lock); > > > > > > > > > - list_for_each_entry(s_entity, &rq- > > > > > > > > > > entities, > > > > > > > > > list) > > > > > > > > > + list_for_each_entry(s_entity, &rq- > > > > > > > > > > entities, > > > > > > > > > list) { > > > > > > > > > + /* > > > > > > > > > + * The justification for this > > > > > > > > > BUG_ON() > > > > > > > > > is > > > > > > > > > that tearing > > > > > > > > > + * down the scheduler while jobs > > > > > > > > > are > > > > > > > > > pending > > > > > > > > > leaves > > > > > > > > > + * dma_fences unsignaled. Since we > > > > > > > > > have > > > > > > > > > dependencies > > > > > > > > > + * from the core memory management > > > > > > > > > to > > > > > > > > > eventually signal > > > > > > > > > + * dma_fences this can trivially > > > > > > > > > lead to > > > > > > > > > a > > > > > > > > > system wide > > > > > > > > > + * stop because of a locked up > > > > > > > > > memory > > > > > > > > > management. > > > > > > > > > + */ > > > > > > > > > + BUG_ON(spsc_queue_count(&s_entity- > > > > > > > > > > job_queue)); > > > > > > > > > + > > > > > > > > > /* > > > > > > > > > * Prevents reinsertion and marks > > > > > > > > > job_queue > > > > > > > > > as idle, > > > > > > > > > * it will removed from rq in > > > > > > > > > drm_sched_entity_fini > > > > > > > > > * eventually > > > > > > > > > */ > > > > > > > > > s_entity->stopped = true; > > > > > > > > > + } > > > > > > > > > spin_unlock(&rq->lock); > > > > > > > > > kfree(sched->sched_rq[i]); > > > > > > > > > } >
Christian, Sima? Matthew? (+CC) Opinions on the below? tl;dr: I still think it's a good thing to detectably block in drm_sched_fini(), or at the very least drm_sched_flush(), because then you'll find out that the driver is broken and can repair it. P. On Fri, 2024-10-18 at 14:07 +0200, Philipp Stanner wrote: > On Mon, 2024-10-14 at 16:56 +0200, Danilo Krummrich wrote: > > On Fri, Sep 27, 2024 at 11:04:48AM +0200, Christian König wrote: > > > Am 25.09.24 um 16:53 schrieb Philipp Stanner: > > > > On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote: > > > > > On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König > > > > > wrote: > > > > > > Am 20.09.24 um 15:26 schrieb Philipp Stanner: > > > > > > > On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote: > > > > > > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner: > > > > > > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian König > > > > > > > > > wrote: > > > > > > > > > > Tearing down the scheduler with jobs still on the > > > > > > > > > > pending > > > > > > > > > > list > > > > > > > > > > can > > > > > > > > > > lead to use after free issues. Add a warning if > > > > > > > > > > drivers try > > > > > > > > > > to > > > > > > > > > > destroy a scheduler which still has work pushed to > > > > > > > > > > the HW. > > > > > > > > > Did you have time yet to look into my proposed > > > > > > > > > waitque- > > > > > > > > > solution? > > > > > > > > I don't remember seeing anything. What have I missed? > > > > > > > https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/ > > > > > > Mhm, I didn't got that in my inbox for some reason. > > > > > > > > > > > > Interesting approach, I'm just not sure if we can or should > > > > > > wait in > > > > > > drm_sched_fini(). > > > > We do agree that jobs still pending when drm_sched_fini() > > > > starts > > > > is > > > > always a bug, right? > > > > > > Correct, the question is how to avoid that. > > > > > > > If so, what are the disadvantages of waiting in > > > > drm_sched_fini()? > > > > We > > > > could block buggy drivers as I see it. Which wouldn't be good, > > > > but > > > > could then be fixed on drivers' site. > > > > > > Sima explained that pretty well: Don't block in fops->close, do > > > that in > > > fops->flush instead. > > > > I agree that we shouldn't block in close(), but this effectively > > means that we > > need to reference count the scheduler, right? > > > > Otherwise, if we allow to just skip / interrupt the teardown, we > > can > > still leak > > memory. > > Having thought about it, I agree with Danilo. Having something that > shall wait on green light, but can be interrupted, is no guarantee > and > therefore not a feasible solution. > > To break down the solution space, these seem to be our options: > 1. We have something (either drm_sched_fini() or a helper, e.g., > drm_sched_flush()) that definitely blocks until the pending > list > has become empty. > 2. We have jobs reference-count the scheduler, so the latter can > outlive the driver and will be freed some time later. > > Can anyone think of a third solution? > > > Solution #1 has the problem of obviously blocking unconditionally if > the driver didn't make sure that all fences will be signaled within > reasonable time. In my opinion, this would actually be an advantage, > because it will be *very* noticable and force users to repair their > driver. The driver *has* to guarantee that all fences will be > signaled. > If the driver has to do fishy things, having the blocking outsourced > to > the helper drm_sched_flush() would allow them to circumvent that. > > Solution #2 has the problem of backend_ops.free_job() potentially > using > driver-data after the driver is gone, causing UAF. So with this > solutions all drivers would have to be aware of the issue and handle > it > through one of DRMs primitives dedicated to such problems. > > > Currently, all drivers either work around the problem internally or > simply ignore it, it seems. > > So I'd argue that both solutions are an improvement over the existing > situation. My preference would be #1. > > > Opinions? > > P. > > > > > > > > > One issue this solves is that when you send a SIGTERM the tear > > > down > > > handling > > > first flushes all the FDs and then closes them. > > > > > > So if flushing the FDs blocks because the process initiated > > > sending > > > a > > > terabyte of data over a 300bps line (for example) you can still > > > throw a > > > SIGKILL and abort that as well. > > > > > > If you would block in fops-close() that SIGKILL won't have any > > > effect any > > > more because by the time close() is called the process is gone > > > and > > > signals > > > are already blocked. > > > > > > And yes when I learned about that issue I was also buffed that > > > handling like > > > this in the UNIX design is nearly 50 years old and still applies > > > to > > > today. > > > > > > Probably better to make that a separate function, something > > > > > > like > > > > > > drm_sched_flush() or similar. > > > > We could do that. Such a function could then be called by > > > > drivers > > > > which > > > > are not sure whether all jobs are done before they start > > > > tearing > > > > down. > > > > > > Yes exactly that's the idea. And give that flush function a > > > return > > > code so > > > that it can return -EINTR. > > > > > > > > Yeah I don't think we should smash this into drm_sched_fini > > > > > unconditionally. I think conceptually there's about three > > > > > cases: > > > > > > > > > > - Ringbuffer schedules. Probably want everything as-is, > > > > > because > > > > > drm_sched_fini is called long after all the entities are > > > > > gone in > > > > > drm_device cleanup. > > > > > > > > > > - fw scheduler hardware with preemption support. There we > > > > > probably > > > > > want to > > > > > nuke the context by setting the tdr timeout to zero (or > > > > > maybe just > > > > > as > > > > > long as context preemption takes to be efficient), and > > > > > relying on > > > > > the > > > > > normal gpu reset flow to handle things. > > > > > drm_sched_entity_flush > > > > > kinda > > > > > does this, except not really and it's a lot more focused > > > > > on > > > > > the > > > > > ringbuffer context. So maybe we want a new > > > > > drm_sched_entity_kill. > > > > > > > > > > For this case calling drm_sched_fini() after the 1:1 > > > > > entity > > > > > is gone > > > > > should not find any linger jobs, it would actually be a > > > > > bug > > > > > somewhere if > > > > > there's a job lingering. Maybe a sanity check that there's > > > > > not just > > > > > no > > > > > jobs lingering, but also no entity left would be good > > > > > here? > > > > The check for lingering ones is in Christian's patch here IISC. > > > > At which position would you imagine the check for the entity > > > > being > > > > performed? > > > > > > > > > - fw scheduler without preemption support. There we kinda > > > > > need > > > > > the > > > > > drm_sched_flush, except blocking in fops->close is not > > > > > great. So > > > > > instead > > > > > I think the following is better: > > > > > 1. drm_sched_entity_stopped, which only stops new > > > > > submissions (for > > > > > paranoia) but doesn't tear down the entity > > > > Who would call that function? > > > > Drivers using it voluntarily could just as well stop accepting > > > > new jobs > > > > from userspace to their entities, couldn't they? > > > > > > > > > 2. drm_dev_get > > > > > 3. launch a worker which does a) drm_sched_flush (or > > > > > drm_sched_entity_flush or whatever we call it) b) > > > > > drm_sched_entity_fini > > > > > + drm_sched_fini c) drm_dev_put > > > > > > > > > > Note that semantically this implements the refcount in the > > > > > other > > > > > path > > > > > from Phillip: > > > > > > > > > > https://lore.kernel.org/all/20240903094531.29893-2-pstanner@redhat.com/ > > > > > Except it doesn't impose refcount on everyone else who > > > > > doesn't need > > > > > it, > > > > > and it doesn't even impose refcounting on drivers that do > > > > > need it > > > > > because we use drm_sched_flush and a worker to achieve the > > > > > same. > > > > I indeed wasn't happy with the refcount approach for that > > > > reason, > > > > agreed. > > > > > > > > > Essentially helper functions for the common use-cases instead > > > > > of > > > > > trying to > > > > > solve them all by putting drm_sched_flush as a potentially > > > > > very > > > > > blocking > > > > > function into drm_sched_fini. > > > > I'm still not able to see why it blocking would be undesired – > > > > as > > > > far > > > > as I can see, it is only invoked on driver teardown, so not > > > > during > > > > active operation. Teardown doesn't happen that often, and it > > > > can > > > > (if > > > > implemented correctly) only block until the driver's code has > > > > signaled > > > > the last fences. If that doesn't happen, the block would reveal > > > > a > > > > bug. > > > > > > > > But don't get me wrong: I don't want to *push* this solution. I > > > > just > > > > want to understand when it could become a problem. > > > > > > > > > > > > Wouldn't an explicitly blocking, separate function like > > > > drm_sched_flush() or drm_sched_fini_flush() be a small, doable > > > > step > > > > towards the right direction? > > > > > > I think that this is the right thing to do, yes. > > > > > > > > > > > > > When there are still entities with jobs the > > > > > > > > > > situation > > > > > > > > > > is > > > > > > > > > > even > > > > > > > > > > worse > > > > > > > > > > since the dma_fences for those jobs can never > > > > > > > > > > signal > > > > > > > > > > we can > > > > > > > > > > just > > > > > > > > > > choose between potentially locking up core memory > > > > > > > > > > management and > > > > > > > > > > random memory corruption. When drivers really mess > > > > > > > > > > it > > > > > > > > > > up > > > > > > > > > > that > > > > > > > > > > well > > > > > > > > > > let them run into a BUG_ON(). > > > > > > > > > > > > > > > > > > > > Signed-off-by: Christian > > > > > > > > > > König<christian.koenig@amd.com> > > > > > > > > > > --- > > > > > > > > > > drivers/gpu/drm/scheduler/sched_main.c | 19 > > > > > > > > > > ++++++++++++++++++- > > > > > > > > > > 1 file changed, 18 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > > index f093616fe53c..8a46fab5cdc8 100644 > > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct > > > > > > > > > > drm_gpu_scheduler > > > > > > > > > > *sched) > > > > > > > > > I agree with Sima that it should first be documented > > > > > > > > > in > > > > > > > > > the > > > > > > > > > function's > > > > > > > > > docstring what the user is expected to have done > > > > > > > > > before > > > > > > > > > calling the > > > > > > > > > function. > > > > > > > > Good point, going to update the documentation as well. > > > > > > > Cool thing, thx. > > > > > > > Would be great if everything (not totally trivial) > > > > > > > necessary to > > > > > > > be done > > > > > > > before _fini() is mentioned. > > > > > > > > > > > > > > One could also think about providing a hint at how the > > > > > > > driver can > > > > > > > do > > > > > > > that. AFAICS the only way for the driver to ensure that > > > > > > > is > > > > > > > to > > > > > > > maintain > > > > > > > its own, separate list of submitted jobs. > > > > > > Even with a duplicated pending list it's actually currently > > > > > > impossible to do > > > > > > this fully cleanly. > > > > > > > > > > > > The problem is that the dma_fence object gives no guarantee > > > > > > when > > > > > > callbacks > > > > > > are processed, e.g. they can be both processed from > > > > > > interrupt > > > > > > context as > > > > > > well as from a CPU which called dma_fence_is_signaled(). > > > > > > > > > > > > So when a driver (or drm_sched_fini) waits for the last > > > > > > submitted > > > > > > fence it > > > > > > actually can be that the drm_sched object still needs to do > > > > > > some > > > > > > processing. > > > > > > See the hack in amdgpu_vm_tlb_seq() for more background on > > > > > > the > > > > > > problem. > > > > Oh dear ^^' > > > > We better work towards fixing that centrally > > > > > > > > Thanks, > > > > P. > > > > > > > > > > > > > So I thought this should be fairly easy because of the sched > > > > > hw/public > > > > > fence split: If we wait for both all jobs to finish and for > > > > > all > > > > > the > > > > > sched > > > > > work/tdr work to finish, and we make sure there's no entity > > > > > existing > > > > > that's not yet stopped we should catch them all? > > > > > > Unfortunately not. > > > > > > Even when you do a dma_fence_wait() on the last submission it can > > > still be > > > that another CPU is executing the callbacks to wake up the > > > scheduler work > > > item and cleanup the job. > > > > > > That's one of the reasons why I think the design of keeping the > > > job > > > alive is > > > so extremely awkward. The dma_fence as representation of the hw > > > submission > > > has a much better defined state machine and lifetime. > > > > > > Regards, > > > Christian. > > > > > > > > Or at least I think > > > > > it's > > > > > a bug if any other code even tries to touch the hw fence. > > > > > > > > > > If you have any other driver code which relies on the rcu > > > > > freeing > > > > > then I > > > > > think that's just a separate concern and we can ignore that > > > > > here > > > > > since the > > > > > fences themselves will till get rcu-delay freed even if > > > > > drm_sched_fini has > > > > > finished. > > > > > -Sima > > > > > > > > > > > Regards, > > > > > > Christian. > > > > > > > > > > > > > P. > > > > > > > > > > > > > > > Thanks, > > > > > > > > Christian. > > > > > > > > > > > > > > > > > P. > > > > > > > > > > > > > > > > > > > drm_sched_wqueue_stop(sched); > > > > > > > > > > + /* > > > > > > > > > > + * Tearing down the scheduler wile there are still > > > > > > > > > > unprocessed jobs can > > > > > > > > > > + * lead to use after free issues in the scheduler > > > > > > > > > > fence. > > > > > > > > > > + */ > > > > > > > > > > + WARN_ON(!list_empty(&sched->pending_list)); > > > > > > > > > > + > > > > > > > > > > for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched- > > > > > > > > > > > num_rqs; > > > > > > > > > > i++) > > > > > > > > > > { > > > > > > > > > > struct drm_sched_rq *rq = sched- > > > > > > > > > > > sched_rq[i]; > > > > > > > > > > spin_lock(&rq->lock); > > > > > > > > > > - list_for_each_entry(s_entity, &rq- > > > > > > > > > > > entities, > > > > > > > > > > list) > > > > > > > > > > + list_for_each_entry(s_entity, &rq- > > > > > > > > > > > entities, > > > > > > > > > > list) { > > > > > > > > > > + /* > > > > > > > > > > + * The justification for this > > > > > > > > > > BUG_ON() > > > > > > > > > > is > > > > > > > > > > that tearing > > > > > > > > > > + * down the scheduler while jobs > > > > > > > > > > are > > > > > > > > > > pending > > > > > > > > > > leaves > > > > > > > > > > + * dma_fences unsignaled. Since we > > > > > > > > > > have > > > > > > > > > > dependencies > > > > > > > > > > + * from the core memory management > > > > > > > > > > to > > > > > > > > > > eventually signal > > > > > > > > > > + * dma_fences this can trivially > > > > > > > > > > lead to > > > > > > > > > > a > > > > > > > > > > system wide > > > > > > > > > > + * stop because of a locked up > > > > > > > > > > memory > > > > > > > > > > management. > > > > > > > > > > + */ > > > > > > > > > > + BUG_ON(spsc_queue_count(&s_entity- > > > > > > > > > > > job_queue)); > > > > > > > > > > + > > > > > > > > > > /* > > > > > > > > > > * Prevents reinsertion and marks > > > > > > > > > > job_queue > > > > > > > > > > as idle, > > > > > > > > > > * it will removed from rq in > > > > > > > > > > drm_sched_entity_fini > > > > > > > > > > * eventually > > > > > > > > > > */ > > > > > > > > > > s_entity->stopped = true; > > > > > > > > > > + } > > > > > > > > > > spin_unlock(&rq->lock); > > > > > > > > > > kfree(sched->sched_rq[i]); > > > > > > > > > > } > > >
On Wed, 18 Sept 2024 at 23:48, Christian König <ckoenig.leichtzumerken@gmail.com> wrote: > > Tearing down the scheduler with jobs still on the pending list can > lead to use after free issues. Add a warning if drivers try to > destroy a scheduler which still has work pushed to the HW. > > When there are still entities with jobs the situation is even worse > since the dma_fences for those jobs can never signal we can just > choose between potentially locking up core memory management and > random memory corruption. When drivers really mess it up that well > let them run into a BUG_ON(). I've been talking a bit to Phillip about this offline. I'm not sure we should ever be BUG_ON here, I think it might be an extreme answer, considering we are saying blocking userspace to let things finish is bad, I think hitting this would be much worse. I'd rather we WARN_ON, and consider just saying screw it and block userspace close. If we really want to avoid the hang on close possibility (though I'm mostly fine with that) then I think Sima's option to just keep a reference on the driver, let userspace close and try and clean things up on fences in the driver later. I think this should be at least good for rust lifetimes. Having an explicit memory leak is bad, having a managed memory leak is less bad, because at least all the memory is still pointed to by something and managed, at a guess we'd have to keep the whole driver and scheduler around, to avoid having to make special free functions. Unless there is some concept like TTM ghost objects we could get away with, but I think having to signal the fence means we should keep all the pieces. Dave. > > Signed-off-by: Christian König <christian.koenig@amd.com> > --- > drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++- > 1 file changed, 18 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index f093616fe53c..8a46fab5cdc8 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched) > > drm_sched_wqueue_stop(sched); > > + /* > + * Tearing down the scheduler wile there are still unprocessed jobs can > + * lead to use after free issues in the scheduler fence. > + */ > + WARN_ON(!list_empty(&sched->pending_list)); > + > for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) { > struct drm_sched_rq *rq = sched->sched_rq[i]; > > spin_lock(&rq->lock); > - list_for_each_entry(s_entity, &rq->entities, list) > + list_for_each_entry(s_entity, &rq->entities, list) { > + /* > + * The justification for this BUG_ON() is that tearing > + * down the scheduler while jobs are pending leaves > + * dma_fences unsignaled. Since we have dependencies > + * from the core memory management to eventually signal > + * dma_fences this can trivially lead to a system wide > + * stop because of a locked up memory management. > + */ > + BUG_ON(spsc_queue_count(&s_entity->job_queue)); > + > /* > * Prevents reinsertion and marks job_queue as idle, > * it will removed from rq in drm_sched_entity_fini > * eventually > */ > s_entity->stopped = true; > + } > spin_unlock(&rq->lock); > kfree(sched->sched_rq[i]); > } > -- > 2.34.1 >
Am 08.11.24 um 05:00 schrieb Dave Airlie: > On Wed, 18 Sept 2024 at 23:48, Christian König > <ckoenig.leichtzumerken@gmail.com> wrote: >> Tearing down the scheduler with jobs still on the pending list can >> lead to use after free issues. Add a warning if drivers try to >> destroy a scheduler which still has work pushed to the HW. >> >> When there are still entities with jobs the situation is even worse >> since the dma_fences for those jobs can never signal we can just >> choose between potentially locking up core memory management and >> random memory corruption. When drivers really mess it up that well >> let them run into a BUG_ON(). > I've been talking a bit to Phillip about this offline. > > I'm not sure we should ever be BUG_ON here, I think it might be an > extreme answer, considering we are saying blocking userspace to let > things finish is bad, I think hitting this would be much worse. Yeah, completely agree that this is the most extreme response. The problem is that the scheduler doesn't have much of a choice in this moment any more. What we can do is the following: 1. Try to block for the queued up jobs in the entities to be processed and signaling their hardware fence. There is no guarantee that this won't deadlock and we are potentially blocking a SIGKILL here. We already tried it before and that turned out to be quite unstable. 2. Signal all pending fences without actually executing anything. Because of the dma_fence design, pipe-lining work and only keeping the latest fence for each context in the containers that can potentially lead to random memory corruptions. 3. Don't signal anything, just free up all jobs. That can trivially result in the core memory management waiting forever for things to make progress. Leading to complete system stops. So the choice you have are all really bad for the system as a whole. > I'd rather we WARN_ON, and consider just saying screw it and block > userspace close. Going to make that a WARN_ON() if you insist, but IIRC Linus or Greg once summarized it as "A BUG() is only justified if you prevent even worse things from happening". If you ask me that's exactly the case here. > If we really want to avoid the hang on close possibility (though I'm > mostly fine with that) then I think Sima's option to just keep a > reference on the driver, let userspace close and try and clean things > up on fences in the driver later. > > I think this should be at least good for rust lifetimes. The problem with the rust lifetime is that it got the direction of the dependency wrong. A dma_fence represents an operation on the hardware, e.g. a direct access to some memory. This means that the dma_fence can only signal if the hardware tells the driver that the operation is completed. If a dma_fence should signals because some object is not referenced any more, for example because userspace closed it, than that clearly points to a fundamentally broken design. > Having an explicit memory leak is bad, having a managed memory leak is > less bad, because at least all the memory is still pointed to by > something and managed, at a guess we'd have to keep the whole driver > and scheduler around, to avoid having to make special free functions. > Unless there is some concept like TTM ghost objects we could get away > with, but I think having to signal the fence means we should keep all > the pieces. Well I think memory leaks are more or less harmless. What we really really need to prevent are random memory corruptions. Those can run undetected under the radar for a very long time and cause massive damage without anybody being able to pinpoint where corruptions came from. Regards, Christian. > > Dave. > >> Signed-off-by: Christian König<christian.koenig@amd.com> >> --- >> drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++- >> 1 file changed, 18 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >> index f093616fe53c..8a46fab5cdc8 100644 >> --- a/drivers/gpu/drm/scheduler/sched_main.c >> +++ b/drivers/gpu/drm/scheduler/sched_main.c >> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched) >> >> drm_sched_wqueue_stop(sched); >> >> + /* >> + * Tearing down the scheduler wile there are still unprocessed jobs can >> + * lead to use after free issues in the scheduler fence. >> + */ >> + WARN_ON(!list_empty(&sched->pending_list)); >> + >> for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) { >> struct drm_sched_rq *rq = sched->sched_rq[i]; >> >> spin_lock(&rq->lock); >> - list_for_each_entry(s_entity, &rq->entities, list) >> + list_for_each_entry(s_entity, &rq->entities, list) { >> + /* >> + * The justification for this BUG_ON() is that tearing >> + * down the scheduler while jobs are pending leaves >> + * dma_fences unsignaled. Since we have dependencies >> + * from the core memory management to eventually signal >> + * dma_fences this can trivially lead to a system wide >> + * stop because of a locked up memory management. >> + */ >> + BUG_ON(spsc_queue_count(&s_entity->job_queue)); >> + >> /* >> * Prevents reinsertion and marks job_queue as idle, >> * it will removed from rq in drm_sched_entity_fini >> * eventually >> */ >> s_entity->stopped = true; >> + } >> spin_unlock(&rq->lock); >> kfree(sched->sched_rq[i]); >> } >> -- >> 2.34.1 >>
Interesting, those mails just showed up in my inbox as unread. More than 14 days after you send it. So sorry for the late reply. Am 29.10.24 um 08:22 schrieb Philipp Stanner: > Christian, Sima? > > Matthew? (+CC) > > Opinions on the below? > > tl;dr: > I still think it's a good thing to detectably block in > drm_sched_fini(), or at the very least drm_sched_flush(), because then > you'll find out that the driver is broken and can repair it. As discussed in private mail as well, the third option not mentioned so far is to completely drop the free_job callback. A good bunch of the problems we have here are caused by abusing the job as state for executing the submission on the hardware. The original design idea of the scheduler instead was to only have the job as intermediate state between submission and picking what to execute next. E.g. the scheduler in that view is just a pipeline were you push jobs in on one end and jobs come out on the other end. In that approach the object which represents the hardware execution is the dma_fence instead of the job. And the dma_fence has a well defined lifetime, error handling, etc... So when we go back to this original approach it would mean that we don't need to wait for any cleanup because the scheduler wouldn't be involved in the execution of the jobs on the hardware any more. The worst thing that could happen is that the driver messes things up and still has not executed job in an entity, but in that case we could just warn, block for the hardware fence and then kill all pending submissions with an error.- And this blocking on the hardware fence can't deadlock because we know that it is a hardware fence with certain properties and not some random free_job callback where we don't have any idea what locks the driver need. Regards, Christian. > > P. > > > On Fri, 2024-10-18 at 14:07 +0200, Philipp Stanner wrote: >> On Mon, 2024-10-14 at 16:56 +0200, Danilo Krummrich wrote: >>> On Fri, Sep 27, 2024 at 11:04:48AM +0200, Christian König wrote: >>>> Am 25.09.24 um 16:53 schrieb Philipp Stanner: >>>>> On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote: >>>>>> On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König >>>>>> wrote: >>>>>>> Am 20.09.24 um 15:26 schrieb Philipp Stanner: >>>>>>>> On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote: >>>>>>>>> Am 20.09.24 um 10:57 schrieb Philipp Stanner: >>>>>>>>>> On Wed, 2024-09-18 at 15:39 +0200, Christian König >>>>>>>>>> wrote: >>>>>>>>>>> Tearing down the scheduler with jobs still on the >>>>>>>>>>> pending >>>>>>>>>>> list >>>>>>>>>>> can >>>>>>>>>>> lead to use after free issues. Add a warning if >>>>>>>>>>> drivers try >>>>>>>>>>> to >>>>>>>>>>> destroy a scheduler which still has work pushed to >>>>>>>>>>> the HW. >>>>>>>>>> Did you have time yet to look into my proposed >>>>>>>>>> waitque- >>>>>>>>>> solution? >>>>>>>>> I don't remember seeing anything. What have I missed? >>>>>>>> https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/ >>>>>>> Mhm, I didn't got that in my inbox for some reason. >>>>>>> >>>>>>> Interesting approach, I'm just not sure if we can or should >>>>>>> wait in >>>>>>> drm_sched_fini(). >>>>> We do agree that jobs still pending when drm_sched_fini() >>>>> starts >>>>> is >>>>> always a bug, right? >>>> Correct, the question is how to avoid that. >>>> >>>>> If so, what are the disadvantages of waiting in >>>>> drm_sched_fini()? >>>>> We >>>>> could block buggy drivers as I see it. Which wouldn't be good, >>>>> but >>>>> could then be fixed on drivers' site. >>>> Sima explained that pretty well: Don't block in fops->close, do >>>> that in >>>> fops->flush instead. >>> I agree that we shouldn't block in close(), but this effectively >>> means that we >>> need to reference count the scheduler, right? >>> >>> Otherwise, if we allow to just skip / interrupt the teardown, we >>> can >>> still leak >>> memory. >> Having thought about it, I agree with Danilo. Having something that >> shall wait on green light, but can be interrupted, is no guarantee >> and >> therefore not a feasible solution. >> >> To break down the solution space, these seem to be our options: >> 1. We have something (either drm_sched_fini() or a helper, e.g., >> drm_sched_flush()) that definitely blocks until the pending >> list >> has become empty. >> 2. We have jobs reference-count the scheduler, so the latter can >> outlive the driver and will be freed some time later. >> >> Can anyone think of a third solution? >> >> >> Solution #1 has the problem of obviously blocking unconditionally if >> the driver didn't make sure that all fences will be signaled within >> reasonable time. In my opinion, this would actually be an advantage, >> because it will be *very* noticable and force users to repair their >> driver. The driver *has* to guarantee that all fences will be >> signaled. >> If the driver has to do fishy things, having the blocking outsourced >> to >> the helper drm_sched_flush() would allow them to circumvent that. >> >> Solution #2 has the problem of backend_ops.free_job() potentially >> using >> driver-data after the driver is gone, causing UAF. So with this >> solutions all drivers would have to be aware of the issue and handle >> it >> through one of DRMs primitives dedicated to such problems. >> >> >> Currently, all drivers either work around the problem internally or >> simply ignore it, it seems. >> >> So I'd argue that both solutions are an improvement over the existing >> situation. My preference would be #1. >> >> >> Opinions? >> >> P. >> >>>> One issue this solves is that when you send a SIGTERM the tear >>>> down >>>> handling >>>> first flushes all the FDs and then closes them. >>>> >>>> So if flushing the FDs blocks because the process initiated >>>> sending >>>> a >>>> terabyte of data over a 300bps line (for example) you can still >>>> throw a >>>> SIGKILL and abort that as well. >>>> >>>> If you would block in fops-close() that SIGKILL won't have any >>>> effect any >>>> more because by the time close() is called the process is gone >>>> and >>>> signals >>>> are already blocked. >>>> >>>> And yes when I learned about that issue I was also buffed that >>>> handling like >>>> this in the UNIX design is nearly 50 years old and still applies >>>> to >>>> today. >>>>>>> Probably better to make that a separate function, something >>>>>>> like >>>>>>> drm_sched_flush() or similar. >>>>> We could do that. Such a function could then be called by >>>>> drivers >>>>> which >>>>> are not sure whether all jobs are done before they start >>>>> tearing >>>>> down. >>>> Yes exactly that's the idea. And give that flush function a >>>> return >>>> code so >>>> that it can return -EINTR. >>>> >>>>>> Yeah I don't think we should smash this into drm_sched_fini >>>>>> unconditionally. I think conceptually there's about three >>>>>> cases: >>>>>> >>>>>> - Ringbuffer schedules. Probably want everything as-is, >>>>>> because >>>>>> drm_sched_fini is called long after all the entities are >>>>>> gone in >>>>>> drm_device cleanup. >>>>>> >>>>>> - fw scheduler hardware with preemption support. There we >>>>>> probably >>>>>> want to >>>>>> nuke the context by setting the tdr timeout to zero (or >>>>>> maybe just >>>>>> as >>>>>> long as context preemption takes to be efficient), and >>>>>> relying on >>>>>> the >>>>>> normal gpu reset flow to handle things. >>>>>> drm_sched_entity_flush >>>>>> kinda >>>>>> does this, except not really and it's a lot more focused >>>>>> on >>>>>> the >>>>>> ringbuffer context. So maybe we want a new >>>>>> drm_sched_entity_kill. >>>>>> >>>>>> For this case calling drm_sched_fini() after the 1:1 >>>>>> entity >>>>>> is gone >>>>>> should not find any linger jobs, it would actually be a >>>>>> bug >>>>>> somewhere if >>>>>> there's a job lingering. Maybe a sanity check that there's >>>>>> not just >>>>>> no >>>>>> jobs lingering, but also no entity left would be good >>>>>> here? >>>>> The check for lingering ones is in Christian's patch here IISC. >>>>> At which position would you imagine the check for the entity >>>>> being >>>>> performed? >>>>> >>>>>> - fw scheduler without preemption support. There we kinda >>>>>> need >>>>>> the >>>>>> drm_sched_flush, except blocking in fops->close is not >>>>>> great. So >>>>>> instead >>>>>> I think the following is better: >>>>>> 1. drm_sched_entity_stopped, which only stops new >>>>>> submissions (for >>>>>> paranoia) but doesn't tear down the entity >>>>> Who would call that function? >>>>> Drivers using it voluntarily could just as well stop accepting >>>>> new jobs >>>>> from userspace to their entities, couldn't they? >>>>> >>>>>> 2. drm_dev_get >>>>>> 3. launch a worker which does a) drm_sched_flush (or >>>>>> drm_sched_entity_flush or whatever we call it) b) >>>>>> drm_sched_entity_fini >>>>>> + drm_sched_fini c) drm_dev_put >>>>>> >>>>>> Note that semantically this implements the refcount in the >>>>>> other >>>>>> path >>>>>> from Phillip: >>>>>> >>>>>> https://lore.kernel.org/all/20240903094531.29893-2-pstanner@redhat.com/ >>>>>> Except it doesn't impose refcount on everyone else who >>>>>> doesn't need >>>>>> it, >>>>>> and it doesn't even impose refcounting on drivers that do >>>>>> need it >>>>>> because we use drm_sched_flush and a worker to achieve the >>>>>> same. >>>>> I indeed wasn't happy with the refcount approach for that >>>>> reason, >>>>> agreed. >>>>> >>>>>> Essentially helper functions for the common use-cases instead >>>>>> of >>>>>> trying to >>>>>> solve them all by putting drm_sched_flush as a potentially >>>>>> very >>>>>> blocking >>>>>> function into drm_sched_fini. >>>>> I'm still not able to see why it blocking would be undesired – >>>>> as >>>>> far >>>>> as I can see, it is only invoked on driver teardown, so not >>>>> during >>>>> active operation. Teardown doesn't happen that often, and it >>>>> can >>>>> (if >>>>> implemented correctly) only block until the driver's code has >>>>> signaled >>>>> the last fences. If that doesn't happen, the block would reveal >>>>> a >>>>> bug. >>>>> >>>>> But don't get me wrong: I don't want to *push* this solution. I >>>>> just >>>>> want to understand when it could become a problem. >>>>> >>>>> >>>>> Wouldn't an explicitly blocking, separate function like >>>>> drm_sched_flush() or drm_sched_fini_flush() be a small, doable >>>>> step >>>>> towards the right direction? >>>> I think that this is the right thing to do, yes. >>>> >>>>>>>>>>> When there are still entities with jobs the >>>>>>>>>>> situation >>>>>>>>>>> is >>>>>>>>>>> even >>>>>>>>>>> worse >>>>>>>>>>> since the dma_fences for those jobs can never >>>>>>>>>>> signal >>>>>>>>>>> we can >>>>>>>>>>> just >>>>>>>>>>> choose between potentially locking up core memory >>>>>>>>>>> management and >>>>>>>>>>> random memory corruption. When drivers really mess >>>>>>>>>>> it >>>>>>>>>>> up >>>>>>>>>>> that >>>>>>>>>>> well >>>>>>>>>>> let them run into a BUG_ON(). >>>>>>>>>>> >>>>>>>>>>> Signed-off-by: Christian >>>>>>>>>>> König<christian.koenig@amd.com> >>>>>>>>>>> --- >>>>>>>>>>> drivers/gpu/drm/scheduler/sched_main.c | 19 >>>>>>>>>>> ++++++++++++++++++- >>>>>>>>>>> 1 file changed, 18 insertions(+), 1 deletion(-) >>>>>>>>>>> >>>>>>>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>>>> b/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>>>> index f093616fe53c..8a46fab5cdc8 100644 >>>>>>>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>>>>>>>>>> @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct >>>>>>>>>>> drm_gpu_scheduler >>>>>>>>>>> *sched) >>>>>>>>>> I agree with Sima that it should first be documented >>>>>>>>>> in >>>>>>>>>> the >>>>>>>>>> function's >>>>>>>>>> docstring what the user is expected to have done >>>>>>>>>> before >>>>>>>>>> calling the >>>>>>>>>> function. >>>>>>>>> Good point, going to update the documentation as well. >>>>>>>> Cool thing, thx. >>>>>>>> Would be great if everything (not totally trivial) >>>>>>>> necessary to >>>>>>>> be done >>>>>>>> before _fini() is mentioned. >>>>>>>> >>>>>>>> One could also think about providing a hint at how the >>>>>>>> driver can >>>>>>>> do >>>>>>>> that. AFAICS the only way for the driver to ensure that >>>>>>>> is >>>>>>>> to >>>>>>>> maintain >>>>>>>> its own, separate list of submitted jobs. >>>>>>> Even with a duplicated pending list it's actually currently >>>>>>> impossible to do >>>>>>> this fully cleanly. >>>>>>> >>>>>>> The problem is that the dma_fence object gives no guarantee >>>>>>> when >>>>>>> callbacks >>>>>>> are processed, e.g. they can be both processed from >>>>>>> interrupt >>>>>>> context as >>>>>>> well as from a CPU which called dma_fence_is_signaled(). >>>>>>> >>>>>>> So when a driver (or drm_sched_fini) waits for the last >>>>>>> submitted >>>>>>> fence it >>>>>>> actually can be that the drm_sched object still needs to do >>>>>>> some >>>>>>> processing. >>>>>>> See the hack in amdgpu_vm_tlb_seq() for more background on >>>>>>> the >>>>>>> problem. >>>>> Oh dear ^^' >>>>> We better work towards fixing that centrally >>>>> >>>>> Thanks, >>>>> P. >>>>> >>>>> >>>>>> So I thought this should be fairly easy because of the sched >>>>>> hw/public >>>>>> fence split: If we wait for both all jobs to finish and for >>>>>> all >>>>>> the >>>>>> sched >>>>>> work/tdr work to finish, and we make sure there's no entity >>>>>> existing >>>>>> that's not yet stopped we should catch them all? >>>> Unfortunately not. >>>> >>>> Even when you do a dma_fence_wait() on the last submission it can >>>> still be >>>> that another CPU is executing the callbacks to wake up the >>>> scheduler work >>>> item and cleanup the job. >>>> >>>> That's one of the reasons why I think the design of keeping the >>>> job >>>> alive is >>>> so extremely awkward. The dma_fence as representation of the hw >>>> submission >>>> has a much better defined state machine and lifetime. >>>> >>>> Regards, >>>> Christian. >>>> >>>>>> Or at least I think >>>>>> it's >>>>>> a bug if any other code even tries to touch the hw fence. >>>>>> >>>>>> If you have any other driver code which relies on the rcu >>>>>> freeing >>>>>> then I >>>>>> think that's just a separate concern and we can ignore that >>>>>> here >>>>>> since the >>>>>> fences themselves will till get rcu-delay freed even if >>>>>> drm_sched_fini has >>>>>> finished. >>>>>> -Sima >>>>>> >>>>>>> Regards, >>>>>>> Christian. >>>>>>> >>>>>>>> P. >>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Christian. >>>>>>>>> >>>>>>>>>> P. >>>>>>>>>> >>>>>>>>>>> drm_sched_wqueue_stop(sched); >>>>>>>>>>> + /* >>>>>>>>>>> + * Tearing down the scheduler wile there are still >>>>>>>>>>> unprocessed jobs can >>>>>>>>>>> + * lead to use after free issues in the scheduler >>>>>>>>>>> fence. >>>>>>>>>>> + */ >>>>>>>>>>> + WARN_ON(!list_empty(&sched->pending_list)); >>>>>>>>>>> + >>>>>>>>>>> for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched- >>>>>>>>>>>> num_rqs; >>>>>>>>>>> i++) >>>>>>>>>>> { >>>>>>>>>>> struct drm_sched_rq *rq = sched- >>>>>>>>>>>> sched_rq[i]; >>>>>>>>>>> spin_lock(&rq->lock); >>>>>>>>>>> - list_for_each_entry(s_entity, &rq- >>>>>>>>>>>> entities, >>>>>>>>>>> list) >>>>>>>>>>> + list_for_each_entry(s_entity, &rq- >>>>>>>>>>>> entities, >>>>>>>>>>> list) { >>>>>>>>>>> + /* >>>>>>>>>>> + * The justification for this >>>>>>>>>>> BUG_ON() >>>>>>>>>>> is >>>>>>>>>>> that tearing >>>>>>>>>>> + * down the scheduler while jobs >>>>>>>>>>> are >>>>>>>>>>> pending >>>>>>>>>>> leaves >>>>>>>>>>> + * dma_fences unsignaled. Since we >>>>>>>>>>> have >>>>>>>>>>> dependencies >>>>>>>>>>> + * from the core memory management >>>>>>>>>>> to >>>>>>>>>>> eventually signal >>>>>>>>>>> + * dma_fences this can trivially >>>>>>>>>>> lead to >>>>>>>>>>> a >>>>>>>>>>> system wide >>>>>>>>>>> + * stop because of a locked up >>>>>>>>>>> memory >>>>>>>>>>> management. >>>>>>>>>>> + */ >>>>>>>>>>> + BUG_ON(spsc_queue_count(&s_entity- >>>>>>>>>>>> job_queue)); >>>>>>>>>>> + >>>>>>>>>>> /* >>>>>>>>>>> * Prevents reinsertion and marks >>>>>>>>>>> job_queue >>>>>>>>>>> as idle, >>>>>>>>>>> * it will removed from rq in >>>>>>>>>>> drm_sched_entity_fini >>>>>>>>>>> * eventually >>>>>>>>>>> */ >>>>>>>>>>> s_entity->stopped = true; >>>>>>>>>>> + } >>>>>>>>>>> spin_unlock(&rq->lock); >>>>>>>>>>> kfree(sched->sched_rq[i]); >>>>>>>>>>> }
On Fri, 2024-11-15 at 12:26 +0100, Christian König wrote: > Interesting, those mails just showed up in my inbox as unread. More > than > 14 days after you send it. > > So sorry for the late reply. I smell google mail :p > > Am 29.10.24 um 08:22 schrieb Philipp Stanner: > > Christian, Sima? > > > > Matthew? (+CC) > > > > Opinions on the below? > > > > tl;dr: > > I still think it's a good thing to detectably block in > > drm_sched_fini(), or at the very least drm_sched_flush(), because > > then > > you'll find out that the driver is broken and can repair it. > > As discussed in private mail as well, the third option not mentioned > so > far is to completely drop the free_job callback. If it's really possible to do that I think this sounds like a great solution at first glance because we could decrease complexity quite substantially. > > A good bunch of the problems we have here are caused by abusing the > job > as state for executing the submission on the hardware. > > The original design idea of the scheduler instead was to only have > the > job as intermediate state between submission and picking what to > execute > next. E.g. the scheduler in that view is just a pipeline were you > push > jobs in on one end and jobs come out on the other end. So let's get a bit more precise about this: 1. Driver enqueues with drm_sched_job_arm() 2. job ends up in pending_list 3. Sooner or later scheduler calls run_job() 4. Job is pushed to hardware 5. Fence gets signaled 6. ??? What would the "come out on the other end" part you describe look like? How would jobs get removed from pending_list and, accordingly, how would we avoid leaks? > > In that approach the object which represents the hardware execution > is > the dma_fence instead of the job. And the dma_fence has a well > defined > lifetime, error handling, etc... > > So when we go back to this original approach it would mean that we > don't > need to wait for any cleanup because the scheduler wouldn't be > involved > in the execution of the jobs on the hardware any more. It would be involved (to speak precisely) in the sense of the scheduler still being the one who pushes the job onto the hardware, agreed? > > The worst thing that could happen is that the driver messes things up > and still has not executed job in an entity, I can't fully follow. So in your mind, the driver would personally set the dma_fence callback and hereby be informed about the job being completed, right? But you wouldn't want to aim for getting rid of struct drm_sched_job completely, or would you? Grüße, P. > but in that case we could > just warn, block for the hardware fence and then kill all pending > submissions with an error.- And this blocking on the hardware fence > can't deadlock because we know that it is a hardware fence with > certain > properties and not some random free_job callback where we don't have > any > idea what locks the driver need. > > Regards, > Christian. > > > > > P. > > > > > > On Fri, 2024-10-18 at 14:07 +0200, Philipp Stanner wrote: > > > On Mon, 2024-10-14 at 16:56 +0200, Danilo Krummrich wrote: > > > > On Fri, Sep 27, 2024 at 11:04:48AM +0200, Christian König > > > > wrote: > > > > > Am 25.09.24 um 16:53 schrieb Philipp Stanner: > > > > > > On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote: > > > > > > > On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König > > > > > > > wrote: > > > > > > > > Am 20.09.24 um 15:26 schrieb Philipp Stanner: > > > > > > > > > On Fri, 2024-09-20 at 12:33 +0200, Christian König > > > > > > > > > wrote: > > > > > > > > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner: > > > > > > > > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian > > > > > > > > > > > König > > > > > > > > > > > wrote: > > > > > > > > > > > > Tearing down the scheduler with jobs still on > > > > > > > > > > > > the > > > > > > > > > > > > pending > > > > > > > > > > > > list > > > > > > > > > > > > can > > > > > > > > > > > > lead to use after free issues. Add a warning if > > > > > > > > > > > > drivers try > > > > > > > > > > > > to > > > > > > > > > > > > destroy a scheduler which still has work pushed > > > > > > > > > > > > to > > > > > > > > > > > > the HW. > > > > > > > > > > > Did you have time yet to look into my proposed > > > > > > > > > > > waitque- > > > > > > > > > > > solution? > > > > > > > > > > I don't remember seeing anything. What have I > > > > > > > > > > missed? > > > > > > > > > https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/ > > > > > > > > Mhm, I didn't got that in my inbox for some reason. > > > > > > > > > > > > > > > > Interesting approach, I'm just not sure if we can or > > > > > > > > should > > > > > > > > wait in > > > > > > > > drm_sched_fini(). > > > > > > We do agree that jobs still pending when drm_sched_fini() > > > > > > starts > > > > > > is > > > > > > always a bug, right? > > > > > Correct, the question is how to avoid that. > > > > > > > > > > > If so, what are the disadvantages of waiting in > > > > > > drm_sched_fini()? > > > > > > We > > > > > > could block buggy drivers as I see it. Which wouldn't be > > > > > > good, > > > > > > but > > > > > > could then be fixed on drivers' site. > > > > > Sima explained that pretty well: Don't block in fops->close, > > > > > do > > > > > that in > > > > > fops->flush instead. > > > > I agree that we shouldn't block in close(), but this > > > > effectively > > > > means that we > > > > need to reference count the scheduler, right? > > > > > > > > Otherwise, if we allow to just skip / interrupt the teardown, > > > > we > > > > can > > > > still leak > > > > memory. > > > Having thought about it, I agree with Danilo. Having something > > > that > > > shall wait on green light, but can be interrupted, is no > > > guarantee > > > and > > > therefore not a feasible solution. > > > > > > To break down the solution space, these seem to be our options: > > > 1. We have something (either drm_sched_fini() or a helper, > > > e.g., > > > drm_sched_flush()) that definitely blocks until the > > > pending > > > list > > > has become empty. > > > 2. We have jobs reference-count the scheduler, so the latter > > > can > > > outlive the driver and will be freed some time later. > > > > > > Can anyone think of a third solution? > > > > > > > > > Solution #1 has the problem of obviously blocking unconditionally > > > if > > > the driver didn't make sure that all fences will be signaled > > > within > > > reasonable time. In my opinion, this would actually be an > > > advantage, > > > because it will be *very* noticable and force users to repair > > > their > > > driver. The driver *has* to guarantee that all fences will be > > > signaled. > > > If the driver has to do fishy things, having the blocking > > > outsourced > > > to > > > the helper drm_sched_flush() would allow them to circumvent that. > > > > > > Solution #2 has the problem of backend_ops.free_job() potentially > > > using > > > driver-data after the driver is gone, causing UAF. So with this > > > solutions all drivers would have to be aware of the issue and > > > handle > > > it > > > through one of DRMs primitives dedicated to such problems. > > > > > > > > > Currently, all drivers either work around the problem internally > > > or > > > simply ignore it, it seems. > > > > > > So I'd argue that both solutions are an improvement over the > > > existing > > > situation. My preference would be #1. > > > > > > > > > Opinions? > > > > > > P. > > > > > > > > One issue this solves is that when you send a SIGTERM the > > > > > tear > > > > > down > > > > > handling > > > > > first flushes all the FDs and then closes them. > > > > > > > > > > So if flushing the FDs blocks because the process initiated > > > > > sending > > > > > a > > > > > terabyte of data over a 300bps line (for example) you can > > > > > still > > > > > throw a > > > > > SIGKILL and abort that as well. > > > > > > > > > > If you would block in fops-close() that SIGKILL won't have > > > > > any > > > > > effect any > > > > > more because by the time close() is called the process is > > > > > gone > > > > > and > > > > > signals > > > > > are already blocked. > > > > > > > > > > And yes when I learned about that issue I was also buffed > > > > > that > > > > > handling like > > > > > this in the UNIX design is nearly 50 years old and still > > > > > applies > > > > > to > > > > > today. > > > > > > > > Probably better to make that a separate function, > > > > > > > > something > > > > > > > > like > > > > > > > > drm_sched_flush() or similar. > > > > > > We could do that. Such a function could then be called by > > > > > > drivers > > > > > > which > > > > > > are not sure whether all jobs are done before they start > > > > > > tearing > > > > > > down. > > > > > Yes exactly that's the idea. And give that flush function a > > > > > return > > > > > code so > > > > > that it can return -EINTR. > > > > > > > > > > > > Yeah I don't think we should smash this into > > > > > > > drm_sched_fini > > > > > > > unconditionally. I think conceptually there's about three > > > > > > > cases: > > > > > > > > > > > > > > - Ringbuffer schedules. Probably want everything as-is, > > > > > > > because > > > > > > > drm_sched_fini is called long after all the entities > > > > > > > are > > > > > > > gone in > > > > > > > drm_device cleanup. > > > > > > > > > > > > > > - fw scheduler hardware with preemption support. There we > > > > > > > probably > > > > > > > want to > > > > > > > nuke the context by setting the tdr timeout to zero > > > > > > > (or > > > > > > > maybe just > > > > > > > as > > > > > > > long as context preemption takes to be efficient), > > > > > > > and > > > > > > > relying on > > > > > > > the > > > > > > > normal gpu reset flow to handle things. > > > > > > > drm_sched_entity_flush > > > > > > > kinda > > > > > > > does this, except not really and it's a lot more > > > > > > > focused > > > > > > > on > > > > > > > the > > > > > > > ringbuffer context. So maybe we want a new > > > > > > > drm_sched_entity_kill. > > > > > > > > > > > > > > For this case calling drm_sched_fini() after the 1:1 > > > > > > > entity > > > > > > > is gone > > > > > > > should not find any linger jobs, it would actually be > > > > > > > a > > > > > > > bug > > > > > > > somewhere if > > > > > > > there's a job lingering. Maybe a sanity check that > > > > > > > there's > > > > > > > not just > > > > > > > no > > > > > > > jobs lingering, but also no entity left would be good > > > > > > > here? > > > > > > The check for lingering ones is in Christian's patch here > > > > > > IISC. > > > > > > At which position would you imagine the check for the > > > > > > entity > > > > > > being > > > > > > performed? > > > > > > > > > > > > > - fw scheduler without preemption support. There we kinda > > > > > > > need > > > > > > > the > > > > > > > drm_sched_flush, except blocking in fops->close is > > > > > > > not > > > > > > > great. So > > > > > > > instead > > > > > > > I think the following is better: > > > > > > > 1. drm_sched_entity_stopped, which only stops new > > > > > > > submissions (for > > > > > > > paranoia) but doesn't tear down the entity > > > > > > Who would call that function? > > > > > > Drivers using it voluntarily could just as well stop > > > > > > accepting > > > > > > new jobs > > > > > > from userspace to their entities, couldn't they? > > > > > > > > > > > > > 2. drm_dev_get > > > > > > > 3. launch a worker which does a) drm_sched_flush (or > > > > > > > drm_sched_entity_flush or whatever we call it) b) > > > > > > > drm_sched_entity_fini > > > > > > > + drm_sched_fini c) drm_dev_put > > > > > > > > > > > > > > Note that semantically this implements the refcount > > > > > > > in the > > > > > > > other > > > > > > > path > > > > > > > from Phillip: > > > > > > > > > > > > > > https://lore.kernel.org/all/20240903094531.29893-2-pstanner@redhat.com/ > > > > > > > Except it doesn't impose refcount on everyone else > > > > > > > who > > > > > > > doesn't need > > > > > > > it, > > > > > > > and it doesn't even impose refcounting on drivers > > > > > > > that do > > > > > > > need it > > > > > > > because we use drm_sched_flush and a worker to > > > > > > > achieve the > > > > > > > same. > > > > > > I indeed wasn't happy with the refcount approach for that > > > > > > reason, > > > > > > agreed. > > > > > > > > > > > > > Essentially helper functions for the common use-cases > > > > > > > instead > > > > > > > of > > > > > > > trying to > > > > > > > solve them all by putting drm_sched_flush as a > > > > > > > potentially > > > > > > > very > > > > > > > blocking > > > > > > > function into drm_sched_fini. > > > > > > I'm still not able to see why it blocking would be > > > > > > undesired – > > > > > > as > > > > > > far > > > > > > as I can see, it is only invoked on driver teardown, so not > > > > > > during > > > > > > active operation. Teardown doesn't happen that often, and > > > > > > it > > > > > > can > > > > > > (if > > > > > > implemented correctly) only block until the driver's code > > > > > > has > > > > > > signaled > > > > > > the last fences. If that doesn't happen, the block would > > > > > > reveal > > > > > > a > > > > > > bug. > > > > > > > > > > > > But don't get me wrong: I don't want to *push* this > > > > > > solution. I > > > > > > just > > > > > > want to understand when it could become a problem. > > > > > > > > > > > > > > > > > > Wouldn't an explicitly blocking, separate function like > > > > > > drm_sched_flush() or drm_sched_fini_flush() be a small, > > > > > > doable > > > > > > step > > > > > > towards the right direction? > > > > > I think that this is the right thing to do, yes. > > > > > > > > > > > > > > > > > When there are still entities with jobs the > > > > > > > > > > > > situation > > > > > > > > > > > > is > > > > > > > > > > > > even > > > > > > > > > > > > worse > > > > > > > > > > > > since the dma_fences for those jobs can never > > > > > > > > > > > > signal > > > > > > > > > > > > we can > > > > > > > > > > > > just > > > > > > > > > > > > choose between potentially locking up core > > > > > > > > > > > > memory > > > > > > > > > > > > management and > > > > > > > > > > > > random memory corruption. When drivers really > > > > > > > > > > > > mess > > > > > > > > > > > > it > > > > > > > > > > > > up > > > > > > > > > > > > that > > > > > > > > > > > > well > > > > > > > > > > > > let them run into a BUG_ON(). > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Christian > > > > > > > > > > > > König<christian.koenig@amd.com> > > > > > > > > > > > > --- > > > > > > > > > > > > drivers/gpu/drm/scheduler/sched_main.c | > > > > > > > > > > > > 19 > > > > > > > > > > > > ++++++++++++++++++- > > > > > > > > > > > > 1 file changed, 18 insertions(+), 1 > > > > > > > > > > > > deletion(-) > > > > > > > > > > > > > > > > > > > > > > > > diff --git > > > > > > > > > > > > a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > > > > index f093616fe53c..8a46fab5cdc8 100644 > > > > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > > > > @@ -1333,17 +1333,34 @@ void > > > > > > > > > > > > drm_sched_fini(struct > > > > > > > > > > > > drm_gpu_scheduler > > > > > > > > > > > > *sched) > > > > > > > > > > > I agree with Sima that it should first be > > > > > > > > > > > documented > > > > > > > > > > > in > > > > > > > > > > > the > > > > > > > > > > > function's > > > > > > > > > > > docstring what the user is expected to have done > > > > > > > > > > > before > > > > > > > > > > > calling the > > > > > > > > > > > function. > > > > > > > > > > Good point, going to update the documentation as > > > > > > > > > > well. > > > > > > > > > Cool thing, thx. > > > > > > > > > Would be great if everything (not totally trivial) > > > > > > > > > necessary to > > > > > > > > > be done > > > > > > > > > before _fini() is mentioned. > > > > > > > > > > > > > > > > > > One could also think about providing a hint at how > > > > > > > > > the > > > > > > > > > driver can > > > > > > > > > do > > > > > > > > > that. AFAICS the only way for the driver to ensure > > > > > > > > > that > > > > > > > > > is > > > > > > > > > to > > > > > > > > > maintain > > > > > > > > > its own, separate list of submitted jobs. > > > > > > > > Even with a duplicated pending list it's actually > > > > > > > > currently > > > > > > > > impossible to do > > > > > > > > this fully cleanly. > > > > > > > > > > > > > > > > The problem is that the dma_fence object gives no > > > > > > > > guarantee > > > > > > > > when > > > > > > > > callbacks > > > > > > > > are processed, e.g. they can be both processed from > > > > > > > > interrupt > > > > > > > > context as > > > > > > > > well as from a CPU which called > > > > > > > > dma_fence_is_signaled(). > > > > > > > > > > > > > > > > So when a driver (or drm_sched_fini) waits for the last > > > > > > > > submitted > > > > > > > > fence it > > > > > > > > actually can be that the drm_sched object still needs > > > > > > > > to do > > > > > > > > some > > > > > > > > processing. > > > > > > > > See the hack in amdgpu_vm_tlb_seq() for more background > > > > > > > > on > > > > > > > > the > > > > > > > > problem. > > > > > > Oh dear ^^' > > > > > > We better work towards fixing that centrally > > > > > > > > > > > > Thanks, > > > > > > P. > > > > > > > > > > > > > > > > > > > So I thought this should be fairly easy because of the > > > > > > > sched > > > > > > > hw/public > > > > > > > fence split: If we wait for both all jobs to finish and > > > > > > > for > > > > > > > all > > > > > > > the > > > > > > > sched > > > > > > > work/tdr work to finish, and we make sure there's no > > > > > > > entity > > > > > > > existing > > > > > > > that's not yet stopped we should catch them all? > > > > > Unfortunately not. > > > > > > > > > > Even when you do a dma_fence_wait() on the last submission it > > > > > can > > > > > still be > > > > > that another CPU is executing the callbacks to wake up the > > > > > scheduler work > > > > > item and cleanup the job. > > > > > > > > > > That's one of the reasons why I think the design of keeping > > > > > the > > > > > job > > > > > alive is > > > > > so extremely awkward. The dma_fence as representation of the > > > > > hw > > > > > submission > > > > > has a much better defined state machine and lifetime. > > > > > > > > > > Regards, > > > > > Christian. > > > > > > > > > > > > Or at least I think > > > > > > > it's > > > > > > > a bug if any other code even tries to touch the hw fence. > > > > > > > > > > > > > > If you have any other driver code which relies on the rcu > > > > > > > freeing > > > > > > > then I > > > > > > > think that's just a separate concern and we can ignore > > > > > > > that > > > > > > > here > > > > > > > since the > > > > > > > fences themselves will till get rcu-delay freed even if > > > > > > > drm_sched_fini has > > > > > > > finished. > > > > > > > -Sima > > > > > > > > > > > > > > > Regards, > > > > > > > > Christian. > > > > > > > > > > > > > > > > > P. > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Christian. > > > > > > > > > > > > > > > > > > > > > P. > > > > > > > > > > > > > > > > > > > > > > > drm_sched_wqueue_stop(sched); > > > > > > > > > > > > + /* > > > > > > > > > > > > + * Tearing down the scheduler wile there are > > > > > > > > > > > > still > > > > > > > > > > > > unprocessed jobs can > > > > > > > > > > > > + * lead to use after free issues in the > > > > > > > > > > > > scheduler > > > > > > > > > > > > fence. > > > > > > > > > > > > + */ > > > > > > > > > > > > + WARN_ON(!list_empty(&sched->pending_list)); > > > > > > > > > > > > + > > > > > > > > > > > > for (i = DRM_SCHED_PRIORITY_KERNEL; i < > > > > > > > > > > > > sched- > > > > > > > > > > > > > num_rqs; > > > > > > > > > > > > i++) > > > > > > > > > > > > { > > > > > > > > > > > > struct drm_sched_rq *rq = sched- > > > > > > > > > > > > > sched_rq[i]; > > > > > > > > > > > > spin_lock(&rq->lock); > > > > > > > > > > > > - list_for_each_entry(s_entity, &rq- > > > > > > > > > > > > > entities, > > > > > > > > > > > > list) > > > > > > > > > > > > + list_for_each_entry(s_entity, &rq- > > > > > > > > > > > > > entities, > > > > > > > > > > > > list) { > > > > > > > > > > > > + /* > > > > > > > > > > > > + * The justification for this > > > > > > > > > > > > BUG_ON() > > > > > > > > > > > > is > > > > > > > > > > > > that tearing > > > > > > > > > > > > + * down the scheduler while jobs > > > > > > > > > > > > are > > > > > > > > > > > > pending > > > > > > > > > > > > leaves > > > > > > > > > > > > + * dma_fences unsignaled. Since we > > > > > > > > > > > > have > > > > > > > > > > > > dependencies > > > > > > > > > > > > + * from the core memory management > > > > > > > > > > > > to > > > > > > > > > > > > eventually signal > > > > > > > > > > > > + * dma_fences this can trivially > > > > > > > > > > > > lead to > > > > > > > > > > > > a > > > > > > > > > > > > system wide > > > > > > > > > > > > + * stop because of a locked up > > > > > > > > > > > > memory > > > > > > > > > > > > management. > > > > > > > > > > > > + */ > > > > > > > > > > > > + BUG_ON(spsc_queue_count(&s_entity- > > > > > > > > > > > > > job_queue)); > > > > > > > > > > > > + > > > > > > > > > > > > /* > > > > > > > > > > > > * Prevents reinsertion and marks > > > > > > > > > > > > job_queue > > > > > > > > > > > > as idle, > > > > > > > > > > > > * it will removed from rq in > > > > > > > > > > > > drm_sched_entity_fini > > > > > > > > > > > > * eventually > > > > > > > > > > > > */ > > > > > > > > > > > > s_entity->stopped = true; > > > > > > > > > > > > + } > > > > > > > > > > > > spin_unlock(&rq->lock); > > > > > > > > > > > > kfree(sched->sched_rq[i]); > > > > > > > > > > > > } >
Am 15.11.24 um 14:55 schrieb Philipp Stanner: > [SNIP] >> A good bunch of the problems we have here are caused by abusing the >> job >> as state for executing the submission on the hardware. >> >> The original design idea of the scheduler instead was to only have >> the >> job as intermediate state between submission and picking what to >> execute >> next. E.g. the scheduler in that view is just a pipeline were you >> push >> jobs in on one end and jobs come out on the other end. > So let's get a bit more precise about this: > 1. Driver enqueues with drm_sched_job_arm() > 2. job ends up in pending_list > 3. Sooner or later scheduler calls run_job() > 4. Job is pushed to hardware > 5. Fence gets signaled > 6. ??? > > What would the "come out on the other end" part you describe look like? > > How would jobs get removed from pending_list and, accordingly, how > would we avoid leaks? Let me describe the full solution a bit further down. >> In that approach the object which represents the hardware execution >> is >> the dma_fence instead of the job. And the dma_fence has a well >> defined >> lifetime, error handling, etc... >> >> So when we go back to this original approach it would mean that we >> don't >> need to wait for any cleanup because the scheduler wouldn't be >> involved >> in the execution of the jobs on the hardware any more. > It would be involved (to speak precisely) in the sense of the scheduler > still being the one who pushes the job onto the hardware, agreed? Yes, exactly. See in the original design the "job" wasn't even a defined structure, but rather just a void*. So basically what the driver did was telling the scheduler here I have a bunch of different void* please tell me which one to run first. And apart from being this identifier this void* had absolutely no meaning for the scheduler. >> The worst thing that could happen is that the driver messes things up >> and still has not executed job in an entity, > I can't fully follow. > > So in your mind, the driver would personally set the dma_fence callback > and hereby be informed about the job being completed, right? No. The run_job callback would still return the hw fence so that the scheduler can install the callback and so gets informed when the next job could be run. > But you wouldn't want to aim for getting rid of struct drm_sched_job > completely, or would you? No, the drm_sched_job structure was added to aid the single producer single consumer queue and so made it easier to put the jobs into a container. In my mind the full solution for running jobs looks like this: 1. Driver enqueues with drm_sched_job_arm() 2. job ends up in pending_list 3. Sooner or later scheduler calls run_job() 4. In return scheduler gets a dma_fence representing the resulting hardware operation. 5, This fence is put into a container to keep around what the hw is actually executing. 6. At some point the fence gets signaled informing the scheduler that the next job can be pushed if enough credits are now available. There is no free_job callback any more because after run_job is called the scheduler is done with the job and only the dma_fence which represents the actually HW operation is the object the scheduler now works with. We would still need something like a kill_job callback which is used when an entity is released without flushing all jobs (see drm_sched_entity_kill()), but that is then only used for a specific corner case. Blocking for the cleanup in drm_sched_fini() then becomes a trivial dma_fence_wait() on the remaining unsignaled HW fences and eventually on the latest killed fence. The problem with this approach is that the idea of re-submitting jobs in a GPU reset by the scheduler is then basically dead. But to be honest that never worked correctly. See the deprecated warning I already put on drm_sched_resubmit_jobs(). The job lifetime is not really well defined because of this, see the free_guilty hack as well. Regards, Christian. > > > Grüße, > P. >
On Fri, 2024-11-15 at 15:39 +0100, Christian König wrote: > Am 15.11.24 um 14:55 schrieb Philipp Stanner: > > > [SNIP] > > > > > > A good bunch of the problems we have here are caused by abusing > > > the > > > job > > > as state for executing the submission on the hardware. > > > > > > The original design idea of the scheduler instead was to only > > > have > > > the > > > job as intermediate state between submission and picking what to > > > execute > > > next. E.g. the scheduler in that view is just a pipeline were you > > > push > > > jobs in on one end and jobs come out on the other end. > > > > > > > So let's get a bit more precise about this: > > 1. Driver enqueues with drm_sched_job_arm() > > 2. job ends up in pending_list > > 3. Sooner or later scheduler calls run_job() > > 4. Job is pushed to hardware > > 5. Fence gets signaled > > 6. ??? > > > > What would the "come out on the other end" part you describe look > > like? > > > > How would jobs get removed from pending_list and, accordingly, how > > would we avoid leaks? > > > > Let me describe the full solution a bit further down. > > > > > > > > > > > In that approach the object which represents the hardware > > > execution > > > is > > > the dma_fence instead of the job. And the dma_fence has a well > > > defined > > > lifetime, error handling, etc... > > > > > > So when we go back to this original approach it would mean that > > > we > > > don't > > > need to wait for any cleanup because the scheduler wouldn't be > > > involved > > > in the execution of the jobs on the hardware any more. > > > > > > > It would be involved (to speak precisely) in the sense of the > > scheduler > > still being the one who pushes the job onto the hardware, agreed? > > > > Yes, exactly. > > See in the original design the "job" wasn't even a defined > structure, but rather just a void*. > > So basically what the driver did was telling the scheduler here I > have a bunch of different void* please tell me which one to run > first. > > And apart from being this identifier this void* had absolutely no > meaning for the scheduler. Interesting.. > > > > > > > The worst thing that could happen is that the driver messes > > > things up > > > and still has not executed job in an entity, > > > > > > > I can't fully follow. > > > > So in your mind, the driver would personally set the dma_fence > > callback > > and hereby be informed about the job being completed, right? > > > > No. The run_job callback would still return the hw fence so that the > scheduler can install the callback and so gets informed when the next > job could be run. > > > > > > But you wouldn't want to aim for getting rid of struct > > drm_sched_job > > completely, or would you? > > > > No, the drm_sched_job structure was added to aid the single producer > single consumer queue and so made it easier to put the jobs into a > container. > > > In my mind the full solution for running jobs looks like this: > > 1. Driver enqueues with drm_sched_job_arm() > 2. job ends up in pending_list > 3. Sooner or later scheduler calls run_job() > 4. In return scheduler gets a dma_fence representing the resulting > hardware operation. > 5, This fence is put into a container to keep around what the hw is > actually executing. > 6. At some point the fence gets signaled informing the scheduler > that the next job can be pushed if enough credits are now available. > > There is no free_job callback any more because after run_job is > called the scheduler is done with the job and only the dma_fence > which represents the actually HW operation is the object the > scheduler now works with. > > We would still need something like a kill_job callback which is used > when an entity is released without flushing all jobs (see > drm_sched_entity_kill()), but that is then only used for a specific > corner case. Can't we just limit the scheduler's responsibility to telling the driver that it has to flush, and if it doesn't it's a bug? > Blocking for the cleanup in drm_sched_fini() then becomes a trivial > dma_fence_wait() on the remaining unsignaled HW fences and eventually > on the latest killed fence. But that results in exactly the same situation as my waitque solution, doesn't it? Blocking infinitely in drm_sched_fini(): If the driver doesn't guarantee that all fences will get signaled, then wait_event() in the waitque solution will block forever. The same with dma_fence_wait(). Or are you aiming at an interruptible dma_fence_wait(), thereby not blocking SIGKILL? That then would still result in leaks. So basically the same situation as with an interruptible drm_sched_flush() function. (Although I agree that would probably be more elegant) > > The problem with this approach is that the idea of re-submitting > jobs in a GPU reset by the scheduler is then basically dead. But to > be honest that never worked correctly. > > See the deprecated warning I already put on > drm_sched_resubmit_jobs(). The job lifetime is not really well > defined because of this, see the free_guilty hack as well. drm_sched_resubmit_jobs() is being used by 5 drivers currently. So if we go for this approach we have to port them, first. That doesn't sound trivial to me P. > > Regards, > Christian. > > > > > > > > > > Grüße, > > P. > > > > >
Am 15.11.24 um 17:07 schrieb Philipp Stanner: > On Fri, 2024-11-15 at 15:39 +0100, Christian König wrote: >> Am 15.11.24 um 14:55 schrieb Philipp Stanner: >> >>> [SNIP] >>> But you wouldn't want to aim for getting rid of struct >>> drm_sched_job >>> completely, or would you? >>> >> >> No, the drm_sched_job structure was added to aid the single producer >> single consumer queue and so made it easier to put the jobs into a >> container. >> >> >> In my mind the full solution for running jobs looks like this: >> >> 1. Driver enqueues with drm_sched_job_arm() >> 2. job ends up in pending_list >> 3. Sooner or later scheduler calls run_job() >> 4. In return scheduler gets a dma_fence representing the resulting >> hardware operation. >> 5, This fence is put into a container to keep around what the hw is >> actually executing. >> 6. At some point the fence gets signaled informing the scheduler >> that the next job can be pushed if enough credits are now available. >> >> There is no free_job callback any more because after run_job is >> called the scheduler is done with the job and only the dma_fence >> which represents the actually HW operation is the object the >> scheduler now works with. >> >> We would still need something like a kill_job callback which is used >> when an entity is released without flushing all jobs (see >> drm_sched_entity_kill()), but that is then only used for a specific >> corner case. > Can't we just limit the scheduler's responsibility to telling the > driver that it has to flush, and if it doesn't it's a bug? We still need to remove the pending jobs from the scheduler if flushing times out. Without timing out flushing and/or aborting when the process was killed we run into the same problem again that we don't want ti block on _fini(). >> Blocking for the cleanup in drm_sched_fini() then becomes a trivial >> dma_fence_wait() on the remaining unsignaled HW fences and eventually >> on the latest killed fence. > But that results in exactly the same situation as my waitque solution, > doesn't it? The key part is that dma_fence's have a documented requirement to never block infinitely. > Blocking infinitely in drm_sched_fini(): > > If the driver doesn't guarantee that all fences will get signaled, then > wait_event() in the waitque solution will block forever. The same with > dma_fence_wait(). > > Or are you aiming at an interruptible dma_fence_wait(), thereby not > blocking SIGKILL? That is basically what drm_sched_entity_flush() is already doing. > That then would still result in leaks. So basically the same situation > as with an interruptible drm_sched_flush() function. > > (Although I agree that would probably be more elegant) If the wait_event really would just waiting for dma_fences then yes. The problem with the waitqueue approach is that we need to wait for the free_job callback and that callback is normally called from a work item without holding any additional locks. When drm_sched_fini starts to block for that we create a rat-tail of new dependencies since that one is most likely called from a file descriptor destruction. > >> >> The problem with this approach is that the idea of re-submitting >> jobs in a GPU reset by the scheduler is then basically dead. But to >> be honest that never worked correctly. >> >> See the deprecated warning I already put on >> drm_sched_resubmit_jobs(). The job lifetime is not really well >> defined because of this, see the free_guilty hack as well. > drm_sched_resubmit_jobs() is being used by 5 drivers currently. So if > we go for this approach we have to port them, first. That doesn't sound > trivial to me Yeah, completely agree. I think the scheduler should provide something like drm_sched_for_each_pending_fence() which helps to iterate over all the pending HW fences. The individual drivers could then device by themself what to do, e.g. upcast the HW fence into the job and push it again or similar. But what we really need to get away from are those fence pre-requisite violations Sima pointed out. For example we can't allocate memory for run_job. Regards, Christian. > > > P. > >> >> Regards, >> Christian. >> >> >>> >>> >>> >>> Grüße, >>> P. >>> >>> >>
+CC Thomas Hellström On Mon, 2024-11-18 at 12:48 +0100, Christian König wrote: > Am 15.11.24 um 17:07 schrieb Philipp Stanner: > > > > > On Fri, 2024-11-15 at 15:39 +0100, Christian König wrote: > > > > > > > > Am 15.11.24 um 14:55 schrieb Philipp Stanner: > > > > > > > > > > > > > > [SNIP] > > > > > > > > > > > > > > > > > > > > But you wouldn't want to aim for getting rid of struct > > > > drm_sched_job > > > > completely, or would you? > > > > > > > > > > > > > > > > > No, the drm_sched_job structure was added to aid the single > > > producer > > > single consumer queue and so made it easier to put the jobs into > > > a > > > container. > > > > > > > > > In my mind the full solution for running jobs looks like this: > > > > > > 1. Driver enqueues with drm_sched_job_arm() > > > 2. job ends up in pending_list > > > 3. Sooner or later scheduler calls run_job() > > > 4. In return scheduler gets a dma_fence representing the > > > resulting > > > hardware operation. > > > 5, This fence is put into a container to keep around what the hw > > > is > > > actually executing. > > > 6. At some point the fence gets signaled informing the scheduler > > > that the next job can be pushed if enough credits are now > > > available. > > > > > > There is no free_job callback any more because after run_job is > > > called the scheduler is done with the job and only the dma_fence > > > which represents the actually HW operation is the object the > > > scheduler now works with. > > > > > > We would still need something like a kill_job callback which is > > > used > > > when an entity is released without flushing all jobs (see > > > drm_sched_entity_kill()), but that is then only used for a > > > specific > > > corner case. > > > > > > > Can't we just limit the scheduler's responsibility to telling the > > driver that it has to flush, and if it doesn't it's a bug? > > > > We still need to remove the pending jobs from the scheduler if > flushing times out. > > Without timing out flushing and/or aborting when the process was > killed we run into the same problem again that we don't want ti block > on _fini(). > > > > > > > > > Blocking for the cleanup in drm_sched_fini() then becomes a > > > trivial > > > dma_fence_wait() on the remaining unsignaled HW fences and > > > eventually > > > on the latest killed fence. > > > > > > > But that results in exactly the same situation as my waitque > > solution, > > doesn't it? > > > > The key part is that dma_fence's have a documented requirement to > never block infinitely. > > > > > > Blocking infinitely in drm_sched_fini(): > > > > If the driver doesn't guarantee that all fences will get signaled, > > then > > wait_event() in the waitque solution will block forever. The same > > with > > dma_fence_wait(). > > > > Or are you aiming at an interruptible dma_fence_wait(), thereby not > > blocking SIGKILL? > > > > That is basically what drm_sched_entity_flush() is already doing. > > > > > > That then would still result in leaks. So basically the same > > situation > > as with an interruptible drm_sched_flush() function. > > > > (Although I agree that would probably be more elegant) > > > > If the wait_event really would just waiting for dma_fences then yes. > > The problem with the waitqueue approach is that we need to wait for > the free_job callback and that callback is normally called from a > work item without holding any additional locks. > > When drm_sched_fini starts to block for that we create a rat-tail of > new dependencies since that one is most likely called from a file > descriptor destruction. > > > > > > > > > > > > > > > > > The problem with this approach is that the idea of re-submitting > > > jobs in a GPU reset by the scheduler is then basically dead. But > > > to > > > be honest that never worked correctly. > > > > > > See the deprecated warning I already put on > > > drm_sched_resubmit_jobs(). The job lifetime is not really well > > > defined because of this, see the free_guilty hack as well. > > > > > > > drm_sched_resubmit_jobs() is being used by 5 drivers currently. So > > if > > we go for this approach we have to port them, first. That doesn't > > sound > > trivial to me > > > > Yeah, completely agree. I think the scheduler should provide > something like drm_sched_for_each_pending_fence() which helps to > iterate over all the pending HW fences. > > The individual drivers could then device by themself what to do, > e.g. upcast the HW fence into the job and push it again or similar. I have talked with Dave and we would be interested in exploring the direction of getting rid of backend_ops.free_job() and doing the modifications this implies. Matthew, Thomas, any hard NACKs on principle, or can we look into this in a future patch series? P. > > But what we really need to get away from are those fence pre- > requisite violations Sima pointed out. For example we can't allocate > memory for run_job. > > Regards, > Christian. > > > > > > > > > > P. > > > > > > > > > > > > > Regards, > > > Christian. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Grüße, > > > > P. > > > > > > > > > > > > > > > > > > > > > > > > >
On Tue, Oct 29, 2024 at 08:22:22AM +0100, Philipp Stanner wrote: > Christian, Sima? > > Matthew? (+CC) > Sorry catching up here. Internally in Xe we ref count the scheduler as the scheduler is embedded into our user queue structure (xe_exec_queue). Jobs ref the queue once they are armed. Upon queue killing we set TDR to zero which will flush out all jobs / signal all the job's fences. Once the ref count of queue is zero we do hardware / firmware teardown of the queue and then finally call drm_sched_fini before freeing the queues memory (and thus the scheduler). This flow seems to work pretty well but largely bypasses the scheduler functions to implement this. Not sure if this flow could be normalized at all but I would expect usage models of a 1 to 1 relationship between queue and scheduler to something similar to Xe's flow. Maybe we could write this done as design guideline so other drivers don't have to figure this out - this took me a bit to land on this. With that, in general I agree with Christian's patch here complaining about pending jobs if drm_sched_fini is called. > Opinions on the below? > > tl;dr: > I still think it's a good thing to detectably block in > drm_sched_fini(), or at the very least drm_sched_flush(), because then See above. I don't think drm_sched_fini should block rather just complain this patch does which says 'go fix your driver'. Matt > you'll find out that the driver is broken and can repair it. > > P. > > > On Fri, 2024-10-18 at 14:07 +0200, Philipp Stanner wrote: > > On Mon, 2024-10-14 at 16:56 +0200, Danilo Krummrich wrote: > > > On Fri, Sep 27, 2024 at 11:04:48AM +0200, Christian König wrote: > > > > Am 25.09.24 um 16:53 schrieb Philipp Stanner: > > > > > On Tue, 2024-09-24 at 13:18 +0200, Simona Vetter wrote: > > > > > > On Mon, Sep 23, 2024 at 05:24:10PM +0200, Christian König > > > > > > wrote: > > > > > > > Am 20.09.24 um 15:26 schrieb Philipp Stanner: > > > > > > > > On Fri, 2024-09-20 at 12:33 +0200, Christian König wrote: > > > > > > > > > Am 20.09.24 um 10:57 schrieb Philipp Stanner: > > > > > > > > > > On Wed, 2024-09-18 at 15:39 +0200, Christian König > > > > > > > > > > wrote: > > > > > > > > > > > Tearing down the scheduler with jobs still on the > > > > > > > > > > > pending > > > > > > > > > > > list > > > > > > > > > > > can > > > > > > > > > > > lead to use after free issues. Add a warning if > > > > > > > > > > > drivers try > > > > > > > > > > > to > > > > > > > > > > > destroy a scheduler which still has work pushed to > > > > > > > > > > > the HW. > > > > > > > > > > Did you have time yet to look into my proposed > > > > > > > > > > waitque- > > > > > > > > > > solution? > > > > > > > > > I don't remember seeing anything. What have I missed? > > > > > > > > https://lore.kernel.org/all/20240903094446.29797-2-pstanner@redhat.com/ > > > > > > > Mhm, I didn't got that in my inbox for some reason. > > > > > > > > > > > > > > Interesting approach, I'm just not sure if we can or should > > > > > > > wait in > > > > > > > drm_sched_fini(). > > > > > We do agree that jobs still pending when drm_sched_fini() > > > > > starts > > > > > is > > > > > always a bug, right? > > > > > > > > Correct, the question is how to avoid that. > > > > > > > > > If so, what are the disadvantages of waiting in > > > > > drm_sched_fini()? > > > > > We > > > > > could block buggy drivers as I see it. Which wouldn't be good, > > > > > but > > > > > could then be fixed on drivers' site. > > > > > > > > Sima explained that pretty well: Don't block in fops->close, do > > > > that in > > > > fops->flush instead. > > > > > > I agree that we shouldn't block in close(), but this effectively > > > means that we > > > need to reference count the scheduler, right? > > > > > > Otherwise, if we allow to just skip / interrupt the teardown, we > > > can > > > still leak > > > memory. > > > > Having thought about it, I agree with Danilo. Having something that > > shall wait on green light, but can be interrupted, is no guarantee > > and > > therefore not a feasible solution. > > > > To break down the solution space, these seem to be our options: > > 1. We have something (either drm_sched_fini() or a helper, e.g., > > drm_sched_flush()) that definitely blocks until the pending > > list > > has become empty. > > 2. We have jobs reference-count the scheduler, so the latter can > > outlive the driver and will be freed some time later. > > > > Can anyone think of a third solution? > > > > > > Solution #1 has the problem of obviously blocking unconditionally if > > the driver didn't make sure that all fences will be signaled within > > reasonable time. In my opinion, this would actually be an advantage, > > because it will be *very* noticable and force users to repair their > > driver. The driver *has* to guarantee that all fences will be > > signaled. > > If the driver has to do fishy things, having the blocking outsourced > > to > > the helper drm_sched_flush() would allow them to circumvent that. > > > > Solution #2 has the problem of backend_ops.free_job() potentially > > using > > driver-data after the driver is gone, causing UAF. So with this > > solutions all drivers would have to be aware of the issue and handle > > it > > through one of DRMs primitives dedicated to such problems. > > > > > > Currently, all drivers either work around the problem internally or > > simply ignore it, it seems. > > > > So I'd argue that both solutions are an improvement over the existing > > situation. My preference would be #1. > > > > > > Opinions? > > > > P. > > > > > > > > > > > > > One issue this solves is that when you send a SIGTERM the tear > > > > down > > > > handling > > > > first flushes all the FDs and then closes them. > > > > > > > > So if flushing the FDs blocks because the process initiated > > > > sending > > > > a > > > > terabyte of data over a 300bps line (for example) you can still > > > > throw a > > > > SIGKILL and abort that as well. > > > > > > > > If you would block in fops-close() that SIGKILL won't have any > > > > effect any > > > > more because by the time close() is called the process is gone > > > > and > > > > signals > > > > are already blocked. > > > > > > > > And yes when I learned about that issue I was also buffed that > > > > handling like > > > > this in the UNIX design is nearly 50 years old and still applies > > > > to > > > > today. > > > > > > > Probably better to make that a separate function, something > > > > > > > like > > > > > > > drm_sched_flush() or similar. > > > > > We could do that. Such a function could then be called by > > > > > drivers > > > > > which > > > > > are not sure whether all jobs are done before they start > > > > > tearing > > > > > down. > > > > > > > > Yes exactly that's the idea. And give that flush function a > > > > return > > > > code so > > > > that it can return -EINTR. > > > > > > > > > > Yeah I don't think we should smash this into drm_sched_fini > > > > > > unconditionally. I think conceptually there's about three > > > > > > cases: > > > > > > > > > > > > - Ringbuffer schedules. Probably want everything as-is, > > > > > > because > > > > > > drm_sched_fini is called long after all the entities are > > > > > > gone in > > > > > > drm_device cleanup. > > > > > > > > > > > > - fw scheduler hardware with preemption support. There we > > > > > > probably > > > > > > want to > > > > > > nuke the context by setting the tdr timeout to zero (or > > > > > > maybe just > > > > > > as > > > > > > long as context preemption takes to be efficient), and > > > > > > relying on > > > > > > the > > > > > > normal gpu reset flow to handle things. > > > > > > drm_sched_entity_flush > > > > > > kinda > > > > > > does this, except not really and it's a lot more focused > > > > > > on > > > > > > the > > > > > > ringbuffer context. So maybe we want a new > > > > > > drm_sched_entity_kill. > > > > > > > > > > > > For this case calling drm_sched_fini() after the 1:1 > > > > > > entity > > > > > > is gone > > > > > > should not find any linger jobs, it would actually be a > > > > > > bug > > > > > > somewhere if > > > > > > there's a job lingering. Maybe a sanity check that there's > > > > > > not just > > > > > > no > > > > > > jobs lingering, but also no entity left would be good > > > > > > here? > > > > > The check for lingering ones is in Christian's patch here IISC. > > > > > At which position would you imagine the check for the entity > > > > > being > > > > > performed? > > > > > > > > > > > - fw scheduler without preemption support. There we kinda > > > > > > need > > > > > > the > > > > > > drm_sched_flush, except blocking in fops->close is not > > > > > > great. So > > > > > > instead > > > > > > I think the following is better: > > > > > > 1. drm_sched_entity_stopped, which only stops new > > > > > > submissions (for > > > > > > paranoia) but doesn't tear down the entity > > > > > Who would call that function? > > > > > Drivers using it voluntarily could just as well stop accepting > > > > > new jobs > > > > > from userspace to their entities, couldn't they? > > > > > > > > > > > 2. drm_dev_get > > > > > > 3. launch a worker which does a) drm_sched_flush (or > > > > > > drm_sched_entity_flush or whatever we call it) b) > > > > > > drm_sched_entity_fini > > > > > > + drm_sched_fini c) drm_dev_put > > > > > > > > > > > > Note that semantically this implements the refcount in the > > > > > > other > > > > > > path > > > > > > from Phillip: > > > > > > > > > > > > https://lore.kernel.org/all/20240903094531.29893-2-pstanner@redhat.com/ > > > > > > Except it doesn't impose refcount on everyone else who > > > > > > doesn't need > > > > > > it, > > > > > > and it doesn't even impose refcounting on drivers that do > > > > > > need it > > > > > > because we use drm_sched_flush and a worker to achieve the > > > > > > same. > > > > > I indeed wasn't happy with the refcount approach for that > > > > > reason, > > > > > agreed. > > > > > > > > > > > Essentially helper functions for the common use-cases instead > > > > > > of > > > > > > trying to > > > > > > solve them all by putting drm_sched_flush as a potentially > > > > > > very > > > > > > blocking > > > > > > function into drm_sched_fini. > > > > > I'm still not able to see why it blocking would be undesired – > > > > > as > > > > > far > > > > > as I can see, it is only invoked on driver teardown, so not > > > > > during > > > > > active operation. Teardown doesn't happen that often, and it > > > > > can > > > > > (if > > > > > implemented correctly) only block until the driver's code has > > > > > signaled > > > > > the last fences. If that doesn't happen, the block would reveal > > > > > a > > > > > bug. > > > > > > > > > > But don't get me wrong: I don't want to *push* this solution. I > > > > > just > > > > > want to understand when it could become a problem. > > > > > > > > > > > > > > > Wouldn't an explicitly blocking, separate function like > > > > > drm_sched_flush() or drm_sched_fini_flush() be a small, doable > > > > > step > > > > > towards the right direction? > > > > > > > > I think that this is the right thing to do, yes. > > > > > > > > > > > > > > > When there are still entities with jobs the > > > > > > > > > > > situation > > > > > > > > > > > is > > > > > > > > > > > even > > > > > > > > > > > worse > > > > > > > > > > > since the dma_fences for those jobs can never > > > > > > > > > > > signal > > > > > > > > > > > we can > > > > > > > > > > > just > > > > > > > > > > > choose between potentially locking up core memory > > > > > > > > > > > management and > > > > > > > > > > > random memory corruption. When drivers really mess > > > > > > > > > > > it > > > > > > > > > > > up > > > > > > > > > > > that > > > > > > > > > > > well > > > > > > > > > > > let them run into a BUG_ON(). > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Christian > > > > > > > > > > > König<christian.koenig@amd.com> > > > > > > > > > > > --- > > > > > > > > > > > drivers/gpu/drm/scheduler/sched_main.c | 19 > > > > > > > > > > > ++++++++++++++++++- > > > > > > > > > > > 1 file changed, 18 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > > > > > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > > > b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > > > index f093616fe53c..8a46fab5cdc8 100644 > > > > > > > > > > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > > > > > > > > > > @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct > > > > > > > > > > > drm_gpu_scheduler > > > > > > > > > > > *sched) > > > > > > > > > > I agree with Sima that it should first be documented > > > > > > > > > > in > > > > > > > > > > the > > > > > > > > > > function's > > > > > > > > > > docstring what the user is expected to have done > > > > > > > > > > before > > > > > > > > > > calling the > > > > > > > > > > function. > > > > > > > > > Good point, going to update the documentation as well. > > > > > > > > Cool thing, thx. > > > > > > > > Would be great if everything (not totally trivial) > > > > > > > > necessary to > > > > > > > > be done > > > > > > > > before _fini() is mentioned. > > > > > > > > > > > > > > > > One could also think about providing a hint at how the > > > > > > > > driver can > > > > > > > > do > > > > > > > > that. AFAICS the only way for the driver to ensure that > > > > > > > > is > > > > > > > > to > > > > > > > > maintain > > > > > > > > its own, separate list of submitted jobs. > > > > > > > Even with a duplicated pending list it's actually currently > > > > > > > impossible to do > > > > > > > this fully cleanly. > > > > > > > > > > > > > > The problem is that the dma_fence object gives no guarantee > > > > > > > when > > > > > > > callbacks > > > > > > > are processed, e.g. they can be both processed from > > > > > > > interrupt > > > > > > > context as > > > > > > > well as from a CPU which called dma_fence_is_signaled(). > > > > > > > > > > > > > > So when a driver (or drm_sched_fini) waits for the last > > > > > > > submitted > > > > > > > fence it > > > > > > > actually can be that the drm_sched object still needs to do > > > > > > > some > > > > > > > processing. > > > > > > > See the hack in amdgpu_vm_tlb_seq() for more background on > > > > > > > the > > > > > > > problem. > > > > > Oh dear ^^' > > > > > We better work towards fixing that centrally > > > > > > > > > > Thanks, > > > > > P. > > > > > > > > > > > > > > > > So I thought this should be fairly easy because of the sched > > > > > > hw/public > > > > > > fence split: If we wait for both all jobs to finish and for > > > > > > all > > > > > > the > > > > > > sched > > > > > > work/tdr work to finish, and we make sure there's no entity > > > > > > existing > > > > > > that's not yet stopped we should catch them all? > > > > > > > > Unfortunately not. > > > > > > > > Even when you do a dma_fence_wait() on the last submission it can > > > > still be > > > > that another CPU is executing the callbacks to wake up the > > > > scheduler work > > > > item and cleanup the job. > > > > > > > > That's one of the reasons why I think the design of keeping the > > > > job > > > > alive is > > > > so extremely awkward. The dma_fence as representation of the hw > > > > submission > > > > has a much better defined state machine and lifetime. > > > > > > > > Regards, > > > > Christian. > > > > > > > > > > Or at least I think > > > > > > it's > > > > > > a bug if any other code even tries to touch the hw fence. > > > > > > > > > > > > If you have any other driver code which relies on the rcu > > > > > > freeing > > > > > > then I > > > > > > think that's just a separate concern and we can ignore that > > > > > > here > > > > > > since the > > > > > > fences themselves will till get rcu-delay freed even if > > > > > > drm_sched_fini has > > > > > > finished. > > > > > > -Sima > > > > > > > > > > > > > Regards, > > > > > > > Christian. > > > > > > > > > > > > > > > P. > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Christian. > > > > > > > > > > > > > > > > > > > P. > > > > > > > > > > > > > > > > > > > > > drm_sched_wqueue_stop(sched); > > > > > > > > > > > + /* > > > > > > > > > > > + * Tearing down the scheduler wile there are still > > > > > > > > > > > unprocessed jobs can > > > > > > > > > > > + * lead to use after free issues in the scheduler > > > > > > > > > > > fence. > > > > > > > > > > > + */ > > > > > > > > > > > + WARN_ON(!list_empty(&sched->pending_list)); > > > > > > > > > > > + > > > > > > > > > > > for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched- > > > > > > > > > > > > num_rqs; > > > > > > > > > > > i++) > > > > > > > > > > > { > > > > > > > > > > > struct drm_sched_rq *rq = sched- > > > > > > > > > > > > sched_rq[i]; > > > > > > > > > > > spin_lock(&rq->lock); > > > > > > > > > > > - list_for_each_entry(s_entity, &rq- > > > > > > > > > > > > entities, > > > > > > > > > > > list) > > > > > > > > > > > + list_for_each_entry(s_entity, &rq- > > > > > > > > > > > > entities, > > > > > > > > > > > list) { > > > > > > > > > > > + /* > > > > > > > > > > > + * The justification for this > > > > > > > > > > > BUG_ON() > > > > > > > > > > > is > > > > > > > > > > > that tearing > > > > > > > > > > > + * down the scheduler while jobs > > > > > > > > > > > are > > > > > > > > > > > pending > > > > > > > > > > > leaves > > > > > > > > > > > + * dma_fences unsignaled. Since we > > > > > > > > > > > have > > > > > > > > > > > dependencies > > > > > > > > > > > + * from the core memory management > > > > > > > > > > > to > > > > > > > > > > > eventually signal > > > > > > > > > > > + * dma_fences this can trivially > > > > > > > > > > > lead to > > > > > > > > > > > a > > > > > > > > > > > system wide > > > > > > > > > > > + * stop because of a locked up > > > > > > > > > > > memory > > > > > > > > > > > management. > > > > > > > > > > > + */ > > > > > > > > > > > + BUG_ON(spsc_queue_count(&s_entity- > > > > > > > > > > > > job_queue)); > > > > > > > > > > > + > > > > > > > > > > > /* > > > > > > > > > > > * Prevents reinsertion and marks > > > > > > > > > > > job_queue > > > > > > > > > > > as idle, > > > > > > > > > > > * it will removed from rq in > > > > > > > > > > > drm_sched_entity_fini > > > > > > > > > > > * eventually > > > > > > > > > > > */ > > > > > > > > > > > s_entity->stopped = true; > > > > > > > > > > > + } > > > > > > > > > > > spin_unlock(&rq->lock); > > > > > > > > > > > kfree(sched->sched_rq[i]); > > > > > > > > > > > } > > > > > >
On Thu, Nov 21, 2024 at 10:06:54AM +0100, Philipp Stanner wrote: > +CC Thomas Hellström > > > On Mon, 2024-11-18 at 12:48 +0100, Christian König wrote: > > Am 15.11.24 um 17:07 schrieb Philipp Stanner: > > > > > > > > On Fri, 2024-11-15 at 15:39 +0100, Christian König wrote: > > > > > > > > > > > Am 15.11.24 um 14:55 schrieb Philipp Stanner: > > > > > > > > > > > > > > > > > > [SNIP] > > > > > > > > > > > > > > > > > > > > > > > > > > But you wouldn't want to aim for getting rid of struct > > > > > drm_sched_job > > > > > completely, or would you? > > > > > > > > > > > > > > > > > > > > > > No, the drm_sched_job structure was added to aid the single > > > > producer > > > > single consumer queue and so made it easier to put the jobs into > > > > a > > > > container. > > > > > > > > > > > > In my mind the full solution for running jobs looks like this: > > > > > > > > 1. Driver enqueues with drm_sched_job_arm() > > > > 2. job ends up in pending_list > > > > 3. Sooner or later scheduler calls run_job() > > > > 4. In return scheduler gets a dma_fence representing the > > > > resulting > > > > hardware operation. > > > > 5, This fence is put into a container to keep around what the hw > > > > is > > > > actually executing. > > > > 6. At some point the fence gets signaled informing the scheduler > > > > that the next job can be pushed if enough credits are now > > > > available. > > > > > > > > There is no free_job callback any more because after run_job is > > > > called the scheduler is done with the job and only the dma_fence > > > > which represents the actually HW operation is the object the > > > > scheduler now works with. > > > > > > > > We would still need something like a kill_job callback which is > > > > used > > > > when an entity is released without flushing all jobs (see > > > > drm_sched_entity_kill()), but that is then only used for a > > > > specific > > > > corner case. > > > > > > > > > > Can't we just limit the scheduler's responsibility to telling the > > > driver that it has to flush, and if it doesn't it's a bug? > > > > > > > We still need to remove the pending jobs from the scheduler if > > flushing times out. > > > > Without timing out flushing and/or aborting when the process was > > killed we run into the same problem again that we don't want ti block > > on _fini(). > > > > > > > > > > > > > Blocking for the cleanup in drm_sched_fini() then becomes a > > > > trivial > > > > dma_fence_wait() on the remaining unsignaled HW fences and > > > > eventually > > > > on the latest killed fence. > > > > > > > > > > But that results in exactly the same situation as my waitque > > > solution, > > > doesn't it? > > > > > > > The key part is that dma_fence's have a documented requirement to > > never block infinitely. > > > > > > > > > > Blocking infinitely in drm_sched_fini(): > > > > > > If the driver doesn't guarantee that all fences will get signaled, > > > then > > > wait_event() in the waitque solution will block forever. The same > > > with > > > dma_fence_wait(). > > > > > > Or are you aiming at an interruptible dma_fence_wait(), thereby not > > > blocking SIGKILL? > > > > > > > That is basically what drm_sched_entity_flush() is already doing. > > > > > > > > > > That then would still result in leaks. So basically the same > > > situation > > > as with an interruptible drm_sched_flush() function. > > > > > > (Although I agree that would probably be more elegant) > > > > > > > If the wait_event really would just waiting for dma_fences then yes. > > > > The problem with the waitqueue approach is that we need to wait for > > the free_job callback and that callback is normally called from a > > work item without holding any additional locks. > > > > When drm_sched_fini starts to block for that we create a rat-tail of > > new dependencies since that one is most likely called from a file > > descriptor destruction. > > > > > > > > > > > > > > > > > > > > > > > > > The problem with this approach is that the idea of re-submitting > > > > jobs in a GPU reset by the scheduler is then basically dead. But > > > > to > > > > be honest that never worked correctly. > > > > > > > > See the deprecated warning I already put on > > > > drm_sched_resubmit_jobs(). The job lifetime is not really well > > > > defined because of this, see the free_guilty hack as well. > > > > > > > > > > drm_sched_resubmit_jobs() is being used by 5 drivers currently. So > > > if > > > we go for this approach we have to port them, first. That doesn't > > > sound > > > trivial to me > > > > > > > Yeah, completely agree. I think the scheduler should provide > > something like drm_sched_for_each_pending_fence() which helps to > > iterate over all the pending HW fences. > > > > The individual drivers could then device by themself what to do, > > e.g. upcast the HW fence into the job and push it again or similar. > > I have talked with Dave and we would be interested in exploring the > direction of getting rid of backend_ops.free_job() and doing the > modifications this implies. > Sorry again, catching up late. Got any details to dropping free_job? Gut reaction is this is fairly large subsystem rework, touching many drivers, and great way to regress stable software. > Matthew, Thomas, any hard NACKs on principle, or can we look into this > in a future patch series? > Definietly not NACK outright, but see my concerns above. If ends up not being all that invasive and good cleanup, not opposed. If it starts messing with existing reset / job cancel / queue teardown flows I would be concerned given how hard it is to get though correct. Matt > > P. > > > > > > > > > But what we really need to get away from are those fence pre- > > requisite violations Sima pointed out. For example we can't allocate > > memory for run_job. > > > > Regards, > > Christian. > > > > > > > > > > > > > > > > P. > > > > > > > > > > > > > > > > > > Regards, > > > > Christian. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Grüße, > > > > > P. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index f093616fe53c..8a46fab5cdc8 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1333,17 +1333,34 @@ void drm_sched_fini(struct drm_gpu_scheduler *sched) drm_sched_wqueue_stop(sched); + /* + * Tearing down the scheduler wile there are still unprocessed jobs can + * lead to use after free issues in the scheduler fence. + */ + WARN_ON(!list_empty(&sched->pending_list)); + for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) { struct drm_sched_rq *rq = sched->sched_rq[i]; spin_lock(&rq->lock); - list_for_each_entry(s_entity, &rq->entities, list) + list_for_each_entry(s_entity, &rq->entities, list) { + /* + * The justification for this BUG_ON() is that tearing + * down the scheduler while jobs are pending leaves + * dma_fences unsignaled. Since we have dependencies + * from the core memory management to eventually signal + * dma_fences this can trivially lead to a system wide + * stop because of a locked up memory management. + */ + BUG_ON(spsc_queue_count(&s_entity->job_queue)); + /* * Prevents reinsertion and marks job_queue as idle, * it will removed from rq in drm_sched_entity_fini * eventually */ s_entity->stopped = true; + } spin_unlock(&rq->lock); kfree(sched->sched_rq[i]); }
Tearing down the scheduler with jobs still on the pending list can lead to use after free issues. Add a warning if drivers try to destroy a scheduler which still has work pushed to the HW. When there are still entities with jobs the situation is even worse since the dma_fences for those jobs can never signal we can just choose between potentially locking up core memory management and random memory corruption. When drivers really mess it up that well let them run into a BUG_ON(). Signed-off-by: Christian König <christian.koenig@amd.com> --- drivers/gpu/drm/scheduler/sched_main.c | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-)