diff mbox series

[v3,1/5] drm/scheduler: rework job destruction

Message ID 1555357403-30813-1-git-send-email-andrey.grodzovsky@amd.com (mailing list archive)
State New, archived
Headers show
Series [v3,1/5] drm/scheduler: rework job destruction | expand

Commit Message

Andrey Grodzovsky April 15, 2019, 7:43 p.m. UTC
From: Christian König <christian.koenig@amd.com>

We now destroy finished jobs from the worker thread to make sure that
we never destroy a job currently in timeout processing.
By this we avoid holding lock around ring mirror list in drm_sched_stop
which should solve a deadlock reported by a user.

v2: Remove unused variable.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692

Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  17 ++--
 drivers/gpu/drm/etnaviv/etnaviv_dump.c     |   4 -
 drivers/gpu/drm/etnaviv/etnaviv_sched.c    |   9 +-
 drivers/gpu/drm/scheduler/sched_main.c     | 138 +++++++++++++++++------------
 drivers/gpu/drm/v3d/v3d_sched.c            |   9 +-
 include/drm/gpu_scheduler.h                |   6 +-
 6 files changed, 108 insertions(+), 75 deletions(-)

Comments

Eric Anholt April 15, 2019, 9:17 p.m. UTC | #1
Andrey Grodzovsky <andrey.grodzovsky@amd.com> writes:

> From: Christian König <christian.koenig@amd.com>
>
> We now destroy finished jobs from the worker thread to make sure that
> we never destroy a job currently in timeout processing.
> By this we avoid holding lock around ring mirror list in drm_sched_stop
> which should solve a deadlock reported by a user.
>
> v2: Remove unused variable.
>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  17 ++--
>  drivers/gpu/drm/etnaviv/etnaviv_dump.c     |   4 -
>  drivers/gpu/drm/etnaviv/etnaviv_sched.c    |   9 +-
>  drivers/gpu/drm/scheduler/sched_main.c     | 138 +++++++++++++++++------------
>  drivers/gpu/drm/v3d/v3d_sched.c            |   9 +-

Missing corresponding panfrost and lima updates.  You should probably
pull in drm-misc for hacking on the scheduler.

> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
> index ce7c737b..8efb091 100644
> --- a/drivers/gpu/drm/v3d/v3d_sched.c
> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
> @@ -232,11 +232,18 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, struct drm_sched_job *sched_job)
>  
>  	/* block scheduler */
>  	for (q = 0; q < V3D_MAX_QUEUES; q++)
> -		drm_sched_stop(&v3d->queue[q].sched);
> +		drm_sched_stop(&v3d->queue[q].sched, sched_job);
>  
>  	if(sched_job)
>  		drm_sched_increase_karma(sched_job);
>  
> +	/*
> +	 * Guilty job did complete and hence needs to be manually removed
> +	 * See drm_sched_stop doc.
> +	 */
> +	if (list_empty(&sched_job->node))
> +		sched_job->sched->ops->free_job(sched_job);

If the if (sched_job) is necessary up above, then this should clearly be
under it.

But, can we please have a core scheduler thing we call here instead of
drivers all replicating it?

> +
>  	/* get the GPU back into the init state */
>  	v3d_reset(v3d);
>
Christian König April 16, 2019, 9:47 a.m. UTC | #2
Am 15.04.19 um 23:17 schrieb Eric Anholt:
> Andrey Grodzovsky <andrey.grodzovsky@amd.com> writes:
>
>> From: Christian König <christian.koenig@amd.com>
>>
>> We now destroy finished jobs from the worker thread to make sure that
>> we never destroy a job currently in timeout processing.
>> By this we avoid holding lock around ring mirror list in drm_sched_stop
>> which should solve a deadlock reported by a user.
>>
>> v2: Remove unused variable.
>>
>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  17 ++--
>>   drivers/gpu/drm/etnaviv/etnaviv_dump.c     |   4 -
>>   drivers/gpu/drm/etnaviv/etnaviv_sched.c    |   9 +-
>>   drivers/gpu/drm/scheduler/sched_main.c     | 138 +++++++++++++++++------------
>>   drivers/gpu/drm/v3d/v3d_sched.c            |   9 +-
> Missing corresponding panfrost and lima updates.  You should probably
> pull in drm-misc for hacking on the scheduler.
>
>> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
>> index ce7c737b..8efb091 100644
>> --- a/drivers/gpu/drm/v3d/v3d_sched.c
>> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
>> @@ -232,11 +232,18 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, struct drm_sched_job *sched_job)
>>   
>>   	/* block scheduler */
>>   	for (q = 0; q < V3D_MAX_QUEUES; q++)
>> -		drm_sched_stop(&v3d->queue[q].sched);
>> +		drm_sched_stop(&v3d->queue[q].sched, sched_job);
>>   
>>   	if(sched_job)
>>   		drm_sched_increase_karma(sched_job);
>>   
>> +	/*
>> +	 * Guilty job did complete and hence needs to be manually removed
>> +	 * See drm_sched_stop doc.
>> +	 */
>> +	if (list_empty(&sched_job->node))
>> +		sched_job->sched->ops->free_job(sched_job);
> If the if (sched_job) is necessary up above, then this should clearly be
> under it.
>
> But, can we please have a core scheduler thing we call here instead of
> drivers all replicating it?

Yeah that's also something I noted before.

Essential problem is that we remove finished jobs from the mirror list 
and so need to destruct them because we otherwise leak them.

Alternative approach here would be to keep the jobs on the ring mirror 
list, but not submit them again.

Regards,
Christian.

>
>> +
>>   	/* get the GPU back into the init state */
>>   	v3d_reset(v3d);
>>
Andrey Grodzovsky April 16, 2019, 2:36 p.m. UTC | #3
On 4/16/19 5:47 AM, Christian König wrote:
> Am 15.04.19 um 23:17 schrieb Eric Anholt:
>> Andrey Grodzovsky <andrey.grodzovsky@amd.com> writes:
>>
>>> From: Christian König <christian.koenig@amd.com>
>>>
>>> We now destroy finished jobs from the worker thread to make sure that
>>> we never destroy a job currently in timeout processing.
>>> By this we avoid holding lock around ring mirror list in drm_sched_stop
>>> which should solve a deadlock reported by a user.
>>>
>>> v2: Remove unused variable.
>>>
>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692
>>>
>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  17 ++--
>>>   drivers/gpu/drm/etnaviv/etnaviv_dump.c     |   4 -
>>>   drivers/gpu/drm/etnaviv/etnaviv_sched.c    |   9 +-
>>>   drivers/gpu/drm/scheduler/sched_main.c     | 138 
>>> +++++++++++++++++------------
>>>   drivers/gpu/drm/v3d/v3d_sched.c            |   9 +-
>> Missing corresponding panfrost and lima updates.  You should probably
>> pull in drm-misc for hacking on the scheduler.
>>
>>> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c 
>>> b/drivers/gpu/drm/v3d/v3d_sched.c
>>> index ce7c737b..8efb091 100644
>>> --- a/drivers/gpu/drm/v3d/v3d_sched.c
>>> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
>>> @@ -232,11 +232,18 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, 
>>> struct drm_sched_job *sched_job)
>>>         /* block scheduler */
>>>       for (q = 0; q < V3D_MAX_QUEUES; q++)
>>> -        drm_sched_stop(&v3d->queue[q].sched);
>>> +        drm_sched_stop(&v3d->queue[q].sched, sched_job);
>>>         if(sched_job)
>>>           drm_sched_increase_karma(sched_job);
>>>   +    /*
>>> +     * Guilty job did complete and hence needs to be manually removed
>>> +     * See drm_sched_stop doc.
>>> +     */
>>> +    if (list_empty(&sched_job->node))
>>> +        sched_job->sched->ops->free_job(sched_job);
>> If the if (sched_job) is necessary up above, then this should clearly be
>> under it.
>>
>> But, can we please have a core scheduler thing we call here instead of
>> drivers all replicating it?
>
> Yeah that's also something I noted before.
>
> Essential problem is that we remove finished jobs from the mirror list 
> and so need to destruct them because we otherwise leak them.
>
> Alternative approach here would be to keep the jobs on the ring mirror 
> list, but not submit them again.
>
> Regards,
> Christian.


I really prefer to avoid this, it means adding extra flag to sched_job 
to check in each iteration of the ring mirror list. What about changing  
signature of drm_sched_backend_ops.timedout_job to return drm_sched_job* 
instead of void, this way we can return the guilty job back from the 
driver specific handler to the generic drm_sched_job_timedout and 
release it there.

Andrey

>
>>
>>> +
>>>       /* get the GPU back into the init state */
>>>       v3d_reset(v3d);
>
Christian König April 16, 2019, 2:43 p.m. UTC | #4
Am 16.04.19 um 16:36 schrieb Grodzovsky, Andrey:
> On 4/16/19 5:47 AM, Christian König wrote:
>> Am 15.04.19 um 23:17 schrieb Eric Anholt:
>>> Andrey Grodzovsky <andrey.grodzovsky@amd.com> writes:
>>>
>>>> From: Christian König <christian.koenig@amd.com>
>>>>
>>>> We now destroy finished jobs from the worker thread to make sure that
>>>> we never destroy a job currently in timeout processing.
>>>> By this we avoid holding lock around ring mirror list in drm_sched_stop
>>>> which should solve a deadlock reported by a user.
>>>>
>>>> v2: Remove unused variable.
>>>>
>>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692
>>>>
>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  17 ++--
>>>>    drivers/gpu/drm/etnaviv/etnaviv_dump.c     |   4 -
>>>>    drivers/gpu/drm/etnaviv/etnaviv_sched.c    |   9 +-
>>>>    drivers/gpu/drm/scheduler/sched_main.c     | 138
>>>> +++++++++++++++++------------
>>>>    drivers/gpu/drm/v3d/v3d_sched.c            |   9 +-
>>> Missing corresponding panfrost and lima updates.  You should probably
>>> pull in drm-misc for hacking on the scheduler.
>>>
>>>> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c
>>>> b/drivers/gpu/drm/v3d/v3d_sched.c
>>>> index ce7c737b..8efb091 100644
>>>> --- a/drivers/gpu/drm/v3d/v3d_sched.c
>>>> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
>>>> @@ -232,11 +232,18 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d,
>>>> struct drm_sched_job *sched_job)
>>>>          /* block scheduler */
>>>>        for (q = 0; q < V3D_MAX_QUEUES; q++)
>>>> -        drm_sched_stop(&v3d->queue[q].sched);
>>>> +        drm_sched_stop(&v3d->queue[q].sched, sched_job);
>>>>          if(sched_job)
>>>>            drm_sched_increase_karma(sched_job);
>>>>    +    /*
>>>> +     * Guilty job did complete and hence needs to be manually removed
>>>> +     * See drm_sched_stop doc.
>>>> +     */
>>>> +    if (list_empty(&sched_job->node))
>>>> +        sched_job->sched->ops->free_job(sched_job);
>>> If the if (sched_job) is necessary up above, then this should clearly be
>>> under it.
>>>
>>> But, can we please have a core scheduler thing we call here instead of
>>> drivers all replicating it?
>> Yeah that's also something I noted before.
>>
>> Essential problem is that we remove finished jobs from the mirror list
>> and so need to destruct them because we otherwise leak them.
>>
>> Alternative approach here would be to keep the jobs on the ring mirror
>> list, but not submit them again.
>>
>> Regards,
>> Christian.
>
> I really prefer to avoid this, it means adding extra flag to sched_job
> to check in each iteration of the ring mirror list.

Mhm, why actually? We just need to check if the scheduler fence is signaled.

> What about changing
> signature of drm_sched_backend_ops.timedout_job to return drm_sched_job*
> instead of void, this way we can return the guilty job back from the
> driver specific handler to the generic drm_sched_job_timedout and
> release it there.

Well the timeout handler already has the job, so returning it doesn't 
make much sense.

The problem is rather that the timeout handler doesn't know if it should 
destroy the job or not.

Christian.

>
> Andrey
>
>>>> +
>>>>        /* get the GPU back into the init state */
>>>>        v3d_reset(v3d);
Andrey Grodzovsky April 16, 2019, 2:58 p.m. UTC | #5
On 4/16/19 10:43 AM, Koenig, Christian wrote:
> Am 16.04.19 um 16:36 schrieb Grodzovsky, Andrey:
>> On 4/16/19 5:47 AM, Christian König wrote:
>>> Am 15.04.19 um 23:17 schrieb Eric Anholt:
>>>> Andrey Grodzovsky <andrey.grodzovsky@amd.com> writes:
>>>>
>>>>> From: Christian König <christian.koenig@amd.com>
>>>>>
>>>>> We now destroy finished jobs from the worker thread to make sure that
>>>>> we never destroy a job currently in timeout processing.
>>>>> By this we avoid holding lock around ring mirror list in drm_sched_stop
>>>>> which should solve a deadlock reported by a user.
>>>>>
>>>>> v2: Remove unused variable.
>>>>>
>>>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692
>>>>>
>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>> ---
>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  17 ++--
>>>>>     drivers/gpu/drm/etnaviv/etnaviv_dump.c     |   4 -
>>>>>     drivers/gpu/drm/etnaviv/etnaviv_sched.c    |   9 +-
>>>>>     drivers/gpu/drm/scheduler/sched_main.c     | 138
>>>>> +++++++++++++++++------------
>>>>>     drivers/gpu/drm/v3d/v3d_sched.c            |   9 +-
>>>> Missing corresponding panfrost and lima updates.  You should probably
>>>> pull in drm-misc for hacking on the scheduler.
>>>>
>>>>> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c
>>>>> b/drivers/gpu/drm/v3d/v3d_sched.c
>>>>> index ce7c737b..8efb091 100644
>>>>> --- a/drivers/gpu/drm/v3d/v3d_sched.c
>>>>> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
>>>>> @@ -232,11 +232,18 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d,
>>>>> struct drm_sched_job *sched_job)
>>>>>           /* block scheduler */
>>>>>         for (q = 0; q < V3D_MAX_QUEUES; q++)
>>>>> -        drm_sched_stop(&v3d->queue[q].sched);
>>>>> +        drm_sched_stop(&v3d->queue[q].sched, sched_job);
>>>>>           if(sched_job)
>>>>>             drm_sched_increase_karma(sched_job);
>>>>>     +    /*
>>>>> +     * Guilty job did complete and hence needs to be manually removed
>>>>> +     * See drm_sched_stop doc.
>>>>> +     */
>>>>> +    if (list_empty(&sched_job->node))
>>>>> +        sched_job->sched->ops->free_job(sched_job);
>>>> If the if (sched_job) is necessary up above, then this should clearly be
>>>> under it.
>>>>
>>>> But, can we please have a core scheduler thing we call here instead of
>>>> drivers all replicating it?
>>> Yeah that's also something I noted before.
>>>
>>> Essential problem is that we remove finished jobs from the mirror list
>>> and so need to destruct them because we otherwise leak them.
>>>
>>> Alternative approach here would be to keep the jobs on the ring mirror
>>> list, but not submit them again.
>>>
>>> Regards,
>>> Christian.
>> I really prefer to avoid this, it means adding extra flag to sched_job
>> to check in each iteration of the ring mirror list.
> Mhm, why actually? We just need to check if the scheduler fence is signaled.

OK, i see it's equivalent but this still en extra check for all the 
iterations.

>
>> What about changing
>> signature of drm_sched_backend_ops.timedout_job to return drm_sched_job*
>> instead of void, this way we can return the guilty job back from the
>> driver specific handler to the generic drm_sched_job_timedout and
>> release it there.
> Well the timeout handler already has the job, so returning it doesn't
> make much sense.
>
> The problem is rather that the timeout handler doesn't know if it should
> destroy the job or not.


But the driver specific handler does, and actually returning back either 
the pointer to the job or null will give an indication of that. We can 
even return bool.

Andrey

>
> Christian.
>
>> Andrey
>>
>>>>> +
>>>>>         /* get the GPU back into the init state */
>>>>>         v3d_reset(v3d);
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Andrey Grodzovsky April 16, 2019, 3:42 p.m. UTC | #6
On 4/16/19 10:58 AM, Grodzovsky, Andrey wrote:
> On 4/16/19 10:43 AM, Koenig, Christian wrote:
>> Am 16.04.19 um 16:36 schrieb Grodzovsky, Andrey:
>>> On 4/16/19 5:47 AM, Christian König wrote:
>>>> Am 15.04.19 um 23:17 schrieb Eric Anholt:
>>>>> Andrey Grodzovsky <andrey.grodzovsky@amd.com> writes:
>>>>>
>>>>>> From: Christian König <christian.koenig@amd.com>
>>>>>>
>>>>>> We now destroy finished jobs from the worker thread to make sure that
>>>>>> we never destroy a job currently in timeout processing.
>>>>>> By this we avoid holding lock around ring mirror list in drm_sched_stop
>>>>>> which should solve a deadlock reported by a user.
>>>>>>
>>>>>> v2: Remove unused variable.
>>>>>>
>>>>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692
>>>>>>
>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>> ---
>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  17 ++--
>>>>>>      drivers/gpu/drm/etnaviv/etnaviv_dump.c     |   4 -
>>>>>>      drivers/gpu/drm/etnaviv/etnaviv_sched.c    |   9 +-
>>>>>>      drivers/gpu/drm/scheduler/sched_main.c     | 138
>>>>>> +++++++++++++++++------------
>>>>>>      drivers/gpu/drm/v3d/v3d_sched.c            |   9 +-
>>>>> Missing corresponding panfrost and lima updates.  You should probably
>>>>> pull in drm-misc for hacking on the scheduler.
>>>>>
>>>>>> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c
>>>>>> b/drivers/gpu/drm/v3d/v3d_sched.c
>>>>>> index ce7c737b..8efb091 100644
>>>>>> --- a/drivers/gpu/drm/v3d/v3d_sched.c
>>>>>> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
>>>>>> @@ -232,11 +232,18 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d,
>>>>>> struct drm_sched_job *sched_job)
>>>>>>            /* block scheduler */
>>>>>>          for (q = 0; q < V3D_MAX_QUEUES; q++)
>>>>>> -        drm_sched_stop(&v3d->queue[q].sched);
>>>>>> +        drm_sched_stop(&v3d->queue[q].sched, sched_job);
>>>>>>            if(sched_job)
>>>>>>              drm_sched_increase_karma(sched_job);
>>>>>>      +    /*
>>>>>> +     * Guilty job did complete and hence needs to be manually removed
>>>>>> +     * See drm_sched_stop doc.
>>>>>> +     */
>>>>>> +    if (list_empty(&sched_job->node))
>>>>>> +        sched_job->sched->ops->free_job(sched_job);
>>>>> If the if (sched_job) is necessary up above, then this should clearly be
>>>>> under it.
>>>>>
>>>>> But, can we please have a core scheduler thing we call here instead of
>>>>> drivers all replicating it?
>>>> Yeah that's also something I noted before.
>>>>
>>>> Essential problem is that we remove finished jobs from the mirror list
>>>> and so need to destruct them because we otherwise leak them.
>>>>
>>>> Alternative approach here would be to keep the jobs on the ring mirror
>>>> list, but not submit them again.
>>>>
>>>> Regards,
>>>> Christian.
>>> I really prefer to avoid this, it means adding extra flag to sched_job
>>> to check in each iteration of the ring mirror list.
>> Mhm, why actually? We just need to check if the scheduler fence is signaled.
> OK, i see it's equivalent but this still en extra check for all the
> iterations.
>
>>> What about changing
>>> signature of drm_sched_backend_ops.timedout_job to return drm_sched_job*
>>> instead of void, this way we can return the guilty job back from the
>>> driver specific handler to the generic drm_sched_job_timedout and
>>> release it there.
>> Well the timeout handler already has the job, so returning it doesn't
>> make much sense.
>>
>> The problem is rather that the timeout handler doesn't know if it should
>> destroy the job or not.
>
> But the driver specific handler does, and actually returning back either
> the pointer to the job or null will give an indication of that. We can
> even return bool.
>
> Andrey


Thinking a bit more about this - the way this check is done now "if 
(list_empty(&sched_job->node)) then free the sched_job" actually makes 
it possible to just move this as is from driver specific callbacks into  
drm_sched_job_timeout without any other changes.

Andrey

>
>> Christian.
>>
>>> Andrey
>>>
>>>>>> +
>>>>>>          /* get the GPU back into the init state */
>>>>>>          v3d_reset(v3d);
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Christian König April 16, 2019, 4 p.m. UTC | #7
Am 16.04.19 um 17:42 schrieb Grodzovsky, Andrey:
> On 4/16/19 10:58 AM, Grodzovsky, Andrey wrote:
>> On 4/16/19 10:43 AM, Koenig, Christian wrote:
>>> Am 16.04.19 um 16:36 schrieb Grodzovsky, Andrey:
>>>> On 4/16/19 5:47 AM, Christian König wrote:
>>>>> Am 15.04.19 um 23:17 schrieb Eric Anholt:
>>>>>> Andrey Grodzovsky <andrey.grodzovsky@amd.com> writes:
>>>>>>
>>>>>>> From: Christian König <christian.koenig@amd.com>
>>>>>>>
>>>>>>> We now destroy finished jobs from the worker thread to make sure that
>>>>>>> we never destroy a job currently in timeout processing.
>>>>>>> By this we avoid holding lock around ring mirror list in drm_sched_stop
>>>>>>> which should solve a deadlock reported by a user.
>>>>>>>
>>>>>>> v2: Remove unused variable.
>>>>>>>
>>>>>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692
>>>>>>>
>>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>> ---
>>>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  17 ++--
>>>>>>>       drivers/gpu/drm/etnaviv/etnaviv_dump.c     |   4 -
>>>>>>>       drivers/gpu/drm/etnaviv/etnaviv_sched.c    |   9 +-
>>>>>>>       drivers/gpu/drm/scheduler/sched_main.c     | 138
>>>>>>> +++++++++++++++++------------
>>>>>>>       drivers/gpu/drm/v3d/v3d_sched.c            |   9 +-
>>>>>> Missing corresponding panfrost and lima updates.  You should probably
>>>>>> pull in drm-misc for hacking on the scheduler.
>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c
>>>>>>> b/drivers/gpu/drm/v3d/v3d_sched.c
>>>>>>> index ce7c737b..8efb091 100644
>>>>>>> --- a/drivers/gpu/drm/v3d/v3d_sched.c
>>>>>>> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
>>>>>>> @@ -232,11 +232,18 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d,
>>>>>>> struct drm_sched_job *sched_job)
>>>>>>>             /* block scheduler */
>>>>>>>           for (q = 0; q < V3D_MAX_QUEUES; q++)
>>>>>>> -        drm_sched_stop(&v3d->queue[q].sched);
>>>>>>> +        drm_sched_stop(&v3d->queue[q].sched, sched_job);
>>>>>>>             if(sched_job)
>>>>>>>               drm_sched_increase_karma(sched_job);
>>>>>>>       +    /*
>>>>>>> +     * Guilty job did complete and hence needs to be manually removed
>>>>>>> +     * See drm_sched_stop doc.
>>>>>>> +     */
>>>>>>> +    if (list_empty(&sched_job->node))
>>>>>>> +        sched_job->sched->ops->free_job(sched_job);
>>>>>> If the if (sched_job) is necessary up above, then this should clearly be
>>>>>> under it.
>>>>>>
>>>>>> But, can we please have a core scheduler thing we call here instead of
>>>>>> drivers all replicating it?
>>>>> Yeah that's also something I noted before.
>>>>>
>>>>> Essential problem is that we remove finished jobs from the mirror list
>>>>> and so need to destruct them because we otherwise leak them.
>>>>>
>>>>> Alternative approach here would be to keep the jobs on the ring mirror
>>>>> list, but not submit them again.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>> I really prefer to avoid this, it means adding extra flag to sched_job
>>>> to check in each iteration of the ring mirror list.
>>> Mhm, why actually? We just need to check if the scheduler fence is signaled.
>> OK, i see it's equivalent but this still en extra check for all the
>> iterations.
>>
>>>> What about changing
>>>> signature of drm_sched_backend_ops.timedout_job to return drm_sched_job*
>>>> instead of void, this way we can return the guilty job back from the
>>>> driver specific handler to the generic drm_sched_job_timedout and
>>>> release it there.
>>> Well the timeout handler already has the job, so returning it doesn't
>>> make much sense.
>>>
>>> The problem is rather that the timeout handler doesn't know if it should
>>> destroy the job or not.
>> But the driver specific handler does, and actually returning back either
>> the pointer to the job or null will give an indication of that. We can
>> even return bool.
>>
>> Andrey
>
> Thinking a bit more about this - the way this check is done now "if
> (list_empty(&sched_job->node)) then free the sched_job" actually makes
> it possible to just move this as is from driver specific callbacks into
> drm_sched_job_timeout without any other changes.

Oh, well that sounds like a good idea off hand.

Need to see the final code, but at least the best idea so far.

Christian.

>
> Andrey
>
>>> Christian.
>>>
>>>> Andrey
>>>>
>>>>>>> +
>>>>>>>           /* get the GPU back into the init state */
>>>>>>>           v3d_reset(v3d);
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Andrey Grodzovsky April 17, 2019, 9:06 p.m. UTC | #8
On 4/16/19 12:00 PM, Koenig, Christian wrote:
> Am 16.04.19 um 17:42 schrieb Grodzovsky, Andrey:
>> On 4/16/19 10:58 AM, Grodzovsky, Andrey wrote:
>>> On 4/16/19 10:43 AM, Koenig, Christian wrote:
>>>> Am 16.04.19 um 16:36 schrieb Grodzovsky, Andrey:
>>>>> On 4/16/19 5:47 AM, Christian König wrote:
>>>>>> Am 15.04.19 um 23:17 schrieb Eric Anholt:
>>>>>>> Andrey Grodzovsky <andrey.grodzovsky@amd.com> writes:
>>>>>>>
>>>>>>>> From: Christian König <christian.koenig@amd.com>
>>>>>>>>
>>>>>>>> We now destroy finished jobs from the worker thread to make sure that
>>>>>>>> we never destroy a job currently in timeout processing.
>>>>>>>> By this we avoid holding lock around ring mirror list in drm_sched_stop
>>>>>>>> which should solve a deadlock reported by a user.
>>>>>>>>
>>>>>>>> v2: Remove unused variable.
>>>>>>>>
>>>>>>>> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109692
>>>>>>>>
>>>>>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>>>>>>> ---
>>>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  17 ++--
>>>>>>>>        drivers/gpu/drm/etnaviv/etnaviv_dump.c     |   4 -
>>>>>>>>        drivers/gpu/drm/etnaviv/etnaviv_sched.c    |   9 +-
>>>>>>>>        drivers/gpu/drm/scheduler/sched_main.c     | 138
>>>>>>>> +++++++++++++++++------------
>>>>>>>>        drivers/gpu/drm/v3d/v3d_sched.c            |   9 +-
>>>>>>> Missing corresponding panfrost and lima updates.  You should probably
>>>>>>> pull in drm-misc for hacking on the scheduler.
>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c
>>>>>>>> b/drivers/gpu/drm/v3d/v3d_sched.c
>>>>>>>> index ce7c737b..8efb091 100644
>>>>>>>> --- a/drivers/gpu/drm/v3d/v3d_sched.c
>>>>>>>> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
>>>>>>>> @@ -232,11 +232,18 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d,
>>>>>>>> struct drm_sched_job *sched_job)
>>>>>>>>              /* block scheduler */
>>>>>>>>            for (q = 0; q < V3D_MAX_QUEUES; q++)
>>>>>>>> -        drm_sched_stop(&v3d->queue[q].sched);
>>>>>>>> +        drm_sched_stop(&v3d->queue[q].sched, sched_job);
>>>>>>>>              if(sched_job)
>>>>>>>>                drm_sched_increase_karma(sched_job);
>>>>>>>>        +    /*
>>>>>>>> +     * Guilty job did complete and hence needs to be manually removed
>>>>>>>> +     * See drm_sched_stop doc.
>>>>>>>> +     */
>>>>>>>> +    if (list_empty(&sched_job->node))
>>>>>>>> +        sched_job->sched->ops->free_job(sched_job);
>>>>>>> If the if (sched_job) is necessary up above, then this should clearly be
>>>>>>> under it.
>>>>>>>
>>>>>>> But, can we please have a core scheduler thing we call here instead of
>>>>>>> drivers all replicating it?
>>>>>> Yeah that's also something I noted before.
>>>>>>
>>>>>> Essential problem is that we remove finished jobs from the mirror list
>>>>>> and so need to destruct them because we otherwise leak them.
>>>>>>
>>>>>> Alternative approach here would be to keep the jobs on the ring mirror
>>>>>> list, but not submit them again.
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>> I really prefer to avoid this, it means adding extra flag to sched_job
>>>>> to check in each iteration of the ring mirror list.
>>>> Mhm, why actually? We just need to check if the scheduler fence is signaled.
>>> OK, i see it's equivalent but this still en extra check for all the
>>> iterations.
>>>
>>>>> What about changing
>>>>> signature of drm_sched_backend_ops.timedout_job to return drm_sched_job*
>>>>> instead of void, this way we can return the guilty job back from the
>>>>> driver specific handler to the generic drm_sched_job_timedout and
>>>>> release it there.
>>>> Well the timeout handler already has the job, so returning it doesn't
>>>> make much sense.
>>>>
>>>> The problem is rather that the timeout handler doesn't know if it should
>>>> destroy the job or not.
>>> But the driver specific handler does, and actually returning back either
>>> the pointer to the job or null will give an indication of that. We can
>>> even return bool.
>>>
>>> Andrey
>> Thinking a bit more about this - the way this check is done now "if
>> (list_empty(&sched_job->node)) then free the sched_job" actually makes
>> it possible to just move this as is from driver specific callbacks into
>> drm_sched_job_timeout without any other changes.
> Oh, well that sounds like a good idea off hand.
>
> Need to see the final code, but at least the best idea so far.
>
> Christian.

Unfortunately looks like it's not that good idea at the end, take a look 
at the attached KASAN print - sched thread's cleanup function races 
against TDR handler and removes the guilty job from mirror list and we 
have no way of differentiating if the job was removed from within the 
TDR handler or from the sched. thread's clean-up function. So looks like 
we either need 'keep the jobs on the ring mirror list, but not submit 
them again' as you suggested before or add a flag to sched_job to hint 
to drm_sched_job_timedout that guilty job requires manual removal. Your 
suggestion implies we will need an extra check in almost every place of 
traversal of the mirror ring to avoid handling signaled jobs while mine 
requires extra flag in sched_job struct . I feel that keeping completed 
jobs in the mirror list when they actually don't belong there any more 
is confusing and an opening for future bugs.

Andrey

>
>> Andrey
>>
>>>> Christian.
>>>>
>>>>> Andrey
>>>>>
>>>>>>>> +
>>>>>>>>            /* get the GPU back into the init state */
>>>>>>>>            v3d_reset(v3d);
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
121.189757 <    0.000171>] amdgpu 0000:01:00.0: GPU reset(5) succeeded!
passed[  121.189894 <    0.000137>] ==================================================================


[  121.189951 <    0.000057>] BUG: KASAN: use-after-free in drm_sched_job_timedout+0x7a/0xf0 [gpu_sched]
Run Summary:    Type  Total    Ran Passed Failed Inactive
              suites      8      0    n/a      0        0
               tests     39      1      1      0        0
             asserts      8      8      8      0      n/a

Elapsed time =    0.001 seconds[  121.189956 <    0.000005>] Read of size 8 at addr ffff88840389a8b0 by task kworker/2:2/1140


[  121.189969 <    0.000013>] CPU: 2 PID: 1140 Comm: kworker/2:2 Tainted: G           OE     5.1.0-rc2-misc+ #1
[  121.189972 <    0.000003>] Hardware name: System manufacturer System Product Name/Z170-PRO, BIOS 1902 06/27/2016
[  121.189977 <    0.000005>] Workqueue: events drm_sched_job_timedout [gpu_sched]
[  121.189980 <    0.000003>] Call Trace:
[  121.189985 <    0.000005>]  dump_stack+0x9b/0xf5
[  121.189992 <    0.000007>]  print_address_description+0x70/0x290
[  121.189997 <    0.000005>]  ? drm_sched_job_timedout+0x7a/0xf0 [gpu_sched]
[  121.190002 <    0.000005>]  kasan_report+0x134/0x191
[  121.190006 <    0.000004>]  ? drm_sched_job_timedout+0x7a/0xf0 [gpu_sched]
[  121.190014 <    0.000008>]  ? drm_sched_job_timedout+0x7a/0xf0 [gpu_sched]
[  121.190019 <    0.000005>]  __asan_load8+0x54/0x90
[  121.190024 <    0.000005>]  drm_sched_job_timedout+0x7a/0xf0 [gpu_sched]
[  121.190034 <    0.000010>]  process_one_work+0x466/0xb00
[  121.190046 <    0.000012>]  ? queue_work_node+0x180/0x180
[  121.190061 <    0.000015>]  worker_thread+0x83/0x6c0
[  121.190075 <    0.000014>]  kthread+0x1a9/0x1f0
[  121.190079 <    0.000004>]  ? rescuer_thread+0x760/0x760
[  121.190081 <    0.000002>]  ? kthread_cancel_delayed_work_sync+0x20/0x20
[  121.190088 <    0.000007>]  ret_from_fork+0x3a/0x50

[  121.190105 <    0.000017>] Allocated by task 1421:
[  121.190110 <    0.000005>]  save_stack+0x46/0xd0
[  121.190112 <    0.000002>]  __kasan_kmalloc+0xab/0xe0
[  121.190115 <    0.000003>]  kasan_kmalloc+0xf/0x20
[  121.190117 <    0.000002>]  __kmalloc+0x167/0x390
[  121.190210 <    0.000093>]  amdgpu_job_alloc+0x47/0x170 [amdgpu]
[  121.190289 <    0.000079>]  amdgpu_cs_ioctl+0x9bd/0x2e70 [amdgpu]
[  121.190312 <    0.000023>]  drm_ioctl_kernel+0x17e/0x1d0 [drm]
[  121.190334 <    0.000022>]  drm_ioctl+0x5e1/0x640 [drm]
[  121.190409 <    0.000075>]  amdgpu_drm_ioctl+0x78/0xd0 [amdgpu]
[  121.190413 <    0.000004>]  do_vfs_ioctl+0x152/0xa30
[  121.190415 <    0.000002>]  ksys_ioctl+0x6d/0x80
[  121.190418 <    0.000003>]  __x64_sys_ioctl+0x43/0x50
[  121.190425 <    0.000007>]  do_syscall_64+0x7d/0x240
[  121.190430 <    0.000005>]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

[  121.190440 <    0.000010>] Freed by task 1242:
[  121.190448 <    0.000008>]  save_stack+0x46/0xd0
[  121.190453 <    0.000005>]  __kasan_slab_free+0x13c/0x1a0
[  121.190458 <    0.000005>]  kasan_slab_free+0xe/0x10
[  121.190462 <    0.000004>]  kfree+0xfa/0x2e0
[  121.190584 <    0.000122>]  amdgpu_job_free_cb+0x7f/0x90 [amdgpu]
[  121.190589 <    0.000005>]  drm_sched_cleanup_jobs.part.10+0xcf/0x1a0 [gpu_sched]
[  121.190594 <    0.000005>]  drm_sched_main+0x38a/0x430 [gpu_sched]
[  121.190596 <    0.000002>]  kthread+0x1a9/0x1f0
[  121.190599 <    0.000003>]  ret_from_fork+0x3a/0x50
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index d46675b..8c41a9f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3341,15 +3341,23 @@  static int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
 		if (!ring || !ring->sched.thread)
 			continue;
 
-		drm_sched_stop(&ring->sched);
+		drm_sched_stop(&ring->sched, &job->base);
 
 		/* after all hw jobs are reset, hw fence is meaningless, so force_completion */
 		amdgpu_fence_driver_force_completion(ring);
 	}
 
-	if(job)
+	if(job) {
 		drm_sched_increase_karma(&job->base);
 
+		/*
+		 * Guilty job did complete and hence needs to be manually removed
+		 * See drm_sched_stop doc.
+		 */
+		if (list_empty(&job->base.node))
+			job->base.sched->ops->free_job(&job->base);
+	}
+
 
 
 	if (!amdgpu_sriov_vf(adev)) {
@@ -3489,8 +3497,7 @@  static int amdgpu_do_asic_reset(struct amdgpu_hive_info *hive,
 	return r;
 }
 
-static void amdgpu_device_post_asic_reset(struct amdgpu_device *adev,
-					  struct amdgpu_job *job)
+static void amdgpu_device_post_asic_reset(struct amdgpu_device *adev)
 {
 	int i;
 
@@ -3630,7 +3637,7 @@  int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 	/* Post ASIC reset for all devs .*/
 	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
-		amdgpu_device_post_asic_reset(tmp_adev, tmp_adev == adev ? job : NULL);
+		amdgpu_device_post_asic_reset(tmp_adev);
 
 		if (r) {
 			/* bad news, how to tell it to userspace ? */
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_dump.c b/drivers/gpu/drm/etnaviv/etnaviv_dump.c
index 3fbb485..8434715 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_dump.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_dump.c
@@ -135,13 +135,11 @@  void etnaviv_core_dump(struct etnaviv_gpu *gpu)
 		    mmu_size + gpu->buffer.size;
 
 	/* Add in the active command buffers */
-	spin_lock_irqsave(&gpu->sched.job_list_lock, flags);
 	list_for_each_entry(s_job, &gpu->sched.ring_mirror_list, node) {
 		submit = to_etnaviv_submit(s_job);
 		file_size += submit->cmdbuf.size;
 		n_obj++;
 	}
-	spin_unlock_irqrestore(&gpu->sched.job_list_lock, flags);
 
 	/* Add in the active buffer objects */
 	list_for_each_entry(vram, &gpu->mmu->mappings, mmu_node) {
@@ -183,14 +181,12 @@  void etnaviv_core_dump(struct etnaviv_gpu *gpu)
 			      gpu->buffer.size,
 			      etnaviv_cmdbuf_get_va(&gpu->buffer));
 
-	spin_lock_irqsave(&gpu->sched.job_list_lock, flags);
 	list_for_each_entry(s_job, &gpu->sched.ring_mirror_list, node) {
 		submit = to_etnaviv_submit(s_job);
 		etnaviv_core_dump_mem(&iter, ETDUMP_BUF_CMD,
 				      submit->cmdbuf.vaddr, submit->cmdbuf.size,
 				      etnaviv_cmdbuf_get_va(&submit->cmdbuf));
 	}
-	spin_unlock_irqrestore(&gpu->sched.job_list_lock, flags);
 
 	/* Reserve space for the bomap */
 	if (n_bomap_pages) {
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
index 67ae266..8795c19 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
@@ -109,11 +109,18 @@  static void etnaviv_sched_timedout_job(struct drm_sched_job *sched_job)
 	}
 
 	/* block scheduler */
-	drm_sched_stop(&gpu->sched);
+	drm_sched_stop(&gpu->sched, sched_job);
 
 	if(sched_job)
 		drm_sched_increase_karma(sched_job);
 
+	/*
+	 * Guilty job did complete and hence needs to be manually removed
+	 * See drm_sched_stop doc.
+	 */
+	if (list_empty(&sched_job->node))
+		sched_job->sched->ops->free_job(sched_job);
+
 	/* get the GPU back into the init state */
 	etnaviv_core_dump(gpu);
 	etnaviv_gpu_recover_hang(gpu);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 19fc601..05a4540 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -265,32 +265,6 @@  void drm_sched_resume_timeout(struct drm_gpu_scheduler *sched,
 }
 EXPORT_SYMBOL(drm_sched_resume_timeout);
 
-/* job_finish is called after hw fence signaled
- */
-static void drm_sched_job_finish(struct work_struct *work)
-{
-	struct drm_sched_job *s_job = container_of(work, struct drm_sched_job,
-						   finish_work);
-	struct drm_gpu_scheduler *sched = s_job->sched;
-	unsigned long flags;
-
-	/*
-	 * Canceling the timeout without removing our job from the ring mirror
-	 * list is safe, as we will only end up in this worker if our jobs
-	 * finished fence has been signaled. So even if some another worker
-	 * manages to find this job as the next job in the list, the fence
-	 * signaled check below will prevent the timeout to be restarted.
-	 */
-	cancel_delayed_work_sync(&sched->work_tdr);
-
-	spin_lock_irqsave(&sched->job_list_lock, flags);
-	/* queue TDR for next job */
-	drm_sched_start_timeout(sched);
-	spin_unlock_irqrestore(&sched->job_list_lock, flags);
-
-	sched->ops->free_job(s_job);
-}
-
 static void drm_sched_job_begin(struct drm_sched_job *s_job)
 {
 	struct drm_gpu_scheduler *sched = s_job->sched;
@@ -371,23 +345,26 @@  EXPORT_SYMBOL(drm_sched_increase_karma);
  * @sched: scheduler instance
  * @bad: bad scheduler job
  *
+ * Stop the scheduler and also removes and frees all completed jobs.
+ * Note: bad job will not be freed as it might be used later and so it's
+ * callers responsibility to release it manually if it's not part of the
+ * mirror list any more.
+ *
  */
-void drm_sched_stop(struct drm_gpu_scheduler *sched)
+void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
 {
-	struct drm_sched_job *s_job;
+	struct drm_sched_job *s_job, *tmp;
 	unsigned long flags;
-	struct dma_fence *last_fence =  NULL;
 
 	kthread_park(sched->thread);
 
 	/*
-	 * Verify all the signaled jobs in mirror list are removed from the ring
-	 * by waiting for the latest job to enter the list. This should insure that
-	 * also all the previous jobs that were in flight also already singaled
-	 * and removed from the list.
+	 * Iterate the job list from later to  earlier one and either deactive
+	 * their HW callbacks or remove them from mirror list if they already
+	 * signaled.
+	 * This iteration is thread safe as sched thread is stopped.
 	 */
-	spin_lock_irqsave(&sched->job_list_lock, flags);
-	list_for_each_entry_reverse(s_job, &sched->ring_mirror_list, node) {
+	list_for_each_entry_safe_reverse(s_job, tmp, &sched->ring_mirror_list, node) {
 		if (s_job->s_fence->parent &&
 		    dma_fence_remove_callback(s_job->s_fence->parent,
 					      &s_job->cb)) {
@@ -395,16 +372,30 @@  void drm_sched_stop(struct drm_gpu_scheduler *sched)
 			s_job->s_fence->parent = NULL;
 			atomic_dec(&sched->hw_rq_count);
 		} else {
-			 last_fence = dma_fence_get(&s_job->s_fence->finished);
-			 break;
+			/*
+			 * remove job from ring_mirror_list.
+			 * Locking here is for concurrent resume timeout
+			 */
+			spin_lock_irqsave(&sched->job_list_lock, flags);
+			list_del_init(&s_job->node);
+			spin_unlock_irqrestore(&sched->job_list_lock, flags);
+
+			/*
+			 * Wait for job's HW fence callback to finish using s_job
+			 * before releasing it.
+			 *
+			 * Job is still alive so fence refcount at least 1
+			 */
+			dma_fence_wait(&s_job->s_fence->finished, false);
+
+			/*
+			 * We must keep bad job alive for later use during
+			 * recovery by some of the drivers
+			 */
+			if (bad != s_job)
+				sched->ops->free_job(s_job);
 		}
 	}
-	spin_unlock_irqrestore(&sched->job_list_lock, flags);
-
-	if (last_fence) {
-		dma_fence_wait(last_fence, false);
-		dma_fence_put(last_fence);
-	}
 }
 
 EXPORT_SYMBOL(drm_sched_stop);
@@ -418,6 +409,7 @@  EXPORT_SYMBOL(drm_sched_stop);
 void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
 {
 	struct drm_sched_job *s_job, *tmp;
+	unsigned long flags;
 	int r;
 
 	if (!full_recovery)
@@ -425,9 +417,7 @@  void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
 
 	/*
 	 * Locking the list is not required here as the sched thread is parked
-	 * so no new jobs are being pushed in to HW and in drm_sched_stop we
-	 * flushed all the jobs who were still in mirror list but who already
-	 * signaled and removed them self from the list. Also concurrent
+	 * so no new jobs are being inserted or removed. Also concurrent
 	 * GPU recovers can't run in parallel.
 	 */
 	list_for_each_entry_safe(s_job, tmp, &sched->ring_mirror_list, node) {
@@ -445,7 +435,9 @@  void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
 			drm_sched_process_job(NULL, &s_job->cb);
 	}
 
+	spin_lock_irqsave(&sched->job_list_lock, flags);
 	drm_sched_start_timeout(sched);
+	spin_unlock_irqrestore(&sched->job_list_lock, flags);
 
 unpark:
 	kthread_unpark(sched->thread);
@@ -464,7 +456,6 @@  void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched)
 	uint64_t guilty_context;
 	bool found_guilty = false;
 
-	/*TODO DO we need spinlock here ? */
 	list_for_each_entry_safe(s_job, tmp, &sched->ring_mirror_list, node) {
 		struct drm_sched_fence *s_fence = s_job->s_fence;
 
@@ -514,7 +505,6 @@  int drm_sched_job_init(struct drm_sched_job *job,
 		return -ENOMEM;
 	job->id = atomic64_inc_return(&sched->job_id_count);
 
-	INIT_WORK(&job->finish_work, drm_sched_job_finish);
 	INIT_LIST_HEAD(&job->node);
 
 	return 0;
@@ -597,24 +587,53 @@  static void drm_sched_process_job(struct dma_fence *f, struct dma_fence_cb *cb)
 	struct drm_sched_job *s_job = container_of(cb, struct drm_sched_job, cb);
 	struct drm_sched_fence *s_fence = s_job->s_fence;
 	struct drm_gpu_scheduler *sched = s_fence->sched;
-	unsigned long flags;
-
-	cancel_delayed_work(&sched->work_tdr);
 
 	atomic_dec(&sched->hw_rq_count);
 	atomic_dec(&sched->num_jobs);
 
-	spin_lock_irqsave(&sched->job_list_lock, flags);
-	/* remove job from ring_mirror_list */
-	list_del_init(&s_job->node);
-	spin_unlock_irqrestore(&sched->job_list_lock, flags);
+	trace_drm_sched_process_job(s_fence);
 
 	drm_sched_fence_finished(s_fence);
-
-	trace_drm_sched_process_job(s_fence);
 	wake_up_interruptible(&sched->wake_up_worker);
+}
+
+/**
+ * drm_sched_cleanup_jobs - destroy finished jobs
+ *
+ * @sched: scheduler instance
+ *
+ * Remove all finished jobs from the mirror list and destroy them.
+ */
+static void drm_sched_cleanup_jobs(struct drm_gpu_scheduler *sched)
+{
+	unsigned long flags;
+
+	/* Don't destroy jobs while the timeout worker is running */
+	if (!cancel_delayed_work(&sched->work_tdr))
+		return;
+
+
+	while (!list_empty(&sched->ring_mirror_list)) {
+		struct drm_sched_job *job;
+
+		job = list_first_entry(&sched->ring_mirror_list,
+				       struct drm_sched_job, node);
+		if (!dma_fence_is_signaled(&job->s_fence->finished))
+			break;
+
+		spin_lock_irqsave(&sched->job_list_lock, flags);
+		/* remove job from ring_mirror_list */
+		list_del_init(&job->node);
+		spin_unlock_irqrestore(&sched->job_list_lock, flags);
+
+		sched->ops->free_job(job);
+	}
+
+	/* queue timeout for next job */
+	spin_lock_irqsave(&sched->job_list_lock, flags);
+	drm_sched_start_timeout(sched);
+	spin_unlock_irqrestore(&sched->job_list_lock, flags);
 
-	schedule_work(&s_job->finish_work);
 }
 
 /**
@@ -656,9 +675,10 @@  static int drm_sched_main(void *param)
 		struct dma_fence *fence;
 
 		wait_event_interruptible(sched->wake_up_worker,
+					 (drm_sched_cleanup_jobs(sched),
 					 (!drm_sched_blocked(sched) &&
 					  (entity = drm_sched_select_entity(sched))) ||
-					 kthread_should_stop());
+					 kthread_should_stop()));
 
 		if (!entity)
 			continue;
diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
index ce7c737b..8efb091 100644
--- a/drivers/gpu/drm/v3d/v3d_sched.c
+++ b/drivers/gpu/drm/v3d/v3d_sched.c
@@ -232,11 +232,18 @@  v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, struct drm_sched_job *sched_job)
 
 	/* block scheduler */
 	for (q = 0; q < V3D_MAX_QUEUES; q++)
-		drm_sched_stop(&v3d->queue[q].sched);
+		drm_sched_stop(&v3d->queue[q].sched, sched_job);
 
 	if(sched_job)
 		drm_sched_increase_karma(sched_job);
 
+	/*
+	 * Guilty job did complete and hence needs to be manually removed
+	 * See drm_sched_stop doc.
+	 */
+	if (list_empty(&sched_job->node))
+		sched_job->sched->ops->free_job(sched_job);
+
 	/* get the GPU back into the init state */
 	v3d_reset(v3d);
 
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 0daca4d..9ee0f27 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -167,9 +167,6 @@  struct drm_sched_fence *to_drm_sched_fence(struct dma_fence *f);
  * @sched: the scheduler instance on which this job is scheduled.
  * @s_fence: contains the fences for the scheduling of job.
  * @finish_cb: the callback for the finished fence.
- * @finish_work: schedules the function @drm_sched_job_finish once the job has
- *               finished to remove the job from the
- *               @drm_gpu_scheduler.ring_mirror_list.
  * @node: used to append this struct to the @drm_gpu_scheduler.ring_mirror_list.
  * @id: a unique id assigned to each job scheduled on the scheduler.
  * @karma: increment on every hang caused by this job. If this exceeds the hang
@@ -188,7 +185,6 @@  struct drm_sched_job {
 	struct drm_gpu_scheduler	*sched;
 	struct drm_sched_fence		*s_fence;
 	struct dma_fence_cb		finish_cb;
-	struct work_struct		finish_work;
 	struct list_head		node;
 	uint64_t			id;
 	atomic_t			karma;
@@ -296,7 +292,7 @@  int drm_sched_job_init(struct drm_sched_job *job,
 		       void *owner);
 void drm_sched_job_cleanup(struct drm_sched_job *job);
 void drm_sched_wakeup(struct drm_gpu_scheduler *sched);
-void drm_sched_stop(struct drm_gpu_scheduler *sched);
+void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
 void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
 void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
 void drm_sched_increase_karma(struct drm_sched_job *bad);