diff mbox series

[3/3] drm/amdgpu: Refresh rq selection for job after ASIC reset

Message ID 1539888261-331-3-git-send-email-andrey.grodzovsky@amd.com (mailing list archive)
State New, archived
Headers show
Series [1/3] drm/sched: Add callback to mark if sched is ready to work. | expand

Commit Message

Andrey Grodzovsky Oct. 18, 2018, 6:44 p.m. UTC
A ring might become unusable after reset, if that the case
drm_sched_entity_select_rq will choose another, working rq
to run the job if there is one.
Also, skip recovery of ring which is not ready.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

Comments

Christian König Oct. 19, 2018, 7:08 a.m. UTC | #1
Am 18.10.18 um 20:44 schrieb Andrey Grodzovsky:
> A ring might become unusable after reset, if that the case
> drm_sched_entity_select_rq will choose another, working rq
> to run the job if there is one.
> Also, skip recovery of ring which is not ready.

Well that is not even remotely sufficient.

If we can't bring up a ring any more after a reset we would need to move 
all jobs which where previously scheduled on it and all of its entities 
to a different instance.

What you do here breaks dependencies between jobs and can result in 
unforeseen consequences including random memory writes.

As far as I can see that can't be done correctly with the current 
scheduler design.

Additional to that when we can't restart one instance of a ring we 
usually can't restart all others either. So the whole approach is rather 
pointless.

Regards,
Christian.

>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++++-
>   1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index d11489e..3124ca1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3355,10 +3355,24 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   	else
>   		r = amdgpu_device_reset(adev);
>   
> +	/*
> +	 * After reboot a ring might fail in which case this will
> +	 * move the job to different rq if possible
> +	 */
> +	if (job) {
> +		drm_sched_entity_select_rq(job->base.entity);
> +		if (job->base.entity->rq) {
> +			job->base.sched = job->base.entity->rq->sched;
> +		} else {
> +			job->base.sched = NULL;
> +			r = -ENOENT;
> +		}
> +	}
> +
>   	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>   		struct amdgpu_ring *ring = adev->rings[i];
>   
> -		if (!ring || !ring->sched.thread)
> +		if (!ring || !ring->ready || !ring->sched.thread)
>   			continue;
>   
>   		/* only need recovery sched of the given job's ring
Andrey Grodzovsky Oct. 19, 2018, 3:06 p.m. UTC | #2
On 10/19/2018 03:08 AM, Koenig, Christian wrote:
> Am 18.10.18 um 20:44 schrieb Andrey Grodzovsky:
>> A ring might become unusable after reset, if that the case
>> drm_sched_entity_select_rq will choose another, working rq
>> to run the job if there is one.
>> Also, skip recovery of ring which is not ready.
> Well that is not even remotely sufficient.
>
> If we can't bring up a ring any more after a reset we would need to move
> all jobs which where previously scheduled on it and all of its entities
> to a different instance.

That one should be easy to add inside amdgpu_device_gpu_recover in case
ring is dead, we just do the same for all the jobs in 
sched->ring_mirror_list of the dead ring
before doing recovery for them, no?
>
> What you do here breaks dependencies between jobs and can result in
> unforeseen consequences including random memory writes.

Can you explain this a bit more ? AFAIK any job dependent on this job 
will still wait for it's completion
before running regardless of did this job moved to a different ring. 
What am I missing ?
>
> As far as I can see that can't be done correctly with the current
> scheduler design.
>
> Additional to that when we can't restart one instance of a ring we
> usually can't restart all others either. So the whole approach is rather
> pointless.

 From my testing looks like we can, compute ring 0 is dead but IB tests 
pass on other compute rings.

Andrey

>
> Regards,
> Christian.
>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++++-
>>    1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index d11489e..3124ca1 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3355,10 +3355,24 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>    	else
>>    		r = amdgpu_device_reset(adev);
>>    
>> +	/*
>> +	 * After reboot a ring might fail in which case this will
>> +	 * move the job to different rq if possible
>> +	 */
>> +	if (job) {
>> +		drm_sched_entity_select_rq(job->base.entity);
>> +		if (job->base.entity->rq) {
>> +			job->base.sched = job->base.entity->rq->sched;
>> +		} else {
>> +			job->base.sched = NULL;
>> +			r = -ENOENT;
>> +		}
>> +	}
>> +
>>    	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>    		struct amdgpu_ring *ring = adev->rings[i];
>>    
>> -		if (!ring || !ring->sched.thread)
>> +		if (!ring || !ring->ready || !ring->sched.thread)
>>    			continue;
>>    
>>    		/* only need recovery sched of the given job's ring
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Christian König Oct. 19, 2018, 4:28 p.m. UTC | #3
Am 19.10.18 um 17:06 schrieb Grodzovsky, Andrey:
>
> On 10/19/2018 03:08 AM, Koenig, Christian wrote:
>> Am 18.10.18 um 20:44 schrieb Andrey Grodzovsky:
>>> A ring might become unusable after reset, if that the case
>>> drm_sched_entity_select_rq will choose another, working rq
>>> to run the job if there is one.
>>> Also, skip recovery of ring which is not ready.
>> Well that is not even remotely sufficient.
>>
>> If we can't bring up a ring any more after a reset we would need to move
>> all jobs which where previously scheduled on it and all of its entities
>> to a different instance.
> That one should be easy to add inside amdgpu_device_gpu_recover in case
> ring is dead, we just do the same for all the jobs in
> sched->ring_mirror_list of the dead ring
> before doing recovery for them, no?

Correct, you need to execute all jobs from the ring_mirror_list as well 
as move all entities to other rqs.

But there are some problem with that see below.

>> What you do here breaks dependencies between jobs and can result in
>> unforeseen consequences including random memory writes.
> Can you explain this a bit more ? AFAIK any job dependent on this job
> will still wait for it's completion
> before running regardless of did this job moved to a different ring.
> What am I missing ?

The stuff dependent on the moving jobs should indeed work correctly.

But you don't necessary have space to execute the ring_mirror_list on 
another scheduler. And to that the jobs are prepared to run on a 
specific ring, e.g. UVD jobs are patches, VMIDs assigned etc etc...

So that most likely won't work correctly.

>> As far as I can see that can't be done correctly with the current
>> scheduler design.
>>
>> Additional to that when we can't restart one instance of a ring we
>> usually can't restart all others either. So the whole approach is rather
>> pointless.
>   From my testing looks like we can, compute ring 0 is dead but IB tests
> pass on other compute rings.

Interesting, but I would rather investigate why compute ring 0 is dead 
while other still work.

At least some compute rings should be handled by the same engine, so if 
one is dead all other should be dead as well.

Christian.

>
> Andrey
>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++++-
>>>     1 file changed, 15 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index d11489e..3124ca1 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -3355,10 +3355,24 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>     	else
>>>     		r = amdgpu_device_reset(adev);
>>>     
>>> +	/*
>>> +	 * After reboot a ring might fail in which case this will
>>> +	 * move the job to different rq if possible
>>> +	 */
>>> +	if (job) {
>>> +		drm_sched_entity_select_rq(job->base.entity);
>>> +		if (job->base.entity->rq) {
>>> +			job->base.sched = job->base.entity->rq->sched;
>>> +		} else {
>>> +			job->base.sched = NULL;
>>> +			r = -ENOENT;
>>> +		}
>>> +	}
>>> +
>>>     	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>     		struct amdgpu_ring *ring = adev->rings[i];
>>>     
>>> -		if (!ring || !ring->sched.thread)
>>> +		if (!ring || !ring->ready || !ring->sched.thread)
>>>     			continue;
>>>     
>>>     		/* only need recovery sched of the given job's ring
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Andrey Grodzovsky Oct. 19, 2018, 4:50 p.m. UTC | #4
That my next step.

Andrey

On 10/19/2018 12:28 PM, Christian König wrote:
  From my testing looks like we can, compute ring 0 is dead but IB tests
pass on other compute rings.

Interesting, but I would rather investigate why compute ring 0 is dead while other still work.
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>That my next step.</p>
<p>Andrey<br>
</p>
<br>
<div class="moz-cite-prefix">On 10/19/2018 12:28 PM, Christian König wrote:<br>
</div>
<blockquote type="cite" cite="mid:732e8d99-e681-e2a4-506d-6aed0d7fc594@gmail.com">
<blockquote type="cite" style="color: #000000;">&nbsp; From my testing looks like we can, compute ring 0 is dead but IB tests
<br>
pass on other compute rings. <br>
</blockquote>
<br>
Interesting, but I would rather investigate why compute ring 0 is dead while other still work.
</blockquote>
<br>
</body>
</html>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index d11489e..3124ca1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3355,10 +3355,24 @@  int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	else
 		r = amdgpu_device_reset(adev);
 
+	/*
+	 * After reboot a ring might fail in which case this will
+	 * move the job to different rq if possible
+	 */
+	if (job) {
+		drm_sched_entity_select_rq(job->base.entity);
+		if (job->base.entity->rq) {
+			job->base.sched = job->base.entity->rq->sched;
+		} else {
+			job->base.sched = NULL;
+			r = -ENOENT;
+		}
+	}
+
 	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 		struct amdgpu_ring *ring = adev->rings[i];
 
-		if (!ring || !ring->sched.thread)
+		if (!ring || !ring->ready || !ring->sched.thread)
 			continue;
 
 		/* only need recovery sched of the given job's ring