diff mbox

[2/2] dma-buf: fix reservation_object_wait_timeout_rcu to wait correctly v2

Message ID 1502384509-10465-3-git-send-email-alexander.deucher@amd.com (mailing list archive)
State New, archived
Headers show

Commit Message

Alex Deucher Aug. 10, 2017, 5:01 p.m. UTC
From: Christian König <christian.koenig@amd.com>

With hardware resets in mind it is possible that all shared fences are
signaled, but the exlusive isn't. Fix waiting for everything in this situation.

v2: make sure we always wait for the exclusive fence

Signed-off-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Reviewed-by: Chunming Zhou <david1.zhou@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
---
 drivers/dma-buf/reservation.c | 33 +++++++++++++++------------------
 1 file changed, 15 insertions(+), 18 deletions(-)

Comments

Chris Wilson Aug. 10, 2017, 5:11 p.m. UTC | #1
Quoting Alex Deucher (2017-08-10 18:01:49)
> From: Christian König <christian.koenig@amd.com>
> 
> With hardware resets in mind it is possible that all shared fences are
> signaled, but the exlusive isn't. Fix waiting for everything in this situation.

I'm still puzzling over this one.

Setting an exclusive fence will clear all shared, so we must have added
the shared fences after the exclusive fence. But those shared fences must
follow the exclusive fence, you should not be able to reorder the shared
fences ahead of the exclusive. If you have completed shared fences, but
incomplete exclusive, doesn't that imply you have an ordering issue?
-Chris
Christian König Aug. 10, 2017, 6:19 p.m. UTC | #2
Am 10.08.2017 um 19:11 schrieb Chris Wilson:
> Quoting Alex Deucher (2017-08-10 18:01:49)
>> From: Christian König <christian.koenig@amd.com>
>>
>> With hardware resets in mind it is possible that all shared fences are
>> signaled, but the exlusive isn't. Fix waiting for everything in this situation.
> I'm still puzzling over this one.
>
> Setting an exclusive fence will clear all shared, so we must have added
> the shared fences after the exclusive fence. But those shared fences must
> follow the exclusive fence, you should not be able to reorder the shared
> fences ahead of the exclusive. If you have completed shared fences, but
> incomplete exclusive, doesn't that imply you have an ordering issue?

No, that is not an ordering issue.

The problem happens when the shared fences are "aborted" because of a 
GPU reset.

See the exclusive fence is often a DMA moving bytes on behalf of the 
kernel while the shared fences are the actual command submission from 
user space.

What can happen is that the userspace process is killed and/or it's 
scheduled command submission aborted for other reasons.

In this case the shared fences get an error code, are set to the 
signaled state and the job they represent is never executed.

Regards,
Christian.
Chris Wilson Aug. 10, 2017, 6:50 p.m. UTC | #3
Quoting Christian König (2017-08-10 19:19:52)
> Am 10.08.2017 um 19:11 schrieb Chris Wilson:
> > Quoting Alex Deucher (2017-08-10 18:01:49)
> >> From: Christian König <christian.koenig@amd.com>
> >>
> >> With hardware resets in mind it is possible that all shared fences are
> >> signaled, but the exlusive isn't. Fix waiting for everything in this situation.
> > I'm still puzzling over this one.
> >
> > Setting an exclusive fence will clear all shared, so we must have added
> > the shared fences after the exclusive fence. But those shared fences must
> > follow the exclusive fence, you should not be able to reorder the shared
> > fences ahead of the exclusive. If you have completed shared fences, but
> > incomplete exclusive, doesn't that imply you have an ordering issue?
> 
> No, that is not an ordering issue.
> 
> The problem happens when the shared fences are "aborted" because of a 
> GPU reset.
> 
> See the exclusive fence is often a DMA moving bytes on behalf of the 
> kernel while the shared fences are the actual command submission from 
> user space.
> 
> What can happen is that the userspace process is killed and/or it's 
> scheduled command submission aborted for other reasons.
> 
> In this case the shared fences get an error code, are set to the 
> signaled state and the job they represent is never executed.

I didn't allow the GPU reset to cause fences to appear to complete out
of order, just rewrote them so they would be skipped (with the error
status flagged) if the error was terminal but still update the hw
semaphore and breadcrumbs (i.e. the user portion was cancel, but our
bookkeeping was left intact).
-Chris
Koenig, Christian Aug. 10, 2017, 7:04 p.m. UTC | #4
Am 10.08.2017 um 20:50 schrieb Chris Wilson:
> Quoting Christian König (2017-08-10 19:19:52)
>> Am 10.08.2017 um 19:11 schrieb Chris Wilson:
>>> Quoting Alex Deucher (2017-08-10 18:01:49)
>>>> From: Christian König <christian.koenig@amd.com>
>>>>
>>>> With hardware resets in mind it is possible that all shared fences are
>>>> signaled, but the exlusive isn't. Fix waiting for everything in this situation.
>>> I'm still puzzling over this one.
>>>
>>> Setting an exclusive fence will clear all shared, so we must have added
>>> the shared fences after the exclusive fence. But those shared fences must
>>> follow the exclusive fence, you should not be able to reorder the shared
>>> fences ahead of the exclusive. If you have completed shared fences, but
>>> incomplete exclusive, doesn't that imply you have an ordering issue?
>> No, that is not an ordering issue.
>>
>> The problem happens when the shared fences are "aborted" because of a
>> GPU reset.
>>
>> See the exclusive fence is often a DMA moving bytes on behalf of the
>> kernel while the shared fences are the actual command submission from
>> user space.
>>
>> What can happen is that the userspace process is killed and/or it's
>> scheduled command submission aborted for other reasons.
>>
>> In this case the shared fences get an error code, are set to the
>> signaled state and the job they represent is never executed.
> I didn't allow the GPU reset to cause fences to appear to complete out
> of order, just rewrote them so they would be skipped (with the error
> status flagged) if the error was terminal but still update the hw
> semaphore and breadcrumbs (i.e. the user portion was cancel, but our
> bookkeeping was left intact).

Well again, they don't signal out of order. The exclusive fence and the 
shared fence which is aborted belong to completely independent contexts 
(and can even belong to completely independent hardware).

It's just that the job which should trigger the shared fence never gets 
executed.

Christian.
diff mbox

Patch

diff --git a/drivers/dma-buf/reservation.c b/drivers/dma-buf/reservation.c
index d4881f9..dec3a81 100644
--- a/drivers/dma-buf/reservation.c
+++ b/drivers/dma-buf/reservation.c
@@ -431,12 +431,25 @@  long reservation_object_wait_timeout_rcu(struct reservation_object *obj,
 	long ret = timeout ? timeout : 1;
 
 retry:
-	fence = NULL;
 	shared_count = 0;
 	seq = read_seqcount_begin(&obj->seq);
 	rcu_read_lock();
 
-	if (wait_all) {
+	fence = rcu_dereference(obj->fence_excl);
+	if (fence && !test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)) {
+		if (!dma_fence_get_rcu(fence))
+			goto unlock_retry;
+
+		if (dma_fence_is_signaled(fence)) {
+			dma_fence_put(fence);
+			fence = NULL;
+		}
+
+	} else {
+		fence = NULL;
+	}
+
+	if (!fence && wait_all) {
 		struct reservation_object_list *fobj =
 						rcu_dereference(obj->fence);
 
@@ -463,22 +476,6 @@  long reservation_object_wait_timeout_rcu(struct reservation_object *obj,
 		}
 	}
 
-	if (!shared_count) {
-		struct dma_fence *fence_excl = rcu_dereference(obj->fence_excl);
-
-		if (fence_excl &&
-		    !test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
-			      &fence_excl->flags)) {
-			if (!dma_fence_get_rcu(fence_excl))
-				goto unlock_retry;
-
-			if (dma_fence_is_signaled(fence_excl))
-				dma_fence_put(fence_excl);
-			else
-				fence = fence_excl;
-		}
-	}
-
 	rcu_read_unlock();
 	if (fence) {
 		if (read_seqcount_retry(&obj->seq, seq)) {