diff mbox

drm/i915: Before pageflip, also wait for shared dmabuf fences.

Message ID 1473293683-6247-1-git-send-email-mario.kleiner.de@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mario Kleiner Sept. 8, 2016, 12:14 a.m. UTC
amdgpu-kms uses shared fences for its prime exported dmabufs,
instead of an exclusive fence. Therefore we need to wait for
all fences of the dmabuf reservation object to prevent
unsynchronized rendering and flipping.

This patch was tested to behave properly with intel-kms +
radeon/amdgpu/nouveau-kms for correct prime sync during
pageflipping under DRI3/Present.

Should fix https://bugs.freedesktop.org/show_bug.cgi?id=95472
at least for page-flipped presentation.

Suggested-by: Michel Dänzer <michel.daenzer@amd.com>
Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com>
Cc: Michel Dänzer <michel.daenzer@amd.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
---
 drivers/gpu/drm/i915/intel_display.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Chris Wilson Sept. 8, 2016, 6:30 a.m. UTC | #1
On Thu, Sep 08, 2016 at 02:14:43AM +0200, Mario Kleiner wrote:
> amdgpu-kms uses shared fences for its prime exported dmabufs,
> instead of an exclusive fence. Therefore we need to wait for
> all fences of the dmabuf reservation object to prevent
> unsynchronized rendering and flipping.

No. Fix the root cause as this affects not just flips but copies -
this implies that everybody using the resv object must wait for all
fences. The resv object is not just used for prime, but all fencing, so
this breaks the ability to schedule parallel operations across engine.
-Chris
Mario Kleiner Sept. 8, 2016, 3:21 p.m. UTC | #2
On 09/08/2016 08:30 AM, Chris Wilson wrote:
> On Thu, Sep 08, 2016 at 02:14:43AM +0200, Mario Kleiner wrote:
>> amdgpu-kms uses shared fences for its prime exported dmabufs,
>> instead of an exclusive fence. Therefore we need to wait for
>> all fences of the dmabuf reservation object to prevent
>> unsynchronized rendering and flipping.
>
> No. Fix the root cause as this affects not just flips but copies -
> this implies that everybody using the resv object must wait for all
> fences. The resv object is not just used for prime, but all fencing, so
> this breaks the ability to schedule parallel operations across engine.
> -Chris
>

Ok. I think i now understand the difference, but let's check: The 
exclusive fence is essentially acting a bit like a write-lock, and the 
shared fences as readers-locks? So you can have multiple readers but 
only one writer at a time?

Ie.:

Writer must wait for all fences before starting write access to a 
buffer, then attach the exclusive fence and signal it on end of write 
access. E.g., write to renderbuffer, write to texture etc.

Readers must wait for exclusive fence, then attach a shared fence per 
reader and signal it on end of read access? E.g., read from texture, fb, 
scanout?

Is that correct? In that case we'd have a missing exclusive fence in 
amdgpu for the linear target dmabuf? Probably beyond my level of 
knowledge to fix this?

thanks,
-mario
Chris Wilson Sept. 8, 2016, 4:23 p.m. UTC | #3
On Thu, Sep 08, 2016 at 05:21:42PM +0200, Mario Kleiner wrote:
> On 09/08/2016 08:30 AM, Chris Wilson wrote:
> >On Thu, Sep 08, 2016 at 02:14:43AM +0200, Mario Kleiner wrote:
> >>amdgpu-kms uses shared fences for its prime exported dmabufs,
> >>instead of an exclusive fence. Therefore we need to wait for
> >>all fences of the dmabuf reservation object to prevent
> >>unsynchronized rendering and flipping.
> >
> >No. Fix the root cause as this affects not just flips but copies -
> >this implies that everybody using the resv object must wait for all
> >fences. The resv object is not just used for prime, but all fencing, so
> >this breaks the ability to schedule parallel operations across engine.
> >-Chris
> >
> 
> Ok. I think i now understand the difference, but let's check: The
> exclusive fence is essentially acting a bit like a write-lock, and
> the shared fences as readers-locks? So you can have multiple readers
> but only one writer at a time?

That's how we (i915.ko and I hope the rest of the world) are using them.
In the model where here is just one reservation object on the GEM
object, that reservation object is then shared between internal driver
scheduling and external. We are reliant on being able to use buffers on
multiple engines through the virtue of the shared fences, and to serialise
after writes by waiting on the exclusive fence. (So we can have concurrent
reads on the display engine, render engines and on the CPU - or
alternatively an exclusive writer.)

In the near future, i915 flips will wait on the common reservation object
not only for dma-bufs, but also its own GEM objects.
 
> Ie.:
> 
> Writer must wait for all fences before starting write access to a
> buffer, then attach the exclusive fence and signal it on end of
> write access. E.g., write to renderbuffer, write to texture etc.

Yes.
 
> Readers must wait for exclusive fence, then attach a shared fence
> per reader and signal it on end of read access? E.g., read from
> texture, fb, scanout?

Yes.
 
> Is that correct? In that case we'd have a missing exclusive fence in
> amdgpu for the linear target dmabuf? Probably beyond my level of
> knowledge to fix this?

i915.ko requires the client to mark which buffers are written to.

In ttm, there are ttm_validate_buffer objects which mark whether they
should be using shared or exclusive fences. Afaict, in amdgpu they are
all set to shared, the relevant user interface seems to be
amdgpu_bo_list_set().
-Chris
Michel Dänzer Sept. 9, 2016, 1:15 a.m. UTC | #4
On 09/09/16 01:23 AM, Chris Wilson wrote:
> On Thu, Sep 08, 2016 at 05:21:42PM +0200, Mario Kleiner wrote:
>> On 09/08/2016 08:30 AM, Chris Wilson wrote:
>>> On Thu, Sep 08, 2016 at 02:14:43AM +0200, Mario Kleiner wrote:
>>>> amdgpu-kms uses shared fences for its prime exported dmabufs,
>>>> instead of an exclusive fence. Therefore we need to wait for
>>>> all fences of the dmabuf reservation object to prevent
>>>> unsynchronized rendering and flipping.
>>>
>>> No. Fix the root cause as this affects not just flips but copies -
>>> this implies that everybody using the resv object must wait for all
>>> fences. The resv object is not just used for prime, but all fencing, so
>>> this breaks the ability to schedule parallel operations across engine.
>>> -Chris
>>>
>>
>> Ok. I think i now understand the difference, but let's check: The
>> exclusive fence is essentially acting a bit like a write-lock, and
>> the shared fences as readers-locks? So you can have multiple readers
>> but only one writer at a time?
> 
> That's how we (i915.ko and I hope the rest of the world) are using them.
> In the model where here is just one reservation object on the GEM
> object, that reservation object is then shared between internal driver
> scheduling and external. We are reliant on being able to use buffers on
> multiple engines through the virtue of the shared fences, and to serialise
> after writes by waiting on the exclusive fence. (So we can have concurrent
> reads on the display engine, render engines and on the CPU - or
> alternatively an exclusive writer.)
> 
> In the near future, i915 flips will wait on the common reservation object
> not only for dma-bufs, but also its own GEM objects.
>  
>> Ie.:
>>
>> Writer must wait for all fences before starting write access to a
>> buffer, then attach the exclusive fence and signal it on end of
>> write access. E.g., write to renderbuffer, write to texture etc.
> 
> Yes.
>  
>> Readers must wait for exclusive fence, then attach a shared fence
>> per reader and signal it on end of read access? E.g., read from
>> texture, fb, scanout?
> 
> Yes.
>  
>> Is that correct? In that case we'd have a missing exclusive fence in
>> amdgpu for the linear target dmabuf? Probably beyond my level of
>> knowledge to fix this?
> 
> i915.ko requires the client to mark which buffers are written to.
> 
> In ttm, there are ttm_validate_buffer objects which mark whether they
> should be using shared or exclusive fences. Afaict, in amdgpu they are
> all set to shared, the relevant user interface seems to be
> amdgpu_bo_list_set().

This all makes sense to me.

Christian, why is amdgpu setting only shared fences? Can we fix that?
Christian König Sept. 13, 2016, 8:44 a.m. UTC | #5
Am 09.09.2016 um 03:15 schrieb Michel Dänzer:
> On 09/09/16 01:23 AM, Chris Wilson wrote:
>> On Thu, Sep 08, 2016 at 05:21:42PM +0200, Mario Kleiner wrote:
>>> On 09/08/2016 08:30 AM, Chris Wilson wrote:
>>>> On Thu, Sep 08, 2016 at 02:14:43AM +0200, Mario Kleiner wrote:
>>>>> amdgpu-kms uses shared fences for its prime exported dmabufs,
>>>>> instead of an exclusive fence. Therefore we need to wait for
>>>>> all fences of the dmabuf reservation object to prevent
>>>>> unsynchronized rendering and flipping.
>>>> No. Fix the root cause as this affects not just flips but copies -
>>>> this implies that everybody using the resv object must wait for all
>>>> fences. The resv object is not just used for prime, but all fencing, so
>>>> this breaks the ability to schedule parallel operations across engine.
>>>> -Chris
>>>>
>>> Ok. I think i now understand the difference, but let's check: The
>>> exclusive fence is essentially acting a bit like a write-lock, and
>>> the shared fences as readers-locks? So you can have multiple readers
>>> but only one writer at a time?
>> That's how we (i915.ko and I hope the rest of the world) are using them.
>> In the model where here is just one reservation object on the GEM
>> object, that reservation object is then shared between internal driver
>> scheduling and external. We are reliant on being able to use buffers on
>> multiple engines through the virtue of the shared fences, and to serialise
>> after writes by waiting on the exclusive fence. (So we can have concurrent
>> reads on the display engine, render engines and on the CPU - or
>> alternatively an exclusive writer.)
>>
>> In the near future, i915 flips will wait on the common reservation object
>> not only for dma-bufs, but also its own GEM objects.
>>   
>>> Ie.:
>>>
>>> Writer must wait for all fences before starting write access to a
>>> buffer, then attach the exclusive fence and signal it on end of
>>> write access. E.g., write to renderbuffer, write to texture etc.
>> Yes.
>>   
>>> Readers must wait for exclusive fence, then attach a shared fence
>>> per reader and signal it on end of read access? E.g., read from
>>> texture, fb, scanout?
>> Yes.
>>   
>>> Is that correct? In that case we'd have a missing exclusive fence in
>>> amdgpu for the linear target dmabuf? Probably beyond my level of
>>> knowledge to fix this?
>> i915.ko requires the client to mark which buffers are written to.
>>
>> In ttm, there are ttm_validate_buffer objects which mark whether they
>> should be using shared or exclusive fences. Afaict, in amdgpu they are
>> all set to shared, the relevant user interface seems to be
>> amdgpu_bo_list_set().
> This all makes sense to me.
>
> Christian, why is amdgpu setting only shared fences? Can we fix that?

No, amdgpu relies on the fact that we even allow concurrent write 
accesses by userspace.

E.g. one part of the buffer can be rendered by one engine while another 
part could be rendered by another engine.

Just imagine X which is composing a buffer with both the 3D engine as 
well as the DMA engine.

All engines need to run in parallel and you need to wait for all of them 
to finish before scanout.

Everybody which needs exclusive access to the reservation object (like 
scanouts do) needs to wait for all fences, not just the exclusive one.

The Intel driver clearly needs to be fixed here.

Regards,
Christian.
Chris Wilson Sept. 13, 2016, 9:39 a.m. UTC | #6
On Tue, Sep 13, 2016 at 10:44:11AM +0200, Christian König wrote:
> Am 09.09.2016 um 03:15 schrieb Michel Dänzer:
> >On 09/09/16 01:23 AM, Chris Wilson wrote:
> >>On Thu, Sep 08, 2016 at 05:21:42PM +0200, Mario Kleiner wrote:
> >>>On 09/08/2016 08:30 AM, Chris Wilson wrote:
> >>>>On Thu, Sep 08, 2016 at 02:14:43AM +0200, Mario Kleiner wrote:
> >>>>>amdgpu-kms uses shared fences for its prime exported dmabufs,
> >>>>>instead of an exclusive fence. Therefore we need to wait for
> >>>>>all fences of the dmabuf reservation object to prevent
> >>>>>unsynchronized rendering and flipping.
> >>>>No. Fix the root cause as this affects not just flips but copies -
> >>>>this implies that everybody using the resv object must wait for all
> >>>>fences. The resv object is not just used for prime, but all fencing, so
> >>>>this breaks the ability to schedule parallel operations across engine.
> >>>>-Chris
> >>>>
> >>>Ok. I think i now understand the difference, but let's check: The
> >>>exclusive fence is essentially acting a bit like a write-lock, and
> >>>the shared fences as readers-locks? So you can have multiple readers
> >>>but only one writer at a time?
> >>That's how we (i915.ko and I hope the rest of the world) are using them.
> >>In the model where here is just one reservation object on the GEM
> >>object, that reservation object is then shared between internal driver
> >>scheduling and external. We are reliant on being able to use buffers on
> >>multiple engines through the virtue of the shared fences, and to serialise
> >>after writes by waiting on the exclusive fence. (So we can have concurrent
> >>reads on the display engine, render engines and on the CPU - or
> >>alternatively an exclusive writer.)
> >>
> >>In the near future, i915 flips will wait on the common reservation object
> >>not only for dma-bufs, but also its own GEM objects.
> >>>Ie.:
> >>>
> >>>Writer must wait for all fences before starting write access to a
> >>>buffer, then attach the exclusive fence and signal it on end of
> >>>write access. E.g., write to renderbuffer, write to texture etc.
> >>Yes.
> >>>Readers must wait for exclusive fence, then attach a shared fence
> >>>per reader and signal it on end of read access? E.g., read from
> >>>texture, fb, scanout?
> >>Yes.
> >>>Is that correct? In that case we'd have a missing exclusive fence in
> >>>amdgpu for the linear target dmabuf? Probably beyond my level of
> >>>knowledge to fix this?
> >>i915.ko requires the client to mark which buffers are written to.
> >>
> >>In ttm, there are ttm_validate_buffer objects which mark whether they
> >>should be using shared or exclusive fences. Afaict, in amdgpu they are
> >>all set to shared, the relevant user interface seems to be
> >>amdgpu_bo_list_set().
> >This all makes sense to me.
> >
> >Christian, why is amdgpu setting only shared fences? Can we fix that?
> 
> No, amdgpu relies on the fact that we even allow concurrent write
> accesses by userspace.
> 
> E.g. one part of the buffer can be rendered by one engine while
> another part could be rendered by another engine.
> 
> Just imagine X which is composing a buffer with both the 3D engine
> as well as the DMA engine.
> 
> All engines need to run in parallel and you need to wait for all of
> them to finish before scanout.
> 
> Everybody which needs exclusive access to the reservation object
> (like scanouts do) needs to wait for all fences, not just the
> exclusive one.
> 
> The Intel driver clearly needs to be fixed here.

If you are not using implicit fencing, you have to pass explicit fences
instead.
-Chris
Christian König Sept. 13, 2016, 12:52 p.m. UTC | #7
Am 13.09.2016 um 11:39 schrieb Chris Wilson:
> On Tue, Sep 13, 2016 at 10:44:11AM +0200, Christian König wrote:
>> Am 09.09.2016 um 03:15 schrieb Michel Dänzer:
>>> On 09/09/16 01:23 AM, Chris Wilson wrote:
>>>> On Thu, Sep 08, 2016 at 05:21:42PM +0200, Mario Kleiner wrote:
>>>>> On 09/08/2016 08:30 AM, Chris Wilson wrote:
>>>>>> On Thu, Sep 08, 2016 at 02:14:43AM +0200, Mario Kleiner wrote:
>>>>>>> amdgpu-kms uses shared fences for its prime exported dmabufs,
>>>>>>> instead of an exclusive fence. Therefore we need to wait for
>>>>>>> all fences of the dmabuf reservation object to prevent
>>>>>>> unsynchronized rendering and flipping.
>>>>>> No. Fix the root cause as this affects not just flips but copies -
>>>>>> this implies that everybody using the resv object must wait for all
>>>>>> fences. The resv object is not just used for prime, but all fencing, so
>>>>>> this breaks the ability to schedule parallel operations across engine.
>>>>>> -Chris
>>>>>>
>>>>> Ok. I think i now understand the difference, but let's check: The
>>>>> exclusive fence is essentially acting a bit like a write-lock, and
>>>>> the shared fences as readers-locks? So you can have multiple readers
>>>>> but only one writer at a time?
>>>> That's how we (i915.ko and I hope the rest of the world) are using them.
>>>> In the model where here is just one reservation object on the GEM
>>>> object, that reservation object is then shared between internal driver
>>>> scheduling and external. We are reliant on being able to use buffers on
>>>> multiple engines through the virtue of the shared fences, and to serialise
>>>> after writes by waiting on the exclusive fence. (So we can have concurrent
>>>> reads on the display engine, render engines and on the CPU - or
>>>> alternatively an exclusive writer.)
>>>>
>>>> In the near future, i915 flips will wait on the common reservation object
>>>> not only for dma-bufs, but also its own GEM objects.
>>>>> Ie.:
>>>>>
>>>>> Writer must wait for all fences before starting write access to a
>>>>> buffer, then attach the exclusive fence and signal it on end of
>>>>> write access. E.g., write to renderbuffer, write to texture etc.
>>>> Yes.
>>>>> Readers must wait for exclusive fence, then attach a shared fence
>>>>> per reader and signal it on end of read access? E.g., read from
>>>>> texture, fb, scanout?
>>>> Yes.
>>>>> Is that correct? In that case we'd have a missing exclusive fence in
>>>>> amdgpu for the linear target dmabuf? Probably beyond my level of
>>>>> knowledge to fix this?
>>>> i915.ko requires the client to mark which buffers are written to.
>>>>
>>>> In ttm, there are ttm_validate_buffer objects which mark whether they
>>>> should be using shared or exclusive fences. Afaict, in amdgpu they are
>>>> all set to shared, the relevant user interface seems to be
>>>> amdgpu_bo_list_set().
>>> This all makes sense to me.
>>>
>>> Christian, why is amdgpu setting only shared fences? Can we fix that?
>> No, amdgpu relies on the fact that we even allow concurrent write
>> accesses by userspace.
>>
>> E.g. one part of the buffer can be rendered by one engine while
>> another part could be rendered by another engine.
>>
>> Just imagine X which is composing a buffer with both the 3D engine
>> as well as the DMA engine.
>>
>> All engines need to run in parallel and you need to wait for all of
>> them to finish before scanout.
>>
>> Everybody which needs exclusive access to the reservation object
>> (like scanouts do) needs to wait for all fences, not just the
>> exclusive one.
>>
>> The Intel driver clearly needs to be fixed here.
> If you are not using implicit fencing, you have to pass explicit fences
> instead.

Which is exactly what we do, but only for the driver internally command 
submissions.

All command submissions from the same process can run concurrently with 
amdgpu, only when we see a fence from another driver or process we wait 
for it to complete before starting to run a command submission.

Other drivers can't make any assumption on what a shared access is 
actually doing (e.g. writing or reading) with a buffer.

So the model i915.ko is using the reservation object and it's shared 
fences is certainly not correct and needs to be fixed.

Regards,
Christian.

> -Chris
>
Michel Dänzer Sept. 21, 2016, 9:56 a.m. UTC | #8
On 13/09/16 09:52 PM, Christian König wrote:
> Am 13.09.2016 um 11:39 schrieb Chris Wilson:
>> On Tue, Sep 13, 2016 at 10:44:11AM +0200, Christian König wrote:
>>> Am 09.09.2016 um 03:15 schrieb Michel Dänzer:
>>>> On 09/09/16 01:23 AM, Chris Wilson wrote:
>>>>> On Thu, Sep 08, 2016 at 05:21:42PM +0200, Mario Kleiner wrote:
>>>>>> On 09/08/2016 08:30 AM, Chris Wilson wrote:
>>>>>>> On Thu, Sep 08, 2016 at 02:14:43AM +0200, Mario Kleiner wrote:
>>>>>>>> amdgpu-kms uses shared fences for its prime exported dmabufs,
>>>>>>>> instead of an exclusive fence. Therefore we need to wait for
>>>>>>>> all fences of the dmabuf reservation object to prevent
>>>>>>>> unsynchronized rendering and flipping.
>>>>>>> No. Fix the root cause as this affects not just flips but copies -
>>>>>>> this implies that everybody using the resv object must wait for all
>>>>>>> fences. The resv object is not just used for prime, but all
>>>>>>> fencing, so
>>>>>>> this breaks the ability to schedule parallel operations across
>>>>>>> engine.
>>>>>>> -Chris
>>>>>>>
>>>>>> Ok. I think i now understand the difference, but let's check: The
>>>>>> exclusive fence is essentially acting a bit like a write-lock, and
>>>>>> the shared fences as readers-locks? So you can have multiple readers
>>>>>> but only one writer at a time?
>>>>> That's how we (i915.ko and I hope the rest of the world) are using
>>>>> them.
>>>>> In the model where here is just one reservation object on the GEM
>>>>> object, that reservation object is then shared between internal driver
>>>>> scheduling and external. We are reliant on being able to use
>>>>> buffers on
>>>>> multiple engines through the virtue of the shared fences, and to
>>>>> serialise
>>>>> after writes by waiting on the exclusive fence. (So we can have
>>>>> concurrent
>>>>> reads on the display engine, render engines and on the CPU - or
>>>>> alternatively an exclusive writer.)
>>>>>
>>>>> In the near future, i915 flips will wait on the common reservation
>>>>> object
>>>>> not only for dma-bufs, but also its own GEM objects.
>>>>>> Ie.:
>>>>>>
>>>>>> Writer must wait for all fences before starting write access to a
>>>>>> buffer, then attach the exclusive fence and signal it on end of
>>>>>> write access. E.g., write to renderbuffer, write to texture etc.
>>>>> Yes.
>>>>>> Readers must wait for exclusive fence, then attach a shared fence
>>>>>> per reader and signal it on end of read access? E.g., read from
>>>>>> texture, fb, scanout?
>>>>> Yes.
>>>>>> Is that correct? In that case we'd have a missing exclusive fence in
>>>>>> amdgpu for the linear target dmabuf? Probably beyond my level of
>>>>>> knowledge to fix this?
>>>>> i915.ko requires the client to mark which buffers are written to.
>>>>>
>>>>> In ttm, there are ttm_validate_buffer objects which mark whether they
>>>>> should be using shared or exclusive fences. Afaict, in amdgpu they are
>>>>> all set to shared, the relevant user interface seems to be
>>>>> amdgpu_bo_list_set().
>>>> This all makes sense to me.
>>>>
>>>> Christian, why is amdgpu setting only shared fences? Can we fix that?
>>> No, amdgpu relies on the fact that we even allow concurrent write
>>> accesses by userspace.
>>>
>>> E.g. one part of the buffer can be rendered by one engine while
>>> another part could be rendered by another engine.
>>>
>>> Just imagine X which is composing a buffer with both the 3D engine
>>> as well as the DMA engine.
>>>
>>> All engines need to run in parallel and you need to wait for all of
>>> them to finish before scanout.
>>>
>>> Everybody which needs exclusive access to the reservation object
>>> (like scanouts do) needs to wait for all fences, not just the
>>> exclusive one.
>>>
>>> The Intel driver clearly needs to be fixed here.
>> If you are not using implicit fencing, you have to pass explicit fences
>> instead.
> 
> Which is exactly what we do, but only for the driver internally command
> submissions.
> 
> All command submissions from the same process can run concurrently with
> amdgpu, only when we see a fence from another driver or process we wait
> for it to complete before starting to run a command submission.
> 
> Other drivers can't make any assumption on what a shared access is
> actually doing (e.g. writing or reading) with a buffer.
> 
> So the model i915.ko is using the reservation object and it's shared
> fences is certainly not correct and needs to be fixed.

Looks like there are different interpretations of the semantics of
exclusive vs. shared fences. Where are these semantics documented?


FWIW, we seem to have the same issue with radeon vs. amdgpu: radeon only
seems to wait for exclusive fences, so e.g. running Xorg on amdgpu and
using PRIME slave scanout on radeon leaves artifacts.
Christian König Sept. 21, 2016, 10:30 a.m. UTC | #9
Am 21.09.2016 um 11:56 schrieb Michel Dänzer:
>
> Looks like there are different interpretations of the semantics of
> exclusive vs. shared fences. Where are these semantics documented?

Yeah, I think as well that this is the primary question here.

IIRC the fences were explicitly called exclusive/shared instead of 
writing/reading on purpose.

I absolutely don't mind switching to them to writing/reading semantics, 
but amdgpu really needs multiple writers at the same time.

So in this case the writing side of a reservation object needs to be a 
collection of fences as well.

> FWIW, we seem to have the same issue with radeon vs. amdgpu: radeon only
> seems to wait for exclusive fences, so e.g. running Xorg on amdgpu and
> using PRIME slave scanout on radeon leaves artifacts.

Yeah, I know. See radeon_display.c radeon_flip_work_func().

We pretty much need the same patch here I've done for amdgpu as well.

Regards,
Christian.
Daniel Vetter Sept. 21, 2016, 11:04 a.m. UTC | #10
On Wed, Sep 21, 2016 at 12:30 PM, Christian König
<deathsimple@vodafone.de> wrote:
> Am 21.09.2016 um 11:56 schrieb Michel Dänzer:
>>
>>
>> Looks like there are different interpretations of the semantics of
>> exclusive vs. shared fences. Where are these semantics documented?
>
>
> Yeah, I think as well that this is the primary question here.
>
> IIRC the fences were explicitly called exclusive/shared instead of
> writing/reading on purpose.
>
> I absolutely don't mind switching to them to writing/reading semantics, but
> amdgpu really needs multiple writers at the same time.
>
> So in this case the writing side of a reservation object needs to be a
> collection of fences as well.

You can't have multiple writers with implicit syncing. That confusion
is exactly why we called them shared/exclusive. Multiple writers
generally means that you do some form of fencing in userspace
(unsync'ed gl buffer access is the common one). What you do for
private buffers doesn't matter, but when you render into a
shared/winsys buffer you really need to set the exclusive fence (and
there can only ever be one). So probably needs some userspace
adjustments to make sure you don't accidentally set an exclusive write
hazard when you don't really want that implicit sync.
-Daniel
Christian König Sept. 21, 2016, 11:19 a.m. UTC | #11
Am 21.09.2016 um 13:04 schrieb Daniel Vetter:
> On Wed, Sep 21, 2016 at 12:30 PM, Christian König
> <deathsimple@vodafone.de> wrote:
>> Am 21.09.2016 um 11:56 schrieb Michel Dänzer:
>>>
>>> Looks like there are different interpretations of the semantics of
>>> exclusive vs. shared fences. Where are these semantics documented?
>>
>> Yeah, I think as well that this is the primary question here.
>>
>> IIRC the fences were explicitly called exclusive/shared instead of
>> writing/reading on purpose.
>>
>> I absolutely don't mind switching to them to writing/reading semantics, but
>> amdgpu really needs multiple writers at the same time.
>>
>> So in this case the writing side of a reservation object needs to be a
>> collection of fences as well.
> You can't have multiple writers with implicit syncing. That confusion
> is exactly why we called them shared/exclusive. Multiple writers
> generally means that you do some form of fencing in userspace
> (unsync'ed gl buffer access is the common one). What you do for
> private buffers doesn't matter, but when you render into a
> shared/winsys buffer you really need to set the exclusive fence (and
> there can only ever be one). So probably needs some userspace
> adjustments to make sure you don't accidentally set an exclusive write
> hazard when you don't really want that implicit sync.

Nope, that isn't true.

We use multiple writers without implicit syncing between processes in 
the amdgpu stack perfectly fine.

See amdgpu_sync.c for the implementation. What we do there is taking a 
look at all the fences associated with a reservation object and only 
sync to those who are from another process.

Then we use implicit syncing for command submissions in the form of 
"dependencies". E.g. for each CS we report back an identifier of that 
submission to user space and on the next submission you can give this 
identifier as dependency which needs to be satisfied before the command 
submission can start running.

This was done to allow multiple engines (3D, DMA, Compute) to compose a 
buffer while still allow compatibility with protocols like DRI2/DRI3.

Regards,
Christian.

> -Daniel
Daniel Vetter Sept. 21, 2016, 12:56 p.m. UTC | #12
On Wed, Sep 21, 2016 at 1:19 PM, Christian König
<deathsimple@vodafone.de> wrote:
> Am 21.09.2016 um 13:04 schrieb Daniel Vetter:
>>
>> On Wed, Sep 21, 2016 at 12:30 PM, Christian König
>> <deathsimple@vodafone.de> wrote:
>>>
>>> Am 21.09.2016 um 11:56 schrieb Michel Dänzer:
>>>>
>>>>
>>>> Looks like there are different interpretations of the semantics of
>>>> exclusive vs. shared fences. Where are these semantics documented?
>>>
>>>
>>> Yeah, I think as well that this is the primary question here.
>>>
>>> IIRC the fences were explicitly called exclusive/shared instead of
>>> writing/reading on purpose.
>>>
>>> I absolutely don't mind switching to them to writing/reading semantics,
>>> but
>>> amdgpu really needs multiple writers at the same time.
>>>
>>> So in this case the writing side of a reservation object needs to be a
>>> collection of fences as well.
>>
>> You can't have multiple writers with implicit syncing. That confusion
>> is exactly why we called them shared/exclusive. Multiple writers
>> generally means that you do some form of fencing in userspace
>> (unsync'ed gl buffer access is the common one). What you do for
>> private buffers doesn't matter, but when you render into a
>> shared/winsys buffer you really need to set the exclusive fence (and
>> there can only ever be one). So probably needs some userspace
>> adjustments to make sure you don't accidentally set an exclusive write
>> hazard when you don't really want that implicit sync.
>
>
> Nope, that isn't true.
>
> We use multiple writers without implicit syncing between processes in the
> amdgpu stack perfectly fine.
>
> See amdgpu_sync.c for the implementation. What we do there is taking a look
> at all the fences associated with a reservation object and only sync to
> those who are from another process.
>
> Then we use implicit syncing for command submissions in the form of
> "dependencies". E.g. for each CS we report back an identifier of that
> submission to user space and on the next submission you can give this
> identifier as dependency which needs to be satisfied before the command
> submission can start running.

This is called explicit fencing. Implemented with a driver-private
primitive (and not sync_file fds like on android), but still
conceptually explicit fencing. Implicit fencing really only can handle
one writer, at least as currently implemented by struct
reservation_object.

> This was done to allow multiple engines (3D, DMA, Compute) to compose a
> buffer while still allow compatibility with protocols like DRI2/DRI3.

Instead of the current solution you need to stop attaching exclusive
fences to non-shared buffers (which are coordinated using the
driver-private explicit fencing you're describing), and only attach
exclusive fences to shared buffers (DRI2/3, PRIME, whatever). Since
you're doing explicit syncing for internal stuff anyway you can still
opt to ignore the exclusive fences if you want to (driven by a flag or
something similar).
-Daniel
Michel Dänzer Sept. 21, 2016, 3:07 p.m. UTC | #13
On 21/09/16 09:56 PM, Daniel Vetter wrote:
> On Wed, Sep 21, 2016 at 1:19 PM, Christian König
> <deathsimple@vodafone.de> wrote:
>> Am 21.09.2016 um 13:04 schrieb Daniel Vetter:
>>> On Wed, Sep 21, 2016 at 12:30 PM, Christian König
>>> <deathsimple@vodafone.de> wrote:
>>>> Am 21.09.2016 um 11:56 schrieb Michel Dänzer:
>>>>>
>>>>>
>>>>> Looks like there are different interpretations of the semantics of
>>>>> exclusive vs. shared fences. Where are these semantics documented?
>>>>
>>>>
>>>> Yeah, I think as well that this is the primary question here.
>>>>
>>>> IIRC the fences were explicitly called exclusive/shared instead of
>>>> writing/reading on purpose.
>>>>
>>>> I absolutely don't mind switching to them to writing/reading semantics,
>>>> but
>>>> amdgpu really needs multiple writers at the same time.
>>>>
>>>> So in this case the writing side of a reservation object needs to be a
>>>> collection of fences as well.
>>>
>>> You can't have multiple writers with implicit syncing. That confusion
>>> is exactly why we called them shared/exclusive. Multiple writers
>>> generally means that you do some form of fencing in userspace
>>> (unsync'ed gl buffer access is the common one). What you do for
>>> private buffers doesn't matter, but when you render into a
>>> shared/winsys buffer you really need to set the exclusive fence (and
>>> there can only ever be one). So probably needs some userspace
>>> adjustments to make sure you don't accidentally set an exclusive write
>>> hazard when you don't really want that implicit sync.
>>
>>
>> Nope, that isn't true.
>>
>> We use multiple writers without implicit syncing between processes in the
>> amdgpu stack perfectly fine.
>>
>> See amdgpu_sync.c for the implementation. What we do there is taking a look
>> at all the fences associated with a reservation object and only sync to
>> those who are from another process.
>>
>> Then we use implicit syncing for command submissions in the form of
>> "dependencies". E.g. for each CS we report back an identifier of that
>> submission to user space and on the next submission you can give this
>> identifier as dependency which needs to be satisfied before the command
>> submission can start running.
> 
> This is called explicit fencing. Implemented with a driver-private
> primitive (and not sync_file fds like on android), but still
> conceptually explicit fencing. Implicit fencing really only can handle
> one writer, at least as currently implemented by struct
> reservation_object.
> 
>> This was done to allow multiple engines (3D, DMA, Compute) to compose a
>> buffer while still allow compatibility with protocols like DRI2/DRI3.
> 
> Instead of the current solution you need to stop attaching exclusive
> fences to non-shared buffers (which are coordinated using the
> driver-private explicit fencing you're describing),

Err, the current issue is actually that amdgpu never sets an exclusive
fence, only ever shared ones. :)

> and only attach exclusive fences to shared buffers (DRI2/3, PRIME,
> whatever).

Still, it occurred to me in the meantime that amdgpu setting the
exclusive fence for buffers shared via PRIME (no matter if it's a write
or read operation) might be a solution. Christian, what do you think?
Michel Dänzer Sept. 21, 2016, 3:13 p.m. UTC | #14
On 21/09/16 07:30 PM, Christian König wrote:
> Am 21.09.2016 um 11:56 schrieb Michel Dänzer:
>>
>> FWIW, we seem to have the same issue with radeon vs. amdgpu: radeon only
>> seems to wait for exclusive fences, so e.g. running Xorg on amdgpu and
>> using PRIME slave scanout on radeon leaves artifacts.
> 
> Yeah, I know. See radeon_display.c radeon_flip_work_func().
> 
> We pretty much need the same patch here I've done for amdgpu as well.

Actually, the PRIME slave can't scan out from the shared BOs directly
(recall the recent discussion we had about that with Mario), but has to
copy from the shared BO to a dedicated scanout BO. These copies need to
be synchronized with the primary GPU's copies to the shared BO.
Christian König Sept. 21, 2016, 3:15 p.m. UTC | #15
Am 21.09.2016 um 17:07 schrieb Michel Dänzer:
> On 21/09/16 09:56 PM, Daniel Vetter wrote:
>> On Wed, Sep 21, 2016 at 1:19 PM, Christian König
>> <deathsimple@vodafone.de> wrote:
>>> Am 21.09.2016 um 13:04 schrieb Daniel Vetter:
>>>> On Wed, Sep 21, 2016 at 12:30 PM, Christian König
>>>> <deathsimple@vodafone.de> wrote:
>>>>> Am 21.09.2016 um 11:56 schrieb Michel Dänzer:
>>>>>>
>>>>>> Looks like there are different interpretations of the semantics of
>>>>>> exclusive vs. shared fences. Where are these semantics documented?
>>>>>
>>>>> Yeah, I think as well that this is the primary question here.
>>>>>
>>>>> IIRC the fences were explicitly called exclusive/shared instead of
>>>>> writing/reading on purpose.
>>>>>
>>>>> I absolutely don't mind switching to them to writing/reading semantics,
>>>>> but
>>>>> amdgpu really needs multiple writers at the same time.
>>>>>
>>>>> So in this case the writing side of a reservation object needs to be a
>>>>> collection of fences as well.
>>>> You can't have multiple writers with implicit syncing. That confusion
>>>> is exactly why we called them shared/exclusive. Multiple writers
>>>> generally means that you do some form of fencing in userspace
>>>> (unsync'ed gl buffer access is the common one). What you do for
>>>> private buffers doesn't matter, but when you render into a
>>>> shared/winsys buffer you really need to set the exclusive fence (and
>>>> there can only ever be one). So probably needs some userspace
>>>> adjustments to make sure you don't accidentally set an exclusive write
>>>> hazard when you don't really want that implicit sync.
>>>
>>> Nope, that isn't true.
>>>
>>> We use multiple writers without implicit syncing between processes in the
>>> amdgpu stack perfectly fine.
>>>
>>> See amdgpu_sync.c for the implementation. What we do there is taking a look
>>> at all the fences associated with a reservation object and only sync to
>>> those who are from another process.
>>>
>>> Then we use implicit syncing for command submissions in the form of
>>> "dependencies". E.g. for each CS we report back an identifier of that
>>> submission to user space and on the next submission you can give this
>>> identifier as dependency which needs to be satisfied before the command
>>> submission can start running.
>> This is called explicit fencing. Implemented with a driver-private
>> primitive (and not sync_file fds like on android), but still
>> conceptually explicit fencing. Implicit fencing really only can handle
>> one writer, at least as currently implemented by struct
>> reservation_object.
>>
>>> This was done to allow multiple engines (3D, DMA, Compute) to compose a
>>> buffer while still allow compatibility with protocols like DRI2/DRI3.
>> Instead of the current solution you need to stop attaching exclusive
>> fences to non-shared buffers (which are coordinated using the
>> driver-private explicit fencing you're describing),
> Err, the current issue is actually that amdgpu never sets an exclusive
> fence, only ever shared ones. :)

Actually amdgpu does set the exclusive fence for buffer migrations, 
cause that is an operation user space doesn't know about and so it needs 
to be "exclusive" access to the buffer.


>> and only attach exclusive fences to shared buffers (DRI2/3, PRIME,
>> whatever).
> Still, it occurred to me in the meantime that amdgpu setting the
> exclusive fence for buffers shared via PRIME (no matter if it's a write
> or read operation) might be a solution. Christian, what do you think?

The problem is that we don't have one fence, but many.

E.g. there can be many operation on a buffer at the same time and we 
need to wait for all of them to complete before it can be displayed.

Regards,
Christian.
Christian König Sept. 21, 2016, 3:21 p.m. UTC | #16
Am 21.09.2016 um 17:13 schrieb Michel Dänzer:
> On 21/09/16 07:30 PM, Christian König wrote:
>> Am 21.09.2016 um 11:56 schrieb Michel Dänzer:
>>> FWIW, we seem to have the same issue with radeon vs. amdgpu: radeon only
>>> seems to wait for exclusive fences, so e.g. running Xorg on amdgpu and
>>> using PRIME slave scanout on radeon leaves artifacts.
>> Yeah, I know. See radeon_display.c radeon_flip_work_func().
>>
>> We pretty much need the same patch here I've done for amdgpu as well.
> Actually, the PRIME slave can't scan out from the shared BOs directly
> (recall the recent discussion we had about that with Mario), but has to
> copy from the shared BO to a dedicated scanout BO. These copies need to
> be synchronized with the primary GPU's copies to the shared BO.

Yeah, that thought came to my mind before as well.

Buffer migrations by the kernel caused by a prime export actually set 
the exclusive fence.

So this shouldn't be an issue in practice when the displaying GPU needs 
to copy from the BO again anyway.

The only case I can see when this can happen is when the BO is composed 
directly in system memory by the engines and not migrated there.

Could be that we run into this issue more often in the future, because 
that is pretty much what we want to have for 4K UVD decode.

Christian.
Michel Dänzer Sept. 21, 2016, 3:28 p.m. UTC | #17
On 22/09/16 12:21 AM, Christian König wrote:
> Am 21.09.2016 um 17:13 schrieb Michel Dänzer:
>> On 21/09/16 07:30 PM, Christian König wrote:
>>> Am 21.09.2016 um 11:56 schrieb Michel Dänzer:
>>>> FWIW, we seem to have the same issue with radeon vs. amdgpu: radeon
>>>> only
>>>> seems to wait for exclusive fences, so e.g. running Xorg on amdgpu and
>>>> using PRIME slave scanout on radeon leaves artifacts.
>>> Yeah, I know. See radeon_display.c radeon_flip_work_func().
>>>
>>> We pretty much need the same patch here I've done for amdgpu as well.
>> Actually, the PRIME slave can't scan out from the shared BOs directly
>> (recall the recent discussion we had about that with Mario), but has to
>> copy from the shared BO to a dedicated scanout BO. These copies need to
>> be synchronized with the primary GPU's copies to the shared BO.
> 
> Yeah, that thought came to my mind before as well.
> 
> Buffer migrations by the kernel caused by a prime export actually set
> the exclusive fence.
> 
> So this shouldn't be an issue in practice when the displaying GPU needs
> to copy from the BO again anyway.
> 
> The only case I can see when this can happen is when the BO is composed
> directly in system memory by the engines and not migrated there.

There is no migration going on in the steady state, just the primary GPU
writing to the shared BO and the slave GPU reading from it. If those
operations aren't properly synchronized, there is at least intermittent
tearing, but there can even be artifacts which stay until that area is
updated again.
Michel Dänzer Sept. 21, 2016, 3:29 p.m. UTC | #18
On 22/09/16 12:15 AM, Christian König wrote:
> Am 21.09.2016 um 17:07 schrieb Michel Dänzer:
>> On 21/09/16 09:56 PM, Daniel Vetter wrote:
>>> On Wed, Sep 21, 2016 at 1:19 PM, Christian König
>>> <deathsimple@vodafone.de> wrote:
>>>>
>>>> We use multiple writers without implicit syncing between processes
>>>> in the
>>>> amdgpu stack perfectly fine.
>>>>
>>>> See amdgpu_sync.c for the implementation. What we do there is taking
>>>> a look
>>>> at all the fences associated with a reservation object and only sync to
>>>> those who are from another process.
>>>>
>>>> Then we use implicit syncing for command submissions in the form of
>>>> "dependencies". E.g. for each CS we report back an identifier of that
>>>> submission to user space and on the next submission you can give this
>>>> identifier as dependency which needs to be satisfied before the command
>>>> submission can start running.
>>> This is called explicit fencing. Implemented with a driver-private
>>> primitive (and not sync_file fds like on android), but still
>>> conceptually explicit fencing. Implicit fencing really only can handle
>>> one writer, at least as currently implemented by struct
>>> reservation_object.
>>>
>>>> This was done to allow multiple engines (3D, DMA, Compute) to compose a
>>>> buffer while still allow compatibility with protocols like DRI2/DRI3.
>>> Instead of the current solution you need to stop attaching exclusive
>>> fences to non-shared buffers (which are coordinated using the
>>> driver-private explicit fencing you're describing),
>> Err, the current issue is actually that amdgpu never sets an exclusive
>> fence, only ever shared ones. :)
> 
> Actually amdgpu does set the exclusive fence for buffer migrations,
> cause that is an operation user space doesn't know about and so it needs
> to be "exclusive" access to the buffer.
> 
> 
>>> and only attach exclusive fences to shared buffers (DRI2/3, PRIME,
>>> whatever).
>> Still, it occurred to me in the meantime that amdgpu setting the
>> exclusive fence for buffers shared via PRIME (no matter if it's a write
>> or read operation) might be a solution. Christian, what do you think?
> 
> The problem is that we don't have one fence, but many.
> 
> E.g. there can be many operation on a buffer at the same time and we
> need to wait for all of them to complete before it can be displayed.

Maybe in theory, but with the problem we have in practice right now, the
amdgpu GPU should only ever access the shared BO with the same engine.

Anyway, this should be solvable by setting some kind of meta-fence as
the exclusive fence, which can internally be mapped to multiple fences,
maybe up to one for each ring which can access the BO?
Christian König Sept. 21, 2016, 4:23 p.m. UTC | #19
Am 21.09.2016 um 17:29 schrieb Michel Dänzer:
> On 22/09/16 12:15 AM, Christian König wrote:
>> Am 21.09.2016 um 17:07 schrieb Michel Dänzer:
>>> On 21/09/16 09:56 PM, Daniel Vetter wrote:
>>>> On Wed, Sep 21, 2016 at 1:19 PM, Christian König
>>>> <deathsimple@vodafone.de> wrote:
>>>>> We use multiple writers without implicit syncing between processes
>>>>> in the
>>>>> amdgpu stack perfectly fine.
>>>>>
>>>>> See amdgpu_sync.c for the implementation. What we do there is taking
>>>>> a look
>>>>> at all the fences associated with a reservation object and only sync to
>>>>> those who are from another process.
>>>>>
>>>>> Then we use implicit syncing for command submissions in the form of
>>>>> "dependencies". E.g. for each CS we report back an identifier of that
>>>>> submission to user space and on the next submission you can give this
>>>>> identifier as dependency which needs to be satisfied before the command
>>>>> submission can start running.
>>>> This is called explicit fencing. Implemented with a driver-private
>>>> primitive (and not sync_file fds like on android), but still
>>>> conceptually explicit fencing. Implicit fencing really only can handle
>>>> one writer, at least as currently implemented by struct
>>>> reservation_object.
>>>>
>>>>> This was done to allow multiple engines (3D, DMA, Compute) to compose a
>>>>> buffer while still allow compatibility with protocols like DRI2/DRI3.
>>>> Instead of the current solution you need to stop attaching exclusive
>>>> fences to non-shared buffers (which are coordinated using the
>>>> driver-private explicit fencing you're describing),
>>> Err, the current issue is actually that amdgpu never sets an exclusive
>>> fence, only ever shared ones. :)
>> Actually amdgpu does set the exclusive fence for buffer migrations,
>> cause that is an operation user space doesn't know about and so it needs
>> to be "exclusive" access to the buffer.
>>
>>
>>>> and only attach exclusive fences to shared buffers (DRI2/3, PRIME,
>>>> whatever).
>>> Still, it occurred to me in the meantime that amdgpu setting the
>>> exclusive fence for buffers shared via PRIME (no matter if it's a write
>>> or read operation) might be a solution. Christian, what do you think?
>> The problem is that we don't have one fence, but many.
>>
>> E.g. there can be many operation on a buffer at the same time and we
>> need to wait for all of them to complete before it can be displayed.
> Maybe in theory, but with the problem we have in practice right now, the
> amdgpu GPU should only ever access the shared BO with the same engine.

That clearly won't work. Take a look at what both Mesa and the pro stack 
do with the BO before it is displayed makes it mandatory to execute 
things in parallel (at least for the not shared case).

> Anyway, this should be solvable by setting some kind of meta-fence as
> the exclusive fence, which can internally be mapped to multiple fences,
> maybe up to one for each ring which can access the BO?

I've thought about that as well, but this approach would also only work 
when we keep a collection of fences and not just an array because of the 
scheduler.

For a quick workaround I suggest to just serialize all accesses to BO 
shared with different drivers, but essentially I think it is a perfectly 
valid requirement to have multiple writers to one BO.

Christian.
Daniel Vetter Sept. 22, 2016, 6:33 a.m. UTC | #20
On Thu, Sep 22, 2016 at 12:07:24AM +0900, Michel Dänzer wrote:
> On 21/09/16 09:56 PM, Daniel Vetter wrote:
> > On Wed, Sep 21, 2016 at 1:19 PM, Christian König
> > <deathsimple@vodafone.de> wrote:
> >> Am 21.09.2016 um 13:04 schrieb Daniel Vetter:
> >>> On Wed, Sep 21, 2016 at 12:30 PM, Christian König
> >>> <deathsimple@vodafone.de> wrote:
> >>>> Am 21.09.2016 um 11:56 schrieb Michel Dänzer:
> >>>>>
> >>>>>
> >>>>> Looks like there are different interpretations of the semantics of
> >>>>> exclusive vs. shared fences. Where are these semantics documented?
> >>>>
> >>>>
> >>>> Yeah, I think as well that this is the primary question here.
> >>>>
> >>>> IIRC the fences were explicitly called exclusive/shared instead of
> >>>> writing/reading on purpose.
> >>>>
> >>>> I absolutely don't mind switching to them to writing/reading semantics,
> >>>> but
> >>>> amdgpu really needs multiple writers at the same time.
> >>>>
> >>>> So in this case the writing side of a reservation object needs to be a
> >>>> collection of fences as well.
> >>>
> >>> You can't have multiple writers with implicit syncing. That confusion
> >>> is exactly why we called them shared/exclusive. Multiple writers
> >>> generally means that you do some form of fencing in userspace
> >>> (unsync'ed gl buffer access is the common one). What you do for
> >>> private buffers doesn't matter, but when you render into a
> >>> shared/winsys buffer you really need to set the exclusive fence (and
> >>> there can only ever be one). So probably needs some userspace
> >>> adjustments to make sure you don't accidentally set an exclusive write
> >>> hazard when you don't really want that implicit sync.
> >>
> >>
> >> Nope, that isn't true.
> >>
> >> We use multiple writers without implicit syncing between processes in the
> >> amdgpu stack perfectly fine.
> >>
> >> See amdgpu_sync.c for the implementation. What we do there is taking a look
> >> at all the fences associated with a reservation object and only sync to
> >> those who are from another process.
> >>
> >> Then we use implicit syncing for command submissions in the form of
> >> "dependencies". E.g. for each CS we report back an identifier of that
> >> submission to user space and on the next submission you can give this
> >> identifier as dependency which needs to be satisfied before the command
> >> submission can start running.
> > 
> > This is called explicit fencing. Implemented with a driver-private
> > primitive (and not sync_file fds like on android), but still
> > conceptually explicit fencing. Implicit fencing really only can handle
> > one writer, at least as currently implemented by struct
> > reservation_object.
> > 
> >> This was done to allow multiple engines (3D, DMA, Compute) to compose a
> >> buffer while still allow compatibility with protocols like DRI2/DRI3.
> > 
> > Instead of the current solution you need to stop attaching exclusive
> > fences to non-shared buffers (which are coordinated using the
> > driver-private explicit fencing you're describing),
> 
> Err, the current issue is actually that amdgpu never sets an exclusive
> fence, only ever shared ones. :)

Well since you sometimes sync and sometimes not sync it is kinda a special
case of semi-exclusive fence (even if attached to the shared slots).
> 
> > and only attach exclusive fences to shared buffers (DRI2/3, PRIME,
> > whatever).
> 
> Still, it occurred to me in the meantime that amdgpu setting the
> exclusive fence for buffers shared via PRIME (no matter if it's a write
> or read operation) might be a solution. Christian, what do you think?

Yup, that's what I mean. And it shouldn't cause a problem since for shared
buffers (at least for protocols where implicit fencing is required), since
for those you really can't have multiple concurrent writers. And with the
special checks in amdgpu_sync.c that's what's happening in reality, only
difference is that the filtering/selection of what is considered and
exclusive fences happens when you sync, and not when you attach them. And
that breaks reservation_object assumptions.
-Daniel
Daniel Vetter Sept. 22, 2016, 6:36 a.m. UTC | #21
On Wed, Sep 21, 2016 at 06:23:35PM +0200, Christian König wrote:
> For a quick workaround I suggest to just serialize all accesses to BO shared
> with different drivers, but essentially I think it is a perfectly valid
> requirement to have multiple writers to one BO.

It is, but it's not possible with implicit sync. If you want parallel
write access to the same shared buffer, you _must_ carry around some
explicit fences. Within amdgpu you can use driver-specific cookies, for
shared buffer we now have sync_file. But multiple writers with implicit
sync simply fundamentally doesn't work. Because you have no idea with which
writer, touching the same subrange you want to touch.
-Daniel
Christian König Sept. 22, 2016, 10:55 a.m. UTC | #22
Am 22.09.2016 um 08:36 schrieb Daniel Vetter:
> On Wed, Sep 21, 2016 at 06:23:35PM +0200, Christian König wrote:
>> For a quick workaround I suggest to just serialize all accesses to BO shared
>> with different drivers, but essentially I think it is a perfectly valid
>> requirement to have multiple writers to one BO.
> It is, but it's not possible with implicit sync. If you want parallel
> write access to the same shared buffer, you _must_ carry around some
> explicit fences. Within amdgpu you can use driver-specific cookies, for
> shared buffer we now have sync_file. But multiple writers with implicit
> sync simply fundamentally doesn't work. Because you have no idea with which
> writer, touching the same subrange you want to touch.

You don't need to separate the BO into subranges which are touched by 
different engines for allowing multiple writers.

AMD hardware and I'm pretty sure others as well are perfectly capable of 
writing to the same memory from multiple engines and even multiple GPUs 
at the same time.

For a good hint of what is possible see the public AMD ISA documentation 
about atomic operations, but that is only the start of it.


The crux here is that we need to assume that we will have implicit and 
explicit sync mixed for backward compatibility.

This implies that we need some mechanism like amdgpu uses in it's sync 
implementation where every fence is associated with an owner which 
denotes the domain in which implicit sync happens. If you leave this 
domain you will automatically run into explicit sync.

Currently we define the borders of this domain in amdgpu on process 
boundary to keep things like DRI2/DRI3 working as expected.

I really don't see how you want to solve this with a single explicit 
fence for each reservation object. As long as you have multiple 
concurrently running operations accessing the same buffer you need to 
keep one fence for each operation no matter what.

Regards,
Christian.

> -Daniel
Daniel Vetter Sept. 22, 2016, 12:26 p.m. UTC | #23
On Thu, Sep 22, 2016 at 12:55 PM, Christian König
<deathsimple@vodafone.de> wrote:
> Am 22.09.2016 um 08:36 schrieb Daniel Vetter:
>>
>> On Wed, Sep 21, 2016 at 06:23:35PM +0200, Christian König wrote:
>>>
>>> For a quick workaround I suggest to just serialize all accesses to BO
>>> shared
>>> with different drivers, but essentially I think it is a perfectly valid
>>> requirement to have multiple writers to one BO.
>>
>> It is, but it's not possible with implicit sync. If you want parallel
>> write access to the same shared buffer, you _must_ carry around some
>> explicit fences. Within amdgpu you can use driver-specific cookies, for
>> shared buffer we now have sync_file. But multiple writers with implicit
>> sync simply fundamentally doesn't work. Because you have no idea with
>> which
>> writer, touching the same subrange you want to touch.
>
>
> You don't need to separate the BO into subranges which are touched by
> different engines for allowing multiple writers.
>
> AMD hardware and I'm pretty sure others as well are perfectly capable of
> writing to the same memory from multiple engines and even multiple GPUs at
> the same time.
>
> For a good hint of what is possible see the public AMD ISA documentation
> about atomic operations, but that is only the start of it.
>
>
> The crux here is that we need to assume that we will have implicit and
> explicit sync mixed for backward compatibility.
>
> This implies that we need some mechanism like amdgpu uses in it's sync
> implementation where every fence is associated with an owner which denotes
> the domain in which implicit sync happens. If you leave this domain you will
> automatically run into explicit sync.
>
> Currently we define the borders of this domain in amdgpu on process boundary
> to keep things like DRI2/DRI3 working as expected.
>
> I really don't see how you want to solve this with a single explicit fence
> for each reservation object. As long as you have multiple concurrently
> running operations accessing the same buffer you need to keep one fence for
> each operation no matter what.

I can't make sense of what you're saying, and I suspect we put
different meaning to different words. So let me define here:

- implicit fencing: Userspace does not track read/writes to buffers,
but only the kernel does that. This is the assumption DRI2/3 has.
Since synchronization is by necessity on a per-buffer level you can
only have 1 writer. In the kernel the cross-driver interface for this
is struct reservation_object attached to dma-bufs. If you don't fill
out/wait for the exclusive fence in there, you're driver is _not_
doing (cross-device) implicit fencing.

- explicit fencing: Userspace passes around distinct fence objects for
any work going on on the gpu. The kernel doesn't insert any stall of
it's own (except for moving buffer objects around ofc). This is what
Android. This also seems to be what amdgpu is doing within one
process/owner.

Given that I'm not sure what you mean with "one explicit fence per
reservation_object", since explicit fencing should not attach anything
(at least not any exclusive fences) to a reservation_object. It does
sound a bit like you have the meaning the other way round (as in
explicit fencing = the kernel explicitly takes care of fencing, but
it's explicit as in explicit fences visible to userspace).
-Daniel
Christian König Sept. 22, 2016, 12:44 p.m. UTC | #24
Am 22.09.2016 um 14:26 schrieb Daniel Vetter:
> On Thu, Sep 22, 2016 at 12:55 PM, Christian König
> <deathsimple@vodafone.de> wrote:
>> Am 22.09.2016 um 08:36 schrieb Daniel Vetter:
>>> On Wed, Sep 21, 2016 at 06:23:35PM +0200, Christian König wrote:
>>>> For a quick workaround I suggest to just serialize all accesses to BO
>>>> shared
>>>> with different drivers, but essentially I think it is a perfectly valid
>>>> requirement to have multiple writers to one BO.
>>> It is, but it's not possible with implicit sync. If you want parallel
>>> write access to the same shared buffer, you _must_ carry around some
>>> explicit fences. Within amdgpu you can use driver-specific cookies, for
>>> shared buffer we now have sync_file. But multiple writers with implicit
>>> sync simply fundamentally doesn't work. Because you have no idea with
>>> which
>>> writer, touching the same subrange you want to touch.
>>
>> You don't need to separate the BO into subranges which are touched by
>> different engines for allowing multiple writers.
>>
>> AMD hardware and I'm pretty sure others as well are perfectly capable of
>> writing to the same memory from multiple engines and even multiple GPUs at
>> the same time.
>>
>> For a good hint of what is possible see the public AMD ISA documentation
>> about atomic operations, but that is only the start of it.
>>
>>
>> The crux here is that we need to assume that we will have implicit and
>> explicit sync mixed for backward compatibility.
>>
>> This implies that we need some mechanism like amdgpu uses in it's sync
>> implementation where every fence is associated with an owner which denotes
>> the domain in which implicit sync happens. If you leave this domain you will
>> automatically run into explicit sync.
>>
>> Currently we define the borders of this domain in amdgpu on process boundary
>> to keep things like DRI2/DRI3 working as expected.
>>
>> I really don't see how you want to solve this with a single explicit fence
>> for each reservation object. As long as you have multiple concurrently
>> running operations accessing the same buffer you need to keep one fence for
>> each operation no matter what.
> I can't make sense of what you're saying, and I suspect we put
> different meaning to different words. So let me define here:
>
> - implicit fencing: Userspace does not track read/writes to buffers,
> but only the kernel does that. This is the assumption DRI2/3 has.
> Since synchronization is by necessity on a per-buffer level you can
> only have 1 writer. In the kernel the cross-driver interface for this
> is struct reservation_object attached to dma-bufs. If you don't fill
> out/wait for the exclusive fence in there, you're driver is _not_
> doing (cross-device) implicit fencing.

I can confirm that my understanding of implicit fencing is exactly the 
same as yours.

> - explicit fencing: Userspace passes around distinct fence objects for
> any work going on on the gpu. The kernel doesn't insert any stall of
> it's own (except for moving buffer objects around ofc). This is what
> Android. This also seems to be what amdgpu is doing within one
> process/owner.

No, that is clearly not my understanding of explicit fencing.

Userspace doesn't necessarily need to pass around distinct fence objects 
with all of it's protocols and the kernel is still responsible for 
inserting stalls whenever an userspace protocol or application requires 
this semantics.

Otherwise you will never be able to use explicit fencing on the Linux 
desktop with protocols like DRI2/DRI3.

I would expect that every driver in the system waits for all fences of a 
reservation object as long as it isn't told otherwise by providing a 
distinct fence object with the IOCTL in question.

Regards,
Christian.
Daniel Vetter Sept. 22, 2016, 1:05 p.m. UTC | #25
On Thu, Sep 22, 2016 at 2:44 PM, Christian König
<deathsimple@vodafone.de> wrote:
>> - explicit fencing: Userspace passes around distinct fence objects for
>> any work going on on the gpu. The kernel doesn't insert any stall of
>> it's own (except for moving buffer objects around ofc). This is what
>> Android. This also seems to be what amdgpu is doing within one
>> process/owner.
>
>
> No, that is clearly not my understanding of explicit fencing.
>
> Userspace doesn't necessarily need to pass around distinct fence objects
> with all of it's protocols and the kernel is still responsible for inserting
> stalls whenever an userspace protocol or application requires this
> semantics.
>
> Otherwise you will never be able to use explicit fencing on the Linux
> desktop with protocols like DRI2/DRI3.

This is about mixing them. Explicit fencing still means userspace has
an explicit piece, separate from buffers, (either sync_file fd, or a
driver-specific cookie, or similar).

> I would expect that every driver in the system waits for all fences of a
> reservation object as long as it isn't told otherwise by providing a
> distinct fence object with the IOCTL in question.

Yup agreed. This way if your explicitly-fencing driver reads a shared
buffer passed over a protocol that does implicit fencing (like
DRI2/3), then it will work.

The other interop direction is explicitly-fencing driver passes a
buffer to a consumer which expects implicit fencing. In that case you
must attach the right fence to the exclusive slot, but _only_ in that
case. Otherwise you end up stalling your explicitly-fencing userspace,
since implicit fencing doesn't allow more than 1 writer. For amdgpu
one possible way to implement this might be to count how many users a
dma-buf has, and if it's more than just the current context set the
exclusive fence. Or do an uabi revision and let userspace decide (or
at least overwrite it).

But the current approach in amdgpu_sync.c of declaring a fence as
exclusive after the fact (if owners don't match) just isn't how
reservation_object works. You can of course change that, but that
means you must change all drivers implementing support for implicit
fencing of dma-buf. Fixing amdgpu will be easier ;-)
-Daniel
Christian König Sept. 22, 2016, 1:22 p.m. UTC | #26
Am 22.09.2016 um 15:05 schrieb Daniel Vetter:
> On Thu, Sep 22, 2016 at 2:44 PM, Christian König
> <deathsimple@vodafone.de> wrote:
>>> - explicit fencing: Userspace passes around distinct fence objects for
>>> any work going on on the gpu. The kernel doesn't insert any stall of
>>> it's own (except for moving buffer objects around ofc). This is what
>>> Android. This also seems to be what amdgpu is doing within one
>>> process/owner.
>>
>> No, that is clearly not my understanding of explicit fencing.
>>
>> Userspace doesn't necessarily need to pass around distinct fence objects
>> with all of it's protocols and the kernel is still responsible for inserting
>> stalls whenever an userspace protocol or application requires this
>> semantics.
>>
>> Otherwise you will never be able to use explicit fencing on the Linux
>> desktop with protocols like DRI2/DRI3.
> This is about mixing them. Explicit fencing still means userspace has
> an explicit piece, separate from buffers, (either sync_file fd, or a
> driver-specific cookie, or similar).
>
>> I would expect that every driver in the system waits for all fences of a
>> reservation object as long as it isn't told otherwise by providing a
>> distinct fence object with the IOCTL in question.
> Yup agreed. This way if your explicitly-fencing driver reads a shared
> buffer passed over a protocol that does implicit fencing (like
> DRI2/3), then it will work.
>
> The other interop direction is explicitly-fencing driver passes a
> buffer to a consumer which expects implicit fencing. In that case you
> must attach the right fence to the exclusive slot, but _only_ in that
> case.

Ok well sounds like you are close to understand why I can't do exactly 
this: There simply is no right fence I could attach.

When amdgpu makes the command submissions it doesn't necessarily know 
that the buffer will be exported and shared with another device later on.

So when the buffer is exported and given to the other device you might 
have a whole bunch of fences which run concurrently and not in any 
serial order.

> Otherwise you end up stalling your explicitly-fencing userspace,
> since implicit fencing doesn't allow more than 1 writer. For amdgpu
> one possible way to implement this might be to count how many users a
> dma-buf has, and if it's more than just the current context set the
> exclusive fence. Or do an uabi revision and let userspace decide (or
> at least overwrite it).

I mean I can pick one fence and wait for the rest to finish manually, 
but that would certainly defeat the whole effort, doesn't it?

I completely agree that you have only 1 writer with implicit fencing, 
but when you switch from explicit fencing back to implicit fencing you 
can have multiple ones.

> But the current approach in amdgpu_sync.c of declaring a fence as
> exclusive after the fact (if owners don't match) just isn't how
> reservation_object works. You can of course change that, but that
> means you must change all drivers implementing support for implicit
> fencing of dma-buf. Fixing amdgpu will be easier ;-)

Well as far as I can see there is no way I can fix amdgpu in this case.

The handling clearly needs to be changed on the receiving side of the 
reservation objects if I don't completely want to disable concurrent 
access to BOs in amdgpu.

Regards,
Christian.

> -Daniel
Michel Dänzer Sept. 23, 2016, 10 a.m. UTC | #27
On 22/09/16 10:22 PM, Christian König wrote:
> Am 22.09.2016 um 15:05 schrieb Daniel Vetter:
>> On Thu, Sep 22, 2016 at 2:44 PM, Christian König
>> <deathsimple@vodafone.de> wrote:
>>>> - explicit fencing: Userspace passes around distinct fence objects for
>>>> any work going on on the gpu. The kernel doesn't insert any stall of
>>>> it's own (except for moving buffer objects around ofc). This is what
>>>> Android. This also seems to be what amdgpu is doing within one
>>>> process/owner.
>>>
>>> No, that is clearly not my understanding of explicit fencing.
>>>
>>> Userspace doesn't necessarily need to pass around distinct fence objects
>>> with all of it's protocols and the kernel is still responsible for
>>> inserting
>>> stalls whenever an userspace protocol or application requires this
>>> semantics.
>>>
>>> Otherwise you will never be able to use explicit fencing on the Linux
>>> desktop with protocols like DRI2/DRI3.
>> This is about mixing them. Explicit fencing still means userspace has
>> an explicit piece, separate from buffers, (either sync_file fd, or a
>> driver-specific cookie, or similar).
>>
>>> I would expect that every driver in the system waits for all fences of a
>>> reservation object as long as it isn't told otherwise by providing a
>>> distinct fence object with the IOCTL in question.
>> Yup agreed. This way if your explicitly-fencing driver reads a shared
>> buffer passed over a protocol that does implicit fencing (like
>> DRI2/3), then it will work.
>>
>> The other interop direction is explicitly-fencing driver passes a
>> buffer to a consumer which expects implicit fencing. In that case you
>> must attach the right fence to the exclusive slot, but _only_ in that
>> case.
> 
> Ok well sounds like you are close to understand why I can't do exactly
> this: There simply is no right fence I could attach.
> 
> When amdgpu makes the command submissions it doesn't necessarily know
> that the buffer will be exported and shared with another device later on.
> 
> So when the buffer is exported and given to the other device you might
> have a whole bunch of fences which run concurrently and not in any
> serial order.

I feel like you're thinking too much of buffers shared between GPUs as
being short-lived and only shared late. In the use-cases I know about,
shared buffers are created separately and shared ahead of time, the
actual rendering work is done to non-shared buffers and then just copied
to the shared buffers for transfer between GPUs. These copies are always
performed by the same context in such a way that they should always be
performed by the same HW engine and thus implicitly serialized.

Do you have any specific use-cases in mind where buffers are only shared
between GPUs after the rendering operations creating the buffer contents
to be shared have already been submitted?


>> Otherwise you end up stalling your explicitly-fencing userspace,
>> since implicit fencing doesn't allow more than 1 writer. For amdgpu
>> one possible way to implement this might be to count how many users a
>> dma-buf has, and if it's more than just the current context set the
>> exclusive fence. Or do an uabi revision and let userspace decide (or
>> at least overwrite it).
> 
> I mean I can pick one fence and wait for the rest to finish manually,
> but that would certainly defeat the whole effort, doesn't it?

I'm afraid it's not clear to me why it would. Can you elaborate?


>> But the current approach in amdgpu_sync.c of declaring a fence as
>> exclusive after the fact (if owners don't match) just isn't how
>> reservation_object works. You can of course change that, but that
>> means you must change all drivers implementing support for implicit
>> fencing of dma-buf. Fixing amdgpu will be easier ;-)
> 
> Well as far as I can see there is no way I can fix amdgpu in this case.
> 
> The handling clearly needs to be changed on the receiving side of the
> reservation objects if I don't completely want to disable concurrent
> access to BOs in amdgpu.

Anyway, we need a solution for this between radeon and amdgpu, and I
don't think a solution which involves those drivers using reservation
object semantics between them which are different from all other drivers
is a good idea.
Daniel Vetter Sept. 23, 2016, 12:09 p.m. UTC | #28
On Fri, Sep 23, 2016 at 07:00:25PM +0900, Michel Dänzer wrote:
> On 22/09/16 10:22 PM, Christian König wrote:
> > Am 22.09.2016 um 15:05 schrieb Daniel Vetter:
> >> On Thu, Sep 22, 2016 at 2:44 PM, Christian König
> >> <deathsimple@vodafone.de> wrote:
> >>>> - explicit fencing: Userspace passes around distinct fence objects for
> >>>> any work going on on the gpu. The kernel doesn't insert any stall of
> >>>> it's own (except for moving buffer objects around ofc). This is what
> >>>> Android. This also seems to be what amdgpu is doing within one
> >>>> process/owner.
> >>>
> >>> No, that is clearly not my understanding of explicit fencing.
> >>>
> >>> Userspace doesn't necessarily need to pass around distinct fence objects
> >>> with all of it's protocols and the kernel is still responsible for
> >>> inserting
> >>> stalls whenever an userspace protocol or application requires this
> >>> semantics.
> >>>
> >>> Otherwise you will never be able to use explicit fencing on the Linux
> >>> desktop with protocols like DRI2/DRI3.
> >> This is about mixing them. Explicit fencing still means userspace has
> >> an explicit piece, separate from buffers, (either sync_file fd, or a
> >> driver-specific cookie, or similar).
> >>
> >>> I would expect that every driver in the system waits for all fences of a
> >>> reservation object as long as it isn't told otherwise by providing a
> >>> distinct fence object with the IOCTL in question.
> >> Yup agreed. This way if your explicitly-fencing driver reads a shared
> >> buffer passed over a protocol that does implicit fencing (like
> >> DRI2/3), then it will work.
> >>
> >> The other interop direction is explicitly-fencing driver passes a
> >> buffer to a consumer which expects implicit fencing. In that case you
> >> must attach the right fence to the exclusive slot, but _only_ in that
> >> case.
> > 
> > Ok well sounds like you are close to understand why I can't do exactly
> > this: There simply is no right fence I could attach.
> > 
> > When amdgpu makes the command submissions it doesn't necessarily know
> > that the buffer will be exported and shared with another device later on.
> > 
> > So when the buffer is exported and given to the other device you might
> > have a whole bunch of fences which run concurrently and not in any
> > serial order.
> 
> I feel like you're thinking too much of buffers shared between GPUs as
> being short-lived and only shared late. In the use-cases I know about,
> shared buffers are created separately and shared ahead of time, the
> actual rendering work is done to non-shared buffers and then just copied
> to the shared buffers for transfer between GPUs. These copies are always
> performed by the same context in such a way that they should always be
> performed by the same HW engine and thus implicitly serialized.
> 
> Do you have any specific use-cases in mind where buffers are only shared
> between GPUs after the rendering operations creating the buffer contents
> to be shared have already been submitted?

Yeah, it should be known which buffer you use (at least in userspace,
maybe not in the kernel) for which you need implicit fencing. At least
with DRI2/3 it's really obvious which buffers are shared. Same holds for
external images and other imported buffers.

Yes that means you need to keep track of a few things in userspace, and
you need to add a special flag to CS to make sure the kernel does set the
exclusive fence.

> >> Otherwise you end up stalling your explicitly-fencing userspace,
> >> since implicit fencing doesn't allow more than 1 writer. For amdgpu
> >> one possible way to implement this might be to count how many users a
> >> dma-buf has, and if it's more than just the current context set the
> >> exclusive fence. Or do an uabi revision and let userspace decide (or
> >> at least overwrite it).
> > 
> > I mean I can pick one fence and wait for the rest to finish manually,
> > but that would certainly defeat the whole effort, doesn't it?
> 
> I'm afraid it's not clear to me why it would. Can you elaborate?
> 
> 
> >> But the current approach in amdgpu_sync.c of declaring a fence as
> >> exclusive after the fact (if owners don't match) just isn't how
> >> reservation_object works. You can of course change that, but that
> >> means you must change all drivers implementing support for implicit
> >> fencing of dma-buf. Fixing amdgpu will be easier ;-)
> > 
> > Well as far as I can see there is no way I can fix amdgpu in this case.
> > 
> > The handling clearly needs to be changed on the receiving side of the
> > reservation objects if I don't completely want to disable concurrent
> > access to BOs in amdgpu.
> 
> Anyway, we need a solution for this between radeon and amdgpu, and I
> don't think a solution which involves those drivers using reservation
> object semantics between them which are different from all other drivers
> is a good idea.

Afaik there's also amd+intel machines out there, so really the only option
is to either fix amdgpu to correctly set exclusive fences on shared
buffers (with the help of userspace hints). Or change all the existing
drivers. No idea what's simpler to do, since I don't know about amdgpu
userspace.
-Daniel
Michel Dänzer Sept. 26, 2016, 12:48 a.m. UTC | #29
On 23/09/16 09:09 PM, Daniel Vetter wrote:
> On Fri, Sep 23, 2016 at 07:00:25PM +0900, Michel Dänzer wrote:
>> On 22/09/16 10:22 PM, Christian König wrote:
>>> Am 22.09.2016 um 15:05 schrieb Daniel Vetter:
>>>> 
>>>> But the current approach in amdgpu_sync.c of declaring a fence as
>>>> exclusive after the fact (if owners don't match) just isn't how
>>>> reservation_object works. You can of course change that, but that
>>>> means you must change all drivers implementing support for implicit
>>>> fencing of dma-buf. Fixing amdgpu will be easier ;-)
>>>
>>> Well as far as I can see there is no way I can fix amdgpu in this case.
>>>
>>> The handling clearly needs to be changed on the receiving side of the
>>> reservation objects if I don't completely want to disable concurrent
>>> access to BOs in amdgpu.
>>
>> Anyway, we need a solution for this between radeon and amdgpu, and I
>> don't think a solution which involves those drivers using reservation
>> object semantics between them which are different from all other drivers
>> is a good idea.
> 
> Afaik there's also amd+intel machines out there,

Sure, what I meant was that even if we didn't care about those (which we
do), we'd still need a solution between our own drivers.


> so really the only option is to either fix amdgpu to correctly set
> exclusive fences on shared buffers (with the help of userspace hints).
> Or change all the existing drivers.

I got some fresh perspective on the problem while taking a walk, and I'm
now fairly convinced that neither amdgpu userspace nor other drivers
need to be modified:

It occurred to me that all the information we need is already there in
the exclusive and shared fences set by amdgpu. We just need to use it
differently to match the expectations of other drivers.

We should be able to set some sort of "pseudo" fence as the exclusive
fence of the reservation object. When we are asked by another driver to
wait for this fence to signal, we take the current "real" exclusive
fence (which we can keep track of e.g. in our BO struct) and shared
fences, and wait for all of those to signal; then we mark the "pseudo"
exclusive fence as signalled.

Am I missing anything which would prevent this from working?
Daniel Vetter Sept. 26, 2016, 8:04 a.m. UTC | #30
On Mon, Sep 26, 2016 at 09:48:37AM +0900, Michel Dänzer wrote:
> On 23/09/16 09:09 PM, Daniel Vetter wrote:
> > On Fri, Sep 23, 2016 at 07:00:25PM +0900, Michel Dänzer wrote:
> >> On 22/09/16 10:22 PM, Christian König wrote:
> >>> Am 22.09.2016 um 15:05 schrieb Daniel Vetter:
> >>>> 
> >>>> But the current approach in amdgpu_sync.c of declaring a fence as
> >>>> exclusive after the fact (if owners don't match) just isn't how
> >>>> reservation_object works. You can of course change that, but that
> >>>> means you must change all drivers implementing support for implicit
> >>>> fencing of dma-buf. Fixing amdgpu will be easier ;-)
> >>>
> >>> Well as far as I can see there is no way I can fix amdgpu in this case.
> >>>
> >>> The handling clearly needs to be changed on the receiving side of the
> >>> reservation objects if I don't completely want to disable concurrent
> >>> access to BOs in amdgpu.
> >>
> >> Anyway, we need a solution for this between radeon and amdgpu, and I
> >> don't think a solution which involves those drivers using reservation
> >> object semantics between them which are different from all other drivers
> >> is a good idea.
> > 
> > Afaik there's also amd+intel machines out there,
> 
> Sure, what I meant was that even if we didn't care about those (which we
> do), we'd still need a solution between our own drivers.
> 
> 
> > so really the only option is to either fix amdgpu to correctly set
> > exclusive fences on shared buffers (with the help of userspace hints).
> > Or change all the existing drivers.
> 
> I got some fresh perspective on the problem while taking a walk, and I'm
> now fairly convinced that neither amdgpu userspace nor other drivers
> need to be modified:
> 
> It occurred to me that all the information we need is already there in
> the exclusive and shared fences set by amdgpu. We just need to use it
> differently to match the expectations of other drivers.
> 
> We should be able to set some sort of "pseudo" fence as the exclusive
> fence of the reservation object. When we are asked by another driver to
> wait for this fence to signal, we take the current "real" exclusive
> fence (which we can keep track of e.g. in our BO struct) and shared
> fences, and wait for all of those to signal; then we mark the "pseudo"
> exclusive fence as signalled.
> 
> Am I missing anything which would prevent this from working?

One thing to make sure is that you don't change the (real, private stored)
fences you're waiting on over the lifetime of the public exclusive fence.
That might lead to some hilarity wrt potentially creating fence depency
loops. But I think as long as you guarantee that the private internal
fences are always amdgpu ones, and never anything imported from a
different driver even that should be fine. Not because this would break
the loops, but since amgpud has a hangcheck it can still gurantee that the
fence eventually fires even if there is a loop.
-Daniel
Mike Lothian Oct. 7, 2016, 12:34 p.m. UTC | #31
Hi

This has discussion has gone a little quiet

Was there any agreement about what needed doing to get this working
for i965/amdgpu?

Regards

Mike

On Mon, 26 Sep 2016 at 09:04 Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Mon, Sep 26, 2016 at 09:48:37AM +0900, Michel Dänzer wrote:
> > On 23/09/16 09:09 PM, Daniel Vetter wrote:
> > > On Fri, Sep 23, 2016 at 07:00:25PM +0900, Michel Dänzer wrote:
> > >> On 22/09/16 10:22 PM, Christian König wrote:
> > >>> Am 22.09.2016 um 15:05 schrieb Daniel Vetter:
> > >>>>
> > >>>> But the current approach in amdgpu_sync.c of declaring a fence as
> > >>>> exclusive after the fact (if owners don't match) just isn't how
> > >>>> reservation_object works. You can of course change that, but that
> > >>>> means you must change all drivers implementing support for implicit
> > >>>> fencing of dma-buf. Fixing amdgpu will be easier ;-)
> > >>>
> > >>> Well as far as I can see there is no way I can fix amdgpu in this case.
> > >>>
> > >>> The handling clearly needs to be changed on the receiving side of the
> > >>> reservation objects if I don't completely want to disable concurrent
> > >>> access to BOs in amdgpu.
> > >>
> > >> Anyway, we need a solution for this between radeon and amdgpu, and I
> > >> don't think a solution which involves those drivers using reservation
> > >> object semantics between them which are different from all other drivers
> > >> is a good idea.
> > >
> > > Afaik there's also amd+intel machines out there,
> >
> > Sure, what I meant was that even if we didn't care about those (which we
> > do), we'd still need a solution between our own drivers.
> >
> >
> > > so really the only option is to either fix amdgpu to correctly set
> > > exclusive fences on shared buffers (with the help of userspace hints).
> > > Or change all the existing drivers.
> >
> > I got some fresh perspective on the problem while taking a walk, and I'm
> > now fairly convinced that neither amdgpu userspace nor other drivers
> > need to be modified:
> >
> > It occurred to me that all the information we need is already there in
> > the exclusive and shared fences set by amdgpu. We just need to use it
> > differently to match the expectations of other drivers.
> >
> > We should be able to set some sort of "pseudo" fence as the exclusive
> > fence of the reservation object. When we are asked by another driver to
> > wait for this fence to signal, we take the current "real" exclusive
> > fence (which we can keep track of e.g. in our BO struct) and shared
> > fences, and wait for all of those to signal; then we mark the "pseudo"
> > exclusive fence as signalled.
> >
> > Am I missing anything which would prevent this from working?
>
> One thing to make sure is that you don't change the (real, private stored)
> fences you're waiting on over the lifetime of the public exclusive fence.
> That might lead to some hilarity wrt potentially creating fence depency
> loops. But I think as long as you guarantee that the private internal
> fences are always amdgpu ones, and never anything imported from a
> different driver even that should be fine. Not because this would break
> the loops, but since amgpud has a hangcheck it can still gurantee that the
> fence eventually fires even if there is a loop.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Michel Dänzer Oct. 11, 2016, 3:58 a.m. UTC | #32
On 07/10/16 09:34 PM, Mike Lothian wrote:
> 
> This has discussion has gone a little quiet
> 
> Was there any agreement about what needed doing to get this working
> for i965/amdgpu?

Christian, do you see anything which could prevent the solution I
outlined from working?
Christian König Oct. 11, 2016, 12:04 p.m. UTC | #33
Am 11.10.2016 um 05:58 schrieb Michel Dänzer:
> On 07/10/16 09:34 PM, Mike Lothian wrote:
>> This has discussion has gone a little quiet
>>
>> Was there any agreement about what needed doing to get this working
>> for i965/amdgpu?
> Christian, do you see anything which could prevent the solution I
> outlined from working?

I thought about that approach as well, but unfortunately it also has a 
couple of downsides. Especially keeping the exclusive fence set while we 
actually don't need it isn't really clean either.

I'm currently a bit busy with other tasks and so put Nayan on a road to 
get a bit into the kernel driver (he asked for that anyway).

Implementing the simple workaround to sync when we export the DMA-buf 
should be something like 20 lines of code and fortunately Nayan has an 
I+A system and so can actually test it.

If it turns out to be more problematic or somebody really starts to need 
it I'm going to hack on that myself a bit more.

Regards,
Christian.
Michel Dänzer Oct. 12, 2016, 12:40 a.m. UTC | #34
On 11/10/16 09:04 PM, Christian König wrote:
> Am 11.10.2016 um 05:58 schrieb Michel Dänzer:
>> On 07/10/16 09:34 PM, Mike Lothian wrote:
>>> This has discussion has gone a little quiet
>>>
>>> Was there any agreement about what needed doing to get this working
>>> for i965/amdgpu?
>> Christian, do you see anything which could prevent the solution I
>> outlined from working?
> 
> I thought about that approach as well, but unfortunately it also has a
> couple of downsides. Especially keeping the exclusive fence set while we
> actually don't need it isn't really clean either.

I was wondering if it's possible to have a singleton pseudo exclusive
fence which is used for all BOs. That might keep the overhead acceptably
low.


> I'm currently a bit busy with other tasks and so put Nayan on a road to
> get a bit into the kernel driver (he asked for that anyway).
> 
> Implementing the simple workaround to sync when we export the DMA-buf
> should be something like 20 lines of code and fortunately Nayan has an
> I+A system and so can actually test it.
> 
> If it turns out to be more problematic or somebody really starts to need
> it I'm going to hack on that myself a bit more.

If you mean only syncing when a DMA-buf is exported, that's not enough,
as I explained before. The BOs being shared are long-lived, and
synchronization between GPUs is required for every command stream
submission.
Mike Lothian Oct. 27, 2016, 1:33 p.m. UTC | #35
Hi

Just another gentle ping to see where you are with this?

Cheers

Mike

On Wed, 12 Oct 2016 at 01:40 Michel Dänzer <michel@daenzer.net> wrote:

> On 11/10/16 09:04 PM, Christian König wrote:
> > Am 11.10.2016 um 05:58 schrieb Michel Dänzer:
> >> On 07/10/16 09:34 PM, Mike Lothian wrote:
> >>> This has discussion has gone a little quiet
> >>>
> >>> Was there any agreement about what needed doing to get this working
> >>> for i965/amdgpu?
> >> Christian, do you see anything which could prevent the solution I
> >> outlined from working?
> >
> > I thought about that approach as well, but unfortunately it also has a
> > couple of downsides. Especially keeping the exclusive fence set while we
> > actually don't need it isn't really clean either.
>
> I was wondering if it's possible to have a singleton pseudo exclusive
> fence which is used for all BOs. That might keep the overhead acceptably
> low.
>
>
> > I'm currently a bit busy with other tasks and so put Nayan on a road to
> > get a bit into the kernel driver (he asked for that anyway).
> >
> > Implementing the simple workaround to sync when we export the DMA-buf
> > should be something like 20 lines of code and fortunately Nayan has an
> > I+A system and so can actually test it.
> >
> > If it turns out to be more problematic or somebody really starts to need
> > it I'm going to hack on that myself a bit more.
>
> If you mean only syncing when a DMA-buf is exported, that's not enough,
> as I explained before. The BOs being shared are long-lived, and
> synchronization between GPUs is required for every command stream
> submission.
>
>
> --
> Earthling Michel Dänzer               |               http://www.amd.com
> Libre software enthusiast             |             Mesa and X developer
>
Michel Dänzer Oct. 28, 2016, 1:34 a.m. UTC | #36
On 27/10/16 10:33 PM, Mike Lothian wrote:
> 
> Just another gentle ping to see where you are with this?

I haven't got a chance to look into this any further.
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 922709b..4b74b96 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -12043,7 +12043,7 @@  static void intel_mmio_flip_work_func(struct work_struct *w)
 	/* For framebuffer backed by dmabuf, wait for fence */
 	resv = i915_gem_object_get_dmabuf_resv(obj);
 	if (resv)
-		WARN_ON(reservation_object_wait_timeout_rcu(resv, false, false,
+		WARN_ON(reservation_object_wait_timeout_rcu(resv, true, false,
 							    MAX_SCHEDULE_TIMEOUT) < 0);
 
 	intel_pipe_update_start(crtc);
@@ -14700,7 +14700,7 @@  intel_prepare_plane_fb(struct drm_plane *plane,
 	if (resv) {
 		long lret;
 
-		lret = reservation_object_wait_timeout_rcu(resv, false, true,
+		lret = reservation_object_wait_timeout_rcu(resv, true, true,
 							   MAX_SCHEDULE_TIMEOUT);
 		if (lret == -ERESTARTSYS)
 			return lret;