Message ID | 1473293683-6247-1-git-send-email-mario.kleiner.de@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Sep 08, 2016 at 02:14:43AM +0200, Mario Kleiner wrote: > amdgpu-kms uses shared fences for its prime exported dmabufs, > instead of an exclusive fence. Therefore we need to wait for > all fences of the dmabuf reservation object to prevent > unsynchronized rendering and flipping. No. Fix the root cause as this affects not just flips but copies - this implies that everybody using the resv object must wait for all fences. The resv object is not just used for prime, but all fencing, so this breaks the ability to schedule parallel operations across engine. -Chris
On 09/08/2016 08:30 AM, Chris Wilson wrote: > On Thu, Sep 08, 2016 at 02:14:43AM +0200, Mario Kleiner wrote: >> amdgpu-kms uses shared fences for its prime exported dmabufs, >> instead of an exclusive fence. Therefore we need to wait for >> all fences of the dmabuf reservation object to prevent >> unsynchronized rendering and flipping. > > No. Fix the root cause as this affects not just flips but copies - > this implies that everybody using the resv object must wait for all > fences. The resv object is not just used for prime, but all fencing, so > this breaks the ability to schedule parallel operations across engine. > -Chris > Ok. I think i now understand the difference, but let's check: The exclusive fence is essentially acting a bit like a write-lock, and the shared fences as readers-locks? So you can have multiple readers but only one writer at a time? Ie.: Writer must wait for all fences before starting write access to a buffer, then attach the exclusive fence and signal it on end of write access. E.g., write to renderbuffer, write to texture etc. Readers must wait for exclusive fence, then attach a shared fence per reader and signal it on end of read access? E.g., read from texture, fb, scanout? Is that correct? In that case we'd have a missing exclusive fence in amdgpu for the linear target dmabuf? Probably beyond my level of knowledge to fix this? thanks, -mario
On Thu, Sep 08, 2016 at 05:21:42PM +0200, Mario Kleiner wrote: > On 09/08/2016 08:30 AM, Chris Wilson wrote: > >On Thu, Sep 08, 2016 at 02:14:43AM +0200, Mario Kleiner wrote: > >>amdgpu-kms uses shared fences for its prime exported dmabufs, > >>instead of an exclusive fence. Therefore we need to wait for > >>all fences of the dmabuf reservation object to prevent > >>unsynchronized rendering and flipping. > > > >No. Fix the root cause as this affects not just flips but copies - > >this implies that everybody using the resv object must wait for all > >fences. The resv object is not just used for prime, but all fencing, so > >this breaks the ability to schedule parallel operations across engine. > >-Chris > > > > Ok. I think i now understand the difference, but let's check: The > exclusive fence is essentially acting a bit like a write-lock, and > the shared fences as readers-locks? So you can have multiple readers > but only one writer at a time? That's how we (i915.ko and I hope the rest of the world) are using them. In the model where here is just one reservation object on the GEM object, that reservation object is then shared between internal driver scheduling and external. We are reliant on being able to use buffers on multiple engines through the virtue of the shared fences, and to serialise after writes by waiting on the exclusive fence. (So we can have concurrent reads on the display engine, render engines and on the CPU - or alternatively an exclusive writer.) In the near future, i915 flips will wait on the common reservation object not only for dma-bufs, but also its own GEM objects. > Ie.: > > Writer must wait for all fences before starting write access to a > buffer, then attach the exclusive fence and signal it on end of > write access. E.g., write to renderbuffer, write to texture etc. Yes. > Readers must wait for exclusive fence, then attach a shared fence > per reader and signal it on end of read access? E.g., read from > texture, fb, scanout? Yes. > Is that correct? In that case we'd have a missing exclusive fence in > amdgpu for the linear target dmabuf? Probably beyond my level of > knowledge to fix this? i915.ko requires the client to mark which buffers are written to. In ttm, there are ttm_validate_buffer objects which mark whether they should be using shared or exclusive fences. Afaict, in amdgpu they are all set to shared, the relevant user interface seems to be amdgpu_bo_list_set(). -Chris
On 09/09/16 01:23 AM, Chris Wilson wrote: > On Thu, Sep 08, 2016 at 05:21:42PM +0200, Mario Kleiner wrote: >> On 09/08/2016 08:30 AM, Chris Wilson wrote: >>> On Thu, Sep 08, 2016 at 02:14:43AM +0200, Mario Kleiner wrote: >>>> amdgpu-kms uses shared fences for its prime exported dmabufs, >>>> instead of an exclusive fence. Therefore we need to wait for >>>> all fences of the dmabuf reservation object to prevent >>>> unsynchronized rendering and flipping. >>> >>> No. Fix the root cause as this affects not just flips but copies - >>> this implies that everybody using the resv object must wait for all >>> fences. The resv object is not just used for prime, but all fencing, so >>> this breaks the ability to schedule parallel operations across engine. >>> -Chris >>> >> >> Ok. I think i now understand the difference, but let's check: The >> exclusive fence is essentially acting a bit like a write-lock, and >> the shared fences as readers-locks? So you can have multiple readers >> but only one writer at a time? > > That's how we (i915.ko and I hope the rest of the world) are using them. > In the model where here is just one reservation object on the GEM > object, that reservation object is then shared between internal driver > scheduling and external. We are reliant on being able to use buffers on > multiple engines through the virtue of the shared fences, and to serialise > after writes by waiting on the exclusive fence. (So we can have concurrent > reads on the display engine, render engines and on the CPU - or > alternatively an exclusive writer.) > > In the near future, i915 flips will wait on the common reservation object > not only for dma-bufs, but also its own GEM objects. > >> Ie.: >> >> Writer must wait for all fences before starting write access to a >> buffer, then attach the exclusive fence and signal it on end of >> write access. E.g., write to renderbuffer, write to texture etc. > > Yes. > >> Readers must wait for exclusive fence, then attach a shared fence >> per reader and signal it on end of read access? E.g., read from >> texture, fb, scanout? > > Yes. > >> Is that correct? In that case we'd have a missing exclusive fence in >> amdgpu for the linear target dmabuf? Probably beyond my level of >> knowledge to fix this? > > i915.ko requires the client to mark which buffers are written to. > > In ttm, there are ttm_validate_buffer objects which mark whether they > should be using shared or exclusive fences. Afaict, in amdgpu they are > all set to shared, the relevant user interface seems to be > amdgpu_bo_list_set(). This all makes sense to me. Christian, why is amdgpu setting only shared fences? Can we fix that?
Am 09.09.2016 um 03:15 schrieb Michel Dänzer: > On 09/09/16 01:23 AM, Chris Wilson wrote: >> On Thu, Sep 08, 2016 at 05:21:42PM +0200, Mario Kleiner wrote: >>> On 09/08/2016 08:30 AM, Chris Wilson wrote: >>>> On Thu, Sep 08, 2016 at 02:14:43AM +0200, Mario Kleiner wrote: >>>>> amdgpu-kms uses shared fences for its prime exported dmabufs, >>>>> instead of an exclusive fence. Therefore we need to wait for >>>>> all fences of the dmabuf reservation object to prevent >>>>> unsynchronized rendering and flipping. >>>> No. Fix the root cause as this affects not just flips but copies - >>>> this implies that everybody using the resv object must wait for all >>>> fences. The resv object is not just used for prime, but all fencing, so >>>> this breaks the ability to schedule parallel operations across engine. >>>> -Chris >>>> >>> Ok. I think i now understand the difference, but let's check: The >>> exclusive fence is essentially acting a bit like a write-lock, and >>> the shared fences as readers-locks? So you can have multiple readers >>> but only one writer at a time? >> That's how we (i915.ko and I hope the rest of the world) are using them. >> In the model where here is just one reservation object on the GEM >> object, that reservation object is then shared between internal driver >> scheduling and external. We are reliant on being able to use buffers on >> multiple engines through the virtue of the shared fences, and to serialise >> after writes by waiting on the exclusive fence. (So we can have concurrent >> reads on the display engine, render engines and on the CPU - or >> alternatively an exclusive writer.) >> >> In the near future, i915 flips will wait on the common reservation object >> not only for dma-bufs, but also its own GEM objects. >> >>> Ie.: >>> >>> Writer must wait for all fences before starting write access to a >>> buffer, then attach the exclusive fence and signal it on end of >>> write access. E.g., write to renderbuffer, write to texture etc. >> Yes. >> >>> Readers must wait for exclusive fence, then attach a shared fence >>> per reader and signal it on end of read access? E.g., read from >>> texture, fb, scanout? >> Yes. >> >>> Is that correct? In that case we'd have a missing exclusive fence in >>> amdgpu for the linear target dmabuf? Probably beyond my level of >>> knowledge to fix this? >> i915.ko requires the client to mark which buffers are written to. >> >> In ttm, there are ttm_validate_buffer objects which mark whether they >> should be using shared or exclusive fences. Afaict, in amdgpu they are >> all set to shared, the relevant user interface seems to be >> amdgpu_bo_list_set(). > This all makes sense to me. > > Christian, why is amdgpu setting only shared fences? Can we fix that? No, amdgpu relies on the fact that we even allow concurrent write accesses by userspace. E.g. one part of the buffer can be rendered by one engine while another part could be rendered by another engine. Just imagine X which is composing a buffer with both the 3D engine as well as the DMA engine. All engines need to run in parallel and you need to wait for all of them to finish before scanout. Everybody which needs exclusive access to the reservation object (like scanouts do) needs to wait for all fences, not just the exclusive one. The Intel driver clearly needs to be fixed here. Regards, Christian.
On Tue, Sep 13, 2016 at 10:44:11AM +0200, Christian König wrote: > Am 09.09.2016 um 03:15 schrieb Michel Dänzer: > >On 09/09/16 01:23 AM, Chris Wilson wrote: > >>On Thu, Sep 08, 2016 at 05:21:42PM +0200, Mario Kleiner wrote: > >>>On 09/08/2016 08:30 AM, Chris Wilson wrote: > >>>>On Thu, Sep 08, 2016 at 02:14:43AM +0200, Mario Kleiner wrote: > >>>>>amdgpu-kms uses shared fences for its prime exported dmabufs, > >>>>>instead of an exclusive fence. Therefore we need to wait for > >>>>>all fences of the dmabuf reservation object to prevent > >>>>>unsynchronized rendering and flipping. > >>>>No. Fix the root cause as this affects not just flips but copies - > >>>>this implies that everybody using the resv object must wait for all > >>>>fences. The resv object is not just used for prime, but all fencing, so > >>>>this breaks the ability to schedule parallel operations across engine. > >>>>-Chris > >>>> > >>>Ok. I think i now understand the difference, but let's check: The > >>>exclusive fence is essentially acting a bit like a write-lock, and > >>>the shared fences as readers-locks? So you can have multiple readers > >>>but only one writer at a time? > >>That's how we (i915.ko and I hope the rest of the world) are using them. > >>In the model where here is just one reservation object on the GEM > >>object, that reservation object is then shared between internal driver > >>scheduling and external. We are reliant on being able to use buffers on > >>multiple engines through the virtue of the shared fences, and to serialise > >>after writes by waiting on the exclusive fence. (So we can have concurrent > >>reads on the display engine, render engines and on the CPU - or > >>alternatively an exclusive writer.) > >> > >>In the near future, i915 flips will wait on the common reservation object > >>not only for dma-bufs, but also its own GEM objects. > >>>Ie.: > >>> > >>>Writer must wait for all fences before starting write access to a > >>>buffer, then attach the exclusive fence and signal it on end of > >>>write access. E.g., write to renderbuffer, write to texture etc. > >>Yes. > >>>Readers must wait for exclusive fence, then attach a shared fence > >>>per reader and signal it on end of read access? E.g., read from > >>>texture, fb, scanout? > >>Yes. > >>>Is that correct? In that case we'd have a missing exclusive fence in > >>>amdgpu for the linear target dmabuf? Probably beyond my level of > >>>knowledge to fix this? > >>i915.ko requires the client to mark which buffers are written to. > >> > >>In ttm, there are ttm_validate_buffer objects which mark whether they > >>should be using shared or exclusive fences. Afaict, in amdgpu they are > >>all set to shared, the relevant user interface seems to be > >>amdgpu_bo_list_set(). > >This all makes sense to me. > > > >Christian, why is amdgpu setting only shared fences? Can we fix that? > > No, amdgpu relies on the fact that we even allow concurrent write > accesses by userspace. > > E.g. one part of the buffer can be rendered by one engine while > another part could be rendered by another engine. > > Just imagine X which is composing a buffer with both the 3D engine > as well as the DMA engine. > > All engines need to run in parallel and you need to wait for all of > them to finish before scanout. > > Everybody which needs exclusive access to the reservation object > (like scanouts do) needs to wait for all fences, not just the > exclusive one. > > The Intel driver clearly needs to be fixed here. If you are not using implicit fencing, you have to pass explicit fences instead. -Chris
Am 13.09.2016 um 11:39 schrieb Chris Wilson: > On Tue, Sep 13, 2016 at 10:44:11AM +0200, Christian König wrote: >> Am 09.09.2016 um 03:15 schrieb Michel Dänzer: >>> On 09/09/16 01:23 AM, Chris Wilson wrote: >>>> On Thu, Sep 08, 2016 at 05:21:42PM +0200, Mario Kleiner wrote: >>>>> On 09/08/2016 08:30 AM, Chris Wilson wrote: >>>>>> On Thu, Sep 08, 2016 at 02:14:43AM +0200, Mario Kleiner wrote: >>>>>>> amdgpu-kms uses shared fences for its prime exported dmabufs, >>>>>>> instead of an exclusive fence. Therefore we need to wait for >>>>>>> all fences of the dmabuf reservation object to prevent >>>>>>> unsynchronized rendering and flipping. >>>>>> No. Fix the root cause as this affects not just flips but copies - >>>>>> this implies that everybody using the resv object must wait for all >>>>>> fences. The resv object is not just used for prime, but all fencing, so >>>>>> this breaks the ability to schedule parallel operations across engine. >>>>>> -Chris >>>>>> >>>>> Ok. I think i now understand the difference, but let's check: The >>>>> exclusive fence is essentially acting a bit like a write-lock, and >>>>> the shared fences as readers-locks? So you can have multiple readers >>>>> but only one writer at a time? >>>> That's how we (i915.ko and I hope the rest of the world) are using them. >>>> In the model where here is just one reservation object on the GEM >>>> object, that reservation object is then shared between internal driver >>>> scheduling and external. We are reliant on being able to use buffers on >>>> multiple engines through the virtue of the shared fences, and to serialise >>>> after writes by waiting on the exclusive fence. (So we can have concurrent >>>> reads on the display engine, render engines and on the CPU - or >>>> alternatively an exclusive writer.) >>>> >>>> In the near future, i915 flips will wait on the common reservation object >>>> not only for dma-bufs, but also its own GEM objects. >>>>> Ie.: >>>>> >>>>> Writer must wait for all fences before starting write access to a >>>>> buffer, then attach the exclusive fence and signal it on end of >>>>> write access. E.g., write to renderbuffer, write to texture etc. >>>> Yes. >>>>> Readers must wait for exclusive fence, then attach a shared fence >>>>> per reader and signal it on end of read access? E.g., read from >>>>> texture, fb, scanout? >>>> Yes. >>>>> Is that correct? In that case we'd have a missing exclusive fence in >>>>> amdgpu for the linear target dmabuf? Probably beyond my level of >>>>> knowledge to fix this? >>>> i915.ko requires the client to mark which buffers are written to. >>>> >>>> In ttm, there are ttm_validate_buffer objects which mark whether they >>>> should be using shared or exclusive fences. Afaict, in amdgpu they are >>>> all set to shared, the relevant user interface seems to be >>>> amdgpu_bo_list_set(). >>> This all makes sense to me. >>> >>> Christian, why is amdgpu setting only shared fences? Can we fix that? >> No, amdgpu relies on the fact that we even allow concurrent write >> accesses by userspace. >> >> E.g. one part of the buffer can be rendered by one engine while >> another part could be rendered by another engine. >> >> Just imagine X which is composing a buffer with both the 3D engine >> as well as the DMA engine. >> >> All engines need to run in parallel and you need to wait for all of >> them to finish before scanout. >> >> Everybody which needs exclusive access to the reservation object >> (like scanouts do) needs to wait for all fences, not just the >> exclusive one. >> >> The Intel driver clearly needs to be fixed here. > If you are not using implicit fencing, you have to pass explicit fences > instead. Which is exactly what we do, but only for the driver internally command submissions. All command submissions from the same process can run concurrently with amdgpu, only when we see a fence from another driver or process we wait for it to complete before starting to run a command submission. Other drivers can't make any assumption on what a shared access is actually doing (e.g. writing or reading) with a buffer. So the model i915.ko is using the reservation object and it's shared fences is certainly not correct and needs to be fixed. Regards, Christian. > -Chris >
On 13/09/16 09:52 PM, Christian König wrote: > Am 13.09.2016 um 11:39 schrieb Chris Wilson: >> On Tue, Sep 13, 2016 at 10:44:11AM +0200, Christian König wrote: >>> Am 09.09.2016 um 03:15 schrieb Michel Dänzer: >>>> On 09/09/16 01:23 AM, Chris Wilson wrote: >>>>> On Thu, Sep 08, 2016 at 05:21:42PM +0200, Mario Kleiner wrote: >>>>>> On 09/08/2016 08:30 AM, Chris Wilson wrote: >>>>>>> On Thu, Sep 08, 2016 at 02:14:43AM +0200, Mario Kleiner wrote: >>>>>>>> amdgpu-kms uses shared fences for its prime exported dmabufs, >>>>>>>> instead of an exclusive fence. Therefore we need to wait for >>>>>>>> all fences of the dmabuf reservation object to prevent >>>>>>>> unsynchronized rendering and flipping. >>>>>>> No. Fix the root cause as this affects not just flips but copies - >>>>>>> this implies that everybody using the resv object must wait for all >>>>>>> fences. The resv object is not just used for prime, but all >>>>>>> fencing, so >>>>>>> this breaks the ability to schedule parallel operations across >>>>>>> engine. >>>>>>> -Chris >>>>>>> >>>>>> Ok. I think i now understand the difference, but let's check: The >>>>>> exclusive fence is essentially acting a bit like a write-lock, and >>>>>> the shared fences as readers-locks? So you can have multiple readers >>>>>> but only one writer at a time? >>>>> That's how we (i915.ko and I hope the rest of the world) are using >>>>> them. >>>>> In the model where here is just one reservation object on the GEM >>>>> object, that reservation object is then shared between internal driver >>>>> scheduling and external. We are reliant on being able to use >>>>> buffers on >>>>> multiple engines through the virtue of the shared fences, and to >>>>> serialise >>>>> after writes by waiting on the exclusive fence. (So we can have >>>>> concurrent >>>>> reads on the display engine, render engines and on the CPU - or >>>>> alternatively an exclusive writer.) >>>>> >>>>> In the near future, i915 flips will wait on the common reservation >>>>> object >>>>> not only for dma-bufs, but also its own GEM objects. >>>>>> Ie.: >>>>>> >>>>>> Writer must wait for all fences before starting write access to a >>>>>> buffer, then attach the exclusive fence and signal it on end of >>>>>> write access. E.g., write to renderbuffer, write to texture etc. >>>>> Yes. >>>>>> Readers must wait for exclusive fence, then attach a shared fence >>>>>> per reader and signal it on end of read access? E.g., read from >>>>>> texture, fb, scanout? >>>>> Yes. >>>>>> Is that correct? In that case we'd have a missing exclusive fence in >>>>>> amdgpu for the linear target dmabuf? Probably beyond my level of >>>>>> knowledge to fix this? >>>>> i915.ko requires the client to mark which buffers are written to. >>>>> >>>>> In ttm, there are ttm_validate_buffer objects which mark whether they >>>>> should be using shared or exclusive fences. Afaict, in amdgpu they are >>>>> all set to shared, the relevant user interface seems to be >>>>> amdgpu_bo_list_set(). >>>> This all makes sense to me. >>>> >>>> Christian, why is amdgpu setting only shared fences? Can we fix that? >>> No, amdgpu relies on the fact that we even allow concurrent write >>> accesses by userspace. >>> >>> E.g. one part of the buffer can be rendered by one engine while >>> another part could be rendered by another engine. >>> >>> Just imagine X which is composing a buffer with both the 3D engine >>> as well as the DMA engine. >>> >>> All engines need to run in parallel and you need to wait for all of >>> them to finish before scanout. >>> >>> Everybody which needs exclusive access to the reservation object >>> (like scanouts do) needs to wait for all fences, not just the >>> exclusive one. >>> >>> The Intel driver clearly needs to be fixed here. >> If you are not using implicit fencing, you have to pass explicit fences >> instead. > > Which is exactly what we do, but only for the driver internally command > submissions. > > All command submissions from the same process can run concurrently with > amdgpu, only when we see a fence from another driver or process we wait > for it to complete before starting to run a command submission. > > Other drivers can't make any assumption on what a shared access is > actually doing (e.g. writing or reading) with a buffer. > > So the model i915.ko is using the reservation object and it's shared > fences is certainly not correct and needs to be fixed. Looks like there are different interpretations of the semantics of exclusive vs. shared fences. Where are these semantics documented? FWIW, we seem to have the same issue with radeon vs. amdgpu: radeon only seems to wait for exclusive fences, so e.g. running Xorg on amdgpu and using PRIME slave scanout on radeon leaves artifacts.
Am 21.09.2016 um 11:56 schrieb Michel Dänzer: > > Looks like there are different interpretations of the semantics of > exclusive vs. shared fences. Where are these semantics documented? Yeah, I think as well that this is the primary question here. IIRC the fences were explicitly called exclusive/shared instead of writing/reading on purpose. I absolutely don't mind switching to them to writing/reading semantics, but amdgpu really needs multiple writers at the same time. So in this case the writing side of a reservation object needs to be a collection of fences as well. > FWIW, we seem to have the same issue with radeon vs. amdgpu: radeon only > seems to wait for exclusive fences, so e.g. running Xorg on amdgpu and > using PRIME slave scanout on radeon leaves artifacts. Yeah, I know. See radeon_display.c radeon_flip_work_func(). We pretty much need the same patch here I've done for amdgpu as well. Regards, Christian.
On Wed, Sep 21, 2016 at 12:30 PM, Christian König <deathsimple@vodafone.de> wrote: > Am 21.09.2016 um 11:56 schrieb Michel Dänzer: >> >> >> Looks like there are different interpretations of the semantics of >> exclusive vs. shared fences. Where are these semantics documented? > > > Yeah, I think as well that this is the primary question here. > > IIRC the fences were explicitly called exclusive/shared instead of > writing/reading on purpose. > > I absolutely don't mind switching to them to writing/reading semantics, but > amdgpu really needs multiple writers at the same time. > > So in this case the writing side of a reservation object needs to be a > collection of fences as well. You can't have multiple writers with implicit syncing. That confusion is exactly why we called them shared/exclusive. Multiple writers generally means that you do some form of fencing in userspace (unsync'ed gl buffer access is the common one). What you do for private buffers doesn't matter, but when you render into a shared/winsys buffer you really need to set the exclusive fence (and there can only ever be one). So probably needs some userspace adjustments to make sure you don't accidentally set an exclusive write hazard when you don't really want that implicit sync. -Daniel
Am 21.09.2016 um 13:04 schrieb Daniel Vetter: > On Wed, Sep 21, 2016 at 12:30 PM, Christian König > <deathsimple@vodafone.de> wrote: >> Am 21.09.2016 um 11:56 schrieb Michel Dänzer: >>> >>> Looks like there are different interpretations of the semantics of >>> exclusive vs. shared fences. Where are these semantics documented? >> >> Yeah, I think as well that this is the primary question here. >> >> IIRC the fences were explicitly called exclusive/shared instead of >> writing/reading on purpose. >> >> I absolutely don't mind switching to them to writing/reading semantics, but >> amdgpu really needs multiple writers at the same time. >> >> So in this case the writing side of a reservation object needs to be a >> collection of fences as well. > You can't have multiple writers with implicit syncing. That confusion > is exactly why we called them shared/exclusive. Multiple writers > generally means that you do some form of fencing in userspace > (unsync'ed gl buffer access is the common one). What you do for > private buffers doesn't matter, but when you render into a > shared/winsys buffer you really need to set the exclusive fence (and > there can only ever be one). So probably needs some userspace > adjustments to make sure you don't accidentally set an exclusive write > hazard when you don't really want that implicit sync. Nope, that isn't true. We use multiple writers without implicit syncing between processes in the amdgpu stack perfectly fine. See amdgpu_sync.c for the implementation. What we do there is taking a look at all the fences associated with a reservation object and only sync to those who are from another process. Then we use implicit syncing for command submissions in the form of "dependencies". E.g. for each CS we report back an identifier of that submission to user space and on the next submission you can give this identifier as dependency which needs to be satisfied before the command submission can start running. This was done to allow multiple engines (3D, DMA, Compute) to compose a buffer while still allow compatibility with protocols like DRI2/DRI3. Regards, Christian. > -Daniel
On Wed, Sep 21, 2016 at 1:19 PM, Christian König <deathsimple@vodafone.de> wrote: > Am 21.09.2016 um 13:04 schrieb Daniel Vetter: >> >> On Wed, Sep 21, 2016 at 12:30 PM, Christian König >> <deathsimple@vodafone.de> wrote: >>> >>> Am 21.09.2016 um 11:56 schrieb Michel Dänzer: >>>> >>>> >>>> Looks like there are different interpretations of the semantics of >>>> exclusive vs. shared fences. Where are these semantics documented? >>> >>> >>> Yeah, I think as well that this is the primary question here. >>> >>> IIRC the fences were explicitly called exclusive/shared instead of >>> writing/reading on purpose. >>> >>> I absolutely don't mind switching to them to writing/reading semantics, >>> but >>> amdgpu really needs multiple writers at the same time. >>> >>> So in this case the writing side of a reservation object needs to be a >>> collection of fences as well. >> >> You can't have multiple writers with implicit syncing. That confusion >> is exactly why we called them shared/exclusive. Multiple writers >> generally means that you do some form of fencing in userspace >> (unsync'ed gl buffer access is the common one). What you do for >> private buffers doesn't matter, but when you render into a >> shared/winsys buffer you really need to set the exclusive fence (and >> there can only ever be one). So probably needs some userspace >> adjustments to make sure you don't accidentally set an exclusive write >> hazard when you don't really want that implicit sync. > > > Nope, that isn't true. > > We use multiple writers without implicit syncing between processes in the > amdgpu stack perfectly fine. > > See amdgpu_sync.c for the implementation. What we do there is taking a look > at all the fences associated with a reservation object and only sync to > those who are from another process. > > Then we use implicit syncing for command submissions in the form of > "dependencies". E.g. for each CS we report back an identifier of that > submission to user space and on the next submission you can give this > identifier as dependency which needs to be satisfied before the command > submission can start running. This is called explicit fencing. Implemented with a driver-private primitive (and not sync_file fds like on android), but still conceptually explicit fencing. Implicit fencing really only can handle one writer, at least as currently implemented by struct reservation_object. > This was done to allow multiple engines (3D, DMA, Compute) to compose a > buffer while still allow compatibility with protocols like DRI2/DRI3. Instead of the current solution you need to stop attaching exclusive fences to non-shared buffers (which are coordinated using the driver-private explicit fencing you're describing), and only attach exclusive fences to shared buffers (DRI2/3, PRIME, whatever). Since you're doing explicit syncing for internal stuff anyway you can still opt to ignore the exclusive fences if you want to (driven by a flag or something similar). -Daniel
On 21/09/16 09:56 PM, Daniel Vetter wrote: > On Wed, Sep 21, 2016 at 1:19 PM, Christian König > <deathsimple@vodafone.de> wrote: >> Am 21.09.2016 um 13:04 schrieb Daniel Vetter: >>> On Wed, Sep 21, 2016 at 12:30 PM, Christian König >>> <deathsimple@vodafone.de> wrote: >>>> Am 21.09.2016 um 11:56 schrieb Michel Dänzer: >>>>> >>>>> >>>>> Looks like there are different interpretations of the semantics of >>>>> exclusive vs. shared fences. Where are these semantics documented? >>>> >>>> >>>> Yeah, I think as well that this is the primary question here. >>>> >>>> IIRC the fences were explicitly called exclusive/shared instead of >>>> writing/reading on purpose. >>>> >>>> I absolutely don't mind switching to them to writing/reading semantics, >>>> but >>>> amdgpu really needs multiple writers at the same time. >>>> >>>> So in this case the writing side of a reservation object needs to be a >>>> collection of fences as well. >>> >>> You can't have multiple writers with implicit syncing. That confusion >>> is exactly why we called them shared/exclusive. Multiple writers >>> generally means that you do some form of fencing in userspace >>> (unsync'ed gl buffer access is the common one). What you do for >>> private buffers doesn't matter, but when you render into a >>> shared/winsys buffer you really need to set the exclusive fence (and >>> there can only ever be one). So probably needs some userspace >>> adjustments to make sure you don't accidentally set an exclusive write >>> hazard when you don't really want that implicit sync. >> >> >> Nope, that isn't true. >> >> We use multiple writers without implicit syncing between processes in the >> amdgpu stack perfectly fine. >> >> See amdgpu_sync.c for the implementation. What we do there is taking a look >> at all the fences associated with a reservation object and only sync to >> those who are from another process. >> >> Then we use implicit syncing for command submissions in the form of >> "dependencies". E.g. for each CS we report back an identifier of that >> submission to user space and on the next submission you can give this >> identifier as dependency which needs to be satisfied before the command >> submission can start running. > > This is called explicit fencing. Implemented with a driver-private > primitive (and not sync_file fds like on android), but still > conceptually explicit fencing. Implicit fencing really only can handle > one writer, at least as currently implemented by struct > reservation_object. > >> This was done to allow multiple engines (3D, DMA, Compute) to compose a >> buffer while still allow compatibility with protocols like DRI2/DRI3. > > Instead of the current solution you need to stop attaching exclusive > fences to non-shared buffers (which are coordinated using the > driver-private explicit fencing you're describing), Err, the current issue is actually that amdgpu never sets an exclusive fence, only ever shared ones. :) > and only attach exclusive fences to shared buffers (DRI2/3, PRIME, > whatever). Still, it occurred to me in the meantime that amdgpu setting the exclusive fence for buffers shared via PRIME (no matter if it's a write or read operation) might be a solution. Christian, what do you think?
On 21/09/16 07:30 PM, Christian König wrote: > Am 21.09.2016 um 11:56 schrieb Michel Dänzer: >> >> FWIW, we seem to have the same issue with radeon vs. amdgpu: radeon only >> seems to wait for exclusive fences, so e.g. running Xorg on amdgpu and >> using PRIME slave scanout on radeon leaves artifacts. > > Yeah, I know. See radeon_display.c radeon_flip_work_func(). > > We pretty much need the same patch here I've done for amdgpu as well. Actually, the PRIME slave can't scan out from the shared BOs directly (recall the recent discussion we had about that with Mario), but has to copy from the shared BO to a dedicated scanout BO. These copies need to be synchronized with the primary GPU's copies to the shared BO.
Am 21.09.2016 um 17:07 schrieb Michel Dänzer: > On 21/09/16 09:56 PM, Daniel Vetter wrote: >> On Wed, Sep 21, 2016 at 1:19 PM, Christian König >> <deathsimple@vodafone.de> wrote: >>> Am 21.09.2016 um 13:04 schrieb Daniel Vetter: >>>> On Wed, Sep 21, 2016 at 12:30 PM, Christian König >>>> <deathsimple@vodafone.de> wrote: >>>>> Am 21.09.2016 um 11:56 schrieb Michel Dänzer: >>>>>> >>>>>> Looks like there are different interpretations of the semantics of >>>>>> exclusive vs. shared fences. Where are these semantics documented? >>>>> >>>>> Yeah, I think as well that this is the primary question here. >>>>> >>>>> IIRC the fences were explicitly called exclusive/shared instead of >>>>> writing/reading on purpose. >>>>> >>>>> I absolutely don't mind switching to them to writing/reading semantics, >>>>> but >>>>> amdgpu really needs multiple writers at the same time. >>>>> >>>>> So in this case the writing side of a reservation object needs to be a >>>>> collection of fences as well. >>>> You can't have multiple writers with implicit syncing. That confusion >>>> is exactly why we called them shared/exclusive. Multiple writers >>>> generally means that you do some form of fencing in userspace >>>> (unsync'ed gl buffer access is the common one). What you do for >>>> private buffers doesn't matter, but when you render into a >>>> shared/winsys buffer you really need to set the exclusive fence (and >>>> there can only ever be one). So probably needs some userspace >>>> adjustments to make sure you don't accidentally set an exclusive write >>>> hazard when you don't really want that implicit sync. >>> >>> Nope, that isn't true. >>> >>> We use multiple writers without implicit syncing between processes in the >>> amdgpu stack perfectly fine. >>> >>> See amdgpu_sync.c for the implementation. What we do there is taking a look >>> at all the fences associated with a reservation object and only sync to >>> those who are from another process. >>> >>> Then we use implicit syncing for command submissions in the form of >>> "dependencies". E.g. for each CS we report back an identifier of that >>> submission to user space and on the next submission you can give this >>> identifier as dependency which needs to be satisfied before the command >>> submission can start running. >> This is called explicit fencing. Implemented with a driver-private >> primitive (and not sync_file fds like on android), but still >> conceptually explicit fencing. Implicit fencing really only can handle >> one writer, at least as currently implemented by struct >> reservation_object. >> >>> This was done to allow multiple engines (3D, DMA, Compute) to compose a >>> buffer while still allow compatibility with protocols like DRI2/DRI3. >> Instead of the current solution you need to stop attaching exclusive >> fences to non-shared buffers (which are coordinated using the >> driver-private explicit fencing you're describing), > Err, the current issue is actually that amdgpu never sets an exclusive > fence, only ever shared ones. :) Actually amdgpu does set the exclusive fence for buffer migrations, cause that is an operation user space doesn't know about and so it needs to be "exclusive" access to the buffer. >> and only attach exclusive fences to shared buffers (DRI2/3, PRIME, >> whatever). > Still, it occurred to me in the meantime that amdgpu setting the > exclusive fence for buffers shared via PRIME (no matter if it's a write > or read operation) might be a solution. Christian, what do you think? The problem is that we don't have one fence, but many. E.g. there can be many operation on a buffer at the same time and we need to wait for all of them to complete before it can be displayed. Regards, Christian.
Am 21.09.2016 um 17:13 schrieb Michel Dänzer: > On 21/09/16 07:30 PM, Christian König wrote: >> Am 21.09.2016 um 11:56 schrieb Michel Dänzer: >>> FWIW, we seem to have the same issue with radeon vs. amdgpu: radeon only >>> seems to wait for exclusive fences, so e.g. running Xorg on amdgpu and >>> using PRIME slave scanout on radeon leaves artifacts. >> Yeah, I know. See radeon_display.c radeon_flip_work_func(). >> >> We pretty much need the same patch here I've done for amdgpu as well. > Actually, the PRIME slave can't scan out from the shared BOs directly > (recall the recent discussion we had about that with Mario), but has to > copy from the shared BO to a dedicated scanout BO. These copies need to > be synchronized with the primary GPU's copies to the shared BO. Yeah, that thought came to my mind before as well. Buffer migrations by the kernel caused by a prime export actually set the exclusive fence. So this shouldn't be an issue in practice when the displaying GPU needs to copy from the BO again anyway. The only case I can see when this can happen is when the BO is composed directly in system memory by the engines and not migrated there. Could be that we run into this issue more often in the future, because that is pretty much what we want to have for 4K UVD decode. Christian.
On 22/09/16 12:21 AM, Christian König wrote: > Am 21.09.2016 um 17:13 schrieb Michel Dänzer: >> On 21/09/16 07:30 PM, Christian König wrote: >>> Am 21.09.2016 um 11:56 schrieb Michel Dänzer: >>>> FWIW, we seem to have the same issue with radeon vs. amdgpu: radeon >>>> only >>>> seems to wait for exclusive fences, so e.g. running Xorg on amdgpu and >>>> using PRIME slave scanout on radeon leaves artifacts. >>> Yeah, I know. See radeon_display.c radeon_flip_work_func(). >>> >>> We pretty much need the same patch here I've done for amdgpu as well. >> Actually, the PRIME slave can't scan out from the shared BOs directly >> (recall the recent discussion we had about that with Mario), but has to >> copy from the shared BO to a dedicated scanout BO. These copies need to >> be synchronized with the primary GPU's copies to the shared BO. > > Yeah, that thought came to my mind before as well. > > Buffer migrations by the kernel caused by a prime export actually set > the exclusive fence. > > So this shouldn't be an issue in practice when the displaying GPU needs > to copy from the BO again anyway. > > The only case I can see when this can happen is when the BO is composed > directly in system memory by the engines and not migrated there. There is no migration going on in the steady state, just the primary GPU writing to the shared BO and the slave GPU reading from it. If those operations aren't properly synchronized, there is at least intermittent tearing, but there can even be artifacts which stay until that area is updated again.
On 22/09/16 12:15 AM, Christian König wrote: > Am 21.09.2016 um 17:07 schrieb Michel Dänzer: >> On 21/09/16 09:56 PM, Daniel Vetter wrote: >>> On Wed, Sep 21, 2016 at 1:19 PM, Christian König >>> <deathsimple@vodafone.de> wrote: >>>> >>>> We use multiple writers without implicit syncing between processes >>>> in the >>>> amdgpu stack perfectly fine. >>>> >>>> See amdgpu_sync.c for the implementation. What we do there is taking >>>> a look >>>> at all the fences associated with a reservation object and only sync to >>>> those who are from another process. >>>> >>>> Then we use implicit syncing for command submissions in the form of >>>> "dependencies". E.g. for each CS we report back an identifier of that >>>> submission to user space and on the next submission you can give this >>>> identifier as dependency which needs to be satisfied before the command >>>> submission can start running. >>> This is called explicit fencing. Implemented with a driver-private >>> primitive (and not sync_file fds like on android), but still >>> conceptually explicit fencing. Implicit fencing really only can handle >>> one writer, at least as currently implemented by struct >>> reservation_object. >>> >>>> This was done to allow multiple engines (3D, DMA, Compute) to compose a >>>> buffer while still allow compatibility with protocols like DRI2/DRI3. >>> Instead of the current solution you need to stop attaching exclusive >>> fences to non-shared buffers (which are coordinated using the >>> driver-private explicit fencing you're describing), >> Err, the current issue is actually that amdgpu never sets an exclusive >> fence, only ever shared ones. :) > > Actually amdgpu does set the exclusive fence for buffer migrations, > cause that is an operation user space doesn't know about and so it needs > to be "exclusive" access to the buffer. > > >>> and only attach exclusive fences to shared buffers (DRI2/3, PRIME, >>> whatever). >> Still, it occurred to me in the meantime that amdgpu setting the >> exclusive fence for buffers shared via PRIME (no matter if it's a write >> or read operation) might be a solution. Christian, what do you think? > > The problem is that we don't have one fence, but many. > > E.g. there can be many operation on a buffer at the same time and we > need to wait for all of them to complete before it can be displayed. Maybe in theory, but with the problem we have in practice right now, the amdgpu GPU should only ever access the shared BO with the same engine. Anyway, this should be solvable by setting some kind of meta-fence as the exclusive fence, which can internally be mapped to multiple fences, maybe up to one for each ring which can access the BO?
Am 21.09.2016 um 17:29 schrieb Michel Dänzer: > On 22/09/16 12:15 AM, Christian König wrote: >> Am 21.09.2016 um 17:07 schrieb Michel Dänzer: >>> On 21/09/16 09:56 PM, Daniel Vetter wrote: >>>> On Wed, Sep 21, 2016 at 1:19 PM, Christian König >>>> <deathsimple@vodafone.de> wrote: >>>>> We use multiple writers without implicit syncing between processes >>>>> in the >>>>> amdgpu stack perfectly fine. >>>>> >>>>> See amdgpu_sync.c for the implementation. What we do there is taking >>>>> a look >>>>> at all the fences associated with a reservation object and only sync to >>>>> those who are from another process. >>>>> >>>>> Then we use implicit syncing for command submissions in the form of >>>>> "dependencies". E.g. for each CS we report back an identifier of that >>>>> submission to user space and on the next submission you can give this >>>>> identifier as dependency which needs to be satisfied before the command >>>>> submission can start running. >>>> This is called explicit fencing. Implemented with a driver-private >>>> primitive (and not sync_file fds like on android), but still >>>> conceptually explicit fencing. Implicit fencing really only can handle >>>> one writer, at least as currently implemented by struct >>>> reservation_object. >>>> >>>>> This was done to allow multiple engines (3D, DMA, Compute) to compose a >>>>> buffer while still allow compatibility with protocols like DRI2/DRI3. >>>> Instead of the current solution you need to stop attaching exclusive >>>> fences to non-shared buffers (which are coordinated using the >>>> driver-private explicit fencing you're describing), >>> Err, the current issue is actually that amdgpu never sets an exclusive >>> fence, only ever shared ones. :) >> Actually amdgpu does set the exclusive fence for buffer migrations, >> cause that is an operation user space doesn't know about and so it needs >> to be "exclusive" access to the buffer. >> >> >>>> and only attach exclusive fences to shared buffers (DRI2/3, PRIME, >>>> whatever). >>> Still, it occurred to me in the meantime that amdgpu setting the >>> exclusive fence for buffers shared via PRIME (no matter if it's a write >>> or read operation) might be a solution. Christian, what do you think? >> The problem is that we don't have one fence, but many. >> >> E.g. there can be many operation on a buffer at the same time and we >> need to wait for all of them to complete before it can be displayed. > Maybe in theory, but with the problem we have in practice right now, the > amdgpu GPU should only ever access the shared BO with the same engine. That clearly won't work. Take a look at what both Mesa and the pro stack do with the BO before it is displayed makes it mandatory to execute things in parallel (at least for the not shared case). > Anyway, this should be solvable by setting some kind of meta-fence as > the exclusive fence, which can internally be mapped to multiple fences, > maybe up to one for each ring which can access the BO? I've thought about that as well, but this approach would also only work when we keep a collection of fences and not just an array because of the scheduler. For a quick workaround I suggest to just serialize all accesses to BO shared with different drivers, but essentially I think it is a perfectly valid requirement to have multiple writers to one BO. Christian.
On Thu, Sep 22, 2016 at 12:07:24AM +0900, Michel Dänzer wrote: > On 21/09/16 09:56 PM, Daniel Vetter wrote: > > On Wed, Sep 21, 2016 at 1:19 PM, Christian König > > <deathsimple@vodafone.de> wrote: > >> Am 21.09.2016 um 13:04 schrieb Daniel Vetter: > >>> On Wed, Sep 21, 2016 at 12:30 PM, Christian König > >>> <deathsimple@vodafone.de> wrote: > >>>> Am 21.09.2016 um 11:56 schrieb Michel Dänzer: > >>>>> > >>>>> > >>>>> Looks like there are different interpretations of the semantics of > >>>>> exclusive vs. shared fences. Where are these semantics documented? > >>>> > >>>> > >>>> Yeah, I think as well that this is the primary question here. > >>>> > >>>> IIRC the fences were explicitly called exclusive/shared instead of > >>>> writing/reading on purpose. > >>>> > >>>> I absolutely don't mind switching to them to writing/reading semantics, > >>>> but > >>>> amdgpu really needs multiple writers at the same time. > >>>> > >>>> So in this case the writing side of a reservation object needs to be a > >>>> collection of fences as well. > >>> > >>> You can't have multiple writers with implicit syncing. That confusion > >>> is exactly why we called them shared/exclusive. Multiple writers > >>> generally means that you do some form of fencing in userspace > >>> (unsync'ed gl buffer access is the common one). What you do for > >>> private buffers doesn't matter, but when you render into a > >>> shared/winsys buffer you really need to set the exclusive fence (and > >>> there can only ever be one). So probably needs some userspace > >>> adjustments to make sure you don't accidentally set an exclusive write > >>> hazard when you don't really want that implicit sync. > >> > >> > >> Nope, that isn't true. > >> > >> We use multiple writers without implicit syncing between processes in the > >> amdgpu stack perfectly fine. > >> > >> See amdgpu_sync.c for the implementation. What we do there is taking a look > >> at all the fences associated with a reservation object and only sync to > >> those who are from another process. > >> > >> Then we use implicit syncing for command submissions in the form of > >> "dependencies". E.g. for each CS we report back an identifier of that > >> submission to user space and on the next submission you can give this > >> identifier as dependency which needs to be satisfied before the command > >> submission can start running. > > > > This is called explicit fencing. Implemented with a driver-private > > primitive (and not sync_file fds like on android), but still > > conceptually explicit fencing. Implicit fencing really only can handle > > one writer, at least as currently implemented by struct > > reservation_object. > > > >> This was done to allow multiple engines (3D, DMA, Compute) to compose a > >> buffer while still allow compatibility with protocols like DRI2/DRI3. > > > > Instead of the current solution you need to stop attaching exclusive > > fences to non-shared buffers (which are coordinated using the > > driver-private explicit fencing you're describing), > > Err, the current issue is actually that amdgpu never sets an exclusive > fence, only ever shared ones. :) Well since you sometimes sync and sometimes not sync it is kinda a special case of semi-exclusive fence (even if attached to the shared slots). > > > and only attach exclusive fences to shared buffers (DRI2/3, PRIME, > > whatever). > > Still, it occurred to me in the meantime that amdgpu setting the > exclusive fence for buffers shared via PRIME (no matter if it's a write > or read operation) might be a solution. Christian, what do you think? Yup, that's what I mean. And it shouldn't cause a problem since for shared buffers (at least for protocols where implicit fencing is required), since for those you really can't have multiple concurrent writers. And with the special checks in amdgpu_sync.c that's what's happening in reality, only difference is that the filtering/selection of what is considered and exclusive fences happens when you sync, and not when you attach them. And that breaks reservation_object assumptions. -Daniel
On Wed, Sep 21, 2016 at 06:23:35PM +0200, Christian König wrote: > For a quick workaround I suggest to just serialize all accesses to BO shared > with different drivers, but essentially I think it is a perfectly valid > requirement to have multiple writers to one BO. It is, but it's not possible with implicit sync. If you want parallel write access to the same shared buffer, you _must_ carry around some explicit fences. Within amdgpu you can use driver-specific cookies, for shared buffer we now have sync_file. But multiple writers with implicit sync simply fundamentally doesn't work. Because you have no idea with which writer, touching the same subrange you want to touch. -Daniel
Am 22.09.2016 um 08:36 schrieb Daniel Vetter: > On Wed, Sep 21, 2016 at 06:23:35PM +0200, Christian König wrote: >> For a quick workaround I suggest to just serialize all accesses to BO shared >> with different drivers, but essentially I think it is a perfectly valid >> requirement to have multiple writers to one BO. > It is, but it's not possible with implicit sync. If you want parallel > write access to the same shared buffer, you _must_ carry around some > explicit fences. Within amdgpu you can use driver-specific cookies, for > shared buffer we now have sync_file. But multiple writers with implicit > sync simply fundamentally doesn't work. Because you have no idea with which > writer, touching the same subrange you want to touch. You don't need to separate the BO into subranges which are touched by different engines for allowing multiple writers. AMD hardware and I'm pretty sure others as well are perfectly capable of writing to the same memory from multiple engines and even multiple GPUs at the same time. For a good hint of what is possible see the public AMD ISA documentation about atomic operations, but that is only the start of it. The crux here is that we need to assume that we will have implicit and explicit sync mixed for backward compatibility. This implies that we need some mechanism like amdgpu uses in it's sync implementation where every fence is associated with an owner which denotes the domain in which implicit sync happens. If you leave this domain you will automatically run into explicit sync. Currently we define the borders of this domain in amdgpu on process boundary to keep things like DRI2/DRI3 working as expected. I really don't see how you want to solve this with a single explicit fence for each reservation object. As long as you have multiple concurrently running operations accessing the same buffer you need to keep one fence for each operation no matter what. Regards, Christian. > -Daniel
On Thu, Sep 22, 2016 at 12:55 PM, Christian König <deathsimple@vodafone.de> wrote: > Am 22.09.2016 um 08:36 schrieb Daniel Vetter: >> >> On Wed, Sep 21, 2016 at 06:23:35PM +0200, Christian König wrote: >>> >>> For a quick workaround I suggest to just serialize all accesses to BO >>> shared >>> with different drivers, but essentially I think it is a perfectly valid >>> requirement to have multiple writers to one BO. >> >> It is, but it's not possible with implicit sync. If you want parallel >> write access to the same shared buffer, you _must_ carry around some >> explicit fences. Within amdgpu you can use driver-specific cookies, for >> shared buffer we now have sync_file. But multiple writers with implicit >> sync simply fundamentally doesn't work. Because you have no idea with >> which >> writer, touching the same subrange you want to touch. > > > You don't need to separate the BO into subranges which are touched by > different engines for allowing multiple writers. > > AMD hardware and I'm pretty sure others as well are perfectly capable of > writing to the same memory from multiple engines and even multiple GPUs at > the same time. > > For a good hint of what is possible see the public AMD ISA documentation > about atomic operations, but that is only the start of it. > > > The crux here is that we need to assume that we will have implicit and > explicit sync mixed for backward compatibility. > > This implies that we need some mechanism like amdgpu uses in it's sync > implementation where every fence is associated with an owner which denotes > the domain in which implicit sync happens. If you leave this domain you will > automatically run into explicit sync. > > Currently we define the borders of this domain in amdgpu on process boundary > to keep things like DRI2/DRI3 working as expected. > > I really don't see how you want to solve this with a single explicit fence > for each reservation object. As long as you have multiple concurrently > running operations accessing the same buffer you need to keep one fence for > each operation no matter what. I can't make sense of what you're saying, and I suspect we put different meaning to different words. So let me define here: - implicit fencing: Userspace does not track read/writes to buffers, but only the kernel does that. This is the assumption DRI2/3 has. Since synchronization is by necessity on a per-buffer level you can only have 1 writer. In the kernel the cross-driver interface for this is struct reservation_object attached to dma-bufs. If you don't fill out/wait for the exclusive fence in there, you're driver is _not_ doing (cross-device) implicit fencing. - explicit fencing: Userspace passes around distinct fence objects for any work going on on the gpu. The kernel doesn't insert any stall of it's own (except for moving buffer objects around ofc). This is what Android. This also seems to be what amdgpu is doing within one process/owner. Given that I'm not sure what you mean with "one explicit fence per reservation_object", since explicit fencing should not attach anything (at least not any exclusive fences) to a reservation_object. It does sound a bit like you have the meaning the other way round (as in explicit fencing = the kernel explicitly takes care of fencing, but it's explicit as in explicit fences visible to userspace). -Daniel
Am 22.09.2016 um 14:26 schrieb Daniel Vetter: > On Thu, Sep 22, 2016 at 12:55 PM, Christian König > <deathsimple@vodafone.de> wrote: >> Am 22.09.2016 um 08:36 schrieb Daniel Vetter: >>> On Wed, Sep 21, 2016 at 06:23:35PM +0200, Christian König wrote: >>>> For a quick workaround I suggest to just serialize all accesses to BO >>>> shared >>>> with different drivers, but essentially I think it is a perfectly valid >>>> requirement to have multiple writers to one BO. >>> It is, but it's not possible with implicit sync. If you want parallel >>> write access to the same shared buffer, you _must_ carry around some >>> explicit fences. Within amdgpu you can use driver-specific cookies, for >>> shared buffer we now have sync_file. But multiple writers with implicit >>> sync simply fundamentally doesn't work. Because you have no idea with >>> which >>> writer, touching the same subrange you want to touch. >> >> You don't need to separate the BO into subranges which are touched by >> different engines for allowing multiple writers. >> >> AMD hardware and I'm pretty sure others as well are perfectly capable of >> writing to the same memory from multiple engines and even multiple GPUs at >> the same time. >> >> For a good hint of what is possible see the public AMD ISA documentation >> about atomic operations, but that is only the start of it. >> >> >> The crux here is that we need to assume that we will have implicit and >> explicit sync mixed for backward compatibility. >> >> This implies that we need some mechanism like amdgpu uses in it's sync >> implementation where every fence is associated with an owner which denotes >> the domain in which implicit sync happens. If you leave this domain you will >> automatically run into explicit sync. >> >> Currently we define the borders of this domain in amdgpu on process boundary >> to keep things like DRI2/DRI3 working as expected. >> >> I really don't see how you want to solve this with a single explicit fence >> for each reservation object. As long as you have multiple concurrently >> running operations accessing the same buffer you need to keep one fence for >> each operation no matter what. > I can't make sense of what you're saying, and I suspect we put > different meaning to different words. So let me define here: > > - implicit fencing: Userspace does not track read/writes to buffers, > but only the kernel does that. This is the assumption DRI2/3 has. > Since synchronization is by necessity on a per-buffer level you can > only have 1 writer. In the kernel the cross-driver interface for this > is struct reservation_object attached to dma-bufs. If you don't fill > out/wait for the exclusive fence in there, you're driver is _not_ > doing (cross-device) implicit fencing. I can confirm that my understanding of implicit fencing is exactly the same as yours. > - explicit fencing: Userspace passes around distinct fence objects for > any work going on on the gpu. The kernel doesn't insert any stall of > it's own (except for moving buffer objects around ofc). This is what > Android. This also seems to be what amdgpu is doing within one > process/owner. No, that is clearly not my understanding of explicit fencing. Userspace doesn't necessarily need to pass around distinct fence objects with all of it's protocols and the kernel is still responsible for inserting stalls whenever an userspace protocol or application requires this semantics. Otherwise you will never be able to use explicit fencing on the Linux desktop with protocols like DRI2/DRI3. I would expect that every driver in the system waits for all fences of a reservation object as long as it isn't told otherwise by providing a distinct fence object with the IOCTL in question. Regards, Christian.
On Thu, Sep 22, 2016 at 2:44 PM, Christian König <deathsimple@vodafone.de> wrote: >> - explicit fencing: Userspace passes around distinct fence objects for >> any work going on on the gpu. The kernel doesn't insert any stall of >> it's own (except for moving buffer objects around ofc). This is what >> Android. This also seems to be what amdgpu is doing within one >> process/owner. > > > No, that is clearly not my understanding of explicit fencing. > > Userspace doesn't necessarily need to pass around distinct fence objects > with all of it's protocols and the kernel is still responsible for inserting > stalls whenever an userspace protocol or application requires this > semantics. > > Otherwise you will never be able to use explicit fencing on the Linux > desktop with protocols like DRI2/DRI3. This is about mixing them. Explicit fencing still means userspace has an explicit piece, separate from buffers, (either sync_file fd, or a driver-specific cookie, or similar). > I would expect that every driver in the system waits for all fences of a > reservation object as long as it isn't told otherwise by providing a > distinct fence object with the IOCTL in question. Yup agreed. This way if your explicitly-fencing driver reads a shared buffer passed over a protocol that does implicit fencing (like DRI2/3), then it will work. The other interop direction is explicitly-fencing driver passes a buffer to a consumer which expects implicit fencing. In that case you must attach the right fence to the exclusive slot, but _only_ in that case. Otherwise you end up stalling your explicitly-fencing userspace, since implicit fencing doesn't allow more than 1 writer. For amdgpu one possible way to implement this might be to count how many users a dma-buf has, and if it's more than just the current context set the exclusive fence. Or do an uabi revision and let userspace decide (or at least overwrite it). But the current approach in amdgpu_sync.c of declaring a fence as exclusive after the fact (if owners don't match) just isn't how reservation_object works. You can of course change that, but that means you must change all drivers implementing support for implicit fencing of dma-buf. Fixing amdgpu will be easier ;-) -Daniel
Am 22.09.2016 um 15:05 schrieb Daniel Vetter: > On Thu, Sep 22, 2016 at 2:44 PM, Christian König > <deathsimple@vodafone.de> wrote: >>> - explicit fencing: Userspace passes around distinct fence objects for >>> any work going on on the gpu. The kernel doesn't insert any stall of >>> it's own (except for moving buffer objects around ofc). This is what >>> Android. This also seems to be what amdgpu is doing within one >>> process/owner. >> >> No, that is clearly not my understanding of explicit fencing. >> >> Userspace doesn't necessarily need to pass around distinct fence objects >> with all of it's protocols and the kernel is still responsible for inserting >> stalls whenever an userspace protocol or application requires this >> semantics. >> >> Otherwise you will never be able to use explicit fencing on the Linux >> desktop with protocols like DRI2/DRI3. > This is about mixing them. Explicit fencing still means userspace has > an explicit piece, separate from buffers, (either sync_file fd, or a > driver-specific cookie, or similar). > >> I would expect that every driver in the system waits for all fences of a >> reservation object as long as it isn't told otherwise by providing a >> distinct fence object with the IOCTL in question. > Yup agreed. This way if your explicitly-fencing driver reads a shared > buffer passed over a protocol that does implicit fencing (like > DRI2/3), then it will work. > > The other interop direction is explicitly-fencing driver passes a > buffer to a consumer which expects implicit fencing. In that case you > must attach the right fence to the exclusive slot, but _only_ in that > case. Ok well sounds like you are close to understand why I can't do exactly this: There simply is no right fence I could attach. When amdgpu makes the command submissions it doesn't necessarily know that the buffer will be exported and shared with another device later on. So when the buffer is exported and given to the other device you might have a whole bunch of fences which run concurrently and not in any serial order. > Otherwise you end up stalling your explicitly-fencing userspace, > since implicit fencing doesn't allow more than 1 writer. For amdgpu > one possible way to implement this might be to count how many users a > dma-buf has, and if it's more than just the current context set the > exclusive fence. Or do an uabi revision and let userspace decide (or > at least overwrite it). I mean I can pick one fence and wait for the rest to finish manually, but that would certainly defeat the whole effort, doesn't it? I completely agree that you have only 1 writer with implicit fencing, but when you switch from explicit fencing back to implicit fencing you can have multiple ones. > But the current approach in amdgpu_sync.c of declaring a fence as > exclusive after the fact (if owners don't match) just isn't how > reservation_object works. You can of course change that, but that > means you must change all drivers implementing support for implicit > fencing of dma-buf. Fixing amdgpu will be easier ;-) Well as far as I can see there is no way I can fix amdgpu in this case. The handling clearly needs to be changed on the receiving side of the reservation objects if I don't completely want to disable concurrent access to BOs in amdgpu. Regards, Christian. > -Daniel
On 22/09/16 10:22 PM, Christian König wrote: > Am 22.09.2016 um 15:05 schrieb Daniel Vetter: >> On Thu, Sep 22, 2016 at 2:44 PM, Christian König >> <deathsimple@vodafone.de> wrote: >>>> - explicit fencing: Userspace passes around distinct fence objects for >>>> any work going on on the gpu. The kernel doesn't insert any stall of >>>> it's own (except for moving buffer objects around ofc). This is what >>>> Android. This also seems to be what amdgpu is doing within one >>>> process/owner. >>> >>> No, that is clearly not my understanding of explicit fencing. >>> >>> Userspace doesn't necessarily need to pass around distinct fence objects >>> with all of it's protocols and the kernel is still responsible for >>> inserting >>> stalls whenever an userspace protocol or application requires this >>> semantics. >>> >>> Otherwise you will never be able to use explicit fencing on the Linux >>> desktop with protocols like DRI2/DRI3. >> This is about mixing them. Explicit fencing still means userspace has >> an explicit piece, separate from buffers, (either sync_file fd, or a >> driver-specific cookie, or similar). >> >>> I would expect that every driver in the system waits for all fences of a >>> reservation object as long as it isn't told otherwise by providing a >>> distinct fence object with the IOCTL in question. >> Yup agreed. This way if your explicitly-fencing driver reads a shared >> buffer passed over a protocol that does implicit fencing (like >> DRI2/3), then it will work. >> >> The other interop direction is explicitly-fencing driver passes a >> buffer to a consumer which expects implicit fencing. In that case you >> must attach the right fence to the exclusive slot, but _only_ in that >> case. > > Ok well sounds like you are close to understand why I can't do exactly > this: There simply is no right fence I could attach. > > When amdgpu makes the command submissions it doesn't necessarily know > that the buffer will be exported and shared with another device later on. > > So when the buffer is exported and given to the other device you might > have a whole bunch of fences which run concurrently and not in any > serial order. I feel like you're thinking too much of buffers shared between GPUs as being short-lived and only shared late. In the use-cases I know about, shared buffers are created separately and shared ahead of time, the actual rendering work is done to non-shared buffers and then just copied to the shared buffers for transfer between GPUs. These copies are always performed by the same context in such a way that they should always be performed by the same HW engine and thus implicitly serialized. Do you have any specific use-cases in mind where buffers are only shared between GPUs after the rendering operations creating the buffer contents to be shared have already been submitted? >> Otherwise you end up stalling your explicitly-fencing userspace, >> since implicit fencing doesn't allow more than 1 writer. For amdgpu >> one possible way to implement this might be to count how many users a >> dma-buf has, and if it's more than just the current context set the >> exclusive fence. Or do an uabi revision and let userspace decide (or >> at least overwrite it). > > I mean I can pick one fence and wait for the rest to finish manually, > but that would certainly defeat the whole effort, doesn't it? I'm afraid it's not clear to me why it would. Can you elaborate? >> But the current approach in amdgpu_sync.c of declaring a fence as >> exclusive after the fact (if owners don't match) just isn't how >> reservation_object works. You can of course change that, but that >> means you must change all drivers implementing support for implicit >> fencing of dma-buf. Fixing amdgpu will be easier ;-) > > Well as far as I can see there is no way I can fix amdgpu in this case. > > The handling clearly needs to be changed on the receiving side of the > reservation objects if I don't completely want to disable concurrent > access to BOs in amdgpu. Anyway, we need a solution for this between radeon and amdgpu, and I don't think a solution which involves those drivers using reservation object semantics between them which are different from all other drivers is a good idea.
On Fri, Sep 23, 2016 at 07:00:25PM +0900, Michel Dänzer wrote: > On 22/09/16 10:22 PM, Christian König wrote: > > Am 22.09.2016 um 15:05 schrieb Daniel Vetter: > >> On Thu, Sep 22, 2016 at 2:44 PM, Christian König > >> <deathsimple@vodafone.de> wrote: > >>>> - explicit fencing: Userspace passes around distinct fence objects for > >>>> any work going on on the gpu. The kernel doesn't insert any stall of > >>>> it's own (except for moving buffer objects around ofc). This is what > >>>> Android. This also seems to be what amdgpu is doing within one > >>>> process/owner. > >>> > >>> No, that is clearly not my understanding of explicit fencing. > >>> > >>> Userspace doesn't necessarily need to pass around distinct fence objects > >>> with all of it's protocols and the kernel is still responsible for > >>> inserting > >>> stalls whenever an userspace protocol or application requires this > >>> semantics. > >>> > >>> Otherwise you will never be able to use explicit fencing on the Linux > >>> desktop with protocols like DRI2/DRI3. > >> This is about mixing them. Explicit fencing still means userspace has > >> an explicit piece, separate from buffers, (either sync_file fd, or a > >> driver-specific cookie, or similar). > >> > >>> I would expect that every driver in the system waits for all fences of a > >>> reservation object as long as it isn't told otherwise by providing a > >>> distinct fence object with the IOCTL in question. > >> Yup agreed. This way if your explicitly-fencing driver reads a shared > >> buffer passed over a protocol that does implicit fencing (like > >> DRI2/3), then it will work. > >> > >> The other interop direction is explicitly-fencing driver passes a > >> buffer to a consumer which expects implicit fencing. In that case you > >> must attach the right fence to the exclusive slot, but _only_ in that > >> case. > > > > Ok well sounds like you are close to understand why I can't do exactly > > this: There simply is no right fence I could attach. > > > > When amdgpu makes the command submissions it doesn't necessarily know > > that the buffer will be exported and shared with another device later on. > > > > So when the buffer is exported and given to the other device you might > > have a whole bunch of fences which run concurrently and not in any > > serial order. > > I feel like you're thinking too much of buffers shared between GPUs as > being short-lived and only shared late. In the use-cases I know about, > shared buffers are created separately and shared ahead of time, the > actual rendering work is done to non-shared buffers and then just copied > to the shared buffers for transfer between GPUs. These copies are always > performed by the same context in such a way that they should always be > performed by the same HW engine and thus implicitly serialized. > > Do you have any specific use-cases in mind where buffers are only shared > between GPUs after the rendering operations creating the buffer contents > to be shared have already been submitted? Yeah, it should be known which buffer you use (at least in userspace, maybe not in the kernel) for which you need implicit fencing. At least with DRI2/3 it's really obvious which buffers are shared. Same holds for external images and other imported buffers. Yes that means you need to keep track of a few things in userspace, and you need to add a special flag to CS to make sure the kernel does set the exclusive fence. > >> Otherwise you end up stalling your explicitly-fencing userspace, > >> since implicit fencing doesn't allow more than 1 writer. For amdgpu > >> one possible way to implement this might be to count how many users a > >> dma-buf has, and if it's more than just the current context set the > >> exclusive fence. Or do an uabi revision and let userspace decide (or > >> at least overwrite it). > > > > I mean I can pick one fence and wait for the rest to finish manually, > > but that would certainly defeat the whole effort, doesn't it? > > I'm afraid it's not clear to me why it would. Can you elaborate? > > > >> But the current approach in amdgpu_sync.c of declaring a fence as > >> exclusive after the fact (if owners don't match) just isn't how > >> reservation_object works. You can of course change that, but that > >> means you must change all drivers implementing support for implicit > >> fencing of dma-buf. Fixing amdgpu will be easier ;-) > > > > Well as far as I can see there is no way I can fix amdgpu in this case. > > > > The handling clearly needs to be changed on the receiving side of the > > reservation objects if I don't completely want to disable concurrent > > access to BOs in amdgpu. > > Anyway, we need a solution for this between radeon and amdgpu, and I > don't think a solution which involves those drivers using reservation > object semantics between them which are different from all other drivers > is a good idea. Afaik there's also amd+intel machines out there, so really the only option is to either fix amdgpu to correctly set exclusive fences on shared buffers (with the help of userspace hints). Or change all the existing drivers. No idea what's simpler to do, since I don't know about amdgpu userspace. -Daniel
On 23/09/16 09:09 PM, Daniel Vetter wrote: > On Fri, Sep 23, 2016 at 07:00:25PM +0900, Michel Dänzer wrote: >> On 22/09/16 10:22 PM, Christian König wrote: >>> Am 22.09.2016 um 15:05 schrieb Daniel Vetter: >>>> >>>> But the current approach in amdgpu_sync.c of declaring a fence as >>>> exclusive after the fact (if owners don't match) just isn't how >>>> reservation_object works. You can of course change that, but that >>>> means you must change all drivers implementing support for implicit >>>> fencing of dma-buf. Fixing amdgpu will be easier ;-) >>> >>> Well as far as I can see there is no way I can fix amdgpu in this case. >>> >>> The handling clearly needs to be changed on the receiving side of the >>> reservation objects if I don't completely want to disable concurrent >>> access to BOs in amdgpu. >> >> Anyway, we need a solution for this between radeon and amdgpu, and I >> don't think a solution which involves those drivers using reservation >> object semantics between them which are different from all other drivers >> is a good idea. > > Afaik there's also amd+intel machines out there, Sure, what I meant was that even if we didn't care about those (which we do), we'd still need a solution between our own drivers. > so really the only option is to either fix amdgpu to correctly set > exclusive fences on shared buffers (with the help of userspace hints). > Or change all the existing drivers. I got some fresh perspective on the problem while taking a walk, and I'm now fairly convinced that neither amdgpu userspace nor other drivers need to be modified: It occurred to me that all the information we need is already there in the exclusive and shared fences set by amdgpu. We just need to use it differently to match the expectations of other drivers. We should be able to set some sort of "pseudo" fence as the exclusive fence of the reservation object. When we are asked by another driver to wait for this fence to signal, we take the current "real" exclusive fence (which we can keep track of e.g. in our BO struct) and shared fences, and wait for all of those to signal; then we mark the "pseudo" exclusive fence as signalled. Am I missing anything which would prevent this from working?
On Mon, Sep 26, 2016 at 09:48:37AM +0900, Michel Dänzer wrote: > On 23/09/16 09:09 PM, Daniel Vetter wrote: > > On Fri, Sep 23, 2016 at 07:00:25PM +0900, Michel Dänzer wrote: > >> On 22/09/16 10:22 PM, Christian König wrote: > >>> Am 22.09.2016 um 15:05 schrieb Daniel Vetter: > >>>> > >>>> But the current approach in amdgpu_sync.c of declaring a fence as > >>>> exclusive after the fact (if owners don't match) just isn't how > >>>> reservation_object works. You can of course change that, but that > >>>> means you must change all drivers implementing support for implicit > >>>> fencing of dma-buf. Fixing amdgpu will be easier ;-) > >>> > >>> Well as far as I can see there is no way I can fix amdgpu in this case. > >>> > >>> The handling clearly needs to be changed on the receiving side of the > >>> reservation objects if I don't completely want to disable concurrent > >>> access to BOs in amdgpu. > >> > >> Anyway, we need a solution for this between radeon and amdgpu, and I > >> don't think a solution which involves those drivers using reservation > >> object semantics between them which are different from all other drivers > >> is a good idea. > > > > Afaik there's also amd+intel machines out there, > > Sure, what I meant was that even if we didn't care about those (which we > do), we'd still need a solution between our own drivers. > > > > so really the only option is to either fix amdgpu to correctly set > > exclusive fences on shared buffers (with the help of userspace hints). > > Or change all the existing drivers. > > I got some fresh perspective on the problem while taking a walk, and I'm > now fairly convinced that neither amdgpu userspace nor other drivers > need to be modified: > > It occurred to me that all the information we need is already there in > the exclusive and shared fences set by amdgpu. We just need to use it > differently to match the expectations of other drivers. > > We should be able to set some sort of "pseudo" fence as the exclusive > fence of the reservation object. When we are asked by another driver to > wait for this fence to signal, we take the current "real" exclusive > fence (which we can keep track of e.g. in our BO struct) and shared > fences, and wait for all of those to signal; then we mark the "pseudo" > exclusive fence as signalled. > > Am I missing anything which would prevent this from working? One thing to make sure is that you don't change the (real, private stored) fences you're waiting on over the lifetime of the public exclusive fence. That might lead to some hilarity wrt potentially creating fence depency loops. But I think as long as you guarantee that the private internal fences are always amdgpu ones, and never anything imported from a different driver even that should be fine. Not because this would break the loops, but since amgpud has a hangcheck it can still gurantee that the fence eventually fires even if there is a loop. -Daniel
Hi This has discussion has gone a little quiet Was there any agreement about what needed doing to get this working for i965/amdgpu? Regards Mike On Mon, 26 Sep 2016 at 09:04 Daniel Vetter <daniel@ffwll.ch> wrote: > > On Mon, Sep 26, 2016 at 09:48:37AM +0900, Michel Dänzer wrote: > > On 23/09/16 09:09 PM, Daniel Vetter wrote: > > > On Fri, Sep 23, 2016 at 07:00:25PM +0900, Michel Dänzer wrote: > > >> On 22/09/16 10:22 PM, Christian König wrote: > > >>> Am 22.09.2016 um 15:05 schrieb Daniel Vetter: > > >>>> > > >>>> But the current approach in amdgpu_sync.c of declaring a fence as > > >>>> exclusive after the fact (if owners don't match) just isn't how > > >>>> reservation_object works. You can of course change that, but that > > >>>> means you must change all drivers implementing support for implicit > > >>>> fencing of dma-buf. Fixing amdgpu will be easier ;-) > > >>> > > >>> Well as far as I can see there is no way I can fix amdgpu in this case. > > >>> > > >>> The handling clearly needs to be changed on the receiving side of the > > >>> reservation objects if I don't completely want to disable concurrent > > >>> access to BOs in amdgpu. > > >> > > >> Anyway, we need a solution for this between radeon and amdgpu, and I > > >> don't think a solution which involves those drivers using reservation > > >> object semantics between them which are different from all other drivers > > >> is a good idea. > > > > > > Afaik there's also amd+intel machines out there, > > > > Sure, what I meant was that even if we didn't care about those (which we > > do), we'd still need a solution between our own drivers. > > > > > > > so really the only option is to either fix amdgpu to correctly set > > > exclusive fences on shared buffers (with the help of userspace hints). > > > Or change all the existing drivers. > > > > I got some fresh perspective on the problem while taking a walk, and I'm > > now fairly convinced that neither amdgpu userspace nor other drivers > > need to be modified: > > > > It occurred to me that all the information we need is already there in > > the exclusive and shared fences set by amdgpu. We just need to use it > > differently to match the expectations of other drivers. > > > > We should be able to set some sort of "pseudo" fence as the exclusive > > fence of the reservation object. When we are asked by another driver to > > wait for this fence to signal, we take the current "real" exclusive > > fence (which we can keep track of e.g. in our BO struct) and shared > > fences, and wait for all of those to signal; then we mark the "pseudo" > > exclusive fence as signalled. > > > > Am I missing anything which would prevent this from working? > > One thing to make sure is that you don't change the (real, private stored) > fences you're waiting on over the lifetime of the public exclusive fence. > That might lead to some hilarity wrt potentially creating fence depency > loops. But I think as long as you guarantee that the private internal > fences are always amdgpu ones, and never anything imported from a > different driver even that should be fine. Not because this would break > the loops, but since amgpud has a hangcheck it can still gurantee that the > fence eventually fires even if there is a loop. > -Daniel > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel
On 07/10/16 09:34 PM, Mike Lothian wrote: > > This has discussion has gone a little quiet > > Was there any agreement about what needed doing to get this working > for i965/amdgpu? Christian, do you see anything which could prevent the solution I outlined from working?
Am 11.10.2016 um 05:58 schrieb Michel Dänzer: > On 07/10/16 09:34 PM, Mike Lothian wrote: >> This has discussion has gone a little quiet >> >> Was there any agreement about what needed doing to get this working >> for i965/amdgpu? > Christian, do you see anything which could prevent the solution I > outlined from working? I thought about that approach as well, but unfortunately it also has a couple of downsides. Especially keeping the exclusive fence set while we actually don't need it isn't really clean either. I'm currently a bit busy with other tasks and so put Nayan on a road to get a bit into the kernel driver (he asked for that anyway). Implementing the simple workaround to sync when we export the DMA-buf should be something like 20 lines of code and fortunately Nayan has an I+A system and so can actually test it. If it turns out to be more problematic or somebody really starts to need it I'm going to hack on that myself a bit more. Regards, Christian.
On 11/10/16 09:04 PM, Christian König wrote: > Am 11.10.2016 um 05:58 schrieb Michel Dänzer: >> On 07/10/16 09:34 PM, Mike Lothian wrote: >>> This has discussion has gone a little quiet >>> >>> Was there any agreement about what needed doing to get this working >>> for i965/amdgpu? >> Christian, do you see anything which could prevent the solution I >> outlined from working? > > I thought about that approach as well, but unfortunately it also has a > couple of downsides. Especially keeping the exclusive fence set while we > actually don't need it isn't really clean either. I was wondering if it's possible to have a singleton pseudo exclusive fence which is used for all BOs. That might keep the overhead acceptably low. > I'm currently a bit busy with other tasks and so put Nayan on a road to > get a bit into the kernel driver (he asked for that anyway). > > Implementing the simple workaround to sync when we export the DMA-buf > should be something like 20 lines of code and fortunately Nayan has an > I+A system and so can actually test it. > > If it turns out to be more problematic or somebody really starts to need > it I'm going to hack on that myself a bit more. If you mean only syncing when a DMA-buf is exported, that's not enough, as I explained before. The BOs being shared are long-lived, and synchronization between GPUs is required for every command stream submission.
Hi Just another gentle ping to see where you are with this? Cheers Mike On Wed, 12 Oct 2016 at 01:40 Michel Dänzer <michel@daenzer.net> wrote: > On 11/10/16 09:04 PM, Christian König wrote: > > Am 11.10.2016 um 05:58 schrieb Michel Dänzer: > >> On 07/10/16 09:34 PM, Mike Lothian wrote: > >>> This has discussion has gone a little quiet > >>> > >>> Was there any agreement about what needed doing to get this working > >>> for i965/amdgpu? > >> Christian, do you see anything which could prevent the solution I > >> outlined from working? > > > > I thought about that approach as well, but unfortunately it also has a > > couple of downsides. Especially keeping the exclusive fence set while we > > actually don't need it isn't really clean either. > > I was wondering if it's possible to have a singleton pseudo exclusive > fence which is used for all BOs. That might keep the overhead acceptably > low. > > > > I'm currently a bit busy with other tasks and so put Nayan on a road to > > get a bit into the kernel driver (he asked for that anyway). > > > > Implementing the simple workaround to sync when we export the DMA-buf > > should be something like 20 lines of code and fortunately Nayan has an > > I+A system and so can actually test it. > > > > If it turns out to be more problematic or somebody really starts to need > > it I'm going to hack on that myself a bit more. > > If you mean only syncing when a DMA-buf is exported, that's not enough, > as I explained before. The BOs being shared are long-lived, and > synchronization between GPUs is required for every command stream > submission. > > > -- > Earthling Michel Dänzer | http://www.amd.com > Libre software enthusiast | Mesa and X developer >
On 27/10/16 10:33 PM, Mike Lothian wrote: > > Just another gentle ping to see where you are with this? I haven't got a chance to look into this any further.
diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c index 922709b..4b74b96 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -12043,7 +12043,7 @@ static void intel_mmio_flip_work_func(struct work_struct *w) /* For framebuffer backed by dmabuf, wait for fence */ resv = i915_gem_object_get_dmabuf_resv(obj); if (resv) - WARN_ON(reservation_object_wait_timeout_rcu(resv, false, false, + WARN_ON(reservation_object_wait_timeout_rcu(resv, true, false, MAX_SCHEDULE_TIMEOUT) < 0); intel_pipe_update_start(crtc); @@ -14700,7 +14700,7 @@ intel_prepare_plane_fb(struct drm_plane *plane, if (resv) { long lret; - lret = reservation_object_wait_timeout_rcu(resv, false, true, + lret = reservation_object_wait_timeout_rcu(resv, true, true, MAX_SCHEDULE_TIMEOUT); if (lret == -ERESTARTSYS) return lret;
amdgpu-kms uses shared fences for its prime exported dmabufs, instead of an exclusive fence. Therefore we need to wait for all fences of the dmabuf reservation object to prevent unsynchronized rendering and flipping. This patch was tested to behave properly with intel-kms + radeon/amdgpu/nouveau-kms for correct prime sync during pageflipping under DRI3/Present. Should fix https://bugs.freedesktop.org/show_bug.cgi?id=95472 at least for page-flipped presentation. Suggested-by: Michel Dänzer <michel.daenzer@amd.com> Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com> Cc: Michel Dänzer <michel.daenzer@amd.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Airlie <airlied@linux.ie> --- drivers/gpu/drm/i915/intel_display.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)