[RFC,0/6] Common preempt fences and semantics

Message ID	20241109172942.482630-1-matthew.brost@intel.com (mailing list archive)
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Matthew Brost <matthew.brost@intel.com> To: intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org Cc: simona.vetter@ffwll.ch, thomas.hellstrom@linux.intel.com, pstanner@redhat.com, boris.brezillon@collabora.com, airlied@gmail.com, ltuikov89@gmail.com, dakr@kernel.org, christian.koenig@amd.com, mihail.atanassov@arm.com, steven.price@arm.com, shashank.sharma@amd.com Subject: [RFC PATCH 0/6] Common preempt fences and semantics Date: Sat, 9 Nov 2024 09:29:36 -0800 Message-Id: <20241109172942.482630-1-matthew.brost@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	Common preempt fences and semantics \| expand [RFC,0/6] Common preempt fences and semantics [RFC,1/6] dma-resv: Add DMA_RESV_USAGE_PREEMPT [RFC,2/6] drm/sched: Teach scheduler about DMA_RESV_USAGE_PREEMPT [RFC,3/6] dma-fence: Add dma_fence_preempt base class [RFC,4/6] drm/sched: Teach scheduler about dma_fence_prempt type [RFC,5/6] drm/xe: Use DMA_RESV_USAGE_PREEMPT for preempt fences [RFC,6/6] drm/xe: Use dma_fence_preempt base class

Matthew Brost Nov. 9, 2024, 5:29 p.m. UTC

The motivation for this series comes from pending UMD submission work by
AMD [1], ARM [3], and the Xe team, who are also beginning to look at
this. Sima has suggested [4] some common driver preemptive fences and
semantics, which we all agree on. This is the first attempt to implement
them, based on Xe's existing long-running preemptive fences.

The semantics are fairly simple: preemptive fences attached to a
dma-resv must wait to enable signaling (and thus preempt device
execution) until all fences in other slots have been signaled. These
semantics are necessary to avoid deadlocks when preempting a device
while a user submission is pending, and resuming device execution
depends on the user submission completing. The semantics align with
Christian's view [5], which I also arrived at independently (with a
little help from Sima).

Implemented via the new dma-resv slot DMA_RESV_USAGE_PREEMPT, a common
embedded base preemptive fence class with driver operations, and updates
to the scheduler to adhere to these semantics. 

The current Xe long-running preemptive fences have been converted to the
new common code, and UMD submission is expected to follow (hopefully)
shortly thereafter in a subsequent series.

Not intended to be presented as the final solution, but rather to
kickstart serious discussions on this topic.

Matt

[1] https://patchwork.freedesktop.org/series/113675/
[2] https://patchwork.freedesktop.org/series/114385/
[3] https://patchwork.freedesktop.org/series/137924/
[4] https://patchwork.kernel.org/project/dri-devel/cover/20240828172605.19176-1-mihail.atanassov@arm.com/#26011577
[5] https://patchwork.kernel.org/project/dri-devel/cover/20240828172605.19176-1-mihail.atanassov@arm.com/#26011853

Matthew Brost (6):
  dma-resv: Add DMA_RESV_USAGE_PREEMPT
  drm/sched: Teach scheduler about DMA_RESV_USAGE_PREEMPT
  dma-fence: Add dma_fence_preempt base class
  drm/sched: Teach scheduler about dma_fence_prempt type
  drm/xe: Use DMA_RESV_USAGE_PREEMPT for preempt fences
  drm/xe: Use dma_fence_preempt base class

 drivers/dma-buf/Makefile                    |   2 +-
 drivers/dma-buf/dma-fence-preempt.c         | 102 ++++++++++++++++++++
 drivers/dma-buf/dma-resv.c                  |  43 ++++++---
 drivers/dma-buf/st-dma-resv.c               |   2 +-
 drivers/gpu/drm/scheduler/sched_entity.c    |  29 +++++-
 drivers/gpu/drm/scheduler/sched_main.c      |  50 +++++++---
 drivers/gpu/drm/xe/xe_bo.c                  |  22 +----
 drivers/gpu/drm/xe/xe_guc_submit.c          |   3 +
 drivers/gpu/drm/xe/xe_hw_engine_group.c     |   4 +-
 drivers/gpu/drm/xe/xe_migrate.c             |   4 +-
 drivers/gpu/drm/xe/xe_preempt_fence.c       |  81 +++++-----------
 drivers/gpu/drm/xe/xe_preempt_fence.h       |   2 +-
 drivers/gpu/drm/xe/xe_preempt_fence_types.h |  11 +--
 drivers/gpu/drm/xe/xe_pt.c                  |  12 +--
 drivers/gpu/drm/xe/xe_vm.c                  |  22 +----
 include/drm/gpu_scheduler.h                 |  15 +++
 include/linux/dma-fence-preempt.h           |  54 +++++++++++
 include/linux/dma-fence.h                   |   1 +
 include/linux/dma-resv.h                    |  24 +++--
 19 files changed, 326 insertions(+), 157 deletions(-)
 create mode 100644 drivers/dma-buf/dma-fence-preempt.c
 create mode 100644 include/linux/dma-fence-preempt.h

Christian König Nov. 11, 2024, 1:42 p.m. UTC | #1

Am 09.11.24 um 18:29 schrieb Matthew Brost:
> The motivation for this series comes from pending UMD submission work by
> AMD [1], ARM [3], and the Xe team, who are also beginning to look at
> this. Sima has suggested [4] some common driver preemptive fences and
> semantics, which we all agree on. This is the first attempt to implement
> them, based on Xe's existing long-running preemptive fences.
>
> The semantics are fairly simple: preemptive fences attached to a
> dma-resv must wait to enable signaling (and thus preempt device
> execution) until all fences in other slots have been signaled. These
> semantics are necessary to avoid deadlocks when preempting a device
> while a user submission is pending, and resuming device execution
> depends on the user submission completing. The semantics align with
> Christian's view [5], which I also arrived at independently (with a
> little help from Sima).

Ah! I was really wondering for a moment why you wanted to add a separate 
dma_resv usage for that. But I strongly think that this approach won't work.

The assumption that completion fences which depend on a preemption fence 
are always part of the same dma_resv object is most likely false in some 
cases.

At least for the AMD design what can easily happen is that we put a 
completion fence only into a drm_syncobj for explicit synchronization 
and then engines or different devices which still use kernel submissions 
depend on it. That can go boom really trivially.

What we do instead is to wait for the latest issued completion fence 
while holding a mutex to prevent creating new ones before stopping the 
threads and signaling the preemption fence.

That code could of course be made some driver independent component.

Regards,
Christian.

>
> Implemented via the new dma-resv slot DMA_RESV_USAGE_PREEMPT, a common
> embedded base preemptive fence class with driver operations, and updates
> to the scheduler to adhere to these semantics.
>
> The current Xe long-running preemptive fences have been converted to the
> new common code, and UMD submission is expected to follow (hopefully)
> shortly thereafter in a subsequent series.
>
> Not intended to be presented as the final solution, but rather to
> kickstart serious discussions on this topic.
>
> Matt
>
> [1] https://patchwork.freedesktop.org/series/113675/
> [2] https://patchwork.freedesktop.org/series/114385/
> [3] https://patchwork.freedesktop.org/series/137924/
> [4] https://patchwork.kernel.org/project/dri-devel/cover/20240828172605.19176-1-mihail.atanassov@arm.com/#26011577
> [5] https://patchwork.kernel.org/project/dri-devel/cover/20240828172605.19176-1-mihail.atanassov@arm.com/#26011853
>
> Matthew Brost (6):
>    dma-resv: Add DMA_RESV_USAGE_PREEMPT
>    drm/sched: Teach scheduler about DMA_RESV_USAGE_PREEMPT
>    dma-fence: Add dma_fence_preempt base class
>    drm/sched: Teach scheduler about dma_fence_prempt type
>    drm/xe: Use DMA_RESV_USAGE_PREEMPT for preempt fences
>    drm/xe: Use dma_fence_preempt base class
>
>   drivers/dma-buf/Makefile                    |   2 +-
>   drivers/dma-buf/dma-fence-preempt.c         | 102 ++++++++++++++++++++
>   drivers/dma-buf/dma-resv.c                  |  43 ++++++---
>   drivers/dma-buf/st-dma-resv.c               |   2 +-
>   drivers/gpu/drm/scheduler/sched_entity.c    |  29 +++++-
>   drivers/gpu/drm/scheduler/sched_main.c      |  50 +++++++---
>   drivers/gpu/drm/xe/xe_bo.c                  |  22 +----
>   drivers/gpu/drm/xe/xe_guc_submit.c          |   3 +
>   drivers/gpu/drm/xe/xe_hw_engine_group.c     |   4 +-
>   drivers/gpu/drm/xe/xe_migrate.c             |   4 +-
>   drivers/gpu/drm/xe/xe_preempt_fence.c       |  81 +++++-----------
>   drivers/gpu/drm/xe/xe_preempt_fence.h       |   2 +-
>   drivers/gpu/drm/xe/xe_preempt_fence_types.h |  11 +--
>   drivers/gpu/drm/xe/xe_pt.c                  |  12 +--
>   drivers/gpu/drm/xe/xe_vm.c                  |  22 +----
>   include/drm/gpu_scheduler.h                 |  15 +++
>   include/linux/dma-fence-preempt.h           |  54 +++++++++++
>   include/linux/dma-fence.h                   |   1 +
>   include/linux/dma-resv.h                    |  24 +++--
>   19 files changed, 326 insertions(+), 157 deletions(-)
>   create mode 100644 drivers/dma-buf/dma-fence-preempt.c
>   create mode 100644 include/linux/dma-fence-preempt.h
>

Matthew Brost Nov. 12, 2024, 3:29 a.m. UTC | #2

On Mon, Nov 11, 2024 at 02:42:02PM +0100, Christian König wrote:
> Am 09.11.24 um 18:29 schrieb Matthew Brost:
> > The motivation for this series comes from pending UMD submission work by
> > AMD [1], ARM [3], and the Xe team, who are also beginning to look at
> > this. Sima has suggested [4] some common driver preemptive fences and
> > semantics, which we all agree on. This is the first attempt to implement
> > them, based on Xe's existing long-running preemptive fences.
> > 
> > The semantics are fairly simple: preemptive fences attached to a
> > dma-resv must wait to enable signaling (and thus preempt device
> > execution) until all fences in other slots have been signaled. These
> > semantics are necessary to avoid deadlocks when preempting a device
> > while a user submission is pending, and resuming device execution
> > depends on the user submission completing. The semantics align with
> > Christian's view [5], which I also arrived at independently (with a
> > little help from Sima).
> 
> Ah! I was really wondering for a moment why you wanted to add a separate
> dma_resv usage for that. But I strongly think that this approach won't work.
> 
> The assumption that completion fences which depend on a preemption fence are
> always part of the same dma_resv object is most likely false in some cases.
> 
> At least for the AMD design what can easily happen is that we put a
> completion fence only into a drm_syncobj for explicit synchronization and
> then engines or different devices which still use kernel submissions depend
> on it. That can go boom really trivially.
> 
> What we do instead is to wait for the latest issued completion fence while
> holding a mutex to prevent creating new ones before stopping the threads and

wrt to a mutex...

A current code reference would be nice. I looked some of the code on
list and dma-fencing rules are broken...

e.g., This patch [1], takes 'uq_mgr->userq_mutex' in the dma fencing path.
This patch [2], takes 'uq_mgr->userq_mutex' and then dma-resv lock /
allocates memory. That is clearly not allowed.

Perhaps this is fixed in your current code base though.

[1] https://patchwork.freedesktop.org/patch/593210/?series=133345&rev=1
[2] https://patchwork.freedesktop.org/patch/593211/?series=133345&rev=1

> signaling the preemption fence.
>

Right, that works and is functionally equivalent to the intended purpose
here.

I left out a key detail: when a user fence is converted into a dma-fence
and exported in a syncobj or syncfile, it must also be installed in all
of the VM's dma-resv bookkeeping slots (i.e., in VM's dma-resv and all
extobj dma-resv mapped in the VM).

Before exporting a dma-fence, the VM must be validated + resumed, and
you must hold all dma-resv locks, so the above is trivial. If you're
using gpuvm, just call drm_gpuvm_resv_add_fence. I assume AMD has a
similarly simple call.

Now the ordering works inherently in dma-resv and the scheduler. e.g. No
need to track the last completion fences explictly in preempt fences.

I'm pretty sure if converted your code this solution it would work,
there are however a couple of bugs in this post which I have fixed.

This is a common solution based on existing components which I'd lean
towards rather than some new component.

Matt

> That code could of course be made some driver independent component.
> 
> Regards,
> Christian.
> 
> > 
> > Implemented via the new dma-resv slot DMA_RESV_USAGE_PREEMPT, a common
> > embedded base preemptive fence class with driver operations, and updates
> > to the scheduler to adhere to these semantics.
> > 
> > The current Xe long-running preemptive fences have been converted to the
> > new common code, and UMD submission is expected to follow (hopefully)
> > shortly thereafter in a subsequent series.
> > 
> > Not intended to be presented as the final solution, but rather to
> > kickstart serious discussions on this topic.
> > 
> > Matt
> > 
> > [1] https://patchwork.freedesktop.org/series/113675/
> > [2] https://patchwork.freedesktop.org/series/114385/
> > [3] https://patchwork.freedesktop.org/series/137924/
> > [4] https://patchwork.kernel.org/project/dri-devel/cover/20240828172605.19176-1-mihail.atanassov@arm.com/#26011577
> > [5] https://patchwork.kernel.org/project/dri-devel/cover/20240828172605.19176-1-mihail.atanassov@arm.com/#26011853
> > 
> > Matthew Brost (6):
> >    dma-resv: Add DMA_RESV_USAGE_PREEMPT
> >    drm/sched: Teach scheduler about DMA_RESV_USAGE_PREEMPT
> >    dma-fence: Add dma_fence_preempt base class
> >    drm/sched: Teach scheduler about dma_fence_prempt type
> >    drm/xe: Use DMA_RESV_USAGE_PREEMPT for preempt fences
> >    drm/xe: Use dma_fence_preempt base class
> > 
> >   drivers/dma-buf/Makefile                    |   2 +-
> >   drivers/dma-buf/dma-fence-preempt.c         | 102 ++++++++++++++++++++
> >   drivers/dma-buf/dma-resv.c                  |  43 ++++++---
> >   drivers/dma-buf/st-dma-resv.c               |   2 +-
> >   drivers/gpu/drm/scheduler/sched_entity.c    |  29 +++++-
> >   drivers/gpu/drm/scheduler/sched_main.c      |  50 +++++++---
> >   drivers/gpu/drm/xe/xe_bo.c                  |  22 +----
> >   drivers/gpu/drm/xe/xe_guc_submit.c          |   3 +
> >   drivers/gpu/drm/xe/xe_hw_engine_group.c     |   4 +-
> >   drivers/gpu/drm/xe/xe_migrate.c             |   4 +-
> >   drivers/gpu/drm/xe/xe_preempt_fence.c       |  81 +++++-----------
> >   drivers/gpu/drm/xe/xe_preempt_fence.h       |   2 +-
> >   drivers/gpu/drm/xe/xe_preempt_fence_types.h |  11 +--
> >   drivers/gpu/drm/xe/xe_pt.c                  |  12 +--
> >   drivers/gpu/drm/xe/xe_vm.c                  |  22 +----
> >   include/drm/gpu_scheduler.h                 |  15 +++
> >   include/linux/dma-fence-preempt.h           |  54 +++++++++++
> >   include/linux/dma-fence.h                   |   1 +
> >   include/linux/dma-resv.h                    |  24 +++--
> >   19 files changed, 326 insertions(+), 157 deletions(-)
> >   create mode 100644 drivers/dma-buf/dma-fence-preempt.c
> >   create mode 100644 include/linux/dma-fence-preempt.h
> > 
>

Christian König Nov. 12, 2024, 11:09 a.m. UTC | #3

Am 12.11.24 um 04:29 schrieb Matthew Brost:
> On Mon, Nov 11, 2024 at 02:42:02PM +0100, Christian König wrote:
>> Am 09.11.24 um 18:29 schrieb Matthew Brost:
>>> The motivation for this series comes from pending UMD submission work by
>>> AMD [1], ARM [3], and the Xe team, who are also beginning to look at
>>> this. Sima has suggested [4] some common driver preemptive fences and
>>> semantics, which we all agree on. This is the first attempt to implement
>>> them, based on Xe's existing long-running preemptive fences.
>>>
>>> The semantics are fairly simple: preemptive fences attached to a
>>> dma-resv must wait to enable signaling (and thus preempt device
>>> execution) until all fences in other slots have been signaled. These
>>> semantics are necessary to avoid deadlocks when preempting a device
>>> while a user submission is pending, and resuming device execution
>>> depends on the user submission completing. The semantics align with
>>> Christian's view [5], which I also arrived at independently (with a
>>> little help from Sima).
>> Ah! I was really wondering for a moment why you wanted to add a separate
>> dma_resv usage for that. But I strongly think that this approach won't work.
>>
>> The assumption that completion fences which depend on a preemption fence are
>> always part of the same dma_resv object is most likely false in some cases.
>>
>> At least for the AMD design what can easily happen is that we put a
>> completion fence only into a drm_syncobj for explicit synchronization and
>> then engines or different devices which still use kernel submissions depend
>> on it. That can go boom really trivially.
>>
>> What we do instead is to wait for the latest issued completion fence while
>> holding a mutex to prevent creating new ones before stopping the threads and
> wrt to a mutex...
>
> A current code reference would be nice. I looked some of the code on
> list and dma-fencing rules are broken...
>
> e.g., This patch [1], takes 'uq_mgr->userq_mutex' in the dma fencing path.
> This patch [2], takes 'uq_mgr->userq_mutex' and then dma-resv lock /
> allocates memory. That is clearly not allowed.

No that is unproblematic. The dma_resv is only locked after the 
preemption fence is already signaled.

The problem is currently that we have separated the signaling worker 
from the resume worker. E.g. the mutex is dropped in between.

>
> Perhaps this is fixed in your current code base though.
>
> [1] https://patchwork.freedesktop.org/patch/593210/?series=133345&rev=1
> [2] https://patchwork.freedesktop.org/patch/593211/?series=133345&rev=1
>
>> signaling the preemption fence.
>>
> Right, that works and is functionally equivalent to the intended purpose
> here.
>
> I left out a key detail: when a user fence is converted into a dma-fence
> and exported in a syncobj or syncfile, it must also be installed in all
> of the VM's dma-resv bookkeeping slots (i.e., in VM's dma-resv and all
> extobj dma-resv mapped in the VM).

I don't think that this is something we should do.

> Before exporting a dma-fence, the VM must be validated + resumed, and
> you must hold all dma-resv locks, so the above is trivial.

Hui? Why should we hold all the dma-resv locks for that?

We only hold the userq_mutex to make sure that no user fence is created 
while we are in the signaling path of the eviction fence.

> If you're using gpuvm, just call drm_gpuvm_resv_add_fence. I assume AMD has a
> similarly simple call.

Nope, we try to avoid locking all BOs in the VM as hard as we can.

> Now the ordering works inherently in dma-resv and the scheduler. e.g. No
> need to track the last completion fences explictly in preempt fences.

I really don't think that this is a good approach. Explicitly keeping 
the last completion fence in the pre-emption fence is basically a must 
have as far as I can see.

The approach you take here looks like a really ugly hack to me.

Regards,
Christian.

>
> I'm pretty sure if converted your code this solution it would work,
> there are however a couple of bugs in this post which I have fixed.
>
> This is a common solution based on existing components which I'd lean
> towards rather than some new component.
>
> Matt
>
>> That code could of course be made some driver independent component.
>>
>> Regards,
>> Christian.
>>
>>> Implemented via the new dma-resv slot DMA_RESV_USAGE_PREEMPT, a common
>>> embedded base preemptive fence class with driver operations, and updates
>>> to the scheduler to adhere to these semantics.
>>>
>>> The current Xe long-running preemptive fences have been converted to the
>>> new common code, and UMD submission is expected to follow (hopefully)
>>> shortly thereafter in a subsequent series.
>>>
>>> Not intended to be presented as the final solution, but rather to
>>> kickstart serious discussions on this topic.
>>>
>>> Matt
>>>
>>> [1] https://patchwork.freedesktop.org/series/113675/
>>> [2] https://patchwork.freedesktop.org/series/114385/
>>> [3] https://patchwork.freedesktop.org/series/137924/
>>> [4] https://patchwork.kernel.org/project/dri-devel/cover/20240828172605.19176-1-mihail.atanassov@arm.com/#26011577
>>> [5] https://patchwork.kernel.org/project/dri-devel/cover/20240828172605.19176-1-mihail.atanassov@arm.com/#26011853
>>>
>>> Matthew Brost (6):
>>>     dma-resv: Add DMA_RESV_USAGE_PREEMPT
>>>     drm/sched: Teach scheduler about DMA_RESV_USAGE_PREEMPT
>>>     dma-fence: Add dma_fence_preempt base class
>>>     drm/sched: Teach scheduler about dma_fence_prempt type
>>>     drm/xe: Use DMA_RESV_USAGE_PREEMPT for preempt fences
>>>     drm/xe: Use dma_fence_preempt base class
>>>
>>>    drivers/dma-buf/Makefile                    |   2 +-
>>>    drivers/dma-buf/dma-fence-preempt.c         | 102 ++++++++++++++++++++
>>>    drivers/dma-buf/dma-resv.c                  |  43 ++++++---
>>>    drivers/dma-buf/st-dma-resv.c               |   2 +-
>>>    drivers/gpu/drm/scheduler/sched_entity.c    |  29 +++++-
>>>    drivers/gpu/drm/scheduler/sched_main.c      |  50 +++++++---
>>>    drivers/gpu/drm/xe/xe_bo.c                  |  22 +----
>>>    drivers/gpu/drm/xe/xe_guc_submit.c          |   3 +
>>>    drivers/gpu/drm/xe/xe_hw_engine_group.c     |   4 +-
>>>    drivers/gpu/drm/xe/xe_migrate.c             |   4 +-
>>>    drivers/gpu/drm/xe/xe_preempt_fence.c       |  81 +++++-----------
>>>    drivers/gpu/drm/xe/xe_preempt_fence.h       |   2 +-
>>>    drivers/gpu/drm/xe/xe_preempt_fence_types.h |  11 +--
>>>    drivers/gpu/drm/xe/xe_pt.c                  |  12 +--
>>>    drivers/gpu/drm/xe/xe_vm.c                  |  22 +----
>>>    include/drm/gpu_scheduler.h                 |  15 +++
>>>    include/linux/dma-fence-preempt.h           |  54 +++++++++++
>>>    include/linux/dma-fence.h                   |   1 +
>>>    include/linux/dma-resv.h                    |  24 +++--
>>>    19 files changed, 326 insertions(+), 157 deletions(-)
>>>    create mode 100644 drivers/dma-buf/dma-fence-preempt.c
>>>    create mode 100644 include/linux/dma-fence-preempt.h
>>>

Matthew Brost Nov. 13, 2024, 2:27 a.m. UTC | #4

On Tue, Nov 12, 2024 at 12:09:52PM +0100, Christian König wrote:
> Am 12.11.24 um 04:29 schrieb Matthew Brost:
> > On Mon, Nov 11, 2024 at 02:42:02PM +0100, Christian König wrote:
> > > Am 09.11.24 um 18:29 schrieb Matthew Brost:
> > > > The motivation for this series comes from pending UMD submission work by
> > > > AMD [1], ARM [3], and the Xe team, who are also beginning to look at
> > > > this. Sima has suggested [4] some common driver preemptive fences and
> > > > semantics, which we all agree on. This is the first attempt to implement
> > > > them, based on Xe's existing long-running preemptive fences.
> > > > 
> > > > The semantics are fairly simple: preemptive fences attached to a
> > > > dma-resv must wait to enable signaling (and thus preempt device
> > > > execution) until all fences in other slots have been signaled. These
> > > > semantics are necessary to avoid deadlocks when preempting a device
> > > > while a user submission is pending, and resuming device execution
> > > > depends on the user submission completing. The semantics align with
> > > > Christian's view [5], which I also arrived at independently (with a
> > > > little help from Sima).
> > > Ah! I was really wondering for a moment why you wanted to add a separate
> > > dma_resv usage for that. But I strongly think that this approach won't work.
> > > 
> > > The assumption that completion fences which depend on a preemption fence are
> > > always part of the same dma_resv object is most likely false in some cases.
> > > 
> > > At least for the AMD design what can easily happen is that we put a
> > > completion fence only into a drm_syncobj for explicit synchronization and
> > > then engines or different devices which still use kernel submissions depend
> > > on it. That can go boom really trivially.
> > > 
> > > What we do instead is to wait for the latest issued completion fence while
> > > holding a mutex to prevent creating new ones before stopping the threads and
> > wrt to a mutex...
> > 
> > A current code reference would be nice. I looked some of the code on
> > list and dma-fencing rules are broken...
> > 
> > e.g., This patch [1], takes 'uq_mgr->userq_mutex' in the dma fencing path.
> > This patch [2], takes 'uq_mgr->userq_mutex' and then dma-resv lock /
> > allocates memory. That is clearly not allowed.
> 
> No that is unproblematic. The dma_resv is only locked after the preemption
> fence is already signaled.
> 

That's kinda cheating, fragile, and a pretty poor practice IMO. Your
driver though.

> The problem is currently that we have separated the signaling worker from
> the resume worker. E.g. the mutex is dropped in between.
> 

Again, can you share any code refs? See below, will share work shortly.

> > 
> > Perhaps this is fixed in your current code base though.
> > 
> > [1] https://patchwork.freedesktop.org/patch/593210/?series=133345&rev=1
> > [2] https://patchwork.freedesktop.org/patch/593211/?series=133345&rev=1
> > 
> > > signaling the preemption fence.
> > > 
> > Right, that works and is functionally equivalent to the intended purpose
> > here.
> > 
> > I left out a key detail: when a user fence is converted into a dma-fence
> > and exported in a syncobj or syncfile, it must also be installed in all
> > of the VM's dma-resv bookkeeping slots (i.e., in VM's dma-resv and all
> > extobj dma-resv mapped in the VM).
> 
> I don't think that this is something we should do.
> 
> > Before exporting a dma-fence, the VM must be validated + resumed, and
> > you must hold all dma-resv locks, so the above is trivial.
> 
> Hui? Why should we hold all the dma-resv locks for that?
> 
> We only hold the userq_mutex to make sure that no user fence is created
> while we are in the signaling path of the eviction fence.
> 

I reason that the user fence -> DMA fence IOCTL (and vice versa) is
essentially another version of the rebind worker, which also performs
the fence transformation and installs fences in the preempt slots to
guard against unwanted preemption while a DMA fence has been exported.
It seems to work quite nicely.

> > If you're using gpuvm, just call drm_gpuvm_resv_add_fence. I assume AMD has a
> > similarly simple call.
> 
> Nope, we try to avoid locking all BOs in the VM as hard as we can.
> 

Why? Calling in to perform fence conversion shouldn't be all that
frequent and simplifies things.

Also, it's likely that only a few locks are involved, as not too many
external BOs are mapped within a VM (i.e., most BOs share the VM's
dma-resv lock).

> > Now the ordering works inherently in dma-resv and the scheduler. e.g. No
> > need to track the last completion fences explictly in preempt fences.
> 
> I really don't think that this is a good approach. Explicitly keeping the
> last completion fence in the pre-emption fence is basically a must have as
> far as I can see.
> 
> The approach you take here looks like a really ugly hack to me.
> 

Well, I have to disagree; it seems like a pretty solid, common design.

Anyway, I think I have this more or less working. I want to run this by
the Mesa team a bit to ensure I haven't missed anything, and will likely
post something shortly after. 

We can discuss this more after I post and perhaps solicit other
opinions, weighing the pros and cons of the approaches here. I do think
they function roughly the same, so something commonly agreed upon would
be good. Sharing a bit of code, if possible, is always a plus too.

Matt

> Regards,
> Christian.
> 
> > 
> > I'm pretty sure if converted your code this solution it would work,
> > there are however a couple of bugs in this post which I have fixed.
> > 
> > This is a common solution based on existing components which I'd lean
> > towards rather than some new component.
> > 
> > Matt
> > 
> > > That code could of course be made some driver independent component.
> > > 
> > > Regards,
> > > Christian.
> > > 
> > > > Implemented via the new dma-resv slot DMA_RESV_USAGE_PREEMPT, a common
> > > > embedded base preemptive fence class with driver operations, and updates
> > > > to the scheduler to adhere to these semantics.
> > > > 
> > > > The current Xe long-running preemptive fences have been converted to the
> > > > new common code, and UMD submission is expected to follow (hopefully)
> > > > shortly thereafter in a subsequent series.
> > > > 
> > > > Not intended to be presented as the final solution, but rather to
> > > > kickstart serious discussions on this topic.
> > > > 
> > > > Matt
> > > > 
> > > > [1] https://patchwork.freedesktop.org/series/113675/
> > > > [2] https://patchwork.freedesktop.org/series/114385/
> > > > [3] https://patchwork.freedesktop.org/series/137924/
> > > > [4] https://patchwork.kernel.org/project/dri-devel/cover/20240828172605.19176-1-mihail.atanassov@arm.com/#26011577
> > > > [5] https://patchwork.kernel.org/project/dri-devel/cover/20240828172605.19176-1-mihail.atanassov@arm.com/#26011853
> > > > 
> > > > Matthew Brost (6):
> > > >     dma-resv: Add DMA_RESV_USAGE_PREEMPT
> > > >     drm/sched: Teach scheduler about DMA_RESV_USAGE_PREEMPT
> > > >     dma-fence: Add dma_fence_preempt base class
> > > >     drm/sched: Teach scheduler about dma_fence_prempt type
> > > >     drm/xe: Use DMA_RESV_USAGE_PREEMPT for preempt fences
> > > >     drm/xe: Use dma_fence_preempt base class
> > > > 
> > > >    drivers/dma-buf/Makefile                    |   2 +-
> > > >    drivers/dma-buf/dma-fence-preempt.c         | 102 ++++++++++++++++++++
> > > >    drivers/dma-buf/dma-resv.c                  |  43 ++++++---
> > > >    drivers/dma-buf/st-dma-resv.c               |   2 +-
> > > >    drivers/gpu/drm/scheduler/sched_entity.c    |  29 +++++-
> > > >    drivers/gpu/drm/scheduler/sched_main.c      |  50 +++++++---
> > > >    drivers/gpu/drm/xe/xe_bo.c                  |  22 +----
> > > >    drivers/gpu/drm/xe/xe_guc_submit.c          |   3 +
> > > >    drivers/gpu/drm/xe/xe_hw_engine_group.c     |   4 +-
> > > >    drivers/gpu/drm/xe/xe_migrate.c             |   4 +-
> > > >    drivers/gpu/drm/xe/xe_preempt_fence.c       |  81 +++++-----------
> > > >    drivers/gpu/drm/xe/xe_preempt_fence.h       |   2 +-
> > > >    drivers/gpu/drm/xe/xe_preempt_fence_types.h |  11 +--
> > > >    drivers/gpu/drm/xe/xe_pt.c                  |  12 +--
> > > >    drivers/gpu/drm/xe/xe_vm.c                  |  22 +----
> > > >    include/drm/gpu_scheduler.h                 |  15 +++
> > > >    include/linux/dma-fence-preempt.h           |  54 +++++++++++
> > > >    include/linux/dma-fence.h                   |   1 +
> > > >    include/linux/dma-resv.h                    |  24 +++--
> > > >    19 files changed, 326 insertions(+), 157 deletions(-)
> > > >    create mode 100644 drivers/dma-buf/dma-fence-preempt.c
> > > >    create mode 100644 include/linux/dma-fence-preempt.h
> > > > 
>

Matthew Brost Nov. 13, 2024, 2:30 a.m. UTC | #5

On Tue, Nov 12, 2024 at 06:27:31PM -0800, Matthew Brost wrote:
> On Tue, Nov 12, 2024 at 12:09:52PM +0100, Christian König wrote:
> > Am 12.11.24 um 04:29 schrieb Matthew Brost:
> > > On Mon, Nov 11, 2024 at 02:42:02PM +0100, Christian König wrote:
> > > > Am 09.11.24 um 18:29 schrieb Matthew Brost:
> > > > > The motivation for this series comes from pending UMD submission work by
> > > > > AMD [1], ARM [3], and the Xe team, who are also beginning to look at
> > > > > this. Sima has suggested [4] some common driver preemptive fences and
> > > > > semantics, which we all agree on. This is the first attempt to implement
> > > > > them, based on Xe's existing long-running preemptive fences.
> > > > > 
> > > > > The semantics are fairly simple: preemptive fences attached to a
> > > > > dma-resv must wait to enable signaling (and thus preempt device
> > > > > execution) until all fences in other slots have been signaled. These
> > > > > semantics are necessary to avoid deadlocks when preempting a device
> > > > > while a user submission is pending, and resuming device execution
> > > > > depends on the user submission completing. The semantics align with
> > > > > Christian's view [5], which I also arrived at independently (with a
> > > > > little help from Sima).
> > > > Ah! I was really wondering for a moment why you wanted to add a separate
> > > > dma_resv usage for that. But I strongly think that this approach won't work.
> > > > 
> > > > The assumption that completion fences which depend on a preemption fence are
> > > > always part of the same dma_resv object is most likely false in some cases.
> > > > 
> > > > At least for the AMD design what can easily happen is that we put a
> > > > completion fence only into a drm_syncobj for explicit synchronization and
> > > > then engines or different devices which still use kernel submissions depend
> > > > on it. That can go boom really trivially.
> > > > 
> > > > What we do instead is to wait for the latest issued completion fence while
> > > > holding a mutex to prevent creating new ones before stopping the threads and
> > > wrt to a mutex...
> > > 
> > > A current code reference would be nice. I looked some of the code on
> > > list and dma-fencing rules are broken...
> > > 
> > > e.g., This patch [1], takes 'uq_mgr->userq_mutex' in the dma fencing path.
> > > This patch [2], takes 'uq_mgr->userq_mutex' and then dma-resv lock /
> > > allocates memory. That is clearly not allowed.
> > 
> > No that is unproblematic. The dma_resv is only locked after the preemption
> > fence is already signaled.
> > 
> 
> That's kinda cheating, fragile, and a pretty poor practice IMO. Your
> driver though.
> 
> > The problem is currently that we have separated the signaling worker from
> > the resume worker. E.g. the mutex is dropped in between.
> > 
> 
> Again, can you share any code refs? See below, will share work shortly.
> 
> > > 
> > > Perhaps this is fixed in your current code base though.
> > > 
> > > [1] https://patchwork.freedesktop.org/patch/593210/?series=133345&rev=1
> > > [2] https://patchwork.freedesktop.org/patch/593211/?series=133345&rev=1
> > > 
> > > > signaling the preemption fence.
> > > > 
> > > Right, that works and is functionally equivalent to the intended purpose
> > > here.
> > > 
> > > I left out a key detail: when a user fence is converted into a dma-fence
> > > and exported in a syncobj or syncfile, it must also be installed in all
> > > of the VM's dma-resv bookkeeping slots (i.e., in VM's dma-resv and all
> > > extobj dma-resv mapped in the VM).
> > 
> > I don't think that this is something we should do.
> > 
> > > Before exporting a dma-fence, the VM must be validated + resumed, and
> > > you must hold all dma-resv locks, so the above is trivial.
> > 
> > Hui? Why should we hold all the dma-resv locks for that?
> > 
> > We only hold the userq_mutex to make sure that no user fence is created
> > while we are in the signaling path of the eviction fence.
> > 
> 
> I reason that the user fence -> DMA fence IOCTL (and vice versa) is
> essentially another version of the rebind worker, which also performs
> the fence transformation and installs fences in the preempt slots to
> guard against unwanted preemption while a DMA fence has been exported.
> It seems to work quite nicely.
> 

Ugh, typos. s/installs fences in the preempt slots/installs dma-fences in the bookkeep slots/

Matt

> > > If you're using gpuvm, just call drm_gpuvm_resv_add_fence. I assume AMD has a
> > > similarly simple call.
> > 
> > Nope, we try to avoid locking all BOs in the VM as hard as we can.
> > 
> 
> Why? Calling in to perform fence conversion shouldn't be all that
> frequent and simplifies things.
> 
> Also, it's likely that only a few locks are involved, as not too many
> external BOs are mapped within a VM (i.e., most BOs share the VM's
> dma-resv lock).
> 
> > > Now the ordering works inherently in dma-resv and the scheduler. e.g. No
> > > need to track the last completion fences explictly in preempt fences.
> > 
> > I really don't think that this is a good approach. Explicitly keeping the
> > last completion fence in the pre-emption fence is basically a must have as
> > far as I can see.
> > 
> > The approach you take here looks like a really ugly hack to me.
> > 
> 
> Well, I have to disagree; it seems like a pretty solid, common design.
> 
> Anyway, I think I have this more or less working. I want to run this by
> the Mesa team a bit to ensure I haven't missed anything, and will likely
> post something shortly after. 
> 
> We can discuss this more after I post and perhaps solicit other
> opinions, weighing the pros and cons of the approaches here. I do think
> they function roughly the same, so something commonly agreed upon would
> be good. Sharing a bit of code, if possible, is always a plus too.
> 
> Matt
> 
> > Regards,
> > Christian.
> > 
> > > 
> > > I'm pretty sure if converted your code this solution it would work,
> > > there are however a couple of bugs in this post which I have fixed.
> > > 
> > > This is a common solution based on existing components which I'd lean
> > > towards rather than some new component.
> > > 
> > > Matt
> > > 
> > > > That code could of course be made some driver independent component.
> > > > 
> > > > Regards,
> > > > Christian.
> > > > 
> > > > > Implemented via the new dma-resv slot DMA_RESV_USAGE_PREEMPT, a common
> > > > > embedded base preemptive fence class with driver operations, and updates
> > > > > to the scheduler to adhere to these semantics.
> > > > > 
> > > > > The current Xe long-running preemptive fences have been converted to the
> > > > > new common code, and UMD submission is expected to follow (hopefully)
> > > > > shortly thereafter in a subsequent series.
> > > > > 
> > > > > Not intended to be presented as the final solution, but rather to
> > > > > kickstart serious discussions on this topic.
> > > > > 
> > > > > Matt
> > > > > 
> > > > > [1] https://patchwork.freedesktop.org/series/113675/
> > > > > [2] https://patchwork.freedesktop.org/series/114385/
> > > > > [3] https://patchwork.freedesktop.org/series/137924/
> > > > > [4] https://patchwork.kernel.org/project/dri-devel/cover/20240828172605.19176-1-mihail.atanassov@arm.com/#26011577
> > > > > [5] https://patchwork.kernel.org/project/dri-devel/cover/20240828172605.19176-1-mihail.atanassov@arm.com/#26011853
> > > > > 
> > > > > Matthew Brost (6):
> > > > >     dma-resv: Add DMA_RESV_USAGE_PREEMPT
> > > > >     drm/sched: Teach scheduler about DMA_RESV_USAGE_PREEMPT
> > > > >     dma-fence: Add dma_fence_preempt base class
> > > > >     drm/sched: Teach scheduler about dma_fence_prempt type
> > > > >     drm/xe: Use DMA_RESV_USAGE_PREEMPT for preempt fences
> > > > >     drm/xe: Use dma_fence_preempt base class
> > > > > 
> > > > >    drivers/dma-buf/Makefile                    |   2 +-
> > > > >    drivers/dma-buf/dma-fence-preempt.c         | 102 ++++++++++++++++++++
> > > > >    drivers/dma-buf/dma-resv.c                  |  43 ++++++---
> > > > >    drivers/dma-buf/st-dma-resv.c               |   2 +-
> > > > >    drivers/gpu/drm/scheduler/sched_entity.c    |  29 +++++-
> > > > >    drivers/gpu/drm/scheduler/sched_main.c      |  50 +++++++---
> > > > >    drivers/gpu/drm/xe/xe_bo.c                  |  22 +----
> > > > >    drivers/gpu/drm/xe/xe_guc_submit.c          |   3 +
> > > > >    drivers/gpu/drm/xe/xe_hw_engine_group.c     |   4 +-
> > > > >    drivers/gpu/drm/xe/xe_migrate.c             |   4 +-
> > > > >    drivers/gpu/drm/xe/xe_preempt_fence.c       |  81 +++++-----------
> > > > >    drivers/gpu/drm/xe/xe_preempt_fence.h       |   2 +-
> > > > >    drivers/gpu/drm/xe/xe_preempt_fence_types.h |  11 +--
> > > > >    drivers/gpu/drm/xe/xe_pt.c                  |  12 +--
> > > > >    drivers/gpu/drm/xe/xe_vm.c                  |  22 +----
> > > > >    include/drm/gpu_scheduler.h                 |  15 +++
> > > > >    include/linux/dma-fence-preempt.h           |  54 +++++++++++
> > > > >    include/linux/dma-fence.h                   |   1 +
> > > > >    include/linux/dma-resv.h                    |  24 +++--
> > > > >    19 files changed, 326 insertions(+), 157 deletions(-)
> > > > >    create mode 100644 drivers/dma-buf/dma-fence-preempt.c
> > > > >    create mode 100644 include/linux/dma-fence-preempt.h
> > > > > 
> >

Christian König Nov. 13, 2024, 9:02 a.m. UTC | #6

Am 13.11.24 um 03:30 schrieb Matthew Brost:
> [SNIP]
>>>> If you're using gpuvm, just call drm_gpuvm_resv_add_fence. I assume AMD has a
>>>> similarly simple call.
>>> Nope, we try to avoid locking all BOs in the VM as hard as we can.
>>>
>> Why? Calling in to perform fence conversion shouldn't be all that
>> frequent and simplifies things.
>>
>> Also, it's likely that only a few locks are involved, as not too many
>> external BOs are mapped within a VM (i.e., most BOs share the VM's
>> dma-resv lock).

The most common use case are multi GPU systems which share a lot of data 
in a NUMA cluster.

This configuration has almost all BOs shared between GPUs making locking 
the whole VM a task with massive overhead which should be avoided as 
much as possible.

>>>> Now the ordering works inherently in dma-resv and the scheduler. e.g. No
>>>> need to track the last completion fences explictly in preempt fences.
>>> I really don't think that this is a good approach. Explicitly keeping the
>>> last completion fence in the pre-emption fence is basically a must have as
>>> far as I can see.
>>>
>>> The approach you take here looks like a really ugly hack to me.
>>>
>> Well, I have to disagree; it seems like a pretty solid, common design.

What you basically do is to move the responsibility to signal fences in 
the right order from the provider of the fences to the consumer of it.

Since we have tons of consumers of that stuff this is not even remotely 
a defensive design.

>>
>> Anyway, I think I have this more or less working. I want to run this by
>> the Mesa team a bit to ensure I haven't missed anything, and will likely
>> post something shortly after.
>>
>> We can discuss this more after I post and perhaps solicit other
>> opinions, weighing the pros and cons of the approaches here. I do think
>> they function roughly the same, so something commonly agreed upon would
>> be good. Sharing a bit of code, if possible, is always a plus too.

Well to make it clear that will never ever get a green light from my 
side as DMA-buf maintainer. What you suggest here is extremely fragile.

Why not simply wait for the pending completion fences as dependency for 
signaling preemption fences?

That should work for all drivers and is trivial to implement as far as I 
can see.

Regards,
Christian.

>>
>> Matt
>>
>>> Regards,
>>> Christian.
>>>

Matthew Brost Nov. 13, 2024, 3:34 p.m. UTC | #7

On Wed, Nov 13, 2024 at 10:02:12AM +0100, Christian König wrote:
> Am 13.11.24 um 03:30 schrieb Matthew Brost:
> > [SNIP]
> > > > > If you're using gpuvm, just call drm_gpuvm_resv_add_fence. I assume AMD has a
> > > > > similarly simple call.
> > > > Nope, we try to avoid locking all BOs in the VM as hard as we can.
> > > > 
> > > Why? Calling in to perform fence conversion shouldn't be all that
> > > frequent and simplifies things.
> > > 
> > > Also, it's likely that only a few locks are involved, as not too many
> > > external BOs are mapped within a VM (i.e., most BOs share the VM's
> > > dma-resv lock).
> 
> The most common use case are multi GPU systems which share a lot of data in
> a NUMA cluster.
> 
> This configuration has almost all BOs shared between GPUs making locking the
> whole VM a task with massive overhead which should be avoided as much as
> possible.
> 

Let look into use cases on our end.

> > > > > Now the ordering works inherently in dma-resv and the scheduler. e.g. No
> > > > > need to track the last completion fences explictly in preempt fences.
> > > > I really don't think that this is a good approach. Explicitly keeping the
> > > > last completion fence in the pre-emption fence is basically a must have as
> > > > far as I can see.
> > > > 
> > > > The approach you take here looks like a really ugly hack to me.
> > > > 
> > > Well, I have to disagree; it seems like a pretty solid, common design.
> 
> What you basically do is to move the responsibility to signal fences in the
> right order from the provider of the fences to the consumer of it.
> 

Are there not ordering rules already built into dma-resv? This is just
extending those preempt fences.

I can maybe buy some of this argument, as if a random yahoo enables
signaling a preempt fence without properly going through dma-resv or the
scheduler, then yes, that could break things. But don't do that. In Xe,
we have exactly two places where these can be triggered: in the TTM BO
move vfunc (which the scheduler handles) and in the MMU invalidation
path (dma-resv).

I would expect all drivers and users of dma-resv and the scheduler with
preempt fences to use the proper interfaces to signal preempt fences,
rather than bypassing this thus ensuring the common rules for preempt
fences are adhered to.

> Since we have tons of consumers of that stuff this is not even remotely a
> defensive design.
>

I can consider other designs, but I do think larger community input may
be required, as you mentioned there are several consumers of this code.

> > > 
> > > Anyway, I think I have this more or less working. I want to run this by
> > > the Mesa team a bit to ensure I haven't missed anything, and will likely
> > > post something shortly after.
> > > 
> > > We can discuss this more after I post and perhaps solicit other
> > > opinions, weighing the pros and cons of the approaches here. I do think
> > > they function roughly the same, so something commonly agreed upon would
> > > be good. Sharing a bit of code, if possible, is always a plus too.
> 
> Well to make it clear that will never ever get a green light from my side as
> DMA-buf maintainer. What you suggest here is extremely fragile.
> 

It is unfortunate that this is your position, and I do feel it is a bit
premature to suggest as much. I didn’t know being a maintainer was akin
to being a dictator. As I’ve said multiple times, I feel this warrants a
bit more discussion with more stakeholders. If this is unacceptable,
sure, I can change it.

> Why not simply wait for the pending completion fences as dependency for
> signaling preemption fences?
>
> That should work for all drivers and is trivial to implement as far as I can
> see.

Again, this is hard to understand without a clear picture of what AMD is
doing. I pointed out a dma-fencing problem in the code on the list, and
the response was, "Well, we have some magic ordering rules that make it
safe." That might be true, but if you annotated your code, lockdep would
complain. Anything that cannot be annotated is a non-starter for me.
This makes me nervous that, in fact, it is not as trivial as you
suggest, nor is the design as sound as you believe.

Anyways, I'll still likely post a common solution with annotations. If
it is rejected, so be it, and I will rework it.

In spirit of open development here is my code in a public branch:

Kernel: https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-umd-submission/-/tree/v2-11-13-24?ref_type=heads
IGT: https://gitlab.freedesktop.org/mbrost/igt-gpu-tools-umd-submission/-/tree/umd_submission.v2?ref_type=heads

Matt

> 
> Regards,
> Christian.
> 
> > > 
> > > Matt
> > > 
> > > > Regards,
> > > > Christian.
> > > >

Christian König Nov. 14, 2024, 8:38 a.m. UTC | #8

Am 13.11.24 um 16:34 schrieb Matthew Brost:
>>>>>> Now the ordering works inherently in dma-resv and the scheduler. e.g. No
>>>>>> need to track the last completion fences explictly in preempt fences.
>>>>> I really don't think that this is a good approach. Explicitly keeping the
>>>>> last completion fence in the pre-emption fence is basically a must have as
>>>>> far as I can see.
>>>>>
>>>>> The approach you take here looks like a really ugly hack to me.
>>>>>
>>>> Well, I have to disagree; it seems like a pretty solid, common design.
>> What you basically do is to move the responsibility to signal fences in the
>> right order from the provider of the fences to the consumer of it.
>>
> Are there not ordering rules already built into dma-resv? This is just
> extending those preempt fences.

Well, the usage flags are to distinct which fences should be queried for 
which use case.

The order the fence are used and returned is completely undetermined. 
E.g. currently you can get READ, WRITE fences all mixed together.

> I can maybe buy some of this argument, as if a random yahoo enables
> signaling a preempt fence without properly going through dma-resv or the
> scheduler, then yes, that could break things. But don't do that. In Xe,
> we have exactly two places where these can be triggered: in the TTM BO
> move vfunc (which the scheduler handles) and in the MMU invalidation
> path (dma-resv).

Yeah, but the purpose of all this is that the dma-resv object is 
shareable between device drivers.

That one device driver comes up with a new requirement is certainly 
possible, but then we need to make sure that all others can live with 
that as well.

Just saying that we limit our scope to just the requirements of one 
driver because other are never supposed to see this fence is a recipe 
for problems.

> I would expect all drivers and users of dma-resv and the scheduler with
> preempt fences to use the proper interfaces to signal preempt fences,
> rather than bypassing this thus ensuring the common rules for preempt
> fences are adhered to.

Waiting for fences in any order is part of the design and a requirement 
by some consumers.

See nouveau_fence_sync() for an example of what is usually done, radeon 
has similar requirements.

But those approaches are here to for optimization purposes only and not 
for correctness.

That a driver says "My fences must be waited on first A, then B" is 
something completely new.

>> Since we have tons of consumers of that stuff this is not even remotely a
>> defensive design.
> I can consider other designs, but I do think larger community input may
> be required, as you mentioned there are several consumers of this code.

Each fence is an independent object without dependencies on anything 
else. Imposing some order on consumers of dma_fences is clearly going 
against that.

>>>> Anyway, I think I have this more or less working. I want to run this by
>>>> the Mesa team a bit to ensure I haven't missed anything, and will likely
>>>> post something shortly after.
>>>>
>>>> We can discuss this more after I post and perhaps solicit other
>>>> opinions, weighing the pros and cons of the approaches here. I do think
>>>> they function roughly the same, so something commonly agreed upon would
>>>> be good. Sharing a bit of code, if possible, is always a plus too.
>> Well to make it clear that will never ever get a green light from my side as
>> DMA-buf maintainer. What you suggest here is extremely fragile.
>>
> It is unfortunate that this is your position, and I do feel it is a bit
> premature to suggest as much. I didn’t know being a maintainer was akin
> to being a dictator. As I’ve said multiple times, I feel this warrants a
> bit more discussion with more stakeholders. If this is unacceptable,
> sure, I can change it.

I'm sorry when you feel like that, it's really not my intention to 
dictate anything. I have probably over-reacted.

It's just to me suggesting this solution is so far of that I can't 
really understand how you came up with that.

I perfectly understand that you must have some reason for it, I just 
don't see it.

Maybe we should concentrate on understanding those reasoning first 
instead of discussing a possible solution.

>> Why not simply wait for the pending completion fences as dependency for
>> signaling preemption fences?
>>
>> That should work for all drivers and is trivial to implement as far as I can
>> see.
> Again, this is hard to understand without a clear picture of what AMD is
> doing. I pointed out a dma-fencing problem in the code on the list, and
> the response was, "Well, we have some magic ordering rules that make it
> safe." That might be true, but if you annotated your code, lockdep would
> complain. Anything that cannot be annotated is a non-starter for me.
> This makes me nervous that, in fact, it is not as trivial as you
> suggest, nor is the design as sound as you believe.

I'm pretty sure that the code is not even remotely bug free. The design 
and code has been under development for the last 16 month or so and is 
now published bit by bit.

We completely missed both during internal review as well as testing that 
lockdep should complain about it and I'm actually not sure why it doesn't.

The full code should land in amd-staging-drm-next in the next few 
days/weeks, I can give you a detailed date later today.

> Anyways, I'll still likely post a common solution with annotations. If
> it is rejected, so be it, and I will rework it.

Well that sounds great. But let us discuss what this solution looks like 
instead of jumping ahead and implementing something.

Regards,
Christian.

>
> In spirit of open development here is my code in a public branch:
>
> Kernel:https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-umd-submission/-/tree/v2-11-13-24?ref_type=heads
> IGT:https://gitlab.freedesktop.org/mbrost/igt-gpu-tools-umd-submission/-/tree/umd_submission.v2?ref_type=heads
>
> Matt
>
>> Regards,
>> Christian.
>>
>>>> Matt
>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>

Matthew Brost Nov. 15, 2024, 7:38 p.m. UTC | #9

On Thu, Nov 14, 2024 at 09:38:43AM +0100, Christian König wrote:
> Am 13.11.24 um 16:34 schrieb Matthew Brost:
> > > > > > > Now the ordering works inherently in dma-resv and the scheduler. e.g. No
> > > > > > > need to track the last completion fences explictly in preempt fences.
> > > > > > I really don't think that this is a good approach. Explicitly keeping the
> > > > > > last completion fence in the pre-emption fence is basically a must have as
> > > > > > far as I can see.
> > > > > > 
> > > > > > The approach you take here looks like a really ugly hack to me.
> > > > > > 
> > > > > Well, I have to disagree; it seems like a pretty solid, common design.
> > > What you basically do is to move the responsibility to signal fences in the
> > > right order from the provider of the fences to the consumer of it.
> > > 
> > Are there not ordering rules already built into dma-resv? This is just
> > extending those preempt fences.
> 
> Well, the usage flags are to distinct which fences should be queried for
> which use case.
> 
> The order the fence are used and returned is completely undetermined. E.g.
> currently you can get READ, WRITE fences all mixed together.
> 

I was thinking there was an order here but yea looking now that does
appear to be an oversight on my end.

> > I can maybe buy some of this argument, as if a random yahoo enables
> > signaling a preempt fence without properly going through dma-resv or the
> > scheduler, then yes, that could break things. But don't do that. In Xe,
> > we have exactly two places where these can be triggered: in the TTM BO
> > move vfunc (which the scheduler handles) and in the MMU invalidation
> > path (dma-resv).
> 
> Yeah, but the purpose of all this is that the dma-resv object is shareable
> between device drivers.
> 

Agree.

> That one device driver comes up with a new requirement is certainly
> possible, but then we need to make sure that all others can live with that
> as well.
> 

Also agree.

> Just saying that we limit our scope to just the requirements of one driver
> because other are never supposed to see this fence is a recipe for problems.
> 
> > I would expect all drivers and users of dma-resv and the scheduler with
> > preempt fences to use the proper interfaces to signal preempt fences,
> > rather than bypassing this thus ensuring the common rules for preempt
> > fences are adhered to.
> 
> Waiting for fences in any order is part of the design and a requirement by
> some consumers.
> 
> See nouveau_fence_sync() for an example of what is usually done, radeon has
> similar requirements.
> 
> But those approaches are here to for optimization purposes only and not for
> correctness.
> 
> That a driver says "My fences must be waited on first A, then B" is
> something completely new.
> 

Ok, fair enough.

> > > Since we have tons of consumers of that stuff this is not even remotely a
> > > defensive design.
> > I can consider other designs, but I do think larger community input may
> > be required, as you mentioned there are several consumers of this code.
> 
> Each fence is an independent object without dependencies on anything else.
> Imposing some order on consumers of dma_fences is clearly going against
> that.
> 
> > > > > Anyway, I think I have this more or less working. I want to run this by
> > > > > the Mesa team a bit to ensure I haven't missed anything, and will likely
> > > > > post something shortly after.
> > > > > 
> > > > > We can discuss this more after I post and perhaps solicit other
> > > > > opinions, weighing the pros and cons of the approaches here. I do think
> > > > > they function roughly the same, so something commonly agreed upon would
> > > > > be good. Sharing a bit of code, if possible, is always a plus too.
> > > Well to make it clear that will never ever get a green light from my side as
> > > DMA-buf maintainer. What you suggest here is extremely fragile.
> > > 
> > It is unfortunate that this is your position, and I do feel it is a bit
> > premature to suggest as much. I didn’t know being a maintainer was akin
> > to being a dictator. As I’ve said multiple times, I feel this warrants a
> > bit more discussion with more stakeholders. If this is unacceptable,
> > sure, I can change it.
> 
> I'm sorry when you feel like that, it's really not my intention to dictate
> anything. I have probably over-reacted.
> 
> It's just to me suggesting this solution is so far of that I can't really
> understand how you came up with that.
> 
> I perfectly understand that you must have some reason for it, I just don't
> see it.
> 
> Maybe we should concentrate on understanding those reasoning first instead
> of discussing a possible solution.
> 
> > > Why not simply wait for the pending completion fences as dependency for
> > > signaling preemption fences?
> > > 
> > > That should work for all drivers and is trivial to implement as far as I can
> > > see.
> > Again, this is hard to understand without a clear picture of what AMD is
> > doing. I pointed out a dma-fencing problem in the code on the list, and
> > the response was, "Well, we have some magic ordering rules that make it
> > safe." That might be true, but if you annotated your code, lockdep would
> > complain. Anything that cannot be annotated is a non-starter for me.
> > This makes me nervous that, in fact, it is not as trivial as you
> > suggest, nor is the design as sound as you believe.
> 
> I'm pretty sure that the code is not even remotely bug free. The design and
> code has been under development for the last 16 month or so and is now
> published bit by bit.
> 
> We completely missed both during internal review as well as testing that
> lockdep should complain about it and I'm actually not sure why it doesn't.
> 
> The full code should land in amd-staging-drm-next in the next few
> days/weeks, I can give you a detailed date later today.
>

Please let me know when this happens. I will take a pass at reviewing.
We have had preemption fences in Xe for LR compute for 2+ years now and
can help spot if you missed something. It took us a while to get these
right.
 
> > Anyways, I'll still likely post a common solution with annotations. If
> > it is rejected, so be it, and I will rework it.
> 
> Well that sounds great. But let us discuss what this solution looks like
> instead of jumping ahead and implementing something.
> 

Sure. FWIW I changed my code to wait on last fence and it quite easy
to do so and it functioned the same. So at my reasoning that new slots
vs waiting on last was correct. While there some properties of new slots
I like, I'll just drop the new slots idea as it seems contentious.

Matt

> Regards,
> Christian.
> 
> > 
> > In spirit of open development here is my code in a public branch:
> > 
> > Kernel:https://gitlab.freedesktop.org/mbrost/xe-kernel-driver-umd-submission/-/tree/v2-11-13-24?ref_type=heads
> > IGT:https://gitlab.freedesktop.org/mbrost/igt-gpu-tools-umd-submission/-/tree/umd_submission.v2?ref_type=heads
> > 
> > Matt
> > 
> > > Regards,
> > > Christian.
> > > 
> > > > > Matt
> > > > > 
> > > > > > Regards,
> > > > > > Christian.
> > > > > >

[RFC,0/6] Common preempt fences and semantics

Message

Comments