mbox series

[0/5] Rework amdgpu HW fence refocunt and update scheduler parent fence refcount.

Message ID 20220620220302.86389-1-andrey.grodzovsky@amd.com (mailing list archive)
Headers show
Series Rework amdgpu HW fence refocunt and update scheduler parent fence refcount. | expand

Message

Andrey Grodzovsky June 20, 2022, 10:02 p.m. UTC
Yiqing raised a problem of negative fence refcount for resubmitted jobs
in amdgpu and suggested a workaround in [1]. I took  a look myself and discovered
some deeper problems both in amdgpu and scheduler code.

Yiqing helped with testing the new code and also drew a detailed refcount and flow
tracing diagram for parent (HW) fence life cycle and refcount under various
cases for the proposed patchset at [2].

[1] - https://lore.kernel.org/all/731b7ff1-3cc9-e314-df2a-7c51b76d4db0@amd.com/t/#r00c728fcc069b1276642c325bfa9d82bf8fa21a3
[2] - https://drive.google.com/file/d/1yEoeW6OQC9WnwmzFW6NBLhFP_jD0xcHm/view?usp=sharing

Andrey Grodzovsky (5):
  drm/amdgpu: Fix possible refcount leak for release of
    external_hw_fence
  drm/amdgpu: Add put fence in amdgpu_fence_driver_clear_job_fences
  drm/amdgpu: Prevent race between late signaled fences and GPU reset.
  drm/sched: Partial revert of 'drm/sched: Keep s_fence->parent pointer'
  drm/amdgpu: Follow up change to previous drm scheduler change.

 drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c |  2 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 27 ++++++++++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 37 ++++++++++++++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 12 +++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  1 +
 drivers/gpu/drm/scheduler/sched_main.c     | 16 ++++++++--
 6 files changed, 78 insertions(+), 17 deletions(-)