mbox series

[0/6] Allow to extend the timeout without jobs disappearing

Message ID 20201125031708.6433-1-luben.tuikov@amd.com (mailing list archive)
Headers show
Series Allow to extend the timeout without jobs disappearing | expand

Message

Luben Tuikov Nov. 25, 2020, 3:17 a.m. UTC
Hi guys,

This series of patches implements a pending list for
jobs which are in the hardware, and a done list for
tasks which are done and need to be freed.

It implements a second thread, dedicated to freeing
tasks from the done list. The main scheduler thread no
longer frees (cleans up) done tasks by polling the head
of the pending list (drm_sched_get_cleanup_task() is
now gone)--it only pushes tasks down to the GPU. As
tasks complete and call their DRM callback, their
fences are signalled and tasks are queued to the done
list and the done thread woken up to free them. This
can take place concurrently with the main scheduler
thread pushing tasks down to the GPU.

When a task times out, the timeout function prototype
now is made to return a value back to DRM. The reason
for this is that the GPU driver has intimate knowledge
of the hardware and can pass back information to DRM on
what to do. Whether to attempt to abort the task (by
say calling a driver abort function, etc., as the
implementation dictates), or whether the task needs
more time. Note that the task is not moved away from
the pending list, unless it is no longer in the GPU.
(The pending list holds tasks which are pending from
DRM's point of view, i.e. the GPU has control over
them--that could be things like DMA is active, CU's are
active, for the task, etc.)

The idea really is that what DRM wants to know is
whether the task is in the GPU or not. So now
drm_sched_backend_ops::timedout_job() returns
0 of the task is no longer with the GPU, or 1
if the task needs more time.

Tested up to patch 5. Running with patch 6 seems to
make X/GDM just sleep, and I'm looking into this now.

This series applies to drm-misc-next.

Luben Tuikov (6):
  drm/scheduler: "node" --> "list"
  gpu/drm: ring_mirror_list --> pending_list
  drm/scheduler: Job timeout handler returns status
  drm/scheduler: Essentialize the job done callback
  drm/amdgpu: Don't hardcode thread name length
  drm/sched: Make use of a "done" thread

 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |   6 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  |   4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c     |   8 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h    |   2 +-
 drivers/gpu/drm/scheduler/sched_main.c      | 275 ++++++++++----------
 include/drm/gpu_scheduler.h                 |  43 ++-
 6 files changed, 186 insertions(+), 152 deletions(-)