diff mbox series

drm/panthor: Make the timeout per-queue instead of per-job

Message ID 20250307155556.173494-1-ashley.smith@collabora.com (mailing list archive)
State New
Headers show
Series drm/panthor: Make the timeout per-queue instead of per-job | expand

Commit Message

Ashley Smith March 7, 2025, 3:55 p.m. UTC
The timeout logic provided by drm_sched leads to races when we try
to suspend it while the drm_sched workqueue queues more jobs. Let's
overhaul the timeout handling in panthor to have our own delayed work
that's resumed/suspended when a group is resumed/suspended. When an
actual timeout occurs, we call drm_sched_fault() to report it
through drm_sched, still. But otherwise, the drm_sched timeout is
disabled (set to MAX_SCHEDULE_TIMEOUT), which leaves us in control of
how we protect modifications on the timer.

One issue seems to be when we call drm_sched_suspend_timeout() from
both queue_run_job() and tick_work() which could lead to races due to
drm_sched_suspend_timeout() not having a lock. Another issue seems to
be in queue_run_job() if the group is not scheduled, we suspend the
timeout again which undoes what drm_sched_job_begin() did when calling
drm_sched_start_timeout(). So the timeout does not reset when a job
is finished.

Co-developed-by: Boris Brezillon <boris.brezillon@collabora.com>
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Tested-by: Daniel Stone <daniels@collabora.com>
Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block")
Signed-off-by: Ashley Smith <ashley.smith@collabora.com>
---
 drivers/gpu/drm/panthor/panthor_sched.c | 233 +++++++++++++++++-------
 1 file changed, 167 insertions(+), 66 deletions(-)


base-commit: b72f66f22c0e39ae6684c43fead774c13db24e73

Comments

kernel test robot March 8, 2025, 3:31 p.m. UTC | #1
Hi Ashley,

kernel test robot noticed the following build warnings:

[auto build test WARNING on b72f66f22c0e39ae6684c43fead774c13db24e73]

url:    https://github.com/intel-lab-lkp/linux/commits/Ashley-Smith/drm-panthor-Make-the-timeout-per-queue-instead-of-per-job/20250307-235830
base:   b72f66f22c0e39ae6684c43fead774c13db24e73
patch link:    https://lore.kernel.org/r/20250307155556.173494-1-ashley.smith%40collabora.com
patch subject: [PATCH] drm/panthor: Make the timeout per-queue instead of per-job
config: i386-buildonly-randconfig-004-20250308 (https://download.01.org/0day-ci/archive/20250308/202503082339.3TzIrrex-lkp@intel.com/config)
compiler: clang version 19.1.7 (https://github.com/llvm/llvm-project cd708029e0b2869e80abe31ddb175f7c35361f90)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250308/202503082339.3TzIrrex-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202503082339.3TzIrrex-lkp@intel.com/

All warnings (new ones prefixed by >>):

   drivers/gpu/drm/panthor/panthor_sched.c:318: warning: Excess struct member 'runnable' description in 'panthor_scheduler'
   drivers/gpu/drm/panthor/panthor_sched.c:318: warning: Excess struct member 'idle' description in 'panthor_scheduler'
   drivers/gpu/drm/panthor/panthor_sched.c:318: warning: Excess struct member 'waiting' description in 'panthor_scheduler'
   drivers/gpu/drm/panthor/panthor_sched.c:318: warning: Excess struct member 'has_ref' description in 'panthor_scheduler'
   drivers/gpu/drm/panthor/panthor_sched.c:318: warning: Excess struct member 'in_progress' description in 'panthor_scheduler'
   drivers/gpu/drm/panthor/panthor_sched.c:318: warning: Excess struct member 'stopped_groups' description in 'panthor_scheduler'
>> drivers/gpu/drm/panthor/panthor_sched.c:493: warning: Excess struct member 'remaining' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:493: warning: Excess struct member 'mem' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:493: warning: Excess struct member 'input' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:493: warning: Excess struct member 'output' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:493: warning: Excess struct member 'input_fw_va' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:493: warning: Excess struct member 'output_fw_va' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:493: warning: Excess struct member 'gpu_va' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:493: warning: Excess struct member 'ref' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:493: warning: Excess struct member 'gt' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:493: warning: Excess struct member 'sync64' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:493: warning: Excess struct member 'bo' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:493: warning: Excess struct member 'offset' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:493: warning: Excess struct member 'kmap' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:493: warning: Excess struct member 'lock' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:493: warning: Excess struct member 'id' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:493: warning: Excess struct member 'seqno' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:493: warning: Excess struct member 'last_fence' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:493: warning: Excess struct member 'in_flight_jobs' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:493: warning: Excess struct member 'slots' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:493: warning: Excess struct member 'slot_count' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:493: warning: Excess struct member 'seqno' description in 'panthor_queue'
   drivers/gpu/drm/panthor/panthor_sched.c:702: warning: Excess struct member 'data' description in 'panthor_group'
   drivers/gpu/drm/panthor/panthor_sched.c:838: warning: Excess struct member 'start' description in 'panthor_job'
   drivers/gpu/drm/panthor/panthor_sched.c:838: warning: Excess struct member 'size' description in 'panthor_job'
   drivers/gpu/drm/panthor/panthor_sched.c:838: warning: Excess struct member 'latest_flush' description in 'panthor_job'
   drivers/gpu/drm/panthor/panthor_sched.c:838: warning: Excess struct member 'start' description in 'panthor_job'
   drivers/gpu/drm/panthor/panthor_sched.c:838: warning: Excess struct member 'end' description in 'panthor_job'
   drivers/gpu/drm/panthor/panthor_sched.c:838: warning: Excess struct member 'mask' description in 'panthor_job'
   drivers/gpu/drm/panthor/panthor_sched.c:838: warning: Excess struct member 'slot' description in 'panthor_job'
   drivers/gpu/drm/panthor/panthor_sched.c:1832: warning: Function parameter or struct member 'ptdev' not described in 'panthor_sched_report_fw_events'
   drivers/gpu/drm/panthor/panthor_sched.c:1832: warning: Function parameter or struct member 'events' not described in 'panthor_sched_report_fw_events'
   drivers/gpu/drm/panthor/panthor_sched.c:2712: warning: Function parameter or struct member 'ptdev' not described in 'panthor_sched_report_mmu_fault'


vim +493 drivers/gpu/drm/panthor/panthor_sched.c

de85488138247d0 Boris Brezillon 2024-02-29  147  
de85488138247d0 Boris Brezillon 2024-02-29  148  /**
de85488138247d0 Boris Brezillon 2024-02-29  149   * struct panthor_scheduler - Object used to manage the scheduler
de85488138247d0 Boris Brezillon 2024-02-29  150   */
de85488138247d0 Boris Brezillon 2024-02-29  151  struct panthor_scheduler {
de85488138247d0 Boris Brezillon 2024-02-29  152  	/** @ptdev: Device. */
de85488138247d0 Boris Brezillon 2024-02-29  153  	struct panthor_device *ptdev;
de85488138247d0 Boris Brezillon 2024-02-29  154  
de85488138247d0 Boris Brezillon 2024-02-29  155  	/**
de85488138247d0 Boris Brezillon 2024-02-29  156  	 * @wq: Workqueue used by our internal scheduler logic and
de85488138247d0 Boris Brezillon 2024-02-29  157  	 * drm_gpu_scheduler.
de85488138247d0 Boris Brezillon 2024-02-29  158  	 *
de85488138247d0 Boris Brezillon 2024-02-29  159  	 * Used for the scheduler tick, group update or other kind of FW
de85488138247d0 Boris Brezillon 2024-02-29  160  	 * event processing that can't be handled in the threaded interrupt
de85488138247d0 Boris Brezillon 2024-02-29  161  	 * path. Also passed to the drm_gpu_scheduler instances embedded
de85488138247d0 Boris Brezillon 2024-02-29  162  	 * in panthor_queue.
de85488138247d0 Boris Brezillon 2024-02-29  163  	 */
de85488138247d0 Boris Brezillon 2024-02-29  164  	struct workqueue_struct *wq;
de85488138247d0 Boris Brezillon 2024-02-29  165  
de85488138247d0 Boris Brezillon 2024-02-29  166  	/**
de85488138247d0 Boris Brezillon 2024-02-29  167  	 * @heap_alloc_wq: Workqueue used to schedule tiler_oom works.
de85488138247d0 Boris Brezillon 2024-02-29  168  	 *
de85488138247d0 Boris Brezillon 2024-02-29  169  	 * We have a queue dedicated to heap chunk allocation works to avoid
de85488138247d0 Boris Brezillon 2024-02-29  170  	 * blocking the rest of the scheduler if the allocation tries to
de85488138247d0 Boris Brezillon 2024-02-29  171  	 * reclaim memory.
de85488138247d0 Boris Brezillon 2024-02-29  172  	 */
de85488138247d0 Boris Brezillon 2024-02-29  173  	struct workqueue_struct *heap_alloc_wq;
de85488138247d0 Boris Brezillon 2024-02-29  174  
de85488138247d0 Boris Brezillon 2024-02-29  175  	/** @tick_work: Work executed on a scheduling tick. */
de85488138247d0 Boris Brezillon 2024-02-29  176  	struct delayed_work tick_work;
de85488138247d0 Boris Brezillon 2024-02-29  177  
de85488138247d0 Boris Brezillon 2024-02-29  178  	/**
de85488138247d0 Boris Brezillon 2024-02-29  179  	 * @sync_upd_work: Work used to process synchronization object updates.
de85488138247d0 Boris Brezillon 2024-02-29  180  	 *
de85488138247d0 Boris Brezillon 2024-02-29  181  	 * We use this work to unblock queues/groups that were waiting on a
de85488138247d0 Boris Brezillon 2024-02-29  182  	 * synchronization object.
de85488138247d0 Boris Brezillon 2024-02-29  183  	 */
de85488138247d0 Boris Brezillon 2024-02-29  184  	struct work_struct sync_upd_work;
de85488138247d0 Boris Brezillon 2024-02-29  185  
de85488138247d0 Boris Brezillon 2024-02-29  186  	/**
de85488138247d0 Boris Brezillon 2024-02-29  187  	 * @fw_events_work: Work used to process FW events outside the interrupt path.
de85488138247d0 Boris Brezillon 2024-02-29  188  	 *
de85488138247d0 Boris Brezillon 2024-02-29  189  	 * Even if the interrupt is threaded, we need any event processing
de85488138247d0 Boris Brezillon 2024-02-29  190  	 * that require taking the panthor_scheduler::lock to be processed
de85488138247d0 Boris Brezillon 2024-02-29  191  	 * outside the interrupt path so we don't block the tick logic when
de85488138247d0 Boris Brezillon 2024-02-29  192  	 * it calls panthor_fw_{csg,wait}_wait_acks(). Since most of the
de85488138247d0 Boris Brezillon 2024-02-29  193  	 * event processing requires taking this lock, we just delegate all
de85488138247d0 Boris Brezillon 2024-02-29  194  	 * FW event processing to the scheduler workqueue.
de85488138247d0 Boris Brezillon 2024-02-29  195  	 */
de85488138247d0 Boris Brezillon 2024-02-29  196  	struct work_struct fw_events_work;
de85488138247d0 Boris Brezillon 2024-02-29  197  
de85488138247d0 Boris Brezillon 2024-02-29  198  	/**
de85488138247d0 Boris Brezillon 2024-02-29  199  	 * @fw_events: Bitmask encoding pending FW events.
de85488138247d0 Boris Brezillon 2024-02-29  200  	 */
de85488138247d0 Boris Brezillon 2024-02-29  201  	atomic_t fw_events;
de85488138247d0 Boris Brezillon 2024-02-29  202  
de85488138247d0 Boris Brezillon 2024-02-29  203  	/**
de85488138247d0 Boris Brezillon 2024-02-29  204  	 * @resched_target: When the next tick should occur.
de85488138247d0 Boris Brezillon 2024-02-29  205  	 *
de85488138247d0 Boris Brezillon 2024-02-29  206  	 * Expressed in jiffies.
de85488138247d0 Boris Brezillon 2024-02-29  207  	 */
de85488138247d0 Boris Brezillon 2024-02-29  208  	u64 resched_target;
de85488138247d0 Boris Brezillon 2024-02-29  209  
de85488138247d0 Boris Brezillon 2024-02-29  210  	/**
de85488138247d0 Boris Brezillon 2024-02-29  211  	 * @last_tick: When the last tick occurred.
de85488138247d0 Boris Brezillon 2024-02-29  212  	 *
de85488138247d0 Boris Brezillon 2024-02-29  213  	 * Expressed in jiffies.
de85488138247d0 Boris Brezillon 2024-02-29  214  	 */
de85488138247d0 Boris Brezillon 2024-02-29  215  	u64 last_tick;
de85488138247d0 Boris Brezillon 2024-02-29  216  
de85488138247d0 Boris Brezillon 2024-02-29  217  	/** @tick_period: Tick period in jiffies. */
de85488138247d0 Boris Brezillon 2024-02-29  218  	u64 tick_period;
de85488138247d0 Boris Brezillon 2024-02-29  219  
de85488138247d0 Boris Brezillon 2024-02-29  220  	/**
de85488138247d0 Boris Brezillon 2024-02-29  221  	 * @lock: Lock protecting access to all the scheduler fields.
de85488138247d0 Boris Brezillon 2024-02-29  222  	 *
de85488138247d0 Boris Brezillon 2024-02-29  223  	 * Should be taken in the tick work, the irq handler, and anywhere the @groups
de85488138247d0 Boris Brezillon 2024-02-29  224  	 * fields are touched.
de85488138247d0 Boris Brezillon 2024-02-29  225  	 */
de85488138247d0 Boris Brezillon 2024-02-29  226  	struct mutex lock;
de85488138247d0 Boris Brezillon 2024-02-29  227  
de85488138247d0 Boris Brezillon 2024-02-29  228  	/** @groups: Various lists used to classify groups. */
de85488138247d0 Boris Brezillon 2024-02-29  229  	struct {
de85488138247d0 Boris Brezillon 2024-02-29  230  		/**
de85488138247d0 Boris Brezillon 2024-02-29  231  		 * @runnable: Runnable group lists.
de85488138247d0 Boris Brezillon 2024-02-29  232  		 *
de85488138247d0 Boris Brezillon 2024-02-29  233  		 * When a group has queues that want to execute something,
de85488138247d0 Boris Brezillon 2024-02-29  234  		 * its panthor_group::run_node should be inserted here.
de85488138247d0 Boris Brezillon 2024-02-29  235  		 *
de85488138247d0 Boris Brezillon 2024-02-29  236  		 * One list per-priority.
de85488138247d0 Boris Brezillon 2024-02-29  237  		 */
de85488138247d0 Boris Brezillon 2024-02-29  238  		struct list_head runnable[PANTHOR_CSG_PRIORITY_COUNT];
de85488138247d0 Boris Brezillon 2024-02-29  239  
de85488138247d0 Boris Brezillon 2024-02-29  240  		/**
de85488138247d0 Boris Brezillon 2024-02-29  241  		 * @idle: Idle group lists.
de85488138247d0 Boris Brezillon 2024-02-29  242  		 *
de85488138247d0 Boris Brezillon 2024-02-29  243  		 * When all queues of a group are idle (either because they
de85488138247d0 Boris Brezillon 2024-02-29  244  		 * have nothing to execute, or because they are blocked), the
de85488138247d0 Boris Brezillon 2024-02-29  245  		 * panthor_group::run_node field should be inserted here.
de85488138247d0 Boris Brezillon 2024-02-29  246  		 *
de85488138247d0 Boris Brezillon 2024-02-29  247  		 * One list per-priority.
de85488138247d0 Boris Brezillon 2024-02-29  248  		 */
de85488138247d0 Boris Brezillon 2024-02-29  249  		struct list_head idle[PANTHOR_CSG_PRIORITY_COUNT];
de85488138247d0 Boris Brezillon 2024-02-29  250  
de85488138247d0 Boris Brezillon 2024-02-29  251  		/**
de85488138247d0 Boris Brezillon 2024-02-29  252  		 * @waiting: List of groups whose queues are blocked on a
de85488138247d0 Boris Brezillon 2024-02-29  253  		 * synchronization object.
de85488138247d0 Boris Brezillon 2024-02-29  254  		 *
de85488138247d0 Boris Brezillon 2024-02-29  255  		 * Insert panthor_group::wait_node here when a group is waiting
de85488138247d0 Boris Brezillon 2024-02-29  256  		 * for synchronization objects to be signaled.
de85488138247d0 Boris Brezillon 2024-02-29  257  		 *
de85488138247d0 Boris Brezillon 2024-02-29  258  		 * This list is evaluated in the @sync_upd_work work.
de85488138247d0 Boris Brezillon 2024-02-29  259  		 */
de85488138247d0 Boris Brezillon 2024-02-29  260  		struct list_head waiting;
de85488138247d0 Boris Brezillon 2024-02-29  261  	} groups;
de85488138247d0 Boris Brezillon 2024-02-29  262  
de85488138247d0 Boris Brezillon 2024-02-29  263  	/**
de85488138247d0 Boris Brezillon 2024-02-29  264  	 * @csg_slots: FW command stream group slots.
de85488138247d0 Boris Brezillon 2024-02-29  265  	 */
de85488138247d0 Boris Brezillon 2024-02-29  266  	struct panthor_csg_slot csg_slots[MAX_CSGS];
de85488138247d0 Boris Brezillon 2024-02-29  267  
de85488138247d0 Boris Brezillon 2024-02-29  268  	/** @csg_slot_count: Number of command stream group slots exposed by the FW. */
de85488138247d0 Boris Brezillon 2024-02-29  269  	u32 csg_slot_count;
de85488138247d0 Boris Brezillon 2024-02-29  270  
de85488138247d0 Boris Brezillon 2024-02-29  271  	/** @cs_slot_count: Number of command stream slot per group slot exposed by the FW. */
de85488138247d0 Boris Brezillon 2024-02-29  272  	u32 cs_slot_count;
de85488138247d0 Boris Brezillon 2024-02-29  273  
de85488138247d0 Boris Brezillon 2024-02-29  274  	/** @as_slot_count: Number of address space slots supported by the MMU. */
de85488138247d0 Boris Brezillon 2024-02-29  275  	u32 as_slot_count;
de85488138247d0 Boris Brezillon 2024-02-29  276  
de85488138247d0 Boris Brezillon 2024-02-29  277  	/** @used_csg_slot_count: Number of command stream group slot currently used. */
de85488138247d0 Boris Brezillon 2024-02-29  278  	u32 used_csg_slot_count;
de85488138247d0 Boris Brezillon 2024-02-29  279  
de85488138247d0 Boris Brezillon 2024-02-29  280  	/** @sb_slot_count: Number of scoreboard slots. */
de85488138247d0 Boris Brezillon 2024-02-29  281  	u32 sb_slot_count;
de85488138247d0 Boris Brezillon 2024-02-29  282  
de85488138247d0 Boris Brezillon 2024-02-29  283  	/**
de85488138247d0 Boris Brezillon 2024-02-29  284  	 * @might_have_idle_groups: True if an active group might have become idle.
de85488138247d0 Boris Brezillon 2024-02-29  285  	 *
de85488138247d0 Boris Brezillon 2024-02-29  286  	 * This will force a tick, so other runnable groups can be scheduled if one
de85488138247d0 Boris Brezillon 2024-02-29  287  	 * or more active groups became idle.
de85488138247d0 Boris Brezillon 2024-02-29  288  	 */
de85488138247d0 Boris Brezillon 2024-02-29  289  	bool might_have_idle_groups;
de85488138247d0 Boris Brezillon 2024-02-29  290  
de85488138247d0 Boris Brezillon 2024-02-29  291  	/** @pm: Power management related fields. */
de85488138247d0 Boris Brezillon 2024-02-29  292  	struct {
de85488138247d0 Boris Brezillon 2024-02-29  293  		/** @has_ref: True if the scheduler owns a runtime PM reference. */
de85488138247d0 Boris Brezillon 2024-02-29  294  		bool has_ref;
de85488138247d0 Boris Brezillon 2024-02-29  295  	} pm;
de85488138247d0 Boris Brezillon 2024-02-29  296  
de85488138247d0 Boris Brezillon 2024-02-29  297  	/** @reset: Reset related fields. */
de85488138247d0 Boris Brezillon 2024-02-29  298  	struct {
de85488138247d0 Boris Brezillon 2024-02-29  299  		/** @lock: Lock protecting the other reset fields. */
de85488138247d0 Boris Brezillon 2024-02-29  300  		struct mutex lock;
de85488138247d0 Boris Brezillon 2024-02-29  301  
de85488138247d0 Boris Brezillon 2024-02-29  302  		/**
de85488138247d0 Boris Brezillon 2024-02-29  303  		 * @in_progress: True if a reset is in progress.
de85488138247d0 Boris Brezillon 2024-02-29  304  		 *
de85488138247d0 Boris Brezillon 2024-02-29  305  		 * Set to true in panthor_sched_pre_reset() and back to false in
de85488138247d0 Boris Brezillon 2024-02-29  306  		 * panthor_sched_post_reset().
de85488138247d0 Boris Brezillon 2024-02-29  307  		 */
de85488138247d0 Boris Brezillon 2024-02-29  308  		atomic_t in_progress;
de85488138247d0 Boris Brezillon 2024-02-29  309  
de85488138247d0 Boris Brezillon 2024-02-29  310  		/**
de85488138247d0 Boris Brezillon 2024-02-29  311  		 * @stopped_groups: List containing all groups that were stopped
de85488138247d0 Boris Brezillon 2024-02-29  312  		 * before a reset.
de85488138247d0 Boris Brezillon 2024-02-29  313  		 *
de85488138247d0 Boris Brezillon 2024-02-29  314  		 * Insert panthor_group::run_node in the pre_reset path.
de85488138247d0 Boris Brezillon 2024-02-29  315  		 */
de85488138247d0 Boris Brezillon 2024-02-29  316  		struct list_head stopped_groups;
de85488138247d0 Boris Brezillon 2024-02-29  317  	} reset;
de85488138247d0 Boris Brezillon 2024-02-29 @318  };
de85488138247d0 Boris Brezillon 2024-02-29  319  
de85488138247d0 Boris Brezillon 2024-02-29  320  /**
de85488138247d0 Boris Brezillon 2024-02-29  321   * struct panthor_syncobj_32b - 32-bit FW synchronization object
de85488138247d0 Boris Brezillon 2024-02-29  322   */
de85488138247d0 Boris Brezillon 2024-02-29  323  struct panthor_syncobj_32b {
de85488138247d0 Boris Brezillon 2024-02-29  324  	/** @seqno: Sequence number. */
de85488138247d0 Boris Brezillon 2024-02-29  325  	u32 seqno;
de85488138247d0 Boris Brezillon 2024-02-29  326  
de85488138247d0 Boris Brezillon 2024-02-29  327  	/**
de85488138247d0 Boris Brezillon 2024-02-29  328  	 * @status: Status.
de85488138247d0 Boris Brezillon 2024-02-29  329  	 *
de85488138247d0 Boris Brezillon 2024-02-29  330  	 * Not zero on failure.
de85488138247d0 Boris Brezillon 2024-02-29  331  	 */
de85488138247d0 Boris Brezillon 2024-02-29  332  	u32 status;
de85488138247d0 Boris Brezillon 2024-02-29  333  };
de85488138247d0 Boris Brezillon 2024-02-29  334  
de85488138247d0 Boris Brezillon 2024-02-29  335  /**
de85488138247d0 Boris Brezillon 2024-02-29  336   * struct panthor_syncobj_64b - 64-bit FW synchronization object
de85488138247d0 Boris Brezillon 2024-02-29  337   */
de85488138247d0 Boris Brezillon 2024-02-29  338  struct panthor_syncobj_64b {
de85488138247d0 Boris Brezillon 2024-02-29  339  	/** @seqno: Sequence number. */
de85488138247d0 Boris Brezillon 2024-02-29  340  	u64 seqno;
de85488138247d0 Boris Brezillon 2024-02-29  341  
de85488138247d0 Boris Brezillon 2024-02-29  342  	/**
de85488138247d0 Boris Brezillon 2024-02-29  343  	 * @status: Status.
de85488138247d0 Boris Brezillon 2024-02-29  344  	 *
de85488138247d0 Boris Brezillon 2024-02-29  345  	 * Not zero on failure.
de85488138247d0 Boris Brezillon 2024-02-29  346  	 */
de85488138247d0 Boris Brezillon 2024-02-29  347  	u32 status;
de85488138247d0 Boris Brezillon 2024-02-29  348  
de85488138247d0 Boris Brezillon 2024-02-29  349  	/** @pad: MBZ. */
de85488138247d0 Boris Brezillon 2024-02-29  350  	u32 pad;
de85488138247d0 Boris Brezillon 2024-02-29  351  };
de85488138247d0 Boris Brezillon 2024-02-29  352  
de85488138247d0 Boris Brezillon 2024-02-29  353  /**
de85488138247d0 Boris Brezillon 2024-02-29  354   * struct panthor_queue - Execution queue
de85488138247d0 Boris Brezillon 2024-02-29  355   */
de85488138247d0 Boris Brezillon 2024-02-29  356  struct panthor_queue {
de85488138247d0 Boris Brezillon 2024-02-29  357  	/** @scheduler: DRM scheduler used for this queue. */
de85488138247d0 Boris Brezillon 2024-02-29  358  	struct drm_gpu_scheduler scheduler;
de85488138247d0 Boris Brezillon 2024-02-29  359  
de85488138247d0 Boris Brezillon 2024-02-29  360  	/** @entity: DRM scheduling entity used for this queue. */
de85488138247d0 Boris Brezillon 2024-02-29  361  	struct drm_sched_entity entity;
de85488138247d0 Boris Brezillon 2024-02-29  362  
b571025809e4350 Ashley Smith    2025-03-07  363  	/** @timeout: Queue timeout related fields. */
b571025809e4350 Ashley Smith    2025-03-07  364  	struct {
b571025809e4350 Ashley Smith    2025-03-07  365  		/** @timeout.work: Work executed when a queue timeout occurs. */
b571025809e4350 Ashley Smith    2025-03-07  366  		struct delayed_work work;
b571025809e4350 Ashley Smith    2025-03-07  367  
de85488138247d0 Boris Brezillon 2024-02-29  368  		/**
b571025809e4350 Ashley Smith    2025-03-07  369  		 * @remaining: Time remaining before a queue timeout.
de85488138247d0 Boris Brezillon 2024-02-29  370  		 *
b571025809e4350 Ashley Smith    2025-03-07  371  		 * When the timer is running, this value is set to MAX_SCHEDULE_TIMEOUT.
b571025809e4350 Ashley Smith    2025-03-07  372  		 * When the timer is suspended, it's set to the time remaining when the
b571025809e4350 Ashley Smith    2025-03-07  373  		 * timer was suspended.
de85488138247d0 Boris Brezillon 2024-02-29  374  		 */
b571025809e4350 Ashley Smith    2025-03-07  375  		unsigned long remaining;
b571025809e4350 Ashley Smith    2025-03-07  376  	} timeout;
de85488138247d0 Boris Brezillon 2024-02-29  377  
de85488138247d0 Boris Brezillon 2024-02-29  378  	/**
de85488138247d0 Boris Brezillon 2024-02-29  379  	 * @doorbell_id: Doorbell assigned to this queue.
de85488138247d0 Boris Brezillon 2024-02-29  380  	 *
de85488138247d0 Boris Brezillon 2024-02-29  381  	 * Right now, all groups share the same doorbell, and the doorbell ID
de85488138247d0 Boris Brezillon 2024-02-29  382  	 * is assigned to group_slot + 1 when the group is assigned a slot. But
de85488138247d0 Boris Brezillon 2024-02-29  383  	 * we might decide to provide fine grained doorbell assignment at some
de85488138247d0 Boris Brezillon 2024-02-29  384  	 * point, so don't have to wake up all queues in a group every time one
de85488138247d0 Boris Brezillon 2024-02-29  385  	 * of them is updated.
de85488138247d0 Boris Brezillon 2024-02-29  386  	 */
de85488138247d0 Boris Brezillon 2024-02-29  387  	u8 doorbell_id;
de85488138247d0 Boris Brezillon 2024-02-29  388  
de85488138247d0 Boris Brezillon 2024-02-29  389  	/**
de85488138247d0 Boris Brezillon 2024-02-29  390  	 * @priority: Priority of the queue inside the group.
de85488138247d0 Boris Brezillon 2024-02-29  391  	 *
de85488138247d0 Boris Brezillon 2024-02-29  392  	 * Must be less than 16 (Only 4 bits available).
de85488138247d0 Boris Brezillon 2024-02-29  393  	 */
de85488138247d0 Boris Brezillon 2024-02-29  394  	u8 priority;
de85488138247d0 Boris Brezillon 2024-02-29  395  #define CSF_MAX_QUEUE_PRIO	GENMASK(3, 0)
de85488138247d0 Boris Brezillon 2024-02-29  396  
de85488138247d0 Boris Brezillon 2024-02-29  397  	/** @ringbuf: Command stream ring-buffer. */
de85488138247d0 Boris Brezillon 2024-02-29  398  	struct panthor_kernel_bo *ringbuf;
de85488138247d0 Boris Brezillon 2024-02-29  399  
de85488138247d0 Boris Brezillon 2024-02-29  400  	/** @iface: Firmware interface. */
de85488138247d0 Boris Brezillon 2024-02-29  401  	struct {
de85488138247d0 Boris Brezillon 2024-02-29  402  		/** @mem: FW memory allocated for this interface. */
de85488138247d0 Boris Brezillon 2024-02-29  403  		struct panthor_kernel_bo *mem;
de85488138247d0 Boris Brezillon 2024-02-29  404  
de85488138247d0 Boris Brezillon 2024-02-29  405  		/** @input: Input interface. */
de85488138247d0 Boris Brezillon 2024-02-29  406  		struct panthor_fw_ringbuf_input_iface *input;
de85488138247d0 Boris Brezillon 2024-02-29  407  
de85488138247d0 Boris Brezillon 2024-02-29  408  		/** @output: Output interface. */
de85488138247d0 Boris Brezillon 2024-02-29  409  		const struct panthor_fw_ringbuf_output_iface *output;
de85488138247d0 Boris Brezillon 2024-02-29  410  
de85488138247d0 Boris Brezillon 2024-02-29  411  		/** @input_fw_va: FW virtual address of the input interface buffer. */
de85488138247d0 Boris Brezillon 2024-02-29  412  		u32 input_fw_va;
de85488138247d0 Boris Brezillon 2024-02-29  413  
de85488138247d0 Boris Brezillon 2024-02-29  414  		/** @output_fw_va: FW virtual address of the output interface buffer. */
de85488138247d0 Boris Brezillon 2024-02-29  415  		u32 output_fw_va;
de85488138247d0 Boris Brezillon 2024-02-29  416  	} iface;
de85488138247d0 Boris Brezillon 2024-02-29  417  
de85488138247d0 Boris Brezillon 2024-02-29  418  	/**
de85488138247d0 Boris Brezillon 2024-02-29  419  	 * @syncwait: Stores information about the synchronization object this
de85488138247d0 Boris Brezillon 2024-02-29  420  	 * queue is waiting on.
de85488138247d0 Boris Brezillon 2024-02-29  421  	 */
de85488138247d0 Boris Brezillon 2024-02-29  422  	struct {
de85488138247d0 Boris Brezillon 2024-02-29  423  		/** @gpu_va: GPU address of the synchronization object. */
de85488138247d0 Boris Brezillon 2024-02-29  424  		u64 gpu_va;
de85488138247d0 Boris Brezillon 2024-02-29  425  
de85488138247d0 Boris Brezillon 2024-02-29  426  		/** @ref: Reference value to compare against. */
de85488138247d0 Boris Brezillon 2024-02-29  427  		u64 ref;
de85488138247d0 Boris Brezillon 2024-02-29  428  
de85488138247d0 Boris Brezillon 2024-02-29  429  		/** @gt: True if this is a greater-than test. */
de85488138247d0 Boris Brezillon 2024-02-29  430  		bool gt;
de85488138247d0 Boris Brezillon 2024-02-29  431  
de85488138247d0 Boris Brezillon 2024-02-29  432  		/** @sync64: True if this is a 64-bit sync object. */
de85488138247d0 Boris Brezillon 2024-02-29  433  		bool sync64;
de85488138247d0 Boris Brezillon 2024-02-29  434  
de85488138247d0 Boris Brezillon 2024-02-29  435  		/** @bo: Buffer object holding the synchronization object. */
de85488138247d0 Boris Brezillon 2024-02-29  436  		struct drm_gem_object *obj;
de85488138247d0 Boris Brezillon 2024-02-29  437  
de85488138247d0 Boris Brezillon 2024-02-29  438  		/** @offset: Offset of the synchronization object inside @bo. */
de85488138247d0 Boris Brezillon 2024-02-29  439  		u64 offset;
de85488138247d0 Boris Brezillon 2024-02-29  440  
de85488138247d0 Boris Brezillon 2024-02-29  441  		/**
de85488138247d0 Boris Brezillon 2024-02-29  442  		 * @kmap: Kernel mapping of the buffer object holding the
de85488138247d0 Boris Brezillon 2024-02-29  443  		 * synchronization object.
de85488138247d0 Boris Brezillon 2024-02-29  444  		 */
de85488138247d0 Boris Brezillon 2024-02-29  445  		void *kmap;
de85488138247d0 Boris Brezillon 2024-02-29  446  	} syncwait;
de85488138247d0 Boris Brezillon 2024-02-29  447  
de85488138247d0 Boris Brezillon 2024-02-29  448  	/** @fence_ctx: Fence context fields. */
de85488138247d0 Boris Brezillon 2024-02-29  449  	struct {
de85488138247d0 Boris Brezillon 2024-02-29  450  		/** @lock: Used to protect access to all fences allocated by this context. */
de85488138247d0 Boris Brezillon 2024-02-29  451  		spinlock_t lock;
de85488138247d0 Boris Brezillon 2024-02-29  452  
de85488138247d0 Boris Brezillon 2024-02-29  453  		/**
de85488138247d0 Boris Brezillon 2024-02-29  454  		 * @id: Fence context ID.
de85488138247d0 Boris Brezillon 2024-02-29  455  		 *
de85488138247d0 Boris Brezillon 2024-02-29  456  		 * Allocated with dma_fence_context_alloc().
de85488138247d0 Boris Brezillon 2024-02-29  457  		 */
de85488138247d0 Boris Brezillon 2024-02-29  458  		u64 id;
de85488138247d0 Boris Brezillon 2024-02-29  459  
de85488138247d0 Boris Brezillon 2024-02-29  460  		/** @seqno: Sequence number of the last initialized fence. */
de85488138247d0 Boris Brezillon 2024-02-29  461  		atomic64_t seqno;
de85488138247d0 Boris Brezillon 2024-02-29  462  
7b6f9ec6ad51125 Boris Brezillon 2024-07-03  463  		/**
7b6f9ec6ad51125 Boris Brezillon 2024-07-03  464  		 * @last_fence: Fence of the last submitted job.
7b6f9ec6ad51125 Boris Brezillon 2024-07-03  465  		 *
7b6f9ec6ad51125 Boris Brezillon 2024-07-03  466  		 * We return this fence when we get an empty command stream.
7b6f9ec6ad51125 Boris Brezillon 2024-07-03  467  		 * This way, we are guaranteed that all earlier jobs have completed
7b6f9ec6ad51125 Boris Brezillon 2024-07-03  468  		 * when drm_sched_job::s_fence::finished without having to feed
7b6f9ec6ad51125 Boris Brezillon 2024-07-03  469  		 * the CS ring buffer with a dummy job that only signals the fence.
7b6f9ec6ad51125 Boris Brezillon 2024-07-03  470  		 */
7b6f9ec6ad51125 Boris Brezillon 2024-07-03  471  		struct dma_fence *last_fence;
7b6f9ec6ad51125 Boris Brezillon 2024-07-03  472  
de85488138247d0 Boris Brezillon 2024-02-29  473  		/**
de85488138247d0 Boris Brezillon 2024-02-29  474  		 * @in_flight_jobs: List containing all in-flight jobs.
de85488138247d0 Boris Brezillon 2024-02-29  475  		 *
de85488138247d0 Boris Brezillon 2024-02-29  476  		 * Used to keep track and signal panthor_job::done_fence when the
de85488138247d0 Boris Brezillon 2024-02-29  477  		 * synchronization object attached to the queue is signaled.
de85488138247d0 Boris Brezillon 2024-02-29  478  		 */
de85488138247d0 Boris Brezillon 2024-02-29  479  		struct list_head in_flight_jobs;
de85488138247d0 Boris Brezillon 2024-02-29  480  	} fence_ctx;
f8ff51a47084517 Adrián Larumbe  2024-09-24  481  
f8ff51a47084517 Adrián Larumbe  2024-09-24  482  	/** @profiling: Job profiling data slots and access information. */
f8ff51a47084517 Adrián Larumbe  2024-09-24  483  	struct {
f8ff51a47084517 Adrián Larumbe  2024-09-24  484  		/** @slots: Kernel BO holding the slots. */
f8ff51a47084517 Adrián Larumbe  2024-09-24  485  		struct panthor_kernel_bo *slots;
f8ff51a47084517 Adrián Larumbe  2024-09-24  486  
f8ff51a47084517 Adrián Larumbe  2024-09-24  487  		/** @slot_count: Number of jobs ringbuffer can hold at once. */
f8ff51a47084517 Adrián Larumbe  2024-09-24  488  		u32 slot_count;
f8ff51a47084517 Adrián Larumbe  2024-09-24  489  
f8ff51a47084517 Adrián Larumbe  2024-09-24  490  		/** @seqno: Index of the next available profiling information slot. */
f8ff51a47084517 Adrián Larumbe  2024-09-24  491  		u32 seqno;
f8ff51a47084517 Adrián Larumbe  2024-09-24  492  	} profiling;
de85488138247d0 Boris Brezillon 2024-02-29 @493  };
de85488138247d0 Boris Brezillon 2024-02-29  494
diff mbox series

Patch

diff --git a/drivers/gpu/drm/panthor/panthor_sched.c b/drivers/gpu/drm/panthor/panthor_sched.c
index 4d31d1967716..95eb5246c246 100644
--- a/drivers/gpu/drm/panthor/panthor_sched.c
+++ b/drivers/gpu/drm/panthor/panthor_sched.c
@@ -360,17 +360,20 @@  struct panthor_queue {
 	/** @entity: DRM scheduling entity used for this queue. */
 	struct drm_sched_entity entity;
 
-	/**
-	 * @remaining_time: Time remaining before the job timeout expires.
-	 *
-	 * The job timeout is suspended when the queue is not scheduled by the
-	 * FW. Every time we suspend the timer, we need to save the remaining
-	 * time so we can restore it later on.
-	 */
-	unsigned long remaining_time;
+	/** @timeout: Queue timeout related fields. */
+	struct {
+		/** @timeout.work: Work executed when a queue timeout occurs. */
+		struct delayed_work work;
 
-	/** @timeout_suspended: True if the job timeout was suspended. */
-	bool timeout_suspended;
+		/**
+		 * @remaining: Time remaining before a queue timeout.
+		 *
+		 * When the timer is running, this value is set to MAX_SCHEDULE_TIMEOUT.
+		 * When the timer is suspended, it's set to the time remaining when the
+		 * timer was suspended.
+		 */
+		unsigned long remaining;
+	} timeout;
 
 	/**
 	 * @doorbell_id: Doorbell assigned to this queue.
@@ -1031,6 +1034,82 @@  group_unbind_locked(struct panthor_group *group)
 	return 0;
 }
 
+static bool
+group_is_idle(struct panthor_group *group)
+{
+	struct panthor_device *ptdev = group->ptdev;
+	u32 inactive_queues;
+
+	if (group->csg_id >= 0)
+		return ptdev->scheduler->csg_slots[group->csg_id].idle;
+
+	inactive_queues = group->idle_queues | group->blocked_queues;
+	return hweight32(inactive_queues) == group->queue_count;
+}
+
+static void
+queue_suspend_timeout(struct panthor_queue *queue)
+{
+	unsigned long qtimeout, now;
+	struct panthor_group *group;
+	struct panthor_job *job;
+	bool timer_was_active;
+
+	spin_lock(&queue->fence_ctx.lock);
+
+	/* Already suspended, nothing to do. */
+	if (queue->timeout.remaining != MAX_SCHEDULE_TIMEOUT)
+		goto out_unlock;
+
+	job = list_first_entry_or_null(&queue->fence_ctx.in_flight_jobs,
+				       struct panthor_job, node);
+	group = job ? job->group : NULL;
+
+	/* If the queue is blocked and the group is idle, we want the timer to
+	 * keep running because the group can't be unblocked by other queues,
+	 * so it has to come from an external source, and we want to timebox
+	 * this external signalling.
+	 */
+	if (group && (group->blocked_queues & BIT(job->queue_idx)) &&
+	    group_is_idle(group))
+		goto out_unlock;
+
+	now = jiffies;
+	qtimeout = queue->timeout.work.timer.expires;
+
+	/* Cancel the timer. */
+	timer_was_active = cancel_delayed_work(&queue->timeout.work);
+	if (!timer_was_active || !job)
+		queue->timeout.remaining = msecs_to_jiffies(JOB_TIMEOUT_MS);
+	else if (time_after(qtimeout, now))
+		queue->timeout.remaining = qtimeout - now;
+	else
+		queue->timeout.remaining = 0;
+
+	if (WARN_ON_ONCE(queue->timeout.remaining > msecs_to_jiffies(JOB_TIMEOUT_MS)))
+		queue->timeout.remaining = msecs_to_jiffies(JOB_TIMEOUT_MS);
+
+out_unlock:
+	spin_unlock(&queue->fence_ctx.lock);
+}
+
+static void
+queue_resume_timeout(struct panthor_queue *queue)
+{
+	spin_lock(&queue->fence_ctx.lock);
+
+	/* When running, the remaining time is set to MAX_SCHEDULE_TIMEOUT. */
+	if (queue->timeout.remaining != MAX_SCHEDULE_TIMEOUT) {
+		mod_delayed_work(queue->scheduler.timeout_wq,
+				 &queue->timeout.work,
+				 queue->timeout.remaining);
+
+		queue->timeout.remaining = MAX_SCHEDULE_TIMEOUT;
+	}
+
+	spin_unlock(&queue->fence_ctx.lock);
+}
+
 /**
  * cs_slot_prog_locked() - Program a queue slot
  * @ptdev: Device.
@@ -1069,10 +1148,8 @@  cs_slot_prog_locked(struct panthor_device *ptdev, u32 csg_id, u32 cs_id)
 			       CS_IDLE_EMPTY |
 			       CS_STATE_MASK |
 			       CS_EXTRACT_EVENT);
-	if (queue->iface.input->insert != queue->iface.input->extract && queue->timeout_suspended) {
-		drm_sched_resume_timeout(&queue->scheduler, queue->remaining_time);
-		queue->timeout_suspended = false;
-	}
+	if (queue->iface.input->insert != queue->iface.input->extract)
+		queue_resume_timeout(queue);
 }
 
 /**
@@ -1099,14 +1176,7 @@  cs_slot_reset_locked(struct panthor_device *ptdev, u32 csg_id, u32 cs_id)
 			       CS_STATE_STOP,
 			       CS_STATE_MASK);
 
-	/* If the queue is blocked, we want to keep the timeout running, so
-	 * we can detect unbounded waits and kill the group when that happens.
-	 */
-	if (!(group->blocked_queues & BIT(cs_id)) && !queue->timeout_suspended) {
-		queue->remaining_time = drm_sched_suspend_timeout(&queue->scheduler);
-		queue->timeout_suspended = true;
-		WARN_ON(queue->remaining_time > msecs_to_jiffies(JOB_TIMEOUT_MS));
-	}
+	queue_suspend_timeout(queue);
 
 	return 0;
 }
@@ -1888,19 +1958,6 @@  tick_ctx_is_full(const struct panthor_scheduler *sched,
 	return ctx->group_count == sched->csg_slot_count;
 }
 
-static bool
-group_is_idle(struct panthor_group *group)
-{
-	struct panthor_device *ptdev = group->ptdev;
-	u32 inactive_queues;
-
-	if (group->csg_id >= 0)
-		return ptdev->scheduler->csg_slots[group->csg_id].idle;
-
-	inactive_queues = group->idle_queues | group->blocked_queues;
-	return hweight32(inactive_queues) == group->queue_count;
-}
-
 static bool
 group_can_run(struct panthor_group *group)
 {
@@ -2888,35 +2945,50 @@  void panthor_fdinfo_gather_group_samples(struct panthor_file *pfile)
 	xa_unlock(&gpool->xa);
 }
 
-static void group_sync_upd_work(struct work_struct *work)
+static bool queue_check_job_completion(struct panthor_queue *queue)
 {
-	struct panthor_group *group =
-		container_of(work, struct panthor_group, sync_upd_work);
+	struct panthor_syncobj_64b *syncobj = NULL;
 	struct panthor_job *job, *job_tmp;
+	bool cookie, progress = false;
 	LIST_HEAD(done_jobs);
-	u32 queue_idx;
-	bool cookie;
 
 	cookie = dma_fence_begin_signalling();
-	for (queue_idx = 0; queue_idx < group->queue_count; queue_idx++) {
-		struct panthor_queue *queue = group->queues[queue_idx];
-		struct panthor_syncobj_64b *syncobj;
+	spin_lock(&queue->fence_ctx.lock);
+	list_for_each_entry_safe(job, job_tmp, &queue->fence_ctx.in_flight_jobs, node) {
+		if (!syncobj) {
+			struct panthor_group *group = job->group;
 
-		if (!queue)
-			continue;
+			syncobj = group->syncobjs->kmap +
+				  (job->queue_idx * sizeof(*syncobj));
+		}
 
-		syncobj = group->syncobjs->kmap + (queue_idx * sizeof(*syncobj));
+		if (syncobj->seqno < job->done_fence->seqno)
+			break;
 
-		spin_lock(&queue->fence_ctx.lock);
-		list_for_each_entry_safe(job, job_tmp, &queue->fence_ctx.in_flight_jobs, node) {
-			if (syncobj->seqno < job->done_fence->seqno)
-				break;
+		list_move_tail(&job->node, &done_jobs);
+		dma_fence_signal_locked(job->done_fence);
+	}
 
-			list_move_tail(&job->node, &done_jobs);
-			dma_fence_signal_locked(job->done_fence);
-		}
-		spin_unlock(&queue->fence_ctx.lock);
+	if (list_empty(&queue->fence_ctx.in_flight_jobs)) {
+		/* If we have no job left, we cancel the timer, and reset remaining
+		 * time to its default so it can be restarted next time
+		 * queue_resume_timeout() is called.
+		 */
+		cancel_delayed_work(&queue->timeout.work);
+		queue->timeout.remaining = msecs_to_jiffies(JOB_TIMEOUT_MS);
+
+		/* If there's no job pending, we consider it progress to avoid a
+		 * spurious timeout if the timeout handler and the sync update
+		 * handler raced.
+		 */
+		progress = true;
+	} else if (!list_empty(&done_jobs)) {
+		mod_delayed_work(queue->scheduler.timeout_wq,
+				 &queue->timeout.work,
+				 msecs_to_jiffies(JOB_TIMEOUT_MS));
+		progress = true;
 	}
+	spin_unlock(&queue->fence_ctx.lock);
 	dma_fence_end_signalling(cookie);
 
 	list_for_each_entry_safe(job, job_tmp, &done_jobs, node) {
@@ -2926,6 +2998,27 @@  static void group_sync_upd_work(struct work_struct *work)
 		panthor_job_put(&job->base);
 	}
 
+	return progress;
+}
+
+static void group_sync_upd_work(struct work_struct *work)
+{
+	struct panthor_group *group =
+		container_of(work, struct panthor_group, sync_upd_work);
+	u32 queue_idx;
+	bool cookie;
+
+	cookie = dma_fence_begin_signalling();
+	for (queue_idx = 0; queue_idx < group->queue_count; queue_idx++) {
+		struct panthor_queue *queue = group->queues[queue_idx];
+
+		if (!queue)
+			continue;
+
+		queue_check_job_completion(queue);
+	}
+	dma_fence_end_signalling(cookie);
+
 	group_put(group);
 }
 
@@ -3173,17 +3266,6 @@  queue_run_job(struct drm_sched_job *sched_job)
 	queue->iface.input->insert = job->ringbuf.end;
 
 	if (group->csg_id < 0) {
-		/* If the queue is blocked, we want to keep the timeout running, so we
-		 * can detect unbounded waits and kill the group when that happens.
-		 * Otherwise, we suspend the timeout so the time we spend waiting for
-		 * a CSG slot is not counted.
-		 */
-		if (!(group->blocked_queues & BIT(job->queue_idx)) &&
-		    !queue->timeout_suspended) {
-			queue->remaining_time = drm_sched_suspend_timeout(&queue->scheduler);
-			queue->timeout_suspended = true;
-		}
-
 		group_schedule_locked(group, BIT(job->queue_idx));
 	} else {
 		gpu_write(ptdev, CSF_DOORBELL(queue->doorbell_id), 1);
@@ -3192,6 +3274,7 @@  queue_run_job(struct drm_sched_job *sched_job)
 			pm_runtime_get(ptdev->base.dev);
 			sched->pm.has_ref = true;
 		}
+		queue_resume_timeout(queue);
 		panthor_devfreq_record_busy(sched->ptdev);
 	}
 
@@ -3241,6 +3324,11 @@  queue_timedout_job(struct drm_sched_job *sched_job)
 
 	queue_start(queue);
 
+	/* We already flagged the queue as faulty, make sure we don't get
+	 * called again.
+	 */
+	queue->scheduler.timeout = MAX_SCHEDULE_TIMEOUT;
+
 	return DRM_GPU_SCHED_STAT_NOMINAL;
 }
 
@@ -3283,6 +3371,17 @@  static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev,
 	return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * sizeof(u64));
 }
 
+static void queue_timeout_work(struct work_struct *work)
+{
+	struct panthor_queue *queue = container_of(work, struct panthor_queue,
+						   timeout.work.work);
+	bool progress;
+
+	progress = queue_check_job_completion(queue);
+	if (!progress)
+		drm_sched_fault(&queue->scheduler);
+}
+
 static struct panthor_queue *
 group_create_queue(struct panthor_group *group,
 		   const struct drm_panthor_queue_create *args)
@@ -3298,7 +3397,7 @@  group_create_queue(struct panthor_group *group,
 		 * their profiling status.
 		 */
 		.credit_limit = args->ringbuf_size / sizeof(u64),
-		.timeout = msecs_to_jiffies(JOB_TIMEOUT_MS),
+		.timeout = MAX_SCHEDULE_TIMEOUT,
 		.timeout_wq = group->ptdev->reset.wq,
 		.name = "panthor-queue",
 		.dev = group->ptdev->base.dev,
@@ -3321,6 +3420,8 @@  group_create_queue(struct panthor_group *group,
 	if (!queue)
 		return ERR_PTR(-ENOMEM);
 
+	queue->timeout.remaining = msecs_to_jiffies(JOB_TIMEOUT_MS);
+	INIT_DELAYED_WORK(&queue->timeout.work, queue_timeout_work);
 	queue->fence_ctx.id = dma_fence_context_alloc(1);
 	spin_lock_init(&queue->fence_ctx.lock);
 	INIT_LIST_HEAD(&queue->fence_ctx.in_flight_jobs);