[20/46] drm/i915/guc: Add hang check to GuC submit engine

Message ID	20210803222943.27686-21-matthew.brost@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=Vnqj=M2=lists.freedesktop.org=intel-gfx-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 8FB5360184 From: Matthew Brost <matthew.brost@intel.com> To: <intel-gfx@lists.freedesktop.org>, <dri-devel@lists.freedesktop.org> Date: Tue, 3 Aug 2021 15:29:17 -0700 Message-Id: <20210803222943.27686-21-matthew.brost@intel.com> In-Reply-To: <20210803222943.27686-1-matthew.brost@intel.com> References: <20210803222943.27686-1-matthew.brost@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Subject: [Intel-gfx] [PATCH 20/46] drm/i915/guc: Add hang check to GuC submit engine Precedence: list Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	Parallel submission aka multi-bb execbuf \| expand [00/46] Parallel submission aka multi-bb execbuf [01/46] drm/i915/guc: Allow flexible number of context ids [02/46] drm/i915/guc: Connect the number of guc_ids to debugfs [03/46] drm/i915/guc: Don't return -EAGAIN to user when guc_ids exhausted [04/46] drm/i915/guc: Don't allow requests not ready to consume all guc_ids [05/46] drm/i915/guc: Introduce guc_submit_engine object [06/46] drm/i915/guc: Check return of __xa_store when registering a context [07/46] drm/i915/guc: Non-static lrc descriptor registration buffer [08/46] drm/i915/guc: Take GT PM ref when deregistering context [09/46] drm/i915: Add GT PM unpark worker [10/46] drm/i915/guc: Take engine PM when a context is pinned with GuC submission [11/46] drm/i915/guc: Don't call switch_to_kernel_context with GuC submission [12/46] drm/i915/guc: Selftest for GuC flow control [13/46] drm/i915: Add logical engine mapping [14/46] drm/i915: Expose logical engine instance to user [15/46] drm/i915/guc: Introduce context parent-child relationship [16/46] drm/i915/guc: Implement GuC parent-child context pin / unpin functions [17/46] drm/i915/guc: Add multi-lrc context registration [18/46] drm/i915/guc: Ensure GuC schedule operations do not operate on child contexts [19/46] drm/i915/guc: Assign contexts in parent-child relationship consecutive guc_ids [20/46] drm/i915/guc: Add hang check to GuC submit engine [21/46] drm/i915/guc: Add guc_child_context_destroy [22/46] drm/i915/guc: Implement multi-lrc submission [23/46] drm/i915/guc: Insert submit fences between requests in parent-child relationship [24/46] drm/i915/guc: Implement multi-lrc reset [25/46] drm/i915/guc: Update debugfs for GuC multi-lrc [26/46] drm/i915: Connect UAPI to GuC multi-lrc interface [27/46] drm/i915/doc: Update parallel submit doc to point to i915_drm.h [28/46] drm/i915/guc: Add basic GuC multi-lrc selftest [29/46] drm/i915/guc: Extend GuC flow control selftest for multi-lrc [30/46] drm/i915/guc: Implement no mid batch preemption for multi-lrc [31/46] drm/i915: Move secure execbuf check to execbuf2 [32/46] drm/i915: Move input/exec fence handling to i915_gem_execbuffer2 [33/46] drm/i915: Move output fence handling to i915_gem_execbuffer2 [34/46] drm/i915: Return output fence from i915_gem_do_execbuffer [35/46] drm/i915: Store batch index in struct i915_execbuffer [36/46] drm/i915: Allow callers of i915_gem_do_execbuffer to override the batch index [37/46] drm/i915: Teach execbuf there can be more than one batch in the objects list [38/46] drm/i915: Only track object dependencies on first request [39/46] drm/i915: Force parallel contexts to use copy engine for reloc [40/46] drm/i915: Multi-batch execbuffer2 [41/46] drm/i915: Eliminate unnecessary VMA calls for multi-BB submission [42/46] drm/i915: Hold all parallel requests until last request, properly handle error [43/46] drm/i915/guc: Handle errors in multi-lrc requests [44/46] drm/i915: Enable multi-bb execbuf [45/46] drm/i915/execlists: Weak parallel submission support for execlists [46/46] drm/i915/guc: Add delay before disabling scheduling on contexts

Message ID

20210803222943.27686-21-matthew.brost@intel.com (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 8FB5360184
From: Matthew Brost <matthew.brost@intel.com>
To: <intel-gfx@lists.freedesktop.org>,
	<dri-devel@lists.freedesktop.org>
Date: Tue,  3 Aug 2021 15:29:17 -0700
Message-Id: <20210803222943.27686-21-matthew.brost@intel.com>
In-Reply-To: <20210803222943.27686-1-matthew.brost@intel.com>
References: <20210803222943.27686-1-matthew.brost@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Subject: [Intel-gfx] [PATCH 20/46] drm/i915/guc: Add hang check to GuC
 submit engine
Precedence: list
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Series

Parallel submission aka multi-bb execbuf | expand

Commit Message

Matthew Brost Aug. 3, 2021, 10:29 p.m. UTC

The heartbeat uses a single instance of a GuC submit engine (GSE) to do
the hang check. As such if a different GSE's state machine hangs, the
heartbeat cannot detect this hang. Add timer to each GSE which in turn
can disable all submissions if it is hung.

Cc: John Harrison <John.C.Harrison@Intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++++
 .../i915/gt/uc/intel_guc_submission_types.h   |  3 ++
 2 files changed, 39 insertions(+)

Comments

Daniel Vetter Aug. 9, 2021, 3:35 p.m. UTC | #1

On Tue, Aug 03, 2021 at 03:29:17PM -0700, Matthew Brost wrote:
> The heartbeat uses a single instance of a GuC submit engine (GSE) to do
> the hang check. As such if a different GSE's state machine hangs, the
> heartbeat cannot detect this hang. Add timer to each GSE which in turn
> can disable all submissions if it is hung.
> 
> Cc: John Harrison <John.C.Harrison@Intel.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++++
>  .../i915/gt/uc/intel_guc_submission_types.h   |  3 ++
>  2 files changed, 39 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index afb9b4bb8971..2d8296bcc583 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -105,15 +105,21 @@ static bool tasklet_blocked(struct guc_submit_engine *gse)
>  	return test_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
>  }
>  
> +/* 2 seconds seems like a reasonable timeout waiting for a G2H */
> +#define MAX_TASKLET_BLOCKED_NS	2000000000
>  static void set_tasklet_blocked(struct guc_submit_engine *gse)
>  {
>  	lockdep_assert_held(&gse->sched_engine.lock);
> +	hrtimer_start_range_ns(&gse->hang_timer,
> +			       ns_to_ktime(MAX_TASKLET_BLOCKED_NS), 0,
> +			       HRTIMER_MODE_REL_PINNED);
>  	set_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);

So with drm/scheduler the reset handling is assumed to be
single-threaded, and there's quite complex rules around that. I've
recently worked with Boris Brezillion to clarify all this a bit and
improve docs. Does this all still work in that glorious future? Might be
good to at least sprinkle some comments/thoughts around in the commit
message about the envisaged future direction for all this stuff, to keep
people in the loop. Especially future people.

Ofc plan is still to just largely land all this.

Also: set_bit is an unordered atomic, which means you need barriers, which
meanes ... *insert the full rant about justifying/documenting lockless
algorithms from earlier *

But I think this all falls out with the removal of the guc-id allocation
scheme?
-Daniel

>  }
>  
>  static void __clr_tasklet_blocked(struct guc_submit_engine *gse)
>  {
>  	lockdep_assert_held(&gse->sched_engine.lock);
> +	hrtimer_cancel(&gse->hang_timer);
>  	clear_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
>  }
>  
> @@ -1028,6 +1034,7 @@ static void disable_submission(struct intel_guc *guc)
>  		if (__tasklet_is_enabled(&sched_engine->tasklet)) {
>  			GEM_BUG_ON(!guc->ct.enabled);
>  			__tasklet_disable_sync_once(&sched_engine->tasklet);
> +			hrtimer_try_to_cancel(&guc->gse[i]->hang_timer);
>  			sched_engine->tasklet.callback = NULL;
>  		}
>  	}
> @@ -3750,6 +3757,33 @@ static void guc_sched_engine_destroy(struct kref *kref)
>  	kfree(gse);
>  }
>  
> +static enum hrtimer_restart gse_hang(struct hrtimer *hrtimer)
> +{
> +	struct guc_submit_engine *gse =
> +		container_of(hrtimer, struct guc_submit_engine, hang_timer);
> +	struct intel_guc *guc = gse->sched_engine.private_data;
> +
> +#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> +	if (guc->gse_hang_expected)
> +		drm_dbg(&guc_to_gt(guc)->i915->drm,
> +			"GSE[%i] hung, disabling submission", gse->id);
> +	else
> +		drm_err(&guc_to_gt(guc)->i915->drm,
> +			"GSE[%i] hung, disabling submission", gse->id);
> +#else
> +	drm_err(&guc_to_gt(guc)->i915->drm,
> +		"GSE[%i] hung, disabling submission", gse->id);
> +#endif
> +
> +	/*
> +	 * Tasklet not making forward progress, disable submission which in turn
> +	 * will kick in the heartbeat to do a full GPU reset.
> +	 */
> +	disable_submission(guc);
> +
> +	return HRTIMER_NORESTART;
> +}
> +
>  static void guc_submit_engine_init(struct intel_guc *guc,
>  				   struct guc_submit_engine *gse,
>  				   int id)
> @@ -3767,6 +3801,8 @@ static void guc_submit_engine_init(struct intel_guc *guc,
>  	sched_engine->retire_inflight_request_prio =
>  		guc_retire_inflight_request_prio;
>  	sched_engine->private_data = guc;
> +	hrtimer_init(&gse->hang_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +	gse->hang_timer.function = gse_hang;
>  	gse->id = id;
>  }
>  
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> index a5933e07bdd2..eae2e9725ede 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> @@ -6,6 +6,8 @@
>  #ifndef _INTEL_GUC_SUBMISSION_TYPES_H_
>  #define _INTEL_GUC_SUBMISSION_TYPES_H_
>  
> +#include <linux/xarray.h>
> +
>  #include "gt/intel_engine_types.h"
>  #include "gt/intel_context_types.h"
>  #include "i915_scheduler_types.h"
> @@ -41,6 +43,7 @@ struct guc_submit_engine {
>  	unsigned long flags;
>  	int total_num_rq_with_no_guc_id;
>  	atomic_t num_guc_ids_not_ready;
> +	struct hrtimer hang_timer;
>  	int id;
>  
>  	/*
> -- 
> 2.28.0
>

Matthew Brost Aug. 9, 2021, 7:05 p.m. UTC | #2

On Mon, Aug 09, 2021 at 05:35:25PM +0200, Daniel Vetter wrote:
> On Tue, Aug 03, 2021 at 03:29:17PM -0700, Matthew Brost wrote:
> > The heartbeat uses a single instance of a GuC submit engine (GSE) to do
> > the hang check. As such if a different GSE's state machine hangs, the
> > heartbeat cannot detect this hang. Add timer to each GSE which in turn
> > can disable all submissions if it is hung.
> > 
> > Cc: John Harrison <John.C.Harrison@Intel.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++++
> >  .../i915/gt/uc/intel_guc_submission_types.h   |  3 ++
> >  2 files changed, 39 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index afb9b4bb8971..2d8296bcc583 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -105,15 +105,21 @@ static bool tasklet_blocked(struct guc_submit_engine *gse)
> >  	return test_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
> >  }
> >  
> > +/* 2 seconds seems like a reasonable timeout waiting for a G2H */
> > +#define MAX_TASKLET_BLOCKED_NS	2000000000
> >  static void set_tasklet_blocked(struct guc_submit_engine *gse)
> >  {
> >  	lockdep_assert_held(&gse->sched_engine.lock);
> > +	hrtimer_start_range_ns(&gse->hang_timer,
> > +			       ns_to_ktime(MAX_TASKLET_BLOCKED_NS), 0,
> > +			       HRTIMER_MODE_REL_PINNED);
> >  	set_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
> 
> So with drm/scheduler the reset handling is assumed to be
> single-threaded, and there's quite complex rules around that. I've
> recently worked with Boris Brezillion to clarify all this a bit and
> improve docs. Does this all still work in that glorious future? Might be
> good to at least sprinkle some comments/thoughts around in the commit
> message about the envisaged future direction for all this stuff, to keep
> people in the loop. Especially future people.
> 
> Ofc plan is still to just largely land all this.
> 
> Also: set_bit is an unordered atomic, which means you need barriers, which
> meanes ... *insert the full rant about justifying/documenting lockless
> algorithms from earlier *
>

lockdep_assert_held(&gse->sched_engine.lock);

Not lockless. Also spin locks act as barriers, right?
 
> But I think this all falls out with the removal of the guc-id allocation
> scheme?

Yes, this patch is getting deleted.

Matt

> -Daniel
> 
> >  }
> >  
> >  static void __clr_tasklet_blocked(struct guc_submit_engine *gse)
> >  {
> >  	lockdep_assert_held(&gse->sched_engine.lock);
> > +	hrtimer_cancel(&gse->hang_timer);
> >  	clear_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
> >  }
> >  
> > @@ -1028,6 +1034,7 @@ static void disable_submission(struct intel_guc *guc)
> >  		if (__tasklet_is_enabled(&sched_engine->tasklet)) {
> >  			GEM_BUG_ON(!guc->ct.enabled);
> >  			__tasklet_disable_sync_once(&sched_engine->tasklet);
> > +			hrtimer_try_to_cancel(&guc->gse[i]->hang_timer);
> >  			sched_engine->tasklet.callback = NULL;
> >  		}
> >  	}
> > @@ -3750,6 +3757,33 @@ static void guc_sched_engine_destroy(struct kref *kref)
> >  	kfree(gse);
> >  }
> >  
> > +static enum hrtimer_restart gse_hang(struct hrtimer *hrtimer)
> > +{
> > +	struct guc_submit_engine *gse =
> > +		container_of(hrtimer, struct guc_submit_engine, hang_timer);
> > +	struct intel_guc *guc = gse->sched_engine.private_data;
> > +
> > +#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> > +	if (guc->gse_hang_expected)
> > +		drm_dbg(&guc_to_gt(guc)->i915->drm,
> > +			"GSE[%i] hung, disabling submission", gse->id);
> > +	else
> > +		drm_err(&guc_to_gt(guc)->i915->drm,
> > +			"GSE[%i] hung, disabling submission", gse->id);
> > +#else
> > +	drm_err(&guc_to_gt(guc)->i915->drm,
> > +		"GSE[%i] hung, disabling submission", gse->id);
> > +#endif
> > +
> > +	/*
> > +	 * Tasklet not making forward progress, disable submission which in turn
> > +	 * will kick in the heartbeat to do a full GPU reset.
> > +	 */
> > +	disable_submission(guc);
> > +
> > +	return HRTIMER_NORESTART;
> > +}
> > +
> >  static void guc_submit_engine_init(struct intel_guc *guc,
> >  				   struct guc_submit_engine *gse,
> >  				   int id)
> > @@ -3767,6 +3801,8 @@ static void guc_submit_engine_init(struct intel_guc *guc,
> >  	sched_engine->retire_inflight_request_prio =
> >  		guc_retire_inflight_request_prio;
> >  	sched_engine->private_data = guc;
> > +	hrtimer_init(&gse->hang_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> > +	gse->hang_timer.function = gse_hang;
> >  	gse->id = id;
> >  }
> >  
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> > index a5933e07bdd2..eae2e9725ede 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> > @@ -6,6 +6,8 @@
> >  #ifndef _INTEL_GUC_SUBMISSION_TYPES_H_
> >  #define _INTEL_GUC_SUBMISSION_TYPES_H_
> >  
> > +#include <linux/xarray.h>
> > +
> >  #include "gt/intel_engine_types.h"
> >  #include "gt/intel_context_types.h"
> >  #include "i915_scheduler_types.h"
> > @@ -41,6 +43,7 @@ struct guc_submit_engine {
> >  	unsigned long flags;
> >  	int total_num_rq_with_no_guc_id;
> >  	atomic_t num_guc_ids_not_ready;
> > +	struct hrtimer hang_timer;
> >  	int id;
> >  
> >  	/*
> > -- 
> > 2.28.0
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Daniel Vetter Aug. 10, 2021, 9:18 a.m. UTC | #3

On Mon, Aug 09, 2021 at 07:05:58PM +0000, Matthew Brost wrote:
> On Mon, Aug 09, 2021 at 05:35:25PM +0200, Daniel Vetter wrote:
> > On Tue, Aug 03, 2021 at 03:29:17PM -0700, Matthew Brost wrote:
> > > The heartbeat uses a single instance of a GuC submit engine (GSE) to do
> > > the hang check. As such if a different GSE's state machine hangs, the
> > > heartbeat cannot detect this hang. Add timer to each GSE which in turn
> > > can disable all submissions if it is hung.
> > > 
> > > Cc: John Harrison <John.C.Harrison@Intel.com>
> > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > ---
> > >  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++++
> > >  .../i915/gt/uc/intel_guc_submission_types.h   |  3 ++
> > >  2 files changed, 39 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > index afb9b4bb8971..2d8296bcc583 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > > @@ -105,15 +105,21 @@ static bool tasklet_blocked(struct guc_submit_engine *gse)
> > >  	return test_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
> > >  }
> > >  
> > > +/* 2 seconds seems like a reasonable timeout waiting for a G2H */
> > > +#define MAX_TASKLET_BLOCKED_NS	2000000000
> > >  static void set_tasklet_blocked(struct guc_submit_engine *gse)
> > >  {
> > >  	lockdep_assert_held(&gse->sched_engine.lock);
> > > +	hrtimer_start_range_ns(&gse->hang_timer,
> > > +			       ns_to_ktime(MAX_TASKLET_BLOCKED_NS), 0,
> > > +			       HRTIMER_MODE_REL_PINNED);
> > >  	set_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
> > 
> > So with drm/scheduler the reset handling is assumed to be
> > single-threaded, and there's quite complex rules around that. I've
> > recently worked with Boris Brezillion to clarify all this a bit and
> > improve docs. Does this all still work in that glorious future? Might be
> > good to at least sprinkle some comments/thoughts around in the commit
> > message about the envisaged future direction for all this stuff, to keep
> > people in the loop. Especially future people.
> > 
> > Ofc plan is still to just largely land all this.
> > 
> > Also: set_bit is an unordered atomic, which means you need barriers, which
> > meanes ... *insert the full rant about justifying/documenting lockless
> > algorithms from earlier *
> >
> 
> lockdep_assert_held(&gse->sched_engine.lock);
> 
> Not lockless. Also spin locks act as barriers, right?

Well if that spinlock is protecting that bit then that's good, but then it
shouldn't be an atomic set_bit. In that case:
- either make the entire bitfield non-atomic so it's clear there's boring
  dumb locking going on
- or split out your new bit into a separate field so that there's no false
  sharing with the existing bitfield state machinery, and add a kernel doc
  to that field explaining the locking

set_bit itself is atomic and unordered, so means you need barriers and all
that. If you don't have a lockless algorithm, don't use atomic bitops to
avoid confusing readers because set_bit/test_bit sets of all the warning
bells.

And yes it's annoying that for bitops the atomic ones don't have an
atomic_ prefix. The non-atomic ones have a __ prefix. This is honestly why
I don't think we should use bitfields as much as we do, because the main
use-case for them is when you have bitfields which are longer than 64bits.
They come from the cpumask world, and linux supports a lot of cpus.

Open-coding non-atomic simple bitfields with the usual C operators is
perfectly fine and legible imo. But that part is maybe more a bikeshed.

> > But I think this all falls out with the removal of the guc-id allocation
> > scheme?
> 
> Yes, this patch is getting deleted.

That works too :-)
-Daniel

> 
> Matt
> 
> > -Daniel
> > 
> > >  }
> > >  
> > >  static void __clr_tasklet_blocked(struct guc_submit_engine *gse)
> > >  {
> > >  	lockdep_assert_held(&gse->sched_engine.lock);
> > > +	hrtimer_cancel(&gse->hang_timer);
> > >  	clear_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
> > >  }
> > >  
> > > @@ -1028,6 +1034,7 @@ static void disable_submission(struct intel_guc *guc)
> > >  		if (__tasklet_is_enabled(&sched_engine->tasklet)) {
> > >  			GEM_BUG_ON(!guc->ct.enabled);
> > >  			__tasklet_disable_sync_once(&sched_engine->tasklet);
> > > +			hrtimer_try_to_cancel(&guc->gse[i]->hang_timer);
> > >  			sched_engine->tasklet.callback = NULL;
> > >  		}
> > >  	}
> > > @@ -3750,6 +3757,33 @@ static void guc_sched_engine_destroy(struct kref *kref)
> > >  	kfree(gse);
> > >  }
> > >  
> > > +static enum hrtimer_restart gse_hang(struct hrtimer *hrtimer)
> > > +{
> > > +	struct guc_submit_engine *gse =
> > > +		container_of(hrtimer, struct guc_submit_engine, hang_timer);
> > > +	struct intel_guc *guc = gse->sched_engine.private_data;
> > > +
> > > +#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
> > > +	if (guc->gse_hang_expected)
> > > +		drm_dbg(&guc_to_gt(guc)->i915->drm,
> > > +			"GSE[%i] hung, disabling submission", gse->id);
> > > +	else
> > > +		drm_err(&guc_to_gt(guc)->i915->drm,
> > > +			"GSE[%i] hung, disabling submission", gse->id);
> > > +#else
> > > +	drm_err(&guc_to_gt(guc)->i915->drm,
> > > +		"GSE[%i] hung, disabling submission", gse->id);
> > > +#endif
> > > +
> > > +	/*
> > > +	 * Tasklet not making forward progress, disable submission which in turn
> > > +	 * will kick in the heartbeat to do a full GPU reset.
> > > +	 */
> > > +	disable_submission(guc);
> > > +
> > > +	return HRTIMER_NORESTART;
> > > +}
> > > +
> > >  static void guc_submit_engine_init(struct intel_guc *guc,
> > >  				   struct guc_submit_engine *gse,
> > >  				   int id)
> > > @@ -3767,6 +3801,8 @@ static void guc_submit_engine_init(struct intel_guc *guc,
> > >  	sched_engine->retire_inflight_request_prio =
> > >  		guc_retire_inflight_request_prio;
> > >  	sched_engine->private_data = guc;
> > > +	hrtimer_init(&gse->hang_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> > > +	gse->hang_timer.function = gse_hang;
> > >  	gse->id = id;
> > >  }
> > >  
> > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> > > index a5933e07bdd2..eae2e9725ede 100644
> > > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
> > > @@ -6,6 +6,8 @@
> > >  #ifndef _INTEL_GUC_SUBMISSION_TYPES_H_
> > >  #define _INTEL_GUC_SUBMISSION_TYPES_H_
> > >  
> > > +#include <linux/xarray.h>
> > > +
> > >  #include "gt/intel_engine_types.h"
> > >  #include "gt/intel_context_types.h"
> > >  #include "i915_scheduler_types.h"
> > > @@ -41,6 +43,7 @@ struct guc_submit_engine {
> > >  	unsigned long flags;
> > >  	int total_num_rq_with_no_guc_id;
> > >  	atomic_t num_guc_ids_not_ready;
> > > +	struct hrtimer hang_timer;
> > >  	int id;
> > >  
> > >  	/*
> > > -- 
> > > 2.28.0
> > > 
> > 
> > -- 
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index afb9b4bb8971..2d8296bcc583 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -105,15 +105,21 @@  static bool tasklet_blocked(struct guc_submit_engine *gse)
 	return test_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
 }
 
+/* 2 seconds seems like a reasonable timeout waiting for a G2H */
+#define MAX_TASKLET_BLOCKED_NS	2000000000
 static void set_tasklet_blocked(struct guc_submit_engine *gse)
 {
 	lockdep_assert_held(&gse->sched_engine.lock);
+	hrtimer_start_range_ns(&gse->hang_timer,
+			       ns_to_ktime(MAX_TASKLET_BLOCKED_NS), 0,
+			       HRTIMER_MODE_REL_PINNED);
 	set_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
 }
 
 static void __clr_tasklet_blocked(struct guc_submit_engine *gse)
 {
 	lockdep_assert_held(&gse->sched_engine.lock);
+	hrtimer_cancel(&gse->hang_timer);
 	clear_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags);
 }
 
@@ -1028,6 +1034,7 @@  static void disable_submission(struct intel_guc *guc)
 		if (__tasklet_is_enabled(&sched_engine->tasklet)) {
 			GEM_BUG_ON(!guc->ct.enabled);
 			__tasklet_disable_sync_once(&sched_engine->tasklet);
+			hrtimer_try_to_cancel(&guc->gse[i]->hang_timer);
 			sched_engine->tasklet.callback = NULL;
 		}
 	}
@@ -3750,6 +3757,33 @@  static void guc_sched_engine_destroy(struct kref *kref)
 	kfree(gse);
 }
 
+static enum hrtimer_restart gse_hang(struct hrtimer *hrtimer)
+{
+	struct guc_submit_engine *gse =
+		container_of(hrtimer, struct guc_submit_engine, hang_timer);
+	struct intel_guc *guc = gse->sched_engine.private_data;
+
+#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
+	if (guc->gse_hang_expected)
+		drm_dbg(&guc_to_gt(guc)->i915->drm,
+			"GSE[%i] hung, disabling submission", gse->id);
+	else
+		drm_err(&guc_to_gt(guc)->i915->drm,
+			"GSE[%i] hung, disabling submission", gse->id);
+#else
+	drm_err(&guc_to_gt(guc)->i915->drm,
+		"GSE[%i] hung, disabling submission", gse->id);
+#endif
+
+	/*
+	 * Tasklet not making forward progress, disable submission which in turn
+	 * will kick in the heartbeat to do a full GPU reset.
+	 */
+	disable_submission(guc);
+
+	return HRTIMER_NORESTART;
+}
+
 static void guc_submit_engine_init(struct intel_guc *guc,
 				   struct guc_submit_engine *gse,
 				   int id)
@@ -3767,6 +3801,8 @@  static void guc_submit_engine_init(struct intel_guc *guc,
 	sched_engine->retire_inflight_request_prio =
 		guc_retire_inflight_request_prio;
 	sched_engine->private_data = guc;
+	hrtimer_init(&gse->hang_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	gse->hang_timer.function = gse_hang;
 	gse->id = id;
 }
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
index a5933e07bdd2..eae2e9725ede 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h
@@ -6,6 +6,8 @@ 
 #ifndef _INTEL_GUC_SUBMISSION_TYPES_H_
 #define _INTEL_GUC_SUBMISSION_TYPES_H_
 
+#include <linux/xarray.h>
+
 #include "gt/intel_engine_types.h"
 #include "gt/intel_context_types.h"
 #include "i915_scheduler_types.h"
@@ -41,6 +43,7 @@  struct guc_submit_engine {
 	unsigned long flags;
 	int total_num_rq_with_no_guc_id;
 	atomic_t num_guc_ids_not_ready;
+	struct hrtimer hang_timer;
 	int id;
 
 	/*

[20/46] drm/i915/guc: Add hang check to GuC submit engine

Commit Message

Comments

Patch