[05/11] drm/i915/tdr: Identify hung request and drop it

Message ID	1469551257-26803-6-git-send-email-arun.siluvery@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Arun Siluvery <arun.siluvery@linux.intel.com> To: intel-gfx@lists.freedesktop.org Date: Tue, 26 Jul 2016 17:40:51 +0100 Message-Id: <1469551257-26803-6-git-send-email-arun.siluvery@linux.intel.com> In-Reply-To: <1469551257-26803-1-git-send-email-arun.siluvery@linux.intel.com> References: <1469551257-26803-1-git-send-email-arun.siluvery@linux.intel.com> Subject: [Intel-gfx] [PATCH 05/11] drm/i915/tdr: Identify hung request and drop it Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Message ID

1469551257-26803-6-git-send-email-arun.siluvery@linux.intel.com (mailing list archive)

State

New, archived

Headers

From: Arun Siluvery <arun.siluvery@linux.intel.com>
To: intel-gfx@lists.freedesktop.org
Date: Tue, 26 Jul 2016 17:40:51 +0100
Message-Id: <1469551257-26803-6-git-send-email-arun.siluvery@linux.intel.com>
In-Reply-To: <1469551257-26803-1-git-send-email-arun.siluvery@linux.intel.com>
References: <1469551257-26803-1-git-send-email-arun.siluvery@linux.intel.com>
Subject: [Intel-gfx] [PATCH 05/11] drm/i915/tdr: Identify hung request and
	drop it
Precedence: list
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Commit Message

arun.siluvery@linux.intel.com July 26, 2016, 4:40 p.m. UTC

The current active request is the one that caused the hang so this is
retrieved and removed from elsp queue, otherwise we cannot submit other
workloads to be processed by GPU.

A consistency check between HW and driver is performed to ensure that we
are dropping the correct request. Since this request doesn't get executed
anymore, we also need to advance the seqno to mark it as complete. Head
pointer is advanced to skip the offending batch so that HW resumes
execution other workloads. If HW and SW don't agree then we won't proceed
with engine reset, this is treated as an error condition and we fallback to
full gpu reset.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 116 +++++++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.h |   2 +
 2 files changed, 118 insertions(+)

Comments

Chris Wilson July 26, 2016, 9:37 p.m. UTC | #1

On Tue, Jul 26, 2016 at 05:40:51PM +0100, Arun Siluvery wrote:
> The current active request is the one that caused the hang so this is
> retrieved and removed from elsp queue, otherwise we cannot submit other
> workloads to be processed by GPU.
> 
> A consistency check between HW and driver is performed to ensure that we
> are dropping the correct request. Since this request doesn't get executed
> anymore, we also need to advance the seqno to mark it as complete. Head
> pointer is advanced to skip the offending batch so that HW resumes
> execution other workloads. If HW and SW don't agree then we won't proceed
> with engine reset, this is treated as an error condition and we fallback to
> full gpu reset.
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/intel_lrc.c | 116 +++++++++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/i915/intel_lrc.h |   2 +
>  2 files changed, 118 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index daf1279..8fc5a3b 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1026,6 +1026,122 @@ void intel_lr_context_unpin(struct i915_gem_context *ctx,
>  	i915_gem_context_put(ctx);
>  }
>  
> +static void intel_lr_context_resync(struct i915_gem_context *ctx,
> +				    struct intel_engine_cs *engine)
> +{
> +	u32 head;
> +	u32 head_addr, tail_addr;
> +	u32 *reg_state;
> +	struct intel_ringbuffer *ringbuf;
> +	struct drm_i915_private *dev_priv = engine->i915;
> +
> +	ringbuf = ctx->engine[engine->id].ringbuf;
> +	reg_state = ctx->engine[engine->id].lrc_reg_state;
> +
> +	head = I915_READ_HEAD(engine);
> +	head_addr = head & HEAD_ADDR;
> +	tail_addr = reg_state[CTX_RING_TAIL+1] & TAIL_ADDR;

?

We know where we want the head to be to emit the breadcrumb and complete
the request since we can record that when constructing the request. That
also neatly solves the riddle of how to update the hw state.

resync?  intel_lr_context_reset_ring may be more apt, or maybe
intel_execlists_reset_request?
-Chris

arun.siluvery@linux.intel.com July 27, 2016, 11:54 a.m. UTC | #2

On 26/07/2016 22:37, Chris Wilson wrote:
> On Tue, Jul 26, 2016 at 05:40:51PM +0100, Arun Siluvery wrote:
>> The current active request is the one that caused the hang so this is
>> retrieved and removed from elsp queue, otherwise we cannot submit other
>> workloads to be processed by GPU.
>>
>> A consistency check between HW and driver is performed to ensure that we
>> are dropping the correct request. Since this request doesn't get executed
>> anymore, we also need to advance the seqno to mark it as complete. Head
>> pointer is advanced to skip the offending batch so that HW resumes
>> execution other workloads. If HW and SW don't agree then we won't proceed
>> with engine reset, this is treated as an error condition and we fallback to
>> full gpu reset.
>>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> ---
>>   drivers/gpu/drm/i915/intel_lrc.c | 116 +++++++++++++++++++++++++++++++++++++++
>>   drivers/gpu/drm/i915/intel_lrc.h |   2 +
>>   2 files changed, 118 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>> index daf1279..8fc5a3b 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -1026,6 +1026,122 @@ void intel_lr_context_unpin(struct i915_gem_context *ctx,
>>   	i915_gem_context_put(ctx);
>>   }
>>
>> +static void intel_lr_context_resync(struct i915_gem_context *ctx,
>> +				    struct intel_engine_cs *engine)
>> +{
>> +	u32 head;
>> +	u32 head_addr, tail_addr;
>> +	u32 *reg_state;
>> +	struct intel_ringbuffer *ringbuf;
>> +	struct drm_i915_private *dev_priv = engine->i915;
>> +
>> +	ringbuf = ctx->engine[engine->id].ringbuf;
>> +	reg_state = ctx->engine[engine->id].lrc_reg_state;
>> +
>> +	head = I915_READ_HEAD(engine);
>> +	head_addr = head & HEAD_ADDR;
>> +	tail_addr = reg_state[CTX_RING_TAIL+1] & TAIL_ADDR;
>
> ?
>
> We know where we want the head to be to emit the breadcrumb and complete
> the request since we can record that when constructing the request. That
> also neatly solves the riddle of how to update the hw state.

We want to skip only MI_BATCH_BUFFER_START and continue as usual so just 
using existing info.
>
> resync?  intel_lr_context_reset_ring may be more apt, or maybe
> intel_execlists_reset_request?

resync because we read current state and update it. 
intel_execlists_reset_request() sounds better, will change it as 
suggested. thanks.

regards
Arun


> -Chris
>

Chris Wilson July 27, 2016, 12:27 p.m. UTC | #3

On Wed, Jul 27, 2016 at 12:54:44PM +0100, Arun Siluvery wrote:
> On 26/07/2016 22:37, Chris Wilson wrote:
> >On Tue, Jul 26, 2016 at 05:40:51PM +0100, Arun Siluvery wrote:
> >>The current active request is the one that caused the hang so this is
> >>retrieved and removed from elsp queue, otherwise we cannot submit other
> >>workloads to be processed by GPU.
> >>
> >>A consistency check between HW and driver is performed to ensure that we
> >>are dropping the correct request. Since this request doesn't get executed
> >>anymore, we also need to advance the seqno to mark it as complete. Head
> >>pointer is advanced to skip the offending batch so that HW resumes
> >>execution other workloads. If HW and SW don't agree then we won't proceed
> >>with engine reset, this is treated as an error condition and we fallback to
> >>full gpu reset.
> >>
> >>Cc: Chris Wilson <chris@chris-wilson.co.uk>
> >>Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> >>Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> >>---
> >>  drivers/gpu/drm/i915/intel_lrc.c | 116 +++++++++++++++++++++++++++++++++++++++
> >>  drivers/gpu/drm/i915/intel_lrc.h |   2 +
> >>  2 files changed, 118 insertions(+)
> >>
> >>diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> >>index daf1279..8fc5a3b 100644
> >>--- a/drivers/gpu/drm/i915/intel_lrc.c
> >>+++ b/drivers/gpu/drm/i915/intel_lrc.c
> >>@@ -1026,6 +1026,122 @@ void intel_lr_context_unpin(struct i915_gem_context *ctx,
> >>  	i915_gem_context_put(ctx);
> >>  }
> >>
> >>+static void intel_lr_context_resync(struct i915_gem_context *ctx,
> >>+				    struct intel_engine_cs *engine)
> >>+{
> >>+	u32 head;
> >>+	u32 head_addr, tail_addr;
> >>+	u32 *reg_state;
> >>+	struct intel_ringbuffer *ringbuf;
> >>+	struct drm_i915_private *dev_priv = engine->i915;
> >>+
> >>+	ringbuf = ctx->engine[engine->id].ringbuf;
> >>+	reg_state = ctx->engine[engine->id].lrc_reg_state;
> >>+
> >>+	head = I915_READ_HEAD(engine);
> >>+	head_addr = head & HEAD_ADDR;
> >>+	tail_addr = reg_state[CTX_RING_TAIL+1] & TAIL_ADDR;
> >
> >?
> >
> >We know where we want the head to be to emit the breadcrumb and complete
> >the request since we can record that when constructing the request. That
> >also neatly solves the riddle of how to update the hw state.
> 
> We want to skip only MI_BATCH_BUFFER_START and continue as usual so
> just using existing info.

That's exactly my point and why this approach is overkill since we
aleady know where we need to resume from.
-Chris

Chris Wilson July 27, 2016, 9:35 p.m. UTC | #4

To clarify what I mean about knowing what values we need to write into
the ring following the reset, please consider these poc patches which
implement request recovery following the global reset.
-Chris

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index daf1279..8fc5a3b 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1026,6 +1026,122 @@  void intel_lr_context_unpin(struct i915_gem_context *ctx,
 	i915_gem_context_put(ctx);
 }
 
+static void intel_lr_context_resync(struct i915_gem_context *ctx,
+				    struct intel_engine_cs *engine)
+{
+	u32 head;
+	u32 head_addr, tail_addr;
+	u32 *reg_state;
+	struct intel_ringbuffer *ringbuf;
+	struct drm_i915_private *dev_priv = engine->i915;
+
+	ringbuf = ctx->engine[engine->id].ringbuf;
+	reg_state = ctx->engine[engine->id].lrc_reg_state;
+
+	head = I915_READ_HEAD(engine);
+	head_addr = head & HEAD_ADDR;
+	tail_addr = reg_state[CTX_RING_TAIL+1] & TAIL_ADDR;
+
+	/*
+	 * force head it to advance to the next QWORD. In most cases the
+	 * engine head pointer will automatically advance to the next
+	 * instruction as soon as it has read the current instruction,
+	 * without waiting for it to complete. This seems to be the default
+	 * behaviour, however an MBOX wait inserted directly to the VCS/BCS
+	 * engines does not behave in the same way, instead the head
+	 * pointer will still be pointing at the MBOX instruction until it
+	 * completes.
+	 */
+	head_addr = roundup(head_addr, 8);
+
+	if (head_addr > tail_addr)
+		head_addr = tail_addr;
+	else if (head_addr >= ringbuf->size)
+		head_addr = 0;
+
+	head &= ~HEAD_ADDR;
+	head |= (head_addr & HEAD_ADDR);
+
+	/* update head in ctx */
+	reg_state[CTX_RING_HEAD+1] = head;
+	I915_WRITE_HEAD(engine, head);
+
+	ringbuf->head = head;
+	ringbuf->last_retired_head = -1;
+	intel_ring_update_space(ringbuf);
+}
+
+/**
+ * intel_execlists_reset_prepare() - identifies the request that is
+ * hung and drops it
+ *
+ * Head is adjusted to skip the batch that caused the hang
+ *
+ * @engine: Engine that is currently hung
+ *
+ * Returns:
+ *   0 - on success
+ *   nonzero errorcode otherwise
+ */
+int intel_execlists_reset_prepare(struct intel_engine_cs *engine)
+{
+	struct drm_i915_gem_request *req;
+	bool continue_with_reset;
+
+	spin_lock_bh(&engine->execlist_lock);
+
+	req = list_first_entry_or_null(&engine->execlist_queue,
+					    struct drm_i915_gem_request,
+					    execlist_link);
+
+	/*
+	 * Only acknowledge the request in the execlist queue if it's actually
+	 * been submitted to hardware, otherwise it cannot cause hang.
+	 */
+	if (req && req->ctx && req->elsp_submitted) {
+		u32 execlist_status;
+		u32 hw_context;
+		u32 hw_active;
+		struct drm_i915_private *dev_priv = engine->i915;
+
+		hw_context = I915_READ(RING_EXECLIST_STATUS_HI(engine));
+		execlist_status = I915_READ(RING_EXECLIST_STATUS_LO(engine));
+		hw_active = ((execlist_status & EXECLIST_STATUS_ELEMENT0_ACTIVE) ||
+			     (execlist_status & EXECLIST_STATUS_ELEMENT1_ACTIVE));
+
+		continue_with_reset = hw_active && hw_context == req->ctx->hw_id;
+		if (!continue_with_reset) {
+			DRM_ERROR("GPU hung when HW is not active !!\n");
+			goto unlock;
+		}
+
+		/*
+		 * GPU is now hung and the request that caused it
+		 * will be dropped so mark it as completed
+		 */
+		intel_write_status_page(engine, I915_GEM_HWS_INDEX, req->fence.seqno);
+
+		intel_lr_context_resync(req->ctx, engine);
+
+		/*
+		 * remove the request from the elsp queue so that
+		 * engine can resume execution after reset when new
+		 * requests are submitted
+		 */
+		if (!--req->elsp_submitted) {
+			list_del(&req->execlist_link);
+			i915_gem_request_put(req);
+		}
+	} else {
+		WARN(1, "GPU hang detected with no active request\n");
+		continue_with_reset = false;
+	}
+
+unlock:
+	spin_unlock_bh(&engine->execlist_lock);
+	return !continue_with_reset;
+}
+
 static int intel_logical_ring_workarounds_emit(struct drm_i915_gem_request *req)
 {
 	int ret, i;
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 3828730..1171ea1 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -31,6 +31,8 @@ 
 /* Execlists regs */
 #define RING_ELSP(engine)			_MMIO((engine)->mmio_base + 0x230)
 #define RING_EXECLIST_STATUS_LO(engine)		_MMIO((engine)->mmio_base + 0x234)
+#define   EXECLIST_STATUS_ELEMENT0_ACTIVE       (1 << 14)
+#define   EXECLIST_STATUS_ELEMENT1_ACTIVE       (1 << 15)
 #define RING_EXECLIST_STATUS_HI(engine)		_MMIO((engine)->mmio_base + 0x234 + 4)
 #define RING_CONTEXT_CONTROL(engine)		_MMIO((engine)->mmio_base + 0x244)
 #define	  CTX_CTRL_INHIBIT_SYN_CTX_SWITCH	(1 << 3)

[05/11] drm/i915/tdr: Identify hung request and drop it

Commit Message

Comments

Patch