diff mbox

[05/20] drm/i915: Cancel reset-engine if we couldn't find an active request

Message ID 20170515212001.16418-1-michel.thierry@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Michel Thierry May 15, 2017, 9:20 p.m. UTC
Before reseting an engine, check if there is an active request, and if
the _hung_ request has completed. In these two cases, the seqno has moved
after hang declaration and we can skip the reset.

Also store the active request so that we only search for it once.

v2: Check for request completion inside _prepare_engine, don't use
ECANCELED, remove unnecessary null checks (Chris).

Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c | 21 +++++++++++++------
 drivers/gpu/drm/i915/i915_drv.h |  6 ++++--
 drivers/gpu/drm/i915/i915_gem.c | 45 ++++++++++++++++++++++++++++-------------
 3 files changed, 50 insertions(+), 22 deletions(-)

Comments

Chris Wilson May 15, 2017, 9:31 p.m. UTC | #1
On Mon, May 15, 2017 at 02:20:01PM -0700, Michel Thierry wrote:
> @@ -2827,21 +2830,34 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
>  
>  	if (engine_stalled(engine)) {
>  		request = i915_gem_find_active_request(engine);
> -		if (request && request->fence.error == -EIO)
> -			err = -EIO; /* Previous reset failed! */
> +
> +		if (request) {
> +			if (request->fence.error == -EIO)
> +				return ERR_PTR(-EIO); /* Previous reset failed! */
> +
> +			if (__i915_gem_request_completed(request,
> +							 engine->hangcheck.seqno))

This is not the seqno for the request, so this is incorrect. It will
judge that the request was preempted (as hangcheck.seqno must be less
thn request->global_seqno) and so conclude that the request was never
completed.

You just want if (i915_gem_request_completed(request))
-Chris
Chris Wilson May 15, 2017, 9:47 p.m. UTC | #2
On Mon, May 15, 2017 at 10:31:58PM +0100, Chris Wilson wrote:
> On Mon, May 15, 2017 at 02:20:01PM -0700, Michel Thierry wrote:
> > @@ -2827,21 +2830,34 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
> >  
> >  	if (engine_stalled(engine)) {
> >  		request = i915_gem_find_active_request(engine);
> > -		if (request && request->fence.error == -EIO)
> > -			err = -EIO; /* Previous reset failed! */
> > +
> > +		if (request) {
> > +			if (request->fence.error == -EIO)
> > +				return ERR_PTR(-EIO); /* Previous reset failed! */
> > +
> > +			if (__i915_gem_request_completed(request,
> > +							 engine->hangcheck.seqno))
> 
> This is not the seqno for the request, so this is incorrect. It will
> judge that the request was preempted (as hangcheck.seqno must be less
> thn request->global_seqno) and so conclude that the request was never
> completed.
> 
> You just want if (i915_gem_request_completed(request))

Also not here. This pardon check should be deferred to the caller just
before commiting to thre reset. In the case of global reset, we want to
gather up all the engines' active requests first, complete our
preparations and then double check the engine was hung.
-Chris
Michel Thierry May 15, 2017, 10:25 p.m. UTC | #3
On 5/15/2017 2:47 PM, Chris Wilson wrote:
> On Mon, May 15, 2017 at 10:31:58PM +0100, Chris Wilson wrote:
>> On Mon, May 15, 2017 at 02:20:01PM -0700, Michel Thierry wrote:
>>> @@ -2827,21 +2830,34 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
>>>
>>>  	if (engine_stalled(engine)) {
>>>  		request = i915_gem_find_active_request(engine);
>>> -		if (request && request->fence.error == -EIO)
>>> -			err = -EIO; /* Previous reset failed! */
>>> +
>>> +		if (request) {
>>> +			if (request->fence.error == -EIO)
>>> +				return ERR_PTR(-EIO); /* Previous reset failed! */
>>> +
>>> +			if (__i915_gem_request_completed(request,
>>> +							 engine->hangcheck.seqno))
>>
>> This is not the seqno for the request, so this is incorrect. It will
>> judge that the request was preempted (as hangcheck.seqno must be less
>> thn request->global_seqno) and so conclude that the request was never
>> completed.
>>
>> You just want if (i915_gem_request_completed(request))

Thanks, I'll change the function.

>
> Also not here. This pardon check should be deferred to the caller just
> before commiting to thre reset. In the case of global reset, we want to
> gather up all the engines' active requests first, complete our
> preparations and then double check the engine was hung.

i915_reset_engine calls this directly, but 'full reset' [from 
i915_gem_reset_prepare()] would not be affected and it won't pardon 
anything... i915_gem_reset_engine is doing the double check you mention.

-Michel
Chris Wilson May 16, 2017, 7:54 a.m. UTC | #4
On Mon, May 15, 2017 at 03:25:27PM -0700, Michel Thierry wrote:
> On 5/15/2017 2:47 PM, Chris Wilson wrote:
> >On Mon, May 15, 2017 at 10:31:58PM +0100, Chris Wilson wrote:
> >>On Mon, May 15, 2017 at 02:20:01PM -0700, Michel Thierry wrote:
> >>>@@ -2827,21 +2830,34 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
> >>>
> >>> 	if (engine_stalled(engine)) {
> >>> 		request = i915_gem_find_active_request(engine);
> >>>-		if (request && request->fence.error == -EIO)
> >>>-			err = -EIO; /* Previous reset failed! */
> >>>+
> >>>+		if (request) {
> >>>+			if (request->fence.error == -EIO)
> >>>+				return ERR_PTR(-EIO); /* Previous reset failed! */
> >>>+
> >>>+			if (__i915_gem_request_completed(request,
> >>>+							 engine->hangcheck.seqno))
> >>
> >>This is not the seqno for the request, so this is incorrect. It will
> >>judge that the request was preempted (as hangcheck.seqno must be less
> >>thn request->global_seqno) and so conclude that the request was never
> >>completed.
> >>
> >>You just want if (i915_gem_request_completed(request))
> 
> Thanks, I'll change the function.
> 
> >
> >Also not here. This pardon check should be deferred to the caller just
> >before commiting to thre reset. In the case of global reset, we want to
> >gather up all the engines' active requests first, complete our
> >preparations and then double check the engine was hung.
> 
> i915_reset_engine calls this directly, but 'full reset' [from
> i915_gem_reset_prepare()] would not be affected and it won't pardon
> anything... i915_gem_reset_engine is doing the double check you
> mention.

Aye, but in the long run I was thinking of capturing this request in
engine->hangcheck.active_request and then we reuse that info in the later
phases.
-Chris
Michel Thierry May 17, 2017, 12:13 a.m. UTC | #5
On 16/05/17 00:54, Chris Wilson wrote:
> On Mon, May 15, 2017 at 03:25:27PM -0700, Michel Thierry wrote:
>> On 5/15/2017 2:47 PM, Chris Wilson wrote:
>>> On Mon, May 15, 2017 at 10:31:58PM +0100, Chris Wilson wrote:
>>>> On Mon, May 15, 2017 at 02:20:01PM -0700, Michel Thierry wrote:
>>>>> @@ -2827,21 +2830,34 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
>>>>>
>>>>> 	if (engine_stalled(engine)) {
>>>>> 		request = i915_gem_find_active_request(engine);
>>>>> -		if (request && request->fence.error == -EIO)
>>>>> -			err = -EIO; /* Previous reset failed! */
>>>>> +
>>>>> +		if (request) {
>>>>> +			if (request->fence.error == -EIO)
>>>>> +				return ERR_PTR(-EIO); /* Previous reset failed! */
>>>>> +
>>>>> +			if (__i915_gem_request_completed(request,
>>>>> +							 engine->hangcheck.seqno))
>>>>
>>>> This is not the seqno for the request, so this is incorrect. It will
>>>> judge that the request was preempted (as hangcheck.seqno must be less
>>>> thn request->global_seqno) and so conclude that the request was never
>>>> completed.
>>>>
>>>> You just want if (i915_gem_request_completed(request))
>>
>> Thanks, I'll change the function.
>>
>>>
>>> Also not here. This pardon check should be deferred to the caller just
>>> before commiting to thre reset. In the case of global reset, we want to
>>> gather up all the engines' active requests first, complete our
>>> preparations and then double check the engine was hung.
>>
>> i915_reset_engine calls this directly, but 'full reset' [from
>> i915_gem_reset_prepare()] would not be affected and it won't pardon
>> anything... i915_gem_reset_engine is doing the double check you
>> mention.
>
> Aye, but in the long run I was thinking of capturing this request in
> engine->hangcheck.active_request and then we reuse that info in the later
> phases.

Capture hangcheck.active_request during hangcheck_declare_hang? Or still 
here in reset_prepare?

Thanks
Chris Wilson May 17, 2017, 7:19 a.m. UTC | #6
On Tue, May 16, 2017 at 05:13:58PM -0700, Michel Thierry wrote:
> On 16/05/17 00:54, Chris Wilson wrote:
> >On Mon, May 15, 2017 at 03:25:27PM -0700, Michel Thierry wrote:
> >>On 5/15/2017 2:47 PM, Chris Wilson wrote:
> >>>On Mon, May 15, 2017 at 10:31:58PM +0100, Chris Wilson wrote:
> >>>>On Mon, May 15, 2017 at 02:20:01PM -0700, Michel Thierry wrote:
> >>>>>@@ -2827,21 +2830,34 @@ int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
> >>>>>
> >>>>>	if (engine_stalled(engine)) {
> >>>>>		request = i915_gem_find_active_request(engine);
> >>>>>-		if (request && request->fence.error == -EIO)
> >>>>>-			err = -EIO; /* Previous reset failed! */
> >>>>>+
> >>>>>+		if (request) {
> >>>>>+			if (request->fence.error == -EIO)
> >>>>>+				return ERR_PTR(-EIO); /* Previous reset failed! */
> >>>>>+
> >>>>>+			if (__i915_gem_request_completed(request,
> >>>>>+							 engine->hangcheck.seqno))
> >>>>
> >>>>This is not the seqno for the request, so this is incorrect. It will
> >>>>judge that the request was preempted (as hangcheck.seqno must be less
> >>>>thn request->global_seqno) and so conclude that the request was never
> >>>>completed.
> >>>>
> >>>>You just want if (i915_gem_request_completed(request))
> >>
> >>Thanks, I'll change the function.
> >>
> >>>
> >>>Also not here. This pardon check should be deferred to the caller just
> >>>before commiting to thre reset. In the case of global reset, we want to
> >>>gather up all the engines' active requests first, complete our
> >>>preparations and then double check the engine was hung.
> >>
> >>i915_reset_engine calls this directly, but 'full reset' [from
> >>i915_gem_reset_prepare()] would not be affected and it won't pardon
> >>anything... i915_gem_reset_engine is doing the double check you
> >>mention.
> >
> >Aye, but in the long run I was thinking of capturing this request in
> >engine->hangcheck.active_request and then we reuse that info in the later
> >phases.
> 
> Capture hangcheck.active_request during hangcheck_declare_hang? Or
> still here in reset_prepare?

Not in the hangcheck worker itself since we want that to be as lockless
as we can make it and since putting it inside the reset works just as
well, we should.
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index d62793805794..6ee60c1e17ee 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1895,23 +1895,28 @@  int i915_reset_engine(struct intel_engine_cs *engine)
 	int ret;
 	struct drm_i915_private *dev_priv = engine->i915;
 	struct i915_gpu_error *error = &dev_priv->gpu_error;
+	struct drm_i915_gem_request *active_request;
 
 	GEM_BUG_ON(!test_bit(I915_RESET_ENGINE_IN_PROGRESS, &error->flags));
 
 	DRM_DEBUG_DRIVER("resetting %s\n", engine->name);
 
-	ret = i915_gem_reset_prepare_engine(engine);
-	if (ret) {
-		DRM_ERROR("Previous reset failed - promote to full reset\n");
+	active_request = i915_gem_reset_prepare_engine(engine);
+	if (!active_request) {
+		DRM_DEBUG_DRIVER("seqno moved after hang declaration, pardoned\n");
+		goto canceled;
+	} else if (IS_ERR(active_request)) {
+		DRM_DEBUG_DRIVER("Previous reset failed, promote to full reset\n");
+		ret = PTR_ERR(active_request);
 		goto out;
 	}
 
 	/*
-	 * the request that caused the hang is stuck on elsp, identify the
-	 * active request and drop it, adjust head to skip the offending
+	 * the request that caused the hang is stuck on elsp, we know the
+	 * active request and can drop it, adjust head to skip the offending
 	 * request to resume executing remaining requests in the queue.
 	 */
-	i915_gem_reset_engine(engine);
+	i915_gem_reset_engine(engine, active_request);
 
 	/* forcing engine to idle */
 	ret = intel_reset_engine_start(engine);
@@ -1942,6 +1947,10 @@  int i915_reset_engine(struct intel_engine_cs *engine)
 
 out:
 	return ret;
+
+canceled:
+	i915_gem_reset_finish_engine(engine);
+	return 0;
 }
 
 static int i915_pm_suspend(struct device *kdev)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index a5b9c666b3bf..f8cbd286f904 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3370,14 +3370,16 @@  static inline u32 i915_reset_count(struct i915_gpu_error *error)
 	return READ_ONCE(error->reset_count);
 }
 
-int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
+struct drm_i915_gem_request *
+i915_gem_reset_prepare_engine(struct intel_engine_cs *engine);
 int i915_gem_reset_prepare(struct drm_i915_private *dev_priv);
 void i915_gem_reset(struct drm_i915_private *dev_priv);
 void i915_gem_reset_finish_engine(struct intel_engine_cs *engine);
 void i915_gem_reset_finish(struct drm_i915_private *dev_priv);
 void i915_gem_set_wedged(struct drm_i915_private *dev_priv);
 bool i915_gem_unset_wedged(struct drm_i915_private *dev_priv);
-void i915_gem_reset_engine(struct intel_engine_cs *engine);
+void i915_gem_reset_engine(struct intel_engine_cs *engine,
+			   struct drm_i915_gem_request *request);
 
 void i915_gem_init_mmio(struct drm_i915_private *i915);
 int __must_check i915_gem_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index b5dc073a5ddc..2e47678315d4 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2793,12 +2793,15 @@  static bool engine_stalled(struct intel_engine_cs *engine)
 	return true;
 }
 
-/* Ensure irq handler finishes, and not run again. */
-int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
+/*
+ * Ensure irq handler finishes, and not run again.
+ * For reset-engine we also store the active request so that we only search
+ * for it once.
+ */
+struct drm_i915_gem_request *
+i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
 {
-	struct drm_i915_gem_request *request;
-	int err = 0;
-
+	struct drm_i915_gem_request *request = NULL;
 
 	/* Prevent the signaler thread from updating the request
 	 * state (by calling dma_fence_signal) as we are processing
@@ -2827,21 +2830,34 @@  int i915_gem_reset_prepare_engine(struct intel_engine_cs *engine)
 
 	if (engine_stalled(engine)) {
 		request = i915_gem_find_active_request(engine);
-		if (request && request->fence.error == -EIO)
-			err = -EIO; /* Previous reset failed! */
+
+		if (request) {
+			if (request->fence.error == -EIO)
+				return ERR_PTR(-EIO); /* Previous reset failed! */
+
+			if (__i915_gem_request_completed(request,
+							 engine->hangcheck.seqno))
+				return NULL; /* request completed, skip reset */
+		}
 	}
 
-	return err;
+	return request;
 }
 
 int i915_gem_reset_prepare(struct drm_i915_private *dev_priv)
 {
 	struct intel_engine_cs *engine;
+	struct drm_i915_gem_request *request;
 	enum intel_engine_id id;
 	int err = 0;
 
-	for_each_engine(engine, dev_priv, id)
-		err = i915_gem_reset_prepare_engine(engine);
+	for_each_engine(engine, dev_priv, id) {
+		request = i915_gem_reset_prepare_engine(engine);
+		if (IS_ERR(request)) {
+			err = PTR_ERR(request);
+			break;
+		}
+	}
 
 	i915_gem_revoke_fences(dev_priv);
 
@@ -2928,11 +2944,12 @@  static bool i915_gem_reset_request(struct drm_i915_gem_request *request)
 	return guilty;
 }
 
-void i915_gem_reset_engine(struct intel_engine_cs *engine)
+void i915_gem_reset_engine(struct intel_engine_cs *engine,
+			   struct drm_i915_gem_request *request)
 {
-	struct drm_i915_gem_request *request;
+	if (!request)
+		request = i915_gem_find_active_request(engine);
 
-	request = i915_gem_find_active_request(engine);
 	if (request && i915_gem_reset_request(request)) {
 		DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
 				 engine->name, request->global_seqno);
@@ -2958,7 +2975,7 @@  void i915_gem_reset(struct drm_i915_private *dev_priv)
 	for_each_engine(engine, dev_priv, id) {
 		struct i915_gem_context *ctx;
 
-		i915_gem_reset_engine(engine);
+		i915_gem_reset_engine(engine, NULL);
 		ctx = fetch_and_zero(&engine->last_retired_context);
 		if (ctx)
 			engine->context_unpin(engine, ctx);