[5/7] drm/i915: queue hangcheck on reset

Message ID	1372861332-6308-6-git-send-email-mika.kuoppala@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces+patchwork-intel-gfx=patchwork.kernel.org@lists.freedesktop.org> From: Mika Kuoppala <mika.kuoppala@linux.intel.com> To: intel-gfx@lists.freedesktop.org Date: Wed, 3 Jul 2013 17:22:10 +0300 Message-Id: <1372861332-6308-6-git-send-email-mika.kuoppala@intel.com> In-Reply-To: <1372861332-6308-1-git-send-email-mika.kuoppala@intel.com> References: <1372861332-6308-1-git-send-email-mika.kuoppala@intel.com> Subject: [Intel-gfx] [PATCH 5/7] drm/i915: queue hangcheck on reset Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: intel-gfx-bounces+patchwork-intel-gfx=patchwork.kernel.org@lists.freedesktop.org Errors-To: intel-gfx-bounces+patchwork-intel-gfx=patchwork.kernel.org@lists.freedesktop.org

Message ID

1372861332-6308-6-git-send-email-mika.kuoppala@intel.com (mailing list archive)

State

New, archived

Headers

From: Mika Kuoppala <mika.kuoppala@linux.intel.com>
To: intel-gfx@lists.freedesktop.org
Date: Wed,  3 Jul 2013 17:22:10 +0300
Message-Id: <1372861332-6308-6-git-send-email-mika.kuoppala@intel.com>
In-Reply-To: <1372861332-6308-1-git-send-email-mika.kuoppala@intel.com>
References: <1372861332-6308-1-git-send-email-mika.kuoppala@intel.com>
Subject: [Intel-gfx] [PATCH 5/7] drm/i915: queue hangcheck on reset
Precedence: list
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: intel-gfx-bounces+patchwork-intel-gfx=patchwork.kernel.org@lists.freedesktop.org
Errors-To: intel-gfx-bounces+patchwork-intel-gfx=patchwork.kernel.org@lists.freedesktop.org

Commit Message

Mika Kuoppala July 3, 2013, 2:22 p.m. UTC

From: Mika Kuoppala <mika.kuoppala@linux.intel.com>

Upon resetting the GPU, we begin processing batches once more, so
reset the hangcheck timer.

v2: kicking inside reset instead of hangcheck_elapsed and
    sane commit message by Chris Wilson

Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c |    2 ++
 1 file changed, 2 insertions(+)

Comments

Daniel Vetter July 16, 2013, 8:49 a.m. UTC | #1

On Wed, Jul 03, 2013 at 05:22:10PM +0300, Mika Kuoppala wrote:
> From: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> 
> Upon resetting the GPU, we begin processing batches once more, so
> reset the hangcheck timer.
> 
> v2: kicking inside reset instead of hangcheck_elapsed and
>     sane commit message by Chris Wilson
> 
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_irq.c |    2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index b0fec7f..1b0e903 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -1452,6 +1452,8 @@ static void i915_error_work_func(struct work_struct *work)
>  
>  			kobject_uevent_env(&dev->primary->kdev.kobj,
>  					   KOBJ_CHANGE, reset_done_event);
> +
> +			i915_queue_hangcheck(dev);

Hm, what exactly is this for? After reset we don't have any batches
running right now (since we reset all batches), so I don't understand why
we need this. And the commit message also doesn't give a reason.

Aside a little maintainer bikeshed: The important thing a commit message
should cover isn't really the "what changed" since the code should be
readable enough to not need additional explanation, but the "why". In some
cases the important part of the "why" is "why not a different approach" or
"why does this work at all" (e.g. for bugfixes).

Cheers, Daniel

Chris Wilson July 16, 2013, 9:16 a.m. UTC | #2

On Tue, Jul 16, 2013 at 10:49:29AM +0200, Daniel Vetter wrote:
> On Wed, Jul 03, 2013 at 05:22:10PM +0300, Mika Kuoppala wrote:
> > From: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > 
> > Upon resetting the GPU, we begin processing batches once more, so
> > reset the hangcheck timer.
> > 
> > v2: kicking inside reset instead of hangcheck_elapsed and
> >     sane commit message by Chris Wilson
> > 
> > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> > ---
> >  drivers/gpu/drm/i915/i915_irq.c |    2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> > index b0fec7f..1b0e903 100644
> > --- a/drivers/gpu/drm/i915/i915_irq.c
> > +++ b/drivers/gpu/drm/i915/i915_irq.c
> > @@ -1452,6 +1452,8 @@ static void i915_error_work_func(struct work_struct *work)
> >  
> >  			kobject_uevent_env(&dev->primary->kdev.kobj,
> >  					   KOBJ_CHANGE, reset_done_event);
> > +
> > +			i915_queue_hangcheck(dev);
> 
> Hm, what exactly is this for? After reset we don't have any batches
> running right now (since we reset all batches), so I don't understand why
> we need this. And the commit message also doesn't give a reason.

Because our code is a little snafu after reset. We do have batches still
queued, but the rings are incorrectly reset. Instead they should just be
promoted past the failed batch and processing restarted. If you get
extremely fancy, we can no-op out any requests from the hung !default
context to prevent incorrect state leakage. This also likely explains
why we end up with active bo stuck after becoming wedged.
-Chris

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index b0fec7f..1b0e903 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -1452,6 +1452,8 @@  static void i915_error_work_func(struct work_struct *work)
 
 			kobject_uevent_env(&dev->primary->kdev.kobj,
 					   KOBJ_CHANGE, reset_done_event);
+
+			i915_queue_hangcheck(dev);
 		} else {
 			atomic_set(&error->reset_counter, I915_WEDGED);
 		}

[5/7] drm/i915: queue hangcheck on reset

Commit Message

Comments

Patch