diff mbox

[v2] drm/i915: Avoid GPU hang when coming out of S3 or S4

Message ID 1430205494-12623-1-git-send-email-peter.antoine@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Peter Antoine April 28, 2015, 7:18 a.m. UTC
This patch fixed a timing issue that causes a GPU hang when a the system
comes out of power saving.

During pm_resume, We are submitting batchbuffers before enabling Interrupts
this is causing us to miss the context switch interrupt, and in consequence
intel_execlists_handle_ctx_events is not triggered.

This patch is based on a patch from Deepak S <deepak.s@intel.com> from
another platform.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=89600
Signed-off-by: Peter Antoine <peter.antoine@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Comments

Chris Wilson April 28, 2015, 7:23 a.m. UTC | #1
On Tue, Apr 28, 2015 at 08:18:14AM +0100, Peter Antoine wrote:
> This patch fixed a timing issue that causes a GPU hang when a the system
> comes out of power saving.
> 
> During pm_resume, We are submitting batchbuffers before enabling Interrupts
> this is causing us to miss the context switch interrupt, and in consequence
> intel_execlists_handle_ctx_events is not triggered.

Ah. Thanks. How about a code comment? :)

But that doesn't explain moving init_hw...
-Chris
deepak.s@intel.com April 28, 2015, 8:29 a.m. UTC | #2
Thanks Chirs for review, We moved  "Init_hw" to initialize WA's before any BB submission. 

Init_hw calls " init_clock_gating"

Thanks
Deepak

-----Original Message-----
From: Chris Wilson [mailto:chris@chris-wilson.co.uk] 
Sent: Tuesday, April 28, 2015 12:53 PM
To: Antoine, Peter
Cc: intel-gfx@lists.freedesktop.org; S, Deepak; Tian, YeX; Weinehall, David
Subject: Re: [Intel-gfx] [PATCH v2] drm/i915: Avoid GPU hang when coming out of S3 or S4

On Tue, Apr 28, 2015 at 08:18:14AM +0100, Peter Antoine wrote:
> This patch fixed a timing issue that causes a GPU hang when a the 
> system comes out of power saving.
> 
> During pm_resume, We are submitting batchbuffers before enabling 
> Interrupts this is causing us to miss the context switch interrupt, 
> and in consequence intel_execlists_handle_ctx_events is not triggered.

Ah. Thanks. How about a code comment? :)

But that doesn't explain moving init_hw...
-Chris

--
Chris Wilson, Intel Open Source Technology Centre
Chris Wilson April 28, 2015, 8:44 a.m. UTC | #3
On Tue, Apr 28, 2015 at 08:29:13AM +0000, S, Deepak wrote:
> Thanks Chirs for review, We moved  "Init_hw" to initialize WA's before any BB submission. 
> 
> Init_hw calls " init_clock_gating"

But you appreciate the same issue exists on other paths?
-Chris
deepak.s@linux.intel.com April 28, 2015, 8:55 a.m. UTC | #4
Yes agreed, we need to make changes in other paths :)


On Tuesday 28 April 2015 02:14 PM, Chris Wilson wrote:
> On Tue, Apr 28, 2015 at 08:29:13AM +0000, S, Deepak wrote:
>> Thanks Chirs for review, We moved  "Init_hw" to initialize WA's before any BB submission.
>>
>> Init_hw calls " init_clock_gating"
> But you appreciate the same issue exists on other paths?
> -Chris
>
Shuang He April 28, 2015, 9:28 a.m. UTC | #5
Tested-By: Intel Graphics QA PRTS (Patch Regression Test System Contact: shuang.he@intel.com)
Task id: 6271
-------------------------------------Summary-------------------------------------
Platform          Delta          drm-intel-nightly          Series Applied
PNV                                  276/276              276/276
ILK                 -1              302/302              301/302
SNB                                  318/318              318/318
IVB                                  341/341              341/341
BYT                                  287/287              287/287
BDW                                  318/318              318/318
-------------------------------------Detailed-------------------------------------
Platform  Test                                drm-intel-nightly          Series Applied
*ILK  igt@gem_fenced_exec_thrash@no-spare-fences-busy      PASS(4)      DMESG_WARN(1)PASS(1)
(dmesg patch applied)drm:i915_hangcheck_elapsed[i915]]*ERROR*Hangcheck_timer_elapsed...bsd_ring_idle@Hangcheck timer elapsed... bsd ring idle
Note: You need to pay more attention to line start with '*'
Peter Antoine April 28, 2015, 2:38 p.m. UTC | #6
So is the plan to push these patches and have follow-on work to cover the other paths?
As this fixes the Bugzilla issue that has been raised.

Peter.

-----Original Message-----
From: Deepak S [mailto:deepak.s@linux.intel.com] 

Sent: Tuesday, April 28, 2015 9:56 AM
To: Chris Wilson; S, Deepak; Antoine, Peter; intel-gfx@lists.freedesktop.org; Tian, YeX; Weinehall, David
Subject: Re: [Intel-gfx] [PATCH v2] drm/i915: Avoid GPU hang when coming out of S3 or S4

Yes agreed, we need to make changes in other paths :)


On Tuesday 28 April 2015 02:14 PM, Chris Wilson wrote:
> On Tue, Apr 28, 2015 at 08:29:13AM +0000, S, Deepak wrote:

>> Thanks Chirs for review, We moved  "Init_hw" to initialize WA's before any BB submission.

>>

>> Init_hw calls " init_clock_gating"

> But you appreciate the same issue exists on other paths?

> -Chris

>
Chris Wilson April 28, 2015, 2:46 p.m. UTC | #7
On Tue, Apr 28, 2015 at 02:38:25PM +0000, Antoine, Peter wrote:
> So is the plan to push these patches and have follow-on work to cover the other paths?
> As this fixes the Bugzilla issue that has been raised.

You've identified an issue, but I think your patch is incomplete.
-Chris
David Weinehall April 29, 2015, 11:07 a.m. UTC | #8
On Tue, Apr 28, 2015 at 03:46:46PM +0100, Chris Wilson wrote:
> On Tue, Apr 28, 2015 at 02:38:25PM +0000, Antoine, Peter wrote:
> > So is the plan to push these patches and have follow-on work to cover the other paths?
> > As this fixes the Bugzilla issue that has been raised.
> 
> You've identified an issue, but I think your patch is incomplete.

I've tried my best to go through the remaining similar-looking code,
but the rest seems fine (I might've missed something though).

The only thing I reacted on was that in intel_runtime_resume() the call
to intel_init_pch_refclk() is conditional on IS_GEN6(), but none of the
other invocations of intel_init_pch_refclk() are.  The commit message
doesn't seem to provide a sufficient explanation for why this is so.


Regards, David
Chris Wilson April 29, 2015, 11:39 a.m. UTC | #9
On Wed, Apr 29, 2015 at 02:07:19PM +0300, David Weinehall wrote:
> On Tue, Apr 28, 2015 at 03:46:46PM +0100, Chris Wilson wrote:
> > On Tue, Apr 28, 2015 at 02:38:25PM +0000, Antoine, Peter wrote:
> > > So is the plan to push these patches and have follow-on work to cover the other paths?
> > > As this fixes the Bugzilla issue that has been raised.
> > 
> > You've identified an issue, but I think your patch is incomplete.
> 
> I've tried my best to go through the remaining similar-looking code,
> but the rest seems fine (I might've missed something though).
> 
> The only thing I reacted on was that in intel_runtime_resume() the call
> to intel_init_pch_refclk() is conditional on IS_GEN6(), but none of the
> other invocations of intel_init_pch_refclk() are.  The commit message
> doesn't seem to provide a sufficient explanation for why this is so.

The explanation for moving init_hw() was that it setups clock_gating. If
any in that path are required for enabling the rings, those should be move to
init_render_ring() or the init_context.
-Chris
Daniel Vetter May 4, 2015, 2:55 p.m. UTC | #10
On Wed, Apr 29, 2015 at 12:39:17PM +0100, Chris Wilson wrote:
> 
> On Wed, Apr 29, 2015 at 02:07:19PM +0300, David Weinehall wrote:
> > On Tue, Apr 28, 2015 at 03:46:46PM +0100, Chris Wilson wrote:
> > > On Tue, Apr 28, 2015 at 02:38:25PM +0000, Antoine, Peter wrote:
> > > > So is the plan to push these patches and have follow-on work to cover the other paths?
> > > > As this fixes the Bugzilla issue that has been raised.
> > > 
> > > You've identified an issue, but I think your patch is incomplete.
> > 
> > I've tried my best to go through the remaining similar-looking code,
> > but the rest seems fine (I might've missed something though).
> > 
> > The only thing I reacted on was that in intel_runtime_resume() the call
> > to intel_init_pch_refclk() is conditional on IS_GEN6(), but none of the
> > other invocations of intel_init_pch_refclk() are.  The commit message
> > doesn't seem to provide a sufficient explanation for why this is so.
> 
> The explanation for moving init_hw() was that it setups clock_gating. If
> any in that path are required for enabling the rings, those should be move to
> init_render_ring() or the init_context.

Yeah we've had this fun before. init_clock_gating is _not_ for GT
workarounds, only for modeset workarounds. We might need to rename that
hook to avoid getting bitten for eternity, but moving init_hw because of
clock_gating to get the rings up and running is bogus.

As Chris said we need to identify which bits need to get moved to the ring
init or w/a bb code and do that (and in a separate patch from enabling the
interrupts early enough like this one here does).
-Daniel
Peter Antoine May 5, 2015, 7:05 a.m. UTC | #11
On Mon, 2015-05-04 at 16:55 +0200, Daniel Vetter wrote:
> On Wed, Apr 29, 2015 at 12:39:17PM +0100, Chris Wilson wrote:

> > 

> > On Wed, Apr 29, 2015 at 02:07:19PM +0300, David Weinehall wrote:

> > > On Tue, Apr 28, 2015 at 03:46:46PM +0100, Chris Wilson wrote:

> > > > On Tue, Apr 28, 2015 at 02:38:25PM +0000, Antoine, Peter wrote:

> > > > > So is the plan to push these patches and have follow-on work to cover the other paths?

> > > > > As this fixes the Bugzilla issue that has been raised.

> > > > 

> > > > You've identified an issue, but I think your patch is incomplete.

> > > 

> > > I've tried my best to go through the remaining similar-looking code,

> > > but the rest seems fine (I might've missed something though).

> > > 

> > > The only thing I reacted on was that in intel_runtime_resume() the call

> > > to intel_init_pch_refclk() is conditional on IS_GEN6(), but none of the

> > > other invocations of intel_init_pch_refclk() are.  The commit message

> > > doesn't seem to provide a sufficient explanation for why this is so.

> > 

> > The explanation for moving init_hw() was that it setups clock_gating. If

> > any in that path are required for enabling the rings, those should be move to

> > init_render_ring() or the init_context.

> 

> Yeah we've had this fun before. init_clock_gating is _not_ for GT

> workarounds, only for modeset workarounds. We might need to rename that

> hook to avoid getting bitten for eternity, but moving init_hw because of

> clock_gating to get the rings up and running is bogus.

> 

> As Chris said we need to identify which bits need to get moved to the ring

> init or w/a bb code and do that (and in a separate patch from enabling the

> interrupts early enough like this one here does).

> -Daniel


Ok. More work is required.

But, we have two issues here. The open and accessible (to any user)
ioctl that causes the driver (and in testing the kernel) to misbehave
and cause the system to become unresponsive need or to reboot.This is
covered by a simple de-reference check that protects the driver and the
system.

Secondly, a nice to have, which is the hw-lock/context code that seems
to have more issues with it and should be turned off when it is not
required. This code has subtle connections that will need more work. The
un-niceness of this code has been known about for a while and it would
be good to turn off.

The first I would suggest is kind of important as is simply exploitable
security hole, the second is just probably full of security holes.

Can we split this into two jobs, fix the actionable security hole, and
get back to the nice to have later.

Peter.
Peter Antoine May 5, 2015, 7:06 a.m. UTC | #12
Ignore this response, replied to the wrong thread.


On Tue, 2015-05-05 at 08:05 +0100, Peter Antoine wrote:
> On Mon, 2015-05-04 at 16:55 +0200, Daniel Vetter wrote:

> > On Wed, Apr 29, 2015 at 12:39:17PM +0100, Chris Wilson wrote:

> > > 

> > > On Wed, Apr 29, 2015 at 02:07:19PM +0300, David Weinehall wrote:

> > > > On Tue, Apr 28, 2015 at 03:46:46PM +0100, Chris Wilson wrote:

> > > > > On Tue, Apr 28, 2015 at 02:38:25PM +0000, Antoine, Peter wrote:

> > > > > > So is the plan to push these patches and have follow-on work to cover the other paths?

> > > > > > As this fixes the Bugzilla issue that has been raised.

> > > > > 

> > > > > You've identified an issue, but I think your patch is incomplete.

> > > > 

> > > > I've tried my best to go through the remaining similar-looking code,

> > > > but the rest seems fine (I might've missed something though).

> > > > 

> > > > The only thing I reacted on was that in intel_runtime_resume() the call

> > > > to intel_init_pch_refclk() is conditional on IS_GEN6(), but none of the

> > > > other invocations of intel_init_pch_refclk() are.  The commit message

> > > > doesn't seem to provide a sufficient explanation for why this is so.

> > > 

> > > The explanation for moving init_hw() was that it setups clock_gating. If

> > > any in that path are required for enabling the rings, those should be move to

> > > init_render_ring() or the init_context.

> > 

> > Yeah we've had this fun before. init_clock_gating is _not_ for GT

> > workarounds, only for modeset workarounds. We might need to rename that

> > hook to avoid getting bitten for eternity, but moving init_hw because of

> > clock_gating to get the rings up and running is bogus.

> > 

> > As Chris said we need to identify which bits need to get moved to the ring

> > init or w/a bb code and do that (and in a separate patch from enabling the

> > interrupts early enough like this one here does).

> > -Daniel

> 

> Ok. More work is required.

> 

> But, we have two issues here. The open and accessible (to any user)

> ioctl that causes the driver (and in testing the kernel) to misbehave

> and cause the system to become unresponsive need or to reboot.This is

> covered by a simple de-reference check that protects the driver and the

> system.

> 

> Secondly, a nice to have, which is the hw-lock/context code that seems

> to have more issues with it and should be turned off when it is not

> required. This code has subtle connections that will need more work. The

> un-niceness of this code has been known about for a while and it would

> be good to turn off.

> 

> The first I would suggest is kind of important as is simply exploitable

> security hole, the second is just probably full of security holes.

> 

> Can we split this into two jobs, fix the actionable security hole, and

> get back to the nice to have later.

> 

> Peter.
deepak.s@intel.com May 5, 2015, 7:43 a.m. UTC | #13
I agree Daniel. We need two patches here, 1) moving the enabling of the interrupts early on. 2) split the WA initialization into GT & Display and move GT workarounds before ring init.

Thanks
Deepak  

-----Original Message-----
From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of Daniel Vetter
Sent: Monday, May 4, 2015 8:25 PM
To: Chris Wilson; Antoine, Peter; Deepak S; S, Deepak; intel-gfx@lists.freedesktop.org; Tian, YeX
Subject: Re: [Intel-gfx] [PATCH v2] drm/i915: Avoid GPU hang when coming out of S3 or S4

On Wed, Apr 29, 2015 at 12:39:17PM +0100, Chris Wilson wrote:
> 
> On Wed, Apr 29, 2015 at 02:07:19PM +0300, David Weinehall wrote:
> > On Tue, Apr 28, 2015 at 03:46:46PM +0100, Chris Wilson wrote:
> > > On Tue, Apr 28, 2015 at 02:38:25PM +0000, Antoine, Peter wrote:
> > > > So is the plan to push these patches and have follow-on work to cover the other paths?
> > > > As this fixes the Bugzilla issue that has been raised.
> > > 
> > > You've identified an issue, but I think your patch is incomplete.
> > 
> > I've tried my best to go through the remaining similar-looking code, 
> > but the rest seems fine (I might've missed something though).
> > 
> > The only thing I reacted on was that in intel_runtime_resume() the 
> > call to intel_init_pch_refclk() is conditional on IS_GEN6(), but 
> > none of the other invocations of intel_init_pch_refclk() are.  The 
> > commit message doesn't seem to provide a sufficient explanation for why this is so.
> 
> The explanation for moving init_hw() was that it setups clock_gating. 
> If any in that path are required for enabling the rings, those should 
> be move to
> init_render_ring() or the init_context.

Yeah we've had this fun before. init_clock_gating is _not_ for GT workarounds, only for modeset workarounds. We might need to rename that hook to avoid getting bitten for eternity, but moving init_hw because of clock_gating to get the rings up and running is bogus.

As Chris said we need to identify which bits need to get moved to the ring init or w/a bb code and do that (and in a separate patch from enabling the interrupts early enough like this one here does).
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index e70adfd..648866f 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -712,6 +712,11 @@  static int i915_drm_resume(struct drm_device *dev)
 	intel_init_pch_refclk(dev);
 	drm_mode_config_reset(dev);
 
+	/* We need working interrupts for modeset enabling ... */
+	intel_runtime_pm_enable_interrupts(dev_priv);
+
+	intel_modeset_init_hw(dev);
+
 	mutex_lock(&dev->struct_mutex);
 	if (i915_gem_init_hw(dev)) {
 		DRM_ERROR("failed to re-initialize GPU, declaring wedged!\n");
@@ -719,11 +724,6 @@  static int i915_drm_resume(struct drm_device *dev)
 	}
 	mutex_unlock(&dev->struct_mutex);
 
-	/* We need working interrupts for modeset enabling ... */
-	intel_runtime_pm_enable_interrupts(dev_priv);
-
-	intel_modeset_init_hw(dev);
-
 	spin_lock_irq(&dev_priv->irq_lock);
 	if (dev_priv->display.hpd_irq_setup)
 		dev_priv->display.hpd_irq_setup(dev);