Message ID | 1430205494-12623-1-git-send-email-peter.antoine@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Apr 28, 2015 at 08:18:14AM +0100, Peter Antoine wrote: > This patch fixed a timing issue that causes a GPU hang when a the system > comes out of power saving. > > During pm_resume, We are submitting batchbuffers before enabling Interrupts > this is causing us to miss the context switch interrupt, and in consequence > intel_execlists_handle_ctx_events is not triggered. Ah. Thanks. How about a code comment? :) But that doesn't explain moving init_hw... -Chris
Thanks Chirs for review, We moved "Init_hw" to initialize WA's before any BB submission. Init_hw calls " init_clock_gating" Thanks Deepak -----Original Message----- From: Chris Wilson [mailto:chris@chris-wilson.co.uk] Sent: Tuesday, April 28, 2015 12:53 PM To: Antoine, Peter Cc: intel-gfx@lists.freedesktop.org; S, Deepak; Tian, YeX; Weinehall, David Subject: Re: [Intel-gfx] [PATCH v2] drm/i915: Avoid GPU hang when coming out of S3 or S4 On Tue, Apr 28, 2015 at 08:18:14AM +0100, Peter Antoine wrote: > This patch fixed a timing issue that causes a GPU hang when a the > system comes out of power saving. > > During pm_resume, We are submitting batchbuffers before enabling > Interrupts this is causing us to miss the context switch interrupt, > and in consequence intel_execlists_handle_ctx_events is not triggered. Ah. Thanks. How about a code comment? :) But that doesn't explain moving init_hw... -Chris -- Chris Wilson, Intel Open Source Technology Centre
On Tue, Apr 28, 2015 at 08:29:13AM +0000, S, Deepak wrote: > Thanks Chirs for review, We moved "Init_hw" to initialize WA's before any BB submission. > > Init_hw calls " init_clock_gating" But you appreciate the same issue exists on other paths? -Chris
Yes agreed, we need to make changes in other paths :) On Tuesday 28 April 2015 02:14 PM, Chris Wilson wrote: > On Tue, Apr 28, 2015 at 08:29:13AM +0000, S, Deepak wrote: >> Thanks Chirs for review, We moved "Init_hw" to initialize WA's before any BB submission. >> >> Init_hw calls " init_clock_gating" > But you appreciate the same issue exists on other paths? > -Chris >
Tested-By: Intel Graphics QA PRTS (Patch Regression Test System Contact: shuang.he@intel.com)
Task id: 6271
-------------------------------------Summary-------------------------------------
Platform Delta drm-intel-nightly Series Applied
PNV 276/276 276/276
ILK -1 302/302 301/302
SNB 318/318 318/318
IVB 341/341 341/341
BYT 287/287 287/287
BDW 318/318 318/318
-------------------------------------Detailed-------------------------------------
Platform Test drm-intel-nightly Series Applied
*ILK igt@gem_fenced_exec_thrash@no-spare-fences-busy PASS(4) DMESG_WARN(1)PASS(1)
(dmesg patch applied)drm:i915_hangcheck_elapsed[i915]]*ERROR*Hangcheck_timer_elapsed...bsd_ring_idle@Hangcheck timer elapsed... bsd ring idle
Note: You need to pay more attention to line start with '*'
So is the plan to push these patches and have follow-on work to cover the other paths? As this fixes the Bugzilla issue that has been raised. Peter. -----Original Message----- From: Deepak S [mailto:deepak.s@linux.intel.com] Sent: Tuesday, April 28, 2015 9:56 AM To: Chris Wilson; S, Deepak; Antoine, Peter; intel-gfx@lists.freedesktop.org; Tian, YeX; Weinehall, David Subject: Re: [Intel-gfx] [PATCH v2] drm/i915: Avoid GPU hang when coming out of S3 or S4 Yes agreed, we need to make changes in other paths :) On Tuesday 28 April 2015 02:14 PM, Chris Wilson wrote: > On Tue, Apr 28, 2015 at 08:29:13AM +0000, S, Deepak wrote: >> Thanks Chirs for review, We moved "Init_hw" to initialize WA's before any BB submission. >> >> Init_hw calls " init_clock_gating" > But you appreciate the same issue exists on other paths? > -Chris >
On Tue, Apr 28, 2015 at 02:38:25PM +0000, Antoine, Peter wrote: > So is the plan to push these patches and have follow-on work to cover the other paths? > As this fixes the Bugzilla issue that has been raised. You've identified an issue, but I think your patch is incomplete. -Chris
On Tue, Apr 28, 2015 at 03:46:46PM +0100, Chris Wilson wrote: > On Tue, Apr 28, 2015 at 02:38:25PM +0000, Antoine, Peter wrote: > > So is the plan to push these patches and have follow-on work to cover the other paths? > > As this fixes the Bugzilla issue that has been raised. > > You've identified an issue, but I think your patch is incomplete. I've tried my best to go through the remaining similar-looking code, but the rest seems fine (I might've missed something though). The only thing I reacted on was that in intel_runtime_resume() the call to intel_init_pch_refclk() is conditional on IS_GEN6(), but none of the other invocations of intel_init_pch_refclk() are. The commit message doesn't seem to provide a sufficient explanation for why this is so. Regards, David
On Wed, Apr 29, 2015 at 02:07:19PM +0300, David Weinehall wrote: > On Tue, Apr 28, 2015 at 03:46:46PM +0100, Chris Wilson wrote: > > On Tue, Apr 28, 2015 at 02:38:25PM +0000, Antoine, Peter wrote: > > > So is the plan to push these patches and have follow-on work to cover the other paths? > > > As this fixes the Bugzilla issue that has been raised. > > > > You've identified an issue, but I think your patch is incomplete. > > I've tried my best to go through the remaining similar-looking code, > but the rest seems fine (I might've missed something though). > > The only thing I reacted on was that in intel_runtime_resume() the call > to intel_init_pch_refclk() is conditional on IS_GEN6(), but none of the > other invocations of intel_init_pch_refclk() are. The commit message > doesn't seem to provide a sufficient explanation for why this is so. The explanation for moving init_hw() was that it setups clock_gating. If any in that path are required for enabling the rings, those should be move to init_render_ring() or the init_context. -Chris
On Wed, Apr 29, 2015 at 12:39:17PM +0100, Chris Wilson wrote: > > On Wed, Apr 29, 2015 at 02:07:19PM +0300, David Weinehall wrote: > > On Tue, Apr 28, 2015 at 03:46:46PM +0100, Chris Wilson wrote: > > > On Tue, Apr 28, 2015 at 02:38:25PM +0000, Antoine, Peter wrote: > > > > So is the plan to push these patches and have follow-on work to cover the other paths? > > > > As this fixes the Bugzilla issue that has been raised. > > > > > > You've identified an issue, but I think your patch is incomplete. > > > > I've tried my best to go through the remaining similar-looking code, > > but the rest seems fine (I might've missed something though). > > > > The only thing I reacted on was that in intel_runtime_resume() the call > > to intel_init_pch_refclk() is conditional on IS_GEN6(), but none of the > > other invocations of intel_init_pch_refclk() are. The commit message > > doesn't seem to provide a sufficient explanation for why this is so. > > The explanation for moving init_hw() was that it setups clock_gating. If > any in that path are required for enabling the rings, those should be move to > init_render_ring() or the init_context. Yeah we've had this fun before. init_clock_gating is _not_ for GT workarounds, only for modeset workarounds. We might need to rename that hook to avoid getting bitten for eternity, but moving init_hw because of clock_gating to get the rings up and running is bogus. As Chris said we need to identify which bits need to get moved to the ring init or w/a bb code and do that (and in a separate patch from enabling the interrupts early enough like this one here does). -Daniel
On Mon, 2015-05-04 at 16:55 +0200, Daniel Vetter wrote: > On Wed, Apr 29, 2015 at 12:39:17PM +0100, Chris Wilson wrote: > > > > On Wed, Apr 29, 2015 at 02:07:19PM +0300, David Weinehall wrote: > > > On Tue, Apr 28, 2015 at 03:46:46PM +0100, Chris Wilson wrote: > > > > On Tue, Apr 28, 2015 at 02:38:25PM +0000, Antoine, Peter wrote: > > > > > So is the plan to push these patches and have follow-on work to cover the other paths? > > > > > As this fixes the Bugzilla issue that has been raised. > > > > > > > > You've identified an issue, but I think your patch is incomplete. > > > > > > I've tried my best to go through the remaining similar-looking code, > > > but the rest seems fine (I might've missed something though). > > > > > > The only thing I reacted on was that in intel_runtime_resume() the call > > > to intel_init_pch_refclk() is conditional on IS_GEN6(), but none of the > > > other invocations of intel_init_pch_refclk() are. The commit message > > > doesn't seem to provide a sufficient explanation for why this is so. > > > > The explanation for moving init_hw() was that it setups clock_gating. If > > any in that path are required for enabling the rings, those should be move to > > init_render_ring() or the init_context. > > Yeah we've had this fun before. init_clock_gating is _not_ for GT > workarounds, only for modeset workarounds. We might need to rename that > hook to avoid getting bitten for eternity, but moving init_hw because of > clock_gating to get the rings up and running is bogus. > > As Chris said we need to identify which bits need to get moved to the ring > init or w/a bb code and do that (and in a separate patch from enabling the > interrupts early enough like this one here does). > -Daniel Ok. More work is required. But, we have two issues here. The open and accessible (to any user) ioctl that causes the driver (and in testing the kernel) to misbehave and cause the system to become unresponsive need or to reboot.This is covered by a simple de-reference check that protects the driver and the system. Secondly, a nice to have, which is the hw-lock/context code that seems to have more issues with it and should be turned off when it is not required. This code has subtle connections that will need more work. The un-niceness of this code has been known about for a while and it would be good to turn off. The first I would suggest is kind of important as is simply exploitable security hole, the second is just probably full of security holes. Can we split this into two jobs, fix the actionable security hole, and get back to the nice to have later. Peter.
Ignore this response, replied to the wrong thread. On Tue, 2015-05-05 at 08:05 +0100, Peter Antoine wrote: > On Mon, 2015-05-04 at 16:55 +0200, Daniel Vetter wrote: > > On Wed, Apr 29, 2015 at 12:39:17PM +0100, Chris Wilson wrote: > > > > > > On Wed, Apr 29, 2015 at 02:07:19PM +0300, David Weinehall wrote: > > > > On Tue, Apr 28, 2015 at 03:46:46PM +0100, Chris Wilson wrote: > > > > > On Tue, Apr 28, 2015 at 02:38:25PM +0000, Antoine, Peter wrote: > > > > > > So is the plan to push these patches and have follow-on work to cover the other paths? > > > > > > As this fixes the Bugzilla issue that has been raised. > > > > > > > > > > You've identified an issue, but I think your patch is incomplete. > > > > > > > > I've tried my best to go through the remaining similar-looking code, > > > > but the rest seems fine (I might've missed something though). > > > > > > > > The only thing I reacted on was that in intel_runtime_resume() the call > > > > to intel_init_pch_refclk() is conditional on IS_GEN6(), but none of the > > > > other invocations of intel_init_pch_refclk() are. The commit message > > > > doesn't seem to provide a sufficient explanation for why this is so. > > > > > > The explanation for moving init_hw() was that it setups clock_gating. If > > > any in that path are required for enabling the rings, those should be move to > > > init_render_ring() or the init_context. > > > > Yeah we've had this fun before. init_clock_gating is _not_ for GT > > workarounds, only for modeset workarounds. We might need to rename that > > hook to avoid getting bitten for eternity, but moving init_hw because of > > clock_gating to get the rings up and running is bogus. > > > > As Chris said we need to identify which bits need to get moved to the ring > > init or w/a bb code and do that (and in a separate patch from enabling the > > interrupts early enough like this one here does). > > -Daniel > > Ok. More work is required. > > But, we have two issues here. The open and accessible (to any user) > ioctl that causes the driver (and in testing the kernel) to misbehave > and cause the system to become unresponsive need or to reboot.This is > covered by a simple de-reference check that protects the driver and the > system. > > Secondly, a nice to have, which is the hw-lock/context code that seems > to have more issues with it and should be turned off when it is not > required. This code has subtle connections that will need more work. The > un-niceness of this code has been known about for a while and it would > be good to turn off. > > The first I would suggest is kind of important as is simply exploitable > security hole, the second is just probably full of security holes. > > Can we split this into two jobs, fix the actionable security hole, and > get back to the nice to have later. > > Peter.
I agree Daniel. We need two patches here, 1) moving the enabling of the interrupts early on. 2) split the WA initialization into GT & Display and move GT workarounds before ring init. Thanks Deepak -----Original Message----- From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of Daniel Vetter Sent: Monday, May 4, 2015 8:25 PM To: Chris Wilson; Antoine, Peter; Deepak S; S, Deepak; intel-gfx@lists.freedesktop.org; Tian, YeX Subject: Re: [Intel-gfx] [PATCH v2] drm/i915: Avoid GPU hang when coming out of S3 or S4 On Wed, Apr 29, 2015 at 12:39:17PM +0100, Chris Wilson wrote: > > On Wed, Apr 29, 2015 at 02:07:19PM +0300, David Weinehall wrote: > > On Tue, Apr 28, 2015 at 03:46:46PM +0100, Chris Wilson wrote: > > > On Tue, Apr 28, 2015 at 02:38:25PM +0000, Antoine, Peter wrote: > > > > So is the plan to push these patches and have follow-on work to cover the other paths? > > > > As this fixes the Bugzilla issue that has been raised. > > > > > > You've identified an issue, but I think your patch is incomplete. > > > > I've tried my best to go through the remaining similar-looking code, > > but the rest seems fine (I might've missed something though). > > > > The only thing I reacted on was that in intel_runtime_resume() the > > call to intel_init_pch_refclk() is conditional on IS_GEN6(), but > > none of the other invocations of intel_init_pch_refclk() are. The > > commit message doesn't seem to provide a sufficient explanation for why this is so. > > The explanation for moving init_hw() was that it setups clock_gating. > If any in that path are required for enabling the rings, those should > be move to > init_render_ring() or the init_context. Yeah we've had this fun before. init_clock_gating is _not_ for GT workarounds, only for modeset workarounds. We might need to rename that hook to avoid getting bitten for eternity, but moving init_hw because of clock_gating to get the rings up and running is bogus. As Chris said we need to identify which bits need to get moved to the ring init or w/a bb code and do that (and in a separate patch from enabling the interrupts early enough like this one here does). -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c index e70adfd..648866f 100644 --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -712,6 +712,11 @@ static int i915_drm_resume(struct drm_device *dev) intel_init_pch_refclk(dev); drm_mode_config_reset(dev); + /* We need working interrupts for modeset enabling ... */ + intel_runtime_pm_enable_interrupts(dev_priv); + + intel_modeset_init_hw(dev); + mutex_lock(&dev->struct_mutex); if (i915_gem_init_hw(dev)) { DRM_ERROR("failed to re-initialize GPU, declaring wedged!\n"); @@ -719,11 +724,6 @@ static int i915_drm_resume(struct drm_device *dev) } mutex_unlock(&dev->struct_mutex); - /* We need working interrupts for modeset enabling ... */ - intel_runtime_pm_enable_interrupts(dev_priv); - - intel_modeset_init_hw(dev); - spin_lock_irq(&dev_priv->irq_lock); if (dev_priv->display.hpd_irq_setup) dev_priv->display.hpd_irq_setup(dev);
This patch fixed a timing issue that causes a GPU hang when a the system comes out of power saving. During pm_resume, We are submitting batchbuffers before enabling Interrupts this is causing us to miss the context switch interrupt, and in consequence intel_execlists_handle_ctx_events is not triggered. This patch is based on a patch from Deepak S <deepak.s@intel.com> from another platform. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=89600 Signed-off-by: Peter Antoine <peter.antoine@intel.com> --- drivers/gpu/drm/i915/i915_drv.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)