drm/i915: Warn when execlists changes context without IRQs

Message ID	1431356607-25092-1-git-send-email-peter.antoine@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Peter Antoine <peter.antoine@intel.com> To: intel-gfx@lists.freedesktop.org Date: Mon, 11 May 2015 16:03:27 +0100 Message-Id: <1431356607-25092-1-git-send-email-peter.antoine@intel.com> Cc: daniel.vetter@ffwll.ch Subject: [Intel-gfx] [PATCH] drm/i915: Warn when execlists changes context without IRQs Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Peter Antoine May 11, 2015, 3:03 p.m. UTC

If an batch ends while the IRQs are not turned on the notification can
go missing and the GPU can hang. So generate a warning in this case.

Signed-off-by: Peter Antoine <peter.antoine@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 6 ++++++
 1 file changed, 6 insertions(+)

Daniel Vetter May 11, 2015, 4:37 p.m. UTC | #1

On Mon, May 11, 2015 at 04:03:27PM +0100, Peter Antoine wrote:
> If an batch ends while the IRQs are not turned on the notification can
> go missing and the GPU can hang. So generate a warning in this case.
> 
> Signed-off-by: Peter Antoine <peter.antoine@intel.com>

Queued for -next, thanks for the patch.
-Daniel

> ---
>  drivers/gpu/drm/i915/intel_lrc.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 0fa9209..0413b8f 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -394,6 +394,12 @@ static void execlists_context_unqueue(struct intel_engine_cs *ring)
>  
>  	assert_spin_locked(&ring->execlist_lock);
>  
> +	/*
> +	 * If irqs are not active generate a warning as batches that finish
> +	 * without the irqs may get lost and a GPU Hang may occur.
> +	 */
> +	WARN_ON(!intel_irqs_enabled(ring->dev->dev_private));
> +
>  	if (list_empty(&ring->execlist_queue))
>  		return;
>  
> -- 
> 1.9.1
>

Chris Wilson May 11, 2015, 8:26 p.m. UTC | #2

On Mon, May 11, 2015 at 06:37:14PM +0200, Daniel Vetter wrote:
> On Mon, May 11, 2015 at 04:03:27PM +0100, Peter Antoine wrote:
> > If an batch ends while the IRQs are not turned on the notification can
> > go missing and the GPU can hang. So generate a warning in this case.
> > 
> > Signed-off-by: Peter Antoine <peter.antoine@intel.com>
> 
> Queued for -next, thanks for the patch.

Please drop this. We already have the test inside the irq handler. The
instance where you want the guard is during resume, inside
intel_lr_context_render_state_init() where you can even emit an error
code!
-Chris

Daniel Vetter May 12, 2015, 6:42 a.m. UTC | #3

On Mon, May 11, 2015 at 09:26:40PM +0100, Chris Wilson wrote:
> On Mon, May 11, 2015 at 06:37:14PM +0200, Daniel Vetter wrote:
> > On Mon, May 11, 2015 at 04:03:27PM +0100, Peter Antoine wrote:
> > > If an batch ends while the IRQs are not turned on the notification can
> > > go missing and the GPU can hang. So generate a warning in this case.
> > > 
> > > Signed-off-by: Peter Antoine <peter.antoine@intel.com>
> > 
> > Queued for -next, thanks for the patch.
> 
> Please drop this. We already have the test inside the irq handler. The
> instance where you want the guard is during resume, inside
> intel_lr_context_render_state_init() where you can even emit an error
> code!

_unqueue is also called from intel_logical_ring_advance_and_submit and
should cover us I hope. And I don't think we should add error handling for
this, that has cost of its own.
-Daniel

Chris Wilson May 12, 2015, 8:18 a.m. UTC | #4

On Tue, May 12, 2015 at 08:42:58AM +0200, Daniel Vetter wrote:
> On Mon, May 11, 2015 at 09:26:40PM +0100, Chris Wilson wrote:
> > On Mon, May 11, 2015 at 06:37:14PM +0200, Daniel Vetter wrote:
> > > On Mon, May 11, 2015 at 04:03:27PM +0100, Peter Antoine wrote:
> > > > If an batch ends while the IRQs are not turned on the notification can
> > > > go missing and the GPU can hang. So generate a warning in this case.
> > > > 
> > > > Signed-off-by: Peter Antoine <peter.antoine@intel.com>
> > > 
> > > Queued for -next, thanks for the patch.
> > 
> > Please drop this. We already have the test inside the irq handler. The
> > instance where you want the guard is during resume, inside
> > intel_lr_context_render_state_init() where you can even emit an error
> > code!
> 
> _unqueue is also called from intel_logical_ring_advance_and_submit and
> should cover us I hope. And I don't think we should add error handling for
> this, that has cost of its own.

Pardon? Your explanation for adding it to the *interrupt* handler after
an existing check is that it also catches the initial submission?
-Chris

Daniel Vetter May 12, 2015, 8:42 a.m. UTC | #5

On Tue, May 12, 2015 at 09:18:00AM +0100, Chris Wilson wrote:
> On Tue, May 12, 2015 at 08:42:58AM +0200, Daniel Vetter wrote:
> > On Mon, May 11, 2015 at 09:26:40PM +0100, Chris Wilson wrote:
> > > On Mon, May 11, 2015 at 06:37:14PM +0200, Daniel Vetter wrote:
> > > > On Mon, May 11, 2015 at 04:03:27PM +0100, Peter Antoine wrote:
> > > > > If an batch ends while the IRQs are not turned on the notification can
> > > > > go missing and the GPU can hang. So generate a warning in this case.
> > > > > 
> > > > > Signed-off-by: Peter Antoine <peter.antoine@intel.com>
> > > > 
> > > > Queued for -next, thanks for the patch.
> > > 
> > > Please drop this. We already have the test inside the irq handler. The
> > > instance where you want the guard is during resume, inside
> > > intel_lr_context_render_state_init() where you can even emit an error
> > > code!
> > 
> > _unqueue is also called from intel_logical_ring_advance_and_submit and
> > should cover us I hope. And I don't think we should add error handling for
> > this, that has cost of its own.
> 
> Pardon? Your explanation for adding it to the *interrupt* handler after
> an existing check is that it also catches the initial submission?

I wanted to add it as low down as possible to increase changes of the
check surviving refactoring. This way it's as close to possible to the
writes to the submit ports. Yes that means it's also run from interrupt
context when requeueing, but I'm fairly meh about that. What's your
concern?
-Daniel

Chris Wilson May 12, 2015, 8:48 a.m. UTC | #6

On Tue, May 12, 2015 at 10:42:41AM +0200, Daniel Vetter wrote:
> On Tue, May 12, 2015 at 09:18:00AM +0100, Chris Wilson wrote:
> > On Tue, May 12, 2015 at 08:42:58AM +0200, Daniel Vetter wrote:
> > > On Mon, May 11, 2015 at 09:26:40PM +0100, Chris Wilson wrote:
> > > > On Mon, May 11, 2015 at 06:37:14PM +0200, Daniel Vetter wrote:
> > > > > On Mon, May 11, 2015 at 04:03:27PM +0100, Peter Antoine wrote:
> > > > > > If an batch ends while the IRQs are not turned on the notification can
> > > > > > go missing and the GPU can hang. So generate a warning in this case.
> > > > > > 
> > > > > > Signed-off-by: Peter Antoine <peter.antoine@intel.com>
> > > > > 
> > > > > Queued for -next, thanks for the patch.
> > > > 
> > > > Please drop this. We already have the test inside the irq handler. The
> > > > instance where you want the guard is during resume, inside
> > > > intel_lr_context_render_state_init() where you can even emit an error
> > > > code!
> > > 
> > > _unqueue is also called from intel_logical_ring_advance_and_submit and
> > > should cover us I hope. And I don't think we should add error handling for
> > > this, that has cost of its own.
> > 
> > Pardon? Your explanation for adding it to the *interrupt* handler after
> > an existing check is that it also catches the initial submission?
> 
> I wanted to add it as low down as possible to increase changes of the
> check surviving refactoring. This way it's as close to possible to the
> writes to the submit ports. Yes that means it's also run from interrupt
> context when requeueing, but I'm fairly meh about that. What's your
> concern?

The interrupt handler + submission is the ratelimiting factor in
execlists throughput. Tests for external factors really should be on the
boundary.
-Chris

Shuang He May 15, 2015, 3:41 a.m. UTC | #7

Tested-By: Intel Graphics QA PRTS (Patch Regression Test System Contact: shuang.he@intel.com)
Task id: 6381
-------------------------------------Summary-------------------------------------
Platform          Delta          drm-intel-nightly          Series Applied
PNV                                  276/276              276/276
ILK                                  302/302              302/302
SNB                 -1              314/314              313/314
IVB                                  338/338              338/338
BYT                                  286/286              286/286
BDW                 -1              320/320              319/320
-------------------------------------Detailed-------------------------------------
Platform  Test                                drm-intel-nightly          Series Applied
 SNB  igt@pm_rpm@dpms-mode-unset-non-lpsp      DMESG_WARN(13)PASS(1)      DMESG_WARN(1)
(dmesg patch applied)WARNING:at_drivers/gpu/drm/i915/intel_uncore.c:#assert_device_not_suspended[i915]()@WARNING:.* at .* assert_device_not_suspended+0x
*BDW  igt@gem_gtt_hog      PASS(3)      DMESG_WARN(1)
(dmesg patch applied)WARNING:at_drivers/gpu/drm/i915/intel_display.c:#assert_plane[i915]()@WARNING:.* at .* assert_plane
assertion_failure@assertion failure
WARNING:at_drivers/gpu/drm/drm_irq.c:#drm_wait_one_vblank[drm]()@WARNING:.* at .* drm_wait_one_vblank+0x
Note: You need to pay more attention to line start with '*'

drm/i915: Warn when execlists changes context without IRQs

Commit Message

Comments

Patch