From patchwork Thu May 30 06:04:29 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mika Kuoppala X-Patchwork-Id: 2633781 Return-Path: X-Original-To: patchwork-intel-gfx@patchwork.kernel.org Delivered-To: patchwork-process-083081@patchwork2.kernel.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) by patchwork2.kernel.org (Postfix) with ESMTP id D2491DF2A1 for ; Thu, 30 May 2013 06:05:21 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 7C802E5D0E for ; Wed, 29 May 2013 23:05:21 -0700 (PDT) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from mga03.intel.com (mga03.intel.com [143.182.124.21]) by gabe.freedesktop.org (Postfix) with ESMTP id F37D6E5D0E for ; Wed, 29 May 2013 23:05:08 -0700 (PDT) Received: from azsmga002.ch.intel.com ([10.2.17.35]) by azsmga101.ch.intel.com with ESMTP; 29 May 2013 23:05:08 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.87,768,1363158000"; d="scan'208";a="248525157" Received: from gaia.fi.intel.com (HELO gaia) ([10.237.72.66]) by AZSMGA002.ch.intel.com with ESMTP; 29 May 2013 23:05:06 -0700 Received: by gaia (Postfix, from userid 1000) id 409884115C; Thu, 30 May 2013 09:04:30 +0300 (EEST) From: Mika Kuoppala To: intel-gfx@lists.freedesktop.org Date: Thu, 30 May 2013 09:04:29 +0300 Message-Id: <1369893869-19110-1-git-send-email-mika.kuoppala@intel.com> X-Mailer: git-send-email 1.7.9.5 In-Reply-To: <20130527013448.GA22410@bwidawsk.net> References: <20130527013448.GA22410@bwidawsk.net> Cc: miku@iki.fi Subject: [Intel-gfx] [PATCH] drm/i915: detect hang using per ring hangcheck_score X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Sender: intel-gfx-bounces+patchwork-intel-gfx=patchwork.kernel.org@lists.freedesktop.org Errors-To: intel-gfx-bounces+patchwork-intel-gfx=patchwork.kernel.org@lists.freedesktop.org Keep track of ring seqno progress and if there are no progress detected, declare hang. Use actual head (acthd) to distinguish between ring stuck and batchbuffer looping situation. Stuck ring will be kicked to trigger progress. This commit adds a hard limit for batchbuffer completion time. If batchbuffer completion time is more than 4.5 seconds, the gpu will be declared hung. v2: use atchd to detect stuck ring from loop (Ben Widawsky) v3: Use acthd to check when ring needs kicking. Declare hang on third time in order to give time for kick_ring to take effect. v4: Update commit msg Signed-off-by: Mika Kuoppala --- drivers/gpu/drm/i915/i915_irq.c | 80 ++++++++++++++++++------------- drivers/gpu/drm/i915/intel_ringbuffer.h | 2 + 2 files changed, 49 insertions(+), 33 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index 557acd3..c4492bf 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -683,7 +683,6 @@ static void notify_ring(struct drm_device *dev, wake_up_all(&ring->irq_queue); if (i915_enable_hangcheck) { - dev_priv->gpu_error.hangcheck_count = 0; mod_timer(&dev_priv->gpu_error.hangcheck_timer, round_jiffies_up(jiffies + DRM_I915_HANGCHECK_JIFFIES)); } @@ -2385,61 +2384,76 @@ static bool i915_hangcheck_hung(struct drm_device *dev) /** * This is called when the chip hasn't reported back with completed - * batchbuffers in a long time. The first time this is called we simply record - * ACTHD. If ACTHD hasn't changed by the time the hangcheck timer elapses - * again, we assume the chip is wedged and try to fix it. + * batchbuffers in a long time. We keep track per ring seqno progress and + * if there are no progress, hangcheck score for that ring is increased. + * Further, acthd is inspected to see if the ring is stuck. On stuck case + * we kick the ring. If we see no progress on three subsequent calls + * we assume chip is wedged and try to fix it by resetting the chip. */ void i915_hangcheck_elapsed(unsigned long data) { struct drm_device *dev = (struct drm_device *)data; drm_i915_private_t *dev_priv = dev->dev_private; struct intel_ring_buffer *ring; - bool err = false, idle; int i; - u32 seqno[I915_NUM_RINGS]; - bool work_done; + int busy_count = 0, rings_hung = 0; + bool stuck[I915_NUM_RINGS]; if (!i915_enable_hangcheck) return; - idle = true; for_each_ring(ring, dev_priv, i) { - seqno[i] = ring->get_seqno(ring, false); - idle &= i915_hangcheck_ring_idle(ring, seqno[i], &err); - } + u32 seqno, acthd; + bool idle, err = false; + + seqno = ring->get_seqno(ring, false); + acthd = intel_ring_get_active_head(ring); + idle = i915_hangcheck_ring_idle(ring, seqno, &err); + stuck[i] = ring->hangcheck.acthd == acthd; + + if (idle) { + if (err) + ring->hangcheck.score += 2; + else + ring->hangcheck.score = 0; + } else { + busy_count++; - /* If all work is done then ACTHD clearly hasn't advanced. */ - if (idle) { - if (err) { - if (i915_hangcheck_hung(dev)) - return; + if (ring->hangcheck.seqno == seqno) { + ring->hangcheck.score++; - goto repeat; + /* Kick ring if stuck*/ + if (stuck[i]) + i915_hangcheck_ring_hung(ring); + } else { + ring->hangcheck.score = 0; + } } - dev_priv->gpu_error.hangcheck_count = 0; - return; + ring->hangcheck.seqno = seqno; + ring->hangcheck.acthd = acthd; } - work_done = false; for_each_ring(ring, dev_priv, i) { - if (ring->hangcheck.seqno != seqno[i]) { - work_done = true; - ring->hangcheck.seqno = seqno[i]; + if (ring->hangcheck.score > 2) { + rings_hung++; + DRM_ERROR("%s: %s on %s 0x%x\n", ring->name, + stuck[i] ? "stuck" : "no progress", + stuck[i] ? "addr" : "seqno", + stuck[i] ? ring->hangcheck.acthd & HEAD_ADDR : + ring->hangcheck.seqno); } } - if (!work_done) { - if (i915_hangcheck_hung(dev)) - return; - } else { - dev_priv->gpu_error.hangcheck_count = 0; - } + if (rings_hung) + return i915_handle_error(dev, true); -repeat: - /* Reset timer case chip hangs without another request being added */ - mod_timer(&dev_priv->gpu_error.hangcheck_timer, - round_jiffies_up(jiffies + DRM_I915_HANGCHECK_JIFFIES)); + if (busy_count) + /* Reset timer case chip hangs without another request + * being added */ + mod_timer(&dev_priv->gpu_error.hangcheck_timer, + round_jiffies_up(jiffies + + DRM_I915_HANGCHECK_JIFFIES)); } /* drm_dma.h hooks diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h index ef374a8..5886667 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.h +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h @@ -39,6 +39,8 @@ struct intel_hw_status_page { struct intel_ring_hangcheck { u32 seqno; + u32 acthd; + int score; }; struct intel_ring_buffer {