From patchwork Fri Aug 30 13:19:28 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mika Kuoppala X-Patchwork-Id: 2852034 Return-Path: X-Original-To: patchwork-intel-gfx@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.19.201]) by patchwork1.web.kernel.org (Postfix) with ESMTP id 7E5429F313 for ; Fri, 30 Aug 2013 13:24:55 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 366122058F for ; Fri, 30 Aug 2013 13:24:54 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) by mail.kernel.org (Postfix) with ESMTP id 7E4F020598 for ; Fri, 30 Aug 2013 13:24:52 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 63189E6827 for ; Fri, 30 Aug 2013 06:24:52 -0700 (PDT) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from mga14.intel.com (mga14.intel.com [143.182.124.37]) by gabe.freedesktop.org (Postfix) with ESMTP id 4BB8EE66C0 for ; Fri, 30 Aug 2013 06:19:34 -0700 (PDT) Received: from azsmga002.ch.intel.com ([10.2.17.35]) by azsmga102.ch.intel.com with ESMTP; 30 Aug 2013 06:19:33 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.89,990,1367996400"; d="scan'208";a="288903163" Received: from rosetta.fi.intel.com (HELO rosetta) ([10.237.72.86]) by AZSMGA002.ch.intel.com with ESMTP; 30 Aug 2013 06:19:32 -0700 Received: by rosetta (Postfix, from userid 1000) id B59E980099; Fri, 30 Aug 2013 16:19:31 +0300 (EEST) From: Mika Kuoppala To: intel-gfx@lists.freedesktop.org Date: Fri, 30 Aug 2013 16:19:28 +0300 Message-Id: <1377868769-6049-1-git-send-email-mika.kuoppala@intel.com> X-Mailer: git-send-email 1.7.9.5 Subject: [Intel-gfx] [PATCH 1/2] drm/i915: ban badly behaving contexts X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Sender: intel-gfx-bounces+patchwork-intel-gfx=patchwork.kernel.org@lists.freedesktop.org Errors-To: intel-gfx-bounces+patchwork-intel-gfx=patchwork.kernel.org@lists.freedesktop.org X-Spam-Status: No, score=-6.3 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_MED, RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Now when we have mechanism in place to track which context was guilty of hanging the gpu, it is possible to punish for bad behaviour. If context has recently submitted a faulty batchbuffers guilty of gpu hang and submits another batch which hangs gpu in quick succession, ban it permanently. If ctx is banned, no more batchbuffers will be queued for execution. There is no need for global wedge machinery anymore and it would be unwise to wedge the whole gpu if we have multiple hanging batches queued for execution. Instead just ban the guilty ones and carry on. v2: Store guilty ban status bool in gpu_error instead of pointers that might become danling before hang is declared. v3: Use return value for banned status instead of stashing state into gpu_error (Chris Wilson) v4: - rebase on top of fixed hang stats api - add define for ban period - rename commit and improve commit msg v5: - rely context banning instead of wedging the gpu - beautification and fix for ban calculation (Chris) Signed-off-by: Mika Kuoppala Reviewed-by: Chris Wilson --- drivers/gpu/drm/i915/i915_drv.c | 29 ++++++++++++---------------- drivers/gpu/drm/i915/i915_drv.h | 11 +++++++++-- drivers/gpu/drm/i915/i915_gem.c | 22 +++++++++++++++++++-- drivers/gpu/drm/i915/i915_gem_execbuffer.c | 12 ++++++++++++ 4 files changed, 53 insertions(+), 21 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c index eb56f68..a3beef3 100644 --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -719,24 +719,19 @@ int i915_reset(struct drm_device *dev) simulated = dev_priv->gpu_error.stop_rings != 0; - if (!simulated && get_seconds() - dev_priv->gpu_error.last_reset < 5) { - DRM_ERROR("GPU hanging too fast, declaring wedged!\n"); - ret = -ENODEV; - } else { - ret = intel_gpu_reset(dev); - - /* Also reset the gpu hangman. */ - if (simulated) { - DRM_INFO("Simulated gpu hang, resetting stop_rings\n"); - dev_priv->gpu_error.stop_rings = 0; - if (ret == -ENODEV) { - DRM_ERROR("Reset not implemented, but ignoring " - "error for simulated gpu hangs\n"); - ret = 0; - } - } else - dev_priv->gpu_error.last_reset = get_seconds(); + ret = intel_gpu_reset(dev); + + /* Also reset the gpu hangman. */ + if (simulated) { + DRM_INFO("Simulated gpu hang, resetting stop_rings\n"); + dev_priv->gpu_error.stop_rings = 0; + if (ret == -ENODEV) { + DRM_ERROR("Reset not implemented, but ignoring " + "error for simulated gpu hangs\n"); + ret = 0; + } } + if (ret) { DRM_ERROR("Failed to reset chip.\n"); mutex_unlock(&dev->struct_mutex); diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 4da1f86..2c5f3bc 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -583,6 +583,12 @@ struct i915_ctx_hang_stats { /* This context had batch active when hang was declared */ unsigned batch_active; + + /* Time when this context was last blamed for a GPU reset */ + unsigned long guilty_ts; + + /* This context is banned to submit more work */ + bool banned; }; /* This must match up with the value previously used for execbuf2.rsvd1. */ @@ -984,6 +990,9 @@ struct i915_gpu_error { /* For hangcheck timer */ #define DRM_I915_HANGCHECK_PERIOD 1500 /* in ms */ #define DRM_I915_HANGCHECK_JIFFIES msecs_to_jiffies(DRM_I915_HANGCHECK_PERIOD) + /* Hang gpu twice in this window and your context gets banned */ +#define DRM_I915_CTX_BAN_PERIOD DIV_ROUND_UP(8*DRM_I915_HANGCHECK_PERIOD, 1000) + struct timer_list hangcheck_timer; /* For reset and error_state handling. */ @@ -992,8 +1001,6 @@ struct i915_gpu_error { struct drm_i915_error_state *first_error; struct work_struct work; - unsigned long last_reset; - /** * State variable and reset counter controlling the reset flow * diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 7392f64..754f875 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -2201,6 +2201,21 @@ static bool i915_request_guilty(struct drm_i915_gem_request *request, return false; } +static bool i915_context_is_banned(const struct i915_ctx_hang_stats *hs) +{ + const unsigned long elapsed = get_seconds() - hs->guilty_ts; + + if (hs->banned) + return true; + + if (elapsed <= DRM_I915_CTX_BAN_PERIOD) { + DRM_ERROR("context hanging too fast, declaring banned!\n"); + return true; + } + + return false; +} + static void i915_set_reset_status(struct intel_ring_buffer *ring, struct drm_i915_gem_request *request, u32 acthd) @@ -2237,10 +2252,13 @@ static void i915_set_reset_status(struct intel_ring_buffer *ring, hs = &request->file_priv->hang_stats; if (hs) { - if (guilty) + if (guilty) { + hs->banned = i915_context_is_banned(hs); hs->batch_active++; - else + hs->guilty_ts = get_seconds(); + } else { hs->batch_pending++; + } } } diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c index ea4bacb..426021a 100644 --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c @@ -926,6 +926,7 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data, struct drm_i915_gem_object *batch_obj; struct drm_clip_rect *cliprects = NULL; struct intel_ring_buffer *ring; + struct i915_ctx_hang_stats *hs; u32 ctx_id = i915_execbuffer2_get_context_id(*args); u32 exec_start, exec_len; u32 mask, flags; @@ -1115,6 +1116,17 @@ i915_gem_do_execbuffer(struct drm_device *dev, void *data, if (ret) goto err; + hs = i915_gem_context_get_hang_stats(dev, file, ctx_id); + if (IS_ERR(hs)) { + ret = PTR_ERR(hs); + goto err; + } + + if (hs->banned) { + ret = -EIO; + goto err; + } + ret = i915_switch_context(ring, file, ctx_id); if (ret) goto err;