From patchwork Fri Jun 17 07:09:10 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: arun.siluvery@linux.intel.com
X-Patchwork-Id: 9182709
Return-Path: <intel-gfx-bounces@lists.freedesktop.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	4163260832 for <patchwork-intel-gfx@patchwork.kernel.org>;
	Fri, 17 Jun 2016 07:10:00 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2FFE51FF21
	for <patchwork-intel-gfx@patchwork.kernel.org>;
	Fri, 17 Jun 2016 07:10:00 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 24F6C28396; Fri, 17 Jun 2016 07:10:00 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-4.2 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_MED
	autolearn=ham version=3.3.1
Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C044428399
	for <patchwork-intel-gfx@patchwork.kernel.org>;
	Fri, 17 Jun 2016 07:09:59 +0000 (UTC)
Received: from gabe.freedesktop.org (localhost [127.0.0.1])
	by gabe.freedesktop.org (Postfix) with ESMTP id 034366EAF0;
	Fri, 17 Jun 2016 07:09:51 +0000 (UTC)
X-Original-To: intel-gfx@lists.freedesktop.org
Delivered-To: intel-gfx@lists.freedesktop.org
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
	by gabe.freedesktop.org (Postfix) with ESMTP id A20936EADE
	for <intel-gfx@lists.freedesktop.org>;
	Fri, 17 Jun 2016 07:09:47 +0000 (UTC)
Received: from fmsmga002.fm.intel.com ([10.253.24.26])
	by fmsmga103.fm.intel.com with ESMTP; 17 Jun 2016 00:09:47 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos; i="5.26,482,1459839600"; d="scan'208";
	a="1003894836"
Received: from asiluver-linux.isw.intel.com ([10.102.226.117])
	by fmsmga002.fm.intel.com with ESMTP; 17 Jun 2016 00:09:41 -0700
From: Arun Siluvery <arun.siluvery@linux.intel.com>
To: intel-gfx@lists.freedesktop.org
Date: Fri, 17 Jun 2016 08:09:10 +0100
Message-Id: <1466147355-4635-11-git-send-email-arun.siluvery@linux.intel.com>
X-Mailer: git-send-email 1.9.1
In-Reply-To: <1466147355-4635-1-git-send-email-arun.siluvery@linux.intel.com>
References: <1466147355-4635-1-git-send-email-arun.siluvery@linux.intel.com>
Cc: Ian Lister <ian.lister@intel.com>, Tomas Elf <tomas.elf@intel.com>
Subject: [Intel-gfx] [PATCH v2 10/15] drm/i915: Extending
	i915_gem_check_wedge to check engine reset in progress
X-BeenThere: intel-gfx@lists.freedesktop.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Intel graphics driver community testing & development
	<intel-gfx.lists.freedesktop.org>
List-Unsubscribe: <https://lists.freedesktop.org/mailman/options/intel-gfx>,
	<mailto:intel-gfx-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <https://lists.freedesktop.org/archives/intel-gfx>
List-Post: <mailto:intel-gfx@lists.freedesktop.org>
List-Help: <mailto:intel-gfx-request@lists.freedesktop.org?subject=help>
List-Subscribe: <https://lists.freedesktop.org/mailman/listinfo/intel-gfx>,
	<mailto:intel-gfx-request@lists.freedesktop.org?subject=subscribe>
MIME-Version: 1.0
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
X-Virus-Scanned: ClamAV using ClamSMTP

i915_gem_check_wedge now returns a non-zero result in three different cases:

1. Legacy: A hang has been detected and full GPU reset is in progress.

2. Per-engine recovery:
   a. A single engine reference can be passed to the function, in which
   case only that engine will be checked. If that particular engine is
   detected to be hung and is to be reset this will yield a non-zero result
   but not if reset is in progress for any other engine.

   b. No engine reference is passed to the function, in which case all
   engines are checked for ongoing per-engine hang recovery.

__i915_wait_request() is updated such that if an engine reset is pending,
we request the waiter to try again so that engine recovery can continue.
If i915_wait_request does not take per-engine hang recovery into account
there is no way for a waiting thread to know that a per-engine recovery is
about to happen and that it needs to back off.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---

These changes are based on current nightly. I am aware of the changes being
done to wait_request patch in "thundering herd series" but my understanding
is it has other dependencies. We can add incremental changes once that
series is merged.

 drivers/gpu/drm/i915/i915_gem.c | 43 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 38 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 6160564..bc404da 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -100,12 +100,31 @@ static void i915_gem_info_remove_obj(struct drm_i915_private *dev_priv,
 	spin_unlock(&dev_priv->mm.object_stat_lock);
 }
 
+static bool i915_engine_reset_pending(struct i915_gpu_error *error,
+				     struct intel_engine_cs *engine)
+{
+	int i;
+
+	if (engine)
+		return i915_engine_reset_in_progress(error, engine->id);
+
+	for (i = 0; i < I915_NUM_ENGINES; ++i) {
+		if (i915_engine_reset_in_progress(error, i))
+			return true;
+	}
+
+	return false;
+}
+
 static int
 i915_gem_wait_for_error(struct i915_gpu_error *error)
 {
 	int ret;
 
-	if (!i915_reset_in_progress(error))
+#define EXIT_COND (!i915_reset_in_progress(error) ||	\
+		   !i915_engine_reset_pending(error, NULL))
+
+	if (EXIT_COND)
 		return 0;
 
 	/*
@@ -114,7 +133,7 @@ i915_gem_wait_for_error(struct i915_gpu_error *error)
 	 * we should simply try to bail out and fail as gracefully as possible.
 	 */
 	ret = wait_event_interruptible_timeout(error->reset_queue,
-					       !i915_reset_in_progress(error),
+					       EXIT_COND,
 					       10*HZ);
 	if (ret == 0) {
 		DRM_ERROR("Timed out waiting for the gpu reset to complete\n");
@@ -1325,12 +1344,18 @@ put_rpm:
 }
 
 static int
-i915_gem_check_wedge(unsigned reset_counter, bool interruptible)
+i915_gem_check_wedge(struct drm_i915_private *dev_priv,
+		     struct intel_engine_cs *engine,
+		     bool interruptible)
 {
+	struct i915_gpu_error *error = &dev_priv->gpu_error;
+	unsigned reset_counter = i915_reset_counter(error);
+
 	if (__i915_terminally_wedged(reset_counter))
 		return -EIO;
 
-	if (__i915_reset_in_progress(reset_counter)) {
+	if (__i915_reset_in_progress(reset_counter) ||
+	    i915_engine_reset_pending(error, engine)) {
 		/* Non-interruptible callers can't handle -EAGAIN, hence return
 		 * -EIO unconditionally for these. */
 		if (!interruptible)
@@ -1500,6 +1525,7 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
 
 	for (;;) {
 		struct timer_list timer;
+		int reset_pending;
 
 		prepare_to_wait(&engine->irq_queue, &wait, state);
 
@@ -1515,6 +1541,13 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
 			break;
 		}
 
+		reset_pending = i915_engine_reset_pending(&dev_priv->gpu_error,
+							  NULL);
+		if (reset_pending) {
+			ret = -EAGAIN;
+			break;
+		}
+
 		if (i915_gem_request_completed(req, false)) {
 			ret = 0;
 			break;
@@ -2997,7 +3030,7 @@ __i915_gem_request_alloc(struct intel_engine_cs *engine,
 	 * EIO if the GPU is already wedged, or EAGAIN to drop the struct_mutex
 	 * and restart.
 	 */
-	ret = i915_gem_check_wedge(reset_counter, dev_priv->mm.interruptible);
+	ret = i915_gem_check_wedge(dev_priv, NULL, dev_priv->mm.interruptible);
 	if (ret)
 		return ret;