From patchwork Fri Jan 27 00:28:35 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13118010 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 6A283C54EAA for ; Fri, 27 Jan 2023 00:29:03 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 54F9210E157; Fri, 27 Jan 2023 00:29:02 +0000 (UTC) Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by gabe.freedesktop.org (Postfix) with ESMTPS id BE95E10E146; Fri, 27 Jan 2023 00:28:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674779339; x=1706315339; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=bvBPqBJzrPZsz1udXkh/HMQaaXs7fmZnunh55oxpb5k=; b=TR9HpstX3OUeCWxztdz+5aXkQIPFmj1JULTILWEqvw/Lzl+FS2sRA9l6 +M56yeKw9SMaQXB9+L7vfqFeQCQHwtPpdui1RBnT+NpjOEOVWE4rGY1nO gPVk/L4bxbvpN7zFwm0CZOSTF3apS9/fek7taeMKehT/oMJAtlPYJ6u3L XOfJYYri23lEGYd31S9ezOEkF/XpFTuwYzVDuC7OZbfq1GZwaiX0tBWih lokupyvaKk2ibQ/+1a7OOjG1G6StMOzuHAYFEDg0Vw9QoF04KuHd8IhDj 2RV/IGPrYVeK6EiFm1F5+T1U7qZTpeIwJiZy3f18JDOju0FwNSAtWi27+ A==; X-IronPort-AV: E=McAfee;i="6500,9779,10602"; a="324687298" X-IronPort-AV: E=Sophos;i="5.97,249,1669104000"; d="scan'208";a="324687298" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jan 2023 16:28:50 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10602"; a="805621884" X-IronPort-AV: E=Sophos;i="5.97,249,1669104000"; d="scan'208";a="805621884" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by fmsmga001.fm.intel.com with ESMTP; 26 Jan 2023 16:28:50 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH v6 1/8] drm/i915/guc: Fix locking when searching for a hung request Date: Thu, 26 Jan 2023 16:28:35 -0800 Message-Id: <20230127002842.3169194-2-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20230127002842.3169194-1-John.C.Harrison@Intel.com> References: <20230127002842.3169194-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Matthew Brost , Chris Wilson , Michael Cheng , Alan Previn , Tvrtko Ursulin , Umesh Nerlige Ramappa , Matthew Auld , Lucas De Marchi , Daniele Ceraolo Spurio , DRI-Devel@Lists.FreeDesktop.Org, Rodrigo Vivi , Tejas Upadhyay , John Harrison , Bruce Chang Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison intel_guc_find_hung_context() was not acquiring the correct spinlock before searching the request list. So fix that up. While at it, add some extra whitespace padding for readability. Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset with GuC") Signed-off-by: John Harrison Reviewed-by: Daniele Ceraolo Spurio Acked-by: Tvrtko Ursulin Cc: Matthew Brost Cc: Jani Nikula Cc: Joonas Lahtinen Cc: Rodrigo Vivi Cc: Matt Roper Cc: Umesh Nerlige Ramappa Cc: Michael Cheng Cc: Lucas De Marchi Cc: Tejas Upadhyay Cc: Chris Wilson Cc: Bruce Chang Cc: Alan Previn Cc: Matthew Auld --- drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index b436dd7f12e42..3b34a82d692be 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -4820,6 +4820,8 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine) xa_lock_irqsave(&guc->context_lookup, flags); xa_for_each(&guc->context_lookup, index, ce) { + bool found; + if (!kref_get_unless_zero(&ce->ref)) continue; @@ -4836,10 +4838,18 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine) goto next; } + found = false; + spin_lock(&ce->guc_state.lock); list_for_each_entry(rq, &ce->guc_state.requests, sched.link) { if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE) continue; + found = true; + break; + } + spin_unlock(&ce->guc_state.lock); + + if (found) { intel_engine_set_hung_context(engine, ce); /* Can only cope with one hang at a time... */ @@ -4847,6 +4857,7 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine) xa_lock(&guc->context_lookup); goto done; } + next: intel_context_put(ce); xa_lock(&guc->context_lookup); From patchwork Fri Jan 27 00:28:36 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13118012 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id EF2F8C05027 for ; Fri, 27 Jan 2023 00:29:12 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id E2D4410E165; Fri, 27 Jan 2023 00:29:07 +0000 (UTC) Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by gabe.freedesktop.org (Postfix) with ESMTPS id 1CFC310E155; Fri, 27 Jan 2023 00:29:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674779340; x=1706315340; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=BfD6bCGli0DeSSizzS2xxSYb+Yo0s3xkvsPoa63rn2o=; b=MCJ/v3y/mqtMdtcSER1HKS5jJoRZ+Udlt2tQ80QJqATQweekvLlZDFkh mqyrV8hgTS1NIDUONEBufqSLZnwpQhb4S8PNEyweDhnP+Tiv27JhUByCw ec4h23GEjZRnnvr42vKn6qInQPYhxzt3RVb4gG8Zf9KEC6+kss29cqHWE dsaNZ8VXTVtRp/+XL8aMxmpNx7S7c0ZoEb2M7+Ec9J9EStfNx1577wPvH fvmGL5rbHoABvuWk7wkrPAVB7EJiC7za6+8lgKH8fUZWjFappYIatKIIZ 0MALVJgZdFJfDAjSiymuWp5sTo5vRiM74Ygo/z9wMKCO8rigjnV8+D4U4 A==; X-IronPort-AV: E=McAfee;i="6500,9779,10602"; a="324687303" X-IronPort-AV: E=Sophos;i="5.97,249,1669104000"; d="scan'208";a="324687303" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jan 2023 16:28:51 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10602"; a="805621894" X-IronPort-AV: E=Sophos;i="5.97,249,1669104000"; d="scan'208";a="805621894" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by fmsmga001.fm.intel.com with ESMTP; 26 Jan 2023 16:28:50 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH v6 2/8] drm/i915: Fix request ref counting during error capture & debugfs dump Date: Thu, 26 Jan 2023 16:28:36 -0800 Message-Id: <20230127002842.3169194-3-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20230127002842.3169194-1-John.C.Harrison@Intel.com> References: <20230127002842.3169194-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Matthew Brost , Andy Shevchenko , Michael Cheng , Aravind Iddamsetty , Alan Previn , Tvrtko Ursulin , Umesh Nerlige Ramappa , Lucas De Marchi , Bruce Chang , Daniele Ceraolo Spurio , DRI-Devel@Lists.FreeDesktop.Org, Andrzej Hajda , Rodrigo Vivi , Tejas Upadhyay , John Harrison , Matthew Auld Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison When GuC support was added to error capture, the reference counting around the request object was broken. Fix it up. The context based search manages the spinlocking around the search internally. So it needs to grab the reference count internally as well. The execlist only request based search relies on external locking, so it needs an external reference count but within the spinlock not outside it. The only other caller of the context based search is the code for dumping engine state to debugfs. That code wasn't previously getting an explicit reference at all as it does everything while holding the execlist specific spinlock. So, that needs updaing as well as that spinlock doesn't help when using GuC submission. Rather than trying to conditionally get/put depending on submission model, just change it to always do the get/put. v2: Explicitly document adding an extra blank line in some dense code (Andy Shevchenko). Fix multiple potential null pointer derefs in case of no request found (some spotted by Tvrtko, but there was more!). Also fix a leaked request in case of !started and another in __guc_reset_context now that intel_context_find_active_request is actually reference counting the returned request. v3: Add a _get suffix to intel_context_find_active_request now that it grabs a reference (Daniele). v4: Split the intel_guc_find_hung_context change to a separate patch and rename intel_context_find_active_request_get to intel_context_get_active_request (Tvrtko). v5: s/locking/reference counting/ in commit message (Tvrtko) Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset with GuC") Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset") Signed-off-by: John Harrison Reviewed-by: Daniele Ceraolo Spurio Acked-by: Tvrtko Ursulin Cc: Matthew Brost Cc: Jani Nikula Cc: Joonas Lahtinen Cc: Rodrigo Vivi Cc: Andrzej Hajda Cc: Matthew Auld Cc: Matt Roper Cc: Umesh Nerlige Ramappa Cc: Michael Cheng Cc: Lucas De Marchi Cc: Tejas Upadhyay Cc: Andy Shevchenko Cc: Aravind Iddamsetty Cc: Alan Previn Cc: Bruce Chang --- drivers/gpu/drm/i915/gt/intel_context.c | 4 +++- drivers/gpu/drm/i915/gt/intel_context.h | 3 +-- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 6 +++++- drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++- drivers/gpu/drm/i915/i915_gpu_error.c | 13 ++++++------- 5 files changed, 17 insertions(+), 12 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c index e94365b08f1ef..2aa63ec521b89 100644 --- a/drivers/gpu/drm/i915/gt/intel_context.c +++ b/drivers/gpu/drm/i915/gt/intel_context.c @@ -528,7 +528,7 @@ struct i915_request *intel_context_create_request(struct intel_context *ce) return rq; } -struct i915_request *intel_context_find_active_request(struct intel_context *ce) +struct i915_request *intel_context_get_active_request(struct intel_context *ce) { struct intel_context *parent = intel_context_to_parent(ce); struct i915_request *rq, *active = NULL; @@ -552,6 +552,8 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce) active = rq; } + if (active) + active = i915_request_get_rcu(active); spin_unlock_irqrestore(&parent->guc_state.lock, flags); return active; diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h index fb62b7b8cbcda..0a8d553da3f43 100644 --- a/drivers/gpu/drm/i915/gt/intel_context.h +++ b/drivers/gpu/drm/i915/gt/intel_context.h @@ -268,8 +268,7 @@ int intel_context_prepare_remote_request(struct intel_context *ce, struct i915_request *intel_context_create_request(struct intel_context *ce); -struct i915_request * -intel_context_find_active_request(struct intel_context *ce); +struct i915_request *intel_context_get_active_request(struct intel_context *ce); static inline bool intel_context_is_barrier(const struct intel_context *ce) { diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index 922f1bb22dc68..a86bdbee7a6be 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -2237,9 +2237,11 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d if (guc) { ce = intel_engine_get_hung_context(engine); if (ce) - hung_rq = intel_context_find_active_request(ce); + hung_rq = intel_context_get_active_request(ce); } else { hung_rq = intel_engine_execlist_find_hung_request(engine); + if (hung_rq) + hung_rq = i915_request_get_rcu(hung_rq); } if (hung_rq) @@ -2250,6 +2252,8 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d else intel_engine_dump_active_requests(&engine->sched_engine->requests, hung_rq, m); + if (hung_rq) + i915_request_put(hung_rq); } void intel_engine_dump(struct intel_engine_cs *engine, diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 3b34a82d692be..a2b263e5fd667 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -1702,7 +1702,7 @@ static void __guc_reset_context(struct intel_context *ce, intel_engine_mask_t st goto next_context; guilty = false; - rq = intel_context_find_active_request(ce); + rq = intel_context_get_active_request(ce); if (!rq) { head = ce->ring->tail; goto out_replay; @@ -1715,6 +1715,7 @@ static void __guc_reset_context(struct intel_context *ce, intel_engine_mask_t st head = intel_ring_wrap(ce->ring, rq->head); __i915_request_reset(rq, guilty); + i915_request_put(rq); out_replay: guc_reset_state(ce, head, guilty); next_context: diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 9d5d5a397b64e..9e2d17785a9a8 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1607,7 +1607,7 @@ capture_engine(struct intel_engine_cs *engine, ce = intel_engine_get_hung_context(engine); if (ce) { intel_engine_clear_hung_context(engine); - rq = intel_context_find_active_request(ce); + rq = intel_context_get_active_request(ce); if (!rq || !i915_request_started(rq)) goto no_request_capture; } else { @@ -1618,21 +1618,18 @@ capture_engine(struct intel_engine_cs *engine, if (!intel_uc_uses_guc_submission(&engine->gt->uc)) { spin_lock_irqsave(&engine->sched_engine->lock, flags); rq = intel_engine_execlist_find_hung_request(engine); + if (rq) + rq = i915_request_get_rcu(rq); spin_unlock_irqrestore(&engine->sched_engine->lock, flags); } } - if (rq) - rq = i915_request_get_rcu(rq); - if (!rq) goto no_request_capture; capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL); - if (!capture) { - i915_request_put(rq); + if (!capture) goto no_request_capture; - } if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE) intel_guc_capture_get_matching_node(engine->gt, ee, ce); @@ -1642,6 +1639,8 @@ capture_engine(struct intel_engine_cs *engine, return ee; no_request_capture: + if (rq) + i915_request_put(rq); kfree(ee); return NULL; } From patchwork Fri Jan 27 00:28:37 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13118014 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 2F752C54EAA for ; Fri, 27 Jan 2023 00:29:17 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 23E2110E167; Fri, 27 Jan 2023 00:29:10 +0000 (UTC) Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by gabe.freedesktop.org (Postfix) with ESMTPS id 92A1710E155; Fri, 27 Jan 2023 00:29:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674779340; x=1706315340; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=khP+It8dRMo8cNmd9/0ygwct4A4LICIPuHLqHrK4Fa8=; b=lzbn4EX8csAOhkMI7qCxYxAW0Sv+FtHghcKy8wnwWCxjum6Ug9wqRTAX 9x1x7wpBUGU6fEMGRQ+Z5UFdARo6a75VQv7/7k6to+EDjIFEgP8EyY7TX 4hvdXloCOUjUv2qmSD0fO1mx/u+v67aA1+38XtgXeBuu4hjuH0+vq3Sy7 jwAqoi7ewiLk0qgBY/0wJZ+8E5PIsVfRE0+Avsg97xJcLvvCUZTb+PNyD mdLuXHVjM5IJqk2z9nOFnvH13/WYYHGFnAHrCdAkeaSwsZALTxZH2YyDw ZZNd0bJB15WrfASw1DBJs3HhYgkKf7ypePjOlHN4J4JdN+MYcGVLxuuKf g==; X-IronPort-AV: E=McAfee;i="6500,9779,10602"; a="324687307" X-IronPort-AV: E=Sophos;i="5.97,249,1669104000"; d="scan'208";a="324687307" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jan 2023 16:28:52 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10602"; a="805621905" X-IronPort-AV: E=Sophos;i="5.97,249,1669104000"; d="scan'208";a="805621905" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by fmsmga001.fm.intel.com with ESMTP; 26 Jan 2023 16:28:51 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH v6 3/8] drm/i915: Fix up locking around dumping requests lists Date: Thu, 26 Jan 2023 16:28:37 -0800 Message-Id: <20230127002842.3169194-4-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20230127002842.3169194-1-John.C.Harrison@Intel.com> References: <20230127002842.3169194-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Matthew Brost , Michael Cheng , Alan Previn , Tvrtko Ursulin , Umesh Nerlige Ramappa , Matthew Auld , Lucas De Marchi , Daniele Ceraolo Spurio , DRI-Devel@Lists.FreeDesktop.Org, Rodrigo Vivi , John Harrison , Bruce Chang Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison The debugfs dump of requests was confused about what state requires the execlist lock versus the GuC lock. There was also a bunch of duplicated messy code between it and the error capture code. So refactor the hung request search into a re-usable function. And reduce the span of the execlist state lock to only the execlist specific code paths. In order to do that, also move the report of hold count (which is an execlist only concept) from the top level dump function to the lower level execlist specific function. Also, move the execlist specific code into the execlist source file. v2: Rename some functions and move to more appropriate files (Daniele). v3: Rename new execlist dump function (Daniele) Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset with GuC") Signed-off-by: John Harrison Reviewed-by: Daniele Ceraolo Spurio Acked-by: Tvrtko Ursulin Cc: Matthew Brost Cc: Jani Nikula Cc: Joonas Lahtinen Cc: Rodrigo Vivi Cc: Matt Roper Cc: Umesh Nerlige Ramappa Cc: Michael Cheng Cc: Lucas De Marchi Cc: Bruce Chang Cc: Alan Previn Cc: Matthew Auld --- drivers/gpu/drm/i915/gt/intel_engine.h | 4 +- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 74 +++++++++---------- .../drm/i915/gt/intel_execlists_submission.c | 27 +++++++ .../drm/i915/gt/intel_execlists_submission.h | 4 + drivers/gpu/drm/i915/i915_gpu_error.c | 26 +------ 5 files changed, 73 insertions(+), 62 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h index 0e24af5efee9c..b58c30ac8ef02 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine.h +++ b/drivers/gpu/drm/i915/gt/intel_engine.h @@ -250,8 +250,8 @@ void intel_engine_dump_active_requests(struct list_head *requests, ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now); -struct i915_request * -intel_engine_execlist_find_hung_request(struct intel_engine_cs *engine); +void intel_engine_get_hung_entity(struct intel_engine_cs *engine, + struct intel_context **ce, struct i915_request **rq); u32 intel_engine_context_size(struct intel_gt *gt, u8 class); struct intel_context * diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index a86bdbee7a6be..9f703f255d721 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -2114,17 +2114,6 @@ static void print_request_ring(struct drm_printer *m, struct i915_request *rq) } } -static unsigned long list_count(struct list_head *list) -{ - struct list_head *pos; - unsigned long count = 0; - - list_for_each(pos, list) - count++; - - return count; -} - static unsigned long read_ul(void *p, size_t x) { return *(unsigned long *)(p + x); @@ -2216,11 +2205,11 @@ void intel_engine_dump_active_requests(struct list_head *requests, } } -static void engine_dump_active_requests(struct intel_engine_cs *engine, struct drm_printer *m) +static void engine_dump_active_requests(struct intel_engine_cs *engine, + struct drm_printer *m) { + struct intel_context *hung_ce = NULL; struct i915_request *hung_rq = NULL; - struct intel_context *ce; - bool guc; /* * No need for an engine->irq_seqno_barrier() before the seqno reads. @@ -2229,29 +2218,20 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d * But the intention here is just to report an instantaneous snapshot * so that's fine. */ - lockdep_assert_held(&engine->sched_engine->lock); + intel_engine_get_hung_entity(engine, &hung_ce, &hung_rq); drm_printf(m, "\tRequests:\n"); - guc = intel_uc_uses_guc_submission(&engine->gt->uc); - if (guc) { - ce = intel_engine_get_hung_context(engine); - if (ce) - hung_rq = intel_context_get_active_request(ce); - } else { - hung_rq = intel_engine_execlist_find_hung_request(engine); - if (hung_rq) - hung_rq = i915_request_get_rcu(hung_rq); - } - if (hung_rq) engine_dump_request(hung_rq, m, "\t\thung"); + else if (hung_ce) + drm_printf(m, "\t\tGot hung ce but no hung rq!\n"); - if (guc) + if (intel_uc_uses_guc_submission(&engine->gt->uc)) intel_guc_dump_active_requests(engine, hung_rq, m); else - intel_engine_dump_active_requests(&engine->sched_engine->requests, - hung_rq, m); + intel_execlists_dump_active_requests(engine, hung_rq, m); + if (hung_rq) i915_request_put(hung_rq); } @@ -2263,7 +2243,6 @@ void intel_engine_dump(struct intel_engine_cs *engine, struct i915_gpu_error * const error = &engine->i915->gpu_error; struct i915_request *rq; intel_wakeref_t wakeref; - unsigned long flags; ktime_t dummy; if (header) { @@ -2300,13 +2279,8 @@ void intel_engine_dump(struct intel_engine_cs *engine, i915_reset_count(error)); print_properties(engine, m); - spin_lock_irqsave(&engine->sched_engine->lock, flags); engine_dump_active_requests(engine, m); - drm_printf(m, "\tOn hold?: %lu\n", - list_count(&engine->sched_engine->hold)); - spin_unlock_irqrestore(&engine->sched_engine->lock, flags); - drm_printf(m, "\tMMIO base: 0x%08x\n", engine->mmio_base); wakeref = intel_runtime_pm_get_if_in_use(engine->uncore->rpm); if (wakeref) { @@ -2352,8 +2326,7 @@ intel_engine_create_virtual(struct intel_engine_cs **siblings, return siblings[0]->cops->create_virtual(siblings, count, flags); } -struct i915_request * -intel_engine_execlist_find_hung_request(struct intel_engine_cs *engine) +static struct i915_request *engine_execlist_find_hung_request(struct intel_engine_cs *engine) { struct i915_request *request, *active = NULL; @@ -2405,6 +2378,33 @@ intel_engine_execlist_find_hung_request(struct intel_engine_cs *engine) return active; } +void intel_engine_get_hung_entity(struct intel_engine_cs *engine, + struct intel_context **ce, struct i915_request **rq) +{ + unsigned long flags; + + *ce = intel_engine_get_hung_context(engine); + if (*ce) { + intel_engine_clear_hung_context(engine); + + *rq = intel_context_get_active_request(*ce); + return; + } + + /* + * Getting here with GuC enabled means it is a forced error capture + * with no actual hang. So, no need to attempt the execlist search. + */ + if (intel_uc_uses_guc_submission(&engine->gt->uc)) + return; + + spin_lock_irqsave(&engine->sched_engine->lock, flags); + *rq = engine_execlist_find_hung_request(engine); + if (*rq) + *rq = i915_request_get_rcu(*rq); + spin_unlock_irqrestore(&engine->sched_engine->lock, flags); +} + void xehp_enable_ccs_engines(struct intel_engine_cs *engine) { /* diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c index 18ffe55282e59..3c573d41d4046 100644 --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c @@ -4150,6 +4150,33 @@ void intel_execlists_show_requests(struct intel_engine_cs *engine, spin_unlock_irqrestore(&sched_engine->lock, flags); } +static unsigned long list_count(struct list_head *list) +{ + struct list_head *pos; + unsigned long count = 0; + + list_for_each(pos, list) + count++; + + return count; +} + +void intel_execlists_dump_active_requests(struct intel_engine_cs *engine, + struct i915_request *hung_rq, + struct drm_printer *m) +{ + unsigned long flags; + + spin_lock_irqsave(&engine->sched_engine->lock, flags); + + intel_engine_dump_active_requests(&engine->sched_engine->requests, hung_rq, m); + + drm_printf(m, "\tOn hold?: %lu\n", + list_count(&engine->sched_engine->hold)); + + spin_unlock_irqrestore(&engine->sched_engine->lock, flags); +} + #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST) #include "selftest_execlists.c" #endif diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h index a1aa92c983a51..d2c7d45ea0623 100644 --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h @@ -32,6 +32,10 @@ void intel_execlists_show_requests(struct intel_engine_cs *engine, int indent), unsigned int max); +void intel_execlists_dump_active_requests(struct intel_engine_cs *engine, + struct i915_request *hung_rq, + struct drm_printer *m); + bool intel_engine_in_execlists_submission_mode(const struct intel_engine_cs *engine); diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 9e2d17785a9a8..b20bd6365615b 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1596,35 +1596,15 @@ capture_engine(struct intel_engine_cs *engine, { struct intel_engine_capture_vma *capture = NULL; struct intel_engine_coredump *ee; - struct intel_context *ce; + struct intel_context *ce = NULL; struct i915_request *rq = NULL; - unsigned long flags; ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL, dump_flags); if (!ee) return NULL; - ce = intel_engine_get_hung_context(engine); - if (ce) { - intel_engine_clear_hung_context(engine); - rq = intel_context_get_active_request(ce); - if (!rq || !i915_request_started(rq)) - goto no_request_capture; - } else { - /* - * Getting here with GuC enabled means it is a forced error capture - * with no actual hang. So, no need to attempt the execlist search. - */ - if (!intel_uc_uses_guc_submission(&engine->gt->uc)) { - spin_lock_irqsave(&engine->sched_engine->lock, flags); - rq = intel_engine_execlist_find_hung_request(engine); - if (rq) - rq = i915_request_get_rcu(rq); - spin_unlock_irqrestore(&engine->sched_engine->lock, - flags); - } - } - if (!rq) + intel_engine_get_hung_entity(engine, &ce, &rq); + if (!rq || !i915_request_started(rq)) goto no_request_capture; capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL); From patchwork Fri Jan 27 00:28:38 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13118013 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 3D540C61D97 for ; Fri, 27 Jan 2023 00:29:15 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id B9EED10E169; Fri, 27 Jan 2023 00:29:09 +0000 (UTC) Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by gabe.freedesktop.org (Postfix) with ESMTPS id C74F310E146; Fri, 27 Jan 2023 00:29:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674779340; x=1706315340; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=zTd7lf6ad9RbSljaLkuZ/But6bcl+1S+wpat7nkSU1I=; b=LokHBvtmcNbLxYZKaGE4fawOhrGp4GAR3bufjoiFfPj1qQYGw/eDJV0P XjjGxzMzVfyVdXtU39ckC9PN99SEsbosxy8mEYpHsD07W0kBEHNWiBAVz MlC+AE2ySmbTm6IjHLogwCIngl0LeiifqiNW6LN8CBIxlTkyEjhqEqDRY tKz27zvk2T7MvgJ/y7Pe0Giy2NfyaEkOq3REDbi5fRhC2l3TVm4pnT0/R AqgKKWcXCnD3bc675VcJ61fsTv6pO2Yo6E9Tsb5iP6ygfnyOrzOHV6bqw nEyUMi4GU9AkBqQpJpEzHVBKHo8WFXXdvwN6vJ9unPsIirnIAXKCZr+Zy g==; X-IronPort-AV: E=McAfee;i="6500,9779,10602"; a="324687309" X-IronPort-AV: E=Sophos;i="5.97,249,1669104000"; d="scan'208";a="324687309" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jan 2023 16:28:52 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10602"; a="805621909" X-IronPort-AV: E=Sophos;i="5.97,249,1669104000"; d="scan'208";a="805621909" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by fmsmga001.fm.intel.com with ESMTP; 26 Jan 2023 16:28:52 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH v6 4/8] drm/i915: Allow error capture without a request Date: Thu, 26 Jan 2023 16:28:38 -0800 Message-Id: <20230127002842.3169194-5-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20230127002842.3169194-1-John.C.Harrison@Intel.com> References: <20230127002842.3169194-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Daniele Ceraolo Spurio , Umesh Nerlige Ramappa , John Harrison , DRI-Devel@Lists.FreeDesktop.Org, Tvrtko Ursulin Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison There was a report of error captures occurring without any hung context being indicated despite the capture being initiated by a 'hung context notification' from GuC. The problem was not reproducible. However, it is possible to happen if the context in question has no active requests. For example, if the hang was in the context switch itself then the breadcrumb write would have occurred and the KMD would see an idle context. In the interests of attempting to provide as much information as possible about a hang, it seems wise to include the engine info regardless of whether a request was found or not. As opposed to just prentending there was no hang at all. So update the error capture code to always record engine information if a context is given. Which means updating record_context() to take a context instead of a request (which it only ever used to find the context anyway). And split the request agnostic parts of intel_engine_coredump_add_request() out into a seaprate function. v2: Remove a duplicate 'if' statement (Umesh) and fix a put of a null pointer. v3: Tidy up request locking code flow (Tvrtko) v4: Pull in improved info message from next patch and fix up potential leak of GuC register state (Daniele) Signed-off-by: John Harrison Reviewed-by: Umesh Nerlige Ramappa (v2) Reviewed-by: Daniele Ceraolo Spurio Acked-by: Tvrtko Ursulin --- drivers/gpu/drm/i915/i915_gpu_error.c | 74 ++++++++++++++++++--------- 1 file changed, 50 insertions(+), 24 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index b20bd6365615b..225f1b11a6b93 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct intel_engine_coredump *ee) } static bool record_context(struct i915_gem_context_coredump *e, - const struct i915_request *rq) + struct intel_context *ce) { struct i915_gem_context *ctx; struct task_struct *task; bool simulated; rcu_read_lock(); - ctx = rcu_dereference(rq->context->gem_context); + ctx = rcu_dereference(ce->gem_context); if (ctx && !kref_get_unless_zero(&ctx->ref)) ctx = NULL; rcu_read_unlock(); @@ -1396,8 +1396,8 @@ static bool record_context(struct i915_gem_context_coredump *e, e->guilty = atomic_read(&ctx->guilty_count); e->active = atomic_read(&ctx->active_count); - e->total_runtime = intel_context_get_total_runtime_ns(rq->context); - e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context); + e->total_runtime = intel_context_get_total_runtime_ns(ce); + e->avg_runtime = intel_context_get_avg_runtime_ns(ce); simulated = i915_gem_context_no_error_capture(ctx); @@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct intel_engine_cs *engine, gfp_t gfp, u32 dump_ return ee; } +static struct intel_engine_capture_vma * +engine_coredump_add_context(struct intel_engine_coredump *ee, + struct intel_context *ce, + gfp_t gfp) +{ + struct intel_engine_capture_vma *vma = NULL; + + ee->simulated |= record_context(&ee->context, ce); + if (ee->simulated) + return NULL; + + /* + * We need to copy these to an anonymous buffer + * as the simplest method to avoid being overwritten + * by userspace. + */ + vma = capture_vma(vma, ce->ring->vma, "ring", gfp); + vma = capture_vma(vma, ce->state, "HW context", gfp); + + return vma; +} + struct intel_engine_capture_vma * intel_engine_coredump_add_request(struct intel_engine_coredump *ee, struct i915_request *rq, gfp_t gfp) { - struct intel_engine_capture_vma *vma = NULL; + struct intel_engine_capture_vma *vma; - ee->simulated |= record_context(&ee->context, rq); - if (ee->simulated) + vma = engine_coredump_add_context(ee, rq->context, gfp); + if (!vma) return NULL; /* @@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct intel_engine_coredump *ee, */ vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch"); vma = capture_user(vma, rq, gfp); - vma = capture_vma(vma, rq->ring->vma, "ring", gfp); - vma = capture_vma(vma, rq->context->state, "HW context", gfp); ee->rq_head = rq->head; ee->rq_post = rq->postfix; @@ -1604,25 +1624,31 @@ capture_engine(struct intel_engine_cs *engine, return NULL; intel_engine_get_hung_entity(engine, &ce, &rq); - if (!rq || !i915_request_started(rq)) - goto no_request_capture; + if (rq && !i915_request_started(rq)) { + drm_info(&engine->gt->i915->drm, "Got hung context on %s with active request %lld:%lld [0x%04X] not yet started\n", + engine->name, rq->fence.context, rq->fence.seqno, ce->guc_id.id); + i915_request_put(rq); + rq = NULL; + } - capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL); - if (!capture) - goto no_request_capture; - if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE) - intel_guc_capture_get_matching_node(engine->gt, ee, ce); + if (rq) { + capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL); + i915_request_put(rq); + } else if (ce) { + capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL); + } - intel_engine_coredump_add_vma(ee, capture, compress); - i915_request_put(rq); + if (capture) { + intel_engine_coredump_add_vma(ee, capture, compress); - return ee; + if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE) + intel_guc_capture_get_matching_node(engine->gt, ee, ce); + } else { + kfree(ee); + ee = NULL; + } -no_request_capture: - if (rq) - i915_request_put(rq); - kfree(ee); - return NULL; + return ee; } static void From patchwork Fri Jan 27 00:28:39 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13118017 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id ECB8BC54EAA for ; Fri, 27 Jan 2023 00:29:21 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id F0A0210E161; Fri, 27 Jan 2023 00:29:19 +0000 (UTC) Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by gabe.freedesktop.org (Postfix) with ESMTPS id D9B9110E157; Fri, 27 Jan 2023 00:29:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674779340; x=1706315340; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=MMGVM3oKDx1++G5z9tXhnD/tSjRvUWveU/E16hB4uys=; b=WloEnYf606i1G+nU9w0ph57EQzLpDmpkhiphfmIHzjtvHjZOhw4d9GyP M0pgVa8qMGu/rtspUdfOPfy1maUhkoC5TqiVTMtpk7M+trcDG8knsFuvN MbbGgf3Iati5E3nkvmJM35sDNPMJbBBOm7lGmdMGazk1tOLZGQPh0NQOu OVFt23kB2ZziBsVZty3eYd4XDLwP0w9luc51fcWWFFhdGdrYuHCdRNZcs cAgA3NepGF7bmHvwGux1mQ28qAkuCYax8uEwIHyKk9QgsIYuOxXT0qVhH 5W2+aXzrydgZkrUKkIENF7f3MrGKiRaEerqDTloOVEfYjN6aF/6vxVVmE w==; X-IronPort-AV: E=McAfee;i="6500,9779,10602"; a="324687310" X-IronPort-AV: E=Sophos;i="5.97,249,1669104000"; d="scan'208";a="324687310" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jan 2023 16:28:52 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10602"; a="805621913" X-IronPort-AV: E=Sophos;i="5.97,249,1669104000"; d="scan'208";a="805621913" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by fmsmga001.fm.intel.com with ESMTP; 26 Jan 2023 16:28:52 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH v6 5/8] drm/i915: Allow error capture of a pending request Date: Thu, 26 Jan 2023 16:28:39 -0800 Message-Id: <20230127002842.3169194-6-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20230127002842.3169194-1-John.C.Harrison@Intel.com> References: <20230127002842.3169194-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: John Harrison , DRI-Devel@Lists.FreeDesktop.Org, Tvrtko Ursulin Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison A hang situation has been observed where the only requests on the context were either completed or not yet started according to the breaadcrumbs. However, the register state claimed a batch was (maybe) in progress. So, allow capture of the pending request on the grounds that this might be better than nothing. v2: Reword 'not started' warning message (Tvrtko) Signed-off-by: John Harrison Reviewed-by: Tvrtko Ursulin --- drivers/gpu/drm/i915/i915_gpu_error.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 225f1b11a6b93..904f21e1380cd 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1624,12 +1624,9 @@ capture_engine(struct intel_engine_cs *engine, return NULL; intel_engine_get_hung_entity(engine, &ce, &rq); - if (rq && !i915_request_started(rq)) { + if (rq && !i915_request_started(rq)) drm_info(&engine->gt->i915->drm, "Got hung context on %s with active request %lld:%lld [0x%04X] not yet started\n", engine->name, rq->fence.context, rq->fence.seqno, ce->guc_id.id); - i915_request_put(rq); - rq = NULL; - } if (rq) { capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL); From patchwork Fri Jan 27 00:28:40 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13118015 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id EBCC6C61D97 for ; Fri, 27 Jan 2023 00:29:18 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id C4EDE10E3D4; Fri, 27 Jan 2023 00:29:10 +0000 (UTC) Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by gabe.freedesktop.org (Postfix) with ESMTPS id EB75110E155; Fri, 27 Jan 2023 00:29:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674779340; x=1706315340; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=rdM+o6H8ybxuEDzfnAh/Dh6qJbeerttIapfW/5uZKIk=; b=mrVDLcuR4RrUawy1ssc15kw7sQ+KvEIQRCpevnMKSKaxEyTl4mg3bGM3 ZA5CYRdxj5U7Bd+WBm+pv5X60pQxMkw64qf2j6Hvz4Zt/b3lOQgAsiYMr ZKbnpHwn/v5B74eQ/ZFrU32osDqnYMFWD9ioAOGFRVlAckBthzTInwSsW seuWlvP5M16zzCyEL4PLMwiq1NahZ+4DHP8Q62hW3PIOqo/2w4h3NIDwh MVMrhMs8Yn1UgR2BNUwXSpcK6vXSr8ol5KCfnCsiE9bZsu1arkke05U9b iqXPprCvKHEy2QHiiKP0cASV5WQmkA/IqbF6ZA0tEtgMeGVvXr1JU/n6O A==; X-IronPort-AV: E=McAfee;i="6500,9779,10602"; a="324687311" X-IronPort-AV: E=Sophos;i="5.97,249,1669104000"; d="scan'208";a="324687311" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jan 2023 16:28:53 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10602"; a="805621917" X-IronPort-AV: E=Sophos;i="5.97,249,1669104000"; d="scan'208";a="805621917" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by fmsmga001.fm.intel.com with ESMTP; 26 Jan 2023 16:28:52 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH v6 6/8] drm/i915/guc: Look for a guilty context when an engine reset fails Date: Thu, 26 Jan 2023 16:28:40 -0800 Message-Id: <20230127002842.3169194-7-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20230127002842.3169194-1-John.C.Harrison@Intel.com> References: <20230127002842.3169194-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Daniele Ceraolo Spurio , John Harrison , DRI-Devel@Lists.FreeDesktop.Org, Tvrtko Ursulin Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison Engine resets are supposed to never fail. But in the case when one does (due to unknown reasons that normally come down to a missing w/a), it is useful to get as much information out of the system as possible. Given that the GuC intentionally dies on such a situation, it is not possible to get a guilty context notification back. So do a manual search instead. Given that GuC is dead, this is safe because GuC won't be changing the engine state asynchronously. v2: Change comment to be less alarming (Tvrtko) Signed-off-by: John Harrison Acked-by: Tvrtko Ursulin Reviewed-by: Daniele Ceraolo Spurio --- .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index a2b263e5fd667..7adc35bd4435a 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -4755,11 +4755,24 @@ static void reset_fail_worker_func(struct work_struct *w) guc->submission_state.reset_fail_mask = 0; spin_unlock_irqrestore(&guc->submission_state.lock, flags); - if (likely(reset_fail_mask)) + if (likely(reset_fail_mask)) { + struct intel_engine_cs *engine; + enum intel_engine_id id; + + /* + * GuC is toast at this point - it dead loops after sending the failed + * reset notification. So need to manually determine the guilty context. + * Note that it should be reliable to do this here because the GuC is + * toast and will not be scheduling behind the KMD's back. + */ + for_each_engine_masked(engine, gt, reset_fail_mask, id) + intel_guc_find_hung_context(engine); + intel_gt_handle_error(gt, reset_fail_mask, I915_ERROR_CAPTURE, - "GuC failed to reset engine mask=0x%x\n", + "GuC failed to reset engine mask=0x%x", reset_fail_mask); + } } int intel_guc_engine_failure_process_msg(struct intel_guc *guc, From patchwork Fri Jan 27 00:28:41 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13118011 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 49FD8C05027 for ; Fri, 27 Jan 2023 00:29:10 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id A4A5F10E163; Fri, 27 Jan 2023 00:29:07 +0000 (UTC) Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by gabe.freedesktop.org (Postfix) with ESMTPS id 065E710E159; Fri, 27 Jan 2023 00:29:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674779341; x=1706315341; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=ge027wYS9Kpvkw04aKFhgn6e7W9uzBegmpREjabV0iY=; b=OWzfPyPHQnr8QKLOcWkQaS5BZnQr5dDaYTmiTfL8pillhvDdSfD323yP 75XV1YEywDmHsweSlXLHK4KFRUcUTUPcn5k+P3j5CZyAQCKFLE92jL4E+ 4vAla7BbjwtGs4FFQpKSuc9CPJVWmQ2bp0V766X64AmfKbmo1XYbeGHhj x8mfITtOW/4l1FGxllipKJf7DaekKV86vuDPVe/EXdrA0RP1lzqcZ3xok h7TqMr5ZMvFLQDeCIn0BwqO3M2sL8dVOyyvTK+o7VyslyQqUTmzb+ozdR FnwHYTGjmGIJJ5Mk92NwvmBEWkdwxQ75MiSSvnSSMHemS5YTxXa9MGspZ w==; X-IronPort-AV: E=McAfee;i="6500,9779,10602"; a="324687313" X-IronPort-AV: E=Sophos;i="5.97,249,1669104000"; d="scan'208";a="324687313" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jan 2023 16:28:53 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10602"; a="805621922" X-IronPort-AV: E=Sophos;i="5.97,249,1669104000"; d="scan'208";a="805621922" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by fmsmga001.fm.intel.com with ESMTP; 26 Jan 2023 16:28:53 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH v6 7/8] drm/i915/guc: Add a debug print on GuC triggered reset Date: Thu, 26 Jan 2023 16:28:41 -0800 Message-Id: <20230127002842.3169194-8-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20230127002842.3169194-1-John.C.Harrison@Intel.com> References: <20230127002842.3169194-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: John Harrison , DRI-Devel@Lists.FreeDesktop.Org, Tvrtko Ursulin Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison For understanding bug reports, it can be useful to have an explicit dmesg print when a reset notification is received from GuC. As opposed to simply inferring that this happened from other messages. Signed-off-by: John Harrison Reviewed-by: Tvrtko Ursulin --- drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 7adc35bd4435a..2e6ab0bb5c2b6 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -4666,6 +4666,10 @@ static void guc_handle_context_reset(struct intel_guc *guc, { trace_intel_context_reset(ce); + drm_dbg(&guc_to_gt(guc)->i915->drm, "Got GuC reset of 0x%04X, exiting = %d, banned = %d\n", + ce->guc_id.id, test_bit(CONTEXT_EXITING, &ce->flags), + test_bit(CONTEXT_BANNED, &ce->flags)); + if (likely(intel_context_is_schedulable(ce))) { capture_error_state(guc, ce); guc_context_replay(ce); From patchwork Fri Jan 27 00:28:42 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13118018 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 07B92C54EAA for ; Fri, 27 Jan 2023 00:30:00 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 3426910E155; Fri, 27 Jan 2023 00:29:59 +0000 (UTC) Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by gabe.freedesktop.org (Postfix) with ESMTPS id 1F35310E146; Fri, 27 Jan 2023 00:29:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674779341; x=1706315341; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=gE2TL8/wmSdfF/qfLiTs7dtrj7rx1/jqcUGqBVk0hBM=; b=Oof6KDCXLvcNLjsxHS/DNVNCsx0ir8ufxWY7dtfBkLDSgD74B7hkHcFY DpHxBmKmvYSFBCDe0GSL4Lm5r6q+O2dwBv9Z8lFQOUyLrj4diNIMCl9TA b7vhw8nbB8ur4tH2xzKclhbux0o25iL6kXDm1CjbMz1S7d5tvdoquWq0u kWx+QlpDbx5D2N7bGjWaFRKOFvSE7XDl/vvIKIfDvcbu8sbW2rgYMN7B2 nUvci+pLhOiW3A5Vf3Ov2NfNYrIOj1ySVwn18TpMaTPhaXiXPtfWuK1UT EclPaPNnIX9SnvNGuz5y4knllZoKZF6ev5GjNrjbUmbrYXwuS/vQtneMa g==; X-IronPort-AV: E=McAfee;i="6500,9779,10602"; a="324687314" X-IronPort-AV: E=Sophos;i="5.97,249,1669104000"; d="scan'208";a="324687314" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jan 2023 16:28:53 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10602"; a="805621926" X-IronPort-AV: E=Sophos;i="5.97,249,1669104000"; d="scan'208";a="805621926" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by fmsmga001.fm.intel.com with ESMTP; 26 Jan 2023 16:28:53 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH v6 8/8] drm/i915/guc: Rename GuC register state capture node to be more obvious Date: Thu, 26 Jan 2023 16:28:42 -0800 Message-Id: <20230127002842.3169194-9-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20230127002842.3169194-1-John.C.Harrison@Intel.com> References: <20230127002842.3169194-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Daniele Ceraolo Spurio , John Harrison , DRI-Devel@Lists.FreeDesktop.Org Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison The GuC specific register state entry in the error capture object was just called 'capture'. Although the companion 'node' entry was called 'guc_capture_node'. Rename the base entry to be 'guc_capture' instead so that it is a) more consistent and b) more obvious what it is. Signed-off-by: John Harrison Reviewed-by: Daniele Ceraolo Spurio --- drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c | 8 ++++---- drivers/gpu/drm/i915/i915_gpu_error.h | 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c index 1c1b85073b4bd..fc3b994626a4f 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c @@ -1506,7 +1506,7 @@ int intel_guc_capture_print_engine_node(struct drm_i915_error_state_buf *ebuf, if (!ebuf || !ee) return -EINVAL; - cap = ee->capture; + cap = ee->guc_capture; if (!cap || !ee->engine) return -ENODEV; @@ -1576,8 +1576,8 @@ void intel_guc_capture_free_node(struct intel_engine_coredump *ee) if (!ee || !ee->guc_capture_node) return; - guc_capture_add_node_to_cachelist(ee->capture, ee->guc_capture_node); - ee->capture = NULL; + guc_capture_add_node_to_cachelist(ee->guc_capture, ee->guc_capture_node); + ee->guc_capture = NULL; ee->guc_capture_node = NULL; } @@ -1611,7 +1611,7 @@ void intel_guc_capture_get_matching_node(struct intel_gt *gt, (ce->lrc.lrca & CTX_GTT_ADDRESS_MASK)) { list_del(&n->link); ee->guc_capture_node = n; - ee->capture = guc->capture; + ee->guc_capture = guc->capture; return; } } diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h index efc75cc2ffdb9..56027ffbce51f 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.h +++ b/drivers/gpu/drm/i915/i915_gpu_error.h @@ -94,7 +94,7 @@ struct intel_engine_coredump { struct intel_instdone instdone; /* GuC matched capture-lists info */ - struct intel_guc_state_capture *capture; + struct intel_guc_state_capture *guc_capture; struct __guc_capture_parsed_output *guc_capture_node; struct i915_gem_context_coredump {