From patchwork Tue Jan 17 21:36:26 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13105081 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 45576C54E76 for ; Tue, 17 Jan 2023 21:37:16 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 7425710E5E9; Tue, 17 Jan 2023 21:37:01 +0000 (UTC) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by gabe.freedesktop.org (Postfix) with ESMTPS id EB56C10E088; Tue, 17 Jan 2023 21:36:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1673991414; x=1705527414; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=9zCLjq2fwuiIkrwTjl39OJ+CMLrbrn0iy5u/pvPh2sA=; b=nQwJd4xb9AkEZfLslL5KaW5vWNExaPzTZfjKlF5IY86lokUCUed7hPrC NT28nSsdcDHVA879Gc0pd/ekv2P+pFkanHYWAOhuxmXGKxSeDqnjA9xfJ oC3Y6483VYjQSR2AyMCykw0j/SQNAM5Bz6iDA7KXO7Y95gQlUNXfy7NbU sAt+amJD+UoZg/CYHDSew7lY9Nxn+h4mCKcg7GlabjTELMLafIDpdi8jZ CMgncPV01ECWqTgcvqabopvE0Qg7t79A1jQFIi8A9JO6jpxIsew+DORl/ 5pNiUflj+03213493CDIvsJWzrlCAZh08jowiqOq7bD69+pRTqZiOa11g g==; X-IronPort-AV: E=McAfee;i="6500,9779,10593"; a="312696224" X-IronPort-AV: E=Sophos;i="5.97,224,1669104000"; d="scan'208";a="312696224" Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Jan 2023 13:36:53 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10593"; a="988285020" X-IronPort-AV: E=Sophos;i="5.97,224,1669104000"; d="scan'208";a="988285020" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by fmsmga005.fm.intel.com with ESMTP; 17 Jan 2023 13:36:52 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump Date: Tue, 17 Jan 2023 13:36:26 -0800 Message-Id: <20230117213630.2897570-2-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.0 In-Reply-To: <20230117213630.2897570-1-John.C.Harrison@Intel.com> References: <20230117213630.2897570-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Matthew Brost , Tvrtko Ursulin , Andy Shevchenko , Michael Cheng , Aravind Iddamsetty , Alan Previn , Umesh Nerlige Ramappa , intel-gfx@lists.freedesktop.org, Lucas De Marchi , Chris Wilson , Bruce Chang , Daniele Ceraolo Spurio , DRI-Devel@Lists.FreeDesktop.Org, Andrzej Hajda , Rodrigo Vivi , Tejas Upadhyay , John Harrison , Matthew Auld Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison When GuC support was added to error capture, the locking around the request object was broken. Fix it up. The context based search manages the spinlocking around the search internally. So it needs to grab the reference count internally as well. The execlist only request based search relies on external locking, so it needs an external reference count. So no change to that code itself but the context version does change. The only other caller is the code for dumping engine state to debugfs. That code wasn't previously getting an explicit reference at all as it does everything while holding the execlist specific spinlock. So that needs updaing as well as that spinlock doesn't help when using GuC submission. Rather than trying to conditionally get/put depending on submission model, just change it to always do the get/put. In addition, intel_guc_find_hung_context() was not acquiring the correct spinlock before searching the request list. So fix that up too. Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset with GuC") Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset") Cc: Matthew Brost Cc: John Harrison Cc: Jani Nikula Cc: Joonas Lahtinen Cc: Rodrigo Vivi Cc: Tvrtko Ursulin Cc: Daniele Ceraolo Spurio Cc: Andrzej Hajda Cc: Chris Wilson Cc: Matthew Auld Cc: Matt Roper Cc: Umesh Nerlige Ramappa Cc: Michael Cheng Cc: Lucas De Marchi Cc: Tejas Upadhyay Cc: Andy Shevchenko Cc: Aravind Iddamsetty Cc: Alan Previn Cc: Bruce Chang Cc: intel-gfx@lists.freedesktop.org Signed-off-by: John Harrison --- drivers/gpu/drm/i915/gt/intel_context.c | 1 + drivers/gpu/drm/i915/gt/intel_engine_cs.c | 7 ++++++- drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 +++++++++++ drivers/gpu/drm/i915/i915_gpu_error.c | 5 ++--- 4 files changed, 20 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c index e94365b08f1ef..df64cf1954c1d 100644 --- a/drivers/gpu/drm/i915/gt/intel_context.c +++ b/drivers/gpu/drm/i915/gt/intel_context.c @@ -552,6 +552,7 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce) active = rq; } + active = i915_request_get_rcu(active); spin_unlock_irqrestore(&parent->guc_state.lock, flags); return active; diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index 922f1bb22dc68..517d1fb7ae333 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -2236,10 +2236,13 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d guc = intel_uc_uses_guc_submission(&engine->gt->uc); if (guc) { ce = intel_engine_get_hung_context(engine); - if (ce) + if (ce) { + /* This will reference count the request (if found) */ hung_rq = intel_context_find_active_request(ce); + } } else { hung_rq = intel_engine_execlist_find_hung_request(engine); + hung_rq = i915_request_get_rcu(hung_rq); } if (hung_rq) @@ -2250,6 +2253,8 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d else intel_engine_dump_active_requests(&engine->sched_engine->requests, hung_rq, m); + if (hung_rq) + i915_request_put(hung_rq); } void intel_engine_dump(struct intel_engine_cs *engine, diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index b436dd7f12e42..3b34a82d692be 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -4820,6 +4820,8 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine) xa_lock_irqsave(&guc->context_lookup, flags); xa_for_each(&guc->context_lookup, index, ce) { + bool found; + if (!kref_get_unless_zero(&ce->ref)) continue; @@ -4836,10 +4838,18 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine) goto next; } + found = false; + spin_lock(&ce->guc_state.lock); list_for_each_entry(rq, &ce->guc_state.requests, sched.link) { if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE) continue; + found = true; + break; + } + spin_unlock(&ce->guc_state.lock); + + if (found) { intel_engine_set_hung_context(engine, ce); /* Can only cope with one hang at a time... */ @@ -4847,6 +4857,7 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine) xa_lock(&guc->context_lookup); goto done; } + next: intel_context_put(ce); xa_lock(&guc->context_lookup); diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 9d5d5a397b64e..4107a0dfcca7d 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1607,6 +1607,7 @@ capture_engine(struct intel_engine_cs *engine, ce = intel_engine_get_hung_context(engine); if (ce) { intel_engine_clear_hung_context(engine); + /* This will reference count the request (if found) */ rq = intel_context_find_active_request(ce); if (!rq || !i915_request_started(rq)) goto no_request_capture; @@ -1618,13 +1619,11 @@ capture_engine(struct intel_engine_cs *engine, if (!intel_uc_uses_guc_submission(&engine->gt->uc)) { spin_lock_irqsave(&engine->sched_engine->lock, flags); rq = intel_engine_execlist_find_hung_request(engine); + rq = i915_request_get_rcu(rq); spin_unlock_irqrestore(&engine->sched_engine->lock, flags); } } - if (rq) - rq = i915_request_get_rcu(rq); - if (!rq) goto no_request_capture; From patchwork Tue Jan 17 21:36:27 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13105078 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id CC1B3C54E76 for ; Tue, 17 Jan 2023 21:37:09 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 567E110E5E2; Tue, 17 Jan 2023 21:37:00 +0000 (UTC) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by gabe.freedesktop.org (Postfix) with ESMTPS id 46E9510E088; Tue, 17 Jan 2023 21:36:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1673991414; x=1705527414; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=6rKX0yXeVxyjWtrNQHbSm4KFl/Rn6MzlA8FP8hwjL9Y=; b=KhNC6B2xzdnt3xm2pxJOv0SPZztH05hPjeONkX4Ect/xVOSa5cpnG/xx NpEYNsdfTvDU0uaRfrtASrnTpaKYAeKx4RTyCLiRcznpTv96SD2U1/E5u ZWROv858//OCz/bVtm4ZukyxHb/IqJKSdQg7RGX/XFNC0MpkAU+376B4S BfJmagLtUyTiqLs/wtowVii8gNbv18dcIb0B/QMxKsqcqVqgYZH/Q5keE uQ4AyM+bHHTw24cg/8lKPcZy+O9Com7mWvZ0Qq1m4hJkvn/Ml10yfDmc3 OkSwvya1BVwoFuBhk7hbRhVL1BHYbh69xD3OEqUEiPEXNKVtIEimU5aKC Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10593"; a="312696226" X-IronPort-AV: E=Sophos;i="5.97,224,1669104000"; d="scan'208";a="312696226" Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Jan 2023 13:36:53 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10593"; a="988285023" X-IronPort-AV: E=Sophos;i="5.97,224,1669104000"; d="scan'208";a="988285023" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by fmsmga005.fm.intel.com with ESMTP; 17 Jan 2023 13:36:53 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH v2 2/5] drm/i915: Allow error capture without a request Date: Tue, 17 Jan 2023 13:36:27 -0800 Message-Id: <20230117213630.2897570-3-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.0 In-Reply-To: <20230117213630.2897570-1-John.C.Harrison@Intel.com> References: <20230117213630.2897570-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Umesh Nerlige Ramappa , John Harrison , DRI-Devel@Lists.FreeDesktop.Org Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison There was a report of error captures occurring without any hung context being indicated despite the capture being initiated by a 'hung context notification' from GuC. The problem was not reproducible. However, it is possible to happen if the context in question has no active requests. For example, if the hang was in the context switch itself then the breadcrumb write would have occurred and the KMD would see an idle context. In the interests of attempting to provide as much information as possible about a hang, it seems wise to include the engine info regardless of whether a request was found or not. As opposed to just prentending there was no hang at all. So update the error capture code to always record engine information if an engine is given. Which means updating record_context() to take a context instead of a request (which it only ever used to find the context anyway). And split the request agnostic parts of intel_engine_coredump_add_request() out into a seaprate function. v2: Remove a duplicate 'if' statement (Umesh) and fix a put of a null pointer. v3: Tidy up request locking code flow (Tvrtko) Signed-off-by: John Harrison Reviewed-by: Umesh Nerlige Ramappa Acked-by: Tvrtko Ursulin --- drivers/gpu/drm/i915/i915_gpu_error.c | 69 ++++++++++++++++++--------- 1 file changed, 46 insertions(+), 23 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 4107a0dfcca7d..461489d599a7e 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct intel_engine_coredump *ee) } static bool record_context(struct i915_gem_context_coredump *e, - const struct i915_request *rq) + struct intel_context *ce) { struct i915_gem_context *ctx; struct task_struct *task; bool simulated; rcu_read_lock(); - ctx = rcu_dereference(rq->context->gem_context); + ctx = rcu_dereference(ce->gem_context); if (ctx && !kref_get_unless_zero(&ctx->ref)) ctx = NULL; rcu_read_unlock(); @@ -1396,8 +1396,8 @@ static bool record_context(struct i915_gem_context_coredump *e, e->guilty = atomic_read(&ctx->guilty_count); e->active = atomic_read(&ctx->active_count); - e->total_runtime = intel_context_get_total_runtime_ns(rq->context); - e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context); + e->total_runtime = intel_context_get_total_runtime_ns(ce); + e->avg_runtime = intel_context_get_avg_runtime_ns(ce); simulated = i915_gem_context_no_error_capture(ctx); @@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct intel_engine_cs *engine, gfp_t gfp, u32 dump_ return ee; } +static struct intel_engine_capture_vma * +engine_coredump_add_context(struct intel_engine_coredump *ee, + struct intel_context *ce, + gfp_t gfp) +{ + struct intel_engine_capture_vma *vma = NULL; + + ee->simulated |= record_context(&ee->context, ce); + if (ee->simulated) + return NULL; + + /* + * We need to copy these to an anonymous buffer + * as the simplest method to avoid being overwritten + * by userspace. + */ + vma = capture_vma(vma, ce->ring->vma, "ring", gfp); + vma = capture_vma(vma, ce->state, "HW context", gfp); + + return vma; +} + struct intel_engine_capture_vma * intel_engine_coredump_add_request(struct intel_engine_coredump *ee, struct i915_request *rq, gfp_t gfp) { - struct intel_engine_capture_vma *vma = NULL; + struct intel_engine_capture_vma *vma; - ee->simulated |= record_context(&ee->context, rq); - if (ee->simulated) + vma = engine_coredump_add_context(ee, rq->context, gfp); + if (!vma) return NULL; /* @@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct intel_engine_coredump *ee, */ vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch"); vma = capture_user(vma, rq, gfp); - vma = capture_vma(vma, rq->ring->vma, "ring", gfp); - vma = capture_vma(vma, rq->context->state, "HW context", gfp); ee->rq_head = rq->head; ee->rq_post = rq->postfix; @@ -1609,8 +1629,12 @@ capture_engine(struct intel_engine_cs *engine, intel_engine_clear_hung_context(engine); /* This will reference count the request (if found) */ rq = intel_context_find_active_request(ce); - if (!rq || !i915_request_started(rq)) - goto no_request_capture; + if (rq && !i915_request_started(rq)) { + drm_info(&engine->gt->i915->drm, "Got hung context on %s with no active request!\n", + engine->name); + i915_request_put(rq); + rq = NULL; + } } else { /* * Getting here with GuC enabled means it is a forced error capture @@ -1624,25 +1648,24 @@ capture_engine(struct intel_engine_cs *engine, flags); } } - if (!rq) - goto no_request_capture; - - capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL); - if (!capture) { + if (rq) { + capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL); i915_request_put(rq); - goto no_request_capture; + } else if (ce) { + capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL); } + if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE) intel_guc_capture_get_matching_node(engine->gt, ee, ce); - intel_engine_coredump_add_vma(ee, capture, compress); - i915_request_put(rq); + if (capture) { + intel_engine_coredump_add_vma(ee, capture, compress); + } else { + kfree(ee); + ee = NULL; + } return ee; - -no_request_capture: - kfree(ee); - return NULL; } static void From patchwork Tue Jan 17 21:36:28 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13105082 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id DE3A5C3DA78 for ; Tue, 17 Jan 2023 21:37:17 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 1D03910E5EA; Tue, 17 Jan 2023 21:37:02 +0000 (UTC) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by gabe.freedesktop.org (Postfix) with ESMTPS id 7764B10E5E1; Tue, 17 Jan 2023 21:36:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1673991414; x=1705527414; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=S2qa84XPCfwCvLSibgiJ0AamPPWLfsfr8XdwHJ/JzZs=; b=OIHwyexFV/8vl6VTbbimPDOPjtUAnv2jMGK14jzTfiwpRPStw8j7vXq2 hW5v9mMSWJ1P2wCke1pK3pEoXajB2qsBJw+C9IIZdwPa3DLIs4sOObm43 pNY5gClLGr+REzTT7r5qhjhvhBz9aZz7A+dE3++0mRA6ZmzJUcGGwnH4M lua5G9IlLYXeeTtrdLJTdy2FhDloecl5uOpMGcH5aF5JqK3aUXom2FIwB UYTdiH0LZYEFkI5oxw3Tr0ZSyDsmGYceFwWg7hGYbFuaOnpX1kPZ1L0rB VM1grPJzCkM9/SQ5vg56bERC2hV5XEAE41tHI3ym9LrI31Xwqg6hsJDCu Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10593"; a="312696227" X-IronPort-AV: E=Sophos;i="5.97,224,1669104000"; d="scan'208";a="312696227" Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Jan 2023 13:36:53 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10593"; a="988285027" X-IronPort-AV: E=Sophos;i="5.97,224,1669104000"; d="scan'208";a="988285027" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by fmsmga005.fm.intel.com with ESMTP; 17 Jan 2023 13:36:53 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH v2 3/5] drm/i915: Allow error capture of a pending request Date: Tue, 17 Jan 2023 13:36:28 -0800 Message-Id: <20230117213630.2897570-4-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.0 In-Reply-To: <20230117213630.2897570-1-John.C.Harrison@Intel.com> References: <20230117213630.2897570-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: John Harrison , DRI-Devel@Lists.FreeDesktop.Org Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison A hang situation has been observed where the only requests on the context were either completed or not yet started according to the breaadcrumbs. However, the register state claimed a batch was (maybe) in progress. So, allow capture of the pending request on the grounds that this might be better than nothing. v2: Reword 'not started' warning message (Tvrtko) Signed-off-by: John Harrison Reviewed-by: Tvrtko Ursulin --- drivers/gpu/drm/i915/i915_gpu_error.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 461489d599a7e..1d33822a8ca23 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1629,12 +1629,9 @@ capture_engine(struct intel_engine_cs *engine, intel_engine_clear_hung_context(engine); /* This will reference count the request (if found) */ rq = intel_context_find_active_request(ce); - if (rq && !i915_request_started(rq)) { - drm_info(&engine->gt->i915->drm, "Got hung context on %s with no active request!\n", - engine->name); - i915_request_put(rq); - rq = NULL; - } + if (rq && !i915_request_started(rq)) + drm_info(&engine->gt->i915->drm, "Got hung context on %s with active request %lld:%lld [0x%04X] not yet started\n", + engine->name, rq->fence.context, rq->fence.seqno, ce->guc_id.id); } else { /* * Getting here with GuC enabled means it is a forced error capture From patchwork Tue Jan 17 21:36:29 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13105079 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 73523C54E76 for ; Tue, 17 Jan 2023 21:37:12 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 756B210E5E3; Tue, 17 Jan 2023 21:37:00 +0000 (UTC) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by gabe.freedesktop.org (Postfix) with ESMTPS id 77CDD10E5E2; Tue, 17 Jan 2023 21:36:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1673991414; x=1705527414; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=sNI1D9hmQ3FjZk2ovxPA/9Shxc84mL4YIw0sgu6bSTQ=; b=MMvRnD5tGUP/eUro6Eql82LNHTKp1e7ori56oXROF90X3i14u5Bi9H0w ddbY9h7H9ErdkrW1I1UqVmkjcprALqIT/X67AL409olJM165rGtEtxzzu wNDiAVYczCGyMsP0FL4MVaqVR9ow9/k6rPtz/SwJUqWM9BiaVxNaZyJBo LSzNY487U30xtGxh1/skjiYyTGAXLpmKwgT5xYTPRr0e17t8T1TXmUbW7 o6369CV0a3lK/Hqa/tXTIfH/HpZC0cYac2+MYU+4hINWpDOwOsC9yNqsO XkLCoqywdIj1ISbUbPZqHFIqd9HY+H7Gp9yvvlDaAfPM3Pf75wOglK/Dl A==; X-IronPort-AV: E=McAfee;i="6500,9779,10593"; a="312696229" X-IronPort-AV: E=Sophos;i="5.97,224,1669104000"; d="scan'208";a="312696229" Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Jan 2023 13:36:54 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10593"; a="988285030" X-IronPort-AV: E=Sophos;i="5.97,224,1669104000"; d="scan'208";a="988285030" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by fmsmga005.fm.intel.com with ESMTP; 17 Jan 2023 13:36:53 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH v2 4/5] drm/i915/guc: Look for a guilty context when an engine reset fails Date: Tue, 17 Jan 2023 13:36:29 -0800 Message-Id: <20230117213630.2897570-5-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.0 In-Reply-To: <20230117213630.2897570-1-John.C.Harrison@Intel.com> References: <20230117213630.2897570-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: John Harrison , DRI-Devel@Lists.FreeDesktop.Org Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison Engine resets are supposed to never fail. But in the case when one does (due to unknown reasons that normally come down to a missing w/a), it is useful to get as much information out of the system as possible. Given that the GuC effectively dies on such a situation, it is not possible to get a guilty context notification back. So do a manual search instead. Given that GuC is dead, this is safe because GuC won't be changing the engine state asynchronously. v2: Change comment to be less alarming (Tvrtko) Signed-off-by: John Harrison Acked-by: Tvrtko Ursulin --- .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 3b34a82d692be..9bc80b807dbcc 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -4754,11 +4754,24 @@ static void reset_fail_worker_func(struct work_struct *w) guc->submission_state.reset_fail_mask = 0; spin_unlock_irqrestore(&guc->submission_state.lock, flags); - if (likely(reset_fail_mask)) + if (likely(reset_fail_mask)) { + struct intel_engine_cs *engine; + enum intel_engine_id id; + + /* + * GuC is toast at this point - it dead loops after sending the failed + * reset notification. So need to manually determine the guilty context. + * Note that it should be reliable to do this here because the GuC is + * toast and will not be scheduling behind the KMD's back. + */ + for_each_engine_masked(engine, gt, reset_fail_mask, id) + intel_guc_find_hung_context(engine); + intel_gt_handle_error(gt, reset_fail_mask, I915_ERROR_CAPTURE, - "GuC failed to reset engine mask=0x%x\n", + "GuC failed to reset engine mask=0x%x", reset_fail_mask); + } } int intel_guc_engine_failure_process_msg(struct intel_guc *guc, From patchwork Tue Jan 17 21:36:30 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13105080 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 9EBDBC677F1 for ; Tue, 17 Jan 2023 21:37:14 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 129B710E5E4; Tue, 17 Jan 2023 21:37:01 +0000 (UTC) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by gabe.freedesktop.org (Postfix) with ESMTPS id A775710E5E3; Tue, 17 Jan 2023 21:36:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1673991414; x=1705527414; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=wdasfmllz40XLlt0SGJMvTNdsjx3B7M8nIQJ/f5T2pw=; b=e+4kdotkngUWq21GGbXklcCIA7ZgfMYghRl/vZyJp9EVaklGZteDlc4O vLrSywsMQiFuhRl2fIZIhTJTTjNVixioAQKrcGPMJIgzPr/WDnkNf4rxn 2QrWlu6tMsZypCso/CW0mT8+b7OTI9GSK2nY8/gLh6EZXerxNYD14UFu2 HkQRH56EiMXel+A5OvKqjXHIS3m13bTzJL5uFWnmIBEKfjSJ0nwZMqkPn P0NOiVbaLQpL6/6cJNF7SmjWqSobkwwEczu8s3epn0Iuin3dxD2QvxmMC BKdnkIHd5p3WEBGvYPAvyGcQeUm/38n6X+SGqlHCBl9T/Kqvp9hx+mw10 g==; X-IronPort-AV: E=McAfee;i="6500,9779,10593"; a="312696230" X-IronPort-AV: E=Sophos;i="5.97,224,1669104000"; d="scan'208";a="312696230" Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Jan 2023 13:36:54 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10593"; a="988285035" X-IronPort-AV: E=Sophos;i="5.97,224,1669104000"; d="scan'208";a="988285035" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by fmsmga005.fm.intel.com with ESMTP; 17 Jan 2023 13:36:53 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH v2 5/5] drm/i915/guc: Add a debug print on GuC triggered reset Date: Tue, 17 Jan 2023 13:36:30 -0800 Message-Id: <20230117213630.2897570-6-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.0 In-Reply-To: <20230117213630.2897570-1-John.C.Harrison@Intel.com> References: <20230117213630.2897570-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: John Harrison , DRI-Devel@Lists.FreeDesktop.Org, Tvrtko Ursulin Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison For understanding bug reports, it can be useful to have an explicit dmesg print when a reset notification is received from GuC. As opposed to simply inferring that this happened from other messages. Signed-off-by: John Harrison Reviewed-by: Tvrtko Ursulin --- drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 9bc80b807dbcc..6029dbab5feeb 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -4665,6 +4665,10 @@ static void guc_handle_context_reset(struct intel_guc *guc, { trace_intel_context_reset(ce); + drm_dbg(&guc_to_gt(guc)->i915->drm, "Got GuC reset of 0x%04X, exiting = %d, banned = %d\n", + ce->guc_id.id, test_bit(CONTEXT_EXITING, &ce->flags), + test_bit(CONTEXT_BANNED, &ce->flags)); + if (likely(intel_context_is_schedulable(ce))) { capture_error_state(guc, ce); guc_context_replay(ce);