From patchwork Tue Nov 29 21:12:52 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13059213 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 82B16C433FE for ; Tue, 29 Nov 2022 21:13:19 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id D84C810E3F1; Tue, 29 Nov 2022 21:12:56 +0000 (UTC) Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by gabe.freedesktop.org (Postfix) with ESMTPS id 0B55C10E3EB; Tue, 29 Nov 2022 21:12:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1669756370; x=1701292370; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=dnitbkunTRwLIv8g6KAwkssL9GkuWWIOMRkTP59cywI=; b=CTjClR7bap3W3YjsAHH9H48eSUrW8Kg0mf4+aU+w1J50LW8WgtuyoFy2 t9PD2PKaMqowOdrttwfdpJMuq1zH6+pIg7dzGkXW2GM0/Y2g+zU6qDDLp 1HXvlqEKmIpSBFGomQVUJTaqwATiyeFFrAHYMayMLMLhgZu9wUsWcOQoF LesJb0d+lgSB80C0zRYc8fCXRhvbgPGtPpCgzqFZtwlJesdpDHDOaU6gD KWvNx6cIQSF78Cx0UUzePlorwjbTHp3P5pNeTJSJXxtanL8KMQ+2w0Zvr Kfgm87liL8VUbkRLoJYMc/59tIe7nY74QbjnZ1JPn2rS94dcTlZDkHjRI A==; X-IronPort-AV: E=McAfee;i="6500,9779,10546"; a="315261008" X-IronPort-AV: E=Sophos;i="5.96,204,1665471600"; d="scan'208";a="315261008" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Nov 2022 13:12:49 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10546"; a="676576079" X-IronPort-AV: E=Sophos;i="5.96,204,1665471600"; d="scan'208";a="676576079" Received: from relo-linux-5.jf.intel.com ([10.165.21.143]) by orsmga001.jf.intel.com with ESMTP; 29 Nov 2022 13:12:49 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH 1/2] drm/i915: Allow error capture without a request Date: Tue, 29 Nov 2022 13:12:52 -0800 Message-Id: <20221129211253.3183480-2-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20221129211253.3183480-1-John.C.Harrison@Intel.com> References: <20221129211253.3183480-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: John Harrison , DRI-Devel@Lists.FreeDesktop.Org Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison There was a report of error captures occurring without any hung context being indicated despite the capture being initiated by a 'hung context notification' from GuC. The problem was not reproducible. However, it is possible to happen if the context in question has no active requests. For example, if the hang was in the context switch itself then the breadcrumb write would have occurred and the KMD would see an idle context. In the interests of attempting to provide as much information as possible about a hang, it seems wise to include the engine info regardless of whether a request was found or not. As opposed to just prentending there was no hang at all. So update the error capture code to always record engine information if an engine is given. Which means updating record_context() to take a context instead of a request (which it only ever used to find the context anyway). And split the request agnostic parts of intel_engine_coredump_add_request() out into a seaprate function. Signed-off-by: John Harrison Reviewed-by: Umesh Nerlige Ramappa --- drivers/gpu/drm/i915/i915_gpu_error.c | 55 +++++++++++++++++++-------- 1 file changed, 40 insertions(+), 15 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 9d5d5a397b64e..2ed1c84c9fab4 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct intel_engine_coredump *ee) } static bool record_context(struct i915_gem_context_coredump *e, - const struct i915_request *rq) + struct intel_context *ce) { struct i915_gem_context *ctx; struct task_struct *task; bool simulated; rcu_read_lock(); - ctx = rcu_dereference(rq->context->gem_context); + ctx = rcu_dereference(ce->gem_context); if (ctx && !kref_get_unless_zero(&ctx->ref)) ctx = NULL; rcu_read_unlock(); @@ -1396,8 +1396,8 @@ static bool record_context(struct i915_gem_context_coredump *e, e->guilty = atomic_read(&ctx->guilty_count); e->active = atomic_read(&ctx->active_count); - e->total_runtime = intel_context_get_total_runtime_ns(rq->context); - e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context); + e->total_runtime = intel_context_get_total_runtime_ns(ce); + e->avg_runtime = intel_context_get_avg_runtime_ns(ce); simulated = i915_gem_context_no_error_capture(ctx); @@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct intel_engine_cs *engine, gfp_t gfp, u32 dump_ return ee; } +static struct intel_engine_capture_vma * +engine_coredump_add_context(struct intel_engine_coredump *ee, + struct intel_context *ce, + gfp_t gfp) +{ + struct intel_engine_capture_vma *vma = NULL; + + ee->simulated |= record_context(&ee->context, ce); + if (ee->simulated) + return NULL; + + /* + * We need to copy these to an anonymous buffer + * as the simplest method to avoid being overwritten + * by userspace. + */ + vma = capture_vma(vma, ce->ring->vma, "ring", gfp); + vma = capture_vma(vma, ce->state, "HW context", gfp); + + return vma; +} + struct intel_engine_capture_vma * intel_engine_coredump_add_request(struct intel_engine_coredump *ee, struct i915_request *rq, gfp_t gfp) { - struct intel_engine_capture_vma *vma = NULL; + struct intel_engine_capture_vma *vma; - ee->simulated |= record_context(&ee->context, rq); - if (ee->simulated) + vma = engine_coredump_add_context(ee, rq->context, gfp); + if (!vma) return NULL; /* @@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct intel_engine_coredump *ee, */ vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch"); vma = capture_user(vma, rq, gfp); - vma = capture_vma(vma, rq->ring->vma, "ring", gfp); - vma = capture_vma(vma, rq->context->state, "HW context", gfp); ee->rq_head = rq->head; ee->rq_post = rq->postfix; @@ -1608,8 +1628,11 @@ capture_engine(struct intel_engine_cs *engine, if (ce) { intel_engine_clear_hung_context(engine); rq = intel_context_find_active_request(ce); - if (!rq || !i915_request_started(rq)) - goto no_request_capture; + if (rq && !i915_request_started(rq)) { + drm_info(&engine->gt->i915->drm, "Got hung context on %s with no active request!\n", + engine->name); + rq = NULL; + } } else { /* * Getting here with GuC enabled means it is a forced error capture @@ -1625,12 +1648,14 @@ capture_engine(struct intel_engine_cs *engine, if (rq) rq = i915_request_get_rcu(rq); - if (!rq) - goto no_request_capture; + if (rq) + capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL); + else if (ce) + capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL); - capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL); if (!capture) { - i915_request_put(rq); + if (rq) + i915_request_put(rq); goto no_request_capture; } if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE) From patchwork Tue Nov 29 21:12:53 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13059212 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 13877C433FE for ; Tue, 29 Nov 2022 21:13:15 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id C5EA510E3EC; Tue, 29 Nov 2022 21:12:56 +0000 (UTC) Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by gabe.freedesktop.org (Postfix) with ESMTPS id 3647010E3EC; Tue, 29 Nov 2022 21:12:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1669756370; x=1701292370; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=XrpvAjUPcLM2nx70UX0X/r2sGPN7cr4xjn/aFgPl5nc=; b=Y0FmS0XuktTilHcP8tUkPfjlArAWloyuPbg1d1/PV2+8npUoT8xbejxc R7VkPiT1KX0kmq/VslVOl2cdU3ZXuXK1bSWswfIv474sbDKDfKfQOZ4ig st9rb6QJYhR8mfT+mC/GsGhaXTgwyBSnMn7dUVlUACTESZvWTHrB0gCm2 eqnZWdb+vKRvaDd/0xd5tTnaI/7/T9lUQ2GO5VsumfPiqWsVpeLTSz52o tvKYFvKcHiBl5wW3VUu1tg2RoiUtSYzFRz6R0G1Qbt5u2MyclQjefsNWr OdqMiGMbzPWrB8um5/T2Mc6N/xYdEY5TLsURZCkq1ljAFLFC+vik1czVF A==; X-IronPort-AV: E=McAfee;i="6500,9779,10546"; a="315261009" X-IronPort-AV: E=Sophos;i="5.96,204,1665471600"; d="scan'208";a="315261009" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Nov 2022 13:12:49 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10546"; a="676576082" X-IronPort-AV: E=Sophos;i="5.96,204,1665471600"; d="scan'208";a="676576082" Received: from relo-linux-5.jf.intel.com ([10.165.21.143]) by orsmga001.jf.intel.com with ESMTP; 29 Nov 2022 13:12:49 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH 2/2] drm/i915/guc: Look for a guilty context when an engine reset fails Date: Tue, 29 Nov 2022 13:12:53 -0800 Message-Id: <20221129211253.3183480-3-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20221129211253.3183480-1-John.C.Harrison@Intel.com> References: <20221129211253.3183480-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: John Harrison , DRI-Devel@Lists.FreeDesktop.Org Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison Engine resets are supposed to never happen. But in the case when one does (due to unknwon reasons that normally come down to a missing w/a), it is useful to get as much information out of the system as possible. Given that the GuC effectively dies on such a situation, it is not possible to get a guilty context notification back. So do a manual search instead. Given that GuC is dead, this is safe because GuC won't be changing the engine state asynchronously. Signed-off-by: John Harrison --- drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 0a42f1807f52c..c82730804a1c4 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -4751,11 +4751,24 @@ static void reset_fail_worker_func(struct work_struct *w) guc->submission_state.reset_fail_mask = 0; spin_unlock_irqrestore(&guc->submission_state.lock, flags); - if (likely(reset_fail_mask)) + if (likely(reset_fail_mask)) { + struct intel_engine_cs *engine; + enum intel_engine_id id; + + /* + * GuC is toast at this point - it dead loops after sending the failed + * reset notification. So need to manually determine the guilty context. + * Note that it should be safe/reliable to do this here because the GuC + * is toast and will not be scheduling behind the KMD's back. + */ + for_each_engine_masked(engine, gt, reset_fail_mask, id) + intel_guc_find_hung_context(engine); + intel_gt_handle_error(gt, reset_fail_mask, I915_ERROR_CAPTURE, "GuC failed to reset engine mask=0x%x\n", reset_fail_mask); + } } int intel_guc_engine_failure_process_msg(struct intel_guc *guc,