From patchwork Mon Apr 10 19:25:22 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13206666 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id D7FC0C77B61 for ; Mon, 10 Apr 2023 19:25:57 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id BC5EB10E0E9; Mon, 10 Apr 2023 19:25:51 +0000 (UTC) Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by gabe.freedesktop.org (Postfix) with ESMTPS id BC4DD10E0E9; Mon, 10 Apr 2023 19:25:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1681154748; x=1712690748; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=chvrDkpzYnMq+VYPjmz27fYL85B2c1Um8Sp8H6bMVhg=; b=UXSU7nS6OTxeRjpJ0mNYIPHko00eq1i+0vOL3FDujyTisogcOkgOH4xq gYTopgIzgm3QU75n77jwodv7/+E9nvDpxVvHxht5/NXJHF/lX999I1vcy jABMksUTd8hzt0rT2lO+OMBGad+F3fXNH48+stKxCXE0C4R25Htn4SO+V ck4XKSo6/AuWIxInie3yHwgQN0zmYzIDmCkH6ycV9udgPVPTZZl2a4VdC ZA5vlm48iSIiNypUWy+324dMvclJmD/XXBNyps2q+yJci6JNPYsOkTnsT 3SHQqmOuuzw0nG0hhjjvqKa1XiWJymcDNp1VriHnoBjv0HBMEw2LeSxBT g==; X-IronPort-AV: E=McAfee;i="6600,9927,10676"; a="323797875" X-IronPort-AV: E=Sophos;i="5.98,333,1673942400"; d="scan'208";a="323797875" Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Apr 2023 12:25:47 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10676"; a="638554244" X-IronPort-AV: E=Sophos;i="5.98,333,1673942400"; d="scan'208";a="638554244" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by orsmga003.jf.intel.com with ESMTP; 10 Apr 2023 12:25:43 -0700 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH 1/2] drm/i915: Dump error capture to kernel log Date: Mon, 10 Apr 2023 12:25:22 -0700 Message-Id: <20230410192523.2020049-2-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20230410192523.2020049-1-John.C.Harrison@Intel.com> References: <20230410192523.2020049-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: John Harrison , DRI-Devel@Lists.FreeDesktop.Org Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison This is useful for getting debug information out in certain situations, such as failing kernel selftests and CI runs that don't log error captures. It is especially useful for things like retrieving GuC logs as GuC operation can't be tracked by adding printk or ftrace entries. Signed-off-by: John Harrison --- drivers/gpu/drm/i915/i915_gpu_error.c | 130 ++++++++++++++++++++++++++ drivers/gpu/drm/i915/i915_gpu_error.h | 8 ++ 2 files changed, 138 insertions(+) diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index f020c0086fbcd..500fec34188a0 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -2219,3 +2219,133 @@ void i915_disable_error_state(struct drm_i915_private *i915, int err) i915->gpu_error.first_error = ERR_PTR(err); spin_unlock_irq(&i915->gpu_error.lock); } + +void intel_klog_error_capture(struct intel_gt *gt, + intel_engine_mask_t engine_mask) +{ + static int g_count; + struct drm_i915_private *i915 = gt->i915; + struct i915_gpu_coredump *error; + intel_wakeref_t wakeref; + size_t buf_size = PAGE_SIZE * 128; + size_t pos_err; + char *buf, *ptr, *next; + int l_count = g_count++; + int line = 0; + + /* Can't allocate memory during a reset */ + if (test_bit(I915_RESET_BACKOFF, >->reset.flags)) { + drm_err(>->i915->drm, "[Capture/%d.%d] Inside GT reset, skipping error capture :(\n", + l_count, line++); + return; + } + + error = READ_ONCE(i915->gpu_error.first_error); + if (error) { + drm_err(&i915->drm, "[Capture/%d.%d] Clearing existing error capture first...\n", + l_count, line++); + i915_reset_error_state(i915); + } + + with_intel_runtime_pm(&i915->runtime_pm, wakeref) + error = i915_gpu_coredump(gt, engine_mask, CORE_DUMP_FLAG_NONE); + + if (IS_ERR(error)) { + drm_err(&i915->drm, "[Capture/%d.%d] Failed to capture error capture: %ld!\n", + l_count, line++, PTR_ERR(error)); + return; + } + + buf = kvmalloc(buf_size, GFP_KERNEL); + if (!buf) { + drm_err(&i915->drm, "[Capture/%d.%d] Failed to allocate buffer for error capture!\n", + l_count, line++); + i915_gpu_coredump_put(error); + return; + } + + drm_info(&i915->drm, "[Capture/%d.%d] Dumping i915 error capture for %ps...\n", + l_count, line++, __builtin_return_address(0)); + + /* Largest string length safe to print via dmesg */ +# define MAX_CHUNK 800 + + pos_err = 0; + while (1) { + ssize_t got = i915_gpu_coredump_copy_to_buffer(error, buf, pos_err, buf_size - 1); + + if (got <= 0) + break; + + buf[got] = 0; + pos_err += got; + + ptr = buf; + while (got > 0) { + size_t count; + char tag[2]; + + next = strnchr(ptr, got, '\n'); + if (next) { + count = next - ptr; + *next = 0; + tag[0] = '>'; + tag[1] = '<'; + } else { + count = got; + tag[0] = '}'; + tag[1] = '{'; + } + + if (count > MAX_CHUNK) { + size_t pos; + char *ptr2 = ptr; + + for (pos = MAX_CHUNK; pos < count; pos += MAX_CHUNK) { + char chr = ptr[pos]; + + ptr[pos] = 0; + drm_info(&i915->drm, "[Capture/%d.%d] }%s{\n", + l_count, line++, ptr2); + ptr[pos] = chr; + ptr2 = ptr + pos; + + /* + * If spewing large amounts of data via a serial console, + * this can be a very slow process. So be friendly and try + * not to cause 'softlockup on CPU' problems. + */ + cond_resched(); + } + + if (ptr2 < (ptr + count)) + drm_info(&i915->drm, "[Capture/%d.%d] %c%s%c\n", + l_count, line++, tag[0], ptr2, tag[1]); + else if (tag[0] == '>') + drm_info(&i915->drm, "[Capture/%d.%d] ><\n", + l_count, line++); + } else { + drm_info(&i915->drm, "[Capture/%d.%d] %c%s%c\n", + l_count, line++, tag[0], ptr, tag[1]); + } + + ptr = next; + got -= count; + if (next) { + ptr++; + got--; + } + + /* As above. */ + cond_resched(); + } + + if (got) + drm_info(&i915->drm, "[Capture/%d.%d] Got %zd bytes remaining!\n", + l_count, line++, got); + } + + kvfree(buf); + + drm_info(&i915->drm, "[Capture/%d.%d] Dumped %zd bytes\n", l_count, line++, pos_err); +} diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h index a91932cc65317..b14c2c9915141 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.h +++ b/drivers/gpu/drm/i915/i915_gpu_error.h @@ -260,6 +260,9 @@ static inline u32 i915_reset_engine_count(struct i915_gpu_error *error, #if IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR) +void intel_klog_error_capture(struct intel_gt *gt, + intel_engine_mask_t engine_mask); + __printf(2, 3) void i915_error_printf(struct drm_i915_error_state_buf *e, const char *f, ...); void intel_gpu_error_print_vma(struct drm_i915_error_state_buf *m, @@ -323,6 +326,11 @@ void i915_disable_error_state(struct drm_i915_private *i915, int err); #else +static inline void intel_klog_error_capture(struct intel_gt *gt, + intel_engine_mask_t engine_mask) +{ +} + __printf(2, 3) static inline void i915_error_printf(struct drm_i915_error_state_buf *e, const char *f, ...) From patchwork Mon Apr 10 19:25:23 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13206665 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 39BD4C76196 for ; Mon, 10 Apr 2023 19:25:55 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 7A5AC10E109; Mon, 10 Apr 2023 19:25:50 +0000 (UTC) Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by gabe.freedesktop.org (Postfix) with ESMTPS id DD5BE10E109; Mon, 10 Apr 2023 19:25:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1681154748; x=1712690748; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=s1BH6tajXubBdYRQWHtoogHpCiosKMkweNoPqwn/+Jo=; b=AFfuGcEoHcleRsSDmyAgVxBzmJIhzaRLBuIDUrwsojMPftUzLVJ3K9SK TXODNNjc0fLEA6z3mTFRm664x/cnP++qda4iFs1wCSiylTY6lqNNtnwy6 tY1DLQ0blOs4EICDsRJbSz5j07LKwUhdKyng6MNAXVUWbPugxml1psE3u kwg2HYTYWblXEB5Ba/tknnQRhULVHpjgSqBWwXpJbh41o+BqOp27LI53D +MbUKEopGcLHGc4EO+rr2NIQxWM/AAMtKHCnR/77qvi1Z9QjbEKZzs8jk 4yrWmjgZ1YcDWV31mpIBgyD8WmhbSsQTGRwNL+fP+8KXTL0iVl8UHPj4q w==; X-IronPort-AV: E=McAfee;i="6600,9927,10676"; a="323797883" X-IronPort-AV: E=Sophos;i="5.98,333,1673942400"; d="scan'208";a="323797883" Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Apr 2023 12:25:47 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10676"; a="638554248" X-IronPort-AV: E=Sophos;i="5.98,333,1673942400"; d="scan'208";a="638554248" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by orsmga003.jf.intel.com with ESMTP; 10 Apr 2023 12:25:43 -0700 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH 2/2] drm/i915/guc: Dump error capture to dmesg on CTB error Date: Mon, 10 Apr 2023 12:25:23 -0700 Message-Id: <20230410192523.2020049-3-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20230410192523.2020049-1-John.C.Harrison@Intel.com> References: <20230410192523.2020049-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: John Harrison , DRI-Devel@Lists.FreeDesktop.Org Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison In the past, There have been sporadic CTB failures which proved hard to reproduce manually. The most effective solution was to dump the GuC log at the point of failure and let the CI system do the repro. It is preferable not to dump the GuC log via dmesg for all issues as it is not always necessary and is not helpful for end users. But rather than trying to re-invent the code to do this each time it is wanted, commit the code but for DEBUG_GUC builds only. Signed-off-by: John Harrison --- drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c | 53 +++++++++++++++++++++++ drivers/gpu/drm/i915/gt/uc/intel_guc_ct.h | 6 +++ 2 files changed, 59 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c index 1803a633ed648..66a1818a3485f 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.c @@ -13,6 +13,30 @@ #include "intel_guc_ct.h" #include "intel_guc_print.h" +#if defined(CONFIG_DRM_I915_DEBUG_GUC) +enum { + CT_DEAD_ALIVE = 0, + CT_DEAD_SETUP, + CT_DEAD_WRITE, + CT_DEAD_DEADLOCK, + CT_DEAD_H2G_HAS_ROOM, + CT_DEAD_READ, + CT_DEAD_PROCESS_FAILED, +}; + +static void ct_dead_ct_worker_func(struct work_struct *w); + +#define CT_DEAD(ct, reason) \ + do { \ + if (!(ct)->dead_ct_reported) { \ + (ct)->dead_ct_reason |= 1 << CT_DEAD_##reason; \ + queue_work(system_unbound_wq, &(ct)->dead_ct_worker); \ + } \ + } while (0) +#else +#define CT_DEAD(ct, reason) do { } while (0) +#endif + static inline struct intel_guc *ct_to_guc(struct intel_guc_ct *ct) { return container_of(ct, struct intel_guc, ct); @@ -93,6 +117,9 @@ void intel_guc_ct_init_early(struct intel_guc_ct *ct) spin_lock_init(&ct->requests.lock); INIT_LIST_HEAD(&ct->requests.pending); INIT_LIST_HEAD(&ct->requests.incoming); +#if defined(CONFIG_DRM_I915_DEBUG_GUC) + INIT_WORK(&ct->dead_ct_worker, ct_dead_ct_worker_func); +#endif INIT_WORK(&ct->requests.worker, ct_incoming_request_worker_func); tasklet_setup(&ct->receive_tasklet, ct_receive_tasklet_func); init_waitqueue_head(&ct->wq); @@ -319,11 +346,16 @@ int intel_guc_ct_enable(struct intel_guc_ct *ct) ct->enabled = true; ct->stall_time = KTIME_MAX; +#if defined(CONFIG_DRM_I915_DEBUG_GUC) + ct->dead_ct_reported = false; + ct->dead_ct_reason = CT_DEAD_ALIVE; +#endif return 0; err_out: CT_PROBE_ERROR(ct, "Failed to enable CTB (%pe)\n", ERR_PTR(err)); + CT_DEAD(ct, SETUP); return err; } @@ -434,6 +466,7 @@ static int ct_write(struct intel_guc_ct *ct, corrupted: CT_ERROR(ct, "Corrupted descriptor head=%u tail=%u status=%#x\n", desc->head, desc->tail, desc->status); + CT_DEAD(ct, WRITE); ctb->broken = true; return -EPIPE; } @@ -504,6 +537,7 @@ static inline bool ct_deadlocked(struct intel_guc_ct *ct) CT_ERROR(ct, "Head: %u\n (Dwords)", ct->ctbs.recv.desc->head); CT_ERROR(ct, "Tail: %u\n (Dwords)", ct->ctbs.recv.desc->tail); + CT_DEAD(ct, DEADLOCK); ct->ctbs.send.broken = true; } @@ -552,6 +586,7 @@ static inline bool h2g_has_room(struct intel_guc_ct *ct, u32 len_dw) head, ctb->size); desc->status |= GUC_CTB_STATUS_OVERFLOW; ctb->broken = true; + CT_DEAD(ct, H2G_HAS_ROOM); return false; } @@ -908,6 +943,7 @@ static int ct_read(struct intel_guc_ct *ct, struct ct_incoming_msg **msg) CT_ERROR(ct, "Corrupted descriptor head=%u tail=%u status=%#x\n", desc->head, desc->tail, desc->status); ctb->broken = true; + CT_DEAD(ct, READ); return -EPIPE; } @@ -1057,6 +1093,7 @@ static bool ct_process_incoming_requests(struct intel_guc_ct *ct) if (unlikely(err)) { CT_ERROR(ct, "Failed to process CT message (%pe) %*ph\n", ERR_PTR(err), 4 * request->size, request->msg); + CT_DEAD(ct, PROCESS_FAILED); ct_free_msg(request); } @@ -1233,3 +1270,19 @@ void intel_guc_ct_print_info(struct intel_guc_ct *ct, drm_printf(p, "Tail: %u\n", ct->ctbs.recv.desc->tail); } + +#if defined(CONFIG_DRM_I915_DEBUG_GUC) +static void ct_dead_ct_worker_func(struct work_struct *w) +{ + struct intel_guc_ct *ct = container_of(w, struct intel_guc_ct, dead_ct_worker); + struct intel_guc *guc = ct_to_guc(ct); + + if (ct->dead_ct_reported) + return; + + ct->dead_ct_reported = true; + + guc_info(guc, "CTB is dead - reason=0x%X\n", ct->dead_ct_reason); + intel_klog_error_capture(guc_to_gt(guc), (intel_engine_mask_t)~0U); +} +#endif diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.h index f709a19c7e214..d111d61449154 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.h +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_ct.h @@ -85,6 +85,12 @@ struct intel_guc_ct { /** @stall_time: time of first time a CTB submission is stalled */ ktime_t stall_time; + +#if defined(CONFIG_DRM_I915_DEBUG_GUC) + int dead_ct_reason; + bool dead_ct_reported; + struct work_struct dead_ct_worker; +#endif }; void intel_guc_ct_init_early(struct intel_guc_ct *ct);