From patchwork Fri Feb 14 20:37:54 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Cavitt X-Patchwork-Id: 13975614 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 0C508C021A9 for ; Fri, 14 Feb 2025 20:38:02 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id EEA6010E033; Fri, 14 Feb 2025 20:38:00 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="iW3b0Pug"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.18]) by gabe.freedesktop.org (Postfix) with ESMTPS id B323810E069; Fri, 14 Feb 2025 20:37:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1739565479; x=1771101479; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=w48Vj/xXdRRKDK7S0uyfSciQMTM5hICF6ShJAQc5FEk=; b=iW3b0PuglGSnpT/co4PKTnBDmCGwDWXelrnBMOWT+hljT3ibhR31z8fG /9oT6Fq24d64kbnIcTHGxnGMoc4IT9FlDGgHxk2LxPJSUn6Yttd+v2Qns b2WxSJfoHNeHuw1gQKjhYyhsZWgW6dbXNtpHnMATp169LE+wrydtOPMNt c32V24aiPra1MkOCrGh5/B9MBt0uatS8qt23f4KfP+Qa8KDrkvJ4sO49x E6BAtnZInaig0rOthhXz0/FIQCK0juC20hsvS7DJH+Mn+T8hF01km5IhI yvVTAGLMrmi5NLvZC7VmqWl1Z0xgEVbi6wmBl183CgSlWRIF9uyJs4SmG w==; X-CSE-ConnectionGUID: uXdbO82ESLaUQsGzEdjU0w== X-CSE-MsgGUID: 63O4tyjqR/O29yW5D4XW2Q== X-IronPort-AV: E=McAfee;i="6700,10204,11345"; a="39558863" X-IronPort-AV: E=Sophos;i="6.13,286,1732608000"; d="scan'208";a="39558863" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by fmvoesa112.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Feb 2025 12:37:59 -0800 X-CSE-ConnectionGUID: KVBguJ3vTOWtEcUTrfuGBw== X-CSE-MsgGUID: Wa2mRLQ5TqOAxB9oGNFlzQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.12,224,1728975600"; d="scan'208";a="118760812" Received: from dut4066lnl.fm.intel.com ([10.105.8.61]) by orviesa005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Feb 2025 12:37:59 -0800 From: Jonathan Cavitt To: intel-xe@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org, jonathan.cavitt@intel.com, saurabhg.gupta@intel.com, alex.zuo@intel.com, joonas.lahtinen@intel.com, lucas.demarchi@intel.com, matthew.brost@intel.com Subject: [PATCH 1/4] drm/xe/xe_exec_queue: Add ID param to exec queue struct Date: Fri, 14 Feb 2025 20:37:54 +0000 Message-ID: <20250214203757.27895-2-jonathan.cavitt@intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250214203757.27895-1-jonathan.cavitt@intel.com> References: <20250214203757.27895-1-jonathan.cavitt@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Add the exec queue id to the exec queue struct. This is useful for performing a reverse lookup into the xef->exec_queue xarray. Signed-off-by: Jonathan Cavitt --- drivers/gpu/drm/xe/xe_exec_queue.c | 1 + drivers/gpu/drm/xe/xe_exec_queue_types.h | 2 ++ 2 files changed, 3 insertions(+) diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c index 6051db78d706..a02e62465e01 100644 --- a/drivers/gpu/drm/xe/xe_exec_queue.c +++ b/drivers/gpu/drm/xe/xe_exec_queue.c @@ -711,6 +711,7 @@ int xe_exec_queue_create_ioctl(struct drm_device *dev, void *data, if (err) goto kill_exec_queue; + q->id = id; args->exec_queue_id = id; return 0; diff --git a/drivers/gpu/drm/xe/xe_exec_queue_types.h b/drivers/gpu/drm/xe/xe_exec_queue_types.h index 6eb7ff091534..088d838218e9 100644 --- a/drivers/gpu/drm/xe/xe_exec_queue_types.h +++ b/drivers/gpu/drm/xe/xe_exec_queue_types.h @@ -55,6 +55,8 @@ struct xe_exec_queue { struct xe_vm *vm; /** @class: class of this exec queue */ enum xe_engine_class class; + /** @id: exec queue ID as reported during create ioctl */ + u32 id; /** * @logical_mask: logical mask of where job submitted to exec queue can run */ From patchwork Fri Feb 14 20:37:55 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Cavitt X-Patchwork-Id: 13975615 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id B9DA5C021A4 for ; Fri, 14 Feb 2025 20:38:04 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 26B3010ED39; Fri, 14 Feb 2025 20:38:01 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="bcXy+hH6"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.18]) by gabe.freedesktop.org (Postfix) with ESMTPS id D9CFB10E033; Fri, 14 Feb 2025 20:37:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1739565479; x=1771101479; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=W4fOQASwtVxUSBCEFadqime9r1dlookPL1M+/T61cvk=; b=bcXy+hH6fHyWUTE/A1Mq07L9Zo0Rgg/NCEKeMeAUGoMdJWqWQrF9XN5T F/lJh04+vFTs9dtxhC0AqjytrP/GemVRVuFp4RX75lN6q2w5t30HjbKwI dNtvdkFwyl8tPDgrAmaxM6MTU2hR4tloqc1TWvTokRSnwhXptsm/V1VIr 2nxQUPrl0yj80/NAcQHqX7897eC7rYI2e9AmyHA5ShRDZF1R+03dvcQ/P GtCYUVB4ANH2b/z6OsB07hv8hOBy+BJ2WJvfKRjlJQ01Nashp0+XqVQj2 e7Kx8DyYSOMK/tNhZOF6DrIizSbDrS1GUkNoJ8Qgj9XR8S2esbELPUSNY w==; X-CSE-ConnectionGUID: OOFIqy9EQBGQiGSt+CSrAg== X-CSE-MsgGUID: oDZtxYL8R8KtEySeesmd8w== X-IronPort-AV: E=McAfee;i="6700,10204,11345"; a="39558866" X-IronPort-AV: E=Sophos;i="6.13,286,1732608000"; d="scan'208";a="39558866" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by fmvoesa112.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Feb 2025 12:37:59 -0800 X-CSE-ConnectionGUID: ketLe7alTrSohSVA1SXrAg== X-CSE-MsgGUID: EiQWC1P9Qi6nlf9Yja4slQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.12,224,1728975600"; d="scan'208";a="118760815" Received: from dut4066lnl.fm.intel.com ([10.105.8.61]) by orviesa005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Feb 2025 12:37:59 -0800 From: Jonathan Cavitt To: intel-xe@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org, jonathan.cavitt@intel.com, saurabhg.gupta@intel.com, alex.zuo@intel.com, joonas.lahtinen@intel.com, lucas.demarchi@intel.com, matthew.brost@intel.com Subject: [PATCH 2/4] drm/xe/xe_gt_pagefault: Migrate pagefault struct to header Date: Fri, 14 Feb 2025 20:37:55 +0000 Message-ID: <20250214203757.27895-3-jonathan.cavitt@intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250214203757.27895-1-jonathan.cavitt@intel.com> References: <20250214203757.27895-1-jonathan.cavitt@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Migrate the pagefault struct from xe_gt_pagefault.c to the xe_gt_pagefault.h header file, along with the associated enum values. Additionally, add string declarations for the associated enum values, as well as functions that translate from the enum values to their string counterparts. The string variants will be useful for debugging later. Signed-off-by: Jonathan Cavitt --- drivers/gpu/drm/xe/xe_gt_pagefault.c | 27 --------------- drivers/gpu/drm/xe/xe_gt_pagefault.h | 51 ++++++++++++++++++++++++++++ 2 files changed, 51 insertions(+), 27 deletions(-) diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c index 46701ca11ce0..fe18e3ec488a 100644 --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c @@ -22,33 +22,6 @@ #include "xe_trace_bo.h" #include "xe_vm.h" -struct pagefault { - u64 page_addr; - u32 asid; - u16 pdata; - u8 vfid; - u8 access_type; - u8 fault_type; - u8 fault_level; - u8 engine_class; - u8 engine_instance; - u8 fault_unsuccessful; - bool trva_fault; -}; - -enum access_type { - ACCESS_TYPE_READ = 0, - ACCESS_TYPE_WRITE = 1, - ACCESS_TYPE_ATOMIC = 2, - ACCESS_TYPE_RESERVED = 3, -}; - -enum fault_type { - NOT_PRESENT = 0, - WRITE_ACCESS_VIOLATION = 1, - ATOMIC_ACCESS_VIOLATION = 2, -}; - struct acc { u64 va_range_base; u32 asid; diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.h b/drivers/gpu/drm/xe/xe_gt_pagefault.h index 839c065a5e4c..d502fdb5b68c 100644 --- a/drivers/gpu/drm/xe/xe_gt_pagefault.h +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.h @@ -11,6 +11,57 @@ struct xe_gt; struct xe_guc; +struct pagefault { + u64 page_addr; + u32 asid; + u16 pdata; + u8 vfid; + u8 access_type; + u8 fault_type; + u8 fault_level; + u8 engine_class; + u8 engine_instance; + u8 fault_unsuccessful; + bool prefetch; + bool trva_fault; +}; + +enum access_type { + ACCESS_TYPE_READ = 0, + ACCESS_TYPE_WRITE = 1, + ACCESS_TYPE_ATOMIC = 2, + ACCESS_TYPE_RESERVED = 3, +}; + +enum fault_type { + NOT_PRESENT = 0, + WRITE_ACCESS_VIOLATION = 1, + ATOMIC_ACCESS_VIOLATION = 2, +}; + +static char *access_type_str[] = { + "ACCESS_TYPE_READ", + "ACCESS_TYPE_WRITE", + "ACCESS_TYPE_ATOMIC", + "ACCESS_TYPE_RESERVED", +}; + +static char *fault_type_str[] = { + "NOT_PRESENT", + "WRITE_ACCESS_VIOLATION", + "ATOMIC_ACCESS_VIOLATION", +}; + +static inline char *access_type_to_str(enum access_type a) +{ + return access_type_str[a]; +} + +static inline char *fault_type_to_str(enum fault_type f) +{ + return fault_type_str[f]; +} + int xe_gt_pagefault_init(struct xe_gt *gt); void xe_gt_pagefault_reset(struct xe_gt *gt); int xe_guc_pagefault_handler(struct xe_guc *guc, u32 *msg, u32 len); From patchwork Fri Feb 14 20:37:56 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Cavitt X-Patchwork-Id: 13975617 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 66ED7C021AA for ; Fri, 14 Feb 2025 20:38:06 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 4498110ED3E; Fri, 14 Feb 2025 20:38:01 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="C7eBZ+8B"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.18]) by gabe.freedesktop.org (Postfix) with ESMTPS id 4E92B10E033; Fri, 14 Feb 2025 20:38:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1739565480; x=1771101480; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=IaZbYFPEVJ5Ap2Np5lSbLA64yEB8lQfrldsffJdO/mA=; b=C7eBZ+8BBKEg+0z/a/7xrTG8TCrO5YRwUPuLLLGi8RlRD8JbMRRWlAPM ZR97nSCRm/qXYsypVG+RrFOsnggo3V3u/dUOjxeTh2I0Cbc7D+8Ee/mkW iVxh7N3MZTrT3ZgxPxR9iOls+bWgBFLFM+sylfYObTDjW/+yYewWYbuPr OEc94h4twrpybdEVz9rxjBORYGHI6Ii8+d6TnKcJ+h3FVWRwYupIbi04b A8vhW5C7Z/9W08rRRm/UozPPjRecady1q3m5MNiILlEYQNMaWKXWwKJud s8Hwv6NSEwWq5Ovrcn8sxdlx0LvIGQN1WuolINWvHCH/2xbXA+G/SMsyn Q==; X-CSE-ConnectionGUID: ACnbCU2fRCOUdAOCXefNmQ== X-CSE-MsgGUID: GguplN25RxCqCTsa3ZTeJQ== X-IronPort-AV: E=McAfee;i="6700,10204,11345"; a="39558867" X-IronPort-AV: E=Sophos;i="6.13,286,1732608000"; d="scan'208";a="39558867" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by fmvoesa112.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Feb 2025 12:38:00 -0800 X-CSE-ConnectionGUID: ZSVONbSSQpuWxBs3D+Uwag== X-CSE-MsgGUID: Gy7g6RdzQOSOOlCluR22rQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.12,224,1728975600"; d="scan'208";a="118760822" Received: from dut4066lnl.fm.intel.com ([10.105.8.61]) by orviesa005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Feb 2025 12:37:59 -0800 From: Jonathan Cavitt To: intel-xe@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org, jonathan.cavitt@intel.com, saurabhg.gupta@intel.com, alex.zuo@intel.com, joonas.lahtinen@intel.com, lucas.demarchi@intel.com, matthew.brost@intel.com Subject: [PATCH 3/4] FIXME: drm/xe/xe_drm_client: Add per drm client pagefault info Date: Fri, 14 Feb 2025 20:37:56 +0000 Message-ID: <20250214203757.27895-4-jonathan.cavitt@intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250214203757.27895-1-jonathan.cavitt@intel.com> References: <20250214203757.27895-1-jonathan.cavitt@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Add additional information to drm client so it can report the last 50 exec queues to have been banned on it, as well as the last pagefault seen when said exec queues were banned. Since we cannot reasonably associate a pagefault to a specific exec queue, we currently report the last seen pagefault on the associated hw engine instead. The last pagefault seen per exec queue is saved to the hw engine, and the pagefault is updated during the pagefault handling process in xe_gt_pagefault. The last seen pagefault is reset when the engine is reset because any future exec queue bans likely were not caused by said pagefault after the reset. v2: Remove exec queue from blame list on destroy and recreate (Joonas) Signed-off-by: Jonathan Cavitt --- drivers/gpu/drm/xe/xe_drm_client.c | 128 ++++++++++++++++++++++++ drivers/gpu/drm/xe/xe_drm_client.h | 36 +++++++ drivers/gpu/drm/xe/xe_exec_queue.c | 7 ++ drivers/gpu/drm/xe/xe_gt_pagefault.c | 19 ++++ drivers/gpu/drm/xe/xe_guc_submit.c | 17 ++++ drivers/gpu/drm/xe/xe_hw_engine.c | 4 + drivers/gpu/drm/xe/xe_hw_engine_types.h | 8 ++ 7 files changed, 219 insertions(+) diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c index 2d4874d2b922..f15560d0b6ff 100644 --- a/drivers/gpu/drm/xe/xe_drm_client.c +++ b/drivers/gpu/drm/xe/xe_drm_client.c @@ -17,6 +17,7 @@ #include "xe_exec_queue.h" #include "xe_force_wake.h" #include "xe_gt.h" +#include "xe_gt_pagefault.h" #include "xe_hw_engine.h" #include "xe_pm.h" #include "xe_trace.h" @@ -70,6 +71,21 @@ * drm-total-cycles-ccs: 7655183225 * drm-engine-capacity-ccs: 4 * + * - Exec queue ban list - + * + * Exec queue 1 banned: + * Associated pagefault: + * ASID: 9 + * VFID: 0 + * PDATA: 0x0450 + * Faulted Address: 0x000001fff86a9000 + * FaultType: NOT_PRESENT + * AccessType: ACCESS_TYPE_WRITE + * FaultLevel: 0 + * EngineClass: 1 vcs + * EngineInstance: 0 + * + * * Possible `drm-cycles-` key names are: `rcs`, `ccs`, `bcs`, `vcs`, `vecs` and * "other". */ @@ -97,6 +113,8 @@ struct xe_drm_client *xe_drm_client_alloc(void) #ifdef CONFIG_PROC_FS spin_lock_init(&client->bos_lock); INIT_LIST_HEAD(&client->bos_list); + spin_lock_init(&client->blame_lock); + INIT_LIST_HEAD(&client->blame_list); #endif return client; } @@ -164,6 +182,72 @@ void xe_drm_client_remove_bo(struct xe_bo *bo) xe_drm_client_put(client); } +static void free_blame(struct blame *b) +{ + list_del(&b->list); + kfree(b->pf); + kfree(b); +} + +void xe_drm_client_add_blame(struct xe_drm_client *client, + struct xe_exec_queue *q) +{ + struct blame *b = NULL; + struct list_head *h; + struct pagefault *pf = NULL; + struct xe_file *xef = q->xef; + struct xe_hw_engine *hwe = q->hwe; + unsigned long count; + + b = kzalloc(sizeof(struct blame), GFP_KERNEL); + xe_assert(xef->xe, b); + + spin_lock(&client->blame_lock); + list_add_tail(&b->list, &client->blame_list); + /** + * Limit the number of blames in the blame list to prevent memory overuse. + * + * TODO: Parameterize max blame list size. + */ + count = 0; + list_for_each(h, &client->blame_list) + count++; + if (count >= 50) { + struct blame *rem = list_first_entry(&client->blame_list, struct blame, list); + free_blame(rem); + } + spin_unlock(&client->blame_lock); + + /** + * Duplicate pagefault on engine to blame, if one may have caused the + * exec queue to be banned. + */ + b->pf = NULL; + spin_lock(&hwe->pf.lock); + if (hwe->pf.info) { + pf = kzalloc(sizeof(struct pagefault), GFP_KERNEL); + memcpy(pf, hwe->pf.info, sizeof(struct pagefault)); + } + spin_unlock(&hwe->pf.lock); + + /** Save blame data to list element */ + b->exec_queue_id = q->id; + b->pf = pf; +} + +void xe_drm_client_remove_blame(struct xe_drm_client *client, + struct xe_exec_queue *q) +{ + struct blame *b, *tmp; + + spin_lock(&client->blame_lock); + list_for_each_entry_safe(b, tmp, &client->blame_list, list) + if (b->exec_queue_id == q->id) + free_blame(b); + spin_unlock(&client->blame_lock); + +} + static void bo_meminfo(struct xe_bo *bo, struct drm_memory_stats stats[TTM_NUM_MEM_TYPES]) { @@ -380,6 +464,49 @@ static void show_run_ticks(struct drm_printer *p, struct drm_file *file) } } +static void print_pagefault(struct drm_printer *p, struct pagefault *pf) +{ + drm_printf(p, "\n\t\tASID: %d\n" + "\t\tVFID: %d\n" + "\t\tPDATA: 0x%04x\n" + "\t\tFaulted Address: 0x%08x%08x\n" + "\t\tFaultType: %s\n" + "\t\tAccessType: %s\n" + "\t\tFaultLevel: %d\n" + "\t\tEngineClass: %d %s\n" + "\t\tEngineInstance: %d\n", + pf->asid, pf->vfid, pf->pdata, upper_32_bits(pf->page_addr), + lower_32_bits(pf->page_addr), + fault_type_to_str(pf->fault_type), + access_type_to_str(pf->access_type), + pf->fault_level, pf->engine_class, + xe_hw_engine_class_to_str(pf->engine_class), + pf->engine_instance); +} + +static void show_blames(struct drm_printer *p, struct drm_file *file) +{ + struct xe_file *xef = file->driver_priv; + struct xe_drm_client *client; + struct blame *b; + + client = xef->client; + + drm_printf(p, "\n"); + drm_printf(p, "- Exec queue ban list -\n"); + spin_lock(&client->blame_lock); + list_for_each_entry(b, &client->blame_list, list) { + struct pagefault *pf = b->pf; + drm_printf(p, "\n\tExec queue %u banned:\n", b->exec_queue_id); + drm_printf(p, "\t\tAssociated pagefault:\n"); + if (pf) + print_pagefault(p, pf); + else + drm_printf(p, "\t\t- No associated pagefault -\n"); + } + spin_unlock(&client->blame_lock); +} + /** * xe_drm_client_fdinfo() - Callback for fdinfo interface * @p: The drm_printer ptr @@ -394,5 +521,6 @@ void xe_drm_client_fdinfo(struct drm_printer *p, struct drm_file *file) { show_meminfo(p, file); show_run_ticks(p, file); + show_blames(p, file); } #endif diff --git a/drivers/gpu/drm/xe/xe_drm_client.h b/drivers/gpu/drm/xe/xe_drm_client.h index a9649aa36011..d21fd0b90742 100644 --- a/drivers/gpu/drm/xe/xe_drm_client.h +++ b/drivers/gpu/drm/xe/xe_drm_client.h @@ -15,7 +15,18 @@ struct drm_file; struct drm_printer; +struct pagefault; struct xe_bo; +struct xe_exec_queue; + +struct blame { + /** @exec_queue_id: ID number of banned exec queue */ + u32 exec_queue_id; + /** @pf: pagefault on engine of banned exec queue, if any at time */ + struct pagefault *pf; + /** @list: link into @xe_drm_client.blame_list */ + struct list_head list; +}; struct xe_drm_client { struct kref kref; @@ -31,6 +42,17 @@ struct xe_drm_client { * Protected by @bos_lock. */ struct list_head bos_list; + /** + * @blame_lock: lock protecting @blame_list + */ + spinlock_t blame_lock; + /** + * @blame_list: list of banned exec queues associated with this drm + * client, as well as any pagefaults at time of ban. + * + * Protected by @blame_lock; + */ + struct list_head blame_list; #endif }; @@ -57,6 +79,10 @@ void xe_drm_client_fdinfo(struct drm_printer *p, struct drm_file *file); void xe_drm_client_add_bo(struct xe_drm_client *client, struct xe_bo *bo); void xe_drm_client_remove_bo(struct xe_bo *bo); +void xe_drm_client_add_blame(struct xe_drm_client *client, + struct xe_exec_queue *q); +void xe_drm_client_remove_blame(struct xe_drm_client *client, + struct xe_exec_queue *q); #else static inline void xe_drm_client_add_bo(struct xe_drm_client *client, struct xe_bo *bo) @@ -66,5 +92,15 @@ static inline void xe_drm_client_add_bo(struct xe_drm_client *client, static inline void xe_drm_client_remove_bo(struct xe_bo *bo) { } + +static inline void xe_drm_client_add_blame(struct xe_drm_client *client, + struct xe_exec_queue *q) +{ +} + +static inline void xe_drm_client_remove_blame(struct xe_drm_client *client, + struct xe_exec_queue *q) +{ +} #endif #endif diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c index a02e62465e01..9c9bc617020c 100644 --- a/drivers/gpu/drm/xe/xe_exec_queue.c +++ b/drivers/gpu/drm/xe/xe_exec_queue.c @@ -13,6 +13,7 @@ #include #include "xe_device.h" +#include "xe_drm_client.h" #include "xe_gt.h" #include "xe_hw_engine_class_sysfs.h" #include "xe_hw_engine_group.h" @@ -714,6 +715,12 @@ int xe_exec_queue_create_ioctl(struct drm_device *dev, void *data, q->id = id; args->exec_queue_id = id; + /** + * If an exec queue in the blame list shares the same exec queue + * ID, remove it from the blame list to avoid confusion. + */ + xe_drm_client_remove_blame(q->xef->client, q); + return 0; kill_exec_queue: diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c index fe18e3ec488a..a0e6f2281e37 100644 --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c @@ -330,6 +330,23 @@ int xe_guc_pagefault_handler(struct xe_guc *guc, u32 *msg, u32 len) return full ? -ENOSPC : 0; } +static void save_pagefault_to_engine(struct xe_gt *gt, struct pagefault *pf) +{ + struct xe_hw_engine *hwe; + + hwe = xe_gt_hw_engine(gt, pf->engine_class, pf->engine_instance, false); + if (hwe) { + spin_lock(&hwe->pf.lock); + /** The latest pagefault is pf, so remove old pf info from engine */ + if (hwe->pf.info) + kfree(hwe->pf.info); + hwe->pf.info = kzalloc(sizeof(struct pagefault), GFP_KERNEL); + if (hwe->pf.info) + memcpy(hwe->pf.info, pf, sizeof(struct pagefault)); + spin_unlock(&hwe->pf.lock); + } +} + #define USM_QUEUE_MAX_RUNTIME_MS 20 static void pf_queue_work_func(struct work_struct *w) @@ -352,6 +369,8 @@ static void pf_queue_work_func(struct work_struct *w) drm_dbg(&xe->drm, "Fault response: Unsuccessful %d\n", ret); } + save_pagefault_to_engine(gt, &pf); + reply.dw0 = FIELD_PREP(PFR_VALID, 1) | FIELD_PREP(PFR_SUCCESS, pf.fault_unsuccessful) | FIELD_PREP(PFR_REPLY, PFR_ACCESS) | diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c index 913c74d6e2ae..d9da5c89429e 100644 --- a/drivers/gpu/drm/xe/xe_guc_submit.c +++ b/drivers/gpu/drm/xe/xe_guc_submit.c @@ -20,11 +20,13 @@ #include "xe_assert.h" #include "xe_devcoredump.h" #include "xe_device.h" +#include "xe_drm_client.h" #include "xe_exec_queue.h" #include "xe_force_wake.h" #include "xe_gpu_scheduler.h" #include "xe_gt.h" #include "xe_gt_clock.h" +#include "xe_gt_pagefault.h" #include "xe_gt_printk.h" #include "xe_guc.h" #include "xe_guc_capture.h" @@ -146,6 +148,7 @@ static bool exec_queue_banned(struct xe_exec_queue *q) static void set_exec_queue_banned(struct xe_exec_queue *q) { atomic_or(EXEC_QUEUE_STATE_BANNED, &q->guc->state); + xe_drm_client_add_blame(q->xef->client, q); } static bool exec_queue_suspended(struct xe_exec_queue *q) @@ -1971,6 +1974,7 @@ int xe_guc_deregister_done_handler(struct xe_guc *guc, u32 *msg, u32 len) int xe_guc_exec_queue_reset_handler(struct xe_guc *guc, u32 *msg, u32 len) { struct xe_gt *gt = guc_to_gt(guc); + struct xe_hw_engine *hwe; struct xe_exec_queue *q; u32 guc_id; @@ -1983,11 +1987,24 @@ int xe_guc_exec_queue_reset_handler(struct xe_guc *guc, u32 *msg, u32 len) if (unlikely(!q)) return -EPROTO; + hwe = q->hwe; + xe_gt_info(gt, "Engine reset: engine_class=%s, logical_mask: 0x%x, guc_id=%d", xe_hw_engine_class_to_str(q->class), q->logical_mask, guc_id); trace_xe_exec_queue_reset(q); + /** + * Clear last pagefault from engine. Any future exec queue bans likely were + * not caused by said pagefault now that the engine has reset. + */ + spin_lock(&hwe->pf.lock); + if (hwe->pf.info) { + kfree(hwe->pf.info); + hwe->pf.info = kzalloc(sizeof(struct pagefault), GFP_KERNEL); + } + spin_unlock(&hwe->pf.lock); + /* * A banned engine is a NOP at this point (came from * guc_exec_queue_timedout_job). Otherwise, kick drm scheduler to cancel diff --git a/drivers/gpu/drm/xe/xe_hw_engine.c b/drivers/gpu/drm/xe/xe_hw_engine.c index fc447751fe78..69f61e4905e2 100644 --- a/drivers/gpu/drm/xe/xe_hw_engine.c +++ b/drivers/gpu/drm/xe/xe_hw_engine.c @@ -21,6 +21,7 @@ #include "xe_gsc.h" #include "xe_gt.h" #include "xe_gt_ccs_mode.h" +#include "xe_gt_pagefault.h" #include "xe_gt_printk.h" #include "xe_gt_mcr.h" #include "xe_gt_topology.h" @@ -557,6 +558,9 @@ static void hw_engine_init_early(struct xe_gt *gt, struct xe_hw_engine *hwe, hwe->eclass->defaults = hwe->eclass->sched_props; } + hwe->pf.info = NULL; + spin_lock_init(&hwe->pf.lock); + xe_reg_sr_init(&hwe->reg_sr, hwe->name, gt_to_xe(gt)); xe_tuning_process_engine(hwe); xe_wa_process_engine(hwe); diff --git a/drivers/gpu/drm/xe/xe_hw_engine_types.h b/drivers/gpu/drm/xe/xe_hw_engine_types.h index e4191a7a2c31..2e1be9481d9b 100644 --- a/drivers/gpu/drm/xe/xe_hw_engine_types.h +++ b/drivers/gpu/drm/xe/xe_hw_engine_types.h @@ -64,6 +64,7 @@ enum xe_hw_engine_id { struct xe_bo; struct xe_execlist_port; struct xe_gt; +struct pagefault; /** * struct xe_hw_engine_class_intf - per hw engine class struct interface @@ -150,6 +151,13 @@ struct xe_hw_engine { struct xe_oa_unit *oa_unit; /** @hw_engine_group: the group of hw engines this one belongs to */ struct xe_hw_engine_group *hw_engine_group; + /** @pf: the last pagefault seen on this engine */ + struct { + /** @pf.info: info containing last seen pagefault details */ + struct pagefault *info; + /** @pf.lock: lock protecting @pf.info */ + spinlock_t lock; + } pf; }; enum xe_hw_engine_snapshot_source_id { From patchwork Fri Feb 14 20:37:57 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Cavitt X-Patchwork-Id: 13975618 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 97DCCC02198 for ; Fri, 14 Feb 2025 20:38:08 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id B116010E069; Fri, 14 Feb 2025 20:38:01 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="IujXM0Wd"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.18]) by gabe.freedesktop.org (Postfix) with ESMTPS id AD99B10E033; Fri, 14 Feb 2025 20:38:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1739565480; x=1771101480; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=4KP3WUSJF+piJ2OJ5kO8VzXNGEWgFcKQfFU0wjFN8+k=; b=IujXM0WdhTz7vRl5PuXZ8hRUZkYiBnojTVPbq1R4j1ordDWSJth+N5Ts Ljw99Tv+vHgnp+AYM9adsZMivHdKVSaMXMEZcQiPePde0pOvBSTSh0tqY 1Suj1mMGIUURlJ6BK7I1SSQbSCjfjfb/53ZLXfeup9WRxdArdv7iFDLCp vkCtUff1iiMxMIOBKa0EMDBiWPeG26RKsZMDNerZA2xcn3FSbzqs4DgfS 9RLUoMCNUDRVmL+FErJVLoI7xcN7MSe40BzU6bjXgZIDeOJrkMaRyCVS0 ITgdrYvXnN/d+PrE8XRAF45NfuNurLv0ynNSoRkq6qUwvXcMFiJRFwGWp A==; X-CSE-ConnectionGUID: cs8Es7nwTye4MahojrO2SA== X-CSE-MsgGUID: 3OKa0dzLRyavBnEk54MMuA== X-IronPort-AV: E=McAfee;i="6700,10204,11345"; a="39558869" X-IronPort-AV: E=Sophos;i="6.13,286,1732608000"; d="scan'208";a="39558869" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by fmvoesa112.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Feb 2025 12:38:00 -0800 X-CSE-ConnectionGUID: WbK6vzdlQd2XKg5fUKENJw== X-CSE-MsgGUID: 9+oZy52OT6qRnw/N2LZHnA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.12,224,1728975600"; d="scan'208";a="118760825" Received: from dut4066lnl.fm.intel.com ([10.105.8.61]) by orviesa005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Feb 2025 12:38:00 -0800 From: Jonathan Cavitt To: intel-xe@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org, jonathan.cavitt@intel.com, saurabhg.gupta@intel.com, alex.zuo@intel.com, joonas.lahtinen@intel.com, lucas.demarchi@intel.com, matthew.brost@intel.com Subject: [PATCH 4/4] drm/xe/xe_drm_client: Add per drm client reset stats Date: Fri, 14 Feb 2025 20:37:57 +0000 Message-ID: <20250214203757.27895-5-jonathan.cavitt@intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250214203757.27895-1-jonathan.cavitt@intel.com> References: <20250214203757.27895-1-jonathan.cavitt@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Add a counter to xe_drm_client that tracks the number of times the engine has been reset since the drm client was created. Signed-off-by: Jonathan Cavitt --- drivers/gpu/drm/xe/xe_drm_client.c | 2 ++ drivers/gpu/drm/xe/xe_drm_client.h | 2 ++ drivers/gpu/drm/xe/xe_guc_submit.c | 4 +++- 3 files changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c index f15560d0b6ff..ecd2ce99fd19 100644 --- a/drivers/gpu/drm/xe/xe_drm_client.c +++ b/drivers/gpu/drm/xe/xe_drm_client.c @@ -492,6 +492,8 @@ static void show_blames(struct drm_printer *p, struct drm_file *file) client = xef->client; + drm_printf(p, "drm-client-reset-count:%u\n", + atomic_read(&client->reset_count)); drm_printf(p, "\n"); drm_printf(p, "- Exec queue ban list -\n"); spin_lock(&client->blame_lock); diff --git a/drivers/gpu/drm/xe/xe_drm_client.h b/drivers/gpu/drm/xe/xe_drm_client.h index d21fd0b90742..c35de675ccfa 100644 --- a/drivers/gpu/drm/xe/xe_drm_client.h +++ b/drivers/gpu/drm/xe/xe_drm_client.h @@ -53,6 +53,8 @@ struct xe_drm_client { * Protected by @blame_lock; */ struct list_head blame_list; + /** @reset_count: number of times this drm client has seen an engine reset */ + atomic_t reset_count; #endif }; diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c index d9da5c89429e..8810abc8f04a 100644 --- a/drivers/gpu/drm/xe/xe_guc_submit.c +++ b/drivers/gpu/drm/xe/xe_guc_submit.c @@ -1988,7 +1988,9 @@ int xe_guc_exec_queue_reset_handler(struct xe_guc *guc, u32 *msg, u32 len) return -EPROTO; hwe = q->hwe; - +#ifdef CONFIG_PROC_FS + atomic_inc(&q->xef->client->reset_count); +#endif xe_gt_info(gt, "Engine reset: engine_class=%s, logical_mask: 0x%x, guc_id=%d", xe_hw_engine_class_to_str(q->class), q->logical_mask, guc_id);