From patchwork Thu Feb 20 20:38:27 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Cavitt X-Patchwork-Id: 13984490 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id AF205C021B4 for ; Thu, 20 Feb 2025 20:38:36 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 1C5B110E9E0; Thu, 20 Feb 2025 20:38:35 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="XvyuWvDm"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) by gabe.freedesktop.org (Postfix) with ESMTPS id 1F82010E9DB; Thu, 20 Feb 2025 20:38:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1740083915; x=1771619915; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=kdx43vBPxQuJgAnPefatV3H8gKQrTpuV5MOzEF/ZEe8=; b=XvyuWvDmNqGHL/YEzZIWEPr7zUwFK12mYacR/6fEpsy6MSg5/R83GR2k 9FKNNRvCDUGaFsASLng0nD7F0B6xTCgjaA7KqvMW706LaYyNZgLKp6gzI qIXnzfQslib6Wy2Rs0sknG3WHYQneofnCmu3nYKALVM2Y1wS93FU0mx4K oCAikSnPuGte09aBv/j9eOgBucRRjEpe8JnkBHVk1itxTeOQVodRJKTV+ Nie+r30aoULWtz2c5hrKdMSIHnhYQ2OTdD2ZdFj+9lNa+ya4nOL7TXS8G 2WnaDjFz2BPNLvDItKH8Bq1+/F4d2oNXT+7dvbK47A8N8uZZXZ7ATPHtr g==; X-CSE-ConnectionGUID: 7ETNFU/XSp6u+SdQEshvcw== X-CSE-MsgGUID: 9qiccyPmTFSUL+6JaC1kog== X-IronPort-AV: E=McAfee;i="6700,10204,11314"; a="41097929" X-IronPort-AV: E=Sophos;i="6.12,310,1728975600"; d="scan'208";a="41097929" Received: from orviesa006.jf.intel.com ([10.64.159.146]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Feb 2025 12:38:34 -0800 X-CSE-ConnectionGUID: RKg2ZyJuQTSaeXNs3nydEg== X-CSE-MsgGUID: 2k6RcK3wQAawtiFfT/st0g== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.13,302,1732608000"; d="scan'208";a="115100549" Received: from dut4086lnl.fm.intel.com ([10.105.10.90]) by orviesa006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Feb 2025 12:38:33 -0800 From: Jonathan Cavitt To: intel-xe@lists.freedesktop.org Cc: saurabhg.gupta@intel.com, alex.zuo@intel.com, jonathan.cavitt@intel.com, joonas.lahtinen@linux.intel.com, tvrtko.ursulin@igalia.com, lucas.demarchi@intel.com, matthew.brost@intel.com, dri-devel@lists.freedesktop.org, simona.vetter@ffwll.ch Subject: [PATCH v4 1/6] drm/xe/xe_exec_queue: Add ID param to exec queue struct Date: Thu, 20 Feb 2025 20:38:27 +0000 Message-ID: <20250220203832.130430-2-jonathan.cavitt@intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250220203832.130430-1-jonathan.cavitt@intel.com> References: <20250220203832.130430-1-jonathan.cavitt@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Add the exec queue id to the exec queue struct. This is useful for performing a reverse lookup into the xef->exec_queue xarray. Signed-off-by: Jonathan Cavitt --- drivers/gpu/drm/xe/xe_exec_queue.c | 1 + drivers/gpu/drm/xe/xe_exec_queue_types.h | 2 ++ 2 files changed, 3 insertions(+) diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c index 23a9f519ce1c..4a98a5d0e405 100644 --- a/drivers/gpu/drm/xe/xe_exec_queue.c +++ b/drivers/gpu/drm/xe/xe_exec_queue.c @@ -709,6 +709,7 @@ int xe_exec_queue_create_ioctl(struct drm_device *dev, void *data, if (err) goto kill_exec_queue; + q->id = id; args->exec_queue_id = id; return 0; diff --git a/drivers/gpu/drm/xe/xe_exec_queue_types.h b/drivers/gpu/drm/xe/xe_exec_queue_types.h index 6eb7ff091534..088d838218e9 100644 --- a/drivers/gpu/drm/xe/xe_exec_queue_types.h +++ b/drivers/gpu/drm/xe/xe_exec_queue_types.h @@ -55,6 +55,8 @@ struct xe_exec_queue { struct xe_vm *vm; /** @class: class of this exec queue */ enum xe_engine_class class; + /** @id: exec queue ID as reported during create ioctl */ + u32 id; /** * @logical_mask: logical mask of where job submitted to exec queue can run */ From patchwork Thu Feb 20 20:38:28 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Cavitt X-Patchwork-Id: 13984491 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 73A8BC021B2 for ; Thu, 20 Feb 2025 20:38:38 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 43F4910E9E3; Thu, 20 Feb 2025 20:38:36 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="BNR2qIwI"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) by gabe.freedesktop.org (Postfix) with ESMTPS id 844D110E9DB; Thu, 20 Feb 2025 20:38:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1740083915; x=1771619915; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=u2AbzGaI4Ls00Mz3zF+gpAWIquuEq3mLtyJZINYvlBU=; b=BNR2qIwIqmAfME63jWrC8WrzGTkLMJ1VuWkDEL+R7rNIEgmV2j89K9dj d4N1GzBfWZThd638md7SoYlMNSoqTZo8ULWIn+vgPbx5aN0p4253c71Ci y/iVCVO7L/Ks273/ADNTYW1dLxZvVuclV9lNTv2Ojjn3bwzTuGc0ql0QM JtQNgVNJrBvQNZ+RhBzTPgCuJndMK0s9ahZWXDv6auD1IvLD/L3R+4/LQ 5LbJ1Yfzm3FvxBtA2+AVI5GGshRAWOEiAvmoU6d0+Yh28DIWGK5WgnUuv mmLZ8TNq+ZB14LfNCFZlTJ3r9h30M1aoJYPl3WaZ88ivcM9buhbWHV5Fs g==; X-CSE-ConnectionGUID: +7VRF9rTQBmW0oNT6jbkLA== X-CSE-MsgGUID: dE+cGOH8T7apECVAtRTvxQ== X-IronPort-AV: E=McAfee;i="6700,10204,11314"; a="41097932" X-IronPort-AV: E=Sophos;i="6.12,310,1728975600"; d="scan'208";a="41097932" Received: from orviesa006.jf.intel.com ([10.64.159.146]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Feb 2025 12:38:35 -0800 X-CSE-ConnectionGUID: hy5YU3vcR96NdNRgq0Ve6Q== X-CSE-MsgGUID: 4YD1XWy9QEm+P+gkIC3mpg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.13,302,1732608000"; d="scan'208";a="115100556" Received: from dut4086lnl.fm.intel.com ([10.105.10.90]) by orviesa006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Feb 2025 12:38:34 -0800 From: Jonathan Cavitt To: intel-xe@lists.freedesktop.org Cc: saurabhg.gupta@intel.com, alex.zuo@intel.com, jonathan.cavitt@intel.com, joonas.lahtinen@linux.intel.com, tvrtko.ursulin@igalia.com, lucas.demarchi@intel.com, matthew.brost@intel.com, dri-devel@lists.freedesktop.org, simona.vetter@ffwll.ch Subject: [PATCH v4 2/6] drm/xe/xe_gt_pagefault: Migrate pagefault struct to header Date: Thu, 20 Feb 2025 20:38:28 +0000 Message-ID: <20250220203832.130430-3-jonathan.cavitt@intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250220203832.130430-1-jonathan.cavitt@intel.com> References: <20250220203832.130430-1-jonathan.cavitt@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Migrate the pagefault struct from xe_gt_pagefault.c to the xe_gt_pagefault.h header file, along with the associated enum values. Signed-off-by: Jonathan Cavitt --- drivers/gpu/drm/xe/xe_gt_pagefault.c | 27 --------------------------- drivers/gpu/drm/xe/xe_gt_pagefault.h | 28 ++++++++++++++++++++++++++++ 2 files changed, 28 insertions(+), 27 deletions(-) diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c index 46701ca11ce0..fe18e3ec488a 100644 --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c @@ -22,33 +22,6 @@ #include "xe_trace_bo.h" #include "xe_vm.h" -struct pagefault { - u64 page_addr; - u32 asid; - u16 pdata; - u8 vfid; - u8 access_type; - u8 fault_type; - u8 fault_level; - u8 engine_class; - u8 engine_instance; - u8 fault_unsuccessful; - bool trva_fault; -}; - -enum access_type { - ACCESS_TYPE_READ = 0, - ACCESS_TYPE_WRITE = 1, - ACCESS_TYPE_ATOMIC = 2, - ACCESS_TYPE_RESERVED = 3, -}; - -enum fault_type { - NOT_PRESENT = 0, - WRITE_ACCESS_VIOLATION = 1, - ATOMIC_ACCESS_VIOLATION = 2, -}; - struct acc { u64 va_range_base; u32 asid; diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.h b/drivers/gpu/drm/xe/xe_gt_pagefault.h index 839c065a5e4c..e9911da5c8a7 100644 --- a/drivers/gpu/drm/xe/xe_gt_pagefault.h +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.h @@ -11,6 +11,34 @@ struct xe_gt; struct xe_guc; +struct pagefault { + u64 page_addr; + u32 asid; + u16 pdata; + u8 vfid; + u8 access_type; + u8 fault_type; + u8 fault_level; + u8 engine_class; + u8 engine_instance; + u8 fault_unsuccessful; + bool prefetch; + bool trva_fault; +}; + +enum access_type { + ACCESS_TYPE_READ = 0, + ACCESS_TYPE_WRITE = 1, + ACCESS_TYPE_ATOMIC = 2, + ACCESS_TYPE_RESERVED = 3, +}; + +enum fault_type { + NOT_PRESENT = 0, + WRITE_ACCESS_VIOLATION = 1, + ATOMIC_ACCESS_VIOLATION = 2, +}; + int xe_gt_pagefault_init(struct xe_gt *gt); void xe_gt_pagefault_reset(struct xe_gt *gt); int xe_guc_pagefault_handler(struct xe_guc *guc, u32 *msg, u32 len); From patchwork Thu Feb 20 20:38:29 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Cavitt X-Patchwork-Id: 13984492 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id C6EE2C021B3 for ; Thu, 20 Feb 2025 20:38:39 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id B3BF110E9DF; Thu, 20 Feb 2025 20:38:36 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="R5v3mvlZ"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) by gabe.freedesktop.org (Postfix) with ESMTPS id 01DBC10E9DF; Thu, 20 Feb 2025 20:38:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1740083915; x=1771619915; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=0uyGn1APz4ljCn/pIHS38mXX4lsg7s1/S0amSUYGNj8=; b=R5v3mvlZR964EyTf6izGoVZLm78nF7wSpE4mzFKEKoG6mxNSaAEqhEG3 rZtjEuNWIclnQTB9ey1ZpBapvw4jQyWSa2JPDDhlMhsY85X/E9w+c0fkF h/42vNA5+PBbBK/pWCgF9YNr7leyTBbxMsr+ZsSdtJ5+P93ywImtl+0yc JMgXvF6Lq1FODSCLLe+LEUDnvxvSFgy74EBcbjKUpc1X63qmpC0P+x+vM NeOHIrYW3NePdfp6stN73a47QxPKvsKq8NnzAMUT3/01tTEVkA2DQ76/y 0PuGP343E4KbFZNzl3nkenstiuVKkKXk9LMRUZG1N03xXH0N7KtrMa8Ka w==; X-CSE-ConnectionGUID: npsCygFSRcCV93qA0ZNeJw== X-CSE-MsgGUID: A2gOLoV7Q3yK12Hw3K6Ajw== X-IronPort-AV: E=McAfee;i="6700,10204,11314"; a="41097935" X-IronPort-AV: E=Sophos;i="6.12,310,1728975600"; d="scan'208";a="41097935" Received: from orviesa006.jf.intel.com ([10.64.159.146]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Feb 2025 12:38:35 -0800 X-CSE-ConnectionGUID: DtHficPPQWqvIX8QksG+jQ== X-CSE-MsgGUID: NgBM5AzTR3W+WEPLqYbNkw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.13,302,1732608000"; d="scan'208";a="115100560" Received: from dut4086lnl.fm.intel.com ([10.105.10.90]) by orviesa006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Feb 2025 12:38:34 -0800 From: Jonathan Cavitt To: intel-xe@lists.freedesktop.org Cc: saurabhg.gupta@intel.com, alex.zuo@intel.com, jonathan.cavitt@intel.com, joonas.lahtinen@linux.intel.com, tvrtko.ursulin@igalia.com, lucas.demarchi@intel.com, matthew.brost@intel.com, dri-devel@lists.freedesktop.org, simona.vetter@ffwll.ch Subject: [PATCH v4 3/6] drm/xe/xe_drm_client: Add per drm client pagefault info Date: Thu, 20 Feb 2025 20:38:29 +0000 Message-ID: <20250220203832.130430-4-jonathan.cavitt@intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250220203832.130430-1-jonathan.cavitt@intel.com> References: <20250220203832.130430-1-jonathan.cavitt@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Add additional information to drm client so it can report up to the last 50 exec queues to have been banned on it, as well as the last pagefault seen when said exec queues were banned. Since we cannot reasonably associate a pagefault to a specific exec queue, we currently report the last seen pagefault on the associated hw engine instead. The last pagefault seen per exec queue is saved to the hw engine, and the pagefault is updated during the pagefault handling process in xe_gt_pagefault. The last seen pagefault is reset when the engine is reset because any future exec queue bans likely were not caused by said pagefault after the reset. v2: Remove exec queue from blame list on destroy and recreate (Joonas) v3: Do not print as part of xe_drm_client_fdinfo (Joonas) v4: Fix formatting and kzalloc during lock warnings Signed-off-by: Jonathan Cavitt --- drivers/gpu/drm/xe/xe_drm_client.c | 68 +++++++++++++++++++++++++ drivers/gpu/drm/xe/xe_drm_client.h | 42 +++++++++++++++ drivers/gpu/drm/xe/xe_exec_queue.c | 7 +++ drivers/gpu/drm/xe/xe_gt_pagefault.c | 17 +++++++ drivers/gpu/drm/xe/xe_guc_submit.c | 15 ++++++ drivers/gpu/drm/xe/xe_hw_engine.c | 4 ++ drivers/gpu/drm/xe/xe_hw_engine_types.h | 8 +++ 7 files changed, 161 insertions(+) diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c index 2d4874d2b922..1bc978ae4c2f 100644 --- a/drivers/gpu/drm/xe/xe_drm_client.c +++ b/drivers/gpu/drm/xe/xe_drm_client.c @@ -17,6 +17,7 @@ #include "xe_exec_queue.h" #include "xe_force_wake.h" #include "xe_gt.h" +#include "xe_gt_pagefault.h" #include "xe_hw_engine.h" #include "xe_pm.h" #include "xe_trace.h" @@ -97,6 +98,8 @@ struct xe_drm_client *xe_drm_client_alloc(void) #ifdef CONFIG_PROC_FS spin_lock_init(&client->bos_lock); INIT_LIST_HEAD(&client->bos_list); + spin_lock_init(&client->blame_lock); + INIT_LIST_HEAD(&client->blame_list); #endif return client; } @@ -164,6 +167,71 @@ void xe_drm_client_remove_bo(struct xe_bo *bo) xe_drm_client_put(client); } +static void free_blame(struct blame *b) +{ + list_del(&b->list); + kfree(b->pf); + kfree(b); +} + +void xe_drm_client_add_blame(struct xe_drm_client *client, + struct xe_exec_queue *q) +{ + struct blame *b = NULL; + struct pagefault *pf = NULL; + struct xe_file *xef = q->xef; + struct xe_hw_engine *hwe = q->hwe; + + b = kzalloc(sizeof(*b), GFP_KERNEL); + xe_assert(xef->xe, b); + + spin_lock(&client->blame_lock); + list_add_tail(&b->list, &client->blame_list); + client->blame_len++; + /** + * Limit the number of blames in the blame list to prevent memory overuse. + */ + if (client->blame_len > MAX_BLAME_LEN) { + struct blame *rem = list_first_entry(&client->blame_list, struct blame, list); + + free_blame(rem); + client->blame_len--; + } + spin_unlock(&client->blame_lock); + + /** + * Duplicate pagefault on engine to blame, if one may have caused the + * exec queue to be ban. + */ + b->pf = NULL; + pf = kzalloc(sizeof(*pf), GFP_KERNEL); + spin_lock(&hwe->pf.lock); + if (hwe->pf.info) { + memcpy(pf, hwe->pf.info, sizeof(struct pagefault)); + b->pf = pf; + } else { + kfree(pf); + } + spin_unlock(&hwe->pf.lock); + + /** Save blame data to list element */ + b->exec_queue_id = q->id; +} + +void xe_drm_client_remove_blame(struct xe_drm_client *client, + struct xe_exec_queue *q) +{ + struct blame *b, *tmp; + + spin_lock(&client->blame_lock); + list_for_each_entry_safe(b, tmp, &client->blame_list, list) + if (b->exec_queue_id == q->id) { + free_blame(b); + client->blame_len--; + } + spin_unlock(&client->blame_lock); +} + static void bo_meminfo(struct xe_bo *bo, struct drm_memory_stats stats[TTM_NUM_MEM_TYPES]) { diff --git a/drivers/gpu/drm/xe/xe_drm_client.h b/drivers/gpu/drm/xe/xe_drm_client.h index a9649aa36011..b3d9b279d55f 100644 --- a/drivers/gpu/drm/xe/xe_drm_client.h +++ b/drivers/gpu/drm/xe/xe_drm_client.h @@ -13,9 +13,22 @@ #include #include +#define MAX_BLAME_LEN 50 + struct drm_file; struct drm_printer; +struct pagefault; struct xe_bo; +struct xe_exec_queue; + +struct blame { + /** @exec_queue_id: ID number of banned exec queue */ + u32 exec_queue_id; + /** @pf: pagefault on engine of banned exec queue, if any at time */ + struct pagefault *pf; + /** @list: link into @xe_drm_client.blame_list */ + struct list_head list; +}; struct xe_drm_client { struct kref kref; @@ -31,6 +44,21 @@ struct xe_drm_client { * Protected by @bos_lock. */ struct list_head bos_list; + /** + * @blame_lock: lock protecting @blame_list + */ + spinlock_t blame_lock; + /** + * @blame_list: list of banned exec queues associated with this drm + * client, as well as any pagefaults at time of ban. + * + * Protected by @blame_lock; + */ + struct list_head blame_list; + /** + * @blame_len: length of @blame_list + */ + unsigned int blame_len; #endif }; @@ -57,6 +85,10 @@ void xe_drm_client_fdinfo(struct drm_printer *p, struct drm_file *file); void xe_drm_client_add_bo(struct xe_drm_client *client, struct xe_bo *bo); void xe_drm_client_remove_bo(struct xe_bo *bo); +void xe_drm_client_add_blame(struct xe_drm_client *client, + struct xe_exec_queue *q); +void xe_drm_client_remove_blame(struct xe_drm_client *client, + struct xe_exec_queue *q); #else static inline void xe_drm_client_add_bo(struct xe_drm_client *client, struct xe_bo *bo) @@ -66,5 +98,15 @@ static inline void xe_drm_client_add_bo(struct xe_drm_client *client, static inline void xe_drm_client_remove_bo(struct xe_bo *bo) { } + +static inline void xe_drm_client_add_blame(struct xe_drm_client *client, + struct xe_exec_queue *q) +{ +} + +static inline void xe_drm_client_remove_blame(struct xe_drm_client *client, + struct xe_exec_queue *q) +{ +} #endif #endif diff --git a/drivers/gpu/drm/xe/xe_exec_queue.c b/drivers/gpu/drm/xe/xe_exec_queue.c index 4a98a5d0e405..f8bcf43b2a0e 100644 --- a/drivers/gpu/drm/xe/xe_exec_queue.c +++ b/drivers/gpu/drm/xe/xe_exec_queue.c @@ -13,6 +13,7 @@ #include #include "xe_device.h" +#include "xe_drm_client.h" #include "xe_gt.h" #include "xe_hw_engine_class_sysfs.h" #include "xe_hw_engine_group.h" @@ -712,6 +713,12 @@ int xe_exec_queue_create_ioctl(struct drm_device *dev, void *data, q->id = id; args->exec_queue_id = id; + /** + * If an exec queue in the blame list shares the same exec queue + * ID, remove it from the blame list to avoid confusion. + */ + xe_drm_client_remove_blame(q->xef->client, q); + return 0; kill_exec_queue: diff --git a/drivers/gpu/drm/xe/xe_gt_pagefault.c b/drivers/gpu/drm/xe/xe_gt_pagefault.c index fe18e3ec488a..b95501076569 100644 --- a/drivers/gpu/drm/xe/xe_gt_pagefault.c +++ b/drivers/gpu/drm/xe/xe_gt_pagefault.c @@ -330,6 +330,21 @@ int xe_guc_pagefault_handler(struct xe_guc *guc, u32 *msg, u32 len) return full ? -ENOSPC : 0; } +static void save_pagefault_to_engine(struct xe_gt *gt, struct pagefault *pf) +{ + struct xe_hw_engine *hwe; + + hwe = xe_gt_hw_engine(gt, pf->engine_class, pf->engine_instance, false); + if (hwe) { + spin_lock(&hwe->pf.lock); + /** Info initializes as NULL, so alloc if first pagefault */ + if (!hwe->pf.info) + hwe->pf.info = kzalloc(sizeof(*pf), GFP_KERNEL); + memcpy(hwe->pf.info, pf, sizeof(*pf)); + spin_unlock(&hwe->pf.lock); + } +} + #define USM_QUEUE_MAX_RUNTIME_MS 20 static void pf_queue_work_func(struct work_struct *w) @@ -352,6 +367,8 @@ static void pf_queue_work_func(struct work_struct *w) drm_dbg(&xe->drm, "Fault response: Unsuccessful %d\n", ret); } + save_pagefault_to_engine(gt, &pf); + reply.dw0 = FIELD_PREP(PFR_VALID, 1) | FIELD_PREP(PFR_SUCCESS, pf.fault_unsuccessful) | FIELD_PREP(PFR_REPLY, PFR_ACCESS) | diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c index 913c74d6e2ae..92de926bd505 100644 --- a/drivers/gpu/drm/xe/xe_guc_submit.c +++ b/drivers/gpu/drm/xe/xe_guc_submit.c @@ -20,11 +20,13 @@ #include "xe_assert.h" #include "xe_devcoredump.h" #include "xe_device.h" +#include "xe_drm_client.h" #include "xe_exec_queue.h" #include "xe_force_wake.h" #include "xe_gpu_scheduler.h" #include "xe_gt.h" #include "xe_gt_clock.h" +#include "xe_gt_pagefault.h" #include "xe_gt_printk.h" #include "xe_guc.h" #include "xe_guc_capture.h" @@ -146,6 +148,7 @@ static bool exec_queue_banned(struct xe_exec_queue *q) static void set_exec_queue_banned(struct xe_exec_queue *q) { atomic_or(EXEC_QUEUE_STATE_BANNED, &q->guc->state); + xe_drm_client_add_blame(q->xef->client, q); } static bool exec_queue_suspended(struct xe_exec_queue *q) @@ -1971,6 +1974,7 @@ int xe_guc_deregister_done_handler(struct xe_guc *guc, u32 *msg, u32 len) int xe_guc_exec_queue_reset_handler(struct xe_guc *guc, u32 *msg, u32 len) { struct xe_gt *gt = guc_to_gt(guc); + struct xe_hw_engine *hwe; struct xe_exec_queue *q; u32 guc_id; @@ -1983,11 +1987,22 @@ int xe_guc_exec_queue_reset_handler(struct xe_guc *guc, u32 *msg, u32 len) if (unlikely(!q)) return -EPROTO; + hwe = q->hwe; + xe_gt_info(gt, "Engine reset: engine_class=%s, logical_mask: 0x%x, guc_id=%d", xe_hw_engine_class_to_str(q->class), q->logical_mask, guc_id); trace_xe_exec_queue_reset(q); + /** + * Clear last pagefault from engine. Any future exec queue bans likely were + * not caused by said pagefault now that the engine has reset. + */ + spin_lock(&hwe->pf.lock); + kfree(hwe->pf.info); + hwe->pf.info = NULL; + spin_unlock(&hwe->pf.lock); + /* * A banned engine is a NOP at this point (came from * guc_exec_queue_timedout_job). Otherwise, kick drm scheduler to cancel diff --git a/drivers/gpu/drm/xe/xe_hw_engine.c b/drivers/gpu/drm/xe/xe_hw_engine.c index fc447751fe78..69f61e4905e2 100644 --- a/drivers/gpu/drm/xe/xe_hw_engine.c +++ b/drivers/gpu/drm/xe/xe_hw_engine.c @@ -21,6 +21,7 @@ #include "xe_gsc.h" #include "xe_gt.h" #include "xe_gt_ccs_mode.h" +#include "xe_gt_pagefault.h" #include "xe_gt_printk.h" #include "xe_gt_mcr.h" #include "xe_gt_topology.h" @@ -557,6 +558,9 @@ static void hw_engine_init_early(struct xe_gt *gt, struct xe_hw_engine *hwe, hwe->eclass->defaults = hwe->eclass->sched_props; } + hwe->pf.info = NULL; + spin_lock_init(&hwe->pf.lock); + xe_reg_sr_init(&hwe->reg_sr, hwe->name, gt_to_xe(gt)); xe_tuning_process_engine(hwe); xe_wa_process_engine(hwe); diff --git a/drivers/gpu/drm/xe/xe_hw_engine_types.h b/drivers/gpu/drm/xe/xe_hw_engine_types.h index e4191a7a2c31..2e1be9481d9b 100644 --- a/drivers/gpu/drm/xe/xe_hw_engine_types.h +++ b/drivers/gpu/drm/xe/xe_hw_engine_types.h @@ -64,6 +64,7 @@ enum xe_hw_engine_id { struct xe_bo; struct xe_execlist_port; struct xe_gt; +struct pagefault; /** * struct xe_hw_engine_class_intf - per hw engine class struct interface @@ -150,6 +151,13 @@ struct xe_hw_engine { struct xe_oa_unit *oa_unit; /** @hw_engine_group: the group of hw engines this one belongs to */ struct xe_hw_engine_group *hw_engine_group; + /** @pf: the last pagefault seen on this engine */ + struct { + /** @pf.info: info containing last seen pagefault details */ + struct pagefault *info; + /** @pf.lock: lock protecting @pf.info */ + spinlock_t lock; + } pf; }; enum xe_hw_engine_snapshot_source_id { From patchwork Thu Feb 20 20:38:30 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Cavitt X-Patchwork-Id: 13984494 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 23E24C021B5 for ; Thu, 20 Feb 2025 20:38:42 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id D4E8510E9E5; Thu, 20 Feb 2025 20:38:36 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="de9xcZmM"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) by gabe.freedesktop.org (Postfix) with ESMTPS id 571AC10E9DF; Thu, 20 Feb 2025 20:38:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1740083916; x=1771619916; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Gzg2r7fR3ov8t74dKxtvai5BqCJm/3d4stlGuCVvktw=; b=de9xcZmMtEBx3lfwEl5yZn4FlAOUgeFxWzMqZvqCZ+sqjbDqQaZE0cDX mY7cDMSJgHjNhNuMnIEHlgJAFk0nxlFerlTbg57KduzvKQZXIVOlAIu4C OcNs7CuphsN82d7dygQojSyecop32Kls7lDaE6fCJmP6Nmo1HzCTN6fLk naUZJUVkSHbHkMiTB8olywp4abMWaLjUq+o3q/76P9L+aka8/qTRcqes+ IqJdvu0bI7+Q8E4aFyiGW5Q4PvZ6qrga7vRRuZv3hG86qt1sREKYyyiXn eVMupshTXlB7iCGVI8zEXKKMFk8oleZCqErjxEoypHTxUQeSX9WMcnSuo w==; X-CSE-ConnectionGUID: aKjSFmm3QUmPn6t+UQh1hw== X-CSE-MsgGUID: CZA4x4zBQpi86IqXDyZ8VA== X-IronPort-AV: E=McAfee;i="6700,10204,11314"; a="41097938" X-IronPort-AV: E=Sophos;i="6.12,310,1728975600"; d="scan'208";a="41097938" Received: from orviesa006.jf.intel.com ([10.64.159.146]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Feb 2025 12:38:36 -0800 X-CSE-ConnectionGUID: xbfmQzEhRvOKVl+acs6aBQ== X-CSE-MsgGUID: cbnUNOJlTOKgjvD5LlwJtg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.13,302,1732608000"; d="scan'208";a="115100564" Received: from dut4086lnl.fm.intel.com ([10.105.10.90]) by orviesa006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Feb 2025 12:38:34 -0800 From: Jonathan Cavitt To: intel-xe@lists.freedesktop.org Cc: saurabhg.gupta@intel.com, alex.zuo@intel.com, jonathan.cavitt@intel.com, joonas.lahtinen@linux.intel.com, tvrtko.ursulin@igalia.com, lucas.demarchi@intel.com, matthew.brost@intel.com, dri-devel@lists.freedesktop.org, simona.vetter@ffwll.ch Subject: [PATCH v4 4/6] drm/xe/xe_drm_client: Add per drm client reset stats Date: Thu, 20 Feb 2025 20:38:30 +0000 Message-ID: <20250220203832.130430-5-jonathan.cavitt@intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250220203832.130430-1-jonathan.cavitt@intel.com> References: <20250220203832.130430-1-jonathan.cavitt@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Add a counter to xe_drm_client that tracks the number of times the engine has been reset since the drm client was created. Signed-off-by: Jonathan Cavitt --- drivers/gpu/drm/xe/xe_drm_client.h | 2 ++ drivers/gpu/drm/xe/xe_guc_submit.c | 4 +++- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/xe/xe_drm_client.h b/drivers/gpu/drm/xe/xe_drm_client.h index b3d9b279d55f..6579c4b60ae7 100644 --- a/drivers/gpu/drm/xe/xe_drm_client.h +++ b/drivers/gpu/drm/xe/xe_drm_client.h @@ -59,6 +59,8 @@ struct xe_drm_client { * @blame_len: length of @blame_list */ unsigned int blame_len; + /** @reset_count: number of times this drm client has seen an engine reset */ + atomic_t reset_count; #endif }; diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c b/drivers/gpu/drm/xe/xe_guc_submit.c index 92de926bd505..5d899de3dd83 100644 --- a/drivers/gpu/drm/xe/xe_guc_submit.c +++ b/drivers/gpu/drm/xe/xe_guc_submit.c @@ -1988,7 +1988,9 @@ int xe_guc_exec_queue_reset_handler(struct xe_guc *guc, u32 *msg, u32 len) return -EPROTO; hwe = q->hwe; - +#ifdef CONFIG_PROC_FS + atomic_inc(&q->xef->client->reset_count); +#endif xe_gt_info(gt, "Engine reset: engine_class=%s, logical_mask: 0x%x, guc_id=%d", xe_hw_engine_class_to_str(q->class), q->logical_mask, guc_id); From patchwork Thu Feb 20 20:38:31 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Cavitt X-Patchwork-Id: 13984495 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 02DC1C021B1 for ; Thu, 20 Feb 2025 20:38:42 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 772E010E9E7; Thu, 20 Feb 2025 20:38:37 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="SQDLcf76"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) by gabe.freedesktop.org (Postfix) with ESMTPS id BB42410E9DF; Thu, 20 Feb 2025 20:38:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1740083916; x=1771619916; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=60nckjAvEwo8lnuh60Lk9jDkK46KiEserXo4D8hJMuY=; b=SQDLcf76f5P5f0JGG8O+enOsHN1NTqZgMeNxAsV0GlrxO3XlHOg8WUHY yJEhIDcUkRsmbNedGUDxCFBzxgaUCp6ttX+exape53jtED8xIZG50T7aG XKQNs2tKaSJyzdZQO5RcwMVxUTlAwW1KB/8jEIQSaQzhg2fF0EcDb30i1 7XZKhWM7ZcxUAavk4tSwCJ24FXdFK/mpuD6iMPmVJ4plxWYxKnV3rS6k9 7KTsZIJ2OhKw9iS9j6JlPl/CiQ7NiNSUvv4nXOE0AA6Tzv91XqAMWXZds ZQ97VO8z76wXYZNb0FyyV5DdQOm9IyBmRsBjWnUuKDrMfPK2vHuMTJ0cv Q==; X-CSE-ConnectionGUID: /HUJagLCRxOrIbJHeDERHw== X-CSE-MsgGUID: i7pp+DPUS5CmNKhd8NFDLA== X-IronPort-AV: E=McAfee;i="6700,10204,11314"; a="41097941" X-IronPort-AV: E=Sophos;i="6.12,310,1728975600"; d="scan'208";a="41097941" Received: from orviesa006.jf.intel.com ([10.64.159.146]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Feb 2025 12:38:36 -0800 X-CSE-ConnectionGUID: nPPNdJlnSBudZVPUFX2u1w== X-CSE-MsgGUID: oAfG8eX+RJOW466+NyeZkw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.13,302,1732608000"; d="scan'208";a="115100572" Received: from dut4086lnl.fm.intel.com ([10.105.10.90]) by orviesa006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Feb 2025 12:38:35 -0800 From: Jonathan Cavitt To: intel-xe@lists.freedesktop.org Cc: saurabhg.gupta@intel.com, alex.zuo@intel.com, jonathan.cavitt@intel.com, joonas.lahtinen@linux.intel.com, tvrtko.ursulin@igalia.com, lucas.demarchi@intel.com, matthew.brost@intel.com, dri-devel@lists.freedesktop.org, simona.vetter@ffwll.ch Subject: [PATCH v4 5/6] drm/xe/xe_query: Pass drm file to query funcs Date: Thu, 20 Feb 2025 20:38:31 +0000 Message-ID: <20250220203832.130430-6-jonathan.cavitt@intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250220203832.130430-1-jonathan.cavitt@intel.com> References: <20250220203832.130430-1-jonathan.cavitt@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Pass the drm file to the query funcs in xe_query.c. This will be necessary for a future query. Signed-off-by: Jonathan Cavitt --- drivers/gpu/drm/xe/xe_query.c | 39 ++++++++++++++++++++++++----------- 1 file changed, 27 insertions(+), 12 deletions(-) diff --git a/drivers/gpu/drm/xe/xe_query.c b/drivers/gpu/drm/xe/xe_query.c index 042f87a688e7..3aad4737bfec 100644 --- a/drivers/gpu/drm/xe/xe_query.c +++ b/drivers/gpu/drm/xe/xe_query.c @@ -110,7 +110,8 @@ hwe_read_timestamp(struct xe_hw_engine *hwe, u64 *engine_ts, u64 *cpu_ts, static int query_engine_cycles(struct xe_device *xe, - struct drm_xe_device_query *query) + struct drm_xe_device_query *query, + struct drm_file *file) { struct drm_xe_query_engine_cycles __user *query_ptr; struct drm_xe_engine_class_instance *eci; @@ -179,7 +180,8 @@ query_engine_cycles(struct xe_device *xe, } static int query_engines(struct xe_device *xe, - struct drm_xe_device_query *query) + struct drm_xe_device_query *query, + struct drm_file *file) { size_t size = calc_hw_engine_info_size(xe); struct drm_xe_query_engines __user *query_ptr = @@ -240,7 +242,8 @@ static size_t calc_mem_regions_size(struct xe_device *xe) } static int query_mem_regions(struct xe_device *xe, - struct drm_xe_device_query *query) + struct drm_xe_device_query *query, + struct drm_file *file) { size_t size = calc_mem_regions_size(xe); struct drm_xe_query_mem_regions *mem_regions; @@ -310,7 +313,9 @@ static int query_mem_regions(struct xe_device *xe, return ret; } -static int query_config(struct xe_device *xe, struct drm_xe_device_query *query) +static int query_config(struct xe_device *xe, + struct drm_xe_device_query *query, + struct drm_file *file) { const u32 num_params = DRM_XE_QUERY_CONFIG_MAX_EXEC_QUEUE_PRIORITY + 1; size_t size = @@ -351,7 +356,9 @@ static int query_config(struct xe_device *xe, struct drm_xe_device_query *query) return 0; } -static int query_gt_list(struct xe_device *xe, struct drm_xe_device_query *query) +static int query_gt_list(struct xe_device *xe, + struct drm_xe_device_query *query, + struct drm_file *file) { struct xe_gt *gt; size_t size = sizeof(struct drm_xe_query_gt_list) + @@ -422,7 +429,8 @@ static int query_gt_list(struct xe_device *xe, struct drm_xe_device_query *query } static int query_hwconfig(struct xe_device *xe, - struct drm_xe_device_query *query) + struct drm_xe_device_query *query, + struct drm_file *file) { struct xe_gt *gt = xe_root_mmio_gt(xe); size_t size = xe_guc_hwconfig_size(>->uc.guc); @@ -490,7 +498,8 @@ static int copy_mask(void __user **ptr, } static int query_gt_topology(struct xe_device *xe, - struct drm_xe_device_query *query) + struct drm_xe_device_query *query, + struct drm_file *file) { void __user *query_ptr = u64_to_user_ptr(query->data); size_t size = calc_topo_query_size(xe); @@ -549,7 +558,9 @@ static int query_gt_topology(struct xe_device *xe, } static int -query_uc_fw_version(struct xe_device *xe, struct drm_xe_device_query *query) +query_uc_fw_version(struct xe_device *xe, + struct drm_xe_device_query *query, + struct drm_file *file) { struct drm_xe_query_uc_fw_version __user *query_ptr = u64_to_user_ptr(query->data); size_t size = sizeof(struct drm_xe_query_uc_fw_version); @@ -639,7 +650,8 @@ static size_t calc_oa_unit_query_size(struct xe_device *xe) } static int query_oa_units(struct xe_device *xe, - struct drm_xe_device_query *query) + struct drm_xe_device_query *query, + struct drm_file *file) { void __user *query_ptr = u64_to_user_ptr(query->data); size_t size = calc_oa_unit_query_size(xe); @@ -699,7 +711,9 @@ static int query_oa_units(struct xe_device *xe, return ret ? -EFAULT : 0; } -static int query_pxp_status(struct xe_device *xe, struct drm_xe_device_query *query) +static int query_pxp_status(struct xe_device *xe, + struct drm_xe_device_query *query, + struct drm_file *file) { struct drm_xe_query_pxp_status __user *query_ptr = u64_to_user_ptr(query->data); size_t size = sizeof(struct drm_xe_query_pxp_status); @@ -727,7 +741,8 @@ static int query_pxp_status(struct xe_device *xe, struct drm_xe_device_query *qu } static int (* const xe_query_funcs[])(struct xe_device *xe, - struct drm_xe_device_query *query) = { + struct drm_xe_device_query *query, + struct drm_file *file) = { query_engines, query_mem_regions, query_config, @@ -757,5 +772,5 @@ int xe_query_ioctl(struct drm_device *dev, void *data, struct drm_file *file) if (XE_IOCTL_DBG(xe, !xe_query_funcs[idx])) return -EINVAL; - return xe_query_funcs[idx](xe, query); + return xe_query_funcs[idx](xe, query, file); } From patchwork Thu Feb 20 20:38:32 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Cavitt X-Patchwork-Id: 13984493 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 0FC81C021B4 for ; Thu, 20 Feb 2025 20:38:41 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id C99FA10E9E4; Thu, 20 Feb 2025 20:38:36 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="M1s6vBPI"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.18]) by gabe.freedesktop.org (Postfix) with ESMTPS id 2860310E9DF; Thu, 20 Feb 2025 20:38:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1740083917; x=1771619917; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=WRoA2Otu8KXZFy/YShs2Evzm1qWRGAVUan1amEpzAB8=; b=M1s6vBPIhCsU8rnViZonzZFDhP9x0vitkij4Z911wUyWKi1J1IeyA/a2 S0MgxFAqilHhlrBILJCtYN7i0676/m3o6LOcH6KY4l6Mxq8OXvL4fkfnu jpzX5geXCp5X2lBe3QifFaDhjA2vP+87QBvpeLi3kaSDRDr+bPzrsZZfc Nuyh6ECY6gEFADyZkI/J+1htLTuRoSLufOLDROLk/h1w/A+2DH+w5EjGf ZfFkkkF9Q/pkk7BKCQdAl4m3bQl2v6/oSOyEfu9NsuFz6cMLZ2db4Iijs X8fX/d53wiY0Qu+ObQT4uz7qIsFCEFpPYiEqAGVjHkPaLS8S8k7aKhImT w==; X-CSE-ConnectionGUID: hnWDICAZQOCbUC0hkFxegQ== X-CSE-MsgGUID: nq9J4QU3SwupGPzCkOIvaA== X-IronPort-AV: E=McAfee;i="6700,10204,11314"; a="41097944" X-IronPort-AV: E=Sophos;i="6.12,310,1728975600"; d="scan'208";a="41097944" Received: from orviesa006.jf.intel.com ([10.64.159.146]) by orvoesa110.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Feb 2025 12:38:37 -0800 X-CSE-ConnectionGUID: Ruzg0pEtRjGy7XWyCVjLwg== X-CSE-MsgGUID: VqXGo015SjimLlpLUbxrsg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.13,302,1732608000"; d="scan'208";a="115100576" Received: from dut4086lnl.fm.intel.com ([10.105.10.90]) by orviesa006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Feb 2025 12:38:35 -0800 From: Jonathan Cavitt To: intel-xe@lists.freedesktop.org Cc: saurabhg.gupta@intel.com, alex.zuo@intel.com, jonathan.cavitt@intel.com, joonas.lahtinen@linux.intel.com, tvrtko.ursulin@igalia.com, lucas.demarchi@intel.com, matthew.brost@intel.com, dri-devel@lists.freedesktop.org, simona.vetter@ffwll.ch Subject: [PATCH v4 6/6] drm/xe/xe_query: Add support for per-drm-client reset stat querying Date: Thu, 20 Feb 2025 20:38:32 +0000 Message-ID: <20250220203832.130430-7-jonathan.cavitt@intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20250220203832.130430-1-jonathan.cavitt@intel.com> References: <20250220203832.130430-1-jonathan.cavitt@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Add support for userspace to query per drm client reset stats via the query ioctl. This includes the number of engine resets the drm client has observed, as well as a list of up to the last 50 relevant exec queue bans and their associated causal pagefaults (if they exists). v2: Report EOPNOTSUPP if CONFIG_PROC_FS is not set in the kernel config, as it is required to trace the reset count and exec queue bans. Signed-off-by: Jonathan Cavitt --- drivers/gpu/drm/xe/xe_query.c | 70 +++++++++++++++++++++++++++++++++++ include/uapi/drm/xe_drm.h | 50 +++++++++++++++++++++++++ 2 files changed, 120 insertions(+) diff --git a/drivers/gpu/drm/xe/xe_query.c b/drivers/gpu/drm/xe/xe_query.c index 3aad4737bfec..671bc4270b93 100644 --- a/drivers/gpu/drm/xe/xe_query.c +++ b/drivers/gpu/drm/xe/xe_query.c @@ -16,10 +16,12 @@ #include "regs/xe_gt_regs.h" #include "xe_bo.h" #include "xe_device.h" +#include "xe_drm_client.h" #include "xe_exec_queue.h" #include "xe_force_wake.h" #include "xe_ggtt.h" #include "xe_gt.h" +#include "xe_gt_pagefault.h" #include "xe_guc_hwconfig.h" #include "xe_macros.h" #include "xe_mmio.h" @@ -740,6 +742,73 @@ static int query_pxp_status(struct xe_device *xe, return 0; } +static size_t calc_reset_stats_size(struct xe_drm_client *client) +{ + size_t size = sizeof(struct drm_xe_query_reset_stats); +#ifdef CONFIG_PROC_FS + spin_lock(&client->blame_lock); + size += sizeof(struct drm_xe_exec_queue_ban) * client->blame_len; + spin_lock(&client->blame_lock); +#endif + return size; +} + +static int query_reset_stats(struct xe_device *xe, + struct drm_xe_device_query *query, + struct drm_file *file) +{ + void __user *query_ptr = u64_to_user_ptr(query->data); + struct drm_xe_query_reset_stats resp; + struct xe_file *xef = to_xe_file(file); + struct xe_drm_client *client = xef->client; + struct blame *b; + size_t size = calc_reset_stats_size(client); + int i = 0; + +#ifdef CONFIG_PROC_FS + if (query->size == 0) { + query->size = size; + return 0; + } else if (XE_IOCTL_DBG(xe, query->size != size)) { + return -EINVAL; + } + + if (copy_from_user(&resp, query_ptr, size)) + return -EFAULT; + + resp.reset_count = atomic_read(&client->reset_count); + + spin_lock(&client->blame_lock); + resp.ban_count = client->blame_len; + list_for_each_entry(b, &client->blame_list, list) { + struct drm_xe_exec_queue_ban *ban = &resp.ban_list[i++]; + struct pagefault *pf = b->pf; + + ban->exec_queue_id = b->exec_queue_id; + ban->pf_found = pf ? 1 : 0; + if (!pf) + continue; + + ban->access_type = pf->access_type; + ban->fault_type = pf->fault_type; + ban->vfid = pf->vfid; + ban->asid = pf->asid; + ban->pdata = pf->pdata; + ban->engine_class = xe_to_user_engine_class[pf->engine_class]; + ban->engine_instance = pf->engine_instance; + ban->fault_addr = pf->page_addr; + } + spin_unlock(&client->blame_lock); + + if (copy_to_user(query_ptr, &resp, size)) + return -EFAULT; + + return 0; +#else + return -EOPNOTSUPP; +#endif +} + static int (* const xe_query_funcs[])(struct xe_device *xe, struct drm_xe_device_query *query, struct drm_file *file) = { @@ -753,6 +822,7 @@ static int (* const xe_query_funcs[])(struct xe_device *xe, query_uc_fw_version, query_oa_units, query_pxp_status, + query_reset_stats, }; int xe_query_ioctl(struct drm_device *dev, void *data, struct drm_file *file) diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h index 892f54d3aa09..ffeb2a79e084 100644 --- a/include/uapi/drm/xe_drm.h +++ b/include/uapi/drm/xe_drm.h @@ -682,6 +682,7 @@ struct drm_xe_query_pxp_status { * - %DRM_XE_DEVICE_QUERY_GT_TOPOLOGY * - %DRM_XE_DEVICE_QUERY_ENGINE_CYCLES * - %DRM_XE_DEVICE_QUERY_PXP_STATUS + * - %DRM_XE_DEVICE_QUERY_RESET_STATS * * If size is set to 0, the driver fills it with the required size for * the requested type of data to query. If size is equal to the required @@ -735,6 +736,7 @@ struct drm_xe_device_query { #define DRM_XE_DEVICE_QUERY_UC_FW_VERSION 7 #define DRM_XE_DEVICE_QUERY_OA_UNITS 8 #define DRM_XE_DEVICE_QUERY_PXP_STATUS 9 +#define DRM_XE_DEVICE_QUERY_RESET_STATS 10 /** @query: The type of data to query */ __u32 query; @@ -1845,6 +1847,54 @@ enum drm_xe_pxp_session_type { DRM_XE_PXP_TYPE_HWDRM = 1, }; +/** + * struct drm_xe_exec_queue_ban - Per drm client exec queue ban info returned + * from @DRM_XE_DEVICE_QUERY_RESET_STATS query. Includes the exec queue ID and + * all associated pagefault information, if relevant. + */ +struct drm_xe_exec_queue_ban { + /** @exec_queue_id: ID of banned exec queue */ + __u32 exec_queue_id; + /** + * @pf_found: whether or not the ban is associated with a pagefault. + * If not, all pagefault data will default to 0 and will not be relevant. + */ + __u8 pf_found; + /** @access_type: access type of associated pagefault */ + __u8 access_type; + /** @fault_type: fault type of associated pagefault */ + __u8 fault_type; + /** @vfid: VFID of associated pagefault */ + __u8 vfid; + /** @asid: ASID of associated pagefault */ + __u32 asid; + /** @pdata: PDATA of associated pagefault */ + __u16 pdata; + /** @engine_class: engine class of associated pagefault */ + __u8 engine_class; + /** @engine_instance: engine instance of associated pagefault */ + __u8 engine_instance; + /** @fault_addr: faulted address of associated pagefault */ + __u64 fault_addr; +}; + +/** + * struct drm_xe_query_reset_stats - Per drm client reset stats query. + */ +struct drm_xe_query_reset_stats { + /** @extensions: Pointer to the first extension struct, if any */ + __u64 extensions; + /** @reset_count: Number of times the drm client has observed an engine reset */ + __u64 reset_count; + /** @ban_count: number of exec queue bans saved by the drm client */ + __u64 ban_count; + /** + * @ban_list: flexible array of struct drm_xe_exec_queue_ban, reporting all + * observed exec queue bans on the drm client. + */ + struct drm_xe_exec_queue_ban ban_list[]; +}; + /* ID of the protected content session managed by Xe when PXP is active */ #define DRM_XE_PXP_HWDRM_DEFAULT_SESSION 0xf