From patchwork Fri Oct 25 08:48:14 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Raag Jadav X-Patchwork-Id: 13850218 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 3FFD7D0C5EB for ; Fri, 25 Oct 2024 08:49:03 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 9FCD310EA2F; Fri, 25 Oct 2024 08:49:02 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="DoDW4kBT"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.13]) by gabe.freedesktop.org (Postfix) with ESMTPS id 7145B10EA2F; Fri, 25 Oct 2024 08:49:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1729846141; x=1761382141; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=aLjcWUY8SuEfKDR0WSvMQTy0mf+lzWHlH7axlY/MKng=; b=DoDW4kBTaV5AggQm6iQhN6LlXTr65pJKxSH4TMQQ3pdIDvbZJL5tQDYF 6seCCV9lHSS/MqbpJiZcvfBt5yCq7itMHYk7hHcd1+nZJmvtdSGHK9yc+ us91hccgF6uXUuT25g9w1hhgdNcHpT7NGo9imt/x6YVyGaFpzJsFjj7UF bgs/bIPD9PWyHfoOXCESjctcMMZ3SPZW/zA5CPCtwiKVkEek73mu5xmGU 51x7BZNopJwsBNsyu178mMdFGKYIsn+9DN8yyv9/I/wDPSGnY6KEWtnp0 57HMfQ7B0KewGILPuLKTmgPElTPP6mwNTL6GMSuB6f/5Rh5b1aoAZVmw2 Q==; X-CSE-ConnectionGUID: NjTPf7rGRlipy9l8KXpoyQ== X-CSE-MsgGUID: HJ+XucT+RPu1rPam79kOkg== X-IronPort-AV: E=McAfee;i="6700,10204,11235"; a="32369502" X-IronPort-AV: E=Sophos;i="6.11,231,1725346800"; d="scan'208";a="32369502" Received: from orviesa010.jf.intel.com ([10.64.159.150]) by fmvoesa107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Oct 2024 01:49:01 -0700 X-CSE-ConnectionGUID: p4/LKv4XTMeTUgcqUGqbvA== X-CSE-MsgGUID: ymsx/TmERsCU3VOUTi3NEg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,231,1725346800"; d="scan'208";a="80768525" Received: from jraag-nuc8i7beh.iind.intel.com ([10.145.169.79]) by orviesa010.jf.intel.com with ESMTP; 25 Oct 2024 01:48:55 -0700 From: Raag Jadav To: airlied@gmail.com, simona@ffwll.ch, lucas.demarchi@intel.com, rodrigo.vivi@intel.com, jani.nikula@linux.intel.com, andriy.shevchenko@linux.intel.com, lina@asahilina.net, michal.wajdeczko@intel.com, christian.koenig@amd.com Cc: intel-xe@lists.freedesktop.org, intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, himal.prasad.ghimiray@intel.com, aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com, alexander.deucher@amd.com, andrealmeid@igalia.com, amd-gfx@lists.freedesktop.org, kernel-dev@igalia.com, Raag Jadav Subject: [PATCH v8 1/4] drm: Introduce device wedged event Date: Fri, 25 Oct 2024 14:18:14 +0530 Message-Id: <20241025084817.144621-2-raag.jadav@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20241025084817.144621-1-raag.jadav@intel.com> References: <20241025084817.144621-1-raag.jadav@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Introduce device wedged event, which will notify userspace of wedged (hanged/unusable) state of the DRM device through a uevent. This is useful especially in cases where the device is no longer operating as expected even after a reset and has become unrecoverable from driver context. Purpose of this implementation is to provide drivers a generic way to recover with the help of userspace intervention without taking any drastic measures in the driver. A 'wedged' device is basically a dead device that needs attention. The uevent is the notification that is sent to userspace along with a hint about what could possibly be attempted to recover the device and bring it back to usable state. Different drivers may have different ideas of a 'wedged' device depending on their hardware implementation, and hence the vendor agnostic nature of the event. It is up to the drivers to decide when they see the need for recovery and how they want to recover from the available methods. Recovery -------- Current implementation defines two recovery methods, out of which, drivers can use any one, both or none. Method(s) of choice will be sent in the uevent environment as ``WEDGED=[,]`` in order of less to more side-effects. If driver is unsure about recovery or method is unknown (like soft/hard reboot, firmware flashing, hardware replacement or any other procedure which can't be attempted on the fly), ``WEDGED=none`` will be sent instead. It is the responsibility of the driver to perform required cleanups (like disabling system memory access or signalling dma_fences) and prepare itself for the recovery before sending the event. Once the event is sent, driver should block all IOCTLs with an error code. This will signify the reason for wegeding which can be reported to the application if needed. Userspace consumers can parse this event and attempt recovery as per below expectations. =============== ================================== Recovery method Consumer expectations =============== ================================== rebind unbind + rebind driver bus-reset unbind + reset bus device + rebind none admin/user policy =============== ================================== Example for rebind ~~~~~~~~~~~~~~~~~~ Udev rule:: SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]", RUN+="/path/to/rebind.sh $env{DEVPATH}" Recovery script:: #!/bin/sh DEVPATH=$(readlink -f /sys/$1/device) DEVICE=$(basename $DEVPATH) DRIVER=$(readlink -f $DEVPATH/driver) echo -n $DEVICE > $DRIVER/unbind sleep 1 echo -n $DEVICE > $DRIVER/bind Although scripts are simple enough for basic recovery, admin/users can define customized policies around recovery action. For example if the driver supports multiple recovery methods, consumers can opt for the suitable one based on policy definition. Consumers can also take additional steps like gathering telemetry information (devcoredump, syslog), or have the device available for further debugging and data collection before performing the recovery. This is useful especially when the driver is unsure about recovery or method is unknown. v4: s/drm_dev_wedged/drm_dev_wedged_event Use drm_info() (Jani) Kernel doc adjustment (Aravind) v5: Send recovery method with uevent (Lina) v6: Access wedge_recovery_opts[] using helper function (Jani) Use snprintf() (Jani) v7: Convert recovery helpers into regular functions (Andy, Jani) Aesthetic adjustments (Andy) Handle invalid method cases v8: Allow sending multiple methods with uevent (Lucas, Michal) static_assert() globally (Andy) Signed-off-by: Raag Jadav --- drivers/gpu/drm/drm_drv.c | 51 +++++++++++++++++++++++++++++++++++++++ include/drm/drm_device.h | 7 ++++++ include/drm/drm_drv.h | 1 + 3 files changed, 59 insertions(+) diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c index ac30b0ec9d93..ded6327fc242 100644 --- a/drivers/gpu/drm/drm_drv.c +++ b/drivers/gpu/drm/drm_drv.c @@ -26,6 +26,8 @@ * DEALINGS IN THE SOFTWARE. */ +#include +#include #include #include #include @@ -33,6 +35,7 @@ #include #include #include +#include #include #include @@ -70,6 +73,16 @@ static struct dentry *drm_debugfs_root; DEFINE_STATIC_SRCU(drm_unplug_srcu); +/* + * Available recovery methods for wedged device. To be sent along with device + * wedged uevent. + */ +static const char *const drm_wedge_recovery_opts[] = { + [ffs(DRM_WEDGE_RECOVERY_REBIND) - 1] = "rebind", + [ffs(DRM_WEDGE_RECOVERY_BUS_RESET) - 1] = "bus-reset", +}; +static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == ffs(DRM_WEDGE_RECOVERY_BUS_RESET)); + /* * DRM Minors * A DRM device can provide several char-dev interfaces on the DRM-Major. Each @@ -497,6 +510,44 @@ void drm_dev_unplug(struct drm_device *dev) } EXPORT_SYMBOL(drm_dev_unplug); +/** + * drm_dev_wedged_event - generate a device wedged uevent + * @dev: DRM device + * @method: method(s) to be used for recovery + * + * This generates a device wedged uevent for the DRM device specified by @dev. + * Recovery @method from drm_wedge_recovery_opts[] is sent in the uevent + * environment as ``WEDGED=[,]`` in order of less to more + * side-effects. If caller is unsure about recovery or @method is unknown (0), + * ``WEDGED=none`` will be sent instead. + * + * Returns: 0 on success, negative error code otherwise. + */ +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method) +{ + unsigned int len, opt, size = ARRAY_SIZE(drm_wedge_recovery_opts); + const char *recovery = NULL; + /* Event string length up to 24+ characters with available methods */ + char event_string[32]; + char *envp[] = { event_string, NULL }; + + len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED="); + + for_each_set_bit(opt, &method, size) { + recovery = drm_wedge_recovery_opts[opt]; + len += scnprintf(event_string + len, sizeof(event_string), + opt == size - 1 ? "%s" : "%s,", recovery); + } + + if (!recovery) + /* Caller is unsure about recovery, do the best we can at this point. */ + scnprintf(event_string + len, sizeof(event_string), "%s", "none"); + + drm_info(dev, "device wedged, needs recovery\n"); + return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp); +} +EXPORT_SYMBOL(drm_dev_wedged_event); + /* * DRM internal mount * We want to be able to allocate our own "struct address_space" to control diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h index c91f87b5242d..edf8b200891d 100644 --- a/include/drm/drm_device.h +++ b/include/drm/drm_device.h @@ -21,6 +21,13 @@ struct inode; struct pci_dev; struct pci_controller; +/* + * Recovery methods for wedged device in order of less to more side-effects. + * To be used with drm_dev_wedged_event() as recovery @method. Callers can + * use any one, multiple (or'd) or none depending on their needs. + */ +#define DRM_WEDGE_RECOVERY_REBIND BIT(0) /* unbind + rebind driver */ +#define DRM_WEDGE_RECOVERY_BUS_RESET BIT(1) /* unbind + reset bus device + rebind */ /** * enum switch_power_state - power state of drm device diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h index 02ea4e3248fd..cc7bcb94ad6a 100644 --- a/include/drm/drm_drv.h +++ b/include/drm/drm_drv.h @@ -461,6 +461,7 @@ void drm_put_dev(struct drm_device *dev); bool drm_dev_enter(struct drm_device *dev, int *idx); void drm_dev_exit(int idx); void drm_dev_unplug(struct drm_device *dev); +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method); /** * drm_dev_is_unplugged - is a DRM device unplugged From patchwork Fri Oct 25 08:48:15 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Raag Jadav X-Patchwork-Id: 13850219 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 1D17AD0C5EB for ; Fri, 25 Oct 2024 08:49:10 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 93C1D10EA33; Fri, 25 Oct 2024 08:49:09 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="eqp39dA6"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.13]) by gabe.freedesktop.org (Postfix) with ESMTPS id 6AD8010EA35; Fri, 25 Oct 2024 08:49:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1729846147; x=1761382147; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=JwkocReqtyRt0zYjlyUPartVAUOebyiEt6E2ydkMjFU=; b=eqp39dA65FqQRzeQxZFdS82bD2+NO/l869YFlV7P+GN/0blyrCSpiN8U 0O+luLTWjt03gaBv7zJ6VJwJPSQSiJgfuAGc6MfH6lCj+kU0KkpKVHPD+ bHG7VfcUGdgoqqpeGwotJCh8gqcFcFWyINincbvjCcOw7W3vYiO2NVyRK FmIYJGlOUuCvb21w3qq66af3KlLpgVqWcdgZoHYE3xYZvB4L5DoSiyAoy f5c1vf8vT2y7RheYjUWrAQlkmWHzLe7fbU+HxAsKpYXpVqHkAlxD43DS5 LNTCw5zhHbdyqVxBn/4ov1cEXMO2nXGEFfn7An/7PqQH0Iinu4S2jajro w==; X-CSE-ConnectionGUID: W2dR/+GJT5qlkrBIIp/SNg== X-CSE-MsgGUID: sWOd3+GPTjiDAuMOJprw3w== X-IronPort-AV: E=McAfee;i="6700,10204,11235"; a="32369512" X-IronPort-AV: E=Sophos;i="6.11,231,1725346800"; d="scan'208";a="32369512" Received: from orviesa010.jf.intel.com ([10.64.159.150]) by fmvoesa107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Oct 2024 01:49:07 -0700 X-CSE-ConnectionGUID: a/oWB9piQj+Ee923mMFE5A== X-CSE-MsgGUID: J4RLUBdBSmWCe8Oq2B0bBA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,231,1725346800"; d="scan'208";a="80768563" Received: from jraag-nuc8i7beh.iind.intel.com ([10.145.169.79]) by orviesa010.jf.intel.com with ESMTP; 25 Oct 2024 01:49:01 -0700 From: Raag Jadav To: airlied@gmail.com, simona@ffwll.ch, lucas.demarchi@intel.com, rodrigo.vivi@intel.com, jani.nikula@linux.intel.com, andriy.shevchenko@linux.intel.com, lina@asahilina.net, michal.wajdeczko@intel.com, christian.koenig@amd.com Cc: intel-xe@lists.freedesktop.org, intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, himal.prasad.ghimiray@intel.com, aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com, alexander.deucher@amd.com, andrealmeid@igalia.com, amd-gfx@lists.freedesktop.org, kernel-dev@igalia.com, Raag Jadav Subject: [PATCH v8 2/4] drm/doc: Document device wedged event Date: Fri, 25 Oct 2024 14:18:15 +0530 Message-Id: <20241025084817.144621-3-raag.jadav@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20241025084817.144621-1-raag.jadav@intel.com> References: <20241025084817.144621-1-raag.jadav@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Add documentation for device wedged event in a new 'Device wedging' chapter. The describes basic definitions and consumer expectations along with an example. v8: Improve documentation (Christian, Rodrigo) Signed-off-by: Raag Jadav --- Documentation/gpu/drm-uapi.rst | 75 ++++++++++++++++++++++++++++++++++ 1 file changed, 75 insertions(+) diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst index 370d820be248..11a7446233b5 100644 --- a/Documentation/gpu/drm-uapi.rst +++ b/Documentation/gpu/drm-uapi.rst @@ -362,6 +362,81 @@ the first place. DRM devices should make use of devcoredump to store relevant information about the reset, so this information can be added to user bug reports. +Device wedging +============== + +Drivers can optionally make use of device wedged event (implemented as +drm_dev_wedged_event() in DRM subsystem) which notifies userspace of wedged +(hanged/unusable) state of the DRM device through a uevent. This is useful +especially in cases where the device is no longer operating as expected even +after a reset and has become unrecoverable from driver context. Purpose of +this implementation is to provide drivers a generic way to recover with the +help of userspace intervention without taking any drastic measures in the +driver. + +A 'wedged' device is basically a dead device that needs attention. The +uevent is the notification that is sent to userspace along with a hint about +what could possibly be attempted to recover the device and bring it back to +usable state. Different drivers may have different ideas of a 'wedged' device +depending on their hardware implementation, and hence the vendor agnostic +nature of the event. It is up to the drivers to decide when they see the need +for recovery and how they want to recover from the available methods. + +Recovery +-------- + +Current implementation defines two recovery methods, out of which, drivers +can use any one, both or none. Method(s) of choice will be sent in the uevent +environment as ``WEDGED=[,]`` in order of less to more side +effects. If driver is unsure about recovery or method is unknown (like reboot, +firmware flashing, hardware replacement or any other procedure which can't be +attempted on the fly), ``WEDGED=none`` will be sent instead. + +It is the responsibility of the driver to perform required cleanups (like +disabling system memory access or signalling dma_fences) and prepare itself +for the recovery before sending the event. Once the event is sent, driver +should block all IOCTLs with an error code. This will signify the reason for +wegeding which can be reported to the application if needed. + +Userspace consumers can parse this event and attempt recovery as per below +expectations. + + =============== ================================== + Recovery method Consumer expectations + =============== ================================== + rebind unbind + rebind driver + bus-reset unbind + reset bus device + rebind + none admin/user policy + =============== ================================== + +Example for rebind +~~~~~~~~~~~~~~~~~~ + +Udev rule:: + + SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]", + RUN+="/path/to/rebind.sh $env{DEVPATH}" + +Recovery script:: + + #!/bin/sh + + DEVPATH=$(readlink -f /sys/$1/device) + DEVICE=$(basename $DEVPATH) + DRIVER=$(readlink -f $DEVPATH/driver) + + echo -n $DEVICE > $DRIVER/unbind + sleep 1 + echo -n $DEVICE > $DRIVER/bind + +Although scripts are simple enough for basic recovery, admin/users can define +customized policies around recovery action. For example if the driver supports +multiple recovery methods, consumers can opt for the suitable one based on +policy definition. Consumers can also take additional steps like gathering +telemetry information (devcoredump, syslog), or have the device available for +further debugging and data collection before performing the recovery. This is +useful especially when the driver is unsure about recovery or method is unknown. + .. _drm_driver_ioctl: IOCTL Support on Device Nodes From patchwork Fri Oct 25 08:48:16 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Raag Jadav X-Patchwork-Id: 13850220 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 298FED0C5E6 for ; Fri, 25 Oct 2024 08:49:16 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 83D6510EA38; Fri, 25 Oct 2024 08:49:15 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="cWYF6YIO"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.13]) by gabe.freedesktop.org (Postfix) with ESMTPS id 7E55610EA36; Fri, 25 Oct 2024 08:49:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1729846154; x=1761382154; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=DXftr2DH4pg0fUGyeeyFU1Xg9RIPK7GA/zDAk6ZGVJ4=; b=cWYF6YIOF9LDeSf1gdcw/edk4gPh4TDxIeXrtVEsCgBBjZaZexky2liN o9IdokhDnWNoAiO4H/jy+v0aCHjkLqN6goZZc2jLWMPC3aQH9VLguQxUH XsR2uSF2HAa639siqBZnfD94UoFv5EJZnk4EENhroztP6Sl11uZhIpL9u qNX2sGcVnCwlBMfajgqRLGzSKk4zzIeXv0XoEZT2VXKSYMj0ssOXBpMT5 sV8Cwf86U7mb0CYBUMRwIK/FGxhe+5tVEEvB8DRrGXjGzILQHOhUuZ7Js KktrEtrbGoyiFru3MpAvXT0DKArwDMUTAmTebaMzr0XVWdjlLQSlHNo8F w==; X-CSE-ConnectionGUID: HrseqB1dT+Cgy0mYYI+WCw== X-CSE-MsgGUID: e07BSqo0TduYmM0aB5Gg9Q== X-IronPort-AV: E=McAfee;i="6700,10204,11235"; a="32369517" X-IronPort-AV: E=Sophos;i="6.11,231,1725346800"; d="scan'208";a="32369517" Received: from orviesa010.jf.intel.com ([10.64.159.150]) by fmvoesa107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Oct 2024 01:49:13 -0700 X-CSE-ConnectionGUID: tfJiozn1SSOxtuaPY06A/w== X-CSE-MsgGUID: R+Twa2AORoe0PTMAS7zhIQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,231,1725346800"; d="scan'208";a="80768578" Received: from jraag-nuc8i7beh.iind.intel.com ([10.145.169.79]) by orviesa010.jf.intel.com with ESMTP; 25 Oct 2024 01:49:07 -0700 From: Raag Jadav To: airlied@gmail.com, simona@ffwll.ch, lucas.demarchi@intel.com, rodrigo.vivi@intel.com, jani.nikula@linux.intel.com, andriy.shevchenko@linux.intel.com, lina@asahilina.net, michal.wajdeczko@intel.com, christian.koenig@amd.com Cc: intel-xe@lists.freedesktop.org, intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, himal.prasad.ghimiray@intel.com, aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com, alexander.deucher@amd.com, andrealmeid@igalia.com, amd-gfx@lists.freedesktop.org, kernel-dev@igalia.com, Raag Jadav Subject: [PATCH v8 3/4] drm/xe: Use device wedged event Date: Fri, 25 Oct 2024 14:18:16 +0530 Message-Id: <20241025084817.144621-4-raag.jadav@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20241025084817.144621-1-raag.jadav@intel.com> References: <20241025084817.144621-1-raag.jadav@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" This was previously attempted as xe specific reset uevent but dropped in commit 77a0d4d1cea2 ("drm/xe/uapi: Remove reset uevent for now") as part of refactoring. Now that we have device wedged event provided by DRM core, make use of it and support both driver rebind and bus-reset based recovery. With this in place userspace will be notified of wedged device, on the basis of which, userspace may take respective action to recover the device. $ udevadm monitor --property --kernel monitor will print the received events for: KERNEL - the kernel uevent KERNEL[265.802982] change /devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0 (drm) ACTION=change DEVPATH=/devices/pci0000:00/0000:00:01.0/0000:01:00.0/0000:02:01.0/0000:03:00.0/drm/card0 SUBSYSTEM=drm WEDGED=rebind,bus-reset DEVNAME=/dev/dri/card0 DEVTYPE=drm_minor SEQNUM=5208 MAJOR=226 MINOR=0 v2: Change authorship to Himal (Aravind) Add uevent for all device wedged cases (Aravind) v3: Generic re-implementation in DRM subsystem (Lucas) v4: Change authorship to Raag (Aravind) Signed-off-by: Raag Jadav --- drivers/gpu/drm/xe/xe_device.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c index 2da4affe4dfd..2477cf043397 100644 --- a/drivers/gpu/drm/xe/xe_device.c +++ b/drivers/gpu/drm/xe/xe_device.c @@ -999,11 +999,12 @@ static void xe_device_wedged_fini(struct drm_device *drm, void *arg) * xe_device_declare_wedged - Declare device wedged * @xe: xe device instance * - * This is a final state that can only be cleared with a mudule + * This is a final state that can only be cleared with a module * re-probe (unbind + bind). * In this state every IOCTL will be blocked so the GT cannot be used. * In general it will be called upon any critical error such as gt reset - * failure or guc loading failure. + * failure or guc loading failure. Userspace will be notified of this state + * by a DRM uevent. * If xe.wedged module parameter is set to 2, this function will be called * on every single execution timeout (a.k.a. GPU hang) right after devcoredump * snapshot capture. In this mode, GT reset won't be attempted so the state of @@ -1033,6 +1034,10 @@ void xe_device_declare_wedged(struct xe_device *xe) "IOCTLs and executions are blocked. Only a rebind may clear the failure\n" "Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n", dev_name(xe->drm.dev)); + + /* Notify userspace of wedged device */ + drm_dev_wedged_event(&xe->drm, + DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET); } for_each_gt(gt, xe, id) From patchwork Fri Oct 25 08:48:17 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Raag Jadav X-Patchwork-Id: 13850221 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 50F4CD0C5EC for ; Fri, 25 Oct 2024 08:49:21 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id CB1C310EA35; Fri, 25 Oct 2024 08:49:20 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="mmYsUhFk"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.13]) by gabe.freedesktop.org (Postfix) with ESMTPS id 98F7410EA39; Fri, 25 Oct 2024 08:49:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1729846160; x=1761382160; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Svj8ABxPDx0YDOfkwsRDJHrtjUOb/SfFshkhtuXbvMI=; b=mmYsUhFkwIan4GegBwEofUN3Y//rD0qx2h4B/NjNRSw+Nd5y2bzb1naO fBBOl5pkjYJbe6xQgewMecabL2V7wi53sRQ0zXFxVoLEBarpbXw+CpqCQ QsWNt4+52Gd9EGE3YXTsYnMGSe25pxtxUv/mc9tBTA8yTH+fNhT8WwYEl g9okdJL46okP8KAikUHirb+zsnuS1I/ZK3oCoxbSpkQtpEgT9k2XZBqAx MuowBnRfp2UUx879nsuUaV+HHPJmASoEypLmbZBsjHY4RoAEHCrNzgk1N 957oaTWXlxnqjAb0Te0CKm1Pkgtgl7WoEfypXe1QYWvBb69R2pBhB9Rrh Q==; X-CSE-ConnectionGUID: kAQqa4/CSvetn/XXtakE6w== X-CSE-MsgGUID: jCN0CTvxSpOw/79bocdGvA== X-IronPort-AV: E=McAfee;i="6700,10204,11235"; a="32369533" X-IronPort-AV: E=Sophos;i="6.11,231,1725346800"; d="scan'208";a="32369533" Received: from orviesa010.jf.intel.com ([10.64.159.150]) by fmvoesa107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Oct 2024 01:49:19 -0700 X-CSE-ConnectionGUID: ohI25//FQAaPij8uhzHFGA== X-CSE-MsgGUID: f9RsNs2aQNOF6Y9vt8iGHw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.11,231,1725346800"; d="scan'208";a="80768588" Received: from jraag-nuc8i7beh.iind.intel.com ([10.145.169.79]) by orviesa010.jf.intel.com with ESMTP; 25 Oct 2024 01:49:13 -0700 From: Raag Jadav To: airlied@gmail.com, simona@ffwll.ch, lucas.demarchi@intel.com, rodrigo.vivi@intel.com, jani.nikula@linux.intel.com, andriy.shevchenko@linux.intel.com, lina@asahilina.net, michal.wajdeczko@intel.com, christian.koenig@amd.com Cc: intel-xe@lists.freedesktop.org, intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, himal.prasad.ghimiray@intel.com, aravind.iddamsetty@linux.intel.com, anshuman.gupta@intel.com, alexander.deucher@amd.com, andrealmeid@igalia.com, amd-gfx@lists.freedesktop.org, kernel-dev@igalia.com, Raag Jadav Subject: [PATCH v8 4/4] drm/i915: Use device wedged event Date: Fri, 25 Oct 2024 14:18:17 +0530 Message-Id: <20241025084817.144621-5-raag.jadav@intel.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20241025084817.144621-1-raag.jadav@intel.com> References: <20241025084817.144621-1-raag.jadav@intel.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Now that we have device wedged event provided by DRM core, make use of it and support both driver rebind and bus-reset based recovery. With this in place, userspace will be notified of wedged device on gt reset failure. Signed-off-by: Raag Jadav --- drivers/gpu/drm/i915/gt/intel_reset.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index 8f1ea95471ef..06bfd2dbb6c8 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -1418,6 +1418,9 @@ static void intel_gt_reset_global(struct intel_gt *gt, if (!test_bit(I915_WEDGED, >->reset.flags)) kobject_uevent_env(kobj, KOBJ_CHANGE, reset_done_event); + else + drm_dev_wedged_event(>->i915->drm, + DRM_WEDGE_RECOVERY_REBIND | DRM_WEDGE_RECOVERY_BUS_RESET); } /**