Message ID | 20241025084817.144621-2-raag.jadav@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Introduce DRM device wedged event | expand |
On Fri, 25 Oct 2024, Raag Jadav <raag.jadav@intel.com> wrote: > Introduce device wedged event, which will notify userspace of wedged > (hanged/unusable) state of the DRM device through a uevent. This is > useful especially in cases where the device is no longer operating as > expected even after a reset and has become unrecoverable from driver > context. Purpose of this implementation is to provide drivers a generic > way to recover with the help of userspace intervention without taking > any drastic measures in the driver. > > A 'wedged' device is basically a dead device that needs attention. > The uevent is the notification that is sent to userspace along with a > hint about what could possibly be attempted to recover the device and > bring it back to usable state. Different drivers may have different > ideas of a 'wedged' device depending on their hardware implementation, > and hence the vendor agnostic nature of the event. It is up to the > drivers to decide when they see the need for recovery and how they > want to recover from the available methods. > > Recovery > -------- > > Current implementation defines two recovery methods, out of which, > drivers can use any one, both or none. Method(s) of choice will be > sent in the uevent environment as ``WEDGED=<method1>[,<method2>]`` > in order of less to more side-effects. If driver is unsure about > recovery or method is unknown (like soft/hard reboot, firmware > flashing, hardware replacement or any other procedure which can't > be attempted on the fly), ``WEDGED=none`` will be sent instead. > > It is the responsibility of the driver to perform required cleanups > (like disabling system memory access or signalling dma_fences) and > prepare itself for the recovery before sending the event. Once the > event is sent, driver should block all IOCTLs with an error code. > This will signify the reason for wegeding which can be reported to > the application if needed. > > Userspace consumers can parse this event and attempt recovery as per > below expectations. > > =============== ================================== > Recovery method Consumer expectations > =============== ================================== > rebind unbind + rebind driver > bus-reset unbind + reset bus device + rebind > none admin/user policy > =============== ================================== > > Example for rebind > ~~~~~~~~~~~~~~~~~~ > > Udev rule:: > > SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]", > RUN+="/path/to/rebind.sh $env{DEVPATH}" > > Recovery script:: > > #!/bin/sh > > DEVPATH=$(readlink -f /sys/$1/device) > DEVICE=$(basename $DEVPATH) > DRIVER=$(readlink -f $DEVPATH/driver) > > echo -n $DEVICE > $DRIVER/unbind > sleep 1 > echo -n $DEVICE > $DRIVER/bind > > Although scripts are simple enough for basic recovery, admin/users > can define customized policies around recovery action. For example if > the driver supports multiple recovery methods, consumers can opt for > the suitable one based on policy definition. Consumers can also take > additional steps like gathering telemetry information (devcoredump, > syslog), or have the device available for further debugging and data > collection before performing the recovery. This is useful especially > when the driver is unsure about recovery or method is unknown. > > v4: s/drm_dev_wedged/drm_dev_wedged_event > Use drm_info() (Jani) > Kernel doc adjustment (Aravind) > v5: Send recovery method with uevent (Lina) > v6: Access wedge_recovery_opts[] using helper function (Jani) > Use snprintf() (Jani) > v7: Convert recovery helpers into regular functions (Andy, Jani) > Aesthetic adjustments (Andy) > Handle invalid method cases > v8: Allow sending multiple methods with uevent (Lucas, Michal) > static_assert() globally (Andy) > > Signed-off-by: Raag Jadav <raag.jadav@intel.com> > --- > drivers/gpu/drm/drm_drv.c | 51 +++++++++++++++++++++++++++++++++++++++ > include/drm/drm_device.h | 7 ++++++ > include/drm/drm_drv.h | 1 + > 3 files changed, 59 insertions(+) > > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c > index ac30b0ec9d93..ded6327fc242 100644 > --- a/drivers/gpu/drm/drm_drv.c > +++ b/drivers/gpu/drm/drm_drv.c > @@ -26,6 +26,8 @@ > * DEALINGS IN THE SOFTWARE. > */ > > +#include <linux/array_size.h> > +#include <linux/build_bug.h> > #include <linux/debugfs.h> > #include <linux/fs.h> > #include <linux/module.h> > @@ -33,6 +35,7 @@ > #include <linux/mount.h> > #include <linux/pseudo_fs.h> > #include <linux/slab.h> > +#include <linux/sprintf.h> > #include <linux/srcu.h> > #include <linux/xarray.h> > > @@ -70,6 +73,16 @@ static struct dentry *drm_debugfs_root; > > DEFINE_STATIC_SRCU(drm_unplug_srcu); > > +/* > + * Available recovery methods for wedged device. To be sent along with device > + * wedged uevent. > + */ > +static const char *const drm_wedge_recovery_opts[] = { > + [ffs(DRM_WEDGE_RECOVERY_REBIND) - 1] = "rebind", > + [ffs(DRM_WEDGE_RECOVERY_BUS_RESET) - 1] = "bus-reset", > +}; > +static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == ffs(DRM_WEDGE_RECOVERY_BUS_RESET)); This might work in most cases, but you also might end up finding that there's an arch and compiler combo out there that just won't be able to figure out ffs() at compile time, and the array initialization fails. If that happens, you'd have to either convert back to an enum (and call the wedge event function with BIT(DRM_WEDGE_RECOVERY_REBIND) etc.), or make this a array of structs mapping the macro values to strings and loop over it. Also, the main point of the static assert was to ensure the array is updated when a new recovery option is added, and there's no out of bounds access. That no longer holds, and the static assert is pretty much useless. You still have to manually find and update this. > + > /* > * DRM Minors > * A DRM device can provide several char-dev interfaces on the DRM-Major. Each > @@ -497,6 +510,44 @@ void drm_dev_unplug(struct drm_device *dev) > } > EXPORT_SYMBOL(drm_dev_unplug); > > +/** > + * drm_dev_wedged_event - generate a device wedged uevent > + * @dev: DRM device > + * @method: method(s) to be used for recovery > + * > + * This generates a device wedged uevent for the DRM device specified by @dev. > + * Recovery @method from drm_wedge_recovery_opts[] is sent in the uevent > + * environment as ``WEDGED=<method1>[,<method2>]`` in order of less to more > + * side-effects. If caller is unsure about recovery or @method is unknown (0), > + * ``WEDGED=none`` will be sent instead. > + * > + * Returns: 0 on success, negative error code otherwise. > + */ > +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method) > +{ > + unsigned int len, opt, size = ARRAY_SIZE(drm_wedge_recovery_opts); > + const char *recovery = NULL; > + /* Event string length up to 24+ characters with available methods */ > + char event_string[32]; > + char *envp[] = { event_string, NULL }; > + > + len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED="); > + > + for_each_set_bit(opt, &method, size) { > + recovery = drm_wedge_recovery_opts[opt]; You've left out bounds checking with the idea that the static assert covers this. I don't think it does. BR, Jani. > + len += scnprintf(event_string + len, sizeof(event_string), > + opt == size - 1 ? "%s" : "%s,", recovery); > + } > + > + if (!recovery) > + /* Caller is unsure about recovery, do the best we can at this point. */ > + scnprintf(event_string + len, sizeof(event_string), "%s", "none"); > + > + drm_info(dev, "device wedged, needs recovery\n"); > + return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp); > +} > +EXPORT_SYMBOL(drm_dev_wedged_event); > + > /* > * DRM internal mount > * We want to be able to allocate our own "struct address_space" to control > diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h > index c91f87b5242d..edf8b200891d 100644 > --- a/include/drm/drm_device.h > +++ b/include/drm/drm_device.h > @@ -21,6 +21,13 @@ struct inode; > struct pci_dev; > struct pci_controller; > > +/* > + * Recovery methods for wedged device in order of less to more side-effects. > + * To be used with drm_dev_wedged_event() as recovery @method. Callers can > + * use any one, multiple (or'd) or none depending on their needs. > + */ > +#define DRM_WEDGE_RECOVERY_REBIND BIT(0) /* unbind + rebind driver */ > +#define DRM_WEDGE_RECOVERY_BUS_RESET BIT(1) /* unbind + reset bus device + rebind */ > > /** > * enum switch_power_state - power state of drm device > diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h > index 02ea4e3248fd..cc7bcb94ad6a 100644 > --- a/include/drm/drm_drv.h > +++ b/include/drm/drm_drv.h > @@ -461,6 +461,7 @@ void drm_put_dev(struct drm_device *dev); > bool drm_dev_enter(struct drm_device *dev, int *idx); > void drm_dev_exit(int idx); > void drm_dev_unplug(struct drm_device *dev); > +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method); > > /** > * drm_dev_is_unplugged - is a DRM device unplugged
On Fri, Oct 25, 2024 at 12:08:50PM +0300, Jani Nikula wrote: > On Fri, 25 Oct 2024, Raag Jadav <raag.jadav@intel.com> wrote: ... > > +/* > > + * Available recovery methods for wedged device. To be sent along with device > > + * wedged uevent. > > + */ > > +static const char *const drm_wedge_recovery_opts[] = { > > + [ffs(DRM_WEDGE_RECOVERY_REBIND) - 1] = "rebind", > > + [ffs(DRM_WEDGE_RECOVERY_BUS_RESET) - 1] = "bus-reset", > > +}; > > +static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == ffs(DRM_WEDGE_RECOVERY_BUS_RESET)); > > This might work in most cases, but you also might end up finding that > there's an arch and compiler combo out there that just won't be able to > figure out ffs() at compile time, and the array initialization fails. We have ilog2() macro for such cases, but it is rather fls() and not ffs(), and I have no idea why ffs() even being used here, especially in the index part of the array assignments. It's unreadable. > If that happens, you'd have to either convert back to an enum (and call > the wedge event function with BIT(DRM_WEDGE_RECOVERY_REBIND) etc.), or > make this a array of structs mapping the macro values to strings and > loop over it. > > Also, the main point of the static assert was to ensure the array is > updated when a new recovery option is added, and there's no out of > bounds access. That no longer holds, and the static assert is pretty > much useless. You still have to manually find and update this.
Hi Raag, kernel test robot noticed the following build errors: [auto build test ERROR on drm-xe/drm-xe-next] [also build test ERROR on drm-intel/for-linux-next drm-intel/for-linux-next-fixes drm-tip/drm-tip linus/master v6.12-rc4 next-20241025] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Raag-Jadav/drm-Introduce-device-wedged-event/20241025-165119 base: https://gitlab.freedesktop.org/drm/xe/kernel.git drm-xe-next patch link: https://lore.kernel.org/r/20241025084817.144621-2-raag.jadav%40intel.com patch subject: [PATCH v8 1/4] drm: Introduce device wedged event config: alpha-allmodconfig (https://download.01.org/0day-ci/archive/20241026/202410261411.F8079SY8-lkp@intel.com/config) compiler: alpha-linux-gcc (GCC) 13.3.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241026/202410261411.F8079SY8-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202410261411.F8079SY8-lkp@intel.com/ All errors (new ones prefixed by >>): >> drivers/gpu/drm/drm_drv.c:81:10: error: nonconstant array index in initializer 81 | [ffs(DRM_WEDGE_RECOVERY_REBIND) - 1] = "rebind", | ^~~ drivers/gpu/drm/drm_drv.c:81:10: note: (near initialization for 'drm_wedge_recovery_opts') drivers/gpu/drm/drm_drv.c:82:10: error: nonconstant array index in initializer 82 | [ffs(DRM_WEDGE_RECOVERY_BUS_RESET) - 1] = "bus-reset", | ^~~ drivers/gpu/drm/drm_drv.c:82:10: note: (near initialization for 'drm_wedge_recovery_opts') In file included from drivers/gpu/drm/drm_drv.c:30: >> drivers/gpu/drm/drm_drv.c:84:51: error: expression in static assertion is not constant 84 | static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == ffs(DRM_WEDGE_RECOVERY_BUS_RESET)); include/linux/build_bug.h:78:56: note: in definition of macro '__static_assert' 78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg) | ^~~~ drivers/gpu/drm/drm_drv.c:84:1: note: in expansion of macro 'static_assert' 84 | static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == ffs(DRM_WEDGE_RECOVERY_BUS_RESET)); | ^~~~~~~~~~~~~ vim +81 drivers/gpu/drm/drm_drv.c 75 76 /* 77 * Available recovery methods for wedged device. To be sent along with device 78 * wedged uevent. 79 */ 80 static const char *const drm_wedge_recovery_opts[] = { > 81 [ffs(DRM_WEDGE_RECOVERY_REBIND) - 1] = "rebind", 82 [ffs(DRM_WEDGE_RECOVERY_BUS_RESET) - 1] = "bus-reset", 83 }; > 84 static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == ffs(DRM_WEDGE_RECOVERY_BUS_RESET)); 85
Hi Raag, kernel test robot noticed the following build errors: [auto build test ERROR on drm-xe/drm-xe-next] [also build test ERROR on drm-intel/for-linux-next drm-intel/for-linux-next-fixes drm-tip/drm-tip linus/master v6.12-rc4 next-20241025] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Raag-Jadav/drm-Introduce-device-wedged-event/20241025-165119 base: https://gitlab.freedesktop.org/drm/xe/kernel.git drm-xe-next patch link: https://lore.kernel.org/r/20241025084817.144621-2-raag.jadav%40intel.com patch subject: [PATCH v8 1/4] drm: Introduce device wedged event config: arm-randconfig-002-20241026 (https://download.01.org/0day-ci/archive/20241026/202410261754.enck8cc6-lkp@intel.com/config) compiler: clang version 20.0.0git (https://github.com/llvm/llvm-project 5886454669c3c9026f7f27eab13509dd0241f2d6) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241026/202410261754.enck8cc6-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202410261754.enck8cc6-lkp@intel.com/ All errors (new ones prefixed by >>): In file included from drivers/gpu/drm/drm_drv.c:36: In file included from include/linux/pseudo_fs.h:4: In file included from include/linux/fs_context.h:14: In file included from include/linux/security.h:33: In file included from include/linux/mm.h:2213: include/linux/vmstat.h:518:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion] 518 | return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_" | ~~~~~~~~~~~ ^ ~~~ >> drivers/gpu/drm/drm_drv.c:81:3: error: expression is not an integer constant expression 81 | [ffs(DRM_WEDGE_RECOVERY_REBIND) - 1] = "rebind", | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ include/asm-generic/bitops/ffs.h:43:16: note: expanded from macro 'ffs' 43 | #define ffs(x) generic_ffs(x) | ^ drivers/gpu/drm/drm_drv.c:82:3: error: expression is not an integer constant expression 82 | [ffs(DRM_WEDGE_RECOVERY_BUS_RESET) - 1] = "bus-reset", | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ include/asm-generic/bitops/ffs.h:43:16: note: expanded from macro 'ffs' 43 | #define ffs(x) generic_ffs(x) | ^ >> drivers/gpu/drm/drm_drv.c:84:15: error: invalid application of 'sizeof' to an incomplete type 'const char *const[]' 84 | static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == ffs(DRM_WEDGE_RECOVERY_BUS_RESET)); | ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ include/linux/array_size.h:11:32: note: expanded from macro 'ARRAY_SIZE' 11 | #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) + __must_be_array(arr)) | ^ include/linux/build_bug.h:77:50: note: expanded from macro 'static_assert' 77 | #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr) | ~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:78:56: note: expanded from macro '__static_assert' 78 | #define __static_assert(expr, msg, ...) _Static_assert(expr, msg) | ^~~~ drivers/gpu/drm/drm_drv.c:528:32: error: invalid application of 'sizeof' to an incomplete type 'const char *const[]' 528 | unsigned int len, opt, size = ARRAY_SIZE(drm_wedge_recovery_opts); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ include/linux/array_size.h:11:32: note: expanded from macro 'ARRAY_SIZE' 11 | #define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]) + __must_be_array(arr)) | ^~~~~ 1 warning and 4 errors generated. Kconfig warnings: (for reference only) WARNING: unmet direct dependencies detected for MODVERSIONS Depends on [n]: MODULES [=y] && !COMPILE_TEST [=y] Selected by [y]: - RANDSTRUCT_FULL [=y] && (CC_HAS_RANDSTRUCT [=y] || GCC_PLUGINS [=n]) && MODULES [=y] vim +81 drivers/gpu/drm/drm_drv.c 75 76 /* 77 * Available recovery methods for wedged device. To be sent along with device 78 * wedged uevent. 79 */ 80 static const char *const drm_wedge_recovery_opts[] = { > 81 [ffs(DRM_WEDGE_RECOVERY_REBIND) - 1] = "rebind", 82 [ffs(DRM_WEDGE_RECOVERY_BUS_RESET) - 1] = "bus-reset", 83 }; > 84 static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == ffs(DRM_WEDGE_RECOVERY_BUS_RESET)); 85
On Fri, Oct 25, 2024 at 05:45:59PM +0300, Andy Shevchenko wrote: > On Fri, Oct 25, 2024 at 12:08:50PM +0300, Jani Nikula wrote: > > On Fri, 25 Oct 2024, Raag Jadav <raag.jadav@intel.com> wrote: > > ... > > > > +/* > > > + * Available recovery methods for wedged device. To be sent along with device > > > + * wedged uevent. > > > + */ > > > +static const char *const drm_wedge_recovery_opts[] = { > > > + [ffs(DRM_WEDGE_RECOVERY_REBIND) - 1] = "rebind", > > > + [ffs(DRM_WEDGE_RECOVERY_BUS_RESET) - 1] = "bus-reset", > > > +}; > > > +static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == ffs(DRM_WEDGE_RECOVERY_BUS_RESET)); > > > > This might work in most cases, but you also might end up finding that > > there's an arch and compiler combo out there that just won't be able to > > figure out ffs() at compile time, and the array initialization fails. > > We have ilog2() macro for such cases, but it is rather fls() and not ffs(), > and I have no idea why ffs() even being used here, especially in the index > part of the array assignments. It's unreadable. I initially had __builtin_ffs() in mind which is even more ugly. > > If that happens, you'd have to either convert back to an enum (and call > > the wedge event function with BIT(DRM_WEDGE_RECOVERY_REBIND) etc.), or Which would confuse the users since that's not how enums are normally used. > > make this a array of structs mapping the macro values to strings and > > loop over it. Why not just switch() it? for_each_set_bit(opt, &method, size) { switch (BIT(opt)) { case DRM_WEDGE_RECOVERY_REBIND: recovery = "rebind"; break; case DRM_WEDGE_RECOVERY_BUS_RESET: recovery = "bus-reset"; break; } ... } I know we'll have to update it with new additions, but it'd be much simpler, atleast compared to introducing and maintaining a new struct. > > Also, the main point of the static assert was to ensure the array is > > updated when a new recovery option is added, and there's no out of > > bounds access. That no longer holds, and the static assert is pretty > > much useless. You still have to manually find and update this. With above in place this won't be needed. Raag
On Fri, 25 Oct 2024, Jani Nikula <jani.nikula@linux.intel.com> wrote: > On Fri, 25 Oct 2024, Raag Jadav <raag.jadav@intel.com> wrote: >> @@ -70,6 +73,16 @@ static struct dentry *drm_debugfs_root; >> >> DEFINE_STATIC_SRCU(drm_unplug_srcu); >> >> +/* >> + * Available recovery methods for wedged device. To be sent along with device >> + * wedged uevent. >> + */ >> +static const char *const drm_wedge_recovery_opts[] = { >> + [ffs(DRM_WEDGE_RECOVERY_REBIND) - 1] = "rebind", >> + [ffs(DRM_WEDGE_RECOVERY_BUS_RESET) - 1] = "bus-reset", >> +}; >> +static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == ffs(DRM_WEDGE_RECOVERY_BUS_RESET)); > > This might work in most cases, but you also might end up finding that > there's an arch and compiler combo out there that just won't be able to > figure out ffs() at compile time, and the array initialization fails. And the kernel test robot hits exactly this. BR, Jani.
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c index ac30b0ec9d93..ded6327fc242 100644 --- a/drivers/gpu/drm/drm_drv.c +++ b/drivers/gpu/drm/drm_drv.c @@ -26,6 +26,8 @@ * DEALINGS IN THE SOFTWARE. */ +#include <linux/array_size.h> +#include <linux/build_bug.h> #include <linux/debugfs.h> #include <linux/fs.h> #include <linux/module.h> @@ -33,6 +35,7 @@ #include <linux/mount.h> #include <linux/pseudo_fs.h> #include <linux/slab.h> +#include <linux/sprintf.h> #include <linux/srcu.h> #include <linux/xarray.h> @@ -70,6 +73,16 @@ static struct dentry *drm_debugfs_root; DEFINE_STATIC_SRCU(drm_unplug_srcu); +/* + * Available recovery methods for wedged device. To be sent along with device + * wedged uevent. + */ +static const char *const drm_wedge_recovery_opts[] = { + [ffs(DRM_WEDGE_RECOVERY_REBIND) - 1] = "rebind", + [ffs(DRM_WEDGE_RECOVERY_BUS_RESET) - 1] = "bus-reset", +}; +static_assert(ARRAY_SIZE(drm_wedge_recovery_opts) == ffs(DRM_WEDGE_RECOVERY_BUS_RESET)); + /* * DRM Minors * A DRM device can provide several char-dev interfaces on the DRM-Major. Each @@ -497,6 +510,44 @@ void drm_dev_unplug(struct drm_device *dev) } EXPORT_SYMBOL(drm_dev_unplug); +/** + * drm_dev_wedged_event - generate a device wedged uevent + * @dev: DRM device + * @method: method(s) to be used for recovery + * + * This generates a device wedged uevent for the DRM device specified by @dev. + * Recovery @method from drm_wedge_recovery_opts[] is sent in the uevent + * environment as ``WEDGED=<method1>[,<method2>]`` in order of less to more + * side-effects. If caller is unsure about recovery or @method is unknown (0), + * ``WEDGED=none`` will be sent instead. + * + * Returns: 0 on success, negative error code otherwise. + */ +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method) +{ + unsigned int len, opt, size = ARRAY_SIZE(drm_wedge_recovery_opts); + const char *recovery = NULL; + /* Event string length up to 24+ characters with available methods */ + char event_string[32]; + char *envp[] = { event_string, NULL }; + + len = scnprintf(event_string, sizeof(event_string), "%s", "WEDGED="); + + for_each_set_bit(opt, &method, size) { + recovery = drm_wedge_recovery_opts[opt]; + len += scnprintf(event_string + len, sizeof(event_string), + opt == size - 1 ? "%s" : "%s,", recovery); + } + + if (!recovery) + /* Caller is unsure about recovery, do the best we can at this point. */ + scnprintf(event_string + len, sizeof(event_string), "%s", "none"); + + drm_info(dev, "device wedged, needs recovery\n"); + return kobject_uevent_env(&dev->primary->kdev->kobj, KOBJ_CHANGE, envp); +} +EXPORT_SYMBOL(drm_dev_wedged_event); + /* * DRM internal mount * We want to be able to allocate our own "struct address_space" to control diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h index c91f87b5242d..edf8b200891d 100644 --- a/include/drm/drm_device.h +++ b/include/drm/drm_device.h @@ -21,6 +21,13 @@ struct inode; struct pci_dev; struct pci_controller; +/* + * Recovery methods for wedged device in order of less to more side-effects. + * To be used with drm_dev_wedged_event() as recovery @method. Callers can + * use any one, multiple (or'd) or none depending on their needs. + */ +#define DRM_WEDGE_RECOVERY_REBIND BIT(0) /* unbind + rebind driver */ +#define DRM_WEDGE_RECOVERY_BUS_RESET BIT(1) /* unbind + reset bus device + rebind */ /** * enum switch_power_state - power state of drm device diff --git a/include/drm/drm_drv.h b/include/drm/drm_drv.h index 02ea4e3248fd..cc7bcb94ad6a 100644 --- a/include/drm/drm_drv.h +++ b/include/drm/drm_drv.h @@ -461,6 +461,7 @@ void drm_put_dev(struct drm_device *dev); bool drm_dev_enter(struct drm_device *dev, int *idx); void drm_dev_exit(int idx); void drm_dev_unplug(struct drm_device *dev); +int drm_dev_wedged_event(struct drm_device *dev, unsigned long method); /** * drm_dev_is_unplugged - is a DRM device unplugged
Introduce device wedged event, which will notify userspace of wedged (hanged/unusable) state of the DRM device through a uevent. This is useful especially in cases where the device is no longer operating as expected even after a reset and has become unrecoverable from driver context. Purpose of this implementation is to provide drivers a generic way to recover with the help of userspace intervention without taking any drastic measures in the driver. A 'wedged' device is basically a dead device that needs attention. The uevent is the notification that is sent to userspace along with a hint about what could possibly be attempted to recover the device and bring it back to usable state. Different drivers may have different ideas of a 'wedged' device depending on their hardware implementation, and hence the vendor agnostic nature of the event. It is up to the drivers to decide when they see the need for recovery and how they want to recover from the available methods. Recovery -------- Current implementation defines two recovery methods, out of which, drivers can use any one, both or none. Method(s) of choice will be sent in the uevent environment as ``WEDGED=<method1>[,<method2>]`` in order of less to more side-effects. If driver is unsure about recovery or method is unknown (like soft/hard reboot, firmware flashing, hardware replacement or any other procedure which can't be attempted on the fly), ``WEDGED=none`` will be sent instead. It is the responsibility of the driver to perform required cleanups (like disabling system memory access or signalling dma_fences) and prepare itself for the recovery before sending the event. Once the event is sent, driver should block all IOCTLs with an error code. This will signify the reason for wegeding which can be reported to the application if needed. Userspace consumers can parse this event and attempt recovery as per below expectations. =============== ================================== Recovery method Consumer expectations =============== ================================== rebind unbind + rebind driver bus-reset unbind + reset bus device + rebind none admin/user policy =============== ================================== Example for rebind ~~~~~~~~~~~~~~~~~~ Udev rule:: SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]", RUN+="/path/to/rebind.sh $env{DEVPATH}" Recovery script:: #!/bin/sh DEVPATH=$(readlink -f /sys/$1/device) DEVICE=$(basename $DEVPATH) DRIVER=$(readlink -f $DEVPATH/driver) echo -n $DEVICE > $DRIVER/unbind sleep 1 echo -n $DEVICE > $DRIVER/bind Although scripts are simple enough for basic recovery, admin/users can define customized policies around recovery action. For example if the driver supports multiple recovery methods, consumers can opt for the suitable one based on policy definition. Consumers can also take additional steps like gathering telemetry information (devcoredump, syslog), or have the device available for further debugging and data collection before performing the recovery. This is useful especially when the driver is unsure about recovery or method is unknown. v4: s/drm_dev_wedged/drm_dev_wedged_event Use drm_info() (Jani) Kernel doc adjustment (Aravind) v5: Send recovery method with uevent (Lina) v6: Access wedge_recovery_opts[] using helper function (Jani) Use snprintf() (Jani) v7: Convert recovery helpers into regular functions (Andy, Jani) Aesthetic adjustments (Andy) Handle invalid method cases v8: Allow sending multiple methods with uevent (Lucas, Michal) static_assert() globally (Andy) Signed-off-by: Raag Jadav <raag.jadav@intel.com> --- drivers/gpu/drm/drm_drv.c | 51 +++++++++++++++++++++++++++++++++++++++ include/drm/drm_device.h | 7 ++++++ include/drm/drm_drv.h | 1 + 3 files changed, 59 insertions(+)