diff mbox series

[v8,2/4] drm/doc: Document device wedged event

Message ID 20241025084817.144621-3-raag.jadav@intel.com (mailing list archive)
State New
Headers show
Series Introduce DRM device wedged event | expand

Commit Message

Raag Jadav Oct. 25, 2024, 8:48 a.m. UTC
Add documentation for device wedged event in a new 'Device wedging'
chapter. The describes basic definitions and consumer expectations
along with an example.

v8: Improve documentation (Christian, Rodrigo)

Signed-off-by: Raag Jadav <raag.jadav@intel.com>
---
 Documentation/gpu/drm-uapi.rst | 75 ++++++++++++++++++++++++++++++++++
 1 file changed, 75 insertions(+)

Comments

Christian König Oct. 29, 2024, 9:51 a.m. UTC | #1
Am 25.10.24 um 10:48 schrieb Raag Jadav:
> Add documentation for device wedged event in a new 'Device wedging'
> chapter. The describes basic definitions and consumer expectations
> along with an example.
>
> v8: Improve documentation (Christian, Rodrigo)
>
> Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> ---
>   Documentation/gpu/drm-uapi.rst | 75 ++++++++++++++++++++++++++++++++++
>   1 file changed, 75 insertions(+)
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 370d820be248..11a7446233b5 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -362,6 +362,81 @@ the first place. DRM devices should make use of devcoredump to store relevant
>   information about the reset, so this information can be added to user bug
>   reports.
>   
> +Device wedging
> +==============
> +
> +Drivers can optionally make use of device wedged event (implemented as
> +drm_dev_wedged_event() in DRM subsystem) which notifies userspace of wedged
> +(hanged/unusable) state of the DRM device through a uevent. This is useful
> +especially in cases where the device is no longer operating as expected even
> +after a reset and has become unrecoverable from driver context. Purpose of
> +this implementation is to provide drivers a generic way to recover with the
> +help of userspace intervention without taking any drastic measures in the
> +driver.
> +
> +A 'wedged' device is basically a dead device that needs attention. The
> +uevent is the notification that is sent to userspace along with a hint about
> +what could possibly be attempted to recover the device and bring it back to
> +usable state. Different drivers may have different ideas of a 'wedged' device
> +depending on their hardware implementation, and hence the vendor agnostic
> +nature of the event. It is up to the drivers to decide when they see the need
> +for recovery and how they want to recover from the available methods.
> +
> +Recovery
> +--------
> +
> +Current implementation defines two recovery methods, out of which, drivers
> +can use any one, both or none. Method(s) of choice will be sent in the uevent
> +environment as ``WEDGED=<method1>[,<method2>]`` in order of less to more side
> +effects. If driver is unsure about recovery or method is unknown (like reboot,
> +firmware flashing, hardware replacement or any other procedure which can't be
> +attempted on the fly), ``WEDGED=none`` will be sent instead.
> +
> +It is the responsibility of the driver to perform required cleanups (like
> +disabling system memory access or signalling dma_fences) and prepare itself
> +for the recovery before sending the event.

That needs to be more like a warning and should have a bit more text. 
Maybe even a separate section.

Something like this maybe:

Prerequisites
------------------

Drivers needs to make sure that the device won't harm the system as a
whole by keeping it in a wedged state. Necessary actions must include
disabling DMA to system memory as well as communication channels
with other devices.
Further drivers must ensure that all dma_fences are signaled and other
state the core kernel might depend on are cleaned up.
All ongoing waiting for hardware state changes must be aborted and
new accesses to the hardware sufficiently blocked....


I'm not a native speaker of English, so feel free to re-phrase that. But 
the general approach should be like don't do this before you have made 
sure a, b and c.

>   Once the event is sent, driver
> +should block all IOCTLs with an error code.

Better define which error code that should be. I think -ENODEV similar 
to a hotplug case would be appropriate.

>   This will signify the reason for
> +wegeding which can be reported to the application if needed.
> +
> +Userspace consumers can parse this event and attempt recovery as per below
> +expectations.
> +
> +    =============== ==================================
> +    Recovery method Consumer expectations
> +    =============== ==================================
> +    rebind          unbind + rebind driver
> +    bus-reset       unbind + reset bus device + rebind
> +    none            admin/user policy
> +    =============== ==================================
> +
> +Example for rebind
> +~~~~~~~~~~~~~~~~~~
> +
> +Udev rule::
> +
> +    SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]",
> +    RUN+="/path/to/rebind.sh $env{DEVPATH}"
> +
> +Recovery script::
> +
> +    #!/bin/sh
> +
> +    DEVPATH=$(readlink -f /sys/$1/device)
> +    DEVICE=$(basename $DEVPATH)
> +    DRIVER=$(readlink -f $DEVPATH/driver)
> +
> +    echo -n $DEVICE > $DRIVER/unbind
> +    sleep 1
> +    echo -n $DEVICE > $DRIVER/bind
> +
> +Although scripts are simple enough for basic recovery, admin/users can define
> +customized policies around recovery action. For example if the driver supports
> +multiple recovery methods, consumers can opt for the suitable one based on
> +policy definition. Consumers can also take additional steps like gathering
> +telemetry information (devcoredump, syslog)

I'm really wondering if we shouldn't standardize successful resets with 
this event as well?

E.g. like there was an issue with device X, please collect the devcoredump.

And then as a second step have the WEDGED property to indicate:
a) reset successful, no actions needed.
b) detach and rebind from the bus
c) bus-reset
d) impossible to recover but system as a whole can continue to work.
e) it's on fire! Run sync and shut down power immediately.
....

Regards,
Christian.

> , or have the device available for
> +further debugging and data collection before performing the recovery. This is
> +useful especially when the driver is unsure about recovery or method is unknown.
> +
>   .. _drm_driver_ioctl:
>   
>   IOCTL Support on Device Nodes
Raag Jadav Nov. 1, 2024, 4:26 a.m. UTC | #2
On Tue, Oct 29, 2024 at 10:51:34AM +0100, Christian König wrote:
> Am 25.10.24 um 10:48 schrieb Raag Jadav:
> > Add documentation for device wedged event in a new 'Device wedging'
> > chapter. The describes basic definitions and consumer expectations
> > along with an example.
> > 
> > v8: Improve documentation (Christian, Rodrigo)
> > 
> > Signed-off-by: Raag Jadav <raag.jadav@intel.com>
> > ---
> >   Documentation/gpu/drm-uapi.rst | 75 ++++++++++++++++++++++++++++++++++
> >   1 file changed, 75 insertions(+)
> > 
> > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > index 370d820be248..11a7446233b5 100644
> > --- a/Documentation/gpu/drm-uapi.rst
> > +++ b/Documentation/gpu/drm-uapi.rst
> > @@ -362,6 +362,81 @@ the first place. DRM devices should make use of devcoredump to store relevant
> >   information about the reset, so this information can be added to user bug
> >   reports.
> > +Device wedging
> > +==============
> > +
> > +Drivers can optionally make use of device wedged event (implemented as
> > +drm_dev_wedged_event() in DRM subsystem) which notifies userspace of wedged
> > +(hanged/unusable) state of the DRM device through a uevent. This is useful
> > +especially in cases where the device is no longer operating as expected even
> > +after a reset and has become unrecoverable from driver context. Purpose of
> > +this implementation is to provide drivers a generic way to recover with the
> > +help of userspace intervention without taking any drastic measures in the
> > +driver.
> > +
> > +A 'wedged' device is basically a dead device that needs attention. The
> > +uevent is the notification that is sent to userspace along with a hint about
> > +what could possibly be attempted to recover the device and bring it back to
> > +usable state. Different drivers may have different ideas of a 'wedged' device
> > +depending on their hardware implementation, and hence the vendor agnostic
> > +nature of the event. It is up to the drivers to decide when they see the need
> > +for recovery and how they want to recover from the available methods.
> > +
> > +Recovery
> > +--------
> > +
> > +Current implementation defines two recovery methods, out of which, drivers
> > +can use any one, both or none. Method(s) of choice will be sent in the uevent
> > +environment as ``WEDGED=<method1>[,<method2>]`` in order of less to more side
> > +effects. If driver is unsure about recovery or method is unknown (like reboot,
> > +firmware flashing, hardware replacement or any other procedure which can't be
> > +attempted on the fly), ``WEDGED=none`` will be sent instead.
> > +
> > +It is the responsibility of the driver to perform required cleanups (like
> > +disabling system memory access or signalling dma_fences) and prepare itself
> > +for the recovery before sending the event.
> 
> That needs to be more like a warning and should have a bit more text. Maybe
> even a separate section.
> 
> Something like this maybe:
> 
> Prerequisites
> ------------------
> 
> Drivers needs to make sure that the device won't harm the system as a
> whole by keeping it in a wedged state. Necessary actions must include
> disabling DMA to system memory as well as communication channels
> with other devices.
> Further drivers must ensure that all dma_fences are signaled and other
> state the core kernel might depend on are cleaned up.
> All ongoing waiting for hardware state changes must be aborted and
> new accesses to the hardware sufficiently blocked....
> 
> 
> I'm not a native speaker of English, so feel free to re-phrase that. But the

Neither am I, but will try my best.

> general approach should be like don't do this before you have made sure a, b
> and c.

Sure, thanks.

> >   Once the event is sent, driver
> > +should block all IOCTLs with an error code.
> 
> Better define which error code that should be. I think -ENODEV similar to a
> hotplug case would be appropriate.

Why not leave it to the drivers to decide depending on the type of failure?

> >   This will signify the reason for
> > +wegeding which can be reported to the application if needed.
> > +
> > +Userspace consumers can parse this event and attempt recovery as per below
> > +expectations.
> > +
> > +    =============== ==================================
> > +    Recovery method Consumer expectations
> > +    =============== ==================================
> > +    rebind          unbind + rebind driver
> > +    bus-reset       unbind + reset bus device + rebind
> > +    none            admin/user policy
> > +    =============== ==================================
> > +
> > +Example for rebind
> > +~~~~~~~~~~~~~~~~~~
> > +
> > +Udev rule::
> > +
> > +    SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]",
> > +    RUN+="/path/to/rebind.sh $env{DEVPATH}"
> > +
> > +Recovery script::
> > +
> > +    #!/bin/sh
> > +
> > +    DEVPATH=$(readlink -f /sys/$1/device)
> > +    DEVICE=$(basename $DEVPATH)
> > +    DRIVER=$(readlink -f $DEVPATH/driver)
> > +
> > +    echo -n $DEVICE > $DRIVER/unbind
> > +    sleep 1
> > +    echo -n $DEVICE > $DRIVER/bind
> > +
> > +Although scripts are simple enough for basic recovery, admin/users can define
> > +customized policies around recovery action. For example if the driver supports
> > +multiple recovery methods, consumers can opt for the suitable one based on
> > +policy definition. Consumers can also take additional steps like gathering
> > +telemetry information (devcoredump, syslog)
> 
> I'm really wondering if we shouldn't standardize successful resets with this
> event as well?
> 
> E.g. like there was an issue with device X, please collect the devcoredump.

This seems to fit into WEDGED=none case, although with this we'd probably
need to redefine what 'wedged' means.

> And then as a second step have the WEDGED property to indicate:
> a) reset successful, no actions needed.
> b) detach and rebind from the bus
> c) bus-reset
> d) impossible to recover but system as a whole can continue to work.
> e) it's on fire! Run sync and shut down power immediately.

Raag
diff mbox series

Patch

diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index 370d820be248..11a7446233b5 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -362,6 +362,81 @@  the first place. DRM devices should make use of devcoredump to store relevant
 information about the reset, so this information can be added to user bug
 reports.
 
+Device wedging
+==============
+
+Drivers can optionally make use of device wedged event (implemented as
+drm_dev_wedged_event() in DRM subsystem) which notifies userspace of wedged
+(hanged/unusable) state of the DRM device through a uevent. This is useful
+especially in cases where the device is no longer operating as expected even
+after a reset and has become unrecoverable from driver context. Purpose of
+this implementation is to provide drivers a generic way to recover with the
+help of userspace intervention without taking any drastic measures in the
+driver.
+
+A 'wedged' device is basically a dead device that needs attention. The
+uevent is the notification that is sent to userspace along with a hint about
+what could possibly be attempted to recover the device and bring it back to
+usable state. Different drivers may have different ideas of a 'wedged' device
+depending on their hardware implementation, and hence the vendor agnostic
+nature of the event. It is up to the drivers to decide when they see the need
+for recovery and how they want to recover from the available methods.
+
+Recovery
+--------
+
+Current implementation defines two recovery methods, out of which, drivers
+can use any one, both or none. Method(s) of choice will be sent in the uevent
+environment as ``WEDGED=<method1>[,<method2>]`` in order of less to more side
+effects. If driver is unsure about recovery or method is unknown (like reboot,
+firmware flashing, hardware replacement or any other procedure which can't be
+attempted on the fly), ``WEDGED=none`` will be sent instead.
+
+It is the responsibility of the driver to perform required cleanups (like
+disabling system memory access or signalling dma_fences) and prepare itself
+for the recovery before sending the event. Once the event is sent, driver
+should block all IOCTLs with an error code. This will signify the reason for
+wegeding which can be reported to the application if needed.
+
+Userspace consumers can parse this event and attempt recovery as per below
+expectations.
+
+    =============== ==================================
+    Recovery method Consumer expectations
+    =============== ==================================
+    rebind          unbind + rebind driver
+    bus-reset       unbind + reset bus device + rebind
+    none            admin/user policy
+    =============== ==================================
+
+Example for rebind
+~~~~~~~~~~~~~~~~~~
+
+Udev rule::
+
+    SUBSYSTEM=="drm", ENV{WEDGED}=="rebind", DEVPATH=="*/drm/card[0-9]",
+    RUN+="/path/to/rebind.sh $env{DEVPATH}"
+
+Recovery script::
+
+    #!/bin/sh
+
+    DEVPATH=$(readlink -f /sys/$1/device)
+    DEVICE=$(basename $DEVPATH)
+    DRIVER=$(readlink -f $DEVPATH/driver)
+
+    echo -n $DEVICE > $DRIVER/unbind
+    sleep 1
+    echo -n $DEVICE > $DRIVER/bind
+
+Although scripts are simple enough for basic recovery, admin/users can define
+customized policies around recovery action. For example if the driver supports
+multiple recovery methods, consumers can opt for the suitable one based on
+policy definition. Consumers can also take additional steps like gathering
+telemetry information (devcoredump, syslog), or have the device available for
+further debugging and data collection before performing the recovery. This is
+useful especially when the driver is unsure about recovery or method is unknown.
+
 .. _drm_driver_ioctl:
 
 IOCTL Support on Device Nodes