Message ID | 20230627132323.115440-1-andrealmeid@igalia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v5,1/1] drm/doc: Document DRM device reset expectations | expand |
Hi André, I have just a few more below: On 6/27/23 06:23, André Almeida wrote: > Create a section that specifies how to deal with DRM device resets for > kernel and userspace drivers. > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> > Signed-off-by: André Almeida <andrealmeid@igalia.com> > --- > > v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/ > > Changes: > - Grammar fixes (Randy) > > Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++ > 1 file changed, 68 insertions(+) > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst > index 65fb3036a580..3cbffa25ed93 100644 > --- a/Documentation/gpu/drm-uapi.rst > +++ b/Documentation/gpu/drm-uapi.rst > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for > mmapped regular files. Threads cause additional pain with signal > handling as well. > > +Device reset > +============ > + > +The GPU stack is really complex and is prone to errors, from hardware bugs, > +faulty applications and everything in between the many layers. Some errors > +require resetting the device in order to make the device usable again. This > +sections describes the expectations for DRM and usermode drivers when a section > +device resets and how to propagate the reset status. > + > +Kernel Mode Driver > +------------------ > + > +The KMD is responsible for checking if the device needs a reset, and to perform > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD > +should keep track of resets, because userspace can query any time about the > +reset stats for an specific context. This is needed to propagate to the rest of for a specific stats or status? > +the stack that a reset has happened. Currently, this is implemented by each > +driver separately, with no common DRM interface. > + > +User Mode Driver > +---------------- > + > +The UMD should check before submitting new commands to the KMD if the device has > +been reset, and this can be checked more often if the UMD requires it. After > +detecting a reset, UMD will then proceed to report it to the application using > +the appropriate API error code, as explained in the section below about > +robustness. > + > +Robustness > +---------- > + > +The only way to try to keep an application working after a reset is if it > +complies with the robustness aspects of the graphical API that it is using. > + > +Graphical APIs provide ways to applications to deal with device resets. However, > +there is no guarantee that the app will use such features correctly, and the > +UMD can implement policies to close the app if it is a repeating offender, > +likely in a broken loop. This is done to ensure that it does not keep blocking > +the user interface from being correctly displayed. This should be done even if > +the app is correct but happens to trigger some bug in the hardware/driver. > + > +OpenGL > +~~~~~~ > + > +Apps using OpenGL should use the available robust interfaces, like the > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This > +interface tells if a reset has happened, and if so, all the context state is > +considered lost and the app proceeds by creating new ones. If it is possible to > +determine that robustness is not in use, the UMD will terminate the app when a > +reset is detected, giving that the contexts are lost and the app won't be able > +to figure this out and recreate the contexts. > + > +Vulkan > +~~~~~~ > + > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions. > +This error code means, among other things, that a device reset has happened and > +it needs to recreate the contexts to keep going. > + > +Reporting causes of resets > +-------------------------- > + > +Apart from propagating the reset through the stack so apps can recover, it's > +really useful for driver developers to learn more about what caused the reset in > +first place. DRM devices should make use of devcoredump to store relevant the first place. > +information about the reset, so this information can be added to user bug > +reports. > + > .. _drm_driver_ioctl: > > IOCTL Support on Device Nodes and with those addressed: Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Thanks for adding the documentation.
Am 27.06.23 um 15:23 schrieb André Almeida: > Create a section that specifies how to deal with DRM device resets for > kernel and userspace drivers. > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> > Signed-off-by: André Almeida <andrealmeid@igalia.com> > --- > > v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/ > > Changes: > - Grammar fixes (Randy) > > Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++ > 1 file changed, 68 insertions(+) > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst > index 65fb3036a580..3cbffa25ed93 100644 > --- a/Documentation/gpu/drm-uapi.rst > +++ b/Documentation/gpu/drm-uapi.rst > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for > mmapped regular files. Threads cause additional pain with signal > handling as well. > > +Device reset > +============ > + > +The GPU stack is really complex and is prone to errors, from hardware bugs, > +faulty applications and everything in between the many layers. Some errors > +require resetting the device in order to make the device usable again. This > +sections describes the expectations for DRM and usermode drivers when a > +device resets and how to propagate the reset status. > + > +Kernel Mode Driver > +------------------ > + > +The KMD is responsible for checking if the device needs a reset, and to perform > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD > +should keep track of resets, because userspace can query any time about the > +reset stats for an specific context. Maybe drop the part "for a specific context". Essentially the reset query could use global counters instead and we won't need the context any more here. Apart from that this sounds good to me, feel free to add my rb. Regards, Christian. > This is needed to propagate to the rest of > +the stack that a reset has happened. Currently, this is implemented by each > +driver separately, with no common DRM interface. > + > +User Mode Driver > +---------------- > + > +The UMD should check before submitting new commands to the KMD if the device has > +been reset, and this can be checked more often if the UMD requires it. After > +detecting a reset, UMD will then proceed to report it to the application using > +the appropriate API error code, as explained in the section below about > +robustness. > + > +Robustness > +---------- > + > +The only way to try to keep an application working after a reset is if it > +complies with the robustness aspects of the graphical API that it is using. > + > +Graphical APIs provide ways to applications to deal with device resets. However, > +there is no guarantee that the app will use such features correctly, and the > +UMD can implement policies to close the app if it is a repeating offender, > +likely in a broken loop. This is done to ensure that it does not keep blocking > +the user interface from being correctly displayed. This should be done even if > +the app is correct but happens to trigger some bug in the hardware/driver. > + > +OpenGL > +~~~~~~ > + > +Apps using OpenGL should use the available robust interfaces, like the > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This > +interface tells if a reset has happened, and if so, all the context state is > +considered lost and the app proceeds by creating new ones. If it is possible to > +determine that robustness is not in use, the UMD will terminate the app when a > +reset is detected, giving that the contexts are lost and the app won't be able > +to figure this out and recreate the contexts. > + > +Vulkan > +~~~~~~ > + > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions. > +This error code means, among other things, that a device reset has happened and > +it needs to recreate the contexts to keep going. > + > +Reporting causes of resets > +-------------------------- > + > +Apart from propagating the reset through the stack so apps can recover, it's > +really useful for driver developers to learn more about what caused the reset in > +first place. DRM devices should make use of devcoredump to store relevant > +information about the reset, so this information can be added to user bug > +reports. > + > .. _drm_driver_ioctl: > > IOCTL Support on Device Nodes
On Tue, Jun 27, 2023, 09:23 André Almeida <andrealmeid@igalia.com> wrote: > Create a section that specifies how to deal with DRM device resets for > kernel and userspace drivers. > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> > Signed-off-by: André Almeida <andrealmeid@igalia.com> > --- > > v4: > https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/ > > Changes: > - Grammar fixes (Randy) > > Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++ > 1 file changed, 68 insertions(+) > > diff --git a/Documentation/gpu/drm-uapi.rst > b/Documentation/gpu/drm-uapi.rst > index 65fb3036a580..3cbffa25ed93 100644 > --- a/Documentation/gpu/drm-uapi.rst > +++ b/Documentation/gpu/drm-uapi.rst > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third > handler for > mmapped regular files. Threads cause additional pain with signal > handling as well. > > +Device reset > +============ > + > +The GPU stack is really complex and is prone to errors, from hardware > bugs, > +faulty applications and everything in between the many layers. Some errors > +require resetting the device in order to make the device usable again. > This > +sections describes the expectations for DRM and usermode drivers when a > +device resets and how to propagate the reset status. > + > +Kernel Mode Driver > +------------------ > + > +The KMD is responsible for checking if the device needs a reset, and to > perform > +it as needed. Usually a hang is detected when a job gets stuck executing. > KMD > +should keep track of resets, because userspace can query any time about > the > +reset stats for an specific context. This is needed to propagate to the > rest of > +the stack that a reset has happened. Currently, this is implemented by > each > +driver separately, with no common DRM interface. > + > +User Mode Driver > +---------------- > + > +The UMD should check before submitting new commands to the KMD if the > device has > +been reset, and this can be checked more often if the UMD requires it. > After > +detecting a reset, UMD will then proceed to report it to the application > using > +the appropriate API error code, as explained in the section below about > +robustness. > The UMD won't check the device status before every command submission due to ioctl overhead. Instead, the KMD should skip command submission and return an error that it was skipped. The only case where that won't be applicable is user queues where drivers don't call into the kernel to submit work, but they do call into the kernel to create a dma_fence. In that case, the call to create a dma_fence can fail with an error. Marek + > +Robustness > +---------- > + > +The only way to try to keep an application working after a reset is if it > +complies with the robustness aspects of the graphical API that it is > using. > + > +Graphical APIs provide ways to applications to deal with device resets. > However, > +there is no guarantee that the app will use such features correctly, and > the > +UMD can implement policies to close the app if it is a repeating offender, > +likely in a broken loop. This is done to ensure that it does not keep > blocking > +the user interface from being correctly displayed. This should be done > even if > +the app is correct but happens to trigger some bug in the hardware/driver. > + > +OpenGL > +~~~~~~ > + > +Apps using OpenGL should use the available robust interfaces, like the > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). > This > +interface tells if a reset has happened, and if so, all the context state > is > +considered lost and the app proceeds by creating new ones. If it is > possible to > +determine that robustness is not in use, the UMD will terminate the app > when a > +reset is detected, giving that the contexts are lost and the app won't be > able > +to figure this out and recreate the contexts. > + > +Vulkan > +~~~~~~ > + > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for > submissions. > +This error code means, among other things, that a device reset has > happened and > +it needs to recreate the contexts to keep going. > + > +Reporting causes of resets > +-------------------------- > + > +Apart from propagating the reset through the stack so apps can recover, > it's > +really useful for driver developers to learn more about what caused the > reset in > +first place. DRM devices should make use of devcoredump to store relevant > +information about the reset, so this information can be added to user bug > +reports. > + > .. _drm_driver_ioctl: > > IOCTL Support on Device Nodes > -- > 2.41.0 > >
Em 27/06/2023 14:47, Christian König escreveu: > Am 27.06.23 um 15:23 schrieb André Almeida: >> Create a section that specifies how to deal with DRM device resets for >> kernel and userspace drivers. >> >> Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> >> Signed-off-by: André Almeida <andrealmeid@igalia.com> >> --- >> >> v4: >> https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/ >> >> Changes: >> - Grammar fixes (Randy) >> >> Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++ >> 1 file changed, 68 insertions(+) >> >> diff --git a/Documentation/gpu/drm-uapi.rst >> b/Documentation/gpu/drm-uapi.rst >> index 65fb3036a580..3cbffa25ed93 100644 >> --- a/Documentation/gpu/drm-uapi.rst >> +++ b/Documentation/gpu/drm-uapi.rst >> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a >> third handler for >> mmapped regular files. Threads cause additional pain with signal >> handling as well. >> +Device reset >> +============ >> + >> +The GPU stack is really complex and is prone to errors, from hardware >> bugs, >> +faulty applications and everything in between the many layers. Some >> errors >> +require resetting the device in order to make the device usable >> again. This >> +sections describes the expectations for DRM and usermode drivers when a >> +device resets and how to propagate the reset status. >> + >> +Kernel Mode Driver >> +------------------ >> + >> +The KMD is responsible for checking if the device needs a reset, and >> to perform >> +it as needed. Usually a hang is detected when a job gets stuck >> executing. KMD >> +should keep track of resets, because userspace can query any time >> about the >> +reset stats for an specific context. > > Maybe drop the part "for a specific context". Essentially the reset > query could use global counters instead and we won't need the context > any more here. > Right, I wrote like this to reflect how it's currently implemented. If follow correctly what you meant, KMD could always notify the global count for UMD, and we would move to the UMD the responsibility to manage the reset counters, right? This would also simplify my DRM_IOCTL_GET_RESET proposal. I'll apply your suggestion to the next doc version. > Apart from that this sounds good to me, feel free to add my rb. > > Regards, > Christian. > >
Hi Marek, Em 27/06/2023 15:57, Marek Olšák escreveu: > On Tue, Jun 27, 2023, 09:23 André Almeida <andrealmeid@igalia.com > <mailto:andrealmeid@igalia.com>> wrote: > > +User Mode Driver > +---------------- > + > +The UMD should check before submitting new commands to the KMD if > the device has > +been reset, and this can be checked more often if the UMD requires > it. After > +detecting a reset, UMD will then proceed to report it to the > application using > +the appropriate API error code, as explained in the section below about > +robustness. > > > The UMD won't check the device status before every command submission > due to ioctl overhead. Instead, the KMD should skip command submission > and return an error that it was skipped. I wrote like this because when reading the source code for vk::check_status()[0] and Gallium's si_flush_gfx_cs()[1], I was under the impression that UMD checks the reset status before every submission/flush. Is your comment about of how things are currently implemented, or how they would ideally work? Either way I can apply your suggestion, I just want to make it clear. [0] https://elixir.bootlin.com/mesa/mesa-23.1.3/source/src/vulkan/runtime/vk_device.h#L142 [1] https://elixir.bootlin.com/mesa/mesa-23.1.3/source/src/gallium/drivers/radeonsi/si_gfx_cs.c#L83 > > The only case where that won't be applicable is user queues where > drivers don't call into the kernel to submit work, but they do call into > the kernel to create a dma_fence. In that case, the call to create a > dma_fence can fail with an error. > > Marek
On Tue, Jun 27, 2023 at 5:31 PM André Almeida <andrealmeid@igalia.com> wrote: > Hi Marek, > > Em 27/06/2023 15:57, Marek Olšák escreveu: > > On Tue, Jun 27, 2023, 09:23 André Almeida <andrealmeid@igalia.com > > <mailto:andrealmeid@igalia.com>> wrote: > > > > +User Mode Driver > > +---------------- > > + > > +The UMD should check before submitting new commands to the KMD if > > the device has > > +been reset, and this can be checked more often if the UMD requires > > it. After > > +detecting a reset, UMD will then proceed to report it to the > > application using > > +the appropriate API error code, as explained in the section below > about > > +robustness. > > > > > > The UMD won't check the device status before every command submission > > due to ioctl overhead. Instead, the KMD should skip command submission > > and return an error that it was skipped. > > I wrote like this because when reading the source code for > vk::check_status()[0] and Gallium's si_flush_gfx_cs()[1], I was under > the impression that UMD checks the reset status before every > submission/flush. > It only does that before every command submission when the context is robust. When it's not robust, radeonsi doesn't do anything. > > Is your comment about of how things are currently implemented, or how > they would ideally work? Either way I can apply your suggestion, I just > want to make it clear. > Yes. Ideally, we would get the reply whether the context is lost from the CS ioctl. This is not currently implemented. Marek > > [0] > > https://elixir.bootlin.com/mesa/mesa-23.1.3/source/src/vulkan/runtime/vk_device.h#L142 > [1] > > https://elixir.bootlin.com/mesa/mesa-23.1.3/source/src/gallium/drivers/radeonsi/si_gfx_cs.c#L83 > > > > > The only case where that won't be applicable is user queues where > > drivers don't call into the kernel to submit work, but they do call into > > the kernel to create a dma_fence. In that case, the call to create a > > dma_fence can fail with an error. > > > > Marek > >
Em 27/06/2023 18:17, André Almeida escreveu: > Em 27/06/2023 14:47, Christian König escreveu: >> Am 27.06.23 um 15:23 schrieb André Almeida: >>> Create a section that specifies how to deal with DRM device resets for >>> kernel and userspace drivers. >>> >>> Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> >>> Signed-off-by: André Almeida <andrealmeid@igalia.com> >>> --- >>> >>> v4: >>> https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/ >>> >>> Changes: >>> - Grammar fixes (Randy) >>> >>> Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++ >>> 1 file changed, 68 insertions(+) >>> >>> diff --git a/Documentation/gpu/drm-uapi.rst >>> b/Documentation/gpu/drm-uapi.rst >>> index 65fb3036a580..3cbffa25ed93 100644 >>> --- a/Documentation/gpu/drm-uapi.rst >>> +++ b/Documentation/gpu/drm-uapi.rst >>> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a >>> third handler for >>> mmapped regular files. Threads cause additional pain with signal >>> handling as well. >>> +Device reset >>> +============ >>> + >>> +The GPU stack is really complex and is prone to errors, from >>> hardware bugs, >>> +faulty applications and everything in between the many layers. Some >>> errors >>> +require resetting the device in order to make the device usable >>> again. This >>> +sections describes the expectations for DRM and usermode drivers when a >>> +device resets and how to propagate the reset status. >>> + >>> +Kernel Mode Driver >>> +------------------ >>> + >>> +The KMD is responsible for checking if the device needs a reset, and >>> to perform >>> +it as needed. Usually a hang is detected when a job gets stuck >>> executing. KMD >>> +should keep track of resets, because userspace can query any time >>> about the >>> +reset stats for an specific context. >> >> Maybe drop the part "for a specific context". Essentially the reset >> query could use global counters instead and we won't need the context >> any more here. >> > > Right, I wrote like this to reflect how it's currently implemented. > > If follow correctly what you meant, KMD could always notify the global > count for UMD, and we would move to the UMD the responsibility to manage > the reset counters, right? This would also simplify my > DRM_IOCTL_GET_RESET proposal. I'll apply your suggestion to the next doc > version. > Actually, if we drop the context identifier we would lose the ability to track which is the guilty context. Vulkan API doesn't seem to care about this, but OpenGL does. >> Apart from that this sounds good to me, feel free to add my rb. >> >> Regards, >> Christian. >> >>
On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com> wrote: > > Create a section that specifies how to deal with DRM device resets for > kernel and userspace drivers. > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> > Signed-off-by: André Almeida <andrealmeid@igalia.com> > --- > > v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/ > > Changes: > - Grammar fixes (Randy) > > Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++ > 1 file changed, 68 insertions(+) > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst > index 65fb3036a580..3cbffa25ed93 100644 > --- a/Documentation/gpu/drm-uapi.rst > +++ b/Documentation/gpu/drm-uapi.rst > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for > mmapped regular files. Threads cause additional pain with signal > handling as well. > > +Device reset > +============ > + > +The GPU stack is really complex and is prone to errors, from hardware bugs, > +faulty applications and everything in between the many layers. Some errors > +require resetting the device in order to make the device usable again. This > +sections describes the expectations for DRM and usermode drivers when a > +device resets and how to propagate the reset status. > + > +Kernel Mode Driver > +------------------ > + > +The KMD is responsible for checking if the device needs a reset, and to perform > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD > +should keep track of resets, because userspace can query any time about the > +reset stats for an specific context. This is needed to propagate to the rest of > +the stack that a reset has happened. Currently, this is implemented by each > +driver separately, with no common DRM interface. > + > +User Mode Driver > +---------------- > + > +The UMD should check before submitting new commands to the KMD if the device has > +been reset, and this can be checked more often if the UMD requires it. After > +detecting a reset, UMD will then proceed to report it to the application using > +the appropriate API error code, as explained in the section below about > +robustness. > + > +Robustness > +---------- > + > +The only way to try to keep an application working after a reset is if it > +complies with the robustness aspects of the graphical API that it is using. > + > +Graphical APIs provide ways to applications to deal with device resets. However, > +there is no guarantee that the app will use such features correctly, and the > +UMD can implement policies to close the app if it is a repeating offender, > +likely in a broken loop. This is done to ensure that it does not keep blocking > +the user interface from being correctly displayed. This should be done even if > +the app is correct but happens to trigger some bug in the hardware/driver. I still don't think it's good to let the kernel arbitrarily kill processes that it thinks are not well-behaved based on some heuristics and policy. Can't this be outsourced to user space? Expose the information about processes causing a device and let e.g. systemd deal with coming up with a policy and with killing stuff. > + > +OpenGL > +~~~~~~ > + > +Apps using OpenGL should use the available robust interfaces, like the > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This > +interface tells if a reset has happened, and if so, all the context state is > +considered lost and the app proceeds by creating new ones. If it is possible to > +determine that robustness is not in use, the UMD will terminate the app when a > +reset is detected, giving that the contexts are lost and the app won't be able > +to figure this out and recreate the contexts. > + > +Vulkan > +~~~~~~ > + > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions. > +This error code means, among other things, that a device reset has happened and > +it needs to recreate the contexts to keep going. > + > +Reporting causes of resets > +-------------------------- > + > +Apart from propagating the reset through the stack so apps can recover, it's > +really useful for driver developers to learn more about what caused the reset in > +first place. DRM devices should make use of devcoredump to store relevant > +information about the reset, so this information can be added to user bug > +reports. > + > .. _drm_driver_ioctl: > > IOCTL Support on Device Nodes > -- > 2.41.0 >
On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick <sebastian.wick@redhat.com> wrote: > > On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com> wrote: > > > > Create a section that specifies how to deal with DRM device resets for > > kernel and userspace drivers. > > > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> > > Signed-off-by: André Almeida <andrealmeid@igalia.com> > > --- > > > > v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/ > > > > Changes: > > - Grammar fixes (Randy) > > > > Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++ > > 1 file changed, 68 insertions(+) > > > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst > > index 65fb3036a580..3cbffa25ed93 100644 > > --- a/Documentation/gpu/drm-uapi.rst > > +++ b/Documentation/gpu/drm-uapi.rst > > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for > > mmapped regular files. Threads cause additional pain with signal > > handling as well. > > > > +Device reset > > +============ > > + > > +The GPU stack is really complex and is prone to errors, from hardware bugs, > > +faulty applications and everything in between the many layers. Some errors > > +require resetting the device in order to make the device usable again. This > > +sections describes the expectations for DRM and usermode drivers when a > > +device resets and how to propagate the reset status. > > + > > +Kernel Mode Driver > > +------------------ > > + > > +The KMD is responsible for checking if the device needs a reset, and to perform > > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD > > +should keep track of resets, because userspace can query any time about the > > +reset stats for an specific context. This is needed to propagate to the rest of > > +the stack that a reset has happened. Currently, this is implemented by each > > +driver separately, with no common DRM interface. > > + > > +User Mode Driver > > +---------------- > > + > > +The UMD should check before submitting new commands to the KMD if the device has > > +been reset, and this can be checked more often if the UMD requires it. After > > +detecting a reset, UMD will then proceed to report it to the application using > > +the appropriate API error code, as explained in the section below about > > +robustness. > > + > > +Robustness > > +---------- > > + > > +The only way to try to keep an application working after a reset is if it > > +complies with the robustness aspects of the graphical API that it is using. > > + > > +Graphical APIs provide ways to applications to deal with device resets. However, > > +there is no guarantee that the app will use such features correctly, and the > > +UMD can implement policies to close the app if it is a repeating offender, > > +likely in a broken loop. This is done to ensure that it does not keep blocking > > +the user interface from being correctly displayed. This should be done even if > > +the app is correct but happens to trigger some bug in the hardware/driver. > > I still don't think it's good to let the kernel arbitrarily kill > processes that it thinks are not well-behaved based on some heuristics > and policy. > > Can't this be outsourced to user space? Expose the information about > processes causing a device and let e.g. systemd deal with coming up > with a policy and with killing stuff. I don't think it's the kernel doing the killing, it would be the UMD. E.g., if the app is guilty and doesn't support robustness the UMD can just call exit(). Alex > > > + > > +OpenGL > > +~~~~~~ > > + > > +Apps using OpenGL should use the available robust interfaces, like the > > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This > > +interface tells if a reset has happened, and if so, all the context state is > > +considered lost and the app proceeds by creating new ones. If it is possible to > > +determine that robustness is not in use, the UMD will terminate the app when a > > +reset is detected, giving that the contexts are lost and the app won't be able > > +to figure this out and recreate the contexts. > > + > > +Vulkan > > +~~~~~~ > > + > > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions. > > +This error code means, among other things, that a device reset has happened and > > +it needs to recreate the contexts to keep going. > > + > > +Reporting causes of resets > > +-------------------------- > > + > > +Apart from propagating the reset through the stack so apps can recover, it's > > +really useful for driver developers to learn more about what caused the reset in > > +first place. DRM devices should make use of devcoredump to store relevant > > +information about the reset, so this information can be added to user bug > > +reports. > > + > > .. _drm_driver_ioctl: > > > > IOCTL Support on Device Nodes > > -- > > 2.41.0 > > >
On 6/30/23 16:59, Alex Deucher wrote: > On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick > <sebastian.wick@redhat.com> wrote: >> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com> wrote: >>> >>> +Robustness >>> +---------- >>> + >>> +The only way to try to keep an application working after a reset is if it >>> +complies with the robustness aspects of the graphical API that it is using. >>> + >>> +Graphical APIs provide ways to applications to deal with device resets. However, >>> +there is no guarantee that the app will use such features correctly, and the >>> +UMD can implement policies to close the app if it is a repeating offender, >>> +likely in a broken loop. This is done to ensure that it does not keep blocking >>> +the user interface from being correctly displayed. This should be done even if >>> +the app is correct but happens to trigger some bug in the hardware/driver. >> >> I still don't think it's good to let the kernel arbitrarily kill >> processes that it thinks are not well-behaved based on some heuristics >> and policy. >> >> Can't this be outsourced to user space? Expose the information about >> processes causing a device and let e.g. systemd deal with coming up >> with a policy and with killing stuff. > > I don't think it's the kernel doing the killing, it would be the UMD. > E.g., if the app is guilty and doesn't support robustness the UMD can > just call exit(). It would be safer to just ignore API calls[0], similarly to what is done until the application destroys the context with robustness. Calling exit() likely results in losing any unsaved work, whereas at least some applications might otherwise allow saving the work by other means. [0] Possibly accompanied by a one-time message to stderr along the lines of "GPU reset detected but robustness not enabled in context, ignoring OpenGL API calls".
On Fri, Jun 30, 2023 at 4:59 PM Alex Deucher <alexdeucher@gmail.com> wrote: > > On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick > <sebastian.wick@redhat.com> wrote: > > > > On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com> wrote: > > > > > > Create a section that specifies how to deal with DRM device resets for > > > kernel and userspace drivers. > > > > > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> > > > Signed-off-by: André Almeida <andrealmeid@igalia.com> > > > --- > > > > > > v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/ > > > > > > Changes: > > > - Grammar fixes (Randy) > > > > > > Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++ > > > 1 file changed, 68 insertions(+) > > > > > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst > > > index 65fb3036a580..3cbffa25ed93 100644 > > > --- a/Documentation/gpu/drm-uapi.rst > > > +++ b/Documentation/gpu/drm-uapi.rst > > > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for > > > mmapped regular files. Threads cause additional pain with signal > > > handling as well. > > > > > > +Device reset > > > +============ > > > + > > > +The GPU stack is really complex and is prone to errors, from hardware bugs, > > > +faulty applications and everything in between the many layers. Some errors > > > +require resetting the device in order to make the device usable again. This > > > +sections describes the expectations for DRM and usermode drivers when a > > > +device resets and how to propagate the reset status. > > > + > > > +Kernel Mode Driver > > > +------------------ > > > + > > > +The KMD is responsible for checking if the device needs a reset, and to perform > > > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD > > > +should keep track of resets, because userspace can query any time about the > > > +reset stats for an specific context. This is needed to propagate to the rest of > > > +the stack that a reset has happened. Currently, this is implemented by each > > > +driver separately, with no common DRM interface. > > > + > > > +User Mode Driver > > > +---------------- > > > + > > > +The UMD should check before submitting new commands to the KMD if the device has > > > +been reset, and this can be checked more often if the UMD requires it. After > > > +detecting a reset, UMD will then proceed to report it to the application using > > > +the appropriate API error code, as explained in the section below about > > > +robustness. > > > + > > > +Robustness > > > +---------- > > > + > > > +The only way to try to keep an application working after a reset is if it > > > +complies with the robustness aspects of the graphical API that it is using. > > > + > > > +Graphical APIs provide ways to applications to deal with device resets. However, > > > +there is no guarantee that the app will use such features correctly, and the > > > +UMD can implement policies to close the app if it is a repeating offender, > > > +likely in a broken loop. This is done to ensure that it does not keep blocking > > > +the user interface from being correctly displayed. This should be done even if > > > +the app is correct but happens to trigger some bug in the hardware/driver. > > > > I still don't think it's good to let the kernel arbitrarily kill > > processes that it thinks are not well-behaved based on some heuristics > > and policy. > > > > Can't this be outsourced to user space? Expose the information about > > processes causing a device and let e.g. systemd deal with coming up > > with a policy and with killing stuff. > > I don't think it's the kernel doing the killing, it would be the UMD. > E.g., if the app is guilty and doesn't support robustness the UMD can > just call exit(). Ah, right, completely skipped over the UMD part. That makes more sense. > > Alex > > > > > > + > > > +OpenGL > > > +~~~~~~ > > > + > > > +Apps using OpenGL should use the available robust interfaces, like the > > > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This > > > +interface tells if a reset has happened, and if so, all the context state is > > > +considered lost and the app proceeds by creating new ones. If it is possible to > > > +determine that robustness is not in use, the UMD will terminate the app when a > > > +reset is detected, giving that the contexts are lost and the app won't be able > > > +to figure this out and recreate the contexts. > > > + > > > +Vulkan > > > +~~~~~~ > > > + > > > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions. > > > +This error code means, among other things, that a device reset has happened and > > > +it needs to recreate the contexts to keep going. > > > + > > > +Reporting causes of resets > > > +-------------------------- > > > + > > > +Apart from propagating the reset through the stack so apps can recover, it's > > > +really useful for driver developers to learn more about what caused the reset in > > > +first place. DRM devices should make use of devcoredump to store relevant > > > +information about the reset, so this information can be added to user bug > > > +reports. > > > + > > > .. _drm_driver_ioctl: > > > > > > IOCTL Support on Device Nodes > > > -- > > > 2.41.0 > > > > > >
That's a terrible idea. Ignoring API calls would be identical to a freeze. You might as well disable GPU recovery because the result would be the same. There are 2 scenarios: - robust contexts: report the GPU reset status and skip API calls; let the app recreate the context to recover - non-robust contexts: call exit(1) immediately, which is the best way to recover Marek On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <michel.daenzer@mailbox.org> wrote: > On 6/30/23 16:59, Alex Deucher wrote: > > On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick > > <sebastian.wick@redhat.com> wrote: > >> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com> > wrote: > >>> > >>> +Robustness > >>> +---------- > >>> + > >>> +The only way to try to keep an application working after a reset is > if it > >>> +complies with the robustness aspects of the graphical API that it is > using. > >>> + > >>> +Graphical APIs provide ways to applications to deal with device > resets. However, > >>> +there is no guarantee that the app will use such features correctly, > and the > >>> +UMD can implement policies to close the app if it is a repeating > offender, > >>> +likely in a broken loop. This is done to ensure that it does not keep > blocking > >>> +the user interface from being correctly displayed. This should be > done even if > >>> +the app is correct but happens to trigger some bug in the > hardware/driver. > >> > >> I still don't think it's good to let the kernel arbitrarily kill > >> processes that it thinks are not well-behaved based on some heuristics > >> and policy. > >> > >> Can't this be outsourced to user space? Expose the information about > >> processes causing a device and let e.g. systemd deal with coming up > >> with a policy and with killing stuff. > > > > I don't think it's the kernel doing the killing, it would be the UMD. > > E.g., if the app is guilty and doesn't support robustness the UMD can > > just call exit(). > > It would be safer to just ignore API calls[0], similarly to what is done > until the application destroys the context with robustness. Calling exit() > likely results in losing any unsaved work, whereas at least some > applications might otherwise allow saving the work by other means. > > > [0] Possibly accompanied by a one-time message to stderr along the lines > of "GPU reset detected but robustness not enabled in context, ignoring > OpenGL API calls". > > -- > Earthling Michel Dänzer | https://redhat.com > Libre software enthusiast | Mesa and Xwayland developer > >
On 6/30/23 22:32, Marek Olšák wrote: > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>> wrote: >> On 6/30/23 16:59, Alex Deucher wrote: >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick >>> <sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com>> wrote: >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com <mailto:andrealmeid@igalia.com>> wrote: >>>>> >>>>> +Robustness >>>>> +---------- >>>>> + >>>>> +The only way to try to keep an application working after a reset is if it >>>>> +complies with the robustness aspects of the graphical API that it is using. >>>>> + >>>>> +Graphical APIs provide ways to applications to deal with device resets. However, >>>>> +there is no guarantee that the app will use such features correctly, and the >>>>> +UMD can implement policies to close the app if it is a repeating offender, >>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking >>>>> +the user interface from being correctly displayed. This should be done even if >>>>> +the app is correct but happens to trigger some bug in the hardware/driver. >>>> >>>> I still don't think it's good to let the kernel arbitrarily kill >>>> processes that it thinks are not well-behaved based on some heuristics >>>> and policy. >>>> >>>> Can't this be outsourced to user space? Expose the information about >>>> processes causing a device and let e.g. systemd deal with coming up >>>> with a policy and with killing stuff. >>> >>> I don't think it's the kernel doing the killing, it would be the UMD. >>> E.g., if the app is guilty and doesn't support robustness the UMD can >>> just call exit(). >> >> It would be safer to just ignore API calls[0], similarly to what is done until the application destroys the context with robustness. Calling exit() likely results in losing any unsaved work, whereas at least some applications might otherwise allow saving the work by other means. > > That's a terrible idea. Ignoring API calls would be identical to a freeze. You might as well disable GPU recovery because the result would be the same. No GPU recovery would affect everything using the GPU, whereas this affects only non-robust applications. > - non-robust contexts: call exit(1) immediately, which is the best way to recover That's not the UMD's call to make. >> [0] Possibly accompanied by a one-time message to stderr along the lines of "GPU reset detected but robustness not enabled in context, ignoring OpenGL API calls".
On Mon, 3 Jul 2023 09:12:29 +0200 Michel Dänzer <michel.daenzer@mailbox.org> wrote: > On 6/30/23 22:32, Marek Olšák wrote: > > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>> wrote: > >> On 6/30/23 16:59, Alex Deucher wrote: > >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick > >>> <sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com>> wrote: > >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com <mailto:andrealmeid@igalia.com>> wrote: > >>>>> > >>>>> +Robustness > >>>>> +---------- > >>>>> + > >>>>> +The only way to try to keep an application working after a reset is if it > >>>>> +complies with the robustness aspects of the graphical API that it is using. > >>>>> + > >>>>> +Graphical APIs provide ways to applications to deal with device resets. However, > >>>>> +there is no guarantee that the app will use such features correctly, and the > >>>>> +UMD can implement policies to close the app if it is a repeating offender, > >>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking > >>>>> +the user interface from being correctly displayed. This should be done even if > >>>>> +the app is correct but happens to trigger some bug in the hardware/driver. > >>>> > >>>> I still don't think it's good to let the kernel arbitrarily kill > >>>> processes that it thinks are not well-behaved based on some heuristics > >>>> and policy. > >>>> > >>>> Can't this be outsourced to user space? Expose the information about > >>>> processes causing a device and let e.g. systemd deal with coming up > >>>> with a policy and with killing stuff. > >>> > >>> I don't think it's the kernel doing the killing, it would be the UMD. > >>> E.g., if the app is guilty and doesn't support robustness the UMD can > >>> just call exit(). > >> > >> It would be safer to just ignore API calls[0], similarly to what > >> is done until the application destroys the context with > >> robustness. Calling exit() likely results in losing any unsaved > >> work, whereas at least some applications might otherwise allow > >> saving the work by other means. > > > > That's a terrible idea. Ignoring API calls would be identical to a > > freeze. You might as well disable GPU recovery because the result > > would be the same. > > No GPU recovery would affect everything using the GPU, whereas this > affects only non-robust applications. > > > > - non-robust contexts: call exit(1) immediately, which is the best > > way to recover > > That's not the UMD's call to make. > > > >> [0] Possibly accompanied by a one-time message to stderr along > >> the lines of "GPU reset detected but robustness not enabled in > >> context, ignoring OpenGL API calls". > Hi, Michel does have a point. It's not just games and display servers that use GPU, but productivity tools as well. They may have periodic autosave in anticipation of crashes, but being able to do the final save before quitting would be nice. UMD killing the process would be new behaviour, right? Previously either application's GPU thread hangs or various API calls return errors, but it didn't kill the process, did it? If an application freezes, that's "no problem"; the end user can just continue using everything else. Alt-tab away etc. if the app was fullscreen. I do that already with games on even Xorg. If a display server freezes, that's a desktop-wide problem, but so is killing it. OTOH, if UMD really does need to terminate the process, then please do it in a way that causes a crash report to be recorded. _exit() with an error code is not it. Thanks, pq
Em 03/07/2023 05:49, Pekka Paalanen escreveu: > On Mon, 3 Jul 2023 09:12:29 +0200 > Michel Dänzer <michel.daenzer@mailbox.org> wrote: > >> On 6/30/23 22:32, Marek Olšák wrote: >>> On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>> wrote: >>>> On 6/30/23 16:59, Alex Deucher wrote: >>>>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick >>>>> <sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com>> wrote: >>>>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com <mailto:andrealmeid@igalia.com>> wrote: >>>>>>> >>>>>>> +Robustness >>>>>>> +---------- >>>>>>> + >>>>>>> +The only way to try to keep an application working after a reset is if it >>>>>>> +complies with the robustness aspects of the graphical API that it is using. >>>>>>> + >>>>>>> +Graphical APIs provide ways to applications to deal with device resets. However, >>>>>>> +there is no guarantee that the app will use such features correctly, and the >>>>>>> +UMD can implement policies to close the app if it is a repeating offender, >>>>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking >>>>>>> +the user interface from being correctly displayed. This should be done even if >>>>>>> +the app is correct but happens to trigger some bug in the hardware/driver. >>>>>> >>>>>> I still don't think it's good to let the kernel arbitrarily kill >>>>>> processes that it thinks are not well-behaved based on some heuristics >>>>>> and policy. >>>>>> >>>>>> Can't this be outsourced to user space? Expose the information about >>>>>> processes causing a device and let e.g. systemd deal with coming up >>>>>> with a policy and with killing stuff. >>>>> >>>>> I don't think it's the kernel doing the killing, it would be the UMD. >>>>> E.g., if the app is guilty and doesn't support robustness the UMD can >>>>> just call exit(). >>>> >>>> It would be safer to just ignore API calls[0], similarly to what >>>> is done until the application destroys the context with >>>> robustness. Calling exit() likely results in losing any unsaved >>>> work, whereas at least some applications might otherwise allow >>>> saving the work by other means. >>> >>> That's a terrible idea. Ignoring API calls would be identical to a >>> freeze. You might as well disable GPU recovery because the result >>> would be the same. >> >> No GPU recovery would affect everything using the GPU, whereas this >> affects only non-robust applications. >> >> >>> - non-robust contexts: call exit(1) immediately, which is the best >>> way to recover >> >> That's not the UMD's call to make. >> >> >>>> [0] Possibly accompanied by a one-time message to stderr along >>>> the lines of "GPU reset detected but robustness not enabled in >>>> context, ignoring OpenGL API calls". >> > > Hi, > > Michel does have a point. It's not just games and display servers that > use GPU, but productivity tools as well. They may have periodic > autosave in anticipation of crashes, but being able to do the final > save before quitting would be nice. UMD killing the process would be > new behaviour, right? Previously either application's GPU thread hangs > or various API calls return errors, but it didn't kill the process, did > it? > In Intel's Iris, UMD may call abort() for the reset guilty application: https://elixir.bootlin.com/mesa/mesa-23.0.4/source/src/gallium/drivers/iris/iris_batch.c#L1063 I was pretty sure this was the same for RadeonSI, but I failed to find the code for this, so I might be wrong. > If an application freezes, that's "no problem"; the end user can just > continue using everything else. Alt-tab away etc. if the app was > fullscreen. I do that already with games on even Xorg. > > If a display server freezes, that's a desktop-wide problem, but so is > killing it. > Interesting, what GPU do you use? In my experience (AMD RX 5600 XT), hanging the GPU usually means that the rest of applications/compositor can't use the GPU either, freezing all user interactions. So killing the guilty app is one effective solution currently, but ignoring calls may help as well. > OTOH, if UMD really does need to terminate the process, then please do > it in a way that causes a crash report to be recorded. _exit() with an > error code is not it. > In the "Reporting causes of resets" subsection of this document I can add something for UMD as well. > > Thanks, > pq
On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer@mailbox.org> wrote: > On 6/30/23 22:32, Marek Olšák wrote: > > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer < > michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>> wrote: > >> On 6/30/23 16:59, Alex Deucher wrote: > >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick > >>> <sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com>> wrote: > >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com > <mailto:andrealmeid@igalia.com>> wrote: > >>>>> > >>>>> +Robustness > >>>>> +---------- > >>>>> + > >>>>> +The only way to try to keep an application working after a reset is > if it > >>>>> +complies with the robustness aspects of the graphical API that it > is using. > >>>>> + > >>>>> +Graphical APIs provide ways to applications to deal with device > resets. However, > >>>>> +there is no guarantee that the app will use such features > correctly, and the > >>>>> +UMD can implement policies to close the app if it is a repeating > offender, > >>>>> +likely in a broken loop. This is done to ensure that it does not > keep blocking > >>>>> +the user interface from being correctly displayed. This should be > done even if > >>>>> +the app is correct but happens to trigger some bug in the > hardware/driver. > >>>> > >>>> I still don't think it's good to let the kernel arbitrarily kill > >>>> processes that it thinks are not well-behaved based on some heuristics > >>>> and policy. > >>>> > >>>> Can't this be outsourced to user space? Expose the information about > >>>> processes causing a device and let e.g. systemd deal with coming up > >>>> with a policy and with killing stuff. > >>> > >>> I don't think it's the kernel doing the killing, it would be the UMD. > >>> E.g., if the app is guilty and doesn't support robustness the UMD can > >>> just call exit(). > >> > >> It would be safer to just ignore API calls[0], similarly to what is > done until the application destroys the context with robustness. Calling > exit() likely results in losing any unsaved work, whereas at least some > applications might otherwise allow saving the work by other means. > > > > That's a terrible idea. Ignoring API calls would be identical to a > freeze. You might as well disable GPU recovery because the result would be > the same. > > No GPU recovery would affect everything using the GPU, whereas this > affects only non-robust applications. > which is currently the majority. > > > - non-robust contexts: call exit(1) immediately, which is the best way > to recover > > That's not the UMD's call to make. > That's absolutely the UMD's call to make because that's mandated by the hw and API design and only driver devs know this, which this thread is a proof of. The default behavior is to skip all command submission if a non-robust context is lost, which looks like a freeze. That's required to prevent infinite hangs from the same context and can be caused by the side effects of the GPU reset itself, not by the cause of the previous hang. The only way out of that is killing the process. Marek > > >> [0] Possibly accompanied by a one-time message to stderr along the > lines of "GPU reset detected but robustness not enabled in context, > ignoring OpenGL API calls". > > > -- > Earthling Michel Dänzer | https://redhat.com > Libre software enthusiast | Mesa and Xwayland developer > >
On 7/3/23 19:34, Marek Olšák wrote: > > > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>> wrote: > Marek, Please stop sending html emails to the mailing lists. The mailing list software drops them. Please set your email interface to use plain text mode instead. Thanks.
On Mon, Jul 3, 2023, 22:38 Randy Dunlap <rdunlap@infradead.org> wrote: > > > On 7/3/23 19:34, Marek Olšák wrote: > > > > > > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer@mailbox.org > <mailto:michel.daenzer@mailbox.org>> wrote: > > > > Marek, > Please stop sending html emails to the mailing lists. > The mailing list software drops them. > > Please set your email interface to use plain text mode instead. > Thanks. > The mobile Gmail app doesn't support plain text, which I use frequently. Marek > -- > ~Randy >
On 7/3/23 19:44, Marek Olšák wrote: > > > On Mon, Jul 3, 2023, 22:38 Randy Dunlap <rdunlap@infradead.org <mailto:rdunlap@infradead.org>> wrote: > > > > On 7/3/23 19:34, Marek Olšák wrote: > > > > > > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org> <mailto:michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>>> wrote: > > > > Marek, > Please stop sending html emails to the mailing lists. > The mailing list software drops them. > > Please set your email interface to use plain text mode instead. > Thanks. > > > The mobile Gmail app doesn't support plain text, which I use frequently. Perhaps you should consider some other mobile app for kernel discussions. E.g., it looks like the K-9 mail app works with gmail.
On Mon, 3 Jul 2023 12:00:22 -0300 André Almeida <andrealmeid@igalia.com> wrote: > Em 03/07/2023 05:49, Pekka Paalanen escreveu: > > If an application freezes, that's "no problem"; the end user can just > > continue using everything else. Alt-tab away etc. if the app was > > fullscreen. I do that already with games on even Xorg. > > > > If a display server freezes, that's a desktop-wide problem, but so is > > killing it. > > > > Interesting, what GPU do you use? In my experience (AMD RX 5600 XT), > hanging the GPU usually means that the rest of applications/compositor > can't use the GPU either, freezing all user interactions. So killing the > guilty app is one effective solution currently, but ignoring calls may > help as well. I don't know if what I'm seeing is a GPU hang or just e.g. Proton getting somehow stuck, all I see is a game freezing. I just Alt+tab back to Steam, force-stop it, and then all is fine again. This is how it should work regardless of why a game freezes. However, even if it was a GPU hang, if I am on a display server that actually handles GPU resets, I don't see why the rest of the desktop would not be able to recover. Individual apps are each to their own, but at the very least non-GPU apps and the DE itself should not have any problem (DE components can simply be restarted automatically). Thanks, pq
On 7/4/23 04:34, Marek Olšák wrote: > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>> wrote: > On 6/30/23 22:32, Marek Olšák wrote: > > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org> <mailto:michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>>> wrote: > >> On 6/30/23 16:59, Alex Deucher wrote: > >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick > >>> <sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com> <mailto:sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com>>> wrote: > >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com <mailto:andrealmeid@igalia.com> <mailto:andrealmeid@igalia.com <mailto:andrealmeid@igalia.com>>> wrote: > >>>>> > >>>>> +Robustness > >>>>> +---------- > >>>>> + > >>>>> +The only way to try to keep an application working after a reset is if it > >>>>> +complies with the robustness aspects of the graphical API that it is using. > >>>>> + > >>>>> +Graphical APIs provide ways to applications to deal with device resets. However, > >>>>> +there is no guarantee that the app will use such features correctly, and the > >>>>> +UMD can implement policies to close the app if it is a repeating offender, > >>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking > >>>>> +the user interface from being correctly displayed. This should be done even if > >>>>> +the app is correct but happens to trigger some bug in the hardware/driver. > >>>> > >>>> I still don't think it's good to let the kernel arbitrarily kill > >>>> processes that it thinks are not well-behaved based on some heuristics > >>>> and policy. > >>>> > >>>> Can't this be outsourced to user space? Expose the information about > >>>> processes causing a device and let e.g. systemd deal with coming up > >>>> with a policy and with killing stuff. > >>> > >>> I don't think it's the kernel doing the killing, it would be the UMD. > >>> E.g., if the app is guilty and doesn't support robustness the UMD can > >>> just call exit(). > >> > >> It would be safer to just ignore API calls[0], similarly to what is done until the application destroys the context with robustness. Calling exit() likely results in losing any unsaved work, whereas at least some applications might otherwise allow saving the work by other means. > > > > That's a terrible idea. Ignoring API calls would be identical to a freeze. You might as well disable GPU recovery because the result would be the same. > > No GPU recovery would affect everything using the GPU, whereas this affects only non-robust applications. > > which is currently the majority. Not sure where you're going with this. Applications need to use robustness to be able to recover from a GPU hang, and the GPU needs to be reset for that. So disabling GPU reset is not the same as what we're discussing here. > > - non-robust contexts: call exit(1) immediately, which is the best way to recover > > That's not the UMD's call to make. > > That's absolutely the UMD's call to make because that's mandated by the hw and API design Can you point us to a spec which mandates that the process must be killed in this case? > and only driver devs know this, which this thread is a proof of. The default behavior is to skip all command submission if a non-robust context is lost, which looks like a freeze. That's required to prevent infinite hangs from the same context and can be caused by the side effects of the GPU reset itself, not by the cause of the previous hang. The only way out of that is killing the process. The UMD killing the process is not the only way out of that, and doing so is overreach on its part. The UMD is but one out of many components in a process, not the main one or a special one. It doesn't get to decide when the process must die, certainly not under circumstances where it must be able to continue while ignoring API calls (that's required for robustness). > >> [0] Possibly accompanied by a one-time message to stderr along the lines of "GPU reset detected but robustness not enabled in context, ignoring OpenGL API calls".
On Tue, Jul 4, 2023, 03:55 Michel Dänzer <michel.daenzer@mailbox.org> wrote: > On 7/4/23 04:34, Marek Olšák wrote: > > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer@mailbox.org > <mailto:michel.daenzer@mailbox.org>> wrote: > > On 6/30/23 22:32, Marek Olšák wrote: > > > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer < > michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org> <mailto: > michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>>> wrote: > > >> On 6/30/23 16:59, Alex Deucher wrote: > > >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick > > >>> <sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com> > <mailto:sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com>>> > wrote: > > >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida < > andrealmeid@igalia.com <mailto:andrealmeid@igalia.com> <mailto: > andrealmeid@igalia.com <mailto:andrealmeid@igalia.com>>> wrote: > > >>>>> > > >>>>> +Robustness > > >>>>> +---------- > > >>>>> + > > >>>>> +The only way to try to keep an application working after a > reset is if it > > >>>>> +complies with the robustness aspects of the graphical API > that it is using. > > >>>>> + > > >>>>> +Graphical APIs provide ways to applications to deal with > device resets. However, > > >>>>> +there is no guarantee that the app will use such features > correctly, and the > > >>>>> +UMD can implement policies to close the app if it is a > repeating offender, > > >>>>> +likely in a broken loop. This is done to ensure that it does > not keep blocking > > >>>>> +the user interface from being correctly displayed. This > should be done even if > > >>>>> +the app is correct but happens to trigger some bug in the > hardware/driver. > > >>>> > > >>>> I still don't think it's good to let the kernel arbitrarily kill > > >>>> processes that it thinks are not well-behaved based on some > heuristics > > >>>> and policy. > > >>>> > > >>>> Can't this be outsourced to user space? Expose the information > about > > >>>> processes causing a device and let e.g. systemd deal with > coming up > > >>>> with a policy and with killing stuff. > > >>> > > >>> I don't think it's the kernel doing the killing, it would be the > UMD. > > >>> E.g., if the app is guilty and doesn't support robustness the > UMD can > > >>> just call exit(). > > >> > > >> It would be safer to just ignore API calls[0], similarly to what > is done until the application destroys the context with robustness. Calling > exit() likely results in losing any unsaved work, whereas at least some > applications might otherwise allow saving the work by other means. > > > > > > That's a terrible idea. Ignoring API calls would be identical to a > freeze. You might as well disable GPU recovery because the result would be > the same. > > > > No GPU recovery would affect everything using the GPU, whereas this > affects only non-robust applications. > > > > which is currently the majority. > > Not sure where you're going with this. Applications need to use robustness > to be able to recover from a GPU hang, and the GPU needs to be reset for > that. So disabling GPU reset is not the same as what we're discussing here. > > > > > - non-robust contexts: call exit(1) immediately, which is the best > way to recover > > > > That's not the UMD's call to make. > > > > That's absolutely the UMD's call to make because that's mandated by the > hw and API design > > Can you point us to a spec which mandates that the process must be killed > in this case? > > > > and only driver devs know this, which this thread is a proof of. The > default behavior is to skip all command submission if a non-robust context > is lost, which looks like a freeze. That's required to prevent infinite > hangs from the same context and can be caused by the side effects of the > GPU reset itself, not by the cause of the previous hang. The only way out > of that is killing the process. > > The UMD killing the process is not the only way out of that, and doing so > is overreach on its part. The UMD is but one out of many components in a > process, not the main one or a special one. It doesn't get to decide when > the process must die, certainly not under circumstances where it must be > able to continue while ignoring API calls (that's required for robustness). > You're mixing things up. Robust apps don't any special action from a UMD. Only non-robust apps need to be killed for proper recovery with the only other alternative being not updating the window/screen, which is not user-friendly because the user who's never heard of GPU hangs has no fucking idea why it's frozen and what do with it. It doesn't matter that you can debug it because you're not the average user. Also it's already used and required by our customers on Android because killing a process returns the user to the desktop screen and can generate a crash dump instead of keeping the app output frozen, and they agree that this is the best user experience given the circumstances. Also if the ML ignores html, that's fine. Marek > > > >> [0] Possibly accompanied by a one-time message to stderr > along the lines of "GPU reset detected but robustness not enabled in > context, ignoring OpenGL API calls". > > > -- > Earthling Michel Dänzer | https://redhat.com > Libre software enthusiast | Mesa and Xwayland developer > >
On 7/5/23 08:30, Marek Olšák wrote: > On Tue, Jul 4, 2023, 03:55 Michel Dänzer <michel.daenzer@mailbox.org> wrote: > On 7/4/23 04:34, Marek Olšák wrote: > > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer@mailbox.org > wrote: > > On 6/30/23 22:32, Marek Olšák wrote: > > > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <michel.daenzer@mailbox.org> wrote: > > >> On 6/30/23 16:59, Alex Deucher wrote: > > >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick > > >>> <sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com> wrote: > > >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com> wrote: > > >>>>> > > >>>>> +Robustness > > >>>>> +---------- > > >>>>> + > > >>>>> +The only way to try to keep an application working after a reset is if it > > >>>>> +complies with the robustness aspects of the graphical API that it is using. > > >>>>> + > > >>>>> +Graphical APIs provide ways to applications to deal with device resets. However, > > >>>>> +there is no guarantee that the app will use such features correctly, and the > > >>>>> +UMD can implement policies to close the app if it is a repeating offender, > > >>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking > > >>>>> +the user interface from being correctly displayed. This should be done even if > > >>>>> +the app is correct but happens to trigger some bug in the hardware/driver. > > >>>> > > >>>> I still don't think it's good to let the kernel arbitrarily kill > > >>>> processes that it thinks are not well-behaved based on some heuristics > > >>>> and policy. > > >>>> > > >>>> Can't this be outsourced to user space? Expose the information about > > >>>> processes causing a device and let e.g. systemd deal with coming up > > >>>> with a policy and with killing stuff. > > >>> > > >>> I don't think it's the kernel doing the killing, it would be the UMD. > > >>> E.g., if the app is guilty and doesn't support robustness the UMD can > > >>> just call exit(). > > >> > > >> It would be safer to just ignore API calls[0], similarly to what is done until the application destroys the context with robustness. Calling exit() likely results in losing any unsaved work, whereas at least some applications might otherwise allow saving the work by other means. > > > > > > That's a terrible idea. Ignoring API calls would be identical to a freeze. You might as well disable GPU recovery because the result would be the same. > > > > No GPU recovery would affect everything using the GPU, whereas this affects only non-robust applications. > > > > which is currently the majority. > > Not sure where you're going with this. Applications need to use robustness to be able to recover from a GPU hang, and the GPU needs to be reset for that. So disabling GPU reset is not the same as what we're discussing here. > > > > > - non-robust contexts: call exit(1) immediately, which is the best way to recover > > > > That's not the UMD's call to make. > > > > That's absolutely the UMD's call to make because that's mandated by the hw and API design > > Can you point us to a spec which mandates that the process must be killed in this case? > > > > and only driver devs know this, which this thread is a proof of. The default behavior is to skip all command submission if a non-robust context is lost, which looks like a freeze. That's required to prevent infinite hangs from the same context and can be caused by the side effects of the GPU reset itself, not by the cause of the previous hang. The only way out of that is killing the process. > > The UMD killing the process is not the only way out of that, and doing so is overreach on its part. The UMD is but one out of many components in a process, not the main one or a special one. It doesn't get to decide when the process must die, certainly not under circumstances where it must be able to continue while ignoring API calls (that's required for robustness). > > > You're mixing things up. Robust apps don't any special action from a UMD. Only non-robust apps need to be killed for proper recovery with the only other alternative being not updating the window/screen, I'm saying they don't "need" to be killed, since the UMD must be able to keep going while ignoring API calls (or it couldn't support robustness). It's a choice, one which is not for the UMD to make. > Also it's already used and required by our customers on Android because killing a process returns the user to the desktop screen and can generate a crash dump instead of keeping the app output frozen, and they agree that this is the best user experience given the circumstances. Then some appropriate Android component needs to make that call. The UMD is not it. > > >> [0] Possibly accompanied by a one-time message to stderr along the lines of "GPU reset detected but robustness not enabled in context, ignoring OpenGL API calls".
On Wed, Jul 5, 2023 at 3:32 AM Michel Dänzer <michel.daenzer@mailbox.org> wrote: > > On 7/5/23 08:30, Marek Olšák wrote: > > On Tue, Jul 4, 2023, 03:55 Michel Dänzer <michel.daenzer@mailbox.org> wrote: > > On 7/4/23 04:34, Marek Olšák wrote: > > > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer@mailbox.org > wrote: > > > On 6/30/23 22:32, Marek Olšák wrote: > > > > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <michel.daenzer@mailbox.org> wrote: > > > >> On 6/30/23 16:59, Alex Deucher wrote: > > > >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick > > > >>> <sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com> wrote: > > > >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com> wrote: > > > >>>>> > > > >>>>> +Robustness > > > >>>>> +---------- > > > >>>>> + > > > >>>>> +The only way to try to keep an application working after a reset is if it > > > >>>>> +complies with the robustness aspects of the graphical API that it is using. > > > >>>>> + > > > >>>>> +Graphical APIs provide ways to applications to deal with device resets. However, > > > >>>>> +there is no guarantee that the app will use such features correctly, and the > > > >>>>> +UMD can implement policies to close the app if it is a repeating offender, > > > >>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking > > > >>>>> +the user interface from being correctly displayed. This should be done even if > > > >>>>> +the app is correct but happens to trigger some bug in the hardware/driver. > > > >>>> > > > >>>> I still don't think it's good to let the kernel arbitrarily kill > > > >>>> processes that it thinks are not well-behaved based on some heuristics > > > >>>> and policy. > > > >>>> > > > >>>> Can't this be outsourced to user space? Expose the information about > > > >>>> processes causing a device and let e.g. systemd deal with coming up > > > >>>> with a policy and with killing stuff. > > > >>> > > > >>> I don't think it's the kernel doing the killing, it would be the UMD. > > > >>> E.g., if the app is guilty and doesn't support robustness the UMD can > > > >>> just call exit(). > > > >> > > > >> It would be safer to just ignore API calls[0], similarly to what is done until the application destroys the context with robustness. Calling exit() likely results in losing any unsaved work, whereas at least some applications might otherwise allow saving the work by other means. > > > > > > > > That's a terrible idea. Ignoring API calls would be identical to a freeze. You might as well disable GPU recovery because the result would be the same. > > > > > > No GPU recovery would affect everything using the GPU, whereas this affects only non-robust applications. > > > > > > which is currently the majority. > > > > Not sure where you're going with this. Applications need to use robustness to be able to recover from a GPU hang, and the GPU needs to be reset for that. So disabling GPU reset is not the same as what we're discussing here. > > > > > > > > - non-robust contexts: call exit(1) immediately, which is the best way to recover > > > > > > That's not the UMD's call to make. > > > > > > That's absolutely the UMD's call to make because that's mandated by the hw and API design > > > > Can you point us to a spec which mandates that the process must be killed in this case? > > > > > > > and only driver devs know this, which this thread is a proof of. The default behavior is to skip all command submission if a non-robust context is lost, which looks like a freeze. That's required to prevent infinite hangs from the same context and can be caused by the side effects of the GPU reset itself, not by the cause of the previous hang. The only way out of that is killing the process. > > > > The UMD killing the process is not the only way out of that, and doing so is overreach on its part. The UMD is but one out of many components in a process, not the main one or a special one. It doesn't get to decide when the process must die, certainly not under circumstances where it must be able to continue while ignoring API calls (that's required for robustness). > > > > > > You're mixing things up. Robust apps don't any special action from a UMD. Only non-robust apps need to be killed for proper recovery with the only other alternative being not updating the window/screen, > > I'm saying they don't "need" to be killed, since the UMD must be able to keep going while ignoring API calls (or it couldn't support robustness). It's a choice, one which is not for the UMD to make. > > > > Also it's already used and required by our customers on Android because killing a process returns the user to the desktop screen and can generate a crash dump instead of keeping the app output frozen, and they agree that this is the best user experience given the circumstances. > > Then some appropriate Android component needs to make that call. The UMD is not it. We can change it once Android and Linux have a better way to handle non-robust apps. Until then, generating a core dump after a GPU crash produces the best outcome for users and developers. Marek
Hi everyone, It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that. Em 27/06/2023 10:23, André Almeida escreveu: > +Robustness > +---------- > + > +The only way to try to keep an application working after a reset is if it > +complies with the robustness aspects of the graphical API that it is using. > + > +Graphical APIs provide ways to applications to deal with device resets. However, > +there is no guarantee that the app will use such features correctly, and the > +UMD can implement policies to close the app if it is a repeating offender, > +likely in a broken loop. This is done to ensure that it does not keep blocking > +the user interface from being correctly displayed. This should be done even if > +the app is correct but happens to trigger some bug in the hardware/driver. > + Depending on the OpenGL version, there are different robustness API available: - OpenGL ABR extension [0] - OpenGL KHR extension [1] - OpenGL ES extension [2] Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset. What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it. However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive. Given that, I believe drivers have the following options: a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications. b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?). c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets. RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5]. For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications. My proposal for the documentation is: implement a) if nothing else is available, have a MESA_NO_RESET_KILL for people that want b), ideally implement c) if the driver is able to know for sure that the non-guilty apps can still work after a reset. Thanks, André [0] https://registry.khronos.org/OpenGL/extensions/ARB/ARB_robustness.txt [1] https://registry.khronos.org/OpenGL/extensions/KHR/KHR_robustness.txt [2] https://registry.khronos.org/OpenGL/extensions/EXT/EXT_robustness.txt [3] https://registry.khronos.org/OpenGL/specs/gl/glspec46.core.pdf [4] https://gitlab.freedesktop.org/mesa/mesa/-/blob/23.1/src/gallium/winsys/amdgpu/drm/amdgpu_cs.c#L1657 [5] https://gitlab.freedesktop.org/mesa/mesa/-/blob/23.1/src/gallium/drivers/iris/iris_batch.c#L842
On Tuesday, July 25th, 2023 at 04:55, André Almeida <andrealmeid@igalia.com> wrote: > It's not clear what we should do about non-robust OpenGL apps after GPU > resets, so I'll try to summarize the topic, show some options and my > proposal to move forward on that. > > Em 27/06/2023 10:23, André Almeida escreveu: > > > +Robustness > > +---------- > > + > > +The only way to try to keep an application working after a reset is if it > > +complies with the robustness aspects of the graphical API that it is using. > > + > > +Graphical APIs provide ways to applications to deal with device resets. However, > > +there is no guarantee that the app will use such features correctly, and the > > +UMD can implement policies to close the app if it is a repeating offender, > > +likely in a broken loop. This is done to ensure that it does not keep blocking > > +the user interface from being correctly displayed. This should be done even if > > +the app is correct but happens to trigger some bug in the hardware/driver. > > + > > Depending on the OpenGL version, there are different robustness API > available: > > - OpenGL ABR extension [0] > - OpenGL KHR extension [1] > - OpenGL ES extension [2] > > Apps written in OpenGL should use whatever version is available for them > to make the app robust for GPU resets. That usually means calling > GetGraphicsResetStatusARB(), checking the status, and if it encounter > something different from NO_ERROR, that means that a reset has happened, > the context is considered lost and should be recreated. If an app follow > this, it will likely succeed recovering a reset. > > What should non-robustness apps do then? They certainly will not be > notified if a reset happens, and thus can't recover if their context is > lost. OpenGL specification does not explicitly define what should be > done in such situations[3], and I believe that usually when the spec > mandates to close the app, it would explicitly note it. > > However, in reality there are different types of device resets, causing > different results. A reset can be precise enough to damage only the > guilty context, and keep others alive. > > Given that, I believe drivers have the following options: > > a) Kill all non-robust apps after a reset. This may lead to lose work > from innocent applications. > > b) Ignore all non-robust apps OpenGL calls. That means that applications > would still be alive, but the user interface would be freeze. The user > would need to close it manually anyway, but in some corner cases, the > app could autosave some work or the user might be able to interact with > it using some alternative method (command line?). > > c) Kill just the affected non-robust applications. To do that, the > driver need to be 100% sure on the impact of its resets. We've discussed this a while back on #dri-devel IIRC. I think the best experience would be for the Wayland compositor to gray out apps which lost their GL context, and display an information dialog to explain what happened to the user and a button to kill the app. I'm not exactly sure how that would translate to a kernel or Mesa uAPI, and if there's appetite to do a lot of work to get "the best GPU reset UX" (IOW: maybe it's not worth all of the trouble).
On 7/25/23 04:55, André Almeida wrote: > Hi everyone, > > It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that. > > Em 27/06/2023 10:23, André Almeida escreveu: >> +Robustness >> +---------- >> + >> +The only way to try to keep an application working after a reset is if it >> +complies with the robustness aspects of the graphical API that it is using. >> + >> +Graphical APIs provide ways to applications to deal with device resets. However, >> +there is no guarantee that the app will use such features correctly, and the >> +UMD can implement policies to close the app if it is a repeating offender, >> +likely in a broken loop. This is done to ensure that it does not keep blocking >> +the user interface from being correctly displayed. This should be done even if >> +the app is correct but happens to trigger some bug in the hardware/driver. >> + > Depending on the OpenGL version, there are different robustness API available: > > - OpenGL ABR extension [0] > - OpenGL KHR extension [1] > - OpenGL ES extension [2] > > Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset. > > What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it. > > However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive. > > Given that, I believe drivers have the following options: > > a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications. > > b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?). > > c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets. > > RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5]. > > For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications. I disagree on these conclusions. c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not. > [0] https://registry.khronos.org/OpenGL/extensions/ARB/ARB_robustness.txt > [1] https://registry.khronos.org/OpenGL/extensions/KHR/KHR_robustness.txt > [2] https://registry.khronos.org/OpenGL/extensions/EXT/EXT_robustness.txt > [3] https://registry.khronos.org/OpenGL/specs/gl/glspec46.core.pdf > [4] https://gitlab.freedesktop.org/mesa/mesa/-/blob/23.1/src/gallium/winsys/amdgpu/drm/amdgpu_cs.c#L1657 > [5] https://gitlab.freedesktop.org/mesa/mesa/-/blob/23.1/src/gallium/drivers/iris/iris_batch.c#L842
Hi Michel, Em 25/07/2023 05:03, Michel Dänzer escreveu: > On 7/25/23 04:55, André Almeida wrote: >> Hi everyone, >> >> It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that. >> >> Em 27/06/2023 10:23, André Almeida escreveu: >>> +Robustness >>> +---------- >>> + >>> +The only way to try to keep an application working after a reset is if it >>> +complies with the robustness aspects of the graphical API that it is using. >>> + >>> +Graphical APIs provide ways to applications to deal with device resets. However, >>> +there is no guarantee that the app will use such features correctly, and the >>> +UMD can implement policies to close the app if it is a repeating offender, >>> +likely in a broken loop. This is done to ensure that it does not keep blocking >>> +the user interface from being correctly displayed. This should be done even if >>> +the app is correct but happens to trigger some bug in the hardware/driver. >>> + >> Depending on the OpenGL version, there are different robustness API available: >> >> - OpenGL ABR extension [0] >> - OpenGL KHR extension [1] >> - OpenGL ES extension [2] >> >> Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset. >> >> What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it. >> >> However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive. >> >> Given that, I believe drivers have the following options: >> >> a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications. >> >> b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?). >> >> c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets. >> >> RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5]. >> >> For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications. > > I disagree on these conclusions. > > c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not. > Thank you for the feedback. How do you think the documentation should look like for this part?
On Tue, Jul 25, 2023 at 4:03 AM Michel Dänzer <michel.daenzer@mailbox.org> wrote: > > On 7/25/23 04:55, André Almeida wrote: > > Hi everyone, > > > > It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that. > > > > Em 27/06/2023 10:23, André Almeida escreveu: > >> +Robustness > >> +---------- > >> + > >> +The only way to try to keep an application working after a reset is if it > >> +complies with the robustness aspects of the graphical API that it is using. > >> + > >> +Graphical APIs provide ways to applications to deal with device resets. However, > >> +there is no guarantee that the app will use such features correctly, and the > >> +UMD can implement policies to close the app if it is a repeating offender, > >> +likely in a broken loop. This is done to ensure that it does not keep blocking > >> +the user interface from being correctly displayed. This should be done even if > >> +the app is correct but happens to trigger some bug in the hardware/driver. > >> + > > Depending on the OpenGL version, there are different robustness API available: > > > > - OpenGL ABR extension [0] > > - OpenGL KHR extension [1] > > - OpenGL ES extension [2] > > > > Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset. > > > > What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it. > > > > However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive. > > > > Given that, I believe drivers have the following options: > > > > a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications. > > > > b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?). > > > > c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets. > > > > RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5]. > > > > For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications. > > I disagree on these conclusions. > > c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not. That's not true. I recommend that you enable b) with your driver and then hang the GPU under different scenarios and see the result. Then enable a) and do the same and compare. Options a) and c) can be merged into one because they are not separate options to choose from. If Wayland wanted to grey out lost apps, they would appear as robust contexts in gallium, but the reset status would be piped through the Wayland protocol instead of the GL API. Marek Marek
On 7/25/23 17:05, Marek Olšák wrote: > On Tue, Jul 25, 2023 at 4:03 AM Michel Dänzer > <michel.daenzer@mailbox.org> wrote: >> On 7/25/23 04:55, André Almeida wrote: >>> Hi everyone, >>> >>> It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that. >>> >>> Em 27/06/2023 10:23, André Almeida escreveu: >>>> +Robustness >>>> +---------- >>>> + >>>> +The only way to try to keep an application working after a reset is if it >>>> +complies with the robustness aspects of the graphical API that it is using. >>>> + >>>> +Graphical APIs provide ways to applications to deal with device resets. However, >>>> +there is no guarantee that the app will use such features correctly, and the >>>> +UMD can implement policies to close the app if it is a repeating offender, >>>> +likely in a broken loop. This is done to ensure that it does not keep blocking >>>> +the user interface from being correctly displayed. This should be done even if >>>> +the app is correct but happens to trigger some bug in the hardware/driver. >>>> + >>> Depending on the OpenGL version, there are different robustness API available: >>> >>> - OpenGL ABR extension [0] >>> - OpenGL KHR extension [1] >>> - OpenGL ES extension [2] >>> >>> Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset. >>> >>> What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it. >>> >>> However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive. >>> >>> Given that, I believe drivers have the following options: >>> >>> a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications. >>> >>> b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?). >>> >>> c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets. >>> >>> RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5]. >>> >>> For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications. >> >> I disagree on these conclusions. >> >> c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not. > > That's not true. Which part of what I wrote are you referring to? > I recommend that you enable b) with your driver and then hang the GPU under different scenarios and see the result. I've been doing GPU driver development for over two decades, I'm perfectly aware what it means. It doesn't change what I wrote above.
On Tue, 2023-07-25 at 19:00 +0200, Michel Dänzer wrote: > On 7/25/23 17:05, Marek Olšák wrote: > > On Tue, Jul 25, 2023 at 4:03 AM Michel Dänzer > > <michel.daenzer@mailbox.org> wrote: > > > On 7/25/23 04:55, André Almeida wrote: > > > > Hi everyone, > > > > > > > > It's not clear what we should do about non-robust OpenGL apps > > > > after GPU resets, so I'll try to summarize the topic, show some > > > > options and my proposal to move forward on that. > > > > > > > > Em 27/06/2023 10:23, André Almeida escreveu: > > > > > +Robustness > > > > > +---------- > > > > > + > > > > > +The only way to try to keep an application working after a > > > > > reset is if it > > > > > +complies with the robustness aspects of the graphical API > > > > > that it is using. > > > > > + > > > > > +Graphical APIs provide ways to applications to deal with > > > > > device resets. However, > > > > > +there is no guarantee that the app will use such features > > > > > correctly, and the > > > > > +UMD can implement policies to close the app if it is a > > > > > repeating offender, > > > > > +likely in a broken loop. This is done to ensure that it does > > > > > not keep blocking > > > > > +the user interface from being correctly displayed. This > > > > > should be done even if > > > > > +the app is correct but happens to trigger some bug in the > > > > > hardware/driver. > > > > > + > > > > Depending on the OpenGL version, there are different robustness > > > > API available: > > > > > > > > - OpenGL ABR extension [0] > > > > - OpenGL KHR extension [1] > > > > - OpenGL ES extension [2] > > > > > > > > Apps written in OpenGL should use whatever version is available > > > > for them to make the app robust for GPU resets. That usually > > > > means calling GetGraphicsResetStatusARB(), checking the status, > > > > and if it encounter something different from NO_ERROR, that > > > > means that a reset has happened, the context is considered lost > > > > and should be recreated. If an app follow this, it will likely > > > > succeed recovering a reset. > > > > > > > > What should non-robustness apps do then? They certainly will > > > > not be notified if a reset happens, and thus can't recover if > > > > their context is lost. OpenGL specification does not explicitly > > > > define what should be done in such situations[3], and I believe > > > > that usually when the spec mandates to close the app, it would > > > > explicitly note it. > > > > > > > > However, in reality there are different types of device resets, > > > > causing different results. A reset can be precise enough to > > > > damage only the guilty context, and keep others alive. > > > > > > > > Given that, I believe drivers have the following options: > > > > > > > > a) Kill all non-robust apps after a reset. This may lead to > > > > lose work from innocent applications. > > > > > > > > b) Ignore all non-robust apps OpenGL calls. That means that > > > > applications would still be alive, but the user interface would > > > > be freeze. The user would need to close it manually anyway, but > > > > in some corner cases, the app could autosave some work or the > > > > user might be able to interact with it using some alternative > > > > method (command line?). > > > > > > > > c) Kill just the affected non-robust applications. To do that, > > > > the driver need to be 100% sure on the impact of its resets. > > > > > > > > RadeonSI currently implements a), as can be seen at [4], while > > > > Iris implements what I think it's c)[5]. > > > > > > > > For the user experience point-of-view, c) is clearly the best > > > > option, but it's the hardest to archive. There's not much gain > > > > on having b) over a), perhaps it could be an optional env var > > > > for such corner case applications. > > > > > > I disagree on these conclusions. > > > > > > c) is certainly better than a), but it's not "clearly the best" > > > in all cases. The OpenGL UMD is not a privileged/special > > > component and is in no position to decide whether or not the > > > process as a whole (only some thread(s) of which may use OpenGL > > > at all) gets to continue running or not. > > > > That's not true. > > Which part of what I wrote are you referring to? > > > > I recommend that you enable b) with your driver and then hang the > > GPU under different scenarios and see the result. > > I've been doing GPU driver development for over two decades, I'm > perfectly aware what it means. It doesn't change what I wrote above. > Michel, I understand that you disagree with the proposed solutions in this email thread but from your mails it is unclear to me what exactly is the solution that you would actually recommend, can you please clarify? Thanks, Timur
On 7/25/23 15:02, André Almeida wrote: > Em 25/07/2023 05:03, Michel Dänzer escreveu: >> On 7/25/23 04:55, André Almeida wrote: >>> Hi everyone, >>> >>> It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that. >>> >>> Em 27/06/2023 10:23, André Almeida escreveu: >>>> +Robustness >>>> +---------- >>>> + >>>> +The only way to try to keep an application working after a reset is if it >>>> +complies with the robustness aspects of the graphical API that it is using. >>>> + >>>> +Graphical APIs provide ways to applications to deal with device resets. However, >>>> +there is no guarantee that the app will use such features correctly, and the >>>> +UMD can implement policies to close the app if it is a repeating offender, >>>> +likely in a broken loop. This is done to ensure that it does not keep blocking >>>> +the user interface from being correctly displayed. This should be done even if >>>> +the app is correct but happens to trigger some bug in the hardware/driver. >>>> + >>> Depending on the OpenGL version, there are different robustness API available: >>> >>> - OpenGL ABR extension [0] >>> - OpenGL KHR extension [1] >>> - OpenGL ES extension [2] >>> >>> Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset. >>> >>> What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it. >>> >>> However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive. >>> >>> Given that, I believe drivers have the following options: >>> >>> a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications. >>> >>> b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?). >>> >>> c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets. >>> >>> RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5]. >>> >>> For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications. >> >> I disagree on these conclusions. >> >> c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not. >> > > Thank you for the feedback. How do you think the documentation should look like for this part? The initial paragraph about robustness should say "keep OpenGL working" instead of "keep an application working". If an OpenGL context stops working, it doesn't necessarily mean the application stops working altogether. If the application doesn't use the robustness extensions, your option b) is what should happen by default whenever possible. And it really has to be possible if the robustness extensions are supported.
A screen that doesn't update isn't usable. Killing the window system and returning to the login screen is one option. Killing the window system manually from a terminal or over ssh and then returning to the login screen is another option, but 99% of users either hard-reset the machine or do sysrq+REISUB anyway because it's faster that way. Those are all your options. If we don't do the kill, users might decide to do a hard reset with an unsync'd file system, which can cause more damage. The precedent from the CPU land is pretty strong here. There is SIGSEGV for invalid CPU memory access and SIGILL for invalid CPU instructions, yet we do nothing for invalid GPU memory access and invalid GPU instructions. Sending a terminating signal from the kernel would be the most natural thing to do. Instead, we just keep a frozen GUI to keep users helpless, or we continue command submission and then the hanging app can cause an infinite cycle of GPU hangs and resets, making the GPU unusable until somebody kills the app over ssh. That's why GL/Vulkan robustness is required - either robust apps, or a robust compositor that greys out lost windows and pops up a diagnostic message with a list of actions to choose from. That's the direction we should be taking. Non-robust apps under a non-robust compositor should just be killed if they crash the GPU. Marek On Wed, Jul 26, 2023 at 4:07 AM Michel Dänzer <michel.daenzer@mailbox.org> wrote: > > On 7/25/23 15:02, André Almeida wrote: > > Em 25/07/2023 05:03, Michel Dänzer escreveu: > >> On 7/25/23 04:55, André Almeida wrote: > >>> Hi everyone, > >>> > >>> It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that. > >>> > >>> Em 27/06/2023 10:23, André Almeida escreveu: > >>>> +Robustness > >>>> +---------- > >>>> + > >>>> +The only way to try to keep an application working after a reset is if it > >>>> +complies with the robustness aspects of the graphical API that it is using. > >>>> + > >>>> +Graphical APIs provide ways to applications to deal with device resets. However, > >>>> +there is no guarantee that the app will use such features correctly, and the > >>>> +UMD can implement policies to close the app if it is a repeating offender, > >>>> +likely in a broken loop. This is done to ensure that it does not keep blocking > >>>> +the user interface from being correctly displayed. This should be done even if > >>>> +the app is correct but happens to trigger some bug in the hardware/driver. > >>>> + > >>> Depending on the OpenGL version, there are different robustness API available: > >>> > >>> - OpenGL ABR extension [0] > >>> - OpenGL KHR extension [1] > >>> - OpenGL ES extension [2] > >>> > >>> Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset. > >>> > >>> What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it. > >>> > >>> However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive. > >>> > >>> Given that, I believe drivers have the following options: > >>> > >>> a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications. > >>> > >>> b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?). > >>> > >>> c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets. > >>> > >>> RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5]. > >>> > >>> For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications. > >> > >> I disagree on these conclusions. > >> > >> c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not. > >> > > > > Thank you for the feedback. How do you think the documentation should look like for this part? > > The initial paragraph about robustness should say "keep OpenGL working" instead of "keep an application working". If an OpenGL context stops working, it doesn't necessarily mean the application stops working altogether. > > > If the application doesn't use the robustness extensions, your option b) is what should happen by default whenever possible. And it really has to be possible if the robustness extensions are supported. > > > -- > Earthling Michel Dänzer | https://redhat.com > Libre software enthusiast | Mesa and Xwayland developer >
On 8/2/23 09:38, Marek Olšák wrote: > > The precedent from the CPU land is pretty strong here. There is > SIGSEGV for invalid CPU memory access and SIGILL for invalid CPU > instructions, yet we do nothing for invalid GPU memory access and > invalid GPU instructions. Sending a terminating signal from the kernel > would be the most natural thing to do. After an unhandled SIGSEGV or SIGILL, the process is in an inconsistent state and cannot safely continue executing. That's why the process is terminated by default in those cases. The same is not true when an OpenGL context stops working. Any threads / other parts of the process not using that OpenGL context continue working normally. And any attempts to use that OpenGL context can be safely ignored (or the OpenGL implementation couldn't support the robustness extensions).
On Tue, Jun 27, 2023 at 10:23:23AM -0300, André Almeida wrote: > Create a section that specifies how to deal with DRM device resets for > kernel and userspace drivers. > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> > Signed-off-by: André Almeida <andrealmeid@igalia.com> > --- > > v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/ > > Changes: > - Grammar fixes (Randy) > > Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++ > 1 file changed, 68 insertions(+) > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst > index 65fb3036a580..3cbffa25ed93 100644 > --- a/Documentation/gpu/drm-uapi.rst > +++ b/Documentation/gpu/drm-uapi.rst > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for > mmapped regular files. Threads cause additional pain with signal > handling as well. > > +Device reset > +============ > + > +The GPU stack is really complex and is prone to errors, from hardware bugs, > +faulty applications and everything in between the many layers. Some errors > +require resetting the device in order to make the device usable again. This > +sections describes the expectations for DRM and usermode drivers when a > +device resets and how to propagate the reset status. > + > +Kernel Mode Driver > +------------------ > + > +The KMD is responsible for checking if the device needs a reset, and to perform > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD > +should keep track of resets, because userspace can query any time about the > +reset stats for an specific context. This is needed to propagate to the rest of > +the stack that a reset has happened. Currently, this is implemented by each > +driver separately, with no common DRM interface. > + > +User Mode Driver > +---------------- > + > +The UMD should check before submitting new commands to the KMD if the device has > +been reset, and this can be checked more often if the UMD requires it. After > +detecting a reset, UMD will then proceed to report it to the application using > +the appropriate API error code, as explained in the section below about > +robustness. > + > +Robustness > +---------- > + > +The only way to try to keep an application working after a reset is if it > +complies with the robustness aspects of the graphical API that it is using. > + > +Graphical APIs provide ways to applications to deal with device resets. However, > +there is no guarantee that the app will use such features correctly, and the > +UMD can implement policies to close the app if it is a repeating offender, Not sure whether this one here is due to my input, but s/UMD/KMD. Repeat offender killing is more a policy where the kernel enforces policy, and no longer up to userspace to dtrt (because very clearly userspace is not really doing the right thing anymore when it's just hanging the gpu in an endless loop). Also maybe tune it down further to something like "the kernel driver may implemnent ..." In my opinion the umd shouldn't implement these kind of magic guesses, the entire point of robustness apis is to delegate responsibility for correctly recovering to the application. And the kernel is left with enforcing fair resource usage policies (which eventually might be a cgroups limit on how much gpu time you're allowed to waste with gpu resets). > +likely in a broken loop. This is done to ensure that it does not keep blocking > +the user interface from being correctly displayed. This should be done even if > +the app is correct but happens to trigger some bug in the hardware/driver. > + > +OpenGL > +~~~~~~ > + > +Apps using OpenGL should use the available robust interfaces, like the > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This > +interface tells if a reset has happened, and if so, all the context state is > +considered lost and the app proceeds by creating new ones. If it is possible to > +determine that robustness is not in use, the UMD will terminate the app when a > +reset is detected, giving that the contexts are lost and the app won't be able > +to figure this out and recreate the contexts. > + > +Vulkan > +~~~~~~ > + > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions. > +This error code means, among other things, that a device reset has happened and > +it needs to recreate the contexts to keep going. > + > +Reporting causes of resets > +-------------------------- > + > +Apart from propagating the reset through the stack so apps can recover, it's > +really useful for driver developers to learn more about what caused the reset in > +first place. DRM devices should make use of devcoredump to store relevant > +information about the reset, so this information can be added to user bug > +reports. Since we do not seem to have a solid consensus in the community about non-robust userspace, maybe we could just document that lack of consensus to unblock this patch? Something like this: Non-Robust Userspace -------------------- Userspace that doesn't support robust interfaces (like an non-robust OpenGL context or API without any robustness support like libva) leave the robustness handling entirely to the userspace driver. There is no strong community consensus on what the userspace driver should do in that case, since all reasonable approaches have some clear downsides. With the s/UMD/KMD/ further up and maybe something added to record the non-robustness non-consensus: Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> Cheers, Daniel > + > .. _drm_driver_ioctl: > > IOCTL Support on Device Nodes > -- > 2.41.0 >
On Fri, Aug 4, 2023 at 3:03 PM Daniel Vetter <daniel@ffwll.ch> wrote: > > On Tue, Jun 27, 2023 at 10:23:23AM -0300, André Almeida wrote: > > Create a section that specifies how to deal with DRM device resets for > > kernel and userspace drivers. > > > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> > > Signed-off-by: André Almeida <andrealmeid@igalia.com> > > --- > > > > v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/ > > > > Changes: > > - Grammar fixes (Randy) > > > > Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++ > > 1 file changed, 68 insertions(+) > > > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst > > index 65fb3036a580..3cbffa25ed93 100644 > > --- a/Documentation/gpu/drm-uapi.rst > > +++ b/Documentation/gpu/drm-uapi.rst > > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for > > mmapped regular files. Threads cause additional pain with signal > > handling as well. > > > > +Device reset > > +============ > > + > > +The GPU stack is really complex and is prone to errors, from hardware bugs, > > +faulty applications and everything in between the many layers. Some errors > > +require resetting the device in order to make the device usable again. This > > +sections describes the expectations for DRM and usermode drivers when a > > +device resets and how to propagate the reset status. > > + > > +Kernel Mode Driver > > +------------------ > > + > > +The KMD is responsible for checking if the device needs a reset, and to perform > > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD > > +should keep track of resets, because userspace can query any time about the > > +reset stats for an specific context. This is needed to propagate to the rest of > > +the stack that a reset has happened. Currently, this is implemented by each > > +driver separately, with no common DRM interface. > > + > > +User Mode Driver > > +---------------- > > + > > +The UMD should check before submitting new commands to the KMD if the device has > > +been reset, and this can be checked more often if the UMD requires it. After > > +detecting a reset, UMD will then proceed to report it to the application using > > +the appropriate API error code, as explained in the section below about > > +robustness. > > + > > +Robustness > > +---------- > > + > > +The only way to try to keep an application working after a reset is if it > > +complies with the robustness aspects of the graphical API that it is using. > > + > > +Graphical APIs provide ways to applications to deal with device resets. However, > > +there is no guarantee that the app will use such features correctly, and the > > +UMD can implement policies to close the app if it is a repeating offender, > > Not sure whether this one here is due to my input, but s/UMD/KMD. Repeat > offender killing is more a policy where the kernel enforces policy, and no > longer up to userspace to dtrt (because very clearly userspace is not > really doing the right thing anymore when it's just hanging the gpu in an > endless loop). Also maybe tune it down further to something like "the > kernel driver may implemnent ..." > > In my opinion the umd shouldn't implement these kind of magic guesses, the > entire point of robustness apis is to delegate responsibility for > correctly recovering to the application. And the kernel is left with > enforcing fair resource usage policies (which eventually might be a > cgroups limit on how much gpu time you're allowed to waste with gpu > resets). Killing apps that the kernel thinks are misbehaving really doesn't seem like a good idea to me. What if the process is a service getting restarted after getting killed? What if killing that process leaves the system in a bad state? Can't the kernel provide some information to user space so that e.g. systemd can handle those situations? > > +likely in a broken loop. This is done to ensure that it does not keep blocking > > +the user interface from being correctly displayed. This should be done even if > > +the app is correct but happens to trigger some bug in the hardware/driver. > > + > > +OpenGL > > +~~~~~~ > > + > > +Apps using OpenGL should use the available robust interfaces, like the > > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This > > +interface tells if a reset has happened, and if so, all the context state is > > +considered lost and the app proceeds by creating new ones. If it is possible to > > +determine that robustness is not in use, the UMD will terminate the app when a > > +reset is detected, giving that the contexts are lost and the app won't be able > > +to figure this out and recreate the contexts. > > + > > +Vulkan > > +~~~~~~ > > + > > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions. > > +This error code means, among other things, that a device reset has happened and > > +it needs to recreate the contexts to keep going. > > + > > +Reporting causes of resets > > +-------------------------- > > + > > +Apart from propagating the reset through the stack so apps can recover, it's > > +really useful for driver developers to learn more about what caused the reset in > > +first place. DRM devices should make use of devcoredump to store relevant > > +information about the reset, so this information can be added to user bug > > +reports. > > Since we do not seem to have a solid consensus in the community about > non-robust userspace, maybe we could just document that lack of consensus > to unblock this patch? Something like this: > > Non-Robust Userspace > -------------------- > > Userspace that doesn't support robust interfaces (like an non-robust > OpenGL context or API without any robustness support like libva) leave the > robustness handling entirely to the userspace driver. There is no strong > community consensus on what the userspace driver should do in that case, > since all reasonable approaches have some clear downsides. > > With the s/UMD/KMD/ further up and maybe something added to record the > non-robustness non-consensus: > > Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> > > Cheers, Daniel > > > > > + > > .. _drm_driver_ioctl: > > > > IOCTL Support on Device Nodes > > -- > > 2.41.0 > > > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch >
It's the same situation as SIGSEGV. A process can catch the signal, but if it doesn't, it gets killed. GL and Vulkan APIs give you a way to catch the GPU error and prevent the process termination. If you don't use the API, you'll get undefined behavior, which means anything can happen, including process termination. Marek On Tue, Aug 8, 2023 at 8:14 AM Sebastian Wick <sebastian.wick@redhat.com> wrote: > > On Fri, Aug 4, 2023 at 3:03 PM Daniel Vetter <daniel@ffwll.ch> wrote: > > > > On Tue, Jun 27, 2023 at 10:23:23AM -0300, André Almeida wrote: > > > Create a section that specifies how to deal with DRM device resets for > > > kernel and userspace drivers. > > > > > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com> > > > Signed-off-by: André Almeida <andrealmeid@igalia.com> > > > --- > > > > > > v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/ > > > > > > Changes: > > > - Grammar fixes (Randy) > > > > > > Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++ > > > 1 file changed, 68 insertions(+) > > > > > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst > > > index 65fb3036a580..3cbffa25ed93 100644 > > > --- a/Documentation/gpu/drm-uapi.rst > > > +++ b/Documentation/gpu/drm-uapi.rst > > > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for > > > mmapped regular files. Threads cause additional pain with signal > > > handling as well. > > > > > > +Device reset > > > +============ > > > + > > > +The GPU stack is really complex and is prone to errors, from hardware bugs, > > > +faulty applications and everything in between the many layers. Some errors > > > +require resetting the device in order to make the device usable again. This > > > +sections describes the expectations for DRM and usermode drivers when a > > > +device resets and how to propagate the reset status. > > > + > > > +Kernel Mode Driver > > > +------------------ > > > + > > > +The KMD is responsible for checking if the device needs a reset, and to perform > > > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD > > > +should keep track of resets, because userspace can query any time about the > > > +reset stats for an specific context. This is needed to propagate to the rest of > > > +the stack that a reset has happened. Currently, this is implemented by each > > > +driver separately, with no common DRM interface. > > > + > > > +User Mode Driver > > > +---------------- > > > + > > > +The UMD should check before submitting new commands to the KMD if the device has > > > +been reset, and this can be checked more often if the UMD requires it. After > > > +detecting a reset, UMD will then proceed to report it to the application using > > > +the appropriate API error code, as explained in the section below about > > > +robustness. > > > + > > > +Robustness > > > +---------- > > > + > > > +The only way to try to keep an application working after a reset is if it > > > +complies with the robustness aspects of the graphical API that it is using. > > > + > > > +Graphical APIs provide ways to applications to deal with device resets. However, > > > +there is no guarantee that the app will use such features correctly, and the > > > +UMD can implement policies to close the app if it is a repeating offender, > > > > Not sure whether this one here is due to my input, but s/UMD/KMD. Repeat > > offender killing is more a policy where the kernel enforces policy, and no > > longer up to userspace to dtrt (because very clearly userspace is not > > really doing the right thing anymore when it's just hanging the gpu in an > > endless loop). Also maybe tune it down further to something like "the > > kernel driver may implemnent ..." > > > > In my opinion the umd shouldn't implement these kind of magic guesses, the > > entire point of robustness apis is to delegate responsibility for > > correctly recovering to the application. And the kernel is left with > > enforcing fair resource usage policies (which eventually might be a > > cgroups limit on how much gpu time you're allowed to waste with gpu > > resets). > > Killing apps that the kernel thinks are misbehaving really doesn't > seem like a good idea to me. What if the process is a service getting > restarted after getting killed? What if killing that process leaves > the system in a bad state? > > Can't the kernel provide some information to user space so that e.g. > systemd can handle those situations? > > > > +likely in a broken loop. This is done to ensure that it does not keep blocking > > > +the user interface from being correctly displayed. This should be done even if > > > +the app is correct but happens to trigger some bug in the hardware/driver. > > > + > > > +OpenGL > > > +~~~~~~ > > > + > > > +Apps using OpenGL should use the available robust interfaces, like the > > > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This > > > +interface tells if a reset has happened, and if so, all the context state is > > > +considered lost and the app proceeds by creating new ones. If it is possible to > > > +determine that robustness is not in use, the UMD will terminate the app when a > > > +reset is detected, giving that the contexts are lost and the app won't be able > > > +to figure this out and recreate the contexts. > > > + > > > +Vulkan > > > +~~~~~~ > > > + > > > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions. > > > +This error code means, among other things, that a device reset has happened and > > > +it needs to recreate the contexts to keep going. > > > + > > > +Reporting causes of resets > > > +-------------------------- > > > + > > > +Apart from propagating the reset through the stack so apps can recover, it's > > > +really useful for driver developers to learn more about what caused the reset in > > > +first place. DRM devices should make use of devcoredump to store relevant > > > +information about the reset, so this information can be added to user bug > > > +reports. > > > > Since we do not seem to have a solid consensus in the community about > > non-robust userspace, maybe we could just document that lack of consensus > > to unblock this patch? Something like this: > > > > Non-Robust Userspace > > -------------------- > > > > Userspace that doesn't support robust interfaces (like an non-robust > > OpenGL context or API without any robustness support like libva) leave the > > robustness handling entirely to the userspace driver. There is no strong > > community consensus on what the userspace driver should do in that case, > > since all reasonable approaches have some clear downsides. > > > > With the s/UMD/KMD/ further up and maybe something added to record the > > non-robustness non-consensus: > > > > Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> > > > > Cheers, Daniel > > > > > > > > > + > > > .. _drm_driver_ioctl: > > > > > > IOCTL Support on Device Nodes > > > -- > > > 2.41.0 > > > > > > > -- > > Daniel Vetter > > Software Engineer, Intel Corporation > > http://blog.ffwll.ch > > >
On 8/8/23 19:03, Marek Olšák wrote: > It's the same situation as SIGSEGV. A process can catch the signal, > but if it doesn't, it gets killed. GL and Vulkan APIs give you a way > to catch the GPU error and prevent the process termination. If you > don't use the API, you'll get undefined behavior, which means anything > can happen, including process termination. Got a spec reference for that? I know the spec allows process termination in response to e.g. out of bounds buffer access by the application (which corresponds to SIGSEGV). There are other causes for GPU hangs though, e.g. driver bugs. The ARB_robustness spec says: If the reset notification behavior is NO_RESET_NOTIFICATION_ARB, then the implementation will never deliver notification of reset events, and GetGraphicsResetStatusARB will always return NO_ERROR[fn1]. [fn1: In this case it is recommended that implementations should not allow loss of context state no matter what events occur. However, this is only a recommendation, and cannot be relied upon by applications.] No mention of process termination, that rather sounds to me like the GL implementation should do its best to keep the application running.
On Wed, Aug 9, 2023 at 3:35 AM Michel Dänzer <michel.daenzer@mailbox.org> wrote: > > On 8/8/23 19:03, Marek Olšák wrote: > > It's the same situation as SIGSEGV. A process can catch the signal, > > but if it doesn't, it gets killed. GL and Vulkan APIs give you a way > > to catch the GPU error and prevent the process termination. If you > > don't use the API, you'll get undefined behavior, which means anything > > can happen, including process termination. > > Got a spec reference for that? > > I know the spec allows process termination in response to e.g. out of bounds buffer access by the application (which corresponds to SIGSEGV). There are other causes for GPU hangs though, e.g. driver bugs. The ARB_robustness spec says: > > If the reset notification behavior is NO_RESET_NOTIFICATION_ARB, > then the implementation will never deliver notification of reset > events, and GetGraphicsResetStatusARB will always return > NO_ERROR[fn1]. > [fn1: In this case it is recommended that implementations should > not allow loss of context state no matter what events occur. > However, this is only a recommendation, and cannot be relied > upon by applications.] > > No mention of process termination, that rather sounds to me like the GL implementation should do its best to keep the application running. It basically says that we can do anything. A frozen window or flipping between 2 random frames can't be described as "keeping the application running". That's the worst user experience. I will not accept it. A window system can force-enable robustness for its non-robust apps and control that. That's the best possible user experience and it's achievable everywhere. Everything else doesn't matter. Marek Marek
On 8/9/23 21:15, Marek Olšák wrote: > On Wed, Aug 9, 2023 at 3:35 AM Michel Dänzer <michel.daenzer@mailbox.org> wrote: >> On 8/8/23 19:03, Marek Olšák wrote: >>> It's the same situation as SIGSEGV. A process can catch the signal, >>> but if it doesn't, it gets killed. GL and Vulkan APIs give you a way >>> to catch the GPU error and prevent the process termination. If you >>> don't use the API, you'll get undefined behavior, which means anything >>> can happen, including process termination. >> >> Got a spec reference for that? >> >> I know the spec allows process termination in response to e.g. out of bounds buffer access by the application (which corresponds to SIGSEGV). There are other causes for GPU hangs though, e.g. driver bugs. The ARB_robustness spec says: >> >> If the reset notification behavior is NO_RESET_NOTIFICATION_ARB, >> then the implementation will never deliver notification of reset >> events, and GetGraphicsResetStatusARB will always return >> NO_ERROR[fn1]. >> [fn1: In this case it is recommended that implementations should >> not allow loss of context state no matter what events occur. >> However, this is only a recommendation, and cannot be relied >> upon by applications.] >> >> No mention of process termination, that rather sounds to me like the GL implementation should do its best to keep the application running. > > It basically says that we can do anything. Not really? If program termination is a possible outcome, the spec otherwise mentions that explicitly, ala "including program termination". > A frozen window or flipping between 2 random frames can't be described > as "keeping the application running". This assumes that an application which uses OpenGL cannot have any other purpose than using OpenGL.
diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst index 65fb3036a580..3cbffa25ed93 100644 --- a/Documentation/gpu/drm-uapi.rst +++ b/Documentation/gpu/drm-uapi.rst @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for mmapped regular files. Threads cause additional pain with signal handling as well. +Device reset +============ + +The GPU stack is really complex and is prone to errors, from hardware bugs, +faulty applications and everything in between the many layers. Some errors +require resetting the device in order to make the device usable again. This +sections describes the expectations for DRM and usermode drivers when a +device resets and how to propagate the reset status. + +Kernel Mode Driver +------------------ + +The KMD is responsible for checking if the device needs a reset, and to perform +it as needed. Usually a hang is detected when a job gets stuck executing. KMD +should keep track of resets, because userspace can query any time about the +reset stats for an specific context. This is needed to propagate to the rest of +the stack that a reset has happened. Currently, this is implemented by each +driver separately, with no common DRM interface. + +User Mode Driver +---------------- + +The UMD should check before submitting new commands to the KMD if the device has +been reset, and this can be checked more often if the UMD requires it. After +detecting a reset, UMD will then proceed to report it to the application using +the appropriate API error code, as explained in the section below about +robustness. + +Robustness +---------- + +The only way to try to keep an application working after a reset is if it +complies with the robustness aspects of the graphical API that it is using. + +Graphical APIs provide ways to applications to deal with device resets. However, +there is no guarantee that the app will use such features correctly, and the +UMD can implement policies to close the app if it is a repeating offender, +likely in a broken loop. This is done to ensure that it does not keep blocking +the user interface from being correctly displayed. This should be done even if +the app is correct but happens to trigger some bug in the hardware/driver. + +OpenGL +~~~~~~ + +Apps using OpenGL should use the available robust interfaces, like the +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This +interface tells if a reset has happened, and if so, all the context state is +considered lost and the app proceeds by creating new ones. If it is possible to +determine that robustness is not in use, the UMD will terminate the app when a +reset is detected, giving that the contexts are lost and the app won't be able +to figure this out and recreate the contexts. + +Vulkan +~~~~~~ + +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions. +This error code means, among other things, that a device reset has happened and +it needs to recreate the contexts to keep going. + +Reporting causes of resets +-------------------------- + +Apart from propagating the reset through the stack so apps can recover, it's +really useful for driver developers to learn more about what caused the reset in +first place. DRM devices should make use of devcoredump to store relevant +information about the reset, so this information can be added to user bug +reports. + .. _drm_driver_ioctl: IOCTL Support on Device Nodes