diff mbox series

[v5,1/1] drm/doc: Document DRM device reset expectations

Message ID	20230627132323.115440-1-andrealmeid@igalia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: =?utf-8?q?Andr=C3=A9_Almeida?= <andrealmeid@igalia.com> To: dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org Subject: [PATCH v5 1/1] drm/doc: Document DRM device reset expectations Date: Tue, 27 Jun 2023 10:23:23 -0300 Message-ID: <20230627132323.115440-1-andrealmeid@igalia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: list Cc: pierre-eric.pelloux-prayer@amd.com, Randy Dunlap <rdunlap@infradead.org>, =?utf-8?q?Andr=C3=A9_Almeida?= <andrealmeid@igalia.com>, =?utf-8?b?J01hcmVr?= =?utf-8?b?IE9sxaHDoWsn?= <maraeo@gmail.com>, =?utf-8?q?Michel_D=C3=A4nzer?= <michel.daenzer@mailbox.org>, =?utf-8?q?Timu?= =?utf-8?q?r_Krist=C3=B3f?= <timur.kristof@gmail.com>, Pekka Paalanen <ppaalanen@gmail.com>, Samuel Pitoiset <samuel.pitoiset@gmail.com>, kernel-dev@igalia.com, alexander.deucher@amd.com, Pekka Paalanen <pekka.paalanen@collabora.com>, christian.koenig@amd.com Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	[v5,1/1] drm/doc: Document DRM device reset expectations \| expand [v5,1/1] drm/doc: Document DRM device reset expectations

Commit Message

André Almeida June 27, 2023, 1:23 p.m. UTC

Create a section that specifies how to deal with DRM device resets for
kernel and userspace drivers.

Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
Signed-off-by: André Almeida <andrealmeid@igalia.com>
---

v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/

Changes:
 - Grammar fixes (Randy)

 Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
 1 file changed, 68 insertions(+)

Comments

Randy Dunlap June 27, 2023, 4:09 p.m. UTC | #1

Hi André,

I have just a few more below:

On 6/27/23 06:23, André Almeida wrote:
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
> 
> Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
> 
> v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
> 
> Changes:
>  - Grammar fixes (Randy)
> 
>  Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
>  1 file changed, 68 insertions(+)
> 
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3cbffa25ed93 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
>  mmapped regular files. Threads cause additional pain with signal
>  handling as well.
>  
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again. This
> +sections describes the expectations for DRM and usermode drivers when a

   section

> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to perform
> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> +should keep track of resets, because userspace can query any time about the
> +reset stats for an specific context. This is needed to propagate to the rest of

               for a specific

stats or status?

> +the stack that a reset has happened. Currently, this is implemented by each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +----------------
> +
> +The UMD should check before submitting new commands to the KMD if the device has
> +been reset, and this can be checked more often if the UMD requires it. After
> +detecting a reset, UMD will then proceed to report it to the application using
> +the appropriate API error code, as explained in the section below about
> +robustness.
> +
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to applications to deal with device resets. However,
> +there is no guarantee that the app will use such features correctly, and the
> +UMD can implement policies to close the app if it is a repeating offender,
> +likely in a broken loop. This is done to ensure that it does not keep blocking
> +the user interface from being correctly displayed. This should be done even if
> +the app is correct but happens to trigger some bug in the hardware/driver.
> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> +interface tells if a reset has happened, and if so, all the context state is
> +considered lost and the app proceeds by creating new ones. If it is possible to
> +determine that robustness is not in use, the UMD will terminate the app when a
> +reset is detected, giving that the contexts are lost and the app won't be able
> +to figure this out and recreate the contexts.
> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting causes of resets
> +--------------------------
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the reset in
> +first place. DRM devices should make use of devcoredump to store relevant

   the first place.

> +information about the reset, so this information can be added to user bug
> +reports.
> +
>  .. _drm_driver_ioctl:
>  
>  IOCTL Support on Device Nodes

and with those addressed:

Reviewed-by: Randy Dunlap <rdunlap@infradead.org>

Thanks for adding the documentation.

Christian König June 27, 2023, 5:47 p.m. UTC | #2

Am 27.06.23 um 15:23 schrieb André Almeida:
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>
> v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
>
> Changes:
>   - Grammar fixes (Randy)
>
>   Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
>   1 file changed, 68 insertions(+)
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3cbffa25ed93 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
>   mmapped regular files. Threads cause additional pain with signal
>   handling as well.
>   
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again. This
> +sections describes the expectations for DRM and usermode drivers when a
> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to perform
> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> +should keep track of resets, because userspace can query any time about the
> +reset stats for an specific context.

Maybe drop the part "for a specific context". Essentially the reset 
query could use global counters instead and we won't need the context 
any more here.

Apart from that this sounds good to me, feel free to add my rb.

Regards,
Christian.

>   This is needed to propagate to the rest of
> +the stack that a reset has happened. Currently, this is implemented by each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +----------------
> +
> +The UMD should check before submitting new commands to the KMD if the device has
> +been reset, and this can be checked more often if the UMD requires it. After
> +detecting a reset, UMD will then proceed to report it to the application using
> +the appropriate API error code, as explained in the section below about
> +robustness.
> +
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to applications to deal with device resets. However,
> +there is no guarantee that the app will use such features correctly, and the
> +UMD can implement policies to close the app if it is a repeating offender,
> +likely in a broken loop. This is done to ensure that it does not keep blocking
> +the user interface from being correctly displayed. This should be done even if
> +the app is correct but happens to trigger some bug in the hardware/driver.
> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> +interface tells if a reset has happened, and if so, all the context state is
> +considered lost and the app proceeds by creating new ones. If it is possible to
> +determine that robustness is not in use, the UMD will terminate the app when a
> +reset is detected, giving that the contexts are lost and the app won't be able
> +to figure this out and recreate the contexts.
> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting causes of resets
> +--------------------------
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the reset in
> +first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.
> +
>   .. _drm_driver_ioctl:
>   
>   IOCTL Support on Device Nodes

Marek Olšák June 27, 2023, 6:57 p.m. UTC | #3

On Tue, Jun 27, 2023, 09:23 André Almeida <andrealmeid@igalia.com> wrote:

> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>
> v4:
> https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
>
> Changes:
>  - Grammar fixes (Randy)
>
>  Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
>  1 file changed, 68 insertions(+)
>
> diff --git a/Documentation/gpu/drm-uapi.rst
> b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3cbffa25ed93 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third
> handler for
>  mmapped regular files. Threads cause additional pain with signal
>  handling as well.
>
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware
> bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again.
> This
> +sections describes the expectations for DRM and usermode drivers when a
> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to
> perform
> +it as needed. Usually a hang is detected when a job gets stuck executing.
> KMD
> +should keep track of resets, because userspace can query any time about
> the
> +reset stats for an specific context. This is needed to propagate to the
> rest of
> +the stack that a reset has happened. Currently, this is implemented by
> each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +----------------
> +
> +The UMD should check before submitting new commands to the KMD if the
> device has
> +been reset, and this can be checked more often if the UMD requires it.
> After
> +detecting a reset, UMD will then proceed to report it to the application
> using
> +the appropriate API error code, as explained in the section below about
> +robustness.
>

The UMD won't check the device status before every command submission due
to ioctl overhead. Instead, the KMD should skip command submission and
return an error that it was skipped.

The only case where that won't be applicable is user queues where drivers
don't call into the kernel to submit work, but they do call into the kernel
to create a dma_fence. In that case, the call to create a dma_fence can
fail with an error.

Marek

+
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is
> using.
> +
> +Graphical APIs provide ways to applications to deal with device resets.
> However,
> +there is no guarantee that the app will use such features correctly, and
> the
> +UMD can implement policies to close the app if it is a repeating offender,
> +likely in a broken loop. This is done to ensure that it does not keep
> blocking
> +the user interface from being correctly displayed. This should be done
> even if
> +the app is correct but happens to trigger some bug in the hardware/driver.
> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES).
> This
> +interface tells if a reset has happened, and if so, all the context state
> is
> +considered lost and the app proceeds by creating new ones. If it is
> possible to
> +determine that robustness is not in use, the UMD will terminate the app
> when a
> +reset is detected, giving that the contexts are lost and the app won't be
> able
> +to figure this out and recreate the contexts.
> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for
> submissions.
> +This error code means, among other things, that a device reset has
> happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting causes of resets
> +--------------------------
> +
> +Apart from propagating the reset through the stack so apps can recover,
> it's
> +really useful for driver developers to learn more about what caused the
> reset in
> +first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.
> +
>  .. _drm_driver_ioctl:
>
>  IOCTL Support on Device Nodes
> --
> 2.41.0
>
>

André Almeida June 27, 2023, 9:17 p.m. UTC | #4

Em 27/06/2023 14:47, Christian König escreveu:
> Am 27.06.23 um 15:23 schrieb André Almeida:
>> Create a section that specifies how to deal with DRM device resets for
>> kernel and userspace drivers.
>>
>> Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>> ---
>>
>> v4: 
>> https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
>>
>> Changes:
>>   - Grammar fixes (Randy)
>>
>>   Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
>>   1 file changed, 68 insertions(+)
>>
>> diff --git a/Documentation/gpu/drm-uapi.rst 
>> b/Documentation/gpu/drm-uapi.rst
>> index 65fb3036a580..3cbffa25ed93 100644
>> --- a/Documentation/gpu/drm-uapi.rst
>> +++ b/Documentation/gpu/drm-uapi.rst
>> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a 
>> third handler for
>>   mmapped regular files. Threads cause additional pain with signal
>>   handling as well.
>> +Device reset
>> +============
>> +
>> +The GPU stack is really complex and is prone to errors, from hardware 
>> bugs,
>> +faulty applications and everything in between the many layers. Some 
>> errors
>> +require resetting the device in order to make the device usable 
>> again. This
>> +sections describes the expectations for DRM and usermode drivers when a
>> +device resets and how to propagate the reset status.
>> +
>> +Kernel Mode Driver
>> +------------------
>> +
>> +The KMD is responsible for checking if the device needs a reset, and 
>> to perform
>> +it as needed. Usually a hang is detected when a job gets stuck 
>> executing. KMD
>> +should keep track of resets, because userspace can query any time 
>> about the
>> +reset stats for an specific context.
> 
> Maybe drop the part "for a specific context". Essentially the reset 
> query could use global counters instead and we won't need the context 
> any more here.
> 

Right, I wrote like this to reflect how it's currently implemented.

If follow correctly what you meant, KMD could always notify the global 
count for UMD, and we would move to the UMD the responsibility to manage 
the reset counters, right? This would also simplify my 
DRM_IOCTL_GET_RESET proposal. I'll apply your suggestion to the next doc 
version.

> Apart from that this sounds good to me, feel free to add my rb.
> 
> Regards,
> Christian.
> 
>

André Almeida June 27, 2023, 9:31 p.m. UTC | #5

Hi Marek,

Em 27/06/2023 15:57, Marek Olšák escreveu:
> On Tue, Jun 27, 2023, 09:23 André Almeida <andrealmeid@igalia.com 
> <mailto:andrealmeid@igalia.com>> wrote:
> 
>     +User Mode Driver
>     +----------------
>     +
>     +The UMD should check before submitting new commands to the KMD if
>     the device has
>     +been reset, and this can be checked more often if the UMD requires
>     it. After
>     +detecting a reset, UMD will then proceed to report it to the
>     application using
>     +the appropriate API error code, as explained in the section below about
>     +robustness.
> 
> 
> The UMD won't check the device status before every command submission 
> due to ioctl overhead. Instead, the KMD should skip command submission 
> and return an error that it was skipped.

I wrote like this because when reading the source code for 
vk::check_status()[0] and Gallium's si_flush_gfx_cs()[1], I was under 
the impression that UMD checks the reset status before every 
submission/flush.

Is your comment about of how things are currently implemented, or how 
they would ideally work? Either way I can apply your suggestion, I just 
want to make it clear.

[0] 
https://elixir.bootlin.com/mesa/mesa-23.1.3/source/src/vulkan/runtime/vk_device.h#L142
[1] 
https://elixir.bootlin.com/mesa/mesa-23.1.3/source/src/gallium/drivers/radeonsi/si_gfx_cs.c#L83

> 
> The only case where that won't be applicable is user queues where 
> drivers don't call into the kernel to submit work, but they do call into 
> the kernel to create a dma_fence. In that case, the call to create a 
> dma_fence can fail with an error.
> 
> Marek

Marek Olšák June 28, 2023, 12:36 a.m. UTC | #6

On Tue, Jun 27, 2023 at 5:31 PM André Almeida <andrealmeid@igalia.com>
wrote:

> Hi Marek,
>
> Em 27/06/2023 15:57, Marek Olšák escreveu:
> > On Tue, Jun 27, 2023, 09:23 André Almeida <andrealmeid@igalia.com
> > <mailto:andrealmeid@igalia.com>> wrote:
> >
> >     +User Mode Driver
> >     +----------------
> >     +
> >     +The UMD should check before submitting new commands to the KMD if
> >     the device has
> >     +been reset, and this can be checked more often if the UMD requires
> >     it. After
> >     +detecting a reset, UMD will then proceed to report it to the
> >     application using
> >     +the appropriate API error code, as explained in the section below
> about
> >     +robustness.
> >
> >
> > The UMD won't check the device status before every command submission
> > due to ioctl overhead. Instead, the KMD should skip command submission
> > and return an error that it was skipped.
>
> I wrote like this because when reading the source code for
> vk::check_status()[0] and Gallium's si_flush_gfx_cs()[1], I was under
> the impression that UMD checks the reset status before every
> submission/flush.
>

It only does that before every command submission when the context is
robust. When it's not robust, radeonsi doesn't do anything.


>
> Is your comment about of how things are currently implemented, or how
> they would ideally work? Either way I can apply your suggestion, I just
> want to make it clear.
>

Yes. Ideally, we would get the reply whether the context is lost from the
CS ioctl. This is not currently implemented.

Marek


>
> [0]
>
> https://elixir.bootlin.com/mesa/mesa-23.1.3/source/src/vulkan/runtime/vk_device.h#L142
> [1]
>
> https://elixir.bootlin.com/mesa/mesa-23.1.3/source/src/gallium/drivers/radeonsi/si_gfx_cs.c#L83
>
> >
> > The only case where that won't be applicable is user queues where
> > drivers don't call into the kernel to submit work, but they do call into
> > the kernel to create a dma_fence. In that case, the call to create a
> > dma_fence can fail with an error.
> >
> > Marek
>
>

André Almeida June 29, 2023, 1:11 p.m. UTC | #7

Em 27/06/2023 18:17, André Almeida escreveu:
> Em 27/06/2023 14:47, Christian König escreveu:
>> Am 27.06.23 um 15:23 schrieb André Almeida:
>>> Create a section that specifies how to deal with DRM device resets for
>>> kernel and userspace drivers.
>>>
>>> Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
>>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>>> ---
>>>
>>> v4: 
>>> https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
>>>
>>> Changes:
>>>   - Grammar fixes (Randy)
>>>
>>>   Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
>>>   1 file changed, 68 insertions(+)
>>>
>>> diff --git a/Documentation/gpu/drm-uapi.rst 
>>> b/Documentation/gpu/drm-uapi.rst
>>> index 65fb3036a580..3cbffa25ed93 100644
>>> --- a/Documentation/gpu/drm-uapi.rst
>>> +++ b/Documentation/gpu/drm-uapi.rst
>>> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a 
>>> third handler for
>>>   mmapped regular files. Threads cause additional pain with signal
>>>   handling as well.
>>> +Device reset
>>> +============
>>> +
>>> +The GPU stack is really complex and is prone to errors, from 
>>> hardware bugs,
>>> +faulty applications and everything in between the many layers. Some 
>>> errors
>>> +require resetting the device in order to make the device usable 
>>> again. This
>>> +sections describes the expectations for DRM and usermode drivers when a
>>> +device resets and how to propagate the reset status.
>>> +
>>> +Kernel Mode Driver
>>> +------------------
>>> +
>>> +The KMD is responsible for checking if the device needs a reset, and 
>>> to perform
>>> +it as needed. Usually a hang is detected when a job gets stuck 
>>> executing. KMD
>>> +should keep track of resets, because userspace can query any time 
>>> about the
>>> +reset stats for an specific context.
>>
>> Maybe drop the part "for a specific context". Essentially the reset 
>> query could use global counters instead and we won't need the context 
>> any more here.
>>
> 
> Right, I wrote like this to reflect how it's currently implemented.
> 
> If follow correctly what you meant, KMD could always notify the global 
> count for UMD, and we would move to the UMD the responsibility to manage 
> the reset counters, right? This would also simplify my 
> DRM_IOCTL_GET_RESET proposal. I'll apply your suggestion to the next doc 
> version.
> 

Actually, if we drop the context identifier we would lose the ability to 
track which is the guilty context. Vulkan API doesn't seem to care about 
this, but OpenGL does.

>> Apart from that this sounds good to me, feel free to add my rb.
>>
>> Regards,
>> Christian.
>>
>>

Sebastian Wick June 30, 2023, 2:48 p.m. UTC | #8

On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com> wrote:
>
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>
> v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
>
> Changes:
>  - Grammar fixes (Randy)
>
>  Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
>  1 file changed, 68 insertions(+)
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3cbffa25ed93 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
>  mmapped regular files. Threads cause additional pain with signal
>  handling as well.
>
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again. This
> +sections describes the expectations for DRM and usermode drivers when a
> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to perform
> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> +should keep track of resets, because userspace can query any time about the
> +reset stats for an specific context. This is needed to propagate to the rest of
> +the stack that a reset has happened. Currently, this is implemented by each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +----------------
> +
> +The UMD should check before submitting new commands to the KMD if the device has
> +been reset, and this can be checked more often if the UMD requires it. After
> +detecting a reset, UMD will then proceed to report it to the application using
> +the appropriate API error code, as explained in the section below about
> +robustness.
> +
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to applications to deal with device resets. However,
> +there is no guarantee that the app will use such features correctly, and the
> +UMD can implement policies to close the app if it is a repeating offender,
> +likely in a broken loop. This is done to ensure that it does not keep blocking
> +the user interface from being correctly displayed. This should be done even if
> +the app is correct but happens to trigger some bug in the hardware/driver.

I still don't think it's good to let the kernel arbitrarily kill
processes that it thinks are not well-behaved based on some heuristics
and policy.

Can't this be outsourced to user space? Expose the information about
processes causing a device and let e.g. systemd deal with coming up
with a policy and with killing stuff.

> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> +interface tells if a reset has happened, and if so, all the context state is
> +considered lost and the app proceeds by creating new ones. If it is possible to
> +determine that robustness is not in use, the UMD will terminate the app when a
> +reset is detected, giving that the contexts are lost and the app won't be able
> +to figure this out and recreate the contexts.
> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting causes of resets
> +--------------------------
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the reset in
> +first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.
> +
>  .. _drm_driver_ioctl:
>
>  IOCTL Support on Device Nodes
> --
> 2.41.0
>

Alex Deucher June 30, 2023, 2:59 p.m. UTC | #9

On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
<sebastian.wick@redhat.com> wrote:
>
> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com> wrote:
> >
> > Create a section that specifies how to deal with DRM device resets for
> > kernel and userspace drivers.
> >
> > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
> > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > ---
> >
> > v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
> >
> > Changes:
> >  - Grammar fixes (Randy)
> >
> >  Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 68 insertions(+)
> >
> > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > index 65fb3036a580..3cbffa25ed93 100644
> > --- a/Documentation/gpu/drm-uapi.rst
> > +++ b/Documentation/gpu/drm-uapi.rst
> > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> >  mmapped regular files. Threads cause additional pain with signal
> >  handling as well.
> >
> > +Device reset
> > +============
> > +
> > +The GPU stack is really complex and is prone to errors, from hardware bugs,
> > +faulty applications and everything in between the many layers. Some errors
> > +require resetting the device in order to make the device usable again. This
> > +sections describes the expectations for DRM and usermode drivers when a
> > +device resets and how to propagate the reset status.
> > +
> > +Kernel Mode Driver
> > +------------------
> > +
> > +The KMD is responsible for checking if the device needs a reset, and to perform
> > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> > +should keep track of resets, because userspace can query any time about the
> > +reset stats for an specific context. This is needed to propagate to the rest of
> > +the stack that a reset has happened. Currently, this is implemented by each
> > +driver separately, with no common DRM interface.
> > +
> > +User Mode Driver
> > +----------------
> > +
> > +The UMD should check before submitting new commands to the KMD if the device has
> > +been reset, and this can be checked more often if the UMD requires it. After
> > +detecting a reset, UMD will then proceed to report it to the application using
> > +the appropriate API error code, as explained in the section below about
> > +robustness.
> > +
> > +Robustness
> > +----------
> > +
> > +The only way to try to keep an application working after a reset is if it
> > +complies with the robustness aspects of the graphical API that it is using.
> > +
> > +Graphical APIs provide ways to applications to deal with device resets. However,
> > +there is no guarantee that the app will use such features correctly, and the
> > +UMD can implement policies to close the app if it is a repeating offender,
> > +likely in a broken loop. This is done to ensure that it does not keep blocking
> > +the user interface from being correctly displayed. This should be done even if
> > +the app is correct but happens to trigger some bug in the hardware/driver.
>
> I still don't think it's good to let the kernel arbitrarily kill
> processes that it thinks are not well-behaved based on some heuristics
> and policy.
>
> Can't this be outsourced to user space? Expose the information about
> processes causing a device and let e.g. systemd deal with coming up
> with a policy and with killing stuff.

I don't think it's the kernel doing the killing, it would be the UMD.
E.g., if the app is guilty and doesn't support robustness the UMD can
just call exit().

Alex

>
> > +
> > +OpenGL
> > +~~~~~~
> > +
> > +Apps using OpenGL should use the available robust interfaces, like the
> > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> > +interface tells if a reset has happened, and if so, all the context state is
> > +considered lost and the app proceeds by creating new ones. If it is possible to
> > +determine that robustness is not in use, the UMD will terminate the app when a
> > +reset is detected, giving that the contexts are lost and the app won't be able
> > +to figure this out and recreate the contexts.
> > +
> > +Vulkan
> > +~~~~~~
> > +
> > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> > +This error code means, among other things, that a device reset has happened and
> > +it needs to recreate the contexts to keep going.
> > +
> > +Reporting causes of resets
> > +--------------------------
> > +
> > +Apart from propagating the reset through the stack so apps can recover, it's
> > +really useful for driver developers to learn more about what caused the reset in
> > +first place. DRM devices should make use of devcoredump to store relevant
> > +information about the reset, so this information can be added to user bug
> > +reports.
> > +
> >  .. _drm_driver_ioctl:
> >
> >  IOCTL Support on Device Nodes
> > --
> > 2.41.0
> >
>

Michel Dänzer June 30, 2023, 3:11 p.m. UTC | #10

On 6/30/23 16:59, Alex Deucher wrote:
> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> <sebastian.wick@redhat.com> wrote:
>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com> wrote:
>>>
>>> +Robustness
>>> +----------
>>> +
>>> +The only way to try to keep an application working after a reset is if it
>>> +complies with the robustness aspects of the graphical API that it is using.
>>> +
>>> +Graphical APIs provide ways to applications to deal with device resets. However,
>>> +there is no guarantee that the app will use such features correctly, and the
>>> +UMD can implement policies to close the app if it is a repeating offender,
>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>>> +the user interface from being correctly displayed. This should be done even if
>>> +the app is correct but happens to trigger some bug in the hardware/driver.
>>
>> I still don't think it's good to let the kernel arbitrarily kill
>> processes that it thinks are not well-behaved based on some heuristics
>> and policy.
>>
>> Can't this be outsourced to user space? Expose the information about
>> processes causing a device and let e.g. systemd deal with coming up
>> with a policy and with killing stuff.
> 
> I don't think it's the kernel doing the killing, it would be the UMD.
> E.g., if the app is guilty and doesn't support robustness the UMD can
> just call exit().

It would be safer to just ignore API calls[0], similarly to what is done until the application destroys the context with robustness. Calling exit() likely results in losing any unsaved work, whereas at least some applications might otherwise allow saving the work by other means.


[0] Possibly accompanied by a one-time message to stderr along the lines of "GPU reset detected but robustness not enabled in context, ignoring OpenGL API calls".

Sebastian Wick June 30, 2023, 3:21 p.m. UTC | #11

On Fri, Jun 30, 2023 at 4:59 PM Alex Deucher <alexdeucher@gmail.com> wrote:
>
> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> <sebastian.wick@redhat.com> wrote:
> >
> > On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com> wrote:
> > >
> > > Create a section that specifies how to deal with DRM device resets for
> > > kernel and userspace drivers.
> > >
> > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
> > > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > > ---
> > >
> > > v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
> > >
> > > Changes:
> > >  - Grammar fixes (Randy)
> > >
> > >  Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> > >  1 file changed, 68 insertions(+)
> > >
> > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > index 65fb3036a580..3cbffa25ed93 100644
> > > --- a/Documentation/gpu/drm-uapi.rst
> > > +++ b/Documentation/gpu/drm-uapi.rst
> > > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> > >  mmapped regular files. Threads cause additional pain with signal
> > >  handling as well.
> > >
> > > +Device reset
> > > +============
> > > +
> > > +The GPU stack is really complex and is prone to errors, from hardware bugs,
> > > +faulty applications and everything in between the many layers. Some errors
> > > +require resetting the device in order to make the device usable again. This
> > > +sections describes the expectations for DRM and usermode drivers when a
> > > +device resets and how to propagate the reset status.
> > > +
> > > +Kernel Mode Driver
> > > +------------------
> > > +
> > > +The KMD is responsible for checking if the device needs a reset, and to perform
> > > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> > > +should keep track of resets, because userspace can query any time about the
> > > +reset stats for an specific context. This is needed to propagate to the rest of
> > > +the stack that a reset has happened. Currently, this is implemented by each
> > > +driver separately, with no common DRM interface.
> > > +
> > > +User Mode Driver
> > > +----------------
> > > +
> > > +The UMD should check before submitting new commands to the KMD if the device has
> > > +been reset, and this can be checked more often if the UMD requires it. After
> > > +detecting a reset, UMD will then proceed to report it to the application using
> > > +the appropriate API error code, as explained in the section below about
> > > +robustness.
> > > +
> > > +Robustness
> > > +----------
> > > +
> > > +The only way to try to keep an application working after a reset is if it
> > > +complies with the robustness aspects of the graphical API that it is using.
> > > +
> > > +Graphical APIs provide ways to applications to deal with device resets. However,
> > > +there is no guarantee that the app will use such features correctly, and the
> > > +UMD can implement policies to close the app if it is a repeating offender,
> > > +likely in a broken loop. This is done to ensure that it does not keep blocking
> > > +the user interface from being correctly displayed. This should be done even if
> > > +the app is correct but happens to trigger some bug in the hardware/driver.
> >
> > I still don't think it's good to let the kernel arbitrarily kill
> > processes that it thinks are not well-behaved based on some heuristics
> > and policy.
> >
> > Can't this be outsourced to user space? Expose the information about
> > processes causing a device and let e.g. systemd deal with coming up
> > with a policy and with killing stuff.
>
> I don't think it's the kernel doing the killing, it would be the UMD.
> E.g., if the app is guilty and doesn't support robustness the UMD can
> just call exit().

Ah, right, completely skipped over the UMD part. That makes more sense.
>
> Alex
>
> >
> > > +
> > > +OpenGL
> > > +~~~~~~
> > > +
> > > +Apps using OpenGL should use the available robust interfaces, like the
> > > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> > > +interface tells if a reset has happened, and if so, all the context state is
> > > +considered lost and the app proceeds by creating new ones. If it is possible to
> > > +determine that robustness is not in use, the UMD will terminate the app when a
> > > +reset is detected, giving that the contexts are lost and the app won't be able
> > > +to figure this out and recreate the contexts.
> > > +
> > > +Vulkan
> > > +~~~~~~
> > > +
> > > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> > > +This error code means, among other things, that a device reset has happened and
> > > +it needs to recreate the contexts to keep going.
> > > +
> > > +Reporting causes of resets
> > > +--------------------------
> > > +
> > > +Apart from propagating the reset through the stack so apps can recover, it's
> > > +really useful for driver developers to learn more about what caused the reset in
> > > +first place. DRM devices should make use of devcoredump to store relevant
> > > +information about the reset, so this information can be added to user bug
> > > +reports.
> > > +
> > >  .. _drm_driver_ioctl:
> > >
> > >  IOCTL Support on Device Nodes
> > > --
> > > 2.41.0
> > >
> >
>

Marek Olšák June 30, 2023, 8:32 p.m. UTC | #12

That's a terrible idea. Ignoring API calls would be identical to a freeze.
You might as well disable GPU recovery because the result would be the same.

There are 2 scenarios:
- robust contexts: report the GPU reset status and skip API calls; let the
app recreate the context to recover
- non-robust contexts: call exit(1) immediately, which is the best way to
recover

Marek

On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <michel.daenzer@mailbox.org>
wrote:

> On 6/30/23 16:59, Alex Deucher wrote:
> > On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> > <sebastian.wick@redhat.com> wrote:
> >> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com>
> wrote:
> >>>
> >>> +Robustness
> >>> +----------
> >>> +
> >>> +The only way to try to keep an application working after a reset is
> if it
> >>> +complies with the robustness aspects of the graphical API that it is
> using.
> >>> +
> >>> +Graphical APIs provide ways to applications to deal with device
> resets. However,
> >>> +there is no guarantee that the app will use such features correctly,
> and the
> >>> +UMD can implement policies to close the app if it is a repeating
> offender,
> >>> +likely in a broken loop. This is done to ensure that it does not keep
> blocking
> >>> +the user interface from being correctly displayed. This should be
> done even if
> >>> +the app is correct but happens to trigger some bug in the
> hardware/driver.
> >>
> >> I still don't think it's good to let the kernel arbitrarily kill
> >> processes that it thinks are not well-behaved based on some heuristics
> >> and policy.
> >>
> >> Can't this be outsourced to user space? Expose the information about
> >> processes causing a device and let e.g. systemd deal with coming up
> >> with a policy and with killing stuff.
> >
> > I don't think it's the kernel doing the killing, it would be the UMD.
> > E.g., if the app is guilty and doesn't support robustness the UMD can
> > just call exit().
>
> It would be safer to just ignore API calls[0], similarly to what is done
> until the application destroys the context with robustness. Calling exit()
> likely results in losing any unsaved work, whereas at least some
> applications might otherwise allow saving the work by other means.
>
>
> [0] Possibly accompanied by a one-time message to stderr along the lines
> of "GPU reset detected but robustness not enabled in context, ignoring
> OpenGL API calls".
>
> --
> Earthling Michel Dänzer            |                  https://redhat.com
> Libre software enthusiast          |         Mesa and Xwayland developer
>
>

Michel Dänzer July 3, 2023, 7:12 a.m. UTC | #13

On 6/30/23 22:32, Marek Olšák wrote:
> On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>> wrote:
>> On 6/30/23 16:59, Alex Deucher wrote:
>>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
>>> <sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com>> wrote:
>>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com <mailto:andrealmeid@igalia.com>> wrote:
>>>>>
>>>>> +Robustness
>>>>> +----------
>>>>> +
>>>>> +The only way to try to keep an application working after a reset is if it
>>>>> +complies with the robustness aspects of the graphical API that it is using.
>>>>> +
>>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
>>>>> +there is no guarantee that the app will use such features correctly, and the
>>>>> +UMD can implement policies to close the app if it is a repeating offender,
>>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>>>>> +the user interface from being correctly displayed. This should be done even if
>>>>> +the app is correct but happens to trigger some bug in the hardware/driver.
>>>>
>>>> I still don't think it's good to let the kernel arbitrarily kill
>>>> processes that it thinks are not well-behaved based on some heuristics
>>>> and policy.
>>>>
>>>> Can't this be outsourced to user space? Expose the information about
>>>> processes causing a device and let e.g. systemd deal with coming up
>>>> with a policy and with killing stuff.
>>>
>>> I don't think it's the kernel doing the killing, it would be the UMD.
>>> E.g., if the app is guilty and doesn't support robustness the UMD can
>>> just call exit().
>>
>> It would be safer to just ignore API calls[0], similarly to what is done until the application destroys the context with robustness. Calling exit() likely results in losing any unsaved work, whereas at least some applications might otherwise allow saving the work by other means.
> 
> That's a terrible idea. Ignoring API calls would be identical to a freeze. You might as well disable GPU recovery because the result would be the same.

No GPU recovery would affect everything using the GPU, whereas this affects only non-robust applications.


> - non-robust contexts: call exit(1) immediately, which is the best way to recover

That's not the UMD's call to make.


>>     [0] Possibly accompanied by a one-time message to stderr along the lines of "GPU reset detected but robustness not enabled in context, ignoring OpenGL API calls".

Pekka Paalanen July 3, 2023, 8:49 a.m. UTC | #14

On Mon, 3 Jul 2023 09:12:29 +0200
Michel Dänzer <michel.daenzer@mailbox.org> wrote:

> On 6/30/23 22:32, Marek Olšák wrote:
> > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>> wrote:  
> >> On 6/30/23 16:59, Alex Deucher wrote:  
> >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> >>> <sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com>> wrote:  
> >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com <mailto:andrealmeid@igalia.com>> wrote:  
> >>>>>
> >>>>> +Robustness
> >>>>> +----------
> >>>>> +
> >>>>> +The only way to try to keep an application working after a reset is if it
> >>>>> +complies with the robustness aspects of the graphical API that it is using.
> >>>>> +
> >>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
> >>>>> +there is no guarantee that the app will use such features correctly, and the
> >>>>> +UMD can implement policies to close the app if it is a repeating offender,
> >>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
> >>>>> +the user interface from being correctly displayed. This should be done even if
> >>>>> +the app is correct but happens to trigger some bug in the hardware/driver.  
> >>>>
> >>>> I still don't think it's good to let the kernel arbitrarily kill
> >>>> processes that it thinks are not well-behaved based on some heuristics
> >>>> and policy.
> >>>>
> >>>> Can't this be outsourced to user space? Expose the information about
> >>>> processes causing a device and let e.g. systemd deal with coming up
> >>>> with a policy and with killing stuff.  
> >>>
> >>> I don't think it's the kernel doing the killing, it would be the UMD.
> >>> E.g., if the app is guilty and doesn't support robustness the UMD can
> >>> just call exit().  
> >>
> >> It would be safer to just ignore API calls[0], similarly to what
> >> is done until the application destroys the context with
> >> robustness. Calling exit() likely results in losing any unsaved
> >> work, whereas at least some applications might otherwise allow
> >> saving the work by other means.  
> > 
> > That's a terrible idea. Ignoring API calls would be identical to a
> > freeze. You might as well disable GPU recovery because the result
> > would be the same.  
> 
> No GPU recovery would affect everything using the GPU, whereas this
> affects only non-robust applications.
> 
> 
> > - non-robust contexts: call exit(1) immediately, which is the best
> > way to recover  
> 
> That's not the UMD's call to make.
> 
> 
> >>     [0] Possibly accompanied by a one-time message to stderr along
> >> the lines of "GPU reset detected but robustness not enabled in
> >> context, ignoring OpenGL API calls".  
> 

Hi,

Michel does have a point. It's not just games and display servers that
use GPU, but productivity tools as well. They may have periodic
autosave in anticipation of crashes, but being able to do the final
save before quitting would be nice. UMD killing the process would be
new behaviour, right? Previously either application's GPU thread hangs
or various API calls return errors, but it didn't kill the process, did
it?

If an application freezes, that's "no problem"; the end user can just
continue using everything else. Alt-tab away etc. if the app was
fullscreen. I do that already with games on even Xorg.

If a display server freezes, that's a desktop-wide problem, but so is
killing it.

OTOH, if UMD really does need to terminate the process, then please do
it in a way that causes a crash report to be recorded. _exit() with an
error code is not it.


Thanks,
pq

André Almeida July 3, 2023, 3 p.m. UTC | #15

Em 03/07/2023 05:49, Pekka Paalanen escreveu:
> On Mon, 3 Jul 2023 09:12:29 +0200
> Michel Dänzer <michel.daenzer@mailbox.org> wrote:
> 
>> On 6/30/23 22:32, Marek Olšák wrote:
>>> On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>> wrote:
>>>> On 6/30/23 16:59, Alex Deucher wrote:
>>>>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
>>>>> <sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com>> wrote:
>>>>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com <mailto:andrealmeid@igalia.com>> wrote:
>>>>>>>
>>>>>>> +Robustness
>>>>>>> +----------
>>>>>>> +
>>>>>>> +The only way to try to keep an application working after a reset is if it
>>>>>>> +complies with the robustness aspects of the graphical API that it is using.
>>>>>>> +
>>>>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
>>>>>>> +there is no guarantee that the app will use such features correctly, and the
>>>>>>> +UMD can implement policies to close the app if it is a repeating offender,
>>>>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>>>>>>> +the user interface from being correctly displayed. This should be done even if
>>>>>>> +the app is correct but happens to trigger some bug in the hardware/driver.
>>>>>>
>>>>>> I still don't think it's good to let the kernel arbitrarily kill
>>>>>> processes that it thinks are not well-behaved based on some heuristics
>>>>>> and policy.
>>>>>>
>>>>>> Can't this be outsourced to user space? Expose the information about
>>>>>> processes causing a device and let e.g. systemd deal with coming up
>>>>>> with a policy and with killing stuff.
>>>>>
>>>>> I don't think it's the kernel doing the killing, it would be the UMD.
>>>>> E.g., if the app is guilty and doesn't support robustness the UMD can
>>>>> just call exit().
>>>>
>>>> It would be safer to just ignore API calls[0], similarly to what
>>>> is done until the application destroys the context with
>>>> robustness. Calling exit() likely results in losing any unsaved
>>>> work, whereas at least some applications might otherwise allow
>>>> saving the work by other means.
>>>
>>> That's a terrible idea. Ignoring API calls would be identical to a
>>> freeze. You might as well disable GPU recovery because the result
>>> would be the same.
>>
>> No GPU recovery would affect everything using the GPU, whereas this
>> affects only non-robust applications.
>>
>>
>>> - non-robust contexts: call exit(1) immediately, which is the best
>>> way to recover
>>
>> That's not the UMD's call to make.
>>
>>
>>>>      [0] Possibly accompanied by a one-time message to stderr along
>>>> the lines of "GPU reset detected but robustness not enabled in
>>>> context, ignoring OpenGL API calls".
>>
> 
> Hi,
> 
> Michel does have a point. It's not just games and display servers that
> use GPU, but productivity tools as well. They may have periodic
> autosave in anticipation of crashes, but being able to do the final
> save before quitting would be nice. UMD killing the process would be
> new behaviour, right? Previously either application's GPU thread hangs
> or various API calls return errors, but it didn't kill the process, did
> it?
> 

In Intel's Iris, UMD may call abort() for the reset guilty application:

https://elixir.bootlin.com/mesa/mesa-23.0.4/source/src/gallium/drivers/iris/iris_batch.c#L1063

I was pretty sure this was the same for RadeonSI, but I failed to find 
the code for this, so I might be wrong.

> If an application freezes, that's "no problem"; the end user can just
> continue using everything else. Alt-tab away etc. if the app was
> fullscreen. I do that already with games on even Xorg.
> 
> If a display server freezes, that's a desktop-wide problem, but so is
> killing it.
> 

Interesting, what GPU do you use? In my experience (AMD RX 5600 XT), 
hanging the GPU usually means that the rest of applications/compositor 
can't use the GPU either, freezing all user interactions. So killing the 
guilty app is one effective solution currently, but ignoring calls may 
help as well.

> OTOH, if UMD really does need to terminate the process, then please do
> it in a way that causes a crash report to be recorded. _exit() with an
> error code is not it.
> 

In the "Reporting causes of resets" subsection of this document I can 
add something for UMD as well.

> 
> Thanks,
> pq

Marek Olšák July 4, 2023, 2:34 a.m. UTC | #16

On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer@mailbox.org> wrote:

> On 6/30/23 22:32, Marek Olšák wrote:
> > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <
> michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>> wrote:
> >> On 6/30/23 16:59, Alex Deucher wrote:
> >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> >>> <sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com>> wrote:
> >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com
> <mailto:andrealmeid@igalia.com>> wrote:
> >>>>>
> >>>>> +Robustness
> >>>>> +----------
> >>>>> +
> >>>>> +The only way to try to keep an application working after a reset is
> if it
> >>>>> +complies with the robustness aspects of the graphical API that it
> is using.
> >>>>> +
> >>>>> +Graphical APIs provide ways to applications to deal with device
> resets. However,
> >>>>> +there is no guarantee that the app will use such features
> correctly, and the
> >>>>> +UMD can implement policies to close the app if it is a repeating
> offender,
> >>>>> +likely in a broken loop. This is done to ensure that it does not
> keep blocking
> >>>>> +the user interface from being correctly displayed. This should be
> done even if
> >>>>> +the app is correct but happens to trigger some bug in the
> hardware/driver.
> >>>>
> >>>> I still don't think it's good to let the kernel arbitrarily kill
> >>>> processes that it thinks are not well-behaved based on some heuristics
> >>>> and policy.
> >>>>
> >>>> Can't this be outsourced to user space? Expose the information about
> >>>> processes causing a device and let e.g. systemd deal with coming up
> >>>> with a policy and with killing stuff.
> >>>
> >>> I don't think it's the kernel doing the killing, it would be the UMD.
> >>> E.g., if the app is guilty and doesn't support robustness the UMD can
> >>> just call exit().
> >>
> >> It would be safer to just ignore API calls[0], similarly to what is
> done until the application destroys the context with robustness. Calling
> exit() likely results in losing any unsaved work, whereas at least some
> applications might otherwise allow saving the work by other means.
> >
> > That's a terrible idea. Ignoring API calls would be identical to a
> freeze. You might as well disable GPU recovery because the result would be
> the same.
>
> No GPU recovery would affect everything using the GPU, whereas this
> affects only non-robust applications.
>

which is currently the majority.


>
> > - non-robust contexts: call exit(1) immediately, which is the best way
> to recover
>
> That's not the UMD's call to make.
>

That's absolutely the UMD's call to make because that's mandated by the hw
and API design and only driver devs know this, which this thread is a proof
of. The default behavior is to skip all command submission if a non-robust
context is lost, which looks like a freeze. That's required to prevent
infinite hangs from the same context and can be caused by the side effects
of the GPU reset itself, not by the cause of the previous hang. The only
way out of that is killing the process.

Marek


>
> >>     [0] Possibly accompanied by a one-time message to stderr along the
> lines of "GPU reset detected but robustness not enabled in context,
> ignoring OpenGL API calls".
>
>
> --
> Earthling Michel Dänzer            |                  https://redhat.com
> Libre software enthusiast          |         Mesa and Xwayland developer
>
>

Randy Dunlap July 4, 2023, 2:38 a.m. UTC | #17

On 7/3/23 19:34, Marek Olšák wrote:
> 
> 
> On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>> wrote:
> 

Marek,
Please stop sending html emails to the mailing lists.
The mailing list software drops them.

Please set your email interface to use plain text mode instead.
Thanks.

Marek Olšák July 4, 2023, 2:44 a.m. UTC | #18

On Mon, Jul 3, 2023, 22:38 Randy Dunlap <rdunlap@infradead.org> wrote:

>
>
> On 7/3/23 19:34, Marek Olšák wrote:
> >
> >
> > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer@mailbox.org
> <mailto:michel.daenzer@mailbox.org>> wrote:
> >
>
> Marek,
> Please stop sending html emails to the mailing lists.
> The mailing list software drops them.
>
> Please set your email interface to use plain text mode instead.
> Thanks.
>

The mobile Gmail app doesn't support plain text, which I use frequently.

Marek


> --
> ~Randy
>

Randy Dunlap July 4, 2023, 2:48 a.m. UTC | #19

On 7/3/23 19:44, Marek Olšák wrote:
> 
> 
> On Mon, Jul 3, 2023, 22:38 Randy Dunlap <rdunlap@infradead.org <mailto:rdunlap@infradead.org>> wrote:
> 
> 
> 
>     On 7/3/23 19:34, Marek Olšák wrote:
>     >
>     >
>     > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org> <mailto:michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>>> wrote:
>     >
> 
>     Marek,
>     Please stop sending html emails to the mailing lists.
>     The mailing list software drops them.
> 
>     Please set your email interface to use plain text mode instead.
>     Thanks.
> 
> 
> The mobile Gmail app doesn't support plain text, which I use frequently.

Perhaps you should consider some other mobile app for kernel discussions.

E.g., it looks like the K-9 mail app works with gmail.

Pekka Paalanen July 4, 2023, 7:42 a.m. UTC | #20

On Mon, 3 Jul 2023 12:00:22 -0300
André Almeida <andrealmeid@igalia.com> wrote:

> Em 03/07/2023 05:49, Pekka Paalanen escreveu:

> > If an application freezes, that's "no problem"; the end user can just
> > continue using everything else. Alt-tab away etc. if the app was
> > fullscreen. I do that already with games on even Xorg.
> > 
> > If a display server freezes, that's a desktop-wide problem, but so is
> > killing it.
> >   
> 
> Interesting, what GPU do you use? In my experience (AMD RX 5600 XT), 
> hanging the GPU usually means that the rest of applications/compositor 
> can't use the GPU either, freezing all user interactions. So killing the 
> guilty app is one effective solution currently, but ignoring calls may 
> help as well.

I don't know if what I'm seeing is a GPU hang or just e.g. Proton
getting somehow stuck, all I see is a game freezing. I just Alt+tab
back to Steam, force-stop it, and then all is fine again. This is how
it should work regardless of why a game freezes.

However, even if it was a GPU hang, if I am on a display server that
actually handles GPU resets, I don't see why the rest of the desktop
would not be able to recover. Individual apps are each to their own,
but at the very least non-GPU apps and the DE itself should not have
any problem (DE components can simply be restarted automatically).

Thanks,
pq

Michel Dänzer July 4, 2023, 7:54 a.m. UTC | #21

On 7/4/23 04:34, Marek Olšák wrote:
> On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>> wrote:
>     On 6/30/23 22:32, Marek Olšák wrote:
>     > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org> <mailto:michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>>> wrote:
>     >> On 6/30/23 16:59, Alex Deucher wrote:
>     >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
>     >>> <sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com> <mailto:sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com>>> wrote:
>     >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com <mailto:andrealmeid@igalia.com> <mailto:andrealmeid@igalia.com <mailto:andrealmeid@igalia.com>>> wrote:
>     >>>>>
>     >>>>> +Robustness
>     >>>>> +----------
>     >>>>> +
>     >>>>> +The only way to try to keep an application working after a reset is if it
>     >>>>> +complies with the robustness aspects of the graphical API that it is using.
>     >>>>> +
>     >>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
>     >>>>> +there is no guarantee that the app will use such features correctly, and the
>     >>>>> +UMD can implement policies to close the app if it is a repeating offender,
>     >>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>     >>>>> +the user interface from being correctly displayed. This should be done even if
>     >>>>> +the app is correct but happens to trigger some bug in the hardware/driver.
>     >>>>
>     >>>> I still don't think it's good to let the kernel arbitrarily kill
>     >>>> processes that it thinks are not well-behaved based on some heuristics
>     >>>> and policy.
>     >>>>
>     >>>> Can't this be outsourced to user space? Expose the information about
>     >>>> processes causing a device and let e.g. systemd deal with coming up
>     >>>> with a policy and with killing stuff.
>     >>>
>     >>> I don't think it's the kernel doing the killing, it would be the UMD.
>     >>> E.g., if the app is guilty and doesn't support robustness the UMD can
>     >>> just call exit().
>     >>
>     >> It would be safer to just ignore API calls[0], similarly to what is done until the application destroys the context with robustness. Calling exit() likely results in losing any unsaved work, whereas at least some applications might otherwise allow saving the work by other means.
>     >
>     > That's a terrible idea. Ignoring API calls would be identical to a freeze. You might as well disable GPU recovery because the result would be the same.
> 
>     No GPU recovery would affect everything using the GPU, whereas this affects only non-robust applications.
> 
> which is currently the majority.

Not sure where you're going with this. Applications need to use robustness to be able to recover from a GPU hang, and the GPU needs to be reset for that. So disabling GPU reset is not the same as what we're discussing here.


>     > - non-robust contexts: call exit(1) immediately, which is the best way to recover
> 
>     That's not the UMD's call to make.
> 
> That's absolutely the UMD's call to make because that's mandated by the hw and API design

Can you point us to a spec which mandates that the process must be killed in this case?


> and only driver devs know this, which this thread is a proof of. The default behavior is to skip all command submission if a non-robust context is lost, which looks like a freeze. That's required to prevent infinite hangs from the same context and can be caused by the side effects of the GPU reset itself, not by the cause of the previous hang. The only way out of that is killing the process.

The UMD killing the process is not the only way out of that, and doing so is overreach on its part. The UMD is but one out of many components in a process, not the main one or a special one. It doesn't get to decide when the process must die, certainly not under circumstances where it must be able to continue while ignoring API calls (that's required for robustness).


>     >>     [0] Possibly accompanied by a one-time message to stderr along the lines of "GPU reset detected but robustness not enabled in context, ignoring OpenGL API calls".

Marek Olšák July 5, 2023, 6:30 a.m. UTC | #22

On Tue, Jul 4, 2023, 03:55 Michel Dänzer <michel.daenzer@mailbox.org> wrote:

> On 7/4/23 04:34, Marek Olšák wrote:
> > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer@mailbox.org
> <mailto:michel.daenzer@mailbox.org>> wrote:
> >     On 6/30/23 22:32, Marek Olšák wrote:
> >     > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <
> michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org> <mailto:
> michel.daenzer@mailbox.org <mailto:michel.daenzer@mailbox.org>>> wrote:
> >     >> On 6/30/23 16:59, Alex Deucher wrote:
> >     >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> >     >>> <sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com>
> <mailto:sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com>>>
> wrote:
> >     >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <
> andrealmeid@igalia.com <mailto:andrealmeid@igalia.com> <mailto:
> andrealmeid@igalia.com <mailto:andrealmeid@igalia.com>>> wrote:
> >     >>>>>
> >     >>>>> +Robustness
> >     >>>>> +----------
> >     >>>>> +
> >     >>>>> +The only way to try to keep an application working after a
> reset is if it
> >     >>>>> +complies with the robustness aspects of the graphical API
> that it is using.
> >     >>>>> +
> >     >>>>> +Graphical APIs provide ways to applications to deal with
> device resets. However,
> >     >>>>> +there is no guarantee that the app will use such features
> correctly, and the
> >     >>>>> +UMD can implement policies to close the app if it is a
> repeating offender,
> >     >>>>> +likely in a broken loop. This is done to ensure that it does
> not keep blocking
> >     >>>>> +the user interface from being correctly displayed. This
> should be done even if
> >     >>>>> +the app is correct but happens to trigger some bug in the
> hardware/driver.
> >     >>>>
> >     >>>> I still don't think it's good to let the kernel arbitrarily kill
> >     >>>> processes that it thinks are not well-behaved based on some
> heuristics
> >     >>>> and policy.
> >     >>>>
> >     >>>> Can't this be outsourced to user space? Expose the information
> about
> >     >>>> processes causing a device and let e.g. systemd deal with
> coming up
> >     >>>> with a policy and with killing stuff.
> >     >>>
> >     >>> I don't think it's the kernel doing the killing, it would be the
> UMD.
> >     >>> E.g., if the app is guilty and doesn't support robustness the
> UMD can
> >     >>> just call exit().
> >     >>
> >     >> It would be safer to just ignore API calls[0], similarly to what
> is done until the application destroys the context with robustness. Calling
> exit() likely results in losing any unsaved work, whereas at least some
> applications might otherwise allow saving the work by other means.
> >     >
> >     > That's a terrible idea. Ignoring API calls would be identical to a
> freeze. You might as well disable GPU recovery because the result would be
> the same.
> >
> >     No GPU recovery would affect everything using the GPU, whereas this
> affects only non-robust applications.
> >
> > which is currently the majority.
>
> Not sure where you're going with this. Applications need to use robustness
> to be able to recover from a GPU hang, and the GPU needs to be reset for
> that. So disabling GPU reset is not the same as what we're discussing here.
>
>
> >     > - non-robust contexts: call exit(1) immediately, which is the best
> way to recover
> >
> >     That's not the UMD's call to make.
> >
> > That's absolutely the UMD's call to make because that's mandated by the
> hw and API design
>
> Can you point us to a spec which mandates that the process must be killed
> in this case?
>
>
> > and only driver devs know this, which this thread is a proof of. The
> default behavior is to skip all command submission if a non-robust context
> is lost, which looks like a freeze. That's required to prevent infinite
> hangs from the same context and can be caused by the side effects of the
> GPU reset itself, not by the cause of the previous hang. The only way out
> of that is killing the process.
>
> The UMD killing the process is not the only way out of that, and doing so
> is overreach on its part. The UMD is but one out of many components in a
> process, not the main one or a special one. It doesn't get to decide when
> the process must die, certainly not under circumstances where it must be
> able to continue while ignoring API calls (that's required for robustness).
>

You're mixing things up. Robust apps don't any special action from a UMD.
Only non-robust apps need to be killed for proper recovery with the only
other alternative being not updating the window/screen, which is not
user-friendly because the user who's never heard of GPU hangs has no
fucking idea why it's frozen and what do with it. It doesn't matter that
you can debug it because you're not the average user. Also it's already
used and required by our customers on Android because killing a process
returns the user to the desktop screen and can generate a crash dump
instead of keeping the app output frozen, and they agree that this is the
best user experience given the circumstances.

Also if the ML ignores html, that's fine.

Marek


>
> >     >>     [0] Possibly accompanied by a one-time message to stderr
> along the lines of "GPU reset detected but robustness not enabled in
> context, ignoring OpenGL API calls".
>
>
> --
> Earthling Michel Dänzer            |                  https://redhat.com
> Libre software enthusiast          |         Mesa and Xwayland developer
>
>

Michel Dänzer July 5, 2023, 7:32 a.m. UTC | #23

On 7/5/23 08:30, Marek Olšák wrote:
> On Tue, Jul 4, 2023, 03:55 Michel Dänzer <michel.daenzer@mailbox.org> wrote:
>     On 7/4/23 04:34, Marek Olšák wrote:
>     > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer@mailbox.org > wrote:
>     >     On 6/30/23 22:32, Marek Olšák wrote:
>     >     > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <michel.daenzer@mailbox.org> wrote:
>     >     >> On 6/30/23 16:59, Alex Deucher wrote:
>     >     >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
>     >     >>> <sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com> wrote:
>     >     >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com> wrote:
>     >     >>>>>
>     >     >>>>> +Robustness
>     >     >>>>> +----------
>     >     >>>>> +
>     >     >>>>> +The only way to try to keep an application working after a reset is if it
>     >     >>>>> +complies with the robustness aspects of the graphical API that it is using.
>     >     >>>>> +
>     >     >>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
>     >     >>>>> +there is no guarantee that the app will use such features correctly, and the
>     >     >>>>> +UMD can implement policies to close the app if it is a repeating offender,
>     >     >>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>     >     >>>>> +the user interface from being correctly displayed. This should be done even if
>     >     >>>>> +the app is correct but happens to trigger some bug in the hardware/driver.
>     >     >>>>
>     >     >>>> I still don't think it's good to let the kernel arbitrarily kill
>     >     >>>> processes that it thinks are not well-behaved based on some heuristics
>     >     >>>> and policy.
>     >     >>>>
>     >     >>>> Can't this be outsourced to user space? Expose the information about
>     >     >>>> processes causing a device and let e.g. systemd deal with coming up
>     >     >>>> with a policy and with killing stuff.
>     >     >>>
>     >     >>> I don't think it's the kernel doing the killing, it would be the UMD.
>     >     >>> E.g., if the app is guilty and doesn't support robustness the UMD can
>     >     >>> just call exit().
>     >     >>
>     >     >> It would be safer to just ignore API calls[0], similarly to what is done until the application destroys the context with robustness. Calling exit() likely results in losing any unsaved work, whereas at least some applications might otherwise allow saving the work by other means.
>     >     >
>     >     > That's a terrible idea. Ignoring API calls would be identical to a freeze. You might as well disable GPU recovery because the result would be the same.
>     >
>     >     No GPU recovery would affect everything using the GPU, whereas this affects only non-robust applications.
>     >
>     > which is currently the majority.
> 
>     Not sure where you're going with this. Applications need to use robustness to be able to recover from a GPU hang, and the GPU needs to be reset for that. So disabling GPU reset is not the same as what we're discussing here.
> 
> 
>     >     > - non-robust contexts: call exit(1) immediately, which is the best way to recover
>     >
>     >     That's not the UMD's call to make.
>     >
>     > That's absolutely the UMD's call to make because that's mandated by the hw and API design
> 
>     Can you point us to a spec which mandates that the process must be killed in this case?
> 
> 
>     > and only driver devs know this, which this thread is a proof of. The default behavior is to skip all command submission if a non-robust context is lost, which looks like a freeze. That's required to prevent infinite hangs from the same context and can be caused by the side effects of the GPU reset itself, not by the cause of the previous hang. The only way out of that is killing the process.
> 
>     The UMD killing the process is not the only way out of that, and doing so is overreach on its part. The UMD is but one out of many components in a process, not the main one or a special one. It doesn't get to decide when the process must die, certainly not under circumstances where it must be able to continue while ignoring API calls (that's required for robustness).
> 
> 
> You're mixing things up. Robust apps don't any special action from a UMD. Only non-robust apps need to be killed for proper recovery with the only other alternative being not updating the window/screen,

I'm saying they don't "need" to be killed, since the UMD must be able to keep going while ignoring API calls (or it couldn't support robustness). It's a choice, one which is not for the UMD to make.


> Also it's already used and required by our customers on Android because killing a process returns the user to the desktop screen and can generate a crash dump instead of keeping the app output frozen, and they agree that this is the best user experience given the circumstances.

Then some appropriate Android component needs to make that call. The UMD is not it.


>     >     >>     [0] Possibly accompanied by a one-time message to stderr along the lines of "GPU reset detected but robustness not enabled in context, ignoring OpenGL API calls".

Marek Olšák July 5, 2023, 3:53 p.m. UTC | #24

On Wed, Jul 5, 2023 at 3:32 AM Michel Dänzer <michel.daenzer@mailbox.org> wrote:
>
> On 7/5/23 08:30, Marek Olšák wrote:
> > On Tue, Jul 4, 2023, 03:55 Michel Dänzer <michel.daenzer@mailbox.org> wrote:
> >     On 7/4/23 04:34, Marek Olšák wrote:
> >     > On Mon, Jul 3, 2023, 03:12 Michel Dänzer <michel.daenzer@mailbox.org > wrote:
> >     >     On 6/30/23 22:32, Marek Olšák wrote:
> >     >     > On Fri, Jun 30, 2023 at 11:11 AM Michel Dänzer <michel.daenzer@mailbox.org> wrote:
> >     >     >> On 6/30/23 16:59, Alex Deucher wrote:
> >     >     >>> On Fri, Jun 30, 2023 at 10:49 AM Sebastian Wick
> >     >     >>> <sebastian.wick@redhat.com <mailto:sebastian.wick@redhat.com> wrote:
> >     >     >>>> On Tue, Jun 27, 2023 at 3:23 PM André Almeida <andrealmeid@igalia.com> wrote:
> >     >     >>>>>
> >     >     >>>>> +Robustness
> >     >     >>>>> +----------
> >     >     >>>>> +
> >     >     >>>>> +The only way to try to keep an application working after a reset is if it
> >     >     >>>>> +complies with the robustness aspects of the graphical API that it is using.
> >     >     >>>>> +
> >     >     >>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
> >     >     >>>>> +there is no guarantee that the app will use such features correctly, and the
> >     >     >>>>> +UMD can implement policies to close the app if it is a repeating offender,
> >     >     >>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
> >     >     >>>>> +the user interface from being correctly displayed. This should be done even if
> >     >     >>>>> +the app is correct but happens to trigger some bug in the hardware/driver.
> >     >     >>>>
> >     >     >>>> I still don't think it's good to let the kernel arbitrarily kill
> >     >     >>>> processes that it thinks are not well-behaved based on some heuristics
> >     >     >>>> and policy.
> >     >     >>>>
> >     >     >>>> Can't this be outsourced to user space? Expose the information about
> >     >     >>>> processes causing a device and let e.g. systemd deal with coming up
> >     >     >>>> with a policy and with killing stuff.
> >     >     >>>
> >     >     >>> I don't think it's the kernel doing the killing, it would be the UMD.
> >     >     >>> E.g., if the app is guilty and doesn't support robustness the UMD can
> >     >     >>> just call exit().
> >     >     >>
> >     >     >> It would be safer to just ignore API calls[0], similarly to what is done until the application destroys the context with robustness. Calling exit() likely results in losing any unsaved work, whereas at least some applications might otherwise allow saving the work by other means.
> >     >     >
> >     >     > That's a terrible idea. Ignoring API calls would be identical to a freeze. You might as well disable GPU recovery because the result would be the same.
> >     >
> >     >     No GPU recovery would affect everything using the GPU, whereas this affects only non-robust applications.
> >     >
> >     > which is currently the majority.
> >
> >     Not sure where you're going with this. Applications need to use robustness to be able to recover from a GPU hang, and the GPU needs to be reset for that. So disabling GPU reset is not the same as what we're discussing here.
> >
> >
> >     >     > - non-robust contexts: call exit(1) immediately, which is the best way to recover
> >     >
> >     >     That's not the UMD's call to make.
> >     >
> >     > That's absolutely the UMD's call to make because that's mandated by the hw and API design
> >
> >     Can you point us to a spec which mandates that the process must be killed in this case?
> >
> >
> >     > and only driver devs know this, which this thread is a proof of. The default behavior is to skip all command submission if a non-robust context is lost, which looks like a freeze. That's required to prevent infinite hangs from the same context and can be caused by the side effects of the GPU reset itself, not by the cause of the previous hang. The only way out of that is killing the process.
> >
> >     The UMD killing the process is not the only way out of that, and doing so is overreach on its part. The UMD is but one out of many components in a process, not the main one or a special one. It doesn't get to decide when the process must die, certainly not under circumstances where it must be able to continue while ignoring API calls (that's required for robustness).
> >
> >
> > You're mixing things up. Robust apps don't any special action from a UMD. Only non-robust apps need to be killed for proper recovery with the only other alternative being not updating the window/screen,
>
> I'm saying they don't "need" to be killed, since the UMD must be able to keep going while ignoring API calls (or it couldn't support robustness). It's a choice, one which is not for the UMD to make.
>
>
> > Also it's already used and required by our customers on Android because killing a process returns the user to the desktop screen and can generate a crash dump instead of keeping the app output frozen, and they agree that this is the best user experience given the circumstances.
>
> Then some appropriate Android component needs to make that call. The UMD is not it.

We can change it once Android and Linux have a better way to handle
non-robust apps. Until then, generating a core dump after a GPU crash
produces the best outcome for users and developers.

Marek

André Almeida July 25, 2023, 2:55 a.m. UTC | #25

Hi everyone,

It's not clear what we should do about non-robust OpenGL apps after GPU 
resets, so I'll try to summarize the topic, show some options and my 
proposal to move forward on that.

Em 27/06/2023 10:23, André Almeida escreveu:
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to applications to deal with device resets. However,
> +there is no guarantee that the app will use such features correctly, and the
> +UMD can implement policies to close the app if it is a repeating offender,
> +likely in a broken loop. This is done to ensure that it does not keep blocking
> +the user interface from being correctly displayed. This should be done even if
> +the app is correct but happens to trigger some bug in the hardware/driver.
> +
Depending on the OpenGL version, there are different robustness API 
available:

- OpenGL ABR extension [0]
- OpenGL KHR extension [1]
- OpenGL ES extension  [2]

Apps written in OpenGL should use whatever version is available for them 
to make the app robust for GPU resets. That usually means calling 
GetGraphicsResetStatusARB(), checking the status, and if it encounter 
something different from NO_ERROR, that means that a reset has happened, 
the context is considered lost and should be recreated. If an app follow 
this, it will likely succeed recovering a reset.

What should non-robustness apps do then? They certainly will not be 
notified if a reset happens, and thus can't recover if their context is 
lost. OpenGL specification does not explicitly define what should be 
done in such situations[3], and I believe that usually when the spec 
mandates to close the app, it would explicitly note it.

However, in reality there are different types of device resets, causing 
different results. A reset can be precise enough to damage only the 
guilty context, and keep others alive.

Given that, I believe drivers have the following options:

a) Kill all non-robust apps after a reset. This may lead to lose work 
from innocent applications.

b) Ignore all non-robust apps OpenGL calls. That means that applications 
would still be alive, but the user interface would be freeze. The user 
would need to close it manually anyway, but in some corner cases, the 
app could autosave some work or the user might be able to interact with 
it using some alternative method (command line?).

c) Kill just the affected non-robust applications. To do that, the 
driver need to be 100% sure on the impact of its resets.

RadeonSI currently implements a), as can be seen at [4], while Iris 
implements what I think it's c)[5].

For the user experience point-of-view, c) is clearly the best option, 
but it's the hardest to archive. There's not much gain on having b) over 
a), perhaps it could be an optional env var for such corner case 
applications.

My proposal for the documentation is: implement a) if nothing else is 
available, have a MESA_NO_RESET_KILL for people that want b), ideally 
implement c) if the driver is able to know for sure that the non-guilty 
apps can still work after a reset.

Thanks,
     André

[0] https://registry.khronos.org/OpenGL/extensions/ARB/ARB_robustness.txt
[1] https://registry.khronos.org/OpenGL/extensions/KHR/KHR_robustness.txt
[2] https://registry.khronos.org/OpenGL/extensions/EXT/EXT_robustness.txt
[3] https://registry.khronos.org/OpenGL/specs/gl/glspec46.core.pdf
[4] 
https://gitlab.freedesktop.org/mesa/mesa/-/blob/23.1/src/gallium/winsys/amdgpu/drm/amdgpu_cs.c#L1657
[5] 
https://gitlab.freedesktop.org/mesa/mesa/-/blob/23.1/src/gallium/drivers/iris/iris_batch.c#L842

Simon Ser July 25, 2023, 7:02 a.m. UTC | #26

On Tuesday, July 25th, 2023 at 04:55, André Almeida <andrealmeid@igalia.com> wrote:

> It's not clear what we should do about non-robust OpenGL apps after GPU
> resets, so I'll try to summarize the topic, show some options and my
> proposal to move forward on that.
> 
> Em 27/06/2023 10:23, André Almeida escreveu:
> 
> > +Robustness
> > +----------
> > +
> > +The only way to try to keep an application working after a reset is if it
> > +complies with the robustness aspects of the graphical API that it is using.
> > +
> > +Graphical APIs provide ways to applications to deal with device resets. However,
> > +there is no guarantee that the app will use such features correctly, and the
> > +UMD can implement policies to close the app if it is a repeating offender,
> > +likely in a broken loop. This is done to ensure that it does not keep blocking
> > +the user interface from being correctly displayed. This should be done even if
> > +the app is correct but happens to trigger some bug in the hardware/driver.
> > +
> 
> Depending on the OpenGL version, there are different robustness API
> available:
> 
> - OpenGL ABR extension [0]
> - OpenGL KHR extension [1]
> - OpenGL ES extension [2]
> 
> Apps written in OpenGL should use whatever version is available for them
> to make the app robust for GPU resets. That usually means calling
> GetGraphicsResetStatusARB(), checking the status, and if it encounter
> something different from NO_ERROR, that means that a reset has happened,
> the context is considered lost and should be recreated. If an app follow
> this, it will likely succeed recovering a reset.
> 
> What should non-robustness apps do then? They certainly will not be
> notified if a reset happens, and thus can't recover if their context is
> lost. OpenGL specification does not explicitly define what should be
> done in such situations[3], and I believe that usually when the spec
> mandates to close the app, it would explicitly note it.
> 
> However, in reality there are different types of device resets, causing
> different results. A reset can be precise enough to damage only the
> guilty context, and keep others alive.
> 
> Given that, I believe drivers have the following options:
> 
> a) Kill all non-robust apps after a reset. This may lead to lose work
> from innocent applications.
> 
> b) Ignore all non-robust apps OpenGL calls. That means that applications
> would still be alive, but the user interface would be freeze. The user
> would need to close it manually anyway, but in some corner cases, the
> app could autosave some work or the user might be able to interact with
> it using some alternative method (command line?).
> 
> c) Kill just the affected non-robust applications. To do that, the
> driver need to be 100% sure on the impact of its resets.

We've discussed this a while back on #dri-devel IIRC. I think the best
experience would be for the Wayland compositor to gray out apps which
lost their GL context, and display an information dialog to explain
what happened to the user and a button to kill the app. I'm not exactly
sure how that would translate to a kernel or Mesa uAPI, and if there's
appetite to do a lot of work to get "the best GPU reset UX" (IOW: maybe
it's not worth all of the trouble).

Michel Dänzer July 25, 2023, 8:03 a.m. UTC | #27

On 7/25/23 04:55, André Almeida wrote:
> Hi everyone,
> 
> It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that.
> 
> Em 27/06/2023 10:23, André Almeida escreveu:
>> +Robustness
>> +----------
>> +
>> +The only way to try to keep an application working after a reset is if it
>> +complies with the robustness aspects of the graphical API that it is using.
>> +
>> +Graphical APIs provide ways to applications to deal with device resets. However,
>> +there is no guarantee that the app will use such features correctly, and the
>> +UMD can implement policies to close the app if it is a repeating offender,
>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>> +the user interface from being correctly displayed. This should be done even if
>> +the app is correct but happens to trigger some bug in the hardware/driver.
>> +
> Depending on the OpenGL version, there are different robustness API available:
> 
> - OpenGL ABR extension [0]
> - OpenGL KHR extension [1]
> - OpenGL ES extension  [2]
> 
> Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset.
> 
> What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it.
> 
> However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive.
> 
> Given that, I believe drivers have the following options:
> 
> a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications.
> 
> b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?).
> 
> c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets.
> 
> RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5].
> 
> For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications.

I disagree on these conclusions.

c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not.


> [0] https://registry.khronos.org/OpenGL/extensions/ARB/ARB_robustness.txt
> [1] https://registry.khronos.org/OpenGL/extensions/KHR/KHR_robustness.txt
> [2] https://registry.khronos.org/OpenGL/extensions/EXT/EXT_robustness.txt
> [3] https://registry.khronos.org/OpenGL/specs/gl/glspec46.core.pdf
> [4] https://gitlab.freedesktop.org/mesa/mesa/-/blob/23.1/src/gallium/winsys/amdgpu/drm/amdgpu_cs.c#L1657
> [5] https://gitlab.freedesktop.org/mesa/mesa/-/blob/23.1/src/gallium/drivers/iris/iris_batch.c#L842

André Almeida July 25, 2023, 1:02 p.m. UTC | #28

Hi Michel,

Em 25/07/2023 05:03, Michel Dänzer escreveu:
> On 7/25/23 04:55, André Almeida wrote:
>> Hi everyone,
>>
>> It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that.
>>
>> Em 27/06/2023 10:23, André Almeida escreveu:
>>> +Robustness
>>> +----------
>>> +
>>> +The only way to try to keep an application working after a reset is if it
>>> +complies with the robustness aspects of the graphical API that it is using.
>>> +
>>> +Graphical APIs provide ways to applications to deal with device resets. However,
>>> +there is no guarantee that the app will use such features correctly, and the
>>> +UMD can implement policies to close the app if it is a repeating offender,
>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>>> +the user interface from being correctly displayed. This should be done even if
>>> +the app is correct but happens to trigger some bug in the hardware/driver.
>>> +
>> Depending on the OpenGL version, there are different robustness API available:
>>
>> - OpenGL ABR extension [0]
>> - OpenGL KHR extension [1]
>> - OpenGL ES extension  [2]
>>
>> Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset.
>>
>> What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it.
>>
>> However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive.
>>
>> Given that, I believe drivers have the following options:
>>
>> a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications.
>>
>> b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?).
>>
>> c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets.
>>
>> RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5].
>>
>> For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications.
> 
> I disagree on these conclusions.
> 
> c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not.
> 

Thank you for the feedback. How do you think the documentation should 
look like for this part?

Marek Olšák July 25, 2023, 3:05 p.m. UTC | #29

On Tue, Jul 25, 2023 at 4:03 AM Michel Dänzer
<michel.daenzer@mailbox.org> wrote:
>
> On 7/25/23 04:55, André Almeida wrote:
> > Hi everyone,
> >
> > It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that.
> >
> > Em 27/06/2023 10:23, André Almeida escreveu:
> >> +Robustness
> >> +----------
> >> +
> >> +The only way to try to keep an application working after a reset is if it
> >> +complies with the robustness aspects of the graphical API that it is using.
> >> +
> >> +Graphical APIs provide ways to applications to deal with device resets. However,
> >> +there is no guarantee that the app will use such features correctly, and the
> >> +UMD can implement policies to close the app if it is a repeating offender,
> >> +likely in a broken loop. This is done to ensure that it does not keep blocking
> >> +the user interface from being correctly displayed. This should be done even if
> >> +the app is correct but happens to trigger some bug in the hardware/driver.
> >> +
> > Depending on the OpenGL version, there are different robustness API available:
> >
> > - OpenGL ABR extension [0]
> > - OpenGL KHR extension [1]
> > - OpenGL ES extension  [2]
> >
> > Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset.
> >
> > What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it.
> >
> > However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive.
> >
> > Given that, I believe drivers have the following options:
> >
> > a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications.
> >
> > b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?).
> >
> > c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets.
> >
> > RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5].
> >
> > For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications.
>
> I disagree on these conclusions.
>
> c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not.

That's not true. I recommend that you enable b) with your driver and
then hang the GPU under different scenarios and see the result. Then
enable a) and do the same and compare.

Options a) and c) can be merged into one because they are not separate
options to choose from.

If Wayland wanted to grey out lost apps, they would appear as robust
contexts in gallium, but the reset status would be piped through the
Wayland protocol instead of the GL API.

Marek



Marek

Michel Dänzer July 25, 2023, 5 p.m. UTC | #30

On 7/25/23 17:05, Marek Olšák wrote:
> On Tue, Jul 25, 2023 at 4:03 AM Michel Dänzer
> <michel.daenzer@mailbox.org> wrote:
>> On 7/25/23 04:55, André Almeida wrote:
>>> Hi everyone,
>>>
>>> It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that.
>>>
>>> Em 27/06/2023 10:23, André Almeida escreveu:
>>>> +Robustness
>>>> +----------
>>>> +
>>>> +The only way to try to keep an application working after a reset is if it
>>>> +complies with the robustness aspects of the graphical API that it is using.
>>>> +
>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
>>>> +there is no guarantee that the app will use such features correctly, and the
>>>> +UMD can implement policies to close the app if it is a repeating offender,
>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>>>> +the user interface from being correctly displayed. This should be done even if
>>>> +the app is correct but happens to trigger some bug in the hardware/driver.
>>>> +
>>> Depending on the OpenGL version, there are different robustness API available:
>>>
>>> - OpenGL ABR extension [0]
>>> - OpenGL KHR extension [1]
>>> - OpenGL ES extension  [2]
>>>
>>> Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset.
>>>
>>> What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it.
>>>
>>> However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive.
>>>
>>> Given that, I believe drivers have the following options:
>>>
>>> a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications.
>>>
>>> b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?).
>>>
>>> c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets.
>>>
>>> RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5].
>>>
>>> For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications.
>>
>> I disagree on these conclusions.
>>
>> c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not.
> 
> That's not true.

Which part of what I wrote are you referring to?


> I recommend that you enable b) with your driver and then hang the GPU under different scenarios and see the result.

I've been doing GPU driver development for over two decades, I'm perfectly aware what it means. It doesn't change what I wrote above.

Timur Kristóf July 26, 2023, 7:55 a.m. UTC | #31

On Tue, 2023-07-25 at 19:00 +0200, Michel Dänzer wrote:
> On 7/25/23 17:05, Marek Olšák wrote:
> > On Tue, Jul 25, 2023 at 4:03 AM Michel Dänzer
> > <michel.daenzer@mailbox.org> wrote:
> > > On 7/25/23 04:55, André Almeida wrote:
> > > > Hi everyone,
> > > > 
> > > > It's not clear what we should do about non-robust OpenGL apps
> > > > after GPU resets, so I'll try to summarize the topic, show some
> > > > options and my proposal to move forward on that.
> > > > 
> > > > Em 27/06/2023 10:23, André Almeida escreveu:
> > > > > +Robustness
> > > > > +----------
> > > > > +
> > > > > +The only way to try to keep an application working after a
> > > > > reset is if it
> > > > > +complies with the robustness aspects of the graphical API
> > > > > that it is using.
> > > > > +
> > > > > +Graphical APIs provide ways to applications to deal with
> > > > > device resets. However,
> > > > > +there is no guarantee that the app will use such features
> > > > > correctly, and the
> > > > > +UMD can implement policies to close the app if it is a
> > > > > repeating offender,
> > > > > +likely in a broken loop. This is done to ensure that it does
> > > > > not keep blocking
> > > > > +the user interface from being correctly displayed. This
> > > > > should be done even if
> > > > > +the app is correct but happens to trigger some bug in the
> > > > > hardware/driver.
> > > > > +
> > > > Depending on the OpenGL version, there are different robustness
> > > > API available:
> > > > 
> > > > - OpenGL ABR extension [0]
> > > > - OpenGL KHR extension [1]
> > > > - OpenGL ES extension  [2]
> > > > 
> > > > Apps written in OpenGL should use whatever version is available
> > > > for them to make the app robust for GPU resets. That usually
> > > > means calling GetGraphicsResetStatusARB(), checking the status,
> > > > and if it encounter something different from NO_ERROR, that
> > > > means that a reset has happened, the context is considered lost
> > > > and should be recreated. If an app follow this, it will likely
> > > > succeed recovering a reset.
> > > > 
> > > > What should non-robustness apps do then? They certainly will
> > > > not be notified if a reset happens, and thus can't recover if
> > > > their context is lost. OpenGL specification does not explicitly
> > > > define what should be done in such situations[3], and I believe
> > > > that usually when the spec mandates to close the app, it would
> > > > explicitly note it.
> > > > 
> > > > However, in reality there are different types of device resets,
> > > > causing different results. A reset can be precise enough to
> > > > damage only the guilty context, and keep others alive.
> > > > 
> > > > Given that, I believe drivers have the following options:
> > > > 
> > > > a) Kill all non-robust apps after a reset. This may lead to
> > > > lose work from innocent applications.
> > > > 
> > > > b) Ignore all non-robust apps OpenGL calls. That means that
> > > > applications would still be alive, but the user interface would
> > > > be freeze. The user would need to close it manually anyway, but
> > > > in some corner cases, the app could autosave some work or the
> > > > user might be able to interact with it using some alternative
> > > > method (command line?).
> > > > 
> > > > c) Kill just the affected non-robust applications. To do that,
> > > > the driver need to be 100% sure on the impact of its resets.
> > > > 
> > > > RadeonSI currently implements a), as can be seen at [4], while
> > > > Iris implements what I think it's c)[5].
> > > > 
> > > > For the user experience point-of-view, c) is clearly the best
> > > > option, but it's the hardest to archive. There's not much gain
> > > > on having b) over a), perhaps it could be an optional env var
> > > > for such corner case applications.
> > > 
> > > I disagree on these conclusions.
> > > 
> > > c) is certainly better than a), but it's not "clearly the best"
> > > in all cases. The OpenGL UMD is not a privileged/special
> > > component and is in no position to decide whether or not the
> > > process as a whole (only some thread(s) of which may use OpenGL
> > > at all) gets to continue running or not.
> > 
> > That's not true.
> 
> Which part of what I wrote are you referring to?
> 
> 
> > I recommend that you enable b) with your driver and then hang the
> > GPU under different scenarios and see the result.
> 
> I've been doing GPU driver development for over two decades, I'm
> perfectly aware what it means. It doesn't change what I wrote above.
> 

Michel, I understand that you disagree with the proposed solutions in
this email thread but from your mails it is unclear to me what exactly
is the solution that you would actually recommend, can you please
clarify?

Thanks,
Timur

Michel Dänzer July 26, 2023, 8:07 a.m. UTC | #32

On 7/25/23 15:02, André Almeida wrote:
> Em 25/07/2023 05:03, Michel Dänzer escreveu:
>> On 7/25/23 04:55, André Almeida wrote:
>>> Hi everyone,
>>>
>>> It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that.
>>>
>>> Em 27/06/2023 10:23, André Almeida escreveu:
>>>> +Robustness
>>>> +----------
>>>> +
>>>> +The only way to try to keep an application working after a reset is if it
>>>> +complies with the robustness aspects of the graphical API that it is using.
>>>> +
>>>> +Graphical APIs provide ways to applications to deal with device resets. However,
>>>> +there is no guarantee that the app will use such features correctly, and the
>>>> +UMD can implement policies to close the app if it is a repeating offender,
>>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
>>>> +the user interface from being correctly displayed. This should be done even if
>>>> +the app is correct but happens to trigger some bug in the hardware/driver.
>>>> +
>>> Depending on the OpenGL version, there are different robustness API available:
>>>
>>> - OpenGL ABR extension [0]
>>> - OpenGL KHR extension [1]
>>> - OpenGL ES extension  [2]
>>>
>>> Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset.
>>>
>>> What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it.
>>>
>>> However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive.
>>>
>>> Given that, I believe drivers have the following options:
>>>
>>> a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications.
>>>
>>> b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?).
>>>
>>> c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets.
>>>
>>> RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5].
>>>
>>> For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications.
>>
>> I disagree on these conclusions.
>>
>> c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not.
>>
> 
> Thank you for the feedback. How do you think the documentation should look like for this part?

The initial paragraph about robustness should say "keep OpenGL working" instead of "keep an application working". If an OpenGL context stops working, it doesn't necessarily mean the application stops working altogether.


If the application doesn't use the robustness extensions, your option b) is what should happen by default whenever possible. And it really has to be possible if the robustness extensions are supported.

Marek Olšák Aug. 2, 2023, 7:38 a.m. UTC | #33

A screen that doesn't update isn't usable. Killing the window system
and returning to the login screen is one option. Killing the window
system manually from a terminal or over ssh and then returning to the
login screen is another option, but 99% of users either hard-reset the
machine or do sysrq+REISUB anyway because it's faster that way. Those
are all your options. If we don't do the kill, users might decide to
do a hard reset with an unsync'd file system, which can cause more
damage.

The precedent from the CPU land is pretty strong here. There is
SIGSEGV for invalid CPU memory access and SIGILL for invalid CPU
instructions, yet we do nothing for invalid GPU memory access and
invalid GPU instructions. Sending a terminating signal from the kernel
would be the most natural thing to do. Instead, we just keep a frozen
GUI to keep users helpless, or we continue command submission and then
the hanging app can cause an infinite cycle of GPU hangs and resets,
making the GPU unusable until somebody kills the app over ssh.

That's why GL/Vulkan robustness is required - either robust apps, or a
robust compositor that greys out lost windows and pops up a diagnostic
message with a list of actions to choose from. That's the direction we
should be taking. Non-robust apps under a non-robust compositor should
just be killed if they crash the GPU.


Marek

On Wed, Jul 26, 2023 at 4:07 AM Michel Dänzer
<michel.daenzer@mailbox.org> wrote:
>
> On 7/25/23 15:02, André Almeida wrote:
> > Em 25/07/2023 05:03, Michel Dänzer escreveu:
> >> On 7/25/23 04:55, André Almeida wrote:
> >>> Hi everyone,
> >>>
> >>> It's not clear what we should do about non-robust OpenGL apps after GPU resets, so I'll try to summarize the topic, show some options and my proposal to move forward on that.
> >>>
> >>> Em 27/06/2023 10:23, André Almeida escreveu:
> >>>> +Robustness
> >>>> +----------
> >>>> +
> >>>> +The only way to try to keep an application working after a reset is if it
> >>>> +complies with the robustness aspects of the graphical API that it is using.
> >>>> +
> >>>> +Graphical APIs provide ways to applications to deal with device resets. However,
> >>>> +there is no guarantee that the app will use such features correctly, and the
> >>>> +UMD can implement policies to close the app if it is a repeating offender,
> >>>> +likely in a broken loop. This is done to ensure that it does not keep blocking
> >>>> +the user interface from being correctly displayed. This should be done even if
> >>>> +the app is correct but happens to trigger some bug in the hardware/driver.
> >>>> +
> >>> Depending on the OpenGL version, there are different robustness API available:
> >>>
> >>> - OpenGL ABR extension [0]
> >>> - OpenGL KHR extension [1]
> >>> - OpenGL ES extension  [2]
> >>>
> >>> Apps written in OpenGL should use whatever version is available for them to make the app robust for GPU resets. That usually means calling GetGraphicsResetStatusARB(), checking the status, and if it encounter something different from NO_ERROR, that means that a reset has happened, the context is considered lost and should be recreated. If an app follow this, it will likely succeed recovering a reset.
> >>>
> >>> What should non-robustness apps do then? They certainly will not be notified if a reset happens, and thus can't recover if their context is lost. OpenGL specification does not explicitly define what should be done in such situations[3], and I believe that usually when the spec mandates to close the app, it would explicitly note it.
> >>>
> >>> However, in reality there are different types of device resets, causing different results. A reset can be precise enough to damage only the guilty context, and keep others alive.
> >>>
> >>> Given that, I believe drivers have the following options:
> >>>
> >>> a) Kill all non-robust apps after a reset. This may lead to lose work from innocent applications.
> >>>
> >>> b) Ignore all non-robust apps OpenGL calls. That means that applications would still be alive, but the user interface would be freeze. The user would need to close it manually anyway, but in some corner cases, the app could autosave some work or the user might be able to interact with it using some alternative method (command line?).
> >>>
> >>> c) Kill just the affected non-robust applications. To do that, the driver need to be 100% sure on the impact of its resets.
> >>>
> >>> RadeonSI currently implements a), as can be seen at [4], while Iris implements what I think it's c)[5].
> >>>
> >>> For the user experience point-of-view, c) is clearly the best option, but it's the hardest to archive. There's not much gain on having b) over a), perhaps it could be an optional env var for such corner case applications.
> >>
> >> I disagree on these conclusions.
> >>
> >> c) is certainly better than a), but it's not "clearly the best" in all cases. The OpenGL UMD is not a privileged/special component and is in no position to decide whether or not the process as a whole (only some thread(s) of which may use OpenGL at all) gets to continue running or not.
> >>
> >
> > Thank you for the feedback. How do you think the documentation should look like for this part?
>
> The initial paragraph about robustness should say "keep OpenGL working" instead of "keep an application working". If an OpenGL context stops working, it doesn't necessarily mean the application stops working altogether.
>
>
> If the application doesn't use the robustness extensions, your option b) is what should happen by default whenever possible. And it really has to be possible if the robustness extensions are supported.
>
>
> --
> Earthling Michel Dänzer            |                  https://redhat.com
> Libre software enthusiast          |         Mesa and Xwayland developer
>

Michel Dänzer Aug. 2, 2023, 8:34 a.m. UTC | #34

On 8/2/23 09:38, Marek Olšák wrote:
> 
> The precedent from the CPU land is pretty strong here. There is
> SIGSEGV for invalid CPU memory access and SIGILL for invalid CPU
> instructions, yet we do nothing for invalid GPU memory access and
> invalid GPU instructions. Sending a terminating signal from the kernel
> would be the most natural thing to do.

After an unhandled SIGSEGV or SIGILL, the process is in an inconsistent state and cannot safely continue executing. That's why the process is terminated by default in those cases.

The same is not true when an OpenGL context stops working. Any threads / other parts of the process not using that OpenGL context continue working normally. And any attempts to use that OpenGL context can be safely ignored (or the OpenGL implementation couldn't support the robustness extensions).

Daniel Vetter Aug. 4, 2023, 1:03 p.m. UTC | #35

On Tue, Jun 27, 2023 at 10:23:23AM -0300, André Almeida wrote:
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
> 
> Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
> 
> v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
> 
> Changes:
>  - Grammar fixes (Randy)
> 
>  Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
>  1 file changed, 68 insertions(+)
> 
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3cbffa25ed93 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
>  mmapped regular files. Threads cause additional pain with signal
>  handling as well.
>  
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in between the many layers. Some errors
> +require resetting the device in order to make the device usable again. This
> +sections describes the expectations for DRM and usermode drivers when a
> +device resets and how to propagate the reset status.
> +
> +Kernel Mode Driver
> +------------------
> +
> +The KMD is responsible for checking if the device needs a reset, and to perform
> +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> +should keep track of resets, because userspace can query any time about the
> +reset stats for an specific context. This is needed to propagate to the rest of
> +the stack that a reset has happened. Currently, this is implemented by each
> +driver separately, with no common DRM interface.
> +
> +User Mode Driver
> +----------------
> +
> +The UMD should check before submitting new commands to the KMD if the device has
> +been reset, and this can be checked more often if the UMD requires it. After
> +detecting a reset, UMD will then proceed to report it to the application using
> +the appropriate API error code, as explained in the section below about
> +robustness.
> +
> +Robustness
> +----------
> +
> +The only way to try to keep an application working after a reset is if it
> +complies with the robustness aspects of the graphical API that it is using.
> +
> +Graphical APIs provide ways to applications to deal with device resets. However,
> +there is no guarantee that the app will use such features correctly, and the
> +UMD can implement policies to close the app if it is a repeating offender,

Not sure whether this one here is due to my input, but s/UMD/KMD. Repeat
offender killing is more a policy where the kernel enforces policy, and no
longer up to userspace to dtrt (because very clearly userspace is not
really doing the right thing anymore when it's just hanging the gpu in an
endless loop). Also maybe tune it down further to something like "the
kernel driver may implemnent ..."

In my opinion the umd shouldn't implement these kind of magic guesses, the
entire point of robustness apis is to delegate responsibility for
correctly recovering to the application. And the kernel is left with
enforcing fair resource usage policies (which eventually might be a
cgroups limit on how much gpu time you're allowed to waste with gpu
resets).

> +likely in a broken loop. This is done to ensure that it does not keep blocking
> +the user interface from being correctly displayed. This should be done even if
> +the app is correct but happens to trigger some bug in the hardware/driver.
> +
> +OpenGL
> +~~~~~~
> +
> +Apps using OpenGL should use the available robust interfaces, like the
> +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> +interface tells if a reset has happened, and if so, all the context state is
> +considered lost and the app proceeds by creating new ones. If it is possible to
> +determine that robustness is not in use, the UMD will terminate the app when a
> +reset is detected, giving that the contexts are lost and the app won't be able
> +to figure this out and recreate the contexts.
> +
> +Vulkan
> +~~~~~~
> +
> +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> +This error code means, among other things, that a device reset has happened and
> +it needs to recreate the contexts to keep going.
> +
> +Reporting causes of resets
> +--------------------------
> +
> +Apart from propagating the reset through the stack so apps can recover, it's
> +really useful for driver developers to learn more about what caused the reset in
> +first place. DRM devices should make use of devcoredump to store relevant
> +information about the reset, so this information can be added to user bug
> +reports.

Since we do not seem to have a solid consensus in the community about
non-robust userspace, maybe we could just document that lack of consensus
to unblock this patch? Something like this:

Non-Robust Userspace
--------------------

Userspace that doesn't support robust interfaces (like an non-robust
OpenGL context or API without any robustness support like libva) leave the
robustness handling entirely to the userspace driver. There is no strong
community consensus on what the userspace driver should do in that case,
since all reasonable approaches have some clear downsides.

With the s/UMD/KMD/ further up and maybe something added to record the
non-robustness non-consensus:

Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Cheers, Daniel



> +
>  .. _drm_driver_ioctl:
>  
>  IOCTL Support on Device Nodes
> -- 
> 2.41.0
>

Sebastian Wick Aug. 8, 2023, 12:13 p.m. UTC | #36

On Fri, Aug 4, 2023 at 3:03 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Tue, Jun 27, 2023 at 10:23:23AM -0300, André Almeida wrote:
> > Create a section that specifies how to deal with DRM device resets for
> > kernel and userspace drivers.
> >
> > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
> > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > ---
> >
> > v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
> >
> > Changes:
> >  - Grammar fixes (Randy)
> >
> >  Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 68 insertions(+)
> >
> > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > index 65fb3036a580..3cbffa25ed93 100644
> > --- a/Documentation/gpu/drm-uapi.rst
> > +++ b/Documentation/gpu/drm-uapi.rst
> > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> >  mmapped regular files. Threads cause additional pain with signal
> >  handling as well.
> >
> > +Device reset
> > +============
> > +
> > +The GPU stack is really complex and is prone to errors, from hardware bugs,
> > +faulty applications and everything in between the many layers. Some errors
> > +require resetting the device in order to make the device usable again. This
> > +sections describes the expectations for DRM and usermode drivers when a
> > +device resets and how to propagate the reset status.
> > +
> > +Kernel Mode Driver
> > +------------------
> > +
> > +The KMD is responsible for checking if the device needs a reset, and to perform
> > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> > +should keep track of resets, because userspace can query any time about the
> > +reset stats for an specific context. This is needed to propagate to the rest of
> > +the stack that a reset has happened. Currently, this is implemented by each
> > +driver separately, with no common DRM interface.
> > +
> > +User Mode Driver
> > +----------------
> > +
> > +The UMD should check before submitting new commands to the KMD if the device has
> > +been reset, and this can be checked more often if the UMD requires it. After
> > +detecting a reset, UMD will then proceed to report it to the application using
> > +the appropriate API error code, as explained in the section below about
> > +robustness.
> > +
> > +Robustness
> > +----------
> > +
> > +The only way to try to keep an application working after a reset is if it
> > +complies with the robustness aspects of the graphical API that it is using.
> > +
> > +Graphical APIs provide ways to applications to deal with device resets. However,
> > +there is no guarantee that the app will use such features correctly, and the
> > +UMD can implement policies to close the app if it is a repeating offender,
>
> Not sure whether this one here is due to my input, but s/UMD/KMD. Repeat
> offender killing is more a policy where the kernel enforces policy, and no
> longer up to userspace to dtrt (because very clearly userspace is not
> really doing the right thing anymore when it's just hanging the gpu in an
> endless loop). Also maybe tune it down further to something like "the
> kernel driver may implemnent ..."
>
> In my opinion the umd shouldn't implement these kind of magic guesses, the
> entire point of robustness apis is to delegate responsibility for
> correctly recovering to the application. And the kernel is left with
> enforcing fair resource usage policies (which eventually might be a
> cgroups limit on how much gpu time you're allowed to waste with gpu
> resets).

Killing apps that the kernel thinks are misbehaving really doesn't
seem like a good idea to me. What if the process is a service getting
restarted after getting killed? What if killing that process leaves
the system in a bad state?

Can't the kernel provide some information to user space so that e.g.
systemd can handle those situations?

> > +likely in a broken loop. This is done to ensure that it does not keep blocking
> > +the user interface from being correctly displayed. This should be done even if
> > +the app is correct but happens to trigger some bug in the hardware/driver.
> > +
> > +OpenGL
> > +~~~~~~
> > +
> > +Apps using OpenGL should use the available robust interfaces, like the
> > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> > +interface tells if a reset has happened, and if so, all the context state is
> > +considered lost and the app proceeds by creating new ones. If it is possible to
> > +determine that robustness is not in use, the UMD will terminate the app when a
> > +reset is detected, giving that the contexts are lost and the app won't be able
> > +to figure this out and recreate the contexts.
> > +
> > +Vulkan
> > +~~~~~~
> > +
> > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> > +This error code means, among other things, that a device reset has happened and
> > +it needs to recreate the contexts to keep going.
> > +
> > +Reporting causes of resets
> > +--------------------------
> > +
> > +Apart from propagating the reset through the stack so apps can recover, it's
> > +really useful for driver developers to learn more about what caused the reset in
> > +first place. DRM devices should make use of devcoredump to store relevant
> > +information about the reset, so this information can be added to user bug
> > +reports.
>
> Since we do not seem to have a solid consensus in the community about
> non-robust userspace, maybe we could just document that lack of consensus
> to unblock this patch? Something like this:
>
> Non-Robust Userspace
> --------------------
>
> Userspace that doesn't support robust interfaces (like an non-robust
> OpenGL context or API without any robustness support like libva) leave the
> robustness handling entirely to the userspace driver. There is no strong
> community consensus on what the userspace driver should do in that case,
> since all reasonable approaches have some clear downsides.
>
> With the s/UMD/KMD/ further up and maybe something added to record the
> non-robustness non-consensus:
>
> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>
> Cheers, Daniel
>
>
>
> > +
> >  .. _drm_driver_ioctl:
> >
> >  IOCTL Support on Device Nodes
> > --
> > 2.41.0
> >
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
>

Marek Olšák Aug. 8, 2023, 5:03 p.m. UTC | #37

It's the same situation as SIGSEGV. A process can catch the signal,
but if it doesn't, it gets killed. GL and Vulkan APIs give you a way
to catch the GPU error and prevent the process termination. If you
don't use the API, you'll get undefined behavior, which means anything
can happen, including process termination.



Marek

On Tue, Aug 8, 2023 at 8:14 AM Sebastian Wick <sebastian.wick@redhat.com> wrote:
>
> On Fri, Aug 4, 2023 at 3:03 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Tue, Jun 27, 2023 at 10:23:23AM -0300, André Almeida wrote:
> > > Create a section that specifies how to deal with DRM device resets for
> > > kernel and userspace drivers.
> > >
> > > Acked-by: Pekka Paalanen <pekka.paalanen@collabora.com>
> > > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > > ---
> > >
> > > v4: https://lore.kernel.org/lkml/20230626183347.55118-1-andrealmeid@igalia.com/
> > >
> > > Changes:
> > >  - Grammar fixes (Randy)
> > >
> > >  Documentation/gpu/drm-uapi.rst | 68 ++++++++++++++++++++++++++++++++++
> > >  1 file changed, 68 insertions(+)
> > >
> > > diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> > > index 65fb3036a580..3cbffa25ed93 100644
> > > --- a/Documentation/gpu/drm-uapi.rst
> > > +++ b/Documentation/gpu/drm-uapi.rst
> > > @@ -285,6 +285,74 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> > >  mmapped regular files. Threads cause additional pain with signal
> > >  handling as well.
> > >
> > > +Device reset
> > > +============
> > > +
> > > +The GPU stack is really complex and is prone to errors, from hardware bugs,
> > > +faulty applications and everything in between the many layers. Some errors
> > > +require resetting the device in order to make the device usable again. This
> > > +sections describes the expectations for DRM and usermode drivers when a
> > > +device resets and how to propagate the reset status.
> > > +
> > > +Kernel Mode Driver
> > > +------------------
> > > +
> > > +The KMD is responsible for checking if the device needs a reset, and to perform
> > > +it as needed. Usually a hang is detected when a job gets stuck executing. KMD
> > > +should keep track of resets, because userspace can query any time about the
> > > +reset stats for an specific context. This is needed to propagate to the rest of
> > > +the stack that a reset has happened. Currently, this is implemented by each
> > > +driver separately, with no common DRM interface.
> > > +
> > > +User Mode Driver
> > > +----------------
> > > +
> > > +The UMD should check before submitting new commands to the KMD if the device has
> > > +been reset, and this can be checked more often if the UMD requires it. After
> > > +detecting a reset, UMD will then proceed to report it to the application using
> > > +the appropriate API error code, as explained in the section below about
> > > +robustness.
> > > +
> > > +Robustness
> > > +----------
> > > +
> > > +The only way to try to keep an application working after a reset is if it
> > > +complies with the robustness aspects of the graphical API that it is using.
> > > +
> > > +Graphical APIs provide ways to applications to deal with device resets. However,
> > > +there is no guarantee that the app will use such features correctly, and the
> > > +UMD can implement policies to close the app if it is a repeating offender,
> >
> > Not sure whether this one here is due to my input, but s/UMD/KMD. Repeat
> > offender killing is more a policy where the kernel enforces policy, and no
> > longer up to userspace to dtrt (because very clearly userspace is not
> > really doing the right thing anymore when it's just hanging the gpu in an
> > endless loop). Also maybe tune it down further to something like "the
> > kernel driver may implemnent ..."
> >
> > In my opinion the umd shouldn't implement these kind of magic guesses, the
> > entire point of robustness apis is to delegate responsibility for
> > correctly recovering to the application. And the kernel is left with
> > enforcing fair resource usage policies (which eventually might be a
> > cgroups limit on how much gpu time you're allowed to waste with gpu
> > resets).
>
> Killing apps that the kernel thinks are misbehaving really doesn't
> seem like a good idea to me. What if the process is a service getting
> restarted after getting killed? What if killing that process leaves
> the system in a bad state?
>
> Can't the kernel provide some information to user space so that e.g.
> systemd can handle those situations?
>
> > > +likely in a broken loop. This is done to ensure that it does not keep blocking
> > > +the user interface from being correctly displayed. This should be done even if
> > > +the app is correct but happens to trigger some bug in the hardware/driver.
> > > +
> > > +OpenGL
> > > +~~~~~~
> > > +
> > > +Apps using OpenGL should use the available robust interfaces, like the
> > > +extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
> > > +interface tells if a reset has happened, and if so, all the context state is
> > > +considered lost and the app proceeds by creating new ones. If it is possible to
> > > +determine that robustness is not in use, the UMD will terminate the app when a
> > > +reset is detected, giving that the contexts are lost and the app won't be able
> > > +to figure this out and recreate the contexts.
> > > +
> > > +Vulkan
> > > +~~~~~~
> > > +
> > > +Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
> > > +This error code means, among other things, that a device reset has happened and
> > > +it needs to recreate the contexts to keep going.
> > > +
> > > +Reporting causes of resets
> > > +--------------------------
> > > +
> > > +Apart from propagating the reset through the stack so apps can recover, it's
> > > +really useful for driver developers to learn more about what caused the reset in
> > > +first place. DRM devices should make use of devcoredump to store relevant
> > > +information about the reset, so this information can be added to user bug
> > > +reports.
> >
> > Since we do not seem to have a solid consensus in the community about
> > non-robust userspace, maybe we could just document that lack of consensus
> > to unblock this patch? Something like this:
> >
> > Non-Robust Userspace
> > --------------------
> >
> > Userspace that doesn't support robust interfaces (like an non-robust
> > OpenGL context or API without any robustness support like libva) leave the
> > robustness handling entirely to the userspace driver. There is no strong
> > community consensus on what the userspace driver should do in that case,
> > since all reasonable approaches have some clear downsides.
> >
> > With the s/UMD/KMD/ further up and maybe something added to record the
> > non-robustness non-consensus:
> >
> > Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> >
> > Cheers, Daniel
> >
> >
> >
> > > +
> > >  .. _drm_driver_ioctl:
> > >
> > >  IOCTL Support on Device Nodes
> > > --
> > > 2.41.0
> > >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
> >
>

Michel Dänzer Aug. 9, 2023, 7:35 a.m. UTC | #38

On 8/8/23 19:03, Marek Olšák wrote:
> It's the same situation as SIGSEGV. A process can catch the signal,
> but if it doesn't, it gets killed. GL and Vulkan APIs give you a way
> to catch the GPU error and prevent the process termination. If you
> don't use the API, you'll get undefined behavior, which means anything
> can happen, including process termination.

Got a spec reference for that?

I know the spec allows process termination in response to e.g. out of bounds buffer access by the application (which corresponds to SIGSEGV). There are other causes for GPU hangs though, e.g. driver bugs. The ARB_robustness spec says:

    If the reset notification behavior is NO_RESET_NOTIFICATION_ARB,
    then the implementation will never deliver notification of reset
    events, and GetGraphicsResetStatusARB will always return
    NO_ERROR[fn1].
       [fn1: In this case it is recommended that implementations should
        not allow loss of context state no matter what events occur.
        However, this is only a recommendation, and cannot be relied
        upon by applications.]

No mention of process termination, that rather sounds to me like the GL implementation should do its best to keep the application running.

Marek Olšák Aug. 9, 2023, 7:15 p.m. UTC | #39

On Wed, Aug 9, 2023 at 3:35 AM Michel Dänzer <michel.daenzer@mailbox.org> wrote:
>
> On 8/8/23 19:03, Marek Olšák wrote:
> > It's the same situation as SIGSEGV. A process can catch the signal,
> > but if it doesn't, it gets killed. GL and Vulkan APIs give you a way
> > to catch the GPU error and prevent the process termination. If you
> > don't use the API, you'll get undefined behavior, which means anything
> > can happen, including process termination.
>
> Got a spec reference for that?
>
> I know the spec allows process termination in response to e.g. out of bounds buffer access by the application (which corresponds to SIGSEGV). There are other causes for GPU hangs though, e.g. driver bugs. The ARB_robustness spec says:
>
>     If the reset notification behavior is NO_RESET_NOTIFICATION_ARB,
>     then the implementation will never deliver notification of reset
>     events, and GetGraphicsResetStatusARB will always return
>     NO_ERROR[fn1].
>        [fn1: In this case it is recommended that implementations should
>         not allow loss of context state no matter what events occur.
>         However, this is only a recommendation, and cannot be relied
>         upon by applications.]
>
> No mention of process termination, that rather sounds to me like the GL implementation should do its best to keep the application running.

It basically says that we can do anything.

A frozen window or flipping between 2 random frames can't be described
as "keeping the application running". That's the worst user
experience. I will not accept it.

A window system can force-enable robustness for its non-robust apps
and control that. That's the best possible user experience and it's
achievable everywhere. Everything else doesn't matter.

Marek




Marek

Michel Dänzer Aug. 10, 2023, 7:33 a.m. UTC | #40

On 8/9/23 21:15, Marek Olšák wrote:
> On Wed, Aug 9, 2023 at 3:35 AM Michel Dänzer <michel.daenzer@mailbox.org> wrote:
>> On 8/8/23 19:03, Marek Olšák wrote:
>>> It's the same situation as SIGSEGV. A process can catch the signal,
>>> but if it doesn't, it gets killed. GL and Vulkan APIs give you a way
>>> to catch the GPU error and prevent the process termination. If you
>>> don't use the API, you'll get undefined behavior, which means anything
>>> can happen, including process termination.
>>
>> Got a spec reference for that?
>>
>> I know the spec allows process termination in response to e.g. out of bounds buffer access by the application (which corresponds to SIGSEGV). There are other causes for GPU hangs though, e.g. driver bugs. The ARB_robustness spec says:
>>
>>     If the reset notification behavior is NO_RESET_NOTIFICATION_ARB,
>>     then the implementation will never deliver notification of reset
>>     events, and GetGraphicsResetStatusARB will always return
>>     NO_ERROR[fn1].
>>        [fn1: In this case it is recommended that implementations should
>>         not allow loss of context state no matter what events occur.
>>         However, this is only a recommendation, and cannot be relied
>>         upon by applications.]
>>
>> No mention of process termination, that rather sounds to me like the GL implementation should do its best to keep the application running.
> 
> It basically says that we can do anything.

Not really? If program termination is a possible outcome, the spec otherwise mentions that explicitly, ala "including program termination".


> A frozen window or flipping between 2 random frames can't be described
> as "keeping the application running".

This assumes that an application which uses OpenGL cannot have any other purpose than using OpenGL.

diff mbox series

Patch

diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index 65fb3036a580..3cbffa25ed93 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -285,6 +285,74 @@  for GPU1 and GPU2 from different vendors, and a third handler for
 mmapped regular files. Threads cause additional pain with signal
 handling as well.
 
+Device reset
+============
+
+The GPU stack is really complex and is prone to errors, from hardware bugs,
+faulty applications and everything in between the many layers. Some errors
+require resetting the device in order to make the device usable again. This
+sections describes the expectations for DRM and usermode drivers when a
+device resets and how to propagate the reset status.
+
+Kernel Mode Driver
+------------------
+
+The KMD is responsible for checking if the device needs a reset, and to perform
+it as needed. Usually a hang is detected when a job gets stuck executing. KMD
+should keep track of resets, because userspace can query any time about the
+reset stats for an specific context. This is needed to propagate to the rest of
+the stack that a reset has happened. Currently, this is implemented by each
+driver separately, with no common DRM interface.
+
+User Mode Driver
+----------------
+
+The UMD should check before submitting new commands to the KMD if the device has
+been reset, and this can be checked more often if the UMD requires it. After
+detecting a reset, UMD will then proceed to report it to the application using
+the appropriate API error code, as explained in the section below about
+robustness.
+
+Robustness
+----------
+
+The only way to try to keep an application working after a reset is if it
+complies with the robustness aspects of the graphical API that it is using.
+
+Graphical APIs provide ways to applications to deal with device resets. However,
+there is no guarantee that the app will use such features correctly, and the
+UMD can implement policies to close the app if it is a repeating offender,
+likely in a broken loop. This is done to ensure that it does not keep blocking
+the user interface from being correctly displayed. This should be done even if
+the app is correct but happens to trigger some bug in the hardware/driver.
+
+OpenGL
+~~~~~~
+
+Apps using OpenGL should use the available robust interfaces, like the
+extension ``GL_ARB_robustness`` (or ``GL_EXT_robustness`` for OpenGL ES). This
+interface tells if a reset has happened, and if so, all the context state is
+considered lost and the app proceeds by creating new ones. If it is possible to
+determine that robustness is not in use, the UMD will terminate the app when a
+reset is detected, giving that the contexts are lost and the app won't be able
+to figure this out and recreate the contexts.
+
+Vulkan
+~~~~~~
+
+Apps using Vulkan should check for ``VK_ERROR_DEVICE_LOST`` for submissions.
+This error code means, among other things, that a device reset has happened and
+it needs to recreate the contexts to keep going.
+
+Reporting causes of resets
+--------------------------
+
+Apart from propagating the reset through the stack so apps can recover, it's
+really useful for driver developers to learn more about what caused the reset in
+first place. DRM devices should make use of devcoredump to store relevant
+information about the reset, so this information can be added to user bug
+reports.
+
 .. _drm_driver_ioctl:
 
 IOCTL Support on Device Nodes