[v2,1/1] drm/doc: Document DRM device reset expectations

Message ID	20230227204000.56787-2-andrealmeid@igalia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: =?utf-8?q?Andr=C3=A9_Almeida?= <andrealmeid@igalia.com> To: dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org Subject: [PATCH v2 1/1] drm/doc: Document DRM device reset expectations Date: Mon, 27 Feb 2023 15:40:00 -0500 Message-Id: <20230227204000.56787-2-andrealmeid@igalia.com> In-Reply-To: <20230227204000.56787-1-andrealmeid@igalia.com> References: <20230227204000.56787-1-andrealmeid@igalia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: list Cc: pierre-eric.pelloux-prayer@amd.com, =?utf-8?q?Andr=C3=A9_Almeida?= <andrealmeid@igalia.com>, =?utf-8?b?TWFyZWsg?= =?utf-8?b?T2zFocOhaw==?= <maraeo@gmail.com>, amaranath.somalapuram@amd.com, Pekka Paalanen <ppaalanen@gmail.com>, kernel-dev@igalia.com, alexander.deucher@amd.com, contactshashanksharma@gmail.com, christian.koenig@amd.com Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	drm: Add doc about GPU reset \| expand [v2,0/1] drm: Add doc about GPU reset [v2,1/1] drm/doc: Document DRM device reset expectations

Message ID

20230227204000.56787-2-andrealmeid@igalia.com (mailing list archive)

State

New, archived

Headers

From: =?utf-8?q?Andr=C3=A9_Almeida?= <andrealmeid@igalia.com>
To: dri-devel@lists.freedesktop.org, amd-gfx@lists.freedesktop.org,
 linux-kernel@vger.kernel.org
Subject: [PATCH v2 1/1] drm/doc: Document DRM device reset expectations
Date: Mon, 27 Feb 2023 15:40:00 -0500
Message-Id: <20230227204000.56787-2-andrealmeid@igalia.com>
In-Reply-To: <20230227204000.56787-1-andrealmeid@igalia.com>
References: <20230227204000.56787-1-andrealmeid@igalia.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Precedence: list
Cc: pierre-eric.pelloux-prayer@amd.com,
 =?utf-8?q?Andr=C3=A9_Almeida?= <andrealmeid@igalia.com>, =?utf-8?b?TWFyZWsg?=
	=?utf-8?b?T2zFocOhaw==?= <maraeo@gmail.com>, amaranath.somalapuram@amd.com,
 Pekka Paalanen <ppaalanen@gmail.com>, kernel-dev@igalia.com,
 alexander.deucher@amd.com, contactshashanksharma@gmail.com,
 christian.koenig@amd.com
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

Series

drm: Add doc about GPU reset | expand

Commit Message

André Almeida Feb. 27, 2023, 8:40 p.m. UTC

Create a section that specifies how to deal with DRM device resets for
kernel and userspace drivers.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 Documentation/gpu/drm-uapi.rst | 51 ++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

Comments

Pekka Paalanen Feb. 28, 2023, 10:02 a.m. UTC | #1

On Mon, 27 Feb 2023 15:40:00 -0500
André Almeida <andrealmeid@igalia.com> wrote:

> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
> 
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>  Documentation/gpu/drm-uapi.rst | 51 ++++++++++++++++++++++++++++++++++
>  1 file changed, 51 insertions(+)
> 
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3d6c3ed392ea 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,57 @@ for GPU1 and GPU2 from different vendors, and a third handler for
>  mmapped regular files. Threads cause additional pain with signal
>  handling as well.
>  
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in the many layers in between. To recover
> +from this kind of state, sometimes is needed to reset the GPU. Unproper handling
> +of GPU resets can lead to an unstable userspace. This section describes what's
> +the expected behaviour from DRM drivers to do in those situations, from usermode
> +drivers and compositors as well. The end goal is to have a seamless experience
> +as possible, either the stack being able to recover itself or resetting to a new
> +stable state.
> +
> +Robustness
> +----------
> +
> +First of all, application robust APIs, when available, should be used. This
> +allows the application to correctly recover and continue to run after a reset.
> +Apps that doesn't use this should be promptly killed when the kernel driver
> +detects that it's in broken state. Specifically guidelines for some APIs:

Hi,

the "kill" wording is still here. It feels too harsh to me, like I say
in my comments below, but let's see what others think.

Even the device hot-unplug guide above this does not call for killing
anything and is prepared for userspace to keep going indefinitely if
userspace is broken enough.

> +
> +- OpenGL: KMD signals the abortion of submitted commands and the UMD should then
> +  react accordingly and abort the application.

No, not abort. Just return failures and make sure no API call will
block indefinitely.

> +
> +- Vulkan: Assumes that every app is able to deal with ``VK_ERROR_DEVICE_LOST``.
> +  If it doesn't do it right, it's considered a broken application and UMD will
> +  deal with it, aborting it.

Is it even possible to detect if an app does it right?

What if the app does do it right, but not before it attempts to hammer
a few more jobs in?

> +
> +Kernel mode driver
> +------------------
> +
> +The KMD must be able to detect that something is wrong with the application
> +and that a reset is needed to take place to recover the device (e.g. an endless
> +wait). It needs to properly track the context that is broken and mark it as
> +dead, so any other syscalls to that context should be further rejected. The
> +other contexts should be preserved when possible, avoid crashing the rest of
> +userspace. KMD can ban a file descriptor that keeps causing resets, as it's
> +likely in a broken loop.

If userspace is in a broken loop repeatedly causing GPU reset, would it
keep using the same (render node) fd? To me it would be more likely to
close the fd and open a new one, then crash again. Robust or not, the
gfx library API would probably require tearing everything down and
starting from scratch. In fact, only robust apps would likely exhibit
this behaviour, and non-robust just get stuck or quit themselves.

I suppose in e.g. EGL, it is possible to just create a new context
instead of a new EGLDisplay, so both re-using and not using the old fd
are possible.

The process identity would usually remain, I believe, except in cases
like Chromium with its separate rendering processes, but then, would
you really want to ban whole Chromium in that case...

> +

Another thing for the kernel mode driver maybe worth mentioning is that
the driver could also pretend a hot-unplug if the GPU crash is so bad
that everything is at risk being lost or corrupted.

> +User mode driver
> +----------------
> +
> +During a reset, UMD should be aware that rejected syscalls indicates that the
> +context is broken and for robust apps the recovery should happen for the
> +context. Non-robust apps must be terminated.

I think the termination thing probably needs to be much more nuanced,
and also interact with the repeat-offender policy.

Repeat-offender policy could be implemented in userspace too,
especially if userspace keeps using the same device fd which is likely
hidden by the gfx API.

> +
> +Compositors
> +-----------
> +
> +Compositors should be robust as well to properly deal with its errors.

What is the worth of this note? To me as a compositor developer it is
obvious.

Thanks,
pq

> +
> +
>  .. _drm_driver_ioctl:
>  
>  IOCTL Support on Device Nodes

André Almeida Feb. 28, 2023, 3:26 p.m. UTC | #2

Hi Pekka,

Thank you for your feedback,

On 2/28/23 05:02, Pekka Paalanen wrote:
> On Mon, 27 Feb 2023 15:40:00 -0500
> André Almeida <andrealmeid@igalia.com> wrote:
>
>> Create a section that specifies how to deal with DRM device resets for
>> kernel and userspace drivers.
>>
>> Signed-off-by: André Almeida <andrealmeid@igalia.com>
>> ---
>>   Documentation/gpu/drm-uapi.rst | 51 ++++++++++++++++++++++++++++++++++
>>   1 file changed, 51 insertions(+)
>>
>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
>> index 65fb3036a580..3d6c3ed392ea 100644
>> --- a/Documentation/gpu/drm-uapi.rst
>> +++ b/Documentation/gpu/drm-uapi.rst
>> @@ -285,6 +285,57 @@ for GPU1 and GPU2 from different vendors, and a third handler for
>>   mmapped regular files. Threads cause additional pain with signal
>>   handling as well.
>>   
>> +Device reset
>> +============
>> +
>> +The GPU stack is really complex and is prone to errors, from hardware bugs,
>> +faulty applications and everything in the many layers in between. To recover
>> +from this kind of state, sometimes is needed to reset the GPU. Unproper handling
>> +of GPU resets can lead to an unstable userspace. This section describes what's
>> +the expected behaviour from DRM drivers to do in those situations, from usermode
>> +drivers and compositors as well. The end goal is to have a seamless experience
>> +as possible, either the stack being able to recover itself or resetting to a new
>> +stable state.
>> +
>> +Robustness
>> +----------
>> +
>> +First of all, application robust APIs, when available, should be used. This
>> +allows the application to correctly recover and continue to run after a reset.
>> +Apps that doesn't use this should be promptly killed when the kernel driver
>> +detects that it's in broken state. Specifically guidelines for some APIs:
> Hi,
>
> the "kill" wording is still here. It feels too harsh to me, like I say
> in my comments below, but let's see what others think.
>
> Even the device hot-unplug guide above this does not call for killing
> anything and is prepared for userspace to keep going indefinitely if
> userspace is broken enough.

If I understood correctly, you don't think that neither KMD or UMD 
should terminate apps that hangs the GPU, right? Should those apps run 
indefinitely until the user decides to do something about it?

At least on Intel GPUs, if I run an OpenGL infinite loop the app will be 
terminated in a few moments, and the rest of userspace is preserved. 
There's an app that just do that if you want to have a look on how it 
works: https://gitlab.freedesktop.org/andrealmeid/gpu-timeout

>
>> +
>> +- OpenGL: KMD signals the abortion of submitted commands and the UMD should then
>> +  react accordingly and abort the application.
> No, not abort. Just return failures and make sure no API call will
> block indefinitely.
>
>> +
>> +- Vulkan: Assumes that every app is able to deal with ``VK_ERROR_DEVICE_LOST``.
>> +  If it doesn't do it right, it's considered a broken application and UMD will
>> +  deal with it, aborting it.
> Is it even possible to detect if an app does it right?
>
> What if the app does do it right, but not before it attempts to hammer
> a few more jobs in?

I think what I meant was

+ If it doesn't support VK_ERROR_DEVICE_LOST, it's considered a broken 
app [...]

In the sense that if it doesn't support this, it is impossible for the 
app to recovery gracefully from a reset so it's considered broken

>> +
>> +Kernel mode driver
>> +------------------
>> +
>> +The KMD must be able to detect that something is wrong with the application
>> +and that a reset is needed to take place to recover the device (e.g. an endless
>> +wait). It needs to properly track the context that is broken and mark it as
>> +dead, so any other syscalls to that context should be further rejected. The
>> +other contexts should be preserved when possible, avoid crashing the rest of
>> +userspace. KMD can ban a file descriptor that keeps causing resets, as it's
>> +likely in a broken loop.
> If userspace is in a broken loop repeatedly causing GPU reset, would it
> keep using the same (render node) fd? To me it would be more likely to
> close the fd and open a new one, then crash again. Robust or not, the
> gfx library API would probably require tearing everything down and
> starting from scratch. In fact, only robust apps would likely exhibit
> this behaviour, and non-robust just get stuck or quit themselves.
>
> I suppose in e.g. EGL, it is possible to just create a new context
> instead of a new EGLDisplay, so both re-using and not using the old fd
> are possible.
>
> The process identity would usually remain, I believe, except in cases
> like Chromium with its separate rendering processes, but then, would
> you really want to ban whole Chromium in that case...
>
Right, so userspace is the right place to implement the repeat-offender 
policy, as you noted below.

>> +
> Another thing for the kernel mode driver maybe worth mentioning is that
> the driver could also pretend a hot-unplug if the GPU crash is so bad
> that everything is at risk being lost or corrupted.

Ack, I'll add that

>
>> +User mode driver
>> +----------------
>> +
>> +During a reset, UMD should be aware that rejected syscalls indicates that the
>> +context is broken and for robust apps the recovery should happen for the
>> +context. Non-robust apps must be terminated.
> I think the termination thing probably needs to be much more nuanced,
> and also interact with the repeat-offender policy.
>
> Repeat-offender policy could be implemented in userspace too,
> especially if userspace keeps using the same device fd which is likely
> hidden by the gfx API.
>
>> +
>> +Compositors
>> +-----------
>> +
>> +Compositors should be robust as well to properly deal with its errors.
> What is the worth of this note? To me as a compositor developer it is
> obvious.

As it is it doesn't says much indeed, I think Christian suggestion adds 
something more meaningful to this part.

>
> Thanks,
> pq
>
>> +
>> +
>>   .. _drm_driver_ioctl:
>>   
>>   IOCTL Support on Device Nodes

Rob Clark Feb. 28, 2023, 5:20 p.m. UTC | #3

On Mon, Feb 27, 2023 at 12:40 PM André Almeida <andrealmeid@igalia.com> wrote:
>
> Create a section that specifies how to deal with DRM device resets for
> kernel and userspace drivers.
>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>  Documentation/gpu/drm-uapi.rst | 51 ++++++++++++++++++++++++++++++++++
>  1 file changed, 51 insertions(+)
>
> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> index 65fb3036a580..3d6c3ed392ea 100644
> --- a/Documentation/gpu/drm-uapi.rst
> +++ b/Documentation/gpu/drm-uapi.rst
> @@ -285,6 +285,57 @@ for GPU1 and GPU2 from different vendors, and a third handler for
>  mmapped regular files. Threads cause additional pain with signal
>  handling as well.
>
> +Device reset
> +============
> +
> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> +faulty applications and everything in the many layers in between. To recover
> +from this kind of state, sometimes is needed to reset the GPU. Unproper handling
> +of GPU resets can lead to an unstable userspace. This section describes what's
> +the expected behaviour from DRM drivers to do in those situations, from usermode
> +drivers and compositors as well. The end goal is to have a seamless experience
> +as possible, either the stack being able to recover itself or resetting to a new
> +stable state.
> +
> +Robustness
> +----------
> +
> +First of all, application robust APIs, when available, should be used. This
> +allows the application to correctly recover and continue to run after a reset.
> +Apps that doesn't use this should be promptly killed when the kernel driver
> +detects that it's in broken state. Specifically guidelines for some APIs:
> +
> +- OpenGL: KMD signals the abortion of submitted commands and the UMD should then
> +  react accordingly and abort the application.

I disagree.. what would be the point of GL_EXT_robustness
glGetGraphicsResetStatusEXT() if we are going to abort the application
before it has a chance to call this?

Also, this would break the deqp-egl robustness tests because they
would start crashing ;-)

> +
> +- Vulkan: Assumes that every app is able to deal with ``VK_ERROR_DEVICE_LOST``.
> +  If it doesn't do it right, it's considered a broken application and UMD will
> +  deal with it, aborting it.
> +
> +Kernel mode driver
> +------------------
> +
> +The KMD must be able to detect that something is wrong with the application
> +and that a reset is needed to take place to recover the device (e.g. an endless
> +wait). It needs to properly track the context that is broken and mark it as
> +dead, so any other syscalls to that context should be further rejected. The
> +other contexts should be preserved when possible, avoid crashing the rest of
> +userspace. KMD can ban a file descriptor that keeps causing resets, as it's
> +likely in a broken loop.

syscalls to the context?  Like the one querying the reset status?  :-P

In general I don't think the KMD should block syscalls.  _Maybe_ there
could be some threshold at which point we start blocking things, but I
think that would still cause problems with deqp-egl.

What we should perhaps do is encourage drivers to implement
devcoredump support for logging/reporting GPU crashes.  This would
have the benefit that distro error reporting could be standardized.
And hopefully some actionable bug reports come out of it.

And maybe we could standardize UABI for reporting crashes so a
compositor has a chance to realize an app is crashing and take action.
(But again, how does the compositor know that this isn't intentional,
it would be kinda inconvenient if the compositor kept killing my deqp
runs.)  But for all the rest, nak

BR,
-R


> +
> +User mode driver
> +----------------
> +
> +During a reset, UMD should be aware that rejected syscalls indicates that the
> +context is broken and for robust apps the recovery should happen for the
> +context. Non-robust apps must be terminated.
> +
> +Compositors
> +-----------
> +
> +Compositors should be robust as well to properly deal with its errors.
> +
> +
>  .. _drm_driver_ioctl:
>
>  IOCTL Support on Device Nodes
> --
> 2.39.2
>

Pekka Paalanen March 1, 2023, 8:47 a.m. UTC | #4

On Tue, 28 Feb 2023 10:26:04 -0500
André Almeida <andrealmeid@igalia.com> wrote:

> Hi Pekka,
> 
> Thank you for your feedback,
> 
> On 2/28/23 05:02, Pekka Paalanen wrote:
> > On Mon, 27 Feb 2023 15:40:00 -0500
> > André Almeida <andrealmeid@igalia.com> wrote:
> >  
> >> Create a section that specifies how to deal with DRM device resets for
> >> kernel and userspace drivers.
> >>
> >> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> >> ---
> >>   Documentation/gpu/drm-uapi.rst | 51 ++++++++++++++++++++++++++++++++++
> >>   1 file changed, 51 insertions(+)
> >>
> >> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> >> index 65fb3036a580..3d6c3ed392ea 100644
> >> --- a/Documentation/gpu/drm-uapi.rst
> >> +++ b/Documentation/gpu/drm-uapi.rst
> >> @@ -285,6 +285,57 @@ for GPU1 and GPU2 from different vendors, and a third handler for
> >>   mmapped regular files. Threads cause additional pain with signal
> >>   handling as well.
> >>   
> >> +Device reset
> >> +============
> >> +
> >> +The GPU stack is really complex and is prone to errors, from hardware bugs,
> >> +faulty applications and everything in the many layers in between. To recover
> >> +from this kind of state, sometimes is needed to reset the GPU. Unproper handling
> >> +of GPU resets can lead to an unstable userspace. This section describes what's
> >> +the expected behaviour from DRM drivers to do in those situations, from usermode
> >> +drivers and compositors as well. The end goal is to have a seamless experience
> >> +as possible, either the stack being able to recover itself or resetting to a new
> >> +stable state.
> >> +
> >> +Robustness
> >> +----------
> >> +
> >> +First of all, application robust APIs, when available, should be used. This
> >> +allows the application to correctly recover and continue to run after a reset.
> >> +Apps that doesn't use this should be promptly killed when the kernel driver
> >> +detects that it's in broken state. Specifically guidelines for some APIs:  
> > Hi,
> >
> > the "kill" wording is still here. It feels too harsh to me, like I say
> > in my comments below, but let's see what others think.
> >
> > Even the device hot-unplug guide above this does not call for killing
> > anything and is prepared for userspace to keep going indefinitely if
> > userspace is broken enough.  
> 
> If I understood correctly, you don't think that neither KMD or UMD 
> should terminate apps that hangs the GPU, right? Should those apps run 
> indefinitely until the user decides to do something about it?

I suspect it depends on what exactly is happening. I do think a policy
to do something harsh to repeat offenders would be a good idea, because
they might prevent the end user from breaking out of the situation, but
it needs a definition of what a "repeat offender" is, and from that it
should be possible to say what to do with it.

But yes, I do think that killing anything should not be the first and
immediate reaction. It's not like OOM that hurts everything in the
system, it's just the app itself. It may mean the user may be staring
at a broken screen (app was fullscreen), but then they should be able
to get out of it with Alt+Tab or whatever their window system normally
offers as soon as the GPU reset is done if not immediately.

It is much more likely that arbitrary apps crash a GPU than, say, the
display server (compositor). If a display server causes a reset, there
should be a very high threshold to kill it in any case, because killing
it may mean other applications cannot save their work (there is no
reason why apps would need a display server to do an emergency save,
but not everything implements an emergency save, and maybe that will
cause a logout too).

Telling the difference between applications and display servers is
likely going to be a problem, so I would not even try.

IOW, significant timeouts (several seconds) would also be a usable tool
in the toolbox to return system control back to the end user.

> At least on Intel GPUs, if I run an OpenGL infinite loop the app will be 
> terminated in a few moments, and the rest of userspace is preserved. 
> There's an app that just do that if you want to have a look on how it 
> works: https://gitlab.freedesktop.org/andrealmeid/gpu-timeout

How exactly does the app get terminated? An abort in some library? A
signal from the kernel?

I do recall a type of DRM fences guarantee that the fence will signal,
and if it doesn't, the kernel will force-signal it to unblock dependent
work and do something to the job that failed to complete. This kind of
fences are routinely used with KMS and with "naive" display servers,
which means that any application that has a long running GPU job (even
if just for 20 ms) will stall all desktop updates on screen, possibly
even mouse pointer. 20 ms jobs will make the whole desktop feel jerky.
5 second jobs will make the desktop completely unusable.

In that case, to give system control back to the end user, it is not
enough to force-signal fence, because the app will just submit another
one, and your desktop would update once every few seconds - the user
cannot realistically kill the offender, if they even realise which app
it is. So in that case, terminating at least the GPU context is a good
idea. Terminating the process that owned the GPU context is up for
debate.

This situation will change radically, when display servers start
inspecting application fences and postpone using application provided
buffers until the related fence has signalled. Then no app can stall
the whole desktop by simply having long running GPU jobs (assuming
those jobs can be pre-empted). Then none of the force signalling or
terminating will be necessary to survive a broken application.

> 
> >  
> >> +
> >> +- OpenGL: KMD signals the abortion of submitted commands and the UMD should then
> >> +  react accordingly and abort the application.  
> > No, not abort. Just return failures and make sure no API call will
> > block indefinitely.
> >  
> >> +
> >> +- Vulkan: Assumes that every app is able to deal with ``VK_ERROR_DEVICE_LOST``.
> >> +  If it doesn't do it right, it's considered a broken application and UMD will
> >> +  deal with it, aborting it.  
> > Is it even possible to detect if an app does it right?
> >
> > What if the app does do it right, but not before it attempts to hammer
> > a few more jobs in?  
> 
> I think what I meant was
> 
> + If it doesn't support VK_ERROR_DEVICE_LOST, it's considered a broken 
> app [...]
> 
> In the sense that if it doesn't support this, it is impossible for the 
> app to recovery gracefully from a reset so it's considered broken

What does it mean to support VK_ERROR_DEVICE_LOST?

What if the app does support and react to VK_ERROR_DEVICE_LOST, but not
on the first API call that returns it? What about API calls that cannot
return it? Where do you draw the line?

> >> +
> >> +Kernel mode driver
> >> +------------------
> >> +
> >> +The KMD must be able to detect that something is wrong with the application
> >> +and that a reset is needed to take place to recover the device (e.g. an endless
> >> +wait). It needs to properly track the context that is broken and mark it as
> >> +dead, so any other syscalls to that context should be further rejected. The
> >> +other contexts should be preserved when possible, avoid crashing the rest of
> >> +userspace. KMD can ban a file descriptor that keeps causing resets, as it's
> >> +likely in a broken loop.  
> > If userspace is in a broken loop repeatedly causing GPU reset, would it
> > keep using the same (render node) fd? To me it would be more likely to
> > close the fd and open a new one, then crash again. Robust or not, the
> > gfx library API would probably require tearing everything down and
> > starting from scratch. In fact, only robust apps would likely exhibit
> > this behaviour, and non-robust just get stuck or quit themselves.
> >
> > I suppose in e.g. EGL, it is possible to just create a new context
> > instead of a new EGLDisplay, so both re-using and not using the old fd
> > are possible.
> >
> > The process identity would usually remain, I believe, except in cases
> > like Chromium with its separate rendering processes, but then, would
> > you really want to ban whole Chromium in that case...
> >  
> Right, so userspace is the right place to implement the repeat-offender 
> policy, as you noted below.

I think it probably depends... if userspace could do it, it is likely
the right place.

Thanks,
pq

> >> +  
> > Another thing for the kernel mode driver maybe worth mentioning is that
> > the driver could also pretend a hot-unplug if the GPU crash is so bad
> > that everything is at risk being lost or corrupted.  
> 
> Ack, I'll add that
> 
> >  
> >> +User mode driver
> >> +----------------
> >> +
> >> +During a reset, UMD should be aware that rejected syscalls indicates that the
> >> +context is broken and for robust apps the recovery should happen for the
> >> +context. Non-robust apps must be terminated.  
> > I think the termination thing probably needs to be much more nuanced,
> > and also interact with the repeat-offender policy.
> >
> > Repeat-offender policy could be implemented in userspace too,
> > especially if userspace keeps using the same device fd which is likely
> > hidden by the gfx API.
> >  
> >> +
> >> +Compositors
> >> +-----------
> >> +
> >> +Compositors should be robust as well to properly deal with its errors.  
> > What is the worth of this note? To me as a compositor developer it is
> > obvious.  
> 
> As it is it doesn't says much indeed, I think Christian suggestion adds 
> something more meaningful to this part.
> 
> >
> > Thanks,
> > pq
> >  
> >> +
> >> +
> >>   .. _drm_driver_ioctl:
> >>   
> >>   IOCTL Support on Device Nodes

diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index 65fb3036a580..3d6c3ed392ea 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -285,6 +285,57 @@  for GPU1 and GPU2 from different vendors, and a third handler for
 mmapped regular files. Threads cause additional pain with signal
 handling as well.
 
+Device reset
+============
+
+The GPU stack is really complex and is prone to errors, from hardware bugs,
+faulty applications and everything in the many layers in between. To recover
+from this kind of state, sometimes is needed to reset the GPU. Unproper handling
+of GPU resets can lead to an unstable userspace. This section describes what's
+the expected behaviour from DRM drivers to do in those situations, from usermode
+drivers and compositors as well. The end goal is to have a seamless experience
+as possible, either the stack being able to recover itself or resetting to a new
+stable state.
+
+Robustness
+----------
+
+First of all, application robust APIs, when available, should be used. This
+allows the application to correctly recover and continue to run after a reset.
+Apps that doesn't use this should be promptly killed when the kernel driver
+detects that it's in broken state. Specifically guidelines for some APIs:
+
+- OpenGL: KMD signals the abortion of submitted commands and the UMD should then
+  react accordingly and abort the application.
+
+- Vulkan: Assumes that every app is able to deal with ``VK_ERROR_DEVICE_LOST``.
+  If it doesn't do it right, it's considered a broken application and UMD will
+  deal with it, aborting it.
+
+Kernel mode driver
+------------------
+
+The KMD must be able to detect that something is wrong with the application
+and that a reset is needed to take place to recover the device (e.g. an endless
+wait). It needs to properly track the context that is broken and mark it as
+dead, so any other syscalls to that context should be further rejected. The
+other contexts should be preserved when possible, avoid crashing the rest of
+userspace. KMD can ban a file descriptor that keeps causing resets, as it's
+likely in a broken loop.
+
+User mode driver
+----------------
+
+During a reset, UMD should be aware that rejected syscalls indicates that the
+context is broken and for robust apps the recovery should happen for the
+context. Non-robust apps must be terminated.
+
+Compositors
+-----------
+
+Compositors should be robust as well to properly deal with its errors.
+
+
 .. _drm_driver_ioctl:
 
 IOCTL Support on Device Nodes

[v2,1/1] drm/doc: Document DRM device reset expectations

Commit Message

Comments

Patch