[v2,0/3] drm: Add panic handling

Message ID	20190311174218.51899-1-noralf@tronnes.org (mailing list archive)
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: =?utf-8?q?Noralf_Tr=C3=B8nnes?= <noralf@tronnes.org> To: dri-devel@lists.freedesktop.org Date: Mon, 11 Mar 2019 18:42:15 +0100 Message-Id: <20190311174218.51899-1-noralf@tronnes.org> MIME-Version: 1.0 X-Mailman-Original-DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=tronnes.org; s=ds201810; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Message-Id:Date:Subject:Cc:To:From; bh=5nEK58pVPDWiIrp65IIY6ARcTFCG921ehK0hsIxWEkM=; b=XFs2sMbH3KwgutToo4mSVkSlV+ZW8UojF22q0x6nW6KBrOfFt8X6k5wjl/LFUP/PJndxfkHo1zlcv6O9Ct+nGoBLWGHatmxXUaKpk2IG4otrYpXFZKwMjq5mJx4YMXwZR78pjdLZegXhLIEtVdklX+0LpEs0ZZZxapKl2QucpBkLzEr43Q0n4HUi/5JWpXF4NoGuxtiaTwA4MlgvTkbw03Ru1iJlmNc3JJxtm/PZUVT+2J8d538gvHjxoDPhxwI4Pli9bcAZF1VXITgNj4bXGxsBoY628j8uLz9fC2nPM0gYCLt+rKx5jkkL3U2z3vPcL/lJEyHdrkcVpESYaEMr7Q==; Subject: [Intel-gfx] [PATCH v2 0/3] drm: Add panic handling Precedence: list Cc: daniel.vetter@ffwll.ch, intel-gfx@lists.freedesktop.org, =?utf-8?q?Noral?= =?utf-8?q?f_Tr=C3=B8nnes?= <noralf@tronnes.org>, darwish.07@gmail.com Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	drm: Add panic handling \| expand [v2,0/3] drm: Add panic handling [v2,1/3] drm: Add support for panic message output [v2,2/3] drm/cma-helper: Add support for panic screen [v2,3/3] drm/vc4: Support for panic screen

Message ID

20190311174218.51899-1-noralf@tronnes.org (mailing list archive)

Headers

From: =?utf-8?q?Noralf_Tr=C3=B8nnes?= <noralf@tronnes.org>
To: dri-devel@lists.freedesktop.org
Date: Mon, 11 Mar 2019 18:42:15 +0100
Message-Id: <20190311174218.51899-1-noralf@tronnes.org>
MIME-Version: 1.0
Subject: [Intel-gfx] [PATCH v2 0/3] drm: Add panic handling
Precedence: list
Cc: daniel.vetter@ffwll.ch, intel-gfx@lists.freedesktop.org, =?utf-8?q?Noral?=
	=?utf-8?q?f_Tr=C3=B8nnes?= <noralf@tronnes.org>, darwish.07@gmail.com
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Series

drm: Add panic handling | expand

Message

Noralf Trønnes March 11, 2019, 5:42 p.m. UTC

This patchset adds a way for DRM drivers to display kernel messages
during panic().

I had a Windows blue screen last year and remembered this patchset, so I
did a new version that I never got around to put out. Now that panic
handling came up on the ML, I'm sending it out as part of the discussion.

TODO:
Determine which part of the framebuffer that is actually visible. I
don't know enough about KMS to do this. See drm_panic_kmsg_render_screen().

Testing:
I have tested[1] using the formats and resolutions support by vc4 and
modetest. I have even recorded it[2]. RG24 and BG24 gave an unexpected
result, not sure what to make of it.

Noralf.

[1] https://gist.github.com/notro/b9fdff4c6a68a307091e970598f27ed0
[2] https://youtu.be/lZ80vL4dgpE
Version 1 (Aug 2016): https://patchwork.freedesktop.org/series/10867/

Changes since version 1:
- Use kernel message dumper
- Add multicolumn support
- Support some YUV formats
- cma: vmap PRIME buffers on import to support that case

Noralf Trønnes (3):
  drm: Add support for panic message output
  drm/cma-helper: Add support for panic screen
  drm/vc4: Support for panic screen

 drivers/gpu/drm/Kconfig              |   3 +
 drivers/gpu/drm/Makefile             |   2 +-
 drivers/gpu/drm/drm_drv.c            |   3 +
 drivers/gpu/drm/drm_fb_cma_helper.c  |  43 ++++
 drivers/gpu/drm/drm_framebuffer.c    | 117 +++++++++
 drivers/gpu/drm/drm_gem_cma_helper.c |   3 +
 drivers/gpu/drm/drm_internal.h       |   4 +
 drivers/gpu/drm/drm_panic.c          | 363 +++++++++++++++++++++++++++
 drivers/gpu/drm/vc4/vc4_kms.c        |   3 +-
 include/drm/drm_fb_cma_helper.h      |   4 +-
 include/drm/drm_framebuffer.h        |  41 +++
 11 files changed, 583 insertions(+), 3 deletions(-)
 create mode 100644 drivers/gpu/drm/drm_panic.c

Comments

Daniel Vetter March 11, 2019, 6:53 p.m. UTC | #1

On Mon, Mar 11, 2019 at 06:42:15PM +0100, Noralf Trønnes wrote:
> This patchset adds a way for DRM drivers to display kernel messages
> during panic().
> 
> I had a Windows blue screen last year and remembered this patchset, so I
> did a new version that I never got around to put out. Now that panic
> handling came up on the ML, I'm sending it out as part of the discussion.
> 
> TODO:
> Determine which part of the framebuffer that is actually visible. I
> don't know enough about KMS to do this. See drm_panic_kmsg_render_screen().

drm_plane_state->src should have this, at least for all drivers using the
clip helpers. Which I think is rolled out fairly well.

If that happens to be all 0, then drm_plane_state->src_[xyhw] is a decent
fallback. If the plane is occluded or scaled then I guess that's just bad
luck.
-Daniel
> 
> Testing:
> I have tested[1] using the formats and resolutions support by vc4 and
> modetest. I have even recorded it[2]. RG24 and BG24 gave an unexpected
> result, not sure what to make of it.
> 
> Noralf.
> 
> [1] https://gist.github.com/notro/b9fdff4c6a68a307091e970598f27ed0
> [2] https://youtu.be/lZ80vL4dgpE
> Version 1 (Aug 2016): https://patchwork.freedesktop.org/series/10867/
> 
> Changes since version 1:
> - Use kernel message dumper
> - Add multicolumn support
> - Support some YUV formats
> - cma: vmap PRIME buffers on import to support that case
> 
> Noralf Trønnes (3):
>   drm: Add support for panic message output
>   drm/cma-helper: Add support for panic screen
>   drm/vc4: Support for panic screen
> 
>  drivers/gpu/drm/Kconfig              |   3 +
>  drivers/gpu/drm/Makefile             |   2 +-
>  drivers/gpu/drm/drm_drv.c            |   3 +
>  drivers/gpu/drm/drm_fb_cma_helper.c  |  43 ++++
>  drivers/gpu/drm/drm_framebuffer.c    | 117 +++++++++
>  drivers/gpu/drm/drm_gem_cma_helper.c |   3 +
>  drivers/gpu/drm/drm_internal.h       |   4 +
>  drivers/gpu/drm/drm_panic.c          | 363 +++++++++++++++++++++++++++
>  drivers/gpu/drm/vc4/vc4_kms.c        |   3 +-
>  include/drm/drm_fb_cma_helper.h      |   4 +-
>  include/drm/drm_framebuffer.h        |  41 +++
>  11 files changed, 583 insertions(+), 3 deletions(-)
>  create mode 100644 drivers/gpu/drm/drm_panic.c
> 
> -- 
> 2.20.1
>

Ahmed S. Darwish March 17, 2019, 11:06 p.m. UTC | #2

=> Now that the dust has settled, here's a summary of this huge
   50-email thread (thanks Daniel, Noralf, John, everyone!).

=> Parts of this document are a direct rewording of Daniel's replies,
   so I took the liberty of adding a Co-developed-by tag here..

=> This is only a summary, and _not_ an official patch submission.
   It's now Show-me-the-code time ;-)

Subject: [PATCH] Documentation: gpu: Add initial DRM panic design

Co-developed-by: Daniel Vetter <daniel@ffwll.ch>
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
---
 Documentation/gpu/drm-panic-design.rst | 124 +++++++++++++++++++++++++
 1 file changed, 124 insertions(+)
 create mode 100644 Documentation/gpu/drm-panic-design.rst

diff --git a/Documentation/gpu/drm-panic-design.rst b/Documentation/gpu/drm-panic-design.rst
new file mode 100644
index 000000000000..ba306193f347
--- /dev/null
+++ b/Documentation/gpu/drm-panic-design.rst
@@ -0,0 +1,124 @@
+
+========================
+A DRM-based panic viewer
+========================
+
+The Linux Kernel currently contains the necessary plumbing for viewing
+a kernel panic log using DRM-based kmsg dumpers, even if the system is
+currently running a graphical session (e.g. wayland).
+
+.. _drm_panic_design:
+
+Implementation design
+=====================
+
+Code pathes running in a panic context face several constraints:
+
+1. Code must be fully synchronous: interrupts are disabled
+2. Dynamic memory allocations are not allowed
+3. Cannot acquire any mutexes: atomic context, due to 1.
+4. Cannot acquire any spin locks: potential spinning-forever risk
+
+For the *DRM* panic code, the extra conditions below apply:
+
+5. Code must only trylock relevant DRM subsystem locks
+6. If any trylock operation fails, the code *must not* go through
+7. All rendering is done on the GPU's "current display buffer" used
+   at the time of panic(): no HW programming is done at all.
+8. The code must be non-intrusive, and *must not* affect any other
+   panic handling mechanism (netconsole, ramoops, kexec, etc.)
+
+Rationale follows.
+
+
+Spin locks
+----------
+
+Acquiring a spin lock in a panic() context is potentially lethal:
+the lock might've been already acquired, _permanently_, by another
+core that is now fully shut down through an IPI from the panic()-ing
+core.
+
+Moreover, at least on x86, the first target architecture for this
+work, the panic()-ing core wait by default for *a full second* until
+all other cores finish their non-preemptible regions and terminate.
+If that did not work out, it even tries more aggressively with NMIs.
+
+So if the other non panic()-ing cores was holding a DRM-related lock
+through spin_lock_irqsave() for more than one second, then it's a
+bug in the DRM layer code. Thus, the DRM panic viewer cannot do
+anything and should just exit.  [A]
+
+What if the non panic()-ing core was holding a DRM lock through
+barebone spin_lock()? Interrupts are enabled there, so the IPI will be
+handled, and thus that core will effectively die while the lock is
+*forever held*.  [B]
+
+
+Trylocking
+----------
+
+The DRM panic code always *tries* to acquire the *minimum relevant
+set* of DRM related locks, through the basic :c:func:`spin_trylock()`
+mechanism.
+
+From case [A] and case [B] above, if the trylock operation fails,
+there's no point in retrying it multiple times: the relevant locks
+are in a broken and unrecoverable state anyway.
+
+Moreover, The panic code cannot also just ignore the DRM locks and
+force its way through: a broken non-preemptible DRM region implies
+either invalid SW state (e.g. broken linked list pointers), or a GPU
+in an unknown HW state.
+
+A GPU in an unknown HW state is quite dangerous: it has access to the
+full system memory, and if poked incorrectly, there's a really good
+chance it can kill the entire machine.
+
+
+GPU hardware access
+-------------------
+
+In the current state, a full GPU reset, modesetting, or even disabling
+GPU planes, is not doable under a panic() context: it implies going
+through a potentially huge set of DRM call-chains that cannot be
+sanely verified against the :ref:`drm_panic_design` requirements
+(e.g. memory allocations, spinlocks, etc.).
+
+The current approach is simple: run the minimal amount of code
+necessary to draw pixels onto the current scanout buffers. Instead
+of disabling GPU planes, the biggest visible rectangle is just picked.
+
+*Usually* there should be a main window that is big enough to show the
+oops.
+
+
+CI testing
+----------
+
+One of the things that killed earlier linux DRM panic handling efforts,
+beside getting into deep DRM call-chains that cannot be verified, was
+that it couldn't be tested except with real oopses.
+
+The first set of bug reports was whack-a-molde kind of bugs where the
+oops displayed was caused by the DRM panic handler itself instead of
+the real oops causing the panic.
+
+Thus, the :ref:`drm_panic_design` requirements was created. Moreover:
+
+  - Special hooks are added at the spin_lock() level to complain
+    loudly if a spin lock operation was tried under the DRM panic
+    context. This could be easily noticed/reported by CI testing.
+
+  - *Non-destructive* testing of the DRM panic code, emulating a
+    real panic path context as much as possible (e.g. by disabling
+    irqs and enabling the spin lock hooks earlier mentioned), is
+    created. This is necessary for heaviling testing the DRM panic
+    code through `CI machines <https://lwn.net/Articles/735468/>`_.
+
+
+Supported drivers
+=================
+
+* Intel i915-compatible cards


--
darwi
http://darwish.chasingpointers.com

Daniel Vetter March 25, 2019, 8:42 a.m. UTC | #3

On Mon, Mar 18, 2019 at 12:06:13AM +0100, Ahmed S. Darwish wrote:
> 
> => Now that the dust has settled, here's a summary of this huge
>    50-email thread (thanks Daniel, Noralf, John, everyone!).
> 
> => Parts of this document are a direct rewording of Daniel's replies,
>    so I took the liberty of adding a Co-developed-by tag here..
> 
> => This is only a summary, and _not_ an official patch submission.
>    It's now Show-me-the-code time ;-)
> 
> Subject: [PATCH] Documentation: gpu: Add initial DRM panic design
> 
> Co-developed-by: Daniel Vetter <daniel@ffwll.ch>
> Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
> ---
>  Documentation/gpu/drm-panic-design.rst | 124 +++++++++++++++++++++++++
>  1 file changed, 124 insertions(+)
>  create mode 100644 Documentation/gpu/drm-panic-design.rst
> 
> diff --git a/Documentation/gpu/drm-panic-design.rst b/Documentation/gpu/drm-panic-design.rst
> new file mode 100644
> index 000000000000..ba306193f347
> --- /dev/null
> +++ b/Documentation/gpu/drm-panic-design.rst
> @@ -0,0 +1,124 @@
> +
> +========================
> +A DRM-based panic viewer
> +========================
> +
> +The Linux Kernel currently contains the necessary plumbing for viewing
> +a kernel panic log using DRM-based kmsg dumpers, even if the system is
> +currently running a graphical session (e.g. wayland).
> +
> +.. _drm_panic_design:
> +
> +Implementation design
> +=====================
> +
> +Code pathes running in a panic context face several constraints:
> +
> +1. Code must be fully synchronous: interrupts are disabled
> +2. Dynamic memory allocations are not allowed
> +3. Cannot acquire any mutexes: atomic context, due to 1.
> +4. Cannot acquire any spin locks: potential spinning-forever risk

Maybe rephrase as:

3. No sleeping (there's other ways to sleep than just memory allocations
and acquiring a mutex)
4. No unconditional locking at all (there's more than spinlocks/mutexes).
E.g. one of the most important locks is drm_modest_lock, which is a
ww_mutex.

> +
> +For the *DRM* panic code, the extra conditions below apply:
> +
> +5. Code must only trylock relevant DRM subsystem locks
> +6. If any trylock operation fails, the code *must not* go through
> +7. All rendering is done on the GPU's "current display buffer" used
> +   at the time of panic(): no HW programming is done at all.

Maybe let's clarify this a bit, since from the discussions it sounded like
at least amdgpu needs to touch a few bits (disable tiling, the indirect
vram write registers to access vram outside of the bar): "Only the least amount
of HW programming (preferrably none) is done, exceptions would be
disabling tiled/compressed scanout."

> +8. The code must be non-intrusive, and *must not* affect any other
> +   panic handling mechanism (netconsole, ramoops, kexec, etc.)

Hm, I'm not entirely clear on what you mean here. Maybe some more examples
of what would be a bad idea?

> +
> +Rationale follows.
> +
> +
> +Spin locks
> +----------
> +
> +Acquiring a spin lock in a panic() context is potentially lethal:
> +the lock might've been already acquired, _permanently_, by another
> +core that is now fully shut down through an IPI from the panic()-ing
> +core.
> +
> +Moreover, at least on x86, the first target architecture for this
> +work, the panic()-ing core wait by default for *a full second* until
> +all other cores finish their non-preemptible regions and terminate.
> +If that did not work out, it even tries more aggressively with NMIs.
> +
> +So if the other non panic()-ing cores was holding a DRM-related lock
> +through spin_lock_irqsave() for more than one second, then it's a
> +bug in the DRM layer code. Thus, the DRM panic viewer cannot do
> +anything and should just exit.  [A]
> +
> +What if the non panic()-ing core was holding a DRM lock through
> +barebone spin_lock()? Interrupts are enabled there, so the IPI will be
> +handled, and thus that core will effectively die while the lock is
> +*forever held*.  [B]
> +
> +
> +Trylocking
> +----------
> +
> +The DRM panic code always *tries* to acquire the *minimum relevant
> +set* of DRM related locks, through the basic :c:func:`spin_trylock()`
> +mechanism.
> +
> +From case [A] and case [B] above, if the trylock operation fails,
> +there's no point in retrying it multiple times: the relevant locks
> +are in a broken and unrecoverable state anyway.
> +
> +Moreover, The panic code cannot also just ignore the DRM locks and
> +force its way through: a broken non-preemptible DRM region implies
> +either invalid SW state (e.g. broken linked list pointers), or a GPU
> +in an unknown HW state.
> +
> +A GPU in an unknown HW state is quite dangerous: it has access to the
> +full system memory, and if poked incorrectly, there's a really good
> +chance it can kill the entire machine.
> +
> +
> +GPU hardware access
> +-------------------
> +
> +In the current state, a full GPU reset, modesetting, or even disabling
> +GPU planes, is not doable under a panic() context: it implies going
> +through a potentially huge set of DRM call-chains that cannot be
> +sanely verified against the :ref:`drm_panic_design` requirements
> +(e.g. memory allocations, spinlocks, etc.).
> +
> +The current approach is simple: run the minimal amount of code
> +necessary to draw pixels onto the current scanout buffers. Instead
> +of disabling GPU planes, the biggest visible rectangle is just picked.
> +
> +*Usually* there should be a main window that is big enough to show the
> +oops.
> +
> +
> +CI testing
> +----------
> +
> +One of the things that killed earlier linux DRM panic handling efforts,
> +beside getting into deep DRM call-chains that cannot be verified, was
> +that it couldn't be tested except with real oopses.
> +
> +The first set of bug reports was whack-a-molde kind of bugs where the
> +oops displayed was caused by the DRM panic handler itself instead of
> +the real oops causing the panic.
> +
> +Thus, the :ref:`drm_panic_design` requirements was created. Moreover:
> +
> +  - Special hooks are added at the spin_lock() level to complain
> +    loudly if a spin lock operation was tried under the DRM panic
> +    context. This could be easily noticed/reported by CI testing.
> +
> +  - *Non-destructive* testing of the DRM panic code, emulating a
> +    real panic path context as much as possible (e.g. by disabling
> +    irqs and enabling the spin lock hooks earlier mentioned), is
> +    created. This is necessary for heaviling testing the DRM panic
> +    code through `CI machines <https://lwn.net/Articles/735468/>`_.
> +
> +
> +Supported drivers
> +=================
> +
> +* Intel i915-compatible cards

Excellent summary, thanks for typing this up.
-Daniel

> 
> 
> --
> darwi
> http://darwish.chasingpointers.com