Message ID | 20190311174218.51899-1-noralf@tronnes.org (mailing list archive) |
---|---|
Headers | show |
Series | drm: Add panic handling | expand |
On Mon, Mar 11, 2019 at 06:42:15PM +0100, Noralf Trønnes wrote: > This patchset adds a way for DRM drivers to display kernel messages > during panic(). > > I had a Windows blue screen last year and remembered this patchset, so I > did a new version that I never got around to put out. Now that panic > handling came up on the ML, I'm sending it out as part of the discussion. > > TODO: > Determine which part of the framebuffer that is actually visible. I > don't know enough about KMS to do this. See drm_panic_kmsg_render_screen(). drm_plane_state->src should have this, at least for all drivers using the clip helpers. Which I think is rolled out fairly well. If that happens to be all 0, then drm_plane_state->src_[xyhw] is a decent fallback. If the plane is occluded or scaled then I guess that's just bad luck. -Daniel > > Testing: > I have tested[1] using the formats and resolutions support by vc4 and > modetest. I have even recorded it[2]. RG24 and BG24 gave an unexpected > result, not sure what to make of it. > > Noralf. > > [1] https://gist.github.com/notro/b9fdff4c6a68a307091e970598f27ed0 > [2] https://youtu.be/lZ80vL4dgpE > Version 1 (Aug 2016): https://patchwork.freedesktop.org/series/10867/ > > Changes since version 1: > - Use kernel message dumper > - Add multicolumn support > - Support some YUV formats > - cma: vmap PRIME buffers on import to support that case > > Noralf Trønnes (3): > drm: Add support for panic message output > drm/cma-helper: Add support for panic screen > drm/vc4: Support for panic screen > > drivers/gpu/drm/Kconfig | 3 + > drivers/gpu/drm/Makefile | 2 +- > drivers/gpu/drm/drm_drv.c | 3 + > drivers/gpu/drm/drm_fb_cma_helper.c | 43 ++++ > drivers/gpu/drm/drm_framebuffer.c | 117 +++++++++ > drivers/gpu/drm/drm_gem_cma_helper.c | 3 + > drivers/gpu/drm/drm_internal.h | 4 + > drivers/gpu/drm/drm_panic.c | 363 +++++++++++++++++++++++++++ > drivers/gpu/drm/vc4/vc4_kms.c | 3 +- > include/drm/drm_fb_cma_helper.h | 4 +- > include/drm/drm_framebuffer.h | 41 +++ > 11 files changed, 583 insertions(+), 3 deletions(-) > create mode 100644 drivers/gpu/drm/drm_panic.c > > -- > 2.20.1 >
=> Now that the dust has settled, here's a summary of this huge
50-email thread (thanks Daniel, Noralf, John, everyone!).
=> Parts of this document are a direct rewording of Daniel's replies,
so I took the liberty of adding a Co-developed-by tag here..
=> This is only a summary, and _not_ an official patch submission.
It's now Show-me-the-code time ;-)
Subject: [PATCH] Documentation: gpu: Add initial DRM panic design
Co-developed-by: Daniel Vetter <daniel@ffwll.ch>
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com>
---
Documentation/gpu/drm-panic-design.rst | 124 +++++++++++++++++++++++++
1 file changed, 124 insertions(+)
create mode 100644 Documentation/gpu/drm-panic-design.rst
diff --git a/Documentation/gpu/drm-panic-design.rst b/Documentation/gpu/drm-panic-design.rst
new file mode 100644
index 000000000000..ba306193f347
--- /dev/null
+++ b/Documentation/gpu/drm-panic-design.rst
@@ -0,0 +1,124 @@
+
+========================
+A DRM-based panic viewer
+========================
+
+The Linux Kernel currently contains the necessary plumbing for viewing
+a kernel panic log using DRM-based kmsg dumpers, even if the system is
+currently running a graphical session (e.g. wayland).
+
+.. _drm_panic_design:
+
+Implementation design
+=====================
+
+Code pathes running in a panic context face several constraints:
+
+1. Code must be fully synchronous: interrupts are disabled
+2. Dynamic memory allocations are not allowed
+3. Cannot acquire any mutexes: atomic context, due to 1.
+4. Cannot acquire any spin locks: potential spinning-forever risk
+
+For the *DRM* panic code, the extra conditions below apply:
+
+5. Code must only trylock relevant DRM subsystem locks
+6. If any trylock operation fails, the code *must not* go through
+7. All rendering is done on the GPU's "current display buffer" used
+ at the time of panic(): no HW programming is done at all.
+8. The code must be non-intrusive, and *must not* affect any other
+ panic handling mechanism (netconsole, ramoops, kexec, etc.)
+
+Rationale follows.
+
+
+Spin locks
+----------
+
+Acquiring a spin lock in a panic() context is potentially lethal:
+the lock might've been already acquired, _permanently_, by another
+core that is now fully shut down through an IPI from the panic()-ing
+core.
+
+Moreover, at least on x86, the first target architecture for this
+work, the panic()-ing core wait by default for *a full second* until
+all other cores finish their non-preemptible regions and terminate.
+If that did not work out, it even tries more aggressively with NMIs.
+
+So if the other non panic()-ing cores was holding a DRM-related lock
+through spin_lock_irqsave() for more than one second, then it's a
+bug in the DRM layer code. Thus, the DRM panic viewer cannot do
+anything and should just exit. [A]
+
+What if the non panic()-ing core was holding a DRM lock through
+barebone spin_lock()? Interrupts are enabled there, so the IPI will be
+handled, and thus that core will effectively die while the lock is
+*forever held*. [B]
+
+
+Trylocking
+----------
+
+The DRM panic code always *tries* to acquire the *minimum relevant
+set* of DRM related locks, through the basic :c:func:`spin_trylock()`
+mechanism.
+
+From case [A] and case [B] above, if the trylock operation fails,
+there's no point in retrying it multiple times: the relevant locks
+are in a broken and unrecoverable state anyway.
+
+Moreover, The panic code cannot also just ignore the DRM locks and
+force its way through: a broken non-preemptible DRM region implies
+either invalid SW state (e.g. broken linked list pointers), or a GPU
+in an unknown HW state.
+
+A GPU in an unknown HW state is quite dangerous: it has access to the
+full system memory, and if poked incorrectly, there's a really good
+chance it can kill the entire machine.
+
+
+GPU hardware access
+-------------------
+
+In the current state, a full GPU reset, modesetting, or even disabling
+GPU planes, is not doable under a panic() context: it implies going
+through a potentially huge set of DRM call-chains that cannot be
+sanely verified against the :ref:`drm_panic_design` requirements
+(e.g. memory allocations, spinlocks, etc.).
+
+The current approach is simple: run the minimal amount of code
+necessary to draw pixels onto the current scanout buffers. Instead
+of disabling GPU planes, the biggest visible rectangle is just picked.
+
+*Usually* there should be a main window that is big enough to show the
+oops.
+
+
+CI testing
+----------
+
+One of the things that killed earlier linux DRM panic handling efforts,
+beside getting into deep DRM call-chains that cannot be verified, was
+that it couldn't be tested except with real oopses.
+
+The first set of bug reports was whack-a-molde kind of bugs where the
+oops displayed was caused by the DRM panic handler itself instead of
+the real oops causing the panic.
+
+Thus, the :ref:`drm_panic_design` requirements was created. Moreover:
+
+ - Special hooks are added at the spin_lock() level to complain
+ loudly if a spin lock operation was tried under the DRM panic
+ context. This could be easily noticed/reported by CI testing.
+
+ - *Non-destructive* testing of the DRM panic code, emulating a
+ real panic path context as much as possible (e.g. by disabling
+ irqs and enabling the spin lock hooks earlier mentioned), is
+ created. This is necessary for heaviling testing the DRM panic
+ code through `CI machines <https://lwn.net/Articles/735468/>`_.
+
+
+Supported drivers
+=================
+
+* Intel i915-compatible cards
--
darwi
http://darwish.chasingpointers.com
On Mon, Mar 18, 2019 at 12:06:13AM +0100, Ahmed S. Darwish wrote: > > => Now that the dust has settled, here's a summary of this huge > 50-email thread (thanks Daniel, Noralf, John, everyone!). > > => Parts of this document are a direct rewording of Daniel's replies, > so I took the liberty of adding a Co-developed-by tag here.. > > => This is only a summary, and _not_ an official patch submission. > It's now Show-me-the-code time ;-) > > Subject: [PATCH] Documentation: gpu: Add initial DRM panic design > > Co-developed-by: Daniel Vetter <daniel@ffwll.ch> > Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com> > --- > Documentation/gpu/drm-panic-design.rst | 124 +++++++++++++++++++++++++ > 1 file changed, 124 insertions(+) > create mode 100644 Documentation/gpu/drm-panic-design.rst > > diff --git a/Documentation/gpu/drm-panic-design.rst b/Documentation/gpu/drm-panic-design.rst > new file mode 100644 > index 000000000000..ba306193f347 > --- /dev/null > +++ b/Documentation/gpu/drm-panic-design.rst > @@ -0,0 +1,124 @@ > + > +======================== > +A DRM-based panic viewer > +======================== > + > +The Linux Kernel currently contains the necessary plumbing for viewing > +a kernel panic log using DRM-based kmsg dumpers, even if the system is > +currently running a graphical session (e.g. wayland). > + > +.. _drm_panic_design: > + > +Implementation design > +===================== > + > +Code pathes running in a panic context face several constraints: > + > +1. Code must be fully synchronous: interrupts are disabled > +2. Dynamic memory allocations are not allowed > +3. Cannot acquire any mutexes: atomic context, due to 1. > +4. Cannot acquire any spin locks: potential spinning-forever risk Maybe rephrase as: 3. No sleeping (there's other ways to sleep than just memory allocations and acquiring a mutex) 4. No unconditional locking at all (there's more than spinlocks/mutexes). E.g. one of the most important locks is drm_modest_lock, which is a ww_mutex. > + > +For the *DRM* panic code, the extra conditions below apply: > + > +5. Code must only trylock relevant DRM subsystem locks > +6. If any trylock operation fails, the code *must not* go through > +7. All rendering is done on the GPU's "current display buffer" used > + at the time of panic(): no HW programming is done at all. Maybe let's clarify this a bit, since from the discussions it sounded like at least amdgpu needs to touch a few bits (disable tiling, the indirect vram write registers to access vram outside of the bar): "Only the least amount of HW programming (preferrably none) is done, exceptions would be disabling tiled/compressed scanout." > +8. The code must be non-intrusive, and *must not* affect any other > + panic handling mechanism (netconsole, ramoops, kexec, etc.) Hm, I'm not entirely clear on what you mean here. Maybe some more examples of what would be a bad idea? > + > +Rationale follows. > + > + > +Spin locks > +---------- > + > +Acquiring a spin lock in a panic() context is potentially lethal: > +the lock might've been already acquired, _permanently_, by another > +core that is now fully shut down through an IPI from the panic()-ing > +core. > + > +Moreover, at least on x86, the first target architecture for this > +work, the panic()-ing core wait by default for *a full second* until > +all other cores finish their non-preemptible regions and terminate. > +If that did not work out, it even tries more aggressively with NMIs. > + > +So if the other non panic()-ing cores was holding a DRM-related lock > +through spin_lock_irqsave() for more than one second, then it's a > +bug in the DRM layer code. Thus, the DRM panic viewer cannot do > +anything and should just exit. [A] > + > +What if the non panic()-ing core was holding a DRM lock through > +barebone spin_lock()? Interrupts are enabled there, so the IPI will be > +handled, and thus that core will effectively die while the lock is > +*forever held*. [B] > + > + > +Trylocking > +---------- > + > +The DRM panic code always *tries* to acquire the *minimum relevant > +set* of DRM related locks, through the basic :c:func:`spin_trylock()` > +mechanism. > + > +From case [A] and case [B] above, if the trylock operation fails, > +there's no point in retrying it multiple times: the relevant locks > +are in a broken and unrecoverable state anyway. > + > +Moreover, The panic code cannot also just ignore the DRM locks and > +force its way through: a broken non-preemptible DRM region implies > +either invalid SW state (e.g. broken linked list pointers), or a GPU > +in an unknown HW state. > + > +A GPU in an unknown HW state is quite dangerous: it has access to the > +full system memory, and if poked incorrectly, there's a really good > +chance it can kill the entire machine. > + > + > +GPU hardware access > +------------------- > + > +In the current state, a full GPU reset, modesetting, or even disabling > +GPU planes, is not doable under a panic() context: it implies going > +through a potentially huge set of DRM call-chains that cannot be > +sanely verified against the :ref:`drm_panic_design` requirements > +(e.g. memory allocations, spinlocks, etc.). > + > +The current approach is simple: run the minimal amount of code > +necessary to draw pixels onto the current scanout buffers. Instead > +of disabling GPU planes, the biggest visible rectangle is just picked. > + > +*Usually* there should be a main window that is big enough to show the > +oops. > + > + > +CI testing > +---------- > + > +One of the things that killed earlier linux DRM panic handling efforts, > +beside getting into deep DRM call-chains that cannot be verified, was > +that it couldn't be tested except with real oopses. > + > +The first set of bug reports was whack-a-molde kind of bugs where the > +oops displayed was caused by the DRM panic handler itself instead of > +the real oops causing the panic. > + > +Thus, the :ref:`drm_panic_design` requirements was created. Moreover: > + > + - Special hooks are added at the spin_lock() level to complain > + loudly if a spin lock operation was tried under the DRM panic > + context. This could be easily noticed/reported by CI testing. > + > + - *Non-destructive* testing of the DRM panic code, emulating a > + real panic path context as much as possible (e.g. by disabling > + irqs and enabling the spin lock hooks earlier mentioned), is > + created. This is necessary for heaviling testing the DRM panic > + code through `CI machines <https://lwn.net/Articles/735468/>`_. > + > + > +Supported drivers > +================= > + > +* Intel i915-compatible cards Excellent summary, thanks for typing this up. -Daniel > > > -- > darwi > http://darwish.chasingpointers.com