[v2,1/3] drm: Add support for panic message output

Message ID	20190311174218.51899-2-noralf@tronnes.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: =?utf-8?q?Noralf_Tr=C3=B8nnes?= <noralf@tronnes.org> To: dri-devel@lists.freedesktop.org Date: Mon, 11 Mar 2019 18:42:16 +0100 Message-Id: <20190311174218.51899-2-noralf@tronnes.org> In-Reply-To: <20190311174218.51899-1-noralf@tronnes.org> References: <20190311174218.51899-1-noralf@tronnes.org> MIME-Version: 1.0 X-Mailman-Original-DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=tronnes.org; s=ds201810; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References:In-Reply-To:Message-Id:Date:Subject:Cc:To:From; bh=AYL43UC09OLtThuowsLT32JqUiAqTlqu/osntpJbXZI=; b=RgvrtAcW9FYZhmanbVlcbF7N7+HYV2Tkj4islMK1iEGnNGt86gL0ubvVawOhyDUlbnAMpE7FgJH9ycaJ7YaL8fPKvssHkyr5ePSzdQ3FN2sv/IBVpJVBvsVTdmDpIfCICGs2ZJxpw1ue/NgDMUSYlN6Q1cDvPHeobwsP7AaNsGZ6Y9TSRmHVv/tzbsfCovTUxISeyaRCKPw7NylwfQGSCUAX848hDBeIlKJ9PJ5Tv0gUaxwAiYiii08Cb0s/ungJ6D4BC6t4Ys1EJx1sEK8UHfw7dB9pHoJTCnYi03aNMVBn3Ti3zquhtM0AvALr8IIGJ+VK0IHR5TzDvS8pkibYew==; Subject: [Intel-gfx] [PATCH v2 1/3] drm: Add support for panic message output Precedence: list Cc: daniel.vetter@ffwll.ch, intel-gfx@lists.freedesktop.org, =?utf-8?q?Noral?= =?utf-8?q?f_Tr=C3=B8nnes?= <noralf@tronnes.org>, darwish.07@gmail.com Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	drm: Add panic handling \| expand [v2,0/3] drm: Add panic handling [v2,1/3] drm: Add support for panic message output [v2,2/3] drm/cma-helper: Add support for panic screen [v2,3/3] drm/vc4: Support for panic screen

Noralf Trønnes March 11, 2019, 5:42 p.m. UTC

This adds support for outputting kernel messages on panic().
A kernel message dumper is used to dump the log. The dumper iterates
over each DRM device and it's crtc's to find suitable framebuffers.

All the other dumpers are run before this one except mtdoops.
Only atomic drivers are supported.

Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
---
 drivers/gpu/drm/Kconfig           |   3 +
 drivers/gpu/drm/drm_drv.c         |   3 +
 drivers/gpu/drm/drm_framebuffer.c | 117 ++++++++++
 drivers/gpu/drm/drm_internal.h    |   4 +
 drivers/gpu/drm/drm_panic.c       | 363 ++++++++++++++++++++++++++++++
 include/drm/drm_framebuffer.h     |  41 ++++
 6 files changed, 531 insertions(+)
 create mode 100644 drivers/gpu/drm/drm_panic.c

Daniel Vetter March 11, 2019, 7:23 p.m. UTC | #1

On Mon, Mar 11, 2019 at 06:42:16PM +0100, Noralf Trønnes wrote:
> This adds support for outputting kernel messages on panic().
> A kernel message dumper is used to dump the log. The dumper iterates
> over each DRM device and it's crtc's to find suitable framebuffers.
> 
> All the other dumpers are run before this one except mtdoops.
> Only atomic drivers are supported.
> 
> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>

Bunch of comments/ideas for you or Darwish below, whoever picks this up.
-Daniel

> ---
>  drivers/gpu/drm/Kconfig           |   3 +
>  drivers/gpu/drm/drm_drv.c         |   3 +
>  drivers/gpu/drm/drm_framebuffer.c | 117 ++++++++++
>  drivers/gpu/drm/drm_internal.h    |   4 +
>  drivers/gpu/drm/drm_panic.c       | 363 ++++++++++++++++++++++++++++++
>  include/drm/drm_framebuffer.h     |  41 ++++
>  6 files changed, 531 insertions(+)
>  create mode 100644 drivers/gpu/drm/drm_panic.c
> 
> diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> index bd943a71756c..74c37542f857 100644
> --- a/drivers/gpu/drm/Kconfig
> +++ b/drivers/gpu/drm/Kconfig
> @@ -14,6 +14,9 @@ menuconfig DRM
>  	select I2C_ALGOBIT
>  	select DMA_SHARED_BUFFER
>  	select SYNC_FILE
> +	select FONT_SUPPORT
> +	select FONT_8x8
> +	select FONT_8x16
>  	help
>  	  Kernel-level support for the Direct Rendering Infrastructure (DRI)
>  	  introduced in XFree86 4.0. If you say Y here, you need to select
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index 50d849d1bc6e..b0b870b5dd55 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -1147,6 +1147,7 @@ static const struct file_operations drm_stub_fops = {
>  
>  static void drm_core_exit(void)
>  {
> +	drm_panic_exit(drm_debugfs_root);
>  	unregister_chrdev(DRM_MAJOR, "drm");
>  	debugfs_remove(drm_debugfs_root);
>  	drm_sysfs_destroy();
> @@ -1178,6 +1179,8 @@ static int __init drm_core_init(void)
>  	if (ret < 0)
>  		goto error;
>  
> +	drm_panic_init(drm_debugfs_root);
> +
>  	drm_core_init_complete = true;
>  
>  	DRM_DEBUG("Initialized\n");
> diff --git a/drivers/gpu/drm/drm_framebuffer.c b/drivers/gpu/drm/drm_framebuffer.c
> index d8d75e25f6fb..da2285c5ae90 100644
> --- a/drivers/gpu/drm/drm_framebuffer.c
> +++ b/drivers/gpu/drm/drm_framebuffer.c
> @@ -1087,3 +1087,120 @@ int drm_framebuffer_debugfs_init(struct drm_minor *minor)
>  				minor->debugfs_root, minor);
>  }
>  #endif
> +
> +/**
> + * drm_framebuffer_panic_draw_xy - draw pixel on fb during panic()
> + * @fb: DRM framebuffer
> + * @vmap: Linear virtual mapping
> + * @x: X coordinate
> + * @y: Y coordinate
> + * @foreground: Foreground pixel
> + *
> + * This function can be used to draw a pixel during panic message rendering.
> + * It requires @vmap to be a linear mapping. This is the default implementation
> + * used if &drm_framebuffer_funcs->panic_draw_xy is not set.
> + */
> +void drm_framebuffer_panic_draw_xy(struct drm_framebuffer *fb, void *vmap,
> +				   int x, int y, bool foreground)
> +{
> +	void *pix = vmap + fb->offsets[0] + (y * fb->pitches[0]);
> +	u8 *pix8 = pix;
> +	u16 *pix16 = pix;
> +	u32 *pix32 = pix;
> +
> +	switch (fb->format->format & ~DRM_FORMAT_BIG_ENDIAN) {
> +	case DRM_FORMAT_C8:
> +
> +	case DRM_FORMAT_RGB332:
> +	case DRM_FORMAT_BGR233:
> +
> +	case DRM_FORMAT_NV12:
> +	case DRM_FORMAT_NV21:
> +	case DRM_FORMAT_NV16:
> +	case DRM_FORMAT_NV61:
> +	case DRM_FORMAT_NV24:
> +	case DRM_FORMAT_NV42:
> +
> +	case DRM_FORMAT_YUV410:
> +	case DRM_FORMAT_YVU410:
> +	case DRM_FORMAT_YUV411:
> +	case DRM_FORMAT_YVU411:
> +	case DRM_FORMAT_YUV420:
> +	case DRM_FORMAT_YVU420:
> +	case DRM_FORMAT_YUV422:
> +	case DRM_FORMAT_YVU422:
> +	case DRM_FORMAT_YUV444:
> +	case DRM_FORMAT_YVU444:
> +		pix8[x] = foreground ? 0xff : 0x00;
> +		break;
> +
> +	case DRM_FORMAT_XRGB4444:
> +	case DRM_FORMAT_ARGB4444:
> +	case DRM_FORMAT_XBGR4444:
> +	case DRM_FORMAT_ABGR4444:
> +		pix16[x] = foreground ? 0xffff : 0xf000;
> +		break;
> +
> +	case DRM_FORMAT_RGBX4444:
> +	case DRM_FORMAT_RGBA4444:
> +	case DRM_FORMAT_BGRX4444:
> +	case DRM_FORMAT_BGRA4444:
> +		pix16[x] = foreground ? 0xffff : 0x000f;
> +		break;
> +
> +	case DRM_FORMAT_XRGB1555:
> +	case DRM_FORMAT_ARGB1555:
> +	case DRM_FORMAT_XBGR1555:
> +	case DRM_FORMAT_ABGR1555:
> +		pix16[x] = foreground ? 0xffff : 0x8000;
> +		break;
> +
> +	case DRM_FORMAT_RGBX5551:
> +	case DRM_FORMAT_RGBA5551:
> +	case DRM_FORMAT_BGRX5551:
> +	case DRM_FORMAT_BGRA5551:
> +		pix16[x] = foreground ? 0xffff : 0x0001;
> +		break;
> +
> +	case DRM_FORMAT_RGB565:
> +	case DRM_FORMAT_BGR565:
> +		pix16[x] = foreground ? 0xffff : 0x0000;
> +		break;
> +
> +	case DRM_FORMAT_RGB888:
> +	case DRM_FORMAT_BGR888:
> +		pix8[x * 3] = foreground ? 0xff : 0x00;
> +		pix8[(x * 3) + 1] = pix8[x];
> +		pix8[(x * 3) + 2] = pix8[x];
> +		break;
> +
> +	case DRM_FORMAT_XRGB8888:
> +	case DRM_FORMAT_ARGB8888:
> +	case DRM_FORMAT_XBGR8888:
> +	case DRM_FORMAT_ABGR8888:
> +		pix32[x] = foreground ? 0xffffffff : 0xff000000;
> +		break;
> +
> +	case DRM_FORMAT_RGBX8888:
> +	case DRM_FORMAT_RGBA8888:
> +	case DRM_FORMAT_BGRX8888:
> +	case DRM_FORMAT_BGRA8888:
> +		pix32[x] = foreground ? 0xffffffff : 0x000000ff;
> +		break;
> +
> +	case DRM_FORMAT_XRGB2101010:
> +	case DRM_FORMAT_ARGB2101010:
> +	case DRM_FORMAT_XBGR2101010:
> +	case DRM_FORMAT_ABGR2101010:
> +		pix32[x] = foreground ? 0xffffffff : 0xc0000000;
> +		break;
> +
> +	case DRM_FORMAT_RGBX1010102:
> +	case DRM_FORMAT_RGBA1010102:
> +	case DRM_FORMAT_BGRX1010102:
> +	case DRM_FORMAT_BGRA1010102:
> +		pix32[x] = foreground ? 0xffffffff : 0x00000003;
> +		break;
> +	}
> +}
> +EXPORT_SYMBOL(drm_framebuffer_panic_draw_xy);
> diff --git a/drivers/gpu/drm/drm_internal.h b/drivers/gpu/drm/drm_internal.h
> index 251d67e04c2d..694a5803ac3c 100644
> --- a/drivers/gpu/drm/drm_internal.h
> +++ b/drivers/gpu/drm/drm_internal.h
> @@ -191,3 +191,7 @@ int drm_syncobj_signal_ioctl(struct drm_device *dev, void *data,
>  void drm_framebuffer_print_info(struct drm_printer *p, unsigned int indent,
>  				const struct drm_framebuffer *fb);
>  int drm_framebuffer_debugfs_init(struct drm_minor *minor);
> +
> +/* drm_panic.c */
> +void drm_panic_init(struct dentry *debugfs_root);
> +void drm_panic_exit(struct dentry *debugfs_root);
> diff --git a/drivers/gpu/drm/drm_panic.c b/drivers/gpu/drm/drm_panic.c
> new file mode 100644
> index 000000000000..4bca7f0bc369
> --- /dev/null
> +++ b/drivers/gpu/drm/drm_panic.c
> @@ -0,0 +1,363 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright 2018 Noralf Trønnes
> +
> +#include <linux/debugfs.h>
> +#include <linux/font.h>
> +#include <linux/kernel.h>
> +#include <linux/kmsg_dump.h>
> +#include <linux/seq_file.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +
> +#include <drm/drm_crtc.h>
> +#include <drm/drm_device.h>
> +#include <drm/drm_drv.h>
> +#include <drm/drm_file.h>
> +#include <drm/drm_framebuffer.h>
> +#include <drm/drm_fb_helper.h>
> +#include <drm/drm_plane.h>
> +#include <drm/drm_print.h>
> +#include <drm/drmP.h>
> +
> +#include "drm_internal.h"
> +
> +/*
> + * The log lines in an ARM stack dump are 92 characters long
> + * and 120 is a nice multiple for HD and 4K.
> + */
> +#define DRM_PANIC_COLUMN_WIDTH	120

I think we should compute this dynamically from the actual fb width and
font witdth.

> +
> +struct drm_panic_ctx {
> +	struct drm_framebuffer *fb;
> +	unsigned int width;
> +	unsigned int height;
> +	void *vmap;
> +
> +	const struct font_desc *font;
> +	unsigned int col_width;
> +	unsigned int bottom_y;
> +	size_t max_line_length;
> +
> +	unsigned int x;
> +	unsigned int y;
> +};
> +
> +static const struct font_desc *drm_panic_font8x8, *drm_panic_font8x16;
> +
> +static void drm_panic_draw_xy(struct drm_framebuffer *fb, void *vmap,
> +			      int x, int y, bool fg)
> +{
> +	if (fb->funcs->panic_draw_xy)
> +		fb->funcs->panic_draw_xy(fb, vmap, x, y, fg);
> +	else
> +		drm_framebuffer_panic_draw_xy(fb, vmap, x, y, fg);
> +}
> +
> +static void drm_panic_render_char(struct drm_panic_ctx *ctx,
> +				  unsigned int offset, char c)
> +{
> +	unsigned int h, w, x, y;
> +	u8 fontline;
> +
> +	for (h = 0; h < ctx->font->height; h++) {
> +		fontline = *(u8 *)(ctx->font->data + c * ctx->font->height + h);
> +
> +		for (w = 0; w < ctx->font->width; w++) {
> +			x = ctx->x + (offset * ctx->font->width) + w;
> +			y = ctx->y + h;
> +			drm_panic_draw_xy(ctx->fb, ctx->vmap, x, y,
> +					  fontline & BIT(7 - w));
> +		}
> +	}
> +}
> +
> +static void drm_panic_render_line(struct drm_panic_ctx *ctx,
> +				  const char *line, size_t len)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < len; i++)
> +		drm_panic_render_char(ctx, i, line[i]);
> +
> +	/* Clear out the rest of the line */
> +	for (i = len; i < ctx->max_line_length; i++)
> +		drm_panic_render_char(ctx, i, ' ');
> +}
> +
> +static bool drm_panic_newline(struct drm_panic_ctx *ctx)
> +{
> +	if (ctx->x == 0 && ctx->y == 0)
> +		return false;
> +	if (ctx->y == 0) {
> +		ctx->x -= ctx->col_width;
> +		ctx->y = ctx->bottom_y;
> +	} else {
> +		ctx->y -= ctx->font->height;
> +	}
> +
> +	return true;
> +}
> +
> +/* Render from bottom right most column */
> +static bool drm_panic_render_lines(struct drm_panic_ctx *ctx,
> +				   const char *str, size_t len)
> +{
> +	size_t l, line_length;
> +	const char *pos;
> +	int i;
> +
> +	while (len) {
> +		pos = str + len - 1;
> +
> +		if (*pos == '\n') {
> +			len--;
> +			if (!drm_panic_newline(ctx))
> +				return false;
> +			continue;
> +		}
> +
> +		while (pos > str && *(pos - 1) != '\n')
> +			pos--;
> +
> +		line_length = len - (pos - str);
> +
> +		if (!line_length || len < line_length) {
> +			pr_err("%s: Bug! line_length=%zu len=%zu\n",
> +			       __func__, line_length, len);
> +			return false;
> +		}
> +
> +		len -= line_length;
> +
> +		for (i = DIV_ROUND_UP(line_length, ctx->max_line_length) - 1; i >= 0; i--) {
> +			l = min(ctx->max_line_length, line_length - i * ctx->max_line_length);
> +			drm_panic_render_line(ctx, pos + (i * ctx->max_line_length), l);
> +			if (i && !drm_panic_newline(ctx))
> +				return false;
> +		}
> +	}
> +
> +	return true;
> +}
> +
> +static void drm_panic_kmsg_render_screen(struct drm_plane *plane,
> +					 struct kmsg_dumper *dumper)
> +{
> +	struct drm_framebuffer *fb = plane->state->fb;
> +	bool first_iteration = true;
> +	struct drm_panic_ctx ctx;
> +	static char text[1024];
> +	size_t len;
> +
> +	ctx.vmap = fb->funcs->panic_vmap(fb);
> +
> +	/* Print some info when testing */
> +	if (dumper->max_reason == KMSG_DUMP_OOPS) {
> +		struct drm_format_name_buf format_name;
> +
> +		pr_info("%s: [FB:%d] %ux%u format=%s vmap=%p\n",
> +			__func__, fb->base.id, fb->width, fb->height,
> +			drm_get_format_name(fb->format->format, &format_name),
> +			ctx.vmap);
> +	}
> +
> +	if (!ctx.vmap)
> +		return;

For some framebuffers it might be useful to not require vmap, but instead
have some page-by-page kind of access helper. Since preemptively setting
up a vmap for all the non-continous buffers is a bit much, and we really
can't set up the vmap in the panic handler.

Maybe we should call this panic_prepare/panic_finish and
s/vmap/cookie/, to make it entirely opaque?

But a bit overengineering perhaps :-)

> +
> +	ctx.fb = fb;
> +	ctx.width = drm_rect_width(&plane->state->src) >> 16;
> +	ctx.height = drm_rect_height(&plane->state->src) >> 16;
> +
> +	/*
> +	 * TODO:
> +	 * Find which part of the fb that is visible.
> +	 * Width and height are zero on vc4
> +	 */
> +	if (!ctx.width || !ctx.height) {
> +		ctx.width = fb->width;
> +		ctx.height = fb->height;
> +	}
> +
> +	/* Try to fit 50 lines */
> +	if (ctx.height < 50 * 16 && drm_panic_font8x8)
> +		ctx.font = drm_panic_font8x8;
> +	else
> +		ctx.font = drm_panic_font8x16;
> +
> +	ctx.col_width = DRM_PANIC_COLUMN_WIDTH * ctx.font->width;
> +	ctx.bottom_y = ((ctx.height / ctx.font->height) - 1) * ctx.font->height;
> +
> +	if (ctx.width < 2 * ctx.col_width)
> +		ctx.max_line_length = ctx.width / ctx.font->width;
> +	else
> +		ctx.max_line_length = DRM_PANIC_COLUMN_WIDTH - 2; /* border=2 */
> +
> +	ctx.x = 0;
> +	ctx.y = ctx.bottom_y;
> +	if (ctx.width > ctx.col_width)
> +		ctx.x = ((ctx.width / ctx.col_width) - 1) * ctx.col_width;
> +
> +	pr_debug("%s: font=%s %ux%u max_line_length=%zu (%u,%u)\n",
> +		 __func__, ctx.font->name, ctx.width, ctx.height,
> +		 ctx.max_line_length, ctx.x, ctx.y);
> +
> +	kmsg_dump_rewind(dumper);
> +
> +	/* The latest messages are fetched first */
> +	while (kmsg_dump_get_buffer(dumper, false, text, sizeof(text), &len)) {
> +		/* Strip off the very last newline so we start at the bottom */
> +		if (first_iteration) {
> +			len--;
> +			first_iteration = false;
> +		}
> +
> +		if (!drm_panic_render_lines(&ctx, text, len))
> +			break;
> +	}
> +
> +	if (fb->funcs->panic_vunmap)
> +		fb->funcs->panic_vunmap(fb, vmap);
> +}
> +
> +static void drm_panic_try_dev(struct drm_device *dev, struct kmsg_dumper *dumper)
> +{
> +	struct drm_framebuffer *fb;
> +	struct drm_plane *plane;
> +	struct drm_crtc *crtc;
> +
> +	if (!drm_core_check_feature(dev, DRIVER_ATOMIC))
> +		return;
> +
> +	if (dumper->max_reason == KMSG_DUMP_OOPS)
> +		pr_info("%s: %s on minor %d\n", __func__, dev->driver->name,
> +			dev->primary->index);
> +
> +	drm_for_each_crtc(crtc, dev) {
> +		if (!ww_mutex_trylock(&crtc->mutex.mutex))
> +			continue;
> +
> +		if (!crtc->enabled || !crtc->primary)
> +			goto crtc_unlock;
> +
> +		if (!crtc->state || !crtc->state->active)
> +			goto crtc_unlock;
> +
> +		plane = crtc->primary;
> +		if (!ww_mutex_trylock(&plane->mutex.mutex))
> +			goto crtc_unlock;
> +
> +		/*
> +		 * TODO: Should we check plane_state->visible?
> +		 * It is not set on vc4

I think we should. We should also check whether that primary is connected
to the crtc (some hw you can reassign planes freely between crtc, and the
crtc->primary pointer is only used for compat with legacy ioctl).

> +		if (!plane->state || !plane->state->visible)
> +		 */
> +		if (!plane->state)
> +			goto plane_unlock;
> +
> +		fb = plane->state->fb;
> +		if (!fb || !fb->funcs->panic_vmap)

Docs says this is optional, doesn't seem to be all that optional. I'd
check for this or a driver-specific ->panic_draw_xy instead.
> +			goto plane_unlock;
> +
> +		/*
> +		 * fbcon puts out panic messages so stay away to avoid jumbled
> +		 * output. If vc->vc_mode != KD_TEXT fbcon won't put out
> +		 * messages (see vt_console_print).
> +		 */
> +		if (dev->fb_helper && dev->fb_helper->fb == fb)
> +			goto plane_unlock;
> +
> +		drm_panic_kmsg_render_screen(plane, dumper);
> +plane_unlock:
> +		ww_mutex_unlock(&plane->mutex.mutex);
> +crtc_unlock:
> +		ww_mutex_unlock(&crtc->mutex.mutex);
> +	}
> +}
> +
> +static int drm_panic_dev_iter(struct device *dev, void *data)
> +{
> +	struct drm_minor *minor = dev_get_drvdata(dev);
> +
> +	if (minor && minor->type == DRM_MINOR_PRIMARY)
> +		drm_panic_try_dev(minor->dev, data);
> +
> +	return 0;
> +}
> +
> +static void drm_panic_kmsg_dump(struct kmsg_dumper *dumper,
> +				enum kmsg_dump_reason reason)
> +{
> +	class_for_each_device(drm_class, NULL, dumper, drm_panic_dev_iter);

class_for_each_device uses klist, which only uses an irqsave spinlock. I
think that's good enough. Comment to that effect would be good e.g.

	/* based on klist, which uses only a spin_lock_irqsave, which we
	 * assume still works */

If we aim for perfect this should be a trylock still, maybe using our own
device list.

> +}
> +
> +static struct kmsg_dumper drm_panic_kmsg_dumper = {
> +	.dump = drm_panic_kmsg_dump,
> +	.max_reason = KMSG_DUMP_PANIC,
> +};
> +
> +static ssize_t drm_panic_file_panic_write(struct file *file,
> +					  const char __user *user_buf,
> +					  size_t count, loff_t *ppos)
> +{
> +	unsigned long long val;
> +	char buf[24];
> +	size_t size;
> +	ssize_t ret;
> +
> +	size = min(sizeof(buf) - 1, count);
> +	if (copy_from_user(buf, user_buf, size))
> +		return -EFAULT;
> +
> +	buf[size] = '\0';
> +	ret = kstrtoull(buf, 0, &val);
> +	if (ret)
> +		return ret;
> +
> +	drm_panic_kmsg_dumper.max_reason = KMSG_DUMP_OOPS;
> +	wmb();
> +
> +	/* Do a real test with: echo c > /proc/sysrq-trigger */
> +
> +	if (val == 0) {
> +		pr_info("Test panic screen using kmsg_dump(OOPS)\n");
> +		kmsg_dump(KMSG_DUMP_OOPS);
> +	} else if (val == 1) {
> +		char *null_pointer = NULL;
> +
> +		pr_info("Test panic screen using NULL pointer dereference\n");
> +		*null_pointer = 1;
> +	} else {
> +		return -EINVAL;
> +	}

This isn't quite what I had in mind, since it still kills the kernel (like
sysrq-trigger). Instead what I had in mind is to recreate the worst
possible panic context as much as feasible (disabling interrupts should be
a good start, maybe we can even do an nmi callback), and then call our
panic implementation. That way we can test the panic handler in a
non-destructive way (i.e. aside from last dmesg lines printed to the
screen nothing bad happens to the kernel: No real panic, no oops, no
tainting).

And igt testcase could then exercise this, and CI run it.

> +
> +	return count;
> +}
> +
> +static const struct file_operations drm_panic_panic_ops = {
> +	.write =        drm_panic_file_panic_write,
> +	.open =         simple_open,
> +	.llseek =       default_llseek,
> +};
> +
> +static struct dentry *drm_panic_d_panic;
> +
> +void __init drm_panic_init(struct dentry *debugfs_root)
> +{
> +	drm_panic_font8x8 = find_font("VGA8x8");
> +	drm_panic_font8x16 = find_font("VGA8x16");
> +	if (!drm_panic_font8x16) {
> +		DRM_WARN("Couldn't find font, panic screen disabled\n");
> +		return;
> +	}
> +
> +	drm_panic_d_panic = debugfs_create_file("panic-test", 0200,
> +						debugfs_root, NULL,
> +						&drm_panic_panic_ops);
> +	kmsg_dump_register(&drm_panic_kmsg_dumper);
> +}
> +
> +void drm_panic_exit(struct dentry *debugfs_root)
> +{
> +	kmsg_dump_unregister(&drm_panic_kmsg_dumper);
> +	debugfs_remove(drm_panic_d_panic);
> +}
> diff --git a/include/drm/drm_framebuffer.h b/include/drm/drm_framebuffer.h
> index f0b34c977ec5..f3274798ecfe 100644
> --- a/include/drm/drm_framebuffer.h
> +++ b/include/drm/drm_framebuffer.h
> @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
>  		     struct drm_file *file_priv, unsigned flags,
>  		     unsigned color, struct drm_clip_rect *clips,
>  		     unsigned num_clips);
> +
> +	/**
> +	 * @panic_vmap:
> +	 *
> +	 * Optional callback for panic handling.
> +	 *
> +	 * For vmapping the selected framebuffer in a panic context. Must
> +	 * be super careful about locking (only trylocking allowed).
> +	 *
> +	 * RETURNS:
> +	 *
> +	 * NULL if it didn't work out, otherwise an opaque cookie which is
> +	 * passed to @panic_draw_xy. It can be anything: vmap area, structure
> +	 * with more details, just a few flags, ...
> +	 */
> +	void *(*panic_vmap)(struct drm_framebuffer *fb);
> +
> +	/**
> +	 * @panic_vunmap:
> +	 *
> +	 * Optional callback for cleaning up after panic testing.
> +	 *
> +	 * Crtc and plane locks are released after this callback has run.
> +	 * vmap is the cookie returned by @panic_vmap.
> +	 */
> +	void (*panic_vunmap)(struct drm_framebuffer *fb, void *vmap);
> +
> +	/**
> +	 * @panic_draw_xy:
> +	 *
> +	 * Optional callback for drawing pixels during panic.
> +	 *
> +	 * For drawing pixels onto a framebuffer prepared with @panic_vmap.
> +	 * vmap is the cookie returned by @panic_vmap.
> +	 * If it's not set, drm_framebuffer_panic_draw_xy() is used.
> +	 */
> +	void (*panic_draw_xy)(struct drm_framebuffer *fb, void *vmap,
> +			      int x, int y, bool foreground);
>  };
>  
>  /**
> @@ -293,4 +331,7 @@ int drm_framebuffer_plane_width(int width,
>  int drm_framebuffer_plane_height(int height,
>  				 const struct drm_framebuffer *fb, int plane);
>  
> +void drm_framebuffer_panic_draw_xy(struct drm_framebuffer *fb, void *vmap,
> +				   int x, int y, bool foreground);
> +
>  #endif
> -- 
> 2.20.1
>

Daniel Vetter March 11, 2019, 7:29 p.m. UTC | #2

On Mon, Mar 11, 2019 at 08:23:38PM +0100, Daniel Vetter wrote:
> On Mon, Mar 11, 2019 at 06:42:16PM +0100, Noralf Trønnes wrote:
> > This adds support for outputting kernel messages on panic().
> > A kernel message dumper is used to dump the log. The dumper iterates
> > over each DRM device and it's crtc's to find suitable framebuffers.
> > 
> > All the other dumpers are run before this one except mtdoops.
> > Only atomic drivers are supported.
> > 
> > Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
> 
> Bunch of comments/ideas for you or Darwish below, whoever picks this up.
> -Daniel
> 
> > ---
> >  drivers/gpu/drm/Kconfig           |   3 +
> >  drivers/gpu/drm/drm_drv.c         |   3 +
> >  drivers/gpu/drm/drm_framebuffer.c | 117 ++++++++++
> >  drivers/gpu/drm/drm_internal.h    |   4 +
> >  drivers/gpu/drm/drm_panic.c       | 363 ++++++++++++++++++++++++++++++
> >  include/drm/drm_framebuffer.h     |  41 ++++
> >  6 files changed, 531 insertions(+)
> >  create mode 100644 drivers/gpu/drm/drm_panic.c
> > 
> > diff --git a/drivers/gpu/drm/Kconfig b/drivers/gpu/drm/Kconfig
> > index bd943a71756c..74c37542f857 100644
> > --- a/drivers/gpu/drm/Kconfig
> > +++ b/drivers/gpu/drm/Kconfig
> > @@ -14,6 +14,9 @@ menuconfig DRM
> >  	select I2C_ALGOBIT
> >  	select DMA_SHARED_BUFFER
> >  	select SYNC_FILE
> > +	select FONT_SUPPORT
> > +	select FONT_8x8
> > +	select FONT_8x16
> >  	help
> >  	  Kernel-level support for the Direct Rendering Infrastructure (DRI)
> >  	  introduced in XFree86 4.0. If you say Y here, you need to select
> > diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> > index 50d849d1bc6e..b0b870b5dd55 100644
> > --- a/drivers/gpu/drm/drm_drv.c
> > +++ b/drivers/gpu/drm/drm_drv.c
> > @@ -1147,6 +1147,7 @@ static const struct file_operations drm_stub_fops = {
> >  
> >  static void drm_core_exit(void)
> >  {
> > +	drm_panic_exit(drm_debugfs_root);
> >  	unregister_chrdev(DRM_MAJOR, "drm");
> >  	debugfs_remove(drm_debugfs_root);
> >  	drm_sysfs_destroy();
> > @@ -1178,6 +1179,8 @@ static int __init drm_core_init(void)
> >  	if (ret < 0)
> >  		goto error;
> >  
> > +	drm_panic_init(drm_debugfs_root);
> > +
> >  	drm_core_init_complete = true;
> >  
> >  	DRM_DEBUG("Initialized\n");
> > diff --git a/drivers/gpu/drm/drm_framebuffer.c b/drivers/gpu/drm/drm_framebuffer.c
> > index d8d75e25f6fb..da2285c5ae90 100644
> > --- a/drivers/gpu/drm/drm_framebuffer.c
> > +++ b/drivers/gpu/drm/drm_framebuffer.c
> > @@ -1087,3 +1087,120 @@ int drm_framebuffer_debugfs_init(struct drm_minor *minor)
> >  				minor->debugfs_root, minor);
> >  }
> >  #endif
> > +
> > +/**
> > + * drm_framebuffer_panic_draw_xy - draw pixel on fb during panic()
> > + * @fb: DRM framebuffer
> > + * @vmap: Linear virtual mapping
> > + * @x: X coordinate
> > + * @y: Y coordinate
> > + * @foreground: Foreground pixel
> > + *
> > + * This function can be used to draw a pixel during panic message rendering.
> > + * It requires @vmap to be a linear mapping. This is the default implementation
> > + * used if &drm_framebuffer_funcs->panic_draw_xy is not set.
> > + */
> > +void drm_framebuffer_panic_draw_xy(struct drm_framebuffer *fb, void *vmap,
> > +				   int x, int y, bool foreground)
> > +{
> > +	void *pix = vmap + fb->offsets[0] + (y * fb->pitches[0]);
> > +	u8 *pix8 = pix;
> > +	u16 *pix16 = pix;
> > +	u32 *pix32 = pix;
> > +
> > +	switch (fb->format->format & ~DRM_FORMAT_BIG_ENDIAN) {
> > +	case DRM_FORMAT_C8:
> > +
> > +	case DRM_FORMAT_RGB332:
> > +	case DRM_FORMAT_BGR233:
> > +
> > +	case DRM_FORMAT_NV12:
> > +	case DRM_FORMAT_NV21:
> > +	case DRM_FORMAT_NV16:
> > +	case DRM_FORMAT_NV61:
> > +	case DRM_FORMAT_NV24:
> > +	case DRM_FORMAT_NV42:
> > +
> > +	case DRM_FORMAT_YUV410:
> > +	case DRM_FORMAT_YVU410:
> > +	case DRM_FORMAT_YUV411:
> > +	case DRM_FORMAT_YVU411:
> > +	case DRM_FORMAT_YUV420:
> > +	case DRM_FORMAT_YVU420:
> > +	case DRM_FORMAT_YUV422:
> > +	case DRM_FORMAT_YVU422:
> > +	case DRM_FORMAT_YUV444:
> > +	case DRM_FORMAT_YVU444:
> > +		pix8[x] = foreground ? 0xff : 0x00;
> > +		break;
> > +
> > +	case DRM_FORMAT_XRGB4444:
> > +	case DRM_FORMAT_ARGB4444:
> > +	case DRM_FORMAT_XBGR4444:
> > +	case DRM_FORMAT_ABGR4444:
> > +		pix16[x] = foreground ? 0xffff : 0xf000;
> > +		break;
> > +
> > +	case DRM_FORMAT_RGBX4444:
> > +	case DRM_FORMAT_RGBA4444:
> > +	case DRM_FORMAT_BGRX4444:
> > +	case DRM_FORMAT_BGRA4444:
> > +		pix16[x] = foreground ? 0xffff : 0x000f;
> > +		break;
> > +
> > +	case DRM_FORMAT_XRGB1555:
> > +	case DRM_FORMAT_ARGB1555:
> > +	case DRM_FORMAT_XBGR1555:
> > +	case DRM_FORMAT_ABGR1555:
> > +		pix16[x] = foreground ? 0xffff : 0x8000;
> > +		break;
> > +
> > +	case DRM_FORMAT_RGBX5551:
> > +	case DRM_FORMAT_RGBA5551:
> > +	case DRM_FORMAT_BGRX5551:
> > +	case DRM_FORMAT_BGRA5551:
> > +		pix16[x] = foreground ? 0xffff : 0x0001;
> > +		break;
> > +
> > +	case DRM_FORMAT_RGB565:
> > +	case DRM_FORMAT_BGR565:
> > +		pix16[x] = foreground ? 0xffff : 0x0000;
> > +		break;
> > +
> > +	case DRM_FORMAT_RGB888:
> > +	case DRM_FORMAT_BGR888:
> > +		pix8[x * 3] = foreground ? 0xff : 0x00;
> > +		pix8[(x * 3) + 1] = pix8[x];
> > +		pix8[(x * 3) + 2] = pix8[x];
> > +		break;
> > +
> > +	case DRM_FORMAT_XRGB8888:
> > +	case DRM_FORMAT_ARGB8888:
> > +	case DRM_FORMAT_XBGR8888:
> > +	case DRM_FORMAT_ABGR8888:
> > +		pix32[x] = foreground ? 0xffffffff : 0xff000000;
> > +		break;
> > +
> > +	case DRM_FORMAT_RGBX8888:
> > +	case DRM_FORMAT_RGBA8888:
> > +	case DRM_FORMAT_BGRX8888:
> > +	case DRM_FORMAT_BGRA8888:
> > +		pix32[x] = foreground ? 0xffffffff : 0x000000ff;
> > +		break;
> > +
> > +	case DRM_FORMAT_XRGB2101010:
> > +	case DRM_FORMAT_ARGB2101010:
> > +	case DRM_FORMAT_XBGR2101010:
> > +	case DRM_FORMAT_ABGR2101010:
> > +		pix32[x] = foreground ? 0xffffffff : 0xc0000000;
> > +		break;
> > +
> > +	case DRM_FORMAT_RGBX1010102:
> > +	case DRM_FORMAT_RGBA1010102:
> > +	case DRM_FORMAT_BGRX1010102:
> > +	case DRM_FORMAT_BGRA1010102:
> > +		pix32[x] = foreground ? 0xffffffff : 0x00000003;
> > +		break;
> > +	}
> > +}
> > +EXPORT_SYMBOL(drm_framebuffer_panic_draw_xy);
> > diff --git a/drivers/gpu/drm/drm_internal.h b/drivers/gpu/drm/drm_internal.h
> > index 251d67e04c2d..694a5803ac3c 100644
> > --- a/drivers/gpu/drm/drm_internal.h
> > +++ b/drivers/gpu/drm/drm_internal.h
> > @@ -191,3 +191,7 @@ int drm_syncobj_signal_ioctl(struct drm_device *dev, void *data,
> >  void drm_framebuffer_print_info(struct drm_printer *p, unsigned int indent,
> >  				const struct drm_framebuffer *fb);
> >  int drm_framebuffer_debugfs_init(struct drm_minor *minor);
> > +
> > +/* drm_panic.c */
> > +void drm_panic_init(struct dentry *debugfs_root);
> > +void drm_panic_exit(struct dentry *debugfs_root);
> > diff --git a/drivers/gpu/drm/drm_panic.c b/drivers/gpu/drm/drm_panic.c
> > new file mode 100644
> > index 000000000000..4bca7f0bc369
> > --- /dev/null
> > +++ b/drivers/gpu/drm/drm_panic.c
> > @@ -0,0 +1,363 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +// Copyright 2018 Noralf Trønnes
> > +
> > +#include <linux/debugfs.h>
> > +#include <linux/font.h>
> > +#include <linux/kernel.h>
> > +#include <linux/kmsg_dump.h>
> > +#include <linux/seq_file.h>
> > +#include <linux/slab.h>
> > +#include <linux/uaccess.h>
> > +
> > +#include <drm/drm_crtc.h>
> > +#include <drm/drm_device.h>
> > +#include <drm/drm_drv.h>
> > +#include <drm/drm_file.h>
> > +#include <drm/drm_framebuffer.h>
> > +#include <drm/drm_fb_helper.h>
> > +#include <drm/drm_plane.h>
> > +#include <drm/drm_print.h>
> > +#include <drm/drmP.h>
> > +
> > +#include "drm_internal.h"
> > +
> > +/*
> > + * The log lines in an ARM stack dump are 92 characters long
> > + * and 120 is a nice multiple for HD and 4K.
> > + */
> > +#define DRM_PANIC_COLUMN_WIDTH	120
> 
> I think we should compute this dynamically from the actual fb width and
> font witdth.
> 
> > +
> > +struct drm_panic_ctx {
> > +	struct drm_framebuffer *fb;
> > +	unsigned int width;
> > +	unsigned int height;
> > +	void *vmap;
> > +
> > +	const struct font_desc *font;
> > +	unsigned int col_width;
> > +	unsigned int bottom_y;
> > +	size_t max_line_length;
> > +
> > +	unsigned int x;
> > +	unsigned int y;
> > +};
> > +
> > +static const struct font_desc *drm_panic_font8x8, *drm_panic_font8x16;
> > +
> > +static void drm_panic_draw_xy(struct drm_framebuffer *fb, void *vmap,
> > +			      int x, int y, bool fg)
> > +{
> > +	if (fb->funcs->panic_draw_xy)
> > +		fb->funcs->panic_draw_xy(fb, vmap, x, y, fg);
> > +	else
> > +		drm_framebuffer_panic_draw_xy(fb, vmap, x, y, fg);
> > +}
> > +
> > +static void drm_panic_render_char(struct drm_panic_ctx *ctx,
> > +				  unsigned int offset, char c)
> > +{
> > +	unsigned int h, w, x, y;
> > +	u8 fontline;
> > +
> > +	for (h = 0; h < ctx->font->height; h++) {
> > +		fontline = *(u8 *)(ctx->font->data + c * ctx->font->height + h);
> > +
> > +		for (w = 0; w < ctx->font->width; w++) {
> > +			x = ctx->x + (offset * ctx->font->width) + w;
> > +			y = ctx->y + h;
> > +			drm_panic_draw_xy(ctx->fb, ctx->vmap, x, y,
> > +					  fontline & BIT(7 - w));
> > +		}
> > +	}
> > +}
> > +
> > +static void drm_panic_render_line(struct drm_panic_ctx *ctx,
> > +				  const char *line, size_t len)
> > +{
> > +	unsigned int i;
> > +
> > +	for (i = 0; i < len; i++)
> > +		drm_panic_render_char(ctx, i, line[i]);
> > +
> > +	/* Clear out the rest of the line */
> > +	for (i = len; i < ctx->max_line_length; i++)
> > +		drm_panic_render_char(ctx, i, ' ');
> > +}
> > +
> > +static bool drm_panic_newline(struct drm_panic_ctx *ctx)
> > +{
> > +	if (ctx->x == 0 && ctx->y == 0)
> > +		return false;
> > +	if (ctx->y == 0) {
> > +		ctx->x -= ctx->col_width;
> > +		ctx->y = ctx->bottom_y;
> > +	} else {
> > +		ctx->y -= ctx->font->height;
> > +	}
> > +
> > +	return true;
> > +}
> > +
> > +/* Render from bottom right most column */
> > +static bool drm_panic_render_lines(struct drm_panic_ctx *ctx,
> > +				   const char *str, size_t len)
> > +{
> > +	size_t l, line_length;
> > +	const char *pos;
> > +	int i;
> > +
> > +	while (len) {
> > +		pos = str + len - 1;
> > +
> > +		if (*pos == '\n') {
> > +			len--;
> > +			if (!drm_panic_newline(ctx))
> > +				return false;
> > +			continue;
> > +		}
> > +
> > +		while (pos > str && *(pos - 1) != '\n')
> > +			pos--;
> > +
> > +		line_length = len - (pos - str);
> > +
> > +		if (!line_length || len < line_length) {
> > +			pr_err("%s: Bug! line_length=%zu len=%zu\n",
> > +			       __func__, line_length, len);
> > +			return false;
> > +		}
> > +
> > +		len -= line_length;
> > +
> > +		for (i = DIV_ROUND_UP(line_length, ctx->max_line_length) - 1; i >= 0; i--) {
> > +			l = min(ctx->max_line_length, line_length - i * ctx->max_line_length);
> > +			drm_panic_render_line(ctx, pos + (i * ctx->max_line_length), l);
> > +			if (i && !drm_panic_newline(ctx))
> > +				return false;
> > +		}
> > +	}
> > +
> > +	return true;
> > +}
> > +
> > +static void drm_panic_kmsg_render_screen(struct drm_plane *plane,
> > +					 struct kmsg_dumper *dumper)
> > +{
> > +	struct drm_framebuffer *fb = plane->state->fb;
> > +	bool first_iteration = true;
> > +	struct drm_panic_ctx ctx;
> > +	static char text[1024];
> > +	size_t len;
> > +
> > +	ctx.vmap = fb->funcs->panic_vmap(fb);
> > +
> > +	/* Print some info when testing */
> > +	if (dumper->max_reason == KMSG_DUMP_OOPS) {
> > +		struct drm_format_name_buf format_name;
> > +
> > +		pr_info("%s: [FB:%d] %ux%u format=%s vmap=%p\n",
> > +			__func__, fb->base.id, fb->width, fb->height,
> > +			drm_get_format_name(fb->format->format, &format_name),
> > +			ctx.vmap);
> > +	}
> > +
> > +	if (!ctx.vmap)
> > +		return;
> 
> For some framebuffers it might be useful to not require vmap, but instead
> have some page-by-page kind of access helper. Since preemptively setting
> up a vmap for all the non-continous buffers is a bit much, and we really
> can't set up the vmap in the panic handler.
> 
> Maybe we should call this panic_prepare/panic_finish and
> s/vmap/cookie/, to make it entirely opaque?
> 
> But a bit overengineering perhaps :-)
> 
> > +
> > +	ctx.fb = fb;
> > +	ctx.width = drm_rect_width(&plane->state->src) >> 16;
> > +	ctx.height = drm_rect_height(&plane->state->src) >> 16;
> > +
> > +	/*
> > +	 * TODO:
> > +	 * Find which part of the fb that is visible.
> > +	 * Width and height are zero on vc4
> > +	 */
> > +	if (!ctx.width || !ctx.height) {
> > +		ctx.width = fb->width;
> > +		ctx.height = fb->height;
> > +	}
> > +
> > +	/* Try to fit 50 lines */
> > +	if (ctx.height < 50 * 16 && drm_panic_font8x8)
> > +		ctx.font = drm_panic_font8x8;
> > +	else
> > +		ctx.font = drm_panic_font8x16;
> > +
> > +	ctx.col_width = DRM_PANIC_COLUMN_WIDTH * ctx.font->width;
> > +	ctx.bottom_y = ((ctx.height / ctx.font->height) - 1) * ctx.font->height;
> > +
> > +	if (ctx.width < 2 * ctx.col_width)
> > +		ctx.max_line_length = ctx.width / ctx.font->width;
> > +	else
> > +		ctx.max_line_length = DRM_PANIC_COLUMN_WIDTH - 2; /* border=2 */
> > +
> > +	ctx.x = 0;
> > +	ctx.y = ctx.bottom_y;
> > +	if (ctx.width > ctx.col_width)
> > +		ctx.x = ((ctx.width / ctx.col_width) - 1) * ctx.col_width;
> > +
> > +	pr_debug("%s: font=%s %ux%u max_line_length=%zu (%u,%u)\n",
> > +		 __func__, ctx.font->name, ctx.width, ctx.height,
> > +		 ctx.max_line_length, ctx.x, ctx.y);
> > +
> > +	kmsg_dump_rewind(dumper);
> > +
> > +	/* The latest messages are fetched first */
> > +	while (kmsg_dump_get_buffer(dumper, false, text, sizeof(text), &len)) {
> > +		/* Strip off the very last newline so we start at the bottom */
> > +		if (first_iteration) {
> > +			len--;
> > +			first_iteration = false;
> > +		}
> > +
> > +		if (!drm_panic_render_lines(&ctx, text, len))
> > +			break;
> > +	}
> > +
> > +	if (fb->funcs->panic_vunmap)
> > +		fb->funcs->panic_vunmap(fb, vmap);
> > +}
> > +
> > +static void drm_panic_try_dev(struct drm_device *dev, struct kmsg_dumper *dumper)
> > +{
> > +	struct drm_framebuffer *fb;
> > +	struct drm_plane *plane;
> > +	struct drm_crtc *crtc;
> > +
> > +	if (!drm_core_check_feature(dev, DRIVER_ATOMIC))
> > +		return;
> > +
> > +	if (dumper->max_reason == KMSG_DUMP_OOPS)
> > +		pr_info("%s: %s on minor %d\n", __func__, dev->driver->name,
> > +			dev->primary->index);
> > +
> > +	drm_for_each_crtc(crtc, dev) {
> > +		if (!ww_mutex_trylock(&crtc->mutex.mutex))
> > +			continue;
> > +
> > +		if (!crtc->enabled || !crtc->primary)
> > +			goto crtc_unlock;
> > +
> > +		if (!crtc->state || !crtc->state->active)
> > +			goto crtc_unlock;
> > +
> > +		plane = crtc->primary;
> > +		if (!ww_mutex_trylock(&plane->mutex.mutex))
> > +			goto crtc_unlock;
> > +
> > +		/*
> > +		 * TODO: Should we check plane_state->visible?
> > +		 * It is not set on vc4
> 
> I think we should. We should also check whether that primary is connected
> to the crtc (some hw you can reassign planes freely between crtc, and the
> crtc->primary pointer is only used for compat with legacy ioctl).
> 
> > +		if (!plane->state || !plane->state->visible)
> > +		 */
> > +		if (!plane->state)
> > +			goto plane_unlock;
> > +
> > +		fb = plane->state->fb;
> > +		if (!fb || !fb->funcs->panic_vmap)
> 
> Docs says this is optional, doesn't seem to be all that optional. I'd
> check for this or a driver-specific ->panic_draw_xy instead.
> > +			goto plane_unlock;
> > +
> > +		/*
> > +		 * fbcon puts out panic messages so stay away to avoid jumbled
> > +		 * output. If vc->vc_mode != KD_TEXT fbcon won't put out
> > +		 * messages (see vt_console_print).
> > +		 */
> > +		if (dev->fb_helper && dev->fb_helper->fb == fb)

This is a bit a layering violation. Could we instead tell fbcon that it
shouldn't do panic handling for drm drivers? I think that would resolve
this overlap here in a much cleaner way. drm fbdev helpers already have
lots of oops_in_progress checks all over to make sure fbcon doesn't do
anything bad. That only leaves the actual rendering, which I think we can
stop too with a simple flag.

Ofc only for atomic drivers which have this panic handling mode here
implemented.
-Daniel

> > +			goto plane_unlock;
> > +
> > +		drm_panic_kmsg_render_screen(plane, dumper);
> > +plane_unlock:
> > +		ww_mutex_unlock(&plane->mutex.mutex);
> > +crtc_unlock:
> > +		ww_mutex_unlock(&crtc->mutex.mutex);
> > +	}
> > +}
> > +
> > +static int drm_panic_dev_iter(struct device *dev, void *data)
> > +{
> > +	struct drm_minor *minor = dev_get_drvdata(dev);
> > +
> > +	if (minor && minor->type == DRM_MINOR_PRIMARY)
> > +		drm_panic_try_dev(minor->dev, data);
> > +
> > +	return 0;
> > +}
> > +
> > +static void drm_panic_kmsg_dump(struct kmsg_dumper *dumper,
> > +				enum kmsg_dump_reason reason)
> > +{
> > +	class_for_each_device(drm_class, NULL, dumper, drm_panic_dev_iter);
> 
> class_for_each_device uses klist, which only uses an irqsave spinlock. I
> think that's good enough. Comment to that effect would be good e.g.
> 
> 	/* based on klist, which uses only a spin_lock_irqsave, which we
> 	 * assume still works */
> 
> If we aim for perfect this should be a trylock still, maybe using our own
> device list.
> 
> > +}
> > +
> > +static struct kmsg_dumper drm_panic_kmsg_dumper = {
> > +	.dump = drm_panic_kmsg_dump,
> > +	.max_reason = KMSG_DUMP_PANIC,
> > +};
> > +
> > +static ssize_t drm_panic_file_panic_write(struct file *file,
> > +					  const char __user *user_buf,
> > +					  size_t count, loff_t *ppos)
> > +{
> > +	unsigned long long val;
> > +	char buf[24];
> > +	size_t size;
> > +	ssize_t ret;
> > +
> > +	size = min(sizeof(buf) - 1, count);
> > +	if (copy_from_user(buf, user_buf, size))
> > +		return -EFAULT;
> > +
> > +	buf[size] = '\0';
> > +	ret = kstrtoull(buf, 0, &val);
> > +	if (ret)
> > +		return ret;
> > +
> > +	drm_panic_kmsg_dumper.max_reason = KMSG_DUMP_OOPS;
> > +	wmb();
> > +
> > +	/* Do a real test with: echo c > /proc/sysrq-trigger */
> > +
> > +	if (val == 0) {
> > +		pr_info("Test panic screen using kmsg_dump(OOPS)\n");
> > +		kmsg_dump(KMSG_DUMP_OOPS);
> > +	} else if (val == 1) {
> > +		char *null_pointer = NULL;
> > +
> > +		pr_info("Test panic screen using NULL pointer dereference\n");
> > +		*null_pointer = 1;
> > +	} else {
> > +		return -EINVAL;
> > +	}
> 
> This isn't quite what I had in mind, since it still kills the kernel (like
> sysrq-trigger). Instead what I had in mind is to recreate the worst
> possible panic context as much as feasible (disabling interrupts should be
> a good start, maybe we can even do an nmi callback), and then call our
> panic implementation. That way we can test the panic handler in a
> non-destructive way (i.e. aside from last dmesg lines printed to the
> screen nothing bad happens to the kernel: No real panic, no oops, no
> tainting).
> 
> And igt testcase could then exercise this, and CI run it.
> 
> > +
> > +	return count;
> > +}
> > +
> > +static const struct file_operations drm_panic_panic_ops = {
> > +	.write =        drm_panic_file_panic_write,
> > +	.open =         simple_open,
> > +	.llseek =       default_llseek,
> > +};
> > +
> > +static struct dentry *drm_panic_d_panic;
> > +
> > +void __init drm_panic_init(struct dentry *debugfs_root)
> > +{
> > +	drm_panic_font8x8 = find_font("VGA8x8");
> > +	drm_panic_font8x16 = find_font("VGA8x16");
> > +	if (!drm_panic_font8x16) {
> > +		DRM_WARN("Couldn't find font, panic screen disabled\n");
> > +		return;
> > +	}
> > +
> > +	drm_panic_d_panic = debugfs_create_file("panic-test", 0200,
> > +						debugfs_root, NULL,
> > +						&drm_panic_panic_ops);
> > +	kmsg_dump_register(&drm_panic_kmsg_dumper);
> > +}
> > +
> > +void drm_panic_exit(struct dentry *debugfs_root)
> > +{
> > +	kmsg_dump_unregister(&drm_panic_kmsg_dumper);
> > +	debugfs_remove(drm_panic_d_panic);
> > +}
> > diff --git a/include/drm/drm_framebuffer.h b/include/drm/drm_framebuffer.h
> > index f0b34c977ec5..f3274798ecfe 100644
> > --- a/include/drm/drm_framebuffer.h
> > +++ b/include/drm/drm_framebuffer.h
> > @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
> >  		     struct drm_file *file_priv, unsigned flags,
> >  		     unsigned color, struct drm_clip_rect *clips,
> >  		     unsigned num_clips);
> > +
> > +	/**
> > +	 * @panic_vmap:
> > +	 *
> > +	 * Optional callback for panic handling.
> > +	 *
> > +	 * For vmapping the selected framebuffer in a panic context. Must
> > +	 * be super careful about locking (only trylocking allowed).
> > +	 *
> > +	 * RETURNS:
> > +	 *
> > +	 * NULL if it didn't work out, otherwise an opaque cookie which is
> > +	 * passed to @panic_draw_xy. It can be anything: vmap area, structure
> > +	 * with more details, just a few flags, ...
> > +	 */
> > +	void *(*panic_vmap)(struct drm_framebuffer *fb);
> > +
> > +	/**
> > +	 * @panic_vunmap:
> > +	 *
> > +	 * Optional callback for cleaning up after panic testing.
> > +	 *
> > +	 * Crtc and plane locks are released after this callback has run.
> > +	 * vmap is the cookie returned by @panic_vmap.
> > +	 */
> > +	void (*panic_vunmap)(struct drm_framebuffer *fb, void *vmap);
> > +
> > +	/**
> > +	 * @panic_draw_xy:
> > +	 *
> > +	 * Optional callback for drawing pixels during panic.
> > +	 *
> > +	 * For drawing pixels onto a framebuffer prepared with @panic_vmap.
> > +	 * vmap is the cookie returned by @panic_vmap.
> > +	 * If it's not set, drm_framebuffer_panic_draw_xy() is used.
> > +	 */
> > +	void (*panic_draw_xy)(struct drm_framebuffer *fb, void *vmap,
> > +			      int x, int y, bool foreground);
> >  };
> >  
> >  /**
> > @@ -293,4 +331,7 @@ int drm_framebuffer_plane_width(int width,
> >  int drm_framebuffer_plane_height(int height,
> >  				 const struct drm_framebuffer *fb, int plane);
> >  
> > +void drm_framebuffer_panic_draw_xy(struct drm_framebuffer *fb, void *vmap,
> > +				   int x, int y, bool foreground);
> > +
> >  #endif
> > -- 
> > 2.20.1
> > 
> 
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Sam Ravnborg March 11, 2019, 7:55 p.m. UTC | #3

Hi Noralf.

Nice!

> +++ b/drivers/gpu/drm/drm_panic.c
> @@ -0,0 +1,363 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright 2018 Noralf Trønnes
> +
> +#include <linux/debugfs.h>
> +#include <linux/font.h>
> +#include <linux/kernel.h>
> +#include <linux/kmsg_dump.h>
> +#include <linux/seq_file.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +
> +#include <drm/drm_crtc.h>
> +#include <drm/drm_device.h>
> +#include <drm/drm_drv.h>
> +#include <drm/drm_file.h>
> +#include <drm/drm_framebuffer.h>
> +#include <drm/drm_fb_helper.h>
> +#include <drm/drm_plane.h>
> +#include <drm/drm_print.h>
> +#include <drm/drmP.h>

Please do not use drmP.h in new stuff.
And res-sort the list of include files when add it.

I did not read much further, so no further pedantic comments in this round.

	Sam

Noralf Trønnes March 11, 2019, 10:33 p.m. UTC | #4

Den 11.03.2019 20.23, skrev Daniel Vetter:
> On Mon, Mar 11, 2019 at 06:42:16PM +0100, Noralf Trønnes wrote:
>> This adds support for outputting kernel messages on panic().
>> A kernel message dumper is used to dump the log. The dumper iterates
>> over each DRM device and it's crtc's to find suitable framebuffers.
>>
>> All the other dumpers are run before this one except mtdoops.
>> Only atomic drivers are supported.
>>
>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
> 
> Bunch of comments/ideas for you or Darwish below, whoever picks this up.

Actually it would ne nice if Darwish could pick it up since he will do
it on i915 which will be useful to a much broader audience.
If not I'll respin when I'm done with the drm_fb_helper refactoring.

<snip>

>> diff --git a/drivers/gpu/drm/drm_panic.c b/drivers/gpu/drm/drm_panic.c
>> new file mode 100644
>> index 000000000000..4bca7f0bc369
>> --- /dev/null
>> +++ b/drivers/gpu/drm/drm_panic.c
>> @@ -0,0 +1,363 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +// Copyright 2018 Noralf Trønnes
>> +
>> +#include <linux/debugfs.h>
>> +#include <linux/font.h>
>> +#include <linux/kernel.h>
>> +#include <linux/kmsg_dump.h>
>> +#include <linux/seq_file.h>
>> +#include <linux/slab.h>
>> +#include <linux/uaccess.h>
>> +
>> +#include <drm/drm_crtc.h>
>> +#include <drm/drm_device.h>
>> +#include <drm/drm_drv.h>
>> +#include <drm/drm_file.h>
>> +#include <drm/drm_framebuffer.h>
>> +#include <drm/drm_fb_helper.h>
>> +#include <drm/drm_plane.h>
>> +#include <drm/drm_print.h>
>> +#include <drm/drmP.h>
>> +
>> +#include "drm_internal.h"
>> +
>> +/*
>> + * The log lines in an ARM stack dump are 92 characters long
>> + * and 120 is a nice multiple for HD and 4K.
>> + */
>> +#define DRM_PANIC_COLUMN_WIDTH	120
> 
> I think we should compute this dynamically from the actual fb width and
> font witdth.

The font is picked based on whether it can fit 50 lines of text.

The next rule is to decide when to add another column. I decided to use
a number that gets most log lines in one line on the screen. If most
lines get broken into two lines, there's not much use in an extra column.

Do you have another idea on how to do this?

(There's even a 16x32 font now that wasn't available when I did this.)

> 
>> +
>> +struct drm_panic_ctx {
>> +	struct drm_framebuffer *fb;
>> +	unsigned int width;
>> +	unsigned int height;
>> +	void *vmap;
>> +
>> +	const struct font_desc *font;
>> +	unsigned int col_width;
>> +	unsigned int bottom_y;
>> +	size_t max_line_length;
>> +
>> +	unsigned int x;
>> +	unsigned int y;
>> +};
>> +
>> +static const struct font_desc *drm_panic_font8x8, *drm_panic_font8x16;
>> +
>> +static void drm_panic_draw_xy(struct drm_framebuffer *fb, void *vmap,
>> +			      int x, int y, bool fg)
>> +{
>> +	if (fb->funcs->panic_draw_xy)
>> +		fb->funcs->panic_draw_xy(fb, vmap, x, y, fg);
>> +	else
>> +		drm_framebuffer_panic_draw_xy(fb, vmap, x, y, fg);
>> +}
>> +
>> +static void drm_panic_render_char(struct drm_panic_ctx *ctx,
>> +				  unsigned int offset, char c)
>> +{
>> +	unsigned int h, w, x, y;
>> +	u8 fontline;
>> +
>> +	for (h = 0; h < ctx->font->height; h++) {
>> +		fontline = *(u8 *)(ctx->font->data + c * ctx->font->height + h);
>> +
>> +		for (w = 0; w < ctx->font->width; w++) {
>> +			x = ctx->x + (offset * ctx->font->width) + w;
>> +			y = ctx->y + h;
>> +			drm_panic_draw_xy(ctx->fb, ctx->vmap, x, y,
>> +					  fontline & BIT(7 - w));
>> +		}
>> +	}
>> +}
>> +
>> +static void drm_panic_render_line(struct drm_panic_ctx *ctx,
>> +				  const char *line, size_t len)
>> +{
>> +	unsigned int i;
>> +
>> +	for (i = 0; i < len; i++)
>> +		drm_panic_render_char(ctx, i, line[i]);
>> +
>> +	/* Clear out the rest of the line */
>> +	for (i = len; i < ctx->max_line_length; i++)
>> +		drm_panic_render_char(ctx, i, ' ');
>> +}
>> +
>> +static bool drm_panic_newline(struct drm_panic_ctx *ctx)
>> +{
>> +	if (ctx->x == 0 && ctx->y == 0)
>> +		return false;
>> +	if (ctx->y == 0) {
>> +		ctx->x -= ctx->col_width;
>> +		ctx->y = ctx->bottom_y;
>> +	} else {
>> +		ctx->y -= ctx->font->height;
>> +	}
>> +
>> +	return true;
>> +}
>> +
>> +/* Render from bottom right most column */
>> +static bool drm_panic_render_lines(struct drm_panic_ctx *ctx,
>> +				   const char *str, size_t len)
>> +{
>> +	size_t l, line_length;
>> +	const char *pos;
>> +	int i;
>> +
>> +	while (len) {
>> +		pos = str + len - 1;
>> +
>> +		if (*pos == '\n') {
>> +			len--;
>> +			if (!drm_panic_newline(ctx))
>> +				return false;
>> +			continue;
>> +		}
>> +
>> +		while (pos > str && *(pos - 1) != '\n')
>> +			pos--;
>> +
>> +		line_length = len - (pos - str);
>> +
>> +		if (!line_length || len < line_length) {
>> +			pr_err("%s: Bug! line_length=%zu len=%zu\n",
>> +			       __func__, line_length, len);
>> +			return false;
>> +		}
>> +
>> +		len -= line_length;
>> +
>> +		for (i = DIV_ROUND_UP(line_length, ctx->max_line_length) - 1; i >= 0; i--) {
>> +			l = min(ctx->max_line_length, line_length - i * ctx->max_line_length);
>> +			drm_panic_render_line(ctx, pos + (i * ctx->max_line_length), l);
>> +			if (i && !drm_panic_newline(ctx))
>> +				return false;
>> +		}
>> +	}
>> +
>> +	return true;
>> +}
>> +
>> +static void drm_panic_kmsg_render_screen(struct drm_plane *plane,
>> +					 struct kmsg_dumper *dumper)
>> +{
>> +	struct drm_framebuffer *fb = plane->state->fb;
>> +	bool first_iteration = true;
>> +	struct drm_panic_ctx ctx;
>> +	static char text[1024];
>> +	size_t len;
>> +
>> +	ctx.vmap = fb->funcs->panic_vmap(fb);
>> +
>> +	/* Print some info when testing */
>> +	if (dumper->max_reason == KMSG_DUMP_OOPS) {
>> +		struct drm_format_name_buf format_name;
>> +
>> +		pr_info("%s: [FB:%d] %ux%u format=%s vmap=%p\n",
>> +			__func__, fb->base.id, fb->width, fb->height,
>> +			drm_get_format_name(fb->format->format, &format_name),
>> +			ctx.vmap);
>> +	}
>> +
>> +	if (!ctx.vmap)
>> +		return;
> 
> For some framebuffers it might be useful to not require vmap, but instead
> have some page-by-page kind of access helper. Since preemptively setting
> up a vmap for all the non-continous buffers is a bit much, and we really
> can't set up the vmap in the panic handler.
> 
> Maybe we should call this panic_prepare/panic_finish and
> s/vmap/cookie/, to make it entirely opaque?
> 
> But a bit overengineering perhaps :-)
> 

I guess i915 wants something else than vmap, but I have no idea how a
page-by-page solution is to be implemented.

When we get bootsplash, we will at least have a kernel address during
that phase for all drivers supporting it.

>> +
>> +	ctx.fb = fb;
>> +	ctx.width = drm_rect_width(&plane->state->src) >> 16;
>> +	ctx.height = drm_rect_height(&plane->state->src) >> 16;
>> +
>> +	/*
>> +	 * TODO:
>> +	 * Find which part of the fb that is visible.
>> +	 * Width and height are zero on vc4
>> +	 */
>> +	if (!ctx.width || !ctx.height) {
>> +		ctx.width = fb->width;
>> +		ctx.height = fb->height;
>> +	}
>> +
>> +	/* Try to fit 50 lines */
>> +	if (ctx.height < 50 * 16 && drm_panic_font8x8)
>> +		ctx.font = drm_panic_font8x8;
>> +	else
>> +		ctx.font = drm_panic_font8x16;
>> +
>> +	ctx.col_width = DRM_PANIC_COLUMN_WIDTH * ctx.font->width;
>> +	ctx.bottom_y = ((ctx.height / ctx.font->height) - 1) * ctx.font->height;
>> +
>> +	if (ctx.width < 2 * ctx.col_width)
>> +		ctx.max_line_length = ctx.width / ctx.font->width;
>> +	else
>> +		ctx.max_line_length = DRM_PANIC_COLUMN_WIDTH - 2; /* border=2 */
>> +
>> +	ctx.x = 0;
>> +	ctx.y = ctx.bottom_y;
>> +	if (ctx.width > ctx.col_width)
>> +		ctx.x = ((ctx.width / ctx.col_width) - 1) * ctx.col_width;
>> +
>> +	pr_debug("%s: font=%s %ux%u max_line_length=%zu (%u,%u)\n",
>> +		 __func__, ctx.font->name, ctx.width, ctx.height,
>> +		 ctx.max_line_length, ctx.x, ctx.y);
>> +
>> +	kmsg_dump_rewind(dumper);
>> +
>> +	/* The latest messages are fetched first */
>> +	while (kmsg_dump_get_buffer(dumper, false, text, sizeof(text), &len)) {
>> +		/* Strip off the very last newline so we start at the bottom */
>> +		if (first_iteration) {
>> +			len--;
>> +			first_iteration = false;
>> +		}
>> +
>> +		if (!drm_panic_render_lines(&ctx, text, len))
>> +			break;
>> +	}
>> +
>> +	if (fb->funcs->panic_vunmap)
>> +		fb->funcs->panic_vunmap(fb, vmap);
>> +}
>> +
>> +static void drm_panic_try_dev(struct drm_device *dev, struct kmsg_dumper *dumper)
>> +{
>> +	struct drm_framebuffer *fb;
>> +	struct drm_plane *plane;
>> +	struct drm_crtc *crtc;
>> +
>> +	if (!drm_core_check_feature(dev, DRIVER_ATOMIC))
>> +		return;
>> +
>> +	if (dumper->max_reason == KMSG_DUMP_OOPS)
>> +		pr_info("%s: %s on minor %d\n", __func__, dev->driver->name,
>> +			dev->primary->index);
>> +
>> +	drm_for_each_crtc(crtc, dev) {
>> +		if (!ww_mutex_trylock(&crtc->mutex.mutex))
>> +			continue;
>> +
>> +		if (!crtc->enabled || !crtc->primary)
>> +			goto crtc_unlock;
>> +
>> +		if (!crtc->state || !crtc->state->active)
>> +			goto crtc_unlock;
>> +
>> +		plane = crtc->primary;
>> +		if (!ww_mutex_trylock(&plane->mutex.mutex))
>> +			goto crtc_unlock;
>> +
>> +		/*
>> +		 * TODO: Should we check plane_state->visible?
>> +		 * It is not set on vc4
> 
> I think we should. 

I need to recheck vc4, because when I did this a year ago ->visible was
not set on vc4.

> We should also check whether that primary is connected
> to the crtc (some hw you can reassign planes freely between crtc, and the
> crtc->primary pointer is only used for compat with legacy ioctl).

Like this?
		/* Check that the plane has not been reassigned */
		if (plane->state->crtc != crtc)
			goto plane_unlock;

> 
>> +		if (!plane->state || !plane->state->visible)
>> +		 */
>> +		if (!plane->state)
>> +			goto plane_unlock;
>> +
>> +		fb = plane->state->fb;
>> +		if (!fb || !fb->funcs->panic_vmap)
> 
> Docs says this is optional, doesn't seem to be all that optional. I'd
> check for this or a driver-specific ->panic_draw_xy instead.

->panic_vmap determines whether panic is supported or not, so in that
sense it's optional. ->panic_draw_xy is optional with a default fallback
for the linear case.

>> +			goto plane_unlock;
>> +
>> +		/*
>> +		 * fbcon puts out panic messages so stay away to avoid jumbled
>> +		 * output. If vc->vc_mode != KD_TEXT fbcon won't put out
>> +		 * messages (see vt_console_print).
>> +		 */
>> +		if (dev->fb_helper && dev->fb_helper->fb == fb)
>> +			goto plane_unlock;
>> +
>> +		drm_panic_kmsg_render_screen(plane, dumper);
>> +plane_unlock:
>> +		ww_mutex_unlock(&plane->mutex.mutex);
>> +crtc_unlock:
>> +		ww_mutex_unlock(&crtc->mutex.mutex);
>> +	}
>> +}
>> +
>> +static int drm_panic_dev_iter(struct device *dev, void *data)
>> +{
>> +	struct drm_minor *minor = dev_get_drvdata(dev);
>> +
>> +	if (minor && minor->type == DRM_MINOR_PRIMARY)
>> +		drm_panic_try_dev(minor->dev, data);
>> +
>> +	return 0;
>> +}
>> +
>> +static void drm_panic_kmsg_dump(struct kmsg_dumper *dumper,
>> +				enum kmsg_dump_reason reason)
>> +{
>> +	class_for_each_device(drm_class, NULL, dumper, drm_panic_dev_iter);
> 
> class_for_each_device uses klist, which only uses an irqsave spinlock. I
> think that's good enough. Comment to that effect would be good e.g.
> 
> 	/* based on klist, which uses only a spin_lock_irqsave, which we
> 	 * assume still works */
> 
> If we aim for perfect this should be a trylock still, maybe using our own
> device list.
> 
>> +}
>> +
>> +static struct kmsg_dumper drm_panic_kmsg_dumper = {
>> +	.dump = drm_panic_kmsg_dump,
>> +	.max_reason = KMSG_DUMP_PANIC,
>> +};
>> +
>> +static ssize_t drm_panic_file_panic_write(struct file *file,
>> +					  const char __user *user_buf,
>> +					  size_t count, loff_t *ppos)
>> +{
>> +	unsigned long long val;
>> +	char buf[24];
>> +	size_t size;
>> +	ssize_t ret;
>> +
>> +	size = min(sizeof(buf) - 1, count);
>> +	if (copy_from_user(buf, user_buf, size))
>> +		return -EFAULT;
>> +
>> +	buf[size] = '\0';
>> +	ret = kstrtoull(buf, 0, &val);
>> +	if (ret)
>> +		return ret;
>> +
>> +	drm_panic_kmsg_dumper.max_reason = KMSG_DUMP_OOPS;
>> +	wmb();
>> +
>> +	/* Do a real test with: echo c > /proc/sysrq-trigger */
>> +
>> +	if (val == 0) {
>> +		pr_info("Test panic screen using kmsg_dump(OOPS)\n");
>> +		kmsg_dump(KMSG_DUMP_OOPS);
>> +	} else if (val == 1) {
>> +		char *null_pointer = NULL;
>> +
>> +		pr_info("Test panic screen using NULL pointer dereference\n");
>> +		*null_pointer = 1;
>> +	} else {
>> +		return -EINVAL;
>> +	}
> 
> This isn't quite what I had in mind, since it still kills the kernel (like
> sysrq-trigger). 

If val == 0, it doesn't kill the kernel, it only dumps the kernel log.
And it doesn't taint the kernel either.

> Instead what I had in mind is to recreate the worst
> possible panic context as much as feasible (disabling interrupts should be
> a good start, maybe we can even do an nmi callback), and then call our
> panic implementation. That way we can test the panic handler in a
> non-destructive way (i.e. aside from last dmesg lines printed to the
> screen nothing bad happens to the kernel: No real panic, no oops, no
> tainting).

The interrupt case I can do, nmi I have no idea.

Noralf.

> 
> And igt testcase could then exercise this, and CI run it.
> 
>> +
>> +	return count;
>> +}
>> +
>> +static const struct file_operations drm_panic_panic_ops = {
>> +	.write =        drm_panic_file_panic_write,
>> +	.open =         simple_open,
>> +	.llseek =       default_llseek,
>> +};
>> +
>> +static struct dentry *drm_panic_d_panic;
>> +
>> +void __init drm_panic_init(struct dentry *debugfs_root)
>> +{
>> +	drm_panic_font8x8 = find_font("VGA8x8");
>> +	drm_panic_font8x16 = find_font("VGA8x16");
>> +	if (!drm_panic_font8x16) {
>> +		DRM_WARN("Couldn't find font, panic screen disabled\n");
>> +		return;
>> +	}
>> +
>> +	drm_panic_d_panic = debugfs_create_file("panic-test", 0200,
>> +						debugfs_root, NULL,
>> +						&drm_panic_panic_ops);
>> +	kmsg_dump_register(&drm_panic_kmsg_dumper);
>> +}
>> +
>> +void drm_panic_exit(struct dentry *debugfs_root)
>> +{
>> +	kmsg_dump_unregister(&drm_panic_kmsg_dumper);
>> +	debugfs_remove(drm_panic_d_panic);
>> +}
>> diff --git a/include/drm/drm_framebuffer.h b/include/drm/drm_framebuffer.h
>> index f0b34c977ec5..f3274798ecfe 100644
>> --- a/include/drm/drm_framebuffer.h
>> +++ b/include/drm/drm_framebuffer.h
>> @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
>>  		     struct drm_file *file_priv, unsigned flags,
>>  		     unsigned color, struct drm_clip_rect *clips,
>>  		     unsigned num_clips);
>> +
>> +	/**
>> +	 * @panic_vmap:
>> +	 *
>> +	 * Optional callback for panic handling.
>> +	 *
>> +	 * For vmapping the selected framebuffer in a panic context. Must
>> +	 * be super careful about locking (only trylocking allowed).
>> +	 *
>> +	 * RETURNS:
>> +	 *
>> +	 * NULL if it didn't work out, otherwise an opaque cookie which is
>> +	 * passed to @panic_draw_xy. It can be anything: vmap area, structure
>> +	 * with more details, just a few flags, ...
>> +	 */
>> +	void *(*panic_vmap)(struct drm_framebuffer *fb);
>> +
>> +	/**
>> +	 * @panic_vunmap:
>> +	 *
>> +	 * Optional callback for cleaning up after panic testing.
>> +	 *
>> +	 * Crtc and plane locks are released after this callback has run.
>> +	 * vmap is the cookie returned by @panic_vmap.
>> +	 */
>> +	void (*panic_vunmap)(struct drm_framebuffer *fb, void *vmap);
>> +
>> +	/**
>> +	 * @panic_draw_xy:
>> +	 *
>> +	 * Optional callback for drawing pixels during panic.
>> +	 *
>> +	 * For drawing pixels onto a framebuffer prepared with @panic_vmap.
>> +	 * vmap is the cookie returned by @panic_vmap.
>> +	 * If it's not set, drm_framebuffer_panic_draw_xy() is used.
>> +	 */
>> +	void (*panic_draw_xy)(struct drm_framebuffer *fb, void *vmap,
>> +			      int x, int y, bool foreground);
>>  };
>>  
>>  /**
>> @@ -293,4 +331,7 @@ int drm_framebuffer_plane_width(int width,
>>  int drm_framebuffer_plane_height(int height,
>>  				 const struct drm_framebuffer *fb, int plane);
>>  
>> +void drm_framebuffer_panic_draw_xy(struct drm_framebuffer *fb, void *vmap,
>> +				   int x, int y, bool foreground);
>> +
>>  #endif
>> -- 
>> 2.20.1
>>
>

Noralf Trønnes March 11, 2019, 10:40 p.m. UTC | #5

Den 11.03.2019 20.29, skrev Daniel Vetter:
> On Mon, Mar 11, 2019 at 08:23:38PM +0100, Daniel Vetter wrote:
>> On Mon, Mar 11, 2019 at 06:42:16PM +0100, Noralf Trønnes wrote:
>>> This adds support for outputting kernel messages on panic().
>>> A kernel message dumper is used to dump the log. The dumper iterates
>>> over each DRM device and it's crtc's to find suitable framebuffers.
>>>
>>> All the other dumpers are run before this one except mtdoops.
>>> Only atomic drivers are supported.
>>>
>>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>

<snip>

>>> +static void drm_panic_try_dev(struct drm_device *dev, struct kmsg_dumper *dumper)
>>> +{
>>> +	struct drm_framebuffer *fb;
>>> +	struct drm_plane *plane;
>>> +	struct drm_crtc *crtc;
>>> +
>>> +	if (!drm_core_check_feature(dev, DRIVER_ATOMIC))
>>> +		return;
>>> +
>>> +	if (dumper->max_reason == KMSG_DUMP_OOPS)
>>> +		pr_info("%s: %s on minor %d\n", __func__, dev->driver->name,
>>> +			dev->primary->index);
>>> +
>>> +	drm_for_each_crtc(crtc, dev) {
>>> +		if (!ww_mutex_trylock(&crtc->mutex.mutex))
>>> +			continue;
>>> +
>>> +		if (!crtc->enabled || !crtc->primary)
>>> +			goto crtc_unlock;
>>> +
>>> +		if (!crtc->state || !crtc->state->active)
>>> +			goto crtc_unlock;
>>> +
>>> +		plane = crtc->primary;
>>> +		if (!ww_mutex_trylock(&plane->mutex.mutex))
>>> +			goto crtc_unlock;
>>> +
>>> +		/*
>>> +		 * TODO: Should we check plane_state->visible?
>>> +		 * It is not set on vc4
>>
>> I think we should. We should also check whether that primary is connected
>> to the crtc (some hw you can reassign planes freely between crtc, and the
>> crtc->primary pointer is only used for compat with legacy ioctl).
>>
>>> +		if (!plane->state || !plane->state->visible)
>>> +		 */
>>> +		if (!plane->state)
>>> +			goto plane_unlock;
>>> +
>>> +		fb = plane->state->fb;
>>> +		if (!fb || !fb->funcs->panic_vmap)
>>
>> Docs says this is optional, doesn't seem to be all that optional. I'd
>> check for this or a driver-specific ->panic_draw_xy instead.
>>> +			goto plane_unlock;
>>> +
>>> +		/*
>>> +		 * fbcon puts out panic messages so stay away to avoid jumbled
>>> +		 * output. If vc->vc_mode != KD_TEXT fbcon won't put out
>>> +		 * messages (see vt_console_print).
>>> +		 */
>>> +		if (dev->fb_helper && dev->fb_helper->fb == fb)
> 
> This is a bit a layering violation. Could we instead tell fbcon that it
> shouldn't do panic handling for drm drivers? I think that would resolve
> this overlap here in a much cleaner way. drm fbdev helpers already have
> lots of oops_in_progress checks all over to make sure fbcon doesn't do
> anything bad. That only leaves the actual rendering, which I think we can
> stop too with a simple flag.
> 
> Ofc only for atomic drivers which have this panic handling mode here
> implemented.

There used to be a fbdev flag FBINFO_CAN_FORCE_OUTPUT that controlled
vc->vc_panic_force_write, but it's gone now I see.

Noralf.

> -Daniel
> 
>>> +			goto plane_unlock;
>>> +
>>> +		drm_panic_kmsg_render_screen(plane, dumper);
>>> +plane_unlock:
>>> +		ww_mutex_unlock(&plane->mutex.mutex);
>>> +crtc_unlock:
>>> +		ww_mutex_unlock(&crtc->mutex.mutex);
>>> +	}
>>> +}

Daniel Vetter March 12, 2019, 9:53 a.m. UTC | #6

On Mon, Mar 11, 2019 at 11:40:36PM +0100, Noralf Trønnes wrote:
> 
> 
> Den 11.03.2019 20.29, skrev Daniel Vetter:
> > On Mon, Mar 11, 2019 at 08:23:38PM +0100, Daniel Vetter wrote:
> >> On Mon, Mar 11, 2019 at 06:42:16PM +0100, Noralf Trønnes wrote:
> >>> This adds support for outputting kernel messages on panic().
> >>> A kernel message dumper is used to dump the log. The dumper iterates
> >>> over each DRM device and it's crtc's to find suitable framebuffers.
> >>>
> >>> All the other dumpers are run before this one except mtdoops.
> >>> Only atomic drivers are supported.
> >>>
> >>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
> 
> <snip>
> 
> >>> +static void drm_panic_try_dev(struct drm_device *dev, struct kmsg_dumper *dumper)
> >>> +{
> >>> +	struct drm_framebuffer *fb;
> >>> +	struct drm_plane *plane;
> >>> +	struct drm_crtc *crtc;
> >>> +
> >>> +	if (!drm_core_check_feature(dev, DRIVER_ATOMIC))
> >>> +		return;
> >>> +
> >>> +	if (dumper->max_reason == KMSG_DUMP_OOPS)
> >>> +		pr_info("%s: %s on minor %d\n", __func__, dev->driver->name,
> >>> +			dev->primary->index);
> >>> +
> >>> +	drm_for_each_crtc(crtc, dev) {
> >>> +		if (!ww_mutex_trylock(&crtc->mutex.mutex))
> >>> +			continue;
> >>> +
> >>> +		if (!crtc->enabled || !crtc->primary)
> >>> +			goto crtc_unlock;
> >>> +
> >>> +		if (!crtc->state || !crtc->state->active)
> >>> +			goto crtc_unlock;
> >>> +
> >>> +		plane = crtc->primary;
> >>> +		if (!ww_mutex_trylock(&plane->mutex.mutex))
> >>> +			goto crtc_unlock;
> >>> +
> >>> +		/*
> >>> +		 * TODO: Should we check plane_state->visible?
> >>> +		 * It is not set on vc4
> >>
> >> I think we should. We should also check whether that primary is connected
> >> to the crtc (some hw you can reassign planes freely between crtc, and the
> >> crtc->primary pointer is only used for compat with legacy ioctl).
> >>
> >>> +		if (!plane->state || !plane->state->visible)
> >>> +		 */
> >>> +		if (!plane->state)
> >>> +			goto plane_unlock;
> >>> +
> >>> +		fb = plane->state->fb;
> >>> +		if (!fb || !fb->funcs->panic_vmap)
> >>
> >> Docs says this is optional, doesn't seem to be all that optional. I'd
> >> check for this or a driver-specific ->panic_draw_xy instead.
> >>> +			goto plane_unlock;
> >>> +
> >>> +		/*
> >>> +		 * fbcon puts out panic messages so stay away to avoid jumbled
> >>> +		 * output. If vc->vc_mode != KD_TEXT fbcon won't put out
> >>> +		 * messages (see vt_console_print).
> >>> +		 */
> >>> +		if (dev->fb_helper && dev->fb_helper->fb == fb)
> > 
> > This is a bit a layering violation. Could we instead tell fbcon that it
> > shouldn't do panic handling for drm drivers? I think that would resolve
> > this overlap here in a much cleaner way. drm fbdev helpers already have
> > lots of oops_in_progress checks all over to make sure fbcon doesn't do
> > anything bad. That only leaves the actual rendering, which I think we can
> > stop too with a simple flag.
> > 
> > Ofc only for atomic drivers which have this panic handling mode here
> > implemented.
> 
> There used to be a fbdev flag FBINFO_CAN_FORCE_OUTPUT that controlled
> vc->vc_panic_force_write, but it's gone now I see.

I killed that :-)

Looking at those patches again, it's not what we wanted. It was used to
force panic display even when not in KD_TEXT mode.

What we want instead here is a flag to tell fbcon/vt to _never_ write
dmesg to our console, not even when the console is in KD_TEXT mode.
Because we have a separate panic handler to display it. Heck maybe that QR
code thing could be resurrected eventually again.

Totally untested snippet below is what I'm thinking of:


diff --git a/drivers/video/fbdev/core/fbcon.c b/drivers/video/fbdev/core/fbcon.c
index 7f4c22a65f63..b08c63286ed0 100644
--- a/drivers/video/fbdev/core/fbcon.c
+++ b/drivers/video/fbdev/core/fbcon.c
@@ -282,6 +282,9 @@ static inline int fbcon_is_inactive(struct vc_data *vc, struct fb_info *info)
 {
 	struct fbcon_ops *ops = info->fbcon_par;
 
+	if (oops_in_progress && info->flags & NO_OOPS_OUTPUT)
+		return false;
+
 	return (info->state != FBINFO_STATE_RUNNING ||
 		vc->vc_mode != KD_TEXT || ops->graphics);
 }


Plus then unconditionally rendering the oops output on the drm side, even
if the current fb is the fbcon one.
-Daniel

Daniel Vetter March 12, 2019, 9:59 a.m. UTC | #7

On Tue, Mar 12, 2019 at 10:53:20AM +0100, Daniel Vetter wrote:
> On Mon, Mar 11, 2019 at 11:40:36PM +0100, Noralf Trønnes wrote:
> > 
> > 
> > Den 11.03.2019 20.29, skrev Daniel Vetter:
> > > On Mon, Mar 11, 2019 at 08:23:38PM +0100, Daniel Vetter wrote:
> > >> On Mon, Mar 11, 2019 at 06:42:16PM +0100, Noralf Trønnes wrote:
> > >>> This adds support for outputting kernel messages on panic().
> > >>> A kernel message dumper is used to dump the log. The dumper iterates
> > >>> over each DRM device and it's crtc's to find suitable framebuffers.
> > >>>
> > >>> All the other dumpers are run before this one except mtdoops.
> > >>> Only atomic drivers are supported.
> > >>>
> > >>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
> > 
> > <snip>
> > 
> > >>> +static void drm_panic_try_dev(struct drm_device *dev, struct kmsg_dumper *dumper)
> > >>> +{
> > >>> +	struct drm_framebuffer *fb;
> > >>> +	struct drm_plane *plane;
> > >>> +	struct drm_crtc *crtc;
> > >>> +
> > >>> +	if (!drm_core_check_feature(dev, DRIVER_ATOMIC))
> > >>> +		return;
> > >>> +
> > >>> +	if (dumper->max_reason == KMSG_DUMP_OOPS)
> > >>> +		pr_info("%s: %s on minor %d\n", __func__, dev->driver->name,
> > >>> +			dev->primary->index);
> > >>> +
> > >>> +	drm_for_each_crtc(crtc, dev) {
> > >>> +		if (!ww_mutex_trylock(&crtc->mutex.mutex))
> > >>> +			continue;
> > >>> +
> > >>> +		if (!crtc->enabled || !crtc->primary)
> > >>> +			goto crtc_unlock;
> > >>> +
> > >>> +		if (!crtc->state || !crtc->state->active)
> > >>> +			goto crtc_unlock;
> > >>> +
> > >>> +		plane = crtc->primary;
> > >>> +		if (!ww_mutex_trylock(&plane->mutex.mutex))
> > >>> +			goto crtc_unlock;
> > >>> +
> > >>> +		/*
> > >>> +		 * TODO: Should we check plane_state->visible?
> > >>> +		 * It is not set on vc4
> > >>
> > >> I think we should. We should also check whether that primary is connected
> > >> to the crtc (some hw you can reassign planes freely between crtc, and the
> > >> crtc->primary pointer is only used for compat with legacy ioctl).
> > >>
> > >>> +		if (!plane->state || !plane->state->visible)
> > >>> +		 */
> > >>> +		if (!plane->state)
> > >>> +			goto plane_unlock;
> > >>> +
> > >>> +		fb = plane->state->fb;
> > >>> +		if (!fb || !fb->funcs->panic_vmap)
> > >>
> > >> Docs says this is optional, doesn't seem to be all that optional. I'd
> > >> check for this or a driver-specific ->panic_draw_xy instead.
> > >>> +			goto plane_unlock;
> > >>> +
> > >>> +		/*
> > >>> +		 * fbcon puts out panic messages so stay away to avoid jumbled
> > >>> +		 * output. If vc->vc_mode != KD_TEXT fbcon won't put out
> > >>> +		 * messages (see vt_console_print).
> > >>> +		 */
> > >>> +		if (dev->fb_helper && dev->fb_helper->fb == fb)
> > > 
> > > This is a bit a layering violation. Could we instead tell fbcon that it
> > > shouldn't do panic handling for drm drivers? I think that would resolve
> > > this overlap here in a much cleaner way. drm fbdev helpers already have
> > > lots of oops_in_progress checks all over to make sure fbcon doesn't do
> > > anything bad. That only leaves the actual rendering, which I think we can
> > > stop too with a simple flag.
> > > 
> > > Ofc only for atomic drivers which have this panic handling mode here
> > > implemented.
> > 
> > There used to be a fbdev flag FBINFO_CAN_FORCE_OUTPUT that controlled
> > vc->vc_panic_force_write, but it's gone now I see.
> 
> I killed that :-)
> 
> Looking at those patches again, it's not what we wanted. It was used to
> force panic display even when not in KD_TEXT mode.
> 
> What we want instead here is a flag to tell fbcon/vt to _never_ write
> dmesg to our console, not even when the console is in KD_TEXT mode.
> Because we have a separate panic handler to display it. Heck maybe that QR
> code thing could be resurrected eventually again.
> 
> Totally untested snippet below is what I'm thinking of:
> 
> 
> diff --git a/drivers/video/fbdev/core/fbcon.c b/drivers/video/fbdev/core/fbcon.c
> index 7f4c22a65f63..b08c63286ed0 100644
> --- a/drivers/video/fbdev/core/fbcon.c
> +++ b/drivers/video/fbdev/core/fbcon.c
> @@ -282,6 +282,9 @@ static inline int fbcon_is_inactive(struct vc_data *vc, struct fb_info *info)
>  {
>  	struct fbcon_ops *ops = info->fbcon_par;
>  
> +	if (oops_in_progress && info->flags & NO_OOPS_OUTPUT)
> +		return false;
> +
>  	return (info->state != FBINFO_STATE_RUNNING ||
>  		vc->vc_mode != KD_TEXT || ops->graphics);
>  }
> 
> 
> Plus then unconditionally rendering the oops output on the drm side, even
> if the current fb is the fbcon one.

Hm thinking about this some more, I think this would allow us to replace
all the oops_in_progress checks we have in drm_fb_helper.c with
WARN_ON(oops_in_progress). So worthy cleanup on its own.

Assuming I'm seeing this right ofc, very much possible I'm missing
something :-)
-Daniel

Michel Dänzer March 12, 2019, 10:47 a.m. UTC | #8

On 2019-03-11 6:42 p.m., Noralf Trønnes wrote:
> This adds support for outputting kernel messages on panic().
> A kernel message dumper is used to dump the log. The dumper iterates
> over each DRM device and it's crtc's to find suitable framebuffers.
> 
> All the other dumpers are run before this one except mtdoops.
> Only atomic drivers are supported.
> 
> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
> ---
>  [...]
> 
> diff --git a/include/drm/drm_framebuffer.h b/include/drm/drm_framebuffer.h
> index f0b34c977ec5..f3274798ecfe 100644
> --- a/include/drm/drm_framebuffer.h
> +++ b/include/drm/drm_framebuffer.h
> @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
>  		     struct drm_file *file_priv, unsigned flags,
>  		     unsigned color, struct drm_clip_rect *clips,
>  		     unsigned num_clips);
> +
> +	/**
> +	 * @panic_vmap:
> +	 *
> +	 * Optional callback for panic handling.
> +	 *
> +	 * For vmapping the selected framebuffer in a panic context. Must
> +	 * be super careful about locking (only trylocking allowed).
> +	 *
> +	 * RETURNS:
> +	 *
> +	 * NULL if it didn't work out, otherwise an opaque cookie which is
> +	 * passed to @panic_draw_xy. It can be anything: vmap area, structure
> +	 * with more details, just a few flags, ...
> +	 */
> +	void *(*panic_vmap)(struct drm_framebuffer *fb);

FWIW, the panic_vmap hook cannot work in general with the amdgpu/radeon
drivers:

Framebuffers are normally tiled, writing to them with the CPU results in
garbled output.

With a discrete GPU having a large amount of VRAM, the framebuffer may
not be directly CPU accessible at all.


There would need to be a mechanism for switching scanout to a linear,
CPU accessible framebuffer.

Daniel Vetter March 12, 2019, 10:58 a.m. UTC | #9

On Mon, Mar 11, 2019 at 11:33:15PM +0100, Noralf Trønnes wrote:
> 
> 
> Den 11.03.2019 20.23, skrev Daniel Vetter:
> > On Mon, Mar 11, 2019 at 06:42:16PM +0100, Noralf Trønnes wrote:
> >> This adds support for outputting kernel messages on panic().
> >> A kernel message dumper is used to dump the log. The dumper iterates
> >> over each DRM device and it's crtc's to find suitable framebuffers.
> >>
> >> All the other dumpers are run before this one except mtdoops.
> >> Only atomic drivers are supported.
> >>
> >> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
> > 
> > Bunch of comments/ideas for you or Darwish below, whoever picks this up.
> 
> Actually it would ne nice if Darwish could pick it up since he will do
> it on i915 which will be useful to a much broader audience.
> If not I'll respin when I'm done with the drm_fb_helper refactoring.
> 
> <snip>
> 
> >> diff --git a/drivers/gpu/drm/drm_panic.c b/drivers/gpu/drm/drm_panic.c
> >> new file mode 100644
> >> index 000000000000..4bca7f0bc369
> >> --- /dev/null
> >> +++ b/drivers/gpu/drm/drm_panic.c
> >> @@ -0,0 +1,363 @@
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +// Copyright 2018 Noralf Trønnes
> >> +
> >> +#include <linux/debugfs.h>
> >> +#include <linux/font.h>
> >> +#include <linux/kernel.h>
> >> +#include <linux/kmsg_dump.h>
> >> +#include <linux/seq_file.h>
> >> +#include <linux/slab.h>
> >> +#include <linux/uaccess.h>
> >> +
> >> +#include <drm/drm_crtc.h>
> >> +#include <drm/drm_device.h>
> >> +#include <drm/drm_drv.h>
> >> +#include <drm/drm_file.h>
> >> +#include <drm/drm_framebuffer.h>
> >> +#include <drm/drm_fb_helper.h>
> >> +#include <drm/drm_plane.h>
> >> +#include <drm/drm_print.h>
> >> +#include <drm/drmP.h>
> >> +
> >> +#include "drm_internal.h"
> >> +
> >> +/*
> >> + * The log lines in an ARM stack dump are 92 characters long
> >> + * and 120 is a nice multiple for HD and 4K.
> >> + */
> >> +#define DRM_PANIC_COLUMN_WIDTH	120
> > 
> > I think we should compute this dynamically from the actual fb width and
> > font witdth.
> 
> The font is picked based on whether it can fit 50 lines of text.
> 
> The next rule is to decide when to add another column. I decided to use
> a number that gets most log lines in one line on the screen. If most
> lines get broken into two lines, there's not much use in an extra column.
> 
> Do you have another idea on how to do this?
> 
> (There's even a 16x32 font now that wasn't available when I did this.)

Oh I had no idea that this is for multi-column oops printing. That's
indeed neat for 4k screens.
> 
> > 
> >> +
> >> +struct drm_panic_ctx {
> >> +	struct drm_framebuffer *fb;
> >> +	unsigned int width;
> >> +	unsigned int height;
> >> +	void *vmap;
> >> +
> >> +	const struct font_desc *font;
> >> +	unsigned int col_width;
> >> +	unsigned int bottom_y;
> >> +	size_t max_line_length;
> >> +
> >> +	unsigned int x;
> >> +	unsigned int y;
> >> +};
> >> +
> >> +static const struct font_desc *drm_panic_font8x8, *drm_panic_font8x16;
> >> +
> >> +static void drm_panic_draw_xy(struct drm_framebuffer *fb, void *vmap,
> >> +			      int x, int y, bool fg)
> >> +{
> >> +	if (fb->funcs->panic_draw_xy)
> >> +		fb->funcs->panic_draw_xy(fb, vmap, x, y, fg);
> >> +	else
> >> +		drm_framebuffer_panic_draw_xy(fb, vmap, x, y, fg);
> >> +}
> >> +
> >> +static void drm_panic_render_char(struct drm_panic_ctx *ctx,
> >> +				  unsigned int offset, char c)
> >> +{
> >> +	unsigned int h, w, x, y;
> >> +	u8 fontline;
> >> +
> >> +	for (h = 0; h < ctx->font->height; h++) {
> >> +		fontline = *(u8 *)(ctx->font->data + c * ctx->font->height + h);
> >> +
> >> +		for (w = 0; w < ctx->font->width; w++) {
> >> +			x = ctx->x + (offset * ctx->font->width) + w;
> >> +			y = ctx->y + h;
> >> +			drm_panic_draw_xy(ctx->fb, ctx->vmap, x, y,
> >> +					  fontline & BIT(7 - w));
> >> +		}
> >> +	}
> >> +}
> >> +
> >> +static void drm_panic_render_line(struct drm_panic_ctx *ctx,
> >> +				  const char *line, size_t len)
> >> +{
> >> +	unsigned int i;
> >> +
> >> +	for (i = 0; i < len; i++)
> >> +		drm_panic_render_char(ctx, i, line[i]);
> >> +
> >> +	/* Clear out the rest of the line */
> >> +	for (i = len; i < ctx->max_line_length; i++)
> >> +		drm_panic_render_char(ctx, i, ' ');
> >> +}
> >> +
> >> +static bool drm_panic_newline(struct drm_panic_ctx *ctx)
> >> +{
> >> +	if (ctx->x == 0 && ctx->y == 0)
> >> +		return false;
> >> +	if (ctx->y == 0) {
> >> +		ctx->x -= ctx->col_width;
> >> +		ctx->y = ctx->bottom_y;
> >> +	} else {
> >> +		ctx->y -= ctx->font->height;
> >> +	}
> >> +
> >> +	return true;
> >> +}
> >> +
> >> +/* Render from bottom right most column */
> >> +static bool drm_panic_render_lines(struct drm_panic_ctx *ctx,
> >> +				   const char *str, size_t len)
> >> +{
> >> +	size_t l, line_length;
> >> +	const char *pos;
> >> +	int i;
> >> +
> >> +	while (len) {
> >> +		pos = str + len - 1;
> >> +
> >> +		if (*pos == '\n') {
> >> +			len--;
> >> +			if (!drm_panic_newline(ctx))
> >> +				return false;
> >> +			continue;
> >> +		}
> >> +
> >> +		while (pos > str && *(pos - 1) != '\n')
> >> +			pos--;
> >> +
> >> +		line_length = len - (pos - str);
> >> +
> >> +		if (!line_length || len < line_length) {
> >> +			pr_err("%s: Bug! line_length=%zu len=%zu\n",
> >> +			       __func__, line_length, len);
> >> +			return false;
> >> +		}
> >> +
> >> +		len -= line_length;
> >> +
> >> +		for (i = DIV_ROUND_UP(line_length, ctx->max_line_length) - 1; i >= 0; i--) {
> >> +			l = min(ctx->max_line_length, line_length - i * ctx->max_line_length);
> >> +			drm_panic_render_line(ctx, pos + (i * ctx->max_line_length), l);
> >> +			if (i && !drm_panic_newline(ctx))
> >> +				return false;
> >> +		}
> >> +	}
> >> +
> >> +	return true;
> >> +}
> >> +
> >> +static void drm_panic_kmsg_render_screen(struct drm_plane *plane,
> >> +					 struct kmsg_dumper *dumper)
> >> +{
> >> +	struct drm_framebuffer *fb = plane->state->fb;
> >> +	bool first_iteration = true;
> >> +	struct drm_panic_ctx ctx;
> >> +	static char text[1024];
> >> +	size_t len;
> >> +
> >> +	ctx.vmap = fb->funcs->panic_vmap(fb);
> >> +
> >> +	/* Print some info when testing */
> >> +	if (dumper->max_reason == KMSG_DUMP_OOPS) {
> >> +		struct drm_format_name_buf format_name;
> >> +
> >> +		pr_info("%s: [FB:%d] %ux%u format=%s vmap=%p\n",
> >> +			__func__, fb->base.id, fb->width, fb->height,
> >> +			drm_get_format_name(fb->format->format, &format_name),
> >> +			ctx.vmap);
> >> +	}
> >> +
> >> +	if (!ctx.vmap)
> >> +		return;
> > 
> > For some framebuffers it might be useful to not require vmap, but instead
> > have some page-by-page kind of access helper. Since preemptively setting
> > up a vmap for all the non-continous buffers is a bit much, and we really
> > can't set up the vmap in the panic handler.
> > 
> > Maybe we should call this panic_prepare/panic_finish and
> > s/vmap/cookie/, to make it entirely opaque?
> > 
> > But a bit overengineering perhaps :-)
> > 
> 
> I guess i915 wants something else than vmap, but I have no idea how a
> page-by-page solution is to be implemented.

i915 should be mostly fine, since we have a GTT for remapping and can make
it look continuous. It might not be in the part of the GTT accessible by
the cpu though.

Wrt implementation: My idea would be to extract the pixel writing function
and export it to drivers. The driver would then implement a
->panic_draw_xy function which looks up the right page, computes it
address, computes the x/y offset within (taking into account tiling and
stuff like that), and then uses the helper function to draw the right
pixel value for the format at the given address.

That's why I suggested that drivers need to either implement
->panic_prepare or ->panic_draw_xy. Since for this approach you might not
need a ->panic_prepare/cleanup implementation, since it's all done in
->panic_draw_xy on a pixel-by-pixel basis.

> When we get bootsplash, we will at least have a kernel address during
> that phase for all drivers supporting it.

Hm, not following what you mean here. Why is bootsplash special?

> >> +
> >> +	ctx.fb = fb;
> >> +	ctx.width = drm_rect_width(&plane->state->src) >> 16;
> >> +	ctx.height = drm_rect_height(&plane->state->src) >> 16;
> >> +
> >> +	/*
> >> +	 * TODO:
> >> +	 * Find which part of the fb that is visible.
> >> +	 * Width and height are zero on vc4
> >> +	 */
> >> +	if (!ctx.width || !ctx.height) {
> >> +		ctx.width = fb->width;
> >> +		ctx.height = fb->height;
> >> +	}
> >> +
> >> +	/* Try to fit 50 lines */
> >> +	if (ctx.height < 50 * 16 && drm_panic_font8x8)
> >> +		ctx.font = drm_panic_font8x8;
> >> +	else
> >> +		ctx.font = drm_panic_font8x16;
> >> +
> >> +	ctx.col_width = DRM_PANIC_COLUMN_WIDTH * ctx.font->width;
> >> +	ctx.bottom_y = ((ctx.height / ctx.font->height) - 1) * ctx.font->height;
> >> +
> >> +	if (ctx.width < 2 * ctx.col_width)
> >> +		ctx.max_line_length = ctx.width / ctx.font->width;
> >> +	else
> >> +		ctx.max_line_length = DRM_PANIC_COLUMN_WIDTH - 2; /* border=2 */
> >> +
> >> +	ctx.x = 0;
> >> +	ctx.y = ctx.bottom_y;
> >> +	if (ctx.width > ctx.col_width)
> >> +		ctx.x = ((ctx.width / ctx.col_width) - 1) * ctx.col_width;
> >> +
> >> +	pr_debug("%s: font=%s %ux%u max_line_length=%zu (%u,%u)\n",
> >> +		 __func__, ctx.font->name, ctx.width, ctx.height,
> >> +		 ctx.max_line_length, ctx.x, ctx.y);
> >> +
> >> +	kmsg_dump_rewind(dumper);
> >> +
> >> +	/* The latest messages are fetched first */
> >> +	while (kmsg_dump_get_buffer(dumper, false, text, sizeof(text), &len)) {
> >> +		/* Strip off the very last newline so we start at the bottom */
> >> +		if (first_iteration) {
> >> +			len--;
> >> +			first_iteration = false;
> >> +		}
> >> +
> >> +		if (!drm_panic_render_lines(&ctx, text, len))
> >> +			break;
> >> +	}
> >> +
> >> +	if (fb->funcs->panic_vunmap)
> >> +		fb->funcs->panic_vunmap(fb, vmap);
> >> +}
> >> +
> >> +static void drm_panic_try_dev(struct drm_device *dev, struct kmsg_dumper *dumper)
> >> +{
> >> +	struct drm_framebuffer *fb;
> >> +	struct drm_plane *plane;
> >> +	struct drm_crtc *crtc;
> >> +
> >> +	if (!drm_core_check_feature(dev, DRIVER_ATOMIC))
> >> +		return;
> >> +
> >> +	if (dumper->max_reason == KMSG_DUMP_OOPS)
> >> +		pr_info("%s: %s on minor %d\n", __func__, dev->driver->name,
> >> +			dev->primary->index);
> >> +
> >> +	drm_for_each_crtc(crtc, dev) {
> >> +		if (!ww_mutex_trylock(&crtc->mutex.mutex))
> >> +			continue;
> >> +
> >> +		if (!crtc->enabled || !crtc->primary)
> >> +			goto crtc_unlock;
> >> +
> >> +		if (!crtc->state || !crtc->state->active)
> >> +			goto crtc_unlock;
> >> +
> >> +		plane = crtc->primary;
> >> +		if (!ww_mutex_trylock(&plane->mutex.mutex))
> >> +			goto crtc_unlock;
> >> +
> >> +		/*
> >> +		 * TODO: Should we check plane_state->visible?
> >> +		 * It is not set on vc4
> > 
> > I think we should. 
> 
> I need to recheck vc4, because when I did this a year ago ->visible was
> not set on vc4.

I think if you look at plane_state->src (i.e. the clipped src rect) then
if that is non-zero I think you can assume ->visible is also set. Becuase
that's both computed by the same helpers.

And if that's not the case, fall back to ->src_x/y/h/w and just assume the
plane is visible (usually those are the simpler drivers anyway).

> > We should also check whether that primary is connected
> > to the crtc (some hw you can reassign planes freely between crtc, and the
> > crtc->primary pointer is only used for compat with legacy ioctl).
> 
> Like this?
> 		/* Check that the plane has not been reassigned */
> 		if (plane->state->crtc != crtc)
> 			goto plane_unlock;

Yup.

> >> +		if (!plane->state || !plane->state->visible)
> >> +		 */
> >> +		if (!plane->state)
> >> +			goto plane_unlock;
> >> +
> >> +		fb = plane->state->fb;
> >> +		if (!fb || !fb->funcs->panic_vmap)
> > 
> > Docs says this is optional, doesn't seem to be all that optional. I'd
> > check for this or a driver-specific ->panic_draw_xy instead.
> 
> ->panic_vmap determines whether panic is supported or not, so in that
> sense it's optional. ->panic_draw_xy is optional with a default fallback
> for the linear case.
> 
> >> +			goto plane_unlock;
> >> +
> >> +		/*
> >> +		 * fbcon puts out panic messages so stay away to avoid jumbled
> >> +		 * output. If vc->vc_mode != KD_TEXT fbcon won't put out
> >> +		 * messages (see vt_console_print).
> >> +		 */
> >> +		if (dev->fb_helper && dev->fb_helper->fb == fb)
> >> +			goto plane_unlock;
> >> +
> >> +		drm_panic_kmsg_render_screen(plane, dumper);
> >> +plane_unlock:
> >> +		ww_mutex_unlock(&plane->mutex.mutex);
> >> +crtc_unlock:
> >> +		ww_mutex_unlock(&crtc->mutex.mutex);
> >> +	}
> >> +}
> >> +
> >> +static int drm_panic_dev_iter(struct device *dev, void *data)
> >> +{
> >> +	struct drm_minor *minor = dev_get_drvdata(dev);
> >> +
> >> +	if (minor && minor->type == DRM_MINOR_PRIMARY)
> >> +		drm_panic_try_dev(minor->dev, data);
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +static void drm_panic_kmsg_dump(struct kmsg_dumper *dumper,
> >> +				enum kmsg_dump_reason reason)
> >> +{
> >> +	class_for_each_device(drm_class, NULL, dumper, drm_panic_dev_iter);
> > 
> > class_for_each_device uses klist, which only uses an irqsave spinlock. I
> > think that's good enough. Comment to that effect would be good e.g.
> > 
> > 	/* based on klist, which uses only a spin_lock_irqsave, which we
> > 	 * assume still works */
> > 
> > If we aim for perfect this should be a trylock still, maybe using our own
> > device list.
> > 
> >> +}
> >> +
> >> +static struct kmsg_dumper drm_panic_kmsg_dumper = {
> >> +	.dump = drm_panic_kmsg_dump,
> >> +	.max_reason = KMSG_DUMP_PANIC,
> >> +};
> >> +
> >> +static ssize_t drm_panic_file_panic_write(struct file *file,
> >> +					  const char __user *user_buf,
> >> +					  size_t count, loff_t *ppos)
> >> +{
> >> +	unsigned long long val;
> >> +	char buf[24];
> >> +	size_t size;
> >> +	ssize_t ret;
> >> +
> >> +	size = min(sizeof(buf) - 1, count);
> >> +	if (copy_from_user(buf, user_buf, size))
> >> +		return -EFAULT;
> >> +
> >> +	buf[size] = '\0';
> >> +	ret = kstrtoull(buf, 0, &val);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	drm_panic_kmsg_dumper.max_reason = KMSG_DUMP_OOPS;
> >> +	wmb();
> >> +
> >> +	/* Do a real test with: echo c > /proc/sysrq-trigger */
> >> +
> >> +	if (val == 0) {
> >> +		pr_info("Test panic screen using kmsg_dump(OOPS)\n");
> >> +		kmsg_dump(KMSG_DUMP_OOPS);
> >> +	} else if (val == 1) {
> >> +		char *null_pointer = NULL;
> >> +
> >> +		pr_info("Test panic screen using NULL pointer dereference\n");
> >> +		*null_pointer = 1;
> >> +	} else {
> >> +		return -EINVAL;
> >> +	}
> > 
> > This isn't quite what I had in mind, since it still kills the kernel (like
> > sysrq-trigger). 
> 
> If val == 0, it doesn't kill the kernel, it only dumps the kernel log.
> And it doesn't taint the kernel either.

Ah I didn't realize that. Sounds like a good option to keep.

> > Instead what I had in mind is to recreate the worst
> > possible panic context as much as feasible (disabling interrupts should be
> > a good start, maybe we can even do an nmi callback), and then call our
> > panic implementation. That way we can test the panic handler in a
> > non-destructive way (i.e. aside from last dmesg lines printed to the
> > screen nothing bad happens to the kernel: No real panic, no oops, no
> > tainting).
> 
> The interrupt case I can do, nmi I have no idea.

I just read the printk nmi code again and it looks like there's now even
more special handling for issues happening in nmi context, so we should
never see an oops from nmi context. So local_irq_disable() wrapping should
be good enough for testing.
-Daniel

> 
> Noralf.
> 
> > 
> > And igt testcase could then exercise this, and CI run it.
> > 
> >> +
> >> +	return count;
> >> +}
> >> +
> >> +static const struct file_operations drm_panic_panic_ops = {
> >> +	.write =        drm_panic_file_panic_write,
> >> +	.open =         simple_open,
> >> +	.llseek =       default_llseek,
> >> +};
> >> +
> >> +static struct dentry *drm_panic_d_panic;
> >> +
> >> +void __init drm_panic_init(struct dentry *debugfs_root)
> >> +{
> >> +	drm_panic_font8x8 = find_font("VGA8x8");
> >> +	drm_panic_font8x16 = find_font("VGA8x16");
> >> +	if (!drm_panic_font8x16) {
> >> +		DRM_WARN("Couldn't find font, panic screen disabled\n");
> >> +		return;
> >> +	}
> >> +
> >> +	drm_panic_d_panic = debugfs_create_file("panic-test", 0200,
> >> +						debugfs_root, NULL,
> >> +						&drm_panic_panic_ops);
> >> +	kmsg_dump_register(&drm_panic_kmsg_dumper);
> >> +}
> >> +
> >> +void drm_panic_exit(struct dentry *debugfs_root)
> >> +{
> >> +	kmsg_dump_unregister(&drm_panic_kmsg_dumper);
> >> +	debugfs_remove(drm_panic_d_panic);
> >> +}
> >> diff --git a/include/drm/drm_framebuffer.h b/include/drm/drm_framebuffer.h
> >> index f0b34c977ec5..f3274798ecfe 100644
> >> --- a/include/drm/drm_framebuffer.h
> >> +++ b/include/drm/drm_framebuffer.h
> >> @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
> >>  		     struct drm_file *file_priv, unsigned flags,
> >>  		     unsigned color, struct drm_clip_rect *clips,
> >>  		     unsigned num_clips);
> >> +
> >> +	/**
> >> +	 * @panic_vmap:
> >> +	 *
> >> +	 * Optional callback for panic handling.
> >> +	 *
> >> +	 * For vmapping the selected framebuffer in a panic context. Must
> >> +	 * be super careful about locking (only trylocking allowed).
> >> +	 *
> >> +	 * RETURNS:
> >> +	 *
> >> +	 * NULL if it didn't work out, otherwise an opaque cookie which is
> >> +	 * passed to @panic_draw_xy. It can be anything: vmap area, structure
> >> +	 * with more details, just a few flags, ...
> >> +	 */
> >> +	void *(*panic_vmap)(struct drm_framebuffer *fb);
> >> +
> >> +	/**
> >> +	 * @panic_vunmap:
> >> +	 *
> >> +	 * Optional callback for cleaning up after panic testing.
> >> +	 *
> >> +	 * Crtc and plane locks are released after this callback has run.
> >> +	 * vmap is the cookie returned by @panic_vmap.
> >> +	 */
> >> +	void (*panic_vunmap)(struct drm_framebuffer *fb, void *vmap);
> >> +
> >> +	/**
> >> +	 * @panic_draw_xy:
> >> +	 *
> >> +	 * Optional callback for drawing pixels during panic.
> >> +	 *
> >> +	 * For drawing pixels onto a framebuffer prepared with @panic_vmap.
> >> +	 * vmap is the cookie returned by @panic_vmap.
> >> +	 * If it's not set, drm_framebuffer_panic_draw_xy() is used.
> >> +	 */
> >> +	void (*panic_draw_xy)(struct drm_framebuffer *fb, void *vmap,
> >> +			      int x, int y, bool foreground);
> >>  };
> >>  
> >>  /**
> >> @@ -293,4 +331,7 @@ int drm_framebuffer_plane_width(int width,
> >>  int drm_framebuffer_plane_height(int height,
> >>  				 const struct drm_framebuffer *fb, int plane);
> >>  
> >> +void drm_framebuffer_panic_draw_xy(struct drm_framebuffer *fb, void *vmap,
> >> +				   int x, int y, bool foreground);
> >> +
> >>  #endif
> >> -- 
> >> 2.20.1
> >>
> >

Noralf Trønnes March 12, 2019, 1:29 p.m. UTC | #10

Den 12.03.2019 11.58, skrev Daniel Vetter:
> On Mon, Mar 11, 2019 at 11:33:15PM +0100, Noralf Trønnes wrote:
>>
>>
>> Den 11.03.2019 20.23, skrev Daniel Vetter:
>>> On Mon, Mar 11, 2019 at 06:42:16PM +0100, Noralf Trønnes wrote:
>>>> This adds support for outputting kernel messages on panic().
>>>> A kernel message dumper is used to dump the log. The dumper iterates
>>>> over each DRM device and it's crtc's to find suitable framebuffers.
>>>>
>>>> All the other dumpers are run before this one except mtdoops.
>>>> Only atomic drivers are supported.
>>>>
>>>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
>>>

<snip>

>>>> +static void drm_panic_kmsg_render_screen(struct drm_plane *plane,
>>>> +					 struct kmsg_dumper *dumper)
>>>> +{
>>>> +	struct drm_framebuffer *fb = plane->state->fb;
>>>> +	bool first_iteration = true;
>>>> +	struct drm_panic_ctx ctx;
>>>> +	static char text[1024];
>>>> +	size_t len;
>>>> +
>>>> +	ctx.vmap = fb->funcs->panic_vmap(fb);
>>>> +
>>>> +	/* Print some info when testing */
>>>> +	if (dumper->max_reason == KMSG_DUMP_OOPS) {
>>>> +		struct drm_format_name_buf format_name;
>>>> +
>>>> +		pr_info("%s: [FB:%d] %ux%u format=%s vmap=%p\n",
>>>> +			__func__, fb->base.id, fb->width, fb->height,
>>>> +			drm_get_format_name(fb->format->format, &format_name),
>>>> +			ctx.vmap);
>>>> +	}
>>>> +
>>>> +	if (!ctx.vmap)
>>>> +		return;
>>>
>>> For some framebuffers it might be useful to not require vmap, but instead
>>> have some page-by-page kind of access helper. Since preemptively setting
>>> up a vmap for all the non-continous buffers is a bit much, and we really
>>> can't set up the vmap in the panic handler.
>>>
>>> Maybe we should call this panic_prepare/panic_finish and
>>> s/vmap/cookie/, to make it entirely opaque?
>>>
>>> But a bit overengineering perhaps :-)
>>>
>>
>> I guess i915 wants something else than vmap, but I have no idea how a
>> page-by-page solution is to be implemented.
> 
> i915 should be mostly fine, since we have a GTT for remapping and can make
> it look continuous. It might not be in the part of the GTT accessible by
> the cpu though.
> 
> Wrt implementation: My idea would be to extract the pixel writing function
> and export it to drivers. The driver would then implement a
> ->panic_draw_xy function which looks up the right page, computes it
> address, computes the x/y offset within (taking into account tiling and
> stuff like that), and then uses the helper function to draw the right
> pixel value for the format at the given address.
> 
> That's why I suggested that drivers need to either implement
> ->panic_prepare or ->panic_draw_xy. Since for this approach you might not
> need a ->panic_prepare/cleanup implementation, since it's all done in
> ->panic_draw_xy on a pixel-by-pixel basis.
> 
>> When we get bootsplash, we will at least have a kernel address during
>> that phase for all drivers supporting it.
> 
> Hm, not following what you mean here. Why is bootsplash special?
> 

Special in the sense that it's framebuffer will be linear and already
have a virtual address for its backing buffer since that's a
prerequisite for it to render the splash image.

Noralf.

Ville Syrjälä March 12, 2019, 4:17 p.m. UTC | #11

On Tue, Mar 12, 2019 at 11:47:04AM +0100, Michel Dänzer wrote:
> On 2019-03-11 6:42 p.m., Noralf Trønnes wrote:
> > This adds support for outputting kernel messages on panic().
> > A kernel message dumper is used to dump the log. The dumper iterates
> > over each DRM device and it's crtc's to find suitable framebuffers.
> > 
> > All the other dumpers are run before this one except mtdoops.
> > Only atomic drivers are supported.
> > 
> > Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
> > ---
> >  [...]
> > 
> > diff --git a/include/drm/drm_framebuffer.h b/include/drm/drm_framebuffer.h
> > index f0b34c977ec5..f3274798ecfe 100644
> > --- a/include/drm/drm_framebuffer.h
> > +++ b/include/drm/drm_framebuffer.h
> > @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
> >  		     struct drm_file *file_priv, unsigned flags,
> >  		     unsigned color, struct drm_clip_rect *clips,
> >  		     unsigned num_clips);
> > +
> > +	/**
> > +	 * @panic_vmap:
> > +	 *
> > +	 * Optional callback for panic handling.
> > +	 *
> > +	 * For vmapping the selected framebuffer in a panic context. Must
> > +	 * be super careful about locking (only trylocking allowed).
> > +	 *
> > +	 * RETURNS:
> > +	 *
> > +	 * NULL if it didn't work out, otherwise an opaque cookie which is
> > +	 * passed to @panic_draw_xy. It can be anything: vmap area, structure
> > +	 * with more details, just a few flags, ...
> > +	 */
> > +	void *(*panic_vmap)(struct drm_framebuffer *fb);
> 
> FWIW, the panic_vmap hook cannot work in general with the amdgpu/radeon
> drivers:
> 
> Framebuffers are normally tiled, writing to them with the CPU results in
> garbled output.
> 
> With a discrete GPU having a large amount of VRAM, the framebuffer may
> not be directly CPU accessible at all.
> 
> 
> There would need to be a mechanism for switching scanout to a linear,
> CPU accessible framebuffer.

I suppose panic_vmap() could just provide a linear temp buffer
to the panic handler, and panic_unmap() could copy the contents
over to the real fb.

That said, this approach of scribbling over the primary plane's
framebuffer has some clear limitations:
* something may overwrite the oops message before the user
  can even read it
* there may be other planes obscuring part or all of the
  primary plane

Also scribbling over the user's framebuffer seems rather rude
to me, so I'm thinking this approach should be limited to kernel
panics only.

Noralf Trønnes March 12, 2019, 5:15 p.m. UTC | #12

Den 12.03.2019 17.17, skrev Ville Syrjälä:
> On Tue, Mar 12, 2019 at 11:47:04AM +0100, Michel Dänzer wrote:
>> On 2019-03-11 6:42 p.m., Noralf Trønnes wrote:
>>> This adds support for outputting kernel messages on panic().
>>> A kernel message dumper is used to dump the log. The dumper iterates
>>> over each DRM device and it's crtc's to find suitable framebuffers.
>>>
>>> All the other dumpers are run before this one except mtdoops.
>>> Only atomic drivers are supported.
>>>
>>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
>>> ---
>>>  [...]
>>>
>>> diff --git a/include/drm/drm_framebuffer.h b/include/drm/drm_framebuffer.h
>>> index f0b34c977ec5..f3274798ecfe 100644
>>> --- a/include/drm/drm_framebuffer.h
>>> +++ b/include/drm/drm_framebuffer.h
>>> @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
>>>  		     struct drm_file *file_priv, unsigned flags,
>>>  		     unsigned color, struct drm_clip_rect *clips,
>>>  		     unsigned num_clips);
>>> +
>>> +	/**
>>> +	 * @panic_vmap:
>>> +	 *
>>> +	 * Optional callback for panic handling.
>>> +	 *
>>> +	 * For vmapping the selected framebuffer in a panic context. Must
>>> +	 * be super careful about locking (only trylocking allowed).
>>> +	 *
>>> +	 * RETURNS:
>>> +	 *
>>> +	 * NULL if it didn't work out, otherwise an opaque cookie which is
>>> +	 * passed to @panic_draw_xy. It can be anything: vmap area, structure
>>> +	 * with more details, just a few flags, ...
>>> +	 */
>>> +	void *(*panic_vmap)(struct drm_framebuffer *fb);
>>
>> FWIW, the panic_vmap hook cannot work in general with the amdgpu/radeon
>> drivers:
>>
>> Framebuffers are normally tiled, writing to them with the CPU results in
>> garbled output.
>>

In which case the driver needs to support the ->panic_draw_xy callback,
or maybe it's possible to make a generic helper for tiled buffers.

>> With a discrete GPU having a large amount of VRAM, the framebuffer may
>> not be directly CPU accessible at all.
>>

I would have been nice to know how Windows works around this.

>>
>> There would need to be a mechanism for switching scanout to a linear,
>> CPU accessible framebuffer.
> 
> I suppose panic_vmap() could just provide a linear temp buffer
> to the panic handler, and panic_unmap() could copy the contents
> over to the real fb.
> 
> That said, this approach of scribbling over the primary plane's
> framebuffer has some clear limitations:
> * something may overwrite the oops message before the user
>   can even read it

When the dumper drm_panic_kmsg_dump() runs, the other CPU's should have
been stopped. See panic().

> * there may be other planes obscuring part or all of the
>   primary plane
> 

Yeah, this is a problem, again I wonder how Windows deals with this.

> Also scribbling over the user's framebuffer seems rather rude
> to me, so I'm thinking this approach should be limited to kernel
> panics only.
> 

Yes this will only happen on kernel panics:

panic() -> kmsg_dump() -> drm_panic_kmsg_dump()

(Unless invoking through debugfs ofc)

Noralf.

Ville Syrjälä March 12, 2019, 5:25 p.m. UTC | #13

On Tue, Mar 12, 2019 at 06:15:24PM +0100, Noralf Trønnes wrote:
> 
> 
> Den 12.03.2019 17.17, skrev Ville Syrjälä:
> > On Tue, Mar 12, 2019 at 11:47:04AM +0100, Michel Dänzer wrote:
> >> On 2019-03-11 6:42 p.m., Noralf Trønnes wrote:
> >>> This adds support for outputting kernel messages on panic().
> >>> A kernel message dumper is used to dump the log. The dumper iterates
> >>> over each DRM device and it's crtc's to find suitable framebuffers.
> >>>
> >>> All the other dumpers are run before this one except mtdoops.
> >>> Only atomic drivers are supported.
> >>>
> >>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
> >>> ---
> >>>  [...]
> >>>
> >>> diff --git a/include/drm/drm_framebuffer.h b/include/drm/drm_framebuffer.h
> >>> index f0b34c977ec5..f3274798ecfe 100644
> >>> --- a/include/drm/drm_framebuffer.h
> >>> +++ b/include/drm/drm_framebuffer.h
> >>> @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
> >>>  		     struct drm_file *file_priv, unsigned flags,
> >>>  		     unsigned color, struct drm_clip_rect *clips,
> >>>  		     unsigned num_clips);
> >>> +
> >>> +	/**
> >>> +	 * @panic_vmap:
> >>> +	 *
> >>> +	 * Optional callback for panic handling.
> >>> +	 *
> >>> +	 * For vmapping the selected framebuffer in a panic context. Must
> >>> +	 * be super careful about locking (only trylocking allowed).
> >>> +	 *
> >>> +	 * RETURNS:
> >>> +	 *
> >>> +	 * NULL if it didn't work out, otherwise an opaque cookie which is
> >>> +	 * passed to @panic_draw_xy. It can be anything: vmap area, structure
> >>> +	 * with more details, just a few flags, ...
> >>> +	 */
> >>> +	void *(*panic_vmap)(struct drm_framebuffer *fb);
> >>
> >> FWIW, the panic_vmap hook cannot work in general with the amdgpu/radeon
> >> drivers:
> >>
> >> Framebuffers are normally tiled, writing to them with the CPU results in
> >> garbled output.
> >>
> 
> In which case the driver needs to support the ->panic_draw_xy callback,
> or maybe it's possible to make a generic helper for tiled buffers.
> 
> >> With a discrete GPU having a large amount of VRAM, the framebuffer may
> >> not be directly CPU accessible at all.
> >>
> 
> I would have been nice to know how Windows works around this.
> 
> >>
> >> There would need to be a mechanism for switching scanout to a linear,
> >> CPU accessible framebuffer.
> > 
> > I suppose panic_vmap() could just provide a linear temp buffer
> > to the panic handler, and panic_unmap() could copy the contents
> > over to the real fb.
> > 
> > That said, this approach of scribbling over the primary plane's
> > framebuffer has some clear limitations:
> > * something may overwrite the oops message before the user
> >   can even read it
> 
> When the dumper drm_panic_kmsg_dump() runs, the other CPU's should have
> been stopped. See panic().

GPUs etc. may still be executing away.

> 
> > * there may be other planes obscuring part or all of the
> >   primary plane
> > 
> 
> Yeah, this is a problem, again I wonder how Windows deals with this.

Probably just disables all other planes. Not that it uses planes
all that heavily.

> 
> > Also scribbling over the user's framebuffer seems rather rude
> > to me, so I'm thinking this approach should be limited to kernel
> > panics only.
> > 
> 
> Yes this will only happen on kernel panics:
> 
> panic() -> kmsg_dump() -> drm_panic_kmsg_dump()
> 
> (Unless invoking through debugfs ofc)

I thought you set the max_level or whatever to OOPS. Doesn't that mean
it gets involved for non-panics as well?

Noralf Trønnes March 12, 2019, 5:37 p.m. UTC | #14

Den 12.03.2019 18.25, skrev Ville Syrjälä:
> On Tue, Mar 12, 2019 at 06:15:24PM +0100, Noralf Trønnes wrote:
>>
>>
>> Den 12.03.2019 17.17, skrev Ville Syrjälä:
>>> On Tue, Mar 12, 2019 at 11:47:04AM +0100, Michel Dänzer wrote:
>>>> On 2019-03-11 6:42 p.m., Noralf Trønnes wrote:
>>>>> This adds support for outputting kernel messages on panic().
>>>>> A kernel message dumper is used to dump the log. The dumper iterates
>>>>> over each DRM device and it's crtc's to find suitable framebuffers.
>>>>>
>>>>> All the other dumpers are run before this one except mtdoops.
>>>>> Only atomic drivers are supported.
>>>>>
>>>>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
>>>>> ---
>>>>>  [...]
>>>>>
>>>>> diff --git a/include/drm/drm_framebuffer.h b/include/drm/drm_framebuffer.h
>>>>> index f0b34c977ec5..f3274798ecfe 100644
>>>>> --- a/include/drm/drm_framebuffer.h
>>>>> +++ b/include/drm/drm_framebuffer.h
>>>>> @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
>>>>>  		     struct drm_file *file_priv, unsigned flags,
>>>>>  		     unsigned color, struct drm_clip_rect *clips,
>>>>>  		     unsigned num_clips);
>>>>> +
>>>>> +	/**
>>>>> +	 * @panic_vmap:
>>>>> +	 *
>>>>> +	 * Optional callback for panic handling.
>>>>> +	 *
>>>>> +	 * For vmapping the selected framebuffer in a panic context. Must
>>>>> +	 * be super careful about locking (only trylocking allowed).
>>>>> +	 *
>>>>> +	 * RETURNS:
>>>>> +	 *
>>>>> +	 * NULL if it didn't work out, otherwise an opaque cookie which is
>>>>> +	 * passed to @panic_draw_xy. It can be anything: vmap area, structure
>>>>> +	 * with more details, just a few flags, ...
>>>>> +	 */
>>>>> +	void *(*panic_vmap)(struct drm_framebuffer *fb);
>>>>
>>>> FWIW, the panic_vmap hook cannot work in general with the amdgpu/radeon
>>>> drivers:
>>>>
>>>> Framebuffers are normally tiled, writing to them with the CPU results in
>>>> garbled output.
>>>>
>>
>> In which case the driver needs to support the ->panic_draw_xy callback,
>> or maybe it's possible to make a generic helper for tiled buffers.
>>
>>>> With a discrete GPU having a large amount of VRAM, the framebuffer may
>>>> not be directly CPU accessible at all.
>>>>
>>
>> I would have been nice to know how Windows works around this.
>>
>>>>
>>>> There would need to be a mechanism for switching scanout to a linear,
>>>> CPU accessible framebuffer.
>>>
>>> I suppose panic_vmap() could just provide a linear temp buffer
>>> to the panic handler, and panic_unmap() could copy the contents
>>> over to the real fb.
>>>
>>> That said, this approach of scribbling over the primary plane's
>>> framebuffer has some clear limitations:
>>> * something may overwrite the oops message before the user
>>>   can even read it
>>
>> When the dumper drm_panic_kmsg_dump() runs, the other CPU's should have
>> been stopped. See panic().
> 
> GPUs etc. may still be executing away.
> 

Would it be safe to stop it in a panic situation? It would ofc be bad to
crash the box even harder.

>>
>>> * there may be other planes obscuring part or all of the
>>>   primary plane
>>>
>>
>> Yeah, this is a problem, again I wonder how Windows deals with this.
> 
> Probably just disables all other planes. Not that it uses planes
> all that heavily.
> 
>>
>>> Also scribbling over the user's framebuffer seems rather rude
>>> to me, so I'm thinking this approach should be limited to kernel
>>> panics only.
>>>
>>
>> Yes this will only happen on kernel panics:
>>
>> panic() -> kmsg_dump() -> drm_panic_kmsg_dump()
>>
>> (Unless invoking through debugfs ofc)
> 
> I thought you set the max_level or whatever to OOPS. Doesn't that mean
> it gets involved for non-panics as well?
> 

I do that in the debugfs code, but I can't remember why I lower level, I
think can just change the level when invoking the dumper:
 drm_panic_file_panic_write(...)
-		kmsg_dump(KMSG_DUMP_OOPS);
+		kmsg_dump(KMSG_DUMP_PANIC);


This is the dumper config:

static struct kmsg_dumper drm_panic_kmsg_dumper = {
	.dump = drm_panic_kmsg_dump,
	.max_reason = KMSG_DUMP_PANIC,
};

Noralf Trønnes March 12, 2019, 5:44 p.m. UTC | #15

Den 12.03.2019 18.37, skrev Noralf Trønnes:
> 
> 
> Den 12.03.2019 18.25, skrev Ville Syrjälä:
>> On Tue, Mar 12, 2019 at 06:15:24PM +0100, Noralf Trønnes wrote:
>>>
>>>
>>> Den 12.03.2019 17.17, skrev Ville Syrjälä:
>>>> On Tue, Mar 12, 2019 at 11:47:04AM +0100, Michel Dänzer wrote:
>>>>> On 2019-03-11 6:42 p.m., Noralf Trønnes wrote:
>>>>>> This adds support for outputting kernel messages on panic().
>>>>>> A kernel message dumper is used to dump the log. The dumper iterates
>>>>>> over each DRM device and it's crtc's to find suitable framebuffers.
>>>>>>
>>>>>> All the other dumpers are run before this one except mtdoops.
>>>>>> Only atomic drivers are supported.
>>>>>>
>>>>>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
>>>>>> ---
>>>>>>  [...]
>>>>>>
>>>>>> diff --git a/include/drm/drm_framebuffer.h b/include/drm/drm_framebuffer.h
>>>>>> index f0b34c977ec5..f3274798ecfe 100644
>>>>>> --- a/include/drm/drm_framebuffer.h
>>>>>> +++ b/include/drm/drm_framebuffer.h
>>>>>> @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
>>>>>>  		     struct drm_file *file_priv, unsigned flags,
>>>>>>  		     unsigned color, struct drm_clip_rect *clips,
>>>>>>  		     unsigned num_clips);
>>>>>> +
>>>>>> +	/**
>>>>>> +	 * @panic_vmap:
>>>>>> +	 *
>>>>>> +	 * Optional callback for panic handling.
>>>>>> +	 *
>>>>>> +	 * For vmapping the selected framebuffer in a panic context. Must
>>>>>> +	 * be super careful about locking (only trylocking allowed).
>>>>>> +	 *
>>>>>> +	 * RETURNS:
>>>>>> +	 *
>>>>>> +	 * NULL if it didn't work out, otherwise an opaque cookie which is
>>>>>> +	 * passed to @panic_draw_xy. It can be anything: vmap area, structure
>>>>>> +	 * with more details, just a few flags, ...
>>>>>> +	 */
>>>>>> +	void *(*panic_vmap)(struct drm_framebuffer *fb);
>>>>>
>>>>> FWIW, the panic_vmap hook cannot work in general with the amdgpu/radeon
>>>>> drivers:
>>>>>
>>>>> Framebuffers are normally tiled, writing to them with the CPU results in
>>>>> garbled output.
>>>>>
>>>
>>> In which case the driver needs to support the ->panic_draw_xy callback,
>>> or maybe it's possible to make a generic helper for tiled buffers.
>>>
>>>>> With a discrete GPU having a large amount of VRAM, the framebuffer may
>>>>> not be directly CPU accessible at all.
>>>>>
>>>
>>> I would have been nice to know how Windows works around this.
>>>
>>>>>
>>>>> There would need to be a mechanism for switching scanout to a linear,
>>>>> CPU accessible framebuffer.
>>>>
>>>> I suppose panic_vmap() could just provide a linear temp buffer
>>>> to the panic handler, and panic_unmap() could copy the contents
>>>> over to the real fb.
>>>>
>>>> That said, this approach of scribbling over the primary plane's
>>>> framebuffer has some clear limitations:
>>>> * something may overwrite the oops message before the user
>>>>   can even read it
>>>
>>> When the dumper drm_panic_kmsg_dump() runs, the other CPU's should have
>>> been stopped. See panic().
>>
>> GPUs etc. may still be executing away.
>>
> 
> Would it be safe to stop it in a panic situation? It would ofc be bad to
> crash the box even harder.
> 
>>>
>>>> * there may be other planes obscuring part or all of the
>>>>   primary plane
>>>>
>>>
>>> Yeah, this is a problem, again I wonder how Windows deals with this.
>>
>> Probably just disables all other planes. Not that it uses planes
>> all that heavily.
>>
>>>
>>>> Also scribbling over the user's framebuffer seems rather rude
>>>> to me, so I'm thinking this approach should be limited to kernel
>>>> panics only.
>>>>
>>>
>>> Yes this will only happen on kernel panics:
>>>
>>> panic() -> kmsg_dump() -> drm_panic_kmsg_dump()
>>>
>>> (Unless invoking through debugfs ofc)
>>
>> I thought you set the max_level or whatever to OOPS. Doesn't that mean
>> it gets involved for non-panics as well?
>>
> 
> I do that in the debugfs code, but I can't remember why I lower level,

Now I remember, it is so I can catch the null pointer debugfs test, but
that's not necessary since this is designed for panics, so I should
remove that test.

Noralf.

> I
> think can just change the level when invoking the dumper:
>  drm_panic_file_panic_write(...)
> -		kmsg_dump(KMSG_DUMP_OOPS);
> +		kmsg_dump(KMSG_DUMP_PANIC);
> 
> 
> This is the dumper config:
> 
> static struct kmsg_dumper drm_panic_kmsg_dumper = {
> 	.dump = drm_panic_kmsg_dump,
> 	.max_reason = KMSG_DUMP_PANIC,
> };
> 
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>

Ville Syrjälä March 12, 2019, 6:02 p.m. UTC | #16

On Tue, Mar 12, 2019 at 06:37:57PM +0100, Noralf Trønnes wrote:
> 
> 
> Den 12.03.2019 18.25, skrev Ville Syrjälä:
> > On Tue, Mar 12, 2019 at 06:15:24PM +0100, Noralf Trønnes wrote:
> >>
> >>
> >> Den 12.03.2019 17.17, skrev Ville Syrjälä:
> >>> On Tue, Mar 12, 2019 at 11:47:04AM +0100, Michel Dänzer wrote:
> >>>> On 2019-03-11 6:42 p.m., Noralf Trønnes wrote:
> >>>>> This adds support for outputting kernel messages on panic().
> >>>>> A kernel message dumper is used to dump the log. The dumper iterates
> >>>>> over each DRM device and it's crtc's to find suitable framebuffers.
> >>>>>
> >>>>> All the other dumpers are run before this one except mtdoops.
> >>>>> Only atomic drivers are supported.
> >>>>>
> >>>>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
> >>>>> ---
> >>>>>  [...]
> >>>>>
> >>>>> diff --git a/include/drm/drm_framebuffer.h b/include/drm/drm_framebuffer.h
> >>>>> index f0b34c977ec5..f3274798ecfe 100644
> >>>>> --- a/include/drm/drm_framebuffer.h
> >>>>> +++ b/include/drm/drm_framebuffer.h
> >>>>> @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
> >>>>>  		     struct drm_file *file_priv, unsigned flags,
> >>>>>  		     unsigned color, struct drm_clip_rect *clips,
> >>>>>  		     unsigned num_clips);
> >>>>> +
> >>>>> +	/**
> >>>>> +	 * @panic_vmap:
> >>>>> +	 *
> >>>>> +	 * Optional callback for panic handling.
> >>>>> +	 *
> >>>>> +	 * For vmapping the selected framebuffer in a panic context. Must
> >>>>> +	 * be super careful about locking (only trylocking allowed).
> >>>>> +	 *
> >>>>> +	 * RETURNS:
> >>>>> +	 *
> >>>>> +	 * NULL if it didn't work out, otherwise an opaque cookie which is
> >>>>> +	 * passed to @panic_draw_xy. It can be anything: vmap area, structure
> >>>>> +	 * with more details, just a few flags, ...
> >>>>> +	 */
> >>>>> +	void *(*panic_vmap)(struct drm_framebuffer *fb);
> >>>>
> >>>> FWIW, the panic_vmap hook cannot work in general with the amdgpu/radeon
> >>>> drivers:
> >>>>
> >>>> Framebuffers are normally tiled, writing to them with the CPU results in
> >>>> garbled output.
> >>>>
> >>
> >> In which case the driver needs to support the ->panic_draw_xy callback,
> >> or maybe it's possible to make a generic helper for tiled buffers.
> >>
> >>>> With a discrete GPU having a large amount of VRAM, the framebuffer may
> >>>> not be directly CPU accessible at all.
> >>>>
> >>
> >> I would have been nice to know how Windows works around this.
> >>
> >>>>
> >>>> There would need to be a mechanism for switching scanout to a linear,
> >>>> CPU accessible framebuffer.
> >>>
> >>> I suppose panic_vmap() could just provide a linear temp buffer
> >>> to the panic handler, and panic_unmap() could copy the contents
> >>> over to the real fb.
> >>>
> >>> That said, this approach of scribbling over the primary plane's
> >>> framebuffer has some clear limitations:
> >>> * something may overwrite the oops message before the user
> >>>   can even read it
> >>
> >> When the dumper drm_panic_kmsg_dump() runs, the other CPU's should have
> >> been stopped. See panic().
> > 
> > GPUs etc. may still be executing away.
> > 
> 
> Would it be safe to stop it in a panic situation? It would ofc be bad to
> crash the box even harder.

Some drivers/devices may have working (and hopefully even reliable)
gpu reset, some may not.

Ahmed S. Darwish March 12, 2019, 10:13 p.m. UTC | #17

Hi,

[[ CCing John for the trylock parts ]]

On Mon, Mar 11, 2019 at 11:33:15PM +0100, Noralf Trønnes wrote:
>
>
> Den 11.03.2019 20.23, skrev Daniel Vetter:
> > On Mon, Mar 11, 2019 at 06:42:16PM +0100, Noralf Trønnes wrote:
> >> This adds support for outputting kernel messages on panic().
> >> A kernel message dumper is used to dump the log. The dumper iterates
> >> over each DRM device and it's crtc's to find suitable framebuffers.
> >>
> >> All the other dumpers are run before this one except mtdoops.
> >> Only atomic drivers are supported.
> >>
> >> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
> >
> > Bunch of comments/ideas for you or Darwish below, whoever picks this up.
>
> Actually it would ne nice if Darwish could pick it up since he will do
> it on i915 which will be useful to a much broader audience.
> If not I'll respin when I'm done with the drm_fb_helper refactoring.
>

Yup, I'll be more than happy to do this.. while preserving all of
Noralf's authorship and copyright notices of course.

I guess it can be:

  - Handle the comments posted by Daniel and others (I'll post
    some questions too)

  - Add the necessary i915 specific bits

  - Test, post v3/v4/../vn. Rinse and repeat. Keep it local at
    dri-devel until getting the necessary S-o-Bs.

  - Post to wider audience (some feedback from distribution folks
    would also be nice, before posting to lkml)

More comments below..

[...]

> >> +
> >> +static void drm_panic_kmsg_dump(struct kmsg_dumper *dumper,
> >> +				enum kmsg_dump_reason reason)
> >> +{
> >> +	class_for_each_device(drm_class, NULL, dumper, drm_panic_dev_iter);
> >
> > class_for_each_device uses klist, which only uses an irqsave spinlock. I
> > think that's good enough. Comment to that effect would be good e.g.
> >
> > 	/* based on klist, which uses only a spin_lock_irqsave, which we
> > 	 * assume still works */
> >
> > If we aim for perfect this should be a trylock still, maybe using our own
> > device list.
> >

I definitely agree here.

The lock may already be locked either by a stopped CPU, or by the
very same CPU we execute panic() on (e.g. NMI panic() on the
printing CPU).

This is why it's very common for example in serial consoles, which
are usually careful about re-entrance and panic contexts, to do:

  xxxxxx_console_write(...) {
	if (oops_in_progress)
		locked = spin_trylock_irqsave(&port->lock, flags);
	else
		spin_lock_irqsave(&port->lock, flags);
  }

I'm quite positive we should do the same for panic drm drivers.
John?

> >> +}
> >> +
> >> +static struct kmsg_dumper drm_panic_kmsg_dumper = {
> >> +	.dump = drm_panic_kmsg_dump,
> >> +	.max_reason = KMSG_DUMP_PANIC,
> >> +};
> >> +
> >> +static ssize_t drm_panic_file_panic_write(struct file *file,
> >> +					  const char __user *user_buf,
> >> +					  size_t count, loff_t *ppos)
> >> +{
> >> +	unsigned long long val;
> >> +	char buf[24];
> >> +	size_t size;
> >> +	ssize_t ret;
> >> +
> >> +	size = min(sizeof(buf) - 1, count);
> >> +	if (copy_from_user(buf, user_buf, size))
> >> +		return -EFAULT;
> >> +
> >> +	buf[size] = '\0';
> >> +	ret = kstrtoull(buf, 0, &val);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	drm_panic_kmsg_dumper.max_reason = KMSG_DUMP_OOPS;
> >> +	wmb();
> >> +
> >> +	/* Do a real test with: echo c > /proc/sysrq-trigger */
> >> +
> >> +	if (val == 0) {
> >> +		pr_info("Test panic screen using kmsg_dump(OOPS)\n");
> >> +		kmsg_dump(KMSG_DUMP_OOPS);
> >> +	} else if (val == 1) {
> >> +		char *null_pointer = NULL;
> >> +
> >> +		pr_info("Test panic screen using NULL pointer dereference\n");
> >> +		*null_pointer = 1;
> >> +	} else {
> >> +		return -EINVAL;
> >> +	}
> >
> > This isn't quite what I had in mind, since it still kills the kernel (like
> > sysrq-trigger).
>
> If val == 0, it doesn't kill the kernel, it only dumps the kernel log.
> And it doesn't taint the kernel either.
>
> > Instead what I had in mind is to recreate the worst
> > possible panic context as much as feasible (disabling interrupts should be
> > a good start, maybe we can even do an nmi callback), and then call our
> > panic implementation. That way we can test the panic handler in a
> > non-destructive way (i.e. aside from last dmesg lines printed to the
> > screen nothing bad happens to the kernel: No real panic, no oops, no
> > tainting).
>
> The interrupt case I can do, nmi I have no idea.
>

I agree too. Disabling interrupts + CONFIG_DEBUG_ATOMIC_SLEEP
would be a nice non-destructive test-case emulation.

thanks!

--
darwi
http://darwish.chasingpointers.com

Ahmed S. Darwish March 13, 2019, 3:53 a.m. UTC | #18

Hi,

On Tue, Mar 12, 2019 at 11:58:10AM +0100, Daniel Vetter wrote:
> On Mon, Mar 11, 2019 at 11:33:15PM +0100, Noralf Trønnes wrote:
> >
> >
> > Den 11.03.2019 20.23, skrev Daniel Vetter:
> > > On Mon, Mar 11, 2019 at 06:42:16PM +0100, Noralf Trønnes wrote:
[...]
> > >> +}
> > >> +
> > >> +static struct kmsg_dumper drm_panic_kmsg_dumper = {
> > >> +	.dump = drm_panic_kmsg_dump,
> > >> +	.max_reason = KMSG_DUMP_PANIC,
> > >> +};
> > >> +
> > >> +static ssize_t drm_panic_file_panic_write(struct file *file,
> > >> +					  const char __user *user_buf,
> > >> +					  size_t count, loff_t *ppos)
> > >> +{
> > >> +	unsigned long long val;
> > >> +	char buf[24];
> > >> +	size_t size;
> > >> +	ssize_t ret;
> > >> +
> > >> +	size = min(sizeof(buf) - 1, count);
> > >> +	if (copy_from_user(buf, user_buf, size))
> > >> +		return -EFAULT;
> > >> +
> > >> +	buf[size] = '\0';
> > >> +	ret = kstrtoull(buf, 0, &val);
> > >> +	if (ret)
> > >> +		return ret;
> > >> +
> > >> +	drm_panic_kmsg_dumper.max_reason = KMSG_DUMP_OOPS;
> > >> +	wmb();
> > >> +
> > >> +	/* Do a real test with: echo c > /proc/sysrq-trigger */
> > >> +
> > >> +	if (val == 0) {
> > >> +		pr_info("Test panic screen using kmsg_dump(OOPS)\n");
> > >> +		kmsg_dump(KMSG_DUMP_OOPS);
> > >> +	} else if (val == 1) {
> > >> +		char *null_pointer = NULL;
> > >> +
> > >> +		pr_info("Test panic screen using NULL pointer dereference\n");
> > >> +		*null_pointer = 1;
> > >> +	} else {
> > >> +		return -EINVAL;
> > >> +	}
>> > >
> > > This isn't quite what I had in mind, since it still kills the kernel (like
> > > sysrq-trigger).
> >
> > If val == 0, it doesn't kill the kernel, it only dumps the kernel log.
> > And it doesn't taint the kernel either.
>
> Ah I didn't realize that. Sounds like a good option to keep.
>
> > > Instead what I had in mind is to recreate the worst
> > > possible panic context as much as feasible (disabling interrupts should be
> > > a good start, maybe we can even do an nmi callback), and then call our
> > > panic implementation. That way we can test the panic handler in a
> > > non-destructive way (i.e. aside from last dmesg lines printed to the
> > > screen nothing bad happens to the kernel: No real panic, no oops, no
> > > tainting).
> >
> > The interrupt case I can do, nmi I have no idea.
>
> I just read the printk nmi code again and it looks like there's now even
> more special handling for issues happening in nmi context, so we should
> never see an oops from nmi context. So local_irq_disable() wrapping should
> be good enough for testing.
> -Daniel
>

Hmmm, John is on a mission to change that soon:

  - https://lore.kernel.org/lkml/20190212143003.48446-1-john.ogness@linutronix.de

  - https://lwn.net/Articles/780556

A v2 is on the way, and there's a lot of interest from different
groups, including RT, to get that in.

So I guess it would be nice to cover NMI contexts from the start,
not necessarily from a non-destructive CI testing perspective, but
at least while doing normal tests and code reviews.

(Is it possible to do non-destructive NMI tests? I don't know..)

thanks!

--
darwi
http://darwish.chasingpointers.com

Ahmed S. Darwish March 13, 2019, 4:05 a.m. UTC | #19

Hi,

[[ CCing John for the trylock parts ]]

On Mon, Mar 11, 2019 at 11:33:15PM +0100, Noralf Trønnes wrote:
>
> Den 11.03.2019 20.23, skrev Daniel Vetter:
> > On Mon, Mar 11, 2019 at 06:42:16PM +0100, Noralf Trønnes wrote:
> >> This adds support for outputting kernel messages on panic().
> >> A kernel message dumper is used to dump the log. The dumper iterates
> >> over each DRM device and it's crtc's to find suitable framebuffers.
> >>
> >> All the other dumpers are run before this one except mtdoops.
> >> Only atomic drivers are supported.
> >>
> >> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
> >
> > Bunch of comments/ideas for you or Darwish below, whoever picks this up.
>
> Actually it would ne nice if Darwish could pick it up since he will do
> it on i915 which will be useful to a much broader audience.
> If not I'll respin when I'm done with the drm_fb_helper refactoring.
>

Yup, I'll be more than happy to do this.. while preserving all of
Noralf's authorship and copyright notices of course.

I guess it can be:

  - Handle the comments posted by Daniel and others (I'll post
    some questions too)

  - Add the necessary i915 specific bits

  - Test, post v3/v4/../vn. Rinse and repeat. Keep it local at
    dri-devel until getting the necessary S-o-Bs.

  - Post to wider audience (some feedback from distribution folks
    would also be nice, before posting to lkml)

More comments below..

[...]

> >> +
> >> +static void drm_panic_kmsg_dump(struct kmsg_dumper *dumper,
> >> +				enum kmsg_dump_reason reason)
> >> +{
> >> +	class_for_each_device(drm_class, NULL, dumper, drm_panic_dev_iter);
> >
> > class_for_each_device uses klist, which only uses an irqsave spinlock. I
> > think that's good enough. Comment to that effect would be good e.g.
> >
> > 	/* based on klist, which uses only a spin_lock_irqsave, which we
> > 	 * assume still works */
> >
> > If we aim for perfect this should be a trylock still, maybe using our own
> > device list.
> >

I definitely agree here.

The lock may already be locked either by a stopped CPU, or by the
very same CPU we execute panic() on (e.g. NMI panic() on the
printing CPU).

This is why it's very common for example in serial consoles, which
are usually careful about re-entrance and panic contexts, to do:

  xxxxxx_console_write(...) {
	if (oops_in_progress)
		locked = spin_trylock_irqsave(&port->lock, flags);
	else
		spin_lock_irqsave(&port->lock, flags);
  }

I'm quite positive we should do the same for panic drm drivers.
John?

> >> +}
> >> +
> >> +static struct kmsg_dumper drm_panic_kmsg_dumper = {
> >> +	.dump = drm_panic_kmsg_dump,
> >> +	.max_reason = KMSG_DUMP_PANIC,
> >> +};
> >> +
> >> +static ssize_t drm_panic_file_panic_write(struct file *file,
> >> +					  const char __user *user_buf,
> >> +					  size_t count, loff_t *ppos)
> >> +{
> >> +	unsigned long long val;
> >> +	char buf[24];
> >> +	size_t size;
> >> +	ssize_t ret;
> >> +
> >> +	size = min(sizeof(buf) - 1, count);
> >> +	if (copy_from_user(buf, user_buf, size))
> >> +		return -EFAULT;
> >> +
> >> +	buf[size] = '\0';
> >> +	ret = kstrtoull(buf, 0, &val);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	drm_panic_kmsg_dumper.max_reason = KMSG_DUMP_OOPS;
> >> +	wmb();
> >> +
> >> +	/* Do a real test with: echo c > /proc/sysrq-trigger */
> >> +
> >> +	if (val == 0) {
> >> +		pr_info("Test panic screen using kmsg_dump(OOPS)\n");
> >> +		kmsg_dump(KMSG_DUMP_OOPS);
> >> +	} else if (val == 1) {
> >> +		char *null_pointer = NULL;
> >> +
> >> +		pr_info("Test panic screen using NULL pointer dereference\n");
> >> +		*null_pointer = 1;
> >> +	} else {
> >> +		return -EINVAL;
> >> +	}
> >
> > This isn't quite what I had in mind, since it still kills the kernel (like
> > sysrq-trigger).
>
> If val == 0, it doesn't kill the kernel, it only dumps the kernel log.
> And it doesn't taint the kernel either.
>
> > Instead what I had in mind is to recreate the worst
> > possible panic context as much as feasible (disabling interrupts should be
> > a good start, maybe we can even do an nmi callback), and then call our
> > panic implementation. That way we can test the panic handler in a
> > non-destructive way (i.e. aside from last dmesg lines printed to the
> > screen nothing bad happens to the kernel: No real panic, no oops, no
> > tainting).
>
> The interrupt case I can do, nmi I have no idea.
>

I agree too. Disabling interrupts + CONFIG_DEBUG_ATOMIC_SLEEP
would be a nice non-destructive test-case emulation.

thanks!

--
darwi
http://darwish.chasingpointers.com

John Ogness March 13, 2019, 7:49 a.m. UTC | #20

On 2019-03-12, Ahmed S. Darwish <darwish.07@gmail.com> wrote:
>>>> +
>>>> +static void drm_panic_kmsg_dump(struct kmsg_dumper *dumper,
>>>> +				enum kmsg_dump_reason reason)
>>>> +{
>>>> +	class_for_each_device(drm_class, NULL, dumper, drm_panic_dev_iter);
>>>
>>> class_for_each_device uses klist, which only uses an irqsave
>>> spinlock. I think that's good enough. Comment to that effect would
>>> be good e.g.
>>>
>>> 	/* based on klist, which uses only a spin_lock_irqsave, which we
>>> 	 * assume still works */
>>>
>>> If we aim for perfect this should be a trylock still, maybe using
>>> our own device list.
>>>
>
> I definitely agree here.
>
> The lock may already be locked either by a stopped CPU, or by the
> very same CPU we execute panic() on (e.g. NMI panic() on the
> printing CPU).
>
> This is why it's very common for example in serial consoles, which
> are usually careful about re-entrance and panic contexts, to do:
>
>   xxxxxx_console_write(...) {
> 	if (oops_in_progress)
> 		locked = spin_trylock_irqsave(&port->lock, flags);
> 	else
> 		spin_lock_irqsave(&port->lock, flags);
>   }
>
> I'm quite positive we should do the same for panic drm drivers.

This construction will continue, even if the trylock fails. It only
makes sense to do this if the driver has a chance of being
successful. Ignoring locks is a serious issue. I personally am quite
unhappy that the serial drivers do this, which was part of my motivation
for the new printk design I'm working on.

If the driver is not capable of doing something useful on a failed
trylock, then I recommend just skipping that device. Maybe trying it
again later after trying all the devices?

John Ogness

Christian König March 13, 2019, 8:29 a.m. UTC | #21

Am 12.03.19 um 19:02 schrieb Ville Syrjälä:
> On Tue, Mar 12, 2019 at 06:37:57PM +0100, Noralf Trønnes wrote:
>>
>> Den 12.03.2019 18.25, skrev Ville Syrjälä:
>>> On Tue, Mar 12, 2019 at 06:15:24PM +0100, Noralf Trønnes wrote:
>>>>
>>>> Den 12.03.2019 17.17, skrev Ville Syrjälä:
>>>>> On Tue, Mar 12, 2019 at 11:47:04AM +0100, Michel Dänzer wrote:
>>>>>> On 2019-03-11 6:42 p.m., Noralf Trønnes wrote:
>>>>>>> This adds support for outputting kernel messages on panic().
>>>>>>> A kernel message dumper is used to dump the log. The dumper iterates
>>>>>>> over each DRM device and it's crtc's to find suitable framebuffers.
>>>>>>>
>>>>>>> All the other dumpers are run before this one except mtdoops.
>>>>>>> Only atomic drivers are supported.
>>>>>>>
>>>>>>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
>>>>>>> ---
>>>>>>>   [...]
>>>>>>>
>>>>>>> diff --git a/include/drm/drm_framebuffer.h b/include/drm/drm_framebuffer.h
>>>>>>> index f0b34c977ec5..f3274798ecfe 100644
>>>>>>> --- a/include/drm/drm_framebuffer.h
>>>>>>> +++ b/include/drm/drm_framebuffer.h
>>>>>>> @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
>>>>>>>   		     struct drm_file *file_priv, unsigned flags,
>>>>>>>   		     unsigned color, struct drm_clip_rect *clips,
>>>>>>>   		     unsigned num_clips);
>>>>>>> +
>>>>>>> +	/**
>>>>>>> +	 * @panic_vmap:
>>>>>>> +	 *
>>>>>>> +	 * Optional callback for panic handling.
>>>>>>> +	 *
>>>>>>> +	 * For vmapping the selected framebuffer in a panic context. Must
>>>>>>> +	 * be super careful about locking (only trylocking allowed).
>>>>>>> +	 *
>>>>>>> +	 * RETURNS:
>>>>>>> +	 *
>>>>>>> +	 * NULL if it didn't work out, otherwise an opaque cookie which is
>>>>>>> +	 * passed to @panic_draw_xy. It can be anything: vmap area, structure
>>>>>>> +	 * with more details, just a few flags, ...
>>>>>>> +	 */
>>>>>>> +	void *(*panic_vmap)(struct drm_framebuffer *fb);
>>>>>> FWIW, the panic_vmap hook cannot work in general with the amdgpu/radeon
>>>>>> drivers:
>>>>>>
>>>>>> Framebuffers are normally tiled, writing to them with the CPU results in
>>>>>> garbled output.
>>>>>>
>>>> In which case the driver needs to support the ->panic_draw_xy callback,
>>>> or maybe it's possible to make a generic helper for tiled buffers.
>>>>
>>>>>> With a discrete GPU having a large amount of VRAM, the framebuffer may
>>>>>> not be directly CPU accessible at all.
>>>>>>
>>>> I would have been nice to know how Windows works around this.
>>>>
>>>>>> There would need to be a mechanism for switching scanout to a linear,
>>>>>> CPU accessible framebuffer.
>>>>> I suppose panic_vmap() could just provide a linear temp buffer
>>>>> to the panic handler, and panic_unmap() could copy the contents
>>>>> over to the real fb.
>>>>>
>>>>> That said, this approach of scribbling over the primary plane's
>>>>> framebuffer has some clear limitations:
>>>>> * something may overwrite the oops message before the user
>>>>>    can even read it
>>>> When the dumper drm_panic_kmsg_dump() runs, the other CPU's should have
>>>> been stopped. See panic().
>>> GPUs etc. may still be executing away.
>>>
>> Would it be safe to stop it in a panic situation? It would ofc be bad to
>> crash the box even harder.
> Some drivers/devices may have working (and hopefully even reliable)
> gpu reset, some may not.

Even if GPU reset is working, it certainly doesn't under a panic() 
condition when all other CPUs are already stopped.

I don't see how this approach should ever work reliable.

Christian.

Daniel Vetter March 13, 2019, 8:35 a.m. UTC | #22

On Tue, Mar 12, 2019 at 11:13:03PM +0100, Ahmed S. Darwish wrote:
> Hi,
> 
> [[ CCing John for the trylock parts ]]
> 
> On Mon, Mar 11, 2019 at 11:33:15PM +0100, Noralf Trønnes wrote:
> >
> >
> > Den 11.03.2019 20.23, skrev Daniel Vetter:
> > > On Mon, Mar 11, 2019 at 06:42:16PM +0100, Noralf Trønnes wrote:
> > >> This adds support for outputting kernel messages on panic().
> > >> A kernel message dumper is used to dump the log. The dumper iterates
> > >> over each DRM device and it's crtc's to find suitable framebuffers.
> > >>
> > >> All the other dumpers are run before this one except mtdoops.
> > >> Only atomic drivers are supported.
> > >>
> > >> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
> > >
> > > Bunch of comments/ideas for you or Darwish below, whoever picks this up.
> >
> > Actually it would ne nice if Darwish could pick it up since he will do
> > it on i915 which will be useful to a much broader audience.
> > If not I'll respin when I'm done with the drm_fb_helper refactoring.
> >
> 
> Yup, I'll be more than happy to do this.. while preserving all of
> Noralf's authorship and copyright notices of course.
> 
> I guess it can be:
> 
>   - Handle the comments posted by Daniel and others (I'll post
>     some questions too)
> 
>   - Add the necessary i915 specific bits
> 
>   - Test, post v3/v4/../vn. Rinse and repeat. Keep it local at
>     dri-devel until getting the necessary S-o-Bs.
> 
>   - Post to wider audience (some feedback from distribution folks
>     would also be nice, before posting to lkml)
> 
> More comments below..
> 
> [...]
> 
> > >> +
> > >> +static void drm_panic_kmsg_dump(struct kmsg_dumper *dumper,
> > >> +				enum kmsg_dump_reason reason)
> > >> +{
> > >> +	class_for_each_device(drm_class, NULL, dumper, drm_panic_dev_iter);
> > >
> > > class_for_each_device uses klist, which only uses an irqsave spinlock. I
> > > think that's good enough. Comment to that effect would be good e.g.
> > >
> > > 	/* based on klist, which uses only a spin_lock_irqsave, which we
> > > 	 * assume still works */
> > >
> > > If we aim for perfect this should be a trylock still, maybe using our own
> > > device list.
> > >
> 
> I definitely agree here.
> 
> The lock may already be locked either by a stopped CPU, or by the
> very same CPU we execute panic() on (e.g. NMI panic() on the
> printing CPU).
> 
> This is why it's very common for example in serial consoles, which
> are usually careful about re-entrance and panic contexts, to do:
> 
>   xxxxxx_console_write(...) {
> 	if (oops_in_progress)
> 		locked = spin_trylock_irqsave(&port->lock, flags);
> 	else
> 		spin_lock_irqsave(&port->lock, flags);
>   }
> 
> I'm quite positive we should do the same for panic drm drivers.

Yeah Ideally all the locking in the drm path would be trylock only.

I wonder whether lockdep could help us validate this, with some "don't
allow anything except trylocks in this context". It's easy to audit the
core code with review, but drivers are much tougher. And often end up with
really deep callchains to get at the backing buffers.
> John?
> 
> > >> +}
> > >> +
> > >> +static struct kmsg_dumper drm_panic_kmsg_dumper = {
> > >> +	.dump = drm_panic_kmsg_dump,
> > >> +	.max_reason = KMSG_DUMP_PANIC,
> > >> +};
> > >> +
> > >> +static ssize_t drm_panic_file_panic_write(struct file *file,
> > >> +					  const char __user *user_buf,
> > >> +					  size_t count, loff_t *ppos)
> > >> +{
> > >> +	unsigned long long val;
> > >> +	char buf[24];
> > >> +	size_t size;
> > >> +	ssize_t ret;
> > >> +
> > >> +	size = min(sizeof(buf) - 1, count);
> > >> +	if (copy_from_user(buf, user_buf, size))
> > >> +		return -EFAULT;
> > >> +
> > >> +	buf[size] = '\0';
> > >> +	ret = kstrtoull(buf, 0, &val);
> > >> +	if (ret)
> > >> +		return ret;
> > >> +
> > >> +	drm_panic_kmsg_dumper.max_reason = KMSG_DUMP_OOPS;
> > >> +	wmb();
> > >> +
> > >> +	/* Do a real test with: echo c > /proc/sysrq-trigger */
> > >> +
> > >> +	if (val == 0) {
> > >> +		pr_info("Test panic screen using kmsg_dump(OOPS)\n");
> > >> +		kmsg_dump(KMSG_DUMP_OOPS);
> > >> +	} else if (val == 1) {
> > >> +		char *null_pointer = NULL;
> > >> +
> > >> +		pr_info("Test panic screen using NULL pointer dereference\n");
> > >> +		*null_pointer = 1;
> > >> +	} else {
> > >> +		return -EINVAL;
> > >> +	}
> > >
> > > This isn't quite what I had in mind, since it still kills the kernel (like
> > > sysrq-trigger).
> >
> > If val == 0, it doesn't kill the kernel, it only dumps the kernel log.
> > And it doesn't taint the kernel either.
> >
> > > Instead what I had in mind is to recreate the worst
> > > possible panic context as much as feasible (disabling interrupts should be
> > > a good start, maybe we can even do an nmi callback), and then call our
> > > panic implementation. That way we can test the panic handler in a
> > > non-destructive way (i.e. aside from last dmesg lines printed to the
> > > screen nothing bad happens to the kernel: No real panic, no oops, no
> > > tainting).
> >
> > The interrupt case I can do, nmi I have no idea.
> >
> 
> I agree too. Disabling interrupts + CONFIG_DEBUG_ATOMIC_SLEEP
> would be a nice non-destructive test-case emulation.

See above, if we can somehow emulate "all locks are held, only allow
trylock" with lockdep that would be great too.

Plus nmi context, once/if that somehow becomes relevant.

The thing that killed the old drm panic handling code definitely was that
we flat out couldnt' test it except with real oopses. And that's just
whack-a-mole and bug reporter frustration if you first have a few patch
iterations around "oh sry, it's still some oops in the panic handler, not
the first one that we're seeing" :-/
-Daniel

Daniel Vetter March 13, 2019, 8:37 a.m. UTC | #23

On Wed, Mar 13, 2019 at 08:49:17AM +0100, John Ogness wrote:
> On 2019-03-12, Ahmed S. Darwish <darwish.07@gmail.com> wrote:
> >>>> +
> >>>> +static void drm_panic_kmsg_dump(struct kmsg_dumper *dumper,
> >>>> +				enum kmsg_dump_reason reason)
> >>>> +{
> >>>> +	class_for_each_device(drm_class, NULL, dumper, drm_panic_dev_iter);
> >>>
> >>> class_for_each_device uses klist, which only uses an irqsave
> >>> spinlock. I think that's good enough. Comment to that effect would
> >>> be good e.g.
> >>>
> >>> 	/* based on klist, which uses only a spin_lock_irqsave, which we
> >>> 	 * assume still works */
> >>>
> >>> If we aim for perfect this should be a trylock still, maybe using
> >>> our own device list.
> >>>
> >
> > I definitely agree here.
> >
> > The lock may already be locked either by a stopped CPU, or by the
> > very same CPU we execute panic() on (e.g. NMI panic() on the
> > printing CPU).
> >
> > This is why it's very common for example in serial consoles, which
> > are usually careful about re-entrance and panic contexts, to do:
> >
> >   xxxxxx_console_write(...) {
> > 	if (oops_in_progress)
> > 		locked = spin_trylock_irqsave(&port->lock, flags);
> > 	else
> > 		spin_lock_irqsave(&port->lock, flags);
> >   }
> >
> > I'm quite positive we should do the same for panic drm drivers.
> 
> This construction will continue, even if the trylock fails. It only
> makes sense to do this if the driver has a chance of being
> successful. Ignoring locks is a serious issue. I personally am quite
> unhappy that the serial drivers do this, which was part of my motivation
> for the new printk design I'm working on.
> 
> If the driver is not capable of doing something useful on a failed
> trylock, then I recommend just skipping that device. Maybe trying it
> again later after trying all the devices?

Ah yes missed that. If the trylock fails anywhere, we must bail out.

Not sure retrying is useful, my experience from at least drm is that
either you're lucky, and drm wasn't doing anything right when the machine
blew up, and then the trylocks will all go through. Or you're unlucky, and
most likely that means drm itself blew up, and no amount of retrying is
going to help. I wouldn't bother.
-Daniel

Daniel Vetter March 13, 2019, 8:43 a.m. UTC | #24

On Tue, Mar 12, 2019 at 08:02:56PM +0200, Ville Syrjälä wrote:
> On Tue, Mar 12, 2019 at 06:37:57PM +0100, Noralf Trønnes wrote:
> > 
> > 
> > Den 12.03.2019 18.25, skrev Ville Syrjälä:
> > > On Tue, Mar 12, 2019 at 06:15:24PM +0100, Noralf Trønnes wrote:
> > >>
> > >>
> > >> Den 12.03.2019 17.17, skrev Ville Syrjälä:
> > >>> On Tue, Mar 12, 2019 at 11:47:04AM +0100, Michel Dänzer wrote:
> > >>>> On 2019-03-11 6:42 p.m., Noralf Trønnes wrote:
> > >>>>> This adds support for outputting kernel messages on panic().
> > >>>>> A kernel message dumper is used to dump the log. The dumper iterates
> > >>>>> over each DRM device and it's crtc's to find suitable framebuffers.
> > >>>>>
> > >>>>> All the other dumpers are run before this one except mtdoops.
> > >>>>> Only atomic drivers are supported.
> > >>>>>
> > >>>>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
> > >>>>> ---
> > >>>>>  [...]
> > >>>>>
> > >>>>> diff --git a/include/drm/drm_framebuffer.h b/include/drm/drm_framebuffer.h
> > >>>>> index f0b34c977ec5..f3274798ecfe 100644
> > >>>>> --- a/include/drm/drm_framebuffer.h
> > >>>>> +++ b/include/drm/drm_framebuffer.h
> > >>>>> @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
> > >>>>>  		     struct drm_file *file_priv, unsigned flags,
> > >>>>>  		     unsigned color, struct drm_clip_rect *clips,
> > >>>>>  		     unsigned num_clips);
> > >>>>> +
> > >>>>> +	/**
> > >>>>> +	 * @panic_vmap:
> > >>>>> +	 *
> > >>>>> +	 * Optional callback for panic handling.
> > >>>>> +	 *
> > >>>>> +	 * For vmapping the selected framebuffer in a panic context. Must
> > >>>>> +	 * be super careful about locking (only trylocking allowed).
> > >>>>> +	 *
> > >>>>> +	 * RETURNS:
> > >>>>> +	 *
> > >>>>> +	 * NULL if it didn't work out, otherwise an opaque cookie which is
> > >>>>> +	 * passed to @panic_draw_xy. It can be anything: vmap area, structure
> > >>>>> +	 * with more details, just a few flags, ...
> > >>>>> +	 */
> > >>>>> +	void *(*panic_vmap)(struct drm_framebuffer *fb);
> > >>>>
> > >>>> FWIW, the panic_vmap hook cannot work in general with the amdgpu/radeon
> > >>>> drivers:
> > >>>>
> > >>>> Framebuffers are normally tiled, writing to them with the CPU results in
> > >>>> garbled output.
> > >>>>
> > >>
> > >> In which case the driver needs to support the ->panic_draw_xy callback,
> > >> or maybe it's possible to make a generic helper for tiled buffers.

I've proposed somewhere else that we rename panic_vmap to panic_prepare,
and the vmap pointer to an abstract cookie. Then the driver can do
whatever it wants too, e.g. in ->panic_prepare it does a few trylocks to
get at the buffer and make sure it can set up a temporary pte to write
into it page-by-page. ->panic_draw_xy can then do whatever it needs to do,
using the opaque void *cookie.

And if the trylock fails you just return NULL from ->panic_prepare

And ->panic_cleanup would be just to clean up the mess for the validation
use-case when running this from debugfs.

> > >>
> > >>>> With a discrete GPU having a large amount of VRAM, the framebuffer may
> > >>>> not be directly CPU accessible at all.
> > >>>>
> > >>
> > >> I would have been nice to know how Windows works around this.
> > >>
> > >>>>
> > >>>> There would need to be a mechanism for switching scanout to a linear,
> > >>>> CPU accessible framebuffer.
> > >>>
> > >>> I suppose panic_vmap() could just provide a linear temp buffer
> > >>> to the panic handler, and panic_unmap() could copy the contents
> > >>> over to the real fb.
> > >>>
> > >>> That said, this approach of scribbling over the primary plane's
> > >>> framebuffer has some clear limitations:
> > >>> * something may overwrite the oops message before the user
> > >>>   can even read it
> > >>
> > >> When the dumper drm_panic_kmsg_dump() runs, the other CPU's should have
> > >> been stopped. See panic().
> > > 
> > > GPUs etc. may still be executing away.
> > > 
> > 
> > Would it be safe to stop it in a panic situation? It would ofc be bad to
> > crash the box even harder.
> 
> Some drivers/devices may have working (and hopefully even reliable)
> gpu reset, some may not.

I don't think touching the gpu is a good idea. Even disabling planes and
all that feels risky. And there's really not much working anymore in panic
context, we can't even schedule a worker/timer to redraw the panic output
a bit later.
-Daniel

Michel Dänzer March 13, 2019, 9:35 a.m. UTC | #25

On 2019-03-12 6:15 p.m., Noralf Trønnes wrote:
> 
> 
> Den 12.03.2019 17.17, skrev Ville Syrjälä:
>> On Tue, Mar 12, 2019 at 11:47:04AM +0100, Michel Dänzer wrote:
>>> On 2019-03-11 6:42 p.m., Noralf Trønnes wrote:
>>>> This adds support for outputting kernel messages on panic().
>>>> A kernel message dumper is used to dump the log. The dumper iterates
>>>> over each DRM device and it's crtc's to find suitable framebuffers.
>>>>
>>>> All the other dumpers are run before this one except mtdoops.
>>>> Only atomic drivers are supported.
>>>>
>>>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
>>>> ---
>>>>  [...]
>>>>
>>>> diff --git a/include/drm/drm_framebuffer.h b/include/drm/drm_framebuffer.h
>>>> index f0b34c977ec5..f3274798ecfe 100644
>>>> --- a/include/drm/drm_framebuffer.h
>>>> +++ b/include/drm/drm_framebuffer.h
>>>> @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
>>>>  		     struct drm_file *file_priv, unsigned flags,
>>>>  		     unsigned color, struct drm_clip_rect *clips,
>>>>  		     unsigned num_clips);
>>>> +
>>>> +	/**
>>>> +	 * @panic_vmap:
>>>> +	 *
>>>> +	 * Optional callback for panic handling.
>>>> +	 *
>>>> +	 * For vmapping the selected framebuffer in a panic context. Must
>>>> +	 * be super careful about locking (only trylocking allowed).
>>>> +	 *
>>>> +	 * RETURNS:
>>>> +	 *
>>>> +	 * NULL if it didn't work out, otherwise an opaque cookie which is
>>>> +	 * passed to @panic_draw_xy. It can be anything: vmap area, structure
>>>> +	 * with more details, just a few flags, ...
>>>> +	 */
>>>> +	void *(*panic_vmap)(struct drm_framebuffer *fb);
>>>
>>> FWIW, the panic_vmap hook cannot work in general with the amdgpu/radeon
>>> drivers:
>>>
>>> Framebuffers are normally tiled, writing to them with the CPU results in
>>> garbled output.
>>>
> 
> In which case the driver needs to support the ->panic_draw_xy callback,
> or maybe it's possible to make a generic helper for tiled buffers.

I'm afraid that won't help, at least not without porting big chunks of
https://gitlab.freedesktop.org/mesa/mesa/tree/master/src/amd/addrlib
into the kernel, none of which will be used for anything else.


>>> There would need to be a mechanism for switching scanout to a linear,
>>> CPU accessible framebuffer.
>>
>> I suppose panic_vmap() could just provide a linear temp buffer
>> to the panic handler, and panic_unmap() could copy the contents
>> over to the real fb.

Copy how? Using a GPU engine?

Noralf Trønnes March 13, 2019, 10:24 a.m. UTC | #26

Den 12.03.2019 23.13, skrev Ahmed S. Darwish:
> Hi,
> 
> [[ CCing John for the trylock parts ]]
> 
> On Mon, Mar 11, 2019 at 11:33:15PM +0100, Noralf Trønnes wrote:
>>
>>
>> Den 11.03.2019 20.23, skrev Daniel Vetter:
>>> On Mon, Mar 11, 2019 at 06:42:16PM +0100, Noralf Trønnes wrote:
>>>> This adds support for outputting kernel messages on panic().
>>>> A kernel message dumper is used to dump the log. The dumper iterates
>>>> over each DRM device and it's crtc's to find suitable framebuffers.
>>>>
>>>> All the other dumpers are run before this one except mtdoops.
>>>> Only atomic drivers are supported.
>>>>
>>>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
>>>
>>> Bunch of comments/ideas for you or Darwish below, whoever picks this up.
>>
>> Actually it would ne nice if Darwish could pick it up since he will do
>> it on i915 which will be useful to a much broader audience.
>> If not I'll respin when I'm done with the drm_fb_helper refactoring.
>>
> 
> Yup, I'll be more than happy to do this.. 

Thanks for doing that.

Noralf.

> while preserving all of
> Noralf's authorship and copyright notices of course.
> 
> I guess it can be:
> 
>   - Handle the comments posted by Daniel and others (I'll post
>     some questions too)
> 
>   - Add the necessary i915 specific bits
> 
>   - Test, post v3/v4/../vn. Rinse and repeat. Keep it local at
>     dri-devel until getting the necessary S-o-Bs.
> 
>   - Post to wider audience (some feedback from distribution folks
>     would also be nice, before posting to lkml)
> 
> More comments below..
> 
> [...]
> 
>>>> +
>>>> +static void drm_panic_kmsg_dump(struct kmsg_dumper *dumper,
>>>> +				enum kmsg_dump_reason reason)
>>>> +{
>>>> +	class_for_each_device(drm_class, NULL, dumper, drm_panic_dev_iter);
>>>
>>> class_for_each_device uses klist, which only uses an irqsave spinlock. I
>>> think that's good enough. Comment to that effect would be good e.g.
>>>
>>> 	/* based on klist, which uses only a spin_lock_irqsave, which we
>>> 	 * assume still works */
>>>
>>> If we aim for perfect this should be a trylock still, maybe using our own
>>> device list.
>>>
> 
> I definitely agree here.
> 
> The lock may already be locked either by a stopped CPU, or by the
> very same CPU we execute panic() on (e.g. NMI panic() on the
> printing CPU).
> 
> This is why it's very common for example in serial consoles, which
> are usually careful about re-entrance and panic contexts, to do:
> 
>   xxxxxx_console_write(...) {
> 	if (oops_in_progress)
> 		locked = spin_trylock_irqsave(&port->lock, flags);
> 	else
> 		spin_lock_irqsave(&port->lock, flags);
>   }
> 
> I'm quite positive we should do the same for panic drm drivers.
> John?
> 
>>>> +}
>>>> +
>>>> +static struct kmsg_dumper drm_panic_kmsg_dumper = {
>>>> +	.dump = drm_panic_kmsg_dump,
>>>> +	.max_reason = KMSG_DUMP_PANIC,
>>>> +};
>>>> +
>>>> +static ssize_t drm_panic_file_panic_write(struct file *file,
>>>> +					  const char __user *user_buf,
>>>> +					  size_t count, loff_t *ppos)
>>>> +{
>>>> +	unsigned long long val;
>>>> +	char buf[24];
>>>> +	size_t size;
>>>> +	ssize_t ret;
>>>> +
>>>> +	size = min(sizeof(buf) - 1, count);
>>>> +	if (copy_from_user(buf, user_buf, size))
>>>> +		return -EFAULT;
>>>> +
>>>> +	buf[size] = '\0';
>>>> +	ret = kstrtoull(buf, 0, &val);
>>>> +	if (ret)
>>>> +		return ret;
>>>> +
>>>> +	drm_panic_kmsg_dumper.max_reason = KMSG_DUMP_OOPS;
>>>> +	wmb();
>>>> +
>>>> +	/* Do a real test with: echo c > /proc/sysrq-trigger */
>>>> +
>>>> +	if (val == 0) {
>>>> +		pr_info("Test panic screen using kmsg_dump(OOPS)\n");
>>>> +		kmsg_dump(KMSG_DUMP_OOPS);
>>>> +	} else if (val == 1) {
>>>> +		char *null_pointer = NULL;
>>>> +
>>>> +		pr_info("Test panic screen using NULL pointer dereference\n");
>>>> +		*null_pointer = 1;
>>>> +	} else {
>>>> +		return -EINVAL;
>>>> +	}
>>>
>>> This isn't quite what I had in mind, since it still kills the kernel (like
>>> sysrq-trigger).
>>
>> If val == 0, it doesn't kill the kernel, it only dumps the kernel log.
>> And it doesn't taint the kernel either.
>>
>>> Instead what I had in mind is to recreate the worst
>>> possible panic context as much as feasible (disabling interrupts should be
>>> a good start, maybe we can even do an nmi callback), and then call our
>>> panic implementation. That way we can test the panic handler in a
>>> non-destructive way (i.e. aside from last dmesg lines printed to the
>>> screen nothing bad happens to the kernel: No real panic, no oops, no
>>> tainting).
>>
>> The interrupt case I can do, nmi I have no idea.
>>
> 
> I agree too. Disabling interrupts + CONFIG_DEBUG_ATOMIC_SLEEP
> would be a nice non-destructive test-case emulation.
> 
> thanks!
> 
> --
> darwi
> http://darwish.chasingpointers.com
>

Ville Syrjälä March 13, 2019, 1:31 p.m. UTC | #27

On Wed, Mar 13, 2019 at 10:35:08AM +0100, Michel Dänzer wrote:
> On 2019-03-12 6:15 p.m., Noralf Trønnes wrote:
> > 
> > 
> > Den 12.03.2019 17.17, skrev Ville Syrjälä:
> >> On Tue, Mar 12, 2019 at 11:47:04AM +0100, Michel Dänzer wrote:
> >>> On 2019-03-11 6:42 p.m., Noralf Trønnes wrote:
> >>>> This adds support for outputting kernel messages on panic().
> >>>> A kernel message dumper is used to dump the log. The dumper iterates
> >>>> over each DRM device and it's crtc's to find suitable framebuffers.
> >>>>
> >>>> All the other dumpers are run before this one except mtdoops.
> >>>> Only atomic drivers are supported.
> >>>>
> >>>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
> >>>> ---
> >>>>  [...]
> >>>>
> >>>> diff --git a/include/drm/drm_framebuffer.h b/include/drm/drm_framebuffer.h
> >>>> index f0b34c977ec5..f3274798ecfe 100644
> >>>> --- a/include/drm/drm_framebuffer.h
> >>>> +++ b/include/drm/drm_framebuffer.h
> >>>> @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
> >>>>  		     struct drm_file *file_priv, unsigned flags,
> >>>>  		     unsigned color, struct drm_clip_rect *clips,
> >>>>  		     unsigned num_clips);
> >>>> +
> >>>> +	/**
> >>>> +	 * @panic_vmap:
> >>>> +	 *
> >>>> +	 * Optional callback for panic handling.
> >>>> +	 *
> >>>> +	 * For vmapping the selected framebuffer in a panic context. Must
> >>>> +	 * be super careful about locking (only trylocking allowed).
> >>>> +	 *
> >>>> +	 * RETURNS:
> >>>> +	 *
> >>>> +	 * NULL if it didn't work out, otherwise an opaque cookie which is
> >>>> +	 * passed to @panic_draw_xy. It can be anything: vmap area, structure
> >>>> +	 * with more details, just a few flags, ...
> >>>> +	 */
> >>>> +	void *(*panic_vmap)(struct drm_framebuffer *fb);
> >>>
> >>> FWIW, the panic_vmap hook cannot work in general with the amdgpu/radeon
> >>> drivers:
> >>>
> >>> Framebuffers are normally tiled, writing to them with the CPU results in
> >>> garbled output.
> >>>
> > 
> > In which case the driver needs to support the ->panic_draw_xy callback,
> > or maybe it's possible to make a generic helper for tiled buffers.
> 
> I'm afraid that won't help, at least not without porting big chunks of
> https://gitlab.freedesktop.org/mesa/mesa/tree/master/src/amd/addrlib
> into the kernel, none of which will be used for anything else.
> 
> 
> >>> There would need to be a mechanism for switching scanout to a linear,
> >>> CPU accessible framebuffer.
> >>
> >> I suppose panic_vmap() could just provide a linear temp buffer
> >> to the panic handler, and panic_unmap() could copy the contents
> >> over to the real fb.
> 
> Copy how? Using a GPU engine?

CPU maybe? Though I suppose that won't work if the buffer isn't CPU
accesible :/

Christian König March 13, 2019, 1:37 p.m. UTC | #28

Am 13.03.19 um 14:31 schrieb Ville Syrjälä:
> On Wed, Mar 13, 2019 at 10:35:08AM +0100, Michel Dänzer wrote:
>> On 2019-03-12 6:15 p.m., Noralf Trønnes wrote:
>>>
>>> Den 12.03.2019 17.17, skrev Ville Syrjälä:
>>>> On Tue, Mar 12, 2019 at 11:47:04AM +0100, Michel Dänzer wrote:
>>>>> On 2019-03-11 6:42 p.m., Noralf Trønnes wrote:
>>>>>> This adds support for outputting kernel messages on panic().
>>>>>> A kernel message dumper is used to dump the log. The dumper iterates
>>>>>> over each DRM device and it's crtc's to find suitable framebuffers.
>>>>>>
>>>>>> All the other dumpers are run before this one except mtdoops.
>>>>>> Only atomic drivers are supported.
>>>>>>
>>>>>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
>>>>>> ---
>>>>>>   [...]
>>>>>>
>>>>>> diff --git a/include/drm/drm_framebuffer.h b/include/drm/drm_framebuffer.h
>>>>>> index f0b34c977ec5..f3274798ecfe 100644
>>>>>> --- a/include/drm/drm_framebuffer.h
>>>>>> +++ b/include/drm/drm_framebuffer.h
>>>>>> @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
>>>>>>   		     struct drm_file *file_priv, unsigned flags,
>>>>>>   		     unsigned color, struct drm_clip_rect *clips,
>>>>>>   		     unsigned num_clips);
>>>>>> +
>>>>>> +	/**
>>>>>> +	 * @panic_vmap:
>>>>>> +	 *
>>>>>> +	 * Optional callback for panic handling.
>>>>>> +	 *
>>>>>> +	 * For vmapping the selected framebuffer in a panic context. Must
>>>>>> +	 * be super careful about locking (only trylocking allowed).
>>>>>> +	 *
>>>>>> +	 * RETURNS:
>>>>>> +	 *
>>>>>> +	 * NULL if it didn't work out, otherwise an opaque cookie which is
>>>>>> +	 * passed to @panic_draw_xy. It can be anything: vmap area, structure
>>>>>> +	 * with more details, just a few flags, ...
>>>>>> +	 */
>>>>>> +	void *(*panic_vmap)(struct drm_framebuffer *fb);
>>>>> FWIW, the panic_vmap hook cannot work in general with the amdgpu/radeon
>>>>> drivers:
>>>>>
>>>>> Framebuffers are normally tiled, writing to them with the CPU results in
>>>>> garbled output.
>>>>>
>>> In which case the driver needs to support the ->panic_draw_xy callback,
>>> or maybe it's possible to make a generic helper for tiled buffers.
>> I'm afraid that won't help, at least not without porting big chunks of
>> https://gitlab.freedesktop.org/mesa/mesa/tree/master/src/amd/addrlib
>> into the kernel, none of which will be used for anything else.
>>
>>
>>>>> There would need to be a mechanism for switching scanout to a linear,
>>>>> CPU accessible framebuffer.
>>>> I suppose panic_vmap() could just provide a linear temp buffer
>>>> to the panic handler, and panic_unmap() could copy the contents
>>>> over to the real fb.
>> Copy how? Using a GPU engine?
> CPU maybe? Though I suppose that won't work if the buffer isn't CPU
> accesible :/

Well we do have a debug path for accessing invisible memory with the CPU.

E.g. three registers: DATA and auto increment OFFSET_LO/HI. So you can 
just read/write DATA over and over again if you want to access some memory.

But turning of tilling etc is still extremely tricky when the system is 
already unstable.

I mean from two miles high it looks like a nice to have feature, but up 
close is a different picture...

Christian.

Michel Dänzer March 13, 2019, 3:38 p.m. UTC | #29

On 2019-03-13 2:37 p.m., Christian König wrote:
> Am 13.03.19 um 14:31 schrieb Ville Syrjälä:
>> On Wed, Mar 13, 2019 at 10:35:08AM +0100, Michel Dänzer wrote:
>>> On 2019-03-12 6:15 p.m., Noralf Trønnes wrote:
>>>>
>>>> Den 12.03.2019 17.17, skrev Ville Syrjälä:
>>>>> On Tue, Mar 12, 2019 at 11:47:04AM +0100, Michel Dänzer wrote:
>>>>>> On 2019-03-11 6:42 p.m., Noralf Trønnes wrote:
>>>>>>> This adds support for outputting kernel messages on panic().
>>>>>>> A kernel message dumper is used to dump the log. The dumper iterates
>>>>>>> over each DRM device and it's crtc's to find suitable framebuffers.
>>>>>>>
>>>>>>> All the other dumpers are run before this one except mtdoops.
>>>>>>> Only atomic drivers are supported.
>>>>>>>
>>>>>>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
>>>>>>> ---
>>>>>>>   [...]
>>>>>>>
>>>>>>> diff --git a/include/drm/drm_framebuffer.h
>>>>>>> b/include/drm/drm_framebuffer.h
>>>>>>> index f0b34c977ec5..f3274798ecfe 100644
>>>>>>> --- a/include/drm/drm_framebuffer.h
>>>>>>> +++ b/include/drm/drm_framebuffer.h
>>>>>>> @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
>>>>>>>                struct drm_file *file_priv, unsigned flags,
>>>>>>>                unsigned color, struct drm_clip_rect *clips,
>>>>>>>                unsigned num_clips);
>>>>>>> +
>>>>>>> +    /**
>>>>>>> +     * @panic_vmap:
>>>>>>> +     *
>>>>>>> +     * Optional callback for panic handling.
>>>>>>> +     *
>>>>>>> +     * For vmapping the selected framebuffer in a panic context.
>>>>>>> Must
>>>>>>> +     * be super careful about locking (only trylocking allowed).
>>>>>>> +     *
>>>>>>> +     * RETURNS:
>>>>>>> +     *
>>>>>>> +     * NULL if it didn't work out, otherwise an opaque cookie
>>>>>>> which is
>>>>>>> +     * passed to @panic_draw_xy. It can be anything: vmap area,
>>>>>>> structure
>>>>>>> +     * with more details, just a few flags, ...
>>>>>>> +     */
>>>>>>> +    void *(*panic_vmap)(struct drm_framebuffer *fb);
>>>>>> FWIW, the panic_vmap hook cannot work in general with the
>>>>>> amdgpu/radeon
>>>>>> drivers:
>>>>>>
>>>>>> Framebuffers are normally tiled, writing to them with the CPU
>>>>>> results in
>>>>>> garbled output.
>>>>>>
>>>> In which case the driver needs to support the ->panic_draw_xy callback,
>>>> or maybe it's possible to make a generic helper for tiled buffers.
>>> I'm afraid that won't help, at least not without porting big chunks of
>>> https://gitlab.freedesktop.org/mesa/mesa/tree/master/src/amd/addrlib
>>> into the kernel, none of which will be used for anything else.
>>>
>>>
>>>>>> There would need to be a mechanism for switching scanout to a linear,
>>>>>> CPU accessible framebuffer.
>>>>> I suppose panic_vmap() could just provide a linear temp buffer
>>>>> to the panic handler, and panic_unmap() could copy the contents
>>>>> over to the real fb.
>>> Copy how? Using a GPU engine?
>> CPU maybe? Though I suppose that won't work if the buffer isn't CPU
>> accesible :/
> 
> Well we do have a debug path for accessing invisible memory with the CPU.
> 
> E.g. three registers: DATA and auto increment OFFSET_LO/HI. So you can
> just read/write DATA over and over again if you want to access some memory.

Right. I assume that'll be very slow, but I guess it could do when the
memory isn't directly CPU accessible.


> But turning of tilling etc is still extremely tricky when the system is
> already unstable.

Maybe we could add a little hook to the display code, which just
disables tiling for scanout and maybe disables non-primary planes, but
doesn't touch anything else. Harry / Nicholas, does that seem feasible?


I'm coming around from "this is never going to work" to "it might
actually work" with our hardware...

Christian König March 13, 2019, 3:54 p.m. UTC | #30

Am 13.03.19 um 16:38 schrieb Michel Dänzer:
> On 2019-03-13 2:37 p.m., Christian König wrote:
>> Am 13.03.19 um 14:31 schrieb Ville Syrjälä:
>>> On Wed, Mar 13, 2019 at 10:35:08AM +0100, Michel Dänzer wrote:
>>>> On 2019-03-12 6:15 p.m., Noralf Trønnes wrote:
>>>>> Den 12.03.2019 17.17, skrev Ville Syrjälä:
>>>>>> On Tue, Mar 12, 2019 at 11:47:04AM +0100, Michel Dänzer wrote:
>>>>>>> On 2019-03-11 6:42 p.m., Noralf Trønnes wrote:
>>>>>>>> This adds support for outputting kernel messages on panic().
>>>>>>>> A kernel message dumper is used to dump the log. The dumper iterates
>>>>>>>> over each DRM device and it's crtc's to find suitable framebuffers.
>>>>>>>>
>>>>>>>> All the other dumpers are run before this one except mtdoops.
>>>>>>>> Only atomic drivers are supported.
>>>>>>>>
>>>>>>>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
>>>>>>>> ---
>>>>>>>>    [...]
>>>>>>>>
>>>>>>>> diff --git a/include/drm/drm_framebuffer.h
>>>>>>>> b/include/drm/drm_framebuffer.h
>>>>>>>> index f0b34c977ec5..f3274798ecfe 100644
>>>>>>>> --- a/include/drm/drm_framebuffer.h
>>>>>>>> +++ b/include/drm/drm_framebuffer.h
>>>>>>>> @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
>>>>>>>>                 struct drm_file *file_priv, unsigned flags,
>>>>>>>>                 unsigned color, struct drm_clip_rect *clips,
>>>>>>>>                 unsigned num_clips);
>>>>>>>> +
>>>>>>>> +    /**
>>>>>>>> +     * @panic_vmap:
>>>>>>>> +     *
>>>>>>>> +     * Optional callback for panic handling.
>>>>>>>> +     *
>>>>>>>> +     * For vmapping the selected framebuffer in a panic context.
>>>>>>>> Must
>>>>>>>> +     * be super careful about locking (only trylocking allowed).
>>>>>>>> +     *
>>>>>>>> +     * RETURNS:
>>>>>>>> +     *
>>>>>>>> +     * NULL if it didn't work out, otherwise an opaque cookie
>>>>>>>> which is
>>>>>>>> +     * passed to @panic_draw_xy. It can be anything: vmap area,
>>>>>>>> structure
>>>>>>>> +     * with more details, just a few flags, ...
>>>>>>>> +     */
>>>>>>>> +    void *(*panic_vmap)(struct drm_framebuffer *fb);
>>>>>>> FWIW, the panic_vmap hook cannot work in general with the
>>>>>>> amdgpu/radeon
>>>>>>> drivers:
>>>>>>>
>>>>>>> Framebuffers are normally tiled, writing to them with the CPU
>>>>>>> results in
>>>>>>> garbled output.
>>>>>>>
>>>>> In which case the driver needs to support the ->panic_draw_xy callback,
>>>>> or maybe it's possible to make a generic helper for tiled buffers.
>>>> I'm afraid that won't help, at least not without porting big chunks of
>>>> https://gitlab.freedesktop.org/mesa/mesa/tree/master/src/amd/addrlib
>>>> into the kernel, none of which will be used for anything else.
>>>>
>>>>
>>>>>>> There would need to be a mechanism for switching scanout to a linear,
>>>>>>> CPU accessible framebuffer.
>>>>>> I suppose panic_vmap() could just provide a linear temp buffer
>>>>>> to the panic handler, and panic_unmap() could copy the contents
>>>>>> over to the real fb.
>>>> Copy how? Using a GPU engine?
>>> CPU maybe? Though I suppose that won't work if the buffer isn't CPU
>>> accesible :/
>> Well we do have a debug path for accessing invisible memory with the CPU.
>>
>> E.g. three registers: DATA and auto increment OFFSET_LO/HI. So you can
>> just read/write DATA over and over again if you want to access some memory.
> Right. I assume that'll be very slow, but I guess it could do when the
> memory isn't directly CPU accessible.

Just made a quick test and reading 33423360 bytes (4096x2040x4) using 
that interfaces takes about 13 seconds.

IIRC we don't use the auto increment optimization yet, so that can 
probably be improved by a factor of 3 or more.

>> But turning of tilling etc is still extremely tricky when the system is
>> already unstable.
> Maybe we could add a little hook to the display code, which just
> disables tiling for scanout and maybe disables non-primary planes, but
> doesn't touch anything else. Harry / Nicholas, does that seem feasible?
>
>
> I'm coming around from "this is never going to work" to "it might
> actually work" with our hardware...

Yeah, agree. It's a bit tricky, but doable.

Takeaway for Noralf is that this whole vmap on panic won't even remotely 
work. We need to get the data byte by byte without a page mapping if 
that is ever going to fly.

Christian.

>
>

Kazlauskas, Nicholas March 13, 2019, 4:16 p.m. UTC | #31

On 3/13/19 11:54 AM, Christian König wrote:
> Am 13.03.19 um 16:38 schrieb Michel Dänzer:
>> On 2019-03-13 2:37 p.m., Christian König wrote:
>>> Am 13.03.19 um 14:31 schrieb Ville Syrjälä:
>>>> On Wed, Mar 13, 2019 at 10:35:08AM +0100, Michel Dänzer wrote:
>>>>> On 2019-03-12 6:15 p.m., Noralf Trønnes wrote:
>>>>>> Den 12.03.2019 17.17, skrev Ville Syrjälä:
>>>>>>> On Tue, Mar 12, 2019 at 11:47:04AM +0100, Michel Dänzer wrote:
>>>>>>>> On 2019-03-11 6:42 p.m., Noralf Trønnes wrote:
>>>>>>>>> This adds support for outputting kernel messages on panic().
>>>>>>>>> A kernel message dumper is used to dump the log. The dumper 
>>>>>>>>> iterates
>>>>>>>>> over each DRM device and it's crtc's to find suitable 
>>>>>>>>> framebuffers.
>>>>>>>>>
>>>>>>>>> All the other dumpers are run before this one except mtdoops.
>>>>>>>>> Only atomic drivers are supported.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
>>>>>>>>> ---
>>>>>>>>>    [...]
>>>>>>>>>
>>>>>>>>> diff --git a/include/drm/drm_framebuffer.h
>>>>>>>>> b/include/drm/drm_framebuffer.h
>>>>>>>>> index f0b34c977ec5..f3274798ecfe 100644
>>>>>>>>> --- a/include/drm/drm_framebuffer.h
>>>>>>>>> +++ b/include/drm/drm_framebuffer.h
>>>>>>>>> @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
>>>>>>>>>                 struct drm_file *file_priv, unsigned flags,
>>>>>>>>>                 unsigned color, struct drm_clip_rect *clips,
>>>>>>>>>                 unsigned num_clips);
>>>>>>>>> +
>>>>>>>>> +    /**
>>>>>>>>> +     * @panic_vmap:
>>>>>>>>> +     *
>>>>>>>>> +     * Optional callback for panic handling.
>>>>>>>>> +     *
>>>>>>>>> +     * For vmapping the selected framebuffer in a panic context.
>>>>>>>>> Must
>>>>>>>>> +     * be super careful about locking (only trylocking allowed).
>>>>>>>>> +     *
>>>>>>>>> +     * RETURNS:
>>>>>>>>> +     *
>>>>>>>>> +     * NULL if it didn't work out, otherwise an opaque cookie
>>>>>>>>> which is
>>>>>>>>> +     * passed to @panic_draw_xy. It can be anything: vmap area,
>>>>>>>>> structure
>>>>>>>>> +     * with more details, just a few flags, ...
>>>>>>>>> +     */
>>>>>>>>> +    void *(*panic_vmap)(struct drm_framebuffer *fb);
>>>>>>>> FWIW, the panic_vmap hook cannot work in general with the
>>>>>>>> amdgpu/radeon
>>>>>>>> drivers:
>>>>>>>>
>>>>>>>> Framebuffers are normally tiled, writing to them with the CPU
>>>>>>>> results in
>>>>>>>> garbled output.
>>>>>>>>
>>>>>> In which case the driver needs to support the ->panic_draw_xy 
>>>>>> callback,
>>>>>> or maybe it's possible to make a generic helper for tiled buffers.
>>>>> I'm afraid that won't help, at least not without porting big chunks of
>>>>> https://gitlab.freedesktop.org/mesa/mesa/tree/master/src/amd/addrlib
>>>>> into the kernel, none of which will be used for anything else.
>>>>>
>>>>>
>>>>>>>> There would need to be a mechanism for switching scanout to a 
>>>>>>>> linear,
>>>>>>>> CPU accessible framebuffer.
>>>>>>> I suppose panic_vmap() could just provide a linear temp buffer
>>>>>>> to the panic handler, and panic_unmap() could copy the contents
>>>>>>> over to the real fb.
>>>>> Copy how? Using a GPU engine?
>>>> CPU maybe? Though I suppose that won't work if the buffer isn't CPU
>>>> accesible :/
>>> Well we do have a debug path for accessing invisible memory with the 
>>> CPU.
>>>
>>> E.g. three registers: DATA and auto increment OFFSET_LO/HI. So you can
>>> just read/write DATA over and over again if you want to access some 
>>> memory.
>> Right. I assume that'll be very slow, but I guess it could do when the
>> memory isn't directly CPU accessible.
> 
> Just made a quick test and reading 33423360 bytes (4096x2040x4) using 
> that interfaces takes about 13 seconds.
> 
> IIRC we don't use the auto increment optimization yet, so that can 
> probably be improved by a factor of 3 or more.
> 
>>> But turning of tilling etc is still extremely tricky when the system is
>>> already unstable.
>> Maybe we could add a little hook to the display code, which just
>> disables tiling for scanout and maybe disables non-primary planes, but
>> doesn't touch anything else. Harry / Nicholas, does that seem feasible?
>>
>>
>> I'm coming around from "this is never going to work" to "it might
>> actually work" with our hardware...
> 
> Yeah, agree. It's a bit tricky, but doable.

A "disable_tiling" hook or something along those lines could work for 
display. It's a little bit non trivial when you want to start dealing 
with locking and any active DRM commits, but we have a global lock 
around all our hardware programming anyway that makes that easier to 
deal with.

I think we can just re-commit and update the existing hardware state 
with only the tiling info for every plane reset to off. For most buffers 
I don't think we'd have to really consider changing anything else here 
as long as you respect the current FB size and pitch.

Nicholas Kazlauskas

> 
> Takeaway for Noralf is that this whole vmap on panic won't even remotely 
> work. We need to get the data byte by byte without a page mapping if 
> that is ever going to fly.
> 
> Christian.
> 
>>
>>
>

Christian König March 13, 2019, 5:30 p.m. UTC | #32

Am 13.03.19 um 17:16 schrieb Kazlauskas, Nicholas:
> On 3/13/19 11:54 AM, Christian König wrote:
>> Am 13.03.19 um 16:38 schrieb Michel Dänzer:
>>> On 2019-03-13 2:37 p.m., Christian König wrote:
>>>> Am 13.03.19 um 14:31 schrieb Ville Syrjälä:
>>>>> On Wed, Mar 13, 2019 at 10:35:08AM +0100, Michel Dänzer wrote:
>>>>>> On 2019-03-12 6:15 p.m., Noralf Trønnes wrote:
>>>>>>> Den 12.03.2019 17.17, skrev Ville Syrjälä:
>>>>>>>> On Tue, Mar 12, 2019 at 11:47:04AM +0100, Michel Dänzer wrote:
>>>>>>>>> On 2019-03-11 6:42 p.m., Noralf Trønnes wrote:
>>>>>>>>>> This adds support for outputting kernel messages on panic().
>>>>>>>>>> A kernel message dumper is used to dump the log. The dumper
>>>>>>>>>> iterates
>>>>>>>>>> over each DRM device and it's crtc's to find suitable
>>>>>>>>>> framebuffers.
>>>>>>>>>>
>>>>>>>>>> All the other dumpers are run before this one except mtdoops.
>>>>>>>>>> Only atomic drivers are supported.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
>>>>>>>>>> ---
>>>>>>>>>>     [...]
>>>>>>>>>>
>>>>>>>>>> diff --git a/include/drm/drm_framebuffer.h
>>>>>>>>>> b/include/drm/drm_framebuffer.h
>>>>>>>>>> index f0b34c977ec5..f3274798ecfe 100644
>>>>>>>>>> --- a/include/drm/drm_framebuffer.h
>>>>>>>>>> +++ b/include/drm/drm_framebuffer.h
>>>>>>>>>> @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
>>>>>>>>>>                  struct drm_file *file_priv, unsigned flags,
>>>>>>>>>>                  unsigned color, struct drm_clip_rect *clips,
>>>>>>>>>>                  unsigned num_clips);
>>>>>>>>>> +
>>>>>>>>>> +    /**
>>>>>>>>>> +     * @panic_vmap:
>>>>>>>>>> +     *
>>>>>>>>>> +     * Optional callback for panic handling.
>>>>>>>>>> +     *
>>>>>>>>>> +     * For vmapping the selected framebuffer in a panic context.
>>>>>>>>>> Must
>>>>>>>>>> +     * be super careful about locking (only trylocking allowed).
>>>>>>>>>> +     *
>>>>>>>>>> +     * RETURNS:
>>>>>>>>>> +     *
>>>>>>>>>> +     * NULL if it didn't work out, otherwise an opaque cookie
>>>>>>>>>> which is
>>>>>>>>>> +     * passed to @panic_draw_xy. It can be anything: vmap area,
>>>>>>>>>> structure
>>>>>>>>>> +     * with more details, just a few flags, ...
>>>>>>>>>> +     */
>>>>>>>>>> +    void *(*panic_vmap)(struct drm_framebuffer *fb);
>>>>>>>>> FWIW, the panic_vmap hook cannot work in general with the
>>>>>>>>> amdgpu/radeon
>>>>>>>>> drivers:
>>>>>>>>>
>>>>>>>>> Framebuffers are normally tiled, writing to them with the CPU
>>>>>>>>> results in
>>>>>>>>> garbled output.
>>>>>>>>>
>>>>>>> In which case the driver needs to support the ->panic_draw_xy
>>>>>>> callback,
>>>>>>> or maybe it's possible to make a generic helper for tiled buffers.
>>>>>> I'm afraid that won't help, at least not without porting big chunks of
>>>>>> https://gitlab.freedesktop.org/mesa/mesa/tree/master/src/amd/addrlib
>>>>>> into the kernel, none of which will be used for anything else.
>>>>>>
>>>>>>
>>>>>>>>> There would need to be a mechanism for switching scanout to a
>>>>>>>>> linear,
>>>>>>>>> CPU accessible framebuffer.
>>>>>>>> I suppose panic_vmap() could just provide a linear temp buffer
>>>>>>>> to the panic handler, and panic_unmap() could copy the contents
>>>>>>>> over to the real fb.
>>>>>> Copy how? Using a GPU engine?
>>>>> CPU maybe? Though I suppose that won't work if the buffer isn't CPU
>>>>> accesible :/
>>>> Well we do have a debug path for accessing invisible memory with the
>>>> CPU.
>>>>
>>>> E.g. three registers: DATA and auto increment OFFSET_LO/HI. So you can
>>>> just read/write DATA over and over again if you want to access some
>>>> memory.
>>> Right. I assume that'll be very slow, but I guess it could do when the
>>> memory isn't directly CPU accessible.
>> Just made a quick test and reading 33423360 bytes (4096x2040x4) using
>> that interfaces takes about 13 seconds.
>>
>> IIRC we don't use the auto increment optimization yet, so that can
>> probably be improved by a factor of 3 or more.
>>
>>>> But turning of tilling etc is still extremely tricky when the system is
>>>> already unstable.
>>> Maybe we could add a little hook to the display code, which just
>>> disables tiling for scanout and maybe disables non-primary planes, but
>>> doesn't touch anything else. Harry / Nicholas, does that seem feasible?
>>>
>>>
>>> I'm coming around from "this is never going to work" to "it might
>>> actually work" with our hardware...
>> Yeah, agree. It's a bit tricky, but doable.
> A "disable_tiling" hook or something along those lines could work for
> display. It's a little bit non trivial when you want to start dealing
> with locking and any active DRM commits, but we have a global lock
> around all our hardware programming anyway that makes that easier to
> deal with.
>
> I think we can just re-commit and update the existing hardware state
> with only the tiling info for every plane reset to off. For most buffers
> I don't think we'd have to really consider changing anything else here
> as long as you respect the current FB size and pitch.

That sounds like it is also already to invasive.

Keep in mind that all other CPU cores and processes/threads are dead and 
you can be in any context you want (e.g. NMI).

In other words we can't take any locks whatsoever and we need to keep 
the hardware interaction as minimal as possible.

Christian.

>
> Nicholas Kazlauskas
>
>> Takeaway for Noralf is that this whole vmap on panic won't even remotely
>> work. We need to get the data byte by byte without a page mapping if
>> that is ever going to fly.
>>
>> Christian.
>>
>>>

Michel Dänzer March 13, 2019, 5:33 p.m. UTC | #33

On 2019-03-13 5:16 p.m., Kazlauskas, Nicholas wrote:
> On 3/13/19 11:54 AM, Christian König wrote:
>> Am 13.03.19 um 16:38 schrieb Michel Dänzer:
>>> On 2019-03-13 2:37 p.m., Christian König wrote:
>>>> Am 13.03.19 um 14:31 schrieb Ville Syrjälä:
>>>>> On Wed, Mar 13, 2019 at 10:35:08AM +0100, Michel Dänzer wrote:
>>>>>> On 2019-03-12 6:15 p.m., Noralf Trønnes wrote:
>>>>>>> Den 12.03.2019 17.17, skrev Ville Syrjälä:
>>>>>>>> On Tue, Mar 12, 2019 at 11:47:04AM +0100, Michel Dänzer wrote:
>>>>>>>>> On 2019-03-11 6:42 p.m., Noralf Trønnes wrote:
>>>>>>>>>> This adds support for outputting kernel messages on panic().
>>>>>>>>>> A kernel message dumper is used to dump the log. The dumper 
>>>>>>>>>> iterates
>>>>>>>>>> over each DRM device and it's crtc's to find suitable 
>>>>>>>>>> framebuffers.
>>>>>>>>>>
>>>>>>>>>> All the other dumpers are run before this one except mtdoops.
>>>>>>>>>> Only atomic drivers are supported.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
>>>>>>>>>> ---
>>>>>>>>>>    [...]
>>>>>>>>>>
>>>>>>>>>> diff --git a/include/drm/drm_framebuffer.h
>>>>>>>>>> b/include/drm/drm_framebuffer.h
>>>>>>>>>> index f0b34c977ec5..f3274798ecfe 100644
>>>>>>>>>> --- a/include/drm/drm_framebuffer.h
>>>>>>>>>> +++ b/include/drm/drm_framebuffer.h
>>>>>>>>>> @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
>>>>>>>>>>                 struct drm_file *file_priv, unsigned flags,
>>>>>>>>>>                 unsigned color, struct drm_clip_rect *clips,
>>>>>>>>>>                 unsigned num_clips);
>>>>>>>>>> +
>>>>>>>>>> +    /**
>>>>>>>>>> +     * @panic_vmap:
>>>>>>>>>> +     *
>>>>>>>>>> +     * Optional callback for panic handling.
>>>>>>>>>> +     *
>>>>>>>>>> +     * For vmapping the selected framebuffer in a panic context.
>>>>>>>>>> Must
>>>>>>>>>> +     * be super careful about locking (only trylocking allowed).
>>>>>>>>>> +     *
>>>>>>>>>> +     * RETURNS:
>>>>>>>>>> +     *
>>>>>>>>>> +     * NULL if it didn't work out, otherwise an opaque cookie
>>>>>>>>>> which is
>>>>>>>>>> +     * passed to @panic_draw_xy. It can be anything: vmap area,
>>>>>>>>>> structure
>>>>>>>>>> +     * with more details, just a few flags, ...
>>>>>>>>>> +     */
>>>>>>>>>> +    void *(*panic_vmap)(struct drm_framebuffer *fb);
>>>>>>>>> FWIW, the panic_vmap hook cannot work in general with the
>>>>>>>>> amdgpu/radeon
>>>>>>>>> drivers:
>>>>>>>>>
>>>>>>>>> Framebuffers are normally tiled, writing to them with the CPU
>>>>>>>>> results in
>>>>>>>>> garbled output.
>>>>>>>>>
>>>>>>> In which case the driver needs to support the ->panic_draw_xy 
>>>>>>> callback,
>>>>>>> or maybe it's possible to make a generic helper for tiled buffers.
>>>>>> I'm afraid that won't help, at least not without porting big chunks of
>>>>>> https://gitlab.freedesktop.org/mesa/mesa/tree/master/src/amd/addrlib
>>>>>> into the kernel, none of which will be used for anything else.
>>>>>>
>>>>>>
>>>>>>>>> There would need to be a mechanism for switching scanout to a 
>>>>>>>>> linear,
>>>>>>>>> CPU accessible framebuffer.
>>>>>>>> I suppose panic_vmap() could just provide a linear temp buffer
>>>>>>>> to the panic handler, and panic_unmap() could copy the contents
>>>>>>>> over to the real fb.
>>>>>> Copy how? Using a GPU engine?
>>>>> CPU maybe? Though I suppose that won't work if the buffer isn't CPU
>>>>> accesible :/
>>>> Well we do have a debug path for accessing invisible memory with the 
>>>> CPU.
>>>>
>>>> E.g. three registers: DATA and auto increment OFFSET_LO/HI. So you can
>>>> just read/write DATA over and over again if you want to access some 
>>>> memory.
>>> Right. I assume that'll be very slow, but I guess it could do when the
>>> memory isn't directly CPU accessible.
>>
>> Just made a quick test and reading 33423360 bytes (4096x2040x4) using 
>> that interfaces takes about 13 seconds.
>>
>> IIRC we don't use the auto increment optimization yet, so that can 
>> probably be improved by a factor of 3 or more.

I'd assume only writes are needed, no reads.


>>>> But turning of tilling etc is still extremely tricky when the system is
>>>> already unstable.
>>> Maybe we could add a little hook to the display code, which just
>>> disables tiling for scanout and maybe disables non-primary planes, but
>>> doesn't touch anything else. Harry / Nicholas, does that seem feasible?
>>>
>>>
>>> I'm coming around from "this is never going to work" to "it might
>>> actually work" with our hardware...
>>
>> Yeah, agree. It's a bit tricky, but doable.
> 
> A "disable_tiling" hook or something along those lines could work for 
> display. It's a little bit non trivial when you want to start dealing 
> with locking and any active DRM commits, but we have a global lock 
> around all our hardware programming anyway that makes that easier to 
> deal with.

This code only runs when there's a kernel panic, so it doesn't have to
care about such things. :) In fact, it should rather avoid them and only
do the absolute minimum register writes needed to do the job.


> I think we can just re-commit and update the existing hardware state 
> with only the tiling info for every plane reset to off.

There's also a concern that non-primary planes might prevent the output
from being visible.


> For most buffers I don't think we'd have to really consider changing
> anything else here as long as you respect the current FB size and
> pitch.

Hmm, e.g. compression might also need to be disabled? Or maybe that's
implicitly off with no tiling?

Kazlauskas, Nicholas March 13, 2019, 5:41 p.m. UTC | #34

On 3/13/19 1:33 PM, Michel Dänzer wrote:
> On 2019-03-13 5:16 p.m., Kazlauskas, Nicholas wrote:
>> On 3/13/19 11:54 AM, Christian König wrote:
>>> Am 13.03.19 um 16:38 schrieb Michel Dänzer:
>>>> On 2019-03-13 2:37 p.m., Christian König wrote:
>>>>> Am 13.03.19 um 14:31 schrieb Ville Syrjälä:
>>>>>> On Wed, Mar 13, 2019 at 10:35:08AM +0100, Michel Dänzer wrote:
>>>>>>> On 2019-03-12 6:15 p.m., Noralf Trønnes wrote:
>>>>>>>> Den 12.03.2019 17.17, skrev Ville Syrjälä:
>>>>>>>>> On Tue, Mar 12, 2019 at 11:47:04AM +0100, Michel Dänzer wrote:
>>>>>>>>>> On 2019-03-11 6:42 p.m., Noralf Trønnes wrote:
>>>>>>>>>>> This adds support for outputting kernel messages on panic().
>>>>>>>>>>> A kernel message dumper is used to dump the log. The dumper
>>>>>>>>>>> iterates
>>>>>>>>>>> over each DRM device and it's crtc's to find suitable
>>>>>>>>>>> framebuffers.
>>>>>>>>>>>
>>>>>>>>>>> All the other dumpers are run before this one except mtdoops.
>>>>>>>>>>> Only atomic drivers are supported.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
>>>>>>>>>>> ---
>>>>>>>>>>>     [...]
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/include/drm/drm_framebuffer.h
>>>>>>>>>>> b/include/drm/drm_framebuffer.h
>>>>>>>>>>> index f0b34c977ec5..f3274798ecfe 100644
>>>>>>>>>>> --- a/include/drm/drm_framebuffer.h
>>>>>>>>>>> +++ b/include/drm/drm_framebuffer.h
>>>>>>>>>>> @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
>>>>>>>>>>>                  struct drm_file *file_priv, unsigned flags,
>>>>>>>>>>>                  unsigned color, struct drm_clip_rect *clips,
>>>>>>>>>>>                  unsigned num_clips);
>>>>>>>>>>> +
>>>>>>>>>>> +    /**
>>>>>>>>>>> +     * @panic_vmap:
>>>>>>>>>>> +     *
>>>>>>>>>>> +     * Optional callback for panic handling.
>>>>>>>>>>> +     *
>>>>>>>>>>> +     * For vmapping the selected framebuffer in a panic context.
>>>>>>>>>>> Must
>>>>>>>>>>> +     * be super careful about locking (only trylocking allowed).
>>>>>>>>>>> +     *
>>>>>>>>>>> +     * RETURNS:
>>>>>>>>>>> +     *
>>>>>>>>>>> +     * NULL if it didn't work out, otherwise an opaque cookie
>>>>>>>>>>> which is
>>>>>>>>>>> +     * passed to @panic_draw_xy. It can be anything: vmap area,
>>>>>>>>>>> structure
>>>>>>>>>>> +     * with more details, just a few flags, ...
>>>>>>>>>>> +     */
>>>>>>>>>>> +    void *(*panic_vmap)(struct drm_framebuffer *fb);
>>>>>>>>>> FWIW, the panic_vmap hook cannot work in general with the
>>>>>>>>>> amdgpu/radeon
>>>>>>>>>> drivers:
>>>>>>>>>>
>>>>>>>>>> Framebuffers are normally tiled, writing to them with the CPU
>>>>>>>>>> results in
>>>>>>>>>> garbled output.
>>>>>>>>>>
>>>>>>>> In which case the driver needs to support the ->panic_draw_xy
>>>>>>>> callback,
>>>>>>>> or maybe it's possible to make a generic helper for tiled buffers.
>>>>>>> I'm afraid that won't help, at least not without porting big chunks of
>>>>>>> https://gitlab.freedesktop.org/mesa/mesa/tree/master/src/amd/addrlib
>>>>>>> into the kernel, none of which will be used for anything else.
>>>>>>>
>>>>>>>
>>>>>>>>>> There would need to be a mechanism for switching scanout to a
>>>>>>>>>> linear,
>>>>>>>>>> CPU accessible framebuffer.
>>>>>>>>> I suppose panic_vmap() could just provide a linear temp buffer
>>>>>>>>> to the panic handler, and panic_unmap() could copy the contents
>>>>>>>>> over to the real fb.
>>>>>>> Copy how? Using a GPU engine?
>>>>>> CPU maybe? Though I suppose that won't work if the buffer isn't CPU
>>>>>> accesible :/
>>>>> Well we do have a debug path for accessing invisible memory with the
>>>>> CPU.
>>>>>
>>>>> E.g. three registers: DATA and auto increment OFFSET_LO/HI. So you can
>>>>> just read/write DATA over and over again if you want to access some
>>>>> memory.
>>>> Right. I assume that'll be very slow, but I guess it could do when the
>>>> memory isn't directly CPU accessible.
>>>
>>> Just made a quick test and reading 33423360 bytes (4096x2040x4) using
>>> that interfaces takes about 13 seconds.
>>>
>>> IIRC we don't use the auto increment optimization yet, so that can
>>> probably be improved by a factor of 3 or more.
> 
> I'd assume only writes are needed, no reads.
> 
> 
>>>>> But turning of tilling etc is still extremely tricky when the system is
>>>>> already unstable.
>>>> Maybe we could add a little hook to the display code, which just
>>>> disables tiling for scanout and maybe disables non-primary planes, but
>>>> doesn't touch anything else. Harry / Nicholas, does that seem feasible?
>>>>
>>>>
>>>> I'm coming around from "this is never going to work" to "it might
>>>> actually work" with our hardware...
>>>
>>> Yeah, agree. It's a bit tricky, but doable.
>>
>> A "disable_tiling" hook or something along those lines could work for
>> display. It's a little bit non trivial when you want to start dealing
>> with locking and any active DRM commits, but we have a global lock
>> around all our hardware programming anyway that makes that easier to
>> deal with.
> 
> This code only runs when there's a kernel panic, so it doesn't have to
> care about such things. :) In fact, it should rather avoid them and only
> do the absolute minimum register writes needed to do the job.

Good point. I guess it doesn't really matter at that point.

> 
> 
>> I think we can just re-commit and update the existing hardware state
>> with only the tiling info for every plane reset to off.
> 
> There's also a concern that non-primary planes might prevent the output
> from being visible.

I would just hope that the overlay doesn't cover the full screen, or 
that the panic handler writes to all buffers if possible.

DC would have to swap the current context in order to disable planes and 
we'd have to do memory allocations / frees etc in order to do so.

> 
> 
>> For most buffers I don't think we'd have to really consider changing
>> anything else here as long as you respect the current FB size and
>> pitch.
> 
> Hmm, e.g. compression might also need to be disabled? Or maybe that's
> implicitly off with no tiling?
> 
> 

Resetting the tiling info will also disable the compression in this 
case, so the buffer should still be usable for this purpose.

Nicholas Kazlauskas

Christian König March 13, 2019, 5:52 p.m. UTC | #35

Am 13.03.19 um 18:33 schrieb Michel Dänzer:
> [SNIP]
>>>>>>> Copy how? Using a GPU engine?
>>>>>> CPU maybe? Though I suppose that won't work if the buffer isn't CPU
>>>>>> accesible :/
>>>>> Well we do have a debug path for accessing invisible memory with the
>>>>> CPU.
>>>>>
>>>>> E.g. three registers: DATA and auto increment OFFSET_LO/HI. So you can
>>>>> just read/write DATA over and over again if you want to access some
>>>>> memory.
>>>> Right. I assume that'll be very slow, but I guess it could do when the
>>>> memory isn't directly CPU accessible.
>>> Just made a quick test and reading 33423360 bytes (4096x2040x4) using
>>> that interfaces takes about 13 seconds.
>>>
>>> IIRC we don't use the auto increment optimization yet, so that can
>>> probably be improved by a factor of 3 or more.
> I'd assume only writes are needed, no reads.

I've played around with that for a moment and with a bit of optimization 
I can actually get about 20 MB/s write performance out of the debugging 
interface.

This way overwriting a 4K framebuffer would take less than 2 seconds 
using this.

It's not ideal, but I think for a panic screen perfectly reasonable.

Christian.

Ahmed S. Darwish March 14, 2019, 2:51 a.m. UTC | #36

[[ Adding Sebastian, who is quite experienced in intricate
   locking situations due to daily PREEMPT_RT work.. ]]

On Wed, Mar 13, 2019 at 09:37:10AM +0100, Daniel Vetter wrote:
> On Wed, Mar 13, 2019 at 08:49:17AM +0100, John Ogness wrote:
> > On 2019-03-12, Ahmed S. Darwish <darwish.07@gmail.com> wrote:
> > > On Wed, Mar 13, 2019 at 09:37:10AM +0100, Daniel Vetter wrote:
[……]
> > > >
> > > > If we aim for perfect this should be a trylock still, maybe using
> > > > our own device list.
> > > >
> > >
> > > I definitely agree here.
> > >
> > > The lock may already be locked either by a stopped CPU, or by the
> > > very same CPU we execute panic() on (e.g. NMI panic() on the
> > > printing CPU).
> > >
> > > This is why it's very common for example in serial consoles, which
> > > are usually careful about re-entrance and panic contexts, to do:
> > >
> > >   xxxxxx_console_write(...) {
> > > 	if (oops_in_progress)
> > > 		locked = spin_trylock_irqsave(&port->lock, flags);
> > > 	else
> > > 		spin_lock_irqsave(&port->lock, flags);
> > >   }
> > >
> > > I'm quite positive we should do the same for panic drm drivers.
> >
> > This construction will continue, even if the trylock fails. It only
> > makes sense to do this if the driver has a chance of being
> > successful. Ignoring locks is a serious issue. I personally am quite
> > unhappy that the serial drivers do this, which was part of my motivation
> > for the new printk design I'm working on.
> >
> > If the driver is not capable of doing something useful on a failed
> > trylock, then I recommend just skipping that device. Maybe trying it
> > again later after trying all the devices?
>
> Ah yes missed that. If the trylock fails anywhere, we must bail out.
>
> Not sure retrying is useful, my experience from at least drm is that
> either you're lucky, and drm wasn't doing anything right when the machine
> blew up, and then the trylocks will all go through. Or you're unlucky, and
> most likely that means drm itself blew up, and no amount of retrying is
> going to help. I wouldn't bother.
>

Ack, retrying most probably won't help.

I see that, at least in x86:

  kernel/panic.c::panic()
    => kernel/panic.c::smp_send_stop()
      => arch/x86/incluse/smp.h::smp_send_stop(0)
        => arch/x86/kernel/smp.c::native_stop_other_cpus(wait)
          /*
           * Don't wait longer than a second if the caller
           * didn't ask us to wait.
           */
           timeout = USEC_PER_SEC;
           while (num_online_cpus() > 1 && (wait || timeout--))
                   udelay(1);

So the panic()-ing core wait by default for _a full second_ until
all other cores finish their non-preemptible regions. If that did
not work out, it even tries more aggressively with NMIs.

So if one of the other non panic()-ing cores was holding a DRM
related lock through spin_lock_irqsave() for more than one second,
well, it's a bug in the DRM layer code anyway..      [A]

Problem is, what will happen if the non panic()-ing core was
holding a DRM lock through barebone spin_lock()? Interrupts are
enabled there, so the IPI will be handled, and thus that core will
effectively die while the lock is forever held :-(   [B]

But, well, yeah, I don't think there's a solution for the [B] part
except by spin_trylock() the fewest amount of locks possible, and
bailing out if anyone of them is held.

For reference, I asked over IRC why not just resetting GPU state,
but it's not easy as it sounds to people outside of the gfx world:

  [abridged]
  <darwi>  So it seems Windows is forcing driver writers to
           implement a gpu reset method [*]
  <darwi>  since Windows vista
  <danvet> yeah
  <danvet> not going to do that from panic
  <danvet> there's a non-zero chance that gpu reset takes down
           the system with at least lots of hw/driver combos
  <danvet> I think all our drivers also have gpu reset support
  <danvet> if something is still rendering while we panic, mea culpa
  <danvet> only way around that is switching planes
  <danvet> which we tried with the old approach, and that's just
           too hard
  <danvet> you end up running a few 100kloc of code worst case
  <danvet> no way to audit/validate that for panic context
  <danvet> but the rendering shouldn't really matter
  <anholt> yeah, actually modesetting in a panic seems hopeless.
  <danvet> at least for modern compositors
  <danvet> anholt, yeah I killed that idea
  <danvet> so they only render to the backbuffer
  <danvet> and we won't show that with a panic
  <anholt> yep.  just draw on a plane.  for everything but
           uncomposited x11, it'll already be idle.
  <danvet> so only really an issue on boot splash and frontbuffer x11
  <danvet> and there a few cursors/characters drawn won't really
           matter in most cases
  <danvet> anholt, yeah that's the idea, we pick the currently
           showing plane and have a simple draw_xy functions which
	   knows how to draw bright/dark pixels for any drm_fourcc
  <danvet> so you can see the oops even if it's yuv or whatever
  <anholt> nice

Thanks!

[*] DXGKDDI_RESETFROMTIMEOUT callback function:
    https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/content/d3dkmddi/nc-d3dkmddi-dxgkddi_resetfromtimeout

--
darwi
http://darwish.chasingpointers.com

Ahmed S. Darwish March 14, 2019, 4:45 a.m. UTC | #37

Hi,

On Wed, Mar 13, 2019 at 09:35:05AM +0100, Daniel Vetter wrote:
> On Tue, Mar 12, 2019 at 11:13:03PM +0100, Ahmed S. Darwish wrote:
> > On Mon, Mar 11, 2019 at 11:33:15PM +0100, Noralf Trønnes wrote:
> > > Den 11.03.2019 20.23, skrev Daniel Vetter:
[……]
> > > >
> > > > class_for_each_device uses klist, which only uses an irqsave spinlock. I
> > > > think that's good enough. Comment to that effect would be good e.g.
> > > >
> > > > 	/* based on klist, which uses only a spin_lock_irqsave, which we
> > > > 	 * assume still works */
> > > >
> > > > If we aim for perfect this should be a trylock still, maybe using our own
> > > > device list.
> > > >
> >
> > I definitely agree here.
> >
> > The lock may already be locked either by a stopped CPU, or by the
> > very same CPU we execute panic() on (e.g. NMI panic() on the
> > printing CPU).
> >
> > This is why it's very common for example in serial consoles, which
> > are usually careful about re-entrance and panic contexts, to do:
> >
> >   xxxxxx_console_write(...) {
> > 	if (oops_in_progress)
> > 		locked = spin_trylock_irqsave(&port->lock, flags);
> > 	else
> > 		spin_lock_irqsave(&port->lock, flags);
> >   }
> >
> > I'm quite positive we should do the same for panic drm drivers.
>
> Yeah Ideally all the locking in the drm path would be trylock only.
>
> I wonder whether lockdep could help us validate this, with some "don't
> allow anything except trylocks in this context". It's easy to audit the
> core code with review, but drivers are much tougher. And often end up with
> really deep callchains to get at the backing buffers.
[……]
> > > > Instead what I had in mind is to recreate the worst
> > > > possible panic context as much as feasible (disabling interrupts should be
> > > > a good start, maybe we can even do an nmi callback), and then call our
> > > > panic implementation. That way we can test the panic handler in a
> > > > non-destructive way (i.e. aside from last dmesg lines printed to the
> > > > screen nothing bad happens to the kernel: No real panic, no oops, no
> > > > tainting).
> > >
> > > The interrupt case I can do, nmi I have no idea.
> > >
> >
> > I agree too. Disabling interrupts + CONFIG_DEBUG_ATOMIC_SLEEP
> > would be a nice non-destructive test-case emulation.
>
> See above, if we can somehow emulate "all locks are held, only allow
> trylock" with lockdep that would be great too.
>

Hmmmm, to prototype things faster, I'll do it as a hook:

   spin_lock(spinlock_t *lock)
   {
        might_spin_forever_under_panic_path();
        ...
   }

Just like how mutexes and DEBUG_ATOMIC_SLEEP do it today:

   void mutex_lock(struct mutex *lock)
   {
        might_sleep();
        ...
   }

Hopefully the locking maintainers can accept something like this.
If not, let's play with lockdep (which is a bit complex).

===> Thanks to Sebastian for suggesting this!

> Plus nmi context, once/if that somehow becomes relevant.
>
> The thing that killed the old drm panic handling code definitely was that
> we flat out couldnt' test it except with real oopses. And that's just
> whack-a-mole and bug reporter frustration if you first have a few patch
> iterations around "oh sry, it's still some oops in the panic handler, not
> the first one that we're seeing" :-/
>

Oh, that's a critical point:

   full data > some data > no data > invalid/useless data

First impressions for the feature will definitely count. Hopefully
the hooks above and some heavy DRM CI testing __beforehand__ can
improve things before publishing to a wider audience..

Cheers,

--
darwi
http://darwish.chasingpointers.com

Daniel Vetter March 14, 2019, 9:32 a.m. UTC | #38

On Thu, Mar 14, 2019 at 03:51:13AM +0100, Ahmed S. Darwish wrote:
> 
> [[ Adding Sebastian, who is quite experienced in intricate
>    locking situations due to daily PREEMPT_RT work.. ]]
> 
> On Wed, Mar 13, 2019 at 09:37:10AM +0100, Daniel Vetter wrote:
> > On Wed, Mar 13, 2019 at 08:49:17AM +0100, John Ogness wrote:
> > > On 2019-03-12, Ahmed S. Darwish <darwish.07@gmail.com> wrote:
> > > > On Wed, Mar 13, 2019 at 09:37:10AM +0100, Daniel Vetter wrote:
> [……]
> > > > >
> > > > > If we aim for perfect this should be a trylock still, maybe using
> > > > > our own device list.
> > > > >
> > > >
> > > > I definitely agree here.
> > > >
> > > > The lock may already be locked either by a stopped CPU, or by the
> > > > very same CPU we execute panic() on (e.g. NMI panic() on the
> > > > printing CPU).
> > > >
> > > > This is why it's very common for example in serial consoles, which
> > > > are usually careful about re-entrance and panic contexts, to do:
> > > >
> > > >   xxxxxx_console_write(...) {
> > > > 	if (oops_in_progress)
> > > > 		locked = spin_trylock_irqsave(&port->lock, flags);
> > > > 	else
> > > > 		spin_lock_irqsave(&port->lock, flags);
> > > >   }
> > > >
> > > > I'm quite positive we should do the same for panic drm drivers.
> > >
> > > This construction will continue, even if the trylock fails. It only
> > > makes sense to do this if the driver has a chance of being
> > > successful. Ignoring locks is a serious issue. I personally am quite
> > > unhappy that the serial drivers do this, which was part of my motivation
> > > for the new printk design I'm working on.
> > >
> > > If the driver is not capable of doing something useful on a failed
> > > trylock, then I recommend just skipping that device. Maybe trying it
> > > again later after trying all the devices?
> >
> > Ah yes missed that. If the trylock fails anywhere, we must bail out.
> >
> > Not sure retrying is useful, my experience from at least drm is that
> > either you're lucky, and drm wasn't doing anything right when the machine
> > blew up, and then the trylocks will all go through. Or you're unlucky, and
> > most likely that means drm itself blew up, and no amount of retrying is
> > going to help. I wouldn't bother.
> >
> 
> Ack, retrying most probably won't help.
> 
> I see that, at least in x86:
> 
>   kernel/panic.c::panic()
>     => kernel/panic.c::smp_send_stop()
>       => arch/x86/incluse/smp.h::smp_send_stop(0)
>         => arch/x86/kernel/smp.c::native_stop_other_cpus(wait)
>           /*
>            * Don't wait longer than a second if the caller
>            * didn't ask us to wait.
>            */
>            timeout = USEC_PER_SEC;
>            while (num_online_cpus() > 1 && (wait || timeout--))
>                    udelay(1);
> 
> So the panic()-ing core wait by default for _a full second_ until
> all other cores finish their non-preemptible regions. If that did
> not work out, it even tries more aggressively with NMIs.
> 
> So if one of the other non panic()-ing cores was holding a DRM
> related lock through spin_lock_irqsave() for more than one second,
> well, it's a bug in the DRM layer code anyway..      [A]
> 
> Problem is, what will happen if the non panic()-ing core was
> holding a DRM lock through barebone spin_lock()? Interrupts are
> enabled there, so the IPI will be handled, and thus that core will
> effectively die while the lock is forever held :-(   [B]

Just to hammer this in even more:

If any of those locks are held, the hw/driver is in an undefined state. If
you then touch the hw there's really good chances you kill the entire
machine. Happened plenty with the old oops handling, which tried to do
that. That's why we want none of any of these. Whethere it's modesetting
or a gpu reset, gpu hw and drivers is just way too complex to make that
feasible.

That's why we came up with the trylock + immediate bail out design if that
fails. Plus really only render the oops int whatever is the current
display buffer, so that we don't have to do any hw programming at all.
-Daniel
> 
> But, well, yeah, I don't think there's a solution for the [B] part
> except by spin_trylock() the fewest amount of locks possible, and
> bailing out if anyone of them is held.
> 
> For reference, I asked over IRC why not just resetting GPU state,
> but it's not easy as it sounds to people outside of the gfx world:
> 
>   [abridged]
>   <darwi>  So it seems Windows is forcing driver writers to
>            implement a gpu reset method [*]
>   <darwi>  since Windows vista
>   <danvet> yeah
>   <danvet> not going to do that from panic
>   <danvet> there's a non-zero chance that gpu reset takes down
>            the system with at least lots of hw/driver combos
>   <danvet> I think all our drivers also have gpu reset support
>   <danvet> if something is still rendering while we panic, mea culpa
>   <danvet> only way around that is switching planes
>   <danvet> which we tried with the old approach, and that's just
>            too hard
>   <danvet> you end up running a few 100kloc of code worst case
>   <danvet> no way to audit/validate that for panic context
>   <danvet> but the rendering shouldn't really matter
>   <anholt> yeah, actually modesetting in a panic seems hopeless.
>   <danvet> at least for modern compositors
>   <danvet> anholt, yeah I killed that idea
>   <danvet> so they only render to the backbuffer
>   <danvet> and we won't show that with a panic
>   <anholt> yep.  just draw on a plane.  for everything but
>            uncomposited x11, it'll already be idle.
>   <danvet> so only really an issue on boot splash and frontbuffer x11
>   <danvet> and there a few cursors/characters drawn won't really
>            matter in most cases
>   <danvet> anholt, yeah that's the idea, we pick the currently
>            showing plane and have a simple draw_xy functions which
> 	   knows how to draw bright/dark pixels for any drm_fourcc
>   <danvet> so you can see the oops even if it's yuv or whatever
>   <anholt> nice
> 
> Thanks!
> 
> [*] DXGKDDI_RESETFROMTIMEOUT callback function:
>     https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/content/d3dkmddi/nc-d3dkmddi-dxgkddi_resetfromtimeout
> 
> --
> darwi
> http://darwish.chasingpointers.com

Daniel Vetter March 14, 2019, 9:35 a.m. UTC | #39

On Thu, Mar 14, 2019 at 05:45:32AM +0100, Ahmed S. Darwish wrote:
> Hi,
> 
> On Wed, Mar 13, 2019 at 09:35:05AM +0100, Daniel Vetter wrote:
> > On Tue, Mar 12, 2019 at 11:13:03PM +0100, Ahmed S. Darwish wrote:
> > > On Mon, Mar 11, 2019 at 11:33:15PM +0100, Noralf Trønnes wrote:
> > > > Den 11.03.2019 20.23, skrev Daniel Vetter:
> [……]
> > > > >
> > > > > class_for_each_device uses klist, which only uses an irqsave spinlock. I
> > > > > think that's good enough. Comment to that effect would be good e.g.
> > > > >
> > > > > 	/* based on klist, which uses only a spin_lock_irqsave, which we
> > > > > 	 * assume still works */
> > > > >
> > > > > If we aim for perfect this should be a trylock still, maybe using our own
> > > > > device list.
> > > > >
> > >
> > > I definitely agree here.
> > >
> > > The lock may already be locked either by a stopped CPU, or by the
> > > very same CPU we execute panic() on (e.g. NMI panic() on the
> > > printing CPU).
> > >
> > > This is why it's very common for example in serial consoles, which
> > > are usually careful about re-entrance and panic contexts, to do:
> > >
> > >   xxxxxx_console_write(...) {
> > > 	if (oops_in_progress)
> > > 		locked = spin_trylock_irqsave(&port->lock, flags);
> > > 	else
> > > 		spin_lock_irqsave(&port->lock, flags);
> > >   }
> > >
> > > I'm quite positive we should do the same for panic drm drivers.
> >
> > Yeah Ideally all the locking in the drm path would be trylock only.
> >
> > I wonder whether lockdep could help us validate this, with some "don't
> > allow anything except trylocks in this context". It's easy to audit the
> > core code with review, but drivers are much tougher. And often end up with
> > really deep callchains to get at the backing buffers.
> [……]
> > > > > Instead what I had in mind is to recreate the worst
> > > > > possible panic context as much as feasible (disabling interrupts should be
> > > > > a good start, maybe we can even do an nmi callback), and then call our
> > > > > panic implementation. That way we can test the panic handler in a
> > > > > non-destructive way (i.e. aside from last dmesg lines printed to the
> > > > > screen nothing bad happens to the kernel: No real panic, no oops, no
> > > > > tainting).
> > > >
> > > > The interrupt case I can do, nmi I have no idea.
> > > >
> > >
> > > I agree too. Disabling interrupts + CONFIG_DEBUG_ATOMIC_SLEEP
> > > would be a nice non-destructive test-case emulation.
> >
> > See above, if we can somehow emulate "all locks are held, only allow
> > trylock" with lockdep that would be great too.
> >
> 
> Hmmmm, to prototype things faster, I'll do it as a hook:
> 
>    spin_lock(spinlock_t *lock)
>    {
>         might_spin_forever_under_panic_path();
>         ...
>    }
> 
> Just like how mutexes and DEBUG_ATOMIC_SLEEP do it today:
> 
>    void mutex_lock(struct mutex *lock)
>    {
>         might_sleep();
>         ...
>    }
> 
> Hopefully the locking maintainers can accept something like this.
> If not, let's play with lockdep (which is a bit complex).
> 
> ===> Thanks to Sebastian for suggesting this!

Ah yes this is a very nice idea. We only need to make sure that we don't
enable any of this for a real oops. Or we just scroll away the real oops
with tons of warnings.
-Daniel

> 
> > Plus nmi context, once/if that somehow becomes relevant.
> >
> > The thing that killed the old drm panic handling code definitely was that
> > we flat out couldnt' test it except with real oopses. And that's just
> > whack-a-mole and bug reporter frustration if you first have a few patch
> > iterations around "oh sry, it's still some oops in the panic handler, not
> > the first one that we're seeing" :-/
> >
> 
> Oh, that's a critical point:
> 
>    full data > some data > no data > invalid/useless data
> 
> First impressions for the feature will definitely count. Hopefully
> the hooks above and some heavy DRM CI testing __beforehand__ can
> improve things before publishing to a wider audience..

Yup, let's get this right!
-Daniel

Daniel Vetter March 14, 2019, 9:40 a.m. UTC | #40

On Wed, Mar 13, 2019 at 04:54:50PM +0100, Christian König wrote:
> Am 13.03.19 um 16:38 schrieb Michel Dänzer:
> > On 2019-03-13 2:37 p.m., Christian König wrote:
> > > Am 13.03.19 um 14:31 schrieb Ville Syrjälä:
> > > > On Wed, Mar 13, 2019 at 10:35:08AM +0100, Michel Dänzer wrote:
> > > > > On 2019-03-12 6:15 p.m., Noralf Trønnes wrote:
> > > > > > Den 12.03.2019 17.17, skrev Ville Syrjälä:
> > > > > > > On Tue, Mar 12, 2019 at 11:47:04AM +0100, Michel Dänzer wrote:
> > > > > > > > On 2019-03-11 6:42 p.m., Noralf Trønnes wrote:
> > > > > > > > > This adds support for outputting kernel messages on panic().
> > > > > > > > > A kernel message dumper is used to dump the log. The dumper iterates
> > > > > > > > > over each DRM device and it's crtc's to find suitable framebuffers.
> > > > > > > > > 
> > > > > > > > > All the other dumpers are run before this one except mtdoops.
> > > > > > > > > Only atomic drivers are supported.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
> > > > > > > > > ---
> > > > > > > > >    [...]
> > > > > > > > > 
> > > > > > > > > diff --git a/include/drm/drm_framebuffer.h
> > > > > > > > > b/include/drm/drm_framebuffer.h
> > > > > > > > > index f0b34c977ec5..f3274798ecfe 100644
> > > > > > > > > --- a/include/drm/drm_framebuffer.h
> > > > > > > > > +++ b/include/drm/drm_framebuffer.h
> > > > > > > > > @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
> > > > > > > > >                 struct drm_file *file_priv, unsigned flags,
> > > > > > > > >                 unsigned color, struct drm_clip_rect *clips,
> > > > > > > > >                 unsigned num_clips);
> > > > > > > > > +
> > > > > > > > > +    /**
> > > > > > > > > +     * @panic_vmap:
> > > > > > > > > +     *
> > > > > > > > > +     * Optional callback for panic handling.
> > > > > > > > > +     *
> > > > > > > > > +     * For vmapping the selected framebuffer in a panic context.
> > > > > > > > > Must
> > > > > > > > > +     * be super careful about locking (only trylocking allowed).
> > > > > > > > > +     *
> > > > > > > > > +     * RETURNS:
> > > > > > > > > +     *
> > > > > > > > > +     * NULL if it didn't work out, otherwise an opaque cookie
> > > > > > > > > which is
> > > > > > > > > +     * passed to @panic_draw_xy. It can be anything: vmap area,
> > > > > > > > > structure
> > > > > > > > > +     * with more details, just a few flags, ...
> > > > > > > > > +     */
> > > > > > > > > +    void *(*panic_vmap)(struct drm_framebuffer *fb);
> > > > > > > > FWIW, the panic_vmap hook cannot work in general with the
> > > > > > > > amdgpu/radeon
> > > > > > > > drivers:
> > > > > > > > 
> > > > > > > > Framebuffers are normally tiled, writing to them with the CPU
> > > > > > > > results in
> > > > > > > > garbled output.
> > > > > > > > 
> > > > > > In which case the driver needs to support the ->panic_draw_xy callback,
> > > > > > or maybe it's possible to make a generic helper for tiled buffers.
> > > > > I'm afraid that won't help, at least not without porting big chunks of
> > > > > https://gitlab.freedesktop.org/mesa/mesa/tree/master/src/amd/addrlib
> > > > > into the kernel, none of which will be used for anything else.
> > > > > 
> > > > > 
> > > > > > > > There would need to be a mechanism for switching scanout to a linear,
> > > > > > > > CPU accessible framebuffer.
> > > > > > > I suppose panic_vmap() could just provide a linear temp buffer
> > > > > > > to the panic handler, and panic_unmap() could copy the contents
> > > > > > > over to the real fb.
> > > > > Copy how? Using a GPU engine?
> > > > CPU maybe? Though I suppose that won't work if the buffer isn't CPU
> > > > accesible :/
> > > Well we do have a debug path for accessing invisible memory with the CPU.
> > > 
> > > E.g. three registers: DATA and auto increment OFFSET_LO/HI. So you can
> > > just read/write DATA over and over again if you want to access some memory.
> > Right. I assume that'll be very slow, but I guess it could do when the
> > memory isn't directly CPU accessible.
> 
> Just made a quick test and reading 33423360 bytes (4096x2040x4) using that
> interfaces takes about 13 seconds.
> 
> IIRC we don't use the auto increment optimization yet, so that can probably
> be improved by a factor of 3 or more.
> 
> > > But turning of tilling etc is still extremely tricky when the system is
> > > already unstable.
> > Maybe we could add a little hook to the display code, which just
> > disables tiling for scanout and maybe disables non-primary planes, but
> > doesn't touch anything else. Harry / Nicholas, does that seem feasible?
> > 
> > 
> > I'm coming around from "this is never going to work" to "it might
> > actually work" with our hardware...
> 
> Yeah, agree. It's a bit tricky, but doable.
> 
> Takeaway for Noralf is that this whole vmap on panic won't even remotely
> work. We need to get the data byte by byte without a page mapping if that is
> ever going to fly.

Don't be hung up on the vmap, the main interface is a pixel-by-pixel
panic_draw_xy. Plus panic_prepare/cleanup to do whatever setup you need to
do (like disable tiling to avoid having a copy of the detiling library in
the kernel for nothing else than oops printing). vmap is just what
conveniently works for noral on his cma drivers.

The important bit is that we run the minimal amount of code to draw pixels
onto the current scanout buffers. For multiple I think best is to just
pick the biggest visible rectangle, but not do stuff like disable planes -
that sounds way too risky. Usually there's a main window that should be
big enough to show the oops.
-Daniel

John Ogness March 14, 2019, 9:43 a.m. UTC | #41

On 2019-03-14, Daniel Vetter <daniel@ffwll.ch> wrote:
> That's why we came up with the trylock + immediate bail out design if
> that fails. Plus really only render the oops int whatever is the
> current display buffer, so that we don't have to do any hw programming
> at all.

I think this is your best option. The real work will be identifying
any/all spin locking that currently exists. For all of those, the code
needs to change to:

1. trylock if oops_in_progress, otherwise spinlock

2. if trylock fails, the code must have a sane failure

The 2nd point will be the difficult one. For example, you may have
functions without a return value taking spinlocks. But now those
functions could fail.

John Ogness

Daniel Vetter March 14, 2019, 9:50 a.m. UTC | #42

On Wed, Mar 13, 2019 at 05:41:52PM +0000, Kazlauskas, Nicholas wrote:
> On 3/13/19 1:33 PM, Michel Dänzer wrote:
> > On 2019-03-13 5:16 p.m., Kazlauskas, Nicholas wrote:
> >> On 3/13/19 11:54 AM, Christian König wrote:
> >>> Am 13.03.19 um 16:38 schrieb Michel Dänzer:
> >>>> On 2019-03-13 2:37 p.m., Christian König wrote:
> >>>>> Am 13.03.19 um 14:31 schrieb Ville Syrjälä:
> >>>>>> On Wed, Mar 13, 2019 at 10:35:08AM +0100, Michel Dänzer wrote:
> >>>>>>> On 2019-03-12 6:15 p.m., Noralf Trønnes wrote:
> >>>>>>>> Den 12.03.2019 17.17, skrev Ville Syrjälä:
> >>>>>>>>> On Tue, Mar 12, 2019 at 11:47:04AM +0100, Michel Dänzer wrote:
> >>>>>>>>>> On 2019-03-11 6:42 p.m., Noralf Trønnes wrote:
> >>>>>>>>>>> This adds support for outputting kernel messages on panic().
> >>>>>>>>>>> A kernel message dumper is used to dump the log. The dumper
> >>>>>>>>>>> iterates
> >>>>>>>>>>> over each DRM device and it's crtc's to find suitable
> >>>>>>>>>>> framebuffers.
> >>>>>>>>>>>
> >>>>>>>>>>> All the other dumpers are run before this one except mtdoops.
> >>>>>>>>>>> Only atomic drivers are supported.
> >>>>>>>>>>>
> >>>>>>>>>>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
> >>>>>>>>>>> ---
> >>>>>>>>>>>     [...]
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/include/drm/drm_framebuffer.h
> >>>>>>>>>>> b/include/drm/drm_framebuffer.h
> >>>>>>>>>>> index f0b34c977ec5..f3274798ecfe 100644
> >>>>>>>>>>> --- a/include/drm/drm_framebuffer.h
> >>>>>>>>>>> +++ b/include/drm/drm_framebuffer.h
> >>>>>>>>>>> @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
> >>>>>>>>>>>                  struct drm_file *file_priv, unsigned flags,
> >>>>>>>>>>>                  unsigned color, struct drm_clip_rect *clips,
> >>>>>>>>>>>                  unsigned num_clips);
> >>>>>>>>>>> +
> >>>>>>>>>>> +    /**
> >>>>>>>>>>> +     * @panic_vmap:
> >>>>>>>>>>> +     *
> >>>>>>>>>>> +     * Optional callback for panic handling.
> >>>>>>>>>>> +     *
> >>>>>>>>>>> +     * For vmapping the selected framebuffer in a panic context.
> >>>>>>>>>>> Must
> >>>>>>>>>>> +     * be super careful about locking (only trylocking allowed).
> >>>>>>>>>>> +     *
> >>>>>>>>>>> +     * RETURNS:
> >>>>>>>>>>> +     *
> >>>>>>>>>>> +     * NULL if it didn't work out, otherwise an opaque cookie
> >>>>>>>>>>> which is
> >>>>>>>>>>> +     * passed to @panic_draw_xy. It can be anything: vmap area,
> >>>>>>>>>>> structure
> >>>>>>>>>>> +     * with more details, just a few flags, ...
> >>>>>>>>>>> +     */
> >>>>>>>>>>> +    void *(*panic_vmap)(struct drm_framebuffer *fb);
> >>>>>>>>>> FWIW, the panic_vmap hook cannot work in general with the
> >>>>>>>>>> amdgpu/radeon
> >>>>>>>>>> drivers:
> >>>>>>>>>>
> >>>>>>>>>> Framebuffers are normally tiled, writing to them with the CPU
> >>>>>>>>>> results in
> >>>>>>>>>> garbled output.
> >>>>>>>>>>
> >>>>>>>> In which case the driver needs to support the ->panic_draw_xy
> >>>>>>>> callback,
> >>>>>>>> or maybe it's possible to make a generic helper for tiled buffers.
> >>>>>>> I'm afraid that won't help, at least not without porting big chunks of
> >>>>>>> https://gitlab.freedesktop.org/mesa/mesa/tree/master/src/amd/addrlib
> >>>>>>> into the kernel, none of which will be used for anything else.
> >>>>>>>
> >>>>>>>
> >>>>>>>>>> There would need to be a mechanism for switching scanout to a
> >>>>>>>>>> linear,
> >>>>>>>>>> CPU accessible framebuffer.
> >>>>>>>>> I suppose panic_vmap() could just provide a linear temp buffer
> >>>>>>>>> to the panic handler, and panic_unmap() could copy the contents
> >>>>>>>>> over to the real fb.
> >>>>>>> Copy how? Using a GPU engine?
> >>>>>> CPU maybe? Though I suppose that won't work if the buffer isn't CPU
> >>>>>> accesible :/
> >>>>> Well we do have a debug path for accessing invisible memory with the
> >>>>> CPU.
> >>>>>
> >>>>> E.g. three registers: DATA and auto increment OFFSET_LO/HI. So you can
> >>>>> just read/write DATA over and over again if you want to access some
> >>>>> memory.
> >>>> Right. I assume that'll be very slow, but I guess it could do when the
> >>>> memory isn't directly CPU accessible.
> >>>
> >>> Just made a quick test and reading 33423360 bytes (4096x2040x4) using
> >>> that interfaces takes about 13 seconds.
> >>>
> >>> IIRC we don't use the auto increment optimization yet, so that can
> >>> probably be improved by a factor of 3 or more.
> > 
> > I'd assume only writes are needed, no reads.
> > 
> > 
> >>>>> But turning of tilling etc is still extremely tricky when the system is
> >>>>> already unstable.
> >>>> Maybe we could add a little hook to the display code, which just
> >>>> disables tiling for scanout and maybe disables non-primary planes, but
> >>>> doesn't touch anything else. Harry / Nicholas, does that seem feasible?
> >>>>
> >>>>
> >>>> I'm coming around from "this is never going to work" to "it might
> >>>> actually work" with our hardware...
> >>>
> >>> Yeah, agree. It's a bit tricky, but doable.
> >>
> >> A "disable_tiling" hook or something along those lines could work for
> >> display. It's a little bit non trivial when you want to start dealing
> >> with locking and any active DRM commits, but we have a global lock
> >> around all our hardware programming anyway that makes that easier to
> >> deal with.
> > 
> > This code only runs when there's a kernel panic, so it doesn't have to
> > care about such things. :) In fact, it should rather avoid them and only
> > do the absolute minimum register writes needed to do the job.
> 
> Good point. I guess it doesn't really matter at that point.
> 
> > 
> > 
> >> I think we can just re-commit and update the existing hardware state
> >> with only the tiling info for every plane reset to off.
> > 
> > There's also a concern that non-primary planes might prevent the output
> > from being visible.
> 
> I would just hope that the overlay doesn't cover the full screen, or 
> that the panic handler writes to all buffers if possible.
> 
> DC would have to swap the current context in order to disable planes and 
> we'd have to do memory allocations / frees etc in order to do so.

You can't. And I'm with Michel, you don't want to run any of the DC code.
Pulling in ~100kloc of dependency in panic context just doesn't work
(that's what we tried before essentially).

What you can do is
- trylock, and bail out if any lock is locked (since that means undefined
  hw or sw state, and then you really shouldn't touch anything)
- maybe write a few very specific registers with the most minimal amount
  of code (since we need to validate every line that runs in oops/panic
  context to make sure it still works in nmi context). Stuff like
  disabling tiling (on i915 that's just flipping a bit in a register).
- memory writes (or that slow indirect write interface you have for
  cpu invisible vram) to draw the actual oops.

Everything else is out of question. Especially calling existing display
code just doesn't work, it's ime impossible to keep allocations/spinlocks
and stuff like that out of it.
-Daniel

> >> For most buffers I don't think we'd have to really consider changing
> >> anything else here as long as you respect the current FB size and
> >> pitch.
> > 
> > Hmm, e.g. compression might also need to be disabled? Or maybe that's
> > implicitly off with no tiling?
> > 
> > 
> 
> Resetting the tiling info will also disable the compression in this 
> case, so the buffer should still be usable for this purpose.
> 
> Nicholas Kazlauskas

John Ogness March 14, 2019, 9:52 a.m. UTC | #43

On 2019-03-14, John Ogness <john.ogness@linutronix.de> wrote:
> On 2019-03-14, Daniel Vetter <daniel@ffwll.ch> wrote:
>> That's why we came up with the trylock + immediate bail out design if
>> that fails. Plus really only render the oops int whatever is the
>> current display buffer, so that we don't have to do any hw
>> programming at all.
>
> I think this is your best option. The real work will be identifying
> any/all spin locking that currently exists. For all of those, the code
> needs to change to:
>
> 1. trylock if oops_in_progress, otherwise spinlock

On second thought, you shouldn't use oops_in_progress. It would be
better if DRM had its own flag to signify that it is currently being
used in kmsg_dump context.

> 2. if trylock fails, the code must have a sane failure
>
> The 2nd point will be the difficult one. For example, you may have
> functions without a return value taking spinlocks. But now those
> functions could fail.

John Ogness

Kazlauskas, Nicholas March 14, 2019, 12:44 p.m. UTC | #44

On 3/14/19 5:50 AM, Daniel Vetter wrote:
> On Wed, Mar 13, 2019 at 05:41:52PM +0000, Kazlauskas, Nicholas wrote:
>> On 3/13/19 1:33 PM, Michel Dänzer wrote:
>>> On 2019-03-13 5:16 p.m., Kazlauskas, Nicholas wrote:
>>>> On 3/13/19 11:54 AM, Christian König wrote:
>>>>> Am 13.03.19 um 16:38 schrieb Michel Dänzer:
>>>>>> On 2019-03-13 2:37 p.m., Christian König wrote:
>>>>>>> Am 13.03.19 um 14:31 schrieb Ville Syrjälä:
>>>>>>>> On Wed, Mar 13, 2019 at 10:35:08AM +0100, Michel Dänzer wrote:
>>>>>>>>> On 2019-03-12 6:15 p.m., Noralf Trønnes wrote:
>>>>>>>>>> Den 12.03.2019 17.17, skrev Ville Syrjälä:
>>>>>>>>>>> On Tue, Mar 12, 2019 at 11:47:04AM +0100, Michel Dänzer wrote:
>>>>>>>>>>>> On 2019-03-11 6:42 p.m., Noralf Trønnes wrote:
>>>>>>>>>>>>> This adds support for outputting kernel messages on panic().
>>>>>>>>>>>>> A kernel message dumper is used to dump the log. The dumper
>>>>>>>>>>>>> iterates
>>>>>>>>>>>>> over each DRM device and it's crtc's to find suitable
>>>>>>>>>>>>> framebuffers.
>>>>>>>>>>>>>
>>>>>>>>>>>>> All the other dumpers are run before this one except mtdoops.
>>>>>>>>>>>>> Only atomic drivers are supported.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>      [...]
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/include/drm/drm_framebuffer.h
>>>>>>>>>>>>> b/include/drm/drm_framebuffer.h
>>>>>>>>>>>>> index f0b34c977ec5..f3274798ecfe 100644
>>>>>>>>>>>>> --- a/include/drm/drm_framebuffer.h
>>>>>>>>>>>>> +++ b/include/drm/drm_framebuffer.h
>>>>>>>>>>>>> @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
>>>>>>>>>>>>>                   struct drm_file *file_priv, unsigned flags,
>>>>>>>>>>>>>                   unsigned color, struct drm_clip_rect *clips,
>>>>>>>>>>>>>                   unsigned num_clips);
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +    /**
>>>>>>>>>>>>> +     * @panic_vmap:
>>>>>>>>>>>>> +     *
>>>>>>>>>>>>> +     * Optional callback for panic handling.
>>>>>>>>>>>>> +     *
>>>>>>>>>>>>> +     * For vmapping the selected framebuffer in a panic context.
>>>>>>>>>>>>> Must
>>>>>>>>>>>>> +     * be super careful about locking (only trylocking allowed).
>>>>>>>>>>>>> +     *
>>>>>>>>>>>>> +     * RETURNS:
>>>>>>>>>>>>> +     *
>>>>>>>>>>>>> +     * NULL if it didn't work out, otherwise an opaque cookie
>>>>>>>>>>>>> which is
>>>>>>>>>>>>> +     * passed to @panic_draw_xy. It can be anything: vmap area,
>>>>>>>>>>>>> structure
>>>>>>>>>>>>> +     * with more details, just a few flags, ...
>>>>>>>>>>>>> +     */
>>>>>>>>>>>>> +    void *(*panic_vmap)(struct drm_framebuffer *fb);
>>>>>>>>>>>> FWIW, the panic_vmap hook cannot work in general with the
>>>>>>>>>>>> amdgpu/radeon
>>>>>>>>>>>> drivers:
>>>>>>>>>>>>
>>>>>>>>>>>> Framebuffers are normally tiled, writing to them with the CPU
>>>>>>>>>>>> results in
>>>>>>>>>>>> garbled output.
>>>>>>>>>>>>
>>>>>>>>>> In which case the driver needs to support the ->panic_draw_xy
>>>>>>>>>> callback,
>>>>>>>>>> or maybe it's possible to make a generic helper for tiled buffers.
>>>>>>>>> I'm afraid that won't help, at least not without porting big chunks of
>>>>>>>>> https://gitlab.freedesktop.org/mesa/mesa/tree/master/src/amd/addrlib
>>>>>>>>> into the kernel, none of which will be used for anything else.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> There would need to be a mechanism for switching scanout to a
>>>>>>>>>>>> linear,
>>>>>>>>>>>> CPU accessible framebuffer.
>>>>>>>>>>> I suppose panic_vmap() could just provide a linear temp buffer
>>>>>>>>>>> to the panic handler, and panic_unmap() could copy the contents
>>>>>>>>>>> over to the real fb.
>>>>>>>>> Copy how? Using a GPU engine?
>>>>>>>> CPU maybe? Though I suppose that won't work if the buffer isn't CPU
>>>>>>>> accesible :/
>>>>>>> Well we do have a debug path for accessing invisible memory with the
>>>>>>> CPU.
>>>>>>>
>>>>>>> E.g. three registers: DATA and auto increment OFFSET_LO/HI. So you can
>>>>>>> just read/write DATA over and over again if you want to access some
>>>>>>> memory.
>>>>>> Right. I assume that'll be very slow, but I guess it could do when the
>>>>>> memory isn't directly CPU accessible.
>>>>>
>>>>> Just made a quick test and reading 33423360 bytes (4096x2040x4) using
>>>>> that interfaces takes about 13 seconds.
>>>>>
>>>>> IIRC we don't use the auto increment optimization yet, so that can
>>>>> probably be improved by a factor of 3 or more.
>>>
>>> I'd assume only writes are needed, no reads.
>>>
>>>
>>>>>>> But turning of tilling etc is still extremely tricky when the system is
>>>>>>> already unstable.
>>>>>> Maybe we could add a little hook to the display code, which just
>>>>>> disables tiling for scanout and maybe disables non-primary planes, but
>>>>>> doesn't touch anything else. Harry / Nicholas, does that seem feasible?
>>>>>>
>>>>>>
>>>>>> I'm coming around from "this is never going to work" to "it might
>>>>>> actually work" with our hardware...
>>>>>
>>>>> Yeah, agree. It's a bit tricky, but doable.
>>>>
>>>> A "disable_tiling" hook or something along those lines could work for
>>>> display. It's a little bit non trivial when you want to start dealing
>>>> with locking and any active DRM commits, but we have a global lock
>>>> around all our hardware programming anyway that makes that easier to
>>>> deal with.
>>>
>>> This code only runs when there's a kernel panic, so it doesn't have to
>>> care about such things. :) In fact, it should rather avoid them and only
>>> do the absolute minimum register writes needed to do the job.
>>
>> Good point. I guess it doesn't really matter at that point.
>>
>>>
>>>
>>>> I think we can just re-commit and update the existing hardware state
>>>> with only the tiling info for every plane reset to off.
>>>
>>> There's also a concern that non-primary planes might prevent the output
>>> from being visible.
>>
>> I would just hope that the overlay doesn't cover the full screen, or
>> that the panic handler writes to all buffers if possible.
>>
>> DC would have to swap the current context in order to disable planes and
>> we'd have to do memory allocations / frees etc in order to do so.
> 
> You can't. And I'm with Michel, you don't want to run any of the DC code.
> Pulling in ~100kloc of dependency in panic context just doesn't work
> (that's what we tried before essentially).
> 
> What you can do is
> - trylock, and bail out if any lock is locked (since that means undefined
>    hw or sw state, and then you really shouldn't touch anything)
> - maybe write a few very specific registers with the most minimal amount
>    of code (since we need to validate every line that runs in oops/panic
>    context to make sure it still works in nmi context). Stuff like
>    disabling tiling (on i915 that's just flipping a bit in a register).
> - memory writes (or that slow indirect write interface you have for
>    cpu invisible vram) to draw the actual oops.
> 
> Everything else is out of question. Especially calling existing display
> code just doesn't work, it's ime impossible to keep allocations/spinlocks
> and stuff like that out of it.
> -Daniel

At some point we'll have to run at least some DC code since disabling 
tiling and compression is ASIC specific and you need to know what 
registers to actually write.

There's no fast path that DC exposes specifically for this usecase, but 
locking is controlled outside of DC anyway (so we don't have to lock) 
and it DC itself won't be doing any memory allocations if it's just 
disabling tiling and compression.

It's not actually that many register reads/writes, and it could probably 
stripped down to just writes since the reads are only used for sanity 
checking and the system is going down anyway. That part might need a 
separate interface for dc_commit_updates_for_stream or for amdgpu_dm to 
disable sanity checking before the call through DC's debug options.

Nicholas Kazlauskas

> 
>>>> For most buffers I don't think we'd have to really consider changing
>>>> anything else here as long as you respect the current FB size and
>>>> pitch.
>>>
>>> Hmm, e.g. compression might also need to be disabled? Or maybe that's
>>> implicitly off with no tiling?
>>>
>>>
>>
>> Resetting the tiling info will also disable the compression in this
>> case, so the buffer should still be usable for this purpose.
>>
>> Nicholas Kazlauskas
>

Daniel Vetter March 15, 2019, 10:56 a.m. UTC | #45

On Thu, Mar 14, 2019 at 10:52:08AM +0100, John Ogness wrote:
> On 2019-03-14, John Ogness <john.ogness@linutronix.de> wrote:
> > On 2019-03-14, Daniel Vetter <daniel@ffwll.ch> wrote:
> >> That's why we came up with the trylock + immediate bail out design if
> >> that fails. Plus really only render the oops int whatever is the
> >> current display buffer, so that we don't have to do any hw
> >> programming at all.
> >
> > I think this is your best option. The real work will be identifying
> > any/all spin locking that currently exists. For all of those, the code
> > needs to change to:
> >
> > 1. trylock if oops_in_progress, otherwise spinlock
> 
> On second thought, you shouldn't use oops_in_progress. It would be
> better if DRM had its own flag to signify that it is currently being
> used in kmsg_dump context.

We use oops_in_progress because the fbcon design is a terrible layer cake.
I agree that for the new code we shouldn't use it: New code paths that do
trylock only.

> > 2. if trylock fails, the code must have a sane failure
> >
> > The 2nd point will be the difficult one. For example, you may have
> > functions without a return value taking spinlocks. But now those
> > functions could fail.

Nah, imo no code reuse for the oops path. If you do that you sooner or
later end up pulling in the 100kloc of modeset code, and we've been there.
Doesn't work. New, absolutely minimal oops code, with as little of it
shared with normal kms code.

That's also why we absolutely have to be able to test this stuff without a
real oops.
-Daniel

Daniel Vetter March 15, 2019, 10:58 a.m. UTC | #46

On Thu, Mar 14, 2019 at 12:44:06PM +0000, Kazlauskas, Nicholas wrote:
> On 3/14/19 5:50 AM, Daniel Vetter wrote:
> > On Wed, Mar 13, 2019 at 05:41:52PM +0000, Kazlauskas, Nicholas wrote:
> >> On 3/13/19 1:33 PM, Michel Dänzer wrote:
> >>> On 2019-03-13 5:16 p.m., Kazlauskas, Nicholas wrote:
> >>>> On 3/13/19 11:54 AM, Christian König wrote:
> >>>>> Am 13.03.19 um 16:38 schrieb Michel Dänzer:
> >>>>>> On 2019-03-13 2:37 p.m., Christian König wrote:
> >>>>>>> Am 13.03.19 um 14:31 schrieb Ville Syrjälä:
> >>>>>>>> On Wed, Mar 13, 2019 at 10:35:08AM +0100, Michel Dänzer wrote:
> >>>>>>>>> On 2019-03-12 6:15 p.m., Noralf Trønnes wrote:
> >>>>>>>>>> Den 12.03.2019 17.17, skrev Ville Syrjälä:
> >>>>>>>>>>> On Tue, Mar 12, 2019 at 11:47:04AM +0100, Michel Dänzer wrote:
> >>>>>>>>>>>> On 2019-03-11 6:42 p.m., Noralf Trønnes wrote:
> >>>>>>>>>>>>> This adds support for outputting kernel messages on panic().
> >>>>>>>>>>>>> A kernel message dumper is used to dump the log. The dumper
> >>>>>>>>>>>>> iterates
> >>>>>>>>>>>>> over each DRM device and it's crtc's to find suitable
> >>>>>>>>>>>>> framebuffers.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> All the other dumpers are run before this one except mtdoops.
> >>>>>>>>>>>>> Only atomic drivers are supported.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
> >>>>>>>>>>>>> ---
> >>>>>>>>>>>>>      [...]
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> diff --git a/include/drm/drm_framebuffer.h
> >>>>>>>>>>>>> b/include/drm/drm_framebuffer.h
> >>>>>>>>>>>>> index f0b34c977ec5..f3274798ecfe 100644
> >>>>>>>>>>>>> --- a/include/drm/drm_framebuffer.h
> >>>>>>>>>>>>> +++ b/include/drm/drm_framebuffer.h
> >>>>>>>>>>>>> @@ -94,6 +94,44 @@ struct drm_framebuffer_funcs {
> >>>>>>>>>>>>>                   struct drm_file *file_priv, unsigned flags,
> >>>>>>>>>>>>>                   unsigned color, struct drm_clip_rect *clips,
> >>>>>>>>>>>>>                   unsigned num_clips);
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> +    /**
> >>>>>>>>>>>>> +     * @panic_vmap:
> >>>>>>>>>>>>> +     *
> >>>>>>>>>>>>> +     * Optional callback for panic handling.
> >>>>>>>>>>>>> +     *
> >>>>>>>>>>>>> +     * For vmapping the selected framebuffer in a panic context.
> >>>>>>>>>>>>> Must
> >>>>>>>>>>>>> +     * be super careful about locking (only trylocking allowed).
> >>>>>>>>>>>>> +     *
> >>>>>>>>>>>>> +     * RETURNS:
> >>>>>>>>>>>>> +     *
> >>>>>>>>>>>>> +     * NULL if it didn't work out, otherwise an opaque cookie
> >>>>>>>>>>>>> which is
> >>>>>>>>>>>>> +     * passed to @panic_draw_xy. It can be anything: vmap area,
> >>>>>>>>>>>>> structure
> >>>>>>>>>>>>> +     * with more details, just a few flags, ...
> >>>>>>>>>>>>> +     */
> >>>>>>>>>>>>> +    void *(*panic_vmap)(struct drm_framebuffer *fb);
> >>>>>>>>>>>> FWIW, the panic_vmap hook cannot work in general with the
> >>>>>>>>>>>> amdgpu/radeon
> >>>>>>>>>>>> drivers:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Framebuffers are normally tiled, writing to them with the CPU
> >>>>>>>>>>>> results in
> >>>>>>>>>>>> garbled output.
> >>>>>>>>>>>>
> >>>>>>>>>> In which case the driver needs to support the ->panic_draw_xy
> >>>>>>>>>> callback,
> >>>>>>>>>> or maybe it's possible to make a generic helper for tiled buffers.
> >>>>>>>>> I'm afraid that won't help, at least not without porting big chunks of
> >>>>>>>>> https://gitlab.freedesktop.org/mesa/mesa/tree/master/src/amd/addrlib
> >>>>>>>>> into the kernel, none of which will be used for anything else.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>>> There would need to be a mechanism for switching scanout to a
> >>>>>>>>>>>> linear,
> >>>>>>>>>>>> CPU accessible framebuffer.
> >>>>>>>>>>> I suppose panic_vmap() could just provide a linear temp buffer
> >>>>>>>>>>> to the panic handler, and panic_unmap() could copy the contents
> >>>>>>>>>>> over to the real fb.
> >>>>>>>>> Copy how? Using a GPU engine?
> >>>>>>>> CPU maybe? Though I suppose that won't work if the buffer isn't CPU
> >>>>>>>> accesible :/
> >>>>>>> Well we do have a debug path for accessing invisible memory with the
> >>>>>>> CPU.
> >>>>>>>
> >>>>>>> E.g. three registers: DATA and auto increment OFFSET_LO/HI. So you can
> >>>>>>> just read/write DATA over and over again if you want to access some
> >>>>>>> memory.
> >>>>>> Right. I assume that'll be very slow, but I guess it could do when the
> >>>>>> memory isn't directly CPU accessible.
> >>>>>
> >>>>> Just made a quick test and reading 33423360 bytes (4096x2040x4) using
> >>>>> that interfaces takes about 13 seconds.
> >>>>>
> >>>>> IIRC we don't use the auto increment optimization yet, so that can
> >>>>> probably be improved by a factor of 3 or more.
> >>>
> >>> I'd assume only writes are needed, no reads.
> >>>
> >>>
> >>>>>>> But turning of tilling etc is still extremely tricky when the system is
> >>>>>>> already unstable.
> >>>>>> Maybe we could add a little hook to the display code, which just
> >>>>>> disables tiling for scanout and maybe disables non-primary planes, but
> >>>>>> doesn't touch anything else. Harry / Nicholas, does that seem feasible?
> >>>>>>
> >>>>>>
> >>>>>> I'm coming around from "this is never going to work" to "it might
> >>>>>> actually work" with our hardware...
> >>>>>
> >>>>> Yeah, agree. It's a bit tricky, but doable.
> >>>>
> >>>> A "disable_tiling" hook or something along those lines could work for
> >>>> display. It's a little bit non trivial when you want to start dealing
> >>>> with locking and any active DRM commits, but we have a global lock
> >>>> around all our hardware programming anyway that makes that easier to
> >>>> deal with.
> >>>
> >>> This code only runs when there's a kernel panic, so it doesn't have to
> >>> care about such things. :) In fact, it should rather avoid them and only
> >>> do the absolute minimum register writes needed to do the job.
> >>
> >> Good point. I guess it doesn't really matter at that point.
> >>
> >>>
> >>>
> >>>> I think we can just re-commit and update the existing hardware state
> >>>> with only the tiling info for every plane reset to off.
> >>>
> >>> There's also a concern that non-primary planes might prevent the output
> >>> from being visible.
> >>
> >> I would just hope that the overlay doesn't cover the full screen, or
> >> that the panic handler writes to all buffers if possible.
> >>
> >> DC would have to swap the current context in order to disable planes and
> >> we'd have to do memory allocations / frees etc in order to do so.
> > 
> > You can't. And I'm with Michel, you don't want to run any of the DC code.
> > Pulling in ~100kloc of dependency in panic context just doesn't work
> > (that's what we tried before essentially).
> > 
> > What you can do is
> > - trylock, and bail out if any lock is locked (since that means undefined
> >    hw or sw state, and then you really shouldn't touch anything)
> > - maybe write a few very specific registers with the most minimal amount
> >    of code (since we need to validate every line that runs in oops/panic
> >    context to make sure it still works in nmi context). Stuff like
> >    disabling tiling (on i915 that's just flipping a bit in a register).
> > - memory writes (or that slow indirect write interface you have for
> >    cpu invisible vram) to draw the actual oops.
> > 
> > Everything else is out of question. Especially calling existing display
> > code just doesn't work, it's ime impossible to keep allocations/spinlocks
> > and stuff like that out of it.
> > -Daniel
> 
> At some point we'll have to run at least some DC code since disabling 
> tiling and compression is ASIC specific and you need to know what 
> registers to actually write.
> 
> There's no fast path that DC exposes specifically for this usecase, but 
> locking is controlled outside of DC anyway (so we don't have to lock) 
> and it DC itself won't be doing any memory allocations if it's just 
> disabling tiling and compression.
> 
> It's not actually that many register reads/writes, and it could probably 
> stripped down to just writes since the reads are only used for sanity 
> checking and the system is going down anyway. That part might need a 
> separate interface for dc_commit_updates_for_stream or for amdgpu_dm to 
> disable sanity checking before the call through DC's debug options.

I'd copypaste that code for oops handling. Resetting to untiled shouldn't
be that much code. And the risk of accidentally pulling in a lot of other
bits are imo too high.
-Daniel

[v2,1/3] drm: Add support for panic message output

Commit Message

Comments

Patch