mbox series

[0/8] tracing: Persistent traces across a reboot or crash

Message ID 20240306015910.766510873@goodmis.org (mailing list archive)
Headers show
Series tracing: Persistent traces across a reboot or crash | expand

Message

Steven Rostedt March 6, 2024, 1:59 a.m. UTC
This is a way to map a ring buffer instance across reboots.
The requirement is that you have a memory region that is not erased.
I tested this on a Debian VM running on qemu on a Debian server,
and even tested it on a baremetal box running Fedora. I was
surprised that it worked on the baremetal box, but it does so
surprisingly consistently.

The idea is that you can reserve a memory region and save it in two
special variables:

  trace_buffer_start and trace_buffer_size

If these are set by fs_initcall() then a "boot_mapped" instance
is created. The memory that was reserved is used by the ring buffer
of this instance. It acts like a memory mapped instance so it has
some limitations. It does not allow snapshots nor does it allow
tracers which use a snapshot buffer (like irqsoff and wakeup tracers).

On boot up, when setting up the ring buffer, it looks at the current
content and does a vigorous test to see if the content is valid.
It even walks the events in all the sub-buffers to make sure the
ring buffer meta data is correct. If it determines that the content
is valid, it will reconstruct the ring buffer to use the content
it has found.

If the buffer is valid, on the next boot, the boot_mapped instance
will contain the data from the previous boot. You can cat the
trace or trace_pipe file, or even run trace-cmd extract on it to
make a trace.dat file that holds the date. This is much better than
dealing with a ftrace_dump_on_opps (I wish I had this a decade ago!)

There are still some limitations of this buffer. One is that it assumes
that the kernel you are booting back into is the same one that crashed.
At least the trace_events (like sched_switch and friends) all have the
same ids. This would be true with the same kernel as the ids are determined
at link time.

Module events could possible be a problem as the ids may not match.

One idea is to just print the raw fields and not process the print formats
for this instance, as the print formats may do some crazy things with
data that does not match.

Another limitation is any print format that has "%pS" will likely not work.
That's because the pointer in the old ring buffer is for an address that
may be different than the function points to now. I was thinking of
adding a file in the boot_mapped instance that holds the delta of the
old mapping to the new mapping, so that trace-cmd and perf could
calculate the current kallsyms from the old pointers.

Finally, this is still a proof of concept. How to create this memory
mapping isn't decided yet. In this patch set I simply hacked into
kexec crash code and hard coded an address that worked for one of my
machines (for the other machine I had to play around to find another
address). Perhaps we could add a kernel command line parameter that
lets people decided, or an option where it could possibly look at
the ACPI (for intel) tables to come up with an address on its own.

Anyway, I plan on using this for debugging, as it already is pretty
featureful but there's much more that can be done.

Basically, all you need to do is:

  echo 1 > /sys/kernel/tracing/instances/boot_mapped/events/enable

Do what ever you want and the system crashes (and boots to the same
kernel). Then:

  cat /sys/kernel/tracing/instances/boot_mapped/trace

and it will have the trace.

I'm sure there's still some gotchas here, which is why this is currently
still just a POC.

Enjoy...

Steven Rostedt (Google) (8):
      ring-buffer: Allow mapped field to be set without mapping
      ring-buffer: Add ring_buffer_alloc_range()
      tracing: Create "boot_mapped" instance for memory mapped buffer
      HACK: Hard code in mapped tracing buffer address
      ring-buffer: Add ring_buffer_meta data
      ring-buffer: Add output of ring buffer meta page
      ring-buffer: Add test if range of boot buffer is valid
      ring-buffer: Validate boot range memory events

----
 arch/x86/kernel/setup.c     |  20 ++
 include/linux/ring_buffer.h |  17 +
 include/linux/trace.h       |   7 +
 kernel/trace/ring_buffer.c  | 826 ++++++++++++++++++++++++++++++++++++++------
 kernel/trace/trace.c        |  95 ++++-
 kernel/trace/trace.h        |   5 +
 6 files changed, 856 insertions(+), 114 deletions(-)

Comments

Steven Rostedt March 6, 2024, 2:01 a.m. UTC | #1
I forgot to add [POC] to the topic.

All these patches are a proof of concept.

-- Steve
Kees Cook March 9, 2024, 6:27 p.m. UTC | #2
On Tue, Mar 05, 2024 at 08:59:10PM -0500, Steven Rostedt wrote:
> This is a way to map a ring buffer instance across reboots.

As mentioned on Fedi, check out the persistent storage subsystem
(pstore)[1]. It already does what you're starting to construct for RAM
backends (but also supports reed-solomon ECC), and supports several
other backends including EFI storage (which is default enabled on at
least Fedora[2]), block devices, etc. It has an existing mechanism for
handling reservations (including via device tree), and supports multiple
"frontends" including the Oops handler, console output, and even ftrace
which does per-cpu recording and event reconstruction (Joel wrote this
frontend).

It should be pretty straight forward to implement a new frontend if the
ftrace one isn't flexible enough. It's a bit clunky still to add one,
but search for "ftrace" in fs/pstore/ram.c to see how to plumb a new
frontend into the RAM backend.

I continue to want to lift the frontend configuration options up into
the pstore core, since it would avoid a bunch of redundancy, but this is
where we are currently. :)

-Kees

[1] CONFIG_PSTORE et. al. in fs/pstore/ https://docs.kernel.org/admin-guide/ramoops.html
[2] https://www.freedesktop.org/software/systemd/man/latest/systemd-pstore.service.html
Steven Rostedt March 9, 2024, 6:51 p.m. UTC | #3
On Sat, 9 Mar 2024 10:27:47 -0800
Kees Cook <keescook@chromium.org> wrote:

> On Tue, Mar 05, 2024 at 08:59:10PM -0500, Steven Rostedt wrote:
> > This is a way to map a ring buffer instance across reboots.  
> 
> As mentioned on Fedi, check out the persistent storage subsystem
> (pstore)[1]. It already does what you're starting to construct for RAM
> backends (but also supports reed-solomon ECC), and supports several
> other backends including EFI storage (which is default enabled on at
> least Fedora[2]), block devices, etc. It has an existing mechanism for
> handling reservations (including via device tree), and supports multiple
> "frontends" including the Oops handler, console output, and even ftrace
> which does per-cpu recording and event reconstruction (Joel wrote this
> frontend).

Mathieu was telling me about the pmem infrastructure.

This patch set doesn't care where the memory comes from. You just give
it an address and size, and it will do the rest.

> 
> It should be pretty straight forward to implement a new frontend if the
> ftrace one isn't flexible enough. It's a bit clunky still to add one,
> but search for "ftrace" in fs/pstore/ram.c to see how to plumb a new
> frontend into the RAM backend.
> 
> I continue to want to lift the frontend configuration options up into
> the pstore core, since it would avoid a bunch of redundancy, but this is
> where we are currently. :)

Thanks for the info. We use pstore on ChromeOS, but it is currently
restricted to 1MB which is too small for the tracing buffers. From what
I understand, it's also in a specific location where there's only 1MB
available for contiguous memory.

I'm looking at finding a way to get consistent memory outside that
range. That's what I'll be doing next week ;-)

But this code was just to see if I could get a single contiguous range
of memory mapped to ftrace, and this patch set does exactly that.

> 
> -Kees
> 
> [1] CONFIG_PSTORE et. al. in fs/pstore/ https://docs.kernel.org/admin-guide/ramoops.html
> [2] https://www.freedesktop.org/software/systemd/man/latest/systemd-pstore.service.html
> 

Thanks!

-- Steve
Kees Cook March 9, 2024, 8:40 p.m. UTC | #4
On Sat, Mar 09, 2024 at 01:51:16PM -0500, Steven Rostedt wrote:
> On Sat, 9 Mar 2024 10:27:47 -0800
> Kees Cook <keescook@chromium.org> wrote:
> 
> > On Tue, Mar 05, 2024 at 08:59:10PM -0500, Steven Rostedt wrote:
> > > This is a way to map a ring buffer instance across reboots.  
> > 
> > As mentioned on Fedi, check out the persistent storage subsystem
> > (pstore)[1]. It already does what you're starting to construct for RAM
> > backends (but also supports reed-solomon ECC), and supports several
> > other backends including EFI storage (which is default enabled on at
> > least Fedora[2]), block devices, etc. It has an existing mechanism for
> > handling reservations (including via device tree), and supports multiple
> > "frontends" including the Oops handler, console output, and even ftrace
> > which does per-cpu recording and event reconstruction (Joel wrote this
> > frontend).
> 
> Mathieu was telling me about the pmem infrastructure.

I use nvdimm to back my RAM backend testing with qemu so I can examine
the storage "externally":

RAM_SIZE=16384
NVDIMM_SIZE=200
MAX_SIZE=$(( RAM_SIZE + NVDIMM_SIZE ))
...
qemu-system-x86_64 \
	...
        -machine pc,nvdimm=on \
        -m ${RAM_SIZE}M,slots=2,maxmem=${MAX_SIZE}M \
        -object memory-backend-file,id=mem1,share=on,mem-path=$IMAGES/x86/nvdimm.img,size=${NVDIMM_SIZE}M,align=128M
\
        -device nvdimm,id=nvdimm1,memdev=mem1,label-size=1M \
	...
        -append 'console=uart,io,0x3f8,115200n8 loglevel=8 root=/dev/vda1 ro ramoops.mem_size=1048576 ramoops.ecc=1 ramoops.mem_address=0x440000000 ramoops.console_size=16384 ramoops.ftrace_size=16384 ramoops.pmsg_size=16384 ramoops.record_size=32768 panic=-1 init=/root/resume.sh '"$@"


The part I'd like to get wired up sanely is having pstore find the
nvdimm area automatically, but it never quite happened:
https://lore.kernel.org/lkml/CAGXu5jLtmb3qinZnX3rScUJLUFdf+pRDVPjy=CS4KUtW9tLHtw@mail.gmail.com/

> Thanks for the info. We use pstore on ChromeOS, but it is currently
> restricted to 1MB which is too small for the tracing buffers. From what
> I understand, it's also in a specific location where there's only 1MB
> available for contiguous memory.

That's the area that is specifically hardware backed with persistent
RAM.

> I'm looking at finding a way to get consistent memory outside that
> range. That's what I'll be doing next week ;-)
> 
> But this code was just to see if I could get a single contiguous range
> of memory mapped to ftrace, and this patch set does exactly that.

Well, please take a look at pstore. It should be able to do everything
you mention already; it just needs a way to define multiple regions if
you want to use an area outside of the persistent ram area defined by
Chrome OS's platform driver.

-Kees
Steven Rostedt March 20, 2024, 12:44 a.m. UTC | #5
On Sat, 9 Mar 2024 12:40:51 -0800
Kees Cook <keescook@chromium.org> wrote:

> The part I'd like to get wired up sanely is having pstore find the
> nvdimm area automatically, but it never quite happened:
> https://lore.kernel.org/lkml/CAGXu5jLtmb3qinZnX3rScUJLUFdf+pRDVPjy=CS4KUtW9tLHtw@mail.gmail.com/

The automatic detection is what I'm looking for.

> 
> > Thanks for the info. We use pstore on ChromeOS, but it is currently
> > restricted to 1MB which is too small for the tracing buffers. From what
> > I understand, it's also in a specific location where there's only 1MB
> > available for contiguous memory.  
> 
> That's the area that is specifically hardware backed with persistent
> RAM.
> 
> > I'm looking at finding a way to get consistent memory outside that
> > range. That's what I'll be doing next week ;-)
> > 
> > But this code was just to see if I could get a single contiguous range
> > of memory mapped to ftrace, and this patch set does exactly that.  
> 
> Well, please take a look at pstore. It should be able to do everything
> you mention already; it just needs a way to define multiple regions if
> you want to use an area outside of the persistent ram area defined by
> Chrome OS's platform driver.

I'm not exactly sure how to use pstore here. At boot up I just need some
consistent memory reserved for the tracing buffer. It just needs to be the
same location at every boot up.

I don't need a front end. If you mean a way to access it from user space.
The front end is the tracefs directory, as I need all the features that the
tracefs directory gives.

I'm going to look to see how pstore is set up in ChromeOS and see if I can
use whatever it does to allocate another location.

-- Steve