diff mbox series

[1/3] ring-buffer: Add uname to match criteria for persistent ring buffer

Message ID 20241217173520.314190793@goodmis.org (mailing list archive)
State New
Headers show
Series ring-buffer: Hardening of the persistent buffer | expand

Commit Message

Steven Rostedt Dec. 17, 2024, 5:32 p.m. UTC
From: Steven Rostedt <rostedt@goodmis.org>

The persistent ring buffer can live across boots. It is expected that the
content in the buffer can be translated to the current kernel with delta
offsets even with KASLR enabled. But it can only guarantee this if the
content of the ring buffer came from the same kernel as the one that is
currently running.

Add uname into the meta data and if the uname in the meta data from the
previous boot does not match the uname of the current boot, then clear the
buffer and re-initialize it.

This only handles the case of kernel versions. It does not clear the
buffer for development. There's several mechanisms to keep bad data from
crashing the kernel. The worse that can happen is some corrupt data may be
displayed.

Cc: stable@vger.kernel.org
Fixes: 8f3e6659656e6 ("ring-buffer: Save text and data locations in mapped meta data")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 kernel/trace/ring_buffer.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Linus Torvalds Dec. 17, 2024, 5:46 p.m. UTC | #1
On Tue, 17 Dec 2024 at 09:34, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Add uname into the meta data and if the uname in the meta data from the
> previous boot does not match the uname of the current boot, then clear the
> buffer and re-initialize it.

This seems broken.

The problem is not that the previous boot data is wrong.

The problem is that you printed it *out* wrong by trying to interpret
pointers in it.

Now you basically hide that, and make it harder to see any data from a
bad kernel (since you presumably need to boot into a good kernel to do
analysis).

The real fix seems to have been your 3/3, which still prints out the
data, but stops trying to interpret the pointers in it.

Except you should also remove the last_text_delta / last_data_delta
stuff. That's all about exactly that "trying to interpret bogus
pointers". Instead you seem to have actually just *added* a case of
that in your 3/3.

           Linus
Steven Rostedt Dec. 17, 2024, 6:04 p.m. UTC | #2
On Tue, 17 Dec 2024 09:46:30 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 17 Dec 2024 at 09:34, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > Add uname into the meta data and if the uname in the meta data from the
> > previous boot does not match the uname of the current boot, then clear the
> > buffer and re-initialize it.  
> 
> This seems broken.

BTW, this isn't the fix for the trace_check_vprintf(). That is here:

  https://lore.kernel.org/linux-trace-kernel/20241217024118.587584221@goodmis.org/

Where the I completely remove that function.

> 
> The problem is not that the previous boot data is wrong.
> 
> The problem is that you printed it *out* wrong by trying to interpret
> pointers in it.
> 
> Now you basically hide that, and make it harder to see any data from a
> bad kernel (since you presumably need to boot into a good kernel to do
> analysis).
> 
> The real fix seems to have been your 3/3, which still prints out the
> data, but stops trying to interpret the pointers in it.
> 
> Except you should also remove the last_text_delta / last_data_delta
> stuff. That's all about exactly that "trying to interpret bogus
> pointers". Instead you seem to have actually just *added* a case of
> that in your 3/3.

I'm not sure what you mean. If the kernels are the same, then the pointers
should also be the same, as KASLR just shifts them. This no longer uses
module code. It only traces core kernel code which should always have the
same offsets.

With the last_text_delta (and last_data_delta is always the same, but I
added it in the case that ever changes), the shifts have always been
accurate. For example, for RCU event strings.

With the pointer update to print_fields() I get:

           <...>-912     [001] d..1.   304.817710: rcu_utilization: s=(0xffffffff9c4026b5:Start context switch)
           <...>-912     [001] d..1.   304.817713: rcu_utilization: s=(0xffffffff9c4ca779:End context switch)
          <idle>-0       [002] dN.1.   304.817769: rcu_utilization: s=(0xffffffff9c4026b5:Start context switch)
          <idle>-0       [002] dN.1.   304.817772: rcu_utilization: s=(0xffffffff9c4ca779:End context switch)
           <...>-19      [002] d..1.   304.817805: rcu_utilization: s=(0xffffffff9c4026b5:Start context switch)
           <...>-19      [002] d..1.   304.817806: rcu_utilization: s=(0xffffffff9c4ca779:End context switch)
          <idle>-0       [006] dN.1.   304.817819: rcu_utilization: s=(0xffffffff9c4026b5:Start context switch)
          <idle>-0       [006] dN.1.   304.817821: rcu_utilization: s=(0xffffffff9c4ca779:End context switch)
           <...>-902     [006] d.h1.   304.817901: rcu_utilization: s=(0xffffffff9c488fdd:Start scheduler-tick)
           <...>-902     [006] d.h1.   304.817903: rcu_utilization: s=(0xffffffff9c426d45:End scheduler-tick)
          <idle>-0       [007] dN.1.   304.817913: rcu_utilization: s=(0xffffffff9c4026b5:Start context switch)
          <idle>-0       [007] dN.1.   304.817915: rcu_utilization: s=(0xffffffff9c4ca779:End context switch)
           <...>-18      [007] d..1.   304.817931: rcu_utilization: s=(0xffffffff9c4026b5:Start context switch)
           <...>-18      [007] d..1.   304.817931: rcu_utilization: s=(0xffffffff9c4ca779:End context switch)
           <...>-902     [006] ..s1.   304.817941: rcu_utilization: s=(0xffffffff9c470f20:Start RCU core)
           <...>-902     [006] ..s1.   304.817958: rcu_utilization: s=(0xffffffff9c419bb8:End RCU core)
          <idle>-0       [002] dN.1.   304.818003: rcu_utilization: s=(0xffffffff9c4026b5:Start context switch)
          <idle>-0       [002] dN.1.   304.818003: rcu_utilization: s=(0xffffffff9c4ca779:End context switch)

Without that calculation, all I get is garbage and completely useless:

           <...>-903     [000] d..1.    24.712788: rcu_utilization: s=(0xffffffff970026af)
           <...>-903     [000] d..1.    24.712791: rcu_utilization: s=(0xffffffff970ca75b)
          <idle>-0       [004] dN.1.    24.712796: rcu_utilization: s=(0xffffffff970026af)
          <idle>-0       [004] dN.1.    24.712799: rcu_utilization: s=(0xffffffff970ca75b)
           <...>-19      [004] d..1.    24.712825: rcu_utilization: s=(0xffffffff970026af)
           <...>-19      [004] d..1.    24.712825: rcu_utilization: s=(0xffffffff970ca75b)
           <...>-18      [004] d..1.    24.712838: rcu_utilization: s=(0xffffffff970026af)
           <...>-18      [004] d..1.    24.712839: rcu_utilization: s=(0xffffffff970ca75b)
          <idle>-0       [005] dN.1.    24.712881: rcu_utilization: s=(0xffffffff970026af)
          <idle>-0       [005] dN.1.    24.712883: rcu_utilization: s=(0xffffffff970ca75b)
           <...>-893     [005] d.h1.    24.712911: rcu_utilization: s=(0xffffffff97088feb)
           <...>-893     [005] d.h1.    24.712912: rcu_utilization: s=(0xffffffff97026d3f)
           <...>-893     [005] ..s1.    24.712942: rcu_utilization: s=(0xffffffff97070f2e)
           <...>-893     [005] ..s1.    24.712944: rcu_utilization: s=(0xffffffff97019bb2)
           <...>-893     [005] dN.1.    24.713412: rcu_utilization: s=(0xffffffff970026af)
           <...>-893     [005] dN.1.    24.713412: rcu_utilization: s=(0xffffffff970ca75b)

The delta calculations are done by saving an address of a symbol into the
meta data, and when the meta data is considered a match, it calculates the
delta between what was saved in the meta data to the same symbol in the
current kernel.

This works across several reboots too. That is, I can save the boot mapped
data, reboot several times, and still see the correct data in the ring
buffer, as the delta is calculated at each boot. It's only saved when
tracing starts and the ring buffer is re-initialized.

So what exactly are you saying is broken?

-- Steve
Steven Rostedt Dec. 17, 2024, 6:17 p.m. UTC | #3
On Tue, 17 Dec 2024 09:46:30 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Now you basically hide that, and make it harder to see any data from a
> bad kernel (since you presumably need to boot into a good kernel to do
> analysis).

Note, this isn't for debugging kernels that crash at boot. There's other
methods to debug things like that. Like, ftrace_dump_on_oops. This is for
debugging something that happens after boot or in the field. Where the
"bad" kernel can still boot in a good state. The kernel should be good
enough to boot into a state where you can extract the trace and send it
offline somewhere.

I used this to debugging a few things in development, where someone hands
me a reproducer. I enabled the boot_mapped buffer, run the reproducer, the
system crashes and reboots into the same kernel. Then I was able to debug
what happened, as long as I don't re-run the reproducer. But I did do
several iterations of:

  Start boot mapped tracing, run reproducer, crash, reboot, look at boot
  mapped trace. Restart boot mapped tracing with more events, run reproducer, 
  crash, reboot, look at mapped trace, wash, rinse, repeat.

-- Steve
Linus Torvalds Dec. 17, 2024, 6:19 p.m. UTC | #4
On Tue, 17 Dec 2024 at 10:04, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> I'm not sure what you mean. If the kernels are the same, then the pointers
> should also be the same, as KASLR just shifts them. This no longer uses
> module code. It only traces core kernel code which should always have the
> same offsets.

 (a) the shifting is what caused problems with you having to hack
round the format string and %pS.

 (b) the data addresses are more than shifted, so that "data_delta" is
*completely* bogus

 (c) neither of them are AT ALL valid for modules regardless

Stop using the delta fields. They are broken. Even for the same
kernel. It's literally a "I made sh*t up and it sometimes works"
situation.

That "sometimes works" is not how we do kernel development. Stop it.

What *woiuld* have been an acceptable model is to actually modify the
boot-time buffers in place, using actual real heuristics that look at
whether a pointer was IN THE CODE SECTION OR THE STATIC DATA section
of the previous boot.

But you never did that. All this delta code has always been complete
and utter garbage, and complete hacks.

Remove it.

Then, if at some point you can do it *right* (see above), maybe you
can try to re-introduce it. But the current delta code is pure and
utter garbage and needs to die. No more of this "hacking shit up to
make it sometimes work".

             Linus
Linus Torvalds Dec. 17, 2024, 6:24 p.m. UTC | #5
On Tue, 17 Dec 2024 at 10:19, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> What *woiuld* have been an acceptable model is to actually modify the
> boot-time buffers in place, using actual real heuristics that look at
> whether a pointer was IN THE CODE SECTION OR THE STATIC DATA section
> of the previous boot.
>
> But you never did that. All this delta code has always been complete
> and utter garbage, and complete hacks.

Actually, I think the proper model isn't even that "modify boot time
buffers in place" thing.

The proper model was probably always to just do the "give the raw
data, and analyze the previous boot data in user mode". Don't do
"delta between old and new kernel", print out the actual KASLR base of
the old kernel, and let user mode figure it out. Because user mode may
actually have the old and different vmlinux image too - something that
kernel mode *never* has.

                   Linus
Steven Rostedt Dec. 17, 2024, 6:33 p.m. UTC | #6
On Tue, 17 Dec 2024 10:19:45 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 17 Dec 2024 at 10:04, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > I'm not sure what you mean. If the kernels are the same, then the pointers
> > should also be the same, as KASLR just shifts them. This no longer uses
> > module code. It only traces core kernel code which should always have the
> > same offsets.  
> 
>  (a) the shifting is what caused problems with you having to hack
> round the format string and %pS.

How else do I get the function name?

> 
>  (b) the data addresses are more than shifted, so that "data_delta" is
> *completely* bogus

The data_delta and text_delta are equal, and I could get rid of delta_data,
as I haven't used it. I only used the text_delta as they are always the
same.

> 
>  (c) neither of them are AT ALL valid for modules regardless

I can make sure that it only works for core kernel code, and print the raw
address if it isn't.

> 
> Stop using the delta fields. They are broken. Even for the same
> kernel. It's literally a "I made sh*t up and it sometimes works"
> situation.
> 
> That "sometimes works" is not how we do kernel development. Stop it.

For core kernel code it *always works*. I haven't seen it fail yet.

This is the point of this patch series, is to remove the cases where it can
fail. That is, if the kernel isn't the same, or the use of modules and
dynamic events that are not stable across reboots.

But for the core kernel code, I have not seen it fail once!

> 
> What *woiuld* have been an acceptable model is to actually modify the
> boot-time buffers in place, using actual real heuristics that look at
> whether a pointer was IN THE CODE SECTION OR THE STATIC DATA section
> of the previous boot.

So basically, you want a full parser of the trace event code that reads the
boot time buffer and makes it match the current kernel?

> 
> But you never did that. All this delta code has always been complete
> and utter garbage, and complete hacks.
> 
> Remove it.
> 
> Then, if at some point you can do it *right* (see above), maybe you
> can try to re-introduce it. But the current delta code is pure and
> utter garbage and needs to die. No more of this "hacking shit up to
> make it sometimes work".

As I said. It doesn't "sometimes work" it "always works", at least for the
core kernel. And I agree that it shouldn't "sometimes work" which is why
this patch series sets out to remove those cases that do not work.

-- Steve
Steven Rostedt Dec. 17, 2024, 6:42 p.m. UTC | #7
On Tue, 17 Dec 2024 10:24:42 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 17 Dec 2024 at 10:19, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > What *woiuld* have been an acceptable model is to actually modify the
> > boot-time buffers in place, using actual real heuristics that look at
> > whether a pointer was IN THE CODE SECTION OR THE STATIC DATA section
> > of the previous boot.
> >
> > But you never did that. All this delta code has always been complete
> > and utter garbage, and complete hacks.  
> 
> Actually, I think the proper model isn't even that "modify boot time
> buffers in place" thing.
> 
> The proper model was probably always to just do the "give the raw
> data, and analyze the previous boot data in user mode". Don't do
> "delta between old and new kernel", print out the actual KASLR base of
> the old kernel, and let user mode figure it out. Because user mode may
> actually have the old and different vmlinux image too - something that
> kernel mode *never* has.

I already support this somewhat, as the text_delta (and data_delta) are
presented in the tracefs directory so that trace-cmd can parse it.

For my use case, this would work, as we are extracting the raw data and
need to do most the processing in user space anyway. I could have it export
the KASLR offset of the previous boot and then trace-cmd should be able to
handle it fine if the events and kallsyms of the previous boot are all
saved. It could easily put things back together, including modules and
dynamic events.

This will just make it useless for those that want to use the tracefs
directly.

-- Steve
Linus Torvalds Dec. 17, 2024, 6:42 p.m. UTC | #8
On Tue, 17 Dec 2024 at 10:32, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> How else do I get the function name?

My initial suggestion was to just fix up the boot time array.

I think that's actually wrong. Just print the raw data and analyze it
in user space.

> I can make sure that it only works for core kernel code, and print the raw
> address if it isn't.

Streven, STOP HACKING AROUND GARBAGE.

Your solution to "I had a bad idea that resulted in bad code" seems to
always be "write more bad code".

STOP IT.

Really. This is literally what I started the whole original complaint
about. Go back to my original email, and try to understand the
original issue. Let me quote the really relevant part of that email
again:

  This stuff is full of crazy special cases for things that should never
  be done in the first place.

Note - and really INTERNALIZE - that "for stuff that should never be
done in the first place".

You started with the wrong design. Then you keep hacking it up, and
the hacks just get wilder and crazier as you notice there are more
special cases.

This is now getting to the point where I'm considering just ripping
out the whole boot-time previous kernel buffer crap because you seem
to have turned an interesting idea into just a morass of problems.

Your choice: get rid of the crazy, or have me rip it out.

            Linus
Steven Rostedt Dec. 17, 2024, 7:01 p.m. UTC | #9
On Tue, 17 Dec 2024 10:42:47 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:


> Really. This is literally what I started the whole original complaint
> about. Go back to my original email, and try to understand the
> original issue. Let me quote the really relevant part of that email
> again:
> 
>   This stuff is full of crazy special cases for things that should never
>   be done in the first place.

To my defense, all of ftrace (function tracing, live kernel patching, etc)
came from trying to do crazy special cases. The TRACE_EVENT() macro being
another one ;-)

> 
> Note - and really INTERNALIZE - that "for stuff that should never be
> done in the first place".
> 
> You started with the wrong design. Then you keep hacking it up, and
> the hacks just get wilder and crazier as you notice there are more
> special cases.
> 
> This is now getting to the point where I'm considering just ripping
> out the whole boot-time previous kernel buffer crap because you seem
> to have turned an interesting idea into just a morass of problems.
> 
> Your choice: get rid of the crazy, or have me rip it out.
> 

Point taken. And for my work use case, I can keep the ring buffer as is and
do most the work in user space. I did these "hacks" for those that do not
use trace-cmd and still use the tracefs file system directly (like a lot of
the embedded folks).

But instead, I'll replace the text/data_deltas with a kaslr offset (it will
only be exported if the trace contains data from a previous kernel so not
to export the current kaslr offset).

Then, on our production systems, we'll save the meta data of the events we
enable (this can include module events as well as dynamic events) and after
a crash, we'll extract the data along with the saved data stored on disk,
and be able to recreate the entire trace.

I'll only push patch 3 without the %s pointer update.

-- Steve
Linus Torvalds Dec. 17, 2024, 7:03 p.m. UTC | #10
On Tue, 17 Dec 2024 at 10:42, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> My initial suggestion was to just fix up the boot time array.
>
> I think that's actually wrong. Just print the raw data and analyze it
> in user space.

.. I still think it's not the optimal solution, but fixing up the
event data from the previous boot (*before* printing it, and entirely
independently of vsnprintf()) would at least avoid the whole "mess
with vsnprintf and switch the format string around as you are trying
to walk the va_list in sync".

Because that was really a non-starter. Both the format string hackery
and the va_list hackery was just fundamentally bogus.

If you massage the data before printing - and independently of it -
those two issues should at least go away.

             Linus
Steven Rostedt Dec. 17, 2024, 7:07 p.m. UTC | #11
On Tue, 17 Dec 2024 11:03:28 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 17 Dec 2024 at 10:42, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > My initial suggestion was to just fix up the boot time array.
> >
> > I think that's actually wrong. Just print the raw data and analyze it
> > in user space.  
> 
> .. I still think it's not the optimal solution, but fixing up the
> event data from the previous boot (*before* printing it, and entirely
> independently of vsnprintf()) would at least avoid the whole "mess
> with vsnprintf and switch the format string around as you are trying
> to walk the va_list in sync".
> 
> Because that was really a non-starter. Both the format string hackery
> and the va_list hackery was just fundamentally bogus.
> 
> If you massage the data before printing - and independently of it -
> those two issues should at least go away.

But I can't massage the data without the deltas. That takes us back to
having to have the same kernel and only processing kernel core data and
ignoring modules.

-- Steve
Steven Rostedt Dec. 17, 2024, 7:14 p.m. UTC | #12
On Tue, 17 Dec 2024 14:07:50 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 17 Dec 2024 11:03:28 -0800
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Tue, 17 Dec 2024 at 10:42, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:  
> > >
> > > My initial suggestion was to just fix up the boot time array.
> > >
> > > I think that's actually wrong. Just print the raw data and analyze it
> > > in user space.    
> > 
> > .. I still think it's not the optimal solution, but fixing up the
> > event data from the previous boot (*before* printing it, and entirely
> > independently of vsnprintf()) would at least avoid the whole "mess
> > with vsnprintf and switch the format string around as you are trying
> > to walk the va_list in sync".

And that code that does the va_list and vsnprintf() tricks is going to be
removed as soon as that patch set finishes going through my full test suite,
and you are OK with the solution.

Again, that patch set is here:

  https://lore.kernel.org/linux-trace-kernel/20241217024118.587584221@goodmis.org/

-- Steve
Mathieu Desnoyers Dec. 17, 2024, 7:22 p.m. UTC | #13
On 2024-12-17 13:24, Linus Torvalds wrote:
> On Tue, 17 Dec 2024 at 10:19, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> What *woiuld* have been an acceptable model is to actually modify the
>> boot-time buffers in place, using actual real heuristics that look at
>> whether a pointer was IN THE CODE SECTION OR THE STATIC DATA section
>> of the previous boot.
>>
>> But you never did that. All this delta code has always been complete
>> and utter garbage, and complete hacks.
> 
> Actually, I think the proper model isn't even that "modify boot time
> buffers in place" thing.
> 
> The proper model was probably always to just do the "give the raw
> data, and analyze the previous boot data in user mode".

It appears that you just summarized the LTTng (out-of-tree) kernel
tracer [1] model in one short sentence.

If this can help in some way, within the LTTng model, here is how
we're solving the problem of mapping addresses to symbols:

1- We have a statedump infrastructure, which dumps internal kernel
    state. It could dump the kernel and each module base addresses
    into the trace. (we do it for userspace tracing)

2- We can hook on module load/unload to insert event about insertion
    and removal of those base addresses into the trace buffers.
    (we do it for userspace tracing)

3- We augment the traces at post-processing with DWARF and ELF parsers
    in Babeltrace [2] to augment the trace with symbolic information
    using the ELF or DWARF files as inputs in addition to the traces.
    (this already exists, and is used for userspace traces)

4- We already have the integration of the LTTng Userspace tracer
    with PMEM and DAX to recover traces after a machine crash.
    Those buffers are self-described with an ABI which allows a
    userspace tool (lttng-crash) to extract well-formed Common
    Trace Format [3] traces from the buffers after reboot. We've
    never had the incentive to port this facility to the kernel
    tracer so far though.

Thanks,

Mathieu

[1] https://lttng.org
[2] https://babeltrace.org
[3] https://diamon.org/ctf
Linus Torvalds Dec. 17, 2024, 7:38 p.m. UTC | #14
On Tue, 17 Dec 2024 at 11:01, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> But instead, I'll replace the text/data_deltas with a kaslr offset (it will
> only be exported if the trace contains data from a previous kernel so not
> to export the current kaslr offset).

Right - never export the KASRL offset for the *current* kernel, but
the same interface that exports the "previous kernel trace data" can
certainly export the KASLR for that previous case.

> Then, on our production systems, we'll save the meta data of the events we
> enable (this can include module events as well as dynamic events) and after
> a crash, we'll extract the data along with the saved data stored on disk,
> and be able to recreate the entire trace.

Yes. And if you save the module names and load addresses, you can now
hopefully sort out things like %s (and %pS) from modules too.

Although maybe they don't happen that often?

            Linus
Steven Rostedt Dec. 17, 2024, 7:44 p.m. UTC | #15
On Tue, 17 Dec 2024 11:38:00 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 17 Dec 2024 at 11:01, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > But instead, I'll replace the text/data_deltas with a kaslr offset (it will
> > only be exported if the trace contains data from a previous kernel so not
> > to export the current kaslr offset).  
> 
> Right - never export the KASRL offset for the *current* kernel, but
> the same interface that exports the "previous kernel trace data" can
> certainly export the KASLR for that previous case.

But this will be future work and not something for this merge window, as
it's more of a feature. The only fix is to add that print_field() code, and
the patch series that removes trace_check_vprintf() (which fixes a
different bug).

> 
> > Then, on our production systems, we'll save the meta data of the events we
> > enable (this can include module events as well as dynamic events) and after
> > a crash, we'll extract the data along with the saved data stored on disk,
> > and be able to recreate the entire trace.  
> 
> Yes. And if you save the module names and load addresses, you can now
> hopefully sort out things like %s (and %pS) from modules too.
> 
> Although maybe they don't happen that often?

Actually, they do appear a bit. As the kmalloc trace event records the call
sites that do allocation and many of them are in module code.

-- Steve
Linus Torvalds Dec. 17, 2024, 10:24 p.m. UTC | #16
On Tue, 17 Dec 2024 at 11:43, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> But this will be future work and not something for this merge window, as
> it's more of a feature. The only fix is to add that print_field() code, and
> the patch series that removes trace_check_vprintf() (which fixes a
> different bug).

Side note: I've also been looking at the core vsnprintf() code to see
if it could be cleanly extended to have callers give more control over
it. Some kind of callback mechanism for the pointer handling so that
there is *not* any need for crazy hackery.

That needs some basic cleanups I want to do and that I'm playing with,
but even with some of that code cleaned up it looks a bit nasty.

I really don't want to expose too much of the internals to the
outside, and vsnprintf() is actually fairly performance-critical, and
indirect calls have become so expensive that I really don't want to
introduce function pointers in there.

But I'll try to see if some more cleanups would make it at least
possible to have a separate version. That said, we already have the
disgusting "binary printf" code, used by bpf and tracing, and I'd hate
to introduce a *third* interface, particularly since the binary printf
code is already doing things wrong (it puts things into a "word
array", but instead of using a single word for char/short like the
normal C varargs code does, it packs them at actual byte/word
boundaries and adds padding etc).

So just looking at that code, I'm not hugely excited about adding yet
more copies of this and new interfaces in this area.

(Aside: are those binary buffers actually exported to user space (that
"u32 *bin_buf, size_t size" thing), or could we fix the binary printf
code to really use a whole word for char/short values? The difference
between '%hhd' and '%d' should be how it's printed out, not how the
data is accessed)

               Linus
Steven Rostedt Dec. 17, 2024, 10:53 p.m. UTC | #17
On Tue, 17 Dec 2024 14:24:08 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> (Aside: are those binary buffers actually exported to user space (that
> "u32 *bin_buf, size_t size" thing), or could we fix the binary printf
> code to really use a whole word for char/short values? The difference
> between '%hhd' and '%d' should be how it's printed out, not how the
> data is accessed)

libtraceevent is able to parse the raw trace_printk() events:

  https://git.kernel.org/pub/scm/libs/libtrace/libtraceevent.git/tree/src/event-parse.c#n5155

The format it reads is from /sys/kernel/tracing/events/ftrace/bprint/format:

name: bprint
ID: 6
format:
	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
	field:int common_pid;	offset:4;	size:4;	signed:1;

	field:unsigned long ip;	offset:8;	size:8;	signed:0;
	field:const char * fmt;	offset:16;	size:8;	signed:0;
	field:u32 buf[];	offset:24;	size:0;	signed:0;

print fmt: "%ps: %s", (void *)REC->ip, REC->fmt

In this case, the "print fmt" is ignored.

Where the buf value holds the binary storage from vbin_printf() and written
in trace_vbprintk(). Yeah, it looks like it does depend on the arguments
being word aligned.

-- Steve
Linus Torvalds Dec. 17, 2024, 11:32 p.m. UTC | #18
On Tue, 17 Dec 2024 at 14:52, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Where the buf value holds the binary storage from vbin_printf() and written
> in trace_vbprintk(). Yeah, it looks like it does depend on the arguments
> being word aligned.

So the thing is, they actually aren't word-aligned at all.

Yes, the buffer is ostensibly individual words, and the buffer size is
given in words, and 64-bit data is saved/fetched as two separate
words.

But if you look more closely, you'll see that the way the buffer is
managed is actually not as a word array at all, but using

        char *str, *end;

instead of word pointers. And the reason is because it does

        str = PTR_ALIGN(str, sizeof(type));
        ...
        str += sizeof(type);

so byte-sized data is *not* given a word, it's only given a single
char, and then if it's followed by an "int", the pointer will be
aligned at that point.

It does mean that the binary buffers are a bit denser, but since %c
and %hhd etc are VERY unusual (and often will be mixed with int-sized
data), that denser format isn't commonly an actual advantage.

And the reason I noticed this? When I was trying to clean up and
simplify the vsnprintf() code to not have so many different cases, the
*regular* number handling for char/short/int-sized cases ends up being
just one case that looks like this:

        num = get_num(va_arg(args, int), fmt.state, spec);

because char/short/int are all just "va_arg(args, int)" values.

Then the actual printout depends on that printf_spec thing (and, in my
experimental branch, that "fmt.state", but that's a local trial thing
where I split the printf_spec differently for better code generation).

So basically the core vfsprintf case doesn't need to care about
fetching char/short/int, because va_args always just formats those
kinds of arguments as int, as per the usual C implicit type expansion
rules.

But the binary printf thing then has to have three different cases,
because unlike the normal calling convention, it actually packs
char/short/int differently.  So with all those nice cleanups I tried
still look like this:

                case FORMAT_STATE_2BYTE:
                        num = get_num(get_arg(short), fmt.state, spec);
                        break;
                case FORMAT_STATE_1BYTE:
                        num = get_num(get_arg(char), fmt.state, spec);
                        break;
                default:
                        num = get_num(get_arg(int), fmt.state, spec);
                        break;

which is admittedly still a lot better than the current situation in
top-of-tree, which has separate versions for a *lot* more types.

So right now top-of-tree has 11 different enumerated cases for number printing:

        FORMAT_TYPE_LONG_LONG,
        FORMAT_TYPE_ULONG,
        FORMAT_TYPE_LONG,
        FORMAT_TYPE_UBYTE,
        FORMAT_TYPE_BYTE,
        FORMAT_TYPE_USHORT,
        FORMAT_TYPE_SHORT,
        FORMAT_TYPE_UINT,
        FORMAT_TYPE_INT,
        FORMAT_TYPE_SIZE_T,
        FORMAT_TYPE_PTRDIFF

and in my test cleanup, I've cut this down to just four cases:
FORMAT_STATE_xBYTE (x = 1, 2, 4, 8).

And the actual argument fetch for the *normal* case is actually just
two cases (because the 8-byte and the "4 bytes or less" cases are
different for va_list, but 1/2/4 bytes are just that single case).

But the binary printf argument save/fetch is not able to be cut down
to just two cases because of how it does that odd "ostensibly a word
array, but packed byte/short fields" thing.

Oh well. It's not a big deal. But I was doing this to see if I could
regularize the vsnprintf() logic and make sharing better - and then
just the binary version already causes unnecessary duplication.

If the *only* thing that accesses that word array is vbin_printf and
bstr_printf, then I could just change the packing to act like va_list
does (ie the word array would *actually* be a word array, and char and
short values would get individual words).

But at least the bpf cde seems to know about the crazy packing, and
also does that

        tmp_buf = PTR_ALIGN(tmp_buf, sizeof(u32));

in bpf_bprintf_prepare() because it knows that it's not *actually* an
array of words despite it being documented as such.

Of course, the bpf code only does the packed access thing for '%c',
and doesn't seem to know that the same is true of '%hd' and '%hhd',
presumably because nobody actually uses that.

Let's add Alexei to the participants. I think bpf too would actually
prefer that the odd char/short packing *not* be done, if only because
it clearly does the wrong thing as-is for non-%c argument (ie those
%hd/%hhd cases).

Who else accesses that odd "binary printed pseudo-word array"?

          Linus
Linus Torvalds Dec. 18, 2024, 12:02 a.m. UTC | #19
On Tue, 17 Dec 2024 at 15:32, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But if you look more closely, you'll see that the way the buffer is
> managed is actually not as a word array at all, but using
>
>         char *str, *end;
>
> instead of word pointers.

Oh, and in addition to the smaller-than-int types ('%c' and '%hd'
etc), pointers that get dereferenced also get written as a byte string
to that word array. There might be other cases too.

So it's really a fairly odd kind of "sometimes words, sometimes not"
array, with the size of the array given in words.

That binary printf is very strange.

               Linus
Alexei Starovoitov Dec. 18, 2024, 12:47 a.m. UTC | #20
On Tue, Dec 17, 2024 at 3:32 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> If the *only* thing that accesses that word array is vbin_printf and
> bstr_printf, then I could just change the packing to act like va_list
> does (ie the word array would *actually* be a word array, and char and
> short values would get individual words).
>
> But at least the bpf cde seems to know about the crazy packing, and
> also does that
>
>         tmp_buf = PTR_ALIGN(tmp_buf, sizeof(u32));
>
> in bpf_bprintf_prepare() because it knows that it's not *actually* an
> array of words despite it being documented as such.
>
> Of course, the bpf code only does the packed access thing for '%c',
> and doesn't seem to know that the same is true of '%hd' and '%hhd',
> presumably because nobody actually uses that.
>
> Let's add Alexei to the participants. I think bpf too would actually
> prefer that the odd char/short packing *not* be done, if only because
> it clearly does the wrong thing as-is for non-%c argument (ie those
> %hd/%hhd cases).

We reject %hd case as EINVAL and do byte copy for %c.
All that was done as part of
commit 48cac3f4a96d ("bpf: Implement formatted output helpers with bstr_printf")
that cleaned things up greatly.
The byte copy for %c wasn't deliberate to save space.
Just happen to work with bstr_printf().
We can totally switch to u32 if that's the direction for bstr_printf.
To handle %s we use bpf_trace_copy_string(tmp_buf, )
which does _nofault() underneath.
Since the tmp_buf is byte packed because of strings the %c case
just adds a byte too. Strings and %c can be made u32 aligned.

Since we're on this topic, Daniel is looking to reuse format_decode()
in bpf_bprintf_prepare() to get rid of our manual format validation.
Steven Rostedt Dec. 18, 2024, 12:48 a.m. UTC | #21
On Tue, 17 Dec 2024 16:02:34 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 17 Dec 2024 at 15:32, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > But if you look more closely, you'll see that the way the buffer is
> > managed is actually not as a word array at all, but using
> >
> >         char *str, *end;
> >
> > instead of word pointers.  
> 
> Oh, and in addition to the smaller-than-int types ('%c' and '%hd'
> etc), pointers that get dereferenced also get written as a byte string
> to that word array. There might be other cases too.
> 
> So it's really a fairly odd kind of "sometimes words, sometimes not"
> array, with the size of the array given in words.
> 
> That binary printf is very strange.

Note that at least on the tracing user space side, as trace_printk() is
never used in production systems and mostly just used for debugging, we can
be pretty liberal if I have to change libtraceevent.

I could even add an update to the format file to have the library stay
backward compatible with older kernels and can see that the format file has
been updated to know that the vbin_printf() has changed.

-- Steve
Linus Torvalds Dec. 18, 2024, 1:26 a.m. UTC | #22
On Tue, 17 Dec 2024 at 16:47, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> Since we're on this topic, Daniel is looking to reuse format_decode()
> in bpf_bprintf_prepare() to get rid of our manual format validation.

That was literally why I started looking into this - the many separate
type formats actually end up causing format_decode() (and the callers)
to have to generate multiple different cases, which then in turn
either cause a jump table, or - more commonly due to the CPU indirect
branch mitigations - a chain of conditionals that are fairly ugly.

Compressing the state table for the types from 11 down to 4 types
helps a bit, but then also dealing with the "smaller than int" things
as just 'int' (with the formatting flags that are separate) also ends
up avoiding some unnecessary and extra cases.

Because in the end, 'size_t' and 'long' are the same thing, even on
architectures like 32-bit x86 where 'size_t' really is 'unsigned int'
- simply because the only thing that matters for fetching the value is
the size, which is 32-bit.

(The whole "is it signed" and the truncation to smaller-than-int etc
is then something we have to handle anyway in by the 'printf_spec'
thing).

So I have a patch series to clean some of this up and avoid the extra
states. I'm not entirely happy with it, though, and I've been going
back and forth on some of the code, so I'm not ready to post it or
have anybody use it as a basis for some "real" cleanups.

I guess I could at least post the "turn 11 different types into 4"
part. I have other things in there, but that part seems fairly
unambiguously good.

Let me go separate that part out and maybe people can point out where
I've done something silly.

               Linus
Linus Torvalds Dec. 18, 2024, 1:39 a.m. UTC | #23
On Tue, 17 Dec 2024 at 17:26, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Let me go separate that part out and maybe people can point out where
> I've done something silly.

Ok, that part I had actually already locally separated out better than
some of my later patches in the series, so I sent it out as

  https://lore.kernel.org/all/20241218013620.1679088-1-torvalds@linux-foundation.org/

but I'm not guaranteeing it's right. Consider it a WIP thing, and only
a first step.

           Linus
Alexei Starovoitov Dec. 18, 2024, 1:53 a.m. UTC | #24
On Tue, Dec 17, 2024 at 5:39 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, 17 Dec 2024 at 17:26, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Let me go separate that part out and maybe people can point out where
> > I've done something silly.
>
> Ok, that part I had actually already locally separated out better than
> some of my later patches in the series, so I sent it out as
>
>   https://lore.kernel.org/all/20241218013620.1679088-1-torvalds@linux-foundation.org/

lgtm.
Since bstr_printf() converts 1/2/4 to unsigned long long num
with a sign according to the format specifier it's good
from the calling convention perspective.
Doesn't matter here, but anyone passing 32-bit ints around
needs to be aware of odd riscv abi promotion rules.
x86-64 and arm64 zero extend 32-bit ints while riscv does sign
extension when s32 and u32 are passed.
Quote from the spec:
"In RV64, 32-bit types, such as int, are stored in integer registers
as proper sign extensions of their 32-bit values; that is, bits 63..31
are all equal. This restriction holds even for unsigned 32-bit types."
diff mbox series

Patch

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 7e257e855dd1..3c94c59d000c 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -17,6 +17,7 @@ 
 #include <linux/uaccess.h>
 #include <linux/hardirq.h>
 #include <linux/kthread.h>	/* for self test */
+#include <linux/utsname.h>
 #include <linux/module.h>
 #include <linux/percpu.h>
 #include <linux/mutex.h>
@@ -45,10 +46,13 @@ 
 static void update_pages_handler(struct work_struct *work);
 
 #define RING_BUFFER_META_MAGIC	0xBADFEED
+#define UNAME_SZ 64
 
 struct ring_buffer_meta {
 	int		magic;
 	int		struct_size;
+	char		uname[UNAME_SZ];
+
 	unsigned long	text_addr;
 	unsigned long	data_addr;
 	unsigned long	first_buffer;
@@ -1687,6 +1691,11 @@  static bool rb_meta_valid(struct ring_buffer_meta *meta, int cpu,
 		return false;
 	}
 
+	if (strncmp(init_utsname()->release, meta->uname, UNAME_SZ - 1)) {
+		pr_info("Ring buffer boot meta[%d] mismatch of uname\n", cpu);
+		return false;
+	}
+
 	/* The subbuffer's size and number of subbuffers must match */
 	if (meta->subbuf_size != subbuf_size ||
 	    meta->nr_subbufs != nr_pages + 1) {
@@ -1920,6 +1929,7 @@  static void rb_range_meta_init(struct trace_buffer *buffer, int nr_pages)
 
 		meta->magic = RING_BUFFER_META_MAGIC;
 		meta->struct_size = sizeof(*meta);
+		strscpy(meta->uname, init_utsname()->release, UNAME_SZ);
 
 		meta->nr_subbufs = nr_pages + 1;
 		meta->subbuf_size = PAGE_SIZE;