[RFC,0/3] tracing: Introduce relative stacktrace

Message ID	173807861687.1525539.15082309716909038251.stgit@mhiramat.roam.corp.google.com (mailing list archive)
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2A0141B2EF2; Tue, 28 Jan 2025 15:37:01 +0000 (UTC) From: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> To: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org>, Mathieu Desnoyers <mathieu.desnoyers@efficios.com>, Luis Chamberlain <mcgrof@kernel.org>, Petr Pavlu <petr.pavlu@suse.com>, Sami Tolvanen <samitolvanen@google.com>, Daniel Gomez <da.gomez@samsung.com>, linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-modules@vger.kernel.org Subject: [RFC PATCH 0/3] tracing: Introduce relative stacktrace Date: Wed, 29 Jan 2025 00:36:56 +0900 Message-ID: <173807861687.1525539.15082309716909038251.stgit@mhiramat.roam.corp.google.com> User-Agent: StGit/0.19 Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit
Series	tracing: Introduce relative stacktrace \| expand [RFC,0/3] tracing: Introduce relative stacktrace [RFC,1/3] tracing: Record stacktrace as the offset from _stext [RFC,2/3] tracing: Introduce "rel_stack" option [RFC,3/3] modules: tracing: Add module_text_offsets event

Masami Hiramatsu (Google) Jan. 28, 2025, 3:36 p.m. UTC

Hi,

This introduces relative stacktrace, which records stacktrace entry as
the offset from _stext instead of raw address. User can enable this
format by setting options/relative-stacktrace.

Basically, this does not change anything for users who are using ftrace
with 'trace' text-formatted interface. This changes how each stacktrace
entry address is stored, so users who is using 'trace_pipe_raw' needs
to change how to decode the stacktrace.

Currently, the stacktrace is stored as raw kernel address. Thus, for
decoding the binary trace data, we need to refer the kallsyms. But this
is not useful on the platform which prohibits to access /proc/kallsyms
for security reason. Since KASLR will change the kernel text address,
we can not decode symbols without kallsyms in userspace.

On the other hand, if we record the stacktrace entries in the offset
from _stext, we can use System.map file to decode it. This is also good
for the stacktrace in the persistent ring buffer, because we don't need
to save the kallsyms before crash anymore.

The problem is to decode the address in the modules because it will be
loaded in the different place. To solve this issue, I also introduced
'module_text_offsets' event, which records module's text and init_text
info as the offset from _stext when loading it. User can store this
event in the (another) persistent ring buffer for decoding.

Thank you,

---

Masami Hiramatsu (Google) (3):
      tracing: Record stacktrace as the offset from _stext
      tracing: Introduce "rel_stack" option
      modules: tracing: Add module_text_offsets event


 include/trace/events/module.h |   40 ++++++++++++++++++++++++++++++++++++++++
 kernel/module/main.c          |    1 +
 kernel/trace/trace.c          |   11 ++++++++++-
 kernel/trace/trace.h          |    2 ++
 kernel/trace/trace_entries.h  |   22 ++++++++++++++++++++++
 kernel/trace/trace_output.c   |   35 +++++++++++++++++++++++++++++++----
 6 files changed, 106 insertions(+), 5 deletions(-)

--
Signature

Mathieu Desnoyers Jan. 28, 2025, 3:46 p.m. UTC | #1

On 2025-01-28 10:36, Masami Hiramatsu (Google) wrote:
> Hi,
> 
> This introduces relative stacktrace, which records stacktrace entry as
> the offset from _stext instead of raw address. User can enable this
> format by setting options/relative-stacktrace.
> 
> Basically, this does not change anything for users who are using ftrace
> with 'trace' text-formatted interface. This changes how each stacktrace
> entry address is stored, so users who is using 'trace_pipe_raw' needs
> to change how to decode the stacktrace.
> 
> Currently, the stacktrace is stored as raw kernel address. Thus, for
> decoding the binary trace data, we need to refer the kallsyms. But this
> is not useful on the platform which prohibits to access /proc/kallsyms
> for security reason. Since KASLR will change the kernel text address,
> we can not decode symbols without kallsyms in userspace.
> 
> On the other hand, if we record the stacktrace entries in the offset
> from _stext, we can use System.map file to decode it. This is also good
> for the stacktrace in the persistent ring buffer, because we don't need
> to save the kallsyms before crash anymore.
> 
> The problem is to decode the address in the modules because it will be
> loaded in the different place. To solve this issue, I also introduced
> 'module_text_offsets' event, which records module's text and init_text
> info as the offset from _stext when loading it. User can store this
> event in the (another) persistent ring buffer for decoding.

This does not handle the situation where a module is already loaded
before tracing starts. In LTTng we have a statedump facility for this,
where we can iterate on all modules at trace start and dump the relevant
information.

You may want to consider a similar approach for other tracers.

Thanks,

Mathieu

> 
> Thank you,
> 
> ---
> 
> Masami Hiramatsu (Google) (3):
>        tracing: Record stacktrace as the offset from _stext
>        tracing: Introduce "rel_stack" option
>        modules: tracing: Add module_text_offsets event
> 
> 
>   include/trace/events/module.h |   40 ++++++++++++++++++++++++++++++++++++++++
>   kernel/module/main.c          |    1 +
>   kernel/trace/trace.c          |   11 ++++++++++-
>   kernel/trace/trace.h          |    2 ++
>   kernel/trace/trace_entries.h  |   22 ++++++++++++++++++++++
>   kernel/trace/trace_output.c   |   35 +++++++++++++++++++++++++++++++----
>   6 files changed, 106 insertions(+), 5 deletions(-)
> 
> --
> Signature

Steven Rostedt Jan. 28, 2025, 4:27 p.m. UTC | #2

On Tue, 28 Jan 2025 10:46:21 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> This does not handle the situation where a module is already loaded
> before tracing starts. In LTTng we have a statedump facility for this,
> where we can iterate on all modules at trace start and dump the relevant
> information.
> 
> You may want to consider a similar approach for other tracers.

Last night Masami and I were talking about this. The idea I was thinking of
was to simply have a module load notifier that would add modules to an
array. It would only keep track of loaded modules, and when the trace hit,
if the address was outside of core text, it would search the array for the
module, and use that. When a module is removed, it would also be removed
from the array. We currently do not support tracing module removal (if the
module is traced, the buffers are cleared when the module is removed).

If it is a module address, set the MSB, and for 32 bit machines use the
next 7 bits as an index into the module array, and for 64 bit machines, use
the next 10 bits as an index. This would be exposed in the format file for
the kernel_stack_rel event, so if these numbers change, user space can cope
with it. In fact, it would need to use the format file to distinguish the
32 bit and 64 bit values.

That is, a stack trace will contain addresses that are core kernel simply
subtracted from ".text", and the modules address would have the MSB set,
the next bits would be an index into that array that holds the module
information, and the address would be the address minus the module address
where it was loaded.

This way we do not need to save the information from any events. Also, for
the persistent ring buffer, this array could live in that memory, so that
it will be available on the next boot.

-- Steve

Mathieu Desnoyers Jan. 28, 2025, 4:46 p.m. UTC | #3

On 2025-01-28 11:27, Steven Rostedt wrote:
> On Tue, 28 Jan 2025 10:46:21 -0500
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
>> This does not handle the situation where a module is already loaded
>> before tracing starts. In LTTng we have a statedump facility for this,
>> where we can iterate on all modules at trace start and dump the relevant
>> information.
>>
>> You may want to consider a similar approach for other tracers.
> 
> Last night Masami and I were talking about this. The idea I was thinking of
> was to simply have a module load notifier that would add modules to an
> array. It would only keep track of loaded modules, and when the trace hit,
> if the address was outside of core text, it would search the array for the
> module, and use that. When a module is removed, it would also be removed
> from the array. We currently do not support tracing module removal (if the
> module is traced, the buffers are cleared when the module is removed).

I'm trying to wrap my head around what you are trying to achieve here.

So AFAIU you are aiming to store the relative offset from kernel _text
and module base text address into the traced events rather than the
actual address.

Based on Masami's cover letter, this appears to be  done to make sure
users can get to this base+offset information even if they cannot read
kallsyms.

Why make the tracing fast path more complex for a simple matter of
accessing this base address information ?

All you need to have to convert from kernel address to base + offset is:

- The kernel _text base address,
- Each loaded module text base address,
- Unloaded modules events to prune this information.

What is wrong with simply exporting this base address information in the
trace buffers rather than rely on kallsyms, and deal with the conversion
to module name / base+offset at post-processing ?

Thanks,

Mathieu

> 
> If it is a module address, set the MSB, and for 32 bit machines use the
> next 7 bits as an index into the module array, and for 64 bit machines, use
> the next 10 bits as an index. This would be exposed in the format file for
> the kernel_stack_rel event, so if these numbers change, user space can cope
> with it. In fact, it would need to use the format file to distinguish the
> 32 bit and 64 bit values.
> 
> That is, a stack trace will contain addresses that are core kernel simply
> subtracted from ".text", and the modules address would have the MSB set,
> the next bits would be an index into that array that holds the module
> information, and the address would be the address minus the module address
> where it was loaded.
> 
> This way we do not need to save the information from any events. Also, for
> the persistent ring buffer, this array could live in that memory, so that
> it will be available on the next boot.

Steven Rostedt Jan. 28, 2025, 5:30 p.m. UTC | #4

On Tue, 28 Jan 2025 11:46:25 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> I'm trying to wrap my head around what you are trying to achieve here.
> 
> So AFAIU you are aiming to store the relative offset from kernel _text
> and module base text address into the traced events rather than the
> actual address.
> 
> Based on Masami's cover letter, this appears to be  done to make sure
> users can get to this base+offset information even if they cannot read
> kallsyms.
> 
> Why make the tracing fast path more complex for a simple matter of
> accessing this base address information ?
> 
> All you need to have to convert from kernel address to base + offset is:
> 
> - The kernel _text base address,
> - Each loaded module text base address,
> - Unloaded modules events to prune this information.
> 
> What is wrong with simply exporting this base address information in the
> trace buffers rather than rely on kallsyms, and deal with the conversion
> to module name / base+offset at post-processing ?

Hmm, we could probably get away with that too. I think we were focused on
kallsyms, where we wanted a way to not have to distinguish between current
boot info and previous boot info. But when we started pulling in the module
info, it may be possible to do a post processing.

I have said in the past that I wanted module information in the persistent
memory. By doing that this may not be needed. I'll look into it.

-- Steve

Masami Hiramatsu (Google) Jan. 29, 2025, 12:17 a.m. UTC | #5

On Tue, 28 Jan 2025 11:27:33 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 28 Jan 2025 10:46:21 -0500
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> 
> > This does not handle the situation where a module is already loaded
> > before tracing starts. In LTTng we have a statedump facility for this,
> > where we can iterate on all modules at trace start and dump the relevant
> > information.
> > 
> > You may want to consider a similar approach for other tracers.
> 
> Last night Masami and I were talking about this. The idea I was thinking of
> was to simply have a module load notifier that would add modules to an
> array. It would only keep track of loaded modules, and when the trace hit,
> if the address was outside of core text, it would search the array for the
> module, and use that. When a module is removed, it would also be removed
> from the array. We currently do not support tracing module removal (if the
> module is traced, the buffers are cleared when the module is removed).

Actually, we already have similar info in /proc/modules. Of course it is
not persistent.

> If it is a module address, set the MSB, and for 32 bit machines use the
> next 7 bits as an index into the module array, and for 64 bit machines, use
> the next 10 bits as an index.

I thought 7 bits were not enough because some stacktrace were kept after
the module was unloaded. Of course we can ignore such case (anyway current
"live" stacktrace does not care such case too).


> This would be exposed in the format file for
> the kernel_stack_rel event, so if these numbers change, user space can cope
> with it. In fact, it would need to use the format file to distinguish the
> 32 bit and 64 bit values.

Yeah, that can simplify the userspace. But the problem of using relative
address from the module .text is that it has bigger overhead to find the
module for each stacktrace entry.

Thank you,

> 
> That is, a stack trace will contain addresses that are core kernel simply
> subtracted from ".text", and the modules address would have the MSB set,
> the next bits would be an index into that array that holds the module
> information, and the address would be the address minus the module address
> where it was loaded.
> 
> This way we do not need to save the information from any events. Also, for
> the persistent ring buffer, this array could live in that memory, so that
> it will be available on the next boot.
> 
> -- Steve
>

Masami Hiramatsu (Google) Jan. 29, 2025, 12:19 a.m. UTC | #6

On Tue, 28 Jan 2025 10:46:21 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> On 2025-01-28 10:36, Masami Hiramatsu (Google) wrote:
> > Hi,
> > 
> > This introduces relative stacktrace, which records stacktrace entry as
> > the offset from _stext instead of raw address. User can enable this
> > format by setting options/relative-stacktrace.
> > 
> > Basically, this does not change anything for users who are using ftrace
> > with 'trace' text-formatted interface. This changes how each stacktrace
> > entry address is stored, so users who is using 'trace_pipe_raw' needs
> > to change how to decode the stacktrace.
> > 
> > Currently, the stacktrace is stored as raw kernel address. Thus, for
> > decoding the binary trace data, we need to refer the kallsyms. But this
> > is not useful on the platform which prohibits to access /proc/kallsyms
> > for security reason. Since KASLR will change the kernel text address,
> > we can not decode symbols without kallsyms in userspace.
> > 
> > On the other hand, if we record the stacktrace entries in the offset
> > from _stext, we can use System.map file to decode it. This is also good
> > for the stacktrace in the persistent ring buffer, because we don't need
> > to save the kallsyms before crash anymore.
> > 
> > The problem is to decode the address in the modules because it will be
> > loaded in the different place. To solve this issue, I also introduced
> > 'module_text_offsets' event, which records module's text and init_text
> > info as the offset from _stext when loading it. User can store this
> > event in the (another) persistent ring buffer for decoding.
> 
> This does not handle the situation where a module is already loaded
> before tracing starts. In LTTng we have a statedump facility for this,
> where we can iterate on all modules at trace start and dump the relevant
> information.

Thanks for the comment!
For the persistent ring buffer, I think we can enable this event in early
boot stage which allows us to store it. (But this overwrites the previous
data, hmm, we need A-B buffer...)

Thank you,

Masami Hiramatsu (Google) Jan. 29, 2025, 12:23 a.m. UTC | #7

On Tue, 28 Jan 2025 10:46:21 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> On 2025-01-28 10:36, Masami Hiramatsu (Google) wrote:
> > Hi,
> > 
> > This introduces relative stacktrace, which records stacktrace entry as
> > the offset from _stext instead of raw address. User can enable this
> > format by setting options/relative-stacktrace.
> > 
> > Basically, this does not change anything for users who are using ftrace
> > with 'trace' text-formatted interface. This changes how each stacktrace
> > entry address is stored, so users who is using 'trace_pipe_raw' needs
> > to change how to decode the stacktrace.
> > 
> > Currently, the stacktrace is stored as raw kernel address. Thus, for
> > decoding the binary trace data, we need to refer the kallsyms. But this
> > is not useful on the platform which prohibits to access /proc/kallsyms
> > for security reason. Since KASLR will change the kernel text address,
> > we can not decode symbols without kallsyms in userspace.
> > 
> > On the other hand, if we record the stacktrace entries in the offset
> > from _stext, we can use System.map file to decode it. This is also good
> > for the stacktrace in the persistent ring buffer, because we don't need
> > to save the kallsyms before crash anymore.
> > 
> > The problem is to decode the address in the modules because it will be
> > loaded in the different place. To solve this issue, I also introduced
> > 'module_text_offsets' event, which records module's text and init_text
> > info as the offset from _stext when loading it. User can store this
> > event in the (another) persistent ring buffer for decoding.
> 
> This does not handle the situation where a module is already loaded
> before tracing starts. In LTTng we have a statedump facility for this,
> where we can iterate on all modules at trace start and dump the relevant
> information.

BTW, if we only covers the crash by watchdog or oops, we can dump the
all loaded module info at the panic code.

Thank you,

Masami Hiramatsu (Google) Jan. 29, 2025, 12:58 a.m. UTC | #8

On Tue, 28 Jan 2025 11:46:25 -0500
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> On 2025-01-28 11:27, Steven Rostedt wrote:
> > On Tue, 28 Jan 2025 10:46:21 -0500
> > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> > 
> >> This does not handle the situation where a module is already loaded
> >> before tracing starts. In LTTng we have a statedump facility for this,
> >> where we can iterate on all modules at trace start and dump the relevant
> >> information.
> >>
> >> You may want to consider a similar approach for other tracers.
> > 
> > Last night Masami and I were talking about this. The idea I was thinking of
> > was to simply have a module load notifier that would add modules to an
> > array. It would only keep track of loaded modules, and when the trace hit,
> > if the address was outside of core text, it would search the array for the
> > module, and use that. When a module is removed, it would also be removed
> > from the array. We currently do not support tracing module removal (if the
> > module is traced, the buffers are cleared when the module is removed).
> 
> I'm trying to wrap my head around what you are trying to achieve here.
> 
> So AFAIU you are aiming to store the relative offset from kernel _text
> and module base text address into the traced events rather than the
> actual address.
> 
> Based on Masami's cover letter, this appears to be  done to make sure
> users can get to this base+offset information even if they cannot read
> kallsyms.
> 
> Why make the tracing fast path more complex for a simple matter of
> accessing this base address information ?
> 
> All you need to have to convert from kernel address to base + offset is:
> 
> - The kernel _text base address,
> - Each loaded module text base address,
> - Unloaded modules events to prune this information.
> 
> What is wrong with simply exporting this base address information in the
> trace buffers rather than rely on kallsyms, and deal with the conversion
> to module name / base+offset at post-processing ?

Hmm, that also works if we only consider the kallsyms access. But that
means to export KASLR information in the trace buffer. We need to check
it is OK.

My another concern is how to handle this stacktrace on live system. The
stacktrace has to be handled in both crash and live trace, but in both case
we need to consider not leaking KASLR offset.

Hmm, for avoiding the security concern, as Steve said, we may need to save
the module relative address, which may introduce a bit more overhead, but
it should be safer.

Anyway, this v1 may be able to leak the KASLR offset (or estimate it easier).
I think we have 2 options; (A) as Mathieu pointed, expose the offset
information via trace buffer. (B) as Steve pointed, fully relative offset
in stacktrace.

For the crash analysis, if we expose the offset information only when the
machine get a panic, (A) is safe because no one will continue to work. But
this may not work with live system (if we can not access to kallsyms).

(B) is always OK, but it takes more overhead to save the stacktrace.
(but how much it increase, we need to measure that)

Thank you,

Steven Rostedt Jan. 29, 2025, 1:09 a.m. UTC | #9

On Wed, 29 Jan 2025 09:58:19 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:

> Hmm, that also works if we only consider the kallsyms access. But that
> means to export KASLR information in the trace buffer. We need to check
> it is OK.

If they say we can't have KASLR information in the ring buffer then
that is pretty much a brick wall, and we are done with this. The best
we can do is to prevent reading the current trace buffer. But honestly,
we want that too. Heck, already get kernel stack traces from perfetto
right? That has KASLR information doesn't it?

> 
> My another concern is how to handle this stacktrace on live system. The
> stacktrace has to be handled in both crash and live trace, but in both case
> we need to consider not leaking KASLR offset.

I don't think we do.

> 
> Hmm, for avoiding the security concern, as Steve said, we may need to save
> the module relative address, which may introduce a bit more overhead, but
> it should be safer.

Actually, if we save the addresses of where the modules are in the
persistent ring buffer, and expose the addresses only if they are from
the previous boot (if it's the current boot, it just says "current"),
then we can decipher the modules from the previous boot.

> 
> Anyway, this v1 may be able to leak the KASLR offset (or estimate it easier).
> I think we have 2 options; (A) as Mathieu pointed, expose the offset
> information via trace buffer. (B) as Steve pointed, fully relative offset
> in stacktrace.

It should be fine to read the full offsets. Again, perf already does this.

-- Steve

Masami Hiramatsu (Google) Jan. 29, 2025, 7:25 a.m. UTC | #10

On Tue, 28 Jan 2025 20:09:38 -0500
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Wed, 29 Jan 2025 09:58:19 +0900
> Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:
> 
> > Hmm, that also works if we only consider the kallsyms access. But that
> > means to export KASLR information in the trace buffer. We need to check
> > it is OK.
> 
> If they say we can't have KASLR information in the ring buffer then
> that is pretty much a brick wall, and we are done with this. The best
> we can do is to prevent reading the current trace buffer. But honestly,
> we want that too. Heck, already get kernel stack traces from perfetto
> right? That has KASLR information doesn't it?

I read the perfetto callstack feature, but it seems to support user
space callstack.

https://perfetto.dev/docs/quickstart/callstack-sampling

> 
> > 
> > My another concern is how to handle this stacktrace on live system. The
> > stacktrace has to be handled in both crash and live trace, but in both case
> > we need to consider not leaking KASLR offset.
> 
> I don't think we do.

I meant that my [PATCH 3/3] can do it intermediately (not directly).
So I think your idea (storing relative offset from module) is better.

> 
> > 
> > Hmm, for avoiding the security concern, as Steve said, we may need to save
> > the module relative address, which may introduce a bit more overhead, but
> > it should be safer.
> 
> Actually, if we save the addresses of where the modules are in the
> persistent ring buffer, and expose the addresses only if they are from
> the previous boot (if it's the current boot, it just says "current"),
> then we can decipher the modules from the previous boot.

OK, but when would we save it? it is OK to do it in panic()?

> 
> > 
> > Anyway, this v1 may be able to leak the KASLR offset (or estimate it easier).
> > I think we have 2 options; (A) as Mathieu pointed, expose the offset
> > information via trace buffer. (B) as Steve pointed, fully relative offset
> > in stacktrace.
> 
> It should be fine to read the full offsets. Again, perf already does this.

Indeed. Hmm, I need to know how perf solve this limitation.

Thank you,

> 
> -- Steve

Steven Rostedt Jan. 29, 2025, 2:42 p.m. UTC | #11

On Wed, 29 Jan 2025 16:25:38 +0900
Masami Hiramatsu (Google) <mhiramat@kernel.org> wrote:


> > Actually, if we save the addresses of where the modules are in the
> > persistent ring buffer, and expose the addresses only if they are from
> > the previous boot (if it's the current boot, it just says "current"),
> > then we can decipher the modules from the previous boot.  
> 
> OK, but when would we save it? it is OK to do it in panic()?

It would be saved in the persistent memory region, and added when a module
is loaded. That is, it will already be recorded when a panic() occurs.

-- Steve

[RFC,0/3] tracing: Introduce relative stacktrace

Message

Comments