Message ID | 20240522013845.1631305-4-andrii@kernel.org (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Fix user stack traces captured from uprobes | expand |
On Tue, May 21, 2024 at 06:38:44PM -0700, Andrii Nakryiko wrote: > When tracing user functions with uprobe functionality, it's common to > install the probe (e.g., a BPF program) at the first instruction of the > function. This is often going to be `push %rbp` instruction in function > preamble, which means that within that function frame pointer hasn't > been established yet. This leads to consistently missing an actual > caller of the traced function, because perf_callchain_user() only > records current IP (capturing traced function) and then following frame > pointer chain (which would be caller's frame, containing the address of > caller's caller). > > So when we have target_1 -> target_2 -> target_3 call chain and we are > tracing an entry to target_3, captured stack trace will report > target_1 -> target_3 call chain, which is wrong and confusing. > > This patch proposes a x86-64-specific heuristic to detect `push %rbp` > instruction being traced. If that's the case, with the assumption that > applicatoin is compiled with frame pointers, this instruction would be > a strong indicator that this is the entry to the function. In that case, > return address is still pointed to by %rsp, so we fetch it and add to > stack trace before proceeding to unwind the rest using frame > pointer-based logic. > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > --- > arch/x86/events/core.c | 20 ++++++++++++++++++++ > include/linux/uprobes.h | 2 ++ > kernel/events/uprobes.c | 2 ++ > 3 files changed, 24 insertions(+) > > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c > index 5b0dd07b1ef1..82d5570b58ff 100644 > --- a/arch/x86/events/core.c > +++ b/arch/x86/events/core.c > @@ -2884,6 +2884,26 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs > return; > > pagefault_disable(); > + > +#ifdef CONFIG_UPROBES > + /* > + * If we are called from uprobe handler, and we are indeed at the very > + * entry to user function (which is normally a `push %rbp` instruction, > + * under assumption of application being compiled with frame pointers), > + * we should read return address from *regs->sp before proceeding > + * to follow frame pointers, otherwise we'll skip immediate caller > + * as %rbp is not yet setup. > + */ > + if (current->utask) { > + struct arch_uprobe *auprobe = current->utask->auprobe; > + u64 ret_addr; > + > + if (auprobe && auprobe->insn[0] == 0x55 /* push %rbp */ && > + !__get_user(ret_addr, (const u64 __user *)regs->sp)) > + perf_callchain_store(entry, ret_addr); > + } > +#endif > + > while (entry->nr < entry->max_stack) { > if (!valid_user_frame(fp, sizeof(frame))) > break; > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h > index 0c57eec85339..7b785cd30d86 100644 > --- a/include/linux/uprobes.h > +++ b/include/linux/uprobes.h > @@ -76,6 +76,8 @@ struct uprobe_task { > struct uprobe *active_uprobe; > unsigned long xol_vaddr; > > + struct arch_uprobe *auprobe; I wonder we could use active_uprobe for this? jirka > + > struct return_instance *return_instances; > unsigned int depth; > }; > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > index 1c99380dc89d..504693845187 100644 > --- a/kernel/events/uprobes.c > +++ b/kernel/events/uprobes.c > @@ -2072,6 +2072,7 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs) > bool need_prep = false; /* prepare return uprobe, when needed */ > > down_read(&uprobe->register_rwsem); > + current->utask->auprobe = &uprobe->arch; > for (uc = uprobe->consumers; uc; uc = uc->next) { > int rc = 0; > > @@ -2086,6 +2087,7 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs) > > remove &= rc; > } > + current->utask->auprobe = NULL; > > if (need_prep && !remove) > prepare_uretprobe(uprobe, regs); /* put bp at return */ > -- > 2.43.0 > >
On Tue, 21 May 2024 18:38:44 -0700 Andrii Nakryiko <andrii@kernel.org> wrote: > When tracing user functions with uprobe functionality, it's common to > install the probe (e.g., a BPF program) at the first instruction of the > function. This is often going to be `push %rbp` instruction in function > preamble, which means that within that function frame pointer hasn't > been established yet. This leads to consistently missing an actual > caller of the traced function, because perf_callchain_user() only > records current IP (capturing traced function) and then following frame > pointer chain (which would be caller's frame, containing the address of > caller's caller). I thought this problem might be solved by sframe. > > So when we have target_1 -> target_2 -> target_3 call chain and we are > tracing an entry to target_3, captured stack trace will report > target_1 -> target_3 call chain, which is wrong and confusing. > > This patch proposes a x86-64-specific heuristic to detect `push %rbp` > instruction being traced. I like this kind of idea :) But I think this should be done in the user-space, not in the kernel because it is not always sure that the user program uses stack frames. > If that's the case, with the assumption that > applicatoin is compiled with frame pointers, this instruction would be > a strong indicator that this is the entry to the function. In that case, > return address is still pointed to by %rsp, so we fetch it and add to > stack trace before proceeding to unwind the rest using frame > pointer-based logic. Why don't we make it in the userspace BPF program? If it is done in the user space, like perf-probe, I'm OK. But I doubt to do this in kernel. That means it is not flexible. More than anything, without user-space helper to find function symbols, uprobe does not know the function entry. Then I'm curious why don't you do this in the user space. At least, this should be done in the user of uprobes, like trace_uprobe or bpf. Thank you, > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > --- > arch/x86/events/core.c | 20 ++++++++++++++++++++ > include/linux/uprobes.h | 2 ++ > kernel/events/uprobes.c | 2 ++ > 3 files changed, 24 insertions(+) > > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c > index 5b0dd07b1ef1..82d5570b58ff 100644 > --- a/arch/x86/events/core.c > +++ b/arch/x86/events/core.c > @@ -2884,6 +2884,26 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs > return; > > pagefault_disable(); > + > +#ifdef CONFIG_UPROBES > + /* > + * If we are called from uprobe handler, and we are indeed at the very > + * entry to user function (which is normally a `push %rbp` instruction, > + * under assumption of application being compiled with frame pointers), > + * we should read return address from *regs->sp before proceeding > + * to follow frame pointers, otherwise we'll skip immediate caller > + * as %rbp is not yet setup. > + */ > + if (current->utask) { > + struct arch_uprobe *auprobe = current->utask->auprobe; > + u64 ret_addr; > + > + if (auprobe && auprobe->insn[0] == 0x55 /* push %rbp */ && > + !__get_user(ret_addr, (const u64 __user *)regs->sp)) > + perf_callchain_store(entry, ret_addr); > + } > +#endif > + > while (entry->nr < entry->max_stack) { > if (!valid_user_frame(fp, sizeof(frame))) > break; > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h > index 0c57eec85339..7b785cd30d86 100644 > --- a/include/linux/uprobes.h > +++ b/include/linux/uprobes.h > @@ -76,6 +76,8 @@ struct uprobe_task { > struct uprobe *active_uprobe; > unsigned long xol_vaddr; > > + struct arch_uprobe *auprobe; > + > struct return_instance *return_instances; > unsigned int depth; > }; > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > index 1c99380dc89d..504693845187 100644 > --- a/kernel/events/uprobes.c > +++ b/kernel/events/uprobes.c > @@ -2072,6 +2072,7 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs) > bool need_prep = false; /* prepare return uprobe, when needed */ > > down_read(&uprobe->register_rwsem); > + current->utask->auprobe = &uprobe->arch; > for (uc = uprobe->consumers; uc; uc = uc->next) { > int rc = 0; > > @@ -2086,6 +2087,7 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs) > > remove &= rc; > } > + current->utask->auprobe = NULL; > > if (need_prep && !remove) > prepare_uretprobe(uprobe, regs); /* put bp at return */ > -- > 2.43.0 >
On Tue, Jun 4, 2024 at 7:06 AM Masami Hiramatsu <mhiramat@kernel.org> wrote: > > On Tue, 21 May 2024 18:38:44 -0700 > Andrii Nakryiko <andrii@kernel.org> wrote: > > > When tracing user functions with uprobe functionality, it's common to > > install the probe (e.g., a BPF program) at the first instruction of the > > function. This is often going to be `push %rbp` instruction in function > > preamble, which means that within that function frame pointer hasn't > > been established yet. This leads to consistently missing an actual > > caller of the traced function, because perf_callchain_user() only > > records current IP (capturing traced function) and then following frame > > pointer chain (which would be caller's frame, containing the address of > > caller's caller). > > I thought this problem might be solved by sframe. Eventually, yes, when real-world applications switch to sframe and we get proper support for it in the kernel. But right now there are tons of applications relying on kernel capturing stack traces based on frame pointers, so it would be good to improve that as well. > > > > > So when we have target_1 -> target_2 -> target_3 call chain and we are > > tracing an entry to target_3, captured stack trace will report > > target_1 -> target_3 call chain, which is wrong and confusing. > > > > This patch proposes a x86-64-specific heuristic to detect `push %rbp` > > instruction being traced. > > I like this kind of idea :) But I think this should be done in > the user-space, not in the kernel because it is not always sure > that the user program uses stack frames. Existing kernel code that captures user space stack trace already assumes that code was compiled with a frame pointer (unconditionally), because that's the best kernel can do. So under that assumption this heuristic is valid and not harmful, IMO. User space can do nothing about this, because it is the kernel that captures stack trace (e.g., from BPF program), and we will lose the calling frame if we don't do it here. > > > If that's the case, with the assumption that > > applicatoin is compiled with frame pointers, this instruction would be > > a strong indicator that this is the entry to the function. In that case, > > return address is still pointed to by %rsp, so we fetch it and add to > > stack trace before proceeding to unwind the rest using frame > > pointer-based logic. > > Why don't we make it in the userspace BPF program? If it is done > in the user space, like perf-probe, I'm OK. But I doubt to do this in > kernel. That means it is not flexible. > You mean for the BPF program to capture the entire stack trace by itself, without asking the kernel for help? It's doable, but: a) it's inconvenient for all users to have to reimplement this low-level "primitive" operation, that we already promise is provided by kernel (through bpf_get_stack() API, and kernel has internal perf_callchain API for this) b) it's faster for kernel to do this, as kernel code disables page faults once and unwinds the stack, while BPF program would have to do multiple bpf_probe_read_user() calls, each individually disabling page faults. But really, there is an already existing API, which in some cases returns partially invalid stack traces (skipping function caller's frame), so this is an attempt to fix this issue. > More than anything, without user-space helper to find function > symbols, uprobe does not know the function entry. Then I'm curious > why don't you do this in the user space. You are mixing stack *capture* (in which we get memory addresses representing where a function call or currently running instruction pointer is) with stack *symbolization* (where user space needs ELF symbols and/or DWARF information to translate those addresses into something human-readable). This heuristic improves the former as performed by the kernel. Stack symbolization is completely orthogonal to all of this. > > At least, this should be done in the user of uprobes, like trace_uprobe > or bpf. > This is a really miserable user experience, if they have to implement their own stack trace capture for uprobes, but use built-in bpf_get_stack() API for any other type of program. > > Thank you, > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > --- > > arch/x86/events/core.c | 20 ++++++++++++++++++++ > > include/linux/uprobes.h | 2 ++ > > kernel/events/uprobes.c | 2 ++ > > 3 files changed, 24 insertions(+) > > [...]
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 5b0dd07b1ef1..82d5570b58ff 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2884,6 +2884,26 @@ perf_callchain_user(struct perf_callchain_entry_ctx *entry, struct pt_regs *regs return; pagefault_disable(); + +#ifdef CONFIG_UPROBES + /* + * If we are called from uprobe handler, and we are indeed at the very + * entry to user function (which is normally a `push %rbp` instruction, + * under assumption of application being compiled with frame pointers), + * we should read return address from *regs->sp before proceeding + * to follow frame pointers, otherwise we'll skip immediate caller + * as %rbp is not yet setup. + */ + if (current->utask) { + struct arch_uprobe *auprobe = current->utask->auprobe; + u64 ret_addr; + + if (auprobe && auprobe->insn[0] == 0x55 /* push %rbp */ && + !__get_user(ret_addr, (const u64 __user *)regs->sp)) + perf_callchain_store(entry, ret_addr); + } +#endif + while (entry->nr < entry->max_stack) { if (!valid_user_frame(fp, sizeof(frame))) break; diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index 0c57eec85339..7b785cd30d86 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -76,6 +76,8 @@ struct uprobe_task { struct uprobe *active_uprobe; unsigned long xol_vaddr; + struct arch_uprobe *auprobe; + struct return_instance *return_instances; unsigned int depth; }; diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 1c99380dc89d..504693845187 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -2072,6 +2072,7 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs) bool need_prep = false; /* prepare return uprobe, when needed */ down_read(&uprobe->register_rwsem); + current->utask->auprobe = &uprobe->arch; for (uc = uprobe->consumers; uc; uc = uc->next) { int rc = 0; @@ -2086,6 +2087,7 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs) remove &= rc; } + current->utask->auprobe = NULL; if (need_prep && !remove) prepare_uretprobe(uprobe, regs); /* put bp at return */
When tracing user functions with uprobe functionality, it's common to install the probe (e.g., a BPF program) at the first instruction of the function. This is often going to be `push %rbp` instruction in function preamble, which means that within that function frame pointer hasn't been established yet. This leads to consistently missing an actual caller of the traced function, because perf_callchain_user() only records current IP (capturing traced function) and then following frame pointer chain (which would be caller's frame, containing the address of caller's caller). So when we have target_1 -> target_2 -> target_3 call chain and we are tracing an entry to target_3, captured stack trace will report target_1 -> target_3 call chain, which is wrong and confusing. This patch proposes a x86-64-specific heuristic to detect `push %rbp` instruction being traced. If that's the case, with the assumption that applicatoin is compiled with frame pointers, this instruction would be a strong indicator that this is the entry to the function. In that case, return address is still pointed to by %rsp, so we fetch it and add to stack trace before proceeding to unwind the rest using frame pointer-based logic. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> --- arch/x86/events/core.c | 20 ++++++++++++++++++++ include/linux/uprobes.h | 2 ++ kernel/events/uprobes.c | 2 ++ 3 files changed, 24 insertions(+)