diff mbox series

[-tip,v8,11/13] x86/unwind: Recover kretprobe trampoline entry

Message ID 162400002631.506599.2413605639666466945.stgit@devnote2 (mailing list archive)
State Not Applicable
Delegated to: Netdev Maintainers
Headers show
Series kprobes: Fix stacktrace with kretprobes on x86 | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Masami Hiramatsu (Google) June 18, 2021, 7:07 a.m. UTC
Since the kretprobe replaces the function return address with
the kretprobe_trampoline on the stack, x86 unwinders can not
continue the stack unwinding at that point, or record
kretprobe_trampoline instead of correct return address.

To fix this issue, find the correct return address from task's
kretprobe_instances as like as function-graph tracer does.

With this fix, the unwinder can correctly unwind the stack
from kretprobe event on x86, as below.

           <...>-135     [003] ...1     6.722338: r_full_proxy_read_0: (vfs_read+0xab/0x1a0 <- full_proxy_read)
           <...>-135     [003] ...1     6.722377: <stack trace>
 => kretprobe_trace_func+0x209/0x2f0
 => kretprobe_dispatcher+0x4a/0x70
 => __kretprobe_trampoline_handler+0xca/0x150
 => trampoline_handler+0x44/0x70
 => kretprobe_trampoline+0x2a/0x50
 => vfs_read+0xab/0x1a0
 => ksys_read+0x5f/0xe0
 => do_syscall_64+0x33/0x40
 => entry_SYSCALL_64_after_hwframe+0x44/0xae


Reported-by: Daniel Xu <dxu@dxuuu.xyz>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Suggested-by: Josh Poimboeuf <jpoimboe@redhat.com>
Tested-by: Andrii Nakryik <andrii@kernel.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
  Changes in v7:
   - Remove superfluous #include <linux/kprobes.h>.
  Changes in v5:
   - Fix the case of interrupt happens on kretprobe_trampoline+0.
  Changes in v3:
   - Split out the kretprobe side patch
   - Fix build error when CONFIG_KRETPROBES=n.
  Changes in v2:
   - Remove kretprobe wrapper functions from unwind_orc.c
   - Do not fixup state->ip when unwinding with regs because
     kretprobe fixup instruction pointer before calling handler.
---
 arch/x86/include/asm/unwind.h  |   23 +++++++++++++++++++++++
 arch/x86/kernel/unwind_frame.c |    3 +--
 arch/x86/kernel/unwind_guess.c |    3 +--
 arch/x86/kernel/unwind_orc.c   |   18 ++++++++++++++----
 4 files changed, 39 insertions(+), 8 deletions(-)

Comments

Peter Zijlstra July 5, 2021, 11:36 a.m. UTC | #1
On Fri, Jun 18, 2021 at 04:07:06PM +0900, Masami Hiramatsu wrote:
> @@ -549,7 +548,15 @@ bool unwind_next_frame(struct unwind_state *state)
>  					 (void *)orig_ip);
>  			goto err;
>  		}
> -
> +		/*
> +		 * There is a small chance to interrupt at the entry of
> +		 * kretprobe_trampoline where the ORC info doesn't exist.
> +		 * That point is right after the RET to kretprobe_trampoline
> +		 * which was modified return address. So the @addr_p must
> +		 * be right before the regs->sp.
> +		 */
> +		state->ip = unwind_recover_kretprobe(state, state->ip,
> +				(unsigned long *)(state->sp - sizeof(long)));
>  		state->regs = (struct pt_regs *)sp;
>  		state->prev_regs = NULL;
>  		state->full_regs = true;
> @@ -562,6 +569,9 @@ bool unwind_next_frame(struct unwind_state *state)
>  					 (void *)orig_ip);
>  			goto err;
>  		}
> +		/* See UNWIND_HINT_TYPE_REGS case comment. */
> +		state->ip = unwind_recover_kretprobe(state, state->ip,
> +				(unsigned long *)(state->sp - sizeof(long)));
>  
>  		if (state->full_regs)
>  			state->prev_regs = state->regs;

Why doesn't the ftrace case have this? That is, why aren't both return
trampolines having the same general shape?
Masami Hiramatsu (Google) July 5, 2021, 3:42 p.m. UTC | #2
On Mon, 5 Jul 2021 13:36:14 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Jun 18, 2021 at 04:07:06PM +0900, Masami Hiramatsu wrote:
> > @@ -549,7 +548,15 @@ bool unwind_next_frame(struct unwind_state *state)
> >  					 (void *)orig_ip);
> >  			goto err;
> >  		}
> > -
> > +		/*
> > +		 * There is a small chance to interrupt at the entry of
> > +		 * kretprobe_trampoline where the ORC info doesn't exist.
> > +		 * That point is right after the RET to kretprobe_trampoline
> > +		 * which was modified return address. So the @addr_p must
> > +		 * be right before the regs->sp.
> > +		 */
> > +		state->ip = unwind_recover_kretprobe(state, state->ip,
> > +				(unsigned long *)(state->sp - sizeof(long)));
> >  		state->regs = (struct pt_regs *)sp;
> >  		state->prev_regs = NULL;
> >  		state->full_regs = true;
> > @@ -562,6 +569,9 @@ bool unwind_next_frame(struct unwind_state *state)
> >  					 (void *)orig_ip);
> >  			goto err;
> >  		}
> > +		/* See UNWIND_HINT_TYPE_REGS case comment. */
> > +		state->ip = unwind_recover_kretprobe(state, state->ip,
> > +				(unsigned long *)(state->sp - sizeof(long)));
> >  
> >  		if (state->full_regs)
> >  			state->prev_regs = state->regs;
> 
> Why doesn't the ftrace case have this? That is, why aren't both return
> trampolines having the same general shape?

Ah, this strongly depends what the trampoline code does.
For the kretprobe case, the PUSHQ at the entry of the kretprobe_trampoline()
does not covered by UNWIND_HINT_FUNC. Thus it needs to find 'correct_ret_addr'
by the frame pointer (which is next to the sp).

        "kretprobe_trampoline:\n"
#ifdef CONFIG_X86_64
        /* Push fake return address to tell the unwinder it's a kretprobe */
        "       pushq $kretprobe_trampoline\n"
        UNWIND_HINT_FUNC

But I'm not so sure how ftrace treat it. It seems that the return_to_handler()
doesn't care such case. (anyway, return_to_handler() does not return but jump
to the original call-site, in that case, the information will be lost.)

Thank you,
Peter Zijlstra July 6, 2021, 7:55 a.m. UTC | #3
On Tue, Jul 06, 2021 at 12:42:57AM +0900, Masami Hiramatsu wrote:
> On Mon, 5 Jul 2021 13:36:14 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Fri, Jun 18, 2021 at 04:07:06PM +0900, Masami Hiramatsu wrote:
> > > @@ -549,7 +548,15 @@ bool unwind_next_frame(struct unwind_state *state)
> > >  					 (void *)orig_ip);
> > >  			goto err;
> > >  		}
> > > -
> > > +		/*
> > > +		 * There is a small chance to interrupt at the entry of
> > > +		 * kretprobe_trampoline where the ORC info doesn't exist.
> > > +		 * That point is right after the RET to kretprobe_trampoline
> > > +		 * which was modified return address. So the @addr_p must
> > > +		 * be right before the regs->sp.
> > > +		 */
> > > +		state->ip = unwind_recover_kretprobe(state, state->ip,
> > > +				(unsigned long *)(state->sp - sizeof(long)));
> > >  		state->regs = (struct pt_regs *)sp;
> > >  		state->prev_regs = NULL;
> > >  		state->full_regs = true;
> > > @@ -562,6 +569,9 @@ bool unwind_next_frame(struct unwind_state *state)
> > >  					 (void *)orig_ip);
> > >  			goto err;
> > >  		}
> > > +		/* See UNWIND_HINT_TYPE_REGS case comment. */
> > > +		state->ip = unwind_recover_kretprobe(state, state->ip,
> > > +				(unsigned long *)(state->sp - sizeof(long)));
> > >  
> > >  		if (state->full_regs)
> > >  			state->prev_regs = state->regs;
> > 
> > Why doesn't the ftrace case have this? That is, why aren't both return
> > trampolines having the same general shape?
> 
> Ah, this strongly depends what the trampoline code does.
> For the kretprobe case, the PUSHQ at the entry of the kretprobe_trampoline()
> does not covered by UNWIND_HINT_FUNC. Thus it needs to find 'correct_ret_addr'
> by the frame pointer (which is next to the sp).
> 
>         "kretprobe_trampoline:\n"
> #ifdef CONFIG_X86_64
>         /* Push fake return address to tell the unwinder it's a kretprobe */
>         "       pushq $kretprobe_trampoline\n"
>         UNWIND_HINT_FUNC
> 
> But I'm not so sure how ftrace treat it. It seems that the return_to_handler()
> doesn't care such case. (anyway, return_to_handler() does not return but jump
> to the original call-site, in that case, the information will be lost.)

I find it bothersome (OCD, sorry :-) that both return trampolines behave
differently. Doubly so because I know people (Steve in particular) have
been talking about unifying them.

Steve, can you clarify the ftrace side here? Afaict return_to_handler()
is similarly affected.
Steven Rostedt July 6, 2021, 3:11 p.m. UTC | #4
On Tue, 6 Jul 2021 09:55:03 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > But I'm not so sure how ftrace treat it. It seems that the return_to_handler()
> > doesn't care such case. (anyway, return_to_handler() does not return but jump
> > to the original call-site, in that case, the information will be lost.)  
> 
> I find it bothersome (OCD, sorry :-) that both return trampolines behave
> differently. Doubly so because I know people (Steve in particular) have
> been talking about unifying them.

They were developed separately, and designed differently with different
goals in mind. Yes, I want to unify them, but trying to get the
different goals together, compounded by the fact that almost every arch
also implemented them differently (in which case, we need to find a way
to do it one arch at a time), makes the process extremely frustrating.

> 
> Steve, can you clarify the ftrace side here? Afaict return_to_handler()
> is similarly affected.

I'm not exactly sure what the issue is. As Masami stated, kretprobe
uses a ret to return to the calling function, but ftrace uses a jmp.

kretprobe return tracing is more complex than the function graph return
tracing is (which is one of the issues I need to overcome to unify
them), and when the function graph return trampoline was created, it
did things as simple as possible (and before ORC existed).

Is this something to worry about now, or should we look to fix his in
the unifying process?

-- Steve
Peter Zijlstra July 7, 2021, 8:20 a.m. UTC | #5
On Tue, Jul 06, 2021 at 11:11:36AM -0400, Steven Rostedt wrote:
> On Tue, 6 Jul 2021 09:55:03 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > But I'm not so sure how ftrace treat it. It seems that the return_to_handler()
> > > doesn't care such case. (anyway, return_to_handler() does not return but jump
> > > to the original call-site, in that case, the information will be lost.)  
> > 
> > I find it bothersome (OCD, sorry :-) that both return trampolines behave
> > differently. Doubly so because I know people (Steve in particular) have
> > been talking about unifying them.
> 
> They were developed separately, and designed differently with different
> goals in mind. Yes, I want to unify them, but trying to get the
> different goals together, compounded by the fact that almost every arch
> also implemented them differently (in which case, we need to find a way
> to do it one arch at a time), makes the process extremely frustrating.

Yeah.. that's going to be somewhat painful.

> > Steve, can you clarify the ftrace side here? Afaict return_to_handler()
> > is similarly affected.
> 
> I'm not exactly sure what the issue is. As Masami stated, kretprobe
> uses a ret to return to the calling function, but ftrace uses a jmp.

I'll have to re-read the ftrace bits, but from the top of my head you
cannot do an indirect jump and preserve all registers at the same time,
so a return stub must use jump from stack aka. ret.

> kretprobe return tracing is more complex than the function graph return
> tracing is (which is one of the issues I need to overcome to unify
> them),

I'm not sure it is. IIRC the biggest pain point with kretprobe is that
'silly' property that the kretprobe_instance are not the same between
kretprobes. Luckily, that's not actually used anywhere, so we can simply
rip that out.

That should also help Matt make the whole freelist thing faster, because
now the kretprobe instances are global.

> and when the function graph return trampoline was created, it
> did things as simple as possible (and before ORC existed).
> 
> Is this something to worry about now, or should we look to fix his in
> the unifying process?

There seems to be a lot of kretprobe activity now; so I figured we ought
to at least consider where we want to go so we don't make it harder
still.
Peter Zijlstra July 7, 2021, 8:36 a.m. UTC | #6
On Wed, Jul 07, 2021 at 10:20:41AM +0200, Peter Zijlstra wrote:

> > > Steve, can you clarify the ftrace side here? Afaict return_to_handler()
> > > is similarly affected.
> > 
> > I'm not exactly sure what the issue is. As Masami stated, kretprobe
> > uses a ret to return to the calling function, but ftrace uses a jmp.
> 
> I'll have to re-read the ftrace bits, but from the top of my head you
> cannot do an indirect jump and preserve all registers at the same time,
> so a return stub must use jump from stack aka. ret.

Hmm... there's callee clobbered regs ofcourse, which don't need to be
preserved. And that's exactly what ftrace seems to be doing, and I don't
think there's any reason why kretprobe cannot do the same. Lemme try.
Masami Hiramatsu (Google) July 7, 2021, 10:15 a.m. UTC | #7
On Wed, 7 Jul 2021 10:20:41 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > > Steve, can you clarify the ftrace side here? Afaict return_to_handler()
> > > is similarly affected.
> > 
> > I'm not exactly sure what the issue is. As Masami stated, kretprobe
> > uses a ret to return to the calling function, but ftrace uses a jmp.
> 
> I'll have to re-read the ftrace bits, but from the top of my head you
> cannot do an indirect jump and preserve all registers at the same time,
> so a return stub must use jump from stack aka. ret.
> 
> > kretprobe return tracing is more complex than the function graph return
> > tracing is (which is one of the issues I need to overcome to unify
> > them),
> 
> I'm not sure it is. IIRC the biggest pain point with kretprobe is that
> 'silly' property that the kretprobe_instance are not the same between
> kretprobes. Luckily, that's not actually used anywhere, so we can simply
> rip that out.

I actually don't want to keep this feature because no one use it.
(only systemtap needs it?)

Anyway, if we keep the idea-level compatibility (not code level),
what we need is 'void *data' in the struct kretprobe_instance.
User who needs it can allocate their own instance data for their
kretprobes when initialising it and sets in their entry handler.

Then we can have a simple kretprobe_instance.

Thank you,
Peter Zijlstra July 7, 2021, 10:20 a.m. UTC | #8
On Wed, Jul 07, 2021 at 07:15:10PM +0900, Masami Hiramatsu wrote:

> I actually don't want to keep this feature because no one use it.
> (only systemtap needs it?)

Yeah, you mentioned systemtap, but since that's out-of-tree I don't
care. Their problem.

> Anyway, if we keep the idea-level compatibility (not code level),
> what we need is 'void *data' in the struct kretprobe_instance.
> User who needs it can allocate their own instance data for their
> kretprobes when initialising it and sets in their entry handler.
> 
> Then we can have a simple kretprobe_instance.

When would you do the alloc? When installing the retprobe, but that
might be inside the allocator, which means you can't call the allocator
etc.. :-)

If we look at struct ftrace_ret_stack, it has a few fixed function
fields. The calltime one is all that is needed for the kretprobe
example code.
Masami Hiramatsu (Google) July 7, 2021, 10:45 a.m. UTC | #9
On Wed, 7 Jul 2021 12:20:57 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Wed, Jul 07, 2021 at 07:15:10PM +0900, Masami Hiramatsu wrote:
> 
> > I actually don't want to keep this feature because no one use it.
> > (only systemtap needs it?)
> 
> Yeah, you mentioned systemtap, but since that's out-of-tree I don't
> care. Their problem.
> 
> > Anyway, if we keep the idea-level compatibility (not code level),
> > what we need is 'void *data' in the struct kretprobe_instance.
> > User who needs it can allocate their own instance data for their
> > kretprobes when initialising it and sets in their entry handler.
> > 
> > Then we can have a simple kretprobe_instance.
> 
> When would you do the alloc? When installing the retprobe, but that
> might be inside the allocator, which means you can't call the allocator
> etc.. :-)

Yes, so the user may need to allocate a pool right before register_kretprobe().
(whether per-kretprobe or per-task or global pool, that is user's choice.)

> 
> If we look at struct ftrace_ret_stack, it has a few fixed function
> fields. The calltime one is all that is needed for the kretprobe
> example code.

kretprobe consumes 3 fields, a pointer to 'struct kretprobe' (which
stores callee function address in 'kretprobe::kp.addr'), a return
address and a frame pointer (*).

* note that this frame pointer might be used for fixing up the
stack trace, but the fixup method depends on the architecture.

Thank you,
Masami Hiramatsu (Google) July 7, 2021, 1:29 p.m. UTC | #10
On Wed, 7 Jul 2021 19:45:30 +0900
Masami Hiramatsu <mhiramat@kernel.org> wrote:

> On Wed, 7 Jul 2021 12:20:57 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Wed, Jul 07, 2021 at 07:15:10PM +0900, Masami Hiramatsu wrote:
> > 
> > > I actually don't want to keep this feature because no one use it.
> > > (only systemtap needs it?)
> > 
> > Yeah, you mentioned systemtap, but since that's out-of-tree I don't
> > care. Their problem.

Yeah, maybe it is not hard to update.

> > 
> > > Anyway, if we keep the idea-level compatibility (not code level),
> > > what we need is 'void *data' in the struct kretprobe_instance.
> > > User who needs it can allocate their own instance data for their
> > > kretprobes when initialising it and sets in their entry handler.
> > > 
> > > Then we can have a simple kretprobe_instance.
> > 
> > When would you do the alloc? When installing the retprobe, but that
> > might be inside the allocator, which means you can't call the allocator
> > etc.. :-)
> 
> Yes, so the user may need to allocate a pool right before register_kretprobe().
> (whether per-kretprobe or per-task or global pool, that is user's choice.)
> 
> > 
> > If we look at struct ftrace_ret_stack, it has a few fixed function
> > fields. The calltime one is all that is needed for the kretprobe
> > example code.
> 
> kretprobe consumes 3 fields, a pointer to 'struct kretprobe' (which
> stores callee function address in 'kretprobe::kp.addr'), a return
> address and a frame pointer (*).

Oops, I forgot to add "void *data" for storing user data.

Thank you,
wuqiang.matt July 7, 2021, 2:42 p.m. UTC | #11
On 2021/7/7 PM9:29, Masami Hiramatsu wrote:
> On Wed, 7 Jul 2021 19:45:30 +0900
> Masami Hiramatsu <mhiramat@kernel.org> wrote:
> 
>> On Wed, 7 Jul 2021 12:20:57 +0200
>> Peter Zijlstra <peterz@infradead.org> wrote:
>>
>>> On Wed, Jul 07, 2021 at 07:15:10PM +0900, Masami Hiramatsu wrote:
>>>
>>>> I actually don't want to keep this feature because no one use it.
>>>> (only systemtap needs it?)
>>>
>>> Yeah, you mentioned systemtap, but since that's out-of-tree I don't
>>> care. Their problem.
> 
> Yeah, maybe it is not hard to update.
> 
>>>
>>>> Anyway, if we keep the idea-level compatibility (not code level),
>>>> what we need is 'void *data' in the struct kretprobe_instance.
>>>> User who needs it can allocate their own instance data for their
>>>> kretprobes when initialising it and sets in their entry handler.
>>>>
>>>> Then we can have a simple kretprobe_instance.
>>>
>>> When would you do the alloc? When installing the retprobe, but that
>>> might be inside the allocator, which means you can't call the allocator
>>> etc.. :-)
>>
>> Yes, so the user may need to allocate a pool right before register_kretprobe().
>> (whether per-kretprobe or per-task or global pool, that is user's choice.)
>>
>>>
>>> If we look at struct ftrace_ret_stack, it has a few fixed function
>>> fields. The calltime one is all that is needed for the kretprobe
>>> example code.
>>
>> kretprobe consumes 3 fields, a pointer to 'struct kretprobe' (which
>> stores callee function address in 'kretprobe::kp.addr'), a return
>> address and a frame pointer (*).
>  > Oops, I forgot to add "void *data" for storing user data.
> 

Should use "struct kretprobe_holder *rph", since "struct kretprobe" belongs
to 3rd-party module (which might be unloaded any time).

User's own pool might not work if the module can be unloaded. Better manage
the pool in kretprobe_holder, which needs no changes from user side.
Masami Hiramatsu (Google) July 11, 2021, 2:09 p.m. UTC | #12
On Wed, 7 Jul 2021 22:42:47 +0800
Matt Wu <wuqiang.matt@bytedance.com> wrote:

> On 2021/7/7 PM9:29, Masami Hiramatsu wrote:
> > On Wed, 7 Jul 2021 19:45:30 +0900
> > Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > 
> >> On Wed, 7 Jul 2021 12:20:57 +0200
> >> Peter Zijlstra <peterz@infradead.org> wrote:
> >>
> >>> On Wed, Jul 07, 2021 at 07:15:10PM +0900, Masami Hiramatsu wrote:
> >>>
> >>>> I actually don't want to keep this feature because no one use it.
> >>>> (only systemtap needs it?)
> >>>
> >>> Yeah, you mentioned systemtap, but since that's out-of-tree I don't
> >>> care. Their problem.
> > 
> > Yeah, maybe it is not hard to update.
> > 
> >>>
> >>>> Anyway, if we keep the idea-level compatibility (not code level),
> >>>> what we need is 'void *data' in the struct kretprobe_instance.
> >>>> User who needs it can allocate their own instance data for their
> >>>> kretprobes when initialising it and sets in their entry handler.
> >>>>
> >>>> Then we can have a simple kretprobe_instance.
> >>>
> >>> When would you do the alloc? When installing the retprobe, but that
> >>> might be inside the allocator, which means you can't call the allocator
> >>> etc.. :-)
> >>
> >> Yes, so the user may need to allocate a pool right before register_kretprobe().
> >> (whether per-kretprobe or per-task or global pool, that is user's choice.)
> >>
> >>>
> >>> If we look at struct ftrace_ret_stack, it has a few fixed function
> >>> fields. The calltime one is all that is needed for the kretprobe
> >>> example code.
> >>
> >> kretprobe consumes 3 fields, a pointer to 'struct kretprobe' (which
> >> stores callee function address in 'kretprobe::kp.addr'), a return
> >> address and a frame pointer (*).
> >  > Oops, I forgot to add "void *data" for storing user data.
> > 
> 
> Should use "struct kretprobe_holder *rph", since "struct kretprobe" belongs
> to 3rd-party module (which might be unloaded any time).

Good catch. Yes, instead of 'struct kretprobe', we need to use the holder.

> User's own pool might not work if the module can be unloaded. Better manage
> the pool in kretprobe_holder, which needs no changes from user side.

No, since the 'data' will be only refered from user handler. If the kretprobe
is released, then the kretprobe_holder will clear the refernce to the 'struct
kretprobe'. Then, the user handler is never called. No one access the 'data'.

Thank you,
wuqiang.matt July 11, 2021, 3:28 p.m. UTC | #13
On 2021/7/11 PM10:09, Masami Hiramatsu wrote:
> On Wed, 7 Jul 2021 22:42:47 +0800
> Matt Wu <wuqiang.matt@bytedance.com> wrote:
> 
>> On 2021/7/7 PM9:29, Masami Hiramatsu wrote:
>>> On Wed, 7 Jul 2021 19:45:30 +0900
>>> Masami Hiramatsu <mhiramat@kernel.org> wrote:
>>>
>>>> On Wed, 7 Jul 2021 12:20:57 +0200
>>>> Peter Zijlstra <peterz@infradead.org> wrote:
>>>>
>>>>> On Wed, Jul 07, 2021 at 07:15:10PM +0900, Masami Hiramatsu wrote:
>>>>>
>>>>>> I actually don't want to keep this feature because no one use it.
>>>>>> (only systemtap needs it?)
>>>>>
>>>>> Yeah, you mentioned systemtap, but since that's out-of-tree I don't
>>>>> care. Their problem.
>>>
>>> Yeah, maybe it is not hard to update.
>>>
>>>>>
>>>>>> Anyway, if we keep the idea-level compatibility (not code level),
>>>>>> what we need is 'void *data' in the struct kretprobe_instance.
>>>>>> User who needs it can allocate their own instance data for their
>>>>>> kretprobes when initialising it and sets in their entry handler.
>>>>>>
>>>>>> Then we can have a simple kretprobe_instance.
>>>>>
>>>>> When would you do the alloc? When installing the retprobe, but that
>>>>> might be inside the allocator, which means you can't call the allocator
>>>>> etc.. :-)
>>>>
>>>> Yes, so the user may need to allocate a pool right before register_kretprobe().
>>>> (whether per-kretprobe or per-task or global pool, that is user's choice.)
>>>>
>>>>>
>>>>> If we look at struct ftrace_ret_stack, it has a few fixed function
>>>>> fields. The calltime one is all that is needed for the kretprobe
>>>>> example code.
>>>>
>>>> kretprobe consumes 3 fields, a pointer to 'struct kretprobe' (which
>>>> stores callee function address in 'kretprobe::kp.addr'), a return
>>>> address and a frame pointer (*).
>>>   > Oops, I forgot to add "void *data" for storing user data.
>>>
>>
>> Should use "struct kretprobe_holder *rph", since "struct kretprobe" belongs
>> to 3rd-party module (which might be unloaded any time).
> 
> Good catch. Yes, instead of 'struct kretprobe', we need to use the holder.
> 
>> User's own pool might not work if the module can be unloaded. Better manage
>> the pool in kretprobe_holder, which needs no changes from user side.
> 
> No, since the 'data' will be only refered from user handler. If the kretprobe
> is released, then the kretprobe_holder will clear the refernce to the 'struct
> kretprobe'. Then, the user handler is never called. No one access the 'data'.

Indeed, there is no race of "data" accessing, since unregister_kretprobes()
is taking care of it.

This implementation just increases the complexity of caller to keep track
of all allocated instances and release them after unregistration.

But guys are likely to use kmalloc in pre-handler and kfree in post-handler,
which will lead to memory leaks.
Masami Hiramatsu (Google) July 12, 2021, 4:57 a.m. UTC | #14
On Sun, 11 Jul 2021 23:28:49 +0800
Matt Wu <wuqiang.matt@bytedance.com> wrote:

> On 2021/7/11 PM10:09, Masami Hiramatsu wrote:
> > On Wed, 7 Jul 2021 22:42:47 +0800
> > Matt Wu <wuqiang.matt@bytedance.com> wrote:
> > 
> >> On 2021/7/7 PM9:29, Masami Hiramatsu wrote:
> >>> On Wed, 7 Jul 2021 19:45:30 +0900
> >>> Masami Hiramatsu <mhiramat@kernel.org> wrote:
> >>>
> >>>> On Wed, 7 Jul 2021 12:20:57 +0200
> >>>> Peter Zijlstra <peterz@infradead.org> wrote:
> >>>>
> >>>>> On Wed, Jul 07, 2021 at 07:15:10PM +0900, Masami Hiramatsu wrote:
> >>>>>
> >>>>>> I actually don't want to keep this feature because no one use it.
> >>>>>> (only systemtap needs it?)
> >>>>>
> >>>>> Yeah, you mentioned systemtap, but since that's out-of-tree I don't
> >>>>> care. Their problem.
> >>>
> >>> Yeah, maybe it is not hard to update.
> >>>
> >>>>>
> >>>>>> Anyway, if we keep the idea-level compatibility (not code level),
> >>>>>> what we need is 'void *data' in the struct kretprobe_instance.
> >>>>>> User who needs it can allocate their own instance data for their
> >>>>>> kretprobes when initialising it and sets in their entry handler.
> >>>>>>
> >>>>>> Then we can have a simple kretprobe_instance.
> >>>>>
> >>>>> When would you do the alloc? When installing the retprobe, but that
> >>>>> might be inside the allocator, which means you can't call the allocator
> >>>>> etc.. :-)
> >>>>
> >>>> Yes, so the user may need to allocate a pool right before register_kretprobe().
> >>>> (whether per-kretprobe or per-task or global pool, that is user's choice.)
> >>>>
> >>>>>
> >>>>> If we look at struct ftrace_ret_stack, it has a few fixed function
> >>>>> fields. The calltime one is all that is needed for the kretprobe
> >>>>> example code.
> >>>>
> >>>> kretprobe consumes 3 fields, a pointer to 'struct kretprobe' (which
> >>>> stores callee function address in 'kretprobe::kp.addr'), a return
> >>>> address and a frame pointer (*).
> >>>   > Oops, I forgot to add "void *data" for storing user data.
> >>>
> >>
> >> Should use "struct kretprobe_holder *rph", since "struct kretprobe" belongs
> >> to 3rd-party module (which might be unloaded any time).
> > 
> > Good catch. Yes, instead of 'struct kretprobe', we need to use the holder.
> > 
> >> User's own pool might not work if the module can be unloaded. Better manage
> >> the pool in kretprobe_holder, which needs no changes from user side.
> > 
> > No, since the 'data' will be only refered from user handler. If the kretprobe
> > is released, then the kretprobe_holder will clear the refernce to the 'struct
> > kretprobe'. Then, the user handler is never called. No one access the 'data'.
> 
> Indeed, there is no race of "data" accessing, since unregister_kretprobes()
> is taking care of it.
> 
> This implementation just increases the complexity of caller to keep track
> of all allocated instances and release them after unregistration.

Yes, but user can manage it with an array of pointers (or directly allocate
an array of their desired data). Not hard to track it in that case.

> But guys are likely to use kmalloc in pre-handler and kfree in post-handler,
> which will lead to memory leaks.

I will note "do not allocate memory inside kprobe handler" on manual.
I think that's all what we need. We cannot stop someone shooting their feet
especially in the kernel...

Thank you,
diff mbox series

Patch

diff --git a/arch/x86/include/asm/unwind.h b/arch/x86/include/asm/unwind.h
index 70fc159ebe69..36d3971c0a2c 100644
--- a/arch/x86/include/asm/unwind.h
+++ b/arch/x86/include/asm/unwind.h
@@ -4,6 +4,7 @@ 
 
 #include <linux/sched.h>
 #include <linux/ftrace.h>
+#include <linux/kprobes.h>
 #include <asm/ptrace.h>
 #include <asm/stacktrace.h>
 
@@ -15,6 +16,7 @@  struct unwind_state {
 	unsigned long stack_mask;
 	struct task_struct *task;
 	int graph_idx;
+	struct llist_node *kr_cur;
 	bool error;
 #if defined(CONFIG_UNWINDER_ORC)
 	bool signal, full_regs;
@@ -99,6 +101,27 @@  void unwind_module_init(struct module *mod, void *orc_ip, size_t orc_ip_size,
 			void *orc, size_t orc_size) {}
 #endif
 
+static inline
+unsigned long unwind_recover_kretprobe(struct unwind_state *state,
+				       unsigned long addr, unsigned long *addr_p)
+{
+	return is_kretprobe_trampoline(addr) ?
+		kretprobe_find_ret_addr(state->task, addr_p, &state->kr_cur) :
+		addr;
+}
+
+/* Recover the return address modified by instrumentation (e.g. kretprobe) */
+static inline
+unsigned long unwind_recover_ret_addr(struct unwind_state *state,
+				     unsigned long addr, unsigned long *addr_p)
+{
+	unsigned long ret;
+
+	ret = ftrace_graph_ret_addr(state->task, &state->graph_idx,
+				    addr, addr_p);
+	return unwind_recover_kretprobe(state, ret, addr_p);
+}
+
 /*
  * This disables KASAN checking when reading a value from another task's stack,
  * since the other task could be running on another CPU and could have poisoned
diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
index d7c44b257f7f..8e1c50c86e5d 100644
--- a/arch/x86/kernel/unwind_frame.c
+++ b/arch/x86/kernel/unwind_frame.c
@@ -240,8 +240,7 @@  static bool update_stack_state(struct unwind_state *state,
 	else {
 		addr_p = unwind_get_return_address_ptr(state);
 		addr = READ_ONCE_TASK_STACK(state->task, *addr_p);
-		state->ip = ftrace_graph_ret_addr(state->task, &state->graph_idx,
-						  addr, addr_p);
+		state->ip = unwind_recover_ret_addr(state, addr, addr_p);
 	}
 
 	/* Save the original stack pointer for unwind_dump(): */
diff --git a/arch/x86/kernel/unwind_guess.c b/arch/x86/kernel/unwind_guess.c
index c49f10ffd8cd..884d68a6e714 100644
--- a/arch/x86/kernel/unwind_guess.c
+++ b/arch/x86/kernel/unwind_guess.c
@@ -15,8 +15,7 @@  unsigned long unwind_get_return_address(struct unwind_state *state)
 
 	addr = READ_ONCE_NOCHECK(*state->sp);
 
-	return ftrace_graph_ret_addr(state->task, &state->graph_idx,
-				     addr, state->sp);
+	return unwind_recover_ret_addr(state, addr, state->sp);
 }
 EXPORT_SYMBOL_GPL(unwind_get_return_address);
 
diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
index a1202536fc57..ad6a9aece379 100644
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -534,9 +534,8 @@  bool unwind_next_frame(struct unwind_state *state)
 		if (!deref_stack_reg(state, ip_p, &state->ip))
 			goto err;
 
-		state->ip = ftrace_graph_ret_addr(state->task, &state->graph_idx,
-						  state->ip, (void *)ip_p);
-
+		state->ip = unwind_recover_ret_addr(state, state->ip,
+						    (unsigned long *)ip_p);
 		state->sp = sp;
 		state->regs = NULL;
 		state->prev_regs = NULL;
@@ -549,7 +548,15 @@  bool unwind_next_frame(struct unwind_state *state)
 					 (void *)orig_ip);
 			goto err;
 		}
-
+		/*
+		 * There is a small chance to interrupt at the entry of
+		 * kretprobe_trampoline where the ORC info doesn't exist.
+		 * That point is right after the RET to kretprobe_trampoline
+		 * which was modified return address. So the @addr_p must
+		 * be right before the regs->sp.
+		 */
+		state->ip = unwind_recover_kretprobe(state, state->ip,
+				(unsigned long *)(state->sp - sizeof(long)));
 		state->regs = (struct pt_regs *)sp;
 		state->prev_regs = NULL;
 		state->full_regs = true;
@@ -562,6 +569,9 @@  bool unwind_next_frame(struct unwind_state *state)
 					 (void *)orig_ip);
 			goto err;
 		}
+		/* See UNWIND_HINT_TYPE_REGS case comment. */
+		state->ip = unwind_recover_kretprobe(state, state->ip,
+				(unsigned long *)(state->sp - sizeof(long)));
 
 		if (state->full_regs)
 			state->prev_regs = state->regs;