diff mbox series

[RFC,perf/core,05/11] uprobes: Add mapping for optimized uprobe trampolines

Message ID 20241105133405.2703607-6-jolsa@kernel.org (mailing list archive)
State New
Delegated to: Masami Hiramatsu
Headers show
Series uprobes: Add support to optimize usdt probes on x86_64 | expand

Commit Message

Jiri Olsa Nov. 5, 2024, 1:33 p.m. UTC
Adding interface to add special mapping for user space page that will be
used as place holder for uprobe trampoline in following changes.

The get_tramp_area(vaddr) function either finds 'callable' page or create
new one.  The 'callable' means it's reachable by call instruction (from
vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
function.

The put_tramp_area function either drops refcount or destroys the special
mapping and all the maps are clean up when the process goes down.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 include/linux/uprobes.h |  12 ++++
 kernel/events/uprobes.c | 141 ++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c           |   2 +
 3 files changed, 155 insertions(+)

Comments

Peter Zijlstra Nov. 5, 2024, 2:23 p.m. UTC | #1
On Tue, Nov 05, 2024 at 02:33:59PM +0100, Jiri Olsa wrote:
> Adding interface to add special mapping for user space page that will be
> used as place holder for uprobe trampoline in following changes.
> 
> The get_tramp_area(vaddr) function either finds 'callable' page or create
> new one.  The 'callable' means it's reachable by call instruction (from
> vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> function.
> 
> The put_tramp_area function either drops refcount or destroys the special
> mapping and all the maps are clean up when the process goes down.

In another thread somewhere, Andrii mentioned that Meta has executables
with more than 4G of .text. This isn't going to work for them, is it?
Jiri Olsa Nov. 5, 2024, 4:33 p.m. UTC | #2
On Tue, Nov 05, 2024 at 03:23:27PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 05, 2024 at 02:33:59PM +0100, Jiri Olsa wrote:
> > Adding interface to add special mapping for user space page that will be
> > used as place holder for uprobe trampoline in following changes.
> > 
> > The get_tramp_area(vaddr) function either finds 'callable' page or create
> > new one.  The 'callable' means it's reachable by call instruction (from
> > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> > function.
> > 
> > The put_tramp_area function either drops refcount or destroys the special
> > mapping and all the maps are clean up when the process goes down.
> 
> In another thread somewhere, Andrii mentioned that Meta has executables
> with more than 4G of .text. This isn't going to work for them, is it?
> 

not if you can't reach the trampoline from the probed address

jirka
Andrii Nakryiko Nov. 14, 2024, 11:44 p.m. UTC | #3
On Tue, Nov 5, 2024 at 8:33 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Tue, Nov 05, 2024 at 03:23:27PM +0100, Peter Zijlstra wrote:
> > On Tue, Nov 05, 2024 at 02:33:59PM +0100, Jiri Olsa wrote:
> > > Adding interface to add special mapping for user space page that will be
> > > used as place holder for uprobe trampoline in following changes.
> > >
> > > The get_tramp_area(vaddr) function either finds 'callable' page or create
> > > new one.  The 'callable' means it's reachable by call instruction (from
> > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> > > function.
> > >
> > > The put_tramp_area function either drops refcount or destroys the special
> > > mapping and all the maps are clean up when the process goes down.
> >
> > In another thread somewhere, Andrii mentioned that Meta has executables
> > with more than 4G of .text. This isn't going to work for them, is it?
> >
>
> not if you can't reach the trampoline from the probed address

That specific example was about 1.5GB (though we might have bigger
.text, I didn't do exhaustive research). As Jiri said, this would be
best effort trying to find closest free mapping to stay within +/-2GB
offset. If that fails, we always would be falling back to slower
int3-based uprobing, yep.

Jiri, we could also have an option to support 64-bit call, right? We'd
need nop9 for that, but it's an option as well to future-proofing this
approach, no?

Also, can we somehow use fs/gs-based indirect calls/jumps somehow to
have a guarantee that offset is always small (<2GB away relative to
the base stored in fs/gs). Not sure if this is feasible, but I thought
it would be good to bring this up just to make sure it doesn't work.

If segment based absolute call is somehow feasible, we can probably
simplify a bunch of stuff by allocating it eagerly, once, and
somewhere high up next to VDSO (or maybe even put it into VDSO, don't
now).

Anyways, let's brainstorm if there are any clever alternatives here.


>
> jirka
Andrii Nakryiko Nov. 14, 2024, 11:44 p.m. UTC | #4
On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Adding interface to add special mapping for user space page that will be
> used as place holder for uprobe trampoline in following changes.
>
> The get_tramp_area(vaddr) function either finds 'callable' page or create
> new one.  The 'callable' means it's reachable by call instruction (from
> vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> function.
>
> The put_tramp_area function either drops refcount or destroys the special
> mapping and all the maps are clean up when the process goes down.
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  include/linux/uprobes.h |  12 ++++
>  kernel/events/uprobes.c | 141 ++++++++++++++++++++++++++++++++++++++++
>  kernel/fork.c           |   2 +
>  3 files changed, 155 insertions(+)
>
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index be306028ed59..222d8e82cee2 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -172,6 +172,15 @@ struct xol_area;
>
>  struct uprobes_state {
>         struct xol_area         *xol_area;
> +       struct hlist_head       tramp_head;
> +       struct mutex            tramp_mutex;
> +};
> +
> +struct tramp_area {
> +       unsigned long           vaddr;
> +       struct page             *page;
> +       struct hlist_node       node;
> +       refcount_t              ref;

nit: any reason we are unnecessarily trying to save 4 bytes on
refcount (and we don't actually, due to padding)

>  };
>
>  extern void __init uprobes_init(void);
> @@ -219,6 +228,9 @@ extern int uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_o
>  extern int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
>                                      uprobe_opcode_t *new_opcode, void *data);
>  extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int len, void *data);
> +struct tramp_area *get_tramp_area(unsigned long vaddr);

uprobe_get_tramp_area() to make it clear this is uprobe specific,
given this is exposed function?

and add that extern like we do for other functions?

> +void put_tramp_area(struct tramp_area *area);

uprobe_put_tramp_area() ?

> +bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr);
>  #else /* !CONFIG_UPROBES */
>  struct uprobes_state {
>  };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 944d9df1f081..a44305c559a4 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -616,6 +616,145 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
>                         (uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE, NULL);
>  }
>
> +bool __weak arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr)
> +{
> +       return false;
> +}
> +
> +static unsigned long find_nearest_page(unsigned long vaddr)
> +{
> +       struct mm_struct *mm = current->mm;
> +       struct vm_area_struct *vma, *prev;
> +       VMA_ITERATOR(vmi, mm, 0);
> +
> +       prev = vma_next(&vmi);
> +       vma = vma_next(&vmi);
> +       while (vma) {
> +               if (vma->vm_start - prev->vm_end  >= PAGE_SIZE &&
> +                   arch_uprobe_is_callable(prev->vm_end, vaddr))
> +                       return prev->vm_end;

shouldn't we try both `prev->vm_end` and `vma->vm_start - PAGE_SIZE`
as two possible places

> +
> +               prev = vma;
> +               vma = vma_next(&vmi);
> +       }
> +
> +       return 0;
> +}
> +
> +static vm_fault_t tramp_fault(const struct vm_special_mapping *sm,
> +                             struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +       struct hlist_head *head = &vma->vm_mm->uprobes_state.tramp_head;
> +       struct tramp_area *area;
> +
> +       hlist_for_each_entry(area, head, node) {
> +               if (vma->vm_start == area->vaddr) {
> +                       vmf->page = area->page;
> +                       get_page(vmf->page);
> +                       return 0;
> +               }
> +       }
> +
> +       return -EINVAL;
> +}
> +
> +static int tramp_mremap(const struct vm_special_mapping *sm, struct vm_area_struct *new_vma)
> +{
> +       return -EPERM;
> +}
> +
> +static const struct vm_special_mapping tramp_mapping = {
> +       .name = "[uprobes-trampoline]",
> +       .fault = tramp_fault,
> +       .mremap = tramp_mremap,
> +};
> +
> +static struct tramp_area *create_tramp_area(unsigned long vaddr)
> +{
> +       struct mm_struct *mm = current->mm;
> +       struct vm_area_struct *vma;
> +       struct tramp_area *area;
> +
> +       vaddr = find_nearest_page(vaddr);
> +       if (!vaddr)
> +               return NULL;
> +
> +       area = kzalloc(sizeof(*area), GFP_KERNEL);
> +       if (unlikely(!area))
> +               return NULL;
> +
> +       area->page = alloc_page(GFP_HIGHUSER);
> +       if (!area->page)
> +               goto free_area;
> +
> +       refcount_set(&area->ref, 1);
> +       area->vaddr = vaddr;
> +
> +       vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
> +                               VM_READ|VM_EXEC|VM_MAYEXEC|VM_MAYREAD|VM_DONTCOPY|VM_IO,
> +                               &tramp_mapping);
> +       if (!IS_ERR(vma))
> +               return area;

please keep a pattern, it's less surprising that way

    if (IS_ERR(vma))
        goto free_page;

    return area;

free_page:

> +
> +       __free_page(area->page);
> + free_area:
> +       kfree(area);
> +       return NULL;
> +}
> +
> +struct tramp_area *get_tramp_area(unsigned long vaddr)
> +{
> +       struct uprobes_state *state = &current->mm->uprobes_state;
> +       struct tramp_area *area = NULL;
> +
> +       mutex_lock(&state->tramp_mutex);
> +       hlist_for_each_entry(area, &state->tramp_head, node) {
> +               if (arch_uprobe_is_callable(area->vaddr, vaddr)) {
> +                       refcount_inc(&area->ref);
> +                       goto unlock;
> +               }
> +       }
> +
> +       area = create_tramp_area(vaddr);
> +       if (area)
> +               hlist_add_head(&area->node, &state->tramp_head);
> +
> +unlock:
> +       mutex_unlock(&state->tramp_mutex);
> +       return area;
> +}
> +
> +static void destroy_tramp_area(struct tramp_area *area)
> +{
> +       hlist_del(&area->node);
> +       put_page(area->page);
> +       kfree(area);
> +}
> +
> +void put_tramp_area(struct tramp_area *area)
> +{
> +       struct mm_struct *mm = current->mm;
> +       struct uprobes_state *state = &mm->uprobes_state;
> +
> +       if (area == NULL)

nit: !area

> +               return;
> +
> +       mutex_lock(&state->tramp_mutex);
> +       if (refcount_dec_and_test(&area->ref))
> +               destroy_tramp_area(area);
> +       mutex_unlock(&state->tramp_mutex);
> +}
> +

[...]
Jiri Olsa Nov. 16, 2024, 9:44 p.m. UTC | #5
On Thu, Nov 14, 2024 at 03:44:12PM -0800, Andrii Nakryiko wrote:
> On Tue, Nov 5, 2024 at 8:33 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Tue, Nov 05, 2024 at 03:23:27PM +0100, Peter Zijlstra wrote:
> > > On Tue, Nov 05, 2024 at 02:33:59PM +0100, Jiri Olsa wrote:
> > > > Adding interface to add special mapping for user space page that will be
> > > > used as place holder for uprobe trampoline in following changes.
> > > >
> > > > The get_tramp_area(vaddr) function either finds 'callable' page or create
> > > > new one.  The 'callable' means it's reachable by call instruction (from
> > > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> > > > function.
> > > >
> > > > The put_tramp_area function either drops refcount or destroys the special
> > > > mapping and all the maps are clean up when the process goes down.
> > >
> > > In another thread somewhere, Andrii mentioned that Meta has executables
> > > with more than 4G of .text. This isn't going to work for them, is it?
> > >
> >
> > not if you can't reach the trampoline from the probed address
> 
> That specific example was about 1.5GB (though we might have bigger
> .text, I didn't do exhaustive research). As Jiri said, this would be
> best effort trying to find closest free mapping to stay within +/-2GB
> offset. If that fails, we always would be falling back to slower
> int3-based uprobing, yep.
> 
> Jiri, we could also have an option to support 64-bit call, right? We'd
> need nop9 for that, but it's an option as well to future-proofing this
> approach, no?

hm, I don't think there's call with relative 64bit offset

there's indirect call through register or address.. but I think we would
fit in nop10 with the indirect call through address

> 
> Also, can we somehow use fs/gs-based indirect calls/jumps somehow to
> have a guarantee that offset is always small (<2GB away relative to
> the base stored in fs/gs). Not sure if this is feasible, but I thought
> it would be good to bring this up just to make sure it doesn't work.
> 
> If segment based absolute call is somehow feasible, we can probably
> simplify a bunch of stuff by allocating it eagerly, once, and
> somewhere high up next to VDSO (or maybe even put it into VDSO, don't
> now).

yes, that would be convenient

jirka

> 
> Anyways, let's brainstorm if there are any clever alternatives here.
> 
> 
> >
> > jirka
Jiri Olsa Nov. 16, 2024, 9:44 p.m. UTC | #6
On Thu, Nov 14, 2024 at 03:44:14PM -0800, Andrii Nakryiko wrote:
> On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > Adding interface to add special mapping for user space page that will be
> > used as place holder for uprobe trampoline in following changes.
> >
> > The get_tramp_area(vaddr) function either finds 'callable' page or create
> > new one.  The 'callable' means it's reachable by call instruction (from
> > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> > function.
> >
> > The put_tramp_area function either drops refcount or destroys the special
> > mapping and all the maps are clean up when the process goes down.
> >
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  include/linux/uprobes.h |  12 ++++
> >  kernel/events/uprobes.c | 141 ++++++++++++++++++++++++++++++++++++++++
> >  kernel/fork.c           |   2 +
> >  3 files changed, 155 insertions(+)
> >
> > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > index be306028ed59..222d8e82cee2 100644
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -172,6 +172,15 @@ struct xol_area;
> >
> >  struct uprobes_state {
> >         struct xol_area         *xol_area;
> > +       struct hlist_head       tramp_head;
> > +       struct mutex            tramp_mutex;
> > +};
> > +
> > +struct tramp_area {
> > +       unsigned long           vaddr;
> > +       struct page             *page;
> > +       struct hlist_node       node;
> > +       refcount_t              ref;
> 
> nit: any reason we are unnecessarily trying to save 4 bytes on
> refcount (and we don't actually, due to padding)

hum, I'm not sure what you mean.. what's the alternative?

> 
> >  };
> >
> >  extern void __init uprobes_init(void);
> > @@ -219,6 +228,9 @@ extern int uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_o
> >  extern int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
> >                                      uprobe_opcode_t *new_opcode, void *data);
> >  extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int len, void *data);
> > +struct tramp_area *get_tramp_area(unsigned long vaddr);
> 
> uprobe_get_tramp_area() to make it clear this is uprobe specific,
> given this is exposed function?
> 
> and add that extern like we do for other functions?

ok

> 
> > +void put_tramp_area(struct tramp_area *area);
> 
> uprobe_put_tramp_area() ?

ok

> 
> > +bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr);
> >  #else /* !CONFIG_UPROBES */
> >  struct uprobes_state {
> >  };
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 944d9df1f081..a44305c559a4 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -616,6 +616,145 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
> >                         (uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE, NULL);
> >  }
> >
> > +bool __weak arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr)
> > +{
> > +       return false;
> > +}
> > +
> > +static unsigned long find_nearest_page(unsigned long vaddr)
> > +{
> > +       struct mm_struct *mm = current->mm;
> > +       struct vm_area_struct *vma, *prev;
> > +       VMA_ITERATOR(vmi, mm, 0);
> > +
> > +       prev = vma_next(&vmi);
> > +       vma = vma_next(&vmi);
> > +       while (vma) {
> > +               if (vma->vm_start - prev->vm_end  >= PAGE_SIZE &&
> > +                   arch_uprobe_is_callable(prev->vm_end, vaddr))
> > +                       return prev->vm_end;
> 
> shouldn't we try both `prev->vm_end` and `vma->vm_start - PAGE_SIZE`
> as two possible places

right, we should do that

SNIP

> > +static struct tramp_area *create_tramp_area(unsigned long vaddr)
> > +{
> > +       struct mm_struct *mm = current->mm;
> > +       struct vm_area_struct *vma;
> > +       struct tramp_area *area;
> > +
> > +       vaddr = find_nearest_page(vaddr);
> > +       if (!vaddr)
> > +               return NULL;
> > +
> > +       area = kzalloc(sizeof(*area), GFP_KERNEL);
> > +       if (unlikely(!area))
> > +               return NULL;
> > +
> > +       area->page = alloc_page(GFP_HIGHUSER);
> > +       if (!area->page)
> > +               goto free_area;
> > +
> > +       refcount_set(&area->ref, 1);
> > +       area->vaddr = vaddr;
> > +
> > +       vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
> > +                               VM_READ|VM_EXEC|VM_MAYEXEC|VM_MAYREAD|VM_DONTCOPY|VM_IO,
> > +                               &tramp_mapping);
> > +       if (!IS_ERR(vma))
> > +               return area;
> 
> please keep a pattern, it's less surprising that way
> 
>     if (IS_ERR(vma))
>         goto free_page;
> 
>     return area;
> 
> free_page:

ok


SNIP

> > +static void destroy_tramp_area(struct tramp_area *area)
> > +{
> > +       hlist_del(&area->node);
> > +       put_page(area->page);
> > +       kfree(area);
> > +}
> > +
> > +void put_tramp_area(struct tramp_area *area)
> > +{
> > +       struct mm_struct *mm = current->mm;
> > +       struct uprobes_state *state = &mm->uprobes_state;
> > +
> > +       if (area == NULL)
> 
> nit: !area

ok

thanks,
jirka
Andrii Nakryiko Nov. 19, 2024, 6:05 a.m. UTC | #7
On Sat, Nov 16, 2024 at 1:44 PM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Thu, Nov 14, 2024 at 03:44:14PM -0800, Andrii Nakryiko wrote:
> > On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
> > >
> > > Adding interface to add special mapping for user space page that will be
> > > used as place holder for uprobe trampoline in following changes.
> > >
> > > The get_tramp_area(vaddr) function either finds 'callable' page or create
> > > new one.  The 'callable' means it's reachable by call instruction (from
> > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> > > function.
> > >
> > > The put_tramp_area function either drops refcount or destroys the special
> > > mapping and all the maps are clean up when the process goes down.
> > >
> > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > ---
> > >  include/linux/uprobes.h |  12 ++++
> > >  kernel/events/uprobes.c | 141 ++++++++++++++++++++++++++++++++++++++++
> > >  kernel/fork.c           |   2 +
> > >  3 files changed, 155 insertions(+)
> > >
> > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > > index be306028ed59..222d8e82cee2 100644
> > > --- a/include/linux/uprobes.h
> > > +++ b/include/linux/uprobes.h
> > > @@ -172,6 +172,15 @@ struct xol_area;
> > >
> > >  struct uprobes_state {
> > >         struct xol_area         *xol_area;
> > > +       struct hlist_head       tramp_head;
> > > +       struct mutex            tramp_mutex;
> > > +};
> > > +
> > > +struct tramp_area {
> > > +       unsigned long           vaddr;
> > > +       struct page             *page;
> > > +       struct hlist_node       node;
> > > +       refcount_t              ref;
> >
> > nit: any reason we are unnecessarily trying to save 4 bytes on
> > refcount (and we don't actually, due to padding)
>
> hum, I'm not sure what you mean.. what's the alternative?

atomic64_t ?

>
> >
> > >  };
> > >
> > >  extern void __init uprobes_init(void);
> > > @@ -219,6 +228,9 @@ extern int uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_o
> > >  extern int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
> > >                                      uprobe_opcode_t *new_opcode, void *data);
> > >  extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int len, void *data);
> > > +struct tramp_area *get_tramp_area(unsigned long vaddr);
> >
> > uprobe_get_tramp_area() to make it clear this is uprobe specific,
> > given this is exposed function?
> >
> > and add that extern like we do for other functions?
>
> ok
>
> >
> > > +void put_tramp_area(struct tramp_area *area);
> >
> > uprobe_put_tramp_area() ?
>
> ok
>
> >
> > > +bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr);
> > >  #else /* !CONFIG_UPROBES */
> > >  struct uprobes_state {
> > >  };
> > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > index 944d9df1f081..a44305c559a4 100644
> > > --- a/kernel/events/uprobes.c
> > > +++ b/kernel/events/uprobes.c
> > > @@ -616,6 +616,145 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
> > >                         (uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE, NULL);
> > >  }
> > >
> > > +bool __weak arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr)
> > > +{
> > > +       return false;
> > > +}
> > > +
> > > +static unsigned long find_nearest_page(unsigned long vaddr)
> > > +{
> > > +       struct mm_struct *mm = current->mm;
> > > +       struct vm_area_struct *vma, *prev;
> > > +       VMA_ITERATOR(vmi, mm, 0);
> > > +
> > > +       prev = vma_next(&vmi);
> > > +       vma = vma_next(&vmi);
> > > +       while (vma) {
> > > +               if (vma->vm_start - prev->vm_end  >= PAGE_SIZE &&
> > > +                   arch_uprobe_is_callable(prev->vm_end, vaddr))
> > > +                       return prev->vm_end;
> >
> > shouldn't we try both `prev->vm_end` and `vma->vm_start - PAGE_SIZE`
> > as two possible places
>
> right, we should do that
>
> SNIP
>
> > > +static struct tramp_area *create_tramp_area(unsigned long vaddr)
> > > +{
> > > +       struct mm_struct *mm = current->mm;
> > > +       struct vm_area_struct *vma;
> > > +       struct tramp_area *area;
> > > +
> > > +       vaddr = find_nearest_page(vaddr);
> > > +       if (!vaddr)
> > > +               return NULL;
> > > +
> > > +       area = kzalloc(sizeof(*area), GFP_KERNEL);
> > > +       if (unlikely(!area))
> > > +               return NULL;
> > > +
> > > +       area->page = alloc_page(GFP_HIGHUSER);
> > > +       if (!area->page)
> > > +               goto free_area;
> > > +
> > > +       refcount_set(&area->ref, 1);
> > > +       area->vaddr = vaddr;
> > > +
> > > +       vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
> > > +                               VM_READ|VM_EXEC|VM_MAYEXEC|VM_MAYREAD|VM_DONTCOPY|VM_IO,
> > > +                               &tramp_mapping);
> > > +       if (!IS_ERR(vma))
> > > +               return area;
> >
> > please keep a pattern, it's less surprising that way
> >
> >     if (IS_ERR(vma))
> >         goto free_page;
> >
> >     return area;
> >
> > free_page:
>
> ok
>
>
> SNIP
>
> > > +static void destroy_tramp_area(struct tramp_area *area)
> > > +{
> > > +       hlist_del(&area->node);
> > > +       put_page(area->page);
> > > +       kfree(area);
> > > +}
> > > +
> > > +void put_tramp_area(struct tramp_area *area)
> > > +{
> > > +       struct mm_struct *mm = current->mm;
> > > +       struct uprobes_state *state = &mm->uprobes_state;
> > > +
> > > +       if (area == NULL)
> >
> > nit: !area
>
> ok
>
> thanks,
> jirka
Andrii Nakryiko Nov. 19, 2024, 6:06 a.m. UTC | #8
On Sat, Nov 16, 2024 at 1:44 PM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Thu, Nov 14, 2024 at 03:44:12PM -0800, Andrii Nakryiko wrote:
> > On Tue, Nov 5, 2024 at 8:33 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > On Tue, Nov 05, 2024 at 03:23:27PM +0100, Peter Zijlstra wrote:
> > > > On Tue, Nov 05, 2024 at 02:33:59PM +0100, Jiri Olsa wrote:
> > > > > Adding interface to add special mapping for user space page that will be
> > > > > used as place holder for uprobe trampoline in following changes.
> > > > >
> > > > > The get_tramp_area(vaddr) function either finds 'callable' page or create
> > > > > new one.  The 'callable' means it's reachable by call instruction (from
> > > > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> > > > > function.
> > > > >
> > > > > The put_tramp_area function either drops refcount or destroys the special
> > > > > mapping and all the maps are clean up when the process goes down.
> > > >
> > > > In another thread somewhere, Andrii mentioned that Meta has executables
> > > > with more than 4G of .text. This isn't going to work for them, is it?
> > > >
> > >
> > > not if you can't reach the trampoline from the probed address
> >
> > That specific example was about 1.5GB (though we might have bigger
> > .text, I didn't do exhaustive research). As Jiri said, this would be
> > best effort trying to find closest free mapping to stay within +/-2GB
> > offset. If that fails, we always would be falling back to slower
> > int3-based uprobing, yep.
> >
> > Jiri, we could also have an option to support 64-bit call, right? We'd
> > need nop9 for that, but it's an option as well to future-proofing this
> > approach, no?
>
> hm, I don't think there's call with relative 64bit offset

why do you need a relative, when you have 64 bits? ;) there is a call
to absolute address, no?

>
> there's indirect call through register or address.. but I think we would
> fit in nop10 with the indirect call through address
>
> >
> > Also, can we somehow use fs/gs-based indirect calls/jumps somehow to
> > have a guarantee that offset is always small (<2GB away relative to
> > the base stored in fs/gs). Not sure if this is feasible, but I thought
> > it would be good to bring this up just to make sure it doesn't work.
> >
> > If segment based absolute call is somehow feasible, we can probably
> > simplify a bunch of stuff by allocating it eagerly, once, and
> > somewhere high up next to VDSO (or maybe even put it into VDSO, don't
> > now).
>
> yes, that would be convenient
>
> jirka
>
> >
> > Anyways, let's brainstorm if there are any clever alternatives here.
> >
> >
> > >
> > > jirka
Peter Zijlstra Nov. 19, 2024, 9:13 a.m. UTC | #9
On Mon, Nov 18, 2024 at 10:06:51PM -0800, Andrii Nakryiko wrote:

> > > Jiri, we could also have an option to support 64-bit call, right? We'd
> > > need nop9 for that, but it's an option as well to future-proofing this
> > > approach, no?
> >
> > hm, I don't think there's call with relative 64bit offset
> 
> why do you need a relative, when you have 64 bits? ;) there is a call
> to absolute address, no?

No, there is not :/ You get to use an indirect call, which means
multiple instructions and all the speculation joy.

IFF USDT thingies have AX clobbered (I couldn't find in a hurry) then
patching the multi instruction thing is relatively straight forward, if
they don't, its going to be a pain.
Jiri Olsa Nov. 19, 2024, 3:14 p.m. UTC | #10
On Mon, Nov 18, 2024 at 10:05:41PM -0800, Andrii Nakryiko wrote:
> On Sat, Nov 16, 2024 at 1:44 PM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Thu, Nov 14, 2024 at 03:44:14PM -0800, Andrii Nakryiko wrote:
> > > On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
> > > >
> > > > Adding interface to add special mapping for user space page that will be
> > > > used as place holder for uprobe trampoline in following changes.
> > > >
> > > > The get_tramp_area(vaddr) function either finds 'callable' page or create
> > > > new one.  The 'callable' means it's reachable by call instruction (from
> > > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> > > > function.
> > > >
> > > > The put_tramp_area function either drops refcount or destroys the special
> > > > mapping and all the maps are clean up when the process goes down.
> > > >
> > > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > > ---
> > > >  include/linux/uprobes.h |  12 ++++
> > > >  kernel/events/uprobes.c | 141 ++++++++++++++++++++++++++++++++++++++++
> > > >  kernel/fork.c           |   2 +
> > > >  3 files changed, 155 insertions(+)
> > > >
> > > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > > > index be306028ed59..222d8e82cee2 100644
> > > > --- a/include/linux/uprobes.h
> > > > +++ b/include/linux/uprobes.h
> > > > @@ -172,6 +172,15 @@ struct xol_area;
> > > >
> > > >  struct uprobes_state {
> > > >         struct xol_area         *xol_area;
> > > > +       struct hlist_head       tramp_head;
> > > > +       struct mutex            tramp_mutex;
> > > > +};
> > > > +
> > > > +struct tramp_area {
> > > > +       unsigned long           vaddr;
> > > > +       struct page             *page;
> > > > +       struct hlist_node       node;
> > > > +       refcount_t              ref;
> > >
> > > nit: any reason we are unnecessarily trying to save 4 bytes on
> > > refcount (and we don't actually, due to padding)
> >
> > hum, I'm not sure what you mean.. what's the alternative?
> 
> atomic64_t ?

hum, just because we have extra 4 bytes padding? we use refcount_t
on other places so seems like better fit to me

jirka
Jiri Olsa Nov. 19, 2024, 3:15 p.m. UTC | #11
On Tue, Nov 19, 2024 at 10:13:48AM +0100, Peter Zijlstra wrote:
> On Mon, Nov 18, 2024 at 10:06:51PM -0800, Andrii Nakryiko wrote:
> 
> > > > Jiri, we could also have an option to support 64-bit call, right? We'd
> > > > need nop9 for that, but it's an option as well to future-proofing this
> > > > approach, no?
> > >
> > > hm, I don't think there's call with relative 64bit offset
> > 
> > why do you need a relative, when you have 64 bits? ;) there is a call
> > to absolute address, no?
> 
> No, there is not :/ You get to use an indirect call, which means
> multiple instructions and all the speculation joy.
> 
> IFF USDT thingies have AX clobbered (I couldn't find in a hurry) then
> patching the multi instruction thing is relatively straight forward, if
> they don't, its going to be a pain.

I don't follow, what's the reason for that?

jirka
Andrii Nakryiko Nov. 21, 2024, 12:07 a.m. UTC | #12
On Tue, Nov 19, 2024 at 1:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Nov 18, 2024 at 10:06:51PM -0800, Andrii Nakryiko wrote:
>
> > > > Jiri, we could also have an option to support 64-bit call, right? We'd
> > > > need nop9 for that, but it's an option as well to future-proofing this
> > > > approach, no?
> > >
> > > hm, I don't think there's call with relative 64bit offset
> >
> > why do you need a relative, when you have 64 bits? ;) there is a call
> > to absolute address, no?
>
> No, there is not :/ You get to use an indirect call, which means
> multiple instructions and all the speculation joy.

Ah, I misinterpreted `CALL m16:64` (from [0]) as an absolute address
call with 64-bit displacement encoded in the struct, my bad.

  [0] https://www.felixcloutier.com/x86/call

>
> IFF USDT thingies have AX clobbered (I couldn't find in a hurry) then
> patching the multi instruction thing is relatively straight forward, if
> they don't, its going to be a pain.

USDTs are meant to be "transparent" to the surrounding code and they
don't mark any clobbered registers. Technically it could be added, but
I'm not a fan of this.

I'd stick to the 32-bit relative call for now, and fallback to int3 if
we can't reach the uprobe trampoline. So huge .text might suffer
suboptimal performance, but at least USDTs won't pessimize the
surrounding code (and kernel won't make any assumptions about
registers that are ok to use).
Andrii Nakryiko Nov. 21, 2024, 12:10 a.m. UTC | #13
On Tue, Nov 19, 2024 at 7:14 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Mon, Nov 18, 2024 at 10:05:41PM -0800, Andrii Nakryiko wrote:
> > On Sat, Nov 16, 2024 at 1:44 PM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > On Thu, Nov 14, 2024 at 03:44:14PM -0800, Andrii Nakryiko wrote:
> > > > On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
> > > > >
> > > > > Adding interface to add special mapping for user space page that will be
> > > > > used as place holder for uprobe trampoline in following changes.
> > > > >
> > > > > The get_tramp_area(vaddr) function either finds 'callable' page or create
> > > > > new one.  The 'callable' means it's reachable by call instruction (from
> > > > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> > > > > function.
> > > > >
> > > > > The put_tramp_area function either drops refcount or destroys the special
> > > > > mapping and all the maps are clean up when the process goes down.
> > > > >
> > > > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > > > ---
> > > > >  include/linux/uprobes.h |  12 ++++
> > > > >  kernel/events/uprobes.c | 141 ++++++++++++++++++++++++++++++++++++++++
> > > > >  kernel/fork.c           |   2 +
> > > > >  3 files changed, 155 insertions(+)
> > > > >
> > > > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > > > > index be306028ed59..222d8e82cee2 100644
> > > > > --- a/include/linux/uprobes.h
> > > > > +++ b/include/linux/uprobes.h
> > > > > @@ -172,6 +172,15 @@ struct xol_area;
> > > > >
> > > > >  struct uprobes_state {
> > > > >         struct xol_area         *xol_area;
> > > > > +       struct hlist_head       tramp_head;
> > > > > +       struct mutex            tramp_mutex;
> > > > > +};
> > > > > +
> > > > > +struct tramp_area {
> > > > > +       unsigned long           vaddr;
> > > > > +       struct page             *page;
> > > > > +       struct hlist_node       node;
> > > > > +       refcount_t              ref;
> > > >
> > > > nit: any reason we are unnecessarily trying to save 4 bytes on
> > > > refcount (and we don't actually, due to padding)
> > >
> > > hum, I'm not sure what you mean.. what's the alternative?
> >
> > atomic64_t ?
>
> hum, just because we have extra 4 bytes padding? we use refcount_t
> on other places so seems like better fit to me

My (minor) concern was that tramp_area is a very long-living object
that can be shared across huge amounts of uprobes, and 2 billion
uprobes is a lot, but still a number that one can, practically
speaking, reach (with enough effort). And so skimping on refcount to
save 4 bytes for something that we have one or two per *some*
processes seemed (and still seems) like wrong savings to pursue.

>
> jirka
diff mbox series

Patch

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index be306028ed59..222d8e82cee2 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -172,6 +172,15 @@  struct xol_area;
 
 struct uprobes_state {
 	struct xol_area		*xol_area;
+	struct hlist_head	tramp_head;
+	struct mutex		tramp_mutex;
+};
+
+struct tramp_area {
+	unsigned long		vaddr;
+	struct page		*page;
+	struct hlist_node	node;
+	refcount_t		ref;
 };
 
 extern void __init uprobes_init(void);
@@ -219,6 +228,9 @@  extern int uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_o
 extern int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
 				     uprobe_opcode_t *new_opcode, void *data);
 extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int len, void *data);
+struct tramp_area *get_tramp_area(unsigned long vaddr);
+void put_tramp_area(struct tramp_area *area);
+bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 944d9df1f081..a44305c559a4 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -616,6 +616,145 @@  set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
 			(uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE, NULL);
 }
 
+bool __weak arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr)
+{
+	return false;
+}
+
+static unsigned long find_nearest_page(unsigned long vaddr)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma, *prev;
+	VMA_ITERATOR(vmi, mm, 0);
+
+	prev = vma_next(&vmi);
+	vma = vma_next(&vmi);
+	while (vma) {
+		if (vma->vm_start - prev->vm_end  >= PAGE_SIZE &&
+		    arch_uprobe_is_callable(prev->vm_end, vaddr))
+			return prev->vm_end;
+
+		prev = vma;
+		vma = vma_next(&vmi);
+	}
+
+	return 0;
+}
+
+static vm_fault_t tramp_fault(const struct vm_special_mapping *sm,
+			      struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct hlist_head *head = &vma->vm_mm->uprobes_state.tramp_head;
+	struct tramp_area *area;
+
+	hlist_for_each_entry(area, head, node) {
+		if (vma->vm_start == area->vaddr) {
+			vmf->page = area->page;
+			get_page(vmf->page);
+			return 0;
+		}
+	}
+
+	return -EINVAL;
+}
+
+static int tramp_mremap(const struct vm_special_mapping *sm, struct vm_area_struct *new_vma)
+{
+	return -EPERM;
+}
+
+static const struct vm_special_mapping tramp_mapping = {
+	.name = "[uprobes-trampoline]",
+	.fault = tramp_fault,
+	.mremap = tramp_mremap,
+};
+
+static struct tramp_area *create_tramp_area(unsigned long vaddr)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma;
+	struct tramp_area *area;
+
+	vaddr = find_nearest_page(vaddr);
+	if (!vaddr)
+		return NULL;
+
+	area = kzalloc(sizeof(*area), GFP_KERNEL);
+	if (unlikely(!area))
+		return NULL;
+
+	area->page = alloc_page(GFP_HIGHUSER);
+	if (!area->page)
+		goto free_area;
+
+	refcount_set(&area->ref, 1);
+	area->vaddr = vaddr;
+
+	vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
+				VM_READ|VM_EXEC|VM_MAYEXEC|VM_MAYREAD|VM_DONTCOPY|VM_IO,
+				&tramp_mapping);
+	if (!IS_ERR(vma))
+		return area;
+
+	__free_page(area->page);
+ free_area:
+	kfree(area);
+	return NULL;
+}
+
+struct tramp_area *get_tramp_area(unsigned long vaddr)
+{
+	struct uprobes_state *state = &current->mm->uprobes_state;
+	struct tramp_area *area = NULL;
+
+	mutex_lock(&state->tramp_mutex);
+	hlist_for_each_entry(area, &state->tramp_head, node) {
+		if (arch_uprobe_is_callable(area->vaddr, vaddr)) {
+			refcount_inc(&area->ref);
+			goto unlock;
+		}
+	}
+
+	area = create_tramp_area(vaddr);
+	if (area)
+		hlist_add_head(&area->node, &state->tramp_head);
+
+unlock:
+	mutex_unlock(&state->tramp_mutex);
+	return area;
+}
+
+static void destroy_tramp_area(struct tramp_area *area)
+{
+	hlist_del(&area->node);
+	put_page(area->page);
+	kfree(area);
+}
+
+void put_tramp_area(struct tramp_area *area)
+{
+	struct mm_struct *mm = current->mm;
+	struct uprobes_state *state = &mm->uprobes_state;
+
+	if (area == NULL)
+		return;
+
+	mutex_lock(&state->tramp_mutex);
+	if (refcount_dec_and_test(&area->ref))
+		destroy_tramp_area(area);
+	mutex_unlock(&state->tramp_mutex);
+}
+
+static void clear_tramp_head(struct mm_struct *mm)
+{
+	struct uprobes_state *state = &mm->uprobes_state;
+	struct tramp_area *area;
+	struct hlist_node *n;
+
+	hlist_for_each_entry_safe(area, n, &state->tramp_head, node)
+		destroy_tramp_area(area);
+}
+
 /* uprobe should have guaranteed positive refcount */
 static struct uprobe *get_uprobe(struct uprobe *uprobe)
 {
@@ -1788,6 +1927,8 @@  void uprobe_clear_state(struct mm_struct *mm)
 	delayed_uprobe_remove(NULL, mm);
 	mutex_unlock(&delayed_uprobe_lock);
 
+	clear_tramp_head(mm);
+
 	if (!area)
 		return;
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 89ceb4a68af2..b1fe431e5cce 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1248,6 +1248,8 @@  static void mm_init_uprobes_state(struct mm_struct *mm)
 {
 #ifdef CONFIG_UPROBES
 	mm->uprobes_state.xol_area = NULL;
+	mutex_init(&mm->uprobes_state.tramp_mutex);
+	INIT_HLIST_HEAD(&mm->uprobes_state.tramp_head);
 #endif
 }