diff mbox series

[RFC,perf/core,05/11] uprobes: Add mapping for optimized uprobe trampolines

Message ID 20241105133405.2703607-6-jolsa@kernel.org (mailing list archive)
State New
Delegated to: Masami Hiramatsu
Headers show
Series uprobes: Add support to optimize usdt probes on x86_64 | expand

Commit Message

Jiri Olsa Nov. 5, 2024, 1:33 p.m. UTC
Adding interface to add special mapping for user space page that will be
used as place holder for uprobe trampoline in following changes.

The get_tramp_area(vaddr) function either finds 'callable' page or create
new one.  The 'callable' means it's reachable by call instruction (from
vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
function.

The put_tramp_area function either drops refcount or destroys the special
mapping and all the maps are clean up when the process goes down.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
---
 include/linux/uprobes.h |  12 ++++
 kernel/events/uprobes.c | 141 ++++++++++++++++++++++++++++++++++++++++
 kernel/fork.c           |   2 +
 3 files changed, 155 insertions(+)

Comments

Peter Zijlstra Nov. 5, 2024, 2:23 p.m. UTC | #1
On Tue, Nov 05, 2024 at 02:33:59PM +0100, Jiri Olsa wrote:
> Adding interface to add special mapping for user space page that will be
> used as place holder for uprobe trampoline in following changes.
> 
> The get_tramp_area(vaddr) function either finds 'callable' page or create
> new one.  The 'callable' means it's reachable by call instruction (from
> vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> function.
> 
> The put_tramp_area function either drops refcount or destroys the special
> mapping and all the maps are clean up when the process goes down.

In another thread somewhere, Andrii mentioned that Meta has executables
with more than 4G of .text. This isn't going to work for them, is it?
Jiri Olsa Nov. 5, 2024, 4:33 p.m. UTC | #2
On Tue, Nov 05, 2024 at 03:23:27PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 05, 2024 at 02:33:59PM +0100, Jiri Olsa wrote:
> > Adding interface to add special mapping for user space page that will be
> > used as place holder for uprobe trampoline in following changes.
> > 
> > The get_tramp_area(vaddr) function either finds 'callable' page or create
> > new one.  The 'callable' means it's reachable by call instruction (from
> > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> > function.
> > 
> > The put_tramp_area function either drops refcount or destroys the special
> > mapping and all the maps are clean up when the process goes down.
> 
> In another thread somewhere, Andrii mentioned that Meta has executables
> with more than 4G of .text. This isn't going to work for them, is it?
> 

not if you can't reach the trampoline from the probed address

jirka
Andrii Nakryiko Nov. 14, 2024, 11:44 p.m. UTC | #3
On Tue, Nov 5, 2024 at 8:33 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Tue, Nov 05, 2024 at 03:23:27PM +0100, Peter Zijlstra wrote:
> > On Tue, Nov 05, 2024 at 02:33:59PM +0100, Jiri Olsa wrote:
> > > Adding interface to add special mapping for user space page that will be
> > > used as place holder for uprobe trampoline in following changes.
> > >
> > > The get_tramp_area(vaddr) function either finds 'callable' page or create
> > > new one.  The 'callable' means it's reachable by call instruction (from
> > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> > > function.
> > >
> > > The put_tramp_area function either drops refcount or destroys the special
> > > mapping and all the maps are clean up when the process goes down.
> >
> > In another thread somewhere, Andrii mentioned that Meta has executables
> > with more than 4G of .text. This isn't going to work for them, is it?
> >
>
> not if you can't reach the trampoline from the probed address

That specific example was about 1.5GB (though we might have bigger
.text, I didn't do exhaustive research). As Jiri said, this would be
best effort trying to find closest free mapping to stay within +/-2GB
offset. If that fails, we always would be falling back to slower
int3-based uprobing, yep.

Jiri, we could also have an option to support 64-bit call, right? We'd
need nop9 for that, but it's an option as well to future-proofing this
approach, no?

Also, can we somehow use fs/gs-based indirect calls/jumps somehow to
have a guarantee that offset is always small (<2GB away relative to
the base stored in fs/gs). Not sure if this is feasible, but I thought
it would be good to bring this up just to make sure it doesn't work.

If segment based absolute call is somehow feasible, we can probably
simplify a bunch of stuff by allocating it eagerly, once, and
somewhere high up next to VDSO (or maybe even put it into VDSO, don't
now).

Anyways, let's brainstorm if there are any clever alternatives here.


>
> jirka
Andrii Nakryiko Nov. 14, 2024, 11:44 p.m. UTC | #4
On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
>
> Adding interface to add special mapping for user space page that will be
> used as place holder for uprobe trampoline in following changes.
>
> The get_tramp_area(vaddr) function either finds 'callable' page or create
> new one.  The 'callable' means it's reachable by call instruction (from
> vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> function.
>
> The put_tramp_area function either drops refcount or destroys the special
> mapping and all the maps are clean up when the process goes down.
>
> Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> ---
>  include/linux/uprobes.h |  12 ++++
>  kernel/events/uprobes.c | 141 ++++++++++++++++++++++++++++++++++++++++
>  kernel/fork.c           |   2 +
>  3 files changed, 155 insertions(+)
>
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index be306028ed59..222d8e82cee2 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -172,6 +172,15 @@ struct xol_area;
>
>  struct uprobes_state {
>         struct xol_area         *xol_area;
> +       struct hlist_head       tramp_head;
> +       struct mutex            tramp_mutex;
> +};
> +
> +struct tramp_area {
> +       unsigned long           vaddr;
> +       struct page             *page;
> +       struct hlist_node       node;
> +       refcount_t              ref;

nit: any reason we are unnecessarily trying to save 4 bytes on
refcount (and we don't actually, due to padding)

>  };
>
>  extern void __init uprobes_init(void);
> @@ -219,6 +228,9 @@ extern int uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_o
>  extern int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
>                                      uprobe_opcode_t *new_opcode, void *data);
>  extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int len, void *data);
> +struct tramp_area *get_tramp_area(unsigned long vaddr);

uprobe_get_tramp_area() to make it clear this is uprobe specific,
given this is exposed function?

and add that extern like we do for other functions?

> +void put_tramp_area(struct tramp_area *area);

uprobe_put_tramp_area() ?

> +bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr);
>  #else /* !CONFIG_UPROBES */
>  struct uprobes_state {
>  };
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 944d9df1f081..a44305c559a4 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -616,6 +616,145 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
>                         (uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE, NULL);
>  }
>
> +bool __weak arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr)
> +{
> +       return false;
> +}
> +
> +static unsigned long find_nearest_page(unsigned long vaddr)
> +{
> +       struct mm_struct *mm = current->mm;
> +       struct vm_area_struct *vma, *prev;
> +       VMA_ITERATOR(vmi, mm, 0);
> +
> +       prev = vma_next(&vmi);
> +       vma = vma_next(&vmi);
> +       while (vma) {
> +               if (vma->vm_start - prev->vm_end  >= PAGE_SIZE &&
> +                   arch_uprobe_is_callable(prev->vm_end, vaddr))
> +                       return prev->vm_end;

shouldn't we try both `prev->vm_end` and `vma->vm_start - PAGE_SIZE`
as two possible places

> +
> +               prev = vma;
> +               vma = vma_next(&vmi);
> +       }
> +
> +       return 0;
> +}
> +
> +static vm_fault_t tramp_fault(const struct vm_special_mapping *sm,
> +                             struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +       struct hlist_head *head = &vma->vm_mm->uprobes_state.tramp_head;
> +       struct tramp_area *area;
> +
> +       hlist_for_each_entry(area, head, node) {
> +               if (vma->vm_start == area->vaddr) {
> +                       vmf->page = area->page;
> +                       get_page(vmf->page);
> +                       return 0;
> +               }
> +       }
> +
> +       return -EINVAL;
> +}
> +
> +static int tramp_mremap(const struct vm_special_mapping *sm, struct vm_area_struct *new_vma)
> +{
> +       return -EPERM;
> +}
> +
> +static const struct vm_special_mapping tramp_mapping = {
> +       .name = "[uprobes-trampoline]",
> +       .fault = tramp_fault,
> +       .mremap = tramp_mremap,
> +};
> +
> +static struct tramp_area *create_tramp_area(unsigned long vaddr)
> +{
> +       struct mm_struct *mm = current->mm;
> +       struct vm_area_struct *vma;
> +       struct tramp_area *area;
> +
> +       vaddr = find_nearest_page(vaddr);
> +       if (!vaddr)
> +               return NULL;
> +
> +       area = kzalloc(sizeof(*area), GFP_KERNEL);
> +       if (unlikely(!area))
> +               return NULL;
> +
> +       area->page = alloc_page(GFP_HIGHUSER);
> +       if (!area->page)
> +               goto free_area;
> +
> +       refcount_set(&area->ref, 1);
> +       area->vaddr = vaddr;
> +
> +       vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
> +                               VM_READ|VM_EXEC|VM_MAYEXEC|VM_MAYREAD|VM_DONTCOPY|VM_IO,
> +                               &tramp_mapping);
> +       if (!IS_ERR(vma))
> +               return area;

please keep a pattern, it's less surprising that way

    if (IS_ERR(vma))
        goto free_page;

    return area;

free_page:

> +
> +       __free_page(area->page);
> + free_area:
> +       kfree(area);
> +       return NULL;
> +}
> +
> +struct tramp_area *get_tramp_area(unsigned long vaddr)
> +{
> +       struct uprobes_state *state = &current->mm->uprobes_state;
> +       struct tramp_area *area = NULL;
> +
> +       mutex_lock(&state->tramp_mutex);
> +       hlist_for_each_entry(area, &state->tramp_head, node) {
> +               if (arch_uprobe_is_callable(area->vaddr, vaddr)) {
> +                       refcount_inc(&area->ref);
> +                       goto unlock;
> +               }
> +       }
> +
> +       area = create_tramp_area(vaddr);
> +       if (area)
> +               hlist_add_head(&area->node, &state->tramp_head);
> +
> +unlock:
> +       mutex_unlock(&state->tramp_mutex);
> +       return area;
> +}
> +
> +static void destroy_tramp_area(struct tramp_area *area)
> +{
> +       hlist_del(&area->node);
> +       put_page(area->page);
> +       kfree(area);
> +}
> +
> +void put_tramp_area(struct tramp_area *area)
> +{
> +       struct mm_struct *mm = current->mm;
> +       struct uprobes_state *state = &mm->uprobes_state;
> +
> +       if (area == NULL)

nit: !area

> +               return;
> +
> +       mutex_lock(&state->tramp_mutex);
> +       if (refcount_dec_and_test(&area->ref))
> +               destroy_tramp_area(area);
> +       mutex_unlock(&state->tramp_mutex);
> +}
> +

[...]
Jiri Olsa Nov. 16, 2024, 9:44 p.m. UTC | #5
On Thu, Nov 14, 2024 at 03:44:12PM -0800, Andrii Nakryiko wrote:
> On Tue, Nov 5, 2024 at 8:33 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Tue, Nov 05, 2024 at 03:23:27PM +0100, Peter Zijlstra wrote:
> > > On Tue, Nov 05, 2024 at 02:33:59PM +0100, Jiri Olsa wrote:
> > > > Adding interface to add special mapping for user space page that will be
> > > > used as place holder for uprobe trampoline in following changes.
> > > >
> > > > The get_tramp_area(vaddr) function either finds 'callable' page or create
> > > > new one.  The 'callable' means it's reachable by call instruction (from
> > > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> > > > function.
> > > >
> > > > The put_tramp_area function either drops refcount or destroys the special
> > > > mapping and all the maps are clean up when the process goes down.
> > >
> > > In another thread somewhere, Andrii mentioned that Meta has executables
> > > with more than 4G of .text. This isn't going to work for them, is it?
> > >
> >
> > not if you can't reach the trampoline from the probed address
> 
> That specific example was about 1.5GB (though we might have bigger
> .text, I didn't do exhaustive research). As Jiri said, this would be
> best effort trying to find closest free mapping to stay within +/-2GB
> offset. If that fails, we always would be falling back to slower
> int3-based uprobing, yep.
> 
> Jiri, we could also have an option to support 64-bit call, right? We'd
> need nop9 for that, but it's an option as well to future-proofing this
> approach, no?

hm, I don't think there's call with relative 64bit offset

there's indirect call through register or address.. but I think we would
fit in nop10 with the indirect call through address

> 
> Also, can we somehow use fs/gs-based indirect calls/jumps somehow to
> have a guarantee that offset is always small (<2GB away relative to
> the base stored in fs/gs). Not sure if this is feasible, but I thought
> it would be good to bring this up just to make sure it doesn't work.
> 
> If segment based absolute call is somehow feasible, we can probably
> simplify a bunch of stuff by allocating it eagerly, once, and
> somewhere high up next to VDSO (or maybe even put it into VDSO, don't
> now).

yes, that would be convenient

jirka

> 
> Anyways, let's brainstorm if there are any clever alternatives here.
> 
> 
> >
> > jirka
Jiri Olsa Nov. 16, 2024, 9:44 p.m. UTC | #6
On Thu, Nov 14, 2024 at 03:44:14PM -0800, Andrii Nakryiko wrote:
> On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
> >
> > Adding interface to add special mapping for user space page that will be
> > used as place holder for uprobe trampoline in following changes.
> >
> > The get_tramp_area(vaddr) function either finds 'callable' page or create
> > new one.  The 'callable' means it's reachable by call instruction (from
> > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> > function.
> >
> > The put_tramp_area function either drops refcount or destroys the special
> > mapping and all the maps are clean up when the process goes down.
> >
> > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > ---
> >  include/linux/uprobes.h |  12 ++++
> >  kernel/events/uprobes.c | 141 ++++++++++++++++++++++++++++++++++++++++
> >  kernel/fork.c           |   2 +
> >  3 files changed, 155 insertions(+)
> >
> > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > index be306028ed59..222d8e82cee2 100644
> > --- a/include/linux/uprobes.h
> > +++ b/include/linux/uprobes.h
> > @@ -172,6 +172,15 @@ struct xol_area;
> >
> >  struct uprobes_state {
> >         struct xol_area         *xol_area;
> > +       struct hlist_head       tramp_head;
> > +       struct mutex            tramp_mutex;
> > +};
> > +
> > +struct tramp_area {
> > +       unsigned long           vaddr;
> > +       struct page             *page;
> > +       struct hlist_node       node;
> > +       refcount_t              ref;
> 
> nit: any reason we are unnecessarily trying to save 4 bytes on
> refcount (and we don't actually, due to padding)

hum, I'm not sure what you mean.. what's the alternative?

> 
> >  };
> >
> >  extern void __init uprobes_init(void);
> > @@ -219,6 +228,9 @@ extern int uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_o
> >  extern int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
> >                                      uprobe_opcode_t *new_opcode, void *data);
> >  extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int len, void *data);
> > +struct tramp_area *get_tramp_area(unsigned long vaddr);
> 
> uprobe_get_tramp_area() to make it clear this is uprobe specific,
> given this is exposed function?
> 
> and add that extern like we do for other functions?

ok

> 
> > +void put_tramp_area(struct tramp_area *area);
> 
> uprobe_put_tramp_area() ?

ok

> 
> > +bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr);
> >  #else /* !CONFIG_UPROBES */
> >  struct uprobes_state {
> >  };
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 944d9df1f081..a44305c559a4 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -616,6 +616,145 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
> >                         (uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE, NULL);
> >  }
> >
> > +bool __weak arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr)
> > +{
> > +       return false;
> > +}
> > +
> > +static unsigned long find_nearest_page(unsigned long vaddr)
> > +{
> > +       struct mm_struct *mm = current->mm;
> > +       struct vm_area_struct *vma, *prev;
> > +       VMA_ITERATOR(vmi, mm, 0);
> > +
> > +       prev = vma_next(&vmi);
> > +       vma = vma_next(&vmi);
> > +       while (vma) {
> > +               if (vma->vm_start - prev->vm_end  >= PAGE_SIZE &&
> > +                   arch_uprobe_is_callable(prev->vm_end, vaddr))
> > +                       return prev->vm_end;
> 
> shouldn't we try both `prev->vm_end` and `vma->vm_start - PAGE_SIZE`
> as two possible places

right, we should do that

SNIP

> > +static struct tramp_area *create_tramp_area(unsigned long vaddr)
> > +{
> > +       struct mm_struct *mm = current->mm;
> > +       struct vm_area_struct *vma;
> > +       struct tramp_area *area;
> > +
> > +       vaddr = find_nearest_page(vaddr);
> > +       if (!vaddr)
> > +               return NULL;
> > +
> > +       area = kzalloc(sizeof(*area), GFP_KERNEL);
> > +       if (unlikely(!area))
> > +               return NULL;
> > +
> > +       area->page = alloc_page(GFP_HIGHUSER);
> > +       if (!area->page)
> > +               goto free_area;
> > +
> > +       refcount_set(&area->ref, 1);
> > +       area->vaddr = vaddr;
> > +
> > +       vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
> > +                               VM_READ|VM_EXEC|VM_MAYEXEC|VM_MAYREAD|VM_DONTCOPY|VM_IO,
> > +                               &tramp_mapping);
> > +       if (!IS_ERR(vma))
> > +               return area;
> 
> please keep a pattern, it's less surprising that way
> 
>     if (IS_ERR(vma))
>         goto free_page;
> 
>     return area;
> 
> free_page:

ok


SNIP

> > +static void destroy_tramp_area(struct tramp_area *area)
> > +{
> > +       hlist_del(&area->node);
> > +       put_page(area->page);
> > +       kfree(area);
> > +}
> > +
> > +void put_tramp_area(struct tramp_area *area)
> > +{
> > +       struct mm_struct *mm = current->mm;
> > +       struct uprobes_state *state = &mm->uprobes_state;
> > +
> > +       if (area == NULL)
> 
> nit: !area

ok

thanks,
jirka
Andrii Nakryiko Nov. 19, 2024, 6:05 a.m. UTC | #7
On Sat, Nov 16, 2024 at 1:44 PM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Thu, Nov 14, 2024 at 03:44:14PM -0800, Andrii Nakryiko wrote:
> > On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
> > >
> > > Adding interface to add special mapping for user space page that will be
> > > used as place holder for uprobe trampoline in following changes.
> > >
> > > The get_tramp_area(vaddr) function either finds 'callable' page or create
> > > new one.  The 'callable' means it's reachable by call instruction (from
> > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> > > function.
> > >
> > > The put_tramp_area function either drops refcount or destroys the special
> > > mapping and all the maps are clean up when the process goes down.
> > >
> > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > ---
> > >  include/linux/uprobes.h |  12 ++++
> > >  kernel/events/uprobes.c | 141 ++++++++++++++++++++++++++++++++++++++++
> > >  kernel/fork.c           |   2 +
> > >  3 files changed, 155 insertions(+)
> > >
> > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > > index be306028ed59..222d8e82cee2 100644
> > > --- a/include/linux/uprobes.h
> > > +++ b/include/linux/uprobes.h
> > > @@ -172,6 +172,15 @@ struct xol_area;
> > >
> > >  struct uprobes_state {
> > >         struct xol_area         *xol_area;
> > > +       struct hlist_head       tramp_head;
> > > +       struct mutex            tramp_mutex;
> > > +};
> > > +
> > > +struct tramp_area {
> > > +       unsigned long           vaddr;
> > > +       struct page             *page;
> > > +       struct hlist_node       node;
> > > +       refcount_t              ref;
> >
> > nit: any reason we are unnecessarily trying to save 4 bytes on
> > refcount (and we don't actually, due to padding)
>
> hum, I'm not sure what you mean.. what's the alternative?

atomic64_t ?

>
> >
> > >  };
> > >
> > >  extern void __init uprobes_init(void);
> > > @@ -219,6 +228,9 @@ extern int uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_o
> > >  extern int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
> > >                                      uprobe_opcode_t *new_opcode, void *data);
> > >  extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int len, void *data);
> > > +struct tramp_area *get_tramp_area(unsigned long vaddr);
> >
> > uprobe_get_tramp_area() to make it clear this is uprobe specific,
> > given this is exposed function?
> >
> > and add that extern like we do for other functions?
>
> ok
>
> >
> > > +void put_tramp_area(struct tramp_area *area);
> >
> > uprobe_put_tramp_area() ?
>
> ok
>
> >
> > > +bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr);
> > >  #else /* !CONFIG_UPROBES */
> > >  struct uprobes_state {
> > >  };
> > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > index 944d9df1f081..a44305c559a4 100644
> > > --- a/kernel/events/uprobes.c
> > > +++ b/kernel/events/uprobes.c
> > > @@ -616,6 +616,145 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
> > >                         (uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE, NULL);
> > >  }
> > >
> > > +bool __weak arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr)
> > > +{
> > > +       return false;
> > > +}
> > > +
> > > +static unsigned long find_nearest_page(unsigned long vaddr)
> > > +{
> > > +       struct mm_struct *mm = current->mm;
> > > +       struct vm_area_struct *vma, *prev;
> > > +       VMA_ITERATOR(vmi, mm, 0);
> > > +
> > > +       prev = vma_next(&vmi);
> > > +       vma = vma_next(&vmi);
> > > +       while (vma) {
> > > +               if (vma->vm_start - prev->vm_end  >= PAGE_SIZE &&
> > > +                   arch_uprobe_is_callable(prev->vm_end, vaddr))
> > > +                       return prev->vm_end;
> >
> > shouldn't we try both `prev->vm_end` and `vma->vm_start - PAGE_SIZE`
> > as two possible places
>
> right, we should do that
>
> SNIP
>
> > > +static struct tramp_area *create_tramp_area(unsigned long vaddr)
> > > +{
> > > +       struct mm_struct *mm = current->mm;
> > > +       struct vm_area_struct *vma;
> > > +       struct tramp_area *area;
> > > +
> > > +       vaddr = find_nearest_page(vaddr);
> > > +       if (!vaddr)
> > > +               return NULL;
> > > +
> > > +       area = kzalloc(sizeof(*area), GFP_KERNEL);
> > > +       if (unlikely(!area))
> > > +               return NULL;
> > > +
> > > +       area->page = alloc_page(GFP_HIGHUSER);
> > > +       if (!area->page)
> > > +               goto free_area;
> > > +
> > > +       refcount_set(&area->ref, 1);
> > > +       area->vaddr = vaddr;
> > > +
> > > +       vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
> > > +                               VM_READ|VM_EXEC|VM_MAYEXEC|VM_MAYREAD|VM_DONTCOPY|VM_IO,
> > > +                               &tramp_mapping);
> > > +       if (!IS_ERR(vma))
> > > +               return area;
> >
> > please keep a pattern, it's less surprising that way
> >
> >     if (IS_ERR(vma))
> >         goto free_page;
> >
> >     return area;
> >
> > free_page:
>
> ok
>
>
> SNIP
>
> > > +static void destroy_tramp_area(struct tramp_area *area)
> > > +{
> > > +       hlist_del(&area->node);
> > > +       put_page(area->page);
> > > +       kfree(area);
> > > +}
> > > +
> > > +void put_tramp_area(struct tramp_area *area)
> > > +{
> > > +       struct mm_struct *mm = current->mm;
> > > +       struct uprobes_state *state = &mm->uprobes_state;
> > > +
> > > +       if (area == NULL)
> >
> > nit: !area
>
> ok
>
> thanks,
> jirka
Andrii Nakryiko Nov. 19, 2024, 6:06 a.m. UTC | #8
On Sat, Nov 16, 2024 at 1:44 PM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Thu, Nov 14, 2024 at 03:44:12PM -0800, Andrii Nakryiko wrote:
> > On Tue, Nov 5, 2024 at 8:33 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > On Tue, Nov 05, 2024 at 03:23:27PM +0100, Peter Zijlstra wrote:
> > > > On Tue, Nov 05, 2024 at 02:33:59PM +0100, Jiri Olsa wrote:
> > > > > Adding interface to add special mapping for user space page that will be
> > > > > used as place holder for uprobe trampoline in following changes.
> > > > >
> > > > > The get_tramp_area(vaddr) function either finds 'callable' page or create
> > > > > new one.  The 'callable' means it's reachable by call instruction (from
> > > > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> > > > > function.
> > > > >
> > > > > The put_tramp_area function either drops refcount or destroys the special
> > > > > mapping and all the maps are clean up when the process goes down.
> > > >
> > > > In another thread somewhere, Andrii mentioned that Meta has executables
> > > > with more than 4G of .text. This isn't going to work for them, is it?
> > > >
> > >
> > > not if you can't reach the trampoline from the probed address
> >
> > That specific example was about 1.5GB (though we might have bigger
> > .text, I didn't do exhaustive research). As Jiri said, this would be
> > best effort trying to find closest free mapping to stay within +/-2GB
> > offset. If that fails, we always would be falling back to slower
> > int3-based uprobing, yep.
> >
> > Jiri, we could also have an option to support 64-bit call, right? We'd
> > need nop9 for that, but it's an option as well to future-proofing this
> > approach, no?
>
> hm, I don't think there's call with relative 64bit offset

why do you need a relative, when you have 64 bits? ;) there is a call
to absolute address, no?

>
> there's indirect call through register or address.. but I think we would
> fit in nop10 with the indirect call through address
>
> >
> > Also, can we somehow use fs/gs-based indirect calls/jumps somehow to
> > have a guarantee that offset is always small (<2GB away relative to
> > the base stored in fs/gs). Not sure if this is feasible, but I thought
> > it would be good to bring this up just to make sure it doesn't work.
> >
> > If segment based absolute call is somehow feasible, we can probably
> > simplify a bunch of stuff by allocating it eagerly, once, and
> > somewhere high up next to VDSO (or maybe even put it into VDSO, don't
> > now).
>
> yes, that would be convenient
>
> jirka
>
> >
> > Anyways, let's brainstorm if there are any clever alternatives here.
> >
> >
> > >
> > > jirka
Peter Zijlstra Nov. 19, 2024, 9:13 a.m. UTC | #9
On Mon, Nov 18, 2024 at 10:06:51PM -0800, Andrii Nakryiko wrote:

> > > Jiri, we could also have an option to support 64-bit call, right? We'd
> > > need nop9 for that, but it's an option as well to future-proofing this
> > > approach, no?
> >
> > hm, I don't think there's call with relative 64bit offset
> 
> why do you need a relative, when you have 64 bits? ;) there is a call
> to absolute address, no?

No, there is not :/ You get to use an indirect call, which means
multiple instructions and all the speculation joy.

IFF USDT thingies have AX clobbered (I couldn't find in a hurry) then
patching the multi instruction thing is relatively straight forward, if
they don't, its going to be a pain.
Jiri Olsa Nov. 19, 2024, 3:14 p.m. UTC | #10
On Mon, Nov 18, 2024 at 10:05:41PM -0800, Andrii Nakryiko wrote:
> On Sat, Nov 16, 2024 at 1:44 PM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Thu, Nov 14, 2024 at 03:44:14PM -0800, Andrii Nakryiko wrote:
> > > On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
> > > >
> > > > Adding interface to add special mapping for user space page that will be
> > > > used as place holder for uprobe trampoline in following changes.
> > > >
> > > > The get_tramp_area(vaddr) function either finds 'callable' page or create
> > > > new one.  The 'callable' means it's reachable by call instruction (from
> > > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> > > > function.
> > > >
> > > > The put_tramp_area function either drops refcount or destroys the special
> > > > mapping and all the maps are clean up when the process goes down.
> > > >
> > > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > > ---
> > > >  include/linux/uprobes.h |  12 ++++
> > > >  kernel/events/uprobes.c | 141 ++++++++++++++++++++++++++++++++++++++++
> > > >  kernel/fork.c           |   2 +
> > > >  3 files changed, 155 insertions(+)
> > > >
> > > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > > > index be306028ed59..222d8e82cee2 100644
> > > > --- a/include/linux/uprobes.h
> > > > +++ b/include/linux/uprobes.h
> > > > @@ -172,6 +172,15 @@ struct xol_area;
> > > >
> > > >  struct uprobes_state {
> > > >         struct xol_area         *xol_area;
> > > > +       struct hlist_head       tramp_head;
> > > > +       struct mutex            tramp_mutex;
> > > > +};
> > > > +
> > > > +struct tramp_area {
> > > > +       unsigned long           vaddr;
> > > > +       struct page             *page;
> > > > +       struct hlist_node       node;
> > > > +       refcount_t              ref;
> > >
> > > nit: any reason we are unnecessarily trying to save 4 bytes on
> > > refcount (and we don't actually, due to padding)
> >
> > hum, I'm not sure what you mean.. what's the alternative?
> 
> atomic64_t ?

hum, just because we have extra 4 bytes padding? we use refcount_t
on other places so seems like better fit to me

jirka
Jiri Olsa Nov. 19, 2024, 3:15 p.m. UTC | #11
On Tue, Nov 19, 2024 at 10:13:48AM +0100, Peter Zijlstra wrote:
> On Mon, Nov 18, 2024 at 10:06:51PM -0800, Andrii Nakryiko wrote:
> 
> > > > Jiri, we could also have an option to support 64-bit call, right? We'd
> > > > need nop9 for that, but it's an option as well to future-proofing this
> > > > approach, no?
> > >
> > > hm, I don't think there's call with relative 64bit offset
> > 
> > why do you need a relative, when you have 64 bits? ;) there is a call
> > to absolute address, no?
> 
> No, there is not :/ You get to use an indirect call, which means
> multiple instructions and all the speculation joy.
> 
> IFF USDT thingies have AX clobbered (I couldn't find in a hurry) then
> patching the multi instruction thing is relatively straight forward, if
> they don't, its going to be a pain.

I don't follow, what's the reason for that?

jirka
Andrii Nakryiko Nov. 21, 2024, 12:07 a.m. UTC | #12
On Tue, Nov 19, 2024 at 1:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Nov 18, 2024 at 10:06:51PM -0800, Andrii Nakryiko wrote:
>
> > > > Jiri, we could also have an option to support 64-bit call, right? We'd
> > > > need nop9 for that, but it's an option as well to future-proofing this
> > > > approach, no?
> > >
> > > hm, I don't think there's call with relative 64bit offset
> >
> > why do you need a relative, when you have 64 bits? ;) there is a call
> > to absolute address, no?
>
> No, there is not :/ You get to use an indirect call, which means
> multiple instructions and all the speculation joy.

Ah, I misinterpreted `CALL m16:64` (from [0]) as an absolute address
call with 64-bit displacement encoded in the struct, my bad.

  [0] https://www.felixcloutier.com/x86/call

>
> IFF USDT thingies have AX clobbered (I couldn't find in a hurry) then
> patching the multi instruction thing is relatively straight forward, if
> they don't, its going to be a pain.

USDTs are meant to be "transparent" to the surrounding code and they
don't mark any clobbered registers. Technically it could be added, but
I'm not a fan of this.

I'd stick to the 32-bit relative call for now, and fallback to int3 if
we can't reach the uprobe trampoline. So huge .text might suffer
suboptimal performance, but at least USDTs won't pessimize the
surrounding code (and kernel won't make any assumptions about
registers that are ok to use).
Andrii Nakryiko Nov. 21, 2024, 12:10 a.m. UTC | #13
On Tue, Nov 19, 2024 at 7:14 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Mon, Nov 18, 2024 at 10:05:41PM -0800, Andrii Nakryiko wrote:
> > On Sat, Nov 16, 2024 at 1:44 PM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > On Thu, Nov 14, 2024 at 03:44:14PM -0800, Andrii Nakryiko wrote:
> > > > On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote:
> > > > >
> > > > > Adding interface to add special mapping for user space page that will be
> > > > > used as place holder for uprobe trampoline in following changes.
> > > > >
> > > > > The get_tramp_area(vaddr) function either finds 'callable' page or create
> > > > > new one.  The 'callable' means it's reachable by call instruction (from
> > > > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable
> > > > > function.
> > > > >
> > > > > The put_tramp_area function either drops refcount or destroys the special
> > > > > mapping and all the maps are clean up when the process goes down.
> > > > >
> > > > > Signed-off-by: Jiri Olsa <jolsa@kernel.org>
> > > > > ---
> > > > >  include/linux/uprobes.h |  12 ++++
> > > > >  kernel/events/uprobes.c | 141 ++++++++++++++++++++++++++++++++++++++++
> > > > >  kernel/fork.c           |   2 +
> > > > >  3 files changed, 155 insertions(+)
> > > > >
> > > > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> > > > > index be306028ed59..222d8e82cee2 100644
> > > > > --- a/include/linux/uprobes.h
> > > > > +++ b/include/linux/uprobes.h
> > > > > @@ -172,6 +172,15 @@ struct xol_area;
> > > > >
> > > > >  struct uprobes_state {
> > > > >         struct xol_area         *xol_area;
> > > > > +       struct hlist_head       tramp_head;
> > > > > +       struct mutex            tramp_mutex;
> > > > > +};
> > > > > +
> > > > > +struct tramp_area {
> > > > > +       unsigned long           vaddr;
> > > > > +       struct page             *page;
> > > > > +       struct hlist_node       node;
> > > > > +       refcount_t              ref;
> > > >
> > > > nit: any reason we are unnecessarily trying to save 4 bytes on
> > > > refcount (and we don't actually, due to padding)
> > >
> > > hum, I'm not sure what you mean.. what's the alternative?
> >
> > atomic64_t ?
>
> hum, just because we have extra 4 bytes padding? we use refcount_t
> on other places so seems like better fit to me

My (minor) concern was that tramp_area is a very long-living object
that can be shared across huge amounts of uprobes, and 2 billion
uprobes is a lot, but still a number that one can, practically
speaking, reach (with enough effort). And so skimping on refcount to
save 4 bytes for something that we have one or two per *some*
processes seemed (and still seems) like wrong savings to pursue.

>
> jirka
Peter Zijlstra Nov. 21, 2024, 11:53 a.m. UTC | #14
On Wed, Nov 20, 2024 at 04:07:38PM -0800, Andrii Nakryiko wrote:

> USDTs are meant to be "transparent" to the surrounding code and they
> don't mark any clobbered registers. Technically it could be added, but
> I'm not a fan of this.

Sure. Anyway, another thing to consider is FRED, will all of this still
matter once that lands? If FRED gets us INT3 performance close to what
SYSCALL has, then all this work will go unused.
Alexei Starovoitov Nov. 21, 2024, 4:02 p.m. UTC | #15
On Thu, Nov 21, 2024 at 4:17 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Nov 20, 2024 at 04:07:38PM -0800, Andrii Nakryiko wrote:
>
> > USDTs are meant to be "transparent" to the surrounding code and they
> > don't mark any clobbered registers. Technically it could be added, but
> > I'm not a fan of this.
>
> Sure. Anyway, another thing to consider is FRED, will all of this still
> matter once that lands? If FRED gets us INT3 performance close to what
> SYSCALL has, then all this work will go unused.

afaik not a single cpu in the datacenter supports FRED while
uprobe overhead is real.
imo it's worth improving performance today for existing cpus.
I suspect arm64 might benefit too. Even if arm hw does the same
amount of work for trap vs syscall the sw overhead of handling
trap is different.
I suspect that equation will apply to future FRED cpus too.
Peter Zijlstra Nov. 21, 2024, 4:34 p.m. UTC | #16
On Thu, Nov 21, 2024 at 08:02:12AM -0800, Alexei Starovoitov wrote:
> On Thu, Nov 21, 2024 at 4:17 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Nov 20, 2024 at 04:07:38PM -0800, Andrii Nakryiko wrote:
> >
> > > USDTs are meant to be "transparent" to the surrounding code and they
> > > don't mark any clobbered registers. Technically it could be added, but
> > > I'm not a fan of this.
> >
> > Sure. Anyway, another thing to consider is FRED, will all of this still
> > matter once that lands? If FRED gets us INT3 performance close to what
> > SYSCALL has, then all this work will go unused.
> 
> afaik not a single cpu in the datacenter supports FRED while
> uprobe overhead is real.
> imo it's worth improving performance today for existing cpus.

I understand, but OTOH adding a syscall now, that we'll have to maintain
for years and years, even through we know it'll not be used much is a
bit annoying.

> I suspect arm64 might benefit too. Even if arm hw does the same
> amount of work for trap vs syscall the sw overhead of handling
> trap is different.

Well, the RISC CPUs have a much harder time using this, their immediate
range is typically puny and they end up needing multiple instructions
and some register in order to set up a call.

Elsewhere in the thread Mark Rutland already noted that arm64 really
doesn't need or want this.
Alexei Starovoitov Nov. 21, 2024, 4:47 p.m. UTC | #17
On Thu, Nov 21, 2024 at 8:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Nov 21, 2024 at 08:02:12AM -0800, Alexei Starovoitov wrote:
> > On Thu, Nov 21, 2024 at 4:17 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Wed, Nov 20, 2024 at 04:07:38PM -0800, Andrii Nakryiko wrote:
> > >
> > > > USDTs are meant to be "transparent" to the surrounding code and they
> > > > don't mark any clobbered registers. Technically it could be added, but
> > > > I'm not a fan of this.
> > >
> > > Sure. Anyway, another thing to consider is FRED, will all of this still
> > > matter once that lands? If FRED gets us INT3 performance close to what
> > > SYSCALL has, then all this work will go unused.
> >
> > afaik not a single cpu in the datacenter supports FRED while
> > uprobe overhead is real.
> > imo it's worth improving performance today for existing cpus.
>
> I understand, but OTOH adding a syscall now, that we'll have to maintain
> for years and years, even through we know it'll not be used much is a
> bit annoying.

No. It _will_ be used for years.

>
> > I suspect arm64 might benefit too. Even if arm hw does the same
> > amount of work for trap vs syscall the sw overhead of handling
> > trap is different.
>
> Well, the RISC CPUs have a much harder time using this, their immediate
> range is typically puny and they end up needing multiple instructions
> and some register in order to set up a call.

We don't care about 32-bit archs and other exotics.
They're not the reasons to leave performance on the table
on dominant archs.

> Elsewhere in the thread Mark Rutland already noted that arm64 really
> doesn't need or want this.

Doesn't look like you've read what you quoted above.
On arm64 the _HW_ cost may be the same.
The _SW_ difference in handling trap vs syscall is real.
I bet once uprobe syscall is benchmarked on arm64 there will
be a delta.
Mark Rutland Nov. 21, 2024, 7:38 p.m. UTC | #18
[resending as I somehow messed up the 'From' header and got a tonne of
bounces]

On Thu, Nov 21, 2024 at 08:47:56AM -0800, Alexei Starovoitov wrote:
> On Thu, Nov 21, 2024 at 8:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > Elsewhere in the thread Mark Rutland already noted that arm64 really
> > doesn't need or want this.
> 
> Doesn't look like you've read what you quoted above.
> On arm64 the _HW_ cost may be the same.
> The _SW_ difference in handling trap vs syscall is real.
> I bet once uprobe syscall is benchmarked on arm64 there will
> be a delta.

I already pointed out in [1] that on arm64 we can make the trap case
*faster* than the syscall. If that's not already the case, there's only
a small amount of rework needed, (pulling BRK handling into
entry-common.c), which we want to do for other reasons anyway.

On arm64 I do not want the syscall; the trap is faster and simpler to
maintain.

Mark

[1] https://lore.kernel.org/lkml/ZzsRfhGSYXVK0mst@J2N7QTR9R3/
diff mbox series

Patch

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index be306028ed59..222d8e82cee2 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -172,6 +172,15 @@  struct xol_area;
 
 struct uprobes_state {
 	struct xol_area		*xol_area;
+	struct hlist_head	tramp_head;
+	struct mutex		tramp_mutex;
+};
+
+struct tramp_area {
+	unsigned long		vaddr;
+	struct page		*page;
+	struct hlist_node	node;
+	refcount_t		ref;
 };
 
 extern void __init uprobes_init(void);
@@ -219,6 +228,9 @@  extern int uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_o
 extern int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
 				     uprobe_opcode_t *new_opcode, void *data);
 extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int len, void *data);
+struct tramp_area *get_tramp_area(unsigned long vaddr);
+void put_tramp_area(struct tramp_area *area);
+bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr);
 #else /* !CONFIG_UPROBES */
 struct uprobes_state {
 };
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 944d9df1f081..a44305c559a4 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -616,6 +616,145 @@  set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
 			(uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE, NULL);
 }
 
+bool __weak arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr)
+{
+	return false;
+}
+
+static unsigned long find_nearest_page(unsigned long vaddr)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma, *prev;
+	VMA_ITERATOR(vmi, mm, 0);
+
+	prev = vma_next(&vmi);
+	vma = vma_next(&vmi);
+	while (vma) {
+		if (vma->vm_start - prev->vm_end  >= PAGE_SIZE &&
+		    arch_uprobe_is_callable(prev->vm_end, vaddr))
+			return prev->vm_end;
+
+		prev = vma;
+		vma = vma_next(&vmi);
+	}
+
+	return 0;
+}
+
+static vm_fault_t tramp_fault(const struct vm_special_mapping *sm,
+			      struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct hlist_head *head = &vma->vm_mm->uprobes_state.tramp_head;
+	struct tramp_area *area;
+
+	hlist_for_each_entry(area, head, node) {
+		if (vma->vm_start == area->vaddr) {
+			vmf->page = area->page;
+			get_page(vmf->page);
+			return 0;
+		}
+	}
+
+	return -EINVAL;
+}
+
+static int tramp_mremap(const struct vm_special_mapping *sm, struct vm_area_struct *new_vma)
+{
+	return -EPERM;
+}
+
+static const struct vm_special_mapping tramp_mapping = {
+	.name = "[uprobes-trampoline]",
+	.fault = tramp_fault,
+	.mremap = tramp_mremap,
+};
+
+static struct tramp_area *create_tramp_area(unsigned long vaddr)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma;
+	struct tramp_area *area;
+
+	vaddr = find_nearest_page(vaddr);
+	if (!vaddr)
+		return NULL;
+
+	area = kzalloc(sizeof(*area), GFP_KERNEL);
+	if (unlikely(!area))
+		return NULL;
+
+	area->page = alloc_page(GFP_HIGHUSER);
+	if (!area->page)
+		goto free_area;
+
+	refcount_set(&area->ref, 1);
+	area->vaddr = vaddr;
+
+	vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE,
+				VM_READ|VM_EXEC|VM_MAYEXEC|VM_MAYREAD|VM_DONTCOPY|VM_IO,
+				&tramp_mapping);
+	if (!IS_ERR(vma))
+		return area;
+
+	__free_page(area->page);
+ free_area:
+	kfree(area);
+	return NULL;
+}
+
+struct tramp_area *get_tramp_area(unsigned long vaddr)
+{
+	struct uprobes_state *state = &current->mm->uprobes_state;
+	struct tramp_area *area = NULL;
+
+	mutex_lock(&state->tramp_mutex);
+	hlist_for_each_entry(area, &state->tramp_head, node) {
+		if (arch_uprobe_is_callable(area->vaddr, vaddr)) {
+			refcount_inc(&area->ref);
+			goto unlock;
+		}
+	}
+
+	area = create_tramp_area(vaddr);
+	if (area)
+		hlist_add_head(&area->node, &state->tramp_head);
+
+unlock:
+	mutex_unlock(&state->tramp_mutex);
+	return area;
+}
+
+static void destroy_tramp_area(struct tramp_area *area)
+{
+	hlist_del(&area->node);
+	put_page(area->page);
+	kfree(area);
+}
+
+void put_tramp_area(struct tramp_area *area)
+{
+	struct mm_struct *mm = current->mm;
+	struct uprobes_state *state = &mm->uprobes_state;
+
+	if (area == NULL)
+		return;
+
+	mutex_lock(&state->tramp_mutex);
+	if (refcount_dec_and_test(&area->ref))
+		destroy_tramp_area(area);
+	mutex_unlock(&state->tramp_mutex);
+}
+
+static void clear_tramp_head(struct mm_struct *mm)
+{
+	struct uprobes_state *state = &mm->uprobes_state;
+	struct tramp_area *area;
+	struct hlist_node *n;
+
+	hlist_for_each_entry_safe(area, n, &state->tramp_head, node)
+		destroy_tramp_area(area);
+}
+
 /* uprobe should have guaranteed positive refcount */
 static struct uprobe *get_uprobe(struct uprobe *uprobe)
 {
@@ -1788,6 +1927,8 @@  void uprobe_clear_state(struct mm_struct *mm)
 	delayed_uprobe_remove(NULL, mm);
 	mutex_unlock(&delayed_uprobe_lock);
 
+	clear_tramp_head(mm);
+
 	if (!area)
 		return;
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 89ceb4a68af2..b1fe431e5cce 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1248,6 +1248,8 @@  static void mm_init_uprobes_state(struct mm_struct *mm)
 {
 #ifdef CONFIG_UPROBES
 	mm->uprobes_state.xol_area = NULL;
+	mutex_init(&mm->uprobes_state.tramp_mutex);
+	INIT_HLIST_HEAD(&mm->uprobes_state.tramp_head);
 #endif
 }