Message ID | 20241105133405.2703607-6-jolsa@kernel.org (mailing list archive) |
---|---|
State | New |
Delegated to: | Masami Hiramatsu |
Headers | show |
Series | uprobes: Add support to optimize usdt probes on x86_64 | expand |
On Tue, Nov 05, 2024 at 02:33:59PM +0100, Jiri Olsa wrote: > Adding interface to add special mapping for user space page that will be > used as place holder for uprobe trampoline in following changes. > > The get_tramp_area(vaddr) function either finds 'callable' page or create > new one. The 'callable' means it's reachable by call instruction (from > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable > function. > > The put_tramp_area function either drops refcount or destroys the special > mapping and all the maps are clean up when the process goes down. In another thread somewhere, Andrii mentioned that Meta has executables with more than 4G of .text. This isn't going to work for them, is it?
On Tue, Nov 05, 2024 at 03:23:27PM +0100, Peter Zijlstra wrote: > On Tue, Nov 05, 2024 at 02:33:59PM +0100, Jiri Olsa wrote: > > Adding interface to add special mapping for user space page that will be > > used as place holder for uprobe trampoline in following changes. > > > > The get_tramp_area(vaddr) function either finds 'callable' page or create > > new one. The 'callable' means it's reachable by call instruction (from > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable > > function. > > > > The put_tramp_area function either drops refcount or destroys the special > > mapping and all the maps are clean up when the process goes down. > > In another thread somewhere, Andrii mentioned that Meta has executables > with more than 4G of .text. This isn't going to work for them, is it? > not if you can't reach the trampoline from the probed address jirka
On Tue, Nov 5, 2024 at 8:33 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > On Tue, Nov 05, 2024 at 03:23:27PM +0100, Peter Zijlstra wrote: > > On Tue, Nov 05, 2024 at 02:33:59PM +0100, Jiri Olsa wrote: > > > Adding interface to add special mapping for user space page that will be > > > used as place holder for uprobe trampoline in following changes. > > > > > > The get_tramp_area(vaddr) function either finds 'callable' page or create > > > new one. The 'callable' means it's reachable by call instruction (from > > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable > > > function. > > > > > > The put_tramp_area function either drops refcount or destroys the special > > > mapping and all the maps are clean up when the process goes down. > > > > In another thread somewhere, Andrii mentioned that Meta has executables > > with more than 4G of .text. This isn't going to work for them, is it? > > > > not if you can't reach the trampoline from the probed address That specific example was about 1.5GB (though we might have bigger .text, I didn't do exhaustive research). As Jiri said, this would be best effort trying to find closest free mapping to stay within +/-2GB offset. If that fails, we always would be falling back to slower int3-based uprobing, yep. Jiri, we could also have an option to support 64-bit call, right? We'd need nop9 for that, but it's an option as well to future-proofing this approach, no? Also, can we somehow use fs/gs-based indirect calls/jumps somehow to have a guarantee that offset is always small (<2GB away relative to the base stored in fs/gs). Not sure if this is feasible, but I thought it would be good to bring this up just to make sure it doesn't work. If segment based absolute call is somehow feasible, we can probably simplify a bunch of stuff by allocating it eagerly, once, and somewhere high up next to VDSO (or maybe even put it into VDSO, don't now). Anyways, let's brainstorm if there are any clever alternatives here. > > jirka
On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote: > > Adding interface to add special mapping for user space page that will be > used as place holder for uprobe trampoline in following changes. > > The get_tramp_area(vaddr) function either finds 'callable' page or create > new one. The 'callable' means it's reachable by call instruction (from > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable > function. > > The put_tramp_area function either drops refcount or destroys the special > mapping and all the maps are clean up when the process goes down. > > Signed-off-by: Jiri Olsa <jolsa@kernel.org> > --- > include/linux/uprobes.h | 12 ++++ > kernel/events/uprobes.c | 141 ++++++++++++++++++++++++++++++++++++++++ > kernel/fork.c | 2 + > 3 files changed, 155 insertions(+) > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h > index be306028ed59..222d8e82cee2 100644 > --- a/include/linux/uprobes.h > +++ b/include/linux/uprobes.h > @@ -172,6 +172,15 @@ struct xol_area; > > struct uprobes_state { > struct xol_area *xol_area; > + struct hlist_head tramp_head; > + struct mutex tramp_mutex; > +}; > + > +struct tramp_area { > + unsigned long vaddr; > + struct page *page; > + struct hlist_node node; > + refcount_t ref; nit: any reason we are unnecessarily trying to save 4 bytes on refcount (and we don't actually, due to padding) > }; > > extern void __init uprobes_init(void); > @@ -219,6 +228,9 @@ extern int uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_o > extern int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr, > uprobe_opcode_t *new_opcode, void *data); > extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int len, void *data); > +struct tramp_area *get_tramp_area(unsigned long vaddr); uprobe_get_tramp_area() to make it clear this is uprobe specific, given this is exposed function? and add that extern like we do for other functions? > +void put_tramp_area(struct tramp_area *area); uprobe_put_tramp_area() ? > +bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr); > #else /* !CONFIG_UPROBES */ > struct uprobes_state { > }; > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > index 944d9df1f081..a44305c559a4 100644 > --- a/kernel/events/uprobes.c > +++ b/kernel/events/uprobes.c > @@ -616,6 +616,145 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v > (uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE, NULL); > } > > +bool __weak arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr) > +{ > + return false; > +} > + > +static unsigned long find_nearest_page(unsigned long vaddr) > +{ > + struct mm_struct *mm = current->mm; > + struct vm_area_struct *vma, *prev; > + VMA_ITERATOR(vmi, mm, 0); > + > + prev = vma_next(&vmi); > + vma = vma_next(&vmi); > + while (vma) { > + if (vma->vm_start - prev->vm_end >= PAGE_SIZE && > + arch_uprobe_is_callable(prev->vm_end, vaddr)) > + return prev->vm_end; shouldn't we try both `prev->vm_end` and `vma->vm_start - PAGE_SIZE` as two possible places > + > + prev = vma; > + vma = vma_next(&vmi); > + } > + > + return 0; > +} > + > +static vm_fault_t tramp_fault(const struct vm_special_mapping *sm, > + struct vm_area_struct *vma, struct vm_fault *vmf) > +{ > + struct hlist_head *head = &vma->vm_mm->uprobes_state.tramp_head; > + struct tramp_area *area; > + > + hlist_for_each_entry(area, head, node) { > + if (vma->vm_start == area->vaddr) { > + vmf->page = area->page; > + get_page(vmf->page); > + return 0; > + } > + } > + > + return -EINVAL; > +} > + > +static int tramp_mremap(const struct vm_special_mapping *sm, struct vm_area_struct *new_vma) > +{ > + return -EPERM; > +} > + > +static const struct vm_special_mapping tramp_mapping = { > + .name = "[uprobes-trampoline]", > + .fault = tramp_fault, > + .mremap = tramp_mremap, > +}; > + > +static struct tramp_area *create_tramp_area(unsigned long vaddr) > +{ > + struct mm_struct *mm = current->mm; > + struct vm_area_struct *vma; > + struct tramp_area *area; > + > + vaddr = find_nearest_page(vaddr); > + if (!vaddr) > + return NULL; > + > + area = kzalloc(sizeof(*area), GFP_KERNEL); > + if (unlikely(!area)) > + return NULL; > + > + area->page = alloc_page(GFP_HIGHUSER); > + if (!area->page) > + goto free_area; > + > + refcount_set(&area->ref, 1); > + area->vaddr = vaddr; > + > + vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE, > + VM_READ|VM_EXEC|VM_MAYEXEC|VM_MAYREAD|VM_DONTCOPY|VM_IO, > + &tramp_mapping); > + if (!IS_ERR(vma)) > + return area; please keep a pattern, it's less surprising that way if (IS_ERR(vma)) goto free_page; return area; free_page: > + > + __free_page(area->page); > + free_area: > + kfree(area); > + return NULL; > +} > + > +struct tramp_area *get_tramp_area(unsigned long vaddr) > +{ > + struct uprobes_state *state = ¤t->mm->uprobes_state; > + struct tramp_area *area = NULL; > + > + mutex_lock(&state->tramp_mutex); > + hlist_for_each_entry(area, &state->tramp_head, node) { > + if (arch_uprobe_is_callable(area->vaddr, vaddr)) { > + refcount_inc(&area->ref); > + goto unlock; > + } > + } > + > + area = create_tramp_area(vaddr); > + if (area) > + hlist_add_head(&area->node, &state->tramp_head); > + > +unlock: > + mutex_unlock(&state->tramp_mutex); > + return area; > +} > + > +static void destroy_tramp_area(struct tramp_area *area) > +{ > + hlist_del(&area->node); > + put_page(area->page); > + kfree(area); > +} > + > +void put_tramp_area(struct tramp_area *area) > +{ > + struct mm_struct *mm = current->mm; > + struct uprobes_state *state = &mm->uprobes_state; > + > + if (area == NULL) nit: !area > + return; > + > + mutex_lock(&state->tramp_mutex); > + if (refcount_dec_and_test(&area->ref)) > + destroy_tramp_area(area); > + mutex_unlock(&state->tramp_mutex); > +} > + [...]
On Thu, Nov 14, 2024 at 03:44:12PM -0800, Andrii Nakryiko wrote: > On Tue, Nov 5, 2024 at 8:33 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > On Tue, Nov 05, 2024 at 03:23:27PM +0100, Peter Zijlstra wrote: > > > On Tue, Nov 05, 2024 at 02:33:59PM +0100, Jiri Olsa wrote: > > > > Adding interface to add special mapping for user space page that will be > > > > used as place holder for uprobe trampoline in following changes. > > > > > > > > The get_tramp_area(vaddr) function either finds 'callable' page or create > > > > new one. The 'callable' means it's reachable by call instruction (from > > > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable > > > > function. > > > > > > > > The put_tramp_area function either drops refcount or destroys the special > > > > mapping and all the maps are clean up when the process goes down. > > > > > > In another thread somewhere, Andrii mentioned that Meta has executables > > > with more than 4G of .text. This isn't going to work for them, is it? > > > > > > > not if you can't reach the trampoline from the probed address > > That specific example was about 1.5GB (though we might have bigger > .text, I didn't do exhaustive research). As Jiri said, this would be > best effort trying to find closest free mapping to stay within +/-2GB > offset. If that fails, we always would be falling back to slower > int3-based uprobing, yep. > > Jiri, we could also have an option to support 64-bit call, right? We'd > need nop9 for that, but it's an option as well to future-proofing this > approach, no? hm, I don't think there's call with relative 64bit offset there's indirect call through register or address.. but I think we would fit in nop10 with the indirect call through address > > Also, can we somehow use fs/gs-based indirect calls/jumps somehow to > have a guarantee that offset is always small (<2GB away relative to > the base stored in fs/gs). Not sure if this is feasible, but I thought > it would be good to bring this up just to make sure it doesn't work. > > If segment based absolute call is somehow feasible, we can probably > simplify a bunch of stuff by allocating it eagerly, once, and > somewhere high up next to VDSO (or maybe even put it into VDSO, don't > now). yes, that would be convenient jirka > > Anyways, let's brainstorm if there are any clever alternatives here. > > > > > > jirka
On Thu, Nov 14, 2024 at 03:44:14PM -0800, Andrii Nakryiko wrote: > On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote: > > > > Adding interface to add special mapping for user space page that will be > > used as place holder for uprobe trampoline in following changes. > > > > The get_tramp_area(vaddr) function either finds 'callable' page or create > > new one. The 'callable' means it's reachable by call instruction (from > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable > > function. > > > > The put_tramp_area function either drops refcount or destroys the special > > mapping and all the maps are clean up when the process goes down. > > > > Signed-off-by: Jiri Olsa <jolsa@kernel.org> > > --- > > include/linux/uprobes.h | 12 ++++ > > kernel/events/uprobes.c | 141 ++++++++++++++++++++++++++++++++++++++++ > > kernel/fork.c | 2 + > > 3 files changed, 155 insertions(+) > > > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h > > index be306028ed59..222d8e82cee2 100644 > > --- a/include/linux/uprobes.h > > +++ b/include/linux/uprobes.h > > @@ -172,6 +172,15 @@ struct xol_area; > > > > struct uprobes_state { > > struct xol_area *xol_area; > > + struct hlist_head tramp_head; > > + struct mutex tramp_mutex; > > +}; > > + > > +struct tramp_area { > > + unsigned long vaddr; > > + struct page *page; > > + struct hlist_node node; > > + refcount_t ref; > > nit: any reason we are unnecessarily trying to save 4 bytes on > refcount (and we don't actually, due to padding) hum, I'm not sure what you mean.. what's the alternative? > > > }; > > > > extern void __init uprobes_init(void); > > @@ -219,6 +228,9 @@ extern int uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_o > > extern int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr, > > uprobe_opcode_t *new_opcode, void *data); > > extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int len, void *data); > > +struct tramp_area *get_tramp_area(unsigned long vaddr); > > uprobe_get_tramp_area() to make it clear this is uprobe specific, > given this is exposed function? > > and add that extern like we do for other functions? ok > > > +void put_tramp_area(struct tramp_area *area); > > uprobe_put_tramp_area() ? ok > > > +bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr); > > #else /* !CONFIG_UPROBES */ > > struct uprobes_state { > > }; > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > index 944d9df1f081..a44305c559a4 100644 > > --- a/kernel/events/uprobes.c > > +++ b/kernel/events/uprobes.c > > @@ -616,6 +616,145 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v > > (uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE, NULL); > > } > > > > +bool __weak arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr) > > +{ > > + return false; > > +} > > + > > +static unsigned long find_nearest_page(unsigned long vaddr) > > +{ > > + struct mm_struct *mm = current->mm; > > + struct vm_area_struct *vma, *prev; > > + VMA_ITERATOR(vmi, mm, 0); > > + > > + prev = vma_next(&vmi); > > + vma = vma_next(&vmi); > > + while (vma) { > > + if (vma->vm_start - prev->vm_end >= PAGE_SIZE && > > + arch_uprobe_is_callable(prev->vm_end, vaddr)) > > + return prev->vm_end; > > shouldn't we try both `prev->vm_end` and `vma->vm_start - PAGE_SIZE` > as two possible places right, we should do that SNIP > > +static struct tramp_area *create_tramp_area(unsigned long vaddr) > > +{ > > + struct mm_struct *mm = current->mm; > > + struct vm_area_struct *vma; > > + struct tramp_area *area; > > + > > + vaddr = find_nearest_page(vaddr); > > + if (!vaddr) > > + return NULL; > > + > > + area = kzalloc(sizeof(*area), GFP_KERNEL); > > + if (unlikely(!area)) > > + return NULL; > > + > > + area->page = alloc_page(GFP_HIGHUSER); > > + if (!area->page) > > + goto free_area; > > + > > + refcount_set(&area->ref, 1); > > + area->vaddr = vaddr; > > + > > + vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE, > > + VM_READ|VM_EXEC|VM_MAYEXEC|VM_MAYREAD|VM_DONTCOPY|VM_IO, > > + &tramp_mapping); > > + if (!IS_ERR(vma)) > > + return area; > > please keep a pattern, it's less surprising that way > > if (IS_ERR(vma)) > goto free_page; > > return area; > > free_page: ok SNIP > > +static void destroy_tramp_area(struct tramp_area *area) > > +{ > > + hlist_del(&area->node); > > + put_page(area->page); > > + kfree(area); > > +} > > + > > +void put_tramp_area(struct tramp_area *area) > > +{ > > + struct mm_struct *mm = current->mm; > > + struct uprobes_state *state = &mm->uprobes_state; > > + > > + if (area == NULL) > > nit: !area ok thanks, jirka
On Sat, Nov 16, 2024 at 1:44 PM Jiri Olsa <olsajiri@gmail.com> wrote: > > On Thu, Nov 14, 2024 at 03:44:14PM -0800, Andrii Nakryiko wrote: > > On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote: > > > > > > Adding interface to add special mapping for user space page that will be > > > used as place holder for uprobe trampoline in following changes. > > > > > > The get_tramp_area(vaddr) function either finds 'callable' page or create > > > new one. The 'callable' means it's reachable by call instruction (from > > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable > > > function. > > > > > > The put_tramp_area function either drops refcount or destroys the special > > > mapping and all the maps are clean up when the process goes down. > > > > > > Signed-off-by: Jiri Olsa <jolsa@kernel.org> > > > --- > > > include/linux/uprobes.h | 12 ++++ > > > kernel/events/uprobes.c | 141 ++++++++++++++++++++++++++++++++++++++++ > > > kernel/fork.c | 2 + > > > 3 files changed, 155 insertions(+) > > > > > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h > > > index be306028ed59..222d8e82cee2 100644 > > > --- a/include/linux/uprobes.h > > > +++ b/include/linux/uprobes.h > > > @@ -172,6 +172,15 @@ struct xol_area; > > > > > > struct uprobes_state { > > > struct xol_area *xol_area; > > > + struct hlist_head tramp_head; > > > + struct mutex tramp_mutex; > > > +}; > > > + > > > +struct tramp_area { > > > + unsigned long vaddr; > > > + struct page *page; > > > + struct hlist_node node; > > > + refcount_t ref; > > > > nit: any reason we are unnecessarily trying to save 4 bytes on > > refcount (and we don't actually, due to padding) > > hum, I'm not sure what you mean.. what's the alternative? atomic64_t ? > > > > > > }; > > > > > > extern void __init uprobes_init(void); > > > @@ -219,6 +228,9 @@ extern int uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_o > > > extern int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr, > > > uprobe_opcode_t *new_opcode, void *data); > > > extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int len, void *data); > > > +struct tramp_area *get_tramp_area(unsigned long vaddr); > > > > uprobe_get_tramp_area() to make it clear this is uprobe specific, > > given this is exposed function? > > > > and add that extern like we do for other functions? > > ok > > > > > > +void put_tramp_area(struct tramp_area *area); > > > > uprobe_put_tramp_area() ? > > ok > > > > > > +bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr); > > > #else /* !CONFIG_UPROBES */ > > > struct uprobes_state { > > > }; > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > index 944d9df1f081..a44305c559a4 100644 > > > --- a/kernel/events/uprobes.c > > > +++ b/kernel/events/uprobes.c > > > @@ -616,6 +616,145 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v > > > (uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE, NULL); > > > } > > > > > > +bool __weak arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr) > > > +{ > > > + return false; > > > +} > > > + > > > +static unsigned long find_nearest_page(unsigned long vaddr) > > > +{ > > > + struct mm_struct *mm = current->mm; > > > + struct vm_area_struct *vma, *prev; > > > + VMA_ITERATOR(vmi, mm, 0); > > > + > > > + prev = vma_next(&vmi); > > > + vma = vma_next(&vmi); > > > + while (vma) { > > > + if (vma->vm_start - prev->vm_end >= PAGE_SIZE && > > > + arch_uprobe_is_callable(prev->vm_end, vaddr)) > > > + return prev->vm_end; > > > > shouldn't we try both `prev->vm_end` and `vma->vm_start - PAGE_SIZE` > > as two possible places > > right, we should do that > > SNIP > > > > +static struct tramp_area *create_tramp_area(unsigned long vaddr) > > > +{ > > > + struct mm_struct *mm = current->mm; > > > + struct vm_area_struct *vma; > > > + struct tramp_area *area; > > > + > > > + vaddr = find_nearest_page(vaddr); > > > + if (!vaddr) > > > + return NULL; > > > + > > > + area = kzalloc(sizeof(*area), GFP_KERNEL); > > > + if (unlikely(!area)) > > > + return NULL; > > > + > > > + area->page = alloc_page(GFP_HIGHUSER); > > > + if (!area->page) > > > + goto free_area; > > > + > > > + refcount_set(&area->ref, 1); > > > + area->vaddr = vaddr; > > > + > > > + vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE, > > > + VM_READ|VM_EXEC|VM_MAYEXEC|VM_MAYREAD|VM_DONTCOPY|VM_IO, > > > + &tramp_mapping); > > > + if (!IS_ERR(vma)) > > > + return area; > > > > please keep a pattern, it's less surprising that way > > > > if (IS_ERR(vma)) > > goto free_page; > > > > return area; > > > > free_page: > > ok > > > SNIP > > > > +static void destroy_tramp_area(struct tramp_area *area) > > > +{ > > > + hlist_del(&area->node); > > > + put_page(area->page); > > > + kfree(area); > > > +} > > > + > > > +void put_tramp_area(struct tramp_area *area) > > > +{ > > > + struct mm_struct *mm = current->mm; > > > + struct uprobes_state *state = &mm->uprobes_state; > > > + > > > + if (area == NULL) > > > > nit: !area > > ok > > thanks, > jirka
On Sat, Nov 16, 2024 at 1:44 PM Jiri Olsa <olsajiri@gmail.com> wrote: > > On Thu, Nov 14, 2024 at 03:44:12PM -0800, Andrii Nakryiko wrote: > > On Tue, Nov 5, 2024 at 8:33 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > On Tue, Nov 05, 2024 at 03:23:27PM +0100, Peter Zijlstra wrote: > > > > On Tue, Nov 05, 2024 at 02:33:59PM +0100, Jiri Olsa wrote: > > > > > Adding interface to add special mapping for user space page that will be > > > > > used as place holder for uprobe trampoline in following changes. > > > > > > > > > > The get_tramp_area(vaddr) function either finds 'callable' page or create > > > > > new one. The 'callable' means it's reachable by call instruction (from > > > > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable > > > > > function. > > > > > > > > > > The put_tramp_area function either drops refcount or destroys the special > > > > > mapping and all the maps are clean up when the process goes down. > > > > > > > > In another thread somewhere, Andrii mentioned that Meta has executables > > > > with more than 4G of .text. This isn't going to work for them, is it? > > > > > > > > > > not if you can't reach the trampoline from the probed address > > > > That specific example was about 1.5GB (though we might have bigger > > .text, I didn't do exhaustive research). As Jiri said, this would be > > best effort trying to find closest free mapping to stay within +/-2GB > > offset. If that fails, we always would be falling back to slower > > int3-based uprobing, yep. > > > > Jiri, we could also have an option to support 64-bit call, right? We'd > > need nop9 for that, but it's an option as well to future-proofing this > > approach, no? > > hm, I don't think there's call with relative 64bit offset why do you need a relative, when you have 64 bits? ;) there is a call to absolute address, no? > > there's indirect call through register or address.. but I think we would > fit in nop10 with the indirect call through address > > > > > Also, can we somehow use fs/gs-based indirect calls/jumps somehow to > > have a guarantee that offset is always small (<2GB away relative to > > the base stored in fs/gs). Not sure if this is feasible, but I thought > > it would be good to bring this up just to make sure it doesn't work. > > > > If segment based absolute call is somehow feasible, we can probably > > simplify a bunch of stuff by allocating it eagerly, once, and > > somewhere high up next to VDSO (or maybe even put it into VDSO, don't > > now). > > yes, that would be convenient > > jirka > > > > > Anyways, let's brainstorm if there are any clever alternatives here. > > > > > > > > > > jirka
On Mon, Nov 18, 2024 at 10:06:51PM -0800, Andrii Nakryiko wrote: > > > Jiri, we could also have an option to support 64-bit call, right? We'd > > > need nop9 for that, but it's an option as well to future-proofing this > > > approach, no? > > > > hm, I don't think there's call with relative 64bit offset > > why do you need a relative, when you have 64 bits? ;) there is a call > to absolute address, no? No, there is not :/ You get to use an indirect call, which means multiple instructions and all the speculation joy. IFF USDT thingies have AX clobbered (I couldn't find in a hurry) then patching the multi instruction thing is relatively straight forward, if they don't, its going to be a pain.
On Mon, Nov 18, 2024 at 10:05:41PM -0800, Andrii Nakryiko wrote: > On Sat, Nov 16, 2024 at 1:44 PM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > On Thu, Nov 14, 2024 at 03:44:14PM -0800, Andrii Nakryiko wrote: > > > On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote: > > > > > > > > Adding interface to add special mapping for user space page that will be > > > > used as place holder for uprobe trampoline in following changes. > > > > > > > > The get_tramp_area(vaddr) function either finds 'callable' page or create > > > > new one. The 'callable' means it's reachable by call instruction (from > > > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable > > > > function. > > > > > > > > The put_tramp_area function either drops refcount or destroys the special > > > > mapping and all the maps are clean up when the process goes down. > > > > > > > > Signed-off-by: Jiri Olsa <jolsa@kernel.org> > > > > --- > > > > include/linux/uprobes.h | 12 ++++ > > > > kernel/events/uprobes.c | 141 ++++++++++++++++++++++++++++++++++++++++ > > > > kernel/fork.c | 2 + > > > > 3 files changed, 155 insertions(+) > > > > > > > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h > > > > index be306028ed59..222d8e82cee2 100644 > > > > --- a/include/linux/uprobes.h > > > > +++ b/include/linux/uprobes.h > > > > @@ -172,6 +172,15 @@ struct xol_area; > > > > > > > > struct uprobes_state { > > > > struct xol_area *xol_area; > > > > + struct hlist_head tramp_head; > > > > + struct mutex tramp_mutex; > > > > +}; > > > > + > > > > +struct tramp_area { > > > > + unsigned long vaddr; > > > > + struct page *page; > > > > + struct hlist_node node; > > > > + refcount_t ref; > > > > > > nit: any reason we are unnecessarily trying to save 4 bytes on > > > refcount (and we don't actually, due to padding) > > > > hum, I'm not sure what you mean.. what's the alternative? > > atomic64_t ? hum, just because we have extra 4 bytes padding? we use refcount_t on other places so seems like better fit to me jirka
On Tue, Nov 19, 2024 at 10:13:48AM +0100, Peter Zijlstra wrote: > On Mon, Nov 18, 2024 at 10:06:51PM -0800, Andrii Nakryiko wrote: > > > > > Jiri, we could also have an option to support 64-bit call, right? We'd > > > > need nop9 for that, but it's an option as well to future-proofing this > > > > approach, no? > > > > > > hm, I don't think there's call with relative 64bit offset > > > > why do you need a relative, when you have 64 bits? ;) there is a call > > to absolute address, no? > > No, there is not :/ You get to use an indirect call, which means > multiple instructions and all the speculation joy. > > IFF USDT thingies have AX clobbered (I couldn't find in a hurry) then > patching the multi instruction thing is relatively straight forward, if > they don't, its going to be a pain. I don't follow, what's the reason for that? jirka
On Tue, Nov 19, 2024 at 1:13 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Nov 18, 2024 at 10:06:51PM -0800, Andrii Nakryiko wrote: > > > > > Jiri, we could also have an option to support 64-bit call, right? We'd > > > > need nop9 for that, but it's an option as well to future-proofing this > > > > approach, no? > > > > > > hm, I don't think there's call with relative 64bit offset > > > > why do you need a relative, when you have 64 bits? ;) there is a call > > to absolute address, no? > > No, there is not :/ You get to use an indirect call, which means > multiple instructions and all the speculation joy. Ah, I misinterpreted `CALL m16:64` (from [0]) as an absolute address call with 64-bit displacement encoded in the struct, my bad. [0] https://www.felixcloutier.com/x86/call > > IFF USDT thingies have AX clobbered (I couldn't find in a hurry) then > patching the multi instruction thing is relatively straight forward, if > they don't, its going to be a pain. USDTs are meant to be "transparent" to the surrounding code and they don't mark any clobbered registers. Technically it could be added, but I'm not a fan of this. I'd stick to the 32-bit relative call for now, and fallback to int3 if we can't reach the uprobe trampoline. So huge .text might suffer suboptimal performance, but at least USDTs won't pessimize the surrounding code (and kernel won't make any assumptions about registers that are ok to use).
On Tue, Nov 19, 2024 at 7:14 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > On Mon, Nov 18, 2024 at 10:05:41PM -0800, Andrii Nakryiko wrote: > > On Sat, Nov 16, 2024 at 1:44 PM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > On Thu, Nov 14, 2024 at 03:44:14PM -0800, Andrii Nakryiko wrote: > > > > On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa <jolsa@kernel.org> wrote: > > > > > > > > > > Adding interface to add special mapping for user space page that will be > > > > > used as place holder for uprobe trampoline in following changes. > > > > > > > > > > The get_tramp_area(vaddr) function either finds 'callable' page or create > > > > > new one. The 'callable' means it's reachable by call instruction (from > > > > > vaddr argument) and is decided by each arch via new arch_uprobe_is_callable > > > > > function. > > > > > > > > > > The put_tramp_area function either drops refcount or destroys the special > > > > > mapping and all the maps are clean up when the process goes down. > > > > > > > > > > Signed-off-by: Jiri Olsa <jolsa@kernel.org> > > > > > --- > > > > > include/linux/uprobes.h | 12 ++++ > > > > > kernel/events/uprobes.c | 141 ++++++++++++++++++++++++++++++++++++++++ > > > > > kernel/fork.c | 2 + > > > > > 3 files changed, 155 insertions(+) > > > > > > > > > > diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h > > > > > index be306028ed59..222d8e82cee2 100644 > > > > > --- a/include/linux/uprobes.h > > > > > +++ b/include/linux/uprobes.h > > > > > @@ -172,6 +172,15 @@ struct xol_area; > > > > > > > > > > struct uprobes_state { > > > > > struct xol_area *xol_area; > > > > > + struct hlist_head tramp_head; > > > > > + struct mutex tramp_mutex; > > > > > +}; > > > > > + > > > > > +struct tramp_area { > > > > > + unsigned long vaddr; > > > > > + struct page *page; > > > > > + struct hlist_node node; > > > > > + refcount_t ref; > > > > > > > > nit: any reason we are unnecessarily trying to save 4 bytes on > > > > refcount (and we don't actually, due to padding) > > > > > > hum, I'm not sure what you mean.. what's the alternative? > > > > atomic64_t ? > > hum, just because we have extra 4 bytes padding? we use refcount_t > on other places so seems like better fit to me My (minor) concern was that tramp_area is a very long-living object that can be shared across huge amounts of uprobes, and 2 billion uprobes is a lot, but still a number that one can, practically speaking, reach (with enough effort). And so skimping on refcount to save 4 bytes for something that we have one or two per *some* processes seemed (and still seems) like wrong savings to pursue. > > jirka
On Wed, Nov 20, 2024 at 04:07:38PM -0800, Andrii Nakryiko wrote: > USDTs are meant to be "transparent" to the surrounding code and they > don't mark any clobbered registers. Technically it could be added, but > I'm not a fan of this. Sure. Anyway, another thing to consider is FRED, will all of this still matter once that lands? If FRED gets us INT3 performance close to what SYSCALL has, then all this work will go unused.
On Thu, Nov 21, 2024 at 4:17 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Nov 20, 2024 at 04:07:38PM -0800, Andrii Nakryiko wrote: > > > USDTs are meant to be "transparent" to the surrounding code and they > > don't mark any clobbered registers. Technically it could be added, but > > I'm not a fan of this. > > Sure. Anyway, another thing to consider is FRED, will all of this still > matter once that lands? If FRED gets us INT3 performance close to what > SYSCALL has, then all this work will go unused. afaik not a single cpu in the datacenter supports FRED while uprobe overhead is real. imo it's worth improving performance today for existing cpus. I suspect arm64 might benefit too. Even if arm hw does the same amount of work for trap vs syscall the sw overhead of handling trap is different. I suspect that equation will apply to future FRED cpus too.
On Thu, Nov 21, 2024 at 08:02:12AM -0800, Alexei Starovoitov wrote: > On Thu, Nov 21, 2024 at 4:17 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Wed, Nov 20, 2024 at 04:07:38PM -0800, Andrii Nakryiko wrote: > > > > > USDTs are meant to be "transparent" to the surrounding code and they > > > don't mark any clobbered registers. Technically it could be added, but > > > I'm not a fan of this. > > > > Sure. Anyway, another thing to consider is FRED, will all of this still > > matter once that lands? If FRED gets us INT3 performance close to what > > SYSCALL has, then all this work will go unused. > > afaik not a single cpu in the datacenter supports FRED while > uprobe overhead is real. > imo it's worth improving performance today for existing cpus. I understand, but OTOH adding a syscall now, that we'll have to maintain for years and years, even through we know it'll not be used much is a bit annoying. > I suspect arm64 might benefit too. Even if arm hw does the same > amount of work for trap vs syscall the sw overhead of handling > trap is different. Well, the RISC CPUs have a much harder time using this, their immediate range is typically puny and they end up needing multiple instructions and some register in order to set up a call. Elsewhere in the thread Mark Rutland already noted that arm64 really doesn't need or want this.
On Thu, Nov 21, 2024 at 8:34 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, Nov 21, 2024 at 08:02:12AM -0800, Alexei Starovoitov wrote: > > On Thu, Nov 21, 2024 at 4:17 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > On Wed, Nov 20, 2024 at 04:07:38PM -0800, Andrii Nakryiko wrote: > > > > > > > USDTs are meant to be "transparent" to the surrounding code and they > > > > don't mark any clobbered registers. Technically it could be added, but > > > > I'm not a fan of this. > > > > > > Sure. Anyway, another thing to consider is FRED, will all of this still > > > matter once that lands? If FRED gets us INT3 performance close to what > > > SYSCALL has, then all this work will go unused. > > > > afaik not a single cpu in the datacenter supports FRED while > > uprobe overhead is real. > > imo it's worth improving performance today for existing cpus. > > I understand, but OTOH adding a syscall now, that we'll have to maintain > for years and years, even through we know it'll not be used much is a > bit annoying. No. It _will_ be used for years. > > > I suspect arm64 might benefit too. Even if arm hw does the same > > amount of work for trap vs syscall the sw overhead of handling > > trap is different. > > Well, the RISC CPUs have a much harder time using this, their immediate > range is typically puny and they end up needing multiple instructions > and some register in order to set up a call. We don't care about 32-bit archs and other exotics. They're not the reasons to leave performance on the table on dominant archs. > Elsewhere in the thread Mark Rutland already noted that arm64 really > doesn't need or want this. Doesn't look like you've read what you quoted above. On arm64 the _HW_ cost may be the same. The _SW_ difference in handling trap vs syscall is real. I bet once uprobe syscall is benchmarked on arm64 there will be a delta.
[resending as I somehow messed up the 'From' header and got a tonne of bounces] On Thu, Nov 21, 2024 at 08:47:56AM -0800, Alexei Starovoitov wrote: > On Thu, Nov 21, 2024 at 8:34 AM Peter Zijlstra <peterz@infradead.org> wrote: > > Elsewhere in the thread Mark Rutland already noted that arm64 really > > doesn't need or want this. > > Doesn't look like you've read what you quoted above. > On arm64 the _HW_ cost may be the same. > The _SW_ difference in handling trap vs syscall is real. > I bet once uprobe syscall is benchmarked on arm64 there will > be a delta. I already pointed out in [1] that on arm64 we can make the trap case *faster* than the syscall. If that's not already the case, there's only a small amount of rework needed, (pulling BRK handling into entry-common.c), which we want to do for other reasons anyway. On arm64 I do not want the syscall; the trap is faster and simpler to maintain. Mark [1] https://lore.kernel.org/lkml/ZzsRfhGSYXVK0mst@J2N7QTR9R3/
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index be306028ed59..222d8e82cee2 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -172,6 +172,15 @@ struct xol_area; struct uprobes_state { struct xol_area *xol_area; + struct hlist_head tramp_head; + struct mutex tramp_mutex; +}; + +struct tramp_area { + unsigned long vaddr; + struct page *page; + struct hlist_node node; + refcount_t ref; }; extern void __init uprobes_init(void); @@ -219,6 +228,9 @@ extern int uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_o extern int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode, void *data); extern bool arch_uprobe_is_register(uprobe_opcode_t *insn, int len, void *data); +struct tramp_area *get_tramp_area(unsigned long vaddr); +void put_tramp_area(struct tramp_area *area); +bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr); #else /* !CONFIG_UPROBES */ struct uprobes_state { }; diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 944d9df1f081..a44305c559a4 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -616,6 +616,145 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v (uprobe_opcode_t *)&auprobe->insn, UPROBE_SWBP_INSN_SIZE, NULL); } +bool __weak arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr) +{ + return false; +} + +static unsigned long find_nearest_page(unsigned long vaddr) +{ + struct mm_struct *mm = current->mm; + struct vm_area_struct *vma, *prev; + VMA_ITERATOR(vmi, mm, 0); + + prev = vma_next(&vmi); + vma = vma_next(&vmi); + while (vma) { + if (vma->vm_start - prev->vm_end >= PAGE_SIZE && + arch_uprobe_is_callable(prev->vm_end, vaddr)) + return prev->vm_end; + + prev = vma; + vma = vma_next(&vmi); + } + + return 0; +} + +static vm_fault_t tramp_fault(const struct vm_special_mapping *sm, + struct vm_area_struct *vma, struct vm_fault *vmf) +{ + struct hlist_head *head = &vma->vm_mm->uprobes_state.tramp_head; + struct tramp_area *area; + + hlist_for_each_entry(area, head, node) { + if (vma->vm_start == area->vaddr) { + vmf->page = area->page; + get_page(vmf->page); + return 0; + } + } + + return -EINVAL; +} + +static int tramp_mremap(const struct vm_special_mapping *sm, struct vm_area_struct *new_vma) +{ + return -EPERM; +} + +static const struct vm_special_mapping tramp_mapping = { + .name = "[uprobes-trampoline]", + .fault = tramp_fault, + .mremap = tramp_mremap, +}; + +static struct tramp_area *create_tramp_area(unsigned long vaddr) +{ + struct mm_struct *mm = current->mm; + struct vm_area_struct *vma; + struct tramp_area *area; + + vaddr = find_nearest_page(vaddr); + if (!vaddr) + return NULL; + + area = kzalloc(sizeof(*area), GFP_KERNEL); + if (unlikely(!area)) + return NULL; + + area->page = alloc_page(GFP_HIGHUSER); + if (!area->page) + goto free_area; + + refcount_set(&area->ref, 1); + area->vaddr = vaddr; + + vma = _install_special_mapping(mm, area->vaddr, PAGE_SIZE, + VM_READ|VM_EXEC|VM_MAYEXEC|VM_MAYREAD|VM_DONTCOPY|VM_IO, + &tramp_mapping); + if (!IS_ERR(vma)) + return area; + + __free_page(area->page); + free_area: + kfree(area); + return NULL; +} + +struct tramp_area *get_tramp_area(unsigned long vaddr) +{ + struct uprobes_state *state = ¤t->mm->uprobes_state; + struct tramp_area *area = NULL; + + mutex_lock(&state->tramp_mutex); + hlist_for_each_entry(area, &state->tramp_head, node) { + if (arch_uprobe_is_callable(area->vaddr, vaddr)) { + refcount_inc(&area->ref); + goto unlock; + } + } + + area = create_tramp_area(vaddr); + if (area) + hlist_add_head(&area->node, &state->tramp_head); + +unlock: + mutex_unlock(&state->tramp_mutex); + return area; +} + +static void destroy_tramp_area(struct tramp_area *area) +{ + hlist_del(&area->node); + put_page(area->page); + kfree(area); +} + +void put_tramp_area(struct tramp_area *area) +{ + struct mm_struct *mm = current->mm; + struct uprobes_state *state = &mm->uprobes_state; + + if (area == NULL) + return; + + mutex_lock(&state->tramp_mutex); + if (refcount_dec_and_test(&area->ref)) + destroy_tramp_area(area); + mutex_unlock(&state->tramp_mutex); +} + +static void clear_tramp_head(struct mm_struct *mm) +{ + struct uprobes_state *state = &mm->uprobes_state; + struct tramp_area *area; + struct hlist_node *n; + + hlist_for_each_entry_safe(area, n, &state->tramp_head, node) + destroy_tramp_area(area); +} + /* uprobe should have guaranteed positive refcount */ static struct uprobe *get_uprobe(struct uprobe *uprobe) { @@ -1788,6 +1927,8 @@ void uprobe_clear_state(struct mm_struct *mm) delayed_uprobe_remove(NULL, mm); mutex_unlock(&delayed_uprobe_lock); + clear_tramp_head(mm); + if (!area) return; diff --git a/kernel/fork.c b/kernel/fork.c index 89ceb4a68af2..b1fe431e5cce 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1248,6 +1248,8 @@ static void mm_init_uprobes_state(struct mm_struct *mm) { #ifdef CONFIG_UPROBES mm->uprobes_state.xol_area = NULL; + mutex_init(&mm->uprobes_state.tramp_mutex); + INIT_HLIST_HEAD(&mm->uprobes_state.tramp_head); #endif }
Adding interface to add special mapping for user space page that will be used as place holder for uprobe trampoline in following changes. The get_tramp_area(vaddr) function either finds 'callable' page or create new one. The 'callable' means it's reachable by call instruction (from vaddr argument) and is decided by each arch via new arch_uprobe_is_callable function. The put_tramp_area function either drops refcount or destroys the special mapping and all the maps are clean up when the process goes down. Signed-off-by: Jiri Olsa <jolsa@kernel.org> --- include/linux/uprobes.h | 12 ++++ kernel/events/uprobes.c | 141 ++++++++++++++++++++++++++++++++++++++++ kernel/fork.c | 2 + 3 files changed, 155 insertions(+)