Message ID | 20230801145414.418145-1-davemarchevsky@fb.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | BPF |
Headers | show |
Series | [v1,bpf-next,1/2,RFC] bpf: Introduce BPF_F_VMA_NEXT flag for bpf_find_vma helper | expand |
On Tue, Aug 1, 2023 at 7:54 AM Dave Marchevsky <davemarchevsky@fb.com> wrote: > > At Meta we have a profiling daemon which periodically collects > information on many hosts. This collection usually involves grabbing > stacks (user and kernel) using perf_event BPF progs and later symbolicating > them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on > remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In > those cases we must fall back to digging around in /proc/PID/maps to map > virtual address to (binary, offset). The /proc/PID/maps digging does not > occur synchronously with stack collection, so the process might already > be gone, in which case it won't have /proc/PID/maps and we will fail to > symbolicate. > > This 'exited process problem' doesn't occur very often as > most of the prod services we care to profile are long-lived daemons, > there are enough usecases to warrant a workaround: a BPF program which > can be optionally loaded at data collection time and essentially walks > /proc/PID/maps. Currently this is done by walking the vma list: > > struct vm_area_struct* mmap = BPF_CORE_READ(mm, mmap); > mmap_next = BPF_CORE_READ(rmap, vm_next); /* in a loop */ > > Since commit 763ecb035029 ("mm: remove the vma linked list") there's no > longer a vma linked list to walk. Walking the vma maple tree is not as > simple as hopping struct vm_area_struct->vm_next. That commit replaces > vm_next hopping with calls to find_vma(mm, addr) helper function, which > returns the vma containing addr, or if no vma contains addr, > the closest vma with higher start addr. > > The BPF helper bpf_find_vma is unsurprisingly a thin wrapper around > find_vma, with the major difference that no 'closest vma' is returned if > there is no VMA containing a particular address. This prevents BPF > programs from being able to use bpf_find_vma to iterate all vmas in a > task in a reasonable way. > > This patch adds a BPF_F_VMA_NEXT flag to bpf_find_vma which restores > 'closest vma' behavior when used. Because this is find_vma's default > behavior it's as straightforward as nerfing a 'vma contains addr' check > on find_vma retval. > > Also, change bpf_find_vma's address parameter to 'addr' instead of > 'start'. The former is used in documentation and more accurately > describes the param. > > [ > RFC: This isn't an ideal solution for iteration of all vmas in a task > in the long term for a few reasons: > > * In nmi context, second call to bpf_find_vma will fail because > irq_work is busy, so can't iterate all vmas > * Repeatedly taking and releasing mmap_read lock when a dedicated > iterate_all_vmas(task) kfunc could just take it once and hold for > all vmas > > My specific usecase doesn't do vma iteration in nmi context and I > think the 'closest vma' behavior can be useful here despite locking > inefficiencies. > > When Alexei and I discussed this offline, two alternatives to > provide similar functionality while addressing above issues seemed > reasonable: > > * open-coded iterator for task vma. Similar to existing > task_vma bpf_iter, but no need to create a bpf_link and read > bpf_iter fd from userspace. > * New kfunc taking callback similar bpf_find_vma, but iterating > over all vmas in one go > > I think this patch is useful on its own since it's a fairly minimal > change and fixes my usecase. Sending for early feedback and to > solicit further thought about whether this should be dropped in > favor of one of the above options. - In theory this patch can work, but patch 2 didn't attempt to actually use it in a loop to iterate all vma-s. Which is a bit of red flag whether such iteration is practical (either via bpf_loop or bpf_for). - This behavior of bpf_find_vma() feels too much implementation detail. find_vma will probably stay this way, since different parts of the kernel rely on it, but exposing it like BPF_F_VMA_NEXT leaks implementation too much. - Looking at task_vma_seq_get_next().. that's how vma iter should be done and I don't think bpf prog can do it on its own. Because with bpf_find_vma() the lock will drop at every step the problems described at that large comment will be hit sooner or later. All concerns combined I feel we better provide a new kfunc that iterates vma and drops the lock before invoking callback. It can be much simpler than task_vma_seq_get_next() if we don't drop the lock. Maybe it's ok. Doing it open coded iterators style is likely better. bpf_iter_vma_new() kfunc will do bpf_mmap_unlock_get_irq_work+mmap_read_trylock while bpf_iter_vma_destroy() will bpf_mmap_unlock_mm. I'd try to do open-code-iter first. It's a good test for the iter infra. bpf_iter_testmod_seq_new is an example of how to add a new iter. Another issue with bpf_find_vma is .arg1_type = ARG_PTR_TO_BTF_ID. It's not a trusted arg. We better move away from this legacy pointer. bpf_iter_vma_new() should accept only trusted ptr to task_struct. fwiw bpf_get_current_task_btf_proto has .ret_type = RET_PTR_TO_BTF_ID_TRUSTED and it matters here. The bpf prog might look like: task = bpf_get_current_task_btf(); err = bpf_iter_vma_new(&it, task); while ((vma = bpf_iter_vma_next(&it))) ...; assuming lock is not dropped by _next.
On 8/1/23 4:41 PM, Alexei Starovoitov wrote: > On Tue, Aug 1, 2023 at 7:54 AM Dave Marchevsky <davemarchevsky@fb.com> wrote: >> >> At Meta we have a profiling daemon which periodically collects >> information on many hosts. This collection usually involves grabbing >> stacks (user and kernel) using perf_event BPF progs and later symbolicating >> them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on >> remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In >> those cases we must fall back to digging around in /proc/PID/maps to map >> virtual address to (binary, offset). The /proc/PID/maps digging does not >> occur synchronously with stack collection, so the process might already >> be gone, in which case it won't have /proc/PID/maps and we will fail to >> symbolicate. >> >> This 'exited process problem' doesn't occur very often as >> most of the prod services we care to profile are long-lived daemons, >> there are enough usecases to warrant a workaround: a BPF program which >> can be optionally loaded at data collection time and essentially walks >> /proc/PID/maps. Currently this is done by walking the vma list: >> >> struct vm_area_struct* mmap = BPF_CORE_READ(mm, mmap); >> mmap_next = BPF_CORE_READ(rmap, vm_next); /* in a loop */ >> >> Since commit 763ecb035029 ("mm: remove the vma linked list") there's no >> longer a vma linked list to walk. Walking the vma maple tree is not as >> simple as hopping struct vm_area_struct->vm_next. That commit replaces >> vm_next hopping with calls to find_vma(mm, addr) helper function, which >> returns the vma containing addr, or if no vma contains addr, >> the closest vma with higher start addr. >> >> The BPF helper bpf_find_vma is unsurprisingly a thin wrapper around >> find_vma, with the major difference that no 'closest vma' is returned if >> there is no VMA containing a particular address. This prevents BPF >> programs from being able to use bpf_find_vma to iterate all vmas in a >> task in a reasonable way. >> >> This patch adds a BPF_F_VMA_NEXT flag to bpf_find_vma which restores >> 'closest vma' behavior when used. Because this is find_vma's default >> behavior it's as straightforward as nerfing a 'vma contains addr' check >> on find_vma retval. >> >> Also, change bpf_find_vma's address parameter to 'addr' instead of >> 'start'. The former is used in documentation and more accurately >> describes the param. >> >> [ >> RFC: This isn't an ideal solution for iteration of all vmas in a task >> in the long term for a few reasons: >> >> * In nmi context, second call to bpf_find_vma will fail because >> irq_work is busy, so can't iterate all vmas >> * Repeatedly taking and releasing mmap_read lock when a dedicated >> iterate_all_vmas(task) kfunc could just take it once and hold for >> all vmas >> >> My specific usecase doesn't do vma iteration in nmi context and I >> think the 'closest vma' behavior can be useful here despite locking >> inefficiencies. >> >> When Alexei and I discussed this offline, two alternatives to >> provide similar functionality while addressing above issues seemed >> reasonable: >> >> * open-coded iterator for task vma. Similar to existing >> task_vma bpf_iter, but no need to create a bpf_link and read >> bpf_iter fd from userspace. >> * New kfunc taking callback similar bpf_find_vma, but iterating >> over all vmas in one go >> >> I think this patch is useful on its own since it's a fairly minimal >> change and fixes my usecase. Sending for early feedback and to >> solicit further thought about whether this should be dropped in >> favor of one of the above options. > > - In theory this patch can work, but patch 2 didn't attempt to actually > use it in a loop to iterate all vma-s. > Which is a bit of red flag whether such iteration is practical > (either via bpf_loop or bpf_for). > > - This behavior of bpf_find_vma() feels too much implementation detail. > find_vma will probably stay this way, since different parts of the kernel > rely on it, but exposing it like BPF_F_VMA_NEXT leaks implementation too much. > > - Looking at task_vma_seq_get_next().. that's how vma iter should be done and > I don't think bpf prog can do it on its own. > Because with bpf_find_vma() the lock will drop at every step the problems > described at that large comment will be hit sooner or later. > > All concerns combined I feel we better provide a new kfunc that iterates vma > and drops the lock before invoking callback. > It can be much simpler than task_vma_seq_get_next() if we don't drop the lock. > Maybe it's ok. > Doing it open coded iterators style is likely better. > bpf_iter_vma_new() kfunc will do > bpf_mmap_unlock_get_irq_work+mmap_read_trylock > while bpf_iter_vma_destroy() will bpf_mmap_unlock_mm. > > I'd try to do open-code-iter first. It's a good test for the iter infra. > bpf_iter_testmod_seq_new is an example of how to add a new iter. > > Another issue with bpf_find_vma is .arg1_type = ARG_PTR_TO_BTF_ID. > It's not a trusted arg. We better move away from this legacy pointer. > bpf_iter_vma_new() should accept only trusted ptr to task_struct. > fwiw bpf_get_current_task_btf_proto has > .ret_type = RET_PTR_TO_BTF_ID_TRUSTED and it matters here. > The bpf prog might look like: > task = bpf_get_current_task_btf(); > err = bpf_iter_vma_new(&it, task); > while ((vma = bpf_iter_vma_next(&it))) ...; > assuming lock is not dropped by _next. The only concern here that doesn't seem reasonable to me is the "too much implementation detail". I agree with the rest, though, so will send a different series with new implementation and point to this discussion.
On 8/3/23 11:59 PM, David Marchevsky wrote: > On 8/1/23 4:41 PM, Alexei Starovoitov wrote: >> On Tue, Aug 1, 2023 at 7:54 AM Dave Marchevsky <davemarchevsky@fb.com> wrote: >>> >>> At Meta we have a profiling daemon which periodically collects >>> information on many hosts. This collection usually involves grabbing >>> stacks (user and kernel) using perf_event BPF progs and later symbolicating >>> them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on >>> remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In >>> those cases we must fall back to digging around in /proc/PID/maps to map >>> virtual address to (binary, offset). The /proc/PID/maps digging does not >>> occur synchronously with stack collection, so the process might already >>> be gone, in which case it won't have /proc/PID/maps and we will fail to >>> symbolicate. >>> >>> This 'exited process problem' doesn't occur very often as >>> most of the prod services we care to profile are long-lived daemons, >>> there are enough usecases to warrant a workaround: a BPF program which >>> can be optionally loaded at data collection time and essentially walks >>> /proc/PID/maps. Currently this is done by walking the vma list: >>> >>> struct vm_area_struct* mmap = BPF_CORE_READ(mm, mmap); >>> mmap_next = BPF_CORE_READ(rmap, vm_next); /* in a loop */ >>> >>> Since commit 763ecb035029 ("mm: remove the vma linked list") there's no >>> longer a vma linked list to walk. Walking the vma maple tree is not as >>> simple as hopping struct vm_area_struct->vm_next. That commit replaces >>> vm_next hopping with calls to find_vma(mm, addr) helper function, which >>> returns the vma containing addr, or if no vma contains addr, >>> the closest vma with higher start addr. >>> >>> The BPF helper bpf_find_vma is unsurprisingly a thin wrapper around >>> find_vma, with the major difference that no 'closest vma' is returned if >>> there is no VMA containing a particular address. This prevents BPF >>> programs from being able to use bpf_find_vma to iterate all vmas in a >>> task in a reasonable way. >>> >>> This patch adds a BPF_F_VMA_NEXT flag to bpf_find_vma which restores >>> 'closest vma' behavior when used. Because this is find_vma's default >>> behavior it's as straightforward as nerfing a 'vma contains addr' check >>> on find_vma retval. >>> >>> Also, change bpf_find_vma's address parameter to 'addr' instead of >>> 'start'. The former is used in documentation and more accurately >>> describes the param. >>> >>> [ >>> RFC: This isn't an ideal solution for iteration of all vmas in a task >>> in the long term for a few reasons: >>> >>> * In nmi context, second call to bpf_find_vma will fail because >>> irq_work is busy, so can't iterate all vmas >>> * Repeatedly taking and releasing mmap_read lock when a dedicated >>> iterate_all_vmas(task) kfunc could just take it once and hold for >>> all vmas >>> >>> My specific usecase doesn't do vma iteration in nmi context and I >>> think the 'closest vma' behavior can be useful here despite locking >>> inefficiencies. >>> >>> When Alexei and I discussed this offline, two alternatives to >>> provide similar functionality while addressing above issues seemed >>> reasonable: >>> >>> * open-coded iterator for task vma. Similar to existing >>> task_vma bpf_iter, but no need to create a bpf_link and read >>> bpf_iter fd from userspace. >>> * New kfunc taking callback similar bpf_find_vma, but iterating >>> over all vmas in one go >>> >>> I think this patch is useful on its own since it's a fairly minimal >>> change and fixes my usecase. Sending for early feedback and to >>> solicit further thought about whether this should be dropped in >>> favor of one of the above options. >> >> - In theory this patch can work, but patch 2 didn't attempt to actually >> use it in a loop to iterate all vma-s. >> Which is a bit of red flag whether such iteration is practical >> (either via bpf_loop or bpf_for). >> >> - This behavior of bpf_find_vma() feels too much implementation detail. >> find_vma will probably stay this way, since different parts of the kernel >> rely on it, but exposing it like BPF_F_VMA_NEXT leaks implementation too much. >> >> - Looking at task_vma_seq_get_next().. that's how vma iter should be done and >> I don't think bpf prog can do it on its own. >> Because with bpf_find_vma() the lock will drop at every step the problems >> described at that large comment will be hit sooner or later. >> >> All concerns combined I feel we better provide a new kfunc that iterates vma >> and drops the lock before invoking callback. >> It can be much simpler than task_vma_seq_get_next() if we don't drop the lock. >> Maybe it's ok. >> Doing it open coded iterators style is likely better. >> bpf_iter_vma_new() kfunc will do >> bpf_mmap_unlock_get_irq_work+mmap_read_trylock >> while bpf_iter_vma_destroy() will bpf_mmap_unlock_mm. >> >> I'd try to do open-code-iter first. It's a good test for the iter infra. >> bpf_iter_testmod_seq_new is an example of how to add a new iter. >> >> Another issue with bpf_find_vma is .arg1_type = ARG_PTR_TO_BTF_ID. >> It's not a trusted arg. We better move away from this legacy pointer. >> bpf_iter_vma_new() should accept only trusted ptr to task_struct. >> fwiw bpf_get_current_task_btf_proto has >> .ret_type = RET_PTR_TO_BTF_ID_TRUSTED and it matters here. >> The bpf prog might look like: >> task = bpf_get_current_task_btf(); >> err = bpf_iter_vma_new(&it, task); >> while ((vma = bpf_iter_vma_next(&it))) ...; >> assuming lock is not dropped by _next. > > The only concern here that doesn't seem reasonable to me is the > "too much implementation detail". I agree with the rest, though, > so will send a different series with new implementation and point > to this discussion. For reference, this is another use case for traversing vma's in the bpf program reported from bcc mailing list: https://github.com/iovisor/bcc/pull/4679 The use case is for that the application may not have frame pointer so bpf program will just scan stack's and find potential user text region pointers and report them. This is similar to what current arch (e.g., x86) code reporting crash stack.
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 70da85200695..947187d76ebc 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -5169,8 +5169,13 @@ union bpf_attr { * function with *task*, *vma*, and *callback_ctx*. * The *callback_fn* should be a static function and * the *callback_ctx* should be a pointer to the stack. - * The *flags* is used to control certain aspects of the helper. - * Currently, the *flags* must be 0. + * The *flags* is used to control certain aspects of the helper and + * may be one of the following: + * + * **BPF_F_VMA_NEXT** + * If no vma contains *addr*, call *callback_fn* with the next vma, + * i.e. the vma with lowest vm_start that is higher than *addr*. + * This replicates behavior of kernel's find_vma helper. * * The expected callback signature is * @@ -6026,6 +6031,11 @@ enum { BPF_F_EXCLUDE_INGRESS = (1ULL << 4), }; +/* Flags for bpf_find_vma helper */ +enum { + BPF_F_VMA_NEXT = (1ULL << 0), +}; + #define __bpf_md_ptr(type, name) \ union { \ type name; \ diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c index c4ab9d6cdbe9..a8c87dcf36ad 100644 --- a/kernel/bpf/task_iter.c +++ b/kernel/bpf/task_iter.c @@ -777,7 +777,7 @@ static struct bpf_iter_reg task_vma_reg_info = { .show_fdinfo = bpf_iter_task_show_fdinfo, }; -BPF_CALL_5(bpf_find_vma, struct task_struct *, task, u64, start, +BPF_CALL_5(bpf_find_vma, struct task_struct *, task, u64, addr, bpf_callback_t, callback_fn, void *, callback_ctx, u64, flags) { struct mmap_unlock_irq_work *work = NULL; @@ -785,10 +785,13 @@ BPF_CALL_5(bpf_find_vma, struct task_struct *, task, u64, start, bool irq_work_busy = false; struct mm_struct *mm; int ret = -ENOENT; + bool vma_next; - if (flags) + if (flags & ~BPF_F_VMA_NEXT) return -EINVAL; + vma_next = flags & BPF_F_VMA_NEXT; + if (!task) return -ENOENT; @@ -801,9 +804,10 @@ BPF_CALL_5(bpf_find_vma, struct task_struct *, task, u64, start, if (irq_work_busy || !mmap_read_trylock(mm)) return -EBUSY; - vma = find_vma(mm, start); + vma = find_vma(mm, addr); - if (vma && vma->vm_start <= start && vma->vm_end > start) { + if (vma && + ((vma->vm_start <= addr && vma->vm_end > addr) || vma_next)) { callback_fn((u64)(long)task, (u64)(long)vma, (u64)(long)callback_ctx, 0, 0); ret = 0; diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 70da85200695..947187d76ebc 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -5169,8 +5169,13 @@ union bpf_attr { * function with *task*, *vma*, and *callback_ctx*. * The *callback_fn* should be a static function and * the *callback_ctx* should be a pointer to the stack. - * The *flags* is used to control certain aspects of the helper. - * Currently, the *flags* must be 0. + * The *flags* is used to control certain aspects of the helper and + * may be one of the following: + * + * **BPF_F_VMA_NEXT** + * If no vma contains *addr*, call *callback_fn* with the next vma, + * i.e. the vma with lowest vm_start that is higher than *addr*. + * This replicates behavior of kernel's find_vma helper. * * The expected callback signature is * @@ -6026,6 +6031,11 @@ enum { BPF_F_EXCLUDE_INGRESS = (1ULL << 4), }; +/* Flags for bpf_find_vma helper */ +enum { + BPF_F_VMA_NEXT = (1ULL << 0), +}; + #define __bpf_md_ptr(type, name) \ union { \ type name; \
At Meta we have a profiling daemon which periodically collects information on many hosts. This collection usually involves grabbing stacks (user and kernel) using perf_event BPF progs and later symbolicating them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In those cases we must fall back to digging around in /proc/PID/maps to map virtual address to (binary, offset). The /proc/PID/maps digging does not occur synchronously with stack collection, so the process might already be gone, in which case it won't have /proc/PID/maps and we will fail to symbolicate. This 'exited process problem' doesn't occur very often as most of the prod services we care to profile are long-lived daemons, there are enough usecases to warrant a workaround: a BPF program which can be optionally loaded at data collection time and essentially walks /proc/PID/maps. Currently this is done by walking the vma list: struct vm_area_struct* mmap = BPF_CORE_READ(mm, mmap); mmap_next = BPF_CORE_READ(rmap, vm_next); /* in a loop */ Since commit 763ecb035029 ("mm: remove the vma linked list") there's no longer a vma linked list to walk. Walking the vma maple tree is not as simple as hopping struct vm_area_struct->vm_next. That commit replaces vm_next hopping with calls to find_vma(mm, addr) helper function, which returns the vma containing addr, or if no vma contains addr, the closest vma with higher start addr. The BPF helper bpf_find_vma is unsurprisingly a thin wrapper around find_vma, with the major difference that no 'closest vma' is returned if there is no VMA containing a particular address. This prevents BPF programs from being able to use bpf_find_vma to iterate all vmas in a task in a reasonable way. This patch adds a BPF_F_VMA_NEXT flag to bpf_find_vma which restores 'closest vma' behavior when used. Because this is find_vma's default behavior it's as straightforward as nerfing a 'vma contains addr' check on find_vma retval. Also, change bpf_find_vma's address parameter to 'addr' instead of 'start'. The former is used in documentation and more accurately describes the param. [ RFC: This isn't an ideal solution for iteration of all vmas in a task in the long term for a few reasons: * In nmi context, second call to bpf_find_vma will fail because irq_work is busy, so can't iterate all vmas * Repeatedly taking and releasing mmap_read lock when a dedicated iterate_all_vmas(task) kfunc could just take it once and hold for all vmas My specific usecase doesn't do vma iteration in nmi context and I think the 'closest vma' behavior can be useful here despite locking inefficiencies. When Alexei and I discussed this offline, two alternatives to provide similar functionality while addressing above issues seemed reasonable: * open-coded iterator for task vma. Similar to existing task_vma bpf_iter, but no need to create a bpf_link and read bpf_iter fd from userspace. * New kfunc taking callback similar bpf_find_vma, but iterating over all vmas in one go I think this patch is useful on its own since it's a fairly minimal change and fixes my usecase. Sending for early feedback and to solicit further thought about whether this should be dropped in favor of one of the above options. ] Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com> Cc: Nathan Slingerland <slinger@meta.com> --- include/uapi/linux/bpf.h | 14 ++++++++++++-- kernel/bpf/task_iter.c | 12 ++++++++---- tools/include/uapi/linux/bpf.h | 14 ++++++++++++-- 3 files changed, 32 insertions(+), 8 deletions(-)