Message ID | 20231013204426.1074286-1-davemarchevsky@fb.com (mailing list archive) |
---|---|
Headers | show |
Series | Open-coded task_vma iter | expand |
On 10/13/23 4:44 PM, Dave Marchevsky wrote: > At Meta we have a profiling daemon which periodically collects > information on many hosts. This collection usually involves grabbing > stacks (user and kernel) using perf_event BPF progs and later symbolicating > them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on > remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In > those cases we must fall back to digging around in /proc/PID/maps to map > virtual address to (binary, offset). The /proc/PID/maps digging does not > occur synchronously with stack collection, so the process might already > be gone, in which case it won't have /proc/PID/maps and we will fail to > symbolicate. > > This 'exited process problem' doesn't occur very often as > most of the prod services we care to profile are long-lived daemons, but > there are enough usecases to warrant a workaround: a BPF program which > can be optionally loaded at data collection time and essentially walks > /proc/PID/maps. Currently this is done by walking the vma list: > > struct vm_area_struct* mmap = BPF_CORE_READ(mm, mmap); > mmap_next = BPF_CORE_READ(rmap, vm_next); /* in a loop */ > > Since commit 763ecb035029 ("mm: remove the vma linked list") there's no > longer a vma linked list to walk. Walking the vma maple tree is not as > simple as hopping struct vm_area_struct->vm_next. Luckily, > commit f39af05949a4 ("mm: add VMA iterator"), another commit in that series, > added struct vma_iterator and for_each_vma macro for easy vma iteration. If > similar functionality was exposed to BPF programs, it would be perfect for our > usecase. > > This series adds such functionality, specifically a BPF equivalent of > for_each_vma using the open-coded iterator style. > > Notes: > * This approach was chosen after discussion on a previous series [0] which > attempted to solve the same problem by adding a BPF_F_VMA_NEXT flag to > bpf_find_vma. > * Unlike the task_vma bpf_iter, the open-coded iterator kfuncs here do not > drop the vma read lock between iterations. See Alexei's response in [0]. > * The [vsyscall] page isn't really part of task->mm's vmas, but > /proc/PID/maps returns information about it anyways. The vma iter added > here does not do the same. See comment on selftest in patch 3. > * bpf_iter_task_vma allocates a _data struct which contains - among other > things - struct vma_iterator, using BPF allocator and keeps a pointer to > the bpf_iter_task_vma_data. This is done in order to prevent changes to > struct ma_state - which is wrapped by struct vma_iterator - from > necessitating changes to uapi struct bpf_iter_task_vma. > > Changelog: > > v6 -> v7: https://lore.kernel.org/bpf/20231010185944.3888849-1-davemarchevsky@fb.com/ > > Patch numbers correspond to their position in v6 > > Patch 2 ("selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c") > * Add Andrii ack > Patch 3 ("bpf: Introduce task_vma open-coded iterator kfuncs") > * Add Andrii ack > * Add missing __diag_ignore_all for -Wmissing-prototypes (Song) > Patch 4 ("selftests/bpf: Add tests for open-coded task_vma iter") > * Remove two unnecessary header includes (Andrii) > * Remove extraneous !vmas_seen check (Andrii) Whoops, forgot to add another bullet here: * Initialize FILE *f = NULL to address llvm-16 CI warnings (Andrii) > New Patch ("bpf: Add BPF_KFUNC_{START,END}_defs macros") > * After talking to Andrii, this is an attempt to clean up __diag_ignore_all > spam everywhere kfuncs are defined. If nontrivial changes are needed, > let's apply the other 4 and I'll respin as a standalone patch. > > v5 -> v6: https://lore.kernel.org/bpf/20231010175637.3405682-1-davemarchevsky@fb.com/ > > Patch 4 ("selftests/bpf: Add tests for open-coded task_vma iter") > * Remove extraneous blank line. I did this manually to the .patch file > for v5, which caused BPF CI to complain about failing to apply the > series > > v4 -> v5: https://lore.kernel.org/bpf/20231002195341.2940874-1-davemarchevsky@fb.com/ > > Patch numbers correspond to their position in v4 > > New Patch ("selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c") > * Patch 2's renaming of this selftest, and associated changes in the > userspace runner, are split out into this separate commit (Andrii) > > Patch 2 ("bpf: Introduce task_vma open-coded iterator kfuncs") > * Remove bpf_iter_task_vma kfuncs from libbpf's bpf_helpers.h, they'll be > added to selftests' bpf_experimental.h in selftests patch below (Andrii) > * Split bpf_iter_task_vma.c renaming into separate commit (Andrii) > > Patch 3 ("selftests/bpf: Add tests for open-coded task_vma iter") > * Add bpf_iter_task_vma kfuncs to bpf_experimental.h (Andrii) > * Remove '?' from prog SEC, open_and_load the skel in one operation (Andrii) > * Ensure that fclose() always happens in test runner (Andrii) > * Use global var w/ 1000 (vm_start, vm_end) structs instead of two > MAP_TYPE_ARRAY's w/ 1k u64s each (Andrii) > > > v3 -> v4: https://lore.kernel.org/bpf/20230822050558.2937659-1-davemarchevsky@fb.com/ > > Patch 1 ("bpf: Don't explicitly emit BTF for struct btf_iter_num") > * Add Andrii ack > Patch 2 ("bpf: Introduce task_vma open-coded iterator kfuncs") > * Mark bpf_iter_task_vma_new args KF_RCU and remove now-unnecessary !task > check (Yonghong) > * Although KF_RCU is a function-level flag, in reality it only applies to > the task_struct *task parameter, as the other two params are a scalar int > and a specially-handled KF_ARG_PTR_TO_ITER > * Remove struct bpf_iter_task_vma definition from uapi headers, define in > kernel/bpf/task_iter.c instead (Andrii) > Patch 3 ("selftests/bpf: Add tests for open-coded task_vma iter") > * Use a local var when looping over vmas to track map idx. Update vmas_seen > global after done iterating. Don't start iterating or update vmas_seen if > vmas_seen global is nonzero. (Andrii) > * Move getpgid() call to correct spot - above skel detach. (Andrii) > > v2 -> v3: https://lore.kernel.org/bpf/20230821173415.1970776-1-davemarchevsky@fb.com/ > > Patch 1 ("bpf: Don't explicitly emit BTF for struct btf_iter_num") > * Add Yonghong ack > > Patch 2 ("bpf: Introduce task_vma open-coded iterator kfuncs") > * UAPI bpf header and tools/ version should match > * Add bpf_iter_task_vma_kern_data which bpf_iter_task_vma_kern points to, > bpf_mem_alloc/free it instead of just vma_iterator. (Alexei) > * Inner data ptr == NULL implies initialization failed > > > v1 -> v2: https://lore.kernel.org/bpf/20230810183513.684836-1-davemarchevsky@fb.com/ > * Patch 1 > * Now removes the unnecessary BTF_TYPE_EMIT instead of changing the > type (Yonghong) > * Patch 2 > * Don't do unnecessary BTF_TYPE_EMIT (Yonghong) > * Bump task refcount to prevent ->mm reuse (Yonghong) > * Keep a pointer to vma_iterator in bpf_iter_task_vma, alloc/free > via BPF mem allocator (Yonghong, Stanislav) > * Patch 3 > > [0]: https://lore.kernel.org/bpf/20230801145414.418145-1-davemarchevsky@fb.com/ > > Dave Marchevsky (5): > bpf: Don't explicitly emit BTF for struct btf_iter_num > selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c > bpf: Introduce task_vma open-coded iterator kfuncs > selftests/bpf: Add tests for open-coded task_vma iter > bpf: Add BPF_KFUNC_{START,END}_defs macros > > include/linux/btf.h | 7 ++ > kernel/bpf/bpf_iter.c | 8 +- > kernel/bpf/cpumask.c | 6 +- > kernel/bpf/helpers.c | 9 +- > kernel/bpf/map_iter.c | 6 +- > kernel/bpf/task_iter.c | 89 +++++++++++++++++++ > kernel/trace/bpf_trace.c | 6 +- > net/bpf/test_run.c | 7 +- > net/core/filter.c | 13 +-- > net/core/xdp.c | 6 +- > net/ipv4/fou_bpf.c | 6 +- > net/netfilter/nf_conntrack_bpf.c | 6 +- > net/netfilter/nf_nat_bpf.c | 6 +- > net/xfrm/xfrm_interface_bpf.c | 6 +- > .../testing/selftests/bpf/bpf_experimental.h | 8 ++ > .../selftests/bpf/prog_tests/bpf_iter.c | 26 +++--- > .../testing/selftests/bpf/prog_tests/iters.c | 58 ++++++++++++ > ...f_iter_task_vma.c => bpf_iter_task_vmas.c} | 0 > .../selftests/bpf/progs/iters_task_vma.c | 43 +++++++++ > 19 files changed, 248 insertions(+), 68 deletions(-) > rename tools/testing/selftests/bpf/progs/{bpf_iter_task_vma.c => bpf_iter_task_vmas.c} (100%) > create mode 100644 tools/testing/selftests/bpf/progs/iters_task_vma.c >
On Fri, Oct 13, 2023 at 1:48 PM David Marchevsky <david.marchevsky@linux.dev> wrote: > > On 10/13/23 4:44 PM, Dave Marchevsky wrote: > > At Meta we have a profiling daemon which periodically collects > > information on many hosts. This collection usually involves grabbing > > stacks (user and kernel) using perf_event BPF progs and later symbolicating > > them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on > > remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In > > those cases we must fall back to digging around in /proc/PID/maps to map > > virtual address to (binary, offset). The /proc/PID/maps digging does not > > occur synchronously with stack collection, so the process might already > > be gone, in which case it won't have /proc/PID/maps and we will fail to > > symbolicate. > > > > This 'exited process problem' doesn't occur very often as > > most of the prod services we care to profile are long-lived daemons, but > > there are enough usecases to warrant a workaround: a BPF program which > > can be optionally loaded at data collection time and essentially walks > > /proc/PID/maps. Currently this is done by walking the vma list: > > > > struct vm_area_struct* mmap = BPF_CORE_READ(mm, mmap); > > mmap_next = BPF_CORE_READ(rmap, vm_next); /* in a loop */ > > > > Since commit 763ecb035029 ("mm: remove the vma linked list") there's no > > longer a vma linked list to walk. Walking the vma maple tree is not as > > simple as hopping struct vm_area_struct->vm_next. Luckily, > > commit f39af05949a4 ("mm: add VMA iterator"), another commit in that series, > > added struct vma_iterator and for_each_vma macro for easy vma iteration. If > > similar functionality was exposed to BPF programs, it would be perfect for our > > usecase. > > > > This series adds such functionality, specifically a BPF equivalent of > > for_each_vma using the open-coded iterator style. > > > > Notes: > > * This approach was chosen after discussion on a previous series [0] which > > attempted to solve the same problem by adding a BPF_F_VMA_NEXT flag to > > bpf_find_vma. > > * Unlike the task_vma bpf_iter, the open-coded iterator kfuncs here do not > > drop the vma read lock between iterations. See Alexei's response in [0]. > > * The [vsyscall] page isn't really part of task->mm's vmas, but > > /proc/PID/maps returns information about it anyways. The vma iter added > > here does not do the same. See comment on selftest in patch 3. > > * bpf_iter_task_vma allocates a _data struct which contains - among other > > things - struct vma_iterator, using BPF allocator and keeps a pointer to > > the bpf_iter_task_vma_data. This is done in order to prevent changes to > > struct ma_state - which is wrapped by struct vma_iterator - from > > necessitating changes to uapi struct bpf_iter_task_vma. > > > > Changelog: > > > > v6 -> v7: https://lore.kernel.org/bpf/20231010185944.3888849-1-davemarchevsky@fb.com/ > > > > Patch numbers correspond to their position in v6 > > > > Patch 2 ("selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c") > > * Add Andrii ack > > Patch 3 ("bpf: Introduce task_vma open-coded iterator kfuncs") > > * Add Andrii ack > > * Add missing __diag_ignore_all for -Wmissing-prototypes (Song) > > Patch 4 ("selftests/bpf: Add tests for open-coded task_vma iter") > > * Remove two unnecessary header includes (Andrii) > > * Remove extraneous !vmas_seen check (Andrii) > > Whoops, forgot to add another bullet here: > * Initialize FILE *f = NULL to address llvm-16 CI warnings (Andrii) > > > New Patch ("bpf: Add BPF_KFUNC_{START,END}_defs macros") > > * After talking to Andrii, this is an attempt to clean up __diag_ignore_all > > spam everywhere kfuncs are defined. If nontrivial changes are needed, > > let's apply the other 4 and I'll respin as a standalone patch. > > > > v5 -> v6: https://lore.kernel.org/bpf/20231010175637.3405682-1-davemarchevsky@fb.com/ > > > > Patch 4 ("selftests/bpf: Add tests for open-coded task_vma iter") > > * Remove extraneous blank line. I did this manually to the .patch file > > for v5, which caused BPF CI to complain about failing to apply the > > series > > > > v4 -> v5: https://lore.kernel.org/bpf/20231002195341.2940874-1-davemarchevsky@fb.com/ > > > > Patch numbers correspond to their position in v4 > > > > New Patch ("selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c") > > * Patch 2's renaming of this selftest, and associated changes in the > > userspace runner, are split out into this separate commit (Andrii) > > > > Patch 2 ("bpf: Introduce task_vma open-coded iterator kfuncs") > > * Remove bpf_iter_task_vma kfuncs from libbpf's bpf_helpers.h, they'll be > > added to selftests' bpf_experimental.h in selftests patch below (Andrii) > > * Split bpf_iter_task_vma.c renaming into separate commit (Andrii) > > > > Patch 3 ("selftests/bpf: Add tests for open-coded task_vma iter") > > * Add bpf_iter_task_vma kfuncs to bpf_experimental.h (Andrii) > > * Remove '?' from prog SEC, open_and_load the skel in one operation (Andrii) > > * Ensure that fclose() always happens in test runner (Andrii) > > * Use global var w/ 1000 (vm_start, vm_end) structs instead of two > > MAP_TYPE_ARRAY's w/ 1k u64s each (Andrii) > > > > > > v3 -> v4: https://lore.kernel.org/bpf/20230822050558.2937659-1-davemarchevsky@fb.com/ > > > > Patch 1 ("bpf: Don't explicitly emit BTF for struct btf_iter_num") > > * Add Andrii ack > > Patch 2 ("bpf: Introduce task_vma open-coded iterator kfuncs") > > * Mark bpf_iter_task_vma_new args KF_RCU and remove now-unnecessary !task > > check (Yonghong) > > * Although KF_RCU is a function-level flag, in reality it only applies to > > the task_struct *task parameter, as the other two params are a scalar int > > and a specially-handled KF_ARG_PTR_TO_ITER > > * Remove struct bpf_iter_task_vma definition from uapi headers, define in > > kernel/bpf/task_iter.c instead (Andrii) > > Patch 3 ("selftests/bpf: Add tests for open-coded task_vma iter") > > * Use a local var when looping over vmas to track map idx. Update vmas_seen > > global after done iterating. Don't start iterating or update vmas_seen if > > vmas_seen global is nonzero. (Andrii) > > * Move getpgid() call to correct spot - above skel detach. (Andrii) > > > > v2 -> v3: https://lore.kernel.org/bpf/20230821173415.1970776-1-davemarchevsky@fb.com/ > > > > Patch 1 ("bpf: Don't explicitly emit BTF for struct btf_iter_num") > > * Add Yonghong ack > > > > Patch 2 ("bpf: Introduce task_vma open-coded iterator kfuncs") > > * UAPI bpf header and tools/ version should match > > * Add bpf_iter_task_vma_kern_data which bpf_iter_task_vma_kern points to, > > bpf_mem_alloc/free it instead of just vma_iterator. (Alexei) > > * Inner data ptr == NULL implies initialization failed > > > > > > v1 -> v2: https://lore.kernel.org/bpf/20230810183513.684836-1-davemarchevsky@fb.com/ > > * Patch 1 > > * Now removes the unnecessary BTF_TYPE_EMIT instead of changing the > > type (Yonghong) > > * Patch 2 > > * Don't do unnecessary BTF_TYPE_EMIT (Yonghong) > > * Bump task refcount to prevent ->mm reuse (Yonghong) > > * Keep a pointer to vma_iterator in bpf_iter_task_vma, alloc/free > > via BPF mem allocator (Yonghong, Stanislav) > > * Patch 3 > > > > [0]: https://lore.kernel.org/bpf/20230801145414.418145-1-davemarchevsky@fb.com/ > > > > Dave Marchevsky (5): > > bpf: Don't explicitly emit BTF for struct btf_iter_num > > selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c > > bpf: Introduce task_vma open-coded iterator kfuncs > > selftests/bpf: Add tests for open-coded task_vma iter > > bpf: Add BPF_KFUNC_{START,END}_defs macros I dropped patch #5 from the series and applied the rest, thanks. For BPF_KFUNC_{START,END}_DEFS, I like the clean up, but I feel like we might want to follow similar function-call like syntax as with __diag_push() and make sure we use semicolon to terminate their "invocations". If you agree, please adjust and submit separately, I'd be curious to see what others think. Also, keep in mind that __diag_ignore_all("-Wmissing-prototypes") that we use right now might not be enough. We started getting more and more reports with different warnings recently, so we probably would need to ignore more warnings for kfuncs: kernel/bpf/helpers.c:1906:19: warning: no previous declaration for 'bpf_percpu_obj_new_impl' [-Wmissing-declarations] kernel/bpf/helpers.c:1942:18: warning: no previous declaration for 'bpf_percpu_obj_drop_impl' [-Wmissing-declarations] kernel/bpf/helpers.c:2477:18: warning: no previous declaration for 'bpf_throw' [-Wmissing-declarations] Which just means that this refactoring is even more important and timely :) > > > > include/linux/btf.h | 7 ++ > > kernel/bpf/bpf_iter.c | 8 +- > > kernel/bpf/cpumask.c | 6 +- > > kernel/bpf/helpers.c | 9 +- > > kernel/bpf/map_iter.c | 6 +- > > kernel/bpf/task_iter.c | 89 +++++++++++++++++++ > > kernel/trace/bpf_trace.c | 6 +- > > net/bpf/test_run.c | 7 +- > > net/core/filter.c | 13 +-- > > net/core/xdp.c | 6 +- > > net/ipv4/fou_bpf.c | 6 +- > > net/netfilter/nf_conntrack_bpf.c | 6 +- > > net/netfilter/nf_nat_bpf.c | 6 +- > > net/xfrm/xfrm_interface_bpf.c | 6 +- > > .../testing/selftests/bpf/bpf_experimental.h | 8 ++ > > .../selftests/bpf/prog_tests/bpf_iter.c | 26 +++--- > > .../testing/selftests/bpf/prog_tests/iters.c | 58 ++++++++++++ > > ...f_iter_task_vma.c => bpf_iter_task_vmas.c} | 0 > > .../selftests/bpf/progs/iters_task_vma.c | 43 +++++++++ > > 19 files changed, 248 insertions(+), 68 deletions(-) > > rename tools/testing/selftests/bpf/progs/{bpf_iter_task_vma.c => bpf_iter_task_vmas.c} (100%) > > create mode 100644 tools/testing/selftests/bpf/progs/iters_task_vma.c > >
Hello: This series was applied to bpf/bpf-next.git (master) by Andrii Nakryiko <andrii@kernel.org>: On Fri, 13 Oct 2023 13:44:21 -0700 you wrote: > At Meta we have a profiling daemon which periodically collects > information on many hosts. This collection usually involves grabbing > stacks (user and kernel) using perf_event BPF progs and later symbolicating > them. For user stacks we try to use BPF_F_USER_BUILD_ID and rely on > remote symbolication, but BPF_F_USER_BUILD_ID doesn't always succeed. In > those cases we must fall back to digging around in /proc/PID/maps to map > virtual address to (binary, offset). The /proc/PID/maps digging does not > occur synchronously with stack collection, so the process might already > be gone, in which case it won't have /proc/PID/maps and we will fail to > symbolicate. > > [...] Here is the summary with links: - [v7,bpf-next,1/5] bpf: Don't explicitly emit BTF for struct btf_iter_num https://git.kernel.org/bpf/bpf-next/c/f10ca5da5bd7 - [v7,bpf-next,2/5] selftests/bpf: Rename bpf_iter_task_vma.c to bpf_iter_task_vmas.c https://git.kernel.org/bpf/bpf-next/c/45b38941c81f - [v7,bpf-next,3/5] bpf: Introduce task_vma open-coded iterator kfuncs https://git.kernel.org/bpf/bpf-next/c/4ac454682158 - [v7,bpf-next,4/5] selftests/bpf: Add tests for open-coded task_vma iter https://git.kernel.org/bpf/bpf-next/c/e0e1a7a5fc37 - [v7,bpf-next,5/5] bpf: Add BPF_KFUNC_{START,END}_defs macros (no matching commit) You are awesome, thank you!