Message ID | 20211104070016.2463668-2-songliubraving@fb.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | introduce bpf_find_vma | expand |
On 11/4/21 12:00 AM, Song Liu wrote: > In some profiler use cases, it is necessary to map an address to the > backing file, e.g., a shared library. bpf_find_vma helper provides a > flexible way to achieve this. bpf_find_vma maps an address of a task to > the vma (vm_area_struct) for this address, and feed the vma to an callback > BPF function. The callback function is necessary here, as we need to > ensure mmap_sem is unlocked. > > It is necessary to lock mmap_sem for find_vma. To lock and unlock mmap_sem > safely when irqs are disable, we use the same mechanism as stackmap with > build_id. Specifically, when irqs are disabled, the unlocked is postponed > in an irq_work. Refactor stackmap.c so that the irq_work is shared among > bpf_find_vma and stackmap helpers. > > Signed-off-by: Song Liu <songliubraving@fb.com> > --- > include/linux/bpf.h | 1 + > include/uapi/linux/bpf.h | 20 ++++++++++ > kernel/bpf/mmap_unlock_work.h | 65 +++++++++++++++++++++++++++++++ > kernel/bpf/stackmap.c | 71 ++++++++-------------------------- > kernel/bpf/task_iter.c | 49 +++++++++++++++++++++++ > kernel/bpf/verifier.c | 36 +++++++++++++++++ > kernel/trace/bpf_trace.c | 2 + > tools/include/uapi/linux/bpf.h | 20 ++++++++++ > 8 files changed, 209 insertions(+), 55 deletions(-) > create mode 100644 kernel/bpf/mmap_unlock_work.h > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index 2be6dfd68df99..df3410bff4b06 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -2157,6 +2157,7 @@ extern const struct bpf_func_proto bpf_btf_find_by_name_kind_proto; > extern const struct bpf_func_proto bpf_sk_setsockopt_proto; > extern const struct bpf_func_proto bpf_sk_getsockopt_proto; > extern const struct bpf_func_proto bpf_kallsyms_lookup_name_proto; > +extern const struct bpf_func_proto bpf_find_vma_proto; > > const struct bpf_func_proto *tracing_prog_func_proto( > enum bpf_func_id func_id, const struct bpf_prog *prog); > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index ba5af15e25f5c..22fa7b74de451 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -4938,6 +4938,25 @@ union bpf_attr { > * **-ENOENT** if symbol is not found. > * > * **-EPERM** if caller does not have permission to obtain kernel address. > + * > + * long bpf_find_vma(struct task_struct *task, u64 addr, void *callback_fn, void *callback_ctx, u64 flags) > + * Description > + * Find vma of *task* that contains *addr*, call *callback_fn* > + * function with *task*, *vma*, and *callback_ctx*. > + * The *callback_fn* should be a static function and > + * the *callback_ctx* should be a pointer to the stack. > + * The *flags* is used to control certain aspects of the helper. > + * Currently, the *flags* must be 0. > + * > + * The expected callback signature is > + * > + * long (\*callback_fn)(struct task_struct \*task, struct vm_area_struct \*vma, void \*ctx); ctx => callback_ctx this should make it clear what this 'ctx' is. > + * > + * Return > + * 0 on success. > + * **-ENOENT** if *task->mm* is NULL, or no vma contains *addr*. > + * **-EBUSY** if failed to try lock mmap_lock. > + * **-EINVAL** for invalid **flags**. > */ > #define __BPF_FUNC_MAPPER(FN) \ > FN(unspec), \ > @@ -5120,6 +5139,7 @@ union bpf_attr { > FN(trace_vprintk), \ > FN(skc_to_unix_sock), \ > FN(kallsyms_lookup_name), \ > + FN(find_vma), \ > /* */ > > /* integer value in 'imm' field of BPF_CALL instruction selects which helper [...] > + > static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs, > u64 *ips, u32 trace_nr, bool user) > { > - int i; > + struct mmap_unlock_irq_work *work = NULL; > + bool irq_work_busy = bpf_mmap_unlock_get_irq_work(&work); > struct vm_area_struct *vma; > - bool irq_work_busy = false; > - struct stack_map_irq_work *work = NULL; > - > - if (irqs_disabled()) { > - if (!IS_ENABLED(CONFIG_PREEMPT_RT)) { > - work = this_cpu_ptr(&up_read_work); > - if (irq_work_is_busy(&work->irq_work)) { > - /* cannot queue more up_read, fallback */ > - irq_work_busy = true; > - } > - } else { > - /* > - * PREEMPT_RT does not allow to trylock mmap sem in > - * interrupt disabled context. Force the fallback code. > - */ > - irq_work_busy = true; > - } > - } > + int i; I think moving 'int i' is unnecessary here since there is no functionality change. > > - /* > - * We cannot do up_read() when the irq is disabled, because of > - * risk to deadlock with rq_lock. To do build_id lookup when the > - * irqs are disabled, we need to run up_read() in irq_work. We use > - * a percpu variable to do the irq_work. If the irq_work is > - * already used by another lookup, we fall back to report ips. > - * > - * Same fallback is used for kernel stack (!user) on a stackmap > - * with build_id. > + /* If the irq_work is in use, fall back to report ips. Same > + * fallback is used for kernel stack (!user) on a stackmap with > + * build_id. > */ > if (!user || !current || !current->mm || irq_work_busy || > !mmap_read_trylock(current->mm)) { [...] > + > + irq_work_busy = bpf_mmap_unlock_get_irq_work(&work); > + > + if (irq_work_busy || !mmap_read_trylock(mm)) > + return -EBUSY; > + > + vma = find_vma(mm, start); > + > + if (vma && vma->vm_start <= start && vma->vm_end > start) { > + callback_fn((u64)(long)task, (u64)(long)vma, > + (u64)(long)callback_ctx, 0, 0); > + ret = 0; > + } > + bpf_mmap_unlock_mm(work, mm); > + return ret; > +} > + > +BTF_ID_LIST_SINGLE(btf_find_vma_ids, struct, task_struct) We have global btf_task_struct_ids, maybe reuse it? > + > +const struct bpf_func_proto bpf_find_vma_proto = { > + .func = bpf_find_vma, > + .ret_type = RET_INTEGER, > + .arg1_type = ARG_PTR_TO_BTF_ID, > + .arg1_btf_id = &btf_find_vma_ids[0], > + .arg2_type = ARG_ANYTHING, > + .arg3_type = ARG_PTR_TO_FUNC, > + .arg4_type = ARG_PTR_TO_STACK_OR_NULL, > + .arg5_type = ARG_ANYTHING, > +}; > + > static int __init task_iter_init(void) > { > int ret; > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index f0dca726ebfde..a65526112924a 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -6132,6 +6132,35 @@ static int set_timer_callback_state(struct bpf_verifier_env *env, > return 0; > } > > +BTF_ID_LIST_SINGLE(btf_set_find_vma_ids, struct, vm_area_struct) In task_iter.c, we have BTF_ID_LIST(btf_task_file_ids) BTF_ID(struct, file) BTF_ID(struct, vm_area_struct) Maybe it is worthwhile to separate them so we can put vm_area_struct as global to be reused. > + > +static int set_find_vma_callback_state(struct bpf_verifier_env *env, > + struct bpf_func_state *caller, > + struct bpf_func_state *callee, > + int insn_idx) > +{ > + /* bpf_find_vma(struct task_struct *task, u64 start, start => addr ? > + * void *callback_fn, void *callback_ctx, u64 flags) > + * (callback_fn)(struct task_struct *task, > + * struct vm_area_struct *vma, void *ctx); ctx => callback_ctx ? > + */ > + callee->regs[BPF_REG_1] = caller->regs[BPF_REG_1]; > + > + callee->regs[BPF_REG_2].type = PTR_TO_BTF_ID; > + __mark_reg_known_zero(&callee->regs[BPF_REG_2]); > + callee->regs[BPF_REG_2].btf = btf_vmlinux; > + callee->regs[BPF_REG_2].btf_id = btf_set_find_vma_ids[0]; > + > + /* pointer to stack or null */ > + callee->regs[BPF_REG_3] = caller->regs[BPF_REG_4]; > + > + /* unused */ > + __mark_reg_not_init(env, &callee->regs[BPF_REG_4]); > + __mark_reg_not_init(env, &callee->regs[BPF_REG_5]); > + callee->in_callback_fn = true; > + return 0; > +} > + [...]
Hi Song,
I love your patch! Yet something to improve:
[auto build test ERROR on bpf-next/master]
url: https://github.com/0day-ci/linux/commits/Song-Liu/introduce-bpf_find_vma/20211104-150210
base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: riscv-randconfig-r034-20211104 (attached as .config)
compiler: riscv64-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/e219efba8d04dede08b1d87fa1e8c5c01180caaf
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Song-Liu/introduce-bpf_find_vma/20211104-150210
git checkout e219efba8d04dede08b1d87fa1e8c5c01180caaf
# save the attached .config to linux build tree
mkdir build_dir
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=riscv SHELL=/bin/bash
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
All errors (new ones prefixed by >>):
riscv64-linux-ld: kernel/bpf/task_iter.o: in function `bpf_find_vma':
task_iter.c:(.text+0x3dc): undefined reference to `mmap_unlock_work'
>> riscv64-linux-ld: task_iter.c:(.text+0x3e0): undefined reference to `mmap_unlock_work'
riscv64-linux-ld: task_iter.c:(.text+0x400): undefined reference to `mmap_unlock_work'
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
> On Nov 4, 2021, at 1:37 PM, kernel test robot <lkp@intel.com> wrote: > > Hi Song, > > I love your patch! Yet something to improve: > > [auto build test ERROR on bpf-next/master] > > url: https://github.com/0day-ci/linux/commits/Song-Liu/introduce-bpf_find_vma/20211104-150210 > base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master > config: riscv-randconfig-r034-20211104 (attached as .config) > compiler: riscv64-linux-gcc (GCC) 11.2.0 > reproduce (this is a W=1 build): > wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross > chmod +x ~/bin/make.cross > # https://github.com/0day-ci/linux/commit/e219efba8d04dede08b1d87fa1e8c5c01180caaf > git remote add linux-review https://github.com/0day-ci/linux > git fetch --no-tags linux-review Song-Liu/introduce-bpf_find_vma/20211104-150210 > git checkout e219efba8d04dede08b1d87fa1e8c5c01180caaf > # save the attached .config to linux build tree > mkdir build_dir > COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=riscv SHELL=/bin/bash > > If you fix the issue, kindly add following tag as appropriate > Reported-by: kernel test robot <lkp@intel.com> > > All errors (new ones prefixed by >>): > > riscv64-linux-ld: kernel/bpf/task_iter.o: in function `bpf_find_vma': > task_iter.c:(.text+0x3dc): undefined reference to `mmap_unlock_work' >>> riscv64-linux-ld: task_iter.c:(.text+0x3e0): undefined reference to `mmap_unlock_work' > riscv64-linux-ld: task_iter.c:(.text+0x400): undefined reference to `mmap_unlock_work' Sigh, I didn't see this before sending v3. Will fix in v4. > > --- > 0-DAY CI Kernel Test Service, Intel Corporation > https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org > <.config.gz>
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 2be6dfd68df99..df3410bff4b06 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -2157,6 +2157,7 @@ extern const struct bpf_func_proto bpf_btf_find_by_name_kind_proto; extern const struct bpf_func_proto bpf_sk_setsockopt_proto; extern const struct bpf_func_proto bpf_sk_getsockopt_proto; extern const struct bpf_func_proto bpf_kallsyms_lookup_name_proto; +extern const struct bpf_func_proto bpf_find_vma_proto; const struct bpf_func_proto *tracing_prog_func_proto( enum bpf_func_id func_id, const struct bpf_prog *prog); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index ba5af15e25f5c..22fa7b74de451 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -4938,6 +4938,25 @@ union bpf_attr { * **-ENOENT** if symbol is not found. * * **-EPERM** if caller does not have permission to obtain kernel address. + * + * long bpf_find_vma(struct task_struct *task, u64 addr, void *callback_fn, void *callback_ctx, u64 flags) + * Description + * Find vma of *task* that contains *addr*, call *callback_fn* + * function with *task*, *vma*, and *callback_ctx*. + * The *callback_fn* should be a static function and + * the *callback_ctx* should be a pointer to the stack. + * The *flags* is used to control certain aspects of the helper. + * Currently, the *flags* must be 0. + * + * The expected callback signature is + * + * long (\*callback_fn)(struct task_struct \*task, struct vm_area_struct \*vma, void \*ctx); + * + * Return + * 0 on success. + * **-ENOENT** if *task->mm* is NULL, or no vma contains *addr*. + * **-EBUSY** if failed to try lock mmap_lock. + * **-EINVAL** for invalid **flags**. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -5120,6 +5139,7 @@ union bpf_attr { FN(trace_vprintk), \ FN(skc_to_unix_sock), \ FN(kallsyms_lookup_name), \ + FN(find_vma), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper diff --git a/kernel/bpf/mmap_unlock_work.h b/kernel/bpf/mmap_unlock_work.h new file mode 100644 index 0000000000000..5d18d7d85bef9 --- /dev/null +++ b/kernel/bpf/mmap_unlock_work.h @@ -0,0 +1,65 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* Copyright (c) 2021 Facebook + */ + +#ifndef __MMAP_UNLOCK_WORK_H__ +#define __MMAP_UNLOCK_WORK_H__ +#include <linux/irq_work.h> + +/* irq_work to run mmap_read_unlock() in irq_work */ +struct mmap_unlock_irq_work { + struct irq_work irq_work; + struct mm_struct *mm; +}; + +DECLARE_PER_CPU(struct mmap_unlock_irq_work, mmap_unlock_work); + +/* + * We cannot do mmap_read_unlock() when the irq is disabled, because of + * risk to deadlock with rq_lock. To look up vma when the irqs are + * disabled, we need to run mmap_read_unlock() in irq_work. We use a + * percpu variable to do the irq_work. If the irq_work is already used + * by another lookup, we fall over. + */ +static inline bool bpf_mmap_unlock_get_irq_work(struct mmap_unlock_irq_work **work_ptr) +{ + struct mmap_unlock_irq_work *work = NULL; + bool irq_work_busy = false; + + if (irqs_disabled()) { + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) { + work = this_cpu_ptr(&mmap_unlock_work); + if (irq_work_is_busy(&work->irq_work)) { + /* cannot queue more up_read, fallback */ + irq_work_busy = true; + } + } else { + /* + * PREEMPT_RT does not allow to trylock mmap sem in + * interrupt disabled context. Force the fallback code. + */ + irq_work_busy = true; + } + } + + *work_ptr = work; + return irq_work_busy; +} + +static inline void bpf_mmap_unlock_mm(struct mmap_unlock_irq_work *work, struct mm_struct *mm) +{ + if (!work) { + mmap_read_unlock(mm); + } else { + work->mm = mm; + + /* The lock will be released once we're out of interrupt + * context. Tell lockdep that we've released it now so + * it doesn't complain that we forgot to release it. + */ + rwsem_release(&mm->mmap_lock.dep_map, _RET_IP_); + irq_work_queue(&work->irq_work); + } +} + +#endif /* __MMAP_UNLOCK_WORK_H__ */ diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index 6e75bbee39f0b..77e366f69212b 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -7,10 +7,10 @@ #include <linux/kernel.h> #include <linux/stacktrace.h> #include <linux/perf_event.h> -#include <linux/irq_work.h> #include <linux/btf_ids.h> #include <linux/buildid.h> #include "percpu_freelist.h" +#include "mmap_unlock_work.h" #define STACK_CREATE_FLAG_MASK \ (BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY | \ @@ -31,25 +31,19 @@ struct bpf_stack_map { struct stack_map_bucket *buckets[]; }; -/* irq_work to run up_read() for build_id lookup in nmi context */ -struct stack_map_irq_work { - struct irq_work irq_work; - struct mm_struct *mm; -}; +DEFINE_PER_CPU(struct mmap_unlock_irq_work, mmap_unlock_work); -static void do_up_read(struct irq_work *entry) +static void do_mmap_read_unlock(struct irq_work *entry) { - struct stack_map_irq_work *work; + struct mmap_unlock_irq_work *work; if (WARN_ON_ONCE(IS_ENABLED(CONFIG_PREEMPT_RT))) return; - work = container_of(entry, struct stack_map_irq_work, irq_work); + work = container_of(entry, struct mmap_unlock_irq_work, irq_work); mmap_read_unlock_non_owner(work->mm); } -static DEFINE_PER_CPU(struct stack_map_irq_work, up_read_work); - static inline bool stack_map_use_build_id(struct bpf_map *map) { return (map->map_flags & BPF_F_STACK_BUILD_ID); @@ -145,39 +139,18 @@ static struct bpf_map *stack_map_alloc(union bpf_attr *attr) return ERR_PTR(err); } + static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs, u64 *ips, u32 trace_nr, bool user) { - int i; + struct mmap_unlock_irq_work *work = NULL; + bool irq_work_busy = bpf_mmap_unlock_get_irq_work(&work); struct vm_area_struct *vma; - bool irq_work_busy = false; - struct stack_map_irq_work *work = NULL; - - if (irqs_disabled()) { - if (!IS_ENABLED(CONFIG_PREEMPT_RT)) { - work = this_cpu_ptr(&up_read_work); - if (irq_work_is_busy(&work->irq_work)) { - /* cannot queue more up_read, fallback */ - irq_work_busy = true; - } - } else { - /* - * PREEMPT_RT does not allow to trylock mmap sem in - * interrupt disabled context. Force the fallback code. - */ - irq_work_busy = true; - } - } + int i; - /* - * We cannot do up_read() when the irq is disabled, because of - * risk to deadlock with rq_lock. To do build_id lookup when the - * irqs are disabled, we need to run up_read() in irq_work. We use - * a percpu variable to do the irq_work. If the irq_work is - * already used by another lookup, we fall back to report ips. - * - * Same fallback is used for kernel stack (!user) on a stackmap - * with build_id. + /* If the irq_work is in use, fall back to report ips. Same + * fallback is used for kernel stack (!user) on a stackmap with + * build_id. */ if (!user || !current || !current->mm || irq_work_busy || !mmap_read_trylock(current->mm)) { @@ -203,19 +176,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs, - vma->vm_start; id_offs[i].status = BPF_STACK_BUILD_ID_VALID; } - - if (!work) { - mmap_read_unlock(current->mm); - } else { - work->mm = current->mm; - - /* The lock will be released once we're out of interrupt - * context. Tell lockdep that we've released it now so - * it doesn't complain that we forgot to release it. - */ - rwsem_release(¤t->mm->mmap_lock.dep_map, _RET_IP_); - irq_work_queue(&work->irq_work); - } + bpf_mmap_unlock_mm(work, current->mm); } static struct perf_callchain_entry * @@ -723,11 +684,11 @@ const struct bpf_map_ops stack_trace_map_ops = { static int __init stack_map_init(void) { int cpu; - struct stack_map_irq_work *work; + struct mmap_unlock_irq_work *work; for_each_possible_cpu(cpu) { - work = per_cpu_ptr(&up_read_work, cpu); - init_irq_work(&work->irq_work, do_up_read); + work = per_cpu_ptr(&mmap_unlock_work, cpu); + init_irq_work(&work->irq_work, do_mmap_read_unlock); } return 0; } diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c index b48750bfba5aa..9d92e14a6eea4 100644 --- a/kernel/bpf/task_iter.c +++ b/kernel/bpf/task_iter.c @@ -8,6 +8,7 @@ #include <linux/fdtable.h> #include <linux/filter.h> #include <linux/btf_ids.h> +#include "mmap_unlock_work.h" struct bpf_iter_seq_task_common { struct pid_namespace *ns; @@ -586,6 +587,54 @@ static struct bpf_iter_reg task_vma_reg_info = { .seq_info = &task_vma_seq_info, }; +BPF_CALL_5(bpf_find_vma, struct task_struct *, task, u64, start, + bpf_callback_t, callback_fn, void *, callback_ctx, u64, flags) +{ + struct mmap_unlock_irq_work *work = NULL; + struct vm_area_struct *vma; + bool irq_work_busy = false; + struct mm_struct *mm; + int ret = -ENOENT; + + if (flags) + return -EINVAL; + + if (!task) + return -ENOENT; + + mm = task->mm; + if (!mm) + return -ENOENT; + + irq_work_busy = bpf_mmap_unlock_get_irq_work(&work); + + if (irq_work_busy || !mmap_read_trylock(mm)) + return -EBUSY; + + vma = find_vma(mm, start); + + if (vma && vma->vm_start <= start && vma->vm_end > start) { + callback_fn((u64)(long)task, (u64)(long)vma, + (u64)(long)callback_ctx, 0, 0); + ret = 0; + } + bpf_mmap_unlock_mm(work, mm); + return ret; +} + +BTF_ID_LIST_SINGLE(btf_find_vma_ids, struct, task_struct) + +const struct bpf_func_proto bpf_find_vma_proto = { + .func = bpf_find_vma, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_BTF_ID, + .arg1_btf_id = &btf_find_vma_ids[0], + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_PTR_TO_FUNC, + .arg4_type = ARG_PTR_TO_STACK_OR_NULL, + .arg5_type = ARG_ANYTHING, +}; + static int __init task_iter_init(void) { int ret; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index f0dca726ebfde..a65526112924a 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -6132,6 +6132,35 @@ static int set_timer_callback_state(struct bpf_verifier_env *env, return 0; } +BTF_ID_LIST_SINGLE(btf_set_find_vma_ids, struct, vm_area_struct) + +static int set_find_vma_callback_state(struct bpf_verifier_env *env, + struct bpf_func_state *caller, + struct bpf_func_state *callee, + int insn_idx) +{ + /* bpf_find_vma(struct task_struct *task, u64 start, + * void *callback_fn, void *callback_ctx, u64 flags) + * (callback_fn)(struct task_struct *task, + * struct vm_area_struct *vma, void *ctx); + */ + callee->regs[BPF_REG_1] = caller->regs[BPF_REG_1]; + + callee->regs[BPF_REG_2].type = PTR_TO_BTF_ID; + __mark_reg_known_zero(&callee->regs[BPF_REG_2]); + callee->regs[BPF_REG_2].btf = btf_vmlinux; + callee->regs[BPF_REG_2].btf_id = btf_set_find_vma_ids[0]; + + /* pointer to stack or null */ + callee->regs[BPF_REG_3] = caller->regs[BPF_REG_4]; + + /* unused */ + __mark_reg_not_init(env, &callee->regs[BPF_REG_4]); + __mark_reg_not_init(env, &callee->regs[BPF_REG_5]); + callee->in_callback_fn = true; + return 0; +} + static int prepare_func_exit(struct bpf_verifier_env *env, int *insn_idx) { struct bpf_verifier_state *state = env->cur_state; @@ -6489,6 +6518,13 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn return -EINVAL; } + if (func_id == BPF_FUNC_find_vma) { + err = __check_func_call(env, insn, insn_idx_p, meta.subprogno, + set_find_vma_callback_state); + if (err < 0) + return -EINVAL; + } + if (func_id == BPF_FUNC_snprintf) { err = check_bpf_snprintf_call(env, regs); if (err < 0) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 7396488793ff7..390176a3031ab 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1208,6 +1208,8 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_get_func_ip_proto_tracing; case BPF_FUNC_get_branch_snapshot: return &bpf_get_branch_snapshot_proto; + case BPF_FUNC_find_vma: + return &bpf_find_vma_proto; case BPF_FUNC_trace_vprintk: return bpf_get_trace_vprintk_proto(); default: diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index ba5af15e25f5c..22fa7b74de451 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -4938,6 +4938,25 @@ union bpf_attr { * **-ENOENT** if symbol is not found. * * **-EPERM** if caller does not have permission to obtain kernel address. + * + * long bpf_find_vma(struct task_struct *task, u64 addr, void *callback_fn, void *callback_ctx, u64 flags) + * Description + * Find vma of *task* that contains *addr*, call *callback_fn* + * function with *task*, *vma*, and *callback_ctx*. + * The *callback_fn* should be a static function and + * the *callback_ctx* should be a pointer to the stack. + * The *flags* is used to control certain aspects of the helper. + * Currently, the *flags* must be 0. + * + * The expected callback signature is + * + * long (\*callback_fn)(struct task_struct \*task, struct vm_area_struct \*vma, void \*ctx); + * + * Return + * 0 on success. + * **-ENOENT** if *task->mm* is NULL, or no vma contains *addr*. + * **-EBUSY** if failed to try lock mmap_lock. + * **-EINVAL** for invalid **flags**. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -5120,6 +5139,7 @@ union bpf_attr { FN(trace_vprintk), \ FN(skc_to_unix_sock), \ FN(kallsyms_lookup_name), \ + FN(find_vma), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper
In some profiler use cases, it is necessary to map an address to the backing file, e.g., a shared library. bpf_find_vma helper provides a flexible way to achieve this. bpf_find_vma maps an address of a task to the vma (vm_area_struct) for this address, and feed the vma to an callback BPF function. The callback function is necessary here, as we need to ensure mmap_sem is unlocked. It is necessary to lock mmap_sem for find_vma. To lock and unlock mmap_sem safely when irqs are disable, we use the same mechanism as stackmap with build_id. Specifically, when irqs are disabled, the unlocked is postponed in an irq_work. Refactor stackmap.c so that the irq_work is shared among bpf_find_vma and stackmap helpers. Signed-off-by: Song Liu <songliubraving@fb.com> --- include/linux/bpf.h | 1 + include/uapi/linux/bpf.h | 20 ++++++++++ kernel/bpf/mmap_unlock_work.h | 65 +++++++++++++++++++++++++++++++ kernel/bpf/stackmap.c | 71 ++++++++-------------------------- kernel/bpf/task_iter.c | 49 +++++++++++++++++++++++ kernel/bpf/verifier.c | 36 +++++++++++++++++ kernel/trace/bpf_trace.c | 2 + tools/include/uapi/linux/bpf.h | 20 ++++++++++ 8 files changed, 209 insertions(+), 55 deletions(-) create mode 100644 kernel/bpf/mmap_unlock_work.h