Message ID | 20230801142912.55078-1-laoar.shao@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | bpf: Add new bpf helper bpf_for_each_cpu | expand |
On 8/1/23 7:29 AM, Yafang Shao wrote: > Some statistic data is stored in percpu pointer but the kernel doesn't > aggregate it into a single value, for example, the data in struct > psi_group_cpu. > > Currently, we can traverse percpu data using for_loop and bpf_per_cpu_ptr: > > for_loop(nr_cpus, callback_fn, callback_ctx, 0) > > In the callback_fn, we retrieve the percpu pointer with bpf_per_cpu_ptr(). > The drawback is that 'nr_cpus' cannot be a variable; otherwise, it will be > rejected by the verifier, hindering deployment, as servers may have > different 'nr_cpus'. Using CONFIG_NR_CPUS is not ideal. > > Alternatively, with the bpf_cpumask family, we can obtain a task's cpumask. > However, it requires creating a bpf_cpumask, copying the cpumask from the > task, and then parsing the CPU IDs from it, resulting in low efficiency. > Introducing other kfuncs like bpf_cpumask_next might be necessary. > > A new bpf helper, bpf_for_each_cpu, is introduced to conveniently traverse > percpu data, covering all scenarios. It includes > for_each_{possible, present, online}_cpu. The user can also traverse CPUs > from a specific task, such as walking the CPUs of a cpuset cgroup when the > task is in that cgroup. The bpf subsystem has adopted kfunc approach. So there is no bpf helper any more. You need to have a bpf_for_each_cpu kfunc instead. But I am wondering whether we should use open coded iterator loops 06accc8779c1 bpf: add support for open-coded iterator loops In kernel, we have a global variable nr_cpu_ids (also in kernel/bpf/helpers.c) which is used in numerous places for per cpu data struct access. I am wondering whether we could have bpf code like int nr_cpu_ids __ksym; struct bpf_iter_num it; int i = 0; // nr_cpu_ids is special, we can give it a range [1, CONFIG_NR_CPUS]. bpf_iter_num_new(&it, 1, nr_cpu_ids); while ((v = bpf_iter_num_next(&it))) { /* access cpu i data */ i++; } bpf_iter_num_destroy(&it); From all existing open coded iterator loops, looks like upper bound has to be a constant. We might need to extend support to bounded scalar upper bound if not there. > > In our use case, we utilize this new helper to traverse percpu psi data. > This aids in understanding why CPU, Memory, and IO pressure data are high > on a server or a container. > > Due to the __percpu annotation, clang-14+ and pahole-1.23+ are required. > > Yafang Shao (3): > bpf: Add bpf_for_each_cpu helper > cgroup, psi: Init root cgroup psi to psi_system > selftests/bpf: Add selftest for for_each_cpu > > include/linux/bpf.h | 1 + > include/linux/psi.h | 2 +- > include/uapi/linux/bpf.h | 32 +++++ > kernel/bpf/bpf_iter.c | 72 +++++++++++ > kernel/bpf/helpers.c | 2 + > kernel/bpf/verifier.c | 29 ++++- > kernel/cgroup/cgroup.c | 5 +- > tools/include/uapi/linux/bpf.h | 32 +++++ > .../selftests/bpf/prog_tests/for_each_cpu.c | 137 +++++++++++++++++++++ > .../selftests/bpf/progs/test_for_each_cpu.c | 63 ++++++++++ > 10 files changed, 372 insertions(+), 3 deletions(-) > create mode 100644 tools/testing/selftests/bpf/prog_tests/for_each_cpu.c > create mode 100644 tools/testing/selftests/bpf/progs/test_for_each_cpu.c >
On Wed, Aug 2, 2023 at 1:53 AM Yonghong Song <yonghong.song@linux.dev> wrote: > > > > On 8/1/23 7:29 AM, Yafang Shao wrote: > > Some statistic data is stored in percpu pointer but the kernel doesn't > > aggregate it into a single value, for example, the data in struct > > psi_group_cpu. > > > > Currently, we can traverse percpu data using for_loop and bpf_per_cpu_ptr: > > > > for_loop(nr_cpus, callback_fn, callback_ctx, 0) > > > > In the callback_fn, we retrieve the percpu pointer with bpf_per_cpu_ptr(). > > The drawback is that 'nr_cpus' cannot be a variable; otherwise, it will be > > rejected by the verifier, hindering deployment, as servers may have > > different 'nr_cpus'. Using CONFIG_NR_CPUS is not ideal. > > > > Alternatively, with the bpf_cpumask family, we can obtain a task's cpumask. > > However, it requires creating a bpf_cpumask, copying the cpumask from the > > task, and then parsing the CPU IDs from it, resulting in low efficiency. > > Introducing other kfuncs like bpf_cpumask_next might be necessary. > > > > A new bpf helper, bpf_for_each_cpu, is introduced to conveniently traverse > > percpu data, covering all scenarios. It includes > > for_each_{possible, present, online}_cpu. The user can also traverse CPUs > > from a specific task, such as walking the CPUs of a cpuset cgroup when the > > task is in that cgroup. > > The bpf subsystem has adopted kfunc approach. So there is no bpf helper > any more. Could you pls. share some background why we made this decision ? > You need to have a bpf_for_each_cpu kfunc instead. > But I am wondering whether we should use open coded iterator loops > 06accc8779c1 bpf: add support for open-coded iterator loops The usage of open-coded iterator for-loop is almost the same with bpf_loop, except that it can start from a non-zero value. > > In kernel, we have a global variable > nr_cpu_ids (also in kernel/bpf/helpers.c) > which is used in numerous places for per cpu data struct access. > > I am wondering whether we could have bpf code like > int nr_cpu_ids __ksym; > > struct bpf_iter_num it; > int i = 0; > > // nr_cpu_ids is special, we can give it a range [1, CONFIG_NR_CPUS]. > bpf_iter_num_new(&it, 1, nr_cpu_ids); > while ((v = bpf_iter_num_next(&it))) { > /* access cpu i data */ > i++; > } > bpf_iter_num_destroy(&it); > > From all existing open coded iterator loops, looks like > upper bound has to be a constant. We might need to extend support > to bounded scalar upper bound if not there. Currently the upper bound is required by both the open-coded for-loop and the bpf_loop. I think we can extend it. It can't handle the cpumask case either. for_each_cpu(cpu, mask) In the 'mask', the CPU IDs might not be continuous. In our container environment, we always use the cpuset cgroup for some critical tasks, but it is not so convenient to traverse the percpu data of this cpuset cgroup. We have to do it as follows for this case : That's why we prefer to introduce a bpf_for_each_cpu helper. It is fine if it can be implemented as a kfunc.
On Tue, Aug 1, 2023 at 7:34 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > In kernel, we have a global variable > > nr_cpu_ids (also in kernel/bpf/helpers.c) > > which is used in numerous places for per cpu data struct access. > > > > I am wondering whether we could have bpf code like > > int nr_cpu_ids __ksym; > > > > struct bpf_iter_num it; > > int i = 0; > > > > // nr_cpu_ids is special, we can give it a range [1, CONFIG_NR_CPUS]. > > bpf_iter_num_new(&it, 1, nr_cpu_ids); > > while ((v = bpf_iter_num_next(&it))) { > > /* access cpu i data */ > > i++; > > } > > bpf_iter_num_destroy(&it); > > > > From all existing open coded iterator loops, looks like > > upper bound has to be a constant. We might need to extend support > > to bounded scalar upper bound if not there. > > Currently the upper bound is required by both the open-coded for-loop > and the bpf_loop. I think we can extend it. > > It can't handle the cpumask case either. > > for_each_cpu(cpu, mask) > > In the 'mask', the CPU IDs might not be continuous. In our container > environment, we always use the cpuset cgroup for some critical tasks, > but it is not so convenient to traverse the percpu data of this cpuset > cgroup. We have to do it as follows for this case : > > That's why we prefer to introduce a bpf_for_each_cpu helper. It is > fine if it can be implemented as a kfunc. I think open-coded-iterators is the only acceptable path forward here. Since existing bpf_iter_num doesn't fit due to sparse cpumask, let's introduce bpf_iter_cpumask and few additional kfuncs that return cpu_possible_mask and others. We already have some cpumask support in kernel/bpf/cpumask.c bpf_iter_cpumask will be a natural follow up.
On Wed, Aug 2, 2023 at 10:46 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Tue, Aug 1, 2023 at 7:34 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > In kernel, we have a global variable > > > nr_cpu_ids (also in kernel/bpf/helpers.c) > > > which is used in numerous places for per cpu data struct access. > > > > > > I am wondering whether we could have bpf code like > > > int nr_cpu_ids __ksym; > > > > > > struct bpf_iter_num it; > > > int i = 0; > > > > > > // nr_cpu_ids is special, we can give it a range [1, CONFIG_NR_CPUS]. > > > bpf_iter_num_new(&it, 1, nr_cpu_ids); > > > while ((v = bpf_iter_num_next(&it))) { > > > /* access cpu i data */ > > > i++; > > > } > > > bpf_iter_num_destroy(&it); > > > > > > From all existing open coded iterator loops, looks like > > > upper bound has to be a constant. We might need to extend support > > > to bounded scalar upper bound if not there. > > > > Currently the upper bound is required by both the open-coded for-loop > > and the bpf_loop. I think we can extend it. > > > > It can't handle the cpumask case either. > > > > for_each_cpu(cpu, mask) > > > > In the 'mask', the CPU IDs might not be continuous. In our container > > environment, we always use the cpuset cgroup for some critical tasks, > > but it is not so convenient to traverse the percpu data of this cpuset > > cgroup. We have to do it as follows for this case : > > > > That's why we prefer to introduce a bpf_for_each_cpu helper. It is > > fine if it can be implemented as a kfunc. > > I think open-coded-iterators is the only acceptable path forward here. > Since existing bpf_iter_num doesn't fit due to sparse cpumask, > let's introduce bpf_iter_cpumask and few additional kfuncs > that return cpu_possible_mask and others. > > We already have some cpumask support in kernel/bpf/cpumask.c > bpf_iter_cpumask will be a natural follow up. I will think about it. Thanks for your suggestion.
On Tue, Aug 01, 2023 at 07:45:57PM -0700, Alexei Starovoitov wrote: > On Tue, Aug 1, 2023 at 7:34 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > In kernel, we have a global variable > > > nr_cpu_ids (also in kernel/bpf/helpers.c) > > > which is used in numerous places for per cpu data struct access. > > > > > > I am wondering whether we could have bpf code like > > > int nr_cpu_ids __ksym; I think this would be useful in general, though any __ksym variable like this would have to be const and mapped in .rodata, right? But yeah, being able to R/O map global variables like this which have static lifetimes would be nice. It's not quite the same thing as nr_cpu_ids, but FWIW, you could accomplish something close to this by doing something like this in your BPF prog: /* Set in user space to libbpf_num_possible_cpus() */ const volatile __u32 nr_cpus; ... __u32 i; bpf_for(i, 0, nr_cpus) bpf_printk("Iterating over cpu %u", i); ... > > > struct bpf_iter_num it; > > > int i = 0; > > > > > > // nr_cpu_ids is special, we can give it a range [1, CONFIG_NR_CPUS]. > > > bpf_iter_num_new(&it, 1, nr_cpu_ids); > > > while ((v = bpf_iter_num_next(&it))) { > > > /* access cpu i data */ > > > i++; > > > } > > > bpf_iter_num_destroy(&it); > > > > > > From all existing open coded iterator loops, looks like > > > upper bound has to be a constant. We might need to extend support > > > to bounded scalar upper bound if not there. > > > > Currently the upper bound is required by both the open-coded for-loop > > and the bpf_loop. I think we can extend it. > > > > It can't handle the cpumask case either. > > > > for_each_cpu(cpu, mask) > > > > In the 'mask', the CPU IDs might not be continuous. In our container > > environment, we always use the cpuset cgroup for some critical tasks, > > but it is not so convenient to traverse the percpu data of this cpuset > > cgroup. We have to do it as follows for this case : > > > > That's why we prefer to introduce a bpf_for_each_cpu helper. It is > > fine if it can be implemented as a kfunc. > > I think open-coded-iterators is the only acceptable path forward here. > Since existing bpf_iter_num doesn't fit due to sparse cpumask, > let's introduce bpf_iter_cpumask and few additional kfuncs > that return cpu_possible_mask and others. I agree that this is the correct way to generalize this. The only thing that we'll have to figure out is how to generalize treating const struct cpumask * objects as kptrs. In sched_ext [0] we export scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to return trusted global cpumask kptrs that can then be "released" in scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and exists only to appease the verifier that the trusted cpumask kptrs aren't being leaked and are having their references "dropped". [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/ I'd imagine that we have 2 ways forward if we want to enable progs to fetch other global cpumasks with static lifetimes (e.g. __cpu_possible_mask or nohz.idle_cpus_mask): 1. The most straightforward thing to do would be to add a new kfunc in kernel/bpf/cpumask.c that's a drop-in replacment for scx_bpf_put_idle_cpumask(): void bpf_global_cpumask_drop(const struct cpumask *cpumask) {} 2. Another would be to implement something resembling what Yonghong suggested in [1], where progs can link against global allocated kptrs like: const struct cpumask *__cpu_possible_mask __ksym; [1]: https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@linux.dev/#t In my opinion (1) is more straightforward, (2) is a better UX. Note again that both approaches only works for cpumasks with static lifetimes. I can't think of a way to treat dynamically allocated struct cpumask *objects as kptrs as there's nowhere to put a reference. If someone wants to track a dynamically allocated cpumask, they'd have to create a kptr out of its container object, and then pass that object's cpumask as a const struct cpumask * to other BPF cpumask kfuncs (including e.g. the proposed iterator). > We already have some cpumask support in kernel/bpf/cpumask.c > bpf_iter_cpumask will be a natural follow up. Yes, this should be easy to add. - David
On 8/1/23 8:29 PM, David Vernet wrote: > On Tue, Aug 01, 2023 at 07:45:57PM -0700, Alexei Starovoitov wrote: >> On Tue, Aug 1, 2023 at 7:34 PM Yafang Shao <laoar.shao@gmail.com> wrote: >>> >>>> >>>> In kernel, we have a global variable >>>> nr_cpu_ids (also in kernel/bpf/helpers.c) >>>> which is used in numerous places for per cpu data struct access. >>>> >>>> I am wondering whether we could have bpf code like >>>> int nr_cpu_ids __ksym; > > I think this would be useful in general, though any __ksym variable like > this would have to be const and mapped in .rodata, right? But yeah, > being able to R/O map global variables like this which have static > lifetimes would be nice. No. There is no map here. __ksym symbol will have a ld_imm64 insn to load the value in the bpf code. The address will be the kernel address patched by libbpf. > > It's not quite the same thing as nr_cpu_ids, but FWIW, you could > accomplish something close to this by doing something like this in your > BPF prog: > > /* Set in user space to libbpf_num_possible_cpus() */ > const volatile __u32 nr_cpus; This is another approach. In this case, nr_cpus will be stored in a map. I don't know the difference between kernel nr_cpu_ids vs. libbpf_num_possible_cpus(). I am choosing nr_cpu_ids because it is the one used inside the kernel. If libbpf_num_possible_cpus() effectively nr_cpu_ids, then happy to use libbpf_num_possible_cpus() which is more user/libbpf friendly. > > ... > __u32 i; > > bpf_for(i, 0, nr_cpus) > bpf_printk("Iterating over cpu %u", i); > > ... > >>>> struct bpf_iter_num it; >>>> int i = 0; >>>> >>>> // nr_cpu_ids is special, we can give it a range [1, CONFIG_NR_CPUS]. >>>> bpf_iter_num_new(&it, 1, nr_cpu_ids); >>>> while ((v = bpf_iter_num_next(&it))) { >>>> /* access cpu i data */ >>>> i++; >>>> } >>>> bpf_iter_num_destroy(&it); >>>> >>>> From all existing open coded iterator loops, looks like >>>> upper bound has to be a constant. We might need to extend support >>>> to bounded scalar upper bound if not there. >>> >>> Currently the upper bound is required by both the open-coded for-loop >>> and the bpf_loop. I think we can extend it. >>> >>> It can't handle the cpumask case either. >>> >>> for_each_cpu(cpu, mask) >>> >>> In the 'mask', the CPU IDs might not be continuous. In our container >>> environment, we always use the cpuset cgroup for some critical tasks, >>> but it is not so convenient to traverse the percpu data of this cpuset >>> cgroup. We have to do it as follows for this case : >>> >>> That's why we prefer to introduce a bpf_for_each_cpu helper. It is >>> fine if it can be implemented as a kfunc. >> >> I think open-coded-iterators is the only acceptable path forward here. >> Since existing bpf_iter_num doesn't fit due to sparse cpumask, >> let's introduce bpf_iter_cpumask and few additional kfuncs >> that return cpu_possible_mask and others. > > I agree that this is the correct way to generalize this. The only thing > that we'll have to figure out is how to generalize treating const struct > cpumask * objects as kptrs. In sched_ext [0] we export > scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to > return trusted global cpumask kptrs that can then be "released" in > scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and > exists only to appease the verifier that the trusted cpumask kptrs > aren't being leaked and are having their references "dropped". > > [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/ > > I'd imagine that we have 2 ways forward if we want to enable progs to > fetch other global cpumasks with static lifetimes (e.g. > __cpu_possible_mask or nohz.idle_cpus_mask): > > 1. The most straightforward thing to do would be to add a new kfunc in > kernel/bpf/cpumask.c that's a drop-in replacment for > scx_bpf_put_idle_cpumask(): > > void bpf_global_cpumask_drop(const struct cpumask *cpumask) > {} > > 2. Another would be to implement something resembling what Yonghong > suggested in [1], where progs can link against global allocated kptrs > like: > > const struct cpumask *__cpu_possible_mask __ksym; > > [1]: https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@linux.dev/#t > > In my opinion (1) is more straightforward, (2) is a better UX. > > Note again that both approaches only works for cpumasks with static > lifetimes. I can't think of a way to treat dynamically allocated struct > cpumask *objects as kptrs as there's nowhere to put a reference. If > someone wants to track a dynamically allocated cpumask, they'd have to > create a kptr out of its container object, and then pass that object's > cpumask as a const struct cpumask * to other BPF cpumask kfuncs > (including e.g. the proposed iterator). > >> We already have some cpumask support in kernel/bpf/cpumask.c >> bpf_iter_cpumask will be a natural follow up. > > Yes, this should be easy to add. > > - David
On Tue, Aug 01, 2023 at 11:54:18PM -0700, Yonghong Song wrote: > > > On 8/1/23 8:29 PM, David Vernet wrote: > > On Tue, Aug 01, 2023 at 07:45:57PM -0700, Alexei Starovoitov wrote: > > > On Tue, Aug 1, 2023 at 7:34 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > > > > > In kernel, we have a global variable > > > > > nr_cpu_ids (also in kernel/bpf/helpers.c) > > > > > which is used in numerous places for per cpu data struct access. > > > > > > > > > > I am wondering whether we could have bpf code like > > > > > int nr_cpu_ids __ksym; > > > > I think this would be useful in general, though any __ksym variable like > > this would have to be const and mapped in .rodata, right? But yeah, > > being able to R/O map global variables like this which have static > > lifetimes would be nice. > > No. There is no map here. __ksym symbol will have a ld_imm64 insn > to load the value in the bpf code. The address will be the kernel > address patched by libbpf. ld_imm64 is fine. I'm talking about stores. BPF progs should not be able to mutate these variables.
On Wed, Aug 2, 2023 at 8:46 AM David Vernet <void@manifault.com> wrote: > > On Tue, Aug 01, 2023 at 11:54:18PM -0700, Yonghong Song wrote: > > > > > > On 8/1/23 8:29 PM, David Vernet wrote: > > > On Tue, Aug 01, 2023 at 07:45:57PM -0700, Alexei Starovoitov wrote: > > > > On Tue, Aug 1, 2023 at 7:34 PM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > > > > > > > > > > In kernel, we have a global variable > > > > > > nr_cpu_ids (also in kernel/bpf/helpers.c) > > > > > > which is used in numerous places for per cpu data struct access. > > > > > > > > > > > > I am wondering whether we could have bpf code like > > > > > > int nr_cpu_ids __ksym; > > > > > > I think this would be useful in general, though any __ksym variable like > > > this would have to be const and mapped in .rodata, right? But yeah, > > > being able to R/O map global variables like this which have static > > > lifetimes would be nice. > > > > No. There is no map here. __ksym symbol will have a ld_imm64 insn > > to load the value in the bpf code. The address will be the kernel > > address patched by libbpf. > > ld_imm64 is fine. I'm talking about stores. BPF progs should not be able > to mutate these variables. ksyms are readonly and support for them is already in the kernel and libbpf. The only reason: extern int nr_cpu_ids __ksym; doesn't work today because nr_cpu_ids is not in vmlinux BTF. pahole adds only per-cpu vars to BTF.
On Tue, Aug 1, 2023 at 8:30 PM David Vernet <void@manifault.com> wrote: > I agree that this is the correct way to generalize this. The only thing > that we'll have to figure out is how to generalize treating const struct > cpumask * objects as kptrs. In sched_ext [0] we export > scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to > return trusted global cpumask kptrs that can then be "released" in > scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and > exists only to appease the verifier that the trusted cpumask kptrs > aren't being leaked and are having their references "dropped". why is it KF_ACQUIRE ? I think it can just return a trusted pointer without acquire. > [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/ > > I'd imagine that we have 2 ways forward if we want to enable progs to > fetch other global cpumasks with static lifetimes (e.g. > __cpu_possible_mask or nohz.idle_cpus_mask): > > 1. The most straightforward thing to do would be to add a new kfunc in > kernel/bpf/cpumask.c that's a drop-in replacment for > scx_bpf_put_idle_cpumask(): > > void bpf_global_cpumask_drop(const struct cpumask *cpumask) > {} > > 2. Another would be to implement something resembling what Yonghong > suggested in [1], where progs can link against global allocated kptrs > like: > > const struct cpumask *__cpu_possible_mask __ksym; > > [1]: https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@linux.dev/#t > > In my opinion (1) is more straightforward, (2) is a better UX. 1 = adding few kfuncs. 2 = teaching pahole to emit certain global vars. nm vmlinux|g -w D|g -v __SCK_|g -v __tracepoint_|wc -l 1998 imo BTF increase trade off is acceptable.
On Wed, Aug 02, 2023 at 09:33:18AM -0700, Alexei Starovoitov wrote: > On Tue, Aug 1, 2023 at 8:30 PM David Vernet <void@manifault.com> wrote: > > I agree that this is the correct way to generalize this. The only thing > > that we'll have to figure out is how to generalize treating const struct > > cpumask * objects as kptrs. In sched_ext [0] we export > > scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to > > return trusted global cpumask kptrs that can then be "released" in > > scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and > > exists only to appease the verifier that the trusted cpumask kptrs > > aren't being leaked and are having their references "dropped". > > why is it KF_ACQUIRE ? > I think it can just return a trusted pointer without acquire. I don't think there's a way to do this yet without hard-coding the kfuncs as special, right? That's why we do stuff like this: 11479 } else if (meta.func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx]) { 11480 mark_reg_known_zero(env, regs, BPF_REG_0); 11481 regs[BPF_REG_0].type = PTR_TO_BTF_ID | PTR_TRUSTED; 11482 regs[BPF_REG_0].btf = desc_btf; 11483 regs[BPF_REG_0].btf_id = meta.ret_btf_id; We could continue to do that, but I wonder if it would be useful to add a kfunc flag that allowed a kfunc to specify that? Something like KF_ALWAYS_TRUSTED? In general though, yes, we can teach the verifier to not require KF_ACQUIRE if we want. It's just that what we have now doesn't really scale to the kernel for any global cpumask. > > [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/ > > > > I'd imagine that we have 2 ways forward if we want to enable progs to > > fetch other global cpumasks with static lifetimes (e.g. > > __cpu_possible_mask or nohz.idle_cpus_mask): > > > > 1. The most straightforward thing to do would be to add a new kfunc in > > kernel/bpf/cpumask.c that's a drop-in replacment for > > scx_bpf_put_idle_cpumask(): > > > > void bpf_global_cpumask_drop(const struct cpumask *cpumask) > > {} > > > > 2. Another would be to implement something resembling what Yonghong > > suggested in [1], where progs can link against global allocated kptrs > > like: > > > > const struct cpumask *__cpu_possible_mask __ksym; > > > > [1]: https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@linux.dev/#t > > > > In my opinion (1) is more straightforward, (2) is a better UX. > > 1 = adding few kfuncs. > 2 = teaching pahole to emit certain global vars. > > nm vmlinux|g -w D|g -v __SCK_|g -v __tracepoint_|wc -l > 1998 > > imo BTF increase trade off is acceptable. Agreed
On Wed, Aug 2, 2023 at 10:06 AM David Vernet <void@manifault.com> wrote: > > On Wed, Aug 02, 2023 at 09:33:18AM -0700, Alexei Starovoitov wrote: > > On Tue, Aug 1, 2023 at 8:30 PM David Vernet <void@manifault.com> wrote: > > > I agree that this is the correct way to generalize this. The only thing > > > that we'll have to figure out is how to generalize treating const struct > > > cpumask * objects as kptrs. In sched_ext [0] we export > > > scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to > > > return trusted global cpumask kptrs that can then be "released" in > > > scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and > > > exists only to appease the verifier that the trusted cpumask kptrs > > > aren't being leaked and are having their references "dropped". > > > > why is it KF_ACQUIRE ? > > I think it can just return a trusted pointer without acquire. > > I don't think there's a way to do this yet without hard-coding the kfuncs as > special, right? That's why we do stuff like this: > > > 11479 } else if (meta.func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx]) { > 11480 mark_reg_known_zero(env, regs, BPF_REG_0); > 11481 regs[BPF_REG_0].type = PTR_TO_BTF_ID | PTR_TRUSTED; > 11482 regs[BPF_REG_0].btf = desc_btf; > 11483 regs[BPF_REG_0].btf_id = meta.ret_btf_id; > > We could continue to do that, but I wonder if it would be useful to add > a kfunc flag that allowed a kfunc to specify that? Something like > KF_ALWAYS_TRUSTED? In general though, yes, we can teach the verifier to > not require KF_ACQUIRE if we want. It's just that what we have now > doesn't really scale to the kernel for any global cpumask. We should probably just do this: diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index e7b1af016841..f0c3f69ee5a8 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -11567,7 +11567,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, } else { mark_reg_known_zero(env, regs, BPF_REG_0); regs[BPF_REG_0].btf = desc_btf; - regs[BPF_REG_0].type = PTR_TO_BTF_ID; + regs[BPF_REG_0].type = PTR_TO_BTF_ID | PTR_TRUSTED; regs[BPF_REG_0].btf_id = ptr_type_id; A quick audit for kfuncs shows that this code is never used. Everywhere where ptr_to_btf is returned the kfunc is maked with KF_ACQUIRE. We'd need to have careful code review to make sure future kfuncs return trusted pointers. That would simplify scx_bpf_get_idle_cpumask(). No need for get and nop put.
On 02/08/2023 17:33, Alexei Starovoitov wrote: > On Tue, Aug 1, 2023 at 8:30 PM David Vernet <void@manifault.com> wrote: >> I agree that this is the correct way to generalize this. The only thing >> that we'll have to figure out is how to generalize treating const struct >> cpumask * objects as kptrs. In sched_ext [0] we export >> scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to >> return trusted global cpumask kptrs that can then be "released" in >> scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and >> exists only to appease the verifier that the trusted cpumask kptrs >> aren't being leaked and are having their references "dropped". > > why is it KF_ACQUIRE ? > I think it can just return a trusted pointer without acquire. > >> [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/ >> >> I'd imagine that we have 2 ways forward if we want to enable progs to >> fetch other global cpumasks with static lifetimes (e.g. >> __cpu_possible_mask or nohz.idle_cpus_mask): >> >> 1. The most straightforward thing to do would be to add a new kfunc in >> kernel/bpf/cpumask.c that's a drop-in replacment for >> scx_bpf_put_idle_cpumask(): >> >> void bpf_global_cpumask_drop(const struct cpumask *cpumask) >> {} >> >> 2. Another would be to implement something resembling what Yonghong >> suggested in [1], where progs can link against global allocated kptrs >> like: >> >> const struct cpumask *__cpu_possible_mask __ksym; >> >> [1]: https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@linux.dev/#t >> >> In my opinion (1) is more straightforward, (2) is a better UX. > > 1 = adding few kfuncs. > 2 = teaching pahole to emit certain global vars. > > nm vmlinux|g -w D|g -v __SCK_|g -v __tracepoint_|wc -l > 1998 > > imo BTF increase trade off is acceptable. Agreed, Stephen's numbers on BTF size increase were pretty modest [1]. What was gating that work in my mind was previous discussion around splitting aspects of BTF into a "vmlinux-extra". Experiments with this seemed to show it's hard to support, and worse, tooling would have to learn about its existence. We have to come up with a CONFIG convention about specifying what ends up in -extra versus core vmlinux BTF, what do we do about modules, etc. All feels like over-complication. I think a better path would be to support BTF in a vmlinux BTF module (controlled by making CONFIG_DEBUG_INFO_BTF tristate). The module is separately loadable, but puts vmlinux in the same place for tools - /sys/kernel/btf/vmlinux. That solves already-existing issues of BTF size for embedded use cases that have come up a few times, and lessens concerns about BTF size for other users, while it all works with existing tooling. I have a basic proof-of-concept but it will take time to hammer into shape. Because variable-related size increases are pretty modest, so should we proceed with the BTF variable support anyway? We can modularize BTF separately later on for those concerned about BTF size. [1] https://lore.kernel.org/bpf/20221104231103.752040-1-stephen.s.brennan@oracle.com/
On 8/3/23 1:21 AM, Alan Maguire wrote: > On 02/08/2023 17:33, Alexei Starovoitov wrote: >> On Tue, Aug 1, 2023 at 8:30 PM David Vernet <void@manifault.com> wrote: >>> I agree that this is the correct way to generalize this. The only thing >>> that we'll have to figure out is how to generalize treating const struct >>> cpumask * objects as kptrs. In sched_ext [0] we export >>> scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to >>> return trusted global cpumask kptrs that can then be "released" in >>> scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and >>> exists only to appease the verifier that the trusted cpumask kptrs >>> aren't being leaked and are having their references "dropped". >> >> why is it KF_ACQUIRE ? >> I think it can just return a trusted pointer without acquire. >> >>> [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/ >>> >>> I'd imagine that we have 2 ways forward if we want to enable progs to >>> fetch other global cpumasks with static lifetimes (e.g. >>> __cpu_possible_mask or nohz.idle_cpus_mask): >>> >>> 1. The most straightforward thing to do would be to add a new kfunc in >>> kernel/bpf/cpumask.c that's a drop-in replacment for >>> scx_bpf_put_idle_cpumask(): >>> >>> void bpf_global_cpumask_drop(const struct cpumask *cpumask) >>> {} >>> >>> 2. Another would be to implement something resembling what Yonghong >>> suggested in [1], where progs can link against global allocated kptrs >>> like: >>> >>> const struct cpumask *__cpu_possible_mask __ksym; >>> >>> [1]: https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@linux.dev/#t >>> >>> In my opinion (1) is more straightforward, (2) is a better UX. >> >> 1 = adding few kfuncs. >> 2 = teaching pahole to emit certain global vars. >> >> nm vmlinux|g -w D|g -v __SCK_|g -v __tracepoint_|wc -l >> 1998 >> >> imo BTF increase trade off is acceptable. > > Agreed, Stephen's numbers on BTF size increase were pretty modest [1]. > > What was gating that work in my mind was previous discussion around > splitting aspects of BTF into a "vmlinux-extra". Experiments with this > seemed to show it's hard to support, and worse, tooling would have to > learn about its existence. We have to come up with a CONFIG convention > about specifying what ends up in -extra versus core vmlinux BTF, what do > we do about modules, etc. All feels like over-complication. > > I think a better path would be to support BTF in a vmlinux BTF module > (controlled by making CONFIG_DEBUG_INFO_BTF tristate). The module is > separately loadable, but puts vmlinux in the same place for tools - > /sys/kernel/btf/vmlinux. That solves already-existing issues of BTF size > for embedded use cases that have come up a few times, and lessens > concerns about BTF size for other users, while it all works with > existing tooling. I have a basic proof-of-concept but it will take time > to hammer into shape. > > Because variable-related size increases are pretty modest, so should we > proceed with the BTF variable support anyway? We can modularize BTF > separately later on for those concerned about BTF size. Alan, it seems a consensus has reached that we should include global variables (excluding special kernel made ones like __SCK_ and __tracepoint_) in vmlinux BTF. please go ahead and propose a patch. Thanks! > > [1] > https://lore.kernel.org/bpf/20221104231103.752040-1-stephen.s.brennan@oracle.com/
On 03/08/2023 16:22, Yonghong Song wrote: > > > On 8/3/23 1:21 AM, Alan Maguire wrote: >> On 02/08/2023 17:33, Alexei Starovoitov wrote: >>> On Tue, Aug 1, 2023 at 8:30 PM David Vernet <void@manifault.com> wrote: >>>> I agree that this is the correct way to generalize this. The only thing >>>> that we'll have to figure out is how to generalize treating const >>>> struct >>>> cpumask * objects as kptrs. In sched_ext [0] we export >>>> scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to >>>> return trusted global cpumask kptrs that can then be "released" in >>>> scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and >>>> exists only to appease the verifier that the trusted cpumask kptrs >>>> aren't being leaked and are having their references "dropped". >>> >>> why is it KF_ACQUIRE ? >>> I think it can just return a trusted pointer without acquire. >>> >>>> [0]: >>>> https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/ >>>> >>>> I'd imagine that we have 2 ways forward if we want to enable progs to >>>> fetch other global cpumasks with static lifetimes (e.g. >>>> __cpu_possible_mask or nohz.idle_cpus_mask): >>>> >>>> 1. The most straightforward thing to do would be to add a new kfunc in >>>> kernel/bpf/cpumask.c that's a drop-in replacment for >>>> scx_bpf_put_idle_cpumask(): >>>> >>>> void bpf_global_cpumask_drop(const struct cpumask *cpumask) >>>> {} >>>> >>>> 2. Another would be to implement something resembling what Yonghong >>>> suggested in [1], where progs can link against global allocated >>>> kptrs >>>> like: >>>> >>>> const struct cpumask *__cpu_possible_mask __ksym; >>>> >>>> [1]: >>>> https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@linux.dev/#t >>>> >>>> In my opinion (1) is more straightforward, (2) is a better UX. >>> >>> 1 = adding few kfuncs. >>> 2 = teaching pahole to emit certain global vars. >>> >>> nm vmlinux|g -w D|g -v __SCK_|g -v __tracepoint_|wc -l >>> 1998 >>> >>> imo BTF increase trade off is acceptable. >> >> Agreed, Stephen's numbers on BTF size increase were pretty modest [1]. >> >> What was gating that work in my mind was previous discussion around >> splitting aspects of BTF into a "vmlinux-extra". Experiments with this >> seemed to show it's hard to support, and worse, tooling would have to >> learn about its existence. We have to come up with a CONFIG convention >> about specifying what ends up in -extra versus core vmlinux BTF, what do >> we do about modules, etc. All feels like over-complication. >> >> I think a better path would be to support BTF in a vmlinux BTF module >> (controlled by making CONFIG_DEBUG_INFO_BTF tristate). The module is >> separately loadable, but puts vmlinux in the same place for tools - >> /sys/kernel/btf/vmlinux. That solves already-existing issues of BTF size >> for embedded use cases that have come up a few times, and lessens >> concerns about BTF size for other users, while it all works with >> existing tooling. I have a basic proof-of-concept but it will take time >> to hammer into shape. >> >> Because variable-related size increases are pretty modest, so should we >> proceed with the BTF variable support anyway? We can modularize BTF >> separately later on for those concerned about BTF size. > > Alan, it seems a consensus has reached that we should include > global variables (excluding special kernel made ones like > __SCK_ and __tracepoint_) in vmlinux BTF. > please go ahead and propose a patch. Thanks! > Sounds good! Stephen and I will hopefully have something ready soon; the changes are mostly in dwarves plus a kernel selftest. And good idea on the filtering; this will eliminate a lot of unneeded variables and cut down on size overheads. Thanks! Alan >> >> [1] >> https://lore.kernel.org/bpf/20221104231103.752040-1-stephen.s.brennan@oracle.com/