mbox series

[RFC,bpf-next,0/3] bpf: Add new bpf helper bpf_for_each_cpu

Message ID 20230801142912.55078-1-laoar.shao@gmail.com (mailing list archive)
Headers show
Series bpf: Add new bpf helper bpf_for_each_cpu | expand

Message

Yafang Shao Aug. 1, 2023, 2:29 p.m. UTC
Some statistic data is stored in percpu pointer but the kernel doesn't
aggregate it into a single value, for example, the data in struct
psi_group_cpu.

Currently, we can traverse percpu data using for_loop and bpf_per_cpu_ptr:

  for_loop(nr_cpus, callback_fn, callback_ctx, 0)

In the callback_fn, we retrieve the percpu pointer with bpf_per_cpu_ptr().
The drawback is that 'nr_cpus' cannot be a variable; otherwise, it will be
rejected by the verifier, hindering deployment, as servers may have
different 'nr_cpus'. Using CONFIG_NR_CPUS is not ideal.

Alternatively, with the bpf_cpumask family, we can obtain a task's cpumask.
However, it requires creating a bpf_cpumask, copying the cpumask from the
task, and then parsing the CPU IDs from it, resulting in low efficiency.
Introducing other kfuncs like bpf_cpumask_next might be necessary.

A new bpf helper, bpf_for_each_cpu, is introduced to conveniently traverse
percpu data, covering all scenarios. It includes
for_each_{possible, present, online}_cpu. The user can also traverse CPUs
from a specific task, such as walking the CPUs of a cpuset cgroup when the
task is in that cgroup.

In our use case, we utilize this new helper to traverse percpu psi data.
This aids in understanding why CPU, Memory, and IO pressure data are high
on a server or a container.

Due to the __percpu annotation, clang-14+ and pahole-1.23+ are required.

Yafang Shao (3):
  bpf: Add bpf_for_each_cpu helper
  cgroup, psi: Init root cgroup psi to psi_system
  selftests/bpf: Add selftest for for_each_cpu

 include/linux/bpf.h                                |   1 +
 include/linux/psi.h                                |   2 +-
 include/uapi/linux/bpf.h                           |  32 +++++
 kernel/bpf/bpf_iter.c                              |  72 +++++++++++
 kernel/bpf/helpers.c                               |   2 +
 kernel/bpf/verifier.c                              |  29 ++++-
 kernel/cgroup/cgroup.c                             |   5 +-
 tools/include/uapi/linux/bpf.h                     |  32 +++++
 .../selftests/bpf/prog_tests/for_each_cpu.c        | 137 +++++++++++++++++++++
 .../selftests/bpf/progs/test_for_each_cpu.c        |  63 ++++++++++
 10 files changed, 372 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/for_each_cpu.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_for_each_cpu.c

Comments

Yonghong Song Aug. 1, 2023, 5:53 p.m. UTC | #1
On 8/1/23 7:29 AM, Yafang Shao wrote:
> Some statistic data is stored in percpu pointer but the kernel doesn't
> aggregate it into a single value, for example, the data in struct
> psi_group_cpu.
> 
> Currently, we can traverse percpu data using for_loop and bpf_per_cpu_ptr:
> 
>    for_loop(nr_cpus, callback_fn, callback_ctx, 0)
> 
> In the callback_fn, we retrieve the percpu pointer with bpf_per_cpu_ptr().
> The drawback is that 'nr_cpus' cannot be a variable; otherwise, it will be
> rejected by the verifier, hindering deployment, as servers may have
> different 'nr_cpus'. Using CONFIG_NR_CPUS is not ideal.
> 
> Alternatively, with the bpf_cpumask family, we can obtain a task's cpumask.
> However, it requires creating a bpf_cpumask, copying the cpumask from the
> task, and then parsing the CPU IDs from it, resulting in low efficiency.
> Introducing other kfuncs like bpf_cpumask_next might be necessary.
> 
> A new bpf helper, bpf_for_each_cpu, is introduced to conveniently traverse
> percpu data, covering all scenarios. It includes
> for_each_{possible, present, online}_cpu. The user can also traverse CPUs
> from a specific task, such as walking the CPUs of a cpuset cgroup when the
> task is in that cgroup.

The bpf subsystem has adopted kfunc approach. So there is no bpf helper
any more. You need to have a bpf_for_each_cpu kfunc instead.

But I am wondering whether we should use open coded iterator loops
    06accc8779c1  bpf: add support for open-coded iterator loops

In kernel, we have a global variable
    nr_cpu_ids (also in kernel/bpf/helpers.c)
which is used in numerous places for per cpu data struct access.

I am wondering whether we could have bpf code like
    int nr_cpu_ids __ksym;

    struct bpf_iter_num it;
    int i = 0;

    // nr_cpu_ids is special, we can give it a range [1, CONFIG_NR_CPUS].
    bpf_iter_num_new(&it, 1, nr_cpu_ids);
    while ((v = bpf_iter_num_next(&it))) {
           /* access cpu i data */
           i++;
    }
    bpf_iter_num_destroy(&it);

 From all existing open coded iterator loops, looks like
upper bound has to be a constant. We might need to extend support
to bounded scalar upper bound if not there.
> 
> In our use case, we utilize this new helper to traverse percpu psi data.
> This aids in understanding why CPU, Memory, and IO pressure data are high
> on a server or a container.
> 
> Due to the __percpu annotation, clang-14+ and pahole-1.23+ are required.
> 
> Yafang Shao (3):
>    bpf: Add bpf_for_each_cpu helper
>    cgroup, psi: Init root cgroup psi to psi_system
>    selftests/bpf: Add selftest for for_each_cpu
> 
>   include/linux/bpf.h                                |   1 +
>   include/linux/psi.h                                |   2 +-
>   include/uapi/linux/bpf.h                           |  32 +++++
>   kernel/bpf/bpf_iter.c                              |  72 +++++++++++
>   kernel/bpf/helpers.c                               |   2 +
>   kernel/bpf/verifier.c                              |  29 ++++-
>   kernel/cgroup/cgroup.c                             |   5 +-
>   tools/include/uapi/linux/bpf.h                     |  32 +++++
>   .../selftests/bpf/prog_tests/for_each_cpu.c        | 137 +++++++++++++++++++++
>   .../selftests/bpf/progs/test_for_each_cpu.c        |  63 ++++++++++
>   10 files changed, 372 insertions(+), 3 deletions(-)
>   create mode 100644 tools/testing/selftests/bpf/prog_tests/for_each_cpu.c
>   create mode 100644 tools/testing/selftests/bpf/progs/test_for_each_cpu.c
>
Yafang Shao Aug. 2, 2023, 2:33 a.m. UTC | #2
On Wed, Aug 2, 2023 at 1:53 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
>
> On 8/1/23 7:29 AM, Yafang Shao wrote:
> > Some statistic data is stored in percpu pointer but the kernel doesn't
> > aggregate it into a single value, for example, the data in struct
> > psi_group_cpu.
> >
> > Currently, we can traverse percpu data using for_loop and bpf_per_cpu_ptr:
> >
> >    for_loop(nr_cpus, callback_fn, callback_ctx, 0)
> >
> > In the callback_fn, we retrieve the percpu pointer with bpf_per_cpu_ptr().
> > The drawback is that 'nr_cpus' cannot be a variable; otherwise, it will be
> > rejected by the verifier, hindering deployment, as servers may have
> > different 'nr_cpus'. Using CONFIG_NR_CPUS is not ideal.
> >
> > Alternatively, with the bpf_cpumask family, we can obtain a task's cpumask.
> > However, it requires creating a bpf_cpumask, copying the cpumask from the
> > task, and then parsing the CPU IDs from it, resulting in low efficiency.
> > Introducing other kfuncs like bpf_cpumask_next might be necessary.
> >
> > A new bpf helper, bpf_for_each_cpu, is introduced to conveniently traverse
> > percpu data, covering all scenarios. It includes
> > for_each_{possible, present, online}_cpu. The user can also traverse CPUs
> > from a specific task, such as walking the CPUs of a cpuset cgroup when the
> > task is in that cgroup.
>
> The bpf subsystem has adopted kfunc approach. So there is no bpf helper
> any more.

Could you pls. share some background why we made this decision ?

> You need to have a bpf_for_each_cpu kfunc instead.
> But I am wondering whether we should use open coded iterator loops
>     06accc8779c1  bpf: add support for open-coded iterator loops

The usage of open-coded iterator for-loop is almost the same with
bpf_loop, except that it can start from a non-zero value.

>
> In kernel, we have a global variable
>     nr_cpu_ids (also in kernel/bpf/helpers.c)
> which is used in numerous places for per cpu data struct access.
>
> I am wondering whether we could have bpf code like
>     int nr_cpu_ids __ksym;
>
>     struct bpf_iter_num it;
>     int i = 0;
>
>     // nr_cpu_ids is special, we can give it a range [1, CONFIG_NR_CPUS].
>     bpf_iter_num_new(&it, 1, nr_cpu_ids);
>     while ((v = bpf_iter_num_next(&it))) {
>            /* access cpu i data */
>            i++;
>     }
>     bpf_iter_num_destroy(&it);
>
>  From all existing open coded iterator loops, looks like
> upper bound has to be a constant. We might need to extend support
> to bounded scalar upper bound if not there.

Currently the upper bound is required by both the open-coded for-loop
and the bpf_loop. I think we can extend it.

It can't handle the cpumask case either.

    for_each_cpu(cpu, mask)

In the 'mask', the CPU IDs might not be continuous. In our container
environment, we always use the cpuset cgroup for some critical tasks,
but it is not so convenient to traverse the percpu data of this cpuset
cgroup.  We have to do it as follows for this case :

That's why we prefer to introduce a bpf_for_each_cpu helper. It is
fine if it can be implemented as a kfunc.
Alexei Starovoitov Aug. 2, 2023, 2:45 a.m. UTC | #3
On Tue, Aug 1, 2023 at 7:34 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> >
> > In kernel, we have a global variable
> >     nr_cpu_ids (also in kernel/bpf/helpers.c)
> > which is used in numerous places for per cpu data struct access.
> >
> > I am wondering whether we could have bpf code like
> >     int nr_cpu_ids __ksym;
> >
> >     struct bpf_iter_num it;
> >     int i = 0;
> >
> >     // nr_cpu_ids is special, we can give it a range [1, CONFIG_NR_CPUS].
> >     bpf_iter_num_new(&it, 1, nr_cpu_ids);
> >     while ((v = bpf_iter_num_next(&it))) {
> >            /* access cpu i data */
> >            i++;
> >     }
> >     bpf_iter_num_destroy(&it);
> >
> >  From all existing open coded iterator loops, looks like
> > upper bound has to be a constant. We might need to extend support
> > to bounded scalar upper bound if not there.
>
> Currently the upper bound is required by both the open-coded for-loop
> and the bpf_loop. I think we can extend it.
>
> It can't handle the cpumask case either.
>
>     for_each_cpu(cpu, mask)
>
> In the 'mask', the CPU IDs might not be continuous. In our container
> environment, we always use the cpuset cgroup for some critical tasks,
> but it is not so convenient to traverse the percpu data of this cpuset
> cgroup.  We have to do it as follows for this case :
>
> That's why we prefer to introduce a bpf_for_each_cpu helper. It is
> fine if it can be implemented as a kfunc.

I think open-coded-iterators is the only acceptable path forward here.
Since existing bpf_iter_num doesn't fit due to sparse cpumask,
let's introduce bpf_iter_cpumask and few additional kfuncs
that return cpu_possible_mask and others.

We already have some cpumask support in kernel/bpf/cpumask.c
bpf_iter_cpumask will be a natural follow up.
Yafang Shao Aug. 2, 2023, 2:57 a.m. UTC | #4
On Wed, Aug 2, 2023 at 10:46 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Aug 1, 2023 at 7:34 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > >
> > > In kernel, we have a global variable
> > >     nr_cpu_ids (also in kernel/bpf/helpers.c)
> > > which is used in numerous places for per cpu data struct access.
> > >
> > > I am wondering whether we could have bpf code like
> > >     int nr_cpu_ids __ksym;
> > >
> > >     struct bpf_iter_num it;
> > >     int i = 0;
> > >
> > >     // nr_cpu_ids is special, we can give it a range [1, CONFIG_NR_CPUS].
> > >     bpf_iter_num_new(&it, 1, nr_cpu_ids);
> > >     while ((v = bpf_iter_num_next(&it))) {
> > >            /* access cpu i data */
> > >            i++;
> > >     }
> > >     bpf_iter_num_destroy(&it);
> > >
> > >  From all existing open coded iterator loops, looks like
> > > upper bound has to be a constant. We might need to extend support
> > > to bounded scalar upper bound if not there.
> >
> > Currently the upper bound is required by both the open-coded for-loop
> > and the bpf_loop. I think we can extend it.
> >
> > It can't handle the cpumask case either.
> >
> >     for_each_cpu(cpu, mask)
> >
> > In the 'mask', the CPU IDs might not be continuous. In our container
> > environment, we always use the cpuset cgroup for some critical tasks,
> > but it is not so convenient to traverse the percpu data of this cpuset
> > cgroup.  We have to do it as follows for this case :
> >
> > That's why we prefer to introduce a bpf_for_each_cpu helper. It is
> > fine if it can be implemented as a kfunc.
>
> I think open-coded-iterators is the only acceptable path forward here.
> Since existing bpf_iter_num doesn't fit due to sparse cpumask,
> let's introduce bpf_iter_cpumask and few additional kfuncs
> that return cpu_possible_mask and others.
>
> We already have some cpumask support in kernel/bpf/cpumask.c
> bpf_iter_cpumask will be a natural follow up.

I will think about it. Thanks for your suggestion.
David Vernet Aug. 2, 2023, 3:29 a.m. UTC | #5
On Tue, Aug 01, 2023 at 07:45:57PM -0700, Alexei Starovoitov wrote:
> On Tue, Aug 1, 2023 at 7:34 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > >
> > > In kernel, we have a global variable
> > >     nr_cpu_ids (also in kernel/bpf/helpers.c)
> > > which is used in numerous places for per cpu data struct access.
> > >
> > > I am wondering whether we could have bpf code like
> > >     int nr_cpu_ids __ksym;

I think this would be useful in general, though any __ksym variable like
this would have to be const and mapped in .rodata, right? But yeah,
being able to R/O map global variables like this which have static
lifetimes would be nice.

It's not quite the same thing as nr_cpu_ids, but FWIW, you could
accomplish something close to this by doing something like this in your
BPF prog:

/* Set in user space to libbpf_num_possible_cpus() */
const volatile __u32 nr_cpus;

...
	__u32 i;

	bpf_for(i, 0, nr_cpus)
		bpf_printk("Iterating over cpu %u", i);

...

> > >     struct bpf_iter_num it;
> > >     int i = 0;
> > >
> > >     // nr_cpu_ids is special, we can give it a range [1, CONFIG_NR_CPUS].
> > >     bpf_iter_num_new(&it, 1, nr_cpu_ids);
> > >     while ((v = bpf_iter_num_next(&it))) {
> > >            /* access cpu i data */
> > >            i++;
> > >     }
> > >     bpf_iter_num_destroy(&it);
> > >
> > >  From all existing open coded iterator loops, looks like
> > > upper bound has to be a constant. We might need to extend support
> > > to bounded scalar upper bound if not there.
> >
> > Currently the upper bound is required by both the open-coded for-loop
> > and the bpf_loop. I think we can extend it.
> >
> > It can't handle the cpumask case either.
> >
> >     for_each_cpu(cpu, mask)
> >
> > In the 'mask', the CPU IDs might not be continuous. In our container
> > environment, we always use the cpuset cgroup for some critical tasks,
> > but it is not so convenient to traverse the percpu data of this cpuset
> > cgroup.  We have to do it as follows for this case :
> >
> > That's why we prefer to introduce a bpf_for_each_cpu helper. It is
> > fine if it can be implemented as a kfunc.
> 
> I think open-coded-iterators is the only acceptable path forward here.
> Since existing bpf_iter_num doesn't fit due to sparse cpumask,
> let's introduce bpf_iter_cpumask and few additional kfuncs
> that return cpu_possible_mask and others.

I agree that this is the correct way to generalize this. The only thing
that we'll have to figure out is how to generalize treating const struct
cpumask * objects as kptrs. In sched_ext [0] we export
scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to
return trusted global cpumask kptrs that can then be "released" in
scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and
exists only to appease the verifier that the trusted cpumask kptrs
aren't being leaked and are having their references "dropped".

[0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/

I'd imagine that we have 2 ways forward if we want to enable progs to
fetch other global cpumasks with static lifetimes (e.g.
__cpu_possible_mask or nohz.idle_cpus_mask):

1. The most straightforward thing to do would be to add a new kfunc in
   kernel/bpf/cpumask.c that's a drop-in replacment for
   scx_bpf_put_idle_cpumask():

void bpf_global_cpumask_drop(const struct cpumask *cpumask)
{}

2. Another would be to implement something resembling what Yonghong
   suggested in [1], where progs can link against global allocated kptrs
   like:

const struct cpumask *__cpu_possible_mask __ksym;

[1]: https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@linux.dev/#t

In my opinion (1) is more straightforward, (2) is a better UX.

Note again that both approaches only works for cpumasks with static
lifetimes.  I can't think of a way to treat dynamically allocated struct
cpumask *objects as kptrs as there's nowhere to put a reference. If
someone wants to track a dynamically allocated cpumask, they'd have to
create a kptr out of its container object, and then pass that object's
cpumask as a const struct cpumask * to other BPF cpumask kfuncs
(including e.g. the proposed iterator).

> We already have some cpumask support in kernel/bpf/cpumask.c
> bpf_iter_cpumask will be a natural follow up.

Yes, this should be easy to add.

- David
Yonghong Song Aug. 2, 2023, 6:54 a.m. UTC | #6
On 8/1/23 8:29 PM, David Vernet wrote:
> On Tue, Aug 01, 2023 at 07:45:57PM -0700, Alexei Starovoitov wrote:
>> On Tue, Aug 1, 2023 at 7:34 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>
>>>>
>>>> In kernel, we have a global variable
>>>>      nr_cpu_ids (also in kernel/bpf/helpers.c)
>>>> which is used in numerous places for per cpu data struct access.
>>>>
>>>> I am wondering whether we could have bpf code like
>>>>      int nr_cpu_ids __ksym;
> 
> I think this would be useful in general, though any __ksym variable like
> this would have to be const and mapped in .rodata, right? But yeah,
> being able to R/O map global variables like this which have static
> lifetimes would be nice.

No. There is no map here. __ksym symbol will have a ld_imm64 insn
to load the value in the bpf code. The address will be the kernel
address patched by libbpf.

> 
> It's not quite the same thing as nr_cpu_ids, but FWIW, you could
> accomplish something close to this by doing something like this in your
> BPF prog:
> 
> /* Set in user space to libbpf_num_possible_cpus() */
> const volatile __u32 nr_cpus;

This is another approach. In this case, nr_cpus will be
stored in a map.

I don't know the difference between kernel nr_cpu_ids vs.
libbpf_num_possible_cpus(). I am choosing nr_cpu_ids
because it is the one used inside the kernel. If
libbpf_num_possible_cpus() effectively nr_cpu_ids,
then happy to use libbpf_num_possible_cpus() which
is more user/libbpf friendly.

> 
> ...
> 	__u32 i;
> 
> 	bpf_for(i, 0, nr_cpus)
> 		bpf_printk("Iterating over cpu %u", i);
> 
> ...
> 
>>>>      struct bpf_iter_num it;
>>>>      int i = 0;
>>>>
>>>>      // nr_cpu_ids is special, we can give it a range [1, CONFIG_NR_CPUS].
>>>>      bpf_iter_num_new(&it, 1, nr_cpu_ids);
>>>>      while ((v = bpf_iter_num_next(&it))) {
>>>>             /* access cpu i data */
>>>>             i++;
>>>>      }
>>>>      bpf_iter_num_destroy(&it);
>>>>
>>>>   From all existing open coded iterator loops, looks like
>>>> upper bound has to be a constant. We might need to extend support
>>>> to bounded scalar upper bound if not there.
>>>
>>> Currently the upper bound is required by both the open-coded for-loop
>>> and the bpf_loop. I think we can extend it.
>>>
>>> It can't handle the cpumask case either.
>>>
>>>      for_each_cpu(cpu, mask)
>>>
>>> In the 'mask', the CPU IDs might not be continuous. In our container
>>> environment, we always use the cpuset cgroup for some critical tasks,
>>> but it is not so convenient to traverse the percpu data of this cpuset
>>> cgroup.  We have to do it as follows for this case :
>>>
>>> That's why we prefer to introduce a bpf_for_each_cpu helper. It is
>>> fine if it can be implemented as a kfunc.
>>
>> I think open-coded-iterators is the only acceptable path forward here.
>> Since existing bpf_iter_num doesn't fit due to sparse cpumask,
>> let's introduce bpf_iter_cpumask and few additional kfuncs
>> that return cpu_possible_mask and others.
> 
> I agree that this is the correct way to generalize this. The only thing
> that we'll have to figure out is how to generalize treating const struct
> cpumask * objects as kptrs. In sched_ext [0] we export
> scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to
> return trusted global cpumask kptrs that can then be "released" in
> scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and
> exists only to appease the verifier that the trusted cpumask kptrs
> aren't being leaked and are having their references "dropped".
> 
> [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/
> 
> I'd imagine that we have 2 ways forward if we want to enable progs to
> fetch other global cpumasks with static lifetimes (e.g.
> __cpu_possible_mask or nohz.idle_cpus_mask):
> 
> 1. The most straightforward thing to do would be to add a new kfunc in
>     kernel/bpf/cpumask.c that's a drop-in replacment for
>     scx_bpf_put_idle_cpumask():
> 
> void bpf_global_cpumask_drop(const struct cpumask *cpumask)
> {}
> 
> 2. Another would be to implement something resembling what Yonghong
>     suggested in [1], where progs can link against global allocated kptrs
>     like:
> 
> const struct cpumask *__cpu_possible_mask __ksym;
> 
> [1]: https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@linux.dev/#t
> 
> In my opinion (1) is more straightforward, (2) is a better UX.
> 
> Note again that both approaches only works for cpumasks with static
> lifetimes.  I can't think of a way to treat dynamically allocated struct
> cpumask *objects as kptrs as there's nowhere to put a reference. If
> someone wants to track a dynamically allocated cpumask, they'd have to
> create a kptr out of its container object, and then pass that object's
> cpumask as a const struct cpumask * to other BPF cpumask kfuncs
> (including e.g. the proposed iterator).
> 
>> We already have some cpumask support in kernel/bpf/cpumask.c
>> bpf_iter_cpumask will be a natural follow up.
> 
> Yes, this should be easy to add.
> 
> - David
David Vernet Aug. 2, 2023, 3:46 p.m. UTC | #7
On Tue, Aug 01, 2023 at 11:54:18PM -0700, Yonghong Song wrote:
> 
> 
> On 8/1/23 8:29 PM, David Vernet wrote:
> > On Tue, Aug 01, 2023 at 07:45:57PM -0700, Alexei Starovoitov wrote:
> > > On Tue, Aug 1, 2023 at 7:34 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > 
> > > > > 
> > > > > In kernel, we have a global variable
> > > > >      nr_cpu_ids (also in kernel/bpf/helpers.c)
> > > > > which is used in numerous places for per cpu data struct access.
> > > > > 
> > > > > I am wondering whether we could have bpf code like
> > > > >      int nr_cpu_ids __ksym;
> > 
> > I think this would be useful in general, though any __ksym variable like
> > this would have to be const and mapped in .rodata, right? But yeah,
> > being able to R/O map global variables like this which have static
> > lifetimes would be nice.
> 
> No. There is no map here. __ksym symbol will have a ld_imm64 insn
> to load the value in the bpf code. The address will be the kernel
> address patched by libbpf.

ld_imm64 is fine. I'm talking about stores. BPF progs should not be able
to mutate these variables.
Alexei Starovoitov Aug. 2, 2023, 4:23 p.m. UTC | #8
On Wed, Aug 2, 2023 at 8:46 AM David Vernet <void@manifault.com> wrote:
>
> On Tue, Aug 01, 2023 at 11:54:18PM -0700, Yonghong Song wrote:
> >
> >
> > On 8/1/23 8:29 PM, David Vernet wrote:
> > > On Tue, Aug 01, 2023 at 07:45:57PM -0700, Alexei Starovoitov wrote:
> > > > On Tue, Aug 1, 2023 at 7:34 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > >
> > > > > > In kernel, we have a global variable
> > > > > >      nr_cpu_ids (also in kernel/bpf/helpers.c)
> > > > > > which is used in numerous places for per cpu data struct access.
> > > > > >
> > > > > > I am wondering whether we could have bpf code like
> > > > > >      int nr_cpu_ids __ksym;
> > >
> > > I think this would be useful in general, though any __ksym variable like
> > > this would have to be const and mapped in .rodata, right? But yeah,
> > > being able to R/O map global variables like this which have static
> > > lifetimes would be nice.
> >
> > No. There is no map here. __ksym symbol will have a ld_imm64 insn
> > to load the value in the bpf code. The address will be the kernel
> > address patched by libbpf.
>
> ld_imm64 is fine. I'm talking about stores. BPF progs should not be able
> to mutate these variables.

ksyms are readonly and support for them is already in the kernel and libbpf.
The only reason:
extern int nr_cpu_ids __ksym;
doesn't work today because nr_cpu_ids is not in vmlinux BTF.
pahole adds only per-cpu vars to BTF.
Alexei Starovoitov Aug. 2, 2023, 4:33 p.m. UTC | #9
On Tue, Aug 1, 2023 at 8:30 PM David Vernet <void@manifault.com> wrote:
> I agree that this is the correct way to generalize this. The only thing
> that we'll have to figure out is how to generalize treating const struct
> cpumask * objects as kptrs. In sched_ext [0] we export
> scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to
> return trusted global cpumask kptrs that can then be "released" in
> scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and
> exists only to appease the verifier that the trusted cpumask kptrs
> aren't being leaked and are having their references "dropped".

why is it KF_ACQUIRE ?
I think it can just return a trusted pointer without acquire.

> [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/
>
> I'd imagine that we have 2 ways forward if we want to enable progs to
> fetch other global cpumasks with static lifetimes (e.g.
> __cpu_possible_mask or nohz.idle_cpus_mask):
>
> 1. The most straightforward thing to do would be to add a new kfunc in
>    kernel/bpf/cpumask.c that's a drop-in replacment for
>    scx_bpf_put_idle_cpumask():
>
> void bpf_global_cpumask_drop(const struct cpumask *cpumask)
> {}
>
> 2. Another would be to implement something resembling what Yonghong
>    suggested in [1], where progs can link against global allocated kptrs
>    like:
>
> const struct cpumask *__cpu_possible_mask __ksym;
>
> [1]: https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@linux.dev/#t
>
> In my opinion (1) is more straightforward, (2) is a better UX.

1 = adding few kfuncs.
2 = teaching pahole to emit certain global vars.

nm vmlinux|g -w D|g -v __SCK_|g -v __tracepoint_|wc -l
1998

imo BTF increase trade off is acceptable.
David Vernet Aug. 2, 2023, 5:06 p.m. UTC | #10
On Wed, Aug 02, 2023 at 09:33:18AM -0700, Alexei Starovoitov wrote:
> On Tue, Aug 1, 2023 at 8:30 PM David Vernet <void@manifault.com> wrote:
> > I agree that this is the correct way to generalize this. The only thing
> > that we'll have to figure out is how to generalize treating const struct
> > cpumask * objects as kptrs. In sched_ext [0] we export
> > scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to
> > return trusted global cpumask kptrs that can then be "released" in
> > scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and
> > exists only to appease the verifier that the trusted cpumask kptrs
> > aren't being leaked and are having their references "dropped".
> 
> why is it KF_ACQUIRE ?
> I think it can just return a trusted pointer without acquire.

I don't think there's a way to do this yet without hard-coding the kfuncs as
special, right? That's why we do stuff like this:


11479                 } else if (meta.func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx]) {
11480                         mark_reg_known_zero(env, regs, BPF_REG_0);
11481                         regs[BPF_REG_0].type = PTR_TO_BTF_ID | PTR_TRUSTED;
11482                         regs[BPF_REG_0].btf = desc_btf;
11483                         regs[BPF_REG_0].btf_id = meta.ret_btf_id;

We could continue to do that, but I wonder if it would be useful to add
a kfunc flag that allowed a kfunc to specify that? Something like
KF_ALWAYS_TRUSTED? In general though, yes, we can teach the verifier to
not require KF_ACQUIRE if we want. It's just that what we have now
doesn't really scale to the kernel for any global cpumask.

> > [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/
> >
> > I'd imagine that we have 2 ways forward if we want to enable progs to
> > fetch other global cpumasks with static lifetimes (e.g.
> > __cpu_possible_mask or nohz.idle_cpus_mask):
> >
> > 1. The most straightforward thing to do would be to add a new kfunc in
> >    kernel/bpf/cpumask.c that's a drop-in replacment for
> >    scx_bpf_put_idle_cpumask():
> >
> > void bpf_global_cpumask_drop(const struct cpumask *cpumask)
> > {}
> >
> > 2. Another would be to implement something resembling what Yonghong
> >    suggested in [1], where progs can link against global allocated kptrs
> >    like:
> >
> > const struct cpumask *__cpu_possible_mask __ksym;
> >
> > [1]: https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@linux.dev/#t
> >
> > In my opinion (1) is more straightforward, (2) is a better UX.
> 
> 1 = adding few kfuncs.
> 2 = teaching pahole to emit certain global vars.
> 
> nm vmlinux|g -w D|g -v __SCK_|g -v __tracepoint_|wc -l
> 1998
> 
> imo BTF increase trade off is acceptable.

Agreed
Alexei Starovoitov Aug. 2, 2023, 6:13 p.m. UTC | #11
On Wed, Aug 2, 2023 at 10:06 AM David Vernet <void@manifault.com> wrote:
>
> On Wed, Aug 02, 2023 at 09:33:18AM -0700, Alexei Starovoitov wrote:
> > On Tue, Aug 1, 2023 at 8:30 PM David Vernet <void@manifault.com> wrote:
> > > I agree that this is the correct way to generalize this. The only thing
> > > that we'll have to figure out is how to generalize treating const struct
> > > cpumask * objects as kptrs. In sched_ext [0] we export
> > > scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to
> > > return trusted global cpumask kptrs that can then be "released" in
> > > scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and
> > > exists only to appease the verifier that the trusted cpumask kptrs
> > > aren't being leaked and are having their references "dropped".
> >
> > why is it KF_ACQUIRE ?
> > I think it can just return a trusted pointer without acquire.
>
> I don't think there's a way to do this yet without hard-coding the kfuncs as
> special, right? That's why we do stuff like this:
>
>
> 11479                 } else if (meta.func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx]) {
> 11480                         mark_reg_known_zero(env, regs, BPF_REG_0);
> 11481                         regs[BPF_REG_0].type = PTR_TO_BTF_ID | PTR_TRUSTED;
> 11482                         regs[BPF_REG_0].btf = desc_btf;
> 11483                         regs[BPF_REG_0].btf_id = meta.ret_btf_id;
>
> We could continue to do that, but I wonder if it would be useful to add
> a kfunc flag that allowed a kfunc to specify that? Something like
> KF_ALWAYS_TRUSTED? In general though, yes, we can teach the verifier to
> not require KF_ACQUIRE if we want. It's just that what we have now
> doesn't really scale to the kernel for any global cpumask.

We should probably just do this:
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index e7b1af016841..f0c3f69ee5a8 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -11567,7 +11567,7 @@ static int check_kfunc_call(struct
bpf_verifier_env *env, struct bpf_insn *insn,
                } else {
                        mark_reg_known_zero(env, regs, BPF_REG_0);
                        regs[BPF_REG_0].btf = desc_btf;
-                       regs[BPF_REG_0].type = PTR_TO_BTF_ID;
+                       regs[BPF_REG_0].type = PTR_TO_BTF_ID | PTR_TRUSTED;
                        regs[BPF_REG_0].btf_id = ptr_type_id;

A quick audit for kfuncs shows that this code is never used.
Everywhere where ptr_to_btf is returned the kfunc is maked with KF_ACQUIRE.
We'd need to have careful code review to make sure future kfuncs
return trusted pointers.
That would simplify scx_bpf_get_idle_cpumask(). No need for get and nop put.
Alan Maguire Aug. 3, 2023, 8:21 a.m. UTC | #12
On 02/08/2023 17:33, Alexei Starovoitov wrote:
> On Tue, Aug 1, 2023 at 8:30 PM David Vernet <void@manifault.com> wrote:
>> I agree that this is the correct way to generalize this. The only thing
>> that we'll have to figure out is how to generalize treating const struct
>> cpumask * objects as kptrs. In sched_ext [0] we export
>> scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to
>> return trusted global cpumask kptrs that can then be "released" in
>> scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and
>> exists only to appease the verifier that the trusted cpumask kptrs
>> aren't being leaked and are having their references "dropped".
> 
> why is it KF_ACQUIRE ?
> I think it can just return a trusted pointer without acquire.
> 
>> [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/
>>
>> I'd imagine that we have 2 ways forward if we want to enable progs to
>> fetch other global cpumasks with static lifetimes (e.g.
>> __cpu_possible_mask or nohz.idle_cpus_mask):
>>
>> 1. The most straightforward thing to do would be to add a new kfunc in
>>    kernel/bpf/cpumask.c that's a drop-in replacment for
>>    scx_bpf_put_idle_cpumask():
>>
>> void bpf_global_cpumask_drop(const struct cpumask *cpumask)
>> {}
>>
>> 2. Another would be to implement something resembling what Yonghong
>>    suggested in [1], where progs can link against global allocated kptrs
>>    like:
>>
>> const struct cpumask *__cpu_possible_mask __ksym;
>>
>> [1]: https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@linux.dev/#t
>>
>> In my opinion (1) is more straightforward, (2) is a better UX.
> 
> 1 = adding few kfuncs.
> 2 = teaching pahole to emit certain global vars.
> 
> nm vmlinux|g -w D|g -v __SCK_|g -v __tracepoint_|wc -l
> 1998
> 
> imo BTF increase trade off is acceptable.

Agreed, Stephen's numbers on BTF size increase were pretty modest [1].

What was gating that work in my mind was previous discussion around
splitting aspects of BTF into a "vmlinux-extra". Experiments with this
seemed to show it's hard to support, and worse, tooling would have to
learn about its existence. We have to come up with a CONFIG convention
about specifying what ends up in -extra versus core vmlinux BTF, what do
we do about modules, etc. All feels like over-complication.

I think a better path would be to support BTF in a vmlinux BTF module
(controlled by making CONFIG_DEBUG_INFO_BTF tristate). The module is
separately loadable, but puts vmlinux in the same place for tools -
/sys/kernel/btf/vmlinux. That solves already-existing issues of BTF size
for embedded use cases that have come up a few times, and lessens
concerns about BTF size for other users, while it all works with
existing tooling. I have a basic proof-of-concept but it will take time
to hammer into shape.

Because variable-related size increases are pretty modest, so should we
proceed with the BTF variable support anyway? We can modularize BTF
separately later on for those concerned about BTF size.

[1]
https://lore.kernel.org/bpf/20221104231103.752040-1-stephen.s.brennan@oracle.com/
Yonghong Song Aug. 3, 2023, 3:22 p.m. UTC | #13
On 8/3/23 1:21 AM, Alan Maguire wrote:
> On 02/08/2023 17:33, Alexei Starovoitov wrote:
>> On Tue, Aug 1, 2023 at 8:30 PM David Vernet <void@manifault.com> wrote:
>>> I agree that this is the correct way to generalize this. The only thing
>>> that we'll have to figure out is how to generalize treating const struct
>>> cpumask * objects as kptrs. In sched_ext [0] we export
>>> scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to
>>> return trusted global cpumask kptrs that can then be "released" in
>>> scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and
>>> exists only to appease the verifier that the trusted cpumask kptrs
>>> aren't being leaked and are having their references "dropped".
>>
>> why is it KF_ACQUIRE ?
>> I think it can just return a trusted pointer without acquire.
>>
>>> [0]: https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/
>>>
>>> I'd imagine that we have 2 ways forward if we want to enable progs to
>>> fetch other global cpumasks with static lifetimes (e.g.
>>> __cpu_possible_mask or nohz.idle_cpus_mask):
>>>
>>> 1. The most straightforward thing to do would be to add a new kfunc in
>>>     kernel/bpf/cpumask.c that's a drop-in replacment for
>>>     scx_bpf_put_idle_cpumask():
>>>
>>> void bpf_global_cpumask_drop(const struct cpumask *cpumask)
>>> {}
>>>
>>> 2. Another would be to implement something resembling what Yonghong
>>>     suggested in [1], where progs can link against global allocated kptrs
>>>     like:
>>>
>>> const struct cpumask *__cpu_possible_mask __ksym;
>>>
>>> [1]: https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@linux.dev/#t
>>>
>>> In my opinion (1) is more straightforward, (2) is a better UX.
>>
>> 1 = adding few kfuncs.
>> 2 = teaching pahole to emit certain global vars.
>>
>> nm vmlinux|g -w D|g -v __SCK_|g -v __tracepoint_|wc -l
>> 1998
>>
>> imo BTF increase trade off is acceptable.
> 
> Agreed, Stephen's numbers on BTF size increase were pretty modest [1].
> 
> What was gating that work in my mind was previous discussion around
> splitting aspects of BTF into a "vmlinux-extra". Experiments with this
> seemed to show it's hard to support, and worse, tooling would have to
> learn about its existence. We have to come up with a CONFIG convention
> about specifying what ends up in -extra versus core vmlinux BTF, what do
> we do about modules, etc. All feels like over-complication.
> 
> I think a better path would be to support BTF in a vmlinux BTF module
> (controlled by making CONFIG_DEBUG_INFO_BTF tristate). The module is
> separately loadable, but puts vmlinux in the same place for tools -
> /sys/kernel/btf/vmlinux. That solves already-existing issues of BTF size
> for embedded use cases that have come up a few times, and lessens
> concerns about BTF size for other users, while it all works with
> existing tooling. I have a basic proof-of-concept but it will take time
> to hammer into shape.
> 
> Because variable-related size increases are pretty modest, so should we
> proceed with the BTF variable support anyway? We can modularize BTF
> separately later on for those concerned about BTF size.

Alan, it seems a consensus has reached that we should include
global variables (excluding special kernel made ones like
__SCK_ and __tracepoint_) in vmlinux BTF.
please go ahead and propose a patch. Thanks!

> 
> [1]
> https://lore.kernel.org/bpf/20221104231103.752040-1-stephen.s.brennan@oracle.com/
Alan Maguire Aug. 3, 2023, 4:10 p.m. UTC | #14
On 03/08/2023 16:22, Yonghong Song wrote:
> 
> 
> On 8/3/23 1:21 AM, Alan Maguire wrote:
>> On 02/08/2023 17:33, Alexei Starovoitov wrote:
>>> On Tue, Aug 1, 2023 at 8:30 PM David Vernet <void@manifault.com> wrote:
>>>> I agree that this is the correct way to generalize this. The only thing
>>>> that we'll have to figure out is how to generalize treating const
>>>> struct
>>>> cpumask * objects as kptrs. In sched_ext [0] we export
>>>> scx_bpf_get_idle_cpumask() and scx_bpf_get_idle_smtmask() kfuncs to
>>>> return trusted global cpumask kptrs that can then be "released" in
>>>> scx_bpf_put_idle_cpumask(). scx_bpf_put_idle_cpumask() is empty and
>>>> exists only to appease the verifier that the trusted cpumask kptrs
>>>> aren't being leaked and are having their references "dropped".
>>>
>>> why is it KF_ACQUIRE ?
>>> I think it can just return a trusted pointer without acquire.
>>>
>>>> [0]:
>>>> https://lore.kernel.org/all/20230711011412.100319-13-tj@kernel.org/
>>>>
>>>> I'd imagine that we have 2 ways forward if we want to enable progs to
>>>> fetch other global cpumasks with static lifetimes (e.g.
>>>> __cpu_possible_mask or nohz.idle_cpus_mask):
>>>>
>>>> 1. The most straightforward thing to do would be to add a new kfunc in
>>>>     kernel/bpf/cpumask.c that's a drop-in replacment for
>>>>     scx_bpf_put_idle_cpumask():
>>>>
>>>> void bpf_global_cpumask_drop(const struct cpumask *cpumask)
>>>> {}
>>>>
>>>> 2. Another would be to implement something resembling what Yonghong
>>>>     suggested in [1], where progs can link against global allocated
>>>> kptrs
>>>>     like:
>>>>
>>>> const struct cpumask *__cpu_possible_mask __ksym;
>>>>
>>>> [1]:
>>>> https://lore.kernel.org/all/3f56b3b3-9b71-f0d3-ace1-406a8eeb64c0@linux.dev/#t
>>>>
>>>> In my opinion (1) is more straightforward, (2) is a better UX.
>>>
>>> 1 = adding few kfuncs.
>>> 2 = teaching pahole to emit certain global vars.
>>>
>>> nm vmlinux|g -w D|g -v __SCK_|g -v __tracepoint_|wc -l
>>> 1998
>>>
>>> imo BTF increase trade off is acceptable.
>>
>> Agreed, Stephen's numbers on BTF size increase were pretty modest [1].
>>
>> What was gating that work in my mind was previous discussion around
>> splitting aspects of BTF into a "vmlinux-extra". Experiments with this
>> seemed to show it's hard to support, and worse, tooling would have to
>> learn about its existence. We have to come up with a CONFIG convention
>> about specifying what ends up in -extra versus core vmlinux BTF, what do
>> we do about modules, etc. All feels like over-complication.
>>
>> I think a better path would be to support BTF in a vmlinux BTF module
>> (controlled by making CONFIG_DEBUG_INFO_BTF tristate). The module is
>> separately loadable, but puts vmlinux in the same place for tools -
>> /sys/kernel/btf/vmlinux. That solves already-existing issues of BTF size
>> for embedded use cases that have come up a few times, and lessens
>> concerns about BTF size for other users, while it all works with
>> existing tooling. I have a basic proof-of-concept but it will take time
>> to hammer into shape.
>>
>> Because variable-related size increases are pretty modest, so should we
>> proceed with the BTF variable support anyway? We can modularize BTF
>> separately later on for those concerned about BTF size.
> 
> Alan, it seems a consensus has reached that we should include
> global variables (excluding special kernel made ones like
> __SCK_ and __tracepoint_) in vmlinux BTF.
> please go ahead and propose a patch. Thanks!
>

Sounds good! Stephen and I will hopefully have something ready soon;
the changes are mostly in dwarves plus a kernel selftest. And
good idea on the filtering; this will eliminate a lot of unneeded
variables and cut down on size overheads. Thanks!

Alan

>>
>> [1]
>> https://lore.kernel.org/bpf/20221104231103.752040-1-stephen.s.brennan@oracle.com/