mbox series

[bpf-next,v4,00/10] bpf: Support private stack for bpf progs

Message ID 20241010175552.1895980-1-yonghong.song@linux.dev (mailing list archive)
Headers show
Series bpf: Support private stack for bpf progs | expand

Message

Yonghong Song Oct. 10, 2024, 5:55 p.m. UTC
The main motivation for private stack comes from nested scheduler in
sched-ext from Tejun. The basic idea is that
 - each cgroup will its own associated bpf program,
 - bpf program with parent cgroup will call bpf programs
   in immediate child cgroups.

Let us say we have the following cgroup hierarchy:
  root_cg (prog0):
    cg1 (prog1):
      cg11 (prog11):
        cg111 (prog111)
        cg112 (prog112)
      cg12 (prog12):
        cg121 (prog121)
        cg122 (prog122)
    cg2 (prog2):
      cg21 (prog21)
      cg22 (prog22)
      cg23 (prog23)

In the above example, prog0 will call a kfunc which will call prog1 and
prog2 to get sched info for cg1 and cg2 and then the information is
summarized and sent back to prog0. Similarly, prog11 and prog12 will be
invoked in the kfunc and the result will be summarized and sent back to
prog1, etc.

Currently, for each thread, the x86 kernel allocate 8KB stack. The each
bpf program (including its subprograms) has maximum 512B stack size to
avoid potential stack overflow. And nested bpf programs increase the risk
of stack overflow. To avoid potential stack overflow caused by bpf
programs, this patch implemented a private stack so bpf program stack
space is allocated dynamically when the program is jited. Such private
stack is applied to tracing programs like kprobe/uprobe, perf_event,
tracepoint, raw tracepoint and tracing.

But more than one instance of the same bpf program may run in the system.
To make things simple, percpu private stack is allocated for each program,
so if the same program is running on different cpus concurrently, we won't
have any issue. Note that the kernel already have logic to prevent the
recursion for the same bpf program on the same cpu (kprobe, fentry, etc.).

The patch implemented a percpu private stack based approach for x86 arch.
A new kfunc bpf_prog_call() is introduced for the above nested scheduler use
case. If bpf_prog_call() is used in the program and bpf_tail_call() is not
used in the same program, then private stack will be used. Internally,
private stack allows certain number of recursions by allocating more
space. Please see each individual patch for details.

Change logs:
  v3 -> v4:
    - v3 link: https://lore.kernel.org/bpf/20240926234506.1769256-1-yonghong.song@linux.dev/
      There is a long discussion in the above v3 link trying to allow private
      stack to be used by kernel functions in order to simplify implementation.
      But unfortunately we didn't find a workable solution yet, so we return
      to the approach where private stack is only used by bpf programs.
    - Add bpf_prog_call() kfunc.
  v2 -> v3:
    - Instead of per-subprog private stack allocation, allocate private
      stacks at main prog or callback entry prog. Subprogs not main or callback
      progs will increment the inherited stack pointer to be their
      frame pointer.
    - Private stack allows each prog max stack size to be 512 bytes, intead
      of the whole prog hierarchy to be 512 bytes.
    - Add some tests.

Yonghong Song (10):
  bpf: Allow each subprog having stack size of 512 bytes
  bpf: Mark each subprog with proper private stack modes
  bpf, x86: Refactor func emit_prologue
  bpf, x86: Create a helper for certain "reg <op>= imm" operations
  bpf, x86: Add jit support for private stack
  selftests/bpf: Add private stack tests
  bpf: Support calling non-tailcall bpf prog
  bpf, x86: Create two helpers for some arith operations
  bpf, x86: Jit support for nested bpf_prog_call
  selftests/bpf: Add tests for bpf_prog_call()

 arch/x86/net/bpf_jit_comp.c                   | 318 ++++++++++++++----
 include/linux/bpf.h                           |  14 +
 include/linux/bpf_verifier.h                  |   3 +
 include/linux/filter.h                        |   1 +
 kernel/bpf/core.c                             |  27 ++
 kernel/bpf/helpers.c                          |  20 ++
 kernel/bpf/trampoline.c                       |  16 +
 kernel/bpf/verifier.c                         | 145 +++++++-
 .../selftests/bpf/prog_tests/prog_call.c      |  78 +++++
 .../selftests/bpf/prog_tests/verifier.c       |   2 +
 tools/testing/selftests/bpf/progs/prog_call.c |  92 +++++
 .../bpf/progs/verifier_private_stack.c        | 216 ++++++++++++
 12 files changed, 856 insertions(+), 76 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/prog_call.c
 create mode 100644 tools/testing/selftests/bpf/progs/prog_call.c
 create mode 100644 tools/testing/selftests/bpf/progs/verifier_private_stack.c

Comments

Tejun Heo Oct. 15, 2024, 9:28 p.m. UTC | #1
Hello,

On Thu, Oct 10, 2024 at 10:55:52AM -0700, Yonghong Song wrote:
> The main motivation for private stack comes from nested scheduler in
> sched-ext from Tejun. The basic idea is that
>  - each cgroup will its own associated bpf program,
>  - bpf program with parent cgroup will call bpf programs
>    in immediate child cgroups.
> 
> Let us say we have the following cgroup hierarchy:
>   root_cg (prog0):
>     cg1 (prog1):
>       cg11 (prog11):
>         cg111 (prog111)
>         cg112 (prog112)
>       cg12 (prog12):
>         cg121 (prog121)
>         cg122 (prog122)
>     cg2 (prog2):
>       cg21 (prog21)
>       cg22 (prog22)
>       cg23 (prog23)

Thank you so much for working on this. I have some basic and a bit
tangential questions around how stacks are allocated. So, for sched_ext,
each scheduler would be represented by struct_ops and I think the interface
to load them would be attaching a struct_ops to a cgroup.

- I suppose each operation in a struct_ops would count as a separate program
  and would thus allocate 512 * nr_cpus stacks, right?

- If the same scheduler implementation is attached to more than one cgroups,
  would each instance be treated as a separate set of programs or would they
  share the stack?

- Most struct_ops operations won't need to be nested and thus wouldn't need
  to use a private stack. Would it be possible to indicate which one should
  use a private stack?

Thanks.
Alexei Starovoitov Oct. 15, 2024, 9:39 p.m. UTC | #2
On Tue, Oct 15, 2024 at 2:28 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, Oct 10, 2024 at 10:55:52AM -0700, Yonghong Song wrote:
> > The main motivation for private stack comes from nested scheduler in
> > sched-ext from Tejun. The basic idea is that
> >  - each cgroup will its own associated bpf program,
> >  - bpf program with parent cgroup will call bpf programs
> >    in immediate child cgroups.
> >
> > Let us say we have the following cgroup hierarchy:
> >   root_cg (prog0):
> >     cg1 (prog1):
> >       cg11 (prog11):
> >         cg111 (prog111)
> >         cg112 (prog112)
> >       cg12 (prog12):
> >         cg121 (prog121)
> >         cg122 (prog122)
> >     cg2 (prog2):
> >       cg21 (prog21)
> >       cg22 (prog22)
> >       cg23 (prog23)
>
> Thank you so much for working on this. I have some basic and a bit
> tangential questions around how stacks are allocated. So, for sched_ext,
> each scheduler would be represented by struct_ops and I think the interface
> to load them would be attaching a struct_ops to a cgroup.
>
> - I suppose each operation in a struct_ops would count as a separate program
>   and would thus allocate 512 * nr_cpus stacks, right?

It's one stack per program.
Its size will be ~512 * nr_cpus * max_allowed_recursion.

We hope max_allowed_recursion == 4 or something small.

> - If the same scheduler implementation is attached to more than one cgroups,
>   would each instance be treated as a separate set of programs or would they
>   share the stack?

I think there is only one sched_ext struct_ops with
its set of progs. They are global and not "attached to a cgroup".

> - Most struct_ops operations won't need to be nested and thus wouldn't need
>   to use a private stack. Would it be possible to indicate which one should
>   use a private stack?

See my other reply. One of bpf_verifier_ops callbacks would need to
indicate back to trampoline which callback is nested with limited recursion.