diff mbox series

[bpf-next,v2,2/2] riscv, bpf: inline bpf_get_smp_processor_id()

Message ID 20240430175834.33152-3-puranjay@kernel.org (mailing list archive)
State New
Headers show
Series riscv, bpf: Support per-CPU insn and inline bpf_get_smp_processor_id() | expand

Checks

Context Check Description
conchuod/vmtest-fixes-PR fail merge-conflict

Commit Message

Puranjay Mohan April 30, 2024, 5:58 p.m. UTC
Inline the calls to bpf_get_smp_processor_id() in the riscv bpf jit.

RISCV saves the pointer to the CPU's task_struct in the TP (thread
pointer) register. This makes it trivial to get the CPU's processor id.
As thread_info is the first member of task_struct, we can read the
processor id from TP + offsetof(struct thread_info, cpu).

          RISCV64 JIT output for `call bpf_get_smp_processor_id`
	  ======================================================

                Before                           After
               --------                         -------

         auipc   t1,0x848c                  ld    a5,32(tp)
         jalr    604(t1)
         mv      a5,a0

Benchmark using [1] on Qemu.

./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc

+---------------+------------------+------------------+--------------+
|      Name     |     Before       |       After      |   % change   |
|---------------+------------------+------------------+--------------|
| glob-arr-inc  | 1.077 ± 0.006M/s | 1.336 ± 0.010M/s |   + 24.04%   |
| arr-inc       | 1.078 ± 0.002M/s | 1.332 ± 0.015M/s |   + 23.56%   |
| hash-inc      | 0.494 ± 0.004M/s | 0.653 ± 0.001M/s |   + 32.18%   |
+---------------+------------------+------------------+--------------+

NOTE: This benchmark includes changes from this patch and the previous
      patch that implemented the per-cpu insn.

[1] https://github.com/anakryiko/linux/commit/8dec900975ef

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 arch/riscv/net/bpf_jit_comp64.c | 26 ++++++++++++++++++++++++++
 include/linux/filter.h          |  1 +
 kernel/bpf/core.c               | 11 +++++++++++
 kernel/bpf/verifier.c           |  2 ++
 4 files changed, 40 insertions(+)

Comments

Kumar Kartikeya Dwivedi April 30, 2024, 7:18 p.m. UTC | #1
On Tue, 30 Apr 2024 at 20:00, Puranjay Mohan <puranjay@kernel.org> wrote:
>
> Inline the calls to bpf_get_smp_processor_id() in the riscv bpf jit.
>
> RISCV saves the pointer to the CPU's task_struct in the TP (thread
> pointer) register. This makes it trivial to get the CPU's processor id.
> As thread_info is the first member of task_struct, we can read the
> processor id from TP + offsetof(struct thread_info, cpu).
>
>           RISCV64 JIT output for `call bpf_get_smp_processor_id`
>           ======================================================
>
>                 Before                           After
>                --------                         -------
>
>          auipc   t1,0x848c                  ld    a5,32(tp)
>          jalr    604(t1)
>          mv      a5,a0
>
> Benchmark using [1] on Qemu.
>
> ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc
>
> +---------------+------------------+------------------+--------------+
> |      Name     |     Before       |       After      |   % change   |
> |---------------+------------------+------------------+--------------|
> | glob-arr-inc  | 1.077 ± 0.006M/s | 1.336 ± 0.010M/s |   + 24.04%   |
> | arr-inc       | 1.078 ± 0.002M/s | 1.332 ± 0.015M/s |   + 23.56%   |
> | hash-inc      | 0.494 ± 0.004M/s | 0.653 ± 0.001M/s |   + 32.18%   |
> +---------------+------------------+------------------+--------------+
>
> NOTE: This benchmark includes changes from this patch and the previous
>       patch that implemented the per-cpu insn.
>
> [1] https://github.com/anakryiko/linux/commit/8dec900975ef
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> ---

For non-riscv bits (& fwiw):

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Andrii Nakryiko May 1, 2024, 4:46 p.m. UTC | #2
On Tue, Apr 30, 2024 at 10:59 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> Inline the calls to bpf_get_smp_processor_id() in the riscv bpf jit.
>
> RISCV saves the pointer to the CPU's task_struct in the TP (thread
> pointer) register. This makes it trivial to get the CPU's processor id.
> As thread_info is the first member of task_struct, we can read the
> processor id from TP + offsetof(struct thread_info, cpu).
>
>           RISCV64 JIT output for `call bpf_get_smp_processor_id`
>           ======================================================
>
>                 Before                           After
>                --------                         -------
>
>          auipc   t1,0x848c                  ld    a5,32(tp)
>          jalr    604(t1)
>          mv      a5,a0
>

Nice, great find! Would you be able to do similar inlining for x86-64
as well? Disassembling bpf_get_smp_processor_id for x86-64 shows this:

Dump of assembler code for function bpf_get_smp_processor_id:
   0xffffffff810f91a0 <+0>:     0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
   0xffffffff810f91a5 <+5>:     65 8b 05 60 79 f3 7e    mov
%gs:0x7ef37960(%rip),%eax        # 0x30b0c <pcpu_hot+12>
   0xffffffff810f91ac <+12>:    48 98   cltq
   0xffffffff810f91ae <+14>:    c3      ret
End of assembler dump.

We should be able to do the same in x86-64 BPF JIT. (it's actually how
I started initially, I had a dedicated instruction reading per-cpu
memory, but ended up with more general "calculate per-cpu address").

Anyways, great work, a small nit below.

Acked-by: Andrii Nakryiko <andrii@kernel.org>

> Benchmark using [1] on Qemu.
>
> ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc
>
> +---------------+------------------+------------------+--------------+
> |      Name     |     Before       |       After      |   % change   |
> |---------------+------------------+------------------+--------------|
> | glob-arr-inc  | 1.077 ± 0.006M/s | 1.336 ± 0.010M/s |   + 24.04%   |
> | arr-inc       | 1.078 ± 0.002M/s | 1.332 ± 0.015M/s |   + 23.56%   |
> | hash-inc      | 0.494 ± 0.004M/s | 0.653 ± 0.001M/s |   + 32.18%   |
> +---------------+------------------+------------------+--------------+
>
> NOTE: This benchmark includes changes from this patch and the previous
>       patch that implemented the per-cpu insn.
>
> [1] https://github.com/anakryiko/linux/commit/8dec900975ef
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> ---
>  arch/riscv/net/bpf_jit_comp64.c | 26 ++++++++++++++++++++++++++
>  include/linux/filter.h          |  1 +
>  kernel/bpf/core.c               | 11 +++++++++++
>  kernel/bpf/verifier.c           |  2 ++
>  4 files changed, 40 insertions(+)
>
> diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
> index 99d7006f1420..5789b7afae47 100644
> --- a/arch/riscv/net/bpf_jit_comp64.c
> +++ b/arch/riscv/net/bpf_jit_comp64.c
> @@ -1493,6 +1493,22 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
>                 bool fixed_addr;
>                 u64 addr;
>
> +               /* Inline calls to bpf_get_smp_processor_id()
> +                *
> +                * RV_REG_TP holds the address of the current CPU's task_struct and thread_info is
> +                * at offset 0 in task_struct.
> +                * Load cpu from thread_info:
> +                *     Set R0 to ((struct thread_info *)(RV_REG_TP))->cpu
> +                *
> +                * This replicates the implementation of raw_smp_processor_id() on RISCV
> +                */
> +               if (insn->src_reg == 0 && insn->imm == BPF_FUNC_get_smp_processor_id) {
> +                       /* Load current CPU number in R0 */
> +                       emit_ld(bpf_to_rv_reg(BPF_REG_0, ctx), offsetof(struct thread_info, cpu),
> +                               RV_REG_TP, ctx);
> +                       break;
> +               }
> +
>                 mark_call(ctx);
>                 ret = bpf_jit_get_func_addr(ctx->prog, insn, extra_pass,
>                                             &addr, &fixed_addr);
> @@ -2062,3 +2078,13 @@ bool bpf_jit_supports_percpu_insn(void)
>  {
>         return true;
>  }
> +
> +bool bpf_jit_inlines_helper_call(s32 imm)
> +{
> +       switch (imm) {
> +       case BPF_FUNC_get_smp_processor_id:
> +               return true;
> +       }
> +
> +       return false;

nit: why not

default:
    return false;

to keep everything within the switch?

> +}
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 7a27f19bf44d..3e19bb62ed1a 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -993,6 +993,7 @@ u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>  struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog);
>  void bpf_jit_compile(struct bpf_prog *prog);
>  bool bpf_jit_needs_zext(void);
> +bool bpf_jit_inlines_helper_call(s32 imm);
>  bool bpf_jit_supports_subprog_tailcalls(void);
>  bool bpf_jit_supports_percpu_insn(void);
>  bool bpf_jit_supports_kfunc_call(void);
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 99b8b1c9a248..aa59af9f9bd9 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -2941,6 +2941,17 @@ bool __weak bpf_jit_needs_zext(void)
>         return false;
>  }
>
> +/* Return true if the JIT inlines the call to the helper corresponding to
> + * the imm.
> + *
> + * The verifier will not patch the insn->imm for the call to the helper if
> + * this returns true.
> + */
> +bool __weak bpf_jit_inlines_helper_call(s32 imm)
> +{
> +       return false;
> +}
> +
>  /* Return TRUE if the JIT backend supports mixing bpf2bpf and tailcalls. */
>  bool __weak bpf_jit_supports_subprog_tailcalls(void)
>  {
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 5d42db05315e..e78f766d7f91 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -20013,6 +20013,8 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>                         goto next_insn;
>                 }
>
> +               if (bpf_jit_inlines_helper_call(insn->imm))
> +                       goto next_insn;

It's nice to be able to allow BPF JIT to do a higher-performance
implementation. Let's add a short comment above to mention that this
is bypassing normal inlining because BPF JIT will do it better (I know
you have this description for the function definition, but a short
remark here would be helpful).

And please add an empty line after this check to logically separate it
from the rest of helper inlining logic in verifier, thanks!

pw-bot: cr

>                 if (insn->imm == BPF_FUNC_get_route_realm)
>                         prog->dst_needed = 1;
>                 if (insn->imm == BPF_FUNC_get_prandom_u32)
> --
> 2.40.1
>
Puranjay Mohan May 2, 2024, 1:16 p.m. UTC | #3
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Tue, Apr 30, 2024 at 10:59 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>
>> Inline the calls to bpf_get_smp_processor_id() in the riscv bpf jit.
>>
>> RISCV saves the pointer to the CPU's task_struct in the TP (thread
>> pointer) register. This makes it trivial to get the CPU's processor id.
>> As thread_info is the first member of task_struct, we can read the
>> processor id from TP + offsetof(struct thread_info, cpu).
>>
>>           RISCV64 JIT output for `call bpf_get_smp_processor_id`
>>           ======================================================
>>
>>                 Before                           After
>>                --------                         -------
>>
>>          auipc   t1,0x848c                  ld    a5,32(tp)
>>          jalr    604(t1)
>>          mv      a5,a0
>>
>
> Nice, great find! Would you be able to do similar inlining for x86-64
> as well? Disassembling bpf_get_smp_processor_id for x86-64 shows this:
>
> Dump of assembler code for function bpf_get_smp_processor_id:
>    0xffffffff810f91a0 <+0>:     0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
>    0xffffffff810f91a5 <+5>:     65 8b 05 60 79 f3 7e    mov
> %gs:0x7ef37960(%rip),%eax        # 0x30b0c <pcpu_hot+12>
>    0xffffffff810f91ac <+12>:    48 98   cltq
>    0xffffffff810f91ae <+14>:    c3      ret
> End of assembler dump.
> We should be able to do the same in x86-64 BPF JIT. (it's actually how
> I started initially, I had a dedicated instruction reading per-cpu
> memory, but ended up with more general "calculate per-cpu address").

I feel in x86-64's case JIT can not do a (much) better job compared to the
current approach in the verifier.

On RISC-V and ARM64, JIT was able to do it better because both of these
architectures save a pointer to the task struct in a special CPU
register. As x86-64 doesn't have enough extra registers, it uses a
percpu variable to store task struct, thread_info, and the cpu
number.

P.S. - While doing this for BPF, I realized that ARM64 kernel code is
also not optimal as it is using the percpu variable and is not reading
the CPU register directly. So, I sent a patch[1] to fix it in the kernel
and get rid of the per-cpu variable in ARM64.


[1] https://lore.kernel.org/all/20240502123449.2690-2-puranjay@kernel.org/

> Anyways, great work, a small nit below.
>
> Acked-by: Andrii Nakryiko <andrii@kernel.org>

Thanks,
Puranjay
Andrii Nakryiko May 2, 2024, 4:03 p.m. UTC | #4
On Thu, May 2, 2024 at 6:16 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Tue, Apr 30, 2024 at 10:59 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> >>
> >> Inline the calls to bpf_get_smp_processor_id() in the riscv bpf jit.
> >>
> >> RISCV saves the pointer to the CPU's task_struct in the TP (thread
> >> pointer) register. This makes it trivial to get the CPU's processor id.
> >> As thread_info is the first member of task_struct, we can read the
> >> processor id from TP + offsetof(struct thread_info, cpu).
> >>
> >>           RISCV64 JIT output for `call bpf_get_smp_processor_id`
> >>           ======================================================
> >>
> >>                 Before                           After
> >>                --------                         -------
> >>
> >>          auipc   t1,0x848c                  ld    a5,32(tp)
> >>          jalr    604(t1)
> >>          mv      a5,a0
> >>
> >
> > Nice, great find! Would you be able to do similar inlining for x86-64
> > as well? Disassembling bpf_get_smp_processor_id for x86-64 shows this:
> >
> > Dump of assembler code for function bpf_get_smp_processor_id:
> >    0xffffffff810f91a0 <+0>:     0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
> >    0xffffffff810f91a5 <+5>:     65 8b 05 60 79 f3 7e    mov
> > %gs:0x7ef37960(%rip),%eax        # 0x30b0c <pcpu_hot+12>
> >    0xffffffff810f91ac <+12>:    48 98   cltq
> >    0xffffffff810f91ae <+14>:    c3      ret
> > End of assembler dump.
> > We should be able to do the same in x86-64 BPF JIT. (it's actually how
> > I started initially, I had a dedicated instruction reading per-cpu
> > memory, but ended up with more general "calculate per-cpu address").
>
> I feel in x86-64's case JIT can not do a (much) better job compared to the
> current approach in the verifier.

This direct memory read (using gs segment) ought to be a bit faster
than calculating offset and then doing memory dereference, but yes,
the difference won't be as big as you got with RISC-V and ARM64. Ok,
never mind, we can always benchmark and add that later, no big deal.

>
> On RISC-V and ARM64, JIT was able to do it better because both of these
> architectures save a pointer to the task struct in a special CPU
> register. As x86-64 doesn't have enough extra registers, it uses a
> percpu variable to store task struct, thread_info, and the cpu
> number.
>
> P.S. - While doing this for BPF, I realized that ARM64 kernel code is
> also not optimal as it is using the percpu variable and is not reading
> the CPU register directly. So, I sent a patch[1] to fix it in the kernel
> and get rid of the per-cpu variable in ARM64.
>
>
> [1] https://lore.kernel.org/all/20240502123449.2690-2-puranjay@kernel.org/
>
> > Anyways, great work, a small nit below.
> >
> > Acked-by: Andrii Nakryiko <andrii@kernel.org>
>
> Thanks,
> Puranjay
Björn Töpel May 2, 2024, 4:19 p.m. UTC | #5
Puranjay Mohan <puranjay@kernel.org> writes:

> Inline the calls to bpf_get_smp_processor_id() in the riscv bpf jit.
>
> RISCV saves the pointer to the CPU's task_struct in the TP (thread
> pointer) register. This makes it trivial to get the CPU's processor id.
> As thread_info is the first member of task_struct, we can read the
> processor id from TP + offsetof(struct thread_info, cpu).
>
>           RISCV64 JIT output for `call bpf_get_smp_processor_id`
> 	  ======================================================
>
>                 Before                           After
>                --------                         -------
>
>          auipc   t1,0x848c                  ld    a5,32(tp)
>          jalr    604(t1)
>          mv      a5,a0
>
> Benchmark using [1] on Qemu.
>
> ./benchs/run_bench_trigger.sh glob-arr-inc arr-inc hash-inc
>
> +---------------+------------------+------------------+--------------+
> |      Name     |     Before       |       After      |   % change   |
> |---------------+------------------+------------------+--------------|
> | glob-arr-inc  | 1.077 ± 0.006M/s | 1.336 ± 0.010M/s |   + 24.04%   |
> | arr-inc       | 1.078 ± 0.002M/s | 1.332 ± 0.015M/s |   + 23.56%   |
> | hash-inc      | 0.494 ± 0.004M/s | 0.653 ± 0.001M/s |   + 32.18%   |
> +---------------+------------------+------------------+--------------+
>
> NOTE: This benchmark includes changes from this patch and the previous
>       patch that implemented the per-cpu insn.
>
> [1] https://github.com/anakryiko/linux/commit/8dec900975ef
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>

Acked-by: Björn Töpel <bjorn@kernel.org>
diff mbox series

Patch

diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
index 99d7006f1420..5789b7afae47 100644
--- a/arch/riscv/net/bpf_jit_comp64.c
+++ b/arch/riscv/net/bpf_jit_comp64.c
@@ -1493,6 +1493,22 @@  int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
 		bool fixed_addr;
 		u64 addr;
 
+		/* Inline calls to bpf_get_smp_processor_id()
+		 *
+		 * RV_REG_TP holds the address of the current CPU's task_struct and thread_info is
+		 * at offset 0 in task_struct.
+		 * Load cpu from thread_info:
+		 *     Set R0 to ((struct thread_info *)(RV_REG_TP))->cpu
+		 *
+		 * This replicates the implementation of raw_smp_processor_id() on RISCV
+		 */
+		if (insn->src_reg == 0 && insn->imm == BPF_FUNC_get_smp_processor_id) {
+			/* Load current CPU number in R0 */
+			emit_ld(bpf_to_rv_reg(BPF_REG_0, ctx), offsetof(struct thread_info, cpu),
+				RV_REG_TP, ctx);
+			break;
+		}
+
 		mark_call(ctx);
 		ret = bpf_jit_get_func_addr(ctx->prog, insn, extra_pass,
 					    &addr, &fixed_addr);
@@ -2062,3 +2078,13 @@  bool bpf_jit_supports_percpu_insn(void)
 {
 	return true;
 }
+
+bool bpf_jit_inlines_helper_call(s32 imm)
+{
+	switch (imm) {
+	case BPF_FUNC_get_smp_processor_id:
+		return true;
+	}
+
+	return false;
+}
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 7a27f19bf44d..3e19bb62ed1a 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -993,6 +993,7 @@  u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog);
 void bpf_jit_compile(struct bpf_prog *prog);
 bool bpf_jit_needs_zext(void);
+bool bpf_jit_inlines_helper_call(s32 imm);
 bool bpf_jit_supports_subprog_tailcalls(void);
 bool bpf_jit_supports_percpu_insn(void);
 bool bpf_jit_supports_kfunc_call(void);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 99b8b1c9a248..aa59af9f9bd9 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2941,6 +2941,17 @@  bool __weak bpf_jit_needs_zext(void)
 	return false;
 }
 
+/* Return true if the JIT inlines the call to the helper corresponding to
+ * the imm.
+ *
+ * The verifier will not patch the insn->imm for the call to the helper if
+ * this returns true.
+ */
+bool __weak bpf_jit_inlines_helper_call(s32 imm)
+{
+	return false;
+}
+
 /* Return TRUE if the JIT backend supports mixing bpf2bpf and tailcalls. */
 bool __weak bpf_jit_supports_subprog_tailcalls(void)
 {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 5d42db05315e..e78f766d7f91 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -20013,6 +20013,8 @@  static int do_misc_fixups(struct bpf_verifier_env *env)
 			goto next_insn;
 		}
 
+		if (bpf_jit_inlines_helper_call(insn->imm))
+			goto next_insn;
 		if (insn->imm == BPF_FUNC_get_route_realm)
 			prog->dst_needed = 1;
 		if (insn->imm == BPF_FUNC_get_prandom_u32)