diff mbox series

[RFC,bpf-next,01/11] bpf: use branch predictions in opt_hard_wire_dead_code_branches()

Message ID 20241107175040.1659341-2-eddyz87@gmail.com (mailing list archive)
State RFC
Delegated to: BPF
Headers show
Series bpf: inlinable kfuncs for BPF | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-0 success Logs for Lint
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-8 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-3 success Logs for Validate matrix.py
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Unittests
bpf/vmtest-bpf-next-VM_Test-5 success Logs for aarch64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-4 fail Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-6 success Logs for aarch64-gcc / test
bpf/vmtest-bpf-next-VM_Test-9 success Logs for s390x-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-10 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-11 pending Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-13 pending Logs for x86_64-llvm-17 / build / build for x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-14 pending Logs for x86_64-llvm-17 / build-release / build for x86_64 with llvm-17-O2
bpf/vmtest-bpf-next-VM_Test-15 pending Logs for x86_64-llvm-18 / build / build for x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-16 pending Logs for x86_64-llvm-18 / build-release / build for x86_64 with llvm-18-O2
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5 this patch: 5
netdev/build_tools success Errors and warnings before: 0 (+0) this patch: 0 (+0)
netdev/cc_maintainers warning 6 maintainers not CCed: john.fastabend@gmail.com kpsingh@kernel.org song@kernel.org jolsa@kernel.org haoluo@google.com sdf@fomichev.me
netdev/build_clang success Errors and warnings before: 3 this patch: 3
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 68 this patch: 68
netdev/checkpatch warning WARNING: line length of 81 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns WARNING: line length of 91 exceeds 80 columns WARNING: line length of 93 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Eduard Zingerman Nov. 7, 2024, 5:50 p.m. UTC
Consider dead code elimination problem for program like below:

    main:
      1: r1 = 42
      2: call <subprogram>;
      3: exit

    subprogram:
      4: r0 = 1
      5: if r1 != 42 goto +1
      6: r0 = 2
      7: exit;

Here verifier would visit every instruction and thus
bpf_insn_aux_data->seen flag would be set for both true (7)
and falltrhough (6) branches of conditional (5).
Hence opt_hard_wire_dead_code_branches() will not replace
conditional (5) with unconditional jump.

To cover such cases:
- add two fields in struct bpf_insn_aux_data:
  - true_branch_taken;
  - false_branch_taken;
- adjust check_cond_jmp_op() to set the fields according to jump
  predictions;
- handle these flags in the opt_hard_wire_dead_code_branches():
  - true_branch_taken && !false_branch_taken
    always jump, replace instruction with 'goto off';
  - !true_branch_taken && false_branch_taken
    always falltrhough, replace with 'goto +0';
  - true_branch_taken && false_branch_taken
    jump and falltrhough are possible, don't change the instruction;
  - !true_branch_taken && !false_branch_taken
    neither jump, nor falltrhough are possible, if condition itself
    must be a dead code (should be removed by opt_remove_dead_code).

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
---
 include/linux/bpf_verifier.h |  4 +++-
 kernel/bpf/verifier.c        | 16 ++++++++++++----
 2 files changed, 15 insertions(+), 5 deletions(-)

Comments

Eduard Zingerman Nov. 14, 2024, 10:20 p.m. UTC | #1
On Thu, 2024-11-07 at 09:50 -0800, Eduard Zingerman wrote:
> Consider dead code elimination problem for program like below:
> 
>     main:
>       1: r1 = 42
>       2: call <subprogram>;
>       3: exit
> 
>     subprogram:
>       4: r0 = 1
>       5: if r1 != 42 goto +1
>       6: r0 = 2
>       7: exit;
> 
> Here verifier would visit every instruction and thus
> bpf_insn_aux_data->seen flag would be set for both true (7)
> and falltrhough (6) branches of conditional (5).
> Hence opt_hard_wire_dead_code_branches() will not replace
> conditional (5) with unconditional jump.

[...]

Had an off-list discussion with Alexei yesterday,
here are some answers to questions raised:
- The patches #1,2 with opt_hard_wire_dead_code_branches() changes are
  not necessary for dynptr_slice kfunc inlining / branch removal.
  I will drop these patches and adjust test cases.
- Did some measurements for dynptr_slice call using simple benchmark
  from patch #11:
  - baseline:
    76.167 ± 0.030M/s million calls per second;
  - with call inlining, but without branch pruning (only patch #3):
    101.198 ± 0.101M/s million calls per second;
  - with call inlining and with branch pruning (full patch-set):
    116.935 ± 0.142M/s million calls per second.
Andrii Nakryiko Nov. 15, 2024, 12:16 a.m. UTC | #2
On Thu, Nov 7, 2024 at 9:51 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> Consider dead code elimination problem for program like below:
>
>     main:
>       1: r1 = 42
>       2: call <subprogram>;
>       3: exit
>
>     subprogram:
>       4: r0 = 1
>       5: if r1 != 42 goto +1
>       6: r0 = 2
>       7: exit;
>
> Here verifier would visit every instruction and thus
> bpf_insn_aux_data->seen flag would be set for both true (7)
> and falltrhough (6) branches of conditional (5).
> Hence opt_hard_wire_dead_code_branches() will not replace
> conditional (5) with unconditional jump.
>
> To cover such cases:
> - add two fields in struct bpf_insn_aux_data:
>   - true_branch_taken;
>   - false_branch_taken;
> - adjust check_cond_jmp_op() to set the fields according to jump
>   predictions;
> - handle these flags in the opt_hard_wire_dead_code_branches():
>   - true_branch_taken && !false_branch_taken
>     always jump, replace instruction with 'goto off';
>   - !true_branch_taken && false_branch_taken
>     always falltrhough, replace with 'goto +0';
>   - true_branch_taken && false_branch_taken
>     jump and falltrhough are possible, don't change the instruction;
>   - !true_branch_taken && !false_branch_taken
>     neither jump, nor falltrhough are possible, if condition itself
>     must be a dead code (should be removed by opt_remove_dead_code).
>
> Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
> ---
>  include/linux/bpf_verifier.h |  4 +++-
>  kernel/bpf/verifier.c        | 16 ++++++++++++----
>  2 files changed, 15 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index 4513372c5bc8..ed4eacfd4db7 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -570,7 +570,9 @@ struct bpf_insn_aux_data {
>         struct btf_struct_meta *kptr_struct_meta;
>         u64 map_key_state; /* constant (32 bit) key tracking for maps */
>         int ctx_field_size; /* the ctx field size for load insn, maybe 0 */
> -       u32 seen; /* this insn was processed by the verifier at env->pass_cnt */
> +       bool seen; /* this insn was processed by the verifier at env->pass_cnt */
> +       bool true_branch_taken; /* for cond jumps, set if verifier ever took true branch */
> +       bool false_branch_taken; /* for cond jumps, set if verifier ever took false branch */

we'll now have 12 bool fields in bpf_insn_aux_data, probably it's time
to switch to bitfields for those (even though you are trading 4 for 3
bytes here), 72+ bytes per instruction is not unnoticeable, especially
for big BPF programs

>         bool sanitize_stack_spill; /* subject to Spectre v4 sanitation */
>         bool zext_dst; /* this insn zero extends dst reg */
>         bool needs_zext; /* alu op needs to clear upper bits */

[...]
Andrii Nakryiko Nov. 15, 2024, 12:17 a.m. UTC | #3
On Thu, Nov 14, 2024 at 2:20 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Thu, 2024-11-07 at 09:50 -0800, Eduard Zingerman wrote:
> > Consider dead code elimination problem for program like below:
> >
> >     main:
> >       1: r1 = 42
> >       2: call <subprogram>;
> >       3: exit
> >
> >     subprogram:
> >       4: r0 = 1
> >       5: if r1 != 42 goto +1
> >       6: r0 = 2
> >       7: exit;
> >
> > Here verifier would visit every instruction and thus
> > bpf_insn_aux_data->seen flag would be set for both true (7)
> > and falltrhough (6) branches of conditional (5).
> > Hence opt_hard_wire_dead_code_branches() will not replace
> > conditional (5) with unconditional jump.
>
> [...]
>
> Had an off-list discussion with Alexei yesterday,
> here are some answers to questions raised:
> - The patches #1,2 with opt_hard_wire_dead_code_branches() changes are
>   not necessary for dynptr_slice kfunc inlining / branch removal.
>   I will drop these patches and adjust test cases.
> - Did some measurements for dynptr_slice call using simple benchmark
>   from patch #11:
>   - baseline:
>     76.167 ± 0.030M/s million calls per second;
>   - with call inlining, but without branch pruning (only patch #3):
>     101.198 ± 0.101M/s million calls per second;
>   - with call inlining and with branch pruning (full patch-set):
>     116.935 ± 0.142M/s million calls per second.
>

This true/false logic seems generally useful not just for this use
case, is there anything wrong with landing it? Seems pretty
straightforward. I'd split it from the kfunc inlining and land
independently.
Andrii Nakryiko Nov. 15, 2024, 12:19 a.m. UTC | #4
On Thu, Nov 14, 2024 at 4:17 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Nov 14, 2024 at 2:20 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
> >
> > On Thu, 2024-11-07 at 09:50 -0800, Eduard Zingerman wrote:
> > > Consider dead code elimination problem for program like below:
> > >
> > >     main:
> > >       1: r1 = 42
> > >       2: call <subprogram>;
> > >       3: exit
> > >
> > >     subprogram:
> > >       4: r0 = 1
> > >       5: if r1 != 42 goto +1
> > >       6: r0 = 2
> > >       7: exit;
> > >
> > > Here verifier would visit every instruction and thus
> > > bpf_insn_aux_data->seen flag would be set for both true (7)
> > > and falltrhough (6) branches of conditional (5).
> > > Hence opt_hard_wire_dead_code_branches() will not replace
> > > conditional (5) with unconditional jump.
> >
> > [...]
> >
> > Had an off-list discussion with Alexei yesterday,
> > here are some answers to questions raised:
> > - The patches #1,2 with opt_hard_wire_dead_code_branches() changes are
> >   not necessary for dynptr_slice kfunc inlining / branch removal.
> >   I will drop these patches and adjust test cases.
> > - Did some measurements for dynptr_slice call using simple benchmark
> >   from patch #11:
> >   - baseline:
> >     76.167 ± 0.030M/s million calls per second;
> >   - with call inlining, but without branch pruning (only patch #3):
> >     101.198 ± 0.101M/s million calls per second;
> >   - with call inlining and with branch pruning (full patch-set):
> >     116.935 ± 0.142M/s million calls per second.
> >
>
> This true/false logic seems generally useful not just for this use
> case, is there anything wrong with landing it? Seems pretty
> straightforward. I'd split it from the kfunc inlining and land
> independently.

I was also always hoping that we'll eventually optimize the following pattern:

r1 = *(global var)
if r1 == 1 /* always 1 or 0 */
   goto +...
...


This is extremely common with .rodata global variables, and while the
branches are dead code eliminated, memory reads are not. Not sure how
involved it would be to do this.
Eduard Zingerman Nov. 15, 2024, 12:20 a.m. UTC | #5
On Thu, 2024-11-14 at 16:17 -0800, Andrii Nakryiko wrote:

[...]

> This true/false logic seems generally useful not just for this use
> case, is there anything wrong with landing it? Seems pretty
> straightforward. I'd split it from the kfunc inlining and land
> independently.

I don't see anything wrong with it, but Alexei didn't like it.
Agree with your comment about organizing flags as bit-fields.
Alexei Starovoitov Nov. 15, 2024, 12:27 a.m. UTC | #6
On Thu, Nov 14, 2024 at 4:20 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Thu, 2024-11-14 at 16:17 -0800, Andrii Nakryiko wrote:
>
> [...]
>
> > This true/false logic seems generally useful not just for this use
> > case, is there anything wrong with landing it? Seems pretty
> > straightforward. I'd split it from the kfunc inlining and land
> > independently.
>
> I don't see anything wrong with it, but Alexei didn't like it.
> Agree with your comment about organizing flags as bit-fields.

Well, I asked whether it makes a difference.
And looks like your numbers show that this optimization adds 101m to 116m ?

If so, it's worth doing. I still don't like two bool flags though.
Eduard Zingerman Nov. 15, 2024, 12:33 a.m. UTC | #7
On Thu, 2024-11-14 at 16:27 -0800, Alexei Starovoitov wrote:

[...]

> Well, I asked whether it makes a difference.
> And looks like your numbers show that this optimization adds 101m to 116m ?

No, it does not make a difference for dynptr_slice:

> > - The patches #1,2 with opt_hard_wire_dead_code_branches() changes are
> >   not necessary for dynptr_slice kfunc inlining / branch removal.
> >  I will drop these patches and adjust test cases.

The 101m -> 116m is for inlining w/o known branch removal -> inlining with branch removal.
(With 76m being no inlining at all).

> If so, it's worth doing. I still don't like two bool flags though.

The peaces of information to encode here are:
- whether the branch is always predicted;
- if so, which branch.

I don't think this could be narrowed down to a single bit.
Alexei Starovoitov Nov. 15, 2024, 12:38 a.m. UTC | #8
On Thu, Nov 14, 2024 at 4:34 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Thu, 2024-11-14 at 16:27 -0800, Alexei Starovoitov wrote:
>
> [...]
>
> > Well, I asked whether it makes a difference.
> > And looks like your numbers show that this optimization adds 101m to 116m ?
>
> No, it does not make a difference for dynptr_slice:

Then let's not do it. If there is no observable performance
difference there is no need to do it just to make selftests shorter.

> > > - The patches #1,2 with opt_hard_wire_dead_code_branches() changes are
> > >   not necessary for dynptr_slice kfunc inlining / branch removal.
> > >  I will drop these patches and adjust test cases.
>
> The 101m -> 116m is for inlining w/o known branch removal -> inlining with branch removal.
> (With 76m being no inlining at all).

Not following. Which patch # does branch removal then?
Eduard Zingerman Nov. 15, 2024, 12:43 a.m. UTC | #9
On Thu, 2024-11-14 at 16:38 -0800, Alexei Starovoitov wrote:

[...]

> > The 101m -> 116m is for inlining w/o known branch removal -> inlining with branch removal.
> > (With 76m being no inlining at all).
> 
> Not following. Which patch # does branch removal then?

- "bpf: shared BPF/native kfuncs" (patch #3)
  Build system integration and kfuncs inlining after verification.

- "bpf: instantiate inlinable kfuncs before verification" (patch #7)
  Adds a pass that clones inlinable kfunc bodies as hidden
  subprograms, one subprogram per callsite.

#3 does inlining, but does not remove any branches.
#7 moves where inlining is done which allows to remove branches.

Performance numbers for the simple test:
- #3 alone : 76m -> 101m
- #3 + #7  : 76m -> 116m
Eduard Zingerman Nov. 15, 2024, 12:50 a.m. UTC | #10
On Thu, 2024-11-14 at 16:19 -0800, Andrii Nakryiko wrote:

[...]

> I was also always hoping that we'll eventually optimize the following pattern:
> 
> r1 = *(global var)
> if r1 == 1 /* always 1 or 0 */
>    goto +...
> ...
> 
> 
> This is extremely common with .rodata global variables, and while the
> branches are dead code eliminated, memory reads are not. Not sure how
> involved it would be to do this.

Could you please elaborate a bit.
For a simple test like below compiler replaces 'flag' with 1,
so no action is needed from verifier:

    const int flag = 1;

    SEC("socket")
    __success
    __xlated("foobar")
    int rodata_test(void *ctx)
    {
    	if (flag)
    		return 1;
    	return 0;
    }
Andrii Nakryiko Nov. 15, 2024, 3:03 a.m. UTC | #11
On Thu, Nov 14, 2024 at 4:50 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Thu, 2024-11-14 at 16:19 -0800, Andrii Nakryiko wrote:
>
> [...]
>
> > I was also always hoping that we'll eventually optimize the following pattern:
> >
> > r1 = *(global var)
> > if r1 == 1 /* always 1 or 0 */
> >    goto +...
> > ...
> >
> >
> > This is extremely common with .rodata global variables, and while the
> > branches are dead code eliminated, memory reads are not. Not sure how
> > involved it would be to do this.
>
> Could you please elaborate a bit.
> For a simple test like below compiler replaces 'flag' with 1,
> so no action is needed from verifier:
>
>     const int flag = 1;

now change this to actual .rodata global variable"

const volatile int flag = 1;

>
>     SEC("socket")
>     __success
>     __xlated("foobar")
>     int rodata_test(void *ctx)
>     {
>         if (flag)
>                 return 1;
>         return 0;
>     }
>
diff mbox series

Patch

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 4513372c5bc8..ed4eacfd4db7 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -570,7 +570,9 @@  struct bpf_insn_aux_data {
 	struct btf_struct_meta *kptr_struct_meta;
 	u64 map_key_state; /* constant (32 bit) key tracking for maps */
 	int ctx_field_size; /* the ctx field size for load insn, maybe 0 */
-	u32 seen; /* this insn was processed by the verifier at env->pass_cnt */
+	bool seen; /* this insn was processed by the verifier at env->pass_cnt */
+	bool true_branch_taken; /* for cond jumps, set if verifier ever took true branch */
+	bool false_branch_taken; /* for cond jumps, set if verifier ever took false branch */
 	bool sanitize_stack_spill; /* subject to Spectre v4 sanitation */
 	bool zext_dst; /* this insn zero extends dst reg */
 	bool needs_zext; /* alu op needs to clear upper bits */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 7958d6ff6b73..3bae0bbc1da9 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -13265,7 +13265,7 @@  static void sanitize_mark_insn_seen(struct bpf_verifier_env *env)
 	 * rewrite/sanitize them.
 	 */
 	if (!vstate->speculative)
-		env->insn_aux_data[env->insn_idx].seen = env->pass_cnt;
+		env->insn_aux_data[env->insn_idx].seen = env->pass_cnt > 0;
 }
 
 static int sanitize_err(struct bpf_verifier_env *env,
@@ -15484,6 +15484,7 @@  static int check_cond_jmp_op(struct bpf_verifier_env *env,
 {
 	struct bpf_verifier_state *this_branch = env->cur_state;
 	struct bpf_verifier_state *other_branch;
+	struct bpf_insn_aux_data *aux = &env->insn_aux_data[*insn_idx];
 	struct bpf_reg_state *regs = this_branch->frame[this_branch->curframe]->regs;
 	struct bpf_reg_state *dst_reg, *other_branch_regs, *src_reg = NULL;
 	struct bpf_reg_state *eq_branch_regs;
@@ -15510,6 +15511,8 @@  static int check_cond_jmp_op(struct bpf_verifier_env *env,
 				insn->off, insn->imm);
 			return -EINVAL;
 		}
+		aux->true_branch_taken = true;
+		aux->false_branch_taken = true;
 		prev_st = find_prev_entry(env, cur_st->parent, idx);
 
 		/* branch out 'fallthrough' insn as a new state to explore */
@@ -15579,6 +15582,7 @@  static int check_cond_jmp_op(struct bpf_verifier_env *env,
 		 * the fall-through branch for simulation under speculative
 		 * execution.
 		 */
+		aux->true_branch_taken = true;
 		if (!env->bypass_spec_v1 &&
 		    !sanitize_speculative_path(env, insn, *insn_idx + 1,
 					       *insn_idx))
@@ -15592,6 +15596,7 @@  static int check_cond_jmp_op(struct bpf_verifier_env *env,
 		 * program will go. If needed, push the goto branch for
 		 * simulation under speculative execution.
 		 */
+		aux->false_branch_taken = true;
 		if (!env->bypass_spec_v1 &&
 		    !sanitize_speculative_path(env, insn,
 					       *insn_idx + insn->off + 1,
@@ -15602,6 +15607,9 @@  static int check_cond_jmp_op(struct bpf_verifier_env *env,
 		return 0;
 	}
 
+	aux->true_branch_taken = true;
+	aux->false_branch_taken = true;
+
 	/* Push scalar registers sharing same ID to jump history,
 	 * do this before creating 'other_branch', so that both
 	 * 'this_branch' and 'other_branch' share this history
@@ -19288,7 +19296,7 @@  static void adjust_insn_aux_data(struct bpf_verifier_env *env,
 {
 	struct bpf_insn_aux_data *old_data = env->insn_aux_data;
 	struct bpf_insn *insn = new_prog->insnsi;
-	u32 old_seen = old_data[off].seen;
+	bool old_seen = old_data[off].seen;
 	u32 prog_len;
 	int i;
 
@@ -19608,9 +19616,9 @@  static void opt_hard_wire_dead_code_branches(struct bpf_verifier_env *env)
 		if (!insn_is_cond_jump(insn->code))
 			continue;
 
-		if (!aux_data[i + 1].seen)
+		if (aux_data[i].true_branch_taken && !aux_data[i].false_branch_taken)
 			ja.off = insn->off;
-		else if (!aux_data[i + 1 + insn->off].seen)
+		else if (!aux_data[i].true_branch_taken && aux_data[i].false_branch_taken)
 			ja.off = 0;
 		else
 			continue;