Message ID | 20241107175040.1659341-2-eddyz87@gmail.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | BPF |
Headers | show |
Series | bpf: inlinable kfuncs for BPF | expand |
On Thu, 2024-11-07 at 09:50 -0800, Eduard Zingerman wrote: > Consider dead code elimination problem for program like below: > > main: > 1: r1 = 42 > 2: call <subprogram>; > 3: exit > > subprogram: > 4: r0 = 1 > 5: if r1 != 42 goto +1 > 6: r0 = 2 > 7: exit; > > Here verifier would visit every instruction and thus > bpf_insn_aux_data->seen flag would be set for both true (7) > and falltrhough (6) branches of conditional (5). > Hence opt_hard_wire_dead_code_branches() will not replace > conditional (5) with unconditional jump. [...] Had an off-list discussion with Alexei yesterday, here are some answers to questions raised: - The patches #1,2 with opt_hard_wire_dead_code_branches() changes are not necessary for dynptr_slice kfunc inlining / branch removal. I will drop these patches and adjust test cases. - Did some measurements for dynptr_slice call using simple benchmark from patch #11: - baseline: 76.167 ± 0.030M/s million calls per second; - with call inlining, but without branch pruning (only patch #3): 101.198 ± 0.101M/s million calls per second; - with call inlining and with branch pruning (full patch-set): 116.935 ± 0.142M/s million calls per second.
On Thu, Nov 7, 2024 at 9:51 AM Eduard Zingerman <eddyz87@gmail.com> wrote: > > Consider dead code elimination problem for program like below: > > main: > 1: r1 = 42 > 2: call <subprogram>; > 3: exit > > subprogram: > 4: r0 = 1 > 5: if r1 != 42 goto +1 > 6: r0 = 2 > 7: exit; > > Here verifier would visit every instruction and thus > bpf_insn_aux_data->seen flag would be set for both true (7) > and falltrhough (6) branches of conditional (5). > Hence opt_hard_wire_dead_code_branches() will not replace > conditional (5) with unconditional jump. > > To cover such cases: > - add two fields in struct bpf_insn_aux_data: > - true_branch_taken; > - false_branch_taken; > - adjust check_cond_jmp_op() to set the fields according to jump > predictions; > - handle these flags in the opt_hard_wire_dead_code_branches(): > - true_branch_taken && !false_branch_taken > always jump, replace instruction with 'goto off'; > - !true_branch_taken && false_branch_taken > always falltrhough, replace with 'goto +0'; > - true_branch_taken && false_branch_taken > jump and falltrhough are possible, don't change the instruction; > - !true_branch_taken && !false_branch_taken > neither jump, nor falltrhough are possible, if condition itself > must be a dead code (should be removed by opt_remove_dead_code). > > Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> > --- > include/linux/bpf_verifier.h | 4 +++- > kernel/bpf/verifier.c | 16 ++++++++++++---- > 2 files changed, 15 insertions(+), 5 deletions(-) > > diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h > index 4513372c5bc8..ed4eacfd4db7 100644 > --- a/include/linux/bpf_verifier.h > +++ b/include/linux/bpf_verifier.h > @@ -570,7 +570,9 @@ struct bpf_insn_aux_data { > struct btf_struct_meta *kptr_struct_meta; > u64 map_key_state; /* constant (32 bit) key tracking for maps */ > int ctx_field_size; /* the ctx field size for load insn, maybe 0 */ > - u32 seen; /* this insn was processed by the verifier at env->pass_cnt */ > + bool seen; /* this insn was processed by the verifier at env->pass_cnt */ > + bool true_branch_taken; /* for cond jumps, set if verifier ever took true branch */ > + bool false_branch_taken; /* for cond jumps, set if verifier ever took false branch */ we'll now have 12 bool fields in bpf_insn_aux_data, probably it's time to switch to bitfields for those (even though you are trading 4 for 3 bytes here), 72+ bytes per instruction is not unnoticeable, especially for big BPF programs > bool sanitize_stack_spill; /* subject to Spectre v4 sanitation */ > bool zext_dst; /* this insn zero extends dst reg */ > bool needs_zext; /* alu op needs to clear upper bits */ [...]
On Thu, Nov 14, 2024 at 2:20 PM Eduard Zingerman <eddyz87@gmail.com> wrote: > > On Thu, 2024-11-07 at 09:50 -0800, Eduard Zingerman wrote: > > Consider dead code elimination problem for program like below: > > > > main: > > 1: r1 = 42 > > 2: call <subprogram>; > > 3: exit > > > > subprogram: > > 4: r0 = 1 > > 5: if r1 != 42 goto +1 > > 6: r0 = 2 > > 7: exit; > > > > Here verifier would visit every instruction and thus > > bpf_insn_aux_data->seen flag would be set for both true (7) > > and falltrhough (6) branches of conditional (5). > > Hence opt_hard_wire_dead_code_branches() will not replace > > conditional (5) with unconditional jump. > > [...] > > Had an off-list discussion with Alexei yesterday, > here are some answers to questions raised: > - The patches #1,2 with opt_hard_wire_dead_code_branches() changes are > not necessary for dynptr_slice kfunc inlining / branch removal. > I will drop these patches and adjust test cases. > - Did some measurements for dynptr_slice call using simple benchmark > from patch #11: > - baseline: > 76.167 ± 0.030M/s million calls per second; > - with call inlining, but without branch pruning (only patch #3): > 101.198 ± 0.101M/s million calls per second; > - with call inlining and with branch pruning (full patch-set): > 116.935 ± 0.142M/s million calls per second. > This true/false logic seems generally useful not just for this use case, is there anything wrong with landing it? Seems pretty straightforward. I'd split it from the kfunc inlining and land independently.
On Thu, Nov 14, 2024 at 4:17 PM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Thu, Nov 14, 2024 at 2:20 PM Eduard Zingerman <eddyz87@gmail.com> wrote: > > > > On Thu, 2024-11-07 at 09:50 -0800, Eduard Zingerman wrote: > > > Consider dead code elimination problem for program like below: > > > > > > main: > > > 1: r1 = 42 > > > 2: call <subprogram>; > > > 3: exit > > > > > > subprogram: > > > 4: r0 = 1 > > > 5: if r1 != 42 goto +1 > > > 6: r0 = 2 > > > 7: exit; > > > > > > Here verifier would visit every instruction and thus > > > bpf_insn_aux_data->seen flag would be set for both true (7) > > > and falltrhough (6) branches of conditional (5). > > > Hence opt_hard_wire_dead_code_branches() will not replace > > > conditional (5) with unconditional jump. > > > > [...] > > > > Had an off-list discussion with Alexei yesterday, > > here are some answers to questions raised: > > - The patches #1,2 with opt_hard_wire_dead_code_branches() changes are > > not necessary for dynptr_slice kfunc inlining / branch removal. > > I will drop these patches and adjust test cases. > > - Did some measurements for dynptr_slice call using simple benchmark > > from patch #11: > > - baseline: > > 76.167 ± 0.030M/s million calls per second; > > - with call inlining, but without branch pruning (only patch #3): > > 101.198 ± 0.101M/s million calls per second; > > - with call inlining and with branch pruning (full patch-set): > > 116.935 ± 0.142M/s million calls per second. > > > > This true/false logic seems generally useful not just for this use > case, is there anything wrong with landing it? Seems pretty > straightforward. I'd split it from the kfunc inlining and land > independently. I was also always hoping that we'll eventually optimize the following pattern: r1 = *(global var) if r1 == 1 /* always 1 or 0 */ goto +... ... This is extremely common with .rodata global variables, and while the branches are dead code eliminated, memory reads are not. Not sure how involved it would be to do this.
On Thu, 2024-11-14 at 16:17 -0800, Andrii Nakryiko wrote: [...] > This true/false logic seems generally useful not just for this use > case, is there anything wrong with landing it? Seems pretty > straightforward. I'd split it from the kfunc inlining and land > independently. I don't see anything wrong with it, but Alexei didn't like it. Agree with your comment about organizing flags as bit-fields.
On Thu, Nov 14, 2024 at 4:20 PM Eduard Zingerman <eddyz87@gmail.com> wrote: > > On Thu, 2024-11-14 at 16:17 -0800, Andrii Nakryiko wrote: > > [...] > > > This true/false logic seems generally useful not just for this use > > case, is there anything wrong with landing it? Seems pretty > > straightforward. I'd split it from the kfunc inlining and land > > independently. > > I don't see anything wrong with it, but Alexei didn't like it. > Agree with your comment about organizing flags as bit-fields. Well, I asked whether it makes a difference. And looks like your numbers show that this optimization adds 101m to 116m ? If so, it's worth doing. I still don't like two bool flags though.
On Thu, 2024-11-14 at 16:27 -0800, Alexei Starovoitov wrote: [...] > Well, I asked whether it makes a difference. > And looks like your numbers show that this optimization adds 101m to 116m ? No, it does not make a difference for dynptr_slice: > > - The patches #1,2 with opt_hard_wire_dead_code_branches() changes are > > not necessary for dynptr_slice kfunc inlining / branch removal. > > I will drop these patches and adjust test cases. The 101m -> 116m is for inlining w/o known branch removal -> inlining with branch removal. (With 76m being no inlining at all). > If so, it's worth doing. I still don't like two bool flags though. The peaces of information to encode here are: - whether the branch is always predicted; - if so, which branch. I don't think this could be narrowed down to a single bit.
On Thu, Nov 14, 2024 at 4:34 PM Eduard Zingerman <eddyz87@gmail.com> wrote: > > On Thu, 2024-11-14 at 16:27 -0800, Alexei Starovoitov wrote: > > [...] > > > Well, I asked whether it makes a difference. > > And looks like your numbers show that this optimization adds 101m to 116m ? > > No, it does not make a difference for dynptr_slice: Then let's not do it. If there is no observable performance difference there is no need to do it just to make selftests shorter. > > > - The patches #1,2 with opt_hard_wire_dead_code_branches() changes are > > > not necessary for dynptr_slice kfunc inlining / branch removal. > > > I will drop these patches and adjust test cases. > > The 101m -> 116m is for inlining w/o known branch removal -> inlining with branch removal. > (With 76m being no inlining at all). Not following. Which patch # does branch removal then?
On Thu, 2024-11-14 at 16:38 -0800, Alexei Starovoitov wrote: [...] > > The 101m -> 116m is for inlining w/o known branch removal -> inlining with branch removal. > > (With 76m being no inlining at all). > > Not following. Which patch # does branch removal then? - "bpf: shared BPF/native kfuncs" (patch #3) Build system integration and kfuncs inlining after verification. - "bpf: instantiate inlinable kfuncs before verification" (patch #7) Adds a pass that clones inlinable kfunc bodies as hidden subprograms, one subprogram per callsite. #3 does inlining, but does not remove any branches. #7 moves where inlining is done which allows to remove branches. Performance numbers for the simple test: - #3 alone : 76m -> 101m - #3 + #7 : 76m -> 116m
On Thu, 2024-11-14 at 16:19 -0800, Andrii Nakryiko wrote: [...] > I was also always hoping that we'll eventually optimize the following pattern: > > r1 = *(global var) > if r1 == 1 /* always 1 or 0 */ > goto +... > ... > > > This is extremely common with .rodata global variables, and while the > branches are dead code eliminated, memory reads are not. Not sure how > involved it would be to do this. Could you please elaborate a bit. For a simple test like below compiler replaces 'flag' with 1, so no action is needed from verifier: const int flag = 1; SEC("socket") __success __xlated("foobar") int rodata_test(void *ctx) { if (flag) return 1; return 0; }
On Thu, Nov 14, 2024 at 4:50 PM Eduard Zingerman <eddyz87@gmail.com> wrote: > > On Thu, 2024-11-14 at 16:19 -0800, Andrii Nakryiko wrote: > > [...] > > > I was also always hoping that we'll eventually optimize the following pattern: > > > > r1 = *(global var) > > if r1 == 1 /* always 1 or 0 */ > > goto +... > > ... > > > > > > This is extremely common with .rodata global variables, and while the > > branches are dead code eliminated, memory reads are not. Not sure how > > involved it would be to do this. > > Could you please elaborate a bit. > For a simple test like below compiler replaces 'flag' with 1, > so no action is needed from verifier: > > const int flag = 1; now change this to actual .rodata global variable" const volatile int flag = 1; > > SEC("socket") > __success > __xlated("foobar") > int rodata_test(void *ctx) > { > if (flag) > return 1; > return 0; > } >
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h index 4513372c5bc8..ed4eacfd4db7 100644 --- a/include/linux/bpf_verifier.h +++ b/include/linux/bpf_verifier.h @@ -570,7 +570,9 @@ struct bpf_insn_aux_data { struct btf_struct_meta *kptr_struct_meta; u64 map_key_state; /* constant (32 bit) key tracking for maps */ int ctx_field_size; /* the ctx field size for load insn, maybe 0 */ - u32 seen; /* this insn was processed by the verifier at env->pass_cnt */ + bool seen; /* this insn was processed by the verifier at env->pass_cnt */ + bool true_branch_taken; /* for cond jumps, set if verifier ever took true branch */ + bool false_branch_taken; /* for cond jumps, set if verifier ever took false branch */ bool sanitize_stack_spill; /* subject to Spectre v4 sanitation */ bool zext_dst; /* this insn zero extends dst reg */ bool needs_zext; /* alu op needs to clear upper bits */ diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 7958d6ff6b73..3bae0bbc1da9 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -13265,7 +13265,7 @@ static void sanitize_mark_insn_seen(struct bpf_verifier_env *env) * rewrite/sanitize them. */ if (!vstate->speculative) - env->insn_aux_data[env->insn_idx].seen = env->pass_cnt; + env->insn_aux_data[env->insn_idx].seen = env->pass_cnt > 0; } static int sanitize_err(struct bpf_verifier_env *env, @@ -15484,6 +15484,7 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env, { struct bpf_verifier_state *this_branch = env->cur_state; struct bpf_verifier_state *other_branch; + struct bpf_insn_aux_data *aux = &env->insn_aux_data[*insn_idx]; struct bpf_reg_state *regs = this_branch->frame[this_branch->curframe]->regs; struct bpf_reg_state *dst_reg, *other_branch_regs, *src_reg = NULL; struct bpf_reg_state *eq_branch_regs; @@ -15510,6 +15511,8 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env, insn->off, insn->imm); return -EINVAL; } + aux->true_branch_taken = true; + aux->false_branch_taken = true; prev_st = find_prev_entry(env, cur_st->parent, idx); /* branch out 'fallthrough' insn as a new state to explore */ @@ -15579,6 +15582,7 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env, * the fall-through branch for simulation under speculative * execution. */ + aux->true_branch_taken = true; if (!env->bypass_spec_v1 && !sanitize_speculative_path(env, insn, *insn_idx + 1, *insn_idx)) @@ -15592,6 +15596,7 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env, * program will go. If needed, push the goto branch for * simulation under speculative execution. */ + aux->false_branch_taken = true; if (!env->bypass_spec_v1 && !sanitize_speculative_path(env, insn, *insn_idx + insn->off + 1, @@ -15602,6 +15607,9 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env, return 0; } + aux->true_branch_taken = true; + aux->false_branch_taken = true; + /* Push scalar registers sharing same ID to jump history, * do this before creating 'other_branch', so that both * 'this_branch' and 'other_branch' share this history @@ -19288,7 +19296,7 @@ static void adjust_insn_aux_data(struct bpf_verifier_env *env, { struct bpf_insn_aux_data *old_data = env->insn_aux_data; struct bpf_insn *insn = new_prog->insnsi; - u32 old_seen = old_data[off].seen; + bool old_seen = old_data[off].seen; u32 prog_len; int i; @@ -19608,9 +19616,9 @@ static void opt_hard_wire_dead_code_branches(struct bpf_verifier_env *env) if (!insn_is_cond_jump(insn->code)) continue; - if (!aux_data[i + 1].seen) + if (aux_data[i].true_branch_taken && !aux_data[i].false_branch_taken) ja.off = insn->off; - else if (!aux_data[i + 1 + insn->off].seen) + else if (!aux_data[i].true_branch_taken && aux_data[i].false_branch_taken) ja.off = 0; else continue;
Consider dead code elimination problem for program like below: main: 1: r1 = 42 2: call <subprogram>; 3: exit subprogram: 4: r0 = 1 5: if r1 != 42 goto +1 6: r0 = 2 7: exit; Here verifier would visit every instruction and thus bpf_insn_aux_data->seen flag would be set for both true (7) and falltrhough (6) branches of conditional (5). Hence opt_hard_wire_dead_code_branches() will not replace conditional (5) with unconditional jump. To cover such cases: - add two fields in struct bpf_insn_aux_data: - true_branch_taken; - false_branch_taken; - adjust check_cond_jmp_op() to set the fields according to jump predictions; - handle these flags in the opt_hard_wire_dead_code_branches(): - true_branch_taken && !false_branch_taken always jump, replace instruction with 'goto off'; - !true_branch_taken && false_branch_taken always falltrhough, replace with 'goto +0'; - true_branch_taken && false_branch_taken jump and falltrhough are possible, don't change the instruction; - !true_branch_taken && !false_branch_taken neither jump, nor falltrhough are possible, if condition itself must be a dead code (should be removed by opt_remove_dead_code). Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> --- include/linux/bpf_verifier.h | 4 +++- kernel/bpf/verifier.c | 16 ++++++++++++---- 2 files changed, 15 insertions(+), 5 deletions(-)