Message ID | 20241009021254.2805446-1-eddyz87@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | BPF |
Headers | show |
Series | [bpf-next,v1,1/2] bpf: force checkpoints at loop back-edges | expand |
On Tue, 2024-10-08 at 19:12 -0700, Eduard Zingerman wrote: [...] > This patch forcibly enables checkpoints for each loop back-edge. > This helps with the programs in question, as verification of both > syzbot program and reduced snippet finishes in ~2.5 sec. [...] > File Program Insns (DIFF) States (DIFF) > -------------------------- ------------------------------------ ---------------- ------------------- > ... > pyperf600.bpf.o on_event -53870 (-9.81%) -3000 (-10.16%) > ... fwiw, this patch speeds up verification of pyperf600 by a small margin, sufficient for it to pass in combination with jump history bump when LLVM20 is used. # ./veristat pyperf600.bpf.o no_alu32/pyperf600.bpf.o cpuv4/pyperf600.bpf.o ... File Program Verdict Duration (us) Insns States Peak states --------------- -------- ------- ------------- ------ ------ ----------- pyperf600.bpf.o on_event success 2400571 872914 26490 25480 pyperf600.bpf.o on_event success 2460908 947038 26090 25330 pyperf600.bpf.o on_event success 2158117 788481 26329 25368 --------------- -------- ------- ------------- ------ ------ ----------- W/o this patch jump history bump is not sufficient to get no_alu32 version verified, instruction limit is reached. The draft is here: https://github.com/eddyz87/bpf/tree/jmp-history-pyperf600 [...]
On Tue, 2024-10-08 at 19:12 -0700, Eduard Zingerman wrote: > In [1] syzbot reported an interesting BPF program. > Verification for this program takes a very long time. > > [1] https://lore.kernel.org/bpf/670429f6.050a0220.49194.0517.GAE@google.com/ > > The program could be simplified to the following snippet: > > /* Program type is kprobe */ > r7 = *(u16 *)(r1 +0); > 1: r7 += 0x1ab064b9; > if r7 & 0x702000 goto 1b; > r7 &= 0x1ee60e; > r7 += r1; > if r7 s> 0x37d2 goto +0; > r0 = 0; > exit; Answering a few questions from off-list discussion with Alexei. The test is not specific for jset instruction, e.g. the following program exhibits similar behaviour: SEC("kprobe") __failure __log_level(4) __msg("BPF program is too large.") __naked void short_loop1(void) { asm volatile ( " r7 = *(u16 *)(r1 +0);" " r8 = *(u64 *)(r1 +16);" "1: r7 += 0x1ab064b9;" "if r7 < r8 goto 1b;" " r7 &= 0x1ee60e;" " r7 += r1;" " if r7 s> 0x37d2 goto +0;" " r0 = 0;" " exit;" ::: __clobber_all); } > The snippet exhibits the following behaviour depending on > BPF_COMPLEXITY_LIMIT_INSNS: > - at 1,000,000 verification does not finish in 15 minutes; > - at 100,000 verification finishes in 15 seconds; > - at 100 it is possible to get some verifier log. Still investigating why running time change is non-linear. [...] > This patch forcibly enables checkpoints for each loop back-edge. > This helps with the programs in question, as verification of both > syzbot program and reduced snippet finishes in ~2.5 sec. There is the following code in is_state_visited(): ... /* if the verifier is processing a loop, avoid adding new state * too often, since different loop iterations have distinct * states and may not help future pruning. * This threshold shouldn't be too low to make sure that * a loop with large bound will be rejected quickly. * The most abusive loop will be: * r1 += 1 * if r1 < 1000000 goto pc-2 * 1M insn_procssed limit / 100 == 10k peak states. * This threshold shouldn't be too high either, since states * at the end of the loop are likely to be useful in pruning. */ skip_inf_loop_check: if (!env->test_state_freq && env->jmps_processed - env->prev_jmps_processed < 20 && env->insn_processed - env->prev_insn_processed < 100) add_new_state = false; goto miss; ... Which runs into a direct contradiction with what I do in this patch, so either I need to change the patch or this fragment needs adjustment. [...]
On Wed, Oct 9, 2024 at 12:41 PM Eduard Zingerman <eddyz87@gmail.com> wrote: > > On Tue, 2024-10-08 at 19:12 -0700, Eduard Zingerman wrote: > > In [1] syzbot reported an interesting BPF program. > > Verification for this program takes a very long time. > > > > [1] https://lore.kernel.org/bpf/670429f6.050a0220.49194.0517.GAE@google.com/ > > > > The program could be simplified to the following snippet: > > > > /* Program type is kprobe */ > > r7 = *(u16 *)(r1 +0); > > 1: r7 += 0x1ab064b9; > > if r7 & 0x702000 goto 1b; > > r7 &= 0x1ee60e; > > r7 += r1; > > if r7 s> 0x37d2 goto +0; > > r0 = 0; > > exit; > > Answering a few questions from off-list discussion with Alexei. > The test is not specific for jset instruction, e.g. the following > program exhibits similar behaviour: > > SEC("kprobe") > __failure __log_level(4) > __msg("BPF program is too large.") > __naked void short_loop1(void) > { > asm volatile ( > " r7 = *(u16 *)(r1 +0);" > " r8 = *(u64 *)(r1 +16);" > "1: r7 += 0x1ab064b9;" > "if r7 < r8 goto 1b;" > " r7 &= 0x1ee60e;" > " r7 += r1;" > " if r7 s> 0x37d2 goto +0;" > " r0 = 0;" > " exit;" > ::: __clobber_all); > } > > > The snippet exhibits the following behaviour depending on > > BPF_COMPLEXITY_LIMIT_INSNS: > > - at 1,000,000 verification does not finish in 15 minutes; > > - at 100,000 verification finishes in 15 seconds; > > - at 100 it is possible to get some verifier log. > > Still investigating why running time change is non-linear. > > [...] > > > This patch forcibly enables checkpoints for each loop back-edge. > > This helps with the programs in question, as verification of both > > syzbot program and reduced snippet finishes in ~2.5 sec. > > There is the following code in is_state_visited(): > > ... > /* if the verifier is processing a loop, avoid adding new state > * too often, since different loop iterations have distinct > * states and may not help future pruning. > * This threshold shouldn't be too low to make sure that > * a loop with large bound will be rejected quickly. > * The most abusive loop will be: > * r1 += 1 > * if r1 < 1000000 goto pc-2 > * 1M insn_procssed limit / 100 == 10k peak states. > * This threshold shouldn't be too high either, since states > * at the end of the loop are likely to be useful in pruning. > */ > skip_inf_loop_check: > if (!env->test_state_freq && > env->jmps_processed - env->prev_jmps_processed < 20 && > env->insn_processed - env->prev_insn_processed < 100) > add_new_state = false; > goto miss; > ... > > Which runs into a direct contradiction with what I do in this patch, > so either I need to change the patch or this fragment needs adjustment. Yeah. It's not saving all states exactly to avoid: loop3.bpf.o while_true +350359 (+106.88%) +0 (+0.00%) +101448 (+1049.86%) 100k states is a lot of memory. It might cause OOMs. pyperf600 improvement is certainly nice though. It feels it's better to adjust above heuristic 20 && 100 instead of forcing save state everywhere. Something should be done about: 71.25% [k] __mark_chain_precision 24.81% [k] bt_sync_linked_regs as well. The algorithm there needs some tweaks. pw-bot: cr
On Wed, Oct 9, 2024 at 6:09 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Wed, Oct 9, 2024 at 12:41 PM Eduard Zingerman <eddyz87@gmail.com> wrote: > > > > On Tue, 2024-10-08 at 19:12 -0700, Eduard Zingerman wrote: > > > In [1] syzbot reported an interesting BPF program. > > > Verification for this program takes a very long time. > > > > > > [1] https://lore.kernel.org/bpf/670429f6.050a0220.49194.0517.GAE@google.com/ > > > > > > The program could be simplified to the following snippet: > > > > > > /* Program type is kprobe */ > > > r7 = *(u16 *)(r1 +0); > > > 1: r7 += 0x1ab064b9; > > > if r7 & 0x702000 goto 1b; > > > r7 &= 0x1ee60e; > > > r7 += r1; > > > if r7 s> 0x37d2 goto +0; > > > r0 = 0; > > > exit; > > > > Answering a few questions from off-list discussion with Alexei. > > The test is not specific for jset instruction, e.g. the following > > program exhibits similar behaviour: > > > > SEC("kprobe") > > __failure __log_level(4) > > __msg("BPF program is too large.") > > __naked void short_loop1(void) > > { > > asm volatile ( > > " r7 = *(u16 *)(r1 +0);" > > " r8 = *(u64 *)(r1 +16);" > > "1: r7 += 0x1ab064b9;" > > "if r7 < r8 goto 1b;" > > " r7 &= 0x1ee60e;" > > " r7 += r1;" > > " if r7 s> 0x37d2 goto +0;" > > " r0 = 0;" > > " exit;" > > ::: __clobber_all); > > } > > > > > The snippet exhibits the following behaviour depending on > > > BPF_COMPLEXITY_LIMIT_INSNS: > > > - at 1,000,000 verification does not finish in 15 minutes; > > > - at 100,000 verification finishes in 15 seconds; > > > - at 100 it is possible to get some verifier log. > > > > Still investigating why running time change is non-linear. > > > > [...] > > > > > This patch forcibly enables checkpoints for each loop back-edge. > > > This helps with the programs in question, as verification of both > > > syzbot program and reduced snippet finishes in ~2.5 sec. > > > > There is the following code in is_state_visited(): > > > > ... > > /* if the verifier is processing a loop, avoid adding new state > > * too often, since different loop iterations have distinct > > * states and may not help future pruning. > > * This threshold shouldn't be too low to make sure that > > * a loop with large bound will be rejected quickly. > > * The most abusive loop will be: > > * r1 += 1 > > * if r1 < 1000000 goto pc-2 > > * 1M insn_procssed limit / 100 == 10k peak states. > > * This threshold shouldn't be too high either, since states > > * at the end of the loop are likely to be useful in pruning. > > */ > > skip_inf_loop_check: > > if (!env->test_state_freq && > > env->jmps_processed - env->prev_jmps_processed < 20 && > > env->insn_processed - env->prev_insn_processed < 100) > > add_new_state = false; > > goto miss; > > ... > > > > Which runs into a direct contradiction with what I do in this patch, > > so either I need to change the patch or this fragment needs adjustment. > > Yeah. It's not saving all states exactly to avoid: > loop3.bpf.o while_true > +350359 (+106.88%) +0 (+0.00%) +101448 (+1049.86%) > > 100k states is a lot of memory. It might cause OOMs. > > pyperf600 improvement is certainly nice though. > > It feels it's better to adjust above heuristic 20 && 100 instead > of forcing save state everywhere. > > Something should be done about: > 71.25% [k] __mark_chain_precision > 24.81% [k] bt_sync_linked_regs > as well. > The algorithm there needs some tweaks. If we were to store bpf_jmp_history_entry for each instruction (and we can do that efficiently, memory-wise, I had the patch), and then for each instruction we maintained a list of "input" regs/slots and corresponding "output" regs/slots as we simulate each instruction forward, I think __mark_chain_precision would be much simpler and thus faster. We'd basically just walk backwards instruction by instruction, check if any of the output regs/slots need to be precise (few bitmasks intersection), and if yes, set all input regs/slots as "need precision", and just continue forward. I think it's actually a simpler approach and should be faster. Simpler because it's easy to tell inputs/outputs while doing forward instruction processing. Faster because __mark_chain_precision would only do very simple operation without lots of branching and checks. We are already halfway there, tbh, with the linked registers and insn_stack_access_flags() (i.e., information that we couldn't re-derive in the backwards pass). So maybe we should bite the bullet and go all in? P.S. And just in general, there is a certain appeal to always having a complete history up to any current point in time. Feels like this would be useful in the future for some extra optimizations or checks. > > pw-bot: cr
On Thu, 2024-10-10 at 15:13 -0700, Andrii Nakryiko wrote: > On Wed, Oct 9, 2024 at 6:09 PM Alexei Starovoitov [...] > > Something should be done about: > > 71.25% [k] __mark_chain_precision > > 24.81% [k] bt_sync_linked_regs > > as well. > > The algorithm there needs some tweaks. > > If we were to store bpf_jmp_history_entry for each instruction (and we > can do that efficiently, memory-wise, I had the patch), and then for > each instruction we maintained a list of "input" regs/slots and > corresponding "output" regs/slots as we simulate each instruction > forward, I think __mark_chain_precision would be much simpler and thus > faster. We'd basically just walk backwards instruction by instruction, > check if any of the output regs/slots need to be precise (few bitmasks > intersection), and if yes, set all input regs/slots as "need > precision", and just continue forward. > > I think it's actually a simpler approach and should be faster. Simpler > because it's easy to tell inputs/outputs while doing forward > instruction processing. Faster because __mark_chain_precision would > only do very simple operation without lots of branching and checks. I think this would bring significant speedup. Not sure it would completely fix the issue at hand, as mark_chain_precision() walks like 100 instructions back on each iteration of the loop, but it might be a step in the right direction. Do you mind if I refresh your old patches for jump history, or do you want to work on this yourself? [...]
On Thu, 2024-10-10 at 15:40 -0700, Eduard Zingerman wrote: > I think this would bring significant speedup. > Not sure it would completely fix the issue at hand, > as mark_chain_precision() walks like 100 instructions back on each > iteration of the loop, but it might be a step in the right direction. In theory, we can do mark_chain_precision() lazily: - mark that certain registers need precision for an instruction; - wait till next checkpoint is created; - do the walk for marked registers. This should be simpler to implement on top of what Andrii suggests. [...]
On Thu, Oct 10, 2024 at 3:52 PM Eduard Zingerman <eddyz87@gmail.com> wrote: > > On Thu, 2024-10-10 at 15:40 -0700, Eduard Zingerman wrote: > > > I think this would bring significant speedup. > > Not sure it would completely fix the issue at hand, > > as mark_chain_precision() walks like 100 instructions back on each > > iteration of the loop, but it might be a step in the right direction. > > In theory, we can do mark_chain_precision() lazily: > - mark that certain registers need precision for an instruction; > - wait till next checkpoint is created; > - do the walk for marked registers. > > This should be simpler to implement on top of what Andrii suggests. Indeed it can wait until the next new_state, since when the next iteration of the loop starts the sl->state are guaranteed to be some olderstates. Though the verifier may have done multiple loop iteration traversals, but it's still in 'cur' state, so states_equal(env, &sl->state, cur...) should still be correct.
On Thu, Oct 10, 2024 at 3:40 PM Eduard Zingerman <eddyz87@gmail.com> wrote: > > On Thu, 2024-10-10 at 15:13 -0700, Andrii Nakryiko wrote: > > On Wed, Oct 9, 2024 at 6:09 PM Alexei Starovoitov > > [...] > > > > Something should be done about: > > > 71.25% [k] __mark_chain_precision > > > 24.81% [k] bt_sync_linked_regs > > > as well. > > > The algorithm there needs some tweaks. > > > > If we were to store bpf_jmp_history_entry for each instruction (and we > > can do that efficiently, memory-wise, I had the patch), and then for > > each instruction we maintained a list of "input" regs/slots and > > corresponding "output" regs/slots as we simulate each instruction > > forward, I think __mark_chain_precision would be much simpler and thus > > faster. We'd basically just walk backwards instruction by instruction, > > check if any of the output regs/slots need to be precise (few bitmasks > > intersection), and if yes, set all input regs/slots as "need > > precision", and just continue forward. > > > > I think it's actually a simpler approach and should be faster. Simpler > > because it's easy to tell inputs/outputs while doing forward > > instruction processing. Faster because __mark_chain_precision would > > only do very simple operation without lots of branching and checks. > > I think this would bring significant speedup. > Not sure it would completely fix the issue at hand, > as mark_chain_precision() walks like 100 instructions back on each > iteration of the loop, but it might be a step in the right direction. > > Do you mind if I refresh your old patches for jump history, > or do you want to work on this yourself? No, I don't mind at all. Go ahead. > > [...]
On Thu, 2024-10-10 at 16:23 -0700, Alexei Starovoitov wrote: > On Thu, Oct 10, 2024 at 3:52 PM Eduard Zingerman <eddyz87@gmail.com> wrote: > > > > On Thu, 2024-10-10 at 15:40 -0700, Eduard Zingerman wrote: > > > > > I think this would bring significant speedup. > > > Not sure it would completely fix the issue at hand, > > > as mark_chain_precision() walks like 100 instructions back on each > > > iteration of the loop, but it might be a step in the right direction. > > > > In theory, we can do mark_chain_precision() lazily: > > - mark that certain registers need precision for an instruction; > > - wait till next checkpoint is created; > > - do the walk for marked registers. > > > > This should be simpler to implement on top of what Andrii suggests. > > Indeed it can wait until the next new_state, since > when the next iteration of the loop starts the sl->state > are guaranteed to be some olderstates. Though the verifier may have done > multiple loop iteration traversals, but it's still in 'cur' state, > so states_equal(env, &sl->state, cur...) should still be correct. After spending absolutely unreasonable amount of time on this I finally identified the missing piece of the puzzle. The example program takes so much time to verify because it accumulates a huge jmp_history array (262145 entries) within a single state and constantly executes mark_chain_precision() against it. The reason for such huge jmp_history is an interplay between two conditions: if (env->jmps_processed - env->prev_jmps_processed >= 2 && env->insn_processed - env->prev_insn_processed >= 8) add_new_state = true; and if (!force_new_state && env->jmps_processed - env->prev_jmps_processed < 20 && env->insn_processed - env->prev_insn_processed < 100) add_new_state = false; I try to explain it in the commit message for [1]. If one waits long enough, the verifier eventually finishes with -ENOMEM, as it fails to allocate big enough jmp_history array (one has to wait 15 minutes on master, and about a minute with my patch [2], the patch has not changed since our discussion on Wednesday, the error message is the same as we discussed on Wednesday). The discussed change to lazy precision marks propagation turned out to be a dud: - it does not solve the issue at hand, verifier still exits with ENOMEM, just faster; - overall verification speedup for regular programs is not big: $ ./veristat -e file,prog,duration -f 'duration>100000' -C master-baseline.log current.log File Program Duration (us) (A) Duration (us) (B) Duration (us) (DIFF) ------------------------ ---------------- ----------------- ----------------- -------------------- loop1.bpf.o nested_loops 124334 117382 -6952 (-5.59%) loop3.bpf.o while_true 318165 253264 -64901 (-20.40%) pyperf100.bpf.o on_event 704852 692300 -12552 (-1.78%) pyperf180.bpf.o on_event 2304115 2251502 -52613 (-2.28%) pyperf50.bpf.o on_event 183971 184124 +153 (+0.08%) pyperf600.bpf.o on_event 2076073 2051733 -24340 (-1.17%) pyperf600_nounroll.bpf.o on_event 394093 402552 +8459 (+2.15%) strobemeta.bpf.o on_event 169218 169835 +617 (+0.36%) test_verif_scale1.bpf.o balancer_ingress 144779 146486 +1707 (+1.18%) test_verif_scale2.bpf.o balancer_ingress 147078 151902 +4824 (+3.28%) test_verif_scale3.bpf.o balancer_ingress 184959 188723 +3764 (+2.04%) Here only loop3 demonstrates significant speedup (re-run for loop3 several times), but it does not warrant the code change in my opinion. [1] https://lore.kernel.org/bpf/20241018020307.1766906-1-eddyz87@gmail.com/ [2] https://github.com/eddyz87/bpf/tree/lazy-mark-chain-precision
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 7d9b38ffd220..3fa517b3fb57 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -16055,9 +16055,11 @@ static int push_insn(int t, int w, int e, struct bpf_verifier_env *env) int *insn_state = env->cfg.insn_state; if (e == FALLTHROUGH && insn_state[t] >= (DISCOVERED | FALLTHROUGH)) + /* FALLTHROUGH from t already processed */ return DONE_EXPLORING; if (e == BRANCH && insn_state[t] >= (DISCOVERED | BRANCH)) + /* BRANCH from t already processed */ return DONE_EXPLORING; if (w < 0 || w >= env->prog->len) { @@ -16081,6 +16083,8 @@ static int push_insn(int t, int w, int e, struct bpf_verifier_env *env) insn_stack[env->cfg.cur_stack++] = w; return KEEP_EXPLORING; } else if ((insn_state[w] & 0xF0) == DISCOVERED) { + /* back-edge */ + mark_force_checkpoint(env, w); if (env->bpf_capable) return DONE_EXPLORING; verbose_linfo(env, t, "%d: ", t); diff --git a/tools/testing/selftests/bpf/progs/verifier_precision.c b/tools/testing/selftests/bpf/progs/verifier_precision.c index 6b564d4c0986..7215c0cc0ccb 100644 --- a/tools/testing/selftests/bpf/progs/verifier_precision.c +++ b/tools/testing/selftests/bpf/progs/verifier_precision.c @@ -95,34 +95,37 @@ __naked int bpf_end_bswap(void) SEC("?raw_tp") __success __log_level(2) /* - * Without the bug fix there will be no history between "last_idx 3 first_idx 3" + * Without the bug fix there will be no history between "last_idx 1 first_idx 1" * and "parent state regs=" lines. "R0_w=6" parts are here to help anchor * expected log messages to the one specific mark_chain_precision operation. * * This is quite fragile: if verifier checkpointing heuristic changes, this * might need adjusting. */ -__msg("2: (07) r0 += 1 ; R0_w=6") -__msg("3: (35) if r0 >= 0xa goto pc+1") -__msg("mark_precise: frame0: last_idx 3 first_idx 3 subseq_idx -1") -__msg("mark_precise: frame0: regs=r0 stack= before 2: (07) r0 += 1") -__msg("mark_precise: frame0: regs=r0 stack= before 1: (07) r0 += 1") +__msg("3: (07) r0 += 1 ; R0_w=6") +__msg("4: (05) goto pc-4") +__msg("1: (35) if r0 >= 0xa goto pc+3") +__msg("mark_precise: frame0: last_idx 1 first_idx 1 subseq_idx -1") +__msg("mark_precise: frame0: parent state regs=r0 stack=: R0_rw=P6 R1=ctx() R10=fp0") +__msg("mark_precise: frame0: last_idx 4 first_idx 1 subseq_idx 1") __msg("mark_precise: frame0: regs=r0 stack= before 4: (05) goto pc-4") -__msg("mark_precise: frame0: regs=r0 stack= before 3: (35) if r0 >= 0xa goto pc+1") -__msg("mark_precise: frame0: parent state regs= stack=: R0_rw=P4") -__msg("3: R0_w=6") +__msg("mark_precise: frame0: regs=r0 stack= before 3: (07) r0 += 1") +__msg("mark_precise: frame0: regs=r0 stack= before 2: (07) r0 += 1") +__msg("mark_precise: frame0: regs=r0 stack= before 1: (35) if r0 >= 0xa goto pc+3") +__msg("mark_precise: frame0: parent state regs= stack=: R0_rw=P4 R1=ctx() R10=fp0") +__msg("1: R0=6") __naked int state_loop_first_last_equal(void) { asm volatile ( "r0 = 0;" "l0_%=:" - "r0 += 1;" - "r0 += 1;" /* every few iterations we'll have a checkpoint here with * first_idx == last_idx, potentially confusing precision * backtracking logic */ "if r0 >= 10 goto l1_%=;" /* checkpoint + mark_precise */ + "r0 += 1;" + "r0 += 1;" "goto l0_%=;" "l1_%=:" "exit;"
In [1] syzbot reported an interesting BPF program. Verification for this program takes a very long time. [1] https://lore.kernel.org/bpf/670429f6.050a0220.49194.0517.GAE@google.com/ The program could be simplified to the following snippet: /* Program type is kprobe */ r7 = *(u16 *)(r1 +0); 1: r7 += 0x1ab064b9; if r7 & 0x702000 goto 1b; r7 &= 0x1ee60e; r7 += r1; if r7 s> 0x37d2 goto +0; r0 = 0; exit; The snippet exhibits the following behaviour depending on BPF_COMPLEXITY_LIMIT_INSNS: - at 1,000,000 verification does not finish in 15 minutes; - at 100,000 verification finishes in 15 seconds; - at 100 it is possible to get some verifier log. Perf report collected for verification of the snippet: 99.34% veristat 99.34% [kernel.kallsyms] 71.25% [k] __mark_chain_precision 24.81% [k] bt_sync_linked_regs ... Fragment of the log collected with instruction limit set to 100: 1: (07) r7 += 447767737 ; R7_w=scalar(...) 2: (45) if r7 & 0x702000 goto pc-2 mark_precise: frame0: last_idx 2 first_idx 2 subseq_idx -1 mark_precise: frame0: regs=r7 stack= before 1: (07) r7 += 447767737 mark_precise: frame0: regs=r7 stack= before 2: (45) if r7 & 0x702000 goto pc-2 ... repeats 25 times ... mark_precise: frame0: regs=r7 stack= before 1: (07) r7 += 447767737 mark_precise: frame0: regs=r7 stack= before 2: (45) if r7 & 0x702000 goto pc-2 mark_precise: frame0: parent state regs= stack=: R1_r=ctx() R7_rw=Pscalar(...) R10=fp0 2: R7_w=scalar(...) 1: (07) r7 += 447767737 This shows very long backtracking walks done by mark_chain_precision() inside of the very small loop. Such backtracking walks terminate at checkpoint state boundaries, but checkpoints are introduced too infrequently for this loop. This patch forcibly enables checkpoints for each loop back-edge. This helps with the programs in question, as verification of both syzbot program and reduced snippet finishes in ~2.5 sec. However, this comes with some cost. Here are veristat results for selftests listed in veristat.cfg and cillium BPF programs from [2]. [2] https://github.com/anakryiko/cilium $ ./veristat -e file,prog,duration,insns,states -f 'states_pct>5' \ -C master-baseline.log this-patch.log File Program Duration (us) (DIFF) Insns (DIFF) States (DIFF) -------------------------- ------------------------------------ -------------------- ---------------- ------------------- loop1.bpf.o nested_loops +52014 (+40.34%) +0 (+0.00%) +39649 (+720.37%) loop2.bpf.o while_true +1308 (+161.48%) +2861 (+160.46%) +316 (+554.39%) loop3.bpf.o while_true +350359 (+106.88%) +0 (+0.00%) +101448 (+1049.86%) loop4.bpf.o combinations -88 (-34.78%) -221 (-42.18%) +13 (+72.22%) loop5.bpf.o while_true +5 (+10.64%) +3 (+3.57%) +2 (+22.22%) loop6.bpf.o trace_virtqueue_add_sgs -1886 (-34.01%) -3717 (-37.38%) +18 (+8.61%) netif_receive_skb.bpf.o trace_netif_receive_skb +5 (+0.04%) +0 (+0.00%) +562 (+111.29%) profiler2.bpf.o kprobe__vfs_link -1132 (-12.50%) -1142 (-8.41%) +158 (+53.20%) profiler2.bpf.o kprobe__vfs_symlink -1059 (-17.39%) -1173 (-13.63%) +77 (+37.75%) profiler2.bpf.o kprobe_ret__do_filp_open -465 (-8.40%) -832 (-9.91%) +84 (+40.00%) profiler2.bpf.o raw_tracepoint__sched_process_exit -524 (-29.91%) -1050 (-33.89%) -5 (-6.02%) profiler2.bpf.o tracepoint__syscalls__sys_enter_kill -4977 (-22.00%) -4585 (-18.96%) +70 (+6.75%) pyperf600.bpf.o on_event +16312 (+0.81%) -53870 (-9.81%) -3000 (-10.16%) strobemeta_nounroll1.bpf.o on_event -16609 (-40.28%) -20944 (-38.87%) -294 (-18.62%) strobemeta_nounroll2.bpf.o on_event -22022 (-23.60%) -40646 (-34.23%) -578 (-15.41%) strobemeta_subprogs.bpf.o on_event -17414 (-33.30%) -19758 (-36.47%) -185 (-12.33%) test_sysctl_loop1.bpf.o sysctl_tcp_mem +76 (+8.05%) +60 (+4.33%) +71 (+273.08%) test_sysctl_loop2.bpf.o sysctl_tcp_mem +156 (+13.38%) +77 (+5.13%) +82 (+282.76%) xdp_synproxy_kern.bpf.o syncookie_tc +8711 (+52.72%) +3610 (+26.23%) +209 (+53.87%) xdp_synproxy_kern.bpf.o syncookie_xdp +8571 (+50.01%) +3610 (+26.10%) +209 (+53.73%) The test case 'state_loop_first_last_equal' needs an update because checkpoints placement for it changed. (There is now a forced checkpoint at 'l0'). The goal of the test is to check that there is a jump history entry with the same last_idx/first_idx pair. Update the test by moving 'if' instruction to 'l0', so that total number of checkpoints in the test remains the same. (But now interesting last_idx/first_idx pair is 1/1 instead of 4/4). Signed-off-by: Eduard Zingerman <eddyz87@gmail.com> --- kernel/bpf/verifier.c | 4 +++ .../selftests/bpf/progs/verifier_precision.c | 25 +++++++++++-------- 2 files changed, 18 insertions(+), 11 deletions(-)