diff mbox series

[bpf-next,v1,1/2] bpf: force checkpoints at loop back-edges

Message ID 20241009021254.2805446-1-eddyz87@gmail.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series [bpf-next,v1,1/2] bpf: force checkpoints at loop back-edges | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Unittests
bpf/vmtest-bpf-next-VM_Test-5 success Logs for aarch64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-3 success Logs for Validate matrix.py
bpf/vmtest-bpf-next-VM_Test-0 success Logs for Lint
bpf/vmtest-bpf-next-VM_Test-4 success Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-11 success Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for s390x-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-16 success Logs for s390x-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-18 success Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-19 success Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-33 success Logs for x86_64-llvm-17 / veristat
bpf/vmtest-bpf-next-VM_Test-27 success Logs for x86_64-llvm-17 / build / build for x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-35 success Logs for x86_64-llvm-18 / build-release / build for x86_64 with llvm-18-O2
bpf/vmtest-bpf-next-VM_Test-34 success Logs for x86_64-llvm-18 / build / build for x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-28 success Logs for x86_64-llvm-17 / build-release / build for x86_64 with llvm-17-O2
bpf/vmtest-bpf-next-VM_Test-41 success Logs for x86_64-llvm-18 / veristat
bpf/vmtest-bpf-next-VM_Test-6 success Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-8 success Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-15 success Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-21 success Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-20 success Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-26 success Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-22 success Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-31 success Logs for x86_64-llvm-17 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-30 success Logs for x86_64-llvm-17 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-29 success Logs for x86_64-llvm-17 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-36 success Logs for x86_64-llvm-18 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-32 success Logs for x86_64-llvm-17 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-37 success Logs for x86_64-llvm-18 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-38 success Logs for x86_64-llvm-18 / test (test_progs_cpuv4, false, 360) / test_progs_cpuv4 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-40 success Logs for x86_64-llvm-18 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-39 success Logs for x86_64-llvm-18 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-14 success Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 6 this patch: 6
netdev/build_tools success Errors and warnings before: 0 (+1) this patch: 0 (+1)
netdev/cc_maintainers warning 13 maintainers not CCed: song@kernel.org shuah@kernel.org haoluo@google.com mykolal@fb.com shung-hsi.yu@suse.com netdev@vger.kernel.org kuba@kernel.org hawk@kernel.org john.fastabend@gmail.com sdf@fomichev.me kpsingh@kernel.org linux-kselftest@vger.kernel.org jolsa@kernel.org
netdev/build_clang success Errors and warnings before: 6 this patch: 6
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 15 this patch: 15
netdev/checkpatch warning WARNING: Prefer 'fallthrough;' over fallthrough comment WARNING: line length of 83 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Eduard Zingerman Oct. 9, 2024, 2:12 a.m. UTC
In [1] syzbot reported an interesting BPF program.
Verification for this program takes a very long time.

[1] https://lore.kernel.org/bpf/670429f6.050a0220.49194.0517.GAE@google.com/

The program could be simplified to the following snippet:

    /* Program type is kprobe */
       r7 = *(u16 *)(r1 +0);
    1: r7 += 0x1ab064b9;
       if r7 & 0x702000 goto 1b;
       r7 &= 0x1ee60e;
       r7 += r1;
       if r7 s> 0x37d2 goto +0;
       r0 = 0;
       exit;

The snippet exhibits the following behaviour depending on
BPF_COMPLEXITY_LIMIT_INSNS:
- at 1,000,000 verification does not finish in 15 minutes;
- at 100,000 verification finishes in 15 seconds;
- at 100 it is possible to get some verifier log.

Perf report collected for verification of the snippet:

    99.34%        veristat
       99.34%        [kernel.kallsyms]
          71.25%        [k] __mark_chain_precision
          24.81%        [k] bt_sync_linked_regs
          ...

Fragment of the log collected with instruction limit set to 100:

    1: (07) r7 += 447767737               ; R7_w=scalar(...)
    2: (45) if r7 & 0x702000 goto pc-2
    mark_precise: frame0: last_idx 2 first_idx 2 subseq_idx -1
    mark_precise: frame0: regs=r7 stack= before 1: (07) r7 += 447767737
    mark_precise: frame0: regs=r7 stack= before 2: (45) if r7 & 0x702000 goto pc-2
      ... repeats 25 times ...
    mark_precise: frame0: regs=r7 stack= before 1: (07) r7 += 447767737
    mark_precise: frame0: regs=r7 stack= before 2: (45) if r7 & 0x702000 goto pc-2
    mark_precise: frame0: parent state regs= stack=:  R1_r=ctx() R7_rw=Pscalar(...) R10=fp0
    2: R7_w=scalar(...)
    1: (07) r7 += 447767737

This shows very long backtracking walks done by mark_chain_precision()
inside of the very small loop. Such backtracking walks terminate at
checkpoint state boundaries, but checkpoints are introduced too
infrequently for this loop.

This patch forcibly enables checkpoints for each loop back-edge.
This helps with the programs in question, as verification of both
syzbot program and reduced snippet finishes in ~2.5 sec.

However, this comes with some cost. Here are veristat results for
selftests listed in veristat.cfg and cillium BPF programs from [2].

[2] https://github.com/anakryiko/cilium

$ ./veristat -e file,prog,duration,insns,states -f 'states_pct>5' \
  -C master-baseline.log this-patch.log
File                        Program                               Duration (us) (DIFF)  Insns     (DIFF)  States       (DIFF)
--------------------------  ------------------------------------  --------------------  ----------------  -------------------
loop1.bpf.o                 nested_loops                              +52014 (+40.34%)       +0 (+0.00%)    +39649 (+720.37%)
loop2.bpf.o                 while_true                                +1308 (+161.48%)  +2861 (+160.46%)      +316 (+554.39%)
loop3.bpf.o                 while_true                              +350359 (+106.88%)       +0 (+0.00%)  +101448 (+1049.86%)
loop4.bpf.o                 combinations                                 -88 (-34.78%)    -221 (-42.18%)        +13 (+72.22%)
loop5.bpf.o                 while_true                                    +5 (+10.64%)       +3 (+3.57%)         +2 (+22.22%)
loop6.bpf.o                 trace_virtqueue_add_sgs                    -1886 (-34.01%)   -3717 (-37.38%)         +18 (+8.61%)
netif_receive_skb.bpf.o     trace_netif_receive_skb                        +5 (+0.04%)       +0 (+0.00%)      +562 (+111.29%)
profiler2.bpf.o             kprobe__vfs_link                           -1132 (-12.50%)    -1142 (-8.41%)       +158 (+53.20%)
profiler2.bpf.o             kprobe__vfs_symlink                        -1059 (-17.39%)   -1173 (-13.63%)        +77 (+37.75%)
profiler2.bpf.o             kprobe_ret__do_filp_open                     -465 (-8.40%)     -832 (-9.91%)        +84 (+40.00%)
profiler2.bpf.o             raw_tracepoint__sched_process_exit          -524 (-29.91%)   -1050 (-33.89%)          -5 (-6.02%)
profiler2.bpf.o             tracepoint__syscalls__sys_enter_kill       -4977 (-22.00%)   -4585 (-18.96%)         +70 (+6.75%)
pyperf600.bpf.o             on_event                                   +16312 (+0.81%)   -53870 (-9.81%)      -3000 (-10.16%)
strobemeta_nounroll1.bpf.o  on_event                                  -16609 (-40.28%)  -20944 (-38.87%)       -294 (-18.62%)
strobemeta_nounroll2.bpf.o  on_event                                  -22022 (-23.60%)  -40646 (-34.23%)       -578 (-15.41%)
strobemeta_subprogs.bpf.o   on_event                                  -17414 (-33.30%)  -19758 (-36.47%)       -185 (-12.33%)
test_sysctl_loop1.bpf.o     sysctl_tcp_mem                                +76 (+8.05%)      +60 (+4.33%)       +71 (+273.08%)
test_sysctl_loop2.bpf.o     sysctl_tcp_mem                              +156 (+13.38%)      +77 (+5.13%)       +82 (+282.76%)
xdp_synproxy_kern.bpf.o     syncookie_tc                               +8711 (+52.72%)   +3610 (+26.23%)       +209 (+53.87%)
xdp_synproxy_kern.bpf.o     syncookie_xdp                              +8571 (+50.01%)   +3610 (+26.10%)       +209 (+53.73%)

The test case 'state_loop_first_last_equal' needs an update because
checkpoints placement for it changed. (There is now a forced
checkpoint at 'l0'). The goal of the test is to check that there is a
jump history entry with the same last_idx/first_idx pair.
Update the test by moving 'if' instruction to 'l0',
so that total number of checkpoints in the test remains the same.
(But now interesting last_idx/first_idx pair is 1/1 instead of 4/4).

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
---
 kernel/bpf/verifier.c                         |  4 +++
 .../selftests/bpf/progs/verifier_precision.c  | 25 +++++++++++--------
 2 files changed, 18 insertions(+), 11 deletions(-)

Comments

Eduard Zingerman Oct. 9, 2024, 9:23 a.m. UTC | #1
On Tue, 2024-10-08 at 19:12 -0700, Eduard Zingerman wrote:

[...]

> This patch forcibly enables checkpoints for each loop back-edge.
> This helps with the programs in question, as verification of both
> syzbot program and reduced snippet finishes in ~2.5 sec.

[...]

> File                        Program                               Insns     (DIFF)  States       (DIFF)
> --------------------------  ------------------------------------  ----------------  -------------------
> ...
> pyperf600.bpf.o             on_event                               -53870 (-9.81%)      -3000 (-10.16%)
> ...

fwiw, this patch speeds up verification of pyperf600 by a small margin,
sufficient for it to pass in combination with jump history bump when LLVM20 is used.

# ./veristat pyperf600.bpf.o no_alu32/pyperf600.bpf.o cpuv4/pyperf600.bpf.o
...
File             Program   Verdict  Duration (us)   Insns  States  Peak states
---------------  --------  -------  -------------  ------  ------  -----------
pyperf600.bpf.o  on_event  success        2400571  872914   26490        25480
pyperf600.bpf.o  on_event  success        2460908  947038   26090        25330
pyperf600.bpf.o  on_event  success        2158117  788481   26329        25368
---------------  --------  -------  -------------  ------  ------  -----------

W/o this patch jump history bump is not sufficient to get no_alu32 version verified,
instruction limit is reached.

The draft is here:
https://github.com/eddyz87/bpf/tree/jmp-history-pyperf600

[...]
Eduard Zingerman Oct. 9, 2024, 7:41 p.m. UTC | #2
On Tue, 2024-10-08 at 19:12 -0700, Eduard Zingerman wrote:
> In [1] syzbot reported an interesting BPF program.
> Verification for this program takes a very long time.
> 
> [1] https://lore.kernel.org/bpf/670429f6.050a0220.49194.0517.GAE@google.com/
> 
> The program could be simplified to the following snippet:
> 
>     /* Program type is kprobe */
>        r7 = *(u16 *)(r1 +0);
>     1: r7 += 0x1ab064b9;
>        if r7 & 0x702000 goto 1b;
>        r7 &= 0x1ee60e;
>        r7 += r1;
>        if r7 s> 0x37d2 goto +0;
>        r0 = 0;
>        exit;

Answering a few questions from off-list discussion with Alexei.
The test is not specific for jset instruction, e.g. the following
program exhibits similar behaviour:

SEC("kprobe")
__failure __log_level(4)
__msg("BPF program is too large.")
__naked void short_loop1(void)
{
	asm volatile (
	"   r7 = *(u16 *)(r1 +0);"
	"   r8 = *(u64 *)(r1 +16);"
	"1: r7 += 0x1ab064b9;"
	"if r7 < r8 goto 1b;"
	"   r7 &= 0x1ee60e;"
	"   r7 += r1;"
	"   if r7 s> 0x37d2 goto +0;"
	"   r0 = 0;"
	"   exit;"
	::: __clobber_all);
}

> The snippet exhibits the following behaviour depending on
> BPF_COMPLEXITY_LIMIT_INSNS:
> - at 1,000,000 verification does not finish in 15 minutes;
> - at 100,000 verification finishes in 15 seconds;
> - at 100 it is possible to get some verifier log.

Still investigating why running time change is non-linear.

[...]

> This patch forcibly enables checkpoints for each loop back-edge.
> This helps with the programs in question, as verification of both
> syzbot program and reduced snippet finishes in ~2.5 sec.

There is the following code in is_state_visited():

			...
			/* if the verifier is processing a loop, avoid adding new state
			 * too often, since different loop iterations have distinct
			 * states and may not help future pruning.
			 * This threshold shouldn't be too low to make sure that
			 * a loop with large bound will be rejected quickly.
			 * The most abusive loop will be:
			 * r1 += 1
			 * if r1 < 1000000 goto pc-2
			 * 1M insn_procssed limit / 100 == 10k peak states.
			 * This threshold shouldn't be too high either, since states
			 * at the end of the loop are likely to be useful in pruning.
			 */
skip_inf_loop_check:
			if (!env->test_state_freq &&
			    env->jmps_processed - env->prev_jmps_processed < 20 &&
			    env->insn_processed - env->prev_insn_processed < 100)
				add_new_state = false;
			goto miss;
			...

Which runs into a direct contradiction with what I do in this patch,
so either I need to change the patch or this fragment needs adjustment.

[...]
Alexei Starovoitov Oct. 10, 2024, 1:08 a.m. UTC | #3
On Wed, Oct 9, 2024 at 12:41 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Tue, 2024-10-08 at 19:12 -0700, Eduard Zingerman wrote:
> > In [1] syzbot reported an interesting BPF program.
> > Verification for this program takes a very long time.
> >
> > [1] https://lore.kernel.org/bpf/670429f6.050a0220.49194.0517.GAE@google.com/
> >
> > The program could be simplified to the following snippet:
> >
> >     /* Program type is kprobe */
> >        r7 = *(u16 *)(r1 +0);
> >     1: r7 += 0x1ab064b9;
> >        if r7 & 0x702000 goto 1b;
> >        r7 &= 0x1ee60e;
> >        r7 += r1;
> >        if r7 s> 0x37d2 goto +0;
> >        r0 = 0;
> >        exit;
>
> Answering a few questions from off-list discussion with Alexei.
> The test is not specific for jset instruction, e.g. the following
> program exhibits similar behaviour:
>
> SEC("kprobe")
> __failure __log_level(4)
> __msg("BPF program is too large.")
> __naked void short_loop1(void)
> {
>         asm volatile (
>         "   r7 = *(u16 *)(r1 +0);"
>         "   r8 = *(u64 *)(r1 +16);"
>         "1: r7 += 0x1ab064b9;"
>         "if r7 < r8 goto 1b;"
>         "   r7 &= 0x1ee60e;"
>         "   r7 += r1;"
>         "   if r7 s> 0x37d2 goto +0;"
>         "   r0 = 0;"
>         "   exit;"
>         ::: __clobber_all);
> }
>
> > The snippet exhibits the following behaviour depending on
> > BPF_COMPLEXITY_LIMIT_INSNS:
> > - at 1,000,000 verification does not finish in 15 minutes;
> > - at 100,000 verification finishes in 15 seconds;
> > - at 100 it is possible to get some verifier log.
>
> Still investigating why running time change is non-linear.
>
> [...]
>
> > This patch forcibly enables checkpoints for each loop back-edge.
> > This helps with the programs in question, as verification of both
> > syzbot program and reduced snippet finishes in ~2.5 sec.
>
> There is the following code in is_state_visited():
>
>                         ...
>                         /* if the verifier is processing a loop, avoid adding new state
>                          * too often, since different loop iterations have distinct
>                          * states and may not help future pruning.
>                          * This threshold shouldn't be too low to make sure that
>                          * a loop with large bound will be rejected quickly.
>                          * The most abusive loop will be:
>                          * r1 += 1
>                          * if r1 < 1000000 goto pc-2
>                          * 1M insn_procssed limit / 100 == 10k peak states.
>                          * This threshold shouldn't be too high either, since states
>                          * at the end of the loop are likely to be useful in pruning.
>                          */
> skip_inf_loop_check:
>                         if (!env->test_state_freq &&
>                             env->jmps_processed - env->prev_jmps_processed < 20 &&
>                             env->insn_processed - env->prev_insn_processed < 100)
>                                 add_new_state = false;
>                         goto miss;
>                         ...
>
> Which runs into a direct contradiction with what I do in this patch,
> so either I need to change the patch or this fragment needs adjustment.

Yeah. It's not saving all states exactly to avoid:
loop3.bpf.o                 while_true
+350359 (+106.88%)       +0 (+0.00%)  +101448 (+1049.86%)

100k states is a lot of memory. It might cause OOMs.

pyperf600 improvement is certainly nice though.

It feels it's better to adjust above heuristic 20 && 100 instead
of forcing save state everywhere.

Something should be done about:
          71.25%        [k] __mark_chain_precision
          24.81%        [k] bt_sync_linked_regs
as well.
The algorithm there needs some tweaks.

pw-bot: cr
Andrii Nakryiko Oct. 10, 2024, 10:13 p.m. UTC | #4
On Wed, Oct 9, 2024 at 6:09 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Oct 9, 2024 at 12:41 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
> >
> > On Tue, 2024-10-08 at 19:12 -0700, Eduard Zingerman wrote:
> > > In [1] syzbot reported an interesting BPF program.
> > > Verification for this program takes a very long time.
> > >
> > > [1] https://lore.kernel.org/bpf/670429f6.050a0220.49194.0517.GAE@google.com/
> > >
> > > The program could be simplified to the following snippet:
> > >
> > >     /* Program type is kprobe */
> > >        r7 = *(u16 *)(r1 +0);
> > >     1: r7 += 0x1ab064b9;
> > >        if r7 & 0x702000 goto 1b;
> > >        r7 &= 0x1ee60e;
> > >        r7 += r1;
> > >        if r7 s> 0x37d2 goto +0;
> > >        r0 = 0;
> > >        exit;
> >
> > Answering a few questions from off-list discussion with Alexei.
> > The test is not specific for jset instruction, e.g. the following
> > program exhibits similar behaviour:
> >
> > SEC("kprobe")
> > __failure __log_level(4)
> > __msg("BPF program is too large.")
> > __naked void short_loop1(void)
> > {
> >         asm volatile (
> >         "   r7 = *(u16 *)(r1 +0);"
> >         "   r8 = *(u64 *)(r1 +16);"
> >         "1: r7 += 0x1ab064b9;"
> >         "if r7 < r8 goto 1b;"
> >         "   r7 &= 0x1ee60e;"
> >         "   r7 += r1;"
> >         "   if r7 s> 0x37d2 goto +0;"
> >         "   r0 = 0;"
> >         "   exit;"
> >         ::: __clobber_all);
> > }
> >
> > > The snippet exhibits the following behaviour depending on
> > > BPF_COMPLEXITY_LIMIT_INSNS:
> > > - at 1,000,000 verification does not finish in 15 minutes;
> > > - at 100,000 verification finishes in 15 seconds;
> > > - at 100 it is possible to get some verifier log.
> >
> > Still investigating why running time change is non-linear.
> >
> > [...]
> >
> > > This patch forcibly enables checkpoints for each loop back-edge.
> > > This helps with the programs in question, as verification of both
> > > syzbot program and reduced snippet finishes in ~2.5 sec.
> >
> > There is the following code in is_state_visited():
> >
> >                         ...
> >                         /* if the verifier is processing a loop, avoid adding new state
> >                          * too often, since different loop iterations have distinct
> >                          * states and may not help future pruning.
> >                          * This threshold shouldn't be too low to make sure that
> >                          * a loop with large bound will be rejected quickly.
> >                          * The most abusive loop will be:
> >                          * r1 += 1
> >                          * if r1 < 1000000 goto pc-2
> >                          * 1M insn_procssed limit / 100 == 10k peak states.
> >                          * This threshold shouldn't be too high either, since states
> >                          * at the end of the loop are likely to be useful in pruning.
> >                          */
> > skip_inf_loop_check:
> >                         if (!env->test_state_freq &&
> >                             env->jmps_processed - env->prev_jmps_processed < 20 &&
> >                             env->insn_processed - env->prev_insn_processed < 100)
> >                                 add_new_state = false;
> >                         goto miss;
> >                         ...
> >
> > Which runs into a direct contradiction with what I do in this patch,
> > so either I need to change the patch or this fragment needs adjustment.
>
> Yeah. It's not saving all states exactly to avoid:
> loop3.bpf.o                 while_true
> +350359 (+106.88%)       +0 (+0.00%)  +101448 (+1049.86%)
>
> 100k states is a lot of memory. It might cause OOMs.
>
> pyperf600 improvement is certainly nice though.
>
> It feels it's better to adjust above heuristic 20 && 100 instead
> of forcing save state everywhere.
>
> Something should be done about:
>           71.25%        [k] __mark_chain_precision
>           24.81%        [k] bt_sync_linked_regs
> as well.
> The algorithm there needs some tweaks.

If we were to store bpf_jmp_history_entry for each instruction (and we
can do that efficiently, memory-wise, I had the patch), and then for
each instruction we maintained a list of "input" regs/slots and
corresponding "output" regs/slots as we simulate each instruction
forward, I think __mark_chain_precision would be much simpler and thus
faster. We'd basically just walk backwards instruction by instruction,
check if any of the output regs/slots need to be precise (few bitmasks
intersection), and if yes, set all input regs/slots as "need
precision", and just continue forward.

I think it's actually a simpler approach and should be faster. Simpler
because it's easy to tell inputs/outputs while doing forward
instruction processing. Faster because __mark_chain_precision would
only do very simple operation without lots of branching and checks.

We are already halfway there, tbh, with the linked registers and
insn_stack_access_flags() (i.e., information that we couldn't
re-derive in the backwards pass). So maybe we should bite the bullet
and go all in?

P.S. And just in general, there is a certain appeal to always having a
complete history up to any current point in time. Feels like this
would be useful in the future for some extra optimizations or checks.

>
> pw-bot: cr
Eduard Zingerman Oct. 10, 2024, 10:40 p.m. UTC | #5
On Thu, 2024-10-10 at 15:13 -0700, Andrii Nakryiko wrote:
> On Wed, Oct 9, 2024 at 6:09 PM Alexei Starovoitov

[...]

> > Something should be done about:
> >           71.25%        [k] __mark_chain_precision
> >           24.81%        [k] bt_sync_linked_regs
> > as well.
> > The algorithm there needs some tweaks.
> 
> If we were to store bpf_jmp_history_entry for each instruction (and we
> can do that efficiently, memory-wise, I had the patch), and then for
> each instruction we maintained a list of "input" regs/slots and
> corresponding "output" regs/slots as we simulate each instruction
> forward, I think __mark_chain_precision would be much simpler and thus
> faster. We'd basically just walk backwards instruction by instruction,
> check if any of the output regs/slots need to be precise (few bitmasks
> intersection), and if yes, set all input regs/slots as "need
> precision", and just continue forward.
> 
> I think it's actually a simpler approach and should be faster. Simpler
> because it's easy to tell inputs/outputs while doing forward
> instruction processing. Faster because __mark_chain_precision would
> only do very simple operation without lots of branching and checks.

I think this would bring significant speedup.
Not sure it would completely fix the issue at hand,
as mark_chain_precision() walks like 100 instructions back on each
iteration of the loop, but it might be a step in the right direction.

Do you mind if I refresh your old patches for jump history,
or do you want to work on this yourself?

[...]
Eduard Zingerman Oct. 10, 2024, 10:52 p.m. UTC | #6
On Thu, 2024-10-10 at 15:40 -0700, Eduard Zingerman wrote:

> I think this would bring significant speedup.
> Not sure it would completely fix the issue at hand,
> as mark_chain_precision() walks like 100 instructions back on each
> iteration of the loop, but it might be a step in the right direction.

In theory, we can do mark_chain_precision() lazily:
- mark that certain registers need precision for an instruction;
- wait till next checkpoint is created;
- do the walk for marked registers.

This should be simpler to implement on top of what Andrii suggests.

[...]
Alexei Starovoitov Oct. 10, 2024, 11:23 p.m. UTC | #7
On Thu, Oct 10, 2024 at 3:52 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Thu, 2024-10-10 at 15:40 -0700, Eduard Zingerman wrote:
>
> > I think this would bring significant speedup.
> > Not sure it would completely fix the issue at hand,
> > as mark_chain_precision() walks like 100 instructions back on each
> > iteration of the loop, but it might be a step in the right direction.
>
> In theory, we can do mark_chain_precision() lazily:
> - mark that certain registers need precision for an instruction;
> - wait till next checkpoint is created;
> - do the walk for marked registers.
>
> This should be simpler to implement on top of what Andrii suggests.

Indeed it can wait until the next new_state, since
when the next iteration of the loop starts the sl->state
are guaranteed to be some olderstates. Though the verifier may have done
multiple loop iteration traversals, but it's still in 'cur' state,
so states_equal(env, &sl->state, cur...) should still be correct.
Andrii Nakryiko Oct. 11, 2024, 1:44 a.m. UTC | #8
On Thu, Oct 10, 2024 at 3:40 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> On Thu, 2024-10-10 at 15:13 -0700, Andrii Nakryiko wrote:
> > On Wed, Oct 9, 2024 at 6:09 PM Alexei Starovoitov
>
> [...]
>
> > > Something should be done about:
> > >           71.25%        [k] __mark_chain_precision
> > >           24.81%        [k] bt_sync_linked_regs
> > > as well.
> > > The algorithm there needs some tweaks.
> >
> > If we were to store bpf_jmp_history_entry for each instruction (and we
> > can do that efficiently, memory-wise, I had the patch), and then for
> > each instruction we maintained a list of "input" regs/slots and
> > corresponding "output" regs/slots as we simulate each instruction
> > forward, I think __mark_chain_precision would be much simpler and thus
> > faster. We'd basically just walk backwards instruction by instruction,
> > check if any of the output regs/slots need to be precise (few bitmasks
> > intersection), and if yes, set all input regs/slots as "need
> > precision", and just continue forward.
> >
> > I think it's actually a simpler approach and should be faster. Simpler
> > because it's easy to tell inputs/outputs while doing forward
> > instruction processing. Faster because __mark_chain_precision would
> > only do very simple operation without lots of branching and checks.
>
> I think this would bring significant speedup.
> Not sure it would completely fix the issue at hand,
> as mark_chain_precision() walks like 100 instructions back on each
> iteration of the loop, but it might be a step in the right direction.
>
> Do you mind if I refresh your old patches for jump history,
> or do you want to work on this yourself?

No, I don't mind at all. Go ahead.

>
> [...]
Eduard Zingerman Oct. 18, 2024, 2:17 a.m. UTC | #9
On Thu, 2024-10-10 at 16:23 -0700, Alexei Starovoitov wrote:
> On Thu, Oct 10, 2024 at 3:52 PM Eduard Zingerman <eddyz87@gmail.com> wrote:
> > 
> > On Thu, 2024-10-10 at 15:40 -0700, Eduard Zingerman wrote:
> > 
> > > I think this would bring significant speedup.
> > > Not sure it would completely fix the issue at hand,
> > > as mark_chain_precision() walks like 100 instructions back on each
> > > iteration of the loop, but it might be a step in the right direction.
> > 
> > In theory, we can do mark_chain_precision() lazily:
> > - mark that certain registers need precision for an instruction;
> > - wait till next checkpoint is created;
> > - do the walk for marked registers.
> > 
> > This should be simpler to implement on top of what Andrii suggests.
> 
> Indeed it can wait until the next new_state, since
> when the next iteration of the loop starts the sl->state
> are guaranteed to be some olderstates. Though the verifier may have done
> multiple loop iteration traversals, but it's still in 'cur' state,
> so states_equal(env, &sl->state, cur...) should still be correct.

After spending absolutely unreasonable amount of time on this I
finally identified the missing piece of the puzzle.

The example program takes so much time to verify because it
accumulates a huge jmp_history array (262145 entries)
within a single state and constantly executes mark_chain_precision()
against it.

The reason for such huge jmp_history is an interplay between two
conditions:

	if (env->jmps_processed - env->prev_jmps_processed >= 2 &&
	    env->insn_processed - env->prev_insn_processed >= 8)
		add_new_state = true;

and

	if (!force_new_state &&
	    env->jmps_processed - env->prev_jmps_processed < 20 &&
	    env->insn_processed - env->prev_insn_processed < 100)
		add_new_state = false;

I try to explain it in the commit message for [1].

If one waits long enough, the verifier eventually finishes with
-ENOMEM, as it fails to allocate big enough jmp_history array
(one has to wait 15 minutes on master, and about a minute with my patch [2],
 the patch has not changed since our discussion on Wednesday,
 the error message is the same as we discussed on Wednesday).

The discussed change to lazy precision marks propagation turned out to
be a dud:
- it does not solve the issue at hand, verifier still exits with
  ENOMEM, just faster;
- overall verification speedup for regular programs is not big:

$ ./veristat -e file,prog,duration -f 'duration>100000' -C master-baseline.log current.log 
File                      Program           Duration (us) (A)  Duration (us) (B)  Duration (us) (DIFF)
------------------------  ----------------  -----------------  -----------------  --------------------
loop1.bpf.o               nested_loops                 124334             117382        -6952 (-5.59%)
loop3.bpf.o               while_true                   318165             253264      -64901 (-20.40%)
pyperf100.bpf.o           on_event                     704852             692300       -12552 (-1.78%)
pyperf180.bpf.o           on_event                    2304115            2251502       -52613 (-2.28%)
pyperf50.bpf.o            on_event                     183971             184124         +153 (+0.08%)
pyperf600.bpf.o           on_event                    2076073            2051733       -24340 (-1.17%)
pyperf600_nounroll.bpf.o  on_event                     394093             402552        +8459 (+2.15%)
strobemeta.bpf.o          on_event                     169218             169835         +617 (+0.36%)
test_verif_scale1.bpf.o   balancer_ingress             144779             146486        +1707 (+1.18%)
test_verif_scale2.bpf.o   balancer_ingress             147078             151902        +4824 (+3.28%)
test_verif_scale3.bpf.o   balancer_ingress             184959             188723        +3764 (+2.04%)

Here only loop3 demonstrates significant speedup (re-run for loop3 several times),
but it does not warrant the code change in my opinion.

[1] https://lore.kernel.org/bpf/20241018020307.1766906-1-eddyz87@gmail.com/
[2] https://github.com/eddyz87/bpf/tree/lazy-mark-chain-precision
diff mbox series

Patch

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 7d9b38ffd220..3fa517b3fb57 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -16055,9 +16055,11 @@  static int push_insn(int t, int w, int e, struct bpf_verifier_env *env)
 	int *insn_state = env->cfg.insn_state;
 
 	if (e == FALLTHROUGH && insn_state[t] >= (DISCOVERED | FALLTHROUGH))
+		/* FALLTHROUGH from t already processed */
 		return DONE_EXPLORING;
 
 	if (e == BRANCH && insn_state[t] >= (DISCOVERED | BRANCH))
+		/* BRANCH from t already processed */
 		return DONE_EXPLORING;
 
 	if (w < 0 || w >= env->prog->len) {
@@ -16081,6 +16083,8 @@  static int push_insn(int t, int w, int e, struct bpf_verifier_env *env)
 		insn_stack[env->cfg.cur_stack++] = w;
 		return KEEP_EXPLORING;
 	} else if ((insn_state[w] & 0xF0) == DISCOVERED) {
+		/* back-edge */
+		mark_force_checkpoint(env, w);
 		if (env->bpf_capable)
 			return DONE_EXPLORING;
 		verbose_linfo(env, t, "%d: ", t);
diff --git a/tools/testing/selftests/bpf/progs/verifier_precision.c b/tools/testing/selftests/bpf/progs/verifier_precision.c
index 6b564d4c0986..7215c0cc0ccb 100644
--- a/tools/testing/selftests/bpf/progs/verifier_precision.c
+++ b/tools/testing/selftests/bpf/progs/verifier_precision.c
@@ -95,34 +95,37 @@  __naked int bpf_end_bswap(void)
 SEC("?raw_tp")
 __success __log_level(2)
 /*
- * Without the bug fix there will be no history between "last_idx 3 first_idx 3"
+ * Without the bug fix there will be no history between "last_idx 1 first_idx 1"
  * and "parent state regs=" lines. "R0_w=6" parts are here to help anchor
  * expected log messages to the one specific mark_chain_precision operation.
  *
  * This is quite fragile: if verifier checkpointing heuristic changes, this
  * might need adjusting.
  */
-__msg("2: (07) r0 += 1                       ; R0_w=6")
-__msg("3: (35) if r0 >= 0xa goto pc+1")
-__msg("mark_precise: frame0: last_idx 3 first_idx 3 subseq_idx -1")
-__msg("mark_precise: frame0: regs=r0 stack= before 2: (07) r0 += 1")
-__msg("mark_precise: frame0: regs=r0 stack= before 1: (07) r0 += 1")
+__msg("3: (07) r0 += 1                       ; R0_w=6")
+__msg("4: (05) goto pc-4")
+__msg("1: (35) if r0 >= 0xa goto pc+3")
+__msg("mark_precise: frame0: last_idx 1 first_idx 1 subseq_idx -1")
+__msg("mark_precise: frame0: parent state regs=r0 stack=:  R0_rw=P6 R1=ctx() R10=fp0")
+__msg("mark_precise: frame0: last_idx 4 first_idx 1 subseq_idx 1")
 __msg("mark_precise: frame0: regs=r0 stack= before 4: (05) goto pc-4")
-__msg("mark_precise: frame0: regs=r0 stack= before 3: (35) if r0 >= 0xa goto pc+1")
-__msg("mark_precise: frame0: parent state regs= stack=:  R0_rw=P4")
-__msg("3: R0_w=6")
+__msg("mark_precise: frame0: regs=r0 stack= before 3: (07) r0 += 1")
+__msg("mark_precise: frame0: regs=r0 stack= before 2: (07) r0 += 1")
+__msg("mark_precise: frame0: regs=r0 stack= before 1: (35) if r0 >= 0xa goto pc+3")
+__msg("mark_precise: frame0: parent state regs= stack=:  R0_rw=P4 R1=ctx() R10=fp0")
+__msg("1: R0=6")
 __naked int state_loop_first_last_equal(void)
 {
 	asm volatile (
 		"r0 = 0;"
 	"l0_%=:"
-		"r0 += 1;"
-		"r0 += 1;"
 		/* every few iterations we'll have a checkpoint here with
 		 * first_idx == last_idx, potentially confusing precision
 		 * backtracking logic
 		 */
 		"if r0 >= 10 goto l1_%=;"	/* checkpoint + mark_precise */
+		"r0 += 1;"
+		"r0 += 1;"
 		"goto l0_%=;"
 	"l1_%=:"
 		"exit;"