mbox series

[RFC,bpf-next,0/5] Support for BPF_ST instruction in LLVM C compiler

Message ID 20221231163122.1360813-1-eddyz87@gmail.com (mailing list archive)
Headers show
Series Support for BPF_ST instruction in LLVM C compiler | expand

Message

Eduard Zingerman Dec. 31, 2022, 4:31 p.m. UTC
BPF has two documented (non-atomic) memory store instructions:

BPF_STX: *(size *) (dst_reg + off) = src_reg
BPF_ST : *(size *) (dst_reg + off) = imm32

Currently LLVM BPF back-end does not emit BPF_ST instruction and does
not allow one to be specified as inline assembly.

Recently I've been exploring ways to port some of the verifier test
cases from tools/testing/selftests/bpf/verifier/*.c to use inline assembly
and machinery provided in tools/testing/selftests/bpf/test_loader.c
(which should hopefully simplify tests maintenance).
The BPF_ST instruction is popular in these tests: used in 52 of 94 files.

While it is possible to adjust LLVM to only support BPF_ST for inline
assembly blocks it seems a bit wasteful. This patch-set contains a set
of changes to verifier necessary in case when LLVM is allowed to
freely emit BPF_ST instructions (source code is available here [1]).
The changes include:
 - update to verifier.c:check_stack_write_*() functions to track
   constant values spilled to stack via BPF_ST instruction in a same
   way stack spills of known registers by BPF_STX are tracked;
 - updates to verifier.c:convert_ctx_access() and it's callbacks to
   handle BPF_ST instruction in a way similar to BPF_STX;
 - some test adjustments and a few new tests.

With the above changes applied and LLVM from [1] all test_verifier,
test_maps, test_progs and test_progs-no_alu32 test cases are passing.

When built using the LLVM version from [1] BPF programs generated for
selftests and Cilium programs (taken from [2]) see certain reduction
in size, e.g. below are total numbers of instructions for files with
over 5K instructions:

File                                    Insns   Insns   Insns   Diff
                                        w/o     with    diff    pct
                                        BPF_ST  BPF_ST
cilium/bpf_host.o                       44620   43774   -846    -1.90%
cilium/bpf_lxc.o                        36842   36060   -782    -2.12%
cilium/bpf_overlay.o                    23557   23186   -371    -1.57%
cilium/bpf_xdp.o                        26397   25931   -466    -1.77%
selftests/core_kern.bpf.o               12359   12359    0       0.00%
selftests/linked_list_fail.bpf.o        5501    5302    -199    -3.62%
selftests/profiler1.bpf.o               17828   17709   -119    -0.67%
selftests/pyperf100.bpf.o               49793   49268   -525    -1.05%
selftests/pyperf180.bpf.o               88738   87813   -925    -1.04%
selftests/pyperf50.bpf.o                25388   25113   -275    -1.08%
selftests/pyperf600.bpf.o               78330   78300   -30     -0.04%
selftests/pyperf_global.bpf.o           5244    5188    -56     -1.07%
selftests/pyperf_subprogs.bpf.o         5262    5192    -70     -1.33%
selftests/strobemeta.bpf.o              17154   16065   -1089   -6.35%
selftests/test_verif_scale2.bpf.o       11337   11337    0       0.00%

(Instructions are counted by counting the number of instruction lines
 in disassembly).

Is community interested in this work?
Are there any omissions in my changes to the verifier?

Known issue:

There are two programs (one Cilium, one selftest) that exhibit
anomalous increase in verifier processing time with this patch-set:

 File                 Program                        Insns (A)  Insns (B)  Insns   (DIFF)
 -------------------  -----------------------------  ---------  ---------  --------------
 bpf_host.o           tail_ipv6_host_policy_ingress       1576       2403  +827 (+52.47%)
 map_kptr.bpf.o       test_map_kptr                        400        475   +75 (+18.75%)
 -------------------  -----------------------------  ---------  ---------  --------------

These are under investigation.

Thanks,
Eduard

[1] https://reviews.llvm.org/D140804
[2] git@github.com:anakryiko/cilium.git

Eduard Zingerman (5):
  bpf: more precise stack write reasoning for BPF_ST instruction
  selftests/bpf: check if verifier tracks constants spilled by
    BPF_ST_MEM
  bpf: allow ctx writes using BPF_ST_MEM instruction
  selftests/bpf: test if pointer type is tracked for BPF_ST_MEM
  selftests/bpf: don't match exact insn index in expected error message

 kernel/bpf/cgroup.c                           |  49 +++++---
 kernel/bpf/verifier.c                         | 102 +++++++++-------
 net/core/filter.c                             |  72 ++++++------
 .../selftests/bpf/prog_tests/log_fixup.c      |   2 +-
 .../selftests/bpf/prog_tests/spin_lock.c      |   8 +-
 .../bpf/verifier/bounds_mix_sign_unsign.c     | 110 ++++++++++--------
 .../selftests/bpf/verifier/bpf_st_mem.c       |  29 +++++
 tools/testing/selftests/bpf/verifier/ctx.c    |  11 --
 tools/testing/selftests/bpf/verifier/unpriv.c |  23 ++++
 9 files changed, 249 insertions(+), 157 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/verifier/bpf_st_mem.c

Comments

Andrii Nakryiko Jan. 4, 2023, 10:10 p.m. UTC | #1
On Sat, Dec 31, 2022 at 8:31 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>
> BPF has two documented (non-atomic) memory store instructions:
>
> BPF_STX: *(size *) (dst_reg + off) = src_reg
> BPF_ST : *(size *) (dst_reg + off) = imm32
>
> Currently LLVM BPF back-end does not emit BPF_ST instruction and does
> not allow one to be specified as inline assembly.
>
> Recently I've been exploring ways to port some of the verifier test
> cases from tools/testing/selftests/bpf/verifier/*.c to use inline assembly
> and machinery provided in tools/testing/selftests/bpf/test_loader.c
> (which should hopefully simplify tests maintenance).
> The BPF_ST instruction is popular in these tests: used in 52 of 94 files.
>
> While it is possible to adjust LLVM to only support BPF_ST for inline
> assembly blocks it seems a bit wasteful. This patch-set contains a set
> of changes to verifier necessary in case when LLVM is allowed to
> freely emit BPF_ST instructions (source code is available here [1]).

Would we gate LLVM's emitting of BPF_ST for C code behind some new
cpu=v4? What is the benefit for compiler to start automatically emit
such instructions? Such thinking about logistics, if there isn't much
benefit, as BPF application owner I wouldn't bother enabling this
behavior risking regressions on old kernels that don't have these
changes.

So I feel like the biggest benefit is to be able to use this
instruction in embedded assembly, to make writing and maintaining
tests easier.

> The changes include:
>  - update to verifier.c:check_stack_write_*() functions to track
>    constant values spilled to stack via BPF_ST instruction in a same
>    way stack spills of known registers by BPF_STX are tracked;
>  - updates to verifier.c:convert_ctx_access() and it's callbacks to
>    handle BPF_ST instruction in a way similar to BPF_STX;
>  - some test adjustments and a few new tests.
>
> With the above changes applied and LLVM from [1] all test_verifier,
> test_maps, test_progs and test_progs-no_alu32 test cases are passing.
>
> When built using the LLVM version from [1] BPF programs generated for
> selftests and Cilium programs (taken from [2]) see certain reduction
> in size, e.g. below are total numbers of instructions for files with
> over 5K instructions:
>
> File                                    Insns   Insns   Insns   Diff
>                                         w/o     with    diff    pct
>                                         BPF_ST  BPF_ST
> cilium/bpf_host.o                       44620   43774   -846    -1.90%
> cilium/bpf_lxc.o                        36842   36060   -782    -2.12%
> cilium/bpf_overlay.o                    23557   23186   -371    -1.57%
> cilium/bpf_xdp.o                        26397   25931   -466    -1.77%
> selftests/core_kern.bpf.o               12359   12359    0       0.00%
> selftests/linked_list_fail.bpf.o        5501    5302    -199    -3.62%
> selftests/profiler1.bpf.o               17828   17709   -119    -0.67%
> selftests/pyperf100.bpf.o               49793   49268   -525    -1.05%
> selftests/pyperf180.bpf.o               88738   87813   -925    -1.04%
> selftests/pyperf50.bpf.o                25388   25113   -275    -1.08%
> selftests/pyperf600.bpf.o               78330   78300   -30     -0.04%
> selftests/pyperf_global.bpf.o           5244    5188    -56     -1.07%
> selftests/pyperf_subprogs.bpf.o         5262    5192    -70     -1.33%
> selftests/strobemeta.bpf.o              17154   16065   -1089   -6.35%
> selftests/test_verif_scale2.bpf.o       11337   11337    0       0.00%
>
> (Instructions are counted by counting the number of instruction lines
>  in disassembly).
>
> Is community interested in this work?
> Are there any omissions in my changes to the verifier?
>
> Known issue:
>
> There are two programs (one Cilium, one selftest) that exhibit
> anomalous increase in verifier processing time with this patch-set:
>
>  File                 Program                        Insns (A)  Insns (B)  Insns   (DIFF)
>  -------------------  -----------------------------  ---------  ---------  --------------
>  bpf_host.o           tail_ipv6_host_policy_ingress       1576       2403  +827 (+52.47%)
>  map_kptr.bpf.o       test_map_kptr                        400        475   +75 (+18.75%)
>  -------------------  -----------------------------  ---------  ---------  --------------
>
> These are under investigation.
>
> Thanks,
> Eduard
>
> [1] https://reviews.llvm.org/D140804
> [2] git@github.com:anakryiko/cilium.git
>
> Eduard Zingerman (5):
>   bpf: more precise stack write reasoning for BPF_ST instruction
>   selftests/bpf: check if verifier tracks constants spilled by
>     BPF_ST_MEM
>   bpf: allow ctx writes using BPF_ST_MEM instruction
>   selftests/bpf: test if pointer type is tracked for BPF_ST_MEM
>   selftests/bpf: don't match exact insn index in expected error message
>
>  kernel/bpf/cgroup.c                           |  49 +++++---
>  kernel/bpf/verifier.c                         | 102 +++++++++-------
>  net/core/filter.c                             |  72 ++++++------
>  .../selftests/bpf/prog_tests/log_fixup.c      |   2 +-
>  .../selftests/bpf/prog_tests/spin_lock.c      |   8 +-
>  .../bpf/verifier/bounds_mix_sign_unsign.c     | 110 ++++++++++--------
>  .../selftests/bpf/verifier/bpf_st_mem.c       |  29 +++++
>  tools/testing/selftests/bpf/verifier/ctx.c    |  11 --
>  tools/testing/selftests/bpf/verifier/unpriv.c |  23 ++++
>  9 files changed, 249 insertions(+), 157 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/verifier/bpf_st_mem.c
>
> --
> 2.39.0
>
Jose E. Marchesi Jan. 5, 2023, 10:06 a.m. UTC | #2
> On Sat, Dec 31, 2022 at 8:31 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>>
>> BPF has two documented (non-atomic) memory store instructions:
>>
>> BPF_STX: *(size *) (dst_reg + off) = src_reg
>> BPF_ST : *(size *) (dst_reg + off) = imm32
>>
>> Currently LLVM BPF back-end does not emit BPF_ST instruction and does
>> not allow one to be specified as inline assembly.
>>
>> Recently I've been exploring ways to port some of the verifier test
>> cases from tools/testing/selftests/bpf/verifier/*.c to use inline assembly
>> and machinery provided in tools/testing/selftests/bpf/test_loader.c
>> (which should hopefully simplify tests maintenance).
>> The BPF_ST instruction is popular in these tests: used in 52 of 94 files.
>>
>> While it is possible to adjust LLVM to only support BPF_ST for inline
>> assembly blocks it seems a bit wasteful. This patch-set contains a set
>> of changes to verifier necessary in case when LLVM is allowed to
>> freely emit BPF_ST instructions (source code is available here [1]).
>
> Would we gate LLVM's emitting of BPF_ST for C code behind some new
> cpu=v4? What is the benefit for compiler to start automatically emit
> such instructions? Such thinking about logistics, if there isn't much
> benefit, as BPF application owner I wouldn't bother enabling this
> behavior risking regressions on old kernels that don't have these
> changes.

Hmm, GCC happily generates BPF_ST instructions:

  $ echo 'int v; void foo () {  v = 666; }' | bpf-unknown-none-gcc -O2 -xc -S -o foo.s -
  $ cat foo.s
        .file	"<stdin>"
        .text
        .align	3
        .global	foo
        .type	foo, @function
  foo:
        lddw	%r0,v
        stw	[%r0+0],666
        exit
        .size	foo, .-foo
        .global	v
        .type	v, @object
        .lcomm	v,4,4
        .ident	"GCC: (GNU) 12.0.0 20211206 (experimental)"

Been doing that since October 2019, I think before the cpu versioning
mechanism was got in place?

We weren't aware this was problematic.  Does the verifier reject such
instructions?

> So I feel like the biggest benefit is to be able to use this
> instruction in embedded assembly, to make writing and maintaining
> tests easier.
>
>> The changes include:
>>  - update to verifier.c:check_stack_write_*() functions to track
>>    constant values spilled to stack via BPF_ST instruction in a same
>>    way stack spills of known registers by BPF_STX are tracked;
>>  - updates to verifier.c:convert_ctx_access() and it's callbacks to
>>    handle BPF_ST instruction in a way similar to BPF_STX;
>>  - some test adjustments and a few new tests.
>>
>> With the above changes applied and LLVM from [1] all test_verifier,
>> test_maps, test_progs and test_progs-no_alu32 test cases are passing.
>>
>> When built using the LLVM version from [1] BPF programs generated for
>> selftests and Cilium programs (taken from [2]) see certain reduction
>> in size, e.g. below are total numbers of instructions for files with
>> over 5K instructions:
>>
>> File                                    Insns   Insns   Insns   Diff
>>                                         w/o     with    diff    pct
>>                                         BPF_ST  BPF_ST
>> cilium/bpf_host.o                       44620   43774   -846    -1.90%
>> cilium/bpf_lxc.o                        36842   36060   -782    -2.12%
>> cilium/bpf_overlay.o                    23557   23186   -371    -1.57%
>> cilium/bpf_xdp.o                        26397   25931   -466    -1.77%
>> selftests/core_kern.bpf.o               12359   12359    0       0.00%
>> selftests/linked_list_fail.bpf.o        5501    5302    -199    -3.62%
>> selftests/profiler1.bpf.o               17828   17709   -119    -0.67%
>> selftests/pyperf100.bpf.o               49793   49268   -525    -1.05%
>> selftests/pyperf180.bpf.o               88738   87813   -925    -1.04%
>> selftests/pyperf50.bpf.o                25388   25113   -275    -1.08%
>> selftests/pyperf600.bpf.o               78330   78300   -30     -0.04%
>> selftests/pyperf_global.bpf.o           5244    5188    -56     -1.07%
>> selftests/pyperf_subprogs.bpf.o         5262    5192    -70     -1.33%
>> selftests/strobemeta.bpf.o              17154   16065   -1089   -6.35%
>> selftests/test_verif_scale2.bpf.o       11337   11337    0       0.00%
>>
>> (Instructions are counted by counting the number of instruction lines
>>  in disassembly).
>>
>> Is community interested in this work?
>> Are there any omissions in my changes to the verifier?
>>
>> Known issue:
>>
>> There are two programs (one Cilium, one selftest) that exhibit
>> anomalous increase in verifier processing time with this patch-set:
>>
>>  File                 Program                        Insns (A)  Insns (B)  Insns   (DIFF)
>>  -------------------  -----------------------------  ---------  ---------  --------------
>>  bpf_host.o           tail_ipv6_host_policy_ingress       1576       2403  +827 (+52.47%)
>>  map_kptr.bpf.o       test_map_kptr                        400        475   +75 (+18.75%)
>>  -------------------  -----------------------------  ---------  ---------  --------------
>>
>> These are under investigation.
>>
>> Thanks,
>> Eduard
>>
>> [1] https://reviews.llvm.org/D140804
>> [2] git@github.com:anakryiko/cilium.git
>>
>> Eduard Zingerman (5):
>>   bpf: more precise stack write reasoning for BPF_ST instruction
>>   selftests/bpf: check if verifier tracks constants spilled by
>>     BPF_ST_MEM
>>   bpf: allow ctx writes using BPF_ST_MEM instruction
>>   selftests/bpf: test if pointer type is tracked for BPF_ST_MEM
>>   selftests/bpf: don't match exact insn index in expected error message
>>
>>  kernel/bpf/cgroup.c                           |  49 +++++---
>>  kernel/bpf/verifier.c                         | 102 +++++++++-------
>>  net/core/filter.c                             |  72 ++++++------
>>  .../selftests/bpf/prog_tests/log_fixup.c      |   2 +-
>>  .../selftests/bpf/prog_tests/spin_lock.c      |   8 +-
>>  .../bpf/verifier/bounds_mix_sign_unsign.c     | 110 ++++++++++--------
>>  .../selftests/bpf/verifier/bpf_st_mem.c       |  29 +++++
>>  tools/testing/selftests/bpf/verifier/ctx.c    |  11 --
>>  tools/testing/selftests/bpf/verifier/unpriv.c |  23 ++++
>>  9 files changed, 249 insertions(+), 157 deletions(-)
>>  create mode 100644 tools/testing/selftests/bpf/verifier/bpf_st_mem.c
>>
>> --
>> 2.39.0
>>
Eduard Zingerman Jan. 5, 2023, 12:07 p.m. UTC | #3
On Thu, 2023-01-05 at 11:06 +0100, Jose E. Marchesi wrote:
> > On Sat, Dec 31, 2022 at 8:31 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
> > > 
> > > BPF has two documented (non-atomic) memory store instructions:
> > > 
> > > BPF_STX: *(size *) (dst_reg + off) = src_reg
> > > BPF_ST : *(size *) (dst_reg + off) = imm32
> > > 
> > > Currently LLVM BPF back-end does not emit BPF_ST instruction and does
> > > not allow one to be specified as inline assembly.
> > > 
> > > Recently I've been exploring ways to port some of the verifier test
> > > cases from tools/testing/selftests/bpf/verifier/*.c to use inline assembly
> > > and machinery provided in tools/testing/selftests/bpf/test_loader.c
> > > (which should hopefully simplify tests maintenance).
> > > The BPF_ST instruction is popular in these tests: used in 52 of 94 files.
> > > 
> > > While it is possible to adjust LLVM to only support BPF_ST for inline
> > > assembly blocks it seems a bit wasteful. This patch-set contains a set
> > > of changes to verifier necessary in case when LLVM is allowed to
> > > freely emit BPF_ST instructions (source code is available here [1]).
> > 
> > Would we gate LLVM's emitting of BPF_ST for C code behind some new
> > cpu=v4? What is the benefit for compiler to start automatically emit
> > such instructions? Such thinking about logistics, if there isn't much
> > benefit, as BPF application owner I wouldn't bother enabling this
> > behavior risking regressions on old kernels that don't have these
> > changes.
> 
> Hmm, GCC happily generates BPF_ST instructions:
> 
>   $ echo 'int v; void foo () {  v = 666; }' | bpf-unknown-none-gcc -O2 -xc -S -o foo.s -
>   $ cat foo.s
>         .file	"<stdin>"
>         .text
>         .align	3
>         .global	foo
>         .type	foo, @function
>   foo:
>         lddw	%r0,v
>         stw	[%r0+0],666
>         exit
>         .size	foo, .-foo
>         .global	v
>         .type	v, @object
>         .lcomm	v,4,4
>         .ident	"GCC: (GNU) 12.0.0 20211206 (experimental)"
> 
> Been doing that since October 2019, I think before the cpu versioning
> mechanism was got in place?
> 
> We weren't aware this was problematic.  Does the verifier reject such
> instructions?

Interesting, do BPF selftests generated by GCC pass the same way they
do if generated by clang?

I had to do the following changes to the verifier to make the
selftests pass when BPF_ST instruction is allowed for selection:

- patch #1 in this patchset: track values of constants written to
  stack using BPF_ST. Currently these are tracked imprecisely, unlike
  the writes using BPF_STX, e.g.:
  
    fp[-8] = 42;   currently verifier assumes that fp[-8]=mmmmmmmm
                   after such instruction, where m stands for "misc",
                   just a note that something is written at fp[-8].
                   
    r1 = 42;       verifier tracks r1=42 after this instruction.
    fp[-8] = r1;   verifier tracks fp[-8]=42 after this instruction.

  So the patch makes both cases equivalent.
  
- patch #3 in this patchset: adjusts verifier.c:convert_ctx_access()
  to operate on BPF_ST alongside BPF_STX.
  
  Context parameters for some BPF programs types are "fake" data
  structures. The verifier matches all BPF_STX and BPF_LDX
  instructions that operate on pointers to such contexts and rewrites
  these instructions. It might change an offset or add another layer
  of indirection, etc. E.g. see filter.c:bpf_convert_ctx_access().
  (This also implies that verifier forbids writes to non-constant
   offsets inside such structures).
   
  So the patch extends this logic to also handle BPF_ST.

> 
> > So I feel like the biggest benefit is to be able to use this
> > instruction in embedded assembly, to make writing and maintaining
> > tests easier.
> > 
> > > The changes include:
> > >  - update to verifier.c:check_stack_write_*() functions to track
> > >    constant values spilled to stack via BPF_ST instruction in a same
> > >    way stack spills of known registers by BPF_STX are tracked;
> > >  - updates to verifier.c:convert_ctx_access() and it's callbacks to
> > >    handle BPF_ST instruction in a way similar to BPF_STX;
> > >  - some test adjustments and a few new tests.
> > > 
> > > With the above changes applied and LLVM from [1] all test_verifier,
> > > test_maps, test_progs and test_progs-no_alu32 test cases are passing.
> > > 
> > > When built using the LLVM version from [1] BPF programs generated for
> > > selftests and Cilium programs (taken from [2]) see certain reduction
> > > in size, e.g. below are total numbers of instructions for files with
> > > over 5K instructions:
> > > 
> > > File                                    Insns   Insns   Insns   Diff
> > >                                         w/o     with    diff    pct
> > >                                         BPF_ST  BPF_ST
> > > cilium/bpf_host.o                       44620   43774   -846    -1.90%
> > > cilium/bpf_lxc.o                        36842   36060   -782    -2.12%
> > > cilium/bpf_overlay.o                    23557   23186   -371    -1.57%
> > > cilium/bpf_xdp.o                        26397   25931   -466    -1.77%
> > > selftests/core_kern.bpf.o               12359   12359    0       0.00%
> > > selftests/linked_list_fail.bpf.o        5501    5302    -199    -3.62%
> > > selftests/profiler1.bpf.o               17828   17709   -119    -0.67%
> > > selftests/pyperf100.bpf.o               49793   49268   -525    -1.05%
> > > selftests/pyperf180.bpf.o               88738   87813   -925    -1.04%
> > > selftests/pyperf50.bpf.o                25388   25113   -275    -1.08%
> > > selftests/pyperf600.bpf.o               78330   78300   -30     -0.04%
> > > selftests/pyperf_global.bpf.o           5244    5188    -56     -1.07%
> > > selftests/pyperf_subprogs.bpf.o         5262    5192    -70     -1.33%
> > > selftests/strobemeta.bpf.o              17154   16065   -1089   -6.35%
> > > selftests/test_verif_scale2.bpf.o       11337   11337    0       0.00%
> > > 
> > > (Instructions are counted by counting the number of instruction lines
> > >  in disassembly).
> > > 
> > > Is community interested in this work?
> > > Are there any omissions in my changes to the verifier?
> > > 
> > > Known issue:
> > > 
> > > There are two programs (one Cilium, one selftest) that exhibit
> > > anomalous increase in verifier processing time with this patch-set:
> > > 
> > >  File                 Program                        Insns (A)  Insns (B)  Insns   (DIFF)
> > >  -------------------  -----------------------------  ---------  ---------  --------------
> > >  bpf_host.o           tail_ipv6_host_policy_ingress       1576       2403  +827 (+52.47%)
> > >  map_kptr.bpf.o       test_map_kptr                        400        475   +75 (+18.75%)
> > >  -------------------  -----------------------------  ---------  ---------  --------------
> > > 
> > > These are under investigation.
> > > 
> > > Thanks,
> > > Eduard
> > > 
> > > [1] https://reviews.llvm.org/D140804
> > > [2] git@github.com:anakryiko/cilium.git
> > > 
> > > Eduard Zingerman (5):
> > >   bpf: more precise stack write reasoning for BPF_ST instruction
> > >   selftests/bpf: check if verifier tracks constants spilled by
> > >     BPF_ST_MEM
> > >   bpf: allow ctx writes using BPF_ST_MEM instruction
> > >   selftests/bpf: test if pointer type is tracked for BPF_ST_MEM
> > >   selftests/bpf: don't match exact insn index in expected error message
> > > 
> > >  kernel/bpf/cgroup.c                           |  49 +++++---
> > >  kernel/bpf/verifier.c                         | 102 +++++++++-------
> > >  net/core/filter.c                             |  72 ++++++------
> > >  .../selftests/bpf/prog_tests/log_fixup.c      |   2 +-
> > >  .../selftests/bpf/prog_tests/spin_lock.c      |   8 +-
> > >  .../bpf/verifier/bounds_mix_sign_unsign.c     | 110 ++++++++++--------
> > >  .../selftests/bpf/verifier/bpf_st_mem.c       |  29 +++++
> > >  tools/testing/selftests/bpf/verifier/ctx.c    |  11 --
> > >  tools/testing/selftests/bpf/verifier/unpriv.c |  23 ++++
> > >  9 files changed, 249 insertions(+), 157 deletions(-)
> > >  create mode 100644 tools/testing/selftests/bpf/verifier/bpf_st_mem.c
> > > 
> > > --
> > > 2.39.0
> > >
Jose E. Marchesi Jan. 5, 2023, 3:07 p.m. UTC | #4
> On Thu, 2023-01-05 at 11:06 +0100, Jose E. Marchesi wrote:
>> > On Sat, Dec 31, 2022 at 8:31 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>> > > 
>> > > BPF has two documented (non-atomic) memory store instructions:
>> > > 
>> > > BPF_STX: *(size *) (dst_reg + off) = src_reg
>> > > BPF_ST : *(size *) (dst_reg + off) = imm32
>> > > 
>> > > Currently LLVM BPF back-end does not emit BPF_ST instruction and does
>> > > not allow one to be specified as inline assembly.
>> > > 
>> > > Recently I've been exploring ways to port some of the verifier test
>> > > cases from tools/testing/selftests/bpf/verifier/*.c to use inline assembly
>> > > and machinery provided in tools/testing/selftests/bpf/test_loader.c
>> > > (which should hopefully simplify tests maintenance).
>> > > The BPF_ST instruction is popular in these tests: used in 52 of 94 files.
>> > > 
>> > > While it is possible to adjust LLVM to only support BPF_ST for inline
>> > > assembly blocks it seems a bit wasteful. This patch-set contains a set
>> > > of changes to verifier necessary in case when LLVM is allowed to
>> > > freely emit BPF_ST instructions (source code is available here [1]).
>> > 
>> > Would we gate LLVM's emitting of BPF_ST for C code behind some new
>> > cpu=v4? What is the benefit for compiler to start automatically emit
>> > such instructions? Such thinking about logistics, if there isn't much
>> > benefit, as BPF application owner I wouldn't bother enabling this
>> > behavior risking regressions on old kernels that don't have these
>> > changes.
>> 
>> Hmm, GCC happily generates BPF_ST instructions:
>> 
>>   $ echo 'int v; void foo () {  v = 666; }' | bpf-unknown-none-gcc -O2 -xc -S -o foo.s -
>>   $ cat foo.s
>>         .file	"<stdin>"
>>         .text
>>         .align	3
>>         .global	foo
>>         .type	foo, @function
>>   foo:
>>         lddw	%r0,v
>>         stw	[%r0+0],666
>>         exit
>>         .size	foo, .-foo
>>         .global	v
>>         .type	v, @object
>>         .lcomm	v,4,4
>>         .ident	"GCC: (GNU) 12.0.0 20211206 (experimental)"
>> 
>> Been doing that since October 2019, I think before the cpu versioning
>> mechanism was got in place?
>> 
>> We weren't aware this was problematic.  Does the verifier reject such
>> instructions?
>
> Interesting, do BPF selftests generated by GCC pass the same way they
> do if generated by clang?

We are still working on other issues in GCC; we are not yet in par with
clang when it comes to build/run the BPF selftests.  So I guess we
didn't reach that particular breaking point yet ;)

> I had to do the following changes to the verifier to make the
> selftests pass when BPF_ST instruction is allowed for selection:

So, if the current state of affairs is that the kernel rejects BPF_ST
instructions, we shouldn't be generating them from GCC.

I like Andrii's idea of introducing a -mcpu=v4 and condition generation
of BPF_ST from the compiler to it.  GCC is defaulting to -mcpu=v3 atm.

Yonghong, WDYT?  Would that be acceptable for clang as well?

>
> - patch #1 in this patchset: track values of constants written to
>   stack using BPF_ST. Currently these are tracked imprecisely, unlike
>   the writes using BPF_STX, e.g.:
>   
>     fp[-8] = 42;   currently verifier assumes that fp[-8]=mmmmmmmm
>                    after such instruction, where m stands for "misc",
>                    just a note that something is written at fp[-8].
>                    
>     r1 = 42;       verifier tracks r1=42 after this instruction.
>     fp[-8] = r1;   verifier tracks fp[-8]=42 after this instruction.
>
>   So the patch makes both cases equivalent.
>   
> - patch #3 in this patchset: adjusts verifier.c:convert_ctx_access()
>   to operate on BPF_ST alongside BPF_STX.
>   
>   Context parameters for some BPF programs types are "fake" data
>   structures. The verifier matches all BPF_STX and BPF_LDX
>   instructions that operate on pointers to such contexts and rewrites
>   these instructions. It might change an offset or add another layer
>   of indirection, etc. E.g. see filter.c:bpf_convert_ctx_access().
>   (This also implies that verifier forbids writes to non-constant
>    offsets inside such structures).
>    
>   So the patch extends this logic to also handle BPF_ST.
>
>> 
>> > So I feel like the biggest benefit is to be able to use this
>> > instruction in embedded assembly, to make writing and maintaining
>> > tests easier.
>> > 
>> > > The changes include:
>> > >  - update to verifier.c:check_stack_write_*() functions to track
>> > >    constant values spilled to stack via BPF_ST instruction in a same
>> > >    way stack spills of known registers by BPF_STX are tracked;
>> > >  - updates to verifier.c:convert_ctx_access() and it's callbacks to
>> > >    handle BPF_ST instruction in a way similar to BPF_STX;
>> > >  - some test adjustments and a few new tests.
>> > > 
>> > > With the above changes applied and LLVM from [1] all test_verifier,
>> > > test_maps, test_progs and test_progs-no_alu32 test cases are passing.
>> > > 
>> > > When built using the LLVM version from [1] BPF programs generated for
>> > > selftests and Cilium programs (taken from [2]) see certain reduction
>> > > in size, e.g. below are total numbers of instructions for files with
>> > > over 5K instructions:
>> > > 
>> > > File                                    Insns   Insns   Insns   Diff
>> > >                                         w/o     with    diff    pct
>> > >                                         BPF_ST  BPF_ST
>> > > cilium/bpf_host.o                       44620   43774   -846    -1.90%
>> > > cilium/bpf_lxc.o                        36842   36060   -782    -2.12%
>> > > cilium/bpf_overlay.o                    23557   23186   -371    -1.57%
>> > > cilium/bpf_xdp.o                        26397   25931   -466    -1.77%
>> > > selftests/core_kern.bpf.o               12359   12359    0       0.00%
>> > > selftests/linked_list_fail.bpf.o        5501    5302    -199    -3.62%
>> > > selftests/profiler1.bpf.o               17828   17709   -119    -0.67%
>> > > selftests/pyperf100.bpf.o               49793   49268   -525    -1.05%
>> > > selftests/pyperf180.bpf.o               88738   87813   -925    -1.04%
>> > > selftests/pyperf50.bpf.o                25388   25113   -275    -1.08%
>> > > selftests/pyperf600.bpf.o               78330   78300   -30     -0.04%
>> > > selftests/pyperf_global.bpf.o           5244    5188    -56     -1.07%
>> > > selftests/pyperf_subprogs.bpf.o         5262    5192    -70     -1.33%
>> > > selftests/strobemeta.bpf.o              17154   16065   -1089   -6.35%
>> > > selftests/test_verif_scale2.bpf.o       11337   11337    0       0.00%
>> > > 
>> > > (Instructions are counted by counting the number of instruction lines
>> > >  in disassembly).
>> > > 
>> > > Is community interested in this work?
>> > > Are there any omissions in my changes to the verifier?
>> > > 
>> > > Known issue:
>> > > 
>> > > There are two programs (one Cilium, one selftest) that exhibit
>> > > anomalous increase in verifier processing time with this patch-set:
>> > > 
>> > >  File                 Program                        Insns (A)  Insns (B)  Insns   (DIFF)
>> > >  -------------------  -----------------------------  ---------  ---------  --------------
>> > >  bpf_host.o           tail_ipv6_host_policy_ingress       1576       2403  +827 (+52.47%)
>> > >  map_kptr.bpf.o       test_map_kptr                        400        475   +75 (+18.75%)
>> > >  -------------------  -----------------------------  ---------  ---------  --------------
>> > > 
>> > > These are under investigation.
>> > > 
>> > > Thanks,
>> > > Eduard
>> > > 
>> > > [1] https://reviews.llvm.org/D140804
>> > > [2] git@github.com:anakryiko/cilium.git
>> > > 
>> > > Eduard Zingerman (5):
>> > >   bpf: more precise stack write reasoning for BPF_ST instruction
>> > >   selftests/bpf: check if verifier tracks constants spilled by
>> > >     BPF_ST_MEM
>> > >   bpf: allow ctx writes using BPF_ST_MEM instruction
>> > >   selftests/bpf: test if pointer type is tracked for BPF_ST_MEM
>> > >   selftests/bpf: don't match exact insn index in expected error message
>> > > 
>> > >  kernel/bpf/cgroup.c                           |  49 +++++---
>> > >  kernel/bpf/verifier.c                         | 102 +++++++++-------
>> > >  net/core/filter.c                             |  72 ++++++------
>> > >  .../selftests/bpf/prog_tests/log_fixup.c      |   2 +-
>> > >  .../selftests/bpf/prog_tests/spin_lock.c      |   8 +-
>> > >  .../bpf/verifier/bounds_mix_sign_unsign.c     | 110 ++++++++++--------
>> > >  .../selftests/bpf/verifier/bpf_st_mem.c       |  29 +++++
>> > >  tools/testing/selftests/bpf/verifier/ctx.c    |  11 --
>> > >  tools/testing/selftests/bpf/verifier/unpriv.c |  23 ++++
>> > >  9 files changed, 249 insertions(+), 157 deletions(-)
>> > >  create mode 100644 tools/testing/selftests/bpf/verifier/bpf_st_mem.c
>> > > 
>> > > --
>> > > 2.39.0
>> > >
Alexei Starovoitov Jan. 12, 2023, 10:27 p.m. UTC | #5
On Thu, Jan 05, 2023 at 02:07:05PM +0200, Eduard Zingerman wrote:
> On Thu, 2023-01-05 at 11:06 +0100, Jose E. Marchesi wrote:
> > > On Sat, Dec 31, 2022 at 8:31 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
> > > > 
> > > > BPF has two documented (non-atomic) memory store instructions:
> > > > 
> > > > BPF_STX: *(size *) (dst_reg + off) = src_reg
> > > > BPF_ST : *(size *) (dst_reg + off) = imm32
> > > > 
> > > > Currently LLVM BPF back-end does not emit BPF_ST instruction and does
> > > > not allow one to be specified as inline assembly.
> > > > 
> > > > Recently I've been exploring ways to port some of the verifier test
> > > > cases from tools/testing/selftests/bpf/verifier/*.c to use inline assembly
> > > > and machinery provided in tools/testing/selftests/bpf/test_loader.c
> > > > (which should hopefully simplify tests maintenance).
> > > > The BPF_ST instruction is popular in these tests: used in 52 of 94 files.
> > > > 
> > > > While it is possible to adjust LLVM to only support BPF_ST for inline
> > > > assembly blocks it seems a bit wasteful. This patch-set contains a set
> > > > of changes to verifier necessary in case when LLVM is allowed to
> > > > freely emit BPF_ST instructions (source code is available here [1]).
> > > 
> > > Would we gate LLVM's emitting of BPF_ST for C code behind some new
> > > cpu=v4? What is the benefit for compiler to start automatically emit
> > > such instructions? Such thinking about logistics, if there isn't much
> > > benefit, as BPF application owner I wouldn't bother enabling this
> > > behavior risking regressions on old kernels that don't have these
> > > changes.
> > 
> > Hmm, GCC happily generates BPF_ST instructions:
> > 
> >   $ echo 'int v; void foo () {  v = 666; }' | bpf-unknown-none-gcc -O2 -xc -S -o foo.s -
> >   $ cat foo.s
> >         .file	"<stdin>"
> >         .text
> >         .align	3
> >         .global	foo
> >         .type	foo, @function
> >   foo:
> >         lddw	%r0,v
> >         stw	[%r0+0],666
> >         exit
> >         .size	foo, .-foo
> >         .global	v
> >         .type	v, @object
> >         .lcomm	v,4,4
> >         .ident	"GCC: (GNU) 12.0.0 20211206 (experimental)"
> > 
> > Been doing that since October 2019, I think before the cpu versioning
> > mechanism was got in place?
> > 
> > We weren't aware this was problematic.  Does the verifier reject such
> > instructions?
> 
> Interesting, do BPF selftests generated by GCC pass the same way they
> do if generated by clang?
> 
> I had to do the following changes to the verifier to make the
> selftests pass when BPF_ST instruction is allowed for selection:
> 
> - patch #1 in this patchset: track values of constants written to
>   stack using BPF_ST. Currently these are tracked imprecisely, unlike
>   the writes using BPF_STX, e.g.:
>   
>     fp[-8] = 42;   currently verifier assumes that fp[-8]=mmmmmmmm
>                    after such instruction, where m stands for "misc",
>                    just a note that something is written at fp[-8].
>                    
>     r1 = 42;       verifier tracks r1=42 after this instruction.
>     fp[-8] = r1;   verifier tracks fp[-8]=42 after this instruction.
> 
>   So the patch makes both cases equivalent.
>   
> - patch #3 in this patchset: adjusts verifier.c:convert_ctx_access()
>   to operate on BPF_ST alongside BPF_STX.
>   
>   Context parameters for some BPF programs types are "fake" data
>   structures. The verifier matches all BPF_STX and BPF_LDX
>   instructions that operate on pointers to such contexts and rewrites
>   these instructions. It might change an offset or add another layer
>   of indirection, etc. E.g. see filter.c:bpf_convert_ctx_access().
>   (This also implies that verifier forbids writes to non-constant
>    offsets inside such structures).
>    
>   So the patch extends this logic to also handle BPF_ST.

The patch 3 is necessary to land before llvm starts generating 'st' for ctx access.
That's clear, but I'm missing why patch 1 is necessary.
Sure, it's making the verifier understand scalar spills with 'st' and
makes 'st' equivalent to 'stx', but I'm missing why it's necessary.
What kind of programs fail to be verified when llvm starts generating 'st' ?

Regarind -mcpu=v4.
I think we need to add all of our upcoming instructions as a single flag.
Otherwise we'll have -mcpu=v5,v6,v7 and full combinations of them.

-mcpu=v4 could mean:
- ST
- sign extending loads
- sign extend a register
- 32-bit JA
- proper bswap insns: bswap16, bswap32, bswap64

The sign and 32-bit JA we've discussed earlier.
The bswap was on my wish list forever.
The existing TO_LE, TO_BE insns are really odd from compiler pov.
The compiler should translate bswap IR op into proper bswap insn
just like it does on all cpus.

Maybe add SDIV to -mcpu=v4 as well?
Yonghong Song Jan. 13, 2023, 8:02 a.m. UTC | #6
On 1/12/23 2:27 PM, Alexei Starovoitov wrote:
> On Thu, Jan 05, 2023 at 02:07:05PM +0200, Eduard Zingerman wrote:
>> On Thu, 2023-01-05 at 11:06 +0100, Jose E. Marchesi wrote:
>>>> On Sat, Dec 31, 2022 at 8:31 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>>>>>
>>>>> BPF has two documented (non-atomic) memory store instructions:
>>>>>
>>>>> BPF_STX: *(size *) (dst_reg + off) = src_reg
>>>>> BPF_ST : *(size *) (dst_reg + off) = imm32
>>>>>
>>>>> Currently LLVM BPF back-end does not emit BPF_ST instruction and does
>>>>> not allow one to be specified as inline assembly.
>>>>>
>>>>> Recently I've been exploring ways to port some of the verifier test
>>>>> cases from tools/testing/selftests/bpf/verifier/*.c to use inline assembly
>>>>> and machinery provided in tools/testing/selftests/bpf/test_loader.c
>>>>> (which should hopefully simplify tests maintenance).
>>>>> The BPF_ST instruction is popular in these tests: used in 52 of 94 files.
>>>>>
>>>>> While it is possible to adjust LLVM to only support BPF_ST for inline
>>>>> assembly blocks it seems a bit wasteful. This patch-set contains a set
>>>>> of changes to verifier necessary in case when LLVM is allowed to
>>>>> freely emit BPF_ST instructions (source code is available here [1]).
>>>>
>>>> Would we gate LLVM's emitting of BPF_ST for C code behind some new
>>>> cpu=v4? What is the benefit for compiler to start automatically emit
>>>> such instructions? Such thinking about logistics, if there isn't much
>>>> benefit, as BPF application owner I wouldn't bother enabling this
>>>> behavior risking regressions on old kernels that don't have these
>>>> changes.
>>>
>>> Hmm, GCC happily generates BPF_ST instructions:
>>>
>>>    $ echo 'int v; void foo () {  v = 666; }' | bpf-unknown-none-gcc -O2 -xc -S -o foo.s -
>>>    $ cat foo.s
>>>          .file	"<stdin>"
>>>          .text
>>>          .align	3
>>>          .global	foo
>>>          .type	foo, @function
>>>    foo:
>>>          lddw	%r0,v
>>>          stw	[%r0+0],666
>>>          exit
>>>          .size	foo, .-foo
>>>          .global	v
>>>          .type	v, @object
>>>          .lcomm	v,4,4
>>>          .ident	"GCC: (GNU) 12.0.0 20211206 (experimental)"
>>>
>>> Been doing that since October 2019, I think before the cpu versioning
>>> mechanism was got in place?
>>>
>>> We weren't aware this was problematic.  Does the verifier reject such
>>> instructions?
>>
>> Interesting, do BPF selftests generated by GCC pass the same way they
>> do if generated by clang?
>>
>> I had to do the following changes to the verifier to make the
>> selftests pass when BPF_ST instruction is allowed for selection:
>>
>> - patch #1 in this patchset: track values of constants written to
>>    stack using BPF_ST. Currently these are tracked imprecisely, unlike
>>    the writes using BPF_STX, e.g.:
>>    
>>      fp[-8] = 42;   currently verifier assumes that fp[-8]=mmmmmmmm
>>                     after such instruction, where m stands for "misc",
>>                     just a note that something is written at fp[-8].
>>                     
>>      r1 = 42;       verifier tracks r1=42 after this instruction.
>>      fp[-8] = r1;   verifier tracks fp[-8]=42 after this instruction.
>>
>>    So the patch makes both cases equivalent.
>>    
>> - patch #3 in this patchset: adjusts verifier.c:convert_ctx_access()
>>    to operate on BPF_ST alongside BPF_STX.
>>    
>>    Context parameters for some BPF programs types are "fake" data
>>    structures. The verifier matches all BPF_STX and BPF_LDX
>>    instructions that operate on pointers to such contexts and rewrites
>>    these instructions. It might change an offset or add another layer
>>    of indirection, etc. E.g. see filter.c:bpf_convert_ctx_access().
>>    (This also implies that verifier forbids writes to non-constant
>>     offsets inside such structures).
>>     
>>    So the patch extends this logic to also handle BPF_ST.
> 
> The patch 3 is necessary to land before llvm starts generating 'st' for ctx access.
> That's clear, but I'm missing why patch 1 is necessary.
> Sure, it's making the verifier understand scalar spills with 'st' and
> makes 'st' equivalent to 'stx', but I'm missing why it's necessary.
> What kind of programs fail to be verified when llvm starts generating 'st' ?
> 
> Regarind -mcpu=v4.
> I think we need to add all of our upcoming instructions as a single flag.
> Otherwise we'll have -mcpu=v5,v6,v7 and full combinations of them.
> 
> -mcpu=v4 could mean:
> - ST
> - sign extending loads
> - sign extend a register
> - 32-bit JA
> - proper bswap insns: bswap16, bswap32, bswap64
> 
> The sign and 32-bit JA we've discussed earlier.
> The bswap was on my wish list forever.
> The existing TO_LE, TO_BE insns are really odd from compiler pov.
> The compiler should translate bswap IR op into proper bswap insn
> just like it does on all cpus.
> 
> Maybe add SDIV to -mcpu=v4 as well?

Right, we should add these insns in llvm17 with -mcpu=v4, so we
can keep the number of cpu generations minimum.
Jose E. Marchesi Jan. 13, 2023, 8:53 a.m. UTC | #7
> On 1/12/23 2:27 PM, Alexei Starovoitov wrote:
>> On Thu, Jan 05, 2023 at 02:07:05PM +0200, Eduard Zingerman wrote:
>>> On Thu, 2023-01-05 at 11:06 +0100, Jose E. Marchesi wrote:
>>>>> On Sat, Dec 31, 2022 at 8:31 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>>>>>>
>>>>>> BPF has two documented (non-atomic) memory store instructions:
>>>>>>
>>>>>> BPF_STX: *(size *) (dst_reg + off) = src_reg
>>>>>> BPF_ST : *(size *) (dst_reg + off) = imm32
>>>>>>
>>>>>> Currently LLVM BPF back-end does not emit BPF_ST instruction and does
>>>>>> not allow one to be specified as inline assembly.
>>>>>>
>>>>>> Recently I've been exploring ways to port some of the verifier test
>>>>>> cases from tools/testing/selftests/bpf/verifier/*.c to use inline assembly
>>>>>> and machinery provided in tools/testing/selftests/bpf/test_loader.c
>>>>>> (which should hopefully simplify tests maintenance).
>>>>>> The BPF_ST instruction is popular in these tests: used in 52 of 94 files.
>>>>>>
>>>>>> While it is possible to adjust LLVM to only support BPF_ST for inline
>>>>>> assembly blocks it seems a bit wasteful. This patch-set contains a set
>>>>>> of changes to verifier necessary in case when LLVM is allowed to
>>>>>> freely emit BPF_ST instructions (source code is available here [1]).
>>>>>
>>>>> Would we gate LLVM's emitting of BPF_ST for C code behind some new
>>>>> cpu=v4? What is the benefit for compiler to start automatically emit
>>>>> such instructions? Such thinking about logistics, if there isn't much
>>>>> benefit, as BPF application owner I wouldn't bother enabling this
>>>>> behavior risking regressions on old kernels that don't have these
>>>>> changes.
>>>>
>>>> Hmm, GCC happily generates BPF_ST instructions:
>>>>
>>>>    $ echo 'int v; void foo () {  v = 666; }' | bpf-unknown-none-gcc -O2 -xc -S -o foo.s -
>>>>    $ cat foo.s
>>>>          .file	"<stdin>"
>>>>          .text
>>>>          .align	3
>>>>          .global	foo
>>>>          .type	foo, @function
>>>>    foo:
>>>>          lddw	%r0,v
>>>>          stw	[%r0+0],666
>>>>          exit
>>>>          .size	foo, .-foo
>>>>          .global	v
>>>>          .type	v, @object
>>>>          .lcomm	v,4,4
>>>>          .ident	"GCC: (GNU) 12.0.0 20211206 (experimental)"
>>>>
>>>> Been doing that since October 2019, I think before the cpu versioning
>>>> mechanism was got in place?
>>>>
>>>> We weren't aware this was problematic.  Does the verifier reject such
>>>> instructions?
>>>
>>> Interesting, do BPF selftests generated by GCC pass the same way they
>>> do if generated by clang?
>>>
>>> I had to do the following changes to the verifier to make the
>>> selftests pass when BPF_ST instruction is allowed for selection:
>>>
>>> - patch #1 in this patchset: track values of constants written to
>>>    stack using BPF_ST. Currently these are tracked imprecisely, unlike
>>>    the writes using BPF_STX, e.g.:
>>>         fp[-8] = 42;   currently verifier assumes that
>>> fp[-8]=mmmmmmmm
>>>                     after such instruction, where m stands for "misc",
>>>                     just a note that something is written at fp[-8].
>>>                          r1 = 42;       verifier tracks r1=42 after
>>> this instruction.
>>>      fp[-8] = r1;   verifier tracks fp[-8]=42 after this instruction.
>>>
>>>    So the patch makes both cases equivalent.
>>>    - patch #3 in this patchset: adjusts
>>> verifier.c:convert_ctx_access()
>>>    to operate on BPF_ST alongside BPF_STX.
>>>       Context parameters for some BPF programs types are "fake"
>>> data
>>>    structures. The verifier matches all BPF_STX and BPF_LDX
>>>    instructions that operate on pointers to such contexts and rewrites
>>>    these instructions. It might change an offset or add another layer
>>>    of indirection, etc. E.g. see filter.c:bpf_convert_ctx_access().
>>>    (This also implies that verifier forbids writes to non-constant
>>>     offsets inside such structures).
>>>        So the patch extends this logic to also handle BPF_ST.
>> The patch 3 is necessary to land before llvm starts generating 'st'
>> for ctx access.
>> That's clear, but I'm missing why patch 1 is necessary.
>> Sure, it's making the verifier understand scalar spills with 'st' and
>> makes 'st' equivalent to 'stx', but I'm missing why it's necessary.
>> What kind of programs fail to be verified when llvm starts generating 'st' ?
>> Regarind -mcpu=v4.
>> I think we need to add all of our upcoming instructions as a single flag.
>> Otherwise we'll have -mcpu=v5,v6,v7 and full combinations of them.
>> -mcpu=v4 could mean:
>> - ST
>> - sign extending loads
>> - sign extend a register
>> - 32-bit JA
>> - proper bswap insns: bswap16, bswap32, bswap64
>> The sign and 32-bit JA we've discussed earlier.
>> The bswap was on my wish list forever.
>> The existing TO_LE, TO_BE insns are really odd from compiler pov.
>> The compiler should translate bswap IR op into proper bswap insn
>> just like it does on all cpus.
>> Maybe add SDIV to -mcpu=v4 as well?
>
> Right, we should add these insns in llvm17 with -mcpu=v4, so we
> can keep the number of cpu generations minimum.

How do you plan to encode the sign-extend load instructions?

I guess a possibility would be to use one of the available op-mode for
load instructions that are currently marked as reserved.  For example:

   IMM  = 0b000
   ABS  = 0b001
   IND  = 0b010
   MEM  = 0b011
   SEM = 0b100  <- new

Then we would have the following code fields for sign-extending LDX
instructions:

   op-mode:SEM op-size:{W,H,B,DW} op-class:LDX
Eduard Zingerman Jan. 13, 2023, 4:47 p.m. UTC | #8
On Fri, 2023-01-13 at 09:53 +0100, Jose E. Marchesi wrote:
> > On 1/12/23 2:27 PM, Alexei Starovoitov wrote:
> > > On Thu, Jan 05, 2023 at 02:07:05PM +0200, Eduard Zingerman wrote:
> > > > On Thu, 2023-01-05 at 11:06 +0100, Jose E. Marchesi wrote:
> > > > > > On Sat, Dec 31, 2022 at 8:31 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
> > > > > > > 
> > > > > > > BPF has two documented (non-atomic) memory store instructions:
> > > > > > > 
> > > > > > > BPF_STX: *(size *) (dst_reg + off) = src_reg
> > > > > > > BPF_ST : *(size *) (dst_reg + off) = imm32
> > > > > > > 
> > > > > > > Currently LLVM BPF back-end does not emit BPF_ST instruction and does
> > > > > > > not allow one to be specified as inline assembly.
> > > > > > > 
> > > > > > > Recently I've been exploring ways to port some of the verifier test
> > > > > > > cases from tools/testing/selftests/bpf/verifier/*.c to use inline assembly
> > > > > > > and machinery provided in tools/testing/selftests/bpf/test_loader.c
> > > > > > > (which should hopefully simplify tests maintenance).
> > > > > > > The BPF_ST instruction is popular in these tests: used in 52 of 94 files.
> > > > > > > 
> > > > > > > While it is possible to adjust LLVM to only support BPF_ST for inline
> > > > > > > assembly blocks it seems a bit wasteful. This patch-set contains a set
> > > > > > > of changes to verifier necessary in case when LLVM is allowed to
> > > > > > > freely emit BPF_ST instructions (source code is available here [1]).
> > > > > > 
> > > > > > Would we gate LLVM's emitting of BPF_ST for C code behind some new
> > > > > > cpu=v4? What is the benefit for compiler to start automatically emit
> > > > > > such instructions? Such thinking about logistics, if there isn't much
> > > > > > benefit, as BPF application owner I wouldn't bother enabling this
> > > > > > behavior risking regressions on old kernels that don't have these
> > > > > > changes.
> > > > > 
> > > > > Hmm, GCC happily generates BPF_ST instructions:
> > > > > 
> > > > >    $ echo 'int v; void foo () {  v = 666; }' | bpf-unknown-none-gcc -O2 -xc -S -o foo.s -
> > > > >    $ cat foo.s
> > > > >          .file	"<stdin>"
> > > > >          .text
> > > > >          .align	3
> > > > >          .global	foo
> > > > >          .type	foo, @function
> > > > >    foo:
> > > > >          lddw	%r0,v
> > > > >          stw	[%r0+0],666
> > > > >          exit
> > > > >          .size	foo, .-foo
> > > > >          .global	v
> > > > >          .type	v, @object
> > > > >          .lcomm	v,4,4
> > > > >          .ident	"GCC: (GNU) 12.0.0 20211206 (experimental)"
> > > > > 
> > > > > Been doing that since October 2019, I think before the cpu versioning
> > > > > mechanism was got in place?
> > > > > 
> > > > > We weren't aware this was problematic.  Does the verifier reject such
> > > > > instructions?
> > > > 
> > > > Interesting, do BPF selftests generated by GCC pass the same way they
> > > > do if generated by clang?
> > > > 
> > > > I had to do the following changes to the verifier to make the
> > > > selftests pass when BPF_ST instruction is allowed for selection:
> > > > 
> > > > - patch #1 in this patchset: track values of constants written to
> > > >    stack using BPF_ST. Currently these are tracked imprecisely, unlike
> > > >    the writes using BPF_STX, e.g.:
> > > >         fp[-8] = 42;   currently verifier assumes that
> > > > fp[-8]=mmmmmmmm
> > > >                     after such instruction, where m stands for "misc",
> > > >                     just a note that something is written at fp[-8].
> > > >                          r1 = 42;       verifier tracks r1=42 after
> > > > this instruction.
> > > >      fp[-8] = r1;   verifier tracks fp[-8]=42 after this instruction.
> > > > 
> > > >    So the patch makes both cases equivalent.
> > > >    - patch #3 in this patchset: adjusts
> > > > verifier.c:convert_ctx_access()
> > > >    to operate on BPF_ST alongside BPF_STX.
> > > >       Context parameters for some BPF programs types are "fake"
> > > > data
> > > >    structures. The verifier matches all BPF_STX and BPF_LDX
> > > >    instructions that operate on pointers to such contexts and rewrites
> > > >    these instructions. It might change an offset or add another layer
> > > >    of indirection, etc. E.g. see filter.c:bpf_convert_ctx_access().
> > > >    (This also implies that verifier forbids writes to non-constant
> > > >     offsets inside such structures).
> > > >        So the patch extends this logic to also handle BPF_ST.
> > > The patch 3 is necessary to land before llvm starts generating 'st'
> > > for ctx access.
> > > That's clear, but I'm missing why patch 1 is necessary.
> > > Sure, it's making the verifier understand scalar spills with 'st' and
> > > makes 'st' equivalent to 'stx', but I'm missing why it's necessary.
> > > What kind of programs fail to be verified when llvm starts generating 'st' ?

I should have added an example to the summary. There are a few
test_prog tests that fail w/o this patch, namely atomic_bounds,
dynptr, for_each, xdp_noinline, xdp_synproxy.

Here is how atomic_bounds looks:

  SEC("fentry/bpf_fentry_test1")
  int BPF_PROG(sub, int x)
  {
          int a = 0;
          int b = __sync_fetch_and_add(&a, 1);
          /* b is certainly 0 here. Can the verifier tell? */
          while (b)
                  continue;
          return 0;
  }

    Compiled w/o BPF_ST                  Compiled with BPF_ST
  
  <sub>:                               <sub>:
    w1 = 0x0
    *(u32 *)(r10 - 0x4) = r1             *(u32 *)(r10 - 0x4) = 0x0
    w1 = 0x1                             w1 = 0x1
    lock *(u32 *)(r10 - 0x4) += r1       lock *(u32 *)(r10 - 0x4) += r1
    if w1 == 0x0 goto +0x1 <LBB0_2>      if w1 == 0x0 goto +0x1 <LBB0_2>
  
  <LBB0_1>:                            <LBB0_1>:
    goto -0x1 <LBB0_1>                   goto -0x1 <LBB0_1>
  
  <LBB0_2>:                            <LBB0_2>:
    w0 = 0x0                             w0 = 0x0
    exit                                 exit

When compiled with BPF_ST and verified w/o the patch #1 verification log
looks as follows:

  0: R1=ctx(off=0,imm=0) R10=fp0
  0: (62) *(u32 *)(r10 -4) = 0          ; R10=fp0 fp-8=mmmm????
  1: (b4) w1 = 1                        ; R1_w=1
  2: (c3) r1 = atomic_fetch_add((u32 *)(r10 -4), r1)    ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R10=fp0 fp-8=mmmm????
  3: (16) if w1 == 0x0 goto pc+1        ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
  4: (05) goto pc-1
  4: (05) goto pc-1
  4: (05) goto pc-1
  4: (05) goto pc-1
  infinite loop detected at insn 4

When compiled w/o BPF_ST and verified w/o the patch #1 verification log
looks as follows:

  func#0 @0
  reg type unsupported for arg#0 function sub#5
  0: R1=ctx(off=0,imm=0) R10=fp0
  0: (b4) w1 = 0                        ; R1_w=0
  1: (63) *(u32 *)(r10 -4) = r1
  last_idx 1 first_idx 0
  regs=2 stack=0 before 0: (b4) w1 = 0
  2: R1_w=0 R10=fp0 fp-8=0000????
  2: (b4) w1 = 1                        ; R1_w=1
  ; int b = __sync_fetch_and_add(&a, 1);
  3: (c3) r1 = atomic_fetch_add((u32 *)(r10 -4), r1)    ; R1_w=P0 R10=fp0 fp-8=mmmm????
  4: (16) if w1 == 0x0 goto pc+1
  6: R1_w=P0
  6: (b4) w0 = 0                        ; R0_w=0
  7: (95) exit
  processed 7 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0

The difference comes from the way zero write to `r10-4` is processed,
with BPF_ST it is tracked as `fp-8=mmmm????` after write, without BPF_ST
it is tracked as `fp-8=0000???? after` write.

Which is caused by the way `check_stack_write_fixed_off()` is written.
For the register spills it either saves the complete register state or
STACK_ZERO if register is known to be zero. However, for the BPF_ST it
just saves STACK_MISC. Hence, the patch #1.

> > > Regarind -mcpu=v4.
> > > I think we need to add all of our upcoming instructions as a single flag.
> > > Otherwise we'll have -mcpu=v5,v6,v7 and full combinations of them.
> > > -mcpu=v4 could mean:
> > > - ST
> > > - sign extending loads
> > > - sign extend a register
> > > - 32-bit JA
> > > - proper bswap insns: bswap16, bswap32, bswap64
> > > The sign and 32-bit JA we've discussed earlier.
> > > The bswap was on my wish list forever.
> > > The existing TO_LE, TO_BE insns are really odd from compiler pov.
> > > The compiler should translate bswap IR op into proper bswap insn
> > > just like it does on all cpus.
> > > Maybe add SDIV to -mcpu=v4 as well?

There is also BPF_JSET "PC += off if dst & src" which is not currently
emitted by LLVM backend as far as I can tell.

> > 
> > Right, we should add these insns in llvm17 with -mcpu=v4, so we
> > can keep the number of cpu generations minimum.
> 
> How do you plan to encode the sign-extend load instructions?
> 
> I guess a possibility would be to use one of the available op-mode for
> load instructions that are currently marked as reserved.  For example:
> 
>    IMM  = 0b000
>    ABS  = 0b001
>    IND  = 0b010
>    MEM  = 0b011
>    SEM = 0b100  <- new
> 
> Then we would have the following code fields for sign-extending LDX
> instructions:
> 
>    op-mode:SEM op-size:{W,H,B,DW} op-class:LDX

I second the Jose's question about encoding for the new instructions.

> - sign extend a register

E.g. like this:

BPF_SEXT = 0xb0 
opcode: BPF_SEXT | BPF_X | BPF_ALU
imm32: {8,16,32}                    // to be consistent with BPF_END insn
    or BPF_{B,H,W}                  // to be consistent with LDX/STX

Sign extend 8,16,32-bit value from src to 64-bit dst register:
  src = sext{8,16,32}(dst)

> The sign and 32-bit JA we've discussed earlier.

Could you please share a link for this discussion?

> - proper bswap insns: bswap16, bswap32, bswap64

Could you please extrapolate on this.
Current instructions operate on a single register, e.g.:

  dst_reg = htole16(dst_reg).

It looks like this could be reflected in x86 assembly as a single
instruction using either bswap or xchg instructions (see [1] and [2]).
x86/net/bpf_jit_comp.c:1266 can probably be adjusted to use xchg for
16-bit case.

For ARM rev16, rev32, rev64 allow to use the same register for source
and destination ([3]).

[1] https://www.felixcloutier.com/x86/bswap
[2] https://www.felixcloutier.com/x86/xchg
[3] https://developer.arm.com/documentation/ddi0602/2022-06/Base-Instructions/REV16--Reverse-bytes-in-16-bit-halfwords-?lang=en
    https://developer.arm.com/documentation/ddi0602/2022-06/Base-Instructions/REV32--Reverse-bytes-in-32-bit-words-?lang=en
    https://developer.arm.com/documentation/ddi0602/2022-06/Base-Instructions/REV64--Reverse-Bytes--an-alias-of-REV-?lang=en
Yonghong Song Jan. 13, 2023, 7:23 p.m. UTC | #9
On 1/13/23 12:53 AM, Jose E. Marchesi wrote:
> 
>> On 1/12/23 2:27 PM, Alexei Starovoitov wrote:
>>> On Thu, Jan 05, 2023 at 02:07:05PM +0200, Eduard Zingerman wrote:
>>>> On Thu, 2023-01-05 at 11:06 +0100, Jose E. Marchesi wrote:
>>>>>> On Sat, Dec 31, 2022 at 8:31 AM Eduard Zingerman <eddyz87@gmail.com> wrote:
>>>>>>>
>>>>>>> BPF has two documented (non-atomic) memory store instructions:
>>>>>>>
>>>>>>> BPF_STX: *(size *) (dst_reg + off) = src_reg
>>>>>>> BPF_ST : *(size *) (dst_reg + off) = imm32
>>>>>>>
>>>>>>> Currently LLVM BPF back-end does not emit BPF_ST instruction and does
>>>>>>> not allow one to be specified as inline assembly.
>>>>>>>
>>>>>>> Recently I've been exploring ways to port some of the verifier test
>>>>>>> cases from tools/testing/selftests/bpf/verifier/*.c to use inline assembly
>>>>>>> and machinery provided in tools/testing/selftests/bpf/test_loader.c
>>>>>>> (which should hopefully simplify tests maintenance).
>>>>>>> The BPF_ST instruction is popular in these tests: used in 52 of 94 files.
>>>>>>>
>>>>>>> While it is possible to adjust LLVM to only support BPF_ST for inline
>>>>>>> assembly blocks it seems a bit wasteful. This patch-set contains a set
>>>>>>> of changes to verifier necessary in case when LLVM is allowed to
>>>>>>> freely emit BPF_ST instructions (source code is available here [1]).
>>>>>>
>>>>>> Would we gate LLVM's emitting of BPF_ST for C code behind some new
>>>>>> cpu=v4? What is the benefit for compiler to start automatically emit
>>>>>> such instructions? Such thinking about logistics, if there isn't much
>>>>>> benefit, as BPF application owner I wouldn't bother enabling this
>>>>>> behavior risking regressions on old kernels that don't have these
>>>>>> changes.
>>>>>
>>>>> Hmm, GCC happily generates BPF_ST instructions:
>>>>>
>>>>>     $ echo 'int v; void foo () {  v = 666; }' | bpf-unknown-none-gcc -O2 -xc -S -o foo.s -
>>>>>     $ cat foo.s
>>>>>           .file	"<stdin>"
>>>>>           .text
>>>>>           .align	3
>>>>>           .global	foo
>>>>>           .type	foo, @function
>>>>>     foo:
>>>>>           lddw	%r0,v
>>>>>           stw	[%r0+0],666
>>>>>           exit
>>>>>           .size	foo, .-foo
>>>>>           .global	v
>>>>>           .type	v, @object
>>>>>           .lcomm	v,4,4
>>>>>           .ident	"GCC: (GNU) 12.0.0 20211206 (experimental)"
>>>>>
>>>>> Been doing that since October 2019, I think before the cpu versioning
>>>>> mechanism was got in place?
>>>>>
>>>>> We weren't aware this was problematic.  Does the verifier reject such
>>>>> instructions?
>>>>
>>>> Interesting, do BPF selftests generated by GCC pass the same way they
>>>> do if generated by clang?
>>>>
>>>> I had to do the following changes to the verifier to make the
>>>> selftests pass when BPF_ST instruction is allowed for selection:
>>>>
>>>> - patch #1 in this patchset: track values of constants written to
>>>>     stack using BPF_ST. Currently these are tracked imprecisely, unlike
>>>>     the writes using BPF_STX, e.g.:
>>>>          fp[-8] = 42;   currently verifier assumes that
>>>> fp[-8]=mmmmmmmm
>>>>                      after such instruction, where m stands for "misc",
>>>>                      just a note that something is written at fp[-8].
>>>>                           r1 = 42;       verifier tracks r1=42 after
>>>> this instruction.
>>>>       fp[-8] = r1;   verifier tracks fp[-8]=42 after this instruction.
>>>>
>>>>     So the patch makes both cases equivalent.
>>>>     - patch #3 in this patchset: adjusts
>>>> verifier.c:convert_ctx_access()
>>>>     to operate on BPF_ST alongside BPF_STX.
>>>>        Context parameters for some BPF programs types are "fake"
>>>> data
>>>>     structures. The verifier matches all BPF_STX and BPF_LDX
>>>>     instructions that operate on pointers to such contexts and rewrites
>>>>     these instructions. It might change an offset or add another layer
>>>>     of indirection, etc. E.g. see filter.c:bpf_convert_ctx_access().
>>>>     (This also implies that verifier forbids writes to non-constant
>>>>      offsets inside such structures).
>>>>         So the patch extends this logic to also handle BPF_ST.
>>> The patch 3 is necessary to land before llvm starts generating 'st'
>>> for ctx access.
>>> That's clear, but I'm missing why patch 1 is necessary.
>>> Sure, it's making the verifier understand scalar spills with 'st' and
>>> makes 'st' equivalent to 'stx', but I'm missing why it's necessary.
>>> What kind of programs fail to be verified when llvm starts generating 'st' ?
>>> Regarind -mcpu=v4.
>>> I think we need to add all of our upcoming instructions as a single flag.
>>> Otherwise we'll have -mcpu=v5,v6,v7 and full combinations of them.
>>> -mcpu=v4 could mean:
>>> - ST
>>> - sign extending loads
>>> - sign extend a register
>>> - 32-bit JA
>>> - proper bswap insns: bswap16, bswap32, bswap64
>>> The sign and 32-bit JA we've discussed earlier.
>>> The bswap was on my wish list forever.
>>> The existing TO_LE, TO_BE insns are really odd from compiler pov.
>>> The compiler should translate bswap IR op into proper bswap insn
>>> just like it does on all cpus.
>>> Maybe add SDIV to -mcpu=v4 as well?
>>
>> Right, we should add these insns in llvm17 with -mcpu=v4, so we
>> can keep the number of cpu generations minimum.
> 
> How do you plan to encode the sign-extend load instructions?
> 
> I guess a possibility would be to use one of the available op-mode for
> load instructions that are currently marked as reserved.  For example:
> 
>     IMM  = 0b000
>     ABS  = 0b001
>     IND  = 0b010
>     MEM  = 0b011
>     SEM = 0b100  <- new
> 
> Then we would have the following code fields for sign-extending LDX
> instructions:
> 
>     op-mode:SEM op-size:{W,H,B,DW} op-class:LDX

Right, this is what I plan to do as well. See my incomplete llvm
prototype here:
   https://reviews.llvm.org/D133464
Alexei Starovoitov Jan. 26, 2023, 5:49 a.m. UTC | #10
On 1/13/23 8:47 AM, Eduard Zingerman wrote:
>>>>         So the patch extends this logic to also handle BPF_ST.
>>>> The patch 3 is necessary to land before llvm starts generating 'st'
>>>> for ctx access.
>>>> That's clear, but I'm missing why patch 1 is necessary.
>>>> Sure, it's making the verifier understand scalar spills with 'st' and
>>>> makes 'st' equivalent to 'stx', but I'm missing why it's necessary.
>>>> What kind of programs fail to be verified when llvm starts generating 'st' ?
> 
> I should have added an example to the summary. There are a few
> test_prog tests that fail w/o this patch, namely atomic_bounds,
> dynptr, for_each, xdp_noinline, xdp_synproxy.
> 
> Here is how atomic_bounds looks:
> 
>    SEC("fentry/bpf_fentry_test1")
>    int BPF_PROG(sub, int x)
>    {
>            int a = 0;
>            int b = __sync_fetch_and_add(&a, 1);
>            /* b is certainly 0 here. Can the verifier tell? */
>            while (b)
>                    continue;
>            return 0;
>    }
> 
>      Compiled w/o BPF_ST                  Compiled with BPF_ST
>    
>    <sub>:                               <sub>:
>      w1 = 0x0
>      *(u32 *)(r10 - 0x4) = r1             *(u32 *)(r10 - 0x4) = 0x0
>      w1 = 0x1                             w1 = 0x1
>      lock *(u32 *)(r10 - 0x4) += r1       lock *(u32 *)(r10 - 0x4) += r1
>      if w1 == 0x0 goto +0x1 <LBB0_2>      if w1 == 0x0 goto +0x1 <LBB0_2>
>    
>    <LBB0_1>:                            <LBB0_1>:
>      goto -0x1 <LBB0_1>                   goto -0x1 <LBB0_1>
>    
>    <LBB0_2>:                            <LBB0_2>:
>      w0 = 0x0                             w0 = 0x0
>      exit                                 exit
> 
> When compiled with BPF_ST and verified w/o the patch #1 verification log
> looks as follows:
> 
>    0: R1=ctx(off=0,imm=0) R10=fp0
>    0: (62) *(u32 *)(r10 -4) = 0          ; R10=fp0 fp-8=mmmm????
>    1: (b4) w1 = 1                        ; R1_w=1
>    2: (c3) r1 = atomic_fetch_add((u32 *)(r10 -4), r1)    ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R10=fp0 fp-8=mmmm????
>    3: (16) if w1 == 0x0 goto pc+1        ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
>    4: (05) goto pc-1
>    4: (05) goto pc-1
>    4: (05) goto pc-1
>    4: (05) goto pc-1
>    infinite loop detected at insn 4
> 
> When compiled w/o BPF_ST and verified w/o the patch #1 verification log
> looks as follows:
> 
>    func#0 @0
>    reg type unsupported for arg#0 function sub#5
>    0: R1=ctx(off=0,imm=0) R10=fp0
>    0: (b4) w1 = 0                        ; R1_w=0
>    1: (63) *(u32 *)(r10 -4) = r1
>    last_idx 1 first_idx 0
>    regs=2 stack=0 before 0: (b4) w1 = 0
>    2: R1_w=0 R10=fp0 fp-8=0000????
>    2: (b4) w1 = 1                        ; R1_w=1
>    ; int b = __sync_fetch_and_add(&a, 1);
>    3: (c3) r1 = atomic_fetch_add((u32 *)(r10 -4), r1)    ; R1_w=P0 R10=fp0 fp-8=mmmm????
>    4: (16) if w1 == 0x0 goto pc+1
>    6: R1_w=P0
>    6: (b4) w0 = 0                        ; R0_w=0
>    7: (95) exit
>    processed 7 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
> 
> The difference comes from the way zero write to `r10-4` is processed,
> with BPF_ST it is tracked as `fp-8=mmmm????` after write, without BPF_ST
> it is tracked as `fp-8=0000???? after` write.
> 
> Which is caused by the way `check_stack_write_fixed_off()` is written.
> For the register spills it either saves the complete register state or
> STACK_ZERO if register is known to be zero. However, for the BPF_ST it
> just saves STACK_MISC. Hence, the patch #1.

Thank you for explaining.
Though llvm changes are not ready, let's land the required
kernel changes now.
Please craft asm test cases for the issues you've seen with BPF_ST.