mbox series

[bpf-next,00/24] Support bpf trampoline for s390x

Message ID 20230125213817.1424447-1-iii@linux.ibm.com (mailing list archive)
Headers show
Series Support bpf trampoline for s390x | expand

Message

Ilya Leoshkevich Jan. 25, 2023, 9:37 p.m. UTC
Hi,

This series implements poke, trampoline, kfunc, mixing subprogs and
tailcalls, and fixes a number of tests on s390x.

The following failures still remain:

#52      core_read_macros:FAIL
Uses BPF_PROBE_READ(), shouldn't there be BPF_PROBE_READ_KERNEL()?

#82      get_stack_raw_tp:FAIL
get_stack_print_output:FAIL:user_stack corrupted user stack
Known issue:
We cannot reliably unwind userspace on s390x without DWARF.

#101     ksyms_module:FAIL
address of kernel function bpf_testmod_test_mod_kfunc is out of range
Known issue:
Kernel and modules are too far away from each other on s390x.

#167     sk_assign:FAIL
Uses legacy map definitions in 'maps' section.

#190     stacktrace_build_id:FAIL
Known issue:
We cannot reliably unwind userspace on s390x without DWARF.

#211     test_bpffs:FAIL
iterators.bpf.c is broken on s390x, it uses BPF_CORE_READ(), shouldn't
there be BPF_CORE_READ_KERNEL()?

#218     test_profiler:FAIL
A lot of BPF_PROBE_READ() usages.

#281     xdp_metadata:FAIL
See patch 24.

#284     xdp_synproxy:FAIL
Verifier error:
; value = bpf_tcp_raw_gen_syncookie_ipv4(hdr->ipv4, hdr->tcp,
281: (79) r1 = *(u64 *)(r10 -80)      ; R1_w=pkt(off=14,r=74,imm=0) R10=fp0
282: (bf) r2 = r8                     ; R2_w=pkt(id=5,off=14,r=74,umax=60,var_off=(0x0; 0x3c)) R8=pkt(id=5,off=14,r=74,umax=60,var_off=(0x0; 0x3c))
283: (79) r3 = *(u64 *)(r10 -104)     ; R3_w=scalar(umax=60,var_off=(0x0; 0x3c)) R10=fp0
284: (85) call bpf_tcp_raw_gen_syncookie_ipv4#204
invalid access to packet, off=14 size=0, R2(id=5,off=14,r=74)
R2 offset is outside of the packet

None of these seem to be due to the new changes.

Best regards,
Ilya

Ilya Leoshkevich (24):
  selftests/bpf: Fix liburandom_read.so linker error
  selftests/bpf: Fix symlink creation error
  selftests/bpf: Fix fexit_stress on s390x
  selftests/bpf: Fix trampoline_count on s390x
  selftests/bpf: Fix kfree_skb on s390x
  selftests/bpf: Set errno when urand_spawn() fails
  selftests/bpf: Fix decap_sanity_ns cleanup
  selftests/bpf: Fix verify_pkcs7_sig on s390x
  selftests/bpf: Fix xdp_do_redirect on s390x
  selftests/bpf: Fix cgrp_local_storage on s390x
  selftests/bpf: Check stack_mprotect() return value
  selftests/bpf: Increase SIZEOF_BPF_LOCAL_STORAGE_ELEM on s390x
  selftests/bpf: Add a sign-extension test for kfuncs
  selftests/bpf: Fix test_lsm on s390x
  selftests/bpf: Fix test_xdp_adjust_tail_grow2 on s390x
  selftests/bpf: Fix vmlinux test on s390x
  libbpf: Read usdt arg spec with bpf_probe_read_kernel()
  s390/bpf: Fix a typo in a comment
  s390/bpf: Add expoline to tail calls
  s390/bpf: Implement bpf_arch_text_poke()
  bpf: btf: Add BTF_FMODEL_SIGNED_ARG flag
  s390/bpf: Implement arch_prepare_bpf_trampoline()
  s390/bpf: Implement bpf_jit_supports_subprog_tailcalls()
  s390/bpf: Implement bpf_jit_supports_kfunc_call()

 arch/s390/net/bpf_jit_comp.c                  | 708 +++++++++++++++++-
 include/linux/bpf.h                           |   8 +
 include/linux/btf.h                           |  15 +-
 kernel/bpf/btf.c                              |  16 +-
 net/bpf/test_run.c                            |   9 +
 tools/lib/bpf/usdt.bpf.h                      |  33 +-
 tools/testing/selftests/bpf/Makefile          |   7 +-
 tools/testing/selftests/bpf/netcnt_common.h   |   6 +-
 .../selftests/bpf/prog_tests/bpf_cookie.c     |   6 +-
 .../bpf/prog_tests/cgrp_local_storage.c       |   2 +-
 .../selftests/bpf/prog_tests/decap_sanity.c   |   2 +-
 .../selftests/bpf/prog_tests/fexit_stress.c   |   6 +-
 .../selftests/bpf/prog_tests/kfree_skb.c      |   2 +-
 .../selftests/bpf/prog_tests/kfunc_call.c     |   1 +
 .../selftests/bpf/prog_tests/test_lsm.c       |   3 +-
 .../bpf/prog_tests/trampoline_count.c         |   4 +
 tools/testing/selftests/bpf/prog_tests/usdt.c |   1 +
 .../bpf/prog_tests/verify_pkcs7_sig.c         |   9 +
 .../bpf/prog_tests/xdp_adjust_tail.c          |   7 +-
 .../bpf/prog_tests/xdp_do_redirect.c          |   4 +
 .../selftests/bpf/progs/kfunc_call_test.c     |  18 +
 tools/testing/selftests/bpf/progs/lsm.c       |   7 +-
 .../bpf/progs/test_verify_pkcs7_sig.c         |  12 +-
 .../selftests/bpf/progs/test_vmlinux.c        |   4 +-
 .../bpf/progs/test_xdp_adjust_tail_grow.c     |   8 +-
 25 files changed, 816 insertions(+), 82 deletions(-)

Comments

Andrii Nakryiko Jan. 26, 2023, 12:45 a.m. UTC | #1
On Wed, Jan 25, 2023 at 1:39 PM Ilya Leoshkevich <iii@linux.ibm.com> wrote:
>
> Hi,
>
> This series implements poke, trampoline, kfunc, mixing subprogs and
> tailcalls, and fixes a number of tests on s390x.
>
> The following failures still remain:
>
> #52      core_read_macros:FAIL
> Uses BPF_PROBE_READ(), shouldn't there be BPF_PROBE_READ_KERNEL()?

BPF_PROBE_READ(), similarly to BPF_CORE_READ() both use
bpf_probe_read_kernel() internally, as it's most common use case. We
have separate BPF_PROBE_READ_USER() and BPF_CORE_READ_USER() for when
bpf_probe_read_user() has to be used.

>
> #82      get_stack_raw_tp:FAIL
> get_stack_print_output:FAIL:user_stack corrupted user stack
> Known issue:
> We cannot reliably unwind userspace on s390x without DWARF.

like in principle, or frame pointers (or some equivalent) needs to be
configured for this to work?

Asking also in the context of [0], where s390x was excluded. If there
is actually a way to enable frame pointer-based stack unwinding on
s390x, would be nice to hear from an expert.

  [0] https://pagure.io/fesco/issue/2923

>
> #101     ksyms_module:FAIL
> address of kernel function bpf_testmod_test_mod_kfunc is out of range
> Known issue:
> Kernel and modules are too far away from each other on s390x.
>
> #167     sk_assign:FAIL
> Uses legacy map definitions in 'maps' section.

Hm.. assuming new enough iproute2, new-style .maps definition should
be supported, right? Let's convert map definition?

>
> #190     stacktrace_build_id:FAIL
> Known issue:
> We cannot reliably unwind userspace on s390x without DWARF.
>
> #211     test_bpffs:FAIL
> iterators.bpf.c is broken on s390x, it uses BPF_CORE_READ(), shouldn't
> there be BPF_CORE_READ_KERNEL()?

BPF_CORE_READ() is that, so must be something else

>
> #218     test_profiler:FAIL
> A lot of BPF_PROBE_READ() usages.

ditto, something else

>
> #281     xdp_metadata:FAIL
> See patch 24.
>
> #284     xdp_synproxy:FAIL
> Verifier error:
> ; value = bpf_tcp_raw_gen_syncookie_ipv4(hdr->ipv4, hdr->tcp,
> 281: (79) r1 = *(u64 *)(r10 -80)      ; R1_w=pkt(off=14,r=74,imm=0) R10=fp0
> 282: (bf) r2 = r8                     ; R2_w=pkt(id=5,off=14,r=74,umax=60,var_off=(0x0; 0x3c)) R8=pkt(id=5,off=14,r=74,umax=60,var_off=(0x0; 0x3c))
> 283: (79) r3 = *(u64 *)(r10 -104)     ; R3_w=scalar(umax=60,var_off=(0x0; 0x3c)) R10=fp0
> 284: (85) call bpf_tcp_raw_gen_syncookie_ipv4#204
> invalid access to packet, off=14 size=0, R2(id=5,off=14,r=74)
> R2 offset is outside of the packet

third arg to bpf_tcp_raw_gen_syncookie_ipv4() is defined as
ARG_CONST_SIZE, so is required to be strictly positive, which doesn't
seem to be "proven" to verifier. Please provided bigger log, it might
another instance of needing to sprinkle barrier_var() around.

And maybe thinking about using ARG_CONST_SIZE_OR_ZERO instead of ARG_CONST_SIZE.

>
> None of these seem to be due to the new changes.
>
> Best regards,
> Ilya
>
> Ilya Leoshkevich (24):
>   selftests/bpf: Fix liburandom_read.so linker error
>   selftests/bpf: Fix symlink creation error
>   selftests/bpf: Fix fexit_stress on s390x
>   selftests/bpf: Fix trampoline_count on s390x
>   selftests/bpf: Fix kfree_skb on s390x
>   selftests/bpf: Set errno when urand_spawn() fails
>   selftests/bpf: Fix decap_sanity_ns cleanup
>   selftests/bpf: Fix verify_pkcs7_sig on s390x
>   selftests/bpf: Fix xdp_do_redirect on s390x
>   selftests/bpf: Fix cgrp_local_storage on s390x
>   selftests/bpf: Check stack_mprotect() return value
>   selftests/bpf: Increase SIZEOF_BPF_LOCAL_STORAGE_ELEM on s390x
>   selftests/bpf: Add a sign-extension test for kfuncs
>   selftests/bpf: Fix test_lsm on s390x
>   selftests/bpf: Fix test_xdp_adjust_tail_grow2 on s390x
>   selftests/bpf: Fix vmlinux test on s390x
>   libbpf: Read usdt arg spec with bpf_probe_read_kernel()
>   s390/bpf: Fix a typo in a comment
>   s390/bpf: Add expoline to tail calls
>   s390/bpf: Implement bpf_arch_text_poke()
>   bpf: btf: Add BTF_FMODEL_SIGNED_ARG flag
>   s390/bpf: Implement arch_prepare_bpf_trampoline()
>   s390/bpf: Implement bpf_jit_supports_subprog_tailcalls()
>   s390/bpf: Implement bpf_jit_supports_kfunc_call()
>
>  arch/s390/net/bpf_jit_comp.c                  | 708 +++++++++++++++++-
>  include/linux/bpf.h                           |   8 +
>  include/linux/btf.h                           |  15 +-
>  kernel/bpf/btf.c                              |  16 +-
>  net/bpf/test_run.c                            |   9 +
>  tools/lib/bpf/usdt.bpf.h                      |  33 +-
>  tools/testing/selftests/bpf/Makefile          |   7 +-
>  tools/testing/selftests/bpf/netcnt_common.h   |   6 +-
>  .../selftests/bpf/prog_tests/bpf_cookie.c     |   6 +-
>  .../bpf/prog_tests/cgrp_local_storage.c       |   2 +-
>  .../selftests/bpf/prog_tests/decap_sanity.c   |   2 +-
>  .../selftests/bpf/prog_tests/fexit_stress.c   |   6 +-
>  .../selftests/bpf/prog_tests/kfree_skb.c      |   2 +-
>  .../selftests/bpf/prog_tests/kfunc_call.c     |   1 +
>  .../selftests/bpf/prog_tests/test_lsm.c       |   3 +-
>  .../bpf/prog_tests/trampoline_count.c         |   4 +
>  tools/testing/selftests/bpf/prog_tests/usdt.c |   1 +
>  .../bpf/prog_tests/verify_pkcs7_sig.c         |   9 +
>  .../bpf/prog_tests/xdp_adjust_tail.c          |   7 +-
>  .../bpf/prog_tests/xdp_do_redirect.c          |   4 +
>  .../selftests/bpf/progs/kfunc_call_test.c     |  18 +
>  tools/testing/selftests/bpf/progs/lsm.c       |   7 +-
>  .../bpf/progs/test_verify_pkcs7_sig.c         |  12 +-
>  .../selftests/bpf/progs/test_vmlinux.c        |   4 +-
>  .../bpf/progs/test_xdp_adjust_tail_grow.c     |   8 +-
>  25 files changed, 816 insertions(+), 82 deletions(-)
>
> --
> 2.39.1
>
Ilya Leoshkevich Jan. 27, 2023, 4:51 p.m. UTC | #2
On Wed, 2023-01-25 at 16:45 -0800, Andrii Nakryiko wrote:
> On Wed, Jan 25, 2023 at 1:39 PM Ilya Leoshkevich <iii@linux.ibm.com>
> wrote:
> > 
> > Hi,
> > 
> > This series implements poke, trampoline, kfunc, mixing subprogs and
> > tailcalls, and fixes a number of tests on s390x.
> > 
> > The following failures still remain:
> > 
> > #52      core_read_macros:FAIL
> > Uses BPF_PROBE_READ(), shouldn't there be BPF_PROBE_READ_KERNEL()?
> 
> BPF_PROBE_READ(), similarly to BPF_CORE_READ() both use
> bpf_probe_read_kernel() internally, as it's most common use case. We
> have separate BPF_PROBE_READ_USER() and BPF_CORE_READ_USER() for when
> bpf_probe_read_user() has to be used.

At least purely from the code perspective, BPF_PROBE_READ() seems to
delegate to bpf_probe_read(). The following therefore helps with this
test:

--- a/tools/lib/bpf/bpf_core_read.h
+++ b/tools/lib/bpf/bpf_core_read.h
@@ -364,7 +364,7 @@ enum bpf_enum_value_kind {
 
 /* Non-CO-RE variant of BPF_CORE_READ_INTO() */
 #define BPF_PROBE_READ_INTO(dst, src, a, ...) ({                     
\
-       ___core_read(bpf_probe_read, bpf_probe_read,                  
\
+       ___core_read(bpf_probe_read_kernel, bpf_probe_read_kernel,    
\
                     dst, (src), a, ##__VA_ARGS__)                    
\
 })
 
@@ -400,7 +400,7 @@ enum bpf_enum_value_kind {
 
 /* Non-CO-RE variant of BPF_CORE_READ_STR_INTO() */
 #define BPF_PROBE_READ_STR_INTO(dst, src, a, ...) ({                 
\
-       ___core_read(bpf_probe_read_str, bpf_probe_read,              
\
+       ___core_read(bpf_probe_read_kernel_str, bpf_probe_read_kernel,
\
                     dst, (src), a, ##__VA_ARGS__)                    
\
 })

but I'm not sure if there are backward compatibility concerns, or if
libbpf is supposed to rewrite this when
!ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE.

> > 
> > #82      get_stack_raw_tp:FAIL
> > get_stack_print_output:FAIL:user_stack corrupted user stack
> > Known issue:
> > We cannot reliably unwind userspace on s390x without DWARF.
> 
> like in principle, or frame pointers (or some equivalent) needs to be
> configured for this to work?
> 
> Asking also in the context of [0], where s390x was excluded. If there
> is actually a way to enable frame pointer-based stack unwinding on
> s390x, would be nice to hear from an expert.
> 
>   [0] https://pagure.io/fesco/issue/2923

For DWARFless unwinding we have -mbackchain (not to be confused with
-fno-omit-frame-pointer, which we also have, but which just hurts
performance without providing tangible benefits).
-mbackchain has a few problems though:

- It's not atomic. Here is a typical prologue with -mbackchain:

        1: stmg    %r11,%r15,88(%r15)  # save non-volatile registers
        2: lgr     %r14,%r15           # %r14 = sp
        3: lay     %r15,-160(%r15)     # sp -= 160
        4: stg     %r14,0(%r15)        # *(void **)sp = %r14

  The invariant here is that *(void **)%r15 is always a pointer to the
  next frame. This means that if we unwind from (4), we are totally
  broken. This does not happen if we unwind from any other instruction,
  but still.

- Unwinding from (1)-(3) is not particularly good either. PSW points to
  the callee, but R15 points to the caller's frame.

- Unwinding leaf functions is like the previous problem, but worse:
  they often do not establish a stack frame at all, so PSW and R15 are
  out of sync for the entire duration of the call.

Therefore .eh_frame-based unwinding is preferred, since it covers all
these corner cases and is therefore reliable. From what I understand,
adding .eh_frame unwinding to the kernel is not desirable. In an
internal discussion we had another idea though: would it be possible to
have an eBPF program that does .eh_frame parsing and unwinding? I
understand that it can be technically challenging at the moment, but
the end result would not be exploitable by crafting malicious
.eh_frame sections, won't loop endlessly, will have good performance,
etc.

> > #101     ksyms_module:FAIL
> > address of kernel function bpf_testmod_test_mod_kfunc is out of
> > range
> > Known issue:
> > Kernel and modules are too far away from each other on s390x.
> > 
> > #167     sk_assign:FAIL
> > Uses legacy map definitions in 'maps' section.
> 
> Hm.. assuming new enough iproute2, new-style .maps definition should
> be supported, right? Let's convert map definition?

Yep, that worked. Will include in v2.

> > #190     stacktrace_build_id:FAIL
> > Known issue:
> > We cannot reliably unwind userspace on s390x without DWARF.
> > 
> > #211     test_bpffs:FAIL
> > iterators.bpf.c is broken on s390x, it uses BPF_CORE_READ(),
> > shouldn't
> > there be BPF_CORE_READ_KERNEL()?
> 
> BPF_CORE_READ() is that, so must be something else
> 
> > 
> > #218     test_profiler:FAIL
> > A lot of BPF_PROBE_READ() usages.
> 
> ditto, something else
> 
> > 
> > #281     xdp_metadata:FAIL
> > See patch 24.
> > 
> > #284     xdp_synproxy:FAIL
> > Verifier error:
> > ; value = bpf_tcp_raw_gen_syncookie_ipv4(hdr->ipv4, hdr->tcp,
> > 281: (79) r1 = *(u64 *)(r10 -80)      ; R1_w=pkt(off=14,r=74,imm=0)
> > R10=fp0
> > 282: (bf) r2 = r8                     ;
> > R2_w=pkt(id=5,off=14,r=74,umax=60,var_off=(0x0; 0x3c))
> > R8=pkt(id=5,off=14,r=74,umax=60,var_off=(0x0; 0x3c))
> > 283: (79) r3 = *(u64 *)(r10 -104)     ;
> > R3_w=scalar(umax=60,var_off=(0x0; 0x3c)) R10=fp0
> > 284: (85) call bpf_tcp_raw_gen_syncookie_ipv4#204
> > invalid access to packet, off=14 size=0, R2(id=5,off=14,r=74)
> > R2 offset is outside of the packet
> 
> third arg to bpf_tcp_raw_gen_syncookie_ipv4() is defined as
> ARG_CONST_SIZE, so is required to be strictly positive, which doesn't
> seem to be "proven" to verifier. Please provided bigger log, it might
> another instance of needing to sprinkle barrier_var() around.
> 
> And maybe thinking about using ARG_CONST_SIZE_OR_ZERO instead of
> ARG_CONST_SIZE.

Here is the full log:

https://gist.github.com/iii-i/8e20100c33ab6f0dffb5e6e51d1330e8

Apparently we do indeed lose a constraint established by
if (hdr->tcp_len < sizeof(*hdr->tcp)). But the naive

--- a/tools/testing/selftests/bpf/progs/xdp_synproxy_kern.c
+++ b/tools/testing/selftests/bpf/progs/xdp_synproxy_kern.c
@@ -401,6 +401,7 @@ static __always_inline int tcp_dissect(void *data,
void *data_end,
        hdr->tcp_len = hdr->tcp->doff * 4;
        if (hdr->tcp_len < sizeof(*hdr->tcp))
                return XDP_DROP;
+       barrier_var(hdr->tcp_len);
 
        return XDP_TX;
 }
@@ -791,6 +792,7 @@ static __always_inline int syncookie_part2(void
*ctx, void *data, void *data_end
        hdr->tcp_len = hdr->tcp->doff * 4;
        if (hdr->tcp_len < sizeof(*hdr->tcp))
                return XDP_ABORTED;
+       barrier_var(hdr->tcp_len);
 
        return hdr->tcp->syn ? syncookie_handle_syn(hdr, ctx, data,
data_end, xdp) :
                               syncookie_handle_ack(hdr);

does not help.

> 
> > 
> > None of these seem to be due to the new changes.
> > 
> > Best regards,
> > Ilya
> > 
> > Ilya Leoshkevich (24):
> >   selftests/bpf: Fix liburandom_read.so linker error
> >   selftests/bpf: Fix symlink creation error
> >   selftests/bpf: Fix fexit_stress on s390x
> >   selftests/bpf: Fix trampoline_count on s390x
> >   selftests/bpf: Fix kfree_skb on s390x
> >   selftests/bpf: Set errno when urand_spawn() fails
> >   selftests/bpf: Fix decap_sanity_ns cleanup
> >   selftests/bpf: Fix verify_pkcs7_sig on s390x
> >   selftests/bpf: Fix xdp_do_redirect on s390x
> >   selftests/bpf: Fix cgrp_local_storage on s390x
> >   selftests/bpf: Check stack_mprotect() return value
> >   selftests/bpf: Increase SIZEOF_BPF_LOCAL_STORAGE_ELEM on s390x
> >   selftests/bpf: Add a sign-extension test for kfuncs
> >   selftests/bpf: Fix test_lsm on s390x
> >   selftests/bpf: Fix test_xdp_adjust_tail_grow2 on s390x
> >   selftests/bpf: Fix vmlinux test on s390x
> >   libbpf: Read usdt arg spec with bpf_probe_read_kernel()
> >   s390/bpf: Fix a typo in a comment
> >   s390/bpf: Add expoline to tail calls
> >   s390/bpf: Implement bpf_arch_text_poke()
> >   bpf: btf: Add BTF_FMODEL_SIGNED_ARG flag
> >   s390/bpf: Implement arch_prepare_bpf_trampoline()
> >   s390/bpf: Implement bpf_jit_supports_subprog_tailcalls()
> >   s390/bpf: Implement bpf_jit_supports_kfunc_call()
> > 
> >  arch/s390/net/bpf_jit_comp.c                  | 708
> > +++++++++++++++++-
> >  include/linux/bpf.h                           |   8 +
> >  include/linux/btf.h                           |  15 +-
> >  kernel/bpf/btf.c                              |  16 +-
> >  net/bpf/test_run.c                            |   9 +
> >  tools/lib/bpf/usdt.bpf.h                      |  33 +-
> >  tools/testing/selftests/bpf/Makefile          |   7 +-
> >  tools/testing/selftests/bpf/netcnt_common.h   |   6 +-
> >  .../selftests/bpf/prog_tests/bpf_cookie.c     |   6 +-
> >  .../bpf/prog_tests/cgrp_local_storage.c       |   2 +-
> >  .../selftests/bpf/prog_tests/decap_sanity.c   |   2 +-
> >  .../selftests/bpf/prog_tests/fexit_stress.c   |   6 +-
> >  .../selftests/bpf/prog_tests/kfree_skb.c      |   2 +-
> >  .../selftests/bpf/prog_tests/kfunc_call.c     |   1 +
> >  .../selftests/bpf/prog_tests/test_lsm.c       |   3 +-
> >  .../bpf/prog_tests/trampoline_count.c         |   4 +
> >  tools/testing/selftests/bpf/prog_tests/usdt.c |   1 +
> >  .../bpf/prog_tests/verify_pkcs7_sig.c         |   9 +
> >  .../bpf/prog_tests/xdp_adjust_tail.c          |   7 +-
> >  .../bpf/prog_tests/xdp_do_redirect.c          |   4 +
> >  .../selftests/bpf/progs/kfunc_call_test.c     |  18 +
> >  tools/testing/selftests/bpf/progs/lsm.c       |   7 +-
> >  .../bpf/progs/test_verify_pkcs7_sig.c         |  12 +-
> >  .../selftests/bpf/progs/test_vmlinux.c        |   4 +-
> >  .../bpf/progs/test_xdp_adjust_tail_grow.c     |   8 +-
> >  25 files changed, 816 insertions(+), 82 deletions(-)
> > 
> > --
> > 2.39.1
> >
Andrii Nakryiko Jan. 27, 2023, 5:24 p.m. UTC | #3
On Fri, Jan 27, 2023 at 8:51 AM Ilya Leoshkevich <iii@linux.ibm.com> wrote:
>
> On Wed, 2023-01-25 at 16:45 -0800, Andrii Nakryiko wrote:
> > On Wed, Jan 25, 2023 at 1:39 PM Ilya Leoshkevich <iii@linux.ibm.com>
> > wrote:
> > >
> > > Hi,
> > >
> > > This series implements poke, trampoline, kfunc, mixing subprogs and
> > > tailcalls, and fixes a number of tests on s390x.
> > >
> > > The following failures still remain:
> > >
> > > #52      core_read_macros:FAIL
> > > Uses BPF_PROBE_READ(), shouldn't there be BPF_PROBE_READ_KERNEL()?
> >
> > BPF_PROBE_READ(), similarly to BPF_CORE_READ() both use
> > bpf_probe_read_kernel() internally, as it's most common use case. We
> > have separate BPF_PROBE_READ_USER() and BPF_CORE_READ_USER() for when
> > bpf_probe_read_user() has to be used.
>
> At least purely from the code perspective, BPF_PROBE_READ() seems to
> delegate to bpf_probe_read(). The following therefore helps with this
> test:
>
> --- a/tools/lib/bpf/bpf_core_read.h
> +++ b/tools/lib/bpf/bpf_core_read.h
> @@ -364,7 +364,7 @@ enum bpf_enum_value_kind {
>
>  /* Non-CO-RE variant of BPF_CORE_READ_INTO() */
>  #define BPF_PROBE_READ_INTO(dst, src, a, ...) ({
> \
> -       ___core_read(bpf_probe_read, bpf_probe_read,
> \
> +       ___core_read(bpf_probe_read_kernel, bpf_probe_read_kernel,
> \
>                      dst, (src), a, ##__VA_ARGS__)
> \
>  })
>
> @@ -400,7 +400,7 @@ enum bpf_enum_value_kind {
>
>  /* Non-CO-RE variant of BPF_CORE_READ_STR_INTO() */
>  #define BPF_PROBE_READ_STR_INTO(dst, src, a, ...) ({
> \
> -       ___core_read(bpf_probe_read_str, bpf_probe_read,
> \
> +       ___core_read(bpf_probe_read_kernel_str, bpf_probe_read_kernel,
> \
>                      dst, (src), a, ##__VA_ARGS__)
> \
>  })
>
> but I'm not sure if there are backward compatibility concerns, or if
> libbpf is supposed to rewrite this when
> !ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE.

Oh, this is just a bug, it was an omission when we converted
BPF_CORE_READ to bpf_probe_read_kernel. If you look at bpf_core_read
definition, it is using bpf_probe_read_kernel, which is also the
intent for BPF_PROBE_READ. Let's fix this.

And there is no backwards compat concerns because libbpf will
automatically convert calls to bpf_probe_read_{kernel,user}[_str] to
bpf_probe_read[_str] if host kernel doesn't yet support kernel or user
specific variants.

>
> > >
> > > #82      get_stack_raw_tp:FAIL
> > > get_stack_print_output:FAIL:user_stack corrupted user stack
> > > Known issue:
> > > We cannot reliably unwind userspace on s390x without DWARF.
> >
> > like in principle, or frame pointers (or some equivalent) needs to be
> > configured for this to work?
> >
> > Asking also in the context of [0], where s390x was excluded. If there
> > is actually a way to enable frame pointer-based stack unwinding on
> > s390x, would be nice to hear from an expert.
> >
> >   [0] https://pagure.io/fesco/issue/2923
>
> For DWARFless unwinding we have -mbackchain (not to be confused with
> -fno-omit-frame-pointer, which we also have, but which just hurts
> performance without providing tangible benefits).
> -mbackchain has a few problems though:
>
> - It's not atomic. Here is a typical prologue with -mbackchain:
>
>         1: stmg    %r11,%r15,88(%r15)  # save non-volatile registers
>         2: lgr     %r14,%r15           # %r14 = sp
>         3: lay     %r15,-160(%r15)     # sp -= 160
>         4: stg     %r14,0(%r15)        # *(void **)sp = %r14
>
>   The invariant here is that *(void **)%r15 is always a pointer to the
>   next frame. This means that if we unwind from (4), we are totally
>   broken. This does not happen if we unwind from any other instruction,
>   but still.
>
> - Unwinding from (1)-(3) is not particularly good either. PSW points to
>   the callee, but R15 points to the caller's frame.
>
> - Unwinding leaf functions is like the previous problem, but worse:
>   they often do not establish a stack frame at all, so PSW and R15 are
>   out of sync for the entire duration of the call.
>
> Therefore .eh_frame-based unwinding is preferred, since it covers all
> these corner cases and is therefore reliable. From what I understand,
> adding .eh_frame unwinding to the kernel is not desirable. In an
> internal discussion we had another idea though: would it be possible to
> have an eBPF program that does .eh_frame parsing and unwinding? I
> understand that it can be technically challenging at the moment, but
> the end result would not be exploitable by crafting malicious
> .eh_frame sections, won't loop endlessly, will have good performance,
> etc.

Thanks for details. This was all discussed at length in Fedora
-fno-omit-frame-pointer discussion I linked above, so no real need to
go over this again. .eh_frame-based unwinding on BPF side is possible,
but only for processes that you knew about (and preprocessed) before
you started profiling session. Pre-processing is memory and
cpu-intensive operation on busy systems, and they will miss any
processes started during profiling. So as a general approach for
system-wide profiling it leaves a lot to be desired.

Should we enable -mbackchain in our CI for s390x to be able to capture
stack traces (even if on some instructions they might be incomplete or
outright broken)?

>
> > > #101     ksyms_module:FAIL
> > > address of kernel function bpf_testmod_test_mod_kfunc is out of
> > > range
> > > Known issue:
> > > Kernel and modules are too far away from each other on s390x.
> > >
> > > #167     sk_assign:FAIL
> > > Uses legacy map definitions in 'maps' section.
> >
> > Hm.. assuming new enough iproute2, new-style .maps definition should
> > be supported, right? Let's convert map definition?
>
> Yep, that worked. Will include in v2.

Nice.

>
> > > #190     stacktrace_build_id:FAIL
> > > Known issue:
> > > We cannot reliably unwind userspace on s390x without DWARF.
> > >
> > > #211     test_bpffs:FAIL
> > > iterators.bpf.c is broken on s390x, it uses BPF_CORE_READ(),
> > > shouldn't
> > > there be BPF_CORE_READ_KERNEL()?
> >
> > BPF_CORE_READ() is that, so must be something else
> >
> > >
> > > #218     test_profiler:FAIL
> > > A lot of BPF_PROBE_READ() usages.
> >
> > ditto, something else
> >
> > >
> > > #281     xdp_metadata:FAIL
> > > See patch 24.
> > >
> > > #284     xdp_synproxy:FAIL
> > > Verifier error:
> > > ; value = bpf_tcp_raw_gen_syncookie_ipv4(hdr->ipv4, hdr->tcp,
> > > 281: (79) r1 = *(u64 *)(r10 -80)      ; R1_w=pkt(off=14,r=74,imm=0)
> > > R10=fp0
> > > 282: (bf) r2 = r8                     ;
> > > R2_w=pkt(id=5,off=14,r=74,umax=60,var_off=(0x0; 0x3c))
> > > R8=pkt(id=5,off=14,r=74,umax=60,var_off=(0x0; 0x3c))
> > > 283: (79) r3 = *(u64 *)(r10 -104)     ;
> > > R3_w=scalar(umax=60,var_off=(0x0; 0x3c)) R10=fp0
> > > 284: (85) call bpf_tcp_raw_gen_syncookie_ipv4#204
> > > invalid access to packet, off=14 size=0, R2(id=5,off=14,r=74)
> > > R2 offset is outside of the packet
> >
> > third arg to bpf_tcp_raw_gen_syncookie_ipv4() is defined as
> > ARG_CONST_SIZE, so is required to be strictly positive, which doesn't
> > seem to be "proven" to verifier. Please provided bigger log, it might
> > another instance of needing to sprinkle barrier_var() around.
> >
> > And maybe thinking about using ARG_CONST_SIZE_OR_ZERO instead of
> > ARG_CONST_SIZE.
>
> Here is the full log:
>
> https://gist.github.com/iii-i/8e20100c33ab6f0dffb5e6e51d1330e8
>
> Apparently we do indeed lose a constraint established by
> if (hdr->tcp_len < sizeof(*hdr->tcp)). But the naive

The test is too big and unfamiliar for me to figure this out. And the
problem is not upper bound, but lower bound. hdr->tcp_len is not
proven to be strictly greater than zero, which is what verifier
complains about. Not sure how it works on other arches right now.


But I see that bpf_csum_diff defines size arguments as
ARG_CONST_SIZE_OR_ZERO while bpf_tcp_raw_gen_syncookie_ipv4 has
ARG_CONST_SIZE. I generally found ARG_CONST_SIZE way too problematic
in practice, I'd say we should change it to ARG_CONST_SIZE_OR_ZERO.

>
> --- a/tools/testing/selftests/bpf/progs/xdp_synproxy_kern.c
> +++ b/tools/testing/selftests/bpf/progs/xdp_synproxy_kern.c
> @@ -401,6 +401,7 @@ static __always_inline int tcp_dissect(void *data,
> void *data_end,
>         hdr->tcp_len = hdr->tcp->doff * 4;
>         if (hdr->tcp_len < sizeof(*hdr->tcp))
>                 return XDP_DROP;
> +       barrier_var(hdr->tcp_len);
>
>         return XDP_TX;
>  }
> @@ -791,6 +792,7 @@ static __always_inline int syncookie_part2(void
> *ctx, void *data, void *data_end
>         hdr->tcp_len = hdr->tcp->doff * 4;
>         if (hdr->tcp_len < sizeof(*hdr->tcp))
>                 return XDP_ABORTED;
> +       barrier_var(hdr->tcp_len);
>
>         return hdr->tcp->syn ? syncookie_handle_syn(hdr, ctx, data,
> data_end, xdp) :
>                                syncookie_handle_ack(hdr);
>
> does not help.
>
> >
> > >
> > > None of these seem to be due to the new changes.
> > >
> > > Best regards,
> > > Ilya
> > >
> > > Ilya Leoshkevich (24):
> > >   selftests/bpf: Fix liburandom_read.so linker error
> > >   selftests/bpf: Fix symlink creation error
> > >   selftests/bpf: Fix fexit_stress on s390x
> > >   selftests/bpf: Fix trampoline_count on s390x
> > >   selftests/bpf: Fix kfree_skb on s390x
> > >   selftests/bpf: Set errno when urand_spawn() fails
> > >   selftests/bpf: Fix decap_sanity_ns cleanup
> > >   selftests/bpf: Fix verify_pkcs7_sig on s390x
> > >   selftests/bpf: Fix xdp_do_redirect on s390x
> > >   selftests/bpf: Fix cgrp_local_storage on s390x
> > >   selftests/bpf: Check stack_mprotect() return value
> > >   selftests/bpf: Increase SIZEOF_BPF_LOCAL_STORAGE_ELEM on s390x
> > >   selftests/bpf: Add a sign-extension test for kfuncs
> > >   selftests/bpf: Fix test_lsm on s390x
> > >   selftests/bpf: Fix test_xdp_adjust_tail_grow2 on s390x
> > >   selftests/bpf: Fix vmlinux test on s390x
> > >   libbpf: Read usdt arg spec with bpf_probe_read_kernel()
> > >   s390/bpf: Fix a typo in a comment
> > >   s390/bpf: Add expoline to tail calls
> > >   s390/bpf: Implement bpf_arch_text_poke()
> > >   bpf: btf: Add BTF_FMODEL_SIGNED_ARG flag
> > >   s390/bpf: Implement arch_prepare_bpf_trampoline()
> > >   s390/bpf: Implement bpf_jit_supports_subprog_tailcalls()
> > >   s390/bpf: Implement bpf_jit_supports_kfunc_call()
> > >
> > >  arch/s390/net/bpf_jit_comp.c                  | 708
> > > +++++++++++++++++-
> > >  include/linux/bpf.h                           |   8 +
> > >  include/linux/btf.h                           |  15 +-
> > >  kernel/bpf/btf.c                              |  16 +-
> > >  net/bpf/test_run.c                            |   9 +
> > >  tools/lib/bpf/usdt.bpf.h                      |  33 +-
> > >  tools/testing/selftests/bpf/Makefile          |   7 +-
> > >  tools/testing/selftests/bpf/netcnt_common.h   |   6 +-
> > >  .../selftests/bpf/prog_tests/bpf_cookie.c     |   6 +-
> > >  .../bpf/prog_tests/cgrp_local_storage.c       |   2 +-
> > >  .../selftests/bpf/prog_tests/decap_sanity.c   |   2 +-
> > >  .../selftests/bpf/prog_tests/fexit_stress.c   |   6 +-
> > >  .../selftests/bpf/prog_tests/kfree_skb.c      |   2 +-
> > >  .../selftests/bpf/prog_tests/kfunc_call.c     |   1 +
> > >  .../selftests/bpf/prog_tests/test_lsm.c       |   3 +-
> > >  .../bpf/prog_tests/trampoline_count.c         |   4 +
> > >  tools/testing/selftests/bpf/prog_tests/usdt.c |   1 +
> > >  .../bpf/prog_tests/verify_pkcs7_sig.c         |   9 +
> > >  .../bpf/prog_tests/xdp_adjust_tail.c          |   7 +-
> > >  .../bpf/prog_tests/xdp_do_redirect.c          |   4 +
> > >  .../selftests/bpf/progs/kfunc_call_test.c     |  18 +
> > >  tools/testing/selftests/bpf/progs/lsm.c       |   7 +-
> > >  .../bpf/progs/test_verify_pkcs7_sig.c         |  12 +-
> > >  .../selftests/bpf/progs/test_vmlinux.c        |   4 +-
> > >  .../bpf/progs/test_xdp_adjust_tail_grow.c     |   8 +-
> > >  25 files changed, 816 insertions(+), 82 deletions(-)
> > >
> > > --
> > > 2.39.1
> > >
>
Ilya Leoshkevich Jan. 27, 2023, 10:50 p.m. UTC | #4
On Fri, 2023-01-27 at 09:24 -0800, Andrii Nakryiko wrote:
> On Fri, Jan 27, 2023 at 8:51 AM Ilya Leoshkevich <iii@linux.ibm.com>
> wrote:
> > 
> > On Wed, 2023-01-25 at 16:45 -0800, Andrii Nakryiko wrote:
> > > On Wed, Jan 25, 2023 at 1:39 PM Ilya Leoshkevich
> > > <iii@linux.ibm.com>
> > > wrote:
> > > > 
> > > > Hi,
> > > > 
> > > > This series implements poke, trampoline, kfunc, mixing subprogs
> > > > and
> > > > tailcalls, and fixes a number of tests on s390x.
> > > > 
> > > > The following failures still remain:
> > > > 
> > > > #52      core_read_macros:FAIL
> > > > Uses BPF_PROBE_READ(), shouldn't there be
> > > > BPF_PROBE_READ_KERNEL()?
> > > 
> > > BPF_PROBE_READ(), similarly to BPF_CORE_READ() both use
> > > bpf_probe_read_kernel() internally, as it's most common use case.
> > > We
> > > have separate BPF_PROBE_READ_USER() and BPF_CORE_READ_USER() for
> > > when
> > > bpf_probe_read_user() has to be used.
> > 
> > At least purely from the code perspective, BPF_PROBE_READ() seems
> > to
> > delegate to bpf_probe_read(). The following therefore helps with
> > this
> > test:
> > 
> > --- a/tools/lib/bpf/bpf_core_read.h
> > +++ b/tools/lib/bpf/bpf_core_read.h
> > @@ -364,7 +364,7 @@ enum bpf_enum_value_kind {
> > 
> >  /* Non-CO-RE variant of BPF_CORE_READ_INTO() */
> >  #define BPF_PROBE_READ_INTO(dst, src, a, ...) ({
> > \
> > -       ___core_read(bpf_probe_read, bpf_probe_read,
> > \
> > +       ___core_read(bpf_probe_read_kernel, bpf_probe_read_kernel,
> > \
> >                      dst, (src), a, ##__VA_ARGS__)
> > \
> >  })
> > 
> > @@ -400,7 +400,7 @@ enum bpf_enum_value_kind {
> > 
> >  /* Non-CO-RE variant of BPF_CORE_READ_STR_INTO() */
> >  #define BPF_PROBE_READ_STR_INTO(dst, src, a, ...) ({
> > \
> > -       ___core_read(bpf_probe_read_str, bpf_probe_read,
> > \
> > +       ___core_read(bpf_probe_read_kernel_str,
> > bpf_probe_read_kernel,
> > \
> >                      dst, (src), a, ##__VA_ARGS__)
> > \
> >  })
> > 
> > but I'm not sure if there are backward compatibility concerns, or
> > if
> > libbpf is supposed to rewrite this when
> > !ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE.
> 
> Oh, this is just a bug, it was an omission when we converted
> BPF_CORE_READ to bpf_probe_read_kernel. If you look at bpf_core_read
> definition, it is using bpf_probe_read_kernel, which is also the
> intent for BPF_PROBE_READ. Let's fix this.
> 
> And there is no backwards compat concerns because libbpf will
> automatically convert calls to bpf_probe_read_{kernel,user}[_str] to
> bpf_probe_read[_str] if host kernel doesn't yet support kernel or
> user
> specific variants.

Thanks for confirming! I will include the fix in v2.

> > > > #82      get_stack_raw_tp:FAIL
> > > > get_stack_print_output:FAIL:user_stack corrupted user stack
> > > > Known issue:
> > > > We cannot reliably unwind userspace on s390x without DWARF.
> > > 
> > > like in principle, or frame pointers (or some equivalent) needs
> > > to be
> > > configured for this to work?
> > > 
> > > Asking also in the context of [0], where s390x was excluded. If
> > > there
> > > is actually a way to enable frame pointer-based stack unwinding
> > > on
> > > s390x, would be nice to hear from an expert.
> > > 
> > >   [0] https://pagure.io/fesco/issue/2923
> > 
> > For DWARFless unwinding we have -mbackchain (not to be confused
> > with
> > -fno-omit-frame-pointer, which we also have, but which just hurts
> > performance without providing tangible benefits).
> > -mbackchain has a few problems though:
> > 
> > - It's not atomic. Here is a typical prologue with -mbackchain:
> > 
> >         1: stmg    %r11,%r15,88(%r15)  # save non-volatile
> > registers
> >         2: lgr     %r14,%r15           # %r14 = sp
> >         3: lay     %r15,-160(%r15)     # sp -= 160
> >         4: stg     %r14,0(%r15)        # *(void **)sp = %r14
> > 
> >   The invariant here is that *(void **)%r15 is always a pointer to
> > the
> >   next frame. This means that if we unwind from (4), we are totally
> >   broken. This does not happen if we unwind from any other
> > instruction,
> >   but still.
> > 
> > - Unwinding from (1)-(3) is not particularly good either. PSW
> > points to
> >   the callee, but R15 points to the caller's frame.
> > 
> > - Unwinding leaf functions is like the previous problem, but worse:
> >   they often do not establish a stack frame at all, so PSW and R15
> > are
> >   out of sync for the entire duration of the call.
> > 
> > Therefore .eh_frame-based unwinding is preferred, since it covers
> > all
> > these corner cases and is therefore reliable. From what I
> > understand,
> > adding .eh_frame unwinding to the kernel is not desirable. In an
> > internal discussion we had another idea though: would it be
> > possible to
> > have an eBPF program that does .eh_frame parsing and unwinding? I
> > understand that it can be technically challenging at the moment,
> > but
> > the end result would not be exploitable by crafting malicious
> > .eh_frame sections, won't loop endlessly, will have good
> > performance,
> > etc.
> 
> Thanks for details. This was all discussed at length in Fedora
> -fno-omit-frame-pointer discussion I linked above, so no real need to
> go over this again. .eh_frame-based unwinding on BPF side is
> possible,
> but only for processes that you knew about (and preprocessed) before
> you started profiling session. Pre-processing is memory and
> cpu-intensive operation on busy systems, and they will miss any
> processes started during profiling. So as a general approach for
> system-wide profiling it leaves a lot to be desired.

Thanks for the explanation, I'll read the thread and come back if I
get some new ideas not listed here.

> Should we enable -mbackchain in our CI for s390x to be able to
> capture
> stack traces (even if on some instructions they might be incomplete
> or
> outright broken)?

Let's do it, I don't have anything against this.

[...]
> > > 
> 

> > Here is the full log:
> > 
> > https://gist.github.com/iii-i/8e20100c33ab6f0dffb5e6e51d1330e8
> > 
> > Apparently we do indeed lose a constraint established by
> > if (hdr->tcp_len < sizeof(*hdr->tcp)). But the naive
> 
> The test is too big and unfamiliar for me to figure this out. And the
> problem is not upper bound, but lower bound. hdr->tcp_len is not
> proven to be strictly greater than zero, which is what verifier
> complains about. Not sure how it works on other arches right now.
> 
> 
> But I see that bpf_csum_diff defines size arguments as
> ARG_CONST_SIZE_OR_ZERO while bpf_tcp_raw_gen_syncookie_ipv4 has
> ARG_CONST_SIZE. I generally found ARG_CONST_SIZE way too problematic
> in practice, I'd say we should change it to ARG_CONST_SIZE_OR_ZERO.

Yes, this helps, and doesn't seem to introduce issues, since the 
minimum length is enforced inside this function anyway. I will include
the change for bpf_tcp_raw_gen_syncookie_ipv{4,6} in v2; I guess some
other functions may benefit from this as well, but this is outside the
scope of this series.

[...]