pull-request: bpf-next 2023-12-18

Message ID	20231219000520.34178-1-alexei.starovoitov@gmail.com (mailing list archive)
State	Accepted
Commit	c49b292d031e385abf764ded32cd953c77e73f2d
Headers	show Received: from mail-pl1-f169.google.com (mail-pl1-f169.google.com [209.85.214.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E38657F; Tue, 19 Dec 2023 00:05:25 +0000 (UTC) From: Alexei Starovoitov <alexei.starovoitov@gmail.com> To: kuba@kernel.org Cc: davem@davemloft.net, edumazet@google.com, pabeni@redhat.com, daniel@iogearbox.net, andrii@kernel.org, torvalds@linuxfoundation.org, peterz@infradead.org, brauner@kernel.org, netdev@vger.kernel.org, bpf@vger.kernel.org, kernel-team@fb.com Subject: pull-request: bpf-next 2023-12-18 Date: Mon, 18 Dec 2023 16:05:20 -0800 Message-Id: <20231219000520.34178-1-alexei.starovoitov@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	pull-request: bpf-next 2023-12-18 \| expand pull-request: bpf-next 2023-12-18

Context	Check	Description
netdev/tree_selection	success	Pull request for net-next, async
netdev/build_32bit	fail	Errors and warnings before: 7908 this patch: 7917
netdev/build_clang	fail	Errors and warnings before: 1330 this patch: 1330
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/verify_fixes	success	Fixes tag looks correct
netdev/build_allmodconfig_warn	fail	Errors and warnings before: 8444 this patch: 8453
netdev/build_clang_rust	success	No Rust files in patch. Skipping build

Alexei Starovoitov Dec. 19, 2023, 12:05 a.m. UTC

Hi David, hi Jakub, hi Paolo, hi Eric,

The following pull-request contains BPF updates for your *net-next* tree.

This PR is larger than usual and contains changes in various parts of the kernel.

The main changes are:

1) Fix kCFI bugs in BPF, from Peter Zijlstra.
End result: all forms of indirect calls from BPF into kernel and from kernel into BPF
work with CFI enabled. This allows BPF to work with CONFIG_FINEIBT=y.

2) Introduce BPF token object, from Andrii Nakryiko.
It adds an ability to delegate a subset of BPF features from privileged daemon
(e.g., systemd) through special mount options for userns-bound BPF FS to a
trusted unprivileged application. The design accommodates suggestions from
Christian Brauner and Paul Moore.
Example:
$ sudo mkdir -p /sys/fs/bpf/token
$ sudo mount -t bpf bpffs /sys/fs/bpf/token \
             -o delegate_cmds=prog_load:MAP_CREATE \
             -o delegate_progs=kprobe \
             -o delegate_attachs=xdp

3) Various verifier improvements and fixes, from Andrii Nakryiko, Andrei Matei.
 - Complete precision tracking support for register spills
 - Fix verification of possibly-zero-sized stack accesses
 - Fix access to uninit stack slots
 - Track aligned STACK_ZERO cases as imprecise spilled registers.
   It improves the verifier "instructions processed" metric from single digit
   to 50-60% for some programs.
 - Fix verifier retval logic

4) Support for VLAN tag in XDP hints, from Larysa Zaremba.

5) Allocate BPF trampoline via bpf_prog_pack mechanism, from Song Liu.
End result: better memory utilization and lower I$ miss for calls to BPF
via BPF trampoline.

6) Fix race between BPF prog accessing inner map and parallel delete, from Hou Tao.

7) Add bpf_xdp_get_xfrm_state() kfunc, from Daniel Xu.
It allows BPF interact with IPSEC infra. The intent is to support software RSS
(via XDP) for the upcoming ipsec pcpu work. Experiments on AWS demonstrate
single tunnel pcpu ipsec reaching line rate on 100G ENA nics.

8) Expand bpf_cgrp_storage to support cgroup1 non-attach, from Yafang Shao.

9) BPF file verification via fsverity, from Song Liu.
It allows BPF progs get fsverity digest.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git tags/for-netdev

Thanks a lot!

Also thanks to reporters, reviewers and testers of commits in this pull-request:

Abaci Robot, Alan Maguire, Andrii Nakryiko, Björn Töpel, Christian 
Brauner, Daniel Borkmann, Eduard Zingerman, Eric Biggers, Hao Sun, Ilya 
Leoshkevich, Jesper Dangaard Brouer, Jiri Olsa, John Fastabend, KP 
Singh, Maciej Fijalkowski, Magnus Karlsson, Maxim Mikityanskiy, Mike 
Frysinger, Minh Le Hoang, Paul Moore, Peter Zijlstra, Sami Tolvanen, 
Shung-Hsi Yu, Song Liu, Stanislav Fomichev, Steffen Klassert, Tao Lyu, 
Tariq Toukan, Tejun Heo, Xingwei Lee, Yafang Shao, Yonghong Song

----------------------------------------------------------------

The following changes since commit 15bc81212f593fbd7bda787598418b931842dc14:

  octeon_ep: set backpressure watermark for RX queues (2023-12-01 12:14:32 +0000)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git tags/for-netdev

for you to fetch changes up to 8e432e6197cef6250dfd6fdffd41c06613c874ca:

  bpf: Ensure precise is reset to false in __mark_reg_const_zero() (2023-12-18 23:54:21 +0100)

----------------------------------------------------------------
netdev

----------------------------------------------------------------
Aleksander Lobakin (1):
      net, xdp: Allow metadata > 32

Alexei Starovoitov (17):
      Merge branch 'bpf-file-verification-with-lsm-and-fsverity'
      Merge branch 'bpf-verifier-retval-logic-fixes'
      Merge branch 'bpf-fix-the-release-of-inner-map'
      Merge branch 'complete-bpf-verifier-precision-tracking-support-for-register-spills'
      Merge branch 'bpf-token-and-bpf-fs-based-delegation'
      Merge branch 'allocate-bpf-trampoline-on-bpf_prog_pack'
      Merge branch 'bpf-fixes-for-maybe_wait_bpf_programs'
      Merge branch 'add-new-bpf_cpumask_weight-kfunc'
      Merge branch 'bpf-token-support-in-libbpf-s-bpf-object'
      Merge branch 'xdp-metadata-via-kfuncs-for-ice-vlan-hint'
      Merge branch 'bpf-use-gfp_kernel-in-bpf_event_entry_gen'
      Merge branch 'add-bpf_xdp_get_xfrm_state-kfunc'
      Merge branch 'bpf-fs-mount-options-parsing-follow-ups'
      x86/cfi,bpf: Fix bpf_exception_cb() signature
      Merge branch 'x86-cfi-bpf-fix-cfi-vs-ebpf'
      selftests/bpf: Temporarily disable dummy_struct_ops test on s390
      s390/bpf: Fix indirect trampoline generation

Andrei Matei (8):
      bpf: Minor logging improvement
      bpf: Fix verification of indirect var-off stack access
      bpf: Add verifier regression test for previous patch
      bpf: Guard stack limits against 32bit overflow
      bpf: Add some comments to stack representation
      bpf: Fix accesses to uninit stack slots
      bpf: Minor cleanup around stack bounds
      bpf: Comment on check_mem_size_reg

Andrii Nakryiko (63):
      bpf: rearrange bpf_func_state fields to save a bit of memory
      bpf: provide correct register name for exception callback retval check
      bpf: enforce precision of R0 on callback return
      bpf: enforce exact retval range on subprog/callback exit
      selftests/bpf: add selftest validating callback result is enforced
      bpf: enforce precise retval range on program exit
      bpf: unify async callback and program retval checks
      bpf: enforce precision of R0 on program/async callback return
      selftests/bpf: validate async callback return value check correctness
      selftests/bpf: adjust global_func15 test to validate prog exit precision
      bpf: simplify tnum output if a fully known constant
      bpf: support non-r10 register spill/fill to/from stack in precision tracking
      selftests/bpf: add stack access precision test
      bpf: fix check for attempt to corrupt spilled pointer
      bpf: preserve STACK_ZERO slots on partial reg spills
      selftests/bpf: validate STACK_ZERO is preserved on subreg spill
      bpf: preserve constant zero when doing partial register restore
      selftests/bpf: validate zero preservation for sub-slot loads
      bpf: track aligned STACK_ZERO cases as imprecise spilled registers
      selftests/bpf: validate precision logic in partial_stack_load_preserves_zeros
      bpf: align CAP_NET_ADMIN checks with bpf_capable() approach
      bpf: add BPF token delegation mount options to BPF FS
      bpf: introduce BPF token object
      bpf: add BPF token support to BPF_MAP_CREATE command
      bpf: add BPF token support to BPF_BTF_LOAD command
      bpf: add BPF token support to BPF_PROG_LOAD command
      bpf: take into account BPF token when fetching helper protos
      bpf: consistently use BPF token throughout BPF verifier logic
      bpf,lsm: refactor bpf_prog_alloc/bpf_prog_free LSM hooks
      bpf,lsm: refactor bpf_map_alloc/bpf_map_free LSM hooks
      bpf,lsm: add BPF token LSM hooks
      libbpf: add bpf_token_create() API
      libbpf: add BPF token support to bpf_map_create() API
      libbpf: add BPF token support to bpf_btf_load() API
      libbpf: add BPF token support to bpf_prog_load() API
      selftests/bpf: add BPF token-enabled tests
      bpf,selinux: allocate bpf_security_struct per BPF token
      bpf: rename MAX_BPF_LINK_TYPE into __MAX_BPF_LINK_TYPE for consistency
      Merge branch 'bpf-fix-verification-of-indirect-var-off-stack-access'
      Merge branch 'bpf-fix-accesses-to-uninit-stack-slots'
      selftests/bpf: fix timer/test_bad_ret subtest on test_progs-cpuv4 flavor
      bpf: handle fake register spill to stack with BPF_ST_MEM instruction
      selftests/bpf: validate fake register spill/fill precision backtracking logic
      selftests/bpf: validate eliminated global subprog is not freplaceable
      bpf: log PTR_TO_MEM memory size in verifier log
      bpf: emit more dynptr information in verifier log
      bpf: tidy up exception callback management a bit
      bpf: use bitfields for simple per-subprog bool flags
      selftests/bpf: fix compiler warnings in RELEASE=1 mode
      bpf: fail BPF_TOKEN_CREATE if no delegation option was set on BPF FS
      libbpf: split feature detectors definitions from cached results
      libbpf: further decouple feature checking logic from bpf_object
      libbpf: move feature detection code into its own file
      libbpf: wire up token_fd into feature probing logic
      libbpf: wire up BPF token support at BPF object level
      selftests/bpf: add BPF object loading tests with explicit token passing
      selftests/bpf: add tests for BPF object load with implicit token
      libbpf: support BPF token path setting through LIBBPF_BPF_TOKEN_PATH envvar
      selftests/bpf: add tests for LIBBPF_BPF_TOKEN_PATH envvar
      bpf: support symbolic BPF FS delegation mount options
      selftests/bpf: utilize string values for delegate_xxx mount options
      Merge branch 'bpf-add-check-for-negative-uprobe-multi-offset'
      bpf: Ensure precise is reset to false in __mark_reg_const_zero()

Colin Ian King (1):
      selftests/bpf: Fix spelling mistake "get_signaure_size" -> "get_signature_size"

Daniel Xu (9):
      libbpf: Add BPF_CORE_WRITE_BITFIELD() macro
      bpf: selftests: test_loader: Support __btf_path() annotation
      bpf: selftests: Add verifier tests for CO-RE bitfield writes
      bpf: xfrm: Add bpf_xdp_get_xfrm_state() kfunc
      bpf: selftests: test_tunnel: Setup fresh topology for each subtest
      bpf: selftests: test_tunnel: Use vmlinux.h declarations
      bpf: selftests: Move xfrm tunnel test to test_progs
      bpf: xfrm: Add selftest for bpf_xdp_get_xfrm_state()
      bpf: xdp: Register generic_kfunc_set with XDP programs

Dave Marchevsky (1):
      selftests/bpf: Test bpf_kptr_xchg stashing of bpf_rb_root

David Vernet (3):
      bpf: Load vmlinux btf for any struct_ops map
      bpf: Add bpf_cpumask_weight() kfunc
      selftests/bpf: Add test for bpf_cpumask_weight() kfunc

Hou Tao (21):
      bpf: Check rcu_read_lock_trace_held() before calling bpf map helpers
      bpf: Add map and need_defer parameters to .map_fd_put_ptr()
      bpf: Set need_defer as false when clearing fd array during map free
      bpf: Defer the free of inner map when necessary
      bpf: Optimize the free of inner map
      selftests/bpf: Add test cases for inner map
      selftests/bpf: Test outer map update operations in syscall program
      bpf: Remove unnecessary wait from bpf_map_copy_value()
      bpf: Call maybe_wait_bpf_programs() only once for generic_map_update_batch()
      bpf: Add missed maybe_wait_bpf_programs() for htab of maps
      bpf: Only call maybe_wait_bpf_programs() when map operation succeeds
      bpf: Set uattr->batch.count as zero before batched update or deletion
      bpf: Update the comments in maybe_wait_bpf_programs()
      bpf: Reduce the scope of rcu_read_lock when updating fd map
      bpf: Use GFP_KERNEL in bpf_event_entry_gen()
      bpf: Limit the number of uprobes when attaching program to multiple uprobes
      bpf: Limit the number of kprobes when attaching program to multiple kprobes
      selftests/bpf: Add test for abnormal cnt during multi-uprobe attachment
      selftests/bpf: Don't use libbpf_get_error() in kprobe_multi_test
      selftests/bpf: Add test for abnormal cnt during multi-kprobe attachment
      selftests/bpf: Test the release of map btf

Jeroen van Ingen Schenau (1):
      selftests/bpf: Fix erroneous bitmask operation

Jie Jiang (1):
      bpf: Support uid and gid when mounting bpffs

Jiri Olsa (2):
      bpf: Fail uprobe multi link with negative offset
      selftests/bpf: Add more uprobe multi fail tests

Larysa Zaremba (18):
      selftests/bpf: Increase invalid metadata size
      ice: make RX hash reading code more reusable
      ice: make RX HW timestamp reading code more reusable
      ice: Make ptype internal to descriptor info processing
      ice: Introduce ice_xdp_buff
      ice: Support HW timestamp hint
      ice: Support RX hash XDP hint
      ice: Support XDP hints in AF_XDP ZC mode
      xdp: Add VLAN tag hint
      ice: Implement VLAN tag hint
      ice: use VLAN proto from ring packet context in skb path
      veth: Implement VLAN tag XDP hint
      net: make vlan_get_tag() return -ENODATA instead of -EINVAL
      mlx5: implement VLAN tag XDP hint
      selftests/bpf: Allow VLAN packets in xdp_hw_metadata
      selftests/bpf: Add flags and VLAN hint to xdp_hw_metadata
      selftests/bpf: Add AF_INET packet generation to xdp_metadata
      selftests/bpf: Check VLAN tag and proto in xdp_metadata

Maciej Fijalkowski (1):
      xsk: add functions to fill control buffer

Manu Bretelle (1):
      selftests/bpf: Fixes tests for filesystem kfuncs

Martin KaFai Lau (1):
      Merge branch 'bpf: Expand bpf_cgrp_storage to support cgroup1 non-attach case'

Matt Bobrowski (1):
      bpf: add small subset of SECURITY_PATH hooks to BPF sleepable_lsm_hooks list

Peter Zijlstra (6):
      cfi: Flip headers
      x86/cfi,bpf: Fix BPF JIT call
      x86/cfi,bpf: Fix bpf_callback_t CFI
      x86/cfi,bpf: Fix bpf_struct_ops CFI
      cfi: Add CFI_NOSEAL()
      bpf: Fix dtor CFI

Randy Dunlap (1):
      net, xdp: Correct grammar

Sergei Trofimovich (1):
      libbpf: Add pr_warn() for EINVAL cases in linker_sanity_check_elf

Song Liu (13):
      bpf: Add kfunc bpf_get_file_xattr
      bpf, fsverity: Add kfunc bpf_get_fsverity_digest
      Documentation/bpf: Add documentation for filesystem kfuncs
      selftests/bpf: Sort config in alphabetic order
      selftests/bpf: Add tests for filesystem kfuncs
      selftests/bpf: Add test that uses fsverity and xattr to sign a file
      bpf: Let bpf_prog_pack_free handle any pointer
      bpf: Adjust argument names of arch_prepare_bpf_trampoline()
      bpf: Add helpers for trampoline image management
      bpf, x86: Adjust arch_prepare_bpf_trampoline return value
      bpf: Add arch_bpf_trampoline_size()
      bpf: Use arch_bpf_trampoline_size
      x86, bpf: Use bpf_prog_pack for bpf trampoline

Stanislav Fomichev (2):
      xsk: Add missing SPDX to AF_XDP TX metadata documentation
      selftests/bpf: Make sure we trigger metadata kfuncs for dst 8080

Tiezhu Yang (1):
      test_bpf: Rename second ALU64_SMOD_X to ALU64_SMOD_K

Tushar Vyavahare (1):
      selftests/xsk: Fix for SEND_RECEIVE_UNALIGNED test

Yafang Shao (3):
      bpf: Enable bpf_cgrp_storage for cgroup1 non-attach case
      selftests/bpf: Add a new cgroup helper open_classid()
      selftests/bpf: Add selftests for cgroup1 local storage

Yang Li (1):
      bpf: Remove unused backtrack_state helper functions

YiFei Zhu (1):
      selftests/bpf: Relax time_tai test for equal timestamps in tai_forward

Yonghong Song (2):
      bpf: Fix a race condition between btf_put() and map_free()
      selftests/bpf: Remove flaky test_btf_id test

 Documentation/bpf/cpumasks.rst                     |    2 +-
 Documentation/bpf/fs_kfuncs.rst                    |   21 +
 Documentation/bpf/index.rst                        |    1 +
 Documentation/netlink/specs/netdev.yaml            |    4 +
 Documentation/networking/xdp-rx-metadata.rst       |    8 +-
 Documentation/networking/xsk-tx-metadata.rst       |    2 +
 arch/arm64/net/bpf_jit_comp.c                      |   55 +-
 arch/riscv/include/asm/cfi.h                       |    3 +-
 arch/riscv/kernel/cfi.c                            |    2 +-
 arch/riscv/net/bpf_jit_comp64.c                    |   25 +-
 arch/s390/net/bpf_jit_comp.c                       |   59 +-
 arch/x86/include/asm/cfi.h                         |  126 ++-
 arch/x86/kernel/alternative.c                      |   87 +-
 arch/x86/kernel/cfi.c                              |    4 +-
 arch/x86/net/bpf_jit_comp.c                        |  264 ++++-
 drivers/media/rc/bpf-lirc.c                        |    2 +-
 drivers/net/ethernet/intel/ice/ice.h               |    2 +
 drivers/net/ethernet/intel/ice/ice_base.c          |   15 +
 drivers/net/ethernet/intel/ice/ice_lan_tx_rx.h     |  412 ++++----
 drivers/net/ethernet/intel/ice/ice_main.c          |   21 +
 drivers/net/ethernet/intel/ice/ice_ptp.c           |   22 +-
 drivers/net/ethernet/intel/ice/ice_ptp.h           |   16 +-
 drivers/net/ethernet/intel/ice/ice_txrx.c          |   19 +-
 drivers/net/ethernet/intel/ice/ice_txrx.h          |   32 +-
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c      |  207 +++-
 drivers/net/ethernet/intel/ice/ice_txrx_lib.h      |   18 +-
 drivers/net/ethernet/intel/ice/ice_xsk.c           |   17 +-
 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c   |   15 +
 drivers/net/veth.c                                 |   19 +
 fs/verity/fsverity_private.h                       |   10 +
 fs/verity/init.c                                   |    1 +
 fs/verity/measure.c                                |   84 ++
 include/asm-generic/Kbuild                         |    1 +
 include/asm-generic/cfi.h                          |    5 +
 include/linux/bpf.h                                |  143 ++-
 include/linux/bpf_verifier.h                       |   66 +-
 include/linux/cfi.h                                |   12 +
 include/linux/filter.h                             |    4 +-
 include/linux/if_vlan.h                            |    4 +-
 include/linux/lsm_hook_defs.h                      |   15 +-
 include/linux/mlx5/device.h                        |    2 +-
 include/linux/security.h                           |   43 +-
 include/linux/skbuff.h                             |   13 +-
 include/net/xdp.h                                  |   20 +-
 include/net/xdp_sock_drv.h                         |   17 +
 include/net/xfrm.h                                 |    9 +
 include/net/xsk_buff_pool.h                        |    2 +
 include/uapi/linux/bpf.h                           |   46 +-
 include/uapi/linux/netdev.h                        |    3 +
 kernel/bpf/Makefile                                |    2 +-
 kernel/bpf/arraymap.c                              |   37 +-
 kernel/bpf/bpf_cgrp_storage.c                      |    6 +-
 kernel/bpf/bpf_lsm.c                               |   27 +-
 kernel/bpf/bpf_struct_ops.c                        |   35 +-
 kernel/bpf/btf.c                                   |   11 +-
 kernel/bpf/cgroup.c                                |    6 +-
 kernel/bpf/core.c                                  |   53 +-
 kernel/bpf/cpumask.c                               |   20 +-
 kernel/bpf/dispatcher.c                            |    7 +-
 kernel/bpf/hashtab.c                               |   12 +-
 kernel/bpf/helpers.c                               |   38 +-
 kernel/bpf/inode.c                                 |  326 ++++++-
 kernel/bpf/log.c                                   |   42 +-
 kernel/bpf/map_in_map.c                            |   17 +-
 kernel/bpf/map_in_map.h                            |    2 +-
 kernel/bpf/syscall.c                               |  294 ++++--
 kernel/bpf/tnum.c                                  |    6 -
 kernel/bpf/token.c                                 |  271 +++++
 kernel/bpf/trampoline.c                            |  101 +-
 kernel/bpf/verifier.c                              |  558 ++++++-----
 kernel/trace/bpf_trace.c                           |   84 +-
 lib/test_bpf.c                                     |    2 +-
 net/bpf/bpf_dummy_struct_ops.c                     |   38 +-
 net/bpf/test_run.c                                 |   15 +-
 net/core/filter.c                                  |   36 +-
 net/core/xdp.c                                     |   33 +
 net/ipv4/bpf_tcp_ca.c                              |   71 +-
 net/netfilter/nf_bpf_link.c                        |    2 +-
 net/xdp/xsk_buff_pool.c                            |   12 +
 net/xfrm/Makefile                                  |    1 +
 net/xfrm/xfrm_policy.c                             |    2 +
 net/xfrm/xfrm_state_bpf.c                          |  134 +++
 security/security.c                                |  101 +-
 security/selinux/hooks.c                           |   47 +-
 tools/include/uapi/linux/bpf.h                     |   46 +-
 tools/include/uapi/linux/netdev.h                  |    3 +
 tools/lib/bpf/Build                                |    2 +-
 tools/lib/bpf/bpf.c                                |   37 +-
 tools/lib/bpf/bpf.h                                |   35 +-
 tools/lib/bpf/bpf_core_read.h                      |   32 +
 tools/lib/bpf/btf.c                                |    7 +-
 tools/lib/bpf/elf.c                                |    2 -
 tools/lib/bpf/features.c                           |  478 +++++++++
 tools/lib/bpf/libbpf.c                             |  584 +++--------
 tools/lib/bpf/libbpf.h                             |   37 +-
 tools/lib/bpf/libbpf.map                           |    1 +
 tools/lib/bpf/libbpf_internal.h                    |   36 +-
 tools/lib/bpf/libbpf_probes.c                      |    8 +-
 tools/lib/bpf/linker.c                             |   24 +-
 tools/lib/bpf/str_error.h                          |    3 +
 tools/net/ynl/generated/netdev-user.c              |    1 +
 tools/testing/selftests/bpf/bpf_kfuncs.h           |   10 +
 tools/testing/selftests/bpf/cgroup_helpers.c       |   16 +
 tools/testing/selftests/bpf/cgroup_helpers.h       |    1 +
 tools/testing/selftests/bpf/config                 |    3 +-
 tools/testing/selftests/bpf/prog_tests/btf.c       |    5 -
 .../selftests/bpf/prog_tests/cgrp_local_storage.c  |   98 +-
 tools/testing/selftests/bpf/prog_tests/cpumask.c   |    1 +
 tools/testing/selftests/bpf/prog_tests/fs_kfuncs.c |  142 +++
 .../bpf/prog_tests/global_func_dead_code.c         |   60 ++
 .../selftests/bpf/prog_tests/kprobe_multi_test.c   |   31 +-
 .../selftests/bpf/prog_tests/libbpf_probes.c       |    4 +
 .../testing/selftests/bpf/prog_tests/libbpf_str.c  |    8 +-
 .../selftests/bpf/prog_tests/local_kptr_stash.c    |   23 +
 tools/testing/selftests/bpf/prog_tests/map_btf.c   |   98 ++
 .../testing/selftests/bpf/prog_tests/map_in_map.c  |  141 +++
 tools/testing/selftests/bpf/prog_tests/syscall.c   |   30 +-
 .../testing/selftests/bpf/prog_tests/test_tunnel.c |  162 ++-
 tools/testing/selftests/bpf/prog_tests/time_tai.c  |    2 +-
 tools/testing/selftests/bpf/prog_tests/token.c     | 1031 ++++++++++++++++++++
 .../selftests/bpf/prog_tests/uprobe_multi_test.c   |  175 +++-
 tools/testing/selftests/bpf/prog_tests/verifier.c  |    2 +
 .../selftests/bpf/prog_tests/verify_pkcs7_sig.c    |  165 +++-
 .../bpf/prog_tests/xdp_context_test_run.c          |    4 +-
 .../selftests/bpf/prog_tests/xdp_metadata.c        |  132 ++-
 .../selftests/bpf/progs/access_map_in_map.c        |   93 ++
 tools/testing/selftests/bpf/progs/bpf_misc.h       |    1 +
 .../testing/selftests/bpf/progs/bpf_tracing_net.h  |    1 +
 .../selftests/bpf/progs/cgrp_ls_recursion.c        |   84 +-
 .../selftests/bpf/progs/cgrp_ls_sleepable.c        |   61 +-
 tools/testing/selftests/bpf/progs/cgrp_ls_tp_btf.c |   82 +-
 tools/testing/selftests/bpf/progs/cpumask_common.h |    1 +
 .../testing/selftests/bpf/progs/cpumask_success.c  |   43 +
 .../selftests/bpf/progs/exceptions_assert.c        |    2 +-
 .../testing/selftests/bpf/progs/exceptions_fail.c  |    2 +-
 .../bpf/progs/freplace_dead_global_func.c          |   11 +
 tools/testing/selftests/bpf/progs/iters.c          |    2 +-
 .../testing/selftests/bpf/progs/local_kptr_stash.c |   53 +
 tools/testing/selftests/bpf/progs/map_in_map_btf.c |   73 ++
 tools/testing/selftests/bpf/progs/normal_map_btf.c |   56 ++
 tools/testing/selftests/bpf/progs/priv_map.c       |   13 +
 tools/testing/selftests/bpf/progs/priv_prog.c      |   13 +
 tools/testing/selftests/bpf/progs/syscall.c        |   96 +-
 tools/testing/selftests/bpf/progs/test_fsverity.c  |   48 +
 tools/testing/selftests/bpf/progs/test_get_xattr.c |   37 +
 .../selftests/bpf/progs/test_global_func15.c       |   34 +-
 .../selftests/bpf/progs/test_global_func16.c       |    2 +-
 .../selftests/bpf/progs/test_sig_in_xattr.c        |   83 ++
 .../testing/selftests/bpf/progs/test_tunnel_kern.c |  138 +--
 .../selftests/bpf/progs/test_verify_pkcs7_sig.c    |    8 +-
 tools/testing/selftests/bpf/progs/timer_failure.c  |   37 +-
 .../selftests/bpf/progs/user_ringbuf_fail.c        |    2 +-
 .../selftests/bpf/progs/verifier_basic_stack.c     |    8 +-
 .../selftests/bpf/progs/verifier_bitfield_write.c  |  100 ++
 .../bpf/progs/verifier_cgroup_inv_retcode.c        |    8 +-
 .../bpf/progs/verifier_direct_packet_access.c      |    2 +-
 .../selftests/bpf/progs/verifier_global_subprogs.c |   15 +-
 .../testing/selftests/bpf/progs/verifier_int_ptr.c |    7 +-
 .../bpf/progs/verifier_netfilter_retcode.c         |    2 +-
 .../selftests/bpf/progs/verifier_raw_stack.c       |    5 +-
 .../selftests/bpf/progs/verifier_spill_fill.c      |  287 ++++++
 .../selftests/bpf/progs/verifier_stack_ptr.c       |    4 +-
 .../bpf/progs/verifier_subprog_precision.c         |  137 ++-
 .../testing/selftests/bpf/progs/verifier_var_off.c |   91 +-
 .../testing/selftests/bpf/progs/xdp_hw_metadata.c  |   38 +-
 tools/testing/selftests/bpf/progs/xdp_metadata.c   |   36 +-
 .../selftests/bpf/progs/xdp_synproxy_kern.c        |    4 +-
 tools/testing/selftests/bpf/test_loader.c          |    7 +
 tools/testing/selftests/bpf/test_tunnel.sh         |   92 --
 tools/testing/selftests/bpf/testing_helpers.h      |    3 +
 .../selftests/bpf/verifier/atomic_cmpxchg.c        |   11 -
 tools/testing/selftests/bpf/verifier/calls.c       |    4 +-
 tools/testing/selftests/bpf/verifier/precise.c     |   38 +-
 tools/testing/selftests/bpf/verify_sig_setup.sh    |   25 +
 tools/testing/selftests/bpf/veristat.c             |    2 +-
 tools/testing/selftests/bpf/xdp_hw_metadata.c      |   36 +-
 tools/testing/selftests/bpf/xdp_metadata.h         |   34 +-
 tools/testing/selftests/bpf/xskxceiver.c           |   25 +-
 178 files changed, 8408 insertions(+), 1798 deletions(-)
 create mode 100644 Documentation/bpf/fs_kfuncs.rst
 create mode 100644 include/asm-generic/cfi.h
 create mode 100644 kernel/bpf/token.c
 create mode 100644 net/xfrm/xfrm_state_bpf.c
 create mode 100644 tools/lib/bpf/features.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/fs_kfuncs.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/global_func_dead_code.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/map_btf.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/map_in_map.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/token.c
 create mode 100644 tools/testing/selftests/bpf/progs/access_map_in_map.c
 create mode 100644 tools/testing/selftests/bpf/progs/freplace_dead_global_func.c
 create mode 100644 tools/testing/selftests/bpf/progs/map_in_map_btf.c
 create mode 100644 tools/testing/selftests/bpf/progs/normal_map_btf.c
 create mode 100644 tools/testing/selftests/bpf/progs/priv_map.c
 create mode 100644 tools/testing/selftests/bpf/progs/priv_prog.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_fsverity.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_get_xattr.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_sig_in_xattr.c
 create mode 100644 tools/testing/selftests/bpf/progs/verifier_bitfield_write.c

Alexei Starovoitov Dec. 19, 2023, 12:19 a.m. UTC | #1

On Mon, Dec 18, 2023 at 4:05 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> Please consider pulling these changes from:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git tags/for-netdev

Forgot to mention that there are two conflicts in
include/linux/skbuff.h
and
tools/net/ynl/generated/netdev-user.c
that should be easy to resolve.
First is due to typo fix in the word 'variant' and code move.

Jakub Kicinski Dec. 19, 2023, 12:55 a.m. UTC | #2

On Mon, 18 Dec 2023 16:05:20 -0800 Alexei Starovoitov wrote:
> 2) Introduce BPF token object, from Andrii Nakryiko.
> It adds an ability to delegate a subset of BPF features from privileged daemon
> (e.g., systemd) through special mount options for userns-bound BPF FS to a
> trusted unprivileged application. The design accommodates suggestions from
> Christian Brauner and Paul Moore.
> Example:
> $ sudo mkdir -p /sys/fs/bpf/token
> $ sudo mount -t bpf bpffs /sys/fs/bpf/token \
>              -o delegate_cmds=prog_load:MAP_CREATE \
>              -o delegate_progs=kprobe \
>              -o delegate_attachs=xdp

LGTM, but what do I know about file systems.. Adding LKML to the CC
list, if anyone has any late comments on the BPF token come forward
now, petty please?

patchwork-bot+netdevbpf@kernel.org Dec. 19, 2023, 1:10 a.m. UTC | #3

Hello:

This pull request was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Mon, 18 Dec 2023 16:05:20 -0800 you wrote:
> Hi David, hi Jakub, hi Paolo, hi Eric,
> 
> The following pull-request contains BPF updates for your *net-next* tree.
> 
> This PR is larger than usual and contains changes in various parts of the kernel.
> 
> The main changes are:
> 
> [...]

Here is the summary with links:
  - pull-request: bpf-next 2023-12-18
    https://git.kernel.org/netdev/net-next/c/c49b292d031e

You are awesome, thank you!

Linus Torvalds Dec. 19, 2023, 1:11 a.m. UTC | #4

On Mon, 18 Dec 2023 at 16:05, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> 2) Introduce BPF token object, from Andrii Nakryiko.

I assume this is why I and some other unusual recipients are cc'd,
because the networking people feel like they can't judge this and
shouldn't merge non-networking code like this.

Honestly, I was told - and expected - that this part would come in a
branch of its own, so that it would be sanely reviewable.

Now it's mixed in with everything else.

This is *literally* why we have branches in git, so that people can
make more independent changes and judgements, and so that we don't
have to be in a situation where "look, here's ten different things,
pull it all or nothing".

Many of the changes *look* like they are in branches, but they've been
the "fake branches" that are just done as "patch series in a branch,
with the cover letter as the merge message".

Which is great for maintaining that cover letter information and a
certain amount of historical clarity, but not helpful AT ALL for the
"independent changes" thing when it is all mixed up in history, where
independent things are mostly serialized and not actually independent
in history at all.

So now it appears to be one big mess, and exactly that "all or
nothing" thing that isn't great, since the whole point was that the
networking people weren't comfortable with the reviewing filesystem
side.

And honestly, the bpf side *still* seems to be absolutely conbfused
and complkete crap when it comes to file descriptors.

I took a quick look, and I *still* see new code being introduced there
that thinks that file descriptor zero is special, and we tols you a
*year* ago that that wasn't true, and that you need to fix this.

I literally see complete garbage like tghis:

        ..
        __u32 btf_token_fd;
        ...
        if (attr->btf_token_fd) {
                token = bpf_token_get_from_fd(attr->btf_token_fd);

and this is all *new* code that makes that same bogus sh*t-for-brains
mistake that was wrong the first time.

So now I'm saying NAK. Enough is enough.  No more of this crazy "I
don't understand even the _basics_ of file descriptors, and yet I'm
introducing new random interfaces".

I know you thought fd zero was something invalid. You were told
otherwise. Apparently you just ignored being wrong, and have decided
to double down on being wrong.

We don't take this kind of flat-Earther crap.

File descriptors don't start at 1. Deal with reality. Stop making the
same mistake over and over. If you ant to have a "no file descriptor"
flag, you use a signed type, and a signed value for that, because file
descriptor zero is perfectly valid, and I don't want to hear any more
uninformed denialism.

Stop polluting the kernel with incorrect assumptions.

So yes, I will keep NAK'ing this until this kind of fundamental
mistake is fixed. This is not rocket science, and this is not
something that wasn't discussed before. Your ignorance has now turned
from "I didn't know" to "I didn 't care", and at that point I really
don't want to see new code any more.

               Linus

Linus Torvalds Dec. 19, 2023, 1:17 a.m. UTC | #5

On Mon, 18 Dec 2023 at 16:55, Jakub Kicinski <kuba@kernel.org> wrote:
>
> LGTM, but what do I know about file systems.. Adding LKML to the CC
> list, if anyone has any late comments on the BPF token come forward
> now, petty please?

See my crossed email reply.

The file descriptor handling is FUNDAMENTALLY wrong. The first time
that happened, we chalked it up to a mistake. Now it's something
worse.

Please don't pull until at least that part is fixed.

I tried to review the token patches, but honestly, I got to that part
and I just gave up.

We had this whole discussion more than 6 months ago:

  https://lore.kernel.org/all/20230517-allabendlich-umgekehrt-8cc81f8313ac@brauner/

and I really thought the bpf people had *understood* they their
special use of "fd == 0" was wrong.

But it seems that they never did. Once is a mistake. Twice is a
choice. And the bpf people have chosen insanity.

               Linus

Alexei Starovoitov Dec. 19, 2023, 1:48 a.m. UTC | #6

On Mon, Dec 18, 2023 at 5:11 PM Linus Torvalds
<torvalds@linuxfoundation.org> wrote:
>
> I literally see complete garbage like tghis:
>
>         ..
>         __u32 btf_token_fd;
>         ...
>         if (attr->btf_token_fd) {
>                 token = bpf_token_get_from_fd(attr->btf_token_fd);
>
> and this is all *new* code that makes that same bogus sh*t-for-brains
> mistake that was wrong the first time.

Point taken.
We can do s/__u32 token_fd/__u64 token/
and waste upper 32-bit as flags that indicate that lower 32-bit is an FD
or
are you ok with __u32 token that is 'fd + 1'.
zero - invalid
one - FD==0
two - FD==1
?

Naming is hard. 'token_handle' maybe?

Linus Torvalds Dec. 19, 2023, 3:57 a.m. UTC | #7

On Mon, 18 Dec 2023 at 17:48, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
>
> Point taken.
> We can do s/__u32 token_fd/__u64 token/
> and waste upper 32-bit as flags that indicate that lower 32-bit is an FD
> or
> are you ok with __u32 token that is 'fd + 1'.

No, you make it follow the standard pattern that Unix has always had:
file descriptors are _signed_ integer, and negative means error (or
special cases).

Now, traditionally a 'fd' is literally just of type "int", but for
structures it's actually good to make it be a sized entity, so just
make it be __s32, and make any special cases be actual negative
numbers.

Because I'll just go out on a limb and say that two billion file
descriptors is enough for anybody, and if we ever were to hit that
number, we'll have *way* more serious problems elsewhere long long
before. And in practice, "int" is 32-bit on all current and
near-future architectures, so "__s32" really is the same as "int" in
all practical respects, and making the size explicit is just a good
idea.

You might want to perhaps pre-reserve a few negative numbers for
actual special cases, eg "openat()" uses

    #define AT_FDCWD -100

which I don't think is a great example to follow in the details: it
should have parenthesis, and "100" is a rather odd number to choose,
but it's certainly an example of a not-fundamentally-broken "not a
file descriptor, but a special case".

Now, if you have a 'flags' or 'cmd' field for *other* reasons, then
you can certainly just use one of the flags for "I have a file
descriptor". But don't do some odd "translate values", and don't add
32 bits just for that.

That's also a perfectly fine traditional unix use (example: socket
control messages - "struct cmsghdr" with "cmsg_type = SCM_RIGHTS" in
unix domain sockets).

But if you don't have some other reason for having a separate flag for
"I also have a file descriptor you should use", then just make a
negative number mean "no file descriptor".

It's easy to test for the number being negative, but it's also just
easy to *not* test for, ie it's also perfectly fine to just do
something like

        struct fd f = fdget(fd);

without ever even bothering to test whether 'fd' is negative or not.
It is guaranteed to fail for negative numbers and just look exactly
like the "not open" case, so if you don't care about the difference
between "invalid" and "not open", then a negative fd also works just
as-is with no extra code at all.

                   Linus

                     Linus

Andrii Nakryiko Dec. 19, 2023, 4:34 a.m. UTC | #8

On Mon, Dec 18, 2023 at 7:58 PM Linus Torvalds
<torvalds@linuxfoundation.org> wrote:
>
> On Mon, 18 Dec 2023 at 17:48, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> >
> > Point taken.
> > We can do s/__u32 token_fd/__u64 token/
> > and waste upper 32-bit as flags that indicate that lower 32-bit is an FD
> > or
> > are you ok with __u32 token that is 'fd + 1'.
>
> No, you make it follow the standard pattern that Unix has always had:
> file descriptors are _signed_ integer, and negative means error (or
> special cases).
>
> Now, traditionally a 'fd' is literally just of type "int", but for
> structures it's actually good to make it be a sized entity, so just
> make it be __s32, and make any special cases be actual negative
> numbers.
>
> Because I'll just go out on a limb and say that two billion file
> descriptors is enough for anybody, and if we ever were to hit that
> number, we'll have *way* more serious problems elsewhere long long
> before. And in practice, "int" is 32-bit on all current and
> near-future architectures, so "__s32" really is the same as "int" in
> all practical respects, and making the size explicit is just a good
> idea.
>
> You might want to perhaps pre-reserve a few negative numbers for
> actual special cases, eg "openat()" uses
>
>     #define AT_FDCWD -100
>
> which I don't think is a great example to follow in the details: it
> should have parenthesis, and "100" is a rather odd number to choose,
> but it's certainly an example of a not-fundamentally-broken "not a
> file descriptor, but a special case".
>
> Now, if you have a 'flags' or 'cmd' field for *other* reasons, then
> you can certainly just use one of the flags for "I have a file
> descriptor". But don't do some odd "translate values", and don't add
> 32 bits just for that.
>

Makes sense. Yes, we do have flags for all commands accepting token
FD, except for one, BPF_BTF_LOAD, but it's trivial to add flags there
as well. I'll prepare a patch.

> That's also a perfectly fine traditional unix use (example: socket
> control messages - "struct cmsghdr" with "cmsg_type = SCM_RIGHTS" in
> unix domain sockets).
>
> But if you don't have some other reason for having a separate flag for
> "I also have a file descriptor you should use", then just make a
> negative number mean "no file descriptor".
>
> It's easy to test for the number being negative, but it's also just
> easy to *not* test for, ie it's also perfectly fine to just do
> something like
>
>         struct fd f = fdget(fd);
>
> without ever even bothering to test whether 'fd' is negative or not.
> It is guaranteed to fail for negative numbers and just look exactly
> like the "not open" case, so if you don't care about the difference
> between "invalid" and "not open", then a negative fd also works just
> as-is with no extra code at all.
>
>                    Linus
>
>                      Linus

Andrii Nakryiko Dec. 19, 2023, 5:38 a.m. UTC | #9

On Mon, Dec 18, 2023 at 8:34 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Dec 18, 2023 at 7:58 PM Linus Torvalds
> <torvalds@linuxfoundation.org> wrote:
> >
> > On Mon, 18 Dec 2023 at 17:48, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > >
> > > Point taken.
> > > We can do s/__u32 token_fd/__u64 token/
> > > and waste upper 32-bit as flags that indicate that lower 32-bit is an FD
> > > or
> > > are you ok with __u32 token that is 'fd + 1'.
> >
> > No, you make it follow the standard pattern that Unix has always had:
> > file descriptors are _signed_ integer, and negative means error (or
> > special cases).
> >
> > Now, traditionally a 'fd' is literally just of type "int", but for
> > structures it's actually good to make it be a sized entity, so just
> > make it be __s32, and make any special cases be actual negative
> > numbers.
> >
> > Because I'll just go out on a limb and say that two billion file
> > descriptors is enough for anybody, and if we ever were to hit that
> > number, we'll have *way* more serious problems elsewhere long long
> > before. And in practice, "int" is 32-bit on all current and
> > near-future architectures, so "__s32" really is the same as "int" in
> > all practical respects, and making the size explicit is just a good
> > idea.
> >
> > You might want to perhaps pre-reserve a few negative numbers for
> > actual special cases, eg "openat()" uses
> >
> >     #define AT_FDCWD -100
> >
> > which I don't think is a great example to follow in the details: it
> > should have parenthesis, and "100" is a rather odd number to choose,
> > but it's certainly an example of a not-fundamentally-broken "not a
> > file descriptor, but a special case".
> >
> > Now, if you have a 'flags' or 'cmd' field for *other* reasons, then
> > you can certainly just use one of the flags for "I have a file
> > descriptor". But don't do some odd "translate values", and don't add
> > 32 bits just for that.
> >
>
> Makes sense. Yes, we do have flags for all commands accepting token
> FD, except for one, BPF_BTF_LOAD, but it's trivial to add flags there
> as well. I'll prepare a patch.

The patch is at [0], thanks.

  [0] https://patchwork.kernel.org/project/netdevbpf/patch/20231219053150.336991-1-andrii@kernel.org/

>
> > That's also a perfectly fine traditional unix use (example: socket
> > control messages - "struct cmsghdr" with "cmsg_type = SCM_RIGHTS" in
> > unix domain sockets).
> >
> > But if you don't have some other reason for having a separate flag for
> > "I also have a file descriptor you should use", then just make a
> > negative number mean "no file descriptor".
> >
> > It's easy to test for the number being negative, but it's also just
> > easy to *not* test for, ie it's also perfectly fine to just do
> > something like
> >
> >         struct fd f = fdget(fd);
> >
> > without ever even bothering to test whether 'fd' is negative or not.
> > It is guaranteed to fail for negative numbers and just look exactly
> > like the "not open" case, so if you don't care about the difference
> > between "invalid" and "not open", then a negative fd also works just
> > as-is with no extra code at all.
> >
> >                    Linus
> >
> >                      Linus

Christian Brauner Dec. 19, 2023, 10:23 a.m. UTC | #10

On Mon, Dec 18, 2023 at 05:11:23PM -0800, Linus Torvalds wrote:
> On Mon, 18 Dec 2023 at 16:05, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > 2) Introduce BPF token object, from Andrii Nakryiko.
> 
> I assume this is why I and some other unusual recipients are cc'd,
> because the networking people feel like they can't judge this and
> shouldn't merge non-networking code like this.
> 
> Honestly, I was told - and expected - that this part would come in a
> branch of its own, so that it would be sanely reviewable.
> 
> Now it's mixed in with everything else.
> 
> This is *literally* why we have branches in git, so that people can
> make more independent changes and judgements, and so that we don't
> have to be in a situation where "look, here's ten different things,
> pull it all or nothing".
> 
> Many of the changes *look* like they are in branches, but they've been
> the "fake branches" that are just done as "patch series in a branch,
> with the cover letter as the merge message".
> 
> Which is great for maintaining that cover letter information and a
> certain amount of historical clarity, but not helpful AT ALL for the
> "independent changes" thing when it is all mixed up in history, where
> independent things are mostly serialized and not actually independent
> in history at all.
> 
> So now it appears to be one big mess, and exactly that "all or
> nothing" thing that isn't great, since the whole point was that the
> networking people weren't comfortable with the reviewing filesystem
> side.
> 
> And honestly, the bpf side *still* seems to be absolutely conbfused
> and complkete crap when it comes to file descriptors.
> 
> I took a quick look, and I *still* see new code being introduced there
> that thinks that file descriptor zero is special, and we tols you a
> *year* ago that that wasn't true, and that you need to fix this.
> 
> I literally see complete garbage like tghis:
> 
>         ..
>         __u32 btf_token_fd;
>         ...
>         if (attr->btf_token_fd) {
>                 token = bpf_token_get_from_fd(attr->btf_token_fd);
> 
> and this is all *new* code that makes that same bogus sh*t-for-brains
> mistake that was wrong the first time.
> 
> So now I'm saying NAK. Enough is enough.  No more of this crazy "I
> don't understand even the _basics_ of file descriptors, and yet I'm
> introducing new random interfaces".
> 
> I know you thought fd zero was something invalid. You were told
> otherwise. Apparently you just ignored being wrong, and have decided
> to double down on being wrong.
> 
> We don't take this kind of flat-Earther crap.
> 
> File descriptors don't start at 1. Deal with reality. Stop making the
> same mistake over and over. If you ant to have a "no file descriptor"
> flag, you use a signed type, and a signed value for that, because file
> descriptor zero is perfectly valid, and I don't want to hear any more
> uninformed denialism.
> 
> Stop polluting the kernel with incorrect assumptions.
> 
> So yes, I will keep NAK'ing this until this kind of fundamental
> mistake is fixed. This is not rocket science, and this is not
> something that wasn't discussed before. Your ignorance has now turned
> from "I didn't know" to "I didn 't care", and at that point I really
> don't want to see new code any more.

Alexei, Andrii, this is a massive breach of trust and flatout
disrespectful. I barely reword mails and believe me I've reworded this
mail many times. I'm furious. 

Over the last couple of months since LSFMM in May 2023 until almost last
week I've given you extensive design and review for this whole approach
to get this into even remotely sane shape from a VFS perspective.

The VFS maintainers including Linus have explicitly NAKed this "zero is
not a valid fd nonsense" and told you to stop doing that. We told you
that such fundamental VFS semantics are not yours to decide.

And yet you put a patch into a series that did exactly that and then had
the unbelievable audacity to repeatedly ask me to put my ACK under this
- both in person and on list.

I'm glad I only gave my ACK to the two patches that I extensivly
reviewed and never to the whole series.

@Linus, I'd like to ask you to please not pull any BPF code that touches
fs/ in any way without an explicit ACK/RVB from Al, Jan, or myself. For
now, everything BPF related to fs/ is proactively NAKed by me.

This is disrespectful to the whole fs community and to me personally and
precisely why we will keep resisting meaningful BPF integration in fs/
until we can be sure that we can trust this subsystem.

Pleasant in-person interactions are one thing. But they're meaningless
if they're inconsistent with on-list behavior and matters of trust.
Disgraceful and tasteless is what I keep coming back to.

Matthew Wilcox Dec. 19, 2023, 4:06 p.m. UTC | #11

On Tue, Dec 19, 2023 at 11:23:50AM +0100, Christian Brauner wrote:
> Alexei, Andrii, this is a massive breach of trust and flatout
> disrespectful. I barely reword mails and believe me I've reworded this
> mail many times. I'm furious. 
> 
> Over the last couple of months since LSFMM in May 2023 until almost last
> week I've given you extensive design and review for this whole approach
> to get this into even remotely sane shape from a VFS perspective.

This isn't new behaviour from the BPF people.  They always go their own
way on everything.  They refuse to collaborate with anyone in MM to make
the memory allocators work with their constraints; instead they implement
their own.  It feels like they're on a Mission From God to implement the
BPF Operating System and dealing with everyone else is an inconvenience.

https://lore.kernel.org/bpf/20220623003230.37497-1-alexei.starovoitov@gmail.com/

Daniel Borkmann Dec. 19, 2023, 4:36 p.m. UTC | #12

On 12/19/23 11:23 AM, Christian Brauner wrote:
> On Mon, Dec 18, 2023 at 05:11:23PM -0800, Linus Torvalds wrote:
>> On Mon, 18 Dec 2023 at 16:05, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>>>
>>> 2) Introduce BPF token object, from Andrii Nakryiko.
>>
>> I assume this is why I and some other unusual recipients are cc'd,
>> because the networking people feel like they can't judge this and
>> shouldn't merge non-networking code like this.
>>
>> Honestly, I was told - and expected - that this part would come in a
>> branch of its own, so that it would be sanely reviewable.
>>
>> Now it's mixed in with everything else.
>>
>> This is *literally* why we have branches in git, so that people can
>> make more independent changes and judgements, and so that we don't
>> have to be in a situation where "look, here's ten different things,
>> pull it all or nothing".
>>
>> Many of the changes *look* like they are in branches, but they've been
>> the "fake branches" that are just done as "patch series in a branch,
>> with the cover letter as the merge message".
>>
>> Which is great for maintaining that cover letter information and a
>> certain amount of historical clarity, but not helpful AT ALL for the
>> "independent changes" thing when it is all mixed up in history, where
>> independent things are mostly serialized and not actually independent
>> in history at all.
>>
>> So now it appears to be one big mess, and exactly that "all or
>> nothing" thing that isn't great, since the whole point was that the
>> networking people weren't comfortable with the reviewing filesystem
>> side.
>>
>> And honestly, the bpf side *still* seems to be absolutely conbfused
>> and complkete crap when it comes to file descriptors.
>>
>> I took a quick look, and I *still* see new code being introduced there
>> that thinks that file descriptor zero is special, and we tols you a
>> *year* ago that that wasn't true, and that you need to fix this.
>>
>> I literally see complete garbage like tghis:
>>
>>          ..
>>          __u32 btf_token_fd;
>>          ...
>>          if (attr->btf_token_fd) {
>>                  token = bpf_token_get_from_fd(attr->btf_token_fd);
>>
>> and this is all *new* code that makes that same bogus sh*t-for-brains
>> mistake that was wrong the first time.
>>
>> So now I'm saying NAK. Enough is enough.  No more of this crazy "I
>> don't understand even the _basics_ of file descriptors, and yet I'm
>> introducing new random interfaces".
>>
>> I know you thought fd zero was something invalid. You were told
>> otherwise. Apparently you just ignored being wrong, and have decided
>> to double down on being wrong.
>>
>> We don't take this kind of flat-Earther crap.
>>
>> File descriptors don't start at 1. Deal with reality. Stop making the
>> same mistake over and over. If you ant to have a "no file descriptor"
>> flag, you use a signed type, and a signed value for that, because file
>> descriptor zero is perfectly valid, and I don't want to hear any more
>> uninformed denialism.
>>
>> Stop polluting the kernel with incorrect assumptions.
>>
>> So yes, I will keep NAK'ing this until this kind of fundamental
>> mistake is fixed. This is not rocket science, and this is not
>> something that wasn't discussed before. Your ignorance has now turned
>> from "I didn't know" to "I didn 't care", and at that point I really
>> don't want to see new code any more.
> 
> Alexei, Andrii, this is a massive breach of trust and flatout
> disrespectful. I barely reword mails and believe me I've reworded this
> mail many times. I'm furious.
> 
> Over the last couple of months since LSFMM in May 2023 until almost last
> week I've given you extensive design and review for this whole approach
> to get this into even remotely sane shape from a VFS perspective.
> 
> The VFS maintainers including Linus have explicitly NAKed this "zero is
> not a valid fd nonsense" and told you to stop doing that. We told you
> that such fundamental VFS semantics are not yours to decide.
> 
> And yet you put a patch into a series that did exactly that and then had
> the unbelievable audacity to repeatedly ask me to put my ACK under this
> - both in person and on list.
> 
> I'm glad I only gave my ACK to the two patches that I extensivly
> reviewed and never to the whole series.

Sincere apologies for this whole mess. All token series related patches
have been reverted in bpf-next now, and I'm prepping a PR for net-next
so that this is also fully removed from there.

Thanks,
Daniel

Alexei Starovoitov Dec. 19, 2023, 4:42 p.m. UTC | #13

On Tue, Dec 19, 2023 at 2:23 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Mon, Dec 18, 2023 at 05:11:23PM -0800, Linus Torvalds wrote:
> > On Mon, 18 Dec 2023 at 16:05, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > 2) Introduce BPF token object, from Andrii Nakryiko.
> >
> > I assume this is why I and some other unusual recipients are cc'd,
> > because the networking people feel like they can't judge this and
> > shouldn't merge non-networking code like this.
> >
> > Honestly, I was told - and expected - that this part would come in a
> > branch of its own, so that it would be sanely reviewable.
> >
> > Now it's mixed in with everything else.
> >
> > This is *literally* why we have branches in git, so that people can
> > make more independent changes and judgements, and so that we don't
> > have to be in a situation where "look, here's ten different things,
> > pull it all or nothing".
> >
> > Many of the changes *look* like they are in branches, but they've been
> > the "fake branches" that are just done as "patch series in a branch,
> > with the cover letter as the merge message".
> >
> > Which is great for maintaining that cover letter information and a
> > certain amount of historical clarity, but not helpful AT ALL for the
> > "independent changes" thing when it is all mixed up in history, where
> > independent things are mostly serialized and not actually independent
> > in history at all.
> >
> > So now it appears to be one big mess, and exactly that "all or
> > nothing" thing that isn't great, since the whole point was that the
> > networking people weren't comfortable with the reviewing filesystem
> > side.
> >
> > And honestly, the bpf side *still* seems to be absolutely conbfused
> > and complkete crap when it comes to file descriptors.
> >
> > I took a quick look, and I *still* see new code being introduced there
> > that thinks that file descriptor zero is special, and we tols you a
> > *year* ago that that wasn't true, and that you need to fix this.
> >
> > I literally see complete garbage like tghis:
> >
> >         ..
> >         __u32 btf_token_fd;
> >         ...
> >         if (attr->btf_token_fd) {
> >                 token = bpf_token_get_from_fd(attr->btf_token_fd);
> >
> > and this is all *new* code that makes that same bogus sh*t-for-brains
> > mistake that was wrong the first time.
> >
> > So now I'm saying NAK. Enough is enough.  No more of this crazy "I
> > don't understand even the _basics_ of file descriptors, and yet I'm
> > introducing new random interfaces".
> >
> > I know you thought fd zero was something invalid. You were told
> > otherwise. Apparently you just ignored being wrong, and have decided
> > to double down on being wrong.
> >
> > We don't take this kind of flat-Earther crap.
> >
> > File descriptors don't start at 1. Deal with reality. Stop making the
> > same mistake over and over. If you ant to have a "no file descriptor"
> > flag, you use a signed type, and a signed value for that, because file
> > descriptor zero is perfectly valid, and I don't want to hear any more
> > uninformed denialism.
> >
> > Stop polluting the kernel with incorrect assumptions.
> >
> > So yes, I will keep NAK'ing this until this kind of fundamental
> > mistake is fixed. This is not rocket science, and this is not
> > something that wasn't discussed before. Your ignorance has now turned
> > from "I didn't know" to "I didn 't care", and at that point I really
> > don't want to see new code any more.
>
> Alexei, Andrii, this is a massive breach of trust and flatout
> disrespectful. I barely reword mails and believe me I've reworded this
> mail many times. I'm furious.
>
> Over the last couple of months since LSFMM in May 2023 until almost last
> week I've given you extensive design and review for this whole approach
> to get this into even remotely sane shape from a VFS perspective.
>
> The VFS maintainers including Linus have explicitly NAKed this "zero is
> not a valid fd nonsense" and told you to stop doing that. We told you
> that such fundamental VFS semantics are not yours to decide.
>
> And yet you put a patch into a series that did exactly that and then had
> the unbelievable audacity to repeatedly ask me to put my ACK under this
> - both in person and on list.
>
> I'm glad I only gave my ACK to the two patches that I extensivly
> reviewed and never to the whole series.

fwiw to three patches:
https://lore.kernel.org/bpf/20231208-besessen-vibrieren-4e963e3ca3ba@brauner/
which are all the main bits of it.

The patch 4 that does:
if (attr->map_token_fd)
wasn't sneaked in in any way.
You were cc-ed on it just like linux-fsdevel@vger
during all 12 revisions of the token series over many months.

So this accusation of breach of trust is baseless.

Indeed we didn't internalize that you guys hate fd=0 so much.
In the past you made it clear fd=0 shouldn't be an alias to AT_FDCWD.
We got that part. Meaning of fd=0 here wasn't a special new thing.
We made this mistake in the past and assumed it's ok-ish to continue
in similar situations.
As I said. Point taken. We'll use flag+fd approach as Linus suggested.

Alexei Starovoitov Dec. 19, 2023, 4:51 p.m. UTC | #14

On Tue, Dec 19, 2023 at 8:06 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Dec 19, 2023 at 11:23:50AM +0100, Christian Brauner wrote:
> > Alexei, Andrii, this is a massive breach of trust and flatout
> > disrespectful. I barely reword mails and believe me I've reworded this
> > mail many times. I'm furious.
> >
> > Over the last couple of months since LSFMM in May 2023 until almost last
> > week I've given you extensive design and review for this whole approach
> > to get this into even remotely sane shape from a VFS perspective.
>
> This isn't new behaviour from the BPF people.  They always go their own
> way on everything.  They refuse to collaborate with anyone in MM to make
> the memory allocators work with their constraints; instead they implement
> their own.  It feels like they're on a Mission From God to implement the
> BPF Operating System and dealing with everyone else is an inconvenience.
>
> https://lore.kernel.org/bpf/20220623003230.37497-1-alexei.starovoitov@gmail.com/

Matthew,
I thought I answered in that thread that it is not a memory allocator.
It's small free list of cached elements that bpf prog peeks from
when prog runs in unknown context == tracing deep inside the kernel.
Do you want to design a memory allocator that is fully re-entrant ?
Meaning that kmalloc(GFP_REENTRANT) can be called from any context
deep inside slab, inside arch code, inside _any_ and all code of the kernel?
If the answer is yes, please go ahead.
We'll happily switch to your thing.
We used to preallocate all memory for such tracing use cases
which was wasteful. This thingy is preallocating a few elements instead
of preallocating them all. That's all there is.

Andrii Nakryiko Dec. 19, 2023, 6:31 p.m. UTC | #15

On Tue, Dec 19, 2023 at 8:43 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Dec 19, 2023 at 2:23 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Mon, Dec 18, 2023 at 05:11:23PM -0800, Linus Torvalds wrote:
> > > On Mon, 18 Dec 2023 at 16:05, Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > 2) Introduce BPF token object, from Andrii Nakryiko.
> > >
> > > I assume this is why I and some other unusual recipients are cc'd,
> > > because the networking people feel like they can't judge this and
> > > shouldn't merge non-networking code like this.
> > >
> > > Honestly, I was told - and expected - that this part would come in a
> > > branch of its own, so that it would be sanely reviewable.
> > >
> > > Now it's mixed in with everything else.
> > >
> > > This is *literally* why we have branches in git, so that people can
> > > make more independent changes and judgements, and so that we don't
> > > have to be in a situation where "look, here's ten different things,
> > > pull it all or nothing".
> > >
> > > Many of the changes *look* like they are in branches, but they've been
> > > the "fake branches" that are just done as "patch series in a branch,
> > > with the cover letter as the merge message".
> > >
> > > Which is great for maintaining that cover letter information and a
> > > certain amount of historical clarity, but not helpful AT ALL for the
> > > "independent changes" thing when it is all mixed up in history, where
> > > independent things are mostly serialized and not actually independent
> > > in history at all.
> > >
> > > So now it appears to be one big mess, and exactly that "all or
> > > nothing" thing that isn't great, since the whole point was that the
> > > networking people weren't comfortable with the reviewing filesystem
> > > side.
> > >
> > > And honestly, the bpf side *still* seems to be absolutely conbfused
> > > and complkete crap when it comes to file descriptors.
> > >
> > > I took a quick look, and I *still* see new code being introduced there
> > > that thinks that file descriptor zero is special, and we tols you a
> > > *year* ago that that wasn't true, and that you need to fix this.
> > >
> > > I literally see complete garbage like tghis:
> > >
> > >         ..
> > >         __u32 btf_token_fd;
> > >         ...
> > >         if (attr->btf_token_fd) {
> > >                 token = bpf_token_get_from_fd(attr->btf_token_fd);
> > >
> > > and this is all *new* code that makes that same bogus sh*t-for-brains
> > > mistake that was wrong the first time.
> > >
> > > So now I'm saying NAK. Enough is enough.  No more of this crazy "I
> > > don't understand even the _basics_ of file descriptors, and yet I'm
> > > introducing new random interfaces".
> > >
> > > I know you thought fd zero was something invalid. You were told
> > > otherwise. Apparently you just ignored being wrong, and have decided
> > > to double down on being wrong.
> > >
> > > We don't take this kind of flat-Earther crap.
> > >
> > > File descriptors don't start at 1. Deal with reality. Stop making the
> > > same mistake over and over. If you ant to have a "no file descriptor"
> > > flag, you use a signed type, and a signed value for that, because file
> > > descriptor zero is perfectly valid, and I don't want to hear any more
> > > uninformed denialism.
> > >
> > > Stop polluting the kernel with incorrect assumptions.
> > >
> > > So yes, I will keep NAK'ing this until this kind of fundamental
> > > mistake is fixed. This is not rocket science, and this is not
> > > something that wasn't discussed before. Your ignorance has now turned
> > > from "I didn't know" to "I didn 't care", and at that point I really
> > > don't want to see new code any more.
> >
> > Alexei, Andrii, this is a massive breach of trust and flatout
> > disrespectful. I barely reword mails and believe me I've reworded this
> > mail many times. I'm furious.
> >
> > Over the last couple of months since LSFMM in May 2023 until almost last
> > week I've given you extensive design and review for this whole approach
> > to get this into even remotely sane shape from a VFS perspective.

Yes, and I appreciate your reviews and feedback a lot. There was never
an intent to clandestinely land anything bad or outlandish, so I'm
sorry you feel this way. I've cc'ed you and fsdevel mailing list, just
like LSM folks, on every relevant patch set, each and every patch in
them, and incorporated all the feedback I got over the last multiple
months.

> >
> > The VFS maintainers including Linus have explicitly NAKed this "zero is
> > not a valid fd nonsense" and told you to stop doing that. We told you
> > that such fundamental VFS semantics are not yours to decide.

It's on me to have interpreted FD=0 as AT_CWD in my original patch set
for BPF_OBJ_PIN/BPF_OBJ_GET ([0]). It was totally my fault not
thinking through all the negative consequences of defaulting to
AT_CWD, and I acknowledged that and fixed it.

It's also my bad that I kept using the "fd=0 means no FD was
specified" approach which has been a consistent approach within bpf()
syscall API without really thinking twice about this and how much it
might irritate kernel people. I sent a fix ([1]) and going forward
I'll remember to always add a flag for any new FD-based field in BPF
UAPI.

  [0] https://lore.kernel.org/all/20230516001348.286414-1-andrii@kernel.org/
  [1] https://patchwork.kernel.org/project/netdevbpf/patch/20231219053150.336991-1-andrii@kernel.org/

> >
> > And yet you put a patch into a series that did exactly that and then had
> > the unbelievable audacity to repeatedly ask me to put my ACK under this
> > - both in person and on list.

I did ask for a review and an ack as a sign that it looks good to you.
Precisely to make sure that *everything* looks good overall from the
POV of people outside of the BPF subsystem. I didn't ask for rubber
stamping anything, if that's what is implied here.

> >
> > I'm glad I only gave my ACK to the two patches that I extensivly
> > reviewed and never to the whole series.
>
> fwiw to three patches:
> https://lore.kernel.org/bpf/20231208-besessen-vibrieren-4e963e3ca3ba@brauner/
> which are all the main bits of it.
>
> The patch 4 that does:
> if (attr->map_token_fd)
> wasn't sneaked in in any way.
> You were cc-ed on it just like linux-fsdevel@vger
> during all 12 revisions of the token series over many months.
>
> So this accusation of breach of trust is baseless.
>
> Indeed we didn't internalize that you guys hate fd=0 so much.
> In the past you made it clear fd=0 shouldn't be an alias to AT_FDCWD.

Right, and AT_FDCWD interpretation for fd=0 had security implications
which was a clear and a bad bug.

> We got that part. Meaning of fd=0 here wasn't a special new thing.
> We made this mistake in the past and assumed it's ok-ish to continue
> in similar situations.
> As I said. Point taken. We'll use flag+fd approach as Linus suggested.

Yes, I second the above.

Christian Brauner Dec. 20, 2023, 11:18 a.m. UTC | #16

> The patch 4 that does:
> if (attr->map_token_fd)
> wasn't sneaked in in any way.
> You were cc-ed on it just like linux-fsdevel@vger
> during all 12 revisions of the token series over many months.
> 
> So this accusation of breach of trust is baseless.

I was expecting this reply and I'm still disappointed.

Both of you were explicitly told in very clear words that special-casing
fd 0 is not ok.

Fast forward a few weeks, you chose to not just add patches that forbid
fd 0 again, no, the heinous part is that you chose to not lose a single
word about this: not in the cover letter, not in the relevant commit,
not in all the discussions we had around this.

You were absolutely aware how opposed we are to this. It cannot get any
more sneaky than this. And it's frankly insulting that you choose to
defend this by feigning ignorance. No one is buying this.

But let's assume for a second that both you and Andrii somehow managed
to forget the very clear and heated discussion on-list last time, the
resulting LWN article written about it and the in-person discussion
around this we had in November at LPC.

You still would have put a major deviation from file descriptor
semantics in the bpf specific parts of the patches yet failed to lose a
single word on this anywhere. Yet we explicitly requested in the last
thread that if bpf does deviate from core fs semantics you clearly
communicate this.

But shame on me as well. I should've caught this during review. I
trusted you both enough that I only focussed on the parts that matter
for the VFS which were the two patches I ACKed. I didn't think it
necessary to wade through the completely uninteresting BPF bits that I
couldn't care less about. That won't happen again.

What I want for the future is for bpf to clearly, openly, and explicitly
communicate any decisions that affect core fs semantics. It's the exact
same request I put forward last time. This is a path forward.

Andrii Nakryiko Dec. 20, 2023, 7:17 p.m. UTC | #17

On Wed, Dec 20, 2023 at 3:18 AM Christian Brauner <brauner@kernel.org> wrote:
>
> > The patch 4 that does:
> > if (attr->map_token_fd)
> > wasn't sneaked in in any way.
> > You were cc-ed on it just like linux-fsdevel@vger
> > during all 12 revisions of the token series over many months.
> >
> > So this accusation of breach of trust is baseless.
>
> I was expecting this reply and I'm still disappointed.
>
> Both of you were explicitly told in very clear words that special-casing
> fd 0 is not ok.
>
> Fast forward a few weeks, you chose to not just add patches that forbid
> fd 0 again, no, the heinous part is that you chose to not lose a single
> word about this: not in the cover letter, not in the relevant commit,
> not in all the discussions we had around this.
>
> You were absolutely aware how opposed we are to this. It cannot get any
> more sneaky than this. And it's frankly insulting that you choose to
> defend this by feigning ignorance. No one is buying this.

Christian, I'm sorry you feel this way, but I refuse to accept the
blame of malicious or heinous intent. There was neither intent nor
attempt to mislead you (or anyone else for that matter) or silently
sneak anything in.

Yes, we did continue to use the convention that FD=0 means "no FD is
specified", which was consistent throughout BPF UAPI. But only when it
comes to passing BPF objects around (program, BTF, map, link, and
token). You hate it, I get it. But user-space already deals with this,
because it is present in BPF UAPI in many commands, and it's never a
problem in practice.

This is *very different in implications* compared to passing VFS path
FDs, like it was during the BPF_OBJ_PIN/BPF_OBJ_GET patch set ([0]).
Back then I acknowledged the wrongness of treating FD=0 as AT_FDCWD
due to security implications and fixed that with a special flag. It
was a special case as far as BPF UAPI goes, and it was the first one
of that kind.

In my mind, path FDs (and any other FD that points to a file or other
objects that didn't originate in bpf() syscall) and BPF FDs are two
completely different classes of cases. And that's why I didn't give it
too much thought when adding bpf_token_fd with default (for bpf()
syscall) semantics of treating FD=0 as "no FD was provided". It *is* a
*quirk* of the BPF UAPI that users have to take into account (and
libbpf does take into account and never returns FD=0 to users, so in
practice it is never a problem), but we are not defining any new
semantics here. We do say "dup your FD=0, if you happen to want to
pass it to BPF UAPI", a quirk that libbpf (and I presume other BPF
libraries) hides and doesn't even mention in API.

I'm elaborating on this so much just to explain *the thought process*
(and not to make excuses) and why this was done the way it was done.
This discussion made it very clear that this BPF FD special treatment
won't be tolerated. OK, ack. We are going to add new flags whenever
any FD field is added to BPF UAPI from now on.

Back in [0] I didn't remember such a strong "we forbid fd 0 again"
wording, tbh. At least before the discussion devolved into an
unfortunate "let's prevent kernel from returning fd<3" discussion.
Quoting [0] a bit:

 > If it was discussed then great but if not then I would like to make it
 > very clear that if in the future you decide to introduce custom
 > semantics for vfs provided infrastructure - especially when exposed to
 > userspace - that you please Cc us.

And I did CC you.

 > I personally find this extremely weird to treat fd 0 as anything other
 > than a random fd number as it goes against any userspace assumptions and
 > drastically deviates from basically every file descriptor interface we
 > have. I mean, you're not just saying fd 0 is invalid you're even saying
 > it means AT_FDCWD.

Yes, the AT_FDCWD thing was completely wrong. You did express general
dissatisfaction with BPF UAPI's choice to treat optional FD fields as
"no FD" if FD=0, and we can't fix it without breaking user-space.
Still, the way I read it was that your main concern was, justifiably,
AT_FDCWD treatment.

 > For every other interface, including those that pass fds in structs
 > whose extensibility is premised on unknown fields being set to zero,
 > have ways to make fd 0 work just fine. You could've done that to without
 > inventing custom fd semantics.

You did explicitly ask this, but still, not in a "I forbid" fashion.
Especially, taking this into account:

 > This is not a rant I'm really just trying to make sure that we agree on
 > common ground when it comes to touching each others code or semantic
 > assumptions.

I didn't feel like bpf() syscall UAPI was "touching each others code".
But I'll be honest, I'm not sure how widely I should have treated
"custom semantics for vfs provided infrastructure" and "inventing
custom fd semantics", and so I chose UAPI consistency relegating FD=0
convention as a BPF UAPI quirk. Are the BPF kernel objects the VFS
infrastructure? Or that was about path FDs, and other "standard" FS
files? It's a bit of a moot point now, though, as we agreed to do it
for any FD field going forward, but still, clarity would be helpful.

Again, there was no bad faith in my actions and everything was done in
the open, with plenty of opportunities and time to raise concerns.

  [0] https://lore.kernel.org/all/20230517-allabendlich-umgekehrt-8cc81f8313ac@brauner/

>
> But let's assume for a second that both you and Andrii somehow managed
> to forget the very clear and heated discussion on-list last time, the
> resulting LWN article written about it and the in-person discussion
> around this we had in November at LPC.

As far as I can remember LPC, we never touched on passing the BPF
token FD aspect at all during discussions. We did talk about
BPF_OBJ_PIN and how wrong it was to assume AT_FDCWD, which, again, is
objectively wrong due to potential security attacks. It seems like in
your mind AT_FDCWD and generally passing BPF object FDs is exactly the
same problem, which I disagree with, but I think that's where your
accusations come from. You can say both cases are problems, but they
are different problems (security vs deviation of API from the rest of
kernel APIs).

>
> You still would have put a major deviation from file descriptor
> semantics in the bpf specific parts of the patches yet failed to lose a
> single word on this anywhere. Yet we explicitly requested in the last
> thread that if bpf does deviate from core fs semantics you clearly
> communicate this.

FD 0 is still a valid file descriptor, in general. No one claims
otherwise, there is no change in semantics. BPF syscall won't see FD=0
for some optional fields, and that's a deviation from other APIs, yes,
but with no security implications.

And it might be hard to believe, but it's been like that for so long
and is just such an ingrained default behavior, that yes, out of habit
I didn't even think to highlight that in commit messages or cover
letters. My bad, certainly, but hardly a heinous act.

>
> But shame on me as well. I should've caught this during review. I
> trusted you both enough that I only focussed on the parts that matter
> for the VFS which were the two patches I ACKed. I didn't think it
> necessary to wade through the completely uninteresting BPF bits that I
> couldn't care less about. That won't happen again.
>
> What I want for the future is for bpf to clearly, openly, and explicitly
> communicate any decisions that affect core fs semantics. It's the exact
> same request I put forward last time. This is a path forward.

Of course, and you'll be CC'ed on all the BPF token patches I will
resend after the holidays.

And just to be clear for the future, by "core fs semantics" you also
mean any BPF UAPI FD field, right?

Christian Brauner Dec. 21, 2023, 1:05 p.m. UTC | #18

> Of course, and you'll be CC'ed on all the BPF token patches I will
> resend after the holidays.
> 
> And just to be clear for the future, by "core fs semantics" you also
> mean any BPF UAPI FD field, right?

Yes, because ultimately you end up with calling:

fdget()/fdget_raw()/fget()

to turn a userspace handle in the form of an fd and turn it into a
struct file. And that is uniform across the kernel. And therein lies the
beauty of it all imo.

IMHO, a file descriptor is one of the most widely used generic
abstraction we have across all of the kernel. It is almost literally
used everywhere. And everyone has the same contract: a non-negative
integer is a valid fd, a negative one is invalid. It's simple, there
aren't corner cases, there aren't custom semantics.

And it's also arguably one of the most successful ones as we keep
implementing new apis on top of this abstraction (pidfd, seccomp,
process_*(), memfd_*(), endless kvm ioctls etc etc).

pull-request: bpf-next 2023-12-18

Pull-request

Checks

Message

Comments