diff mbox series

[v9,bpf-next,3/5] bpf: Add skb dynptrs

Message ID 20230127191703.3864860-4-joannelkoong@gmail.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series Add skb + xdp dynptrs | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 1745 this patch: 1747
netdev/cc_maintainers warning 12 maintainers not CCed: sdf@google.com kuba@kernel.org kpsingh@kernel.org jolsa@kernel.org haoluo@google.com martin.lau@linux.dev davem@davemloft.net song@kernel.org pabeni@redhat.com john.fastabend@gmail.com edumazet@google.com yhs@fb.com
netdev/build_clang fail Errors and warnings before: 161 this patch: 163
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 1742 this patch: 1744
netdev/checkpatch warning WARNING: line length of 102 exceeds 80 columns WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 89 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns WARNING: line length of 91 exceeds 80 columns WARNING: line length of 92 exceeds 80 columns WARNING: line length of 93 exceeds 80 columns WARNING: line length of 94 exceeds 80 columns WARNING: line length of 95 exceeds 80 columns WARNING: line length of 96 exceeds 80 columns WARNING: line length of 97 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-7 success Logs for llvm-toolchain
bpf/vmtest-bpf-next-VM_Test-8 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-5 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-37 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-19 success Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-22 success Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_progs_no_alu32_parallel on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-30 success Logs for test_progs_parallel on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-32 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-33 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-34 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-35 success Logs for test_verifier on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-38 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_maps on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-15 success Logs for test_progs on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-20 fail Logs for test_progs_no_alu32 on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-26 success Logs for test_progs_no_alu32_parallel on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-36 success Logs for test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-16 fail Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-21 fail Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-11 success Logs for test_maps on s390x with gcc
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-31 success Logs for test_progs_parallel on s390x with gcc

Commit Message

Joanne Koong Jan. 27, 2023, 7:17 p.m. UTC
Add skb dynptrs, which are dynptrs whose underlying pointer points
to a skb. The dynptr acts on skb data. skb dynptrs have two main
benefits. One is that they allow operations on sizes that are not
statically known at compile-time (eg variable-sized accesses).
Another is that parsing the packet data through dynptrs (instead of
through direct access of skb->data and skb->data_end) can be more
ergonomic and less brittle (eg does not need manual if checking for
being within bounds of data_end).

For bpf prog types that don't support writes on skb data, the dynptr is
read-only (bpf_dynptr_write() will return an error and bpf_dynptr_data()
will return a data slice that is read-only where any writes to it will
be rejected by the verifier).

For reads and writes through the bpf_dynptr_read() and bpf_dynptr_write()
interfaces, reading and writing from/to data in the head as well as from/to
non-linear paged buffers is supported. For data slices (through the
bpf_dynptr_data() interface), if the data is in a paged buffer, the user
must first call bpf_skb_pull_data() to pull the data into the linear
portion.

Any bpf_dynptr_write() automatically invalidates any prior data slices
to the skb dynptr. This is because a bpf_dynptr_write() may be writing
to data in a paged buffer, so it will need to pull the buffer first into
the head. The reason it needs to be pulled instead of writing directly to
the paged buffers is because they may be cloned (only the head of the skb
is by default uncloned). As such, any bpf_dynptr_write() will
automatically have its prior data slices invalidated, even if the write
is to data in the skb head (the verifier has no way of differentiating
whether the write is to the head or paged buffers during program load
time). Please note as well that any other helper calls that change the
underlying packet buffer (eg bpf_skb_pull_data()) invalidates any data
slices of the skb dynptr as well. The stack trace for this is
check_helper_call() -> clear_all_pkt_pointers() ->
__clear_all_pkt_pointers() -> mark_reg_unknown().

For examples of how skb dynptrs can be used, please see the attached
selftests.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 include/linux/bpf.h            |  82 +++++++++------
 include/linux/filter.h         |  18 ++++
 include/uapi/linux/bpf.h       |  37 +++++--
 kernel/bpf/btf.c               |  18 ++++
 kernel/bpf/helpers.c           |  95 ++++++++++++++---
 kernel/bpf/verifier.c          | 185 ++++++++++++++++++++++++++-------
 net/core/filter.c              |  60 ++++++++++-
 tools/include/uapi/linux/bpf.h |  37 +++++--
 8 files changed, 432 insertions(+), 100 deletions(-)

Comments

Alexei Starovoitov Jan. 29, 2023, 11:39 p.m. UTC | #1
On Fri, Jan 27, 2023 at 11:17:01AM -0800, Joanne Koong wrote:
> Add skb dynptrs, which are dynptrs whose underlying pointer points
> to a skb. The dynptr acts on skb data. skb dynptrs have two main
> benefits. One is that they allow operations on sizes that are not
> statically known at compile-time (eg variable-sized accesses).
> Another is that parsing the packet data through dynptrs (instead of
> through direct access of skb->data and skb->data_end) can be more
> ergonomic and less brittle (eg does not need manual if checking for
> being within bounds of data_end).
> 
> For bpf prog types that don't support writes on skb data, the dynptr is
> read-only (bpf_dynptr_write() will return an error and bpf_dynptr_data()
> will return a data slice that is read-only where any writes to it will
> be rejected by the verifier).
> 
> For reads and writes through the bpf_dynptr_read() and bpf_dynptr_write()
> interfaces, reading and writing from/to data in the head as well as from/to
> non-linear paged buffers is supported. For data slices (through the
> bpf_dynptr_data() interface), if the data is in a paged buffer, the user
> must first call bpf_skb_pull_data() to pull the data into the linear
> portion.

Looks like there is an assumption in parts of this patch that
linear part of skb is always writeable. That's not the case.
See if (ops->gen_prologue || env->seen_direct_write) in convert_ctx_accesses().
For TC progs it calls bpf_unclone_prologue() which adds hidden
bpf_skb_pull_data() in the beginning of the prog to make it writeable.

> Any bpf_dynptr_write() automatically invalidates any prior data slices
> to the skb dynptr. This is because a bpf_dynptr_write() may be writing
> to data in a paged buffer, so it will need to pull the buffer first into
> the head. The reason it needs to be pulled instead of writing directly to
> the paged buffers is because they may be cloned (only the head of the skb
> is by default uncloned). As such, any bpf_dynptr_write() will
> automatically have its prior data slices invalidated, even if the write
> is to data in the skb head (the verifier has no way of differentiating
> whether the write is to the head or paged buffers during program load
> time). 

Could you explain the workflow how bpf_dynptr_write() invalidates other
pkt pointers ?
I expected bpf_dynptr_write() to be in bpf_helper_changes_pkt_data().
Looks like bpf_dynptr_write() calls bpf_skb_store_bytes() underneath,
but that doesn't help the verifier.

> Please note as well that any other helper calls that change the
> underlying packet buffer (eg bpf_skb_pull_data()) invalidates any data
> slices of the skb dynptr as well. The stack trace for this is
> check_helper_call() -> clear_all_pkt_pointers() ->
> __clear_all_pkt_pointers() -> mark_reg_unknown().

__clear_all_pkt_pointers isn't present in the tree. Typo ?

> 
> For examples of how skb dynptrs can be used, please see the attached
> selftests.
> 
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
>  include/linux/bpf.h            |  82 +++++++++------
>  include/linux/filter.h         |  18 ++++
>  include/uapi/linux/bpf.h       |  37 +++++--
>  kernel/bpf/btf.c               |  18 ++++
>  kernel/bpf/helpers.c           |  95 ++++++++++++++---
>  kernel/bpf/verifier.c          | 185 ++++++++++++++++++++++++++-------
>  net/core/filter.c              |  60 ++++++++++-
>  tools/include/uapi/linux/bpf.h |  37 +++++--
>  8 files changed, 432 insertions(+), 100 deletions(-)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 14a0264fac57..1ac061b64582 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -575,11 +575,14 @@ enum bpf_type_flag {
>  	/* MEM is tagged with rcu and memory access needs rcu_read_lock protection. */
>  	MEM_RCU			= BIT(13 + BPF_BASE_TYPE_BITS),
>  
> +	/* DYNPTR points to sk_buff */
> +	DYNPTR_TYPE_SKB		= BIT(14 + BPF_BASE_TYPE_BITS),
> +
>  	__BPF_TYPE_FLAG_MAX,
>  	__BPF_TYPE_LAST_FLAG	= __BPF_TYPE_FLAG_MAX - 1,
>  };
>  
> -#define DYNPTR_TYPE_FLAG_MASK	(DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF)
> +#define DYNPTR_TYPE_FLAG_MASK	(DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB)
>  
>  /* Max number of base types. */
>  #define BPF_BASE_TYPE_LIMIT	(1UL << BPF_BASE_TYPE_BITS)
> @@ -1082,6 +1085,35 @@ static __always_inline __nocfi unsigned int bpf_dispatcher_nop_func(
>  	return bpf_func(ctx, insnsi);
>  }
>  
> +/* the implementation of the opaque uapi struct bpf_dynptr */
> +struct bpf_dynptr_kern {
> +	void *data;
> +	/* Size represents the number of usable bytes of dynptr data.
> +	 * If for example the offset is at 4 for a local dynptr whose data is
> +	 * of type u64, the number of usable bytes is 4.
> +	 *
> +	 * The upper 8 bits are reserved. It is as follows:
> +	 * Bits 0 - 23 = size
> +	 * Bits 24 - 30 = dynptr type
> +	 * Bit 31 = whether dynptr is read-only
> +	 */
> +	u32 size;
> +	u32 offset;
> +} __aligned(8);
> +
> +enum bpf_dynptr_type {
> +	BPF_DYNPTR_TYPE_INVALID,
> +	/* Points to memory that is local to the bpf program */
> +	BPF_DYNPTR_TYPE_LOCAL,
> +	/* Underlying data is a ringbuf record */
> +	BPF_DYNPTR_TYPE_RINGBUF,
> +	/* Underlying data is a sk_buff */
> +	BPF_DYNPTR_TYPE_SKB,
> +};
> +
> +int bpf_dynptr_check_size(u32 size);
> +u32 bpf_dynptr_get_size(const struct bpf_dynptr_kern *ptr);
> +
>  #ifdef CONFIG_BPF_JIT
>  int bpf_trampoline_link_prog(struct bpf_tramp_link *link, struct bpf_trampoline *tr);
>  int bpf_trampoline_unlink_prog(struct bpf_tramp_link *link, struct bpf_trampoline *tr);
> @@ -2216,6 +2248,11 @@ static inline bool has_current_bpf_ctx(void)
>  }
>  
>  void notrace bpf_prog_inc_misses_counter(struct bpf_prog *prog);
> +
> +void bpf_dynptr_init(struct bpf_dynptr_kern *ptr, void *data,
> +		     enum bpf_dynptr_type type, u32 offset, u32 size);
> +void bpf_dynptr_set_null(struct bpf_dynptr_kern *ptr);
> +void bpf_dynptr_set_rdonly(struct bpf_dynptr_kern *ptr);
>  #else /* !CONFIG_BPF_SYSCALL */
>  static inline struct bpf_prog *bpf_prog_get(u32 ufd)
>  {
> @@ -2445,6 +2482,19 @@ static inline void bpf_prog_inc_misses_counter(struct bpf_prog *prog)
>  static inline void bpf_cgrp_storage_free(struct cgroup *cgroup)
>  {
>  }
> +
> +static inline void bpf_dynptr_init(struct bpf_dynptr_kern *ptr, void *data,
> +				   enum bpf_dynptr_type type, u32 offset, u32 size)
> +{
> +}
> +
> +static inline void bpf_dynptr_set_null(struct bpf_dynptr_kern *ptr)
> +{
> +}
> +
> +static inline void bpf_dynptr_set_rdonly(struct bpf_dynptr_kern *ptr)
> +{
> +}
>  #endif /* CONFIG_BPF_SYSCALL */
>  
>  void __bpf_free_used_btfs(struct bpf_prog_aux *aux,
> @@ -2863,36 +2913,6 @@ int bpf_bprintf_prepare(char *fmt, u32 fmt_size, const u64 *raw_args,
>  			u32 num_args, struct bpf_bprintf_data *data);
>  void bpf_bprintf_cleanup(struct bpf_bprintf_data *data);
>  
> -/* the implementation of the opaque uapi struct bpf_dynptr */
> -struct bpf_dynptr_kern {
> -	void *data;
> -	/* Size represents the number of usable bytes of dynptr data.
> -	 * If for example the offset is at 4 for a local dynptr whose data is
> -	 * of type u64, the number of usable bytes is 4.
> -	 *
> -	 * The upper 8 bits are reserved. It is as follows:
> -	 * Bits 0 - 23 = size
> -	 * Bits 24 - 30 = dynptr type
> -	 * Bit 31 = whether dynptr is read-only
> -	 */
> -	u32 size;
> -	u32 offset;
> -} __aligned(8);
> -
> -enum bpf_dynptr_type {
> -	BPF_DYNPTR_TYPE_INVALID,
> -	/* Points to memory that is local to the bpf program */
> -	BPF_DYNPTR_TYPE_LOCAL,
> -	/* Underlying data is a kernel-produced ringbuf record */
> -	BPF_DYNPTR_TYPE_RINGBUF,
> -};
> -
> -void bpf_dynptr_init(struct bpf_dynptr_kern *ptr, void *data,
> -		     enum bpf_dynptr_type type, u32 offset, u32 size);
> -void bpf_dynptr_set_null(struct bpf_dynptr_kern *ptr);
> -int bpf_dynptr_check_size(u32 size);
> -u32 bpf_dynptr_get_size(const struct bpf_dynptr_kern *ptr);
> -
>  #ifdef CONFIG_BPF_LSM
>  void bpf_cgroup_atype_get(u32 attach_btf_id, int cgroup_atype);
>  void bpf_cgroup_atype_put(int cgroup_atype);
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index ccc4a4a58c72..c87d13954d89 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1541,4 +1541,22 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u64 index
>  	return XDP_REDIRECT;
>  }
>  
> +#ifdef CONFIG_NET
> +int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len);
> +int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from,
> +			  u32 len, u64 flags);
> +#else /* CONFIG_NET */
> +static inline int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset,
> +				       void *to, u32 len)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
> +static inline int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset,
> +					const void *from, u32 len, u64 flags)
> +{
> +	return -EOPNOTSUPP;
> +}
> +#endif /* CONFIG_NET */
> +
>  #endif /* __LINUX_FILTER_H__ */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index ba0f0cfb5e42..f6910392d339 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -5320,22 +5320,45 @@ union bpf_attr {
>   *	Description
>   *		Write *len* bytes from *src* into *dst*, starting from *offset*
>   *		into *dst*.
> - *		*flags* is currently unused.
> + *
> + *		*flags* must be 0 except for skb-type dynptrs.
> + *
> + *		For skb-type dynptrs:
> + *		    *  All data slices of the dynptr are automatically
> + *		       invalidated after **bpf_dynptr_write**\ (). If you wish to
> + *		       avoid this, please perform the write using direct data slices
> + *		       instead.
> + *
> + *		    *  For *flags*, please see the flags accepted by
> + *		       **bpf_skb_store_bytes**\ ().
>   *	Return
>   *		0 on success, -E2BIG if *offset* + *len* exceeds the length
>   *		of *dst*'s data, -EINVAL if *dst* is an invalid dynptr or if *dst*
> - *		is a read-only dynptr or if *flags* is not 0.
> + *		is a read-only dynptr or if *flags* is not correct. For skb-type dynptrs,
> + *		other errors correspond to errors returned by **bpf_skb_store_bytes**\ ().
>   *
>   * void *bpf_dynptr_data(const struct bpf_dynptr *ptr, u32 offset, u32 len)
>   *	Description
>   *		Get a pointer to the underlying dynptr data.
>   *
>   *		*len* must be a statically known value. The returned data slice
> - *		is invalidated whenever the dynptr is invalidated.
> - *	Return
> - *		Pointer to the underlying dynptr data, NULL if the dynptr is
> - *		read-only, if the dynptr is invalid, or if the offset and length
> - *		is out of bounds.
> + *		is invalidated whenever the dynptr is invalidated. Please note
> + *		that if the dynptr is read-only, then the returned data slice will
> + *		be read-only.
> + *
> + *		For skb-type dynptrs:
> + *		    * If *offset* + *len* extends into the skb's paged buffers,
> + *		      the user should manually pull the skb with **bpf_skb_pull_data**\ ()
> + *		      and try again.
> + *
> + *		    * The data slice is automatically invalidated anytime
> + *		      **bpf_dynptr_write**\ () or a helper call that changes
> + *		      the underlying packet buffer (eg **bpf_skb_pull_data**\ ())
> + *		      is called.
> + *	Return
> + *		Pointer to the underlying dynptr data, NULL if the dynptr is invalid,
> + *		or if the offset and length is out of bounds or in a paged buffer for
> + *		skb-type dynptrs.
>   *
>   * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len)
>   *	Description
> diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
> index b4da17688c65..35d0780f2eb9 100644
> --- a/kernel/bpf/btf.c
> +++ b/kernel/bpf/btf.c
> @@ -207,6 +207,11 @@ enum btf_kfunc_hook {
>  	BTF_KFUNC_HOOK_TRACING,
>  	BTF_KFUNC_HOOK_SYSCALL,
>  	BTF_KFUNC_HOOK_FMODRET,
> +	BTF_KFUNC_HOOK_CGROUP_SKB,
> +	BTF_KFUNC_HOOK_SCHED_ACT,
> +	BTF_KFUNC_HOOK_SK_SKB,
> +	BTF_KFUNC_HOOK_SOCKET_FILTER,
> +	BTF_KFUNC_HOOK_LWT,
>  	BTF_KFUNC_HOOK_MAX,
>  };
>  
> @@ -7609,6 +7614,19 @@ static int bpf_prog_type_to_kfunc_hook(enum bpf_prog_type prog_type)
>  		return BTF_KFUNC_HOOK_TRACING;
>  	case BPF_PROG_TYPE_SYSCALL:
>  		return BTF_KFUNC_HOOK_SYSCALL;
> +	case BPF_PROG_TYPE_CGROUP_SKB:
> +		return BTF_KFUNC_HOOK_CGROUP_SKB;
> +	case BPF_PROG_TYPE_SCHED_ACT:
> +		return BTF_KFUNC_HOOK_SCHED_ACT;
> +	case BPF_PROG_TYPE_SK_SKB:
> +		return BTF_KFUNC_HOOK_SK_SKB;
> +	case BPF_PROG_TYPE_SOCKET_FILTER:
> +		return BTF_KFUNC_HOOK_SOCKET_FILTER;
> +	case BPF_PROG_TYPE_LWT_OUT:
> +	case BPF_PROG_TYPE_LWT_IN:
> +	case BPF_PROG_TYPE_LWT_XMIT:
> +	case BPF_PROG_TYPE_LWT_SEG6LOCAL:
> +		return BTF_KFUNC_HOOK_LWT;
>  	default:
>  		return BTF_KFUNC_HOOK_MAX;
>  	}
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 458db2db2f81..a79d522b3a26 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1420,11 +1420,21 @@ static bool bpf_dynptr_is_rdonly(const struct bpf_dynptr_kern *ptr)
>  	return ptr->size & DYNPTR_RDONLY_BIT;
>  }
>  
> +void bpf_dynptr_set_rdonly(struct bpf_dynptr_kern *ptr)
> +{
> +	ptr->size |= DYNPTR_RDONLY_BIT;
> +}
> +
>  static void bpf_dynptr_set_type(struct bpf_dynptr_kern *ptr, enum bpf_dynptr_type type)
>  {
>  	ptr->size |= type << DYNPTR_TYPE_SHIFT;
>  }
>  
> +static enum bpf_dynptr_type bpf_dynptr_get_type(const struct bpf_dynptr_kern *ptr)
> +{
> +	return (ptr->size & ~(DYNPTR_RDONLY_BIT)) >> DYNPTR_TYPE_SHIFT;
> +}
> +
>  u32 bpf_dynptr_get_size(const struct bpf_dynptr_kern *ptr)
>  {
>  	return ptr->size & DYNPTR_SIZE_MASK;
> @@ -1497,6 +1507,7 @@ static const struct bpf_func_proto bpf_dynptr_from_mem_proto = {
>  BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, const struct bpf_dynptr_kern *, src,
>  	   u32, offset, u64, flags)
>  {
> +	enum bpf_dynptr_type type;
>  	int err;
>  
>  	if (!src->data || flags)
> @@ -1506,13 +1517,23 @@ BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, const struct bpf_dynptr_kern
>  	if (err)
>  		return err;
>  
> -	/* Source and destination may possibly overlap, hence use memmove to
> -	 * copy the data. E.g. bpf_dynptr_from_mem may create two dynptr
> -	 * pointing to overlapping PTR_TO_MAP_VALUE regions.
> -	 */
> -	memmove(dst, src->data + src->offset + offset, len);
> +	type = bpf_dynptr_get_type(src);
>  
> -	return 0;
> +	switch (type) {
> +	case BPF_DYNPTR_TYPE_LOCAL:
> +	case BPF_DYNPTR_TYPE_RINGBUF:
> +		/* Source and destination may possibly overlap, hence use memmove to
> +		 * copy the data. E.g. bpf_dynptr_from_mem may create two dynptr
> +		 * pointing to overlapping PTR_TO_MAP_VALUE regions.
> +		 */
> +		memmove(dst, src->data + src->offset + offset, len);
> +		return 0;
> +	case BPF_DYNPTR_TYPE_SKB:
> +		return __bpf_skb_load_bytes(src->data, src->offset + offset, dst, len);
> +	default:
> +		WARN_ONCE(true, "bpf_dynptr_read: unknown dynptr type %d\n", type);
> +		return -EFAULT;
> +	}
>  }
>  
>  static const struct bpf_func_proto bpf_dynptr_read_proto = {
> @@ -1529,22 +1550,36 @@ static const struct bpf_func_proto bpf_dynptr_read_proto = {
>  BPF_CALL_5(bpf_dynptr_write, const struct bpf_dynptr_kern *, dst, u32, offset, void *, src,
>  	   u32, len, u64, flags)
>  {
> +	enum bpf_dynptr_type type;
>  	int err;
>  
> -	if (!dst->data || flags || bpf_dynptr_is_rdonly(dst))
> +	if (!dst->data || bpf_dynptr_is_rdonly(dst))
>  		return -EINVAL;
>  
>  	err = bpf_dynptr_check_off_len(dst, offset, len);
>  	if (err)
>  		return err;
>  
> -	/* Source and destination may possibly overlap, hence use memmove to
> -	 * copy the data. E.g. bpf_dynptr_from_mem may create two dynptr
> -	 * pointing to overlapping PTR_TO_MAP_VALUE regions.
> -	 */
> -	memmove(dst->data + dst->offset + offset, src, len);
> +	type = bpf_dynptr_get_type(dst);
>  
> -	return 0;
> +	switch (type) {
> +	case BPF_DYNPTR_TYPE_LOCAL:
> +	case BPF_DYNPTR_TYPE_RINGBUF:
> +		if (flags)
> +			return -EINVAL;
> +		/* Source and destination may possibly overlap, hence use memmove to
> +		 * copy the data. E.g. bpf_dynptr_from_mem may create two dynptr
> +		 * pointing to overlapping PTR_TO_MAP_VALUE regions.
> +		 */
> +		memmove(dst->data + dst->offset + offset, src, len);
> +		return 0;
> +	case BPF_DYNPTR_TYPE_SKB:
> +		return __bpf_skb_store_bytes(dst->data, dst->offset + offset, src, len,
> +					     flags);
> +	default:
> +		WARN_ONCE(true, "bpf_dynptr_write: unknown dynptr type %d\n", type);
> +		return -EFAULT;
> +	}
>  }
>  
>  static const struct bpf_func_proto bpf_dynptr_write_proto = {
> @@ -1560,6 +1595,8 @@ static const struct bpf_func_proto bpf_dynptr_write_proto = {
>  
>  BPF_CALL_3(bpf_dynptr_data, const struct bpf_dynptr_kern *, ptr, u32, offset, u32, len)
>  {
> +	enum bpf_dynptr_type type;
> +	void *data;
>  	int err;
>  
>  	if (!ptr->data)
> @@ -1569,10 +1606,36 @@ BPF_CALL_3(bpf_dynptr_data, const struct bpf_dynptr_kern *, ptr, u32, offset, u3
>  	if (err)
>  		return 0;
>  
> -	if (bpf_dynptr_is_rdonly(ptr))
> -		return 0;
> +	type = bpf_dynptr_get_type(ptr);
> +
> +	switch (type) {
> +	case BPF_DYNPTR_TYPE_LOCAL:
> +	case BPF_DYNPTR_TYPE_RINGBUF:
> +		if (bpf_dynptr_is_rdonly(ptr))
> +			return 0;
> +
> +		data = ptr->data;
> +		break;
> +	case BPF_DYNPTR_TYPE_SKB:
> +	{
> +		struct sk_buff *skb = ptr->data;
>  
> -	return (unsigned long)(ptr->data + ptr->offset + offset);
> +		/* if the data is paged, the caller needs to pull it first */
> +		if (ptr->offset + offset + len > skb_headlen(skb))
> +			return 0;
> +
> +		/* Depending on the prog type, the data slice will be either
> +		 * read-writable or read-only. The verifier will enforce that
> +		 * any writes to read-only data slices are rejected
> +		 */
> +		data = skb->data;
> +		break;
> +	}
> +	default:
> +		WARN_ONCE(true, "bpf_dynptr_data: unknown dynptr type %d\n", type);
> +		return 0;
> +	}
> +	return (unsigned long)(data + ptr->offset + offset);
>  }
>  
>  static const struct bpf_func_proto bpf_dynptr_data_proto = {
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 853ab671be0b..3b022abc34e3 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -741,6 +741,8 @@ static enum bpf_dynptr_type arg_to_dynptr_type(enum bpf_arg_type arg_type)
>  		return BPF_DYNPTR_TYPE_LOCAL;
>  	case DYNPTR_TYPE_RINGBUF:
>  		return BPF_DYNPTR_TYPE_RINGBUF;
> +	case DYNPTR_TYPE_SKB:
> +		return BPF_DYNPTR_TYPE_SKB;
>  	default:
>  		return BPF_DYNPTR_TYPE_INVALID;
>  	}
> @@ -1625,6 +1627,12 @@ static bool reg_is_pkt_pointer_any(const struct bpf_reg_state *reg)
>  	       reg->type == PTR_TO_PACKET_END;
>  }
>  
> +static bool reg_is_dynptr_slice_pkt(const struct bpf_reg_state *reg)
> +{
> +	return base_type(reg->type) == PTR_TO_MEM &&
> +		reg->type & DYNPTR_TYPE_SKB;
> +}
> +
>  /* Unmodified PTR_TO_PACKET[_META,_END] register from ctx access. */
>  static bool reg_is_init_pkt_pointer(const struct bpf_reg_state *reg,
>  				    enum bpf_reg_type which)
> @@ -6148,7 +6156,7 @@ static int process_kptr_func(struct bpf_verifier_env *env, int regno,
>   * type, and declare it as 'const struct bpf_dynptr *' in their prototype.
>   */
>  int process_dynptr_func(struct bpf_verifier_env *env, int regno, int insn_idx,
> -			enum bpf_arg_type arg_type)
> +			enum bpf_arg_type arg_type, int func_id)
>  {
>  	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
>  	int err;
> @@ -6233,6 +6241,9 @@ int process_dynptr_func(struct bpf_verifier_env *env, int regno, int insn_idx,
>  			case DYNPTR_TYPE_RINGBUF:
>  				err_extra = "ringbuf";
>  				break;
> +			case DYNPTR_TYPE_SKB:
> +				err_extra = "skb ";
> +				break;
>  			default:
>  				err_extra = "<unknown>";
>  				break;
> @@ -6581,6 +6592,28 @@ int check_func_arg_reg_off(struct bpf_verifier_env *env,
>  	}
>  }
>  
> +static struct bpf_reg_state *get_dynptr_arg_reg(struct bpf_verifier_env *env,
> +						const struct bpf_func_proto *fn,
> +						struct bpf_reg_state *regs)
> +{
> +	struct bpf_reg_state *state = NULL;
> +	int i;
> +
> +	for (i = 0; i < MAX_BPF_FUNC_REG_ARGS; i++)
> +		if (arg_type_is_dynptr(fn->arg_type[i])) {
> +			if (state) {
> +				verbose(env, "verifier internal error: multiple dynptr args\n");
> +				return NULL;
> +			}
> +			state = &regs[BPF_REG_1 + i];
> +		}
> +
> +	if (!state)
> +		verbose(env, "verifier internal error: no dynptr arg found\n");
> +
> +	return state;
> +}

Looks like refactoring is mixed with new features.
Moving struct bpf_dynptr_kern to a different place and factoring out get_dynptr_arg_reg()
could have been a separate patch to make it easier to review.

> +
>  static int dynptr_id(struct bpf_verifier_env *env, struct bpf_reg_state *reg)
>  {
>  	struct bpf_func_state *state = func(env, reg);
> @@ -6607,6 +6640,24 @@ static int dynptr_ref_obj_id(struct bpf_verifier_env *env, struct bpf_reg_state
>  	return state->stack[spi].spilled_ptr.ref_obj_id;
>  }
>  
> +static enum bpf_dynptr_type dynptr_get_type(struct bpf_verifier_env *env,
> +					    struct bpf_reg_state *reg)
> +{
> +	struct bpf_func_state *state = func(env, reg);
> +	int spi;
> +
> +	if (reg->type == CONST_PTR_TO_DYNPTR)
> +		return reg->dynptr.type;
> +
> +	spi = __get_spi(reg->off);
> +	if (spi < 0) {
> +		verbose(env, "verifier internal error: invalid spi when querying dynptr type\n");
> +		return BPF_DYNPTR_TYPE_INVALID;
> +	}
> +
> +	return state->stack[spi].spilled_ptr.dynptr.type;
> +}
> +
>  static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
>  			  struct bpf_call_arg_meta *meta,
>  			  const struct bpf_func_proto *fn,
> @@ -6819,7 +6870,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
>  		err = check_mem_size_reg(env, reg, regno, true, meta);
>  		break;
>  	case ARG_PTR_TO_DYNPTR:
> -		err = process_dynptr_func(env, regno, insn_idx, arg_type);
> +		err = process_dynptr_func(env, regno, insn_idx, arg_type, meta->func_id);
>  		if (err)
>  			return err;
>  		break;
> @@ -7267,6 +7318,9 @@ static int check_func_proto(const struct bpf_func_proto *fn, int func_id)
>  
>  /* Packet data might have moved, any old PTR_TO_PACKET[_META,_END]
>   * are now invalid, so turn them into unknown SCALAR_VALUE.
> + *
> + * This also applies to dynptr slices belonging to skb dynptrs,
> + * since these slices point to packet data.
>   */
>  static void clear_all_pkt_pointers(struct bpf_verifier_env *env)
>  {
> @@ -7274,7 +7328,7 @@ static void clear_all_pkt_pointers(struct bpf_verifier_env *env)
>  	struct bpf_reg_state *reg;
>  
>  	bpf_for_each_reg_in_vstate(env->cur_state, state, reg, ({
> -		if (reg_is_pkt_pointer_any(reg))
> +		if (reg_is_pkt_pointer_any(reg) || reg_is_dynptr_slice_pkt(reg))
>  			__mark_reg_unknown(env, reg);
>  	}));
>  }
> @@ -7958,6 +8012,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>  			     int *insn_idx_p)
>  {
>  	enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
> +	enum bpf_dynptr_type dynptr_type = BPF_DYNPTR_TYPE_INVALID;
>  	const struct bpf_func_proto *fn = NULL;
>  	enum bpf_return_type ret_type;
>  	enum bpf_type_flag ret_flag;
> @@ -8140,43 +8195,61 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>  		}
>  		break;
>  	case BPF_FUNC_dynptr_data:
> -		for (i = 0; i < MAX_BPF_FUNC_REG_ARGS; i++) {
> -			if (arg_type_is_dynptr(fn->arg_type[i])) {
> -				struct bpf_reg_state *reg = &regs[BPF_REG_1 + i];
> -				int id, ref_obj_id;
> -
> -				if (meta.dynptr_id) {
> -					verbose(env, "verifier internal error: meta.dynptr_id already set\n");
> -					return -EFAULT;
> -				}
> +	{
> +		struct bpf_reg_state *reg;
> +		int id, ref_obj_id;
>  
> -				if (meta.ref_obj_id) {
> -					verbose(env, "verifier internal error: meta.ref_obj_id already set\n");
> -					return -EFAULT;
> -				}
> +		reg = get_dynptr_arg_reg(env, fn, regs);
> +		if (!reg)
> +			return -EFAULT;
>  
> -				id = dynptr_id(env, reg);
> -				if (id < 0) {
> -					verbose(env, "verifier internal error: failed to obtain dynptr id\n");
> -					return id;
> -				}
> +		if (meta.dynptr_id) {
> +			verbose(env, "verifier internal error: meta.dynptr_id already set\n");
> +			return -EFAULT;
> +		}
> +		if (meta.ref_obj_id) {
> +			verbose(env, "verifier internal error: meta.ref_obj_id already set\n");
> +			return -EFAULT;
> +		}
>  
> -				ref_obj_id = dynptr_ref_obj_id(env, reg);
> -				if (ref_obj_id < 0) {
> -					verbose(env, "verifier internal error: failed to obtain dynptr ref_obj_id\n");
> -					return ref_obj_id;
> -				}
> +		id = dynptr_id(env, reg);
> +		if (id < 0) {
> +			verbose(env, "verifier internal error: failed to obtain dynptr id\n");
> +			return id;
> +		}
>  
> -				meta.dynptr_id = id;
> -				meta.ref_obj_id = ref_obj_id;
> -				break;
> -			}
> +		ref_obj_id = dynptr_ref_obj_id(env, reg);
> +		if (ref_obj_id < 0) {
> +			verbose(env, "verifier internal error: failed to obtain dynptr ref_obj_id\n");
> +			return ref_obj_id;
>  		}
> -		if (i == MAX_BPF_FUNC_REG_ARGS) {
> -			verbose(env, "verifier internal error: no dynptr in bpf_dynptr_data()\n");
> +
> +		meta.dynptr_id = id;
> +		meta.ref_obj_id = ref_obj_id;
> +
> +		dynptr_type = dynptr_get_type(env, reg);
> +		if (dynptr_type == BPF_DYNPTR_TYPE_INVALID)
>  			return -EFAULT;
> -		}
> +
>  		break;
> +	}
> +	case BPF_FUNC_dynptr_write:
> +	{
> +		struct bpf_reg_state *reg;
> +
> +		reg = get_dynptr_arg_reg(env, fn, regs);
> +		if (!reg)
> +			return -EFAULT;
> +
> +		dynptr_type = dynptr_get_type(env, reg);
> +		if (dynptr_type == BPF_DYNPTR_TYPE_INVALID)
> +			return -EFAULT;
> +
> +		if (dynptr_type == BPF_DYNPTR_TYPE_SKB)
> +			changes_data = true;
> +
> +		break;
> +	}
>  	case BPF_FUNC_user_ringbuf_drain:
>  		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
>  					set_user_ringbuf_callback_state);
> @@ -8243,6 +8316,28 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>  		mark_reg_known_zero(env, regs, BPF_REG_0);
>  		regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
>  		regs[BPF_REG_0].mem_size = meta.mem_size;
> +		if (func_id == BPF_FUNC_dynptr_data &&
> +		    dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> +			bool seen_direct_write = env->seen_direct_write;
> +
> +			regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB;
> +			if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> +				regs[BPF_REG_0].type |= MEM_RDONLY;
> +			else
> +				/*
> +				 * Calling may_access_direct_pkt_data() will set
> +				 * env->seen_direct_write to true if the skb is
> +				 * writable. As an optimization, we can ignore
> +				 * setting env->seen_direct_write.
> +				 *
> +				 * env->seen_direct_write is used by skb
> +				 * programs to determine whether the skb's page
> +				 * buffers should be cloned. Since data slice
> +				 * writes would only be to the head, we can skip
> +				 * this.
> +				 */
> +				env->seen_direct_write = seen_direct_write;

This looks incorrect. skb head might not be writeable.

> +		}
>  		break;
>  	case RET_PTR_TO_MEM_OR_BTF_ID:
>  	{
> @@ -8649,6 +8744,7 @@ enum special_kfunc_type {
>  	KF_bpf_list_pop_back,
>  	KF_bpf_cast_to_kern_ctx,
>  	KF_bpf_rdonly_cast,
> +	KF_bpf_dynptr_from_skb,
>  	KF_bpf_rcu_read_lock,
>  	KF_bpf_rcu_read_unlock,
>  };
> @@ -8662,6 +8758,7 @@ BTF_ID(func, bpf_list_pop_front)
>  BTF_ID(func, bpf_list_pop_back)
>  BTF_ID(func, bpf_cast_to_kern_ctx)
>  BTF_ID(func, bpf_rdonly_cast)
> +BTF_ID(func, bpf_dynptr_from_skb)
>  BTF_SET_END(special_kfunc_set)
>  
>  BTF_ID_LIST(special_kfunc_list)
> @@ -8673,6 +8770,7 @@ BTF_ID(func, bpf_list_pop_front)
>  BTF_ID(func, bpf_list_pop_back)
>  BTF_ID(func, bpf_cast_to_kern_ctx)
>  BTF_ID(func, bpf_rdonly_cast)
> +BTF_ID(func, bpf_dynptr_from_skb)
>  BTF_ID(func, bpf_rcu_read_lock)
>  BTF_ID(func, bpf_rcu_read_unlock)
>  
> @@ -9263,17 +9361,26 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
>  				return ret;
>  			break;
>  		case KF_ARG_PTR_TO_DYNPTR:
> +		{
> +			enum bpf_arg_type dynptr_arg_type = ARG_PTR_TO_DYNPTR;
> +
>  			if (reg->type != PTR_TO_STACK &&
>  			    reg->type != CONST_PTR_TO_DYNPTR) {
>  				verbose(env, "arg#%d expected pointer to stack or dynptr_ptr\n", i);
>  				return -EINVAL;
>  			}
>  
> -			ret = process_dynptr_func(env, regno, insn_idx,
> -						  ARG_PTR_TO_DYNPTR | MEM_RDONLY);
> +			if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb])
> +				dynptr_arg_type |= MEM_UNINIT | DYNPTR_TYPE_SKB;
> +			else
> +				dynptr_arg_type |= MEM_RDONLY;
> +
> +			ret = process_dynptr_func(env, regno, insn_idx, dynptr_arg_type,
> +						  meta->func_id);
>  			if (ret < 0)
>  				return ret;
>  			break;
> +		}
>  		case KF_ARG_PTR_TO_LIST_HEAD:
>  			if (reg->type != PTR_TO_MAP_VALUE &&
>  			    reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
> @@ -15857,6 +15964,14 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>  		   desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
>  		insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
>  		*cnt = 1;
> +	} else if (desc->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
> +		bool is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);
> +		struct bpf_insn addr[2] = { BPF_LD_IMM64(BPF_REG_4, is_rdonly) };

Why use 16-byte insn to pass boolean in R4 ?
Single 8-byte MOV would do.

> +
> +		insn_buf[0] = addr[0];
> +		insn_buf[1] = addr[1];
> +		insn_buf[2] = *insn;
> +		*cnt = 3;
>  	}
>  	return 0;
>  }
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 6da78b3d381e..ddb47126071a 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -1684,8 +1684,8 @@ static inline void bpf_pull_mac_rcsum(struct sk_buff *skb)
>  		skb_postpull_rcsum(skb, skb_mac_header(skb), skb->mac_len);
>  }
>  
> -BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
> -	   const void *, from, u32, len, u64, flags)
> +int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from,
> +			  u32 len, u64 flags)

This change is just to be able to call __bpf_skb_store_bytes() ?
If so, it's unnecessary.
See:
BPF_CALL_4(sk_reuseport_load_bytes,
           const struct sk_reuseport_kern *, reuse_kern, u32, offset,
           void *, to, u32, len)
{
        return ____bpf_skb_load_bytes(reuse_kern->skb, offset, to, len);
}

>  {
>  	void *ptr;
>  
> @@ -1710,6 +1710,12 @@ BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
>  	return 0;
>  }
>  
> +BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
> +	   const void *, from, u32, len, u64, flags)
> +{
> +	return __bpf_skb_store_bytes(skb, offset, from, len, flags);
> +}
> +
>  static const struct bpf_func_proto bpf_skb_store_bytes_proto = {
>  	.func		= bpf_skb_store_bytes,
>  	.gpl_only	= false,
> @@ -1721,8 +1727,7 @@ static const struct bpf_func_proto bpf_skb_store_bytes_proto = {
>  	.arg5_type	= ARG_ANYTHING,
>  };
>  
> -BPF_CALL_4(bpf_skb_load_bytes, const struct sk_buff *, skb, u32, offset,
> -	   void *, to, u32, len)
> +int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len)
>  {
>  	void *ptr;
>  
> @@ -1741,6 +1746,12 @@ BPF_CALL_4(bpf_skb_load_bytes, const struct sk_buff *, skb, u32, offset,
>  	return -EFAULT;
>  }
>  
> +BPF_CALL_4(bpf_skb_load_bytes, const struct sk_buff *, skb, u32, offset,
> +	   void *, to, u32, len)
> +{
> +	return __bpf_skb_load_bytes(skb, offset, to, len);
> +}
> +
>  static const struct bpf_func_proto bpf_skb_load_bytes_proto = {
>  	.func		= bpf_skb_load_bytes,
>  	.gpl_only	= false,
> @@ -1852,6 +1863,22 @@ static const struct bpf_func_proto bpf_skb_pull_data_proto = {
>  	.arg2_type	= ARG_ANYTHING,
>  };
>  
> +int bpf_dynptr_from_skb(struct sk_buff *skb, u64 flags,
> +			struct bpf_dynptr_kern *ptr, int is_rdonly)

It probably needs
__diag_ignore_all("-Wmissing-prototypes",
like other kfuncs to suppress build warn.

> +{
> +	if (flags) {
> +		bpf_dynptr_set_null(ptr);
> +		return -EINVAL;
> +	}
> +
> +	bpf_dynptr_init(ptr, skb, BPF_DYNPTR_TYPE_SKB, 0, skb->len);
> +
> +	if (is_rdonly)
> +		bpf_dynptr_set_rdonly(ptr);
> +
> +	return 0;
> +}
> +
>  BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk)
>  {
>  	return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL;
> @@ -11607,3 +11634,28 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
>  
>  	return func;
>  }
> +
> +BTF_SET8_START(bpf_kfunc_check_set_skb)
> +BTF_ID_FLAGS(func, bpf_dynptr_from_skb)
> +BTF_SET8_END(bpf_kfunc_check_set_skb)
> +
> +static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
> +	.owner = THIS_MODULE,
> +	.set = &bpf_kfunc_check_set_skb,
> +};
> +
> +static int __init bpf_kfunc_init(void)
> +{
> +	int ret;
> +
> +	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_skb);
> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_ACT, &bpf_kfunc_set_skb);
> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SK_SKB, &bpf_kfunc_set_skb);
> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCKET_FILTER, &bpf_kfunc_set_skb);
> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SKB, &bpf_kfunc_set_skb);
> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_OUT, &bpf_kfunc_set_skb);
> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_IN, &bpf_kfunc_set_skb);
> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_XMIT, &bpf_kfunc_set_skb);
> +	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_SEG6LOCAL, &bpf_kfunc_set_skb);
> +}
> +late_initcall(bpf_kfunc_init);
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 7f024ac22edd..6b58e5a75fc5 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -5320,22 +5320,45 @@ union bpf_attr {
>   *	Description
>   *		Write *len* bytes from *src* into *dst*, starting from *offset*
>   *		into *dst*.
> - *		*flags* is currently unused.
> + *
> + *		*flags* must be 0 except for skb-type dynptrs.
> + *
> + *		For skb-type dynptrs:
> + *		    *  All data slices of the dynptr are automatically
> + *		       invalidated after **bpf_dynptr_write**\ (). If you wish to
> + *		       avoid this, please perform the write using direct data slices
> + *		       instead.
> + *
> + *		    *  For *flags*, please see the flags accepted by
> + *		       **bpf_skb_store_bytes**\ ().
>   *	Return
>   *		0 on success, -E2BIG if *offset* + *len* exceeds the length
>   *		of *dst*'s data, -EINVAL if *dst* is an invalid dynptr or if *dst*
> - *		is a read-only dynptr or if *flags* is not 0.
> + *		is a read-only dynptr or if *flags* is not correct. For skb-type dynptrs,
> + *		other errors correspond to errors returned by **bpf_skb_store_bytes**\ ().
>   *
>   * void *bpf_dynptr_data(const struct bpf_dynptr *ptr, u32 offset, u32 len)
>   *	Description
>   *		Get a pointer to the underlying dynptr data.
>   *
>   *		*len* must be a statically known value. The returned data slice
> - *		is invalidated whenever the dynptr is invalidated.
> - *	Return
> - *		Pointer to the underlying dynptr data, NULL if the dynptr is
> - *		read-only, if the dynptr is invalid, or if the offset and length
> - *		is out of bounds.
> + *		is invalidated whenever the dynptr is invalidated. Please note
> + *		that if the dynptr is read-only, then the returned data slice will
> + *		be read-only.
> + *
> + *		For skb-type dynptrs:
> + *		    * If *offset* + *len* extends into the skb's paged buffers,
> + *		      the user should manually pull the skb with **bpf_skb_pull_data**\ ()
> + *		      and try again.
> + *
> + *		    * The data slice is automatically invalidated anytime
> + *		      **bpf_dynptr_write**\ () or a helper call that changes
> + *		      the underlying packet buffer (eg **bpf_skb_pull_data**\ ())
> + *		      is called.
> + *	Return
> + *		Pointer to the underlying dynptr data, NULL if the dynptr is invalid,
> + *		or if the offset and length is out of bounds or in a paged buffer for
> + *		skb-type dynptrs.
>   *
>   * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len)
>   *	Description
> -- 
> 2.30.2
>
Martin KaFai Lau Jan. 30, 2023, 10:04 p.m. UTC | #2
On 1/27/23 11:17 AM, Joanne Koong wrote:
> @@ -8243,6 +8316,28 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>   		mark_reg_known_zero(env, regs, BPF_REG_0);
>   		regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
>   		regs[BPF_REG_0].mem_size = meta.mem_size;
> +		if (func_id == BPF_FUNC_dynptr_data &&
> +		    dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> +			bool seen_direct_write = env->seen_direct_write;
> +
> +			regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB;
> +			if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> +				regs[BPF_REG_0].type |= MEM_RDONLY;
> +			else
> +				/*
> +				 * Calling may_access_direct_pkt_data() will set
> +				 * env->seen_direct_write to true if the skb is
> +				 * writable. As an optimization, we can ignore
> +				 * setting env->seen_direct_write.
> +				 *
> +				 * env->seen_direct_write is used by skb
> +				 * programs to determine whether the skb's page
> +				 * buffers should be cloned. Since data slice
> +				 * writes would only be to the head, we can skip
> +				 * this.
> +				 */
> +				env->seen_direct_write = seen_direct_write;
> +		}

[ ... ]

> @@ -9263,17 +9361,26 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
>   				return ret;
>   			break;
>   		case KF_ARG_PTR_TO_DYNPTR:
> +		{
> +			enum bpf_arg_type dynptr_arg_type = ARG_PTR_TO_DYNPTR;
> +
>   			if (reg->type != PTR_TO_STACK &&
>   			    reg->type != CONST_PTR_TO_DYNPTR) {
>   				verbose(env, "arg#%d expected pointer to stack or dynptr_ptr\n", i);
>   				return -EINVAL;
>   			}
>   
> -			ret = process_dynptr_func(env, regno, insn_idx,
> -						  ARG_PTR_TO_DYNPTR | MEM_RDONLY);
> +			if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb])
> +				dynptr_arg_type |= MEM_UNINIT | DYNPTR_TYPE_SKB;
> +			else
> +				dynptr_arg_type |= MEM_RDONLY;
> +
> +			ret = process_dynptr_func(env, regno, insn_idx, dynptr_arg_type,
> +						  meta->func_id);
>   			if (ret < 0)
>   				return ret;
>   			break;
> +		}
>   		case KF_ARG_PTR_TO_LIST_HEAD:
>   			if (reg->type != PTR_TO_MAP_VALUE &&
>   			    reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
> @@ -15857,6 +15964,14 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>   		   desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
>   		insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
>   		*cnt = 1;
> +	} else if (desc->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
> +		bool is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);

Does it need to restore the env->seen_direct_write here also?

It seems this 'seen_direct_write' saving/restoring is needed now because 
'may_access_direct_pkt_data(BPF_WRITE)' is not only called when it is actually 
writing the packet. Some refactoring can help to avoid issue like this.

While at 'seen_direct_write', Alexei has also pointed out that the verifier 
needs to track whether the (packet) 'slice' returned by bpf_dynptr_data() has 
been written. It should be tracked in 'seen_direct_write'. Take a look at how 
reg_is_pkt_pointer() and may_access_direct_pkt_data() are done in 
check_mem_access(). iirc, this reg_is_pkt_pointer() part got loss somewhere in 
v5 (or v4?) when bpf_dynptr_data() was changed to return register typed 
PTR_TO_MEM instead of PTR_TO_PACKET.


[ ... ]

> +int bpf_dynptr_from_skb(struct sk_buff *skb, u64 flags,
> +			struct bpf_dynptr_kern *ptr, int is_rdonly)

hmm... this exposed kfunc takes "int is_rdonly".

What if the bpf prog calls it like bpf_dynptr_from_skb(..., false) in some hook 
that is not writable to packet?

> +{
> +	if (flags) {
> +		bpf_dynptr_set_null(ptr);
> +		return -EINVAL;
> +	}
> +
> +	bpf_dynptr_init(ptr, skb, BPF_DYNPTR_TYPE_SKB, 0, skb->len);
> +
> +	if (is_rdonly)
> +		bpf_dynptr_set_rdonly(ptr);
> +
> +	return 0;
> +}
> +
>   BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk)
>   {
>   	return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL;
> @@ -11607,3 +11634,28 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
>   
>   	return func;
>   }
> +
> +BTF_SET8_START(bpf_kfunc_check_set_skb)
> +BTF_ID_FLAGS(func, bpf_dynptr_from_skb)
> +BTF_SET8_END(bpf_kfunc_check_set_skb)
> +
> +static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
> +	.owner = THIS_MODULE,
> +	.set = &bpf_kfunc_check_set_skb,
> +};
> +
> +static int __init bpf_kfunc_init(void)
> +{
> +	int ret;
> +
> +	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_skb);
> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_ACT, &bpf_kfunc_set_skb);
> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SK_SKB, &bpf_kfunc_set_skb);
> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCKET_FILTER, &bpf_kfunc_set_skb);
> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SKB, &bpf_kfunc_set_skb);
> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_OUT, &bpf_kfunc_set_skb);
> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_IN, &bpf_kfunc_set_skb);
> +	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_XMIT, &bpf_kfunc_set_skb);
> +	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_SEG6LOCAL, &bpf_kfunc_set_skb);
> +}
> +late_initcall(bpf_kfunc_init);
Alexei Starovoitov Jan. 30, 2023, 10:31 p.m. UTC | #3
On Mon, Jan 30, 2023 at 02:04:08PM -0800, Martin KaFai Lau wrote:
> On 1/27/23 11:17 AM, Joanne Koong wrote:
> > @@ -8243,6 +8316,28 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> >   		mark_reg_known_zero(env, regs, BPF_REG_0);
> >   		regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> >   		regs[BPF_REG_0].mem_size = meta.mem_size;
> > +		if (func_id == BPF_FUNC_dynptr_data &&
> > +		    dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > +			bool seen_direct_write = env->seen_direct_write;
> > +
> > +			regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB;
> > +			if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> > +				regs[BPF_REG_0].type |= MEM_RDONLY;
> > +			else
> > +				/*
> > +				 * Calling may_access_direct_pkt_data() will set
> > +				 * env->seen_direct_write to true if the skb is
> > +				 * writable. As an optimization, we can ignore
> > +				 * setting env->seen_direct_write.
> > +				 *
> > +				 * env->seen_direct_write is used by skb
> > +				 * programs to determine whether the skb's page
> > +				 * buffers should be cloned. Since data slice
> > +				 * writes would only be to the head, we can skip
> > +				 * this.
> > +				 */
> > +				env->seen_direct_write = seen_direct_write;
> > +		}
> 
> [ ... ]
> 
> > @@ -9263,17 +9361,26 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> >   				return ret;
> >   			break;
> >   		case KF_ARG_PTR_TO_DYNPTR:
> > +		{
> > +			enum bpf_arg_type dynptr_arg_type = ARG_PTR_TO_DYNPTR;
> > +
> >   			if (reg->type != PTR_TO_STACK &&
> >   			    reg->type != CONST_PTR_TO_DYNPTR) {
> >   				verbose(env, "arg#%d expected pointer to stack or dynptr_ptr\n", i);
> >   				return -EINVAL;
> >   			}
> > -			ret = process_dynptr_func(env, regno, insn_idx,
> > -						  ARG_PTR_TO_DYNPTR | MEM_RDONLY);
> > +			if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb])
> > +				dynptr_arg_type |= MEM_UNINIT | DYNPTR_TYPE_SKB;
> > +			else
> > +				dynptr_arg_type |= MEM_RDONLY;
> > +
> > +			ret = process_dynptr_func(env, regno, insn_idx, dynptr_arg_type,
> > +						  meta->func_id);
> >   			if (ret < 0)
> >   				return ret;
> >   			break;
> > +		}
> >   		case KF_ARG_PTR_TO_LIST_HEAD:
> >   			if (reg->type != PTR_TO_MAP_VALUE &&
> >   			    reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
> > @@ -15857,6 +15964,14 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> >   		   desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
> >   		insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
> >   		*cnt = 1;
> > +	} else if (desc->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
> > +		bool is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);
> 
> Does it need to restore the env->seen_direct_write here also?
> 
> It seems this 'seen_direct_write' saving/restoring is needed now because
> 'may_access_direct_pkt_data(BPF_WRITE)' is not only called when it is
> actually writing the packet. Some refactoring can help to avoid issue like
> this.
> 
> While at 'seen_direct_write', Alexei has also pointed out that the verifier
> needs to track whether the (packet) 'slice' returned by bpf_dynptr_data()
> has been written. It should be tracked in 'seen_direct_write'. Take a look
> at how reg_is_pkt_pointer() and may_access_direct_pkt_data() are done in
> check_mem_access(). iirc, this reg_is_pkt_pointer() part got loss somewhere
> in v5 (or v4?) when bpf_dynptr_data() was changed to return register typed
> PTR_TO_MEM instead of PTR_TO_PACKET.

btw tc progs are using gen_prologue() approach because data/data_end are not kfuncs
(nothing is being called by the bpf prog).
In this case we don't need to repeat this approach. If so we don't need to
set seen_direct_write.
Instead bpf_dynptr_data() can call bpf_skb_pull_data() directly.
And technically we don't need to limit it to skb head. It can handle any off/len.
It will work for skb, but there is no equivalent for xdp_pull_data().
I don't think we can implement xdp_pull_data in all drivers.
That's massive amount of work, but we need to be consistent if we want
dynptr to wrap both skb and xdp.
We can say dynptr_data is for head only, but we've seen bugs where people
had to switch from data/data_end to load_bytes.

Also bpf_skb_pull_data is quite heavy. For progs that only want to parse
the packet calling that in bpf_dynptr_data is a heavy hammer.

It feels that we need to go back to skb_header_pointer-like discussion.
Something like:
bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, void *buffer)
Whether buffer is a part of dynptr or program provided is tbd.
Martin KaFai Lau Jan. 30, 2023, 11:11 p.m. UTC | #4
On 1/30/23 2:31 PM, Alexei Starovoitov wrote:
> On Mon, Jan 30, 2023 at 02:04:08PM -0800, Martin KaFai Lau wrote:
>> On 1/27/23 11:17 AM, Joanne Koong wrote:
>>> @@ -8243,6 +8316,28 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>>>    		mark_reg_known_zero(env, regs, BPF_REG_0);
>>>    		regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
>>>    		regs[BPF_REG_0].mem_size = meta.mem_size;
>>> +		if (func_id == BPF_FUNC_dynptr_data &&
>>> +		    dynptr_type == BPF_DYNPTR_TYPE_SKB) {
>>> +			bool seen_direct_write = env->seen_direct_write;
>>> +
>>> +			regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB;
>>> +			if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
>>> +				regs[BPF_REG_0].type |= MEM_RDONLY;
>>> +			else
>>> +				/*
>>> +				 * Calling may_access_direct_pkt_data() will set
>>> +				 * env->seen_direct_write to true if the skb is
>>> +				 * writable. As an optimization, we can ignore
>>> +				 * setting env->seen_direct_write.
>>> +				 *
>>> +				 * env->seen_direct_write is used by skb
>>> +				 * programs to determine whether the skb's page
>>> +				 * buffers should be cloned. Since data slice
>>> +				 * writes would only be to the head, we can skip
>>> +				 * this.
>>> +				 */
>>> +				env->seen_direct_write = seen_direct_write;
>>> +		}
>>
>> [ ... ]
>>
>>> @@ -9263,17 +9361,26 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
>>>    				return ret;
>>>    			break;
>>>    		case KF_ARG_PTR_TO_DYNPTR:
>>> +		{
>>> +			enum bpf_arg_type dynptr_arg_type = ARG_PTR_TO_DYNPTR;
>>> +
>>>    			if (reg->type != PTR_TO_STACK &&
>>>    			    reg->type != CONST_PTR_TO_DYNPTR) {
>>>    				verbose(env, "arg#%d expected pointer to stack or dynptr_ptr\n", i);
>>>    				return -EINVAL;
>>>    			}
>>> -			ret = process_dynptr_func(env, regno, insn_idx,
>>> -						  ARG_PTR_TO_DYNPTR | MEM_RDONLY);
>>> +			if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb])
>>> +				dynptr_arg_type |= MEM_UNINIT | DYNPTR_TYPE_SKB;
>>> +			else
>>> +				dynptr_arg_type |= MEM_RDONLY;
>>> +
>>> +			ret = process_dynptr_func(env, regno, insn_idx, dynptr_arg_type,
>>> +						  meta->func_id);
>>>    			if (ret < 0)
>>>    				return ret;
>>>    			break;
>>> +		}
>>>    		case KF_ARG_PTR_TO_LIST_HEAD:
>>>    			if (reg->type != PTR_TO_MAP_VALUE &&
>>>    			    reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
>>> @@ -15857,6 +15964,14 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>>>    		   desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
>>>    		insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
>>>    		*cnt = 1;
>>> +	} else if (desc->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
>>> +		bool is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);
>>
>> Does it need to restore the env->seen_direct_write here also?
>>
>> It seems this 'seen_direct_write' saving/restoring is needed now because
>> 'may_access_direct_pkt_data(BPF_WRITE)' is not only called when it is
>> actually writing the packet. Some refactoring can help to avoid issue like
>> this.
>>
>> While at 'seen_direct_write', Alexei has also pointed out that the verifier
>> needs to track whether the (packet) 'slice' returned by bpf_dynptr_data()
>> has been written. It should be tracked in 'seen_direct_write'. Take a look
>> at how reg_is_pkt_pointer() and may_access_direct_pkt_data() are done in
>> check_mem_access(). iirc, this reg_is_pkt_pointer() part got loss somewhere
>> in v5 (or v4?) when bpf_dynptr_data() was changed to return register typed
>> PTR_TO_MEM instead of PTR_TO_PACKET.
> 
> btw tc progs are using gen_prologue() approach because data/data_end are not kfuncs
> (nothing is being called by the bpf prog).
> In this case we don't need to repeat this approach. If so we don't need to
> set seen_direct_write.
> Instead bpf_dynptr_data() can call bpf_skb_pull_data() directly.
> And technically we don't need to limit it to skb head. It can handle any off/len.
> It will work for skb, but there is no equivalent for xdp_pull_data().
> I don't think we can implement xdp_pull_data in all drivers.
> That's massive amount of work, but we need to be consistent if we want
> dynptr to wrap both skb and xdp.
> We can say dynptr_data is for head only, but we've seen bugs where people
> had to switch from data/data_end to load_bytes.
> 
> Also bpf_skb_pull_data is quite heavy. For progs that only want to parse
> the packet calling that in bpf_dynptr_data is a heavy hammer.
> 
> It feels that we need to go back to skb_header_pointer-like discussion.
> Something like:
> bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, void *buffer)
> Whether buffer is a part of dynptr or program provided is tbd.
It would be nice if it can get something similar to skb_header_pointer(). I 
think Jakub has mentioned that also.

No pull in the common case. Copy to 'buffer' when it is not in head (for skb) or 
across frags (for xdp). In the copy-to-'buffer' case and the bpf prog did change 
the 'buffer', the bpf prog will be responsible to call bpf_dynptr_write() to 
write back to the skb/xdp if needed?
Joanne Koong Jan. 31, 2023, 12:44 a.m. UTC | #5
On Sun, Jan 29, 2023 at 3:39 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Jan 27, 2023 at 11:17:01AM -0800, Joanne Koong wrote:
> > Add skb dynptrs, which are dynptrs whose underlying pointer points
> > to a skb. The dynptr acts on skb data. skb dynptrs have two main
> > benefits. One is that they allow operations on sizes that are not
> > statically known at compile-time (eg variable-sized accesses).
> > Another is that parsing the packet data through dynptrs (instead of
> > through direct access of skb->data and skb->data_end) can be more
> > ergonomic and less brittle (eg does not need manual if checking for
> > being within bounds of data_end).
> >
> > For bpf prog types that don't support writes on skb data, the dynptr is
> > read-only (bpf_dynptr_write() will return an error and bpf_dynptr_data()
> > will return a data slice that is read-only where any writes to it will
> > be rejected by the verifier).
> >
> > For reads and writes through the bpf_dynptr_read() and bpf_dynptr_write()
> > interfaces, reading and writing from/to data in the head as well as from/to
> > non-linear paged buffers is supported. For data slices (through the
> > bpf_dynptr_data() interface), if the data is in a paged buffer, the user
> > must first call bpf_skb_pull_data() to pull the data into the linear
> > portion.
>
> Looks like there is an assumption in parts of this patch that
> linear part of skb is always writeable. That's not the case.
> See if (ops->gen_prologue || env->seen_direct_write) in convert_ctx_accesses().
> For TC progs it calls bpf_unclone_prologue() which adds hidden
> bpf_skb_pull_data() in the beginning of the prog to make it writeable.

I think we can make this assumption? For writable progs (referenced in
the may_access_direct_pkt_data() function), all of them have a
gen_prologue that unclones the buffer (eg tc_cls_act, lwt_xmit, sk_skb
progs) or their linear portion is okay to write into by default (eg
xdp, sk_msg, cg_sockopt progs).

>
> > Any bpf_dynptr_write() automatically invalidates any prior data slices
> > to the skb dynptr. This is because a bpf_dynptr_write() may be writing
> > to data in a paged buffer, so it will need to pull the buffer first into
> > the head. The reason it needs to be pulled instead of writing directly to
> > the paged buffers is because they may be cloned (only the head of the skb
> > is by default uncloned). As such, any bpf_dynptr_write() will
> > automatically have its prior data slices invalidated, even if the write
> > is to data in the skb head (the verifier has no way of differentiating
> > whether the write is to the head or paged buffers during program load
> > time).
>
> Could you explain the workflow how bpf_dynptr_write() invalidates other
> pkt pointers ?
> I expected bpf_dynptr_write() to be in bpf_helper_changes_pkt_data().
> Looks like bpf_dynptr_write() calls bpf_skb_store_bytes() underneath,
> but that doesn't help the verifier.

In the verifier in check_helper_call(), for the BPF_FUNC_dynptr_write
case (line 8236) the "changes_data" variable gets set to true if the
dynptr is an skb type. At the end of check_helper_call() on line 8474,
since "changes_data" is true, clear_all_pkt_pointer() gets called,
which invalidates the other packet pointers.

>
> > Please note as well that any other helper calls that change the
> > underlying packet buffer (eg bpf_skb_pull_data()) invalidates any data
> > slices of the skb dynptr as well. The stack trace for this is
> > check_helper_call() -> clear_all_pkt_pointers() ->
> > __clear_all_pkt_pointers() -> mark_reg_unknown().
>
> __clear_all_pkt_pointers isn't present in the tree. Typo ?

I'll update this message, clear_all_pkt_pointers() and
__clear_all_pkt_pointers() were combined in a previous commit.

>
> >
> > For examples of how skb dynptrs can be used, please see the attached
> > selftests.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> >  include/linux/bpf.h            |  82 +++++++++------
> >  include/linux/filter.h         |  18 ++++
> >  include/uapi/linux/bpf.h       |  37 +++++--
> >  kernel/bpf/btf.c               |  18 ++++
> >  kernel/bpf/helpers.c           |  95 ++++++++++++++---
> >  kernel/bpf/verifier.c          | 185 ++++++++++++++++++++++++++-------
> >  net/core/filter.c              |  60 ++++++++++-
> >  tools/include/uapi/linux/bpf.h |  37 +++++--
> >  8 files changed, 432 insertions(+), 100 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 14a0264fac57..1ac061b64582 100644
[...]
> > @@ -8243,6 +8316,28 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> >               mark_reg_known_zero(env, regs, BPF_REG_0);
> >               regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> >               regs[BPF_REG_0].mem_size = meta.mem_size;
> > +             if (func_id == BPF_FUNC_dynptr_data &&
> > +                 dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > +                     bool seen_direct_write = env->seen_direct_write;
> > +
> > +                     regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB;
> > +                     if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> > +                             regs[BPF_REG_0].type |= MEM_RDONLY;
> > +                     else
> > +                             /*
> > +                              * Calling may_access_direct_pkt_data() will set
> > +                              * env->seen_direct_write to true if the skb is
> > +                              * writable. As an optimization, we can ignore
> > +                              * setting env->seen_direct_write.
> > +                              *
> > +                              * env->seen_direct_write is used by skb
> > +                              * programs to determine whether the skb's page
> > +                              * buffers should be cloned. Since data slice
> > +                              * writes would only be to the head, we can skip
> > +                              * this.
> > +                              */
> > +                             env->seen_direct_write = seen_direct_write;
>
> This looks incorrect. skb head might not be writeable.
>
> > +             }
> >               break;
> >       case RET_PTR_TO_MEM_OR_BTF_ID:
> >       {
> > @@ -8649,6 +8744,7 @@ enum special_kfunc_type {
> >       KF_bpf_list_pop_back,
> >       KF_bpf_cast_to_kern_ctx,
> >       KF_bpf_rdonly_cast,
> > +     KF_bpf_dynptr_from_skb,
> >       KF_bpf_rcu_read_lock,
> >       KF_bpf_rcu_read_unlock,
> >  };
> > @@ -8662,6 +8758,7 @@ BTF_ID(func, bpf_list_pop_front)
> >  BTF_ID(func, bpf_list_pop_back)
> >  BTF_ID(func, bpf_cast_to_kern_ctx)
> >  BTF_ID(func, bpf_rdonly_cast)
> > +BTF_ID(func, bpf_dynptr_from_skb)
> >  BTF_SET_END(special_kfunc_set)
> >
> >  BTF_ID_LIST(special_kfunc_list)
> > @@ -8673,6 +8770,7 @@ BTF_ID(func, bpf_list_pop_front)
> >  BTF_ID(func, bpf_list_pop_back)
> >  BTF_ID(func, bpf_cast_to_kern_ctx)
> >  BTF_ID(func, bpf_rdonly_cast)
> > +BTF_ID(func, bpf_dynptr_from_skb)
> >  BTF_ID(func, bpf_rcu_read_lock)
> >  BTF_ID(func, bpf_rcu_read_unlock)
> >
> > @@ -9263,17 +9361,26 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> >                               return ret;
> >                       break;
> >               case KF_ARG_PTR_TO_DYNPTR:
> > +             {
> > +                     enum bpf_arg_type dynptr_arg_type = ARG_PTR_TO_DYNPTR;
> > +
> >                       if (reg->type != PTR_TO_STACK &&
> >                           reg->type != CONST_PTR_TO_DYNPTR) {
> >                               verbose(env, "arg#%d expected pointer to stack or dynptr_ptr\n", i);
> >                               return -EINVAL;
> >                       }
> >
> > -                     ret = process_dynptr_func(env, regno, insn_idx,
> > -                                               ARG_PTR_TO_DYNPTR | MEM_RDONLY);
> > +                     if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb])
> > +                             dynptr_arg_type |= MEM_UNINIT | DYNPTR_TYPE_SKB;
> > +                     else
> > +                             dynptr_arg_type |= MEM_RDONLY;
> > +
> > +                     ret = process_dynptr_func(env, regno, insn_idx, dynptr_arg_type,
> > +                                               meta->func_id);
> >                       if (ret < 0)
> >                               return ret;
> >                       break;
> > +             }
> >               case KF_ARG_PTR_TO_LIST_HEAD:
> >                       if (reg->type != PTR_TO_MAP_VALUE &&
> >                           reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
> > @@ -15857,6 +15964,14 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> >                  desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
> >               insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
> >               *cnt = 1;
> > +     } else if (desc->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
> > +             bool is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);
> > +             struct bpf_insn addr[2] = { BPF_LD_IMM64(BPF_REG_4, is_rdonly) };
>
> Why use 16-byte insn to pass boolean in R4 ?
> Single 8-byte MOV would do.

Great, I'll change it to a 8-byte MOV

>
> > +
> > +             insn_buf[0] = addr[0];
> > +             insn_buf[1] = addr[1];
> > +             insn_buf[2] = *insn;
> > +             *cnt = 3;
> >       }
> >       return 0;
> >  }
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 6da78b3d381e..ddb47126071a 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -1684,8 +1684,8 @@ static inline void bpf_pull_mac_rcsum(struct sk_buff *skb)
> >               skb_postpull_rcsum(skb, skb_mac_header(skb), skb->mac_len);
> >  }
> >
> > -BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
> > -        const void *, from, u32, len, u64, flags)
> > +int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from,
> > +                       u32 len, u64 flags)
>
> This change is just to be able to call __bpf_skb_store_bytes() ?
> If so, it's unnecessary.
> See:
> BPF_CALL_4(sk_reuseport_load_bytes,
>            const struct sk_reuseport_kern *, reuse_kern, u32, offset,
>            void *, to, u32, len)
> {
>         return ____bpf_skb_load_bytes(reuse_kern->skb, offset, to, len);
> }
>

There was prior feedback [0] that using four underscores to call a
helper function is confusing and makes it ungreppable

[0] https://lore.kernel.org/bpf/CAEf4Bzaz4=tEvESd_twhx1bdepdOP3L4SmUiaKqGFJtX=CJruQ@mail.gmail.com/

> >  {
> >       void *ptr;
> >
> > @@ -1710,6 +1710,12 @@ BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
> >       return 0;
> >  }
> >
> > +BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
> > +        const void *, from, u32, len, u64, flags)
> > +{
> > +     return __bpf_skb_store_bytes(skb, offset, from, len, flags);
> > +}
> > +
[...]
> > @@ -1852,6 +1863,22 @@ static const struct bpf_func_proto bpf_skb_pull_data_proto = {
> >       .arg2_type      = ARG_ANYTHING,
> >  };
> >
> > +int bpf_dynptr_from_skb(struct sk_buff *skb, u64 flags,
> > +                     struct bpf_dynptr_kern *ptr, int is_rdonly)
>
> It probably needs
> __diag_ignore_all("-Wmissing-prototypes",
> like other kfuncs to suppress build warn.
>

Awesome, thanks. I'll add this in.

> > +{
> > +     if (flags) {
> > +             bpf_dynptr_set_null(ptr);
> > +             return -EINVAL;
> > +     }
> > +
> > +     bpf_dynptr_init(ptr, skb, BPF_DYNPTR_TYPE_SKB, 0, skb->len);
> > +
> > +     if (is_rdonly)
> > +             bpf_dynptr_set_rdonly(ptr);
> > +
> > +     return 0;
> > +}
> > +
> >  BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk)
[...]
> > --
> > 2.30.2
> >
Andrii Nakryiko Jan. 31, 2023, 12:48 a.m. UTC | #6
On Fri, Jan 27, 2023 at 11:18 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> Add skb dynptrs, which are dynptrs whose underlying pointer points
> to a skb. The dynptr acts on skb data. skb dynptrs have two main
> benefits. One is that they allow operations on sizes that are not
> statically known at compile-time (eg variable-sized accesses).
> Another is that parsing the packet data through dynptrs (instead of
> through direct access of skb->data and skb->data_end) can be more
> ergonomic and less brittle (eg does not need manual if checking for
> being within bounds of data_end).
>
> For bpf prog types that don't support writes on skb data, the dynptr is
> read-only (bpf_dynptr_write() will return an error and bpf_dynptr_data()
> will return a data slice that is read-only where any writes to it will
> be rejected by the verifier).
>
> For reads and writes through the bpf_dynptr_read() and bpf_dynptr_write()
> interfaces, reading and writing from/to data in the head as well as from/to
> non-linear paged buffers is supported. For data slices (through the
> bpf_dynptr_data() interface), if the data is in a paged buffer, the user
> must first call bpf_skb_pull_data() to pull the data into the linear
> portion.
>
> Any bpf_dynptr_write() automatically invalidates any prior data slices
> to the skb dynptr. This is because a bpf_dynptr_write() may be writing
> to data in a paged buffer, so it will need to pull the buffer first into
> the head. The reason it needs to be pulled instead of writing directly to
> the paged buffers is because they may be cloned (only the head of the skb
> is by default uncloned). As such, any bpf_dynptr_write() will
> automatically have its prior data slices invalidated, even if the write
> is to data in the skb head (the verifier has no way of differentiating
> whether the write is to the head or paged buffers during program load
> time). Please note as well that any other helper calls that change the
> underlying packet buffer (eg bpf_skb_pull_data()) invalidates any data
> slices of the skb dynptr as well. The stack trace for this is
> check_helper_call() -> clear_all_pkt_pointers() ->
> __clear_all_pkt_pointers() -> mark_reg_unknown().
>
> For examples of how skb dynptrs can be used, please see the attached
> selftests.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
>  include/linux/bpf.h            |  82 +++++++++------
>  include/linux/filter.h         |  18 ++++
>  include/uapi/linux/bpf.h       |  37 +++++--
>  kernel/bpf/btf.c               |  18 ++++
>  kernel/bpf/helpers.c           |  95 ++++++++++++++---
>  kernel/bpf/verifier.c          | 185 ++++++++++++++++++++++++++-------
>  net/core/filter.c              |  60 ++++++++++-
>  tools/include/uapi/linux/bpf.h |  37 +++++--
>  8 files changed, 432 insertions(+), 100 deletions(-)
>

[...]

>  static const struct bpf_func_proto bpf_dynptr_write_proto = {
> @@ -1560,6 +1595,8 @@ static const struct bpf_func_proto bpf_dynptr_write_proto = {
>
>  BPF_CALL_3(bpf_dynptr_data, const struct bpf_dynptr_kern *, ptr, u32, offset, u32, len)
>  {
> +       enum bpf_dynptr_type type;
> +       void *data;
>         int err;
>
>         if (!ptr->data)
> @@ -1569,10 +1606,36 @@ BPF_CALL_3(bpf_dynptr_data, const struct bpf_dynptr_kern *, ptr, u32, offset, u3
>         if (err)
>                 return 0;
>
> -       if (bpf_dynptr_is_rdonly(ptr))
> -               return 0;
> +       type = bpf_dynptr_get_type(ptr);
> +
> +       switch (type) {
> +       case BPF_DYNPTR_TYPE_LOCAL:
> +       case BPF_DYNPTR_TYPE_RINGBUF:
> +               if (bpf_dynptr_is_rdonly(ptr))
> +                       return 0;

will something break if we return ptr->data for read-only LOCAL/RINGBUF dynptr?

> +
> +               data = ptr->data;
> +               break;
> +       case BPF_DYNPTR_TYPE_SKB:
> +       {
> +               struct sk_buff *skb = ptr->data;
>

[...]
Joanne Koong Jan. 31, 2023, 12:55 a.m. UTC | #7
On Mon, Jan 30, 2023 at 4:48 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Jan 27, 2023 at 11:18 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > Add skb dynptrs, which are dynptrs whose underlying pointer points
> > to a skb. The dynptr acts on skb data. skb dynptrs have two main
> > benefits. One is that they allow operations on sizes that are not
> > statically known at compile-time (eg variable-sized accesses).
> > Another is that parsing the packet data through dynptrs (instead of
> > through direct access of skb->data and skb->data_end) can be more
> > ergonomic and less brittle (eg does not need manual if checking for
> > being within bounds of data_end).
> >
> > For bpf prog types that don't support writes on skb data, the dynptr is
> > read-only (bpf_dynptr_write() will return an error and bpf_dynptr_data()
> > will return a data slice that is read-only where any writes to it will
> > be rejected by the verifier).
> >
> > For reads and writes through the bpf_dynptr_read() and bpf_dynptr_write()
> > interfaces, reading and writing from/to data in the head as well as from/to
> > non-linear paged buffers is supported. For data slices (through the
> > bpf_dynptr_data() interface), if the data is in a paged buffer, the user
> > must first call bpf_skb_pull_data() to pull the data into the linear
> > portion.
> >
> > Any bpf_dynptr_write() automatically invalidates any prior data slices
> > to the skb dynptr. This is because a bpf_dynptr_write() may be writing
> > to data in a paged buffer, so it will need to pull the buffer first into
> > the head. The reason it needs to be pulled instead of writing directly to
> > the paged buffers is because they may be cloned (only the head of the skb
> > is by default uncloned). As such, any bpf_dynptr_write() will
> > automatically have its prior data slices invalidated, even if the write
> > is to data in the skb head (the verifier has no way of differentiating
> > whether the write is to the head or paged buffers during program load
> > time). Please note as well that any other helper calls that change the
> > underlying packet buffer (eg bpf_skb_pull_data()) invalidates any data
> > slices of the skb dynptr as well. The stack trace for this is
> > check_helper_call() -> clear_all_pkt_pointers() ->
> > __clear_all_pkt_pointers() -> mark_reg_unknown().
> >
> > For examples of how skb dynptrs can be used, please see the attached
> > selftests.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> >  include/linux/bpf.h            |  82 +++++++++------
> >  include/linux/filter.h         |  18 ++++
> >  include/uapi/linux/bpf.h       |  37 +++++--
> >  kernel/bpf/btf.c               |  18 ++++
> >  kernel/bpf/helpers.c           |  95 ++++++++++++++---
> >  kernel/bpf/verifier.c          | 185 ++++++++++++++++++++++++++-------
> >  net/core/filter.c              |  60 ++++++++++-
> >  tools/include/uapi/linux/bpf.h |  37 +++++--
> >  8 files changed, 432 insertions(+), 100 deletions(-)
> >
>
> [...]
>
> >  static const struct bpf_func_proto bpf_dynptr_write_proto = {
> > @@ -1560,6 +1595,8 @@ static const struct bpf_func_proto bpf_dynptr_write_proto = {
> >
> >  BPF_CALL_3(bpf_dynptr_data, const struct bpf_dynptr_kern *, ptr, u32, offset, u32, len)
> >  {
> > +       enum bpf_dynptr_type type;
> > +       void *data;
> >         int err;
> >
> >         if (!ptr->data)
> > @@ -1569,10 +1606,36 @@ BPF_CALL_3(bpf_dynptr_data, const struct bpf_dynptr_kern *, ptr, u32, offset, u3
> >         if (err)
> >                 return 0;
> >
> > -       if (bpf_dynptr_is_rdonly(ptr))
> > -               return 0;
> > +       type = bpf_dynptr_get_type(ptr);
> > +
> > +       switch (type) {
> > +       case BPF_DYNPTR_TYPE_LOCAL:
> > +       case BPF_DYNPTR_TYPE_RINGBUF:
> > +               if (bpf_dynptr_is_rdonly(ptr))
> > +                       return 0;
>
> will something break if we return ptr->data for read-only LOCAL/RINGBUF dynptr?

There will be nothing guarding against direct writes into read-only
LOCAL/RINGBUF dynptrs if we return ptr->data. For skb type dynptrs,
it's guarded by the ptr->data return pointer being marked as
MEM_RDONLY in the verifier if the skb is non-writable.

>
> > +
> > +               data = ptr->data;
> > +               break;
> > +       case BPF_DYNPTR_TYPE_SKB:
> > +       {
> > +               struct sk_buff *skb = ptr->data;
> >
>
> [...]
Andrii Nakryiko Jan. 31, 2023, 1:04 a.m. UTC | #8
On Mon, Jan 30, 2023 at 2:31 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Jan 30, 2023 at 02:04:08PM -0800, Martin KaFai Lau wrote:
> > On 1/27/23 11:17 AM, Joanne Koong wrote:
> > > @@ -8243,6 +8316,28 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > >             mark_reg_known_zero(env, regs, BPF_REG_0);
> > >             regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > >             regs[BPF_REG_0].mem_size = meta.mem_size;
> > > +           if (func_id == BPF_FUNC_dynptr_data &&
> > > +               dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > > +                   bool seen_direct_write = env->seen_direct_write;
> > > +
> > > +                   regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB;
> > > +                   if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> > > +                           regs[BPF_REG_0].type |= MEM_RDONLY;
> > > +                   else
> > > +                           /*
> > > +                            * Calling may_access_direct_pkt_data() will set
> > > +                            * env->seen_direct_write to true if the skb is
> > > +                            * writable. As an optimization, we can ignore
> > > +                            * setting env->seen_direct_write.
> > > +                            *
> > > +                            * env->seen_direct_write is used by skb
> > > +                            * programs to determine whether the skb's page
> > > +                            * buffers should be cloned. Since data slice
> > > +                            * writes would only be to the head, we can skip
> > > +                            * this.
> > > +                            */
> > > +                           env->seen_direct_write = seen_direct_write;
> > > +           }
> >
> > [ ... ]
> >
> > > @@ -9263,17 +9361,26 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> > >                             return ret;
> > >                     break;
> > >             case KF_ARG_PTR_TO_DYNPTR:
> > > +           {
> > > +                   enum bpf_arg_type dynptr_arg_type = ARG_PTR_TO_DYNPTR;
> > > +
> > >                     if (reg->type != PTR_TO_STACK &&
> > >                         reg->type != CONST_PTR_TO_DYNPTR) {
> > >                             verbose(env, "arg#%d expected pointer to stack or dynptr_ptr\n", i);
> > >                             return -EINVAL;
> > >                     }
> > > -                   ret = process_dynptr_func(env, regno, insn_idx,
> > > -                                             ARG_PTR_TO_DYNPTR | MEM_RDONLY);
> > > +                   if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb])
> > > +                           dynptr_arg_type |= MEM_UNINIT | DYNPTR_TYPE_SKB;
> > > +                   else
> > > +                           dynptr_arg_type |= MEM_RDONLY;
> > > +
> > > +                   ret = process_dynptr_func(env, regno, insn_idx, dynptr_arg_type,
> > > +                                             meta->func_id);
> > >                     if (ret < 0)
> > >                             return ret;
> > >                     break;
> > > +           }
> > >             case KF_ARG_PTR_TO_LIST_HEAD:
> > >                     if (reg->type != PTR_TO_MAP_VALUE &&
> > >                         reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
> > > @@ -15857,6 +15964,14 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> > >                desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
> > >             insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
> > >             *cnt = 1;
> > > +   } else if (desc->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
> > > +           bool is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);
> >
> > Does it need to restore the env->seen_direct_write here also?
> >
> > It seems this 'seen_direct_write' saving/restoring is needed now because
> > 'may_access_direct_pkt_data(BPF_WRITE)' is not only called when it is
> > actually writing the packet. Some refactoring can help to avoid issue like
> > this.
> >
> > While at 'seen_direct_write', Alexei has also pointed out that the verifier
> > needs to track whether the (packet) 'slice' returned by bpf_dynptr_data()
> > has been written. It should be tracked in 'seen_direct_write'. Take a look
> > at how reg_is_pkt_pointer() and may_access_direct_pkt_data() are done in
> > check_mem_access(). iirc, this reg_is_pkt_pointer() part got loss somewhere
> > in v5 (or v4?) when bpf_dynptr_data() was changed to return register typed
> > PTR_TO_MEM instead of PTR_TO_PACKET.
>
> btw tc progs are using gen_prologue() approach because data/data_end are not kfuncs
> (nothing is being called by the bpf prog).
> In this case we don't need to repeat this approach. If so we don't need to
> set seen_direct_write.
> Instead bpf_dynptr_data() can call bpf_skb_pull_data() directly.
> And technically we don't need to limit it to skb head. It can handle any off/len.
> It will work for skb, but there is no equivalent for xdp_pull_data().
> I don't think we can implement xdp_pull_data in all drivers.
> That's massive amount of work, but we need to be consistent if we want
> dynptr to wrap both skb and xdp.
> We can say dynptr_data is for head only, but we've seen bugs where people
> had to switch from data/data_end to load_bytes.
>
> Also bpf_skb_pull_data is quite heavy. For progs that only want to parse
> the packet calling that in bpf_dynptr_data is a heavy hammer.
>
> It feels that we need to go back to skb_header_pointer-like discussion.
> Something like:
> bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, void *buffer)
> Whether buffer is a part of dynptr or program provided is tbd.

making it hidden within dynptr would make this approach unreliable
(memory allocations, which can fail, etc). But if we ask users to pass
it directly, then it should be relatively easy to use in practice with
some pre-allocated per-CPU buffer:


struct {
  __int(type, BPF_MAP_TYPE_PERCPU_ARRAY);
  __int(max_entries, 1);
  __type(key, int);
  __type(value, char[4096]);
} scratch SEC(".maps");


...


struct dyn_ptr *dp = bpf_dynptr_from_skb(...).
void *p, *buf;
int zero = 0;

buf = bpf_map_lookup_elem(&scratch, &zero);
if (!buf) return 0; /* can't happen */

p = bpf_dynptr_slice(dp, off, 16, buf);
if (p == NULL) {
   /* out of range */
} else {
   /* work with p directly */
}

/* if we wrote something to p and it was copied to buffer, write it back */
if (p == buf) {
    bpf_dynptr_write(dp, buf, 16);
}


We'll just need to teach verifier to make sure that buf is at least 16
byte long.


But I wonder if for simple cases when users are mostly sure that they
are going to access only header data directly we can have an option
for bpf_dynptr_from_skb() to specify what should be the behavior for
bpf_dynptr_slice():

 - either return NULL for anything that crosses into frags (no
surprising perf penalty, but surprising NULLs);
 - do bpf_skb_pull_data() if bpf_dynptr_data() needs to point to data
beyond header (potential perf penalty, but on NULLs, if off+len is
within packet).

And then bpf_dynptr_from_skb() can accept a flag specifying this
behavior and store it somewhere in struct bpf_dynptr.

Thoughts?
Andrii Nakryiko Jan. 31, 2023, 1:06 a.m. UTC | #9
On Mon, Jan 30, 2023 at 4:56 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Mon, Jan 30, 2023 at 4:48 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Fri, Jan 27, 2023 at 11:18 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > Add skb dynptrs, which are dynptrs whose underlying pointer points
> > > to a skb. The dynptr acts on skb data. skb dynptrs have two main
> > > benefits. One is that they allow operations on sizes that are not
> > > statically known at compile-time (eg variable-sized accesses).
> > > Another is that parsing the packet data through dynptrs (instead of
> > > through direct access of skb->data and skb->data_end) can be more
> > > ergonomic and less brittle (eg does not need manual if checking for
> > > being within bounds of data_end).
> > >
> > > For bpf prog types that don't support writes on skb data, the dynptr is
> > > read-only (bpf_dynptr_write() will return an error and bpf_dynptr_data()
> > > will return a data slice that is read-only where any writes to it will
> > > be rejected by the verifier).
> > >
> > > For reads and writes through the bpf_dynptr_read() and bpf_dynptr_write()
> > > interfaces, reading and writing from/to data in the head as well as from/to
> > > non-linear paged buffers is supported. For data slices (through the
> > > bpf_dynptr_data() interface), if the data is in a paged buffer, the user
> > > must first call bpf_skb_pull_data() to pull the data into the linear
> > > portion.
> > >
> > > Any bpf_dynptr_write() automatically invalidates any prior data slices
> > > to the skb dynptr. This is because a bpf_dynptr_write() may be writing
> > > to data in a paged buffer, so it will need to pull the buffer first into
> > > the head. The reason it needs to be pulled instead of writing directly to
> > > the paged buffers is because they may be cloned (only the head of the skb
> > > is by default uncloned). As such, any bpf_dynptr_write() will
> > > automatically have its prior data slices invalidated, even if the write
> > > is to data in the skb head (the verifier has no way of differentiating
> > > whether the write is to the head or paged buffers during program load
> > > time). Please note as well that any other helper calls that change the
> > > underlying packet buffer (eg bpf_skb_pull_data()) invalidates any data
> > > slices of the skb dynptr as well. The stack trace for this is
> > > check_helper_call() -> clear_all_pkt_pointers() ->
> > > __clear_all_pkt_pointers() -> mark_reg_unknown().
> > >
> > > For examples of how skb dynptrs can be used, please see the attached
> > > selftests.
> > >
> > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > ---
> > >  include/linux/bpf.h            |  82 +++++++++------
> > >  include/linux/filter.h         |  18 ++++
> > >  include/uapi/linux/bpf.h       |  37 +++++--
> > >  kernel/bpf/btf.c               |  18 ++++
> > >  kernel/bpf/helpers.c           |  95 ++++++++++++++---
> > >  kernel/bpf/verifier.c          | 185 ++++++++++++++++++++++++++-------
> > >  net/core/filter.c              |  60 ++++++++++-
> > >  tools/include/uapi/linux/bpf.h |  37 +++++--
> > >  8 files changed, 432 insertions(+), 100 deletions(-)
> > >
> >
> > [...]
> >
> > >  static const struct bpf_func_proto bpf_dynptr_write_proto = {
> > > @@ -1560,6 +1595,8 @@ static const struct bpf_func_proto bpf_dynptr_write_proto = {
> > >
> > >  BPF_CALL_3(bpf_dynptr_data, const struct bpf_dynptr_kern *, ptr, u32, offset, u32, len)
> > >  {
> > > +       enum bpf_dynptr_type type;
> > > +       void *data;
> > >         int err;
> > >
> > >         if (!ptr->data)
> > > @@ -1569,10 +1606,36 @@ BPF_CALL_3(bpf_dynptr_data, const struct bpf_dynptr_kern *, ptr, u32, offset, u3
> > >         if (err)
> > >                 return 0;
> > >
> > > -       if (bpf_dynptr_is_rdonly(ptr))
> > > -               return 0;
> > > +       type = bpf_dynptr_get_type(ptr);
> > > +
> > > +       switch (type) {
> > > +       case BPF_DYNPTR_TYPE_LOCAL:
> > > +       case BPF_DYNPTR_TYPE_RINGBUF:
> > > +               if (bpf_dynptr_is_rdonly(ptr))
> > > +                       return 0;
> >
> > will something break if we return ptr->data for read-only LOCAL/RINGBUF dynptr?
>
> There will be nothing guarding against direct writes into read-only
> LOCAL/RINGBUF dynptrs if we return ptr->data. For skb type dynptrs,
> it's guarded by the ptr->data return pointer being marked as
> MEM_RDONLY in the verifier if the skb is non-writable.
>

Ah, so we won't add MEM_RDONLY for bpf_dynptr_data()'s returned
PTR_TO_MEM if we know (statically) that dynptr is read-only?

Ok, not a big deal, just something that we might want to improve in the future.

> >
> > > +
> > > +               data = ptr->data;
> > > +               break;
> > > +       case BPF_DYNPTR_TYPE_SKB:
> > > +       {
> > > +               struct sk_buff *skb = ptr->data;
> > >
> >
> > [...]
Joanne Koong Jan. 31, 2023, 1:13 a.m. UTC | #10
On Mon, Jan 30, 2023 at 5:06 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Jan 30, 2023 at 4:56 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Mon, Jan 30, 2023 at 4:48 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Fri, Jan 27, 2023 at 11:18 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > Add skb dynptrs, which are dynptrs whose underlying pointer points
> > > > to a skb. The dynptr acts on skb data. skb dynptrs have two main
> > > > benefits. One is that they allow operations on sizes that are not
> > > > statically known at compile-time (eg variable-sized accesses).
> > > > Another is that parsing the packet data through dynptrs (instead of
> > > > through direct access of skb->data and skb->data_end) can be more
> > > > ergonomic and less brittle (eg does not need manual if checking for
> > > > being within bounds of data_end).
> > > >
> > > > For bpf prog types that don't support writes on skb data, the dynptr is
> > > > read-only (bpf_dynptr_write() will return an error and bpf_dynptr_data()
> > > > will return a data slice that is read-only where any writes to it will
> > > > be rejected by the verifier).
> > > >
> > > > For reads and writes through the bpf_dynptr_read() and bpf_dynptr_write()
> > > > interfaces, reading and writing from/to data in the head as well as from/to
> > > > non-linear paged buffers is supported. For data slices (through the
> > > > bpf_dynptr_data() interface), if the data is in a paged buffer, the user
> > > > must first call bpf_skb_pull_data() to pull the data into the linear
> > > > portion.
> > > >
> > > > Any bpf_dynptr_write() automatically invalidates any prior data slices
> > > > to the skb dynptr. This is because a bpf_dynptr_write() may be writing
> > > > to data in a paged buffer, so it will need to pull the buffer first into
> > > > the head. The reason it needs to be pulled instead of writing directly to
> > > > the paged buffers is because they may be cloned (only the head of the skb
> > > > is by default uncloned). As such, any bpf_dynptr_write() will
> > > > automatically have its prior data slices invalidated, even if the write
> > > > is to data in the skb head (the verifier has no way of differentiating
> > > > whether the write is to the head or paged buffers during program load
> > > > time). Please note as well that any other helper calls that change the
> > > > underlying packet buffer (eg bpf_skb_pull_data()) invalidates any data
> > > > slices of the skb dynptr as well. The stack trace for this is
> > > > check_helper_call() -> clear_all_pkt_pointers() ->
> > > > __clear_all_pkt_pointers() -> mark_reg_unknown().
> > > >
> > > > For examples of how skb dynptrs can be used, please see the attached
> > > > selftests.
> > > >
> > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > ---
> > > >  include/linux/bpf.h            |  82 +++++++++------
> > > >  include/linux/filter.h         |  18 ++++
> > > >  include/uapi/linux/bpf.h       |  37 +++++--
> > > >  kernel/bpf/btf.c               |  18 ++++
> > > >  kernel/bpf/helpers.c           |  95 ++++++++++++++---
> > > >  kernel/bpf/verifier.c          | 185 ++++++++++++++++++++++++++-------
> > > >  net/core/filter.c              |  60 ++++++++++-
> > > >  tools/include/uapi/linux/bpf.h |  37 +++++--
> > > >  8 files changed, 432 insertions(+), 100 deletions(-)
> > > >
> > >
> > > [...]
> > >
> > > >  static const struct bpf_func_proto bpf_dynptr_write_proto = {
> > > > @@ -1560,6 +1595,8 @@ static const struct bpf_func_proto bpf_dynptr_write_proto = {
> > > >
> > > >  BPF_CALL_3(bpf_dynptr_data, const struct bpf_dynptr_kern *, ptr, u32, offset, u32, len)
> > > >  {
> > > > +       enum bpf_dynptr_type type;
> > > > +       void *data;
> > > >         int err;
> > > >
> > > >         if (!ptr->data)
> > > > @@ -1569,10 +1606,36 @@ BPF_CALL_3(bpf_dynptr_data, const struct bpf_dynptr_kern *, ptr, u32, offset, u3
> > > >         if (err)
> > > >                 return 0;
> > > >
> > > > -       if (bpf_dynptr_is_rdonly(ptr))
> > > > -               return 0;
> > > > +       type = bpf_dynptr_get_type(ptr);
> > > > +
> > > > +       switch (type) {
> > > > +       case BPF_DYNPTR_TYPE_LOCAL:
> > > > +       case BPF_DYNPTR_TYPE_RINGBUF:
> > > > +               if (bpf_dynptr_is_rdonly(ptr))
> > > > +                       return 0;
> > >
> > > will something break if we return ptr->data for read-only LOCAL/RINGBUF dynptr?
> >
> > There will be nothing guarding against direct writes into read-only
> > LOCAL/RINGBUF dynptrs if we return ptr->data. For skb type dynptrs,
> > it's guarded by the ptr->data return pointer being marked as
> > MEM_RDONLY in the verifier if the skb is non-writable.
> >
>
> Ah, so we won't add MEM_RDONLY for bpf_dynptr_data()'s returned
> PTR_TO_MEM if we know (statically) that dynptr is read-only?

I think you meant will, not won't? If so, then yes we only add
MEM_RDONLY for the returned data slice if we can pre-determine that
the dynptr is read-only, else bpf_dynptr_data() will return null.

>
> Ok, not a big deal, just something that we might want to improve in the future.

I'm curious to hear how you think this could be improved. If we're not
able to know statically whether the dynptr is read-only or writable,
then there's no way to enforce it in the verifier before the bpf
program runs. Or is there some way to do this?

>
> > >
> > > > +
> > > > +               data = ptr->data;
> > > > +               break;
> > > > +       case BPF_DYNPTR_TYPE_SKB:
> > > > +       {
> > > > +               struct sk_buff *skb = ptr->data;
> > > >
> > >
> > > [...]
Andrii Nakryiko Jan. 31, 2023, 1:19 a.m. UTC | #11
On Mon, Jan 30, 2023 at 5:13 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Mon, Jan 30, 2023 at 5:06 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Mon, Jan 30, 2023 at 4:56 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > On Mon, Jan 30, 2023 at 4:48 PM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > On Fri, Jan 27, 2023 at 11:18 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > >
> > > > > Add skb dynptrs, which are dynptrs whose underlying pointer points
> > > > > to a skb. The dynptr acts on skb data. skb dynptrs have two main
> > > > > benefits. One is that they allow operations on sizes that are not
> > > > > statically known at compile-time (eg variable-sized accesses).
> > > > > Another is that parsing the packet data through dynptrs (instead of
> > > > > through direct access of skb->data and skb->data_end) can be more
> > > > > ergonomic and less brittle (eg does not need manual if checking for
> > > > > being within bounds of data_end).
> > > > >
> > > > > For bpf prog types that don't support writes on skb data, the dynptr is
> > > > > read-only (bpf_dynptr_write() will return an error and bpf_dynptr_data()
> > > > > will return a data slice that is read-only where any writes to it will
> > > > > be rejected by the verifier).
> > > > >
> > > > > For reads and writes through the bpf_dynptr_read() and bpf_dynptr_write()
> > > > > interfaces, reading and writing from/to data in the head as well as from/to
> > > > > non-linear paged buffers is supported. For data slices (through the
> > > > > bpf_dynptr_data() interface), if the data is in a paged buffer, the user
> > > > > must first call bpf_skb_pull_data() to pull the data into the linear
> > > > > portion.
> > > > >
> > > > > Any bpf_dynptr_write() automatically invalidates any prior data slices
> > > > > to the skb dynptr. This is because a bpf_dynptr_write() may be writing
> > > > > to data in a paged buffer, so it will need to pull the buffer first into
> > > > > the head. The reason it needs to be pulled instead of writing directly to
> > > > > the paged buffers is because they may be cloned (only the head of the skb
> > > > > is by default uncloned). As such, any bpf_dynptr_write() will
> > > > > automatically have its prior data slices invalidated, even if the write
> > > > > is to data in the skb head (the verifier has no way of differentiating
> > > > > whether the write is to the head or paged buffers during program load
> > > > > time). Please note as well that any other helper calls that change the
> > > > > underlying packet buffer (eg bpf_skb_pull_data()) invalidates any data
> > > > > slices of the skb dynptr as well. The stack trace for this is
> > > > > check_helper_call() -> clear_all_pkt_pointers() ->
> > > > > __clear_all_pkt_pointers() -> mark_reg_unknown().
> > > > >
> > > > > For examples of how skb dynptrs can be used, please see the attached
> > > > > selftests.
> > > > >
> > > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > > ---
> > > > >  include/linux/bpf.h            |  82 +++++++++------
> > > > >  include/linux/filter.h         |  18 ++++
> > > > >  include/uapi/linux/bpf.h       |  37 +++++--
> > > > >  kernel/bpf/btf.c               |  18 ++++
> > > > >  kernel/bpf/helpers.c           |  95 ++++++++++++++---
> > > > >  kernel/bpf/verifier.c          | 185 ++++++++++++++++++++++++++-------
> > > > >  net/core/filter.c              |  60 ++++++++++-
> > > > >  tools/include/uapi/linux/bpf.h |  37 +++++--
> > > > >  8 files changed, 432 insertions(+), 100 deletions(-)
> > > > >
> > > >
> > > > [...]
> > > >
> > > > >  static const struct bpf_func_proto bpf_dynptr_write_proto = {
> > > > > @@ -1560,6 +1595,8 @@ static const struct bpf_func_proto bpf_dynptr_write_proto = {
> > > > >
> > > > >  BPF_CALL_3(bpf_dynptr_data, const struct bpf_dynptr_kern *, ptr, u32, offset, u32, len)
> > > > >  {
> > > > > +       enum bpf_dynptr_type type;
> > > > > +       void *data;
> > > > >         int err;
> > > > >
> > > > >         if (!ptr->data)
> > > > > @@ -1569,10 +1606,36 @@ BPF_CALL_3(bpf_dynptr_data, const struct bpf_dynptr_kern *, ptr, u32, offset, u3
> > > > >         if (err)
> > > > >                 return 0;
> > > > >
> > > > > -       if (bpf_dynptr_is_rdonly(ptr))
> > > > > -               return 0;
> > > > > +       type = bpf_dynptr_get_type(ptr);
> > > > > +
> > > > > +       switch (type) {
> > > > > +       case BPF_DYNPTR_TYPE_LOCAL:
> > > > > +       case BPF_DYNPTR_TYPE_RINGBUF:
> > > > > +               if (bpf_dynptr_is_rdonly(ptr))
> > > > > +                       return 0;
> > > >
> > > > will something break if we return ptr->data for read-only LOCAL/RINGBUF dynptr?
> > >
> > > There will be nothing guarding against direct writes into read-only
> > > LOCAL/RINGBUF dynptrs if we return ptr->data. For skb type dynptrs,
> > > it's guarded by the ptr->data return pointer being marked as
> > > MEM_RDONLY in the verifier if the skb is non-writable.
> > >
> >
> > Ah, so we won't add MEM_RDONLY for bpf_dynptr_data()'s returned
> > PTR_TO_MEM if we know (statically) that dynptr is read-only?
>
> I think you meant will, not won't? If so, then yes we only add
> MEM_RDONLY for the returned data slice if we can pre-determine that
> the dynptr is read-only, else bpf_dynptr_data() will return null.
>
> >
> > Ok, not a big deal, just something that we might want to improve in the future.
>
> I'm curious to hear how you think this could be improved. If we're not
> able to know statically whether the dynptr is read-only or writable,
> then there's no way to enforce it in the verifier before the bpf
> program runs. Or is there some way to do this?

I might be just confused, I thought the conclusion from previous
discussions were that we do know statically if dynptr is read-only? If
that's not the case, then yeah, we can't really do much about this.

Either way, I think this is a small thing, as in practice
LOCAL/RINGBUF dynptrs will always be read-write, right?

>
> >
> > > >
> > > > > +
> > > > > +               data = ptr->data;
> > > > > +               break;
> > > > > +       case BPF_DYNPTR_TYPE_SKB:
> > > > > +       {
> > > > > +               struct sk_buff *skb = ptr->data;
> > > > >
> > > >
> > > > [...]
Martin KaFai Lau Jan. 31, 2023, 1:49 a.m. UTC | #12
On 1/30/23 5:04 PM, Andrii Nakryiko wrote:
> On Mon, Jan 30, 2023 at 2:31 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>> On Mon, Jan 30, 2023 at 02:04:08PM -0800, Martin KaFai Lau wrote:
>>> On 1/27/23 11:17 AM, Joanne Koong wrote:
>>>> @@ -8243,6 +8316,28 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>>>>              mark_reg_known_zero(env, regs, BPF_REG_0);
>>>>              regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
>>>>              regs[BPF_REG_0].mem_size = meta.mem_size;
>>>> +           if (func_id == BPF_FUNC_dynptr_data &&
>>>> +               dynptr_type == BPF_DYNPTR_TYPE_SKB) {
>>>> +                   bool seen_direct_write = env->seen_direct_write;
>>>> +
>>>> +                   regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB;
>>>> +                   if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
>>>> +                           regs[BPF_REG_0].type |= MEM_RDONLY;
>>>> +                   else
>>>> +                           /*
>>>> +                            * Calling may_access_direct_pkt_data() will set
>>>> +                            * env->seen_direct_write to true if the skb is
>>>> +                            * writable. As an optimization, we can ignore
>>>> +                            * setting env->seen_direct_write.
>>>> +                            *
>>>> +                            * env->seen_direct_write is used by skb
>>>> +                            * programs to determine whether the skb's page
>>>> +                            * buffers should be cloned. Since data slice
>>>> +                            * writes would only be to the head, we can skip
>>>> +                            * this.
>>>> +                            */
>>>> +                           env->seen_direct_write = seen_direct_write;
>>>> +           }
>>>
>>> [ ... ]
>>>
>>>> @@ -9263,17 +9361,26 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
>>>>                              return ret;
>>>>                      break;
>>>>              case KF_ARG_PTR_TO_DYNPTR:
>>>> +           {
>>>> +                   enum bpf_arg_type dynptr_arg_type = ARG_PTR_TO_DYNPTR;
>>>> +
>>>>                      if (reg->type != PTR_TO_STACK &&
>>>>                          reg->type != CONST_PTR_TO_DYNPTR) {
>>>>                              verbose(env, "arg#%d expected pointer to stack or dynptr_ptr\n", i);
>>>>                              return -EINVAL;
>>>>                      }
>>>> -                   ret = process_dynptr_func(env, regno, insn_idx,
>>>> -                                             ARG_PTR_TO_DYNPTR | MEM_RDONLY);
>>>> +                   if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb])
>>>> +                           dynptr_arg_type |= MEM_UNINIT | DYNPTR_TYPE_SKB;
>>>> +                   else
>>>> +                           dynptr_arg_type |= MEM_RDONLY;
>>>> +
>>>> +                   ret = process_dynptr_func(env, regno, insn_idx, dynptr_arg_type,
>>>> +                                             meta->func_id);
>>>>                      if (ret < 0)
>>>>                              return ret;
>>>>                      break;
>>>> +           }
>>>>              case KF_ARG_PTR_TO_LIST_HEAD:
>>>>                      if (reg->type != PTR_TO_MAP_VALUE &&
>>>>                          reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
>>>> @@ -15857,6 +15964,14 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
>>>>                 desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
>>>>              insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
>>>>              *cnt = 1;
>>>> +   } else if (desc->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
>>>> +           bool is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);
>>>
>>> Does it need to restore the env->seen_direct_write here also?
>>>
>>> It seems this 'seen_direct_write' saving/restoring is needed now because
>>> 'may_access_direct_pkt_data(BPF_WRITE)' is not only called when it is
>>> actually writing the packet. Some refactoring can help to avoid issue like
>>> this.
>>>
>>> While at 'seen_direct_write', Alexei has also pointed out that the verifier
>>> needs to track whether the (packet) 'slice' returned by bpf_dynptr_data()
>>> has been written. It should be tracked in 'seen_direct_write'. Take a look
>>> at how reg_is_pkt_pointer() and may_access_direct_pkt_data() are done in
>>> check_mem_access(). iirc, this reg_is_pkt_pointer() part got loss somewhere
>>> in v5 (or v4?) when bpf_dynptr_data() was changed to return register typed
>>> PTR_TO_MEM instead of PTR_TO_PACKET.
>>
>> btw tc progs are using gen_prologue() approach because data/data_end are not kfuncs
>> (nothing is being called by the bpf prog).
>> In this case we don't need to repeat this approach. If so we don't need to
>> set seen_direct_write.
>> Instead bpf_dynptr_data() can call bpf_skb_pull_data() directly.
>> And technically we don't need to limit it to skb head. It can handle any off/len.
>> It will work for skb, but there is no equivalent for xdp_pull_data().
>> I don't think we can implement xdp_pull_data in all drivers.
>> That's massive amount of work, but we need to be consistent if we want
>> dynptr to wrap both skb and xdp.
>> We can say dynptr_data is for head only, but we've seen bugs where people
>> had to switch from data/data_end to load_bytes.
>>
>> Also bpf_skb_pull_data is quite heavy. For progs that only want to parse
>> the packet calling that in bpf_dynptr_data is a heavy hammer.
>>
>> It feels that we need to go back to skb_header_pointer-like discussion.
>> Something like:
>> bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, void *buffer)
>> Whether buffer is a part of dynptr or program provided is tbd.
> 
> making it hidden within dynptr would make this approach unreliable
> (memory allocations, which can fail, etc). But if we ask users to pass
> it directly, then it should be relatively easy to use in practice with
> some pre-allocated per-CPU buffer:
> 
> 
> struct {
>    __int(type, BPF_MAP_TYPE_PERCPU_ARRAY);
>    __int(max_entries, 1);
>    __type(key, int);
>    __type(value, char[4096]);
> } scratch SEC(".maps");
> 
> 
> ...
> 
> 
> struct dyn_ptr *dp = bpf_dynptr_from_skb(...).
> void *p, *buf;
> int zero = 0;
> 
> buf = bpf_map_lookup_elem(&scratch, &zero);
> if (!buf) return 0; /* can't happen */
> 
> p = bpf_dynptr_slice(dp, off, 16, buf);
> if (p == NULL) {
>     /* out of range */
> } else {
>     /* work with p directly */
> }
> 
> /* if we wrote something to p and it was copied to buffer, write it back */
> if (p == buf) {
>      bpf_dynptr_write(dp, buf, 16);
> }
> 
> 
> We'll just need to teach verifier to make sure that buf is at least 16
> byte long.

A fifth __sz arg may do:
bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, void 
*buffer, u32 buffer__sz);

The bpf prog usually has buffer in the stack for the common small header parsing.

One side note is the bpf_dynptr_slice() still needs to check if the skb is 
cloned or not even the off/len is within the head range.

> But I wonder if for simple cases when users are mostly sure that they
> are going to access only header data directly we can have an option
> for bpf_dynptr_from_skb() to specify what should be the behavior for
> bpf_dynptr_slice():
> 
>   - either return NULL for anything that crosses into frags (no
> surprising perf penalty, but surprising NULLs);
>   - do bpf_skb_pull_data() if bpf_dynptr_data() needs to point to data
> beyond header (potential perf penalty, but on NULLs, if off+len is
> within packet).
> 
> And then bpf_dynptr_from_skb() can accept a flag specifying this
> behavior and store it somewhere in struct bpf_dynptr.

xdp does not have the bpf_skb_pull_data() equivalent, so xdp prog will still 
need the write back handling.
Andrii Nakryiko Jan. 31, 2023, 4:43 a.m. UTC | #13
On Mon, Jan 30, 2023 at 5:49 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 1/30/23 5:04 PM, Andrii Nakryiko wrote:
> > On Mon, Jan 30, 2023 at 2:31 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >>
> >> On Mon, Jan 30, 2023 at 02:04:08PM -0800, Martin KaFai Lau wrote:
> >>> On 1/27/23 11:17 AM, Joanne Koong wrote:
> >>>> @@ -8243,6 +8316,28 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> >>>>              mark_reg_known_zero(env, regs, BPF_REG_0);
> >>>>              regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> >>>>              regs[BPF_REG_0].mem_size = meta.mem_size;
> >>>> +           if (func_id == BPF_FUNC_dynptr_data &&
> >>>> +               dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> >>>> +                   bool seen_direct_write = env->seen_direct_write;
> >>>> +
> >>>> +                   regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB;
> >>>> +                   if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> >>>> +                           regs[BPF_REG_0].type |= MEM_RDONLY;
> >>>> +                   else
> >>>> +                           /*
> >>>> +                            * Calling may_access_direct_pkt_data() will set
> >>>> +                            * env->seen_direct_write to true if the skb is
> >>>> +                            * writable. As an optimization, we can ignore
> >>>> +                            * setting env->seen_direct_write.
> >>>> +                            *
> >>>> +                            * env->seen_direct_write is used by skb
> >>>> +                            * programs to determine whether the skb's page
> >>>> +                            * buffers should be cloned. Since data slice
> >>>> +                            * writes would only be to the head, we can skip
> >>>> +                            * this.
> >>>> +                            */
> >>>> +                           env->seen_direct_write = seen_direct_write;
> >>>> +           }
> >>>
> >>> [ ... ]
> >>>
> >>>> @@ -9263,17 +9361,26 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> >>>>                              return ret;
> >>>>                      break;
> >>>>              case KF_ARG_PTR_TO_DYNPTR:
> >>>> +           {
> >>>> +                   enum bpf_arg_type dynptr_arg_type = ARG_PTR_TO_DYNPTR;
> >>>> +
> >>>>                      if (reg->type != PTR_TO_STACK &&
> >>>>                          reg->type != CONST_PTR_TO_DYNPTR) {
> >>>>                              verbose(env, "arg#%d expected pointer to stack or dynptr_ptr\n", i);
> >>>>                              return -EINVAL;
> >>>>                      }
> >>>> -                   ret = process_dynptr_func(env, regno, insn_idx,
> >>>> -                                             ARG_PTR_TO_DYNPTR | MEM_RDONLY);
> >>>> +                   if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb])
> >>>> +                           dynptr_arg_type |= MEM_UNINIT | DYNPTR_TYPE_SKB;
> >>>> +                   else
> >>>> +                           dynptr_arg_type |= MEM_RDONLY;
> >>>> +
> >>>> +                   ret = process_dynptr_func(env, regno, insn_idx, dynptr_arg_type,
> >>>> +                                             meta->func_id);
> >>>>                      if (ret < 0)
> >>>>                              return ret;
> >>>>                      break;
> >>>> +           }
> >>>>              case KF_ARG_PTR_TO_LIST_HEAD:
> >>>>                      if (reg->type != PTR_TO_MAP_VALUE &&
> >>>>                          reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
> >>>> @@ -15857,6 +15964,14 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> >>>>                 desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
> >>>>              insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
> >>>>              *cnt = 1;
> >>>> +   } else if (desc->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
> >>>> +           bool is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);
> >>>
> >>> Does it need to restore the env->seen_direct_write here also?
> >>>
> >>> It seems this 'seen_direct_write' saving/restoring is needed now because
> >>> 'may_access_direct_pkt_data(BPF_WRITE)' is not only called when it is
> >>> actually writing the packet. Some refactoring can help to avoid issue like
> >>> this.
> >>>
> >>> While at 'seen_direct_write', Alexei has also pointed out that the verifier
> >>> needs to track whether the (packet) 'slice' returned by bpf_dynptr_data()
> >>> has been written. It should be tracked in 'seen_direct_write'. Take a look
> >>> at how reg_is_pkt_pointer() and may_access_direct_pkt_data() are done in
> >>> check_mem_access(). iirc, this reg_is_pkt_pointer() part got loss somewhere
> >>> in v5 (or v4?) when bpf_dynptr_data() was changed to return register typed
> >>> PTR_TO_MEM instead of PTR_TO_PACKET.
> >>
> >> btw tc progs are using gen_prologue() approach because data/data_end are not kfuncs
> >> (nothing is being called by the bpf prog).
> >> In this case we don't need to repeat this approach. If so we don't need to
> >> set seen_direct_write.
> >> Instead bpf_dynptr_data() can call bpf_skb_pull_data() directly.
> >> And technically we don't need to limit it to skb head. It can handle any off/len.
> >> It will work for skb, but there is no equivalent for xdp_pull_data().
> >> I don't think we can implement xdp_pull_data in all drivers.
> >> That's massive amount of work, but we need to be consistent if we want
> >> dynptr to wrap both skb and xdp.
> >> We can say dynptr_data is for head only, but we've seen bugs where people
> >> had to switch from data/data_end to load_bytes.
> >>
> >> Also bpf_skb_pull_data is quite heavy. For progs that only want to parse
> >> the packet calling that in bpf_dynptr_data is a heavy hammer.
> >>
> >> It feels that we need to go back to skb_header_pointer-like discussion.
> >> Something like:
> >> bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, void *buffer)
> >> Whether buffer is a part of dynptr or program provided is tbd.
> >
> > making it hidden within dynptr would make this approach unreliable
> > (memory allocations, which can fail, etc). But if we ask users to pass
> > it directly, then it should be relatively easy to use in practice with
> > some pre-allocated per-CPU buffer:
> >
> >
> > struct {
> >    __int(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> >    __int(max_entries, 1);
> >    __type(key, int);
> >    __type(value, char[4096]);
> > } scratch SEC(".maps");
> >
> >
> > ...
> >
> >
> > struct dyn_ptr *dp = bpf_dynptr_from_skb(...).
> > void *p, *buf;
> > int zero = 0;
> >
> > buf = bpf_map_lookup_elem(&scratch, &zero);
> > if (!buf) return 0; /* can't happen */
> >
> > p = bpf_dynptr_slice(dp, off, 16, buf);
> > if (p == NULL) {
> >     /* out of range */
> > } else {
> >     /* work with p directly */
> > }
> >
> > /* if we wrote something to p and it was copied to buffer, write it back */
> > if (p == buf) {
> >      bpf_dynptr_write(dp, buf, 16);
> > }
> >
> >
> > We'll just need to teach verifier to make sure that buf is at least 16
> > byte long.
>
> A fifth __sz arg may do:
> bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, void
> *buffer, u32 buffer__sz);

We'll need to make sure that buffer__sz is >= len (or preferably not
require extra size at all). We can check that at runtime, of course,
but rejecting too small buffer at verification time would be a better
experience.

>
> The bpf prog usually has buffer in the stack for the common small header parsing.

sure, that would work for small chunks

>
> One side note is the bpf_dynptr_slice() still needs to check if the skb is
> cloned or not even the off/len is within the head range.

yep, and the above snippet will still do the right thing with
bpf_dynptr_write(), right? bpf_dynptr_write() will have to pull
anyways, if I understand correctly?

>
> > But I wonder if for simple cases when users are mostly sure that they
> > are going to access only header data directly we can have an option
> > for bpf_dynptr_from_skb() to specify what should be the behavior for
> > bpf_dynptr_slice():
> >
> >   - either return NULL for anything that crosses into frags (no
> > surprising perf penalty, but surprising NULLs);
> >   - do bpf_skb_pull_data() if bpf_dynptr_data() needs to point to data
> > beyond header (potential perf penalty, but on NULLs, if off+len is
> > within packet).
> >
> > And then bpf_dynptr_from_skb() can accept a flag specifying this
> > behavior and store it somewhere in struct bpf_dynptr.
>
> xdp does not have the bpf_skb_pull_data() equivalent, so xdp prog will still
> need the write back handling.
>

Sure, unfortunately, can't have everything. I'm just thinking how to
make bpf_dynptr_data() generically usable. Think about some common BPF
routine that calculates hash for all bytes pointed to by dynptr,
regardless of underlying dynptr type; it can iterate in small chunks,
get memory slice, if possible, but fallback to generic
bpf_dynptr_read() if doesn't. This will work for skb, xdp, LOCAL,
RINGBUF, any other dynptr type.
Alexei Starovoitov Jan. 31, 2023, 5:30 a.m. UTC | #14
On Mon, Jan 30, 2023 at 08:43:47PM -0800, Andrii Nakryiko wrote:
> On Mon, Jan 30, 2023 at 5:49 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >
> > On 1/30/23 5:04 PM, Andrii Nakryiko wrote:
> > > On Mon, Jan 30, 2023 at 2:31 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > >>
> > >> On Mon, Jan 30, 2023 at 02:04:08PM -0800, Martin KaFai Lau wrote:
> > >>> On 1/27/23 11:17 AM, Joanne Koong wrote:
> > >>>> @@ -8243,6 +8316,28 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > >>>>              mark_reg_known_zero(env, regs, BPF_REG_0);
> > >>>>              regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > >>>>              regs[BPF_REG_0].mem_size = meta.mem_size;
> > >>>> +           if (func_id == BPF_FUNC_dynptr_data &&
> > >>>> +               dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > >>>> +                   bool seen_direct_write = env->seen_direct_write;
> > >>>> +
> > >>>> +                   regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB;
> > >>>> +                   if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> > >>>> +                           regs[BPF_REG_0].type |= MEM_RDONLY;
> > >>>> +                   else
> > >>>> +                           /*
> > >>>> +                            * Calling may_access_direct_pkt_data() will set
> > >>>> +                            * env->seen_direct_write to true if the skb is
> > >>>> +                            * writable. As an optimization, we can ignore
> > >>>> +                            * setting env->seen_direct_write.
> > >>>> +                            *
> > >>>> +                            * env->seen_direct_write is used by skb
> > >>>> +                            * programs to determine whether the skb's page
> > >>>> +                            * buffers should be cloned. Since data slice
> > >>>> +                            * writes would only be to the head, we can skip
> > >>>> +                            * this.
> > >>>> +                            */
> > >>>> +                           env->seen_direct_write = seen_direct_write;
> > >>>> +           }
> > >>>
> > >>> [ ... ]
> > >>>
> > >>>> @@ -9263,17 +9361,26 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> > >>>>                              return ret;
> > >>>>                      break;
> > >>>>              case KF_ARG_PTR_TO_DYNPTR:
> > >>>> +           {
> > >>>> +                   enum bpf_arg_type dynptr_arg_type = ARG_PTR_TO_DYNPTR;
> > >>>> +
> > >>>>                      if (reg->type != PTR_TO_STACK &&
> > >>>>                          reg->type != CONST_PTR_TO_DYNPTR) {
> > >>>>                              verbose(env, "arg#%d expected pointer to stack or dynptr_ptr\n", i);
> > >>>>                              return -EINVAL;
> > >>>>                      }
> > >>>> -                   ret = process_dynptr_func(env, regno, insn_idx,
> > >>>> -                                             ARG_PTR_TO_DYNPTR | MEM_RDONLY);
> > >>>> +                   if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb])
> > >>>> +                           dynptr_arg_type |= MEM_UNINIT | DYNPTR_TYPE_SKB;
> > >>>> +                   else
> > >>>> +                           dynptr_arg_type |= MEM_RDONLY;
> > >>>> +
> > >>>> +                   ret = process_dynptr_func(env, regno, insn_idx, dynptr_arg_type,
> > >>>> +                                             meta->func_id);
> > >>>>                      if (ret < 0)
> > >>>>                              return ret;
> > >>>>                      break;
> > >>>> +           }
> > >>>>              case KF_ARG_PTR_TO_LIST_HEAD:
> > >>>>                      if (reg->type != PTR_TO_MAP_VALUE &&
> > >>>>                          reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
> > >>>> @@ -15857,6 +15964,14 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> > >>>>                 desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
> > >>>>              insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
> > >>>>              *cnt = 1;
> > >>>> +   } else if (desc->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
> > >>>> +           bool is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);
> > >>>
> > >>> Does it need to restore the env->seen_direct_write here also?
> > >>>
> > >>> It seems this 'seen_direct_write' saving/restoring is needed now because
> > >>> 'may_access_direct_pkt_data(BPF_WRITE)' is not only called when it is
> > >>> actually writing the packet. Some refactoring can help to avoid issue like
> > >>> this.
> > >>>
> > >>> While at 'seen_direct_write', Alexei has also pointed out that the verifier
> > >>> needs to track whether the (packet) 'slice' returned by bpf_dynptr_data()
> > >>> has been written. It should be tracked in 'seen_direct_write'. Take a look
> > >>> at how reg_is_pkt_pointer() and may_access_direct_pkt_data() are done in
> > >>> check_mem_access(). iirc, this reg_is_pkt_pointer() part got loss somewhere
> > >>> in v5 (or v4?) when bpf_dynptr_data() was changed to return register typed
> > >>> PTR_TO_MEM instead of PTR_TO_PACKET.
> > >>
> > >> btw tc progs are using gen_prologue() approach because data/data_end are not kfuncs
> > >> (nothing is being called by the bpf prog).
> > >> In this case we don't need to repeat this approach. If so we don't need to
> > >> set seen_direct_write.
> > >> Instead bpf_dynptr_data() can call bpf_skb_pull_data() directly.
> > >> And technically we don't need to limit it to skb head. It can handle any off/len.
> > >> It will work for skb, but there is no equivalent for xdp_pull_data().
> > >> I don't think we can implement xdp_pull_data in all drivers.
> > >> That's massive amount of work, but we need to be consistent if we want
> > >> dynptr to wrap both skb and xdp.
> > >> We can say dynptr_data is for head only, but we've seen bugs where people
> > >> had to switch from data/data_end to load_bytes.
> > >>
> > >> Also bpf_skb_pull_data is quite heavy. For progs that only want to parse
> > >> the packet calling that in bpf_dynptr_data is a heavy hammer.
> > >>
> > >> It feels that we need to go back to skb_header_pointer-like discussion.
> > >> Something like:
> > >> bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, void *buffer)
> > >> Whether buffer is a part of dynptr or program provided is tbd.
> > >
> > > making it hidden within dynptr would make this approach unreliable
> > > (memory allocations, which can fail, etc). But if we ask users to pass
> > > it directly, then it should be relatively easy to use in practice with
> > > some pre-allocated per-CPU buffer:

bpf_skb_pull_data() is even more unreliable, since it's a bigger allocation.
I like preallocated approach more, so we're in agreement here.

> > >
> > >
> > > struct {
> > >    __int(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> > >    __int(max_entries, 1);
> > >    __type(key, int);
> > >    __type(value, char[4096]);
> > > } scratch SEC(".maps");
> > >
> > >
> > > ...
> > >
> > >
> > > struct dyn_ptr *dp = bpf_dynptr_from_skb(...).
> > > void *p, *buf;
> > > int zero = 0;
> > >
> > > buf = bpf_map_lookup_elem(&scratch, &zero);
> > > if (!buf) return 0; /* can't happen */
> > >
> > > p = bpf_dynptr_slice(dp, off, 16, buf);
> > > if (p == NULL) {
> > >     /* out of range */
> > > } else {
> > >     /* work with p directly */
> > > }
> > >
> > > /* if we wrote something to p and it was copied to buffer, write it back */
> > > if (p == buf) {
> > >      bpf_dynptr_write(dp, buf, 16);
> > > }
> > >
> > >
> > > We'll just need to teach verifier to make sure that buf is at least 16
> > > byte long.
> >
> > A fifth __sz arg may do:
> > bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, void
> > *buffer, u32 buffer__sz);
> 
> We'll need to make sure that buffer__sz is >= len (or preferably not
> require extra size at all). We can check that at runtime, of course,
> but rejecting too small buffer at verification time would be a better
> experience.

I don't follow. Why two equivalent 'len' args ?
Just to allow 'len' to be a variable instead of constant ?
It's unusual for the verifier to have 'len' before 'buffer',
but this is fixable.

How about adding 'rd_only vs rdwr' flag ?
Then MEM_RDONLY for ret value of bpf_dynptr_slice can be set by the verifier
and in run-time bpf_dynptr_slice() wouldn't need to check for skb->cloned.
if (rd_only) return skb_header_pointer()
if (rdwr) bpf_try_make_writable(); return skb->data + off;
and final bpf_dynptr_write() is not needed.

But that doesn't work for xdp, since there is no pull.

It's not clear how to deal with BPF_F_RECOMPUTE_CSUM though.
Expose __skb_postpull_rcsum/__skb_postpush_rcsum as kfuncs?
But that defeats Andrii's goal to use dynptr as a generic wrapper.
skb is quite special.

Maybe something like:
void *bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len,
                       void *buffer, u32 buffer__sz)
{
  if (skb_cloned()) {
    skb_copy_bits(skb, offset, buffer, len);
    return buffer;
  }
  return skb_header_pointer(...);
}

When prog is just parsing the packet it doesn't need to finalize with bpf_dynptr_write.
The prog can always write into the pointer followed by if (p == buf) bpf_dynptr_write.
No need for rdonly flag, but extra copy is there in case of cloned which
could have been avoided with extra rd_only flag.

In case of xdp it will be:
void *bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len,
                       void *buffer, u32 buffer__sz)
{
   ptr = bpf_xdp_pointer(xdp, offset, len);
   if (ptr)
      return ptr;
   bpf_xdp_copy_buf(xdp, offset, buffer, len, false); /* copy into buf */
   return buffer;
}

bpf_dynptr_write will use bpf_xdp_copy_buf(,true); /* copy into xdp */

> >
> > The bpf prog usually has buffer in the stack for the common small header parsing.
> 
> sure, that would work for small chunks
> 
> >
> > One side note is the bpf_dynptr_slice() still needs to check if the skb is
> > cloned or not even the off/len is within the head range.
> 
> yep, and the above snippet will still do the right thing with
> bpf_dynptr_write(), right? bpf_dynptr_write() will have to pull
> anyways, if I understand correctly?

Yes and No. bpf_skb_store_bytes is doing pull followed by memcpy,
while xdp_store_bytes does scatter gather copy into frags.
We should probably add similar copy to skb case to avoid allocations and pull.
Then in case of:
 if (p == buf) {
      bpf_dynptr_write(dp, buf, 16);
 }

the write will guarantee to succeed for both xdp and skb and the user
doesn't need to add error checking for alloc failures in case of skb.

> >
> > > But I wonder if for simple cases when users are mostly sure that they
> > > are going to access only header data directly we can have an option
> > > for bpf_dynptr_from_skb() to specify what should be the behavior for
> > > bpf_dynptr_slice():
> > >
> > >   - either return NULL for anything that crosses into frags (no
> > > surprising perf penalty, but surprising NULLs);
> > >   - do bpf_skb_pull_data() if bpf_dynptr_data() needs to point to data
> > > beyond header (potential perf penalty, but on NULLs, if off+len is
> > > within packet).
> > >
> > > And then bpf_dynptr_from_skb() can accept a flag specifying this
> > > behavior and store it somewhere in struct bpf_dynptr.
> >
> > xdp does not have the bpf_skb_pull_data() equivalent, so xdp prog will still
> > need the write back handling.
> >
> 
> Sure, unfortunately, can't have everything. I'm just thinking how to
> make bpf_dynptr_data() generically usable. Think about some common BPF
> routine that calculates hash for all bytes pointed to by dynptr,
> regardless of underlying dynptr type; it can iterate in small chunks,
> get memory slice, if possible, but fallback to generic
> bpf_dynptr_read() if doesn't. This will work for skb, xdp, LOCAL,
> RINGBUF, any other dynptr type.

It looks to me that dynptr on top of skb, xdp, local can work as generic reader,
but dynptr as a generic writer doesn't look possible.
BPF_F_RECOMPUTE_CSUM and BPF_F_INVALIDATE_HASH are special to skb.
There is also bpf_skb_change_proto and crazy complex bpf_skb_adjust_room.
I don't think writing into skb vs xdp vs ringbuf are generalizable.
The prog needs to do a ton more work to write into skb correctly.
Alexei Starovoitov Jan. 31, 2023, 5:36 a.m. UTC | #15
On Mon, Jan 30, 2023 at 04:44:12PM -0800, Joanne Koong wrote:
> On Sun, Jan 29, 2023 at 3:39 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Jan 27, 2023 at 11:17:01AM -0800, Joanne Koong wrote:
> > > Add skb dynptrs, which are dynptrs whose underlying pointer points
> > > to a skb. The dynptr acts on skb data. skb dynptrs have two main
> > > benefits. One is that they allow operations on sizes that are not
> > > statically known at compile-time (eg variable-sized accesses).
> > > Another is that parsing the packet data through dynptrs (instead of
> > > through direct access of skb->data and skb->data_end) can be more
> > > ergonomic and less brittle (eg does not need manual if checking for
> > > being within bounds of data_end).
> > >
> > > For bpf prog types that don't support writes on skb data, the dynptr is
> > > read-only (bpf_dynptr_write() will return an error and bpf_dynptr_data()
> > > will return a data slice that is read-only where any writes to it will
> > > be rejected by the verifier).
> > >
> > > For reads and writes through the bpf_dynptr_read() and bpf_dynptr_write()
> > > interfaces, reading and writing from/to data in the head as well as from/to
> > > non-linear paged buffers is supported. For data slices (through the
> > > bpf_dynptr_data() interface), if the data is in a paged buffer, the user
> > > must first call bpf_skb_pull_data() to pull the data into the linear
> > > portion.
> >
> > Looks like there is an assumption in parts of this patch that
> > linear part of skb is always writeable. That's not the case.
> > See if (ops->gen_prologue || env->seen_direct_write) in convert_ctx_accesses().
> > For TC progs it calls bpf_unclone_prologue() which adds hidden
> > bpf_skb_pull_data() in the beginning of the prog to make it writeable.
> 
> I think we can make this assumption? For writable progs (referenced in
> the may_access_direct_pkt_data() function), all of them have a
> gen_prologue that unclones the buffer (eg tc_cls_act, lwt_xmit, sk_skb
> progs) or their linear portion is okay to write into by default (eg
> xdp, sk_msg, cg_sockopt progs).

but the patch was preserving seen_direct_write in some cases.
I'm still confused.

> >
> > > Any bpf_dynptr_write() automatically invalidates any prior data slices
> > > to the skb dynptr. This is because a bpf_dynptr_write() may be writing
> > > to data in a paged buffer, so it will need to pull the buffer first into
> > > the head. The reason it needs to be pulled instead of writing directly to
> > > the paged buffers is because they may be cloned (only the head of the skb
> > > is by default uncloned). As such, any bpf_dynptr_write() will
> > > automatically have its prior data slices invalidated, even if the write
> > > is to data in the skb head (the verifier has no way of differentiating
> > > whether the write is to the head or paged buffers during program load
> > > time).
> >
> > Could you explain the workflow how bpf_dynptr_write() invalidates other
> > pkt pointers ?
> > I expected bpf_dynptr_write() to be in bpf_helper_changes_pkt_data().
> > Looks like bpf_dynptr_write() calls bpf_skb_store_bytes() underneath,
> > but that doesn't help the verifier.
> 
> In the verifier in check_helper_call(), for the BPF_FUNC_dynptr_write
> case (line 8236) the "changes_data" variable gets set to true if the
> dynptr is an skb type. At the end of check_helper_call() on line 8474,
> since "changes_data" is true, clear_all_pkt_pointer() gets called,
> which invalidates the other packet pointers.

Ahh. I see. Thanks for explaining.

> >
> > > Please note as well that any other helper calls that change the
> > > underlying packet buffer (eg bpf_skb_pull_data()) invalidates any data
> > > slices of the skb dynptr as well. The stack trace for this is
> > > check_helper_call() -> clear_all_pkt_pointers() ->
> > > __clear_all_pkt_pointers() -> mark_reg_unknown().
> >
> > __clear_all_pkt_pointers isn't present in the tree. Typo ?
> 
> I'll update this message, clear_all_pkt_pointers() and
> __clear_all_pkt_pointers() were combined in a previous commit.
> 
> >
> > >
> > > For examples of how skb dynptrs can be used, please see the attached
> > > selftests.
> > >
> > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > ---
> > >  include/linux/bpf.h            |  82 +++++++++------
> > >  include/linux/filter.h         |  18 ++++
> > >  include/uapi/linux/bpf.h       |  37 +++++--
> > >  kernel/bpf/btf.c               |  18 ++++
> > >  kernel/bpf/helpers.c           |  95 ++++++++++++++---
> > >  kernel/bpf/verifier.c          | 185 ++++++++++++++++++++++++++-------
> > >  net/core/filter.c              |  60 ++++++++++-
> > >  tools/include/uapi/linux/bpf.h |  37 +++++--
> > >  8 files changed, 432 insertions(+), 100 deletions(-)
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 14a0264fac57..1ac061b64582 100644
> [...]
> > > @@ -8243,6 +8316,28 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > >               mark_reg_known_zero(env, regs, BPF_REG_0);
> > >               regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > >               regs[BPF_REG_0].mem_size = meta.mem_size;
> > > +             if (func_id == BPF_FUNC_dynptr_data &&
> > > +                 dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > > +                     bool seen_direct_write = env->seen_direct_write;
> > > +
> > > +                     regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB;
> > > +                     if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> > > +                             regs[BPF_REG_0].type |= MEM_RDONLY;
> > > +                     else
> > > +                             /*
> > > +                              * Calling may_access_direct_pkt_data() will set
> > > +                              * env->seen_direct_write to true if the skb is
> > > +                              * writable. As an optimization, we can ignore
> > > +                              * setting env->seen_direct_write.
> > > +                              *
> > > +                              * env->seen_direct_write is used by skb
> > > +                              * programs to determine whether the skb's page
> > > +                              * buffers should be cloned. Since data slice
> > > +                              * writes would only be to the head, we can skip
> > > +                              * this.

I was talking about above comment. It reads as 'write to the head are allowed'.
But they're not. seen_direct_write is needed to do hidden pull.

> > > +                              */
> > > +                             env->seen_direct_write = seen_direct_write;
> >
> > This looks incorrect. skb head might not be writeable.
> >
> > > +             }
> > >               break;
> > >       case RET_PTR_TO_MEM_OR_BTF_ID:
> > >       {
> > > @@ -8649,6 +8744,7 @@ enum special_kfunc_type {
> > >       KF_bpf_list_pop_back,
> > >       KF_bpf_cast_to_kern_ctx,
> > >       KF_bpf_rdonly_cast,
> > > +     KF_bpf_dynptr_from_skb,
> > >       KF_bpf_rcu_read_lock,
> > >       KF_bpf_rcu_read_unlock,
> > >  };
> > > @@ -8662,6 +8758,7 @@ BTF_ID(func, bpf_list_pop_front)
> > >  BTF_ID(func, bpf_list_pop_back)
> > >  BTF_ID(func, bpf_cast_to_kern_ctx)
> > >  BTF_ID(func, bpf_rdonly_cast)
> > > +BTF_ID(func, bpf_dynptr_from_skb)
> > >  BTF_SET_END(special_kfunc_set)
> > >
> > >  BTF_ID_LIST(special_kfunc_list)
> > > @@ -8673,6 +8770,7 @@ BTF_ID(func, bpf_list_pop_front)
> > >  BTF_ID(func, bpf_list_pop_back)
> > >  BTF_ID(func, bpf_cast_to_kern_ctx)
> > >  BTF_ID(func, bpf_rdonly_cast)
> > > +BTF_ID(func, bpf_dynptr_from_skb)
> > >  BTF_ID(func, bpf_rcu_read_lock)
> > >  BTF_ID(func, bpf_rcu_read_unlock)
> > >
> > > @@ -9263,17 +9361,26 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> > >                               return ret;
> > >                       break;
> > >               case KF_ARG_PTR_TO_DYNPTR:
> > > +             {
> > > +                     enum bpf_arg_type dynptr_arg_type = ARG_PTR_TO_DYNPTR;
> > > +
> > >                       if (reg->type != PTR_TO_STACK &&
> > >                           reg->type != CONST_PTR_TO_DYNPTR) {
> > >                               verbose(env, "arg#%d expected pointer to stack or dynptr_ptr\n", i);
> > >                               return -EINVAL;
> > >                       }
> > >
> > > -                     ret = process_dynptr_func(env, regno, insn_idx,
> > > -                                               ARG_PTR_TO_DYNPTR | MEM_RDONLY);
> > > +                     if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb])
> > > +                             dynptr_arg_type |= MEM_UNINIT | DYNPTR_TYPE_SKB;
> > > +                     else
> > > +                             dynptr_arg_type |= MEM_RDONLY;
> > > +
> > > +                     ret = process_dynptr_func(env, regno, insn_idx, dynptr_arg_type,
> > > +                                               meta->func_id);
> > >                       if (ret < 0)
> > >                               return ret;
> > >                       break;
> > > +             }
> > >               case KF_ARG_PTR_TO_LIST_HEAD:
> > >                       if (reg->type != PTR_TO_MAP_VALUE &&
> > >                           reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
> > > @@ -15857,6 +15964,14 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> > >                  desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
> > >               insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
> > >               *cnt = 1;
> > > +     } else if (desc->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
> > > +             bool is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);
> > > +             struct bpf_insn addr[2] = { BPF_LD_IMM64(BPF_REG_4, is_rdonly) };
> >
> > Why use 16-byte insn to pass boolean in R4 ?
> > Single 8-byte MOV would do.
> 
> Great, I'll change it to a 8-byte MOV
> 
> >
> > > +
> > > +             insn_buf[0] = addr[0];
> > > +             insn_buf[1] = addr[1];
> > > +             insn_buf[2] = *insn;
> > > +             *cnt = 3;
> > >       }
> > >       return 0;
> > >  }
> > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > index 6da78b3d381e..ddb47126071a 100644
> > > --- a/net/core/filter.c
> > > +++ b/net/core/filter.c
> > > @@ -1684,8 +1684,8 @@ static inline void bpf_pull_mac_rcsum(struct sk_buff *skb)
> > >               skb_postpull_rcsum(skb, skb_mac_header(skb), skb->mac_len);
> > >  }
> > >
> > > -BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
> > > -        const void *, from, u32, len, u64, flags)
> > > +int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from,
> > > +                       u32 len, u64 flags)
> >
> > This change is just to be able to call __bpf_skb_store_bytes() ?
> > If so, it's unnecessary.
> > See:
> > BPF_CALL_4(sk_reuseport_load_bytes,
> >            const struct sk_reuseport_kern *, reuse_kern, u32, offset,
> >            void *, to, u32, len)
> > {
> >         return ____bpf_skb_load_bytes(reuse_kern->skb, offset, to, len);
> > }
> >
> 
> There was prior feedback [0] that using four underscores to call a
> helper function is confusing and makes it ungreppable

There are plenty of ungreppable funcs in the kernel.
Try finding where folio_test_dirty() is defined.
mm subsystem is full of such 'features'.
Not friendly for casual kernel code reader, but useful.

Since quadruple underscore is already used in the code base
I see no reason to sacrifice bpf_skb_load_bytes performance with extra call.
Joanne Koong Jan. 31, 2023, 5:54 p.m. UTC | #16
On Mon, Jan 30, 2023 at 9:36 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Jan 30, 2023 at 04:44:12PM -0800, Joanne Koong wrote:
> > On Sun, Jan 29, 2023 at 3:39 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Fri, Jan 27, 2023 at 11:17:01AM -0800, Joanne Koong wrote:
> > > > Add skb dynptrs, which are dynptrs whose underlying pointer points
> > > > to a skb. The dynptr acts on skb data. skb dynptrs have two main
> > > > benefits. One is that they allow operations on sizes that are not
> > > > statically known at compile-time (eg variable-sized accesses).
> > > > Another is that parsing the packet data through dynptrs (instead of
> > > > through direct access of skb->data and skb->data_end) can be more
> > > > ergonomic and less brittle (eg does not need manual if checking for
> > > > being within bounds of data_end).
> > > >
> > > > For bpf prog types that don't support writes on skb data, the dynptr is
> > > > read-only (bpf_dynptr_write() will return an error and bpf_dynptr_data()
> > > > will return a data slice that is read-only where any writes to it will
> > > > be rejected by the verifier).
> > > >
> > > > For reads and writes through the bpf_dynptr_read() and bpf_dynptr_write()
> > > > interfaces, reading and writing from/to data in the head as well as from/to
> > > > non-linear paged buffers is supported. For data slices (through the
> > > > bpf_dynptr_data() interface), if the data is in a paged buffer, the user
> > > > must first call bpf_skb_pull_data() to pull the data into the linear
> > > > portion.
> > >
> > > Looks like there is an assumption in parts of this patch that
> > > linear part of skb is always writeable. That's not the case.
> > > See if (ops->gen_prologue || env->seen_direct_write) in convert_ctx_accesses().
> > > For TC progs it calls bpf_unclone_prologue() which adds hidden
> > > bpf_skb_pull_data() in the beginning of the prog to make it writeable.
> >
> > I think we can make this assumption? For writable progs (referenced in
> > the may_access_direct_pkt_data() function), all of them have a
> > gen_prologue that unclones the buffer (eg tc_cls_act, lwt_xmit, sk_skb
> > progs) or their linear portion is okay to write into by default (eg
> > xdp, sk_msg, cg_sockopt progs).
>
> but the patch was preserving seen_direct_write in some cases.
> I'm still confused.

seen_direct_write is used to determine whether to actually unclone or
not in the program's prologue function (eg tc_cls_act_prologue() ->
bpf_unclone_prologue() where in bpf_unclone_prologue(), if
direct_write was not true, then it can skip doing the actual
uncloning).

I think the part of the patch you're talking about regarding
seen_direct_write is this in check_helper_call():

+ if (func_id == BPF_FUNC_dynptr_data &&
+    dynptr_type == BPF_DYNPTR_TYPE_SKB) {
+   bool seen_direct_write = env->seen_direct_write;
+
+   regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB;
+   if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
+     regs[BPF_REG_0].type |= MEM_RDONLY;
+   else
+     /*
+     * Calling may_access_direct_pkt_data() will set
+     * env->seen_direct_write to true if the skb is
+     * writable. As an optimization, we can ignore
+     * setting env->seen_direct_write.
+     *
+     * env->seen_direct_write is used by skb
+     * programs to determine whether the skb's page
+     * buffers should be cloned. Since data slice
+     * writes would only be to the head, we can skip
+     * this.
+     */
+     env->seen_direct_write = seen_direct_write;
+ }

If the data slice for a skb dynptr is writable, then seen_direct_write
gets set to true (done internally in may_access_direct_pkt_data()) so
that the skb is actually uncloned, whereas if it's read-only, then
env->seen_direct_write gets reset to its original value (since the
may_access_direct_pkt_data() call will have set env->seen_direct_write
to true)

>
> > >
> > > > Any bpf_dynptr_write() automatically invalidates any prior data slices
> > > > to the skb dynptr. This is because a bpf_dynptr_write() may be writing
> > > > to data in a paged buffer, so it will need to pull the buffer first into
> > > > the head. The reason it needs to be pulled instead of writing directly to
> > > > the paged buffers is because they may be cloned (only the head of the skb
> > > > is by default uncloned). As such, any bpf_dynptr_write() will
> > > > automatically have its prior data slices invalidated, even if the write
> > > > is to data in the skb head (the verifier has no way of differentiating
> > > > whether the write is to the head or paged buffers during program load
> > > > time).
> > >
> > > Could you explain the workflow how bpf_dynptr_write() invalidates other
> > > pkt pointers ?
> > > I expected bpf_dynptr_write() to be in bpf_helper_changes_pkt_data().
> > > Looks like bpf_dynptr_write() calls bpf_skb_store_bytes() underneath,
> > > but that doesn't help the verifier.
> >
> > In the verifier in check_helper_call(), for the BPF_FUNC_dynptr_write
> > case (line 8236) the "changes_data" variable gets set to true if the
> > dynptr is an skb type. At the end of check_helper_call() on line 8474,
> > since "changes_data" is true, clear_all_pkt_pointer() gets called,
> > which invalidates the other packet pointers.
>
> Ahh. I see. Thanks for explaining.
>
> > >
> > > > Please note as well that any other helper calls that change the
> > > > underlying packet buffer (eg bpf_skb_pull_data()) invalidates any data
> > > > slices of the skb dynptr as well. The stack trace for this is
> > > > check_helper_call() -> clear_all_pkt_pointers() ->
> > > > __clear_all_pkt_pointers() -> mark_reg_unknown().
> > >
> > > __clear_all_pkt_pointers isn't present in the tree. Typo ?
> >
> > I'll update this message, clear_all_pkt_pointers() and
> > __clear_all_pkt_pointers() were combined in a previous commit.
> >
> > >
> > > >
> > > > For examples of how skb dynptrs can be used, please see the attached
> > > > selftests.
> > > >
> > > > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > > > ---
> > > >  include/linux/bpf.h            |  82 +++++++++------
> > > >  include/linux/filter.h         |  18 ++++
> > > >  include/uapi/linux/bpf.h       |  37 +++++--
> > > >  kernel/bpf/btf.c               |  18 ++++
> > > >  kernel/bpf/helpers.c           |  95 ++++++++++++++---
> > > >  kernel/bpf/verifier.c          | 185 ++++++++++++++++++++++++++-------
> > > >  net/core/filter.c              |  60 ++++++++++-
> > > >  tools/include/uapi/linux/bpf.h |  37 +++++--
> > > >  8 files changed, 432 insertions(+), 100 deletions(-)
> > > >
> > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > > index 14a0264fac57..1ac061b64582 100644
> > [...]
> > > > @@ -8243,6 +8316,28 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > > >               mark_reg_known_zero(env, regs, BPF_REG_0);
> > > >               regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > > >               regs[BPF_REG_0].mem_size = meta.mem_size;
> > > > +             if (func_id == BPF_FUNC_dynptr_data &&
> > > > +                 dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > > > +                     bool seen_direct_write = env->seen_direct_write;
> > > > +
> > > > +                     regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB;
> > > > +                     if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> > > > +                             regs[BPF_REG_0].type |= MEM_RDONLY;
> > > > +                     else
> > > > +                             /*
> > > > +                              * Calling may_access_direct_pkt_data() will set
> > > > +                              * env->seen_direct_write to true if the skb is
> > > > +                              * writable. As an optimization, we can ignore
> > > > +                              * setting env->seen_direct_write.
> > > > +                              *
> > > > +                              * env->seen_direct_write is used by skb
> > > > +                              * programs to determine whether the skb's page
> > > > +                              * buffers should be cloned. Since data slice
> > > > +                              * writes would only be to the head, we can skip
> > > > +                              * this.
>
> I was talking about above comment. It reads as 'write to the head are allowed'.
> But they're not. seen_direct_write is needed to do hidden pull.
>

I will remove this line, I agree that it is confusing.

> > > > +                              */
> > > > +                             env->seen_direct_write = seen_direct_write;
> > >
> > > This looks incorrect. skb head might not be writeable.
> > >
> > > > +             }
> > > >               break;
> > > >       case RET_PTR_TO_MEM_OR_BTF_ID:
> > > >       {
> > > > @@ -8649,6 +8744,7 @@ enum special_kfunc_type {
> > > >       KF_bpf_list_pop_back,
> > > >       KF_bpf_cast_to_kern_ctx,
> > > >       KF_bpf_rdonly_cast,
> > > > +     KF_bpf_dynptr_from_skb,
> > > >       KF_bpf_rcu_read_lock,
> > > >       KF_bpf_rcu_read_unlock,
> > > >  };
> > > > @@ -8662,6 +8758,7 @@ BTF_ID(func, bpf_list_pop_front)
> > > >  BTF_ID(func, bpf_list_pop_back)
> > > >  BTF_ID(func, bpf_cast_to_kern_ctx)
> > > >  BTF_ID(func, bpf_rdonly_cast)
> > > > +BTF_ID(func, bpf_dynptr_from_skb)
> > > >  BTF_SET_END(special_kfunc_set)
> > > >
> > > >  BTF_ID_LIST(special_kfunc_list)
> > > > @@ -8673,6 +8770,7 @@ BTF_ID(func, bpf_list_pop_front)
> > > >  BTF_ID(func, bpf_list_pop_back)
> > > >  BTF_ID(func, bpf_cast_to_kern_ctx)
> > > >  BTF_ID(func, bpf_rdonly_cast)
> > > > +BTF_ID(func, bpf_dynptr_from_skb)
> > > >  BTF_ID(func, bpf_rcu_read_lock)
> > > >  BTF_ID(func, bpf_rcu_read_unlock)
> > > >
> > > > @@ -9263,17 +9361,26 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> > > >                               return ret;
> > > >                       break;
> > > >               case KF_ARG_PTR_TO_DYNPTR:
> > > > +             {
> > > > +                     enum bpf_arg_type dynptr_arg_type = ARG_PTR_TO_DYNPTR;
> > > > +
> > > >                       if (reg->type != PTR_TO_STACK &&
> > > >                           reg->type != CONST_PTR_TO_DYNPTR) {
> > > >                               verbose(env, "arg#%d expected pointer to stack or dynptr_ptr\n", i);
> > > >                               return -EINVAL;
> > > >                       }
> > > >
> > > > -                     ret = process_dynptr_func(env, regno, insn_idx,
> > > > -                                               ARG_PTR_TO_DYNPTR | MEM_RDONLY);
> > > > +                     if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb])
> > > > +                             dynptr_arg_type |= MEM_UNINIT | DYNPTR_TYPE_SKB;
> > > > +                     else
> > > > +                             dynptr_arg_type |= MEM_RDONLY;
> > > > +
> > > > +                     ret = process_dynptr_func(env, regno, insn_idx, dynptr_arg_type,
> > > > +                                               meta->func_id);
> > > >                       if (ret < 0)
> > > >                               return ret;
> > > >                       break;
> > > > +             }
> > > >               case KF_ARG_PTR_TO_LIST_HEAD:
> > > >                       if (reg->type != PTR_TO_MAP_VALUE &&
> > > >                           reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
> > > > @@ -15857,6 +15964,14 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> > > >                  desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
> > > >               insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
> > > >               *cnt = 1;
> > > > +     } else if (desc->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
> > > > +             bool is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);
> > > > +             struct bpf_insn addr[2] = { BPF_LD_IMM64(BPF_REG_4, is_rdonly) };
> > >
> > > Why use 16-byte insn to pass boolean in R4 ?
> > > Single 8-byte MOV would do.
> >
> > Great, I'll change it to a 8-byte MOV
> >
> > >
> > > > +
> > > > +             insn_buf[0] = addr[0];
> > > > +             insn_buf[1] = addr[1];
> > > > +             insn_buf[2] = *insn;
> > > > +             *cnt = 3;
> > > >       }
> > > >       return 0;
> > > >  }
> > > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > > index 6da78b3d381e..ddb47126071a 100644
> > > > --- a/net/core/filter.c
> > > > +++ b/net/core/filter.c
> > > > @@ -1684,8 +1684,8 @@ static inline void bpf_pull_mac_rcsum(struct sk_buff *skb)
> > > >               skb_postpull_rcsum(skb, skb_mac_header(skb), skb->mac_len);
> > > >  }
> > > >
> > > > -BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
> > > > -        const void *, from, u32, len, u64, flags)
> > > > +int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from,
> > > > +                       u32 len, u64 flags)
> > >
> > > This change is just to be able to call __bpf_skb_store_bytes() ?
> > > If so, it's unnecessary.
> > > See:
> > > BPF_CALL_4(sk_reuseport_load_bytes,
> > >            const struct sk_reuseport_kern *, reuse_kern, u32, offset,
> > >            void *, to, u32, len)
> > > {
> > >         return ____bpf_skb_load_bytes(reuse_kern->skb, offset, to, len);
> > > }
> > >
> >
> > There was prior feedback [0] that using four underscores to call a
> > helper function is confusing and makes it ungreppable
>
> There are plenty of ungreppable funcs in the kernel.
> Try finding where folio_test_dirty() is defined.
> mm subsystem is full of such 'features'.
> Not friendly for casual kernel code reader, but useful.
>
> Since quadruple underscore is already used in the code base
> I see no reason to sacrifice bpf_skb_load_bytes performance with extra call.

I don't have a preference either way, I'll change it to use the
quadruple underscore in the next version
Joanne Koong Jan. 31, 2023, 6:18 p.m. UTC | #17
On Mon, Jan 30, 2023 at 2:04 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 1/27/23 11:17 AM, Joanne Koong wrote:
> > @@ -8243,6 +8316,28 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> >               mark_reg_known_zero(env, regs, BPF_REG_0);
> >               regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> >               regs[BPF_REG_0].mem_size = meta.mem_size;
> > +             if (func_id == BPF_FUNC_dynptr_data &&
> > +                 dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > +                     bool seen_direct_write = env->seen_direct_write;
> > +
> > +                     regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB;
> > +                     if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> > +                             regs[BPF_REG_0].type |= MEM_RDONLY;
> > +                     else
> > +                             /*
> > +                              * Calling may_access_direct_pkt_data() will set
> > +                              * env->seen_direct_write to true if the skb is
> > +                              * writable. As an optimization, we can ignore
> > +                              * setting env->seen_direct_write.
> > +                              *
> > +                              * env->seen_direct_write is used by skb
> > +                              * programs to determine whether the skb's page
> > +                              * buffers should be cloned. Since data slice
> > +                              * writes would only be to the head, we can skip
> > +                              * this.
> > +                              */
> > +                             env->seen_direct_write = seen_direct_write;
> > +             }
>
> [ ... ]
>
> > @@ -9263,17 +9361,26 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> >                               return ret;
> >                       break;
> >               case KF_ARG_PTR_TO_DYNPTR:
> > +             {
> > +                     enum bpf_arg_type dynptr_arg_type = ARG_PTR_TO_DYNPTR;
> > +
> >                       if (reg->type != PTR_TO_STACK &&
> >                           reg->type != CONST_PTR_TO_DYNPTR) {
> >                               verbose(env, "arg#%d expected pointer to stack or dynptr_ptr\n", i);
> >                               return -EINVAL;
> >                       }
> >
> > -                     ret = process_dynptr_func(env, regno, insn_idx,
> > -                                               ARG_PTR_TO_DYNPTR | MEM_RDONLY);
> > +                     if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb])
> > +                             dynptr_arg_type |= MEM_UNINIT | DYNPTR_TYPE_SKB;
> > +                     else
> > +                             dynptr_arg_type |= MEM_RDONLY;
> > +
> > +                     ret = process_dynptr_func(env, regno, insn_idx, dynptr_arg_type,
> > +                                               meta->func_id);
> >                       if (ret < 0)
> >                               return ret;
> >                       break;
> > +             }
> >               case KF_ARG_PTR_TO_LIST_HEAD:
> >                       if (reg->type != PTR_TO_MAP_VALUE &&
> >                           reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
> > @@ -15857,6 +15964,14 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> >                  desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
> >               insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
> >               *cnt = 1;
> > +     } else if (desc->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
> > +             bool is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);
>
> Does it need to restore the env->seen_direct_write here also?
>
> It seems this 'seen_direct_write' saving/restoring is needed now because
> 'may_access_direct_pkt_data(BPF_WRITE)' is not only called when it is actually
> writing the packet. Some refactoring can help to avoid issue like this.

Yes! Great catch! I'll submit a patch that refactors this, so that
env->seen_direct_write isn't set implicitly within
may_access_direct_pkt_data()

>
> While at 'seen_direct_write', Alexei has also pointed out that the verifier
> needs to track whether the (packet) 'slice' returned by bpf_dynptr_data() has
> been written. It should be tracked in 'seen_direct_write'. Take a look at how
> reg_is_pkt_pointer() and may_access_direct_pkt_data() are done in
> check_mem_access(). iirc, this reg_is_pkt_pointer() part got loss somewhere in
> v5 (or v4?) when bpf_dynptr_data() was changed to return register typed
> PTR_TO_MEM instead of PTR_TO_PACKET.
>

The verifier right now does track whether the dynptr skb 'slice' is
writable or not and sets seen_direct_write accordingly. However, it
currently does it in check_helper_call() where if the bpf program is
writable, then the env->seen_direct_write is set (regardless of
whether actual writes occur or not), so I like your idea of moving
this to check_mem_access(). The PTR_TO_MEM that gets returned for the
data slice will need to be tagged with DYNPTR_TYPE_SKB.

>
> [ ... ]
>
> > +int bpf_dynptr_from_skb(struct sk_buff *skb, u64 flags,
> > +                     struct bpf_dynptr_kern *ptr, int is_rdonly)
>
> hmm... this exposed kfunc takes "int is_rdonly".
>
> What if the bpf prog calls it like bpf_dynptr_from_skb(..., false) in some hook
> that is not writable to packet?

If the bpf prog tries to do this, their "false" value will be ignored,
because the "int is_rdonly" arg value gets set by the verifier (in
fixup_kfunc_call() in line 15969)

>
> > +{
> > +     if (flags) {
> > +             bpf_dynptr_set_null(ptr);
> > +             return -EINVAL;
> > +     }
> > +
> > +     bpf_dynptr_init(ptr, skb, BPF_DYNPTR_TYPE_SKB, 0, skb->len);
> > +
> > +     if (is_rdonly)
> > +             bpf_dynptr_set_rdonly(ptr);
> > +
> > +     return 0;
> > +}
> > +
> >   BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk)
> >   {
> >       return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL;
> > @@ -11607,3 +11634,28 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
> >
> >       return func;
> >   }
> > +
> > +BTF_SET8_START(bpf_kfunc_check_set_skb)
> > +BTF_ID_FLAGS(func, bpf_dynptr_from_skb)
> > +BTF_SET8_END(bpf_kfunc_check_set_skb)
> > +
> > +static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
> > +     .owner = THIS_MODULE,
> > +     .set = &bpf_kfunc_check_set_skb,
> > +};
> > +
> > +static int __init bpf_kfunc_init(void)
> > +{
> > +     int ret;
> > +
> > +     ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_skb);
> > +     ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_ACT, &bpf_kfunc_set_skb);
> > +     ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SK_SKB, &bpf_kfunc_set_skb);
> > +     ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCKET_FILTER, &bpf_kfunc_set_skb);
> > +     ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SKB, &bpf_kfunc_set_skb);
> > +     ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_OUT, &bpf_kfunc_set_skb);
> > +     ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_IN, &bpf_kfunc_set_skb);
> > +     ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_XMIT, &bpf_kfunc_set_skb);
> > +     return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_SEG6LOCAL, &bpf_kfunc_set_skb);
> > +}
> > +late_initcall(bpf_kfunc_init);
>
>
Joanne Koong Jan. 31, 2023, 6:30 p.m. UTC | #18
On Mon, Jan 30, 2023 at 5:04 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Jan 30, 2023 at 2:31 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Jan 30, 2023 at 02:04:08PM -0800, Martin KaFai Lau wrote:
> > > On 1/27/23 11:17 AM, Joanne Koong wrote:
> > > > @@ -8243,6 +8316,28 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > > >             mark_reg_known_zero(env, regs, BPF_REG_0);
> > > >             regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > > >             regs[BPF_REG_0].mem_size = meta.mem_size;
> > > > +           if (func_id == BPF_FUNC_dynptr_data &&
> > > > +               dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > > > +                   bool seen_direct_write = env->seen_direct_write;
> > > > +
> > > > +                   regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB;
> > > > +                   if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> > > > +                           regs[BPF_REG_0].type |= MEM_RDONLY;
> > > > +                   else
> > > > +                           /*
> > > > +                            * Calling may_access_direct_pkt_data() will set
> > > > +                            * env->seen_direct_write to true if the skb is
> > > > +                            * writable. As an optimization, we can ignore
> > > > +                            * setting env->seen_direct_write.
> > > > +                            *
> > > > +                            * env->seen_direct_write is used by skb
> > > > +                            * programs to determine whether the skb's page
> > > > +                            * buffers should be cloned. Since data slice
> > > > +                            * writes would only be to the head, we can skip
> > > > +                            * this.
> > > > +                            */
> > > > +                           env->seen_direct_write = seen_direct_write;
> > > > +           }
> > >
> > > [ ... ]
> > >
> > > > @@ -9263,17 +9361,26 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> > > >                             return ret;
> > > >                     break;
> > > >             case KF_ARG_PTR_TO_DYNPTR:
> > > > +           {
> > > > +                   enum bpf_arg_type dynptr_arg_type = ARG_PTR_TO_DYNPTR;
> > > > +
> > > >                     if (reg->type != PTR_TO_STACK &&
> > > >                         reg->type != CONST_PTR_TO_DYNPTR) {
> > > >                             verbose(env, "arg#%d expected pointer to stack or dynptr_ptr\n", i);
> > > >                             return -EINVAL;
> > > >                     }
> > > > -                   ret = process_dynptr_func(env, regno, insn_idx,
> > > > -                                             ARG_PTR_TO_DYNPTR | MEM_RDONLY);
> > > > +                   if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb])
> > > > +                           dynptr_arg_type |= MEM_UNINIT | DYNPTR_TYPE_SKB;
> > > > +                   else
> > > > +                           dynptr_arg_type |= MEM_RDONLY;
> > > > +
> > > > +                   ret = process_dynptr_func(env, regno, insn_idx, dynptr_arg_type,
> > > > +                                             meta->func_id);
> > > >                     if (ret < 0)
> > > >                             return ret;
> > > >                     break;
> > > > +           }
> > > >             case KF_ARG_PTR_TO_LIST_HEAD:
> > > >                     if (reg->type != PTR_TO_MAP_VALUE &&
> > > >                         reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
> > > > @@ -15857,6 +15964,14 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> > > >                desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
> > > >             insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
> > > >             *cnt = 1;
> > > > +   } else if (desc->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
> > > > +           bool is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);
> > >
> > > Does it need to restore the env->seen_direct_write here also?
> > >
> > > It seems this 'seen_direct_write' saving/restoring is needed now because
> > > 'may_access_direct_pkt_data(BPF_WRITE)' is not only called when it is
> > > actually writing the packet. Some refactoring can help to avoid issue like
> > > this.
> > >
> > > While at 'seen_direct_write', Alexei has also pointed out that the verifier
> > > needs to track whether the (packet) 'slice' returned by bpf_dynptr_data()
> > > has been written. It should be tracked in 'seen_direct_write'. Take a look
> > > at how reg_is_pkt_pointer() and may_access_direct_pkt_data() are done in
> > > check_mem_access(). iirc, this reg_is_pkt_pointer() part got loss somewhere
> > > in v5 (or v4?) when bpf_dynptr_data() was changed to return register typed
> > > PTR_TO_MEM instead of PTR_TO_PACKET.
> >
> > btw tc progs are using gen_prologue() approach because data/data_end are not kfuncs
> > (nothing is being called by the bpf prog).
> > In this case we don't need to repeat this approach. If so we don't need to
> > set seen_direct_write.
> > Instead bpf_dynptr_data() can call bpf_skb_pull_data() directly.
> > And technically we don't need to limit it to skb head. It can handle any off/len.
> > It will work for skb, but there is no equivalent for xdp_pull_data().
> > I don't think we can implement xdp_pull_data in all drivers.
> > That's massive amount of work, but we need to be consistent if we want
> > dynptr to wrap both skb and xdp.
> > We can say dynptr_data is for head only, but we've seen bugs where people
> > had to switch from data/data_end to load_bytes.
> >
> > Also bpf_skb_pull_data is quite heavy. For progs that only want to parse
> > the packet calling that in bpf_dynptr_data is a heavy hammer.
> >
> > It feels that we need to go back to skb_header_pointer-like discussion.
> > Something like:
> > bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, void *buffer)
> > Whether buffer is a part of dynptr or program provided is tbd.
>
> making it hidden within dynptr would make this approach unreliable
> (memory allocations, which can fail, etc). But if we ask users to pass
> it directly, then it should be relatively easy to use in practice with
> some pre-allocated per-CPU buffer:
>
>
> struct {
>   __int(type, BPF_MAP_TYPE_PERCPU_ARRAY);
>   __int(max_entries, 1);
>   __type(key, int);
>   __type(value, char[4096]);
> } scratch SEC(".maps");
>
>
> ...
>
>
> struct dyn_ptr *dp = bpf_dynptr_from_skb(...).
> void *p, *buf;
> int zero = 0;
>
> buf = bpf_map_lookup_elem(&scratch, &zero);
> if (!buf) return 0; /* can't happen */
>
> p = bpf_dynptr_slice(dp, off, 16, buf);
> if (p == NULL) {
>    /* out of range */
> } else {
>    /* work with p directly */
> }
>
> /* if we wrote something to p and it was copied to buffer, write it back */
> if (p == buf) {
>     bpf_dynptr_write(dp, buf, 16);
> }
>
>
> We'll just need to teach verifier to make sure that buf is at least 16
> byte long.

I'm confused what the benefit of passing in the buffer is. If it's to
avoid the uncloning, this will still need to happen if the user writes
back the data to the skb (which will be the majority of cases). If
it's to avoid uncloning if the user only reads the data of a writable
prog, then we could add logic in the verifier so that we don't pull
the data in this case; the uncloning might still happen regardless if
another part of the program does a direct write. If the benefit is to
avoid needing to pull the data, then can't the user just use
bpf_dynptr_read, which takes in a buffer?

>
>
> But I wonder if for simple cases when users are mostly sure that they
> are going to access only header data directly we can have an option
> for bpf_dynptr_from_skb() to specify what should be the behavior for
> bpf_dynptr_slice():
>
>  - either return NULL for anything that crosses into frags (no
> surprising perf penalty, but surprising NULLs);
>  - do bpf_skb_pull_data() if bpf_dynptr_data() needs to point to data
> beyond header (potential perf penalty, but on NULLs, if off+len is
> within packet).
>
> And then bpf_dynptr_from_skb() can accept a flag specifying this
> behavior and store it somewhere in struct bpf_dynptr.
>
> Thoughts?
Alexei Starovoitov Jan. 31, 2023, 7:50 p.m. UTC | #19
On Tue, Jan 31, 2023 at 9:55 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Mon, Jan 30, 2023 at 9:36 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Jan 30, 2023 at 04:44:12PM -0800, Joanne Koong wrote:
> > > On Sun, Jan 29, 2023 at 3:39 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Fri, Jan 27, 2023 at 11:17:01AM -0800, Joanne Koong wrote:
> > > > > Add skb dynptrs, which are dynptrs whose underlying pointer points
> > > > > to a skb. The dynptr acts on skb data. skb dynptrs have two main
> > > > > benefits. One is that they allow operations on sizes that are not
> > > > > statically known at compile-time (eg variable-sized accesses).
> > > > > Another is that parsing the packet data through dynptrs (instead of
> > > > > through direct access of skb->data and skb->data_end) can be more
> > > > > ergonomic and less brittle (eg does not need manual if checking for
> > > > > being within bounds of data_end).
> > > > >
> > > > > For bpf prog types that don't support writes on skb data, the dynptr is
> > > > > read-only (bpf_dynptr_write() will return an error and bpf_dynptr_data()
> > > > > will return a data slice that is read-only where any writes to it will
> > > > > be rejected by the verifier).
> > > > >
> > > > > For reads and writes through the bpf_dynptr_read() and bpf_dynptr_write()
> > > > > interfaces, reading and writing from/to data in the head as well as from/to
> > > > > non-linear paged buffers is supported. For data slices (through the
> > > > > bpf_dynptr_data() interface), if the data is in a paged buffer, the user
> > > > > must first call bpf_skb_pull_data() to pull the data into the linear
> > > > > portion.
> > > >
> > > > Looks like there is an assumption in parts of this patch that
> > > > linear part of skb is always writeable. That's not the case.
> > > > See if (ops->gen_prologue || env->seen_direct_write) in convert_ctx_accesses().
> > > > For TC progs it calls bpf_unclone_prologue() which adds hidden
> > > > bpf_skb_pull_data() in the beginning of the prog to make it writeable.
> > >
> > > I think we can make this assumption? For writable progs (referenced in
> > > the may_access_direct_pkt_data() function), all of them have a
> > > gen_prologue that unclones the buffer (eg tc_cls_act, lwt_xmit, sk_skb
> > > progs) or their linear portion is okay to write into by default (eg
> > > xdp, sk_msg, cg_sockopt progs).
> >
> > but the patch was preserving seen_direct_write in some cases.
> > I'm still confused.
>
> seen_direct_write is used to determine whether to actually unclone or
> not in the program's prologue function (eg tc_cls_act_prologue() ->
> bpf_unclone_prologue() where in bpf_unclone_prologue(), if
> direct_write was not true, then it can skip doing the actual
> uncloning).
>
> I think the part of the patch you're talking about regarding
> seen_direct_write is this in check_helper_call():
>
> + if (func_id == BPF_FUNC_dynptr_data &&
> +    dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> +   bool seen_direct_write = env->seen_direct_write;
> +
> +   regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB;
> +   if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> +     regs[BPF_REG_0].type |= MEM_RDONLY;
> +   else
> +     /*
> +     * Calling may_access_direct_pkt_data() will set
> +     * env->seen_direct_write to true if the skb is
> +     * writable. As an optimization, we can ignore
> +     * setting env->seen_direct_write.
> +     *
> +     * env->seen_direct_write is used by skb
> +     * programs to determine whether the skb's page
> +     * buffers should be cloned. Since data slice
> +     * writes would only be to the head, we can skip
> +     * this.
> +     */
> +     env->seen_direct_write = seen_direct_write;
> + }
>
> If the data slice for a skb dynptr is writable, then seen_direct_write
> gets set to true (done internally in may_access_direct_pkt_data()) so
> that the skb is actually uncloned, whereas if it's read-only, then
> env->seen_direct_write gets reset to its original value (since the
> may_access_direct_pkt_data() call will have set env->seen_direct_write
> to true)

I'm still confused.
When may_access_direct_pkt_data() returns false
it doesn't change seen_direct_write.
When it returns true it also sets seen_direct_write=true.
But the code above restores it to whatever value it had before.
How is this correct?
Are you saying that another may_access_direct_pkt_data() gets
called somewhere in the verifier that sets seen_direct_write=true?
But what's the harm in doing it twice or N times in all cases?
Alexei Starovoitov Jan. 31, 2023, 7:58 p.m. UTC | #20
On Tue, Jan 31, 2023 at 10:30 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Mon, Jan 30, 2023 at 5:04 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Mon, Jan 30, 2023 at 2:31 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Mon, Jan 30, 2023 at 02:04:08PM -0800, Martin KaFai Lau wrote:
> > > > On 1/27/23 11:17 AM, Joanne Koong wrote:
> > > > > @@ -8243,6 +8316,28 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > > > >             mark_reg_known_zero(env, regs, BPF_REG_0);
> > > > >             regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > > > >             regs[BPF_REG_0].mem_size = meta.mem_size;
> > > > > +           if (func_id == BPF_FUNC_dynptr_data &&
> > > > > +               dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > > > > +                   bool seen_direct_write = env->seen_direct_write;
> > > > > +
> > > > > +                   regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB;
> > > > > +                   if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> > > > > +                           regs[BPF_REG_0].type |= MEM_RDONLY;
> > > > > +                   else
> > > > > +                           /*
> > > > > +                            * Calling may_access_direct_pkt_data() will set
> > > > > +                            * env->seen_direct_write to true if the skb is
> > > > > +                            * writable. As an optimization, we can ignore
> > > > > +                            * setting env->seen_direct_write.
> > > > > +                            *
> > > > > +                            * env->seen_direct_write is used by skb
> > > > > +                            * programs to determine whether the skb's page
> > > > > +                            * buffers should be cloned. Since data slice
> > > > > +                            * writes would only be to the head, we can skip
> > > > > +                            * this.
> > > > > +                            */
> > > > > +                           env->seen_direct_write = seen_direct_write;
> > > > > +           }
> > > >
> > > > [ ... ]
> > > >
> > > > > @@ -9263,17 +9361,26 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> > > > >                             return ret;
> > > > >                     break;
> > > > >             case KF_ARG_PTR_TO_DYNPTR:
> > > > > +           {
> > > > > +                   enum bpf_arg_type dynptr_arg_type = ARG_PTR_TO_DYNPTR;
> > > > > +
> > > > >                     if (reg->type != PTR_TO_STACK &&
> > > > >                         reg->type != CONST_PTR_TO_DYNPTR) {
> > > > >                             verbose(env, "arg#%d expected pointer to stack or dynptr_ptr\n", i);
> > > > >                             return -EINVAL;
> > > > >                     }
> > > > > -                   ret = process_dynptr_func(env, regno, insn_idx,
> > > > > -                                             ARG_PTR_TO_DYNPTR | MEM_RDONLY);
> > > > > +                   if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb])
> > > > > +                           dynptr_arg_type |= MEM_UNINIT | DYNPTR_TYPE_SKB;
> > > > > +                   else
> > > > > +                           dynptr_arg_type |= MEM_RDONLY;
> > > > > +
> > > > > +                   ret = process_dynptr_func(env, regno, insn_idx, dynptr_arg_type,
> > > > > +                                             meta->func_id);
> > > > >                     if (ret < 0)
> > > > >                             return ret;
> > > > >                     break;
> > > > > +           }
> > > > >             case KF_ARG_PTR_TO_LIST_HEAD:
> > > > >                     if (reg->type != PTR_TO_MAP_VALUE &&
> > > > >                         reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
> > > > > @@ -15857,6 +15964,14 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> > > > >                desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
> > > > >             insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
> > > > >             *cnt = 1;
> > > > > +   } else if (desc->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
> > > > > +           bool is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);
> > > >
> > > > Does it need to restore the env->seen_direct_write here also?
> > > >
> > > > It seems this 'seen_direct_write' saving/restoring is needed now because
> > > > 'may_access_direct_pkt_data(BPF_WRITE)' is not only called when it is
> > > > actually writing the packet. Some refactoring can help to avoid issue like
> > > > this.
> > > >
> > > > While at 'seen_direct_write', Alexei has also pointed out that the verifier
> > > > needs to track whether the (packet) 'slice' returned by bpf_dynptr_data()
> > > > has been written. It should be tracked in 'seen_direct_write'. Take a look
> > > > at how reg_is_pkt_pointer() and may_access_direct_pkt_data() are done in
> > > > check_mem_access(). iirc, this reg_is_pkt_pointer() part got loss somewhere
> > > > in v5 (or v4?) when bpf_dynptr_data() was changed to return register typed
> > > > PTR_TO_MEM instead of PTR_TO_PACKET.
> > >
> > > btw tc progs are using gen_prologue() approach because data/data_end are not kfuncs
> > > (nothing is being called by the bpf prog).
> > > In this case we don't need to repeat this approach. If so we don't need to
> > > set seen_direct_write.
> > > Instead bpf_dynptr_data() can call bpf_skb_pull_data() directly.
> > > And technically we don't need to limit it to skb head. It can handle any off/len.
> > > It will work for skb, but there is no equivalent for xdp_pull_data().
> > > I don't think we can implement xdp_pull_data in all drivers.
> > > That's massive amount of work, but we need to be consistent if we want
> > > dynptr to wrap both skb and xdp.
> > > We can say dynptr_data is for head only, but we've seen bugs where people
> > > had to switch from data/data_end to load_bytes.
> > >
> > > Also bpf_skb_pull_data is quite heavy. For progs that only want to parse
> > > the packet calling that in bpf_dynptr_data is a heavy hammer.
> > >
> > > It feels that we need to go back to skb_header_pointer-like discussion.
> > > Something like:
> > > bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, void *buffer)
> > > Whether buffer is a part of dynptr or program provided is tbd.
> >
> > making it hidden within dynptr would make this approach unreliable
> > (memory allocations, which can fail, etc). But if we ask users to pass
> > it directly, then it should be relatively easy to use in practice with
> > some pre-allocated per-CPU buffer:
> >
> >
> > struct {
> >   __int(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> >   __int(max_entries, 1);
> >   __type(key, int);
> >   __type(value, char[4096]);
> > } scratch SEC(".maps");
> >
> >
> > ...
> >
> >
> > struct dyn_ptr *dp = bpf_dynptr_from_skb(...).
> > void *p, *buf;
> > int zero = 0;
> >
> > buf = bpf_map_lookup_elem(&scratch, &zero);
> > if (!buf) return 0; /* can't happen */
> >
> > p = bpf_dynptr_slice(dp, off, 16, buf);
> > if (p == NULL) {
> >    /* out of range */
> > } else {
> >    /* work with p directly */
> > }
> >
> > /* if we wrote something to p and it was copied to buffer, write it back */
> > if (p == buf) {
> >     bpf_dynptr_write(dp, buf, 16);
> > }
> >
> >
> > We'll just need to teach verifier to make sure that buf is at least 16
> > byte long.
>
> I'm confused what the benefit of passing in the buffer is. If it's to
> avoid the uncloning, this will still need to happen if the user writes
> back the data to the skb (which will be the majority of cases). If
> it's to avoid uncloning if the user only reads the data of a writable
> prog, then we could add logic in the verifier so that we don't pull
> the data in this case; the uncloning might still happen regardless if
> another part of the program does a direct write. If the benefit is to
> avoid needing to pull the data, then can't the user just use
> bpf_dynptr_read, which takes in a buffer?

There is no unclone and there is no pull in xdp.
The main idea of this semantics of bpf_dynptr_slice is to make it
work the same way on skb and xdp for _read_ case.
Writes are going to be different between skb and xdp anyway.
In some rare cases the writes can be the same for skb and xdp
with this bpf_dynptr_slice + bpf_dynptr_write logic,
but that's a minor feature addition of the api.

I'd say in skb cases the progs do reads and either drop
or forward the skb.
Writes to skb are done from time to time too, because
they're a pain to do correctly.
nat is the main use case for skb rewrites.
In xdp cases the progs do parse, drop, rewrite, xmit more or less equally.
Joanne Koong Jan. 31, 2023, 8:47 p.m. UTC | #21
On Tue, Jan 31, 2023 at 11:59 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Jan 31, 2023 at 10:30 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Mon, Jan 30, 2023 at 5:04 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Mon, Jan 30, 2023 at 2:31 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Mon, Jan 30, 2023 at 02:04:08PM -0800, Martin KaFai Lau wrote:
> > > > > On 1/27/23 11:17 AM, Joanne Koong wrote:
> > > > > > @@ -8243,6 +8316,28 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > > > > >             mark_reg_known_zero(env, regs, BPF_REG_0);
> > > > > >             regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > > > > >             regs[BPF_REG_0].mem_size = meta.mem_size;
> > > > > > +           if (func_id == BPF_FUNC_dynptr_data &&
> > > > > > +               dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > > > > > +                   bool seen_direct_write = env->seen_direct_write;
> > > > > > +
> > > > > > +                   regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB;
> > > > > > +                   if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> > > > > > +                           regs[BPF_REG_0].type |= MEM_RDONLY;
> > > > > > +                   else
> > > > > > +                           /*
> > > > > > +                            * Calling may_access_direct_pkt_data() will set
> > > > > > +                            * env->seen_direct_write to true if the skb is
> > > > > > +                            * writable. As an optimization, we can ignore
> > > > > > +                            * setting env->seen_direct_write.
> > > > > > +                            *
> > > > > > +                            * env->seen_direct_write is used by skb
> > > > > > +                            * programs to determine whether the skb's page
> > > > > > +                            * buffers should be cloned. Since data slice
> > > > > > +                            * writes would only be to the head, we can skip
> > > > > > +                            * this.
> > > > > > +                            */
> > > > > > +                           env->seen_direct_write = seen_direct_write;
> > > > > > +           }
> > > > >
> > > > > [ ... ]
> > > > >
> > > > > > @@ -9263,17 +9361,26 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> > > > > >                             return ret;
> > > > > >                     break;
> > > > > >             case KF_ARG_PTR_TO_DYNPTR:
> > > > > > +           {
> > > > > > +                   enum bpf_arg_type dynptr_arg_type = ARG_PTR_TO_DYNPTR;
> > > > > > +
> > > > > >                     if (reg->type != PTR_TO_STACK &&
> > > > > >                         reg->type != CONST_PTR_TO_DYNPTR) {
> > > > > >                             verbose(env, "arg#%d expected pointer to stack or dynptr_ptr\n", i);
> > > > > >                             return -EINVAL;
> > > > > >                     }
> > > > > > -                   ret = process_dynptr_func(env, regno, insn_idx,
> > > > > > -                                             ARG_PTR_TO_DYNPTR | MEM_RDONLY);
> > > > > > +                   if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb])
> > > > > > +                           dynptr_arg_type |= MEM_UNINIT | DYNPTR_TYPE_SKB;
> > > > > > +                   else
> > > > > > +                           dynptr_arg_type |= MEM_RDONLY;
> > > > > > +
> > > > > > +                   ret = process_dynptr_func(env, regno, insn_idx, dynptr_arg_type,
> > > > > > +                                             meta->func_id);
> > > > > >                     if (ret < 0)
> > > > > >                             return ret;
> > > > > >                     break;
> > > > > > +           }
> > > > > >             case KF_ARG_PTR_TO_LIST_HEAD:
> > > > > >                     if (reg->type != PTR_TO_MAP_VALUE &&
> > > > > >                         reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
> > > > > > @@ -15857,6 +15964,14 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> > > > > >                desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
> > > > > >             insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
> > > > > >             *cnt = 1;
> > > > > > +   } else if (desc->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
> > > > > > +           bool is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);
> > > > >
> > > > > Does it need to restore the env->seen_direct_write here also?
> > > > >
> > > > > It seems this 'seen_direct_write' saving/restoring is needed now because
> > > > > 'may_access_direct_pkt_data(BPF_WRITE)' is not only called when it is
> > > > > actually writing the packet. Some refactoring can help to avoid issue like
> > > > > this.
> > > > >
> > > > > While at 'seen_direct_write', Alexei has also pointed out that the verifier
> > > > > needs to track whether the (packet) 'slice' returned by bpf_dynptr_data()
> > > > > has been written. It should be tracked in 'seen_direct_write'. Take a look
> > > > > at how reg_is_pkt_pointer() and may_access_direct_pkt_data() are done in
> > > > > check_mem_access(). iirc, this reg_is_pkt_pointer() part got loss somewhere
> > > > > in v5 (or v4?) when bpf_dynptr_data() was changed to return register typed
> > > > > PTR_TO_MEM instead of PTR_TO_PACKET.
> > > >
> > > > btw tc progs are using gen_prologue() approach because data/data_end are not kfuncs
> > > > (nothing is being called by the bpf prog).
> > > > In this case we don't need to repeat this approach. If so we don't need to
> > > > set seen_direct_write.
> > > > Instead bpf_dynptr_data() can call bpf_skb_pull_data() directly.
> > > > And technically we don't need to limit it to skb head. It can handle any off/len.
> > > > It will work for skb, but there is no equivalent for xdp_pull_data().
> > > > I don't think we can implement xdp_pull_data in all drivers.
> > > > That's massive amount of work, but we need to be consistent if we want
> > > > dynptr to wrap both skb and xdp.
> > > > We can say dynptr_data is for head only, but we've seen bugs where people
> > > > had to switch from data/data_end to load_bytes.
> > > >
> > > > Also bpf_skb_pull_data is quite heavy. For progs that only want to parse
> > > > the packet calling that in bpf_dynptr_data is a heavy hammer.
> > > >
> > > > It feels that we need to go back to skb_header_pointer-like discussion.
> > > > Something like:
> > > > bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, void *buffer)
> > > > Whether buffer is a part of dynptr or program provided is tbd.
> > >
> > > making it hidden within dynptr would make this approach unreliable
> > > (memory allocations, which can fail, etc). But if we ask users to pass
> > > it directly, then it should be relatively easy to use in practice with
> > > some pre-allocated per-CPU buffer:
> > >
> > >
> > > struct {
> > >   __int(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> > >   __int(max_entries, 1);
> > >   __type(key, int);
> > >   __type(value, char[4096]);
> > > } scratch SEC(".maps");
> > >
> > >
> > > ...
> > >
> > >
> > > struct dyn_ptr *dp = bpf_dynptr_from_skb(...).
> > > void *p, *buf;
> > > int zero = 0;
> > >
> > > buf = bpf_map_lookup_elem(&scratch, &zero);
> > > if (!buf) return 0; /* can't happen */
> > >
> > > p = bpf_dynptr_slice(dp, off, 16, buf);
> > > if (p == NULL) {
> > >    /* out of range */
> > > } else {
> > >    /* work with p directly */
> > > }
> > >
> > > /* if we wrote something to p and it was copied to buffer, write it back */
> > > if (p == buf) {
> > >     bpf_dynptr_write(dp, buf, 16);
> > > }
> > >
> > >
> > > We'll just need to teach verifier to make sure that buf is at least 16
> > > byte long.
> >
> > I'm confused what the benefit of passing in the buffer is. If it's to
> > avoid the uncloning, this will still need to happen if the user writes
> > back the data to the skb (which will be the majority of cases). If
> > it's to avoid uncloning if the user only reads the data of a writable
> > prog, then we could add logic in the verifier so that we don't pull
> > the data in this case; the uncloning might still happen regardless if
> > another part of the program does a direct write. If the benefit is to
> > avoid needing to pull the data, then can't the user just use
> > bpf_dynptr_read, which takes in a buffer?
>
> There is no unclone and there is no pull in xdp.
> The main idea of this semantics of bpf_dynptr_slice is to make it
> work the same way on skb and xdp for _read_ case.
> Writes are going to be different between skb and xdp anyway.
> In some rare cases the writes can be the same for skb and xdp
> with this bpf_dynptr_slice + bpf_dynptr_write logic,
> but that's a minor feature addition of the api.

bpf_dynptr_read works the same way on skb and xdp. bpf_dynptr_read
takes in a buffer as well, so what is the added benefit of
bpf_dynptr_slice?

>
> I'd say in skb cases the progs do reads and either drop
> or forward the skb.
> Writes to skb are done from time to time too, because
> they're a pain to do correctly.
> nat is the main use case for skb rewrites.
> In xdp cases the progs do parse, drop, rewrite, xmit more or less equally.
Alexei Starovoitov Jan. 31, 2023, 9:10 p.m. UTC | #22
On Tue, Jan 31, 2023 at 12:48 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > p = bpf_dynptr_slice(dp, off, 16, buf);
> > > > if (p == NULL) {
> > > >    /* out of range */
> > > > } else {
> > > >    /* work with p directly */
> > > > }
> > > >
> > > > /* if we wrote something to p and it was copied to buffer, write it back */
> > > > if (p == buf) {
> > > >     bpf_dynptr_write(dp, buf, 16);
> > > > }
> > > >
> > > >
> > > > We'll just need to teach verifier to make sure that buf is at least 16
> > > > byte long.
> > >
> > > I'm confused what the benefit of passing in the buffer is. If it's to
> > > avoid the uncloning, this will still need to happen if the user writes
> > > back the data to the skb (which will be the majority of cases). If
> > > it's to avoid uncloning if the user only reads the data of a writable
> > > prog, then we could add logic in the verifier so that we don't pull
> > > the data in this case; the uncloning might still happen regardless if
> > > another part of the program does a direct write. If the benefit is to
> > > avoid needing to pull the data, then can't the user just use
> > > bpf_dynptr_read, which takes in a buffer?
> >
> > There is no unclone and there is no pull in xdp.
> > The main idea of this semantics of bpf_dynptr_slice is to make it
> > work the same way on skb and xdp for _read_ case.
> > Writes are going to be different between skb and xdp anyway.
> > In some rare cases the writes can be the same for skb and xdp
> > with this bpf_dynptr_slice + bpf_dynptr_write logic,
> > but that's a minor feature addition of the api.
>
> bpf_dynptr_read works the same way on skb and xdp. bpf_dynptr_read
> takes in a buffer as well, so what is the added benefit of
> bpf_dynptr_slice?

That it doesn't copy most of the time.
Joanne Koong Jan. 31, 2023, 9:29 p.m. UTC | #23
On Tue, Jan 31, 2023 at 11:50 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Jan 31, 2023 at 9:55 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Mon, Jan 30, 2023 at 9:36 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Mon, Jan 30, 2023 at 04:44:12PM -0800, Joanne Koong wrote:
> > > > On Sun, Jan 29, 2023 at 3:39 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jan 27, 2023 at 11:17:01AM -0800, Joanne Koong wrote:
> > > > > > Add skb dynptrs, which are dynptrs whose underlying pointer points
> > > > > > to a skb. The dynptr acts on skb data. skb dynptrs have two main
> > > > > > benefits. One is that they allow operations on sizes that are not
> > > > > > statically known at compile-time (eg variable-sized accesses).
> > > > > > Another is that parsing the packet data through dynptrs (instead of
> > > > > > through direct access of skb->data and skb->data_end) can be more
> > > > > > ergonomic and less brittle (eg does not need manual if checking for
> > > > > > being within bounds of data_end).
> > > > > >
> > > > > > For bpf prog types that don't support writes on skb data, the dynptr is
> > > > > > read-only (bpf_dynptr_write() will return an error and bpf_dynptr_data()
> > > > > > will return a data slice that is read-only where any writes to it will
> > > > > > be rejected by the verifier).
> > > > > >
> > > > > > For reads and writes through the bpf_dynptr_read() and bpf_dynptr_write()
> > > > > > interfaces, reading and writing from/to data in the head as well as from/to
> > > > > > non-linear paged buffers is supported. For data slices (through the
> > > > > > bpf_dynptr_data() interface), if the data is in a paged buffer, the user
> > > > > > must first call bpf_skb_pull_data() to pull the data into the linear
> > > > > > portion.
> > > > >
> > > > > Looks like there is an assumption in parts of this patch that
> > > > > linear part of skb is always writeable. That's not the case.
> > > > > See if (ops->gen_prologue || env->seen_direct_write) in convert_ctx_accesses().
> > > > > For TC progs it calls bpf_unclone_prologue() which adds hidden
> > > > > bpf_skb_pull_data() in the beginning of the prog to make it writeable.
> > > >
> > > > I think we can make this assumption? For writable progs (referenced in
> > > > the may_access_direct_pkt_data() function), all of them have a
> > > > gen_prologue that unclones the buffer (eg tc_cls_act, lwt_xmit, sk_skb
> > > > progs) or their linear portion is okay to write into by default (eg
> > > > xdp, sk_msg, cg_sockopt progs).
> > >
> > > but the patch was preserving seen_direct_write in some cases.
> > > I'm still confused.
> >
> > seen_direct_write is used to determine whether to actually unclone or
> > not in the program's prologue function (eg tc_cls_act_prologue() ->
> > bpf_unclone_prologue() where in bpf_unclone_prologue(), if
> > direct_write was not true, then it can skip doing the actual
> > uncloning).
> >
> > I think the part of the patch you're talking about regarding
> > seen_direct_write is this in check_helper_call():
> >
> > + if (func_id == BPF_FUNC_dynptr_data &&
> > +    dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > +   bool seen_direct_write = env->seen_direct_write;
> > +
> > +   regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB;
> > +   if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> > +     regs[BPF_REG_0].type |= MEM_RDONLY;
> > +   else
> > +     /*
> > +     * Calling may_access_direct_pkt_data() will set
> > +     * env->seen_direct_write to true if the skb is
> > +     * writable. As an optimization, we can ignore
> > +     * setting env->seen_direct_write.
> > +     *
> > +     * env->seen_direct_write is used by skb
> > +     * programs to determine whether the skb's page
> > +     * buffers should be cloned. Since data slice
> > +     * writes would only be to the head, we can skip
> > +     * this.
> > +     */
> > +     env->seen_direct_write = seen_direct_write;
> > + }
> >
> > If the data slice for a skb dynptr is writable, then seen_direct_write
> > gets set to true (done internally in may_access_direct_pkt_data()) so
> > that the skb is actually uncloned, whereas if it's read-only, then
> > env->seen_direct_write gets reset to its original value (since the
> > may_access_direct_pkt_data() call will have set env->seen_direct_write
> > to true)
>
> I'm still confused.
> When may_access_direct_pkt_data() returns false
> it doesn't change seen_direct_write.
> When it returns true it also sets seen_direct_write=true.
> But the code above restores it to whatever value it had before.
> How is this correct?
> Are you saying that another may_access_direct_pkt_data() gets
> called somewhere in the verifier that sets seen_direct_write=true?
> But what's the harm in doing it twice or N times in all cases?

I'm confused now too. I added this in v7, judging from the comment
block, I think I added this because I thought uncloning an skb only
needs to happen if the skb's page buffers get written to (aka only if
the skb needs to be pulled), not if it's linear portion gets written
to. This is incorrect - writing to the linear part also needs to
unclone the skb. I will fix this section when I resubmit
Joanne Koong Jan. 31, 2023, 9:33 p.m. UTC | #24
On Tue, Jan 31, 2023 at 1:11 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Jan 31, 2023 at 12:48 PM Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > >
> > > > > p = bpf_dynptr_slice(dp, off, 16, buf);
> > > > > if (p == NULL) {
> > > > >    /* out of range */
> > > > > } else {
> > > > >    /* work with p directly */
> > > > > }
> > > > >
> > > > > /* if we wrote something to p and it was copied to buffer, write it back */
> > > > > if (p == buf) {
> > > > >     bpf_dynptr_write(dp, buf, 16);
> > > > > }
> > > > >
> > > > >
> > > > > We'll just need to teach verifier to make sure that buf is at least 16
> > > > > byte long.
> > > >
> > > > I'm confused what the benefit of passing in the buffer is. If it's to
> > > > avoid the uncloning, this will still need to happen if the user writes
> > > > back the data to the skb (which will be the majority of cases). If
> > > > it's to avoid uncloning if the user only reads the data of a writable
> > > > prog, then we could add logic in the verifier so that we don't pull
> > > > the data in this case; the uncloning might still happen regardless if
> > > > another part of the program does a direct write. If the benefit is to
> > > > avoid needing to pull the data, then can't the user just use
> > > > bpf_dynptr_read, which takes in a buffer?
> > >
> > > There is no unclone and there is no pull in xdp.
> > > The main idea of this semantics of bpf_dynptr_slice is to make it
> > > work the same way on skb and xdp for _read_ case.
> > > Writes are going to be different between skb and xdp anyway.
> > > In some rare cases the writes can be the same for skb and xdp
> > > with this bpf_dynptr_slice + bpf_dynptr_write logic,
> > > but that's a minor feature addition of the api.
> >
> > bpf_dynptr_read works the same way on skb and xdp. bpf_dynptr_read
> > takes in a buffer as well, so what is the added benefit of
> > bpf_dynptr_slice?
>
> That it doesn't copy most of the time.

Ohh I see, I missed that bpf_dynptr_slice also returns back a ptr.
This makes sense to me now, thanks for clarifying.
Martin KaFai Lau Jan. 31, 2023, 10:07 p.m. UTC | #25
On 1/30/23 9:30 PM, Alexei Starovoitov wrote:
>>>>> Also bpf_skb_pull_data is quite heavy. For progs that only want to parse
>>>>> the packet calling that in bpf_dynptr_data is a heavy hammer.
>>>>>
>>>>> It feels that we need to go back to skb_header_pointer-like discussion.
>>>>> Something like:
>>>>> bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, void *buffer)
>>>>> Whether buffer is a part of dynptr or program provided is tbd.
>>>>
>>>> making it hidden within dynptr would make this approach unreliable
>>>> (memory allocations, which can fail, etc). But if we ask users to pass
>>>> it directly, then it should be relatively easy to use in practice with
>>>> some pre-allocated per-CPU buffer:
> 
> bpf_skb_pull_data() is even more unreliable, since it's a bigger allocation.
> I like preallocated approach more, so we're in agreement here.
> 
>>>>
>>>>
>>>> struct {
>>>>     __int(type, BPF_MAP_TYPE_PERCPU_ARRAY);
>>>>     __int(max_entries, 1);
>>>>     __type(key, int);
>>>>     __type(value, char[4096]);
>>>> } scratch SEC(".maps");
>>>>
>>>>
>>>> ...
>>>>
>>>>
>>>> struct dyn_ptr *dp = bpf_dynptr_from_skb(...).
>>>> void *p, *buf;
>>>> int zero = 0;
>>>>
>>>> buf = bpf_map_lookup_elem(&scratch, &zero);
>>>> if (!buf) return 0; /* can't happen */
>>>>
>>>> p = bpf_dynptr_slice(dp, off, 16, buf);
>>>> if (p == NULL) {
>>>>      /* out of range */
>>>> } else {
>>>>      /* work with p directly */
>>>> }
>>>>
>>>> /* if we wrote something to p and it was copied to buffer, write it back */
>>>> if (p == buf) {
>>>>       bpf_dynptr_write(dp, buf, 16);
>>>> }
>>>>
>>>>
>>>> We'll just need to teach verifier to make sure that buf is at least 16
>>>> byte long.
>>>
>>> A fifth __sz arg may do:
>>> bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, void
>>> *buffer, u32 buffer__sz);
>>
>> We'll need to make sure that buffer__sz is >= len (or preferably not
>> require extra size at all). We can check that at runtime, of course,
>> but rejecting too small buffer at verification time would be a better
>> experience.
> 
> I don't follow. Why two equivalent 'len' args ?
> Just to allow 'len' to be a variable instead of constant ?
> It's unusual for the verifier to have 'len' before 'buffer',
> but this is fixable.

Agree. One const scalar 'len' should be enough. Buffer should have the same size 
as the requesting slice.

> 
> How about adding 'rd_only vs rdwr' flag ?
> Then MEM_RDONLY for ret value of bpf_dynptr_slice can be set by the verifier
> and in run-time bpf_dynptr_slice() wouldn't need to check for skb->cloned.
> if (rd_only) return skb_header_pointer()
> if (rdwr) bpf_try_make_writable(); return skb->data + off;
> and final bpf_dynptr_write() is not needed.
> 
> But that doesn't work for xdp, since there is no pull.
> 
> It's not clear how to deal with BPF_F_RECOMPUTE_CSUM though.
> Expose __skb_postpull_rcsum/__skb_postpush_rcsum as kfuncs?
> But that defeats Andrii's goal to use dynptr as a generic wrapper.
> skb is quite special.
> 
> Maybe something like:
> void *bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len,
>                         void *buffer, u32 buffer__sz)
> {
>    if (skb_cloned()) {
>      skb_copy_bits(skb, offset, buffer, len);
>      return buffer;
>    }
>    return skb_header_pointer(...);
> }
> 
> When prog is just parsing the packet it doesn't need to finalize with bpf_dynptr_write.
> The prog can always write into the pointer followed by if (p == buf) bpf_dynptr_write.
> No need for rdonly flag, but extra copy is there in case of cloned which
> could have been avoided with extra rd_only flag.
> 
> In case of xdp it will be:
> void *bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len,
>                         void *buffer, u32 buffer__sz)
> {
>     ptr = bpf_xdp_pointer(xdp, offset, len);
>     if (ptr)
>        return ptr;
>     bpf_xdp_copy_buf(xdp, offset, buffer, len, false); /* copy into buf */
>     return buffer;
> }
> 
> bpf_dynptr_write will use bpf_xdp_copy_buf(,true); /* copy into xdp */

My preference would be making bpf_dynptr_slice() work similarly for skb and xdp, 
so above bpf_dynptr_slice() skb and xdp logic looks good.

Regarding, MEM_RDONLY, it probably is not relevant to xdp. For skb, not sure how 
often is the 'skb_cloned() && !skb_clone_writable()'. May be it can be left for 
later optimization?

Regarding BPF_F_RECOMPUTE_CSUM, I wonder if bpf_csum_diff() is enough to come up 
the csum. Then the missing kfunc is to update the skb->csum. Not sure how the 
csum logic will look like in xdp, probably getting csum from the xdp-hint, 
calculate csum_diff and then set it to the to-be-created skb. All this is likely 
a kfunc also, eg. a kfunc to directly allocate skb during the XDP_PASS case. The 
bpf prog will have to be written differently if it needs to deal with the csum 
but the header parsing part could at least be shared.
Joanne Koong Jan. 31, 2023, 11:17 p.m. UTC | #26
oOn Mon, Jan 30, 2023 at 9:30 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Jan 30, 2023 at 08:43:47PM -0800, Andrii Nakryiko wrote:
> > On Mon, Jan 30, 2023 at 5:49 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > >
> > > On 1/30/23 5:04 PM, Andrii Nakryiko wrote:
> > > > On Mon, Jan 30, 2023 at 2:31 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > >>
> > > >> On Mon, Jan 30, 2023 at 02:04:08PM -0800, Martin KaFai Lau wrote:
> > > >>> On 1/27/23 11:17 AM, Joanne Koong wrote:
> > > >>>> @@ -8243,6 +8316,28 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > > >>>>              mark_reg_known_zero(env, regs, BPF_REG_0);
> > > >>>>              regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > > >>>>              regs[BPF_REG_0].mem_size = meta.mem_size;
> > > >>>> +           if (func_id == BPF_FUNC_dynptr_data &&
> > > >>>> +               dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > > >>>> +                   bool seen_direct_write = env->seen_direct_write;
> > > >>>> +
> > > >>>> +                   regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB;
> > > >>>> +                   if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> > > >>>> +                           regs[BPF_REG_0].type |= MEM_RDONLY;
> > > >>>> +                   else
> > > >>>> +                           /*
> > > >>>> +                            * Calling may_access_direct_pkt_data() will set
> > > >>>> +                            * env->seen_direct_write to true if the skb is
> > > >>>> +                            * writable. As an optimization, we can ignore
> > > >>>> +                            * setting env->seen_direct_write.
> > > >>>> +                            *
> > > >>>> +                            * env->seen_direct_write is used by skb
> > > >>>> +                            * programs to determine whether the skb's page
> > > >>>> +                            * buffers should be cloned. Since data slice
> > > >>>> +                            * writes would only be to the head, we can skip
> > > >>>> +                            * this.
> > > >>>> +                            */
> > > >>>> +                           env->seen_direct_write = seen_direct_write;
> > > >>>> +           }
> > > >>>
> > > >>> [ ... ]
> > > >>>
> > > >>>> @@ -9263,17 +9361,26 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> > > >>>>                              return ret;
> > > >>>>                      break;
> > > >>>>              case KF_ARG_PTR_TO_DYNPTR:
> > > >>>> +           {
> > > >>>> +                   enum bpf_arg_type dynptr_arg_type = ARG_PTR_TO_DYNPTR;
> > > >>>> +
> > > >>>>                      if (reg->type != PTR_TO_STACK &&
> > > >>>>                          reg->type != CONST_PTR_TO_DYNPTR) {
> > > >>>>                              verbose(env, "arg#%d expected pointer to stack or dynptr_ptr\n", i);
> > > >>>>                              return -EINVAL;
> > > >>>>                      }
> > > >>>> -                   ret = process_dynptr_func(env, regno, insn_idx,
> > > >>>> -                                             ARG_PTR_TO_DYNPTR | MEM_RDONLY);
> > > >>>> +                   if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb])
> > > >>>> +                           dynptr_arg_type |= MEM_UNINIT | DYNPTR_TYPE_SKB;
> > > >>>> +                   else
> > > >>>> +                           dynptr_arg_type |= MEM_RDONLY;
> > > >>>> +
> > > >>>> +                   ret = process_dynptr_func(env, regno, insn_idx, dynptr_arg_type,
> > > >>>> +                                             meta->func_id);
> > > >>>>                      if (ret < 0)
> > > >>>>                              return ret;
> > > >>>>                      break;
> > > >>>> +           }
> > > >>>>              case KF_ARG_PTR_TO_LIST_HEAD:
> > > >>>>                      if (reg->type != PTR_TO_MAP_VALUE &&
> > > >>>>                          reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
> > > >>>> @@ -15857,6 +15964,14 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> > > >>>>                 desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
> > > >>>>              insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
> > > >>>>              *cnt = 1;
> > > >>>> +   } else if (desc->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
> > > >>>> +           bool is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);
> > > >>>
> > > >>> Does it need to restore the env->seen_direct_write here also?
> > > >>>
> > > >>> It seems this 'seen_direct_write' saving/restoring is needed now because
> > > >>> 'may_access_direct_pkt_data(BPF_WRITE)' is not only called when it is
> > > >>> actually writing the packet. Some refactoring can help to avoid issue like
> > > >>> this.
> > > >>>
> > > >>> While at 'seen_direct_write', Alexei has also pointed out that the verifier
> > > >>> needs to track whether the (packet) 'slice' returned by bpf_dynptr_data()
> > > >>> has been written. It should be tracked in 'seen_direct_write'. Take a look
> > > >>> at how reg_is_pkt_pointer() and may_access_direct_pkt_data() are done in
> > > >>> check_mem_access(). iirc, this reg_is_pkt_pointer() part got loss somewhere
> > > >>> in v5 (or v4?) when bpf_dynptr_data() was changed to return register typed
> > > >>> PTR_TO_MEM instead of PTR_TO_PACKET.
> > > >>
> > > >> btw tc progs are using gen_prologue() approach because data/data_end are not kfuncs
> > > >> (nothing is being called by the bpf prog).
> > > >> In this case we don't need to repeat this approach. If so we don't need to
> > > >> set seen_direct_write.
> > > >> Instead bpf_dynptr_data() can call bpf_skb_pull_data() directly.
> > > >> And technically we don't need to limit it to skb head. It can handle any off/len.
> > > >> It will work for skb, but there is no equivalent for xdp_pull_data().
> > > >> I don't think we can implement xdp_pull_data in all drivers.
> > > >> That's massive amount of work, but we need to be consistent if we want
> > > >> dynptr to wrap both skb and xdp.
> > > >> We can say dynptr_data is for head only, but we've seen bugs where people
> > > >> had to switch from data/data_end to load_bytes.
> > > >>
> > > >> Also bpf_skb_pull_data is quite heavy. For progs that only want to parse
> > > >> the packet calling that in bpf_dynptr_data is a heavy hammer.
> > > >>
> > > >> It feels that we need to go back to skb_header_pointer-like discussion.
> > > >> Something like:
> > > >> bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, void *buffer)
> > > >> Whether buffer is a part of dynptr or program provided is tbd.
> > > >
> > > > making it hidden within dynptr would make this approach unreliable
> > > > (memory allocations, which can fail, etc). But if we ask users to pass
> > > > it directly, then it should be relatively easy to use in practice with
> > > > some pre-allocated per-CPU buffer:
>
> bpf_skb_pull_data() is even more unreliable, since it's a bigger allocation.
> I like preallocated approach more, so we're in agreement here.
>
> > > >
> > > >
> > > > struct {
> > > >    __int(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> > > >    __int(max_entries, 1);
> > > >    __type(key, int);
> > > >    __type(value, char[4096]);
> > > > } scratch SEC(".maps");
> > > >
> > > >
> > > > ...
> > > >
> > > >
> > > > struct dyn_ptr *dp = bpf_dynptr_from_skb(...).
> > > > void *p, *buf;
> > > > int zero = 0;
> > > >
> > > > buf = bpf_map_lookup_elem(&scratch, &zero);
> > > > if (!buf) return 0; /* can't happen */
> > > >
> > > > p = bpf_dynptr_slice(dp, off, 16, buf);
> > > > if (p == NULL) {
> > > >     /* out of range */
> > > > } else {
> > > >     /* work with p directly */
> > > > }
> > > >
> > > > /* if we wrote something to p and it was copied to buffer, write it back */
> > > > if (p == buf) {
> > > >      bpf_dynptr_write(dp, buf, 16);
> > > > }
> > > >
> > > >
> > > > We'll just need to teach verifier to make sure that buf is at least 16
> > > > byte long.
> > >
> > > A fifth __sz arg may do:
> > > bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, void
> > > *buffer, u32 buffer__sz);
> >
> > We'll need to make sure that buffer__sz is >= len (or preferably not
> > require extra size at all). We can check that at runtime, of course,
> > but rejecting too small buffer at verification time would be a better
> > experience.
>
> I don't follow. Why two equivalent 'len' args ?
> Just to allow 'len' to be a variable instead of constant ?
> It's unusual for the verifier to have 'len' before 'buffer',
> but this is fixable.
>
> How about adding 'rd_only vs rdwr' flag ?
> Then MEM_RDONLY for ret value of bpf_dynptr_slice can be set by the verifier
> and in run-time bpf_dynptr_slice() wouldn't need to check for skb->cloned.
> if (rd_only) return skb_header_pointer()
> if (rdwr) bpf_try_make_writable(); return skb->data + off;
> and final bpf_dynptr_write() is not needed.
>
> But that doesn't work for xdp, since there is no pull.
>
> It's not clear how to deal with BPF_F_RECOMPUTE_CSUM though.
> Expose __skb_postpull_rcsum/__skb_postpush_rcsum as kfuncs?
> But that defeats Andrii's goal to use dynptr as a generic wrapper.
> skb is quite special.

If it's the common case that skbs use the same flag across writes in
their bpf prog, then we can have bpf_dynptr_from_skb take in
BPF_F_RECOMPUTE_CSUM/BPF_F_INVALIDATE_HASH in its flags arg and then
always apply this when the skb does a write to packet data.

>
> Maybe something like:
> void *bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len,
>                        void *buffer, u32 buffer__sz)
> {
>   if (skb_cloned()) {
>     skb_copy_bits(skb, offset, buffer, len);
>     return buffer;
>   }
>   return skb_header_pointer(...);
> }
>
> When prog is just parsing the packet it doesn't need to finalize with bpf_dynptr_write.
> The prog can always write into the pointer followed by if (p == buf) bpf_dynptr_write.
> No need for rdonly flag, but extra copy is there in case of cloned which
> could have been avoided with extra rd_only flag.

We're able to track in the verifier whether the slice gets written to
or not, so if it does get written to in the skb case, can't we just
add in a call to bpf_try_make_writable() as a post-processing fixup
that gets called before bpf_dynptr_slice? Then bpf_dynptr_slice() can
just return a directly writable ptr and avoid the extra memcpy

>
> In case of xdp it will be:
> void *bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len,
>                        void *buffer, u32 buffer__sz)
> {
>    ptr = bpf_xdp_pointer(xdp, offset, len);
>    if (ptr)
>       return ptr;
>    bpf_xdp_copy_buf(xdp, offset, buffer, len, false); /* copy into buf */
>    return buffer;
> }
>
> bpf_dynptr_write will use bpf_xdp_copy_buf(,true); /* copy into xdp */
>
> > >
> > > The bpf prog usually has buffer in the stack for the common small header parsing.
> >
> > sure, that would work for small chunks
> >
> > >
> > > One side note is the bpf_dynptr_slice() still needs to check if the skb is
> > > cloned or not even the off/len is within the head range.
[...]
Andrii Nakryiko Feb. 1, 2023, 12:11 a.m. UTC | #27
On Mon, Jan 30, 2023 at 9:30 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Jan 30, 2023 at 08:43:47PM -0800, Andrii Nakryiko wrote:
> > On Mon, Jan 30, 2023 at 5:49 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> > >
> > > On 1/30/23 5:04 PM, Andrii Nakryiko wrote:
> > > > On Mon, Jan 30, 2023 at 2:31 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > >>
> > > >> On Mon, Jan 30, 2023 at 02:04:08PM -0800, Martin KaFai Lau wrote:
> > > >>> On 1/27/23 11:17 AM, Joanne Koong wrote:
> > > >>>> @@ -8243,6 +8316,28 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> > > >>>>              mark_reg_known_zero(env, regs, BPF_REG_0);
> > > >>>>              regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
> > > >>>>              regs[BPF_REG_0].mem_size = meta.mem_size;
> > > >>>> +           if (func_id == BPF_FUNC_dynptr_data &&
> > > >>>> +               dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > > >>>> +                   bool seen_direct_write = env->seen_direct_write;
> > > >>>> +
> > > >>>> +                   regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB;
> > > >>>> +                   if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
> > > >>>> +                           regs[BPF_REG_0].type |= MEM_RDONLY;
> > > >>>> +                   else
> > > >>>> +                           /*
> > > >>>> +                            * Calling may_access_direct_pkt_data() will set
> > > >>>> +                            * env->seen_direct_write to true if the skb is
> > > >>>> +                            * writable. As an optimization, we can ignore
> > > >>>> +                            * setting env->seen_direct_write.
> > > >>>> +                            *
> > > >>>> +                            * env->seen_direct_write is used by skb
> > > >>>> +                            * programs to determine whether the skb's page
> > > >>>> +                            * buffers should be cloned. Since data slice
> > > >>>> +                            * writes would only be to the head, we can skip
> > > >>>> +                            * this.
> > > >>>> +                            */
> > > >>>> +                           env->seen_direct_write = seen_direct_write;
> > > >>>> +           }
> > > >>>
> > > >>> [ ... ]
> > > >>>
> > > >>>> @@ -9263,17 +9361,26 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
> > > >>>>                              return ret;
> > > >>>>                      break;
> > > >>>>              case KF_ARG_PTR_TO_DYNPTR:
> > > >>>> +           {
> > > >>>> +                   enum bpf_arg_type dynptr_arg_type = ARG_PTR_TO_DYNPTR;
> > > >>>> +
> > > >>>>                      if (reg->type != PTR_TO_STACK &&
> > > >>>>                          reg->type != CONST_PTR_TO_DYNPTR) {
> > > >>>>                              verbose(env, "arg#%d expected pointer to stack or dynptr_ptr\n", i);
> > > >>>>                              return -EINVAL;
> > > >>>>                      }
> > > >>>> -                   ret = process_dynptr_func(env, regno, insn_idx,
> > > >>>> -                                             ARG_PTR_TO_DYNPTR | MEM_RDONLY);
> > > >>>> +                   if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb])
> > > >>>> +                           dynptr_arg_type |= MEM_UNINIT | DYNPTR_TYPE_SKB;
> > > >>>> +                   else
> > > >>>> +                           dynptr_arg_type |= MEM_RDONLY;
> > > >>>> +
> > > >>>> +                   ret = process_dynptr_func(env, regno, insn_idx, dynptr_arg_type,
> > > >>>> +                                             meta->func_id);
> > > >>>>                      if (ret < 0)
> > > >>>>                              return ret;
> > > >>>>                      break;
> > > >>>> +           }
> > > >>>>              case KF_ARG_PTR_TO_LIST_HEAD:
> > > >>>>                      if (reg->type != PTR_TO_MAP_VALUE &&
> > > >>>>                          reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
> > > >>>> @@ -15857,6 +15964,14 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
> > > >>>>                 desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
> > > >>>>              insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
> > > >>>>              *cnt = 1;
> > > >>>> +   } else if (desc->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
> > > >>>> +           bool is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);
> > > >>>
> > > >>> Does it need to restore the env->seen_direct_write here also?
> > > >>>
> > > >>> It seems this 'seen_direct_write' saving/restoring is needed now because
> > > >>> 'may_access_direct_pkt_data(BPF_WRITE)' is not only called when it is
> > > >>> actually writing the packet. Some refactoring can help to avoid issue like
> > > >>> this.
> > > >>>
> > > >>> While at 'seen_direct_write', Alexei has also pointed out that the verifier
> > > >>> needs to track whether the (packet) 'slice' returned by bpf_dynptr_data()
> > > >>> has been written. It should be tracked in 'seen_direct_write'. Take a look
> > > >>> at how reg_is_pkt_pointer() and may_access_direct_pkt_data() are done in
> > > >>> check_mem_access(). iirc, this reg_is_pkt_pointer() part got loss somewhere
> > > >>> in v5 (or v4?) when bpf_dynptr_data() was changed to return register typed
> > > >>> PTR_TO_MEM instead of PTR_TO_PACKET.
> > > >>
> > > >> btw tc progs are using gen_prologue() approach because data/data_end are not kfuncs
> > > >> (nothing is being called by the bpf prog).
> > > >> In this case we don't need to repeat this approach. If so we don't need to
> > > >> set seen_direct_write.
> > > >> Instead bpf_dynptr_data() can call bpf_skb_pull_data() directly.
> > > >> And technically we don't need to limit it to skb head. It can handle any off/len.
> > > >> It will work for skb, but there is no equivalent for xdp_pull_data().
> > > >> I don't think we can implement xdp_pull_data in all drivers.
> > > >> That's massive amount of work, but we need to be consistent if we want
> > > >> dynptr to wrap both skb and xdp.
> > > >> We can say dynptr_data is for head only, but we've seen bugs where people
> > > >> had to switch from data/data_end to load_bytes.
> > > >>
> > > >> Also bpf_skb_pull_data is quite heavy. For progs that only want to parse
> > > >> the packet calling that in bpf_dynptr_data is a heavy hammer.
> > > >>
> > > >> It feels that we need to go back to skb_header_pointer-like discussion.
> > > >> Something like:
> > > >> bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, void *buffer)
> > > >> Whether buffer is a part of dynptr or program provided is tbd.
> > > >
> > > > making it hidden within dynptr would make this approach unreliable
> > > > (memory allocations, which can fail, etc). But if we ask users to pass
> > > > it directly, then it should be relatively easy to use in practice with
> > > > some pre-allocated per-CPU buffer:
>
> bpf_skb_pull_data() is even more unreliable, since it's a bigger allocation.
> I like preallocated approach more, so we're in agreement here.
>
> > > >
> > > >
> > > > struct {
> > > >    __int(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> > > >    __int(max_entries, 1);
> > > >    __type(key, int);
> > > >    __type(value, char[4096]);
> > > > } scratch SEC(".maps");
> > > >
> > > >
> > > > ...
> > > >
> > > >
> > > > struct dyn_ptr *dp = bpf_dynptr_from_skb(...).
> > > > void *p, *buf;
> > > > int zero = 0;
> > > >
> > > > buf = bpf_map_lookup_elem(&scratch, &zero);
> > > > if (!buf) return 0; /* can't happen */
> > > >
> > > > p = bpf_dynptr_slice(dp, off, 16, buf);
> > > > if (p == NULL) {
> > > >     /* out of range */
> > > > } else {
> > > >     /* work with p directly */
> > > > }
> > > >
> > > > /* if we wrote something to p and it was copied to buffer, write it back */
> > > > if (p == buf) {
> > > >      bpf_dynptr_write(dp, buf, 16);
> > > > }
> > > >
> > > >
> > > > We'll just need to teach verifier to make sure that buf is at least 16
> > > > byte long.
> > >
> > > A fifth __sz arg may do:
> > > bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, void
> > > *buffer, u32 buffer__sz);
> >
> > We'll need to make sure that buffer__sz is >= len (or preferably not
> > require extra size at all). We can check that at runtime, of course,
> > but rejecting too small buffer at verification time would be a better
> > experience.
>
> I don't follow. Why two equivalent 'len' args ?
> Just to allow 'len' to be a variable instead of constant ?
> It's unusual for the verifier to have 'len' before 'buffer',
> but this is fixable.

Right, I don't like two lens as well. And no, len can't be variable,
it has to be a constant known at verification time. We could define
bpf_dynptr_slice as

void *bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, void
*buffer, u32 buffer__sz) and it would follow current conventions,
though feels a bit weird.

But either way we'd have to teach verifier to take buffer__sz and mark
it as the size of PTR_TO_MEM returned from bpf_dynptr_slice.

All this is doable.

>
> How about adding 'rd_only vs rdwr' flag ?
> Then MEM_RDONLY for ret value of bpf_dynptr_slice can be set by the verifier
> and in run-time bpf_dynptr_slice() wouldn't need to check for skb->cloned.
> if (rd_only) return skb_header_pointer()
> if (rdwr) bpf_try_make_writable(); return skb->data + off;
> and final bpf_dynptr_write() is not needed.
>
> But that doesn't work for xdp, since there is no pull.
>
> It's not clear how to deal with BPF_F_RECOMPUTE_CSUM though.
> Expose __skb_postpull_rcsum/__skb_postpush_rcsum as kfuncs?
> But that defeats Andrii's goal to use dynptr as a generic wrapper.
> skb is quite special.
>
> Maybe something like:
> void *bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len,
>                        void *buffer, u32 buffer__sz)
> {
>   if (skb_cloned()) {
>     skb_copy_bits(skb, offset, buffer, len);
>     return buffer;
>   }
>   return skb_header_pointer(...);
> }
>
> When prog is just parsing the packet it doesn't need to finalize with bpf_dynptr_write.
> The prog can always write into the pointer followed by if (p == buf) bpf_dynptr_write.
> No need for rdonly flag, but extra copy is there in case of cloned which
> could have been avoided with extra rd_only flag.

Yep, given we are designing bpf_dynptr_slice for performance, extra
copy on reads is unfortunate. ro/rw flag or have separate
bpf_dynptr_slice_rw vs bpf_dynptr_slice_ro?

>
> In case of xdp it will be:
> void *bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len,
>                        void *buffer, u32 buffer__sz)
> {
>    ptr = bpf_xdp_pointer(xdp, offset, len);
>    if (ptr)
>       return ptr;
>    bpf_xdp_copy_buf(xdp, offset, buffer, len, false); /* copy into buf */
>    return buffer;
> }
>
> bpf_dynptr_write will use bpf_xdp_copy_buf(,true); /* copy into xdp */
>
> > >
> > > The bpf prog usually has buffer in the stack for the common small header parsing.
> >
> > sure, that would work for small chunks
> >
> > >
> > > One side note is the bpf_dynptr_slice() still needs to check if the skb is
> > > cloned or not even the off/len is within the head range.
> >
> > yep, and the above snippet will still do the right thing with
> > bpf_dynptr_write(), right? bpf_dynptr_write() will have to pull
> > anyways, if I understand correctly?
>
> Yes and No. bpf_skb_store_bytes is doing pull followed by memcpy,
> while xdp_store_bytes does scatter gather copy into frags.
> We should probably add similar copy to skb case to avoid allocations and pull.
> Then in case of:
>  if (p == buf) {
>       bpf_dynptr_write(dp, buf, 16);
>  }
>
> the write will guarantee to succeed for both xdp and skb and the user
> doesn't need to add error checking for alloc failures in case of skb.
>

That seems like a nice guarantee, agreed.

> > >
> > > > But I wonder if for simple cases when users are mostly sure that they
> > > > are going to access only header data directly we can have an option
> > > > for bpf_dynptr_from_skb() to specify what should be the behavior for
> > > > bpf_dynptr_slice():
> > > >
> > > >   - either return NULL for anything that crosses into frags (no
> > > > surprising perf penalty, but surprising NULLs);
> > > >   - do bpf_skb_pull_data() if bpf_dynptr_data() needs to point to data
> > > > beyond header (potential perf penalty, but on NULLs, if off+len is
> > > > within packet).
> > > >
> > > > And then bpf_dynptr_from_skb() can accept a flag specifying this
> > > > behavior and store it somewhere in struct bpf_dynptr.
> > >
> > > xdp does not have the bpf_skb_pull_data() equivalent, so xdp prog will still
> > > need the write back handling.
> > >
> >
> > Sure, unfortunately, can't have everything. I'm just thinking how to
> > make bpf_dynptr_data() generically usable. Think about some common BPF
> > routine that calculates hash for all bytes pointed to by dynptr,
> > regardless of underlying dynptr type; it can iterate in small chunks,
> > get memory slice, if possible, but fallback to generic
> > bpf_dynptr_read() if doesn't. This will work for skb, xdp, LOCAL,
> > RINGBUF, any other dynptr type.
>
> It looks to me that dynptr on top of skb, xdp, local can work as generic reader,
> but dynptr as a generic writer doesn't look possible.
> BPF_F_RECOMPUTE_CSUM and BPF_F_INVALIDATE_HASH are special to skb.
> There is also bpf_skb_change_proto and crazy complex bpf_skb_adjust_room.
> I don't think writing into skb vs xdp vs ringbuf are generalizable.
> The prog needs to do a ton more work to write into skb correctly.

If that's the case, then yeah, bpf_dynptr_write() can just return
error for skb/xdp dynptrs?
Alexei Starovoitov Feb. 1, 2023, 12:40 a.m. UTC | #28
On Tue, Jan 31, 2023 at 04:11:47PM -0800, Andrii Nakryiko wrote:
> >
> > When prog is just parsing the packet it doesn't need to finalize with bpf_dynptr_write.
> > The prog can always write into the pointer followed by if (p == buf) bpf_dynptr_write.
> > No need for rdonly flag, but extra copy is there in case of cloned which
> > could have been avoided with extra rd_only flag.
> 
> Yep, given we are designing bpf_dynptr_slice for performance, extra
> copy on reads is unfortunate. ro/rw flag or have separate
> bpf_dynptr_slice_rw vs bpf_dynptr_slice_ro?

Either flag or two kfuncs sound good to me.

> > Yes and No. bpf_skb_store_bytes is doing pull followed by memcpy,
> > while xdp_store_bytes does scatter gather copy into frags.
> > We should probably add similar copy to skb case to avoid allocations and pull.
> > Then in case of:
> >  if (p == buf) {
> >       bpf_dynptr_write(dp, buf, 16);
> >  }
> >
> > the write will guarantee to succeed for both xdp and skb and the user
> > doesn't need to add error checking for alloc failures in case of skb.
> >
> 
> That seems like a nice guarantee, agreed.

Just grepped through few projects that use skb_store_bytes.
Everywhere it looks like:
if (bpf_skb_store_byte(...))
   return error;

Not a pretty code to read.
I should prioritize bpf_assert() work, so we can assert from inside of
bpf_dynptr_write() eventually and remove all these IFs.

> > > >
> > > > > But I wonder if for simple cases when users are mostly sure that they
> > > > > are going to access only header data directly we can have an option
> > > > > for bpf_dynptr_from_skb() to specify what should be the behavior for
> > > > > bpf_dynptr_slice():
> > > > >
> > > > >   - either return NULL for anything that crosses into frags (no
> > > > > surprising perf penalty, but surprising NULLs);
> > > > >   - do bpf_skb_pull_data() if bpf_dynptr_data() needs to point to data
> > > > > beyond header (potential perf penalty, but on NULLs, if off+len is
> > > > > within packet).
> > > > >
> > > > > And then bpf_dynptr_from_skb() can accept a flag specifying this
> > > > > behavior and store it somewhere in struct bpf_dynptr.
> > > >
> > > > xdp does not have the bpf_skb_pull_data() equivalent, so xdp prog will still
> > > > need the write back handling.
> > > >
> > >
> > > Sure, unfortunately, can't have everything. I'm just thinking how to
> > > make bpf_dynptr_data() generically usable. Think about some common BPF
> > > routine that calculates hash for all bytes pointed to by dynptr,
> > > regardless of underlying dynptr type; it can iterate in small chunks,
> > > get memory slice, if possible, but fallback to generic
> > > bpf_dynptr_read() if doesn't. This will work for skb, xdp, LOCAL,
> > > RINGBUF, any other dynptr type.
> >
> > It looks to me that dynptr on top of skb, xdp, local can work as generic reader,
> > but dynptr as a generic writer doesn't look possible.
> > BPF_F_RECOMPUTE_CSUM and BPF_F_INVALIDATE_HASH are special to skb.
> > There is also bpf_skb_change_proto and crazy complex bpf_skb_adjust_room.
> > I don't think writing into skb vs xdp vs ringbuf are generalizable.
> > The prog needs to do a ton more work to write into skb correctly.
> 
> If that's the case, then yeah, bpf_dynptr_write() can just return
> error for skb/xdp dynptrs?

You mean to error when these skb only flags are present, but dynptr->type == xdp ?
Yep. I don't see another option. My point was that dynptr doesn't quite work as an
abstraction for writing into networking things.
While libraries like: parse_http(&dynptr), compute_hash(&dynptr), find_string(&dynptr)
can indeed be generic and work with raw bytes, skb, xdp as an input,
which I think was on top of your wishlist for dynptr.
Alexei Starovoitov Feb. 1, 2023, 12:46 a.m. UTC | #29
On Tue, Jan 31, 2023 at 03:17:08PM -0800, Joanne Koong wrote:
> >
> > It's not clear how to deal with BPF_F_RECOMPUTE_CSUM though.
> > Expose __skb_postpull_rcsum/__skb_postpush_rcsum as kfuncs?
> > But that defeats Andrii's goal to use dynptr as a generic wrapper.
> > skb is quite special.
> 
> If it's the common case that skbs use the same flag across writes in
> their bpf prog, then we can have bpf_dynptr_from_skb take in
> BPF_F_RECOMPUTE_CSUM/BPF_F_INVALIDATE_HASH in its flags arg and then
> always apply this when the skb does a write to packet data.

Remembering these flags at creation of dynptr is an interesting idea,
but it doesn't help with direct write into ptr returned from bpf_dynptr_slice.
The __skb_postpull_rcsum needs to be done before the write and
__skb_postpush_rcsum after the write.

> >
> > Maybe something like:
> > void *bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len,
> >                        void *buffer, u32 buffer__sz)
> > {
> >   if (skb_cloned()) {
> >     skb_copy_bits(skb, offset, buffer, len);
> >     return buffer;
> >   }
> >   return skb_header_pointer(...);
> > }
> >
> > When prog is just parsing the packet it doesn't need to finalize with bpf_dynptr_write.
> > The prog can always write into the pointer followed by if (p == buf) bpf_dynptr_write.
> > No need for rdonly flag, but extra copy is there in case of cloned which
> > could have been avoided with extra rd_only flag.
> 
> We're able to track in the verifier whether the slice gets written to
> or not, so if it does get written to in the skb case, can't we just
> add in a call to bpf_try_make_writable() as a post-processing fixup
> that gets called before bpf_dynptr_slice? Then bpf_dynptr_slice() can
> just return a directly writable ptr and avoid the extra memcpy

It's doable, but bpf_try_make_writable can fail and it's much slower than memcpy.
I'm not sure what you're optimizing here.
Andrii Nakryiko Feb. 2, 2023, 1:21 a.m. UTC | #30
On Tue, Jan 31, 2023 at 4:40 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Jan 31, 2023 at 04:11:47PM -0800, Andrii Nakryiko wrote:
> > >
> > > When prog is just parsing the packet it doesn't need to finalize with bpf_dynptr_write.
> > > The prog can always write into the pointer followed by if (p == buf) bpf_dynptr_write.
> > > No need for rdonly flag, but extra copy is there in case of cloned which
> > > could have been avoided with extra rd_only flag.
> >
> > Yep, given we are designing bpf_dynptr_slice for performance, extra
> > copy on reads is unfortunate. ro/rw flag or have separate
> > bpf_dynptr_slice_rw vs bpf_dynptr_slice_ro?
>
> Either flag or two kfuncs sound good to me.

Would it make sense to make bpf_dynptr_slice() as read-only variant,
and bpf_dynptr_slice_rw() for read/write? I think the common case is
read-only, right? And if users mistakenly use bpf_dynptr_slice() for
r/w case, they will get a verifier error when trying to write into the
returned pointer. While if we make bpf_dynptr_slice() as read-write,
users won't realize they are paying a performance penalty for
something that they don't actually need.

>
> > > Yes and No. bpf_skb_store_bytes is doing pull followed by memcpy,
> > > while xdp_store_bytes does scatter gather copy into frags.
> > > We should probably add similar copy to skb case to avoid allocations and pull.
> > > Then in case of:
> > >  if (p == buf) {
> > >       bpf_dynptr_write(dp, buf, 16);
> > >  }
> > >
> > > the write will guarantee to succeed for both xdp and skb and the user
> > > doesn't need to add error checking for alloc failures in case of skb.
> > >
> >
> > That seems like a nice guarantee, agreed.
>
> Just grepped through few projects that use skb_store_bytes.
> Everywhere it looks like:
> if (bpf_skb_store_byte(...))
>    return error;
>
> Not a pretty code to read.
> I should prioritize bpf_assert() work, so we can assert from inside of
> bpf_dynptr_write() eventually and remove all these IFs.
>
> > > > >
> > > > > > But I wonder if for simple cases when users are mostly sure that they
> > > > > > are going to access only header data directly we can have an option
> > > > > > for bpf_dynptr_from_skb() to specify what should be the behavior for
> > > > > > bpf_dynptr_slice():
> > > > > >
> > > > > >   - either return NULL for anything that crosses into frags (no
> > > > > > surprising perf penalty, but surprising NULLs);
> > > > > >   - do bpf_skb_pull_data() if bpf_dynptr_data() needs to point to data
> > > > > > beyond header (potential perf penalty, but on NULLs, if off+len is
> > > > > > within packet).
> > > > > >
> > > > > > And then bpf_dynptr_from_skb() can accept a flag specifying this
> > > > > > behavior and store it somewhere in struct bpf_dynptr.
> > > > >
> > > > > xdp does not have the bpf_skb_pull_data() equivalent, so xdp prog will still
> > > > > need the write back handling.
> > > > >
> > > >
> > > > Sure, unfortunately, can't have everything. I'm just thinking how to
> > > > make bpf_dynptr_data() generically usable. Think about some common BPF
> > > > routine that calculates hash for all bytes pointed to by dynptr,
> > > > regardless of underlying dynptr type; it can iterate in small chunks,
> > > > get memory slice, if possible, but fallback to generic
> > > > bpf_dynptr_read() if doesn't. This will work for skb, xdp, LOCAL,
> > > > RINGBUF, any other dynptr type.
> > >
> > > It looks to me that dynptr on top of skb, xdp, local can work as generic reader,
> > > but dynptr as a generic writer doesn't look possible.
> > > BPF_F_RECOMPUTE_CSUM and BPF_F_INVALIDATE_HASH are special to skb.
> > > There is also bpf_skb_change_proto and crazy complex bpf_skb_adjust_room.
> > > I don't think writing into skb vs xdp vs ringbuf are generalizable.
> > > The prog needs to do a ton more work to write into skb correctly.
> >
> > If that's the case, then yeah, bpf_dynptr_write() can just return
> > error for skb/xdp dynptrs?
>
> You mean to error when these skb only flags are present, but dynptr->type == xdp ?
> Yep. I don't see another option. My point was that dynptr doesn't quite work as an
> abstraction for writing into networking things.

agreed

> While libraries like: parse_http(&dynptr), compute_hash(&dynptr), find_string(&dynptr)
> can indeed be generic and work with raw bytes, skb, xdp as an input,
> which I think was on top of your wishlist for dynptr.

yep, it would be a great property
Alexei Starovoitov Feb. 2, 2023, 11:43 a.m. UTC | #31
On Wed, Feb 1, 2023 at 5:21 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Tue, Jan 31, 2023 at 4:40 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Jan 31, 2023 at 04:11:47PM -0800, Andrii Nakryiko wrote:
> > > >
> > > > When prog is just parsing the packet it doesn't need to finalize with bpf_dynptr_write.
> > > > The prog can always write into the pointer followed by if (p == buf) bpf_dynptr_write.
> > > > No need for rdonly flag, but extra copy is there in case of cloned which
> > > > could have been avoided with extra rd_only flag.
> > >
> > > Yep, given we are designing bpf_dynptr_slice for performance, extra
> > > copy on reads is unfortunate. ro/rw flag or have separate
> > > bpf_dynptr_slice_rw vs bpf_dynptr_slice_ro?
> >
> > Either flag or two kfuncs sound good to me.
>
> Would it make sense to make bpf_dynptr_slice() as read-only variant,
> and bpf_dynptr_slice_rw() for read/write? I think the common case is
> read-only, right? And if users mistakenly use bpf_dynptr_slice() for
> r/w case, they will get a verifier error when trying to write into the
> returned pointer. While if we make bpf_dynptr_slice() as read-write,
> users won't realize they are paying a performance penalty for
> something that they don't actually need.

Makes sense and it matches skb_header_pointer() usage in the kernel
which is read-only. Since there is no verifier the read-only-ness
is not enforced, but we can do it.

Looks like we've converged on bpf_dynptr_slice() and bpf_dynptr_slice_rw().
The question remains what to do with bpf_dynptr_data() backed by skb/xdp.
Should we return EINVAL to discourage its usage?
Of course, we can come up with sensible behavior for bpf_dynptr_data(),
but it will have quirks that will be not easy to document.
Even with extensive docs the users might be surprised by the behavior.
Andrii Nakryiko Feb. 3, 2023, 9:37 p.m. UTC | #32
On Thu, Feb 2, 2023 at 3:43 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Feb 1, 2023 at 5:21 PM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Tue, Jan 31, 2023 at 4:40 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Jan 31, 2023 at 04:11:47PM -0800, Andrii Nakryiko wrote:
> > > > >
> > > > > When prog is just parsing the packet it doesn't need to finalize with bpf_dynptr_write.
> > > > > The prog can always write into the pointer followed by if (p == buf) bpf_dynptr_write.
> > > > > No need for rdonly flag, but extra copy is there in case of cloned which
> > > > > could have been avoided with extra rd_only flag.
> > > >
> > > > Yep, given we are designing bpf_dynptr_slice for performance, extra
> > > > copy on reads is unfortunate. ro/rw flag or have separate
> > > > bpf_dynptr_slice_rw vs bpf_dynptr_slice_ro?
> > >
> > > Either flag or two kfuncs sound good to me.
> >
> > Would it make sense to make bpf_dynptr_slice() as read-only variant,
> > and bpf_dynptr_slice_rw() for read/write? I think the common case is
> > read-only, right? And if users mistakenly use bpf_dynptr_slice() for
> > r/w case, they will get a verifier error when trying to write into the
> > returned pointer. While if we make bpf_dynptr_slice() as read-write,
> > users won't realize they are paying a performance penalty for
> > something that they don't actually need.
>
> Makes sense and it matches skb_header_pointer() usage in the kernel
> which is read-only. Since there is no verifier the read-only-ness
> is not enforced, but we can do it.
>
> Looks like we've converged on bpf_dynptr_slice() and bpf_dynptr_slice_rw().
> The question remains what to do with bpf_dynptr_data() backed by skb/xdp.
> Should we return EINVAL to discourage its usage?
> Of course, we can come up with sensible behavior for bpf_dynptr_data(),
> but it will have quirks that will be not easy to document.
> Even with extensive docs the users might be surprised by the behavior.

I feel like having bpf_dynptr_data() working in the common case for
skb/xdp would be nice (e.g., so basically at least work in cases when
we don't need to pull).

But we've been discussing bpf_dynptr_slice() with Joanne today, and we
came to the conclusion that bpf_dynptr_slice()/bpf_dynptr_slice_rw()
should work for any kind of dynptr (LOCAL, RINGBUF, SKB, XDP). So
generic code that wants to work with any dynptr would be able to just
use bpf_dynptr_slice, even for LOCAL/RINGBUF, even though buffer won't
ever be filled for LOCAL/RINGBUF.

In application, though, if I know I'm working with LOCAL or RINGBUF
(or MALLOC, once we have it), I'd use bpf_dynptr_data() to fill out
fixed parts, of course. bpf_dynptr_slice() would be cumbersome for
such cases (especially if I have some huge fixed part that I *know* is
available in RINGBUF/MALLOC case).

With this setup we probably won't ever need bpf_dynptr_data_rdonly(),
because we can say to use bpf_dynptr_slice() for that (even with an
unnecessary buffer).
Alexei Starovoitov Feb. 8, 2023, 2:25 a.m. UTC | #33
On Fri, Feb 03, 2023 at 01:37:46PM -0800, Andrii Nakryiko wrote:
> On Thu, Feb 2, 2023 at 3:43 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Feb 1, 2023 at 5:21 PM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Tue, Jan 31, 2023 at 4:40 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Tue, Jan 31, 2023 at 04:11:47PM -0800, Andrii Nakryiko wrote:
> > > > > >
> > > > > > When prog is just parsing the packet it doesn't need to finalize with bpf_dynptr_write.
> > > > > > The prog can always write into the pointer followed by if (p == buf) bpf_dynptr_write.
> > > > > > No need for rdonly flag, but extra copy is there in case of cloned which
> > > > > > could have been avoided with extra rd_only flag.
> > > > >
> > > > > Yep, given we are designing bpf_dynptr_slice for performance, extra
> > > > > copy on reads is unfortunate. ro/rw flag or have separate
> > > > > bpf_dynptr_slice_rw vs bpf_dynptr_slice_ro?
> > > >
> > > > Either flag or two kfuncs sound good to me.
> > >
> > > Would it make sense to make bpf_dynptr_slice() as read-only variant,
> > > and bpf_dynptr_slice_rw() for read/write? I think the common case is
> > > read-only, right? And if users mistakenly use bpf_dynptr_slice() for
> > > r/w case, they will get a verifier error when trying to write into the
> > > returned pointer. While if we make bpf_dynptr_slice() as read-write,
> > > users won't realize they are paying a performance penalty for
> > > something that they don't actually need.
> >
> > Makes sense and it matches skb_header_pointer() usage in the kernel
> > which is read-only. Since there is no verifier the read-only-ness
> > is not enforced, but we can do it.
> >
> > Looks like we've converged on bpf_dynptr_slice() and bpf_dynptr_slice_rw().
> > The question remains what to do with bpf_dynptr_data() backed by skb/xdp.
> > Should we return EINVAL to discourage its usage?
> > Of course, we can come up with sensible behavior for bpf_dynptr_data(),
> > but it will have quirks that will be not easy to document.
> > Even with extensive docs the users might be surprised by the behavior.
> 
> I feel like having bpf_dynptr_data() working in the common case for
> skb/xdp would be nice (e.g., so basically at least work in cases when
> we don't need to pull).
> 
> But we've been discussing bpf_dynptr_slice() with Joanne today, and we
> came to the conclusion that bpf_dynptr_slice()/bpf_dynptr_slice_rw()
> should work for any kind of dynptr (LOCAL, RINGBUF, SKB, XDP). So
> generic code that wants to work with any dynptr would be able to just
> use bpf_dynptr_slice, even for LOCAL/RINGBUF, even though buffer won't
> ever be filled for LOCAL/RINGBUF.

great

> In application, though, if I know I'm working with LOCAL or RINGBUF
> (or MALLOC, once we have it), I'd use bpf_dynptr_data() to fill out
> fixed parts, of course. bpf_dynptr_slice() would be cumbersome for
> such cases (especially if I have some huge fixed part that I *know* is
> available in RINGBUF/MALLOC case).

bpf_dynptr_data() for local and ringbuf is fine, of course.
It already exists and has to continue working.
bpf_dynptr_data() for xdp is probably ok as well,
but bpf_dynptr_data() for skb is problematic.
data/data_end concept looked great back in 2016 when it was introduced
and lots of programs were written, but we underestimated the impact
of driver's copybreak on programs.
Network parsing progs consume headers one by one and would typically
be written as:
if (header > data_end)
   return DROP;
Some drivers copybreak fixed number of bytes. Others try to be smart
and copy only headers into linear part of skb.
The drivers also change. At one point we tried to upgrade the kernel
and suddenly bpf firewall started blocking valid traffic.
Turned out the driver copybreak heuristic was changed in that kernel.
The bpf prog was converted to use skb_load_bytes() and the fire was extinguished.
It was a hard lesson.
Others learned the danger of data/data_end the hard way as well.
Take a look at cloudflare's progs/test_cls_redirect.c.
It's a complicated combination of data/data_end and skb_load_bytes().
It's essentially implementing skb_header_pointer.
I wish we could use bpf_dynptr_slice only and remove data/data_end,
but we cannot, since it's uapi.
But we shouldn't repeat the same mistake. If we do bpf_dynptr_data()
that returns linear part people will be hitting the same fragility and
difficult to debug bugs.
bpf_dynptr_data() for XDP is ok-ish, since most of XDP is still
page-per-packet, but patches to split headers in HW are starting to appear.
So even for XDP data/data_end concept may bite us.
Hence my preference is to EINVAL in bpf_dynptr_data() at least for skb,
since bpf_dynptr_slice() is a strictly better alternative.
Joanne Koong Feb. 8, 2023, 8:13 p.m. UTC | #34
On Tue, Feb 7, 2023 at 6:25 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Feb 03, 2023 at 01:37:46PM -0800, Andrii Nakryiko wrote:
> > On Thu, Feb 2, 2023 at 3:43 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Wed, Feb 1, 2023 at 5:21 PM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > >
> > > > On Tue, Jan 31, 2023 at 4:40 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Tue, Jan 31, 2023 at 04:11:47PM -0800, Andrii Nakryiko wrote:
> > > > > > >
> > > > > > > When prog is just parsing the packet it doesn't need to finalize with bpf_dynptr_write.
> > > > > > > The prog can always write into the pointer followed by if (p == buf) bpf_dynptr_write.
> > > > > > > No need for rdonly flag, but extra copy is there in case of cloned which
> > > > > > > could have been avoided with extra rd_only flag.
> > > > > >
> > > > > > Yep, given we are designing bpf_dynptr_slice for performance, extra
> > > > > > copy on reads is unfortunate. ro/rw flag or have separate
> > > > > > bpf_dynptr_slice_rw vs bpf_dynptr_slice_ro?
> > > > >
> > > > > Either flag or two kfuncs sound good to me.
> > > >
> > > > Would it make sense to make bpf_dynptr_slice() as read-only variant,
> > > > and bpf_dynptr_slice_rw() for read/write? I think the common case is
> > > > read-only, right? And if users mistakenly use bpf_dynptr_slice() for
> > > > r/w case, they will get a verifier error when trying to write into the
> > > > returned pointer. While if we make bpf_dynptr_slice() as read-write,
> > > > users won't realize they are paying a performance penalty for
> > > > something that they don't actually need.
> > >
> > > Makes sense and it matches skb_header_pointer() usage in the kernel
> > > which is read-only. Since there is no verifier the read-only-ness
> > > is not enforced, but we can do it.
> > >
> > > Looks like we've converged on bpf_dynptr_slice() and bpf_dynptr_slice_rw().
> > > The question remains what to do with bpf_dynptr_data() backed by skb/xdp.
> > > Should we return EINVAL to discourage its usage?
> > > Of course, we can come up with sensible behavior for bpf_dynptr_data(),
> > > but it will have quirks that will be not easy to document.
> > > Even with extensive docs the users might be surprised by the behavior.
> >
> > I feel like having bpf_dynptr_data() working in the common case for
> > skb/xdp would be nice (e.g., so basically at least work in cases when
> > we don't need to pull).
> >
> > But we've been discussing bpf_dynptr_slice() with Joanne today, and we
> > came to the conclusion that bpf_dynptr_slice()/bpf_dynptr_slice_rw()
> > should work for any kind of dynptr (LOCAL, RINGBUF, SKB, XDP). So
> > generic code that wants to work with any dynptr would be able to just
> > use bpf_dynptr_slice, even for LOCAL/RINGBUF, even though buffer won't
> > ever be filled for LOCAL/RINGBUF.
>
> great
>
> > In application, though, if I know I'm working with LOCAL or RINGBUF
> > (or MALLOC, once we have it), I'd use bpf_dynptr_data() to fill out
> > fixed parts, of course. bpf_dynptr_slice() would be cumbersome for
> > such cases (especially if I have some huge fixed part that I *know* is
> > available in RINGBUF/MALLOC case).
>
> bpf_dynptr_data() for local and ringbuf is fine, of course.
> It already exists and has to continue working.
> bpf_dynptr_data() for xdp is probably ok as well,
> but bpf_dynptr_data() for skb is problematic.
> data/data_end concept looked great back in 2016 when it was introduced
> and lots of programs were written, but we underestimated the impact
> of driver's copybreak on programs.
> Network parsing progs consume headers one by one and would typically
> be written as:
> if (header > data_end)
>    return DROP;
> Some drivers copybreak fixed number of bytes. Others try to be smart
> and copy only headers into linear part of skb.
> The drivers also change. At one point we tried to upgrade the kernel
> and suddenly bpf firewall started blocking valid traffic.
> Turned out the driver copybreak heuristic was changed in that kernel.
> The bpf prog was converted to use skb_load_bytes() and the fire was extinguished.
> It was a hard lesson.
> Others learned the danger of data/data_end the hard way as well.
> Take a look at cloudflare's progs/test_cls_redirect.c.
> It's a complicated combination of data/data_end and skb_load_bytes().
> It's essentially implementing skb_header_pointer.
> I wish we could use bpf_dynptr_slice only and remove data/data_end,
> but we cannot, since it's uapi.
> But we shouldn't repeat the same mistake. If we do bpf_dynptr_data()
> that returns linear part people will be hitting the same fragility and
> difficult to debug bugs.
> bpf_dynptr_data() for XDP is ok-ish, since most of XDP is still
> page-per-packet, but patches to split headers in HW are starting to appear.
> So even for XDP data/data_end concept may bite us.
> Hence my preference is to EINVAL in bpf_dynptr_data() at least for skb,
> since bpf_dynptr_slice() is a strictly better alternative.

This makes sense to me, I will have the next version of this patchset
return -EINVAL if bpf_dynptr_data() is used on a skb or xdp dynptr
Joanne Koong Feb. 8, 2023, 9:46 p.m. UTC | #35
On Tue, Jan 31, 2023 at 9:54 AM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Mon, Jan 30, 2023 at 9:36 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Jan 30, 2023 at 04:44:12PM -0800, Joanne Koong wrote:
> > > On Sun, Jan 29, 2023 at 3:39 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Fri, Jan 27, 2023 at 11:17:01AM -0800, Joanne Koong wrote:
[...]
> > > > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > > > index 6da78b3d381e..ddb47126071a 100644
> > > > > --- a/net/core/filter.c
> > > > > +++ b/net/core/filter.c
> > > > > @@ -1684,8 +1684,8 @@ static inline void bpf_pull_mac_rcsum(struct sk_buff *skb)
> > > > >               skb_postpull_rcsum(skb, skb_mac_header(skb), skb->mac_len);
> > > > >  }
> > > > >
> > > > > -BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
> > > > > -        const void *, from, u32, len, u64, flags)
> > > > > +int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from,
> > > > > +                       u32 len, u64 flags)
> > > >
> > > > This change is just to be able to call __bpf_skb_store_bytes() ?
> > > > If so, it's unnecessary.
> > > > See:
> > > > BPF_CALL_4(sk_reuseport_load_bytes,
> > > >            const struct sk_reuseport_kern *, reuse_kern, u32, offset,
> > > >            void *, to, u32, len)
> > > > {
> > > >         return ____bpf_skb_load_bytes(reuse_kern->skb, offset, to, len);
> > > > }
> > > >
> > >
> > > There was prior feedback [0] that using four underscores to call a
> > > helper function is confusing and makes it ungreppable
> >
> > There are plenty of ungreppable funcs in the kernel.
> > Try finding where folio_test_dirty() is defined.
> > mm subsystem is full of such 'features'.
> > Not friendly for casual kernel code reader, but useful.
> >
> > Since quadruple underscore is already used in the code base
> > I see no reason to sacrifice bpf_skb_load_bytes performance with extra call.
>
> I don't have a preference either way, I'll change it to use the
> quadruple underscore in the next version

I think we still need these extra __bpf_skb_store/load_bytes()
functions, because BPF_CALL_x static inlines the
bpf_skb_store/load_bytes helpers in net/core/filter.c, and we need to
call these bpf_skb_store/load_bytes helpers from another file
(kernel/bpf/helpers.c). I think the only other alternative is moving
the BPF_CALL_x declaration of bpf_skb_store/load bytes to
include/linux/filter.h, but I think having the extra
__bpf_skb_store/load_bytes() is cleaner.
Alexei Starovoitov Feb. 8, 2023, 11:22 p.m. UTC | #36
On Wed, Feb 8, 2023 at 1:47 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Tue, Jan 31, 2023 at 9:54 AM Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Mon, Jan 30, 2023 at 9:36 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Mon, Jan 30, 2023 at 04:44:12PM -0800, Joanne Koong wrote:
> > > > On Sun, Jan 29, 2023 at 3:39 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jan 27, 2023 at 11:17:01AM -0800, Joanne Koong wrote:
> [...]
> > > > > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > > > > index 6da78b3d381e..ddb47126071a 100644
> > > > > > --- a/net/core/filter.c
> > > > > > +++ b/net/core/filter.c
> > > > > > @@ -1684,8 +1684,8 @@ static inline void bpf_pull_mac_rcsum(struct sk_buff *skb)
> > > > > >               skb_postpull_rcsum(skb, skb_mac_header(skb), skb->mac_len);
> > > > > >  }
> > > > > >
> > > > > > -BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
> > > > > > -        const void *, from, u32, len, u64, flags)
> > > > > > +int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from,
> > > > > > +                       u32 len, u64 flags)
> > > > >
> > > > > This change is just to be able to call __bpf_skb_store_bytes() ?
> > > > > If so, it's unnecessary.
> > > > > See:
> > > > > BPF_CALL_4(sk_reuseport_load_bytes,
> > > > >            const struct sk_reuseport_kern *, reuse_kern, u32, offset,
> > > > >            void *, to, u32, len)
> > > > > {
> > > > >         return ____bpf_skb_load_bytes(reuse_kern->skb, offset, to, len);
> > > > > }
> > > > >
> > > >
> > > > There was prior feedback [0] that using four underscores to call a
> > > > helper function is confusing and makes it ungreppable
> > >
> > > There are plenty of ungreppable funcs in the kernel.
> > > Try finding where folio_test_dirty() is defined.
> > > mm subsystem is full of such 'features'.
> > > Not friendly for casual kernel code reader, but useful.
> > >
> > > Since quadruple underscore is already used in the code base
> > > I see no reason to sacrifice bpf_skb_load_bytes performance with extra call.
> >
> > I don't have a preference either way, I'll change it to use the
> > quadruple underscore in the next version
>
> I think we still need these extra __bpf_skb_store/load_bytes()
> functions, because BPF_CALL_x static inlines the
> bpf_skb_store/load_bytes helpers in net/core/filter.c, and we need to
> call these bpf_skb_store/load_bytes helpers from another file
> (kernel/bpf/helpers.c). I think the only other alternative is moving
> the BPF_CALL_x declaration of bpf_skb_store/load bytes to
> include/linux/filter.h, but I think having the extra
> __bpf_skb_store/load_bytes() is cleaner.

bpf_skb_load_bytes() is a performance critical function.
I'm worried about the cost of the extra call.
Will compiler be smart enough to inline __bpf_skb_load_bytes()
in both cases? Probably not if they're in different .c files.
Not sure how to solve it. Make it a static inline in skbuff.h ?
Andrii Nakryiko Feb. 9, 2023, 12:39 a.m. UTC | #37
On Wed, Feb 8, 2023 at 12:13 PM Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Tue, Feb 7, 2023 at 6:25 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Feb 03, 2023 at 01:37:46PM -0800, Andrii Nakryiko wrote:
> > > On Thu, Feb 2, 2023 at 3:43 AM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Wed, Feb 1, 2023 at 5:21 PM Andrii Nakryiko
> > > > <andrii.nakryiko@gmail.com> wrote:
> > > > >
> > > > > On Tue, Jan 31, 2023 at 4:40 PM Alexei Starovoitov
> > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > >
> > > > > > On Tue, Jan 31, 2023 at 04:11:47PM -0800, Andrii Nakryiko wrote:
> > > > > > > >
> > > > > > > > When prog is just parsing the packet it doesn't need to finalize with bpf_dynptr_write.
> > > > > > > > The prog can always write into the pointer followed by if (p == buf) bpf_dynptr_write.
> > > > > > > > No need for rdonly flag, but extra copy is there in case of cloned which
> > > > > > > > could have been avoided with extra rd_only flag.
> > > > > > >
> > > > > > > Yep, given we are designing bpf_dynptr_slice for performance, extra
> > > > > > > copy on reads is unfortunate. ro/rw flag or have separate
> > > > > > > bpf_dynptr_slice_rw vs bpf_dynptr_slice_ro?
> > > > > >
> > > > > > Either flag or two kfuncs sound good to me.
> > > > >
> > > > > Would it make sense to make bpf_dynptr_slice() as read-only variant,
> > > > > and bpf_dynptr_slice_rw() for read/write? I think the common case is
> > > > > read-only, right? And if users mistakenly use bpf_dynptr_slice() for
> > > > > r/w case, they will get a verifier error when trying to write into the
> > > > > returned pointer. While if we make bpf_dynptr_slice() as read-write,
> > > > > users won't realize they are paying a performance penalty for
> > > > > something that they don't actually need.
> > > >
> > > > Makes sense and it matches skb_header_pointer() usage in the kernel
> > > > which is read-only. Since there is no verifier the read-only-ness
> > > > is not enforced, but we can do it.
> > > >
> > > > Looks like we've converged on bpf_dynptr_slice() and bpf_dynptr_slice_rw().
> > > > The question remains what to do with bpf_dynptr_data() backed by skb/xdp.
> > > > Should we return EINVAL to discourage its usage?
> > > > Of course, we can come up with sensible behavior for bpf_dynptr_data(),
> > > > but it will have quirks that will be not easy to document.
> > > > Even with extensive docs the users might be surprised by the behavior.
> > >
> > > I feel like having bpf_dynptr_data() working in the common case for
> > > skb/xdp would be nice (e.g., so basically at least work in cases when
> > > we don't need to pull).
> > >
> > > But we've been discussing bpf_dynptr_slice() with Joanne today, and we
> > > came to the conclusion that bpf_dynptr_slice()/bpf_dynptr_slice_rw()
> > > should work for any kind of dynptr (LOCAL, RINGBUF, SKB, XDP). So
> > > generic code that wants to work with any dynptr would be able to just
> > > use bpf_dynptr_slice, even for LOCAL/RINGBUF, even though buffer won't
> > > ever be filled for LOCAL/RINGBUF.
> >
> > great
> >
> > > In application, though, if I know I'm working with LOCAL or RINGBUF
> > > (or MALLOC, once we have it), I'd use bpf_dynptr_data() to fill out
> > > fixed parts, of course. bpf_dynptr_slice() would be cumbersome for
> > > such cases (especially if I have some huge fixed part that I *know* is
> > > available in RINGBUF/MALLOC case).
> >
> > bpf_dynptr_data() for local and ringbuf is fine, of course.
> > It already exists and has to continue working.
> > bpf_dynptr_data() for xdp is probably ok as well,
> > but bpf_dynptr_data() for skb is problematic.
> > data/data_end concept looked great back in 2016 when it was introduced
> > and lots of programs were written, but we underestimated the impact
> > of driver's copybreak on programs.
> > Network parsing progs consume headers one by one and would typically
> > be written as:
> > if (header > data_end)
> >    return DROP;
> > Some drivers copybreak fixed number of bytes. Others try to be smart
> > and copy only headers into linear part of skb.
> > The drivers also change. At one point we tried to upgrade the kernel
> > and suddenly bpf firewall started blocking valid traffic.
> > Turned out the driver copybreak heuristic was changed in that kernel.
> > The bpf prog was converted to use skb_load_bytes() and the fire was extinguished.
> > It was a hard lesson.
> > Others learned the danger of data/data_end the hard way as well.
> > Take a look at cloudflare's progs/test_cls_redirect.c.
> > It's a complicated combination of data/data_end and skb_load_bytes().
> > It's essentially implementing skb_header_pointer.
> > I wish we could use bpf_dynptr_slice only and remove data/data_end,
> > but we cannot, since it's uapi.
> > But we shouldn't repeat the same mistake. If we do bpf_dynptr_data()
> > that returns linear part people will be hitting the same fragility and
> > difficult to debug bugs.
> > bpf_dynptr_data() for XDP is ok-ish, since most of XDP is still
> > page-per-packet, but patches to split headers in HW are starting to appear.
> > So even for XDP data/data_end concept may bite us.
> > Hence my preference is to EINVAL in bpf_dynptr_data() at least for skb,
> > since bpf_dynptr_slice() is a strictly better alternative.
>
> This makes sense to me, I will have the next version of this patchset
> return -EINVAL if bpf_dynptr_data() is used on a skb or xdp dynptr

+1, sounds reasonable to me as well
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 14a0264fac57..1ac061b64582 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -575,11 +575,14 @@  enum bpf_type_flag {
 	/* MEM is tagged with rcu and memory access needs rcu_read_lock protection. */
 	MEM_RCU			= BIT(13 + BPF_BASE_TYPE_BITS),
 
+	/* DYNPTR points to sk_buff */
+	DYNPTR_TYPE_SKB		= BIT(14 + BPF_BASE_TYPE_BITS),
+
 	__BPF_TYPE_FLAG_MAX,
 	__BPF_TYPE_LAST_FLAG	= __BPF_TYPE_FLAG_MAX - 1,
 };
 
-#define DYNPTR_TYPE_FLAG_MASK	(DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF)
+#define DYNPTR_TYPE_FLAG_MASK	(DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB)
 
 /* Max number of base types. */
 #define BPF_BASE_TYPE_LIMIT	(1UL << BPF_BASE_TYPE_BITS)
@@ -1082,6 +1085,35 @@  static __always_inline __nocfi unsigned int bpf_dispatcher_nop_func(
 	return bpf_func(ctx, insnsi);
 }
 
+/* the implementation of the opaque uapi struct bpf_dynptr */
+struct bpf_dynptr_kern {
+	void *data;
+	/* Size represents the number of usable bytes of dynptr data.
+	 * If for example the offset is at 4 for a local dynptr whose data is
+	 * of type u64, the number of usable bytes is 4.
+	 *
+	 * The upper 8 bits are reserved. It is as follows:
+	 * Bits 0 - 23 = size
+	 * Bits 24 - 30 = dynptr type
+	 * Bit 31 = whether dynptr is read-only
+	 */
+	u32 size;
+	u32 offset;
+} __aligned(8);
+
+enum bpf_dynptr_type {
+	BPF_DYNPTR_TYPE_INVALID,
+	/* Points to memory that is local to the bpf program */
+	BPF_DYNPTR_TYPE_LOCAL,
+	/* Underlying data is a ringbuf record */
+	BPF_DYNPTR_TYPE_RINGBUF,
+	/* Underlying data is a sk_buff */
+	BPF_DYNPTR_TYPE_SKB,
+};
+
+int bpf_dynptr_check_size(u32 size);
+u32 bpf_dynptr_get_size(const struct bpf_dynptr_kern *ptr);
+
 #ifdef CONFIG_BPF_JIT
 int bpf_trampoline_link_prog(struct bpf_tramp_link *link, struct bpf_trampoline *tr);
 int bpf_trampoline_unlink_prog(struct bpf_tramp_link *link, struct bpf_trampoline *tr);
@@ -2216,6 +2248,11 @@  static inline bool has_current_bpf_ctx(void)
 }
 
 void notrace bpf_prog_inc_misses_counter(struct bpf_prog *prog);
+
+void bpf_dynptr_init(struct bpf_dynptr_kern *ptr, void *data,
+		     enum bpf_dynptr_type type, u32 offset, u32 size);
+void bpf_dynptr_set_null(struct bpf_dynptr_kern *ptr);
+void bpf_dynptr_set_rdonly(struct bpf_dynptr_kern *ptr);
 #else /* !CONFIG_BPF_SYSCALL */
 static inline struct bpf_prog *bpf_prog_get(u32 ufd)
 {
@@ -2445,6 +2482,19 @@  static inline void bpf_prog_inc_misses_counter(struct bpf_prog *prog)
 static inline void bpf_cgrp_storage_free(struct cgroup *cgroup)
 {
 }
+
+static inline void bpf_dynptr_init(struct bpf_dynptr_kern *ptr, void *data,
+				   enum bpf_dynptr_type type, u32 offset, u32 size)
+{
+}
+
+static inline void bpf_dynptr_set_null(struct bpf_dynptr_kern *ptr)
+{
+}
+
+static inline void bpf_dynptr_set_rdonly(struct bpf_dynptr_kern *ptr)
+{
+}
 #endif /* CONFIG_BPF_SYSCALL */
 
 void __bpf_free_used_btfs(struct bpf_prog_aux *aux,
@@ -2863,36 +2913,6 @@  int bpf_bprintf_prepare(char *fmt, u32 fmt_size, const u64 *raw_args,
 			u32 num_args, struct bpf_bprintf_data *data);
 void bpf_bprintf_cleanup(struct bpf_bprintf_data *data);
 
-/* the implementation of the opaque uapi struct bpf_dynptr */
-struct bpf_dynptr_kern {
-	void *data;
-	/* Size represents the number of usable bytes of dynptr data.
-	 * If for example the offset is at 4 for a local dynptr whose data is
-	 * of type u64, the number of usable bytes is 4.
-	 *
-	 * The upper 8 bits are reserved. It is as follows:
-	 * Bits 0 - 23 = size
-	 * Bits 24 - 30 = dynptr type
-	 * Bit 31 = whether dynptr is read-only
-	 */
-	u32 size;
-	u32 offset;
-} __aligned(8);
-
-enum bpf_dynptr_type {
-	BPF_DYNPTR_TYPE_INVALID,
-	/* Points to memory that is local to the bpf program */
-	BPF_DYNPTR_TYPE_LOCAL,
-	/* Underlying data is a kernel-produced ringbuf record */
-	BPF_DYNPTR_TYPE_RINGBUF,
-};
-
-void bpf_dynptr_init(struct bpf_dynptr_kern *ptr, void *data,
-		     enum bpf_dynptr_type type, u32 offset, u32 size);
-void bpf_dynptr_set_null(struct bpf_dynptr_kern *ptr);
-int bpf_dynptr_check_size(u32 size);
-u32 bpf_dynptr_get_size(const struct bpf_dynptr_kern *ptr);
-
 #ifdef CONFIG_BPF_LSM
 void bpf_cgroup_atype_get(u32 attach_btf_id, int cgroup_atype);
 void bpf_cgroup_atype_put(int cgroup_atype);
diff --git a/include/linux/filter.h b/include/linux/filter.h
index ccc4a4a58c72..c87d13954d89 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1541,4 +1541,22 @@  static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u64 index
 	return XDP_REDIRECT;
 }
 
+#ifdef CONFIG_NET
+int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len);
+int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from,
+			  u32 len, u64 flags);
+#else /* CONFIG_NET */
+static inline int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset,
+				       void *to, u32 len)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset,
+					const void *from, u32 len, u64 flags)
+{
+	return -EOPNOTSUPP;
+}
+#endif /* CONFIG_NET */
+
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ba0f0cfb5e42..f6910392d339 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5320,22 +5320,45 @@  union bpf_attr {
  *	Description
  *		Write *len* bytes from *src* into *dst*, starting from *offset*
  *		into *dst*.
- *		*flags* is currently unused.
+ *
+ *		*flags* must be 0 except for skb-type dynptrs.
+ *
+ *		For skb-type dynptrs:
+ *		    *  All data slices of the dynptr are automatically
+ *		       invalidated after **bpf_dynptr_write**\ (). If you wish to
+ *		       avoid this, please perform the write using direct data slices
+ *		       instead.
+ *
+ *		    *  For *flags*, please see the flags accepted by
+ *		       **bpf_skb_store_bytes**\ ().
  *	Return
  *		0 on success, -E2BIG if *offset* + *len* exceeds the length
  *		of *dst*'s data, -EINVAL if *dst* is an invalid dynptr or if *dst*
- *		is a read-only dynptr or if *flags* is not 0.
+ *		is a read-only dynptr or if *flags* is not correct. For skb-type dynptrs,
+ *		other errors correspond to errors returned by **bpf_skb_store_bytes**\ ().
  *
  * void *bpf_dynptr_data(const struct bpf_dynptr *ptr, u32 offset, u32 len)
  *	Description
  *		Get a pointer to the underlying dynptr data.
  *
  *		*len* must be a statically known value. The returned data slice
- *		is invalidated whenever the dynptr is invalidated.
- *	Return
- *		Pointer to the underlying dynptr data, NULL if the dynptr is
- *		read-only, if the dynptr is invalid, or if the offset and length
- *		is out of bounds.
+ *		is invalidated whenever the dynptr is invalidated. Please note
+ *		that if the dynptr is read-only, then the returned data slice will
+ *		be read-only.
+ *
+ *		For skb-type dynptrs:
+ *		    * If *offset* + *len* extends into the skb's paged buffers,
+ *		      the user should manually pull the skb with **bpf_skb_pull_data**\ ()
+ *		      and try again.
+ *
+ *		    * The data slice is automatically invalidated anytime
+ *		      **bpf_dynptr_write**\ () or a helper call that changes
+ *		      the underlying packet buffer (eg **bpf_skb_pull_data**\ ())
+ *		      is called.
+ *	Return
+ *		Pointer to the underlying dynptr data, NULL if the dynptr is invalid,
+ *		or if the offset and length is out of bounds or in a paged buffer for
+ *		skb-type dynptrs.
  *
  * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len)
  *	Description
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index b4da17688c65..35d0780f2eb9 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -207,6 +207,11 @@  enum btf_kfunc_hook {
 	BTF_KFUNC_HOOK_TRACING,
 	BTF_KFUNC_HOOK_SYSCALL,
 	BTF_KFUNC_HOOK_FMODRET,
+	BTF_KFUNC_HOOK_CGROUP_SKB,
+	BTF_KFUNC_HOOK_SCHED_ACT,
+	BTF_KFUNC_HOOK_SK_SKB,
+	BTF_KFUNC_HOOK_SOCKET_FILTER,
+	BTF_KFUNC_HOOK_LWT,
 	BTF_KFUNC_HOOK_MAX,
 };
 
@@ -7609,6 +7614,19 @@  static int bpf_prog_type_to_kfunc_hook(enum bpf_prog_type prog_type)
 		return BTF_KFUNC_HOOK_TRACING;
 	case BPF_PROG_TYPE_SYSCALL:
 		return BTF_KFUNC_HOOK_SYSCALL;
+	case BPF_PROG_TYPE_CGROUP_SKB:
+		return BTF_KFUNC_HOOK_CGROUP_SKB;
+	case BPF_PROG_TYPE_SCHED_ACT:
+		return BTF_KFUNC_HOOK_SCHED_ACT;
+	case BPF_PROG_TYPE_SK_SKB:
+		return BTF_KFUNC_HOOK_SK_SKB;
+	case BPF_PROG_TYPE_SOCKET_FILTER:
+		return BTF_KFUNC_HOOK_SOCKET_FILTER;
+	case BPF_PROG_TYPE_LWT_OUT:
+	case BPF_PROG_TYPE_LWT_IN:
+	case BPF_PROG_TYPE_LWT_XMIT:
+	case BPF_PROG_TYPE_LWT_SEG6LOCAL:
+		return BTF_KFUNC_HOOK_LWT;
 	default:
 		return BTF_KFUNC_HOOK_MAX;
 	}
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 458db2db2f81..a79d522b3a26 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1420,11 +1420,21 @@  static bool bpf_dynptr_is_rdonly(const struct bpf_dynptr_kern *ptr)
 	return ptr->size & DYNPTR_RDONLY_BIT;
 }
 
+void bpf_dynptr_set_rdonly(struct bpf_dynptr_kern *ptr)
+{
+	ptr->size |= DYNPTR_RDONLY_BIT;
+}
+
 static void bpf_dynptr_set_type(struct bpf_dynptr_kern *ptr, enum bpf_dynptr_type type)
 {
 	ptr->size |= type << DYNPTR_TYPE_SHIFT;
 }
 
+static enum bpf_dynptr_type bpf_dynptr_get_type(const struct bpf_dynptr_kern *ptr)
+{
+	return (ptr->size & ~(DYNPTR_RDONLY_BIT)) >> DYNPTR_TYPE_SHIFT;
+}
+
 u32 bpf_dynptr_get_size(const struct bpf_dynptr_kern *ptr)
 {
 	return ptr->size & DYNPTR_SIZE_MASK;
@@ -1497,6 +1507,7 @@  static const struct bpf_func_proto bpf_dynptr_from_mem_proto = {
 BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, const struct bpf_dynptr_kern *, src,
 	   u32, offset, u64, flags)
 {
+	enum bpf_dynptr_type type;
 	int err;
 
 	if (!src->data || flags)
@@ -1506,13 +1517,23 @@  BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, const struct bpf_dynptr_kern
 	if (err)
 		return err;
 
-	/* Source and destination may possibly overlap, hence use memmove to
-	 * copy the data. E.g. bpf_dynptr_from_mem may create two dynptr
-	 * pointing to overlapping PTR_TO_MAP_VALUE regions.
-	 */
-	memmove(dst, src->data + src->offset + offset, len);
+	type = bpf_dynptr_get_type(src);
 
-	return 0;
+	switch (type) {
+	case BPF_DYNPTR_TYPE_LOCAL:
+	case BPF_DYNPTR_TYPE_RINGBUF:
+		/* Source and destination may possibly overlap, hence use memmove to
+		 * copy the data. E.g. bpf_dynptr_from_mem may create two dynptr
+		 * pointing to overlapping PTR_TO_MAP_VALUE regions.
+		 */
+		memmove(dst, src->data + src->offset + offset, len);
+		return 0;
+	case BPF_DYNPTR_TYPE_SKB:
+		return __bpf_skb_load_bytes(src->data, src->offset + offset, dst, len);
+	default:
+		WARN_ONCE(true, "bpf_dynptr_read: unknown dynptr type %d\n", type);
+		return -EFAULT;
+	}
 }
 
 static const struct bpf_func_proto bpf_dynptr_read_proto = {
@@ -1529,22 +1550,36 @@  static const struct bpf_func_proto bpf_dynptr_read_proto = {
 BPF_CALL_5(bpf_dynptr_write, const struct bpf_dynptr_kern *, dst, u32, offset, void *, src,
 	   u32, len, u64, flags)
 {
+	enum bpf_dynptr_type type;
 	int err;
 
-	if (!dst->data || flags || bpf_dynptr_is_rdonly(dst))
+	if (!dst->data || bpf_dynptr_is_rdonly(dst))
 		return -EINVAL;
 
 	err = bpf_dynptr_check_off_len(dst, offset, len);
 	if (err)
 		return err;
 
-	/* Source and destination may possibly overlap, hence use memmove to
-	 * copy the data. E.g. bpf_dynptr_from_mem may create two dynptr
-	 * pointing to overlapping PTR_TO_MAP_VALUE regions.
-	 */
-	memmove(dst->data + dst->offset + offset, src, len);
+	type = bpf_dynptr_get_type(dst);
 
-	return 0;
+	switch (type) {
+	case BPF_DYNPTR_TYPE_LOCAL:
+	case BPF_DYNPTR_TYPE_RINGBUF:
+		if (flags)
+			return -EINVAL;
+		/* Source and destination may possibly overlap, hence use memmove to
+		 * copy the data. E.g. bpf_dynptr_from_mem may create two dynptr
+		 * pointing to overlapping PTR_TO_MAP_VALUE regions.
+		 */
+		memmove(dst->data + dst->offset + offset, src, len);
+		return 0;
+	case BPF_DYNPTR_TYPE_SKB:
+		return __bpf_skb_store_bytes(dst->data, dst->offset + offset, src, len,
+					     flags);
+	default:
+		WARN_ONCE(true, "bpf_dynptr_write: unknown dynptr type %d\n", type);
+		return -EFAULT;
+	}
 }
 
 static const struct bpf_func_proto bpf_dynptr_write_proto = {
@@ -1560,6 +1595,8 @@  static const struct bpf_func_proto bpf_dynptr_write_proto = {
 
 BPF_CALL_3(bpf_dynptr_data, const struct bpf_dynptr_kern *, ptr, u32, offset, u32, len)
 {
+	enum bpf_dynptr_type type;
+	void *data;
 	int err;
 
 	if (!ptr->data)
@@ -1569,10 +1606,36 @@  BPF_CALL_3(bpf_dynptr_data, const struct bpf_dynptr_kern *, ptr, u32, offset, u3
 	if (err)
 		return 0;
 
-	if (bpf_dynptr_is_rdonly(ptr))
-		return 0;
+	type = bpf_dynptr_get_type(ptr);
+
+	switch (type) {
+	case BPF_DYNPTR_TYPE_LOCAL:
+	case BPF_DYNPTR_TYPE_RINGBUF:
+		if (bpf_dynptr_is_rdonly(ptr))
+			return 0;
+
+		data = ptr->data;
+		break;
+	case BPF_DYNPTR_TYPE_SKB:
+	{
+		struct sk_buff *skb = ptr->data;
 
-	return (unsigned long)(ptr->data + ptr->offset + offset);
+		/* if the data is paged, the caller needs to pull it first */
+		if (ptr->offset + offset + len > skb_headlen(skb))
+			return 0;
+
+		/* Depending on the prog type, the data slice will be either
+		 * read-writable or read-only. The verifier will enforce that
+		 * any writes to read-only data slices are rejected
+		 */
+		data = skb->data;
+		break;
+	}
+	default:
+		WARN_ONCE(true, "bpf_dynptr_data: unknown dynptr type %d\n", type);
+		return 0;
+	}
+	return (unsigned long)(data + ptr->offset + offset);
 }
 
 static const struct bpf_func_proto bpf_dynptr_data_proto = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 853ab671be0b..3b022abc34e3 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -741,6 +741,8 @@  static enum bpf_dynptr_type arg_to_dynptr_type(enum bpf_arg_type arg_type)
 		return BPF_DYNPTR_TYPE_LOCAL;
 	case DYNPTR_TYPE_RINGBUF:
 		return BPF_DYNPTR_TYPE_RINGBUF;
+	case DYNPTR_TYPE_SKB:
+		return BPF_DYNPTR_TYPE_SKB;
 	default:
 		return BPF_DYNPTR_TYPE_INVALID;
 	}
@@ -1625,6 +1627,12 @@  static bool reg_is_pkt_pointer_any(const struct bpf_reg_state *reg)
 	       reg->type == PTR_TO_PACKET_END;
 }
 
+static bool reg_is_dynptr_slice_pkt(const struct bpf_reg_state *reg)
+{
+	return base_type(reg->type) == PTR_TO_MEM &&
+		reg->type & DYNPTR_TYPE_SKB;
+}
+
 /* Unmodified PTR_TO_PACKET[_META,_END] register from ctx access. */
 static bool reg_is_init_pkt_pointer(const struct bpf_reg_state *reg,
 				    enum bpf_reg_type which)
@@ -6148,7 +6156,7 @@  static int process_kptr_func(struct bpf_verifier_env *env, int regno,
  * type, and declare it as 'const struct bpf_dynptr *' in their prototype.
  */
 int process_dynptr_func(struct bpf_verifier_env *env, int regno, int insn_idx,
-			enum bpf_arg_type arg_type)
+			enum bpf_arg_type arg_type, int func_id)
 {
 	struct bpf_reg_state *regs = cur_regs(env), *reg = &regs[regno];
 	int err;
@@ -6233,6 +6241,9 @@  int process_dynptr_func(struct bpf_verifier_env *env, int regno, int insn_idx,
 			case DYNPTR_TYPE_RINGBUF:
 				err_extra = "ringbuf";
 				break;
+			case DYNPTR_TYPE_SKB:
+				err_extra = "skb ";
+				break;
 			default:
 				err_extra = "<unknown>";
 				break;
@@ -6581,6 +6592,28 @@  int check_func_arg_reg_off(struct bpf_verifier_env *env,
 	}
 }
 
+static struct bpf_reg_state *get_dynptr_arg_reg(struct bpf_verifier_env *env,
+						const struct bpf_func_proto *fn,
+						struct bpf_reg_state *regs)
+{
+	struct bpf_reg_state *state = NULL;
+	int i;
+
+	for (i = 0; i < MAX_BPF_FUNC_REG_ARGS; i++)
+		if (arg_type_is_dynptr(fn->arg_type[i])) {
+			if (state) {
+				verbose(env, "verifier internal error: multiple dynptr args\n");
+				return NULL;
+			}
+			state = &regs[BPF_REG_1 + i];
+		}
+
+	if (!state)
+		verbose(env, "verifier internal error: no dynptr arg found\n");
+
+	return state;
+}
+
 static int dynptr_id(struct bpf_verifier_env *env, struct bpf_reg_state *reg)
 {
 	struct bpf_func_state *state = func(env, reg);
@@ -6607,6 +6640,24 @@  static int dynptr_ref_obj_id(struct bpf_verifier_env *env, struct bpf_reg_state
 	return state->stack[spi].spilled_ptr.ref_obj_id;
 }
 
+static enum bpf_dynptr_type dynptr_get_type(struct bpf_verifier_env *env,
+					    struct bpf_reg_state *reg)
+{
+	struct bpf_func_state *state = func(env, reg);
+	int spi;
+
+	if (reg->type == CONST_PTR_TO_DYNPTR)
+		return reg->dynptr.type;
+
+	spi = __get_spi(reg->off);
+	if (spi < 0) {
+		verbose(env, "verifier internal error: invalid spi when querying dynptr type\n");
+		return BPF_DYNPTR_TYPE_INVALID;
+	}
+
+	return state->stack[spi].spilled_ptr.dynptr.type;
+}
+
 static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 			  struct bpf_call_arg_meta *meta,
 			  const struct bpf_func_proto *fn,
@@ -6819,7 +6870,7 @@  static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 		err = check_mem_size_reg(env, reg, regno, true, meta);
 		break;
 	case ARG_PTR_TO_DYNPTR:
-		err = process_dynptr_func(env, regno, insn_idx, arg_type);
+		err = process_dynptr_func(env, regno, insn_idx, arg_type, meta->func_id);
 		if (err)
 			return err;
 		break;
@@ -7267,6 +7318,9 @@  static int check_func_proto(const struct bpf_func_proto *fn, int func_id)
 
 /* Packet data might have moved, any old PTR_TO_PACKET[_META,_END]
  * are now invalid, so turn them into unknown SCALAR_VALUE.
+ *
+ * This also applies to dynptr slices belonging to skb dynptrs,
+ * since these slices point to packet data.
  */
 static void clear_all_pkt_pointers(struct bpf_verifier_env *env)
 {
@@ -7274,7 +7328,7 @@  static void clear_all_pkt_pointers(struct bpf_verifier_env *env)
 	struct bpf_reg_state *reg;
 
 	bpf_for_each_reg_in_vstate(env->cur_state, state, reg, ({
-		if (reg_is_pkt_pointer_any(reg))
+		if (reg_is_pkt_pointer_any(reg) || reg_is_dynptr_slice_pkt(reg))
 			__mark_reg_unknown(env, reg);
 	}));
 }
@@ -7958,6 +8012,7 @@  static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 			     int *insn_idx_p)
 {
 	enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
+	enum bpf_dynptr_type dynptr_type = BPF_DYNPTR_TYPE_INVALID;
 	const struct bpf_func_proto *fn = NULL;
 	enum bpf_return_type ret_type;
 	enum bpf_type_flag ret_flag;
@@ -8140,43 +8195,61 @@  static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 		}
 		break;
 	case BPF_FUNC_dynptr_data:
-		for (i = 0; i < MAX_BPF_FUNC_REG_ARGS; i++) {
-			if (arg_type_is_dynptr(fn->arg_type[i])) {
-				struct bpf_reg_state *reg = &regs[BPF_REG_1 + i];
-				int id, ref_obj_id;
-
-				if (meta.dynptr_id) {
-					verbose(env, "verifier internal error: meta.dynptr_id already set\n");
-					return -EFAULT;
-				}
+	{
+		struct bpf_reg_state *reg;
+		int id, ref_obj_id;
 
-				if (meta.ref_obj_id) {
-					verbose(env, "verifier internal error: meta.ref_obj_id already set\n");
-					return -EFAULT;
-				}
+		reg = get_dynptr_arg_reg(env, fn, regs);
+		if (!reg)
+			return -EFAULT;
 
-				id = dynptr_id(env, reg);
-				if (id < 0) {
-					verbose(env, "verifier internal error: failed to obtain dynptr id\n");
-					return id;
-				}
+		if (meta.dynptr_id) {
+			verbose(env, "verifier internal error: meta.dynptr_id already set\n");
+			return -EFAULT;
+		}
+		if (meta.ref_obj_id) {
+			verbose(env, "verifier internal error: meta.ref_obj_id already set\n");
+			return -EFAULT;
+		}
 
-				ref_obj_id = dynptr_ref_obj_id(env, reg);
-				if (ref_obj_id < 0) {
-					verbose(env, "verifier internal error: failed to obtain dynptr ref_obj_id\n");
-					return ref_obj_id;
-				}
+		id = dynptr_id(env, reg);
+		if (id < 0) {
+			verbose(env, "verifier internal error: failed to obtain dynptr id\n");
+			return id;
+		}
 
-				meta.dynptr_id = id;
-				meta.ref_obj_id = ref_obj_id;
-				break;
-			}
+		ref_obj_id = dynptr_ref_obj_id(env, reg);
+		if (ref_obj_id < 0) {
+			verbose(env, "verifier internal error: failed to obtain dynptr ref_obj_id\n");
+			return ref_obj_id;
 		}
-		if (i == MAX_BPF_FUNC_REG_ARGS) {
-			verbose(env, "verifier internal error: no dynptr in bpf_dynptr_data()\n");
+
+		meta.dynptr_id = id;
+		meta.ref_obj_id = ref_obj_id;
+
+		dynptr_type = dynptr_get_type(env, reg);
+		if (dynptr_type == BPF_DYNPTR_TYPE_INVALID)
 			return -EFAULT;
-		}
+
 		break;
+	}
+	case BPF_FUNC_dynptr_write:
+	{
+		struct bpf_reg_state *reg;
+
+		reg = get_dynptr_arg_reg(env, fn, regs);
+		if (!reg)
+			return -EFAULT;
+
+		dynptr_type = dynptr_get_type(env, reg);
+		if (dynptr_type == BPF_DYNPTR_TYPE_INVALID)
+			return -EFAULT;
+
+		if (dynptr_type == BPF_DYNPTR_TYPE_SKB)
+			changes_data = true;
+
+		break;
+	}
 	case BPF_FUNC_user_ringbuf_drain:
 		err = __check_func_call(env, insn, insn_idx_p, meta.subprogno,
 					set_user_ringbuf_callback_state);
@@ -8243,6 +8316,28 @@  static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 		mark_reg_known_zero(env, regs, BPF_REG_0);
 		regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
 		regs[BPF_REG_0].mem_size = meta.mem_size;
+		if (func_id == BPF_FUNC_dynptr_data &&
+		    dynptr_type == BPF_DYNPTR_TYPE_SKB) {
+			bool seen_direct_write = env->seen_direct_write;
+
+			regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB;
+			if (!may_access_direct_pkt_data(env, NULL, BPF_WRITE))
+				regs[BPF_REG_0].type |= MEM_RDONLY;
+			else
+				/*
+				 * Calling may_access_direct_pkt_data() will set
+				 * env->seen_direct_write to true if the skb is
+				 * writable. As an optimization, we can ignore
+				 * setting env->seen_direct_write.
+				 *
+				 * env->seen_direct_write is used by skb
+				 * programs to determine whether the skb's page
+				 * buffers should be cloned. Since data slice
+				 * writes would only be to the head, we can skip
+				 * this.
+				 */
+				env->seen_direct_write = seen_direct_write;
+		}
 		break;
 	case RET_PTR_TO_MEM_OR_BTF_ID:
 	{
@@ -8649,6 +8744,7 @@  enum special_kfunc_type {
 	KF_bpf_list_pop_back,
 	KF_bpf_cast_to_kern_ctx,
 	KF_bpf_rdonly_cast,
+	KF_bpf_dynptr_from_skb,
 	KF_bpf_rcu_read_lock,
 	KF_bpf_rcu_read_unlock,
 };
@@ -8662,6 +8758,7 @@  BTF_ID(func, bpf_list_pop_front)
 BTF_ID(func, bpf_list_pop_back)
 BTF_ID(func, bpf_cast_to_kern_ctx)
 BTF_ID(func, bpf_rdonly_cast)
+BTF_ID(func, bpf_dynptr_from_skb)
 BTF_SET_END(special_kfunc_set)
 
 BTF_ID_LIST(special_kfunc_list)
@@ -8673,6 +8770,7 @@  BTF_ID(func, bpf_list_pop_front)
 BTF_ID(func, bpf_list_pop_back)
 BTF_ID(func, bpf_cast_to_kern_ctx)
 BTF_ID(func, bpf_rdonly_cast)
+BTF_ID(func, bpf_dynptr_from_skb)
 BTF_ID(func, bpf_rcu_read_lock)
 BTF_ID(func, bpf_rcu_read_unlock)
 
@@ -9263,17 +9361,26 @@  static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
 				return ret;
 			break;
 		case KF_ARG_PTR_TO_DYNPTR:
+		{
+			enum bpf_arg_type dynptr_arg_type = ARG_PTR_TO_DYNPTR;
+
 			if (reg->type != PTR_TO_STACK &&
 			    reg->type != CONST_PTR_TO_DYNPTR) {
 				verbose(env, "arg#%d expected pointer to stack or dynptr_ptr\n", i);
 				return -EINVAL;
 			}
 
-			ret = process_dynptr_func(env, regno, insn_idx,
-						  ARG_PTR_TO_DYNPTR | MEM_RDONLY);
+			if (meta->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb])
+				dynptr_arg_type |= MEM_UNINIT | DYNPTR_TYPE_SKB;
+			else
+				dynptr_arg_type |= MEM_RDONLY;
+
+			ret = process_dynptr_func(env, regno, insn_idx, dynptr_arg_type,
+						  meta->func_id);
 			if (ret < 0)
 				return ret;
 			break;
+		}
 		case KF_ARG_PTR_TO_LIST_HEAD:
 			if (reg->type != PTR_TO_MAP_VALUE &&
 			    reg->type != (PTR_TO_BTF_ID | MEM_ALLOC)) {
@@ -15857,6 +15964,14 @@  static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		   desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
 		insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
 		*cnt = 1;
+	} else if (desc->func_id == special_kfunc_list[KF_bpf_dynptr_from_skb]) {
+		bool is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);
+		struct bpf_insn addr[2] = { BPF_LD_IMM64(BPF_REG_4, is_rdonly) };
+
+		insn_buf[0] = addr[0];
+		insn_buf[1] = addr[1];
+		insn_buf[2] = *insn;
+		*cnt = 3;
 	}
 	return 0;
 }
diff --git a/net/core/filter.c b/net/core/filter.c
index 6da78b3d381e..ddb47126071a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1684,8 +1684,8 @@  static inline void bpf_pull_mac_rcsum(struct sk_buff *skb)
 		skb_postpull_rcsum(skb, skb_mac_header(skb), skb->mac_len);
 }
 
-BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
-	   const void *, from, u32, len, u64, flags)
+int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from,
+			  u32 len, u64 flags)
 {
 	void *ptr;
 
@@ -1710,6 +1710,12 @@  BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
 	return 0;
 }
 
+BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
+	   const void *, from, u32, len, u64, flags)
+{
+	return __bpf_skb_store_bytes(skb, offset, from, len, flags);
+}
+
 static const struct bpf_func_proto bpf_skb_store_bytes_proto = {
 	.func		= bpf_skb_store_bytes,
 	.gpl_only	= false,
@@ -1721,8 +1727,7 @@  static const struct bpf_func_proto bpf_skb_store_bytes_proto = {
 	.arg5_type	= ARG_ANYTHING,
 };
 
-BPF_CALL_4(bpf_skb_load_bytes, const struct sk_buff *, skb, u32, offset,
-	   void *, to, u32, len)
+int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len)
 {
 	void *ptr;
 
@@ -1741,6 +1746,12 @@  BPF_CALL_4(bpf_skb_load_bytes, const struct sk_buff *, skb, u32, offset,
 	return -EFAULT;
 }
 
+BPF_CALL_4(bpf_skb_load_bytes, const struct sk_buff *, skb, u32, offset,
+	   void *, to, u32, len)
+{
+	return __bpf_skb_load_bytes(skb, offset, to, len);
+}
+
 static const struct bpf_func_proto bpf_skb_load_bytes_proto = {
 	.func		= bpf_skb_load_bytes,
 	.gpl_only	= false,
@@ -1852,6 +1863,22 @@  static const struct bpf_func_proto bpf_skb_pull_data_proto = {
 	.arg2_type	= ARG_ANYTHING,
 };
 
+int bpf_dynptr_from_skb(struct sk_buff *skb, u64 flags,
+			struct bpf_dynptr_kern *ptr, int is_rdonly)
+{
+	if (flags) {
+		bpf_dynptr_set_null(ptr);
+		return -EINVAL;
+	}
+
+	bpf_dynptr_init(ptr, skb, BPF_DYNPTR_TYPE_SKB, 0, skb->len);
+
+	if (is_rdonly)
+		bpf_dynptr_set_rdonly(ptr);
+
+	return 0;
+}
+
 BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk)
 {
 	return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL;
@@ -11607,3 +11634,28 @@  bpf_sk_base_func_proto(enum bpf_func_id func_id)
 
 	return func;
 }
+
+BTF_SET8_START(bpf_kfunc_check_set_skb)
+BTF_ID_FLAGS(func, bpf_dynptr_from_skb)
+BTF_SET8_END(bpf_kfunc_check_set_skb)
+
+static const struct btf_kfunc_id_set bpf_kfunc_set_skb = {
+	.owner = THIS_MODULE,
+	.set = &bpf_kfunc_check_set_skb,
+};
+
+static int __init bpf_kfunc_init(void)
+{
+	int ret;
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_CLS, &bpf_kfunc_set_skb);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SCHED_ACT, &bpf_kfunc_set_skb);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SK_SKB, &bpf_kfunc_set_skb);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_SOCKET_FILTER, &bpf_kfunc_set_skb);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_CGROUP_SKB, &bpf_kfunc_set_skb);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_OUT, &bpf_kfunc_set_skb);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_IN, &bpf_kfunc_set_skb);
+	ret = ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_XMIT, &bpf_kfunc_set_skb);
+	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LWT_SEG6LOCAL, &bpf_kfunc_set_skb);
+}
+late_initcall(bpf_kfunc_init);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 7f024ac22edd..6b58e5a75fc5 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5320,22 +5320,45 @@  union bpf_attr {
  *	Description
  *		Write *len* bytes from *src* into *dst*, starting from *offset*
  *		into *dst*.
- *		*flags* is currently unused.
+ *
+ *		*flags* must be 0 except for skb-type dynptrs.
+ *
+ *		For skb-type dynptrs:
+ *		    *  All data slices of the dynptr are automatically
+ *		       invalidated after **bpf_dynptr_write**\ (). If you wish to
+ *		       avoid this, please perform the write using direct data slices
+ *		       instead.
+ *
+ *		    *  For *flags*, please see the flags accepted by
+ *		       **bpf_skb_store_bytes**\ ().
  *	Return
  *		0 on success, -E2BIG if *offset* + *len* exceeds the length
  *		of *dst*'s data, -EINVAL if *dst* is an invalid dynptr or if *dst*
- *		is a read-only dynptr or if *flags* is not 0.
+ *		is a read-only dynptr or if *flags* is not correct. For skb-type dynptrs,
+ *		other errors correspond to errors returned by **bpf_skb_store_bytes**\ ().
  *
  * void *bpf_dynptr_data(const struct bpf_dynptr *ptr, u32 offset, u32 len)
  *	Description
  *		Get a pointer to the underlying dynptr data.
  *
  *		*len* must be a statically known value. The returned data slice
- *		is invalidated whenever the dynptr is invalidated.
- *	Return
- *		Pointer to the underlying dynptr data, NULL if the dynptr is
- *		read-only, if the dynptr is invalid, or if the offset and length
- *		is out of bounds.
+ *		is invalidated whenever the dynptr is invalidated. Please note
+ *		that if the dynptr is read-only, then the returned data slice will
+ *		be read-only.
+ *
+ *		For skb-type dynptrs:
+ *		    * If *offset* + *len* extends into the skb's paged buffers,
+ *		      the user should manually pull the skb with **bpf_skb_pull_data**\ ()
+ *		      and try again.
+ *
+ *		    * The data slice is automatically invalidated anytime
+ *		      **bpf_dynptr_write**\ () or a helper call that changes
+ *		      the underlying packet buffer (eg **bpf_skb_pull_data**\ ())
+ *		      is called.
+ *	Return
+ *		Pointer to the underlying dynptr data, NULL if the dynptr is invalid,
+ *		or if the offset and length is out of bounds or in a paged buffer for
+ *		skb-type dynptrs.
  *
  * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len)
  *	Description