diff mbox series

[bpf-next,v4,2/3] bpf: Add xdp dynptrs

Message ID 20220822235649.2218031-3-joannelkoong@gmail.com (mailing list archive)
State Superseded
Delegated to: BPF
Headers show
Series Add skb + xdp dynptrs | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-4 success Logs for llvm-toolchain
bpf/vmtest-bpf-next-VM_Test-5 success Logs for set-matrix
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 105798 this patch: 105798
netdev/cc_maintainers warning 12 maintainers not CCed: john.fastabend@gmail.com song@kernel.org sdf@google.com hawk@kernel.org martin.lau@linux.dev davem@davemloft.net edumazet@google.com kpsingh@kernel.org jolsa@kernel.org pabeni@redhat.com haoluo@google.com yhs@fb.com
netdev/build_clang success Errors and warnings before: 173 this patch: 173
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 107672 this patch: 107672
netdev/checkpatch warning WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 92 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-1 success Logs for Kernel LATEST on ubuntu-latest with gcc
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Kernel LATEST on ubuntu-latest with llvm-16
bpf/vmtest-bpf-next-VM_Test-3 fail Logs for Kernel LATEST on z15 with gcc

Commit Message

Joanne Koong Aug. 22, 2022, 11:56 p.m. UTC
Add xdp dynptrs, which are dynptrs whose underlying pointer points
to a xdp_buff. The dynptr acts on xdp data. xdp dynptrs have two main
benefits. One is that they allow operations on sizes that are not
statically known at compile-time (eg variable-sized accesses).
Another is that parsing the packet data through dynptrs (instead of
through direct access of xdp->data and xdp->data_end) can be more
ergonomic and less brittle (eg does not need manual if checking for
being within bounds of data_end).

For reads and writes on the dynptr, this includes reading/writing
from/to and across fragments. For data slices, direct access to
data in fragments is also permitted, but access across fragments
is not. The returned data slice is reg type PTR_TO_PACKET | PTR_MAYBE_NULL.

Any helper calls that change the underlying packet buffer (eg
bpf_xdp_adjust_head) invalidates any data slices of the associated
dynptr. Whenever such a helper call is made, the verifier marks any
PTR_TO_PACKET reg type (which includes xdp dynptr slices since they are
PTR_TO_PACKETs) as unknown. The stack trace for this is
check_helper_call() -> clear_all_pkt_pointers() ->
__clear_all_pkt_pointers() -> mark_reg_unknown()

For examples of how xdp dynptrs can be used, please see the attached
selftests.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 include/linux/bpf.h            |  6 ++++-
 include/linux/filter.h         |  3 +++
 include/uapi/linux/bpf.h       | 25 +++++++++++++++---
 kernel/bpf/helpers.c           | 14 ++++++++++-
 kernel/bpf/verifier.c          |  8 +++++-
 net/core/filter.c              | 46 +++++++++++++++++++++++++++++-----
 tools/include/uapi/linux/bpf.h | 25 +++++++++++++++---
 7 files changed, 112 insertions(+), 15 deletions(-)

Comments

Kumar Kartikeya Dwivedi Aug. 23, 2022, 2:30 a.m. UTC | #1
+Cc XDP folks

On Tue, 23 Aug 2022 at 02:12, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> Add xdp dynptrs, which are dynptrs whose underlying pointer points
> to a xdp_buff. The dynptr acts on xdp data. xdp dynptrs have two main
> benefits. One is that they allow operations on sizes that are not
> statically known at compile-time (eg variable-sized accesses).
> Another is that parsing the packet data through dynptrs (instead of
> through direct access of xdp->data and xdp->data_end) can be more
> ergonomic and less brittle (eg does not need manual if checking for
> being within bounds of data_end).
>
> For reads and writes on the dynptr, this includes reading/writing
> from/to and across fragments. For data slices, direct access to

It's a bit awkward to have such a difference between xdp and skb
dynptr's read/write. I understand why it is the way it is, but it
still doesn't feel right. I'm not sure if we can reconcile the
differences, but it makes writing common code for both xdp and tc
harder as it needs to be aware of the differences (and then the flags
for dynptr_write would differ too). So we're 90% there but not the
whole way...

> data in fragments is also permitted, but access across fragments
> is not. The returned data slice is reg type PTR_TO_PACKET | PTR_MAYBE_NULL.
>
> Any helper calls that change the underlying packet buffer (eg
> bpf_xdp_adjust_head) invalidates any data slices of the associated
> dynptr. Whenever such a helper call is made, the verifier marks any
> PTR_TO_PACKET reg type (which includes xdp dynptr slices since they are
> PTR_TO_PACKETs) as unknown. The stack trace for this is
> check_helper_call() -> clear_all_pkt_pointers() ->
> __clear_all_pkt_pointers() -> mark_reg_unknown()
>
> For examples of how xdp dynptrs can be used, please see the attached
> selftests.
>
> Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> ---
>  include/linux/bpf.h            |  6 ++++-
>  include/linux/filter.h         |  3 +++
>  include/uapi/linux/bpf.h       | 25 +++++++++++++++---
>  kernel/bpf/helpers.c           | 14 ++++++++++-
>  kernel/bpf/verifier.c          |  8 +++++-
>  net/core/filter.c              | 46 +++++++++++++++++++++++++++++-----
>  tools/include/uapi/linux/bpf.h | 25 +++++++++++++++---
>  7 files changed, 112 insertions(+), 15 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 30615d1a0c13..455a215b6c57 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -410,11 +410,15 @@ enum bpf_type_flag {
>         /* DYNPTR points to sk_buff */
>         DYNPTR_TYPE_SKB         = BIT(11 + BPF_BASE_TYPE_BITS),
>
> +       /* DYNPTR points to xdp_buff */
> +       DYNPTR_TYPE_XDP         = BIT(12 + BPF_BASE_TYPE_BITS),
> +
>         __BPF_TYPE_FLAG_MAX,
>         __BPF_TYPE_LAST_FLAG    = __BPF_TYPE_FLAG_MAX - 1,
>  };
>
> -#define DYNPTR_TYPE_FLAG_MASK  (DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB)
> +#define DYNPTR_TYPE_FLAG_MASK  (DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB \
> +                                | DYNPTR_TYPE_XDP)
>
>  /* Max number of base types. */
>  #define BPF_BASE_TYPE_LIMIT    (1UL << BPF_BASE_TYPE_BITS)
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 649063d9cbfd..80f030239877 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1535,5 +1535,8 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
>  int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len);
>  int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from,
>                           u32 len, u64 flags);
> +int __bpf_xdp_load_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len);
> +int __bpf_xdp_store_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len);
> +void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len);
>
>  #endif /* __LINUX_FILTER_H__ */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 320e6b95d95c..9feea29eebcd 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -5283,13 +5283,18 @@ union bpf_attr {
>   *                   and try again.
>   *
>   *                 * The data slice is automatically invalidated anytime
> - *                   **bpf_dynptr_write**\ () or a helper call that changes
> - *                   the underlying packet buffer (eg **bpf_skb_pull_data**\ ())
> + *                   **bpf_dynptr_write**\ () is called.
> + *
> + *             For skb-type and xdp-type dynptrs:
> + *                 * The data slice is automatically invalidated anytime a
> + *                   helper call that changes the underlying packet buffer
> + *                   (eg **bpf_skb_pull_data**\ (), **bpf_xdp_adjust_head**\ ())
>   *                   is called.
>   *     Return
>   *             Pointer to the underlying dynptr data, NULL if the dynptr is
>   *             read-only, if the dynptr is invalid, or if the offset and length
> - *             is out of bounds or in a paged buffer for skb-type dynptrs.
> + *             is out of bounds or in a paged buffer for skb-type dynptrs or
> + *             across fragments for xdp-type dynptrs.
>   *
>   * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len)
>   *     Description
> @@ -5388,6 +5393,19 @@ union bpf_attr {
>   *             *flags* is currently unused, it must be 0 for now.
>   *     Return
>   *             0 on success or -EINVAL if flags is not 0.
> + *
> + * long bpf_dynptr_from_xdp(struct xdp_buff *xdp_md, u64 flags, struct bpf_dynptr *ptr)
> + *     Description
> + *             Get a dynptr to the data in *xdp_md*. *xdp_md* must be the BPF program
> + *             context.
> + *
> + *             Calls that change the *xdp_md*'s underlying packet buffer
> + *             (eg **bpf_xdp_adjust_head**\ ()) do not invalidate the dynptr, but
> + *             they do invalidate any data slices associated with the dynptr.
> + *
> + *             *flags* is currently unused, it must be 0 for now.
> + *     Return
> + *             0 on success, -EINVAL if flags is not 0.
>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -5600,6 +5618,7 @@ union bpf_attr {
>         FN(tcp_raw_check_syncookie_ipv6),       \
>         FN(ktime_get_tai_ns),           \
>         FN(dynptr_from_skb),            \
> +       FN(dynptr_from_xdp),            \
>         /* */
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 471a01a9b6ae..2b9dc4c6de04 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -1541,6 +1541,8 @@ BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, struct bpf_dynptr_kern *, src
>                 return 0;
>         case BPF_DYNPTR_TYPE_SKB:
>                 return __bpf_skb_load_bytes(src->data, src->offset + offset, dst, len);
> +       case BPF_DYNPTR_TYPE_XDP:
> +               return __bpf_xdp_load_bytes(src->data, src->offset + offset, dst, len);
>         default:
>                 WARN(true, "bpf_dynptr_read: unknown dynptr type %d\n", type);
>                 return -EFAULT;
> @@ -1583,6 +1585,10 @@ BPF_CALL_5(bpf_dynptr_write, struct bpf_dynptr_kern *, dst, u32, offset, void *,
>         case BPF_DYNPTR_TYPE_SKB:
>                 return __bpf_skb_store_bytes(dst->data, dst->offset + offset, src, len,
>                                              flags);
> +       case BPF_DYNPTR_TYPE_XDP:
> +               if (flags)
> +                       return -EINVAL;
> +               return __bpf_xdp_store_bytes(dst->data, dst->offset + offset, src, len);
>         default:
>                 WARN(true, "bpf_dynptr_write: unknown dynptr type %d\n", type);
>                 return -EFAULT;
> @@ -1616,7 +1622,7 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
>
>         type = bpf_dynptr_get_type(ptr);
>
> -       /* Only skb dynptrs can get read-only data slices, because the
> +       /* Only skb and xdp dynptrs can get read-only data slices, because the
>          * verifier enforces PTR_TO_PACKET accesses
>          */
>         is_rdonly = bpf_dynptr_is_rdonly(ptr);
> @@ -1640,6 +1646,12 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
>                 data = skb->data;
>                 break;
>         }
> +       case BPF_DYNPTR_TYPE_XDP:
> +               /* if the requested data in across fragments, then it cannot
> +                * be accessed directly - bpf_xdp_pointer will return NULL
> +                */
> +               return (unsigned long)bpf_xdp_pointer(ptr->data,
> +                                                     ptr->offset + offset, len);
>         default:
>                 WARN(true, "bpf_dynptr_data: unknown dynptr type %d\n", type);
>                 return 0;
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 1ea295f47525..d33648eee188 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -686,6 +686,8 @@ static enum bpf_dynptr_type arg_to_dynptr_type(enum bpf_arg_type arg_type)
>                 return BPF_DYNPTR_TYPE_RINGBUF;
>         case DYNPTR_TYPE_SKB:
>                 return BPF_DYNPTR_TYPE_SKB;
> +       case DYNPTR_TYPE_XDP:
> +               return BPF_DYNPTR_TYPE_XDP;
>         default:
>                 return BPF_DYNPTR_TYPE_INVALID;
>         }
> @@ -6078,6 +6080,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
>                         case DYNPTR_TYPE_SKB:
>                                 err_extra = "skb ";
>                                 break;
> +                       case DYNPTR_TYPE_XDP:
> +                               err_extra = "xdp ";
> +                               break;
>                         }
>
>                         verbose(env, "Expected an initialized %sdynptr as arg #%d\n",
> @@ -7439,7 +7444,8 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
>                 mark_reg_known_zero(env, regs, BPF_REG_0);
>
>                 if (func_id == BPF_FUNC_dynptr_data &&
> -                   dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> +                   (dynptr_type == BPF_DYNPTR_TYPE_SKB ||
> +                    dynptr_type == BPF_DYNPTR_TYPE_XDP)) {
>                         regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
>                         regs[BPF_REG_0].range = meta.mem_size;

It doesn't seem like this is safe. Since PTR_TO_PACKET's range can be
modified by comparisons with packet pointers loaded from the xdp/skb
ctx, how do we distinguish e.g. between a pkt slice obtained from some
frag in a multi-buff XDP vs pkt pointer from a linear area?

Someone can compare data_meta from ctx with PTR_TO_PACKET from
bpf_dynptr_data on xdp dynptr (which might be pointing to a xdp mb
frag). While MAX_PACKET_OFF is 0xffff, it can still be used to do OOB
access for the linear area. reg_is_init_pkt_pointer will return true
as modified range is not considered for it. Same kind of issues when
doing comparison with data_end from ctx (though maybe you won't be
able to do incorrect data access at runtime using that).

I had a pkt_uid field in my patch [0] which disallowed comparisons
among bpf_packet_pointer slices. Each call assigned a fresh pkt_uid,
and that disabled comparisons for them. reg->id is used for var_off
range propagation so it cannot be reused.

Coming back to this: What we really want here is a PTR_TO_MEM with a
mem_size, so maybe you should go that route instead of PTR_TO_PACKET
(and add a type tag to maybe pretty print it also as a packet pointer
in verifier log), or add some way to distinguish slice vs non-slice
pkt pointers like I did in my patch. You might also want to add some
tests for this corner case (there are some later in [0] if you want to
reuse them).

So TBH, I kinda dislike my own solution in [0] :). The complexity does
not seem worth it. The pkt_uid distinction is more useful (and
actually would be needed) in Toke's xdp queueing series, where in a
dequeue program you have multiple xdp_mds and want scoped slice
invalidations (i.e. adjust_head on one xdp_md doesn't invalidate
slices of some other xdp_md). Here we can just get away with normal
PTR_TO_MEM.

... Or just let me know if you handle this correctly already, or if
this won't be an actual problem :).

  [0]: https://lore.kernel.org/bpf/20220306234311.452206-3-memxor@gmail.com
Joanne Koong Aug. 23, 2022, 10:26 p.m. UTC | #2
On Mon, Aug 22, 2022 at 7:31 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> +Cc XDP folks
>
> On Tue, 23 Aug 2022 at 02:12, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > Add xdp dynptrs, which are dynptrs whose underlying pointer points
> > to a xdp_buff. The dynptr acts on xdp data. xdp dynptrs have two main
> > benefits. One is that they allow operations on sizes that are not
> > statically known at compile-time (eg variable-sized accesses).
> > Another is that parsing the packet data through dynptrs (instead of
> > through direct access of xdp->data and xdp->data_end) can be more
> > ergonomic and less brittle (eg does not need manual if checking for
> > being within bounds of data_end).
> >
> > For reads and writes on the dynptr, this includes reading/writing
> > from/to and across fragments. For data slices, direct access to
>
> It's a bit awkward to have such a difference between xdp and skb
> dynptr's read/write. I understand why it is the way it is, but it
> still doesn't feel right. I'm not sure if we can reconcile the
> differences, but it makes writing common code for both xdp and tc
> harder as it needs to be aware of the differences (and then the flags
> for dynptr_write would differ too). So we're 90% there but not the
> whole way...

Yeah, it'd be great if the behavior for skb/xdp progs could be the
same, but I'm not seeing a better solution here (unless we invalidate
data slices on writes in xdp progs, just to make it match more :P).

Regarding having 2 different interfaces bpf_dynptr_from_{skb/xdp}, I'm
not convinced this is much of a problem - xdp and skb programs already
have different interfaces for doing things (eg
bpf_{skb/xdp}_{store/load}_bytes).

>
> > data in fragments is also permitted, but access across fragments
> > is not. The returned data slice is reg type PTR_TO_PACKET | PTR_MAYBE_NULL.
> >
> > Any helper calls that change the underlying packet buffer (eg
> > bpf_xdp_adjust_head) invalidates any data slices of the associated
> > dynptr. Whenever such a helper call is made, the verifier marks any
> > PTR_TO_PACKET reg type (which includes xdp dynptr slices since they are
> > PTR_TO_PACKETs) as unknown. The stack trace for this is
> > check_helper_call() -> clear_all_pkt_pointers() ->
> > __clear_all_pkt_pointers() -> mark_reg_unknown()
> >
> > For examples of how xdp dynptrs can be used, please see the attached
> > selftests.
> >
> > Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
> > ---
> >  include/linux/bpf.h            |  6 ++++-
> >  include/linux/filter.h         |  3 +++
> >  include/uapi/linux/bpf.h       | 25 +++++++++++++++---
> >  kernel/bpf/helpers.c           | 14 ++++++++++-
> >  kernel/bpf/verifier.c          |  8 +++++-
> >  net/core/filter.c              | 46 +++++++++++++++++++++++++++++-----
> >  tools/include/uapi/linux/bpf.h | 25 +++++++++++++++---
> >  7 files changed, 112 insertions(+), 15 deletions(-)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 30615d1a0c13..455a215b6c57 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -410,11 +410,15 @@ enum bpf_type_flag {
> >         /* DYNPTR points to sk_buff */
> >         DYNPTR_TYPE_SKB         = BIT(11 + BPF_BASE_TYPE_BITS),
> >
> > +       /* DYNPTR points to xdp_buff */
> > +       DYNPTR_TYPE_XDP         = BIT(12 + BPF_BASE_TYPE_BITS),
> > +
> >         __BPF_TYPE_FLAG_MAX,
> >         __BPF_TYPE_LAST_FLAG    = __BPF_TYPE_FLAG_MAX - 1,
> >  };
> >
> > -#define DYNPTR_TYPE_FLAG_MASK  (DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB)
> > +#define DYNPTR_TYPE_FLAG_MASK  (DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB \
> > +                                | DYNPTR_TYPE_XDP)
> >
> >  /* Max number of base types. */
> >  #define BPF_BASE_TYPE_LIMIT    (1UL << BPF_BASE_TYPE_BITS)
> > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > index 649063d9cbfd..80f030239877 100644
> > --- a/include/linux/filter.h
> > +++ b/include/linux/filter.h
> > @@ -1535,5 +1535,8 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
> >  int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len);
> >  int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from,
> >                           u32 len, u64 flags);
> > +int __bpf_xdp_load_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len);
> > +int __bpf_xdp_store_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len);
> > +void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len);
> >
> >  #endif /* __LINUX_FILTER_H__ */
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 320e6b95d95c..9feea29eebcd 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -5283,13 +5283,18 @@ union bpf_attr {
> >   *                   and try again.
> >   *
> >   *                 * The data slice is automatically invalidated anytime
> > - *                   **bpf_dynptr_write**\ () or a helper call that changes
> > - *                   the underlying packet buffer (eg **bpf_skb_pull_data**\ ())
> > + *                   **bpf_dynptr_write**\ () is called.
> > + *
> > + *             For skb-type and xdp-type dynptrs:
> > + *                 * The data slice is automatically invalidated anytime a
> > + *                   helper call that changes the underlying packet buffer
> > + *                   (eg **bpf_skb_pull_data**\ (), **bpf_xdp_adjust_head**\ ())
> >   *                   is called.
> >   *     Return
> >   *             Pointer to the underlying dynptr data, NULL if the dynptr is
> >   *             read-only, if the dynptr is invalid, or if the offset and length
> > - *             is out of bounds or in a paged buffer for skb-type dynptrs.
> > + *             is out of bounds or in a paged buffer for skb-type dynptrs or
> > + *             across fragments for xdp-type dynptrs.
> >   *
> >   * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len)
> >   *     Description
> > @@ -5388,6 +5393,19 @@ union bpf_attr {
> >   *             *flags* is currently unused, it must be 0 for now.
> >   *     Return
> >   *             0 on success or -EINVAL if flags is not 0.
> > + *
> > + * long bpf_dynptr_from_xdp(struct xdp_buff *xdp_md, u64 flags, struct bpf_dynptr *ptr)
> > + *     Description
> > + *             Get a dynptr to the data in *xdp_md*. *xdp_md* must be the BPF program
> > + *             context.
> > + *
> > + *             Calls that change the *xdp_md*'s underlying packet buffer
> > + *             (eg **bpf_xdp_adjust_head**\ ()) do not invalidate the dynptr, but
> > + *             they do invalidate any data slices associated with the dynptr.
> > + *
> > + *             *flags* is currently unused, it must be 0 for now.
> > + *     Return
> > + *             0 on success, -EINVAL if flags is not 0.
> >   */
> >  #define __BPF_FUNC_MAPPER(FN)          \
> >         FN(unspec),                     \
> > @@ -5600,6 +5618,7 @@ union bpf_attr {
> >         FN(tcp_raw_check_syncookie_ipv6),       \
> >         FN(ktime_get_tai_ns),           \
> >         FN(dynptr_from_skb),            \
> > +       FN(dynptr_from_xdp),            \
> >         /* */
> >
> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index 471a01a9b6ae..2b9dc4c6de04 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
> > @@ -1541,6 +1541,8 @@ BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, struct bpf_dynptr_kern *, src
> >                 return 0;
> >         case BPF_DYNPTR_TYPE_SKB:
> >                 return __bpf_skb_load_bytes(src->data, src->offset + offset, dst, len);
> > +       case BPF_DYNPTR_TYPE_XDP:
> > +               return __bpf_xdp_load_bytes(src->data, src->offset + offset, dst, len);
> >         default:
> >                 WARN(true, "bpf_dynptr_read: unknown dynptr type %d\n", type);
> >                 return -EFAULT;
> > @@ -1583,6 +1585,10 @@ BPF_CALL_5(bpf_dynptr_write, struct bpf_dynptr_kern *, dst, u32, offset, void *,
> >         case BPF_DYNPTR_TYPE_SKB:
> >                 return __bpf_skb_store_bytes(dst->data, dst->offset + offset, src, len,
> >                                              flags);
> > +       case BPF_DYNPTR_TYPE_XDP:
> > +               if (flags)
> > +                       return -EINVAL;
> > +               return __bpf_xdp_store_bytes(dst->data, dst->offset + offset, src, len);
> >         default:
> >                 WARN(true, "bpf_dynptr_write: unknown dynptr type %d\n", type);
> >                 return -EFAULT;
> > @@ -1616,7 +1622,7 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
> >
> >         type = bpf_dynptr_get_type(ptr);
> >
> > -       /* Only skb dynptrs can get read-only data slices, because the
> > +       /* Only skb and xdp dynptrs can get read-only data slices, because the
> >          * verifier enforces PTR_TO_PACKET accesses
> >          */
> >         is_rdonly = bpf_dynptr_is_rdonly(ptr);
> > @@ -1640,6 +1646,12 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
> >                 data = skb->data;
> >                 break;
> >         }
> > +       case BPF_DYNPTR_TYPE_XDP:
> > +               /* if the requested data in across fragments, then it cannot
> > +                * be accessed directly - bpf_xdp_pointer will return NULL
> > +                */
> > +               return (unsigned long)bpf_xdp_pointer(ptr->data,
> > +                                                     ptr->offset + offset, len);
> >         default:
> >                 WARN(true, "bpf_dynptr_data: unknown dynptr type %d\n", type);
> >                 return 0;
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index 1ea295f47525..d33648eee188 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -686,6 +686,8 @@ static enum bpf_dynptr_type arg_to_dynptr_type(enum bpf_arg_type arg_type)
> >                 return BPF_DYNPTR_TYPE_RINGBUF;
> >         case DYNPTR_TYPE_SKB:
> >                 return BPF_DYNPTR_TYPE_SKB;
> > +       case DYNPTR_TYPE_XDP:
> > +               return BPF_DYNPTR_TYPE_XDP;
> >         default:
> >                 return BPF_DYNPTR_TYPE_INVALID;
> >         }
> > @@ -6078,6 +6080,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
> >                         case DYNPTR_TYPE_SKB:
> >                                 err_extra = "skb ";
> >                                 break;
> > +                       case DYNPTR_TYPE_XDP:
> > +                               err_extra = "xdp ";
> > +                               break;
> >                         }
> >
> >                         verbose(env, "Expected an initialized %sdynptr as arg #%d\n",
> > @@ -7439,7 +7444,8 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
> >                 mark_reg_known_zero(env, regs, BPF_REG_0);
> >
> >                 if (func_id == BPF_FUNC_dynptr_data &&
> > -                   dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > +                   (dynptr_type == BPF_DYNPTR_TYPE_SKB ||
> > +                    dynptr_type == BPF_DYNPTR_TYPE_XDP)) {
> >                         regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> >                         regs[BPF_REG_0].range = meta.mem_size;
>
> It doesn't seem like this is safe. Since PTR_TO_PACKET's range can be
> modified by comparisons with packet pointers loaded from the xdp/skb
> ctx, how do we distinguish e.g. between a pkt slice obtained from some
> frag in a multi-buff XDP vs pkt pointer from a linear area?
>
> Someone can compare data_meta from ctx with PTR_TO_PACKET from
> bpf_dynptr_data on xdp dynptr (which might be pointing to a xdp mb
> frag). While MAX_PACKET_OFF is 0xffff, it can still be used to do OOB
> access for the linear area. reg_is_init_pkt_pointer will return true
> as modified range is not considered for it. Same kind of issues when
> doing comparison with data_end from ctx (though maybe you won't be
> able to do incorrect data access at runtime using that).
>
> I had a pkt_uid field in my patch [0] which disallowed comparisons
> among bpf_packet_pointer slices. Each call assigned a fresh pkt_uid,
> and that disabled comparisons for them. reg->id is used for var_off
> range propagation so it cannot be reused.
>
> Coming back to this: What we really want here is a PTR_TO_MEM with a
> mem_size, so maybe you should go that route instead of PTR_TO_PACKET
> (and add a type tag to maybe pretty print it also as a packet pointer
> in verifier log), or add some way to distinguish slice vs non-slice
> pkt pointers like I did in my patch. You might also want to add some
> tests for this corner case (there are some later in [0] if you want to
> reuse them).
>
> So TBH, I kinda dislike my own solution in [0] :). The complexity does
> not seem worth it. The pkt_uid distinction is more useful (and
> actually would be needed) in Toke's xdp queueing series, where in a
> dequeue program you have multiple xdp_mds and want scoped slice
> invalidations (i.e. adjust_head on one xdp_md doesn't invalidate
> slices of some other xdp_md). Here we can just get away with normal
> PTR_TO_MEM.
>
> ... Or just let me know if you handle this correctly already, or if
> this won't be an actual problem :).

Ooh interesting, I hadn't previously taken a look at
try_match_pkt_pointers(), thanks for mentioning it :)

The cleanest solution to me is to add the flag "DYNPTR_TYPE_{SKB/XDP}"
to PTR_TO_PACKET and change reg_is_init_pkt_pointer() to return false
if the DYNPTR_TYPE_{SKB/XDP} flag is present. I prefer this over
returning PTR_TO_MEM because it seems more robust (eg if in the future
we reject x behavior on the packet data reg types, this will
automatically apply to the data slices), and because it'll keep the
logic more efficient/simpler for the case when the pkt pointer has to
be cleared after any helper that changes pkt data is called (aka the
case where the data slice gets invalidated).

What are your thoughts?

>
>   [0]: https://lore.kernel.org/bpf/20220306234311.452206-3-memxor@gmail.com
kernel test robot Aug. 24, 2022, 1:15 a.m. UTC | #3
Hi Joanne,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Joanne-Koong/Add-skb-xdp-dynptrs/20220823-080022
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: microblaze-buildonly-randconfig-r003-20220823 (https://download.01.org/0day-ci/archive/20220824/202208240923.X2tgBh6Y-lkp@intel.com/config)
compiler: microblaze-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/6d832ed17cea3ede1cd48f885a73144d9c4d800a
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Joanne-Koong/Add-skb-xdp-dynptrs/20220823-080022
        git checkout 6d832ed17cea3ede1cd48f885a73144d9c4d800a
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=microblaze SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   microblaze-linux-ld: kernel/bpf/helpers.o: in function `____bpf_dynptr_read':
   kernel/bpf/helpers.c:1543: undefined reference to `__bpf_skb_load_bytes'
>> microblaze-linux-ld: kernel/bpf/helpers.c:1545: undefined reference to `__bpf_xdp_load_bytes'
   microblaze-linux-ld: kernel/bpf/helpers.o: in function `____bpf_dynptr_write':
   kernel/bpf/helpers.c:1586: undefined reference to `__bpf_skb_store_bytes'
>> microblaze-linux-ld: kernel/bpf/helpers.c:1591: undefined reference to `__bpf_xdp_store_bytes'
   microblaze-linux-ld: kernel/bpf/helpers.o: in function `____bpf_dynptr_data':
   kernel/bpf/helpers.c:1653: undefined reference to `bpf_xdp_pointer'
   pahole: .tmp_vmlinux.btf: Invalid argument
   .btf.vmlinux.bin.o: file not recognized: file format not recognized
Toke Høiland-Jørgensen Aug. 24, 2022, 10:39 a.m. UTC | #4
Joanne Koong <joannelkoong@gmail.com> writes:

> On Mon, Aug 22, 2022 at 7:31 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
>>
>> +Cc XDP folks
>>
>> On Tue, 23 Aug 2022 at 02:12, Joanne Koong <joannelkoong@gmail.com> wrote:
>> >
>> > Add xdp dynptrs, which are dynptrs whose underlying pointer points
>> > to a xdp_buff. The dynptr acts on xdp data. xdp dynptrs have two main
>> > benefits. One is that they allow operations on sizes that are not
>> > statically known at compile-time (eg variable-sized accesses).
>> > Another is that parsing the packet data through dynptrs (instead of
>> > through direct access of xdp->data and xdp->data_end) can be more
>> > ergonomic and less brittle (eg does not need manual if checking for
>> > being within bounds of data_end).
>> >
>> > For reads and writes on the dynptr, this includes reading/writing
>> > from/to and across fragments. For data slices, direct access to
>>
>> It's a bit awkward to have such a difference between xdp and skb
>> dynptr's read/write. I understand why it is the way it is, but it
>> still doesn't feel right. I'm not sure if we can reconcile the
>> differences, but it makes writing common code for both xdp and tc
>> harder as it needs to be aware of the differences (and then the flags
>> for dynptr_write would differ too). So we're 90% there but not the
>> whole way...
>
> Yeah, it'd be great if the behavior for skb/xdp progs could be the
> same, but I'm not seeing a better solution here (unless we invalidate
> data slices on writes in xdp progs, just to make it match more :P).
>
> Regarding having 2 different interfaces bpf_dynptr_from_{skb/xdp}, I'm
> not convinced this is much of a problem - xdp and skb programs already
> have different interfaces for doing things (eg
> bpf_{skb/xdp}_{store/load}_bytes).

This is true, but it's quite possible to paper over these differences
and write BPF code that works for both TC and XDP. Subtle semantic
differences in otherwise identical functions makes this harder.

Today you can write a function like:

static inline int parse_pkt(void *data, void* data_end)
{
        /* parse data */
}

And call it like:

SEC("xdp")
int parse_xdp(struct xdp_md *ctx)
{
        return parse_pkt(ctx->data, ctx->data_end);
}

SEC("tc")
int parse_tc(struct __sk_buff *skb)
{
        return parse_pkt(skb->data, skb->data_end);
}


IMO the goal should be to be able to do the equivalent for dynptrs, like:

static inline int parse_pkt(struct bpf_dynptr *ptr)
{
        __u64 *data;
        
	data = bpf_dynptr_data(ptr, 0, sizeof(*data));
	if (!data)
		return 0;
        /* parse data */
}

SEC("xdp")
int parse_xdp(struct xdp_md *ctx)
{
	struct bpf_dynptr ptr;

	bpf_dynptr_from_xdp(ctx, 0, &ptr);
        return parse_pkt(&ptr);
}

SEC("tc")
int parse_tc(struct __sk_buff *skb)
{
	struct bpf_dynptr ptr;

	bpf_dynptr_from_skb(skb, 0, &ptr);
        return parse_pkt(&ptr);
}


If the dynptr-based parse_pkt() function has to take special care to
figure out where the dynptr comes from, it makes it a lot more difficult
to write reusable packet parsing functions. So I'd be in favour of
restricting the dynptr interface to the lowest common denominator of the
skb and xdp interfaces even if that makes things slightly more awkward
in the specialised cases...

-Toke
Joanne Koong Aug. 24, 2022, 6:10 p.m. UTC | #5
On Wed, Aug 24, 2022 at 3:39 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Joanne Koong <joannelkoong@gmail.com> writes:
>
> > On Mon, Aug 22, 2022 at 7:31 PM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> >>
> >> +Cc XDP folks
> >>
> >> On Tue, 23 Aug 2022 at 02:12, Joanne Koong <joannelkoong@gmail.com> wrote:
> >> >
> >> > Add xdp dynptrs, which are dynptrs whose underlying pointer points
> >> > to a xdp_buff. The dynptr acts on xdp data. xdp dynptrs have two main
> >> > benefits. One is that they allow operations on sizes that are not
> >> > statically known at compile-time (eg variable-sized accesses).
> >> > Another is that parsing the packet data through dynptrs (instead of
> >> > through direct access of xdp->data and xdp->data_end) can be more
> >> > ergonomic and less brittle (eg does not need manual if checking for
> >> > being within bounds of data_end).
> >> >
> >> > For reads and writes on the dynptr, this includes reading/writing
> >> > from/to and across fragments. For data slices, direct access to
> >>
> >> It's a bit awkward to have such a difference between xdp and skb
> >> dynptr's read/write. I understand why it is the way it is, but it
> >> still doesn't feel right. I'm not sure if we can reconcile the
> >> differences, but it makes writing common code for both xdp and tc
> >> harder as it needs to be aware of the differences (and then the flags
> >> for dynptr_write would differ too). So we're 90% there but not the
> >> whole way...
> >
> > Yeah, it'd be great if the behavior for skb/xdp progs could be the
> > same, but I'm not seeing a better solution here (unless we invalidate
> > data slices on writes in xdp progs, just to make it match more :P).
> >
> > Regarding having 2 different interfaces bpf_dynptr_from_{skb/xdp}, I'm
> > not convinced this is much of a problem - xdp and skb programs already
> > have different interfaces for doing things (eg
> > bpf_{skb/xdp}_{store/load}_bytes).
>
> This is true, but it's quite possible to paper over these differences
> and write BPF code that works for both TC and XDP. Subtle semantic
> differences in otherwise identical functions makes this harder.
>
> Today you can write a function like:
>
> static inline int parse_pkt(void *data, void* data_end)
> {
>         /* parse data */
> }
>
> And call it like:
>
> SEC("xdp")
> int parse_xdp(struct xdp_md *ctx)
> {
>         return parse_pkt(ctx->data, ctx->data_end);
> }
>
> SEC("tc")
> int parse_tc(struct __sk_buff *skb)
> {
>         return parse_pkt(skb->data, skb->data_end);
> }
>
>
> IMO the goal should be to be able to do the equivalent for dynptrs, like:
>
> static inline int parse_pkt(struct bpf_dynptr *ptr)
> {
>         __u64 *data;
>
>         data = bpf_dynptr_data(ptr, 0, sizeof(*data));
>         if (!data)
>                 return 0;
>         /* parse data */
> }
>
> SEC("xdp")
> int parse_xdp(struct xdp_md *ctx)
> {
>         struct bpf_dynptr ptr;
>
>         bpf_dynptr_from_xdp(ctx, 0, &ptr);
>         return parse_pkt(&ptr);
> }
>
> SEC("tc")
> int parse_tc(struct __sk_buff *skb)
> {
>         struct bpf_dynptr ptr;
>
>         bpf_dynptr_from_skb(skb, 0, &ptr);
>         return parse_pkt(&ptr);
> }
>

To clarify, this is already possible when using data slices, since the
behavior for data slices is equivalent between xdp and tc programs for
non-fragmented accesses. From looking through the selftests, I
anticipate that data slices will be the main way programs interact
with dynptrs. For the cases where the program may write into frags,
then bpf_dynptr_write will be needed (which is where functionality
between xdp and tc start differing) - today, we're not able to write
common code that writes into the frags since tc uses
bpf_skb_store_bytes and xdp uses bpf_xdp_store_bytes.

I'm more and more liking the idea of limiting xdp to match the
constraints of skb given that both you, Kumar, and Jakub have
mentioned that portability between xdp and skb would be useful for
users :)

What are your thoughts on this API:

1) bpf_dynptr_data()

Before:
  for skb-type progs:
      - data slices in fragments is not supported

  for xdp-type progs:
      - data slices in fragments is supported as long as it is in a
contiguous frag (eg not across frags)

Now:
  for skb + xdp type progs:
      - data slices in fragments is not supported


2)  bpf_dynptr_write()

Before:
  for skb-type progs:
     - all data slices are invalidated after a write

  for xdp-type progs:
     - nothing

Now:
  for skb + xdp type progs:
     - all data slices are invalidated after a write

This will unite the functionality for skb and xdp programs across
bpf_dyntpr_data, bpf_dynptr_write, and bpf_dynptr_read. As for whether
we should unite bpf_dynptr_from_skb and bpf_dynptr_from_xdp into one
common bpf_dynptr_from_packet as Jakub brought in [0], I'm leaning
towards no because 1) if in the future there's some irreconcilable
aspect between skb and xdp that gets added, that'll be hard to support
since the expectation is that there is just one overall "packet
dynptr" 2) the "packet dynptr" view is not completely accurate (eg
bpf_dynptr_write accepts flags from skb progs and not xdp progs) 3)
this adds some additional hardcoding in the verifier since there's no
organic mapping between prog type -> prog ctx


[0] https://lore.kernel.org/bpf/20220726184706.954822-1-joannelkoong@gmail.com/T/#m1438f89152b1d0e539fe60a9376482bbc9de7b6e

>
> If the dynptr-based parse_pkt() function has to take special care to
> figure out where the dynptr comes from, it makes it a lot more difficult
> to write reusable packet parsing functions. So I'd be in favour of
> restricting the dynptr interface to the lowest common denominator of the
> skb and xdp interfaces even if that makes things slightly more awkward
> in the specialised cases...
>
> -Toke
>
Kumar Kartikeya Dwivedi Aug. 24, 2022, 9:10 p.m. UTC | #6
On Wed, 24 Aug 2022 at 00:27, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Mon, Aug 22, 2022 at 7:31 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> > > [...]
> > >                 if (func_id == BPF_FUNC_dynptr_data &&
> > > -                   dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > > +                   (dynptr_type == BPF_DYNPTR_TYPE_SKB ||
> > > +                    dynptr_type == BPF_DYNPTR_TYPE_XDP)) {
> > >                         regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> > >                         regs[BPF_REG_0].range = meta.mem_size;
> >
> > It doesn't seem like this is safe. Since PTR_TO_PACKET's range can be
> > modified by comparisons with packet pointers loaded from the xdp/skb
> > ctx, how do we distinguish e.g. between a pkt slice obtained from some
> > frag in a multi-buff XDP vs pkt pointer from a linear area?
> >
> > Someone can compare data_meta from ctx with PTR_TO_PACKET from
> > bpf_dynptr_data on xdp dynptr (which might be pointing to a xdp mb
> > frag). While MAX_PACKET_OFF is 0xffff, it can still be used to do OOB
> > access for the linear area. reg_is_init_pkt_pointer will return true
> > as modified range is not considered for it. Same kind of issues when
> > doing comparison with data_end from ctx (though maybe you won't be
> > able to do incorrect data access at runtime using that).
> >
> > I had a pkt_uid field in my patch [0] which disallowed comparisons
> > among bpf_packet_pointer slices. Each call assigned a fresh pkt_uid,
> > and that disabled comparisons for them. reg->id is used for var_off
> > range propagation so it cannot be reused.
> >
> > Coming back to this: What we really want here is a PTR_TO_MEM with a
> > mem_size, so maybe you should go that route instead of PTR_TO_PACKET
> > (and add a type tag to maybe pretty print it also as a packet pointer
> > in verifier log), or add some way to distinguish slice vs non-slice
> > pkt pointers like I did in my patch. You might also want to add some
> > tests for this corner case (there are some later in [0] if you want to
> > reuse them).
> >
> > So TBH, I kinda dislike my own solution in [0] :). The complexity does
> > not seem worth it. The pkt_uid distinction is more useful (and
> > actually would be needed) in Toke's xdp queueing series, where in a
> > dequeue program you have multiple xdp_mds and want scoped slice
> > invalidations (i.e. adjust_head on one xdp_md doesn't invalidate
> > slices of some other xdp_md). Here we can just get away with normal
> > PTR_TO_MEM.
> >
> > ... Or just let me know if you handle this correctly already, or if
> > this won't be an actual problem :).
>
> Ooh interesting, I hadn't previously taken a look at
> try_match_pkt_pointers(), thanks for mentioning it :)
>
> The cleanest solution to me is to add the flag "DYNPTR_TYPE_{SKB/XDP}"
> to PTR_TO_PACKET and change reg_is_init_pkt_pointer() to return false
> if the DYNPTR_TYPE_{SKB/XDP} flag is present. I prefer this over
> returning PTR_TO_MEM because it seems more robust (eg if in the future
> we reject x behavior on the packet data reg types, this will
> automatically apply to the data slices), and because it'll keep the
> logic more efficient/simpler for the case when the pkt pointer has to
> be cleared after any helper that changes pkt data is called (aka the
> case where the data slice gets invalidated).
>
> What are your thoughts?
>

Thinking more deeply about it, probably not, we need more work here. I
remember _now_ why I chose the pkt_uid approach (and this tells us my
commit log lacks all the details about the motivation :( ).

Consider how equivalency checking for packet pointers works in
regsafe. It is checking type, then if old range > cur range, then
offs, etc.

The problem is, while we now don't prune on access to ptr_to_pkt vs
ptr_to_pkt | dynptr_pkt types in same reg (since type differs we
return false), we still prune if old range of ptr_to_pkt | dynptr_pkt
> cur range of ptr_to_pkt | dynptr_pkt. Both could be pointing into
separate frags, so this assumption would be incorrect. I would be able
to trick the verifier into accessing data beyond the length of a
different frag, by first making sure one line of exploration is
verified, and then changing the register in another branch reaching
the same branch target. Helpers can take packet pointers so the access
can become a pruning point. It would think the rest of the stuff is
safe, while they are not equivalent at all. It is ok if they are bit
by bit equivalent (same type, range, off, etc.).

If you start returning false whenever you see this type tag set, it
will become too conservative (it considers reg copies of the same
dynptr_data lookup as distinct). So you need some kind of id assigned
during dynptr_data lookup to distinguish them.
Kumar Kartikeya Dwivedi Aug. 24, 2022, 11:04 p.m. UTC | #7
On Wed, 24 Aug 2022 at 20:11, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Wed, Aug 24, 2022 at 3:39 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >
> > Joanne Koong <joannelkoong@gmail.com> writes:
> > >> [...]
> > >> It's a bit awkward to have such a difference between xdp and skb
> > >> dynptr's read/write. I understand why it is the way it is, but it
> > >> still doesn't feel right. I'm not sure if we can reconcile the
> > >> differences, but it makes writing common code for both xdp and tc
> > >> harder as it needs to be aware of the differences (and then the flags
> > >> for dynptr_write would differ too). So we're 90% there but not the
> > >> whole way...
> > >
> > > Yeah, it'd be great if the behavior for skb/xdp progs could be the
> > > same, but I'm not seeing a better solution here (unless we invalidate
> > > data slices on writes in xdp progs, just to make it match more :P).
> > >
> > > Regarding having 2 different interfaces bpf_dynptr_from_{skb/xdp}, I'm
> > > not convinced this is much of a problem - xdp and skb programs already
> > > have different interfaces for doing things (eg
> > > bpf_{skb/xdp}_{store/load}_bytes).
> >
> > This is true, but it's quite possible to paper over these differences
> > and write BPF code that works for both TC and XDP. Subtle semantic
> > differences in otherwise identical functions makes this harder.
> >
> > Today you can write a function like:
> >
> > static inline int parse_pkt(void *data, void* data_end)
> > {
> >         /* parse data */
> > }
> >
> > And call it like:
> >
> > SEC("xdp")
> > int parse_xdp(struct xdp_md *ctx)
> > {
> >         return parse_pkt(ctx->data, ctx->data_end);
> > }
> >
> > SEC("tc")
> > int parse_tc(struct __sk_buff *skb)
> > {
> >         return parse_pkt(skb->data, skb->data_end);
> > }
> >
> >
> > IMO the goal should be to be able to do the equivalent for dynptrs, like:
> >
> > static inline int parse_pkt(struct bpf_dynptr *ptr)
> > {
> >         __u64 *data;
> >
> >         data = bpf_dynptr_data(ptr, 0, sizeof(*data));
> >         if (!data)
> >                 return 0;
> >         /* parse data */
> > }
> >
> > SEC("xdp")
> > int parse_xdp(struct xdp_md *ctx)
> > {
> >         struct bpf_dynptr ptr;
> >
> >         bpf_dynptr_from_xdp(ctx, 0, &ptr);
> >         return parse_pkt(&ptr);
> > }
> >
> > SEC("tc")
> > int parse_tc(struct __sk_buff *skb)
> > {
> >         struct bpf_dynptr ptr;
> >
> >         bpf_dynptr_from_skb(skb, 0, &ptr);
> >         return parse_pkt(&ptr);
> > }
> >
>
> To clarify, this is already possible when using data slices, since the
> behavior for data slices is equivalent between xdp and tc programs for
> non-fragmented accesses. From looking through the selftests, I
> anticipate that data slices will be the main way programs interact
> with dynptrs. For the cases where the program may write into frags,
> then bpf_dynptr_write will be needed (which is where functionality
> between xdp and tc start differing) - today, we're not able to write
> common code that writes into the frags since tc uses
> bpf_skb_store_bytes and xdp uses bpf_xdp_store_bytes.

Yes, we cannot write code that just swaps those two calls right now.
The key difference is, the two above are separate functions already
and have different semantics. Here in this set you have the same
function, but with different semantics depending on ctx, which is the
point of contention.

>
> I'm more and more liking the idea of limiting xdp to match the
> constraints of skb given that both you, Kumar, and Jakub have
> mentioned that portability between xdp and skb would be useful for
> users :)
>
> What are your thoughts on this API:
>
> 1) bpf_dynptr_data()
>
> Before:
>   for skb-type progs:
>       - data slices in fragments is not supported
>
>   for xdp-type progs:
>       - data slices in fragments is supported as long as it is in a
> contiguous frag (eg not across frags)
>
> Now:
>   for skb + xdp type progs:
>       - data slices in fragments is not supported
>
>
> 2)  bpf_dynptr_write()
>
> Before:
>   for skb-type progs:
>      - all data slices are invalidated after a write
>
>   for xdp-type progs:
>      - nothing
>
> Now:
>   for skb + xdp type progs:
>      - all data slices are invalidated after a write
>

There is also the other option: failing to write until you pull skb,
which looks a lot better to me if we are adding this helper anyway...

> This will unite the functionality for skb and xdp programs across
> bpf_dyntpr_data, bpf_dynptr_write, and bpf_dynptr_read. As for whether
> we should unite bpf_dynptr_from_skb and bpf_dynptr_from_xdp into one
> common bpf_dynptr_from_packet as Jakub brought in [0], I'm leaning
> towards no because 1) if in the future there's some irreconcilable
> aspect between skb and xdp that gets added, that'll be hard to support
> since the expectation is that there is just one overall "packet
> dynptr" 2) the "packet dynptr" view is not completely accurate (eg
> bpf_dynptr_write accepts flags from skb progs and not xdp progs) 3)
> this adds some additional hardcoding in the verifier since there's no
> organic mapping between prog type -> prog ctx
>

There is, e.g. see how btf_get_prog_ctx_type is doing it (unless I
misunderstood what you meant).

I also had a different tangential question (unrelated to
'reconciliation'). Let's forget about that for a moment. I was
listening to the talk here [0]. At 12:00, I can hear someone
mentioning that you'd have a dynptr for each segment/frag.

Right now, you have a skb/xdp dynptr, which is representing
discontiguous memory, i.e. you represent all linear page + fragments
using one dynptr. This seems a bit inconsistent with the idea of a
dynptr, since it's conceptually a slice of variable length memory.
dynptr_data gives you a constant length slice into dynptr's variable
length memory (which the verifier can track bounds of statically).

So thinking about the idea in [0], one way would be to change
from_xdp/from_skb helpers to take an index (into the frags array), and
return dynptr for each frag. Or determine frag from offset + len. For
the skb case it might require kmap_atomic to access the frag dynptr,
which might need a paired kunmap_atomic call as well, and it may not
return dynptr for frags in cloned skbs, but for XDP multi-buff all
that doesn't count.

Was this approach/line of thinking considered and/or rejected? What
were the reasons? Just trying to understand the background here.

What I'm wondering is that we already have helpers to do reads and
writes that you are wrapping over in dynptr interfaces, but what we're
missing is to be able to get direct access to frags (or 'next' ctx,
essentially). When we know we can avoid pulls, it might be cheaper to
then do access this way to skb, right? Especially for a case where
just a part of the header lies in the next frag?

But I'm no expert here, so I might be missing some subtleties.

  [0]: https://www.youtube.com/watch?v=EZ5PDrXvs7M









>
> [0] https://lore.kernel.org/bpf/20220726184706.954822-1-joannelkoong@gmail.com/T/#m1438f89152b1d0e539fe60a9376482bbc9de7b6e
>
> >
> > If the dynptr-based parse_pkt() function has to take special care to
> > figure out where the dynptr comes from, it makes it a lot more difficult
> > to write reusable packet parsing functions. So I'd be in favour of
> > restricting the dynptr interface to the lowest common denominator of the
> > skb and xdp interfaces even if that makes things slightly more awkward
> > in the specialised cases...
> >
> > -Toke
> >
Joanne Koong Aug. 25, 2022, 8:14 p.m. UTC | #8
On Wed, Aug 24, 2022 at 4:04 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Wed, 24 Aug 2022 at 20:11, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Wed, Aug 24, 2022 at 3:39 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > >
> > > Joanne Koong <joannelkoong@gmail.com> writes:
> > > >> [...]
> > > >> It's a bit awkward to have such a difference between xdp and skb
> > > >> dynptr's read/write. I understand why it is the way it is, but it
> > > >> still doesn't feel right. I'm not sure if we can reconcile the
> > > >> differences, but it makes writing common code for both xdp and tc
> > > >> harder as it needs to be aware of the differences (and then the flags
> > > >> for dynptr_write would differ too). So we're 90% there but not the
> > > >> whole way...
> > > >
> > > > Yeah, it'd be great if the behavior for skb/xdp progs could be the
> > > > same, but I'm not seeing a better solution here (unless we invalidate
> > > > data slices on writes in xdp progs, just to make it match more :P).
> > > >
> > > > Regarding having 2 different interfaces bpf_dynptr_from_{skb/xdp}, I'm
> > > > not convinced this is much of a problem - xdp and skb programs already
> > > > have different interfaces for doing things (eg
> > > > bpf_{skb/xdp}_{store/load}_bytes).
> > >
> > > This is true, but it's quite possible to paper over these differences
> > > and write BPF code that works for both TC and XDP. Subtle semantic
> > > differences in otherwise identical functions makes this harder.
> > >
> > > Today you can write a function like:
> > >
> > > static inline int parse_pkt(void *data, void* data_end)
> > > {
> > >         /* parse data */
> > > }
> > >
> > > And call it like:
> > >
> > > SEC("xdp")
> > > int parse_xdp(struct xdp_md *ctx)
> > > {
> > >         return parse_pkt(ctx->data, ctx->data_end);
> > > }
> > >
> > > SEC("tc")
> > > int parse_tc(struct __sk_buff *skb)
> > > {
> > >         return parse_pkt(skb->data, skb->data_end);
> > > }
> > >
> > >
> > > IMO the goal should be to be able to do the equivalent for dynptrs, like:
> > >
> > > static inline int parse_pkt(struct bpf_dynptr *ptr)
> > > {
> > >         __u64 *data;
> > >
> > >         data = bpf_dynptr_data(ptr, 0, sizeof(*data));
> > >         if (!data)
> > >                 return 0;
> > >         /* parse data */
> > > }
> > >
> > > SEC("xdp")
> > > int parse_xdp(struct xdp_md *ctx)
> > > {
> > >         struct bpf_dynptr ptr;
> > >
> > >         bpf_dynptr_from_xdp(ctx, 0, &ptr);
> > >         return parse_pkt(&ptr);
> > > }
> > >
> > > SEC("tc")
> > > int parse_tc(struct __sk_buff *skb)
> > > {
> > >         struct bpf_dynptr ptr;
> > >
> > >         bpf_dynptr_from_skb(skb, 0, &ptr);
> > >         return parse_pkt(&ptr);
> > > }
> > >
> >
> > To clarify, this is already possible when using data slices, since the
> > behavior for data slices is equivalent between xdp and tc programs for
> > non-fragmented accesses. From looking through the selftests, I
> > anticipate that data slices will be the main way programs interact
> > with dynptrs. For the cases where the program may write into frags,
> > then bpf_dynptr_write will be needed (which is where functionality
> > between xdp and tc start differing) - today, we're not able to write
> > common code that writes into the frags since tc uses
> > bpf_skb_store_bytes and xdp uses bpf_xdp_store_bytes.
>
> Yes, we cannot write code that just swaps those two calls right now.
> The key difference is, the two above are separate functions already
> and have different semantics. Here in this set you have the same
> function, but with different semantics depending on ctx, which is the
> point of contention.
>
> >
> > I'm more and more liking the idea of limiting xdp to match the
> > constraints of skb given that both you, Kumar, and Jakub have
> > mentioned that portability between xdp and skb would be useful for
> > users :)
> >
> > What are your thoughts on this API:
> >
> > 1) bpf_dynptr_data()
> >
> > Before:
> >   for skb-type progs:
> >       - data slices in fragments is not supported
> >
> >   for xdp-type progs:
> >       - data slices in fragments is supported as long as it is in a
> > contiguous frag (eg not across frags)
> >
> > Now:
> >   for skb + xdp type progs:
> >       - data slices in fragments is not supported
> >
> >
> > 2)  bpf_dynptr_write()
> >
> > Before:
> >   for skb-type progs:
> >      - all data slices are invalidated after a write
> >
> >   for xdp-type progs:
> >      - nothing
> >
> > Now:
> >   for skb + xdp type progs:
> >      - all data slices are invalidated after a write
> >
>
> There is also the other option: failing to write until you pull skb,
> which looks a lot better to me if we are adding this helper anyway...

This was also previously discussed in [0].

After using skb dynptrs for the test_cls_redirect_dynptr.c selftest
[1], I'm convinced that allowing bpf_dynptr_writes into frags is the
correct way to go. There are instances where the data is probably in
head with a slight off-chance that it's in a fragment - being able to
call bpf_dynptr_write makes it super easy whereas the alternative is
either 1) always pulling, which in most cases will be unnecessary or
2) adding special casing for the case where the bpf_dynptr_write
fails. As well, failing to write until you pull the skb will hurt
being able to write common code that both xdp/skb can use.


[0] https://lore.kernel.org/bpf/20220726184706.954822-1-joannelkoong@gmail.com/T/#md6c17d9916f5937a9ae9dfca11e815e4b89009fb

[1] https://lore.kernel.org/bpf/20220822235649.2218031-4-joannelkoong@gmail.com/

>
> > This will unite the functionality for skb and xdp programs across
> > bpf_dyntpr_data, bpf_dynptr_write, and bpf_dynptr_read. As for whether
> > we should unite bpf_dynptr_from_skb and bpf_dynptr_from_xdp into one
> > common bpf_dynptr_from_packet as Jakub brought in [0], I'm leaning
> > towards no because 1) if in the future there's some irreconcilable
> > aspect between skb and xdp that gets added, that'll be hard to support
> > since the expectation is that there is just one overall "packet
> > dynptr" 2) the "packet dynptr" view is not completely accurate (eg
> > bpf_dynptr_write accepts flags from skb progs and not xdp progs) 3)
> > this adds some additional hardcoding in the verifier since there's no
> > organic mapping between prog type -> prog ctx
> >
>
> There is, e.g. see how btf_get_prog_ctx_type is doing it (unless I
> misunderstood what you meant).

Cool, that should be pretty straightforwarwd then; from what it looks
like, btf_get_prog_ctx_type() + 1 will give the btf id and then from
that use btf_type_by_id() to get the corresponding struct btf_type,
then use btf_type->name_off to check whether the name corresponds to
"__sk_buff" or "xdp_md"

>
> I also had a different tangential question (unrelated to
> 'reconciliation'). Let's forget about that for a moment. I was
> listening to the talk here [0]. At 12:00, I can hear someone
> mentioning that you'd have a dynptr for each segment/frag.
>
> Right now, you have a skb/xdp dynptr, which is representing
> discontiguous memory, i.e. you represent all linear page + fragments
> using one dynptr. This seems a bit inconsistent with the idea of a
> dynptr, since it's conceptually a slice of variable length memory.
> dynptr_data gives you a constant length slice into dynptr's variable
> length memory (which the verifier can track bounds of statically).
>
> So thinking about the idea in [0], one way would be to change
> from_xdp/from_skb helpers to take an index (into the frags array), and
> return dynptr for each frag. Or determine frag from offset + len. For
> the skb case it might require kmap_atomic to access the frag dynptr,
> which might need a paired kunmap_atomic call as well, and it may not
> return dynptr for frags in cloned skbs, but for XDP multi-buff all
> that doesn't count.
>
> Was this approach/line of thinking considered and/or rejected? What
> were the reasons? Just trying to understand the background here.
>
> What I'm wondering is that we already have helpers to do reads and
> writes that you are wrapping over in dynptr interfaces, but what we're
> missing is to be able to get direct access to frags (or 'next' ctx,
> essentially). When we know we can avoid pulls, it might be cheaper to
> then do access this way to skb, right? Especially for a case where
> just a part of the header lies in the next frag?

Imo, having a separate dynptr for each frag is confusing for users. My
intuition is that in the majority of cases, the user doesn't care how
it's laid out at the frag level, they just want to write some data at
some offset and length. If every frag is a separate dynptr, then the
user needs to be aware whether they're crossing frag (and thus dynptr)
boundaries. In cases where the user doesn't know if the data is across
head/frag or across frags, then their prog code will need to be more
complex to handle these different scenarios. There's the additional
issue as well that this doesn't work for cloned skb frags - now users
need to know whether the skb is cloned or uncloned to know whether to
pull first. Imo it makes the interface more confusing. I'm still not
sure what is gained by having separate dynptrs for separate frags.

If the goal is to avoid pulls, then we're able to do that without
having separate dynptrs. If a write is to an uncloned frag, then
bpf_dynptr_write() can kmap, write the data, and kunmap (though it's
still unclear as to whether this should be done, since maybe it's
faster to pull and do large numbers of writes instead of multiple
kmap/kunmap calls).

>
> But I'm no expert here, so I might be missing some subtleties.
>
>   [0]: https://www.youtube.com/watch?v=EZ5PDrXvs7M
>
>
>
>
>
>
>
>
>
> >
> > [0] https://lore.kernel.org/bpf/20220726184706.954822-1-joannelkoong@gmail.com/T/#m1438f89152b1d0e539fe60a9376482bbc9de7b6e
> >
> > >
> > > If the dynptr-based parse_pkt() function has to take special care to
> > > figure out where the dynptr comes from, it makes it a lot more difficult
> > > to write reusable packet parsing functions. So I'd be in favour of
> > > restricting the dynptr interface to the lowest common denominator of the
> > > skb and xdp interfaces even if that makes things slightly more awkward
> > > in the specialised cases...
> > >
> > > -Toke
> > >
Joanne Koong Aug. 25, 2022, 8:39 p.m. UTC | #9
On Wed, Aug 24, 2022 at 2:11 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Wed, 24 Aug 2022 at 00:27, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Mon, Aug 22, 2022 at 7:31 PM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > > > [...]
> > > >                 if (func_id == BPF_FUNC_dynptr_data &&
> > > > -                   dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > > > +                   (dynptr_type == BPF_DYNPTR_TYPE_SKB ||
> > > > +                    dynptr_type == BPF_DYNPTR_TYPE_XDP)) {
> > > >                         regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> > > >                         regs[BPF_REG_0].range = meta.mem_size;
> > >
> > > It doesn't seem like this is safe. Since PTR_TO_PACKET's range can be
> > > modified by comparisons with packet pointers loaded from the xdp/skb
> > > ctx, how do we distinguish e.g. between a pkt slice obtained from some
> > > frag in a multi-buff XDP vs pkt pointer from a linear area?
> > >
> > > Someone can compare data_meta from ctx with PTR_TO_PACKET from
> > > bpf_dynptr_data on xdp dynptr (which might be pointing to a xdp mb
> > > frag). While MAX_PACKET_OFF is 0xffff, it can still be used to do OOB
> > > access for the linear area. reg_is_init_pkt_pointer will return true
> > > as modified range is not considered for it. Same kind of issues when
> > > doing comparison with data_end from ctx (though maybe you won't be
> > > able to do incorrect data access at runtime using that).
> > >
> > > I had a pkt_uid field in my patch [0] which disallowed comparisons
> > > among bpf_packet_pointer slices. Each call assigned a fresh pkt_uid,
> > > and that disabled comparisons for them. reg->id is used for var_off
> > > range propagation so it cannot be reused.
> > >
> > > Coming back to this: What we really want here is a PTR_TO_MEM with a
> > > mem_size, so maybe you should go that route instead of PTR_TO_PACKET
> > > (and add a type tag to maybe pretty print it also as a packet pointer
> > > in verifier log), or add some way to distinguish slice vs non-slice
> > > pkt pointers like I did in my patch. You might also want to add some
> > > tests for this corner case (there are some later in [0] if you want to
> > > reuse them).
> > >
> > > So TBH, I kinda dislike my own solution in [0] :). The complexity does
> > > not seem worth it. The pkt_uid distinction is more useful (and
> > > actually would be needed) in Toke's xdp queueing series, where in a
> > > dequeue program you have multiple xdp_mds and want scoped slice
> > > invalidations (i.e. adjust_head on one xdp_md doesn't invalidate
> > > slices of some other xdp_md). Here we can just get away with normal
> > > PTR_TO_MEM.
> > >
> > > ... Or just let me know if you handle this correctly already, or if
> > > this won't be an actual problem :).
> >
> > Ooh interesting, I hadn't previously taken a look at
> > try_match_pkt_pointers(), thanks for mentioning it :)
> >
> > The cleanest solution to me is to add the flag "DYNPTR_TYPE_{SKB/XDP}"
> > to PTR_TO_PACKET and change reg_is_init_pkt_pointer() to return false
> > if the DYNPTR_TYPE_{SKB/XDP} flag is present. I prefer this over
> > returning PTR_TO_MEM because it seems more robust (eg if in the future
> > we reject x behavior on the packet data reg types, this will
> > automatically apply to the data slices), and because it'll keep the
> > logic more efficient/simpler for the case when the pkt pointer has to
> > be cleared after any helper that changes pkt data is called (aka the
> > case where the data slice gets invalidated).
> >
> > What are your thoughts?
> >
>
> Thinking more deeply about it, probably not, we need more work here. I
> remember _now_ why I chose the pkt_uid approach (and this tells us my
> commit log lacks all the details about the motivation :( ).
>
> Consider how equivalency checking for packet pointers works in
> regsafe. It is checking type, then if old range > cur range, then
> offs, etc.
>
> The problem is, while we now don't prune on access to ptr_to_pkt vs
> ptr_to_pkt | dynptr_pkt types in same reg (since type differs we
> return false), we still prune if old range of ptr_to_pkt | dynptr_pkt
> > cur range of ptr_to_pkt | dynptr_pkt. Both could be pointing into
> separate frags, so this assumption would be incorrect. I would be able
> to trick the verifier into accessing data beyond the length of a
> different frag, by first making sure one line of exploration is
> verified, and then changing the register in another branch reaching
> the same branch target. Helpers can take packet pointers so the access
> can become a pruning point. It would think the rest of the stuff is
> safe, while they are not equivalent at all. It is ok if they are bit
> by bit equivalent (same type, range, off, etc.).

Thanks for the explanation. To clarify, if old range of ptr_to_pkt >
cur range of ptr_to_pkt, what gets pruned? Is it access to cur range
of ptr_to_pkt since if old range > cur range, then if old range is
acceptable cur range must definitely be acceptable?

>
> If you start returning false whenever you see this type tag set, it
> will become too conservative (it considers reg copies of the same
> dynptr_data lookup as distinct). So you need some kind of id assigned
> during dynptr_data lookup to distinguish them.

What about if the dynptr_pkt type tag is set, then we compare the
ranges as well? If the ranges are the same, then we return true, else
false. Does that work?
Andrii Nakryiko Aug. 25, 2022, 9:53 p.m. UTC | #10
On Wed, Aug 24, 2022 at 4:04 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Wed, 24 Aug 2022 at 20:11, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Wed, Aug 24, 2022 at 3:39 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > >
> > > Joanne Koong <joannelkoong@gmail.com> writes:
> > > >> [...]
> > > >> It's a bit awkward to have such a difference between xdp and skb
> > > >> dynptr's read/write. I understand why it is the way it is, but it
> > > >> still doesn't feel right. I'm not sure if we can reconcile the
> > > >> differences, but it makes writing common code for both xdp and tc
> > > >> harder as it needs to be aware of the differences (and then the flags
> > > >> for dynptr_write would differ too). So we're 90% there but not the
> > > >> whole way...
> > > >
> > > > Yeah, it'd be great if the behavior for skb/xdp progs could be the
> > > > same, but I'm not seeing a better solution here (unless we invalidate
> > > > data slices on writes in xdp progs, just to make it match more :P).
> > > >
> > > > Regarding having 2 different interfaces bpf_dynptr_from_{skb/xdp}, I'm
> > > > not convinced this is much of a problem - xdp and skb programs already
> > > > have different interfaces for doing things (eg
> > > > bpf_{skb/xdp}_{store/load}_bytes).
> > >
> > > This is true, but it's quite possible to paper over these differences
> > > and write BPF code that works for both TC and XDP. Subtle semantic
> > > differences in otherwise identical functions makes this harder.
> > >
> > > Today you can write a function like:
> > >
> > > static inline int parse_pkt(void *data, void* data_end)
> > > {
> > >         /* parse data */
> > > }
> > >
> > > And call it like:
> > >
> > > SEC("xdp")
> > > int parse_xdp(struct xdp_md *ctx)
> > > {
> > >         return parse_pkt(ctx->data, ctx->data_end);
> > > }
> > >
> > > SEC("tc")
> > > int parse_tc(struct __sk_buff *skb)
> > > {
> > >         return parse_pkt(skb->data, skb->data_end);
> > > }
> > >
> > >
> > > IMO the goal should be to be able to do the equivalent for dynptrs, like:
> > >
> > > static inline int parse_pkt(struct bpf_dynptr *ptr)
> > > {
> > >         __u64 *data;
> > >
> > >         data = bpf_dynptr_data(ptr, 0, sizeof(*data));
> > >         if (!data)
> > >                 return 0;
> > >         /* parse data */
> > > }
> > >
> > > SEC("xdp")
> > > int parse_xdp(struct xdp_md *ctx)
> > > {
> > >         struct bpf_dynptr ptr;
> > >
> > >         bpf_dynptr_from_xdp(ctx, 0, &ptr);
> > >         return parse_pkt(&ptr);
> > > }
> > >
> > > SEC("tc")
> > > int parse_tc(struct __sk_buff *skb)
> > > {
> > >         struct bpf_dynptr ptr;
> > >
> > >         bpf_dynptr_from_skb(skb, 0, &ptr);
> > >         return parse_pkt(&ptr);
> > > }
> > >
> >
> > To clarify, this is already possible when using data slices, since the
> > behavior for data slices is equivalent between xdp and tc programs for
> > non-fragmented accesses. From looking through the selftests, I
> > anticipate that data slices will be the main way programs interact
> > with dynptrs. For the cases where the program may write into frags,
> > then bpf_dynptr_write will be needed (which is where functionality
> > between xdp and tc start differing) - today, we're not able to write
> > common code that writes into the frags since tc uses
> > bpf_skb_store_bytes and xdp uses bpf_xdp_store_bytes.
>
> Yes, we cannot write code that just swaps those two calls right now.
> The key difference is, the two above are separate functions already
> and have different semantics. Here in this set you have the same
> function, but with different semantics depending on ctx, which is the
> point of contention.

bpf_dynptr_{read,write}() are effectively virtual methods of dynptr
and we have few different implementations of dynptr, depending on what
data they are wrapping, so having different semantics depending on ctx
makes sense, if they are dictated by good reasons (like good
performance for skb and xdp).

>
> >
> > I'm more and more liking the idea of limiting xdp to match the
> > constraints of skb given that both you, Kumar, and Jakub have
> > mentioned that portability between xdp and skb would be useful for
> > users :)
> >
> > What are your thoughts on this API:
> >
> > 1) bpf_dynptr_data()
> >
> > Before:
> >   for skb-type progs:
> >       - data slices in fragments is not supported
> >
> >   for xdp-type progs:
> >       - data slices in fragments is supported as long as it is in a
> > contiguous frag (eg not across frags)
> >
> > Now:
> >   for skb + xdp type progs:
> >       - data slices in fragments is not supported
> >
> >
> > 2)  bpf_dynptr_write()
> >
> > Before:
> >   for skb-type progs:
> >      - all data slices are invalidated after a write
> >
> >   for xdp-type progs:
> >      - nothing
> >
> > Now:
> >   for skb + xdp type progs:
> >      - all data slices are invalidated after a write
> >
>
> There is also the other option: failing to write until you pull skb,
> which looks a lot better to me if we are adding this helper anyway...

Wouldn't this kill performance for typical cases when writes don't go
into frags?

>
> > This will unite the functionality for skb and xdp programs across
> > bpf_dyntpr_data, bpf_dynptr_write, and bpf_dynptr_read. As for whether
> > we should unite bpf_dynptr_from_skb and bpf_dynptr_from_xdp into one
> > common bpf_dynptr_from_packet as Jakub brought in [0], I'm leaning
> > towards no because 1) if in the future there's some irreconcilable
> > aspect between skb and xdp that gets added, that'll be hard to support
> > since the expectation is that there is just one overall "packet
> > dynptr" 2) the "packet dynptr" view is not completely accurate (eg
> > bpf_dynptr_write accepts flags from skb progs and not xdp progs) 3)
> > this adds some additional hardcoding in the verifier since there's no
> > organic mapping between prog type -> prog ctx
> >
>
> There is, e.g. see how btf_get_prog_ctx_type is doing it (unless I
> misunderstood what you meant).
>
> I also had a different tangential question (unrelated to
> 'reconciliation'). Let's forget about that for a moment. I was
> listening to the talk here [0]. At 12:00, I can hear someone
> mentioning that you'd have a dynptr for each segment/frag.

One of those people was me :) I think what you are referring to was an
idea that for frags we might end a special iterator function that will
call a callback for each linear frag. And in such case linear frag
will be presented to callback as dynptr (just as an interface to
statically unknown memory region, it will be DYNPTR_LOCAL at that
point, probably).

It's different from DYNPTR_SKB that Joanne is adding here, though.

>
> Right now, you have a skb/xdp dynptr, which is representing
> discontiguous memory, i.e. you represent all linear page + fragments
> using one dynptr. This seems a bit inconsistent with the idea of a
> dynptr, since it's conceptually a slice of variable length memory.
> dynptr_data gives you a constant length slice into dynptr's variable
> length memory (which the verifier can track bounds of statically).

All the data in skb is conceptually a slice of variable length memory
with no gaps in between. It's just physically discontiguous, but there
is still a linear offset addressing (conceptually). So I think it
makes sense to represent skb data as a whole as one SKB dynptr. There
are technical restrictions on direct memory access through
bpf_dynptr_data(), inevitablly.


>
> So thinking about the idea in [0], one way would be to change
> from_xdp/from_skb helpers to take an index (into the frags array), and
> return dynptr for each frag. Or determine frag from offset + len. For
> the skb case it might require kmap_atomic to access the frag dynptr,
> which might need a paired kunmap_atomic call as well, and it may not
> return dynptr for frags in cloned skbs, but for XDP multi-buff all
> that doesn't count.

We could do something like bpf_dynptr_from_skb_frag(skb, N) and get
access to frag #N (where we can special case 0 being linear skb data),
but that would be DYNPTR_LOCAL at that point, probably? And it's a
separate extension in addition to DYNPTR_SKB. Having generic
bpf_dynptr_{read,write}() working over entire skb range is very
useful, while this per-frag dynptr would prevent this. So let's keep
those two separate.

>
> Was this approach/line of thinking considered and/or rejected? What
> were the reasons? Just trying to understand the background here.
>
> What I'm wondering is that we already have helpers to do reads and
> writes that you are wrapping over in dynptr interfaces, but what we're
> missing is to be able to get direct access to frags (or 'next' ctx,
> essentially). When we know we can avoid pulls, it might be cheaper to
> then do access this way to skb, right? Especially for a case where
> just a part of the header lies in the next frag?
>
> But I'm no expert here, so I might be missing some subtleties.
>
>   [0]: https://www.youtube.com/watch?v=EZ5PDrXvs7M
>
>
>
>
>
>
>
>
>
> >
> > [0] https://lore.kernel.org/bpf/20220726184706.954822-1-joannelkoong@gmail.com/T/#m1438f89152b1d0e539fe60a9376482bbc9de7b6e
> >
> > >
> > > If the dynptr-based parse_pkt() function has to take special care to
> > > figure out where the dynptr comes from, it makes it a lot more difficult
> > > to write reusable packet parsing functions. So I'd be in favour of
> > > restricting the dynptr interface to the lowest common denominator of the
> > > skb and xdp interfaces even if that makes things slightly more awkward
> > > in the specialised cases...
> > >
> > > -Toke
> > >
Kumar Kartikeya Dwivedi Aug. 25, 2022, 11:18 p.m. UTC | #11
On Thu, 25 Aug 2022 at 22:39, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Wed, Aug 24, 2022 at 2:11 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Wed, 24 Aug 2022 at 00:27, Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > On Mon, Aug 22, 2022 at 7:31 PM Kumar Kartikeya Dwivedi
> > > <memxor@gmail.com> wrote:
> > > > > [...]
> > > > >                 if (func_id == BPF_FUNC_dynptr_data &&
> > > > > -                   dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > > > > +                   (dynptr_type == BPF_DYNPTR_TYPE_SKB ||
> > > > > +                    dynptr_type == BPF_DYNPTR_TYPE_XDP)) {
> > > > >                         regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> > > > >                         regs[BPF_REG_0].range = meta.mem_size;
> > > >
> > > > It doesn't seem like this is safe. Since PTR_TO_PACKET's range can be
> > > > modified by comparisons with packet pointers loaded from the xdp/skb
> > > > ctx, how do we distinguish e.g. between a pkt slice obtained from some
> > > > frag in a multi-buff XDP vs pkt pointer from a linear area?
> > > >
> > > > Someone can compare data_meta from ctx with PTR_TO_PACKET from
> > > > bpf_dynptr_data on xdp dynptr (which might be pointing to a xdp mb
> > > > frag). While MAX_PACKET_OFF is 0xffff, it can still be used to do OOB
> > > > access for the linear area. reg_is_init_pkt_pointer will return true
> > > > as modified range is not considered for it. Same kind of issues when
> > > > doing comparison with data_end from ctx (though maybe you won't be
> > > > able to do incorrect data access at runtime using that).
> > > >
> > > > I had a pkt_uid field in my patch [0] which disallowed comparisons
> > > > among bpf_packet_pointer slices. Each call assigned a fresh pkt_uid,
> > > > and that disabled comparisons for them. reg->id is used for var_off
> > > > range propagation so it cannot be reused.
> > > >
> > > > Coming back to this: What we really want here is a PTR_TO_MEM with a
> > > > mem_size, so maybe you should go that route instead of PTR_TO_PACKET
> > > > (and add a type tag to maybe pretty print it also as a packet pointer
> > > > in verifier log), or add some way to distinguish slice vs non-slice
> > > > pkt pointers like I did in my patch. You might also want to add some
> > > > tests for this corner case (there are some later in [0] if you want to
> > > > reuse them).
> > > >
> > > > So TBH, I kinda dislike my own solution in [0] :). The complexity does
> > > > not seem worth it. The pkt_uid distinction is more useful (and
> > > > actually would be needed) in Toke's xdp queueing series, where in a
> > > > dequeue program you have multiple xdp_mds and want scoped slice
> > > > invalidations (i.e. adjust_head on one xdp_md doesn't invalidate
> > > > slices of some other xdp_md). Here we can just get away with normal
> > > > PTR_TO_MEM.
> > > >
> > > > ... Or just let me know if you handle this correctly already, or if
> > > > this won't be an actual problem :).
> > >
> > > Ooh interesting, I hadn't previously taken a look at
> > > try_match_pkt_pointers(), thanks for mentioning it :)
> > >
> > > The cleanest solution to me is to add the flag "DYNPTR_TYPE_{SKB/XDP}"
> > > to PTR_TO_PACKET and change reg_is_init_pkt_pointer() to return false
> > > if the DYNPTR_TYPE_{SKB/XDP} flag is present. I prefer this over
> > > returning PTR_TO_MEM because it seems more robust (eg if in the future
> > > we reject x behavior on the packet data reg types, this will
> > > automatically apply to the data slices), and because it'll keep the
> > > logic more efficient/simpler for the case when the pkt pointer has to
> > > be cleared after any helper that changes pkt data is called (aka the
> > > case where the data slice gets invalidated).
> > >
> > > What are your thoughts?
> > >
> >
> > Thinking more deeply about it, probably not, we need more work here. I
> > remember _now_ why I chose the pkt_uid approach (and this tells us my
> > commit log lacks all the details about the motivation :( ).
> >
> > Consider how equivalency checking for packet pointers works in
> > regsafe. It is checking type, then if old range > cur range, then
> > offs, etc.
> >
> > The problem is, while we now don't prune on access to ptr_to_pkt vs
> > ptr_to_pkt | dynptr_pkt types in same reg (since type differs we
> > return false), we still prune if old range of ptr_to_pkt | dynptr_pkt
> > > cur range of ptr_to_pkt | dynptr_pkt. Both could be pointing into
> > separate frags, so this assumption would be incorrect. I would be able
> > to trick the verifier into accessing data beyond the length of a
> > different frag, by first making sure one line of exploration is
> > verified, and then changing the register in another branch reaching
> > the same branch target. Helpers can take packet pointers so the access
> > can become a pruning point. It would think the rest of the stuff is
> > safe, while they are not equivalent at all. It is ok if they are bit
> > by bit equivalent (same type, range, off, etc.).
>
> Thanks for the explanation. To clarify, if old range of ptr_to_pkt >
> cur range of ptr_to_pkt, what gets pruned? Is it access to cur range
> of ptr_to_pkt since if old range > cur range, then if old range is
> acceptable cur range must definitely be acceptable?

No, my description was bad :).
We return false when old_range > cur_range, i.e. the path is
considered safe and not explored again when old_range <= cur_range
(pruning), otherwise we continue verifying.
Consider if it was doing pkt[cur_range + 1] access in the path we are
about to explore again (already verified for old_range). That is <=
old_range, but > cur_range, so it would be problematic if we pruned
search for old_range > cur_range.

So maybe it won't be a problem here, and just the current range checks
for pkt pointer slices is fine even if they belong to different frags.
I didn't craft any test case when writing my previous reply.
Especially since you will disable comparisons, one cannot relearn
range again using var_off + comparison, which closes another loophole.

It just seems simpler to me to be a bit more conservative, since it is
only an optimization. There might be some corner case lurking we can't
think of right now. But I leave the judgement up to you if you can
reason about it. In either case it would be good to include some
comments in the commit log about all this.

Meanwhile, looking at the current code, I'm more inclined to suggest
PTR_TO_MEM (and handle invalidation specially), but again, I will
leave it up to you to decide.

When we do += var_off to a pkt reg, its range is reset to zero,
compared to PTR_TO_MEM, where off + var_off (smin/umax) is used to
check against the actual size for an access, which is a bit more
flexible. The reason to reset range is that it will be relearned using
comparisons and transferred to copies (reg->id is assigned for each +=
var_off), which doesn't apply to slice pointers (essentially the only
reason to keep them is being able to pick them for invalidation), we
try to disable the rest of the pkt pointer magic in the verifier,
anyway.

pkt_reg->umax_value influences the prog->aux->max_pkt_offset (and iiuc
it can reach that point with range == 0 after += var_off, and
zero_size_allowed == true), only seems to be used by netronome's ebpf
offload for now, but still a bit confusing if slice pkt pointers cause
this to change.

>
> >
> > If you start returning false whenever you see this type tag set, it
> > will become too conservative (it considers reg copies of the same
> > dynptr_data lookup as distinct). So you need some kind of id assigned
> > during dynptr_data lookup to distinguish them.
>
> What about if the dynptr_pkt type tag is set, then we compare the
> ranges as well? If the ranges are the same, then we return true, else
> false. Does that work?

Seems like it, and true part is already covered by memcmp at the start
of the function, I think.
Martin KaFai Lau Aug. 26, 2022, 6:37 a.m. UTC | #12
On Thu, Aug 25, 2022 at 01:04:16AM +0200, Kumar Kartikeya Dwivedi wrote:
> On Wed, 24 Aug 2022 at 20:11, Joanne Koong <joannelkoong@gmail.com> wrote:
> > I'm more and more liking the idea of limiting xdp to match the
> > constraints of skb given that both you, Kumar, and Jakub have
> > mentioned that portability between xdp and skb would be useful for
> > users :)
> >
> > What are your thoughts on this API:
> >
> > 1) bpf_dynptr_data()
> >
> > Before:
> >   for skb-type progs:
> >       - data slices in fragments is not supported
> >
> >   for xdp-type progs:
> >       - data slices in fragments is supported as long as it is in a
> > contiguous frag (eg not across frags)
> >
> > Now:
> >   for skb + xdp type progs:
> >       - data slices in fragments is not supported
I don't think it is necessary (or help) to restrict xdp slice from getting
a fragment.  In any case, the xdp prog has to deal with the case
that bpf_dynptr_data(xdp_dynptr, offset, len) is across two fragments.
Although unlikely, it still need to handle it (dynptr_data returns NULL)
properly by using bpf_dynptr_read().  The same that the skb case
also needs to handle dynptr_data returning NULL.

Something like Kumar's sample in [0] should work for both
xdp and skb dynptr but replace the helpers with
bpf_dynptr_{data,read,write}().

[0]: https://lore.kernel.org/bpf/20220726184706.954822-1-joannelkoong@gmail.com/T/#mf082a11403bc76fa56fde4de79a25c660680285c

> >
> >
> > 2)  bpf_dynptr_write()
> >
> > Before:
> >   for skb-type progs:
> >      - all data slices are invalidated after a write
> >
> >   for xdp-type progs:
> >      - nothing
> >
> > Now:
> >   for skb + xdp type progs:
> >      - all data slices are invalidated after a write
I wonder if the 'Before' behavior can be kept as is.

The bpf prog that runs in both xdp and bpf should be
the one always expecting the data-slice will be invalidated and
it has to call the bpf_dynptr_data() again after writing.
Yes, it is unnecessary for xdp but the bpf prog needs to the
same anyway if the verifier was the one enforcing it for
both skb and xdp dynptr type.

If the bpf prog is written for xdp alone, then it has
no need to re-get the bpf_dynptr_data() after writing.

> >
> 
> There is also the other option: failing to write until you pull skb,
> which looks a lot better to me if we are adding this helper anyway...
Martin KaFai Lau Aug. 26, 2022, 6:50 a.m. UTC | #13
On Thu, Aug 25, 2022 at 11:37:08PM -0700, Martin KaFai Lau wrote:
> On Thu, Aug 25, 2022 at 01:04:16AM +0200, Kumar Kartikeya Dwivedi wrote:
> > On Wed, 24 Aug 2022 at 20:11, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > I'm more and more liking the idea of limiting xdp to match the
> > > constraints of skb given that both you, Kumar, and Jakub have
> > > mentioned that portability between xdp and skb would be useful for
> > > users :)
> > >
> > > What are your thoughts on this API:
> > >
> > > 1) bpf_dynptr_data()
> > >
> > > Before:
> > >   for skb-type progs:
> > >       - data slices in fragments is not supported
> > >
> > >   for xdp-type progs:
> > >       - data slices in fragments is supported as long as it is in a
> > > contiguous frag (eg not across frags)
> > >
> > > Now:
> > >   for skb + xdp type progs:
> > >       - data slices in fragments is not supported
> I don't think it is necessary (or help) to restrict xdp slice from getting
> a fragment.  In any case, the xdp prog has to deal with the case
> that bpf_dynptr_data(xdp_dynptr, offset, len) is across two fragments.
> Although unlikely, it still need to handle it (dynptr_data returns NULL)
> properly by using bpf_dynptr_read().  The same that the skb case
> also needs to handle dynptr_data returning NULL.
> 
> Something like Kumar's sample in [0] should work for both
> xdp and skb dynptr but replace the helpers with
> bpf_dynptr_{data,read,write}().
> 
> [0]: https://lore.kernel.org/bpf/20220726184706.954822-1-joannelkoong@gmail.com/T/#mf082a11403bc76fa56fde4de79a25c660680285c
> 
> > >
> > >
> > > 2)  bpf_dynptr_write()
> > >
> > > Before:
> > >   for skb-type progs:
> > >      - all data slices are invalidated after a write
> > >
> > >   for xdp-type progs:
> > >      - nothing
> > >
> > > Now:
> > >   for skb + xdp type progs:
> > >      - all data slices are invalidated after a write
> I wonder if the 'Before' behavior can be kept as is.
> 
> The bpf prog that runs in both xdp and bpf should be
typo: both xdp and *skb

> the one always expecting the data-slice will be invalidated and
> it has to call the bpf_dynptr_data() again after writing.
> Yes, it is unnecessary for xdp but the bpf prog needs to the
> same anyway if the verifier was the one enforcing it for
> both skb and xdp dynptr type.
> 
> If the bpf prog is written for xdp alone, then it has
> no need to re-get the bpf_dynptr_data() after writing.
> 
> > >
> > 
> > There is also the other option: failing to write until you pull skb,
> > which looks a lot better to me if we are adding this helper anyway...
Joanne Koong Aug. 26, 2022, 6:23 p.m. UTC | #14
On Thu, Aug 25, 2022 at 4:19 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Thu, 25 Aug 2022 at 22:39, Joanne Koong <joannelkoong@gmail.com> wrote:
> >
> > On Wed, Aug 24, 2022 at 2:11 PM Kumar Kartikeya Dwivedi
> > <memxor@gmail.com> wrote:
> > >
> > > On Wed, 24 Aug 2022 at 00:27, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > >
> > > > On Mon, Aug 22, 2022 at 7:31 PM Kumar Kartikeya Dwivedi
> > > > <memxor@gmail.com> wrote:
> > > > > > [...]
> > > > > >                 if (func_id == BPF_FUNC_dynptr_data &&
> > > > > > -                   dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > > > > > +                   (dynptr_type == BPF_DYNPTR_TYPE_SKB ||
> > > > > > +                    dynptr_type == BPF_DYNPTR_TYPE_XDP)) {
> > > > > >                         regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> > > > > >                         regs[BPF_REG_0].range = meta.mem_size;
> > > > >
> > > > > It doesn't seem like this is safe. Since PTR_TO_PACKET's range can be
> > > > > modified by comparisons with packet pointers loaded from the xdp/skb
> > > > > ctx, how do we distinguish e.g. between a pkt slice obtained from some
> > > > > frag in a multi-buff XDP vs pkt pointer from a linear area?
> > > > >
> > > > > Someone can compare data_meta from ctx with PTR_TO_PACKET from
> > > > > bpf_dynptr_data on xdp dynptr (which might be pointing to a xdp mb
> > > > > frag). While MAX_PACKET_OFF is 0xffff, it can still be used to do OOB
> > > > > access for the linear area. reg_is_init_pkt_pointer will return true
> > > > > as modified range is not considered for it. Same kind of issues when
> > > > > doing comparison with data_end from ctx (though maybe you won't be
> > > > > able to do incorrect data access at runtime using that).
> > > > >
> > > > > I had a pkt_uid field in my patch [0] which disallowed comparisons
> > > > > among bpf_packet_pointer slices. Each call assigned a fresh pkt_uid,
> > > > > and that disabled comparisons for them. reg->id is used for var_off
> > > > > range propagation so it cannot be reused.
> > > > >
> > > > > Coming back to this: What we really want here is a PTR_TO_MEM with a
> > > > > mem_size, so maybe you should go that route instead of PTR_TO_PACKET
> > > > > (and add a type tag to maybe pretty print it also as a packet pointer
> > > > > in verifier log), or add some way to distinguish slice vs non-slice
> > > > > pkt pointers like I did in my patch. You might also want to add some
> > > > > tests for this corner case (there are some later in [0] if you want to
> > > > > reuse them).
> > > > >
> > > > > So TBH, I kinda dislike my own solution in [0] :). The complexity does
> > > > > not seem worth it. The pkt_uid distinction is more useful (and
> > > > > actually would be needed) in Toke's xdp queueing series, where in a
> > > > > dequeue program you have multiple xdp_mds and want scoped slice
> > > > > invalidations (i.e. adjust_head on one xdp_md doesn't invalidate
> > > > > slices of some other xdp_md). Here we can just get away with normal
> > > > > PTR_TO_MEM.
> > > > >
> > > > > ... Or just let me know if you handle this correctly already, or if
> > > > > this won't be an actual problem :).
> > > >
> > > > Ooh interesting, I hadn't previously taken a look at
> > > > try_match_pkt_pointers(), thanks for mentioning it :)
> > > >
> > > > The cleanest solution to me is to add the flag "DYNPTR_TYPE_{SKB/XDP}"
> > > > to PTR_TO_PACKET and change reg_is_init_pkt_pointer() to return false
> > > > if the DYNPTR_TYPE_{SKB/XDP} flag is present. I prefer this over
> > > > returning PTR_TO_MEM because it seems more robust (eg if in the future
> > > > we reject x behavior on the packet data reg types, this will
> > > > automatically apply to the data slices), and because it'll keep the
> > > > logic more efficient/simpler for the case when the pkt pointer has to
> > > > be cleared after any helper that changes pkt data is called (aka the
> > > > case where the data slice gets invalidated).
> > > >
> > > > What are your thoughts?
> > > >
> > >
> > > Thinking more deeply about it, probably not, we need more work here. I
> > > remember _now_ why I chose the pkt_uid approach (and this tells us my
> > > commit log lacks all the details about the motivation :( ).
> > >
> > > Consider how equivalency checking for packet pointers works in
> > > regsafe. It is checking type, then if old range > cur range, then
> > > offs, etc.
> > >
> > > The problem is, while we now don't prune on access to ptr_to_pkt vs
> > > ptr_to_pkt | dynptr_pkt types in same reg (since type differs we
> > > return false), we still prune if old range of ptr_to_pkt | dynptr_pkt
> > > > cur range of ptr_to_pkt | dynptr_pkt. Both could be pointing into
> > > separate frags, so this assumption would be incorrect. I would be able
> > > to trick the verifier into accessing data beyond the length of a
> > > different frag, by first making sure one line of exploration is
> > > verified, and then changing the register in another branch reaching
> > > the same branch target. Helpers can take packet pointers so the access
> > > can become a pruning point. It would think the rest of the stuff is
> > > safe, while they are not equivalent at all. It is ok if they are bit
> > > by bit equivalent (same type, range, off, etc.).
> >
> > Thanks for the explanation. To clarify, if old range of ptr_to_pkt >
> > cur range of ptr_to_pkt, what gets pruned? Is it access to cur range
> > of ptr_to_pkt since if old range > cur range, then if old range is
> > acceptable cur range must definitely be acceptable?
>
> No, my description was bad :).
> We return false when old_range > cur_range, i.e. the path is
> considered safe and not explored again when old_range <= cur_range
> (pruning), otherwise we continue verifying.
> Consider if it was doing pkt[cur_range + 1] access in the path we are
> about to explore again (already verified for old_range). That is <=
> old_range, but > cur_range, so it would be problematic if we pruned
> search for old_range > cur_range.

Does "old_range" here refer to the range that was already previously
verified as safe by the verifier? And "cur_range" is the new range
that we are trying to figure out is safe or not?

When you say "we return false when old_range > cur_range", what
function are we returning false from?

>
> So maybe it won't be a problem here, and just the current range checks
> for pkt pointer slices is fine even if they belong to different frags.
> I didn't craft any test case when writing my previous reply.
> Especially since you will disable comparisons, one cannot relearn
> range again using var_off + comparison, which closes another loophole.
>
> It just seems simpler to me to be a bit more conservative, since it is
> only an optimization. There might be some corner case lurking we can't
> think of right now. But I leave the judgement up to you if you can
> reason about it. In either case it would be good to include some
> comments in the commit log about all this.
>
> Meanwhile, looking at the current code, I'm more inclined to suggest
> PTR_TO_MEM (and handle invalidation specially), but again, I will
> leave it up to you to decide.
>
> When we do += var_off to a pkt reg, its range is reset to zero,
> compared to PTR_TO_MEM, where off + var_off (smin/umax) is used to
> check against the actual size for an access, which is a bit more
> flexible. The reason to reset range is that it will be relearned using
> comparisons and transferred to copies (reg->id is assigned for each +=

Can you direct me to the function where this relearning happens? thanks!

> var_off), which doesn't apply to slice pointers (essentially the only
> reason to keep them is being able to pick them for invalidation), we
> try to disable the rest of the pkt pointer magic in the verifier,
> anyway.
>
> pkt_reg->umax_value influences the prog->aux->max_pkt_offset (and iiuc
> it can reach that point with range == 0 after += var_off, and
> zero_size_allowed == true), only seems to be used by netronome's ebpf
> offload for now, but still a bit confusing if slice pkt pointers cause
> this to change.

My major takeaway from this discussion is that there's a lot of extra
subtleties when reg is PTR_TO_PACKET :) I'm going to delve deeper into
the source code, but from a glance, I think you're right that just
assigning PTR_TO_MEM for the data slice will probably make things a
lot more straightforward. thanks for the discussion!

>
> >
> > >
> > > If you start returning false whenever you see this type tag set, it
> > > will become too conservative (it considers reg copies of the same
> > > dynptr_data lookup as distinct). So you need some kind of id assigned
> > > during dynptr_data lookup to distinguish them.
> >
> > What about if the dynptr_pkt type tag is set, then we compare the
> > ranges as well? If the ranges are the same, then we return true, else
> > false. Does that work?
>
> Seems like it, and true part is already covered by memcmp at the start
> of the function, I think.
Kumar Kartikeya Dwivedi Aug. 26, 2022, 6:31 p.m. UTC | #15
On Fri, 26 Aug 2022 at 20:23, Joanne Koong <joannelkoong@gmail.com> wrote:
>
> On Thu, Aug 25, 2022 at 4:19 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Thu, 25 Aug 2022 at 22:39, Joanne Koong <joannelkoong@gmail.com> wrote:
> > >
> > > On Wed, Aug 24, 2022 at 2:11 PM Kumar Kartikeya Dwivedi
> > > <memxor@gmail.com> wrote:
> > > >
> > > > On Wed, 24 Aug 2022 at 00:27, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > >
> > > > > On Mon, Aug 22, 2022 at 7:31 PM Kumar Kartikeya Dwivedi
> > > > > <memxor@gmail.com> wrote:
> > > > > > > [...]
> > > > > > >                 if (func_id == BPF_FUNC_dynptr_data &&
> > > > > > > -                   dynptr_type == BPF_DYNPTR_TYPE_SKB) {
> > > > > > > +                   (dynptr_type == BPF_DYNPTR_TYPE_SKB ||
> > > > > > > +                    dynptr_type == BPF_DYNPTR_TYPE_XDP)) {
> > > > > > >                         regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
> > > > > > >                         regs[BPF_REG_0].range = meta.mem_size;
> > > > > >
> > > > > > It doesn't seem like this is safe. Since PTR_TO_PACKET's range can be
> > > > > > modified by comparisons with packet pointers loaded from the xdp/skb
> > > > > > ctx, how do we distinguish e.g. between a pkt slice obtained from some
> > > > > > frag in a multi-buff XDP vs pkt pointer from a linear area?
> > > > > >
> > > > > > Someone can compare data_meta from ctx with PTR_TO_PACKET from
> > > > > > bpf_dynptr_data on xdp dynptr (which might be pointing to a xdp mb
> > > > > > frag). While MAX_PACKET_OFF is 0xffff, it can still be used to do OOB
> > > > > > access for the linear area. reg_is_init_pkt_pointer will return true
> > > > > > as modified range is not considered for it. Same kind of issues when
> > > > > > doing comparison with data_end from ctx (though maybe you won't be
> > > > > > able to do incorrect data access at runtime using that).
> > > > > >
> > > > > > I had a pkt_uid field in my patch [0] which disallowed comparisons
> > > > > > among bpf_packet_pointer slices. Each call assigned a fresh pkt_uid,
> > > > > > and that disabled comparisons for them. reg->id is used for var_off
> > > > > > range propagation so it cannot be reused.
> > > > > >
> > > > > > Coming back to this: What we really want here is a PTR_TO_MEM with a
> > > > > > mem_size, so maybe you should go that route instead of PTR_TO_PACKET
> > > > > > (and add a type tag to maybe pretty print it also as a packet pointer
> > > > > > in verifier log), or add some way to distinguish slice vs non-slice
> > > > > > pkt pointers like I did in my patch. You might also want to add some
> > > > > > tests for this corner case (there are some later in [0] if you want to
> > > > > > reuse them).
> > > > > >
> > > > > > So TBH, I kinda dislike my own solution in [0] :). The complexity does
> > > > > > not seem worth it. The pkt_uid distinction is more useful (and
> > > > > > actually would be needed) in Toke's xdp queueing series, where in a
> > > > > > dequeue program you have multiple xdp_mds and want scoped slice
> > > > > > invalidations (i.e. adjust_head on one xdp_md doesn't invalidate
> > > > > > slices of some other xdp_md). Here we can just get away with normal
> > > > > > PTR_TO_MEM.
> > > > > >
> > > > > > ... Or just let me know if you handle this correctly already, or if
> > > > > > this won't be an actual problem :).
> > > > >
> > > > > Ooh interesting, I hadn't previously taken a look at
> > > > > try_match_pkt_pointers(), thanks for mentioning it :)
> > > > >
> > > > > The cleanest solution to me is to add the flag "DYNPTR_TYPE_{SKB/XDP}"
> > > > > to PTR_TO_PACKET and change reg_is_init_pkt_pointer() to return false
> > > > > if the DYNPTR_TYPE_{SKB/XDP} flag is present. I prefer this over
> > > > > returning PTR_TO_MEM because it seems more robust (eg if in the future
> > > > > we reject x behavior on the packet data reg types, this will
> > > > > automatically apply to the data slices), and because it'll keep the
> > > > > logic more efficient/simpler for the case when the pkt pointer has to
> > > > > be cleared after any helper that changes pkt data is called (aka the
> > > > > case where the data slice gets invalidated).
> > > > >
> > > > > What are your thoughts?
> > > > >
> > > >
> > > > Thinking more deeply about it, probably not, we need more work here. I
> > > > remember _now_ why I chose the pkt_uid approach (and this tells us my
> > > > commit log lacks all the details about the motivation :( ).
> > > >
> > > > Consider how equivalency checking for packet pointers works in
> > > > regsafe. It is checking type, then if old range > cur range, then
> > > > offs, etc.
> > > >
> > > > The problem is, while we now don't prune on access to ptr_to_pkt vs
> > > > ptr_to_pkt | dynptr_pkt types in same reg (since type differs we
> > > > return false), we still prune if old range of ptr_to_pkt | dynptr_pkt
> > > > > cur range of ptr_to_pkt | dynptr_pkt. Both could be pointing into
> > > > separate frags, so this assumption would be incorrect. I would be able
> > > > to trick the verifier into accessing data beyond the length of a
> > > > different frag, by first making sure one line of exploration is
> > > > verified, and then changing the register in another branch reaching
> > > > the same branch target. Helpers can take packet pointers so the access
> > > > can become a pruning point. It would think the rest of the stuff is
> > > > safe, while they are not equivalent at all. It is ok if they are bit
> > > > by bit equivalent (same type, range, off, etc.).
> > >
> > > Thanks for the explanation. To clarify, if old range of ptr_to_pkt >
> > > cur range of ptr_to_pkt, what gets pruned? Is it access to cur range
> > > of ptr_to_pkt since if old range > cur range, then if old range is
> > > acceptable cur range must definitely be acceptable?
> >
> > No, my description was bad :).
> > We return false when old_range > cur_range, i.e. the path is
> > considered safe and not explored again when old_range <= cur_range
> > (pruning), otherwise we continue verifying.
> > Consider if it was doing pkt[cur_range + 1] access in the path we are
> > about to explore again (already verified for old_range). That is <=
> > old_range, but > cur_range, so it would be problematic if we pruned
> > search for old_range > cur_range.
>
> Does "old_range" here refer to the range that was already previously
> verified as safe by the verifier? And "cur_range" is the new range
> that we are trying to figure out is safe or not?

Yes, it's the range of the same reg in the already safe verified state
from the point we are about to explore again (and considering
'pruning' the search if the current state would satisfy the safety
properties as well).

>
> When you say "we return false when old_range > cur_range", what
> function are we returning false from?
>

We return false from the regsafe function, i.e. it isn't safe if the
old range was greater than the current range, otherwise we match other
stuff before we return true (like off being equal, and var_off of old
being within cur's).

> >
> > So maybe it won't be a problem here, and just the current range checks
> > for pkt pointer slices is fine even if they belong to different frags.
> > I didn't craft any test case when writing my previous reply.
> > Especially since you will disable comparisons, one cannot relearn
> > range again using var_off + comparison, which closes another loophole.
> >
> > It just seems simpler to me to be a bit more conservative, since it is
> > only an optimization. There might be some corner case lurking we can't
> > think of right now. But I leave the judgement up to you if you can
> > reason about it. In either case it would be good to include some
> > comments in the commit log about all this.
> >
> > Meanwhile, looking at the current code, I'm more inclined to suggest
> > PTR_TO_MEM (and handle invalidation specially), but again, I will
> > leave it up to you to decide.
> >
> > When we do += var_off to a pkt reg, its range is reset to zero,
> > compared to PTR_TO_MEM, where off + var_off (smin/umax) is used to
> > check against the actual size for an access, which is a bit more
> > flexible. The reason to reset range is that it will be relearned using
> > comparisons and transferred to copies (reg->id is assigned for each +=
>
> Can you direct me to the function where this relearning happens? thanks!
>

See the code around the comment:
/* something was added to pkt_ptr, set range to zero */
where it memsets the range to 0,
then in adjust_ptr_min_max_vals,
and then you see how find_good_pkt_pointers also takes into account
reg->umax_value + off before setting reg->range after your next
comparison (because at runtime the variable offset is already added to
the pointer).

> > var_off), which doesn't apply to slice pointers (essentially the only
> > reason to keep them is being able to pick them for invalidation), we
> > try to disable the rest of the pkt pointer magic in the verifier,
> > anyway.
> >
> > pkt_reg->umax_value influences the prog->aux->max_pkt_offset (and iiuc
> > it can reach that point with range == 0 after += var_off, and
> > zero_size_allowed == true), only seems to be used by netronome's ebpf
> > offload for now, but still a bit confusing if slice pkt pointers cause
> > this to change.
>
> My major takeaway from this discussion is that there's a lot of extra
> subtleties when reg is PTR_TO_PACKET :) I'm going to delve deeper into
> the source code, but from a glance, I think you're right that just
> assigning PTR_TO_MEM for the data slice will probably make things a
> lot more straightforward. thanks for the discussion!

Exactly. I think this is an internal verifier detail so we can
definitely go back and change this later, but IMO we'd avoid a lot of
headache if we just choose PTR_TO_MEM, since PTR_TO_PACKET has a lot
of special semantics.

>
> >
> > >
> > > >
> > > > If you start returning false whenever you see this type tag set, it
> > > > will become too conservative (it considers reg copies of the same
> > > > dynptr_data lookup as distinct). So you need some kind of id assigned
> > > > during dynptr_data lookup to distinguish them.
> > >
> > > What about if the dynptr_pkt type tag is set, then we compare the
> > > ranges as well? If the ranges are the same, then we return true, else
> > > false. Does that work?
> >
> > Seems like it, and true part is already covered by memcmp at the start
> > of the function, I think.
Kumar Kartikeya Dwivedi Aug. 26, 2022, 7:09 p.m. UTC | #16
On Fri, 26 Aug 2022 at 08:37, Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Thu, Aug 25, 2022 at 01:04:16AM +0200, Kumar Kartikeya Dwivedi wrote:
> > On Wed, 24 Aug 2022 at 20:11, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > I'm more and more liking the idea of limiting xdp to match the
> > > constraints of skb given that both you, Kumar, and Jakub have
> > > mentioned that portability between xdp and skb would be useful for
> > > users :)
> > >
> > > What are your thoughts on this API:
> > >
> > > 1) bpf_dynptr_data()
> > >
> > > Before:
> > >   for skb-type progs:
> > >       - data slices in fragments is not supported
> > >
> > >   for xdp-type progs:
> > >       - data slices in fragments is supported as long as it is in a
> > > contiguous frag (eg not across frags)
> > >
> > > Now:
> > >   for skb + xdp type progs:
> > >       - data slices in fragments is not supported
> I don't think it is necessary (or help) to restrict xdp slice from getting
> a fragment.  In any case, the xdp prog has to deal with the case
> that bpf_dynptr_data(xdp_dynptr, offset, len) is across two fragments.
> Although unlikely, it still need to handle it (dynptr_data returns NULL)
> properly by using bpf_dynptr_read().  The same that the skb case
> also needs to handle dynptr_data returning NULL.
>
> Something like Kumar's sample in [0] should work for both
> xdp and skb dynptr but replace the helpers with
> bpf_dynptr_{data,read,write}().
>
> [0]: https://lore.kernel.org/bpf/20220726184706.954822-1-joannelkoong@gmail.com/T/#mf082a11403bc76fa56fde4de79a25c660680285c
>
> > >
> > >
> > > 2)  bpf_dynptr_write()
> > >
> > > Before:
> > >   for skb-type progs:
> > >      - all data slices are invalidated after a write
> > >
> > >   for xdp-type progs:
> > >      - nothing
> > >
> > > Now:
> > >   for skb + xdp type progs:
> > >      - all data slices are invalidated after a write
> I wonder if the 'Before' behavior can be kept as is.
>
> The bpf prog that runs in both xdp and bpf should be
> the one always expecting the data-slice will be invalidated and
> it has to call the bpf_dynptr_data() again after writing.
> Yes, it is unnecessary for xdp but the bpf prog needs to the
> same anyway if the verifier was the one enforcing it for
> both skb and xdp dynptr type.
>
> If the bpf prog is written for xdp alone, then it has
> no need to re-get the bpf_dynptr_data() after writing.
>

Yeah, compared to the alternative, I guess it's better how it is right
now. It doesn't seem possible to meaningfully hide the differences
without penalizing either case. It also wouldn't be good to make it
less useful for XDP, since the whole point of this and the previous
effort was exposing bpf_xdp_pointer to the user.
Joanne Koong Aug. 26, 2022, 8:47 p.m. UTC | #17
On Fri, Aug 26, 2022 at 12:09 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Fri, 26 Aug 2022 at 08:37, Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Thu, Aug 25, 2022 at 01:04:16AM +0200, Kumar Kartikeya Dwivedi wrote:
> > > On Wed, 24 Aug 2022 at 20:11, Joanne Koong <joannelkoong@gmail.com> wrote:
> > > > I'm more and more liking the idea of limiting xdp to match the
> > > > constraints of skb given that both you, Kumar, and Jakub have
> > > > mentioned that portability between xdp and skb would be useful for
> > > > users :)
> > > >
> > > > What are your thoughts on this API:
> > > >
> > > > 1) bpf_dynptr_data()
> > > >
> > > > Before:
> > > >   for skb-type progs:
> > > >       - data slices in fragments is not supported
> > > >
> > > >   for xdp-type progs:
> > > >       - data slices in fragments is supported as long as it is in a
> > > > contiguous frag (eg not across frags)
> > > >
> > > > Now:
> > > >   for skb + xdp type progs:
> > > >       - data slices in fragments is not supported
> > I don't think it is necessary (or help) to restrict xdp slice from getting
> > a fragment.  In any case, the xdp prog has to deal with the case
> > that bpf_dynptr_data(xdp_dynptr, offset, len) is across two fragments.
> > Although unlikely, it still need to handle it (dynptr_data returns NULL)
> > properly by using bpf_dynptr_read().  The same that the skb case
> > also needs to handle dynptr_data returning NULL.
> >
> > Something like Kumar's sample in [0] should work for both
> > xdp and skb dynptr but replace the helpers with
> > bpf_dynptr_{data,read,write}().
> >
> > [0]: https://lore.kernel.org/bpf/20220726184706.954822-1-joannelkoong@gmail.com/T/#mf082a11403bc76fa56fde4de79a25c660680285c
> >
> > > >
> > > >
> > > > 2)  bpf_dynptr_write()
> > > >
> > > > Before:
> > > >   for skb-type progs:
> > > >      - all data slices are invalidated after a write
> > > >
> > > >   for xdp-type progs:
> > > >      - nothing
> > > >
> > > > Now:
> > > >   for skb + xdp type progs:
> > > >      - all data slices are invalidated after a write
> > I wonder if the 'Before' behavior can be kept as is.
> >
> > The bpf prog that runs in both xdp and bpf should be
> > the one always expecting the data-slice will be invalidated and
> > it has to call the bpf_dynptr_data() again after writing.
> > Yes, it is unnecessary for xdp but the bpf prog needs to the
> > same anyway if the verifier was the one enforcing it for
> > both skb and xdp dynptr type.
> >
> > If the bpf prog is written for xdp alone, then it has
> > no need to re-get the bpf_dynptr_data() after writing.
> >
>
> Yeah, compared to the alternative, I guess it's better how it is right
> now. It doesn't seem possible to meaningfully hide the differences
> without penalizing either case. It also wouldn't be good to make it
> less useful for XDP, since the whole point of this and the previous
> effort was exposing bpf_xdp_pointer to the user.

I'll keep it as is for v5.
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 30615d1a0c13..455a215b6c57 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -410,11 +410,15 @@  enum bpf_type_flag {
 	/* DYNPTR points to sk_buff */
 	DYNPTR_TYPE_SKB		= BIT(11 + BPF_BASE_TYPE_BITS),
 
+	/* DYNPTR points to xdp_buff */
+	DYNPTR_TYPE_XDP		= BIT(12 + BPF_BASE_TYPE_BITS),
+
 	__BPF_TYPE_FLAG_MAX,
 	__BPF_TYPE_LAST_FLAG	= __BPF_TYPE_FLAG_MAX - 1,
 };
 
-#define DYNPTR_TYPE_FLAG_MASK	(DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB)
+#define DYNPTR_TYPE_FLAG_MASK	(DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB \
+				 | DYNPTR_TYPE_XDP)
 
 /* Max number of base types. */
 #define BPF_BASE_TYPE_LIMIT	(1UL << BPF_BASE_TYPE_BITS)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 649063d9cbfd..80f030239877 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1535,5 +1535,8 @@  static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
 int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len);
 int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from,
 			  u32 len, u64 flags);
+int __bpf_xdp_load_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len);
+int __bpf_xdp_store_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len);
+void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len);
 
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 320e6b95d95c..9feea29eebcd 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5283,13 +5283,18 @@  union bpf_attr {
  *		      and try again.
  *
  *		    * The data slice is automatically invalidated anytime
- *		      **bpf_dynptr_write**\ () or a helper call that changes
- *		      the underlying packet buffer (eg **bpf_skb_pull_data**\ ())
+ *		      **bpf_dynptr_write**\ () is called.
+ *
+ *		For skb-type and xdp-type dynptrs:
+ *		    * The data slice is automatically invalidated anytime a
+ *		      helper call that changes the underlying packet buffer
+ *		      (eg **bpf_skb_pull_data**\ (), **bpf_xdp_adjust_head**\ ())
  *		      is called.
  *	Return
  *		Pointer to the underlying dynptr data, NULL if the dynptr is
  *		read-only, if the dynptr is invalid, or if the offset and length
- *		is out of bounds or in a paged buffer for skb-type dynptrs.
+ *		is out of bounds or in a paged buffer for skb-type dynptrs or
+ *		across fragments for xdp-type dynptrs.
  *
  * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len)
  *	Description
@@ -5388,6 +5393,19 @@  union bpf_attr {
  *		*flags* is currently unused, it must be 0 for now.
  *	Return
  *		0 on success or -EINVAL if flags is not 0.
+ *
+ * long bpf_dynptr_from_xdp(struct xdp_buff *xdp_md, u64 flags, struct bpf_dynptr *ptr)
+ *	Description
+ *		Get a dynptr to the data in *xdp_md*. *xdp_md* must be the BPF program
+ *		context.
+ *
+ *		Calls that change the *xdp_md*'s underlying packet buffer
+ *		(eg **bpf_xdp_adjust_head**\ ()) do not invalidate the dynptr, but
+ *		they do invalidate any data slices associated with the dynptr.
+ *
+ *		*flags* is currently unused, it must be 0 for now.
+ *	Return
+ *		0 on success, -EINVAL if flags is not 0.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5600,6 +5618,7 @@  union bpf_attr {
 	FN(tcp_raw_check_syncookie_ipv6),	\
 	FN(ktime_get_tai_ns),		\
 	FN(dynptr_from_skb),		\
+	FN(dynptr_from_xdp),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 471a01a9b6ae..2b9dc4c6de04 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1541,6 +1541,8 @@  BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, struct bpf_dynptr_kern *, src
 		return 0;
 	case BPF_DYNPTR_TYPE_SKB:
 		return __bpf_skb_load_bytes(src->data, src->offset + offset, dst, len);
+	case BPF_DYNPTR_TYPE_XDP:
+		return __bpf_xdp_load_bytes(src->data, src->offset + offset, dst, len);
 	default:
 		WARN(true, "bpf_dynptr_read: unknown dynptr type %d\n", type);
 		return -EFAULT;
@@ -1583,6 +1585,10 @@  BPF_CALL_5(bpf_dynptr_write, struct bpf_dynptr_kern *, dst, u32, offset, void *,
 	case BPF_DYNPTR_TYPE_SKB:
 		return __bpf_skb_store_bytes(dst->data, dst->offset + offset, src, len,
 					     flags);
+	case BPF_DYNPTR_TYPE_XDP:
+		if (flags)
+			return -EINVAL;
+		return __bpf_xdp_store_bytes(dst->data, dst->offset + offset, src, len);
 	default:
 		WARN(true, "bpf_dynptr_write: unknown dynptr type %d\n", type);
 		return -EFAULT;
@@ -1616,7 +1622,7 @@  BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
 
 	type = bpf_dynptr_get_type(ptr);
 
-	/* Only skb dynptrs can get read-only data slices, because the
+	/* Only skb and xdp dynptrs can get read-only data slices, because the
 	 * verifier enforces PTR_TO_PACKET accesses
 	 */
 	is_rdonly = bpf_dynptr_is_rdonly(ptr);
@@ -1640,6 +1646,12 @@  BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
 		data = skb->data;
 		break;
 	}
+	case BPF_DYNPTR_TYPE_XDP:
+		/* if the requested data in across fragments, then it cannot
+		 * be accessed directly - bpf_xdp_pointer will return NULL
+		 */
+		return (unsigned long)bpf_xdp_pointer(ptr->data,
+						      ptr->offset + offset, len);
 	default:
 		WARN(true, "bpf_dynptr_data: unknown dynptr type %d\n", type);
 		return 0;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1ea295f47525..d33648eee188 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -686,6 +686,8 @@  static enum bpf_dynptr_type arg_to_dynptr_type(enum bpf_arg_type arg_type)
 		return BPF_DYNPTR_TYPE_RINGBUF;
 	case DYNPTR_TYPE_SKB:
 		return BPF_DYNPTR_TYPE_SKB;
+	case DYNPTR_TYPE_XDP:
+		return BPF_DYNPTR_TYPE_XDP;
 	default:
 		return BPF_DYNPTR_TYPE_INVALID;
 	}
@@ -6078,6 +6080,9 @@  static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 			case DYNPTR_TYPE_SKB:
 				err_extra = "skb ";
 				break;
+			case DYNPTR_TYPE_XDP:
+				err_extra = "xdp ";
+				break;
 			}
 
 			verbose(env, "Expected an initialized %sdynptr as arg #%d\n",
@@ -7439,7 +7444,8 @@  static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 		mark_reg_known_zero(env, regs, BPF_REG_0);
 
 		if (func_id == BPF_FUNC_dynptr_data &&
-		    dynptr_type == BPF_DYNPTR_TYPE_SKB) {
+		    (dynptr_type == BPF_DYNPTR_TYPE_SKB ||
+		     dynptr_type == BPF_DYNPTR_TYPE_XDP)) {
 			regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
 			regs[BPF_REG_0].range = meta.mem_size;
 		} else {
diff --git a/net/core/filter.c b/net/core/filter.c
index 5b204b42fb3e..54fbe8f511db 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3825,7 +3825,29 @@  static const struct bpf_func_proto sk_skb_change_head_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
-BPF_CALL_1(bpf_xdp_get_buff_len, struct  xdp_buff*, xdp)
+BPF_CALL_3(bpf_dynptr_from_xdp, struct xdp_buff*, xdp, u64, flags,
+	   struct bpf_dynptr_kern *, ptr)
+{
+	if (flags) {
+		bpf_dynptr_set_null(ptr);
+		return -EINVAL;
+	}
+
+	bpf_dynptr_init(ptr, xdp, BPF_DYNPTR_TYPE_XDP, 0, xdp_get_buff_len(xdp));
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_dynptr_from_xdp_proto = {
+	.func		= bpf_dynptr_from_xdp,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_PTR_TO_DYNPTR | DYNPTR_TYPE_XDP | MEM_UNINIT,
+};
+
+BPF_CALL_1(bpf_xdp_get_buff_len, struct xdp_buff*, xdp)
 {
 	return xdp_get_buff_len(xdp);
 }
@@ -3927,7 +3949,7 @@  static void bpf_xdp_copy_buf(struct xdp_buff *xdp, unsigned long off,
 	}
 }
 
-static void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len)
+void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len)
 {
 	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
 	u32 size = xdp->data_end - xdp->data;
@@ -3958,8 +3980,7 @@  static void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len)
 	return offset + len <= size ? addr + offset : NULL;
 }
 
-BPF_CALL_4(bpf_xdp_load_bytes, struct xdp_buff *, xdp, u32, offset,
-	   void *, buf, u32, len)
+int __bpf_xdp_load_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len)
 {
 	void *ptr;
 
@@ -3975,6 +3996,12 @@  BPF_CALL_4(bpf_xdp_load_bytes, struct xdp_buff *, xdp, u32, offset,
 	return 0;
 }
 
+BPF_CALL_4(bpf_xdp_load_bytes, struct xdp_buff *, xdp, u32, offset,
+	   void *, buf, u32, len)
+{
+	return __bpf_xdp_load_bytes(xdp, offset, buf, len);
+}
+
 static const struct bpf_func_proto bpf_xdp_load_bytes_proto = {
 	.func		= bpf_xdp_load_bytes,
 	.gpl_only	= false,
@@ -3985,8 +4012,7 @@  static const struct bpf_func_proto bpf_xdp_load_bytes_proto = {
 	.arg4_type	= ARG_CONST_SIZE,
 };
 
-BPF_CALL_4(bpf_xdp_store_bytes, struct xdp_buff *, xdp, u32, offset,
-	   void *, buf, u32, len)
+int __bpf_xdp_store_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len)
 {
 	void *ptr;
 
@@ -4002,6 +4028,12 @@  BPF_CALL_4(bpf_xdp_store_bytes, struct xdp_buff *, xdp, u32, offset,
 	return 0;
 }
 
+BPF_CALL_4(bpf_xdp_store_bytes, struct xdp_buff *, xdp, u32, offset,
+	   void *, buf, u32, len)
+{
+	return __bpf_xdp_store_bytes(xdp, offset, buf, len);
+}
+
 static const struct bpf_func_proto bpf_xdp_store_bytes_proto = {
 	.func		= bpf_xdp_store_bytes,
 	.gpl_only	= false,
@@ -8009,6 +8041,8 @@  xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_tcp_raw_check_syncookie_ipv6_proto;
 #endif
 #endif
+	case BPF_FUNC_dynptr_from_xdp:
+		return &bpf_dynptr_from_xdp_proto;
 	default:
 		return bpf_sk_base_func_proto(func_id);
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 3f1800a2b77c..0d5b0117db2a 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5283,13 +5283,18 @@  union bpf_attr {
  *		      and try again.
  *
  *		    * The data slice is automatically invalidated anytime
- *		      **bpf_dynptr_write**\ () or a helper call that changes
- *		      the underlying packet buffer (eg **bpf_skb_pull_data**\ ())
+ *		      **bpf_dynptr_write**\ () is called.
+ *
+ *		For skb-type and xdp-type dynptrs:
+ *		    * The data slice is automatically invalidated anytime a
+ *		      helper call that changes the underlying packet buffer
+ *		      (eg **bpf_skb_pull_data**\ (), **bpf_xdp_adjust_head**\ ())
  *		      is called.
  *	Return
  *		Pointer to the underlying dynptr data, NULL if the dynptr is
  *		read-only, if the dynptr is invalid, or if the offset and length
- *		is out of bounds or in a paged buffer for skb-type dynptrs.
+ *		is out of bounds or in a paged buffer for skb-type dynptrs or
+ *		across fragments for xdp-type dynptrs.
  *
  * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len)
  *	Description
@@ -5388,6 +5393,19 @@  union bpf_attr {
  *		*flags* is currently unused, it must be 0 for now.
  *	Return
  *		0 on success or -EINVAL if flags is not 0.
+ *
+ * long bpf_dynptr_from_xdp(struct xdp_buff *xdp_md, u64 flags, struct bpf_dynptr *ptr)
+ *	Description
+ *		Get a dynptr to the data in *xdp_md*. *xdp_md* must be the BPF program
+ *		context.
+ *
+ *		Calls that change the *xdp_md*'s underlying packet buffer
+ *		(eg **bpf_xdp_adjust_head**\ ()) do not invalidate the dynptr, but
+ *		they do invalidate any data slices associated with the dynptr.
+ *
+ *		*flags* is currently unused, it must be 0 for now.
+ *	Return
+ *		0 on success, -EINVAL if flags is not 0.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5600,6 +5618,7 @@  union bpf_attr {
 	FN(tcp_raw_check_syncookie_ipv6),	\
 	FN(ktime_get_tai_ns),		\
 	FN(dynptr_from_skb),		\
+	FN(dynptr_from_xdp),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper