From patchwork Wed Aug 31 18:32:22 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Joanne Koong X-Patchwork-Id: 12961289 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C2E73C0502A for ; Wed, 31 Aug 2022 18:39:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233103AbiHaSju (ORCPT ); Wed, 31 Aug 2022 14:39:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59182 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233098AbiHaSjQ (ORCPT ); Wed, 31 Aug 2022 14:39:16 -0400 Received: from 69-171-232-181.mail-mxout.facebook.com (69-171-232-181.mail-mxout.facebook.com [69.171.232.181]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8A30ADDABD for ; Wed, 31 Aug 2022 11:36:51 -0700 (PDT) Received: by devbig010.atn6.facebook.com (Postfix, from userid 115148) id C89CD1127DA65; Wed, 31 Aug 2022 11:36:37 -0700 (PDT) From: Joanne Koong To: bpf@vger.kernel.org Cc: andrii@kernel.org, daniel@iogearbox.net, ast@kernel.org, kafai@fb.com, memxor@gmail.com, toke@redhat.com, kuba@kernel.org, netdev@vger.kernel.org, Joanne Koong Subject: [PATCH bpf-next v5 1/3] bpf: Add skb dynptrs Date: Wed, 31 Aug 2022 11:32:22 -0700 Message-Id: <20220831183224.3754305-2-joannelkoong@gmail.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220831183224.3754305-1-joannelkoong@gmail.com> References: <20220831183224.3754305-1-joannelkoong@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net Add skb dynptrs, which are dynptrs whose underlying pointer points to a skb. The dynptr acts on skb data. skb dynptrs have two main benefits. One is that they allow operations on sizes that are not statically known at compile-time (eg variable-sized accesses). Another is that parsing the packet data through dynptrs (instead of through direct access of skb->data and skb->data_end) can be more ergonomic and less brittle (eg does not need manual if checking for being within bounds of data_end). For bpf prog types that don't support writes on skb data, the dynptr is read-only (bpf_dynptr_write() will return an error and bpf_dynptr_data() will return NULL; for a read-only data slice, there will be a separate API bpf_dynptr_data_rdonly(), which will be added in the near future). For reads and writes through the bpf_dynptr_read() and bpf_dynptr_write() interfaces, reading and writing from/to data in the head as well as from/to non-linear paged buffers is supported. For data slices (through the bpf_dynptr_data() interface), if the data is in a paged buffer, the user must first call bpf_skb_pull_data() to pull the data into the linear portion. Any bpf_dynptr_write() automatically invalidates any prior data slices to the skb dynptr. This is because a bpf_dynptr_write() may be writing to data in a paged buffer, so it will need to pull the buffer first into the head. The reason it needs to be pulled instead of writing directly to the paged buffers is because they may be cloned (only the head of the skb is by default uncloned). As such, any bpf_dynptr_write() will automatically have its prior data slices invalidated, even if the write is to data in the skb head (the verifier has no way of differentiating whether the write is to the head or paged buffers during program load time). Please note as well that any other helper calls that change the underlying packet buffer (eg bpf_skb_pull_data()) invalidates any data slices of the skb dynptr as well. The stack trace for this is check_helper_call() -> clear_all_pkt_pointers() -> __clear_all_pkt_pointers() -> mark_reg_unknown(). For examples of how skb dynptrs can be used, please see the attached selftests. Signed-off-by: Joanne Koong Reported-by: kernel test robot Reported-by: kernel test robot Reported-by: kernel test robot --- include/linux/bpf.h | 82 +++++++++++++++++++------------ include/linux/filter.h | 17 +++++++ include/uapi/linux/bpf.h | 43 +++++++++++++++-- kernel/bpf/helpers.c | 71 ++++++++++++++++++++++++--- kernel/bpf/verifier.c | 88 +++++++++++++++++++++++++++------- net/core/filter.c | 72 ++++++++++++++++++++++++++-- tools/include/uapi/linux/bpf.h | 43 +++++++++++++++-- 7 files changed, 352 insertions(+), 64 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 9c1674973e03..26ad1422a157 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -408,11 +408,14 @@ enum bpf_type_flag { /* Size is known at compile time. */ MEM_FIXED_SIZE = BIT(10 + BPF_BASE_TYPE_BITS), + /* DYNPTR points to sk_buff */ + DYNPTR_TYPE_SKB = BIT(11 + BPF_BASE_TYPE_BITS), + __BPF_TYPE_FLAG_MAX, __BPF_TYPE_LAST_FLAG = __BPF_TYPE_FLAG_MAX - 1, }; -#define DYNPTR_TYPE_FLAG_MASK (DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF) +#define DYNPTR_TYPE_FLAG_MASK (DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB) /* Max number of base types. */ #define BPF_BASE_TYPE_LIMIT (1UL << BPF_BASE_TYPE_BITS) @@ -904,6 +907,36 @@ static __always_inline __nocfi unsigned int bpf_dispatcher_nop_func( return bpf_func(ctx, insnsi); } +/* the implementation of the opaque uapi struct bpf_dynptr */ +struct bpf_dynptr_kern { + void *data; + /* Size represents the number of usable bytes of dynptr data. + * If for example the offset is at 4 for a local dynptr whose data is + * of type u64, the number of usable bytes is 4. + * + * The upper 8 bits are reserved. It is as follows: + * Bits 0 - 23 = size + * Bits 24 - 30 = dynptr type + * Bit 31 = whether dynptr is read-only + */ + u32 size; + u32 offset; +} __aligned(8); + +enum bpf_dynptr_type { + BPF_DYNPTR_TYPE_INVALID, + /* Points to memory that is local to the bpf program */ + BPF_DYNPTR_TYPE_LOCAL, + /* Underlying data is a ringbuf record */ + BPF_DYNPTR_TYPE_RINGBUF, + /* Underlying data is a sk_buff */ + BPF_DYNPTR_TYPE_SKB, + /* Underlying data is a xdp_buff */ + BPF_DYNPTR_TYPE_XDP, +}; + +int bpf_dynptr_check_size(u32 size); + #ifdef CONFIG_BPF_JIT int bpf_trampoline_link_prog(struct bpf_tramp_link *link, struct bpf_trampoline *tr); int bpf_trampoline_unlink_prog(struct bpf_tramp_link *link, struct bpf_trampoline *tr); @@ -1983,6 +2016,11 @@ static inline bool has_current_bpf_ctx(void) { return !!current->bpf_ctx; } + +void bpf_dynptr_init(struct bpf_dynptr_kern *ptr, void *data, + enum bpf_dynptr_type type, u32 offset, u32 size); +void bpf_dynptr_set_null(struct bpf_dynptr_kern *ptr); +void bpf_dynptr_set_rdonly(struct bpf_dynptr_kern *ptr); #else /* !CONFIG_BPF_SYSCALL */ static inline struct bpf_prog *bpf_prog_get(u32 ufd) { @@ -2196,6 +2234,19 @@ static inline bool has_current_bpf_ctx(void) { return false; } + +static inline void bpf_dynptr_init(struct bpf_dynptr_kern *ptr, void *data, + enum bpf_dynptr_type type, u32 offset, u32 size) +{ +} + +static inline void bpf_dynptr_set_null(struct bpf_dynptr_kern *ptr) +{ +} + +static inline void bpf_dynptr_set_rdonly(struct bpf_dynptr_kern *ptr) +{ +} #endif /* CONFIG_BPF_SYSCALL */ void __bpf_free_used_btfs(struct bpf_prog_aux *aux, @@ -2557,35 +2608,6 @@ int bpf_bprintf_prepare(char *fmt, u32 fmt_size, const u64 *raw_args, u32 **bin_buf, u32 num_args); void bpf_bprintf_cleanup(void); -/* the implementation of the opaque uapi struct bpf_dynptr */ -struct bpf_dynptr_kern { - void *data; - /* Size represents the number of usable bytes of dynptr data. - * If for example the offset is at 4 for a local dynptr whose data is - * of type u64, the number of usable bytes is 4. - * - * The upper 8 bits are reserved. It is as follows: - * Bits 0 - 23 = size - * Bits 24 - 30 = dynptr type - * Bit 31 = whether dynptr is read-only - */ - u32 size; - u32 offset; -} __aligned(8); - -enum bpf_dynptr_type { - BPF_DYNPTR_TYPE_INVALID, - /* Points to memory that is local to the bpf program */ - BPF_DYNPTR_TYPE_LOCAL, - /* Underlying data is a ringbuf record */ - BPF_DYNPTR_TYPE_RINGBUF, -}; - -void bpf_dynptr_init(struct bpf_dynptr_kern *ptr, void *data, - enum bpf_dynptr_type type, u32 offset, u32 size); -void bpf_dynptr_set_null(struct bpf_dynptr_kern *ptr); -int bpf_dynptr_check_size(u32 size); - #ifdef CONFIG_BPF_LSM void bpf_cgroup_atype_get(u32 attach_btf_id, int cgroup_atype); void bpf_cgroup_atype_put(int cgroup_atype); diff --git a/include/linux/filter.h b/include/linux/filter.h index a5f21dc3c432..a3a415344001 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -1532,4 +1532,21 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind return XDP_REDIRECT; } +#ifdef CONFIG_NET +int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len); +int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from, + u32 len, u64 flags); +#else /* CONFIG_NET */ +int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len) +{ + return -EOPNOTSUPP; +} + +int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from, + u32 len, u64 flags) +{ + return -EOPNOTSUPP; +} +#endif /* CONFIG_NET */ + #endif /* __LINUX_FILTER_H__ */ diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 962960a98835..b4e5f7d81a20 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -5284,22 +5284,44 @@ union bpf_attr { * Description * Write *len* bytes from *src* into *dst*, starting from *offset* * into *dst*. - * *flags* is currently unused. + * + * *flags* must be 0 except for skb-type dynptrs. + * + * For skb-type dynptrs: + * * All data slices of the dynptr are automatically + * invalidated after **bpf_dynptr_write**\ (). If you wish to + * avoid this, please perform the write using direct data slices + * instead. + * + * * For *flags*, please see the flags accepted by + * **bpf_skb_store_bytes**\ (). * Return * 0 on success, -E2BIG if *offset* + *len* exceeds the length * of *dst*'s data, -EINVAL if *dst* is an invalid dynptr or if *dst* - * is a read-only dynptr or if *flags* is not 0. + * is a read-only dynptr or if *flags* is not correct. For skb-type dynptrs, + * other errors correspond to errors returned by **bpf_skb_store_bytes**\ (). * * void *bpf_dynptr_data(struct bpf_dynptr *ptr, u32 offset, u32 len) * Description * Get a pointer to the underlying dynptr data. * * *len* must be a statically known value. The returned data slice - * is invalidated whenever the dynptr is invalidated. + * is invalidated whenever the dynptr is invalidated. If the dynptr + * is read-only, this will return NULL. + * + * For skb-type dynptrs: + * * If *offset* + *len* extends into the skb's paged buffers, + * the user should manually pull the skb with **bpf_skb_pull_data**\ () + * and try again. + * + * * The data slice is automatically invalidated anytime + * **bpf_dynptr_write**\ () or a helper call that changes + * the underlying packet buffer (eg **bpf_skb_pull_data**\ ()) + * is called. * Return * Pointer to the underlying dynptr data, NULL if the dynptr is * read-only, if the dynptr is invalid, or if the offset and length - * is out of bounds. + * is out of bounds or in a paged buffer for skb-type dynptrs. * * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len) * Description @@ -5386,6 +5408,18 @@ union bpf_attr { * Return * Current *ktime*. * + * long bpf_dynptr_from_skb(struct sk_buff *skb, u64 flags, struct bpf_dynptr *ptr) + * Description + * Get a dynptr to the data in *skb*. *skb* must be the BPF program + * context. Depending on program type, the dynptr may be read-only. + * + * Calls that change the *skb*'s underlying packet buffer + * (eg **bpf_skb_pull_data**\ ()) do not invalidate the dynptr, but + * they do invalidate any data slices associated with the dynptr. + * + * *flags* is currently unused, it must be 0 for now. + * Return + * 0 on success or -EINVAL if flags is not 0. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -5597,6 +5631,7 @@ union bpf_attr { FN(tcp_raw_check_syncookie_ipv4), \ FN(tcp_raw_check_syncookie_ipv6), \ FN(ktime_get_tai_ns), \ + FN(dynptr_from_skb), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index fc08035f14ed..98fcb4704a9e 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -1403,11 +1403,21 @@ static bool bpf_dynptr_is_rdonly(struct bpf_dynptr_kern *ptr) return ptr->size & DYNPTR_RDONLY_BIT; } +void bpf_dynptr_set_rdonly(struct bpf_dynptr_kern *ptr) +{ + ptr->size |= DYNPTR_RDONLY_BIT; +} + static void bpf_dynptr_set_type(struct bpf_dynptr_kern *ptr, enum bpf_dynptr_type type) { ptr->size |= type << DYNPTR_TYPE_SHIFT; } +static enum bpf_dynptr_type bpf_dynptr_get_type(const struct bpf_dynptr_kern *ptr) +{ + return (ptr->size & ~(DYNPTR_RDONLY_BIT)) >> DYNPTR_TYPE_SHIFT; +} + static u32 bpf_dynptr_get_size(struct bpf_dynptr_kern *ptr) { return ptr->size & DYNPTR_SIZE_MASK; @@ -1478,6 +1488,7 @@ static const struct bpf_func_proto bpf_dynptr_from_mem_proto = { BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, struct bpf_dynptr_kern *, src, u32, offset, u64, flags) { + enum bpf_dynptr_type type; int err; if (!src->data || flags) @@ -1487,9 +1498,19 @@ BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, struct bpf_dynptr_kern *, src if (err) return err; - memcpy(dst, src->data + src->offset + offset, len); + type = bpf_dynptr_get_type(src); - return 0; + switch (type) { + case BPF_DYNPTR_TYPE_LOCAL: + case BPF_DYNPTR_TYPE_RINGBUF: + memcpy(dst, src->data + src->offset + offset, len); + return 0; + case BPF_DYNPTR_TYPE_SKB: + return __bpf_skb_load_bytes(src->data, src->offset + offset, dst, len); + default: + WARN_ONCE(true, "bpf_dynptr_read: unknown dynptr type %d\n", type); + return -EFAULT; + } } static const struct bpf_func_proto bpf_dynptr_read_proto = { @@ -1506,18 +1527,32 @@ static const struct bpf_func_proto bpf_dynptr_read_proto = { BPF_CALL_5(bpf_dynptr_write, struct bpf_dynptr_kern *, dst, u32, offset, void *, src, u32, len, u64, flags) { + enum bpf_dynptr_type type; int err; - if (!dst->data || flags || bpf_dynptr_is_rdonly(dst)) + if (!dst->data || bpf_dynptr_is_rdonly(dst)) return -EINVAL; err = bpf_dynptr_check_off_len(dst, offset, len); if (err) return err; - memcpy(dst->data + dst->offset + offset, src, len); + type = bpf_dynptr_get_type(dst); - return 0; + switch (type) { + case BPF_DYNPTR_TYPE_LOCAL: + case BPF_DYNPTR_TYPE_RINGBUF: + if (flags) + return -EINVAL; + memcpy(dst->data + dst->offset + offset, src, len); + return 0; + case BPF_DYNPTR_TYPE_SKB: + return __bpf_skb_store_bytes(dst->data, dst->offset + offset, src, len, + flags); + default: + WARN_ONCE(true, "bpf_dynptr_write: unknown dynptr type %d\n", type); + return -EFAULT; + } } static const struct bpf_func_proto bpf_dynptr_write_proto = { @@ -1533,6 +1568,8 @@ static const struct bpf_func_proto bpf_dynptr_write_proto = { BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len) { + enum bpf_dynptr_type type; + void *data; int err; if (!ptr->data) @@ -1545,7 +1582,29 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len if (bpf_dynptr_is_rdonly(ptr)) return 0; - return (unsigned long)(ptr->data + ptr->offset + offset); + type = bpf_dynptr_get_type(ptr); + + switch (type) { + case BPF_DYNPTR_TYPE_LOCAL: + case BPF_DYNPTR_TYPE_RINGBUF: + data = ptr->data; + break; + case BPF_DYNPTR_TYPE_SKB: + { + struct sk_buff *skb = ptr->data; + + /* if the data is paged, the caller needs to pull it first */ + if (ptr->offset + offset + len > skb->len - skb->data_len) + return 0; + + data = skb->data; + break; + } + default: + WARN_ONCE(true, "bpf_dynptr_data: unknown dynptr type %d\n", type); + return 0; + } + return (unsigned long)(data + ptr->offset + offset); } static const struct bpf_func_proto bpf_dynptr_data_proto = { diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 0194a36d0b36..7f67fac66735 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -684,6 +684,8 @@ static enum bpf_dynptr_type arg_to_dynptr_type(enum bpf_arg_type arg_type) return BPF_DYNPTR_TYPE_LOCAL; case DYNPTR_TYPE_RINGBUF: return BPF_DYNPTR_TYPE_RINGBUF; + case DYNPTR_TYPE_SKB: + return BPF_DYNPTR_TYPE_SKB; default: return BPF_DYNPTR_TYPE_INVALID; } @@ -1406,6 +1408,12 @@ static bool reg_is_pkt_pointer_any(const struct bpf_reg_state *reg) reg->type == PTR_TO_PACKET_END; } +static bool reg_is_dynptr_slice_pkt(const struct bpf_reg_state *reg) +{ + return base_type(reg->type) == PTR_TO_MEM && + reg->type & DYNPTR_TYPE_SKB; +} + /* Unmodified PTR_TO_PACKET[_META,_END] register from ctx access. */ static bool reg_is_init_pkt_pointer(const struct bpf_reg_state *reg, enum bpf_reg_type which) @@ -5830,12 +5838,29 @@ int check_func_arg_reg_off(struct bpf_verifier_env *env, return __check_ptr_off_reg(env, reg, regno, fixed_off_ok); } -static u32 stack_slot_get_id(struct bpf_verifier_env *env, struct bpf_reg_state *reg) +static struct bpf_reg_state *get_dynptr_arg_reg(const struct bpf_func_proto *fn, + struct bpf_reg_state *regs) +{ + int i; + + for (i = 0; i < MAX_BPF_FUNC_REG_ARGS; i++) + if (arg_type_is_dynptr(fn->arg_type[i])) + return ®s[BPF_REG_1 + i]; + + return NULL; +} + +static enum bpf_dynptr_type stack_slot_get_dynptr_info(struct bpf_verifier_env *env, + struct bpf_reg_state *reg, + int *ref_obj_id) { struct bpf_func_state *state = func(env, reg); int spi = get_spi(reg->off); - return state->stack[spi].spilled_ptr.id; + if (ref_obj_id) + *ref_obj_id = state->stack[spi].spilled_ptr.id; + + return state->stack[spi].spilled_ptr.dynptr.type; } static int check_func_arg(struct bpf_verifier_env *env, u32 arg, @@ -6060,6 +6085,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg, case DYNPTR_TYPE_RINGBUF: err_extra = "ringbuf "; break; + case DYNPTR_TYPE_SKB: + err_extra = "skb "; + break; default: break; } @@ -6490,6 +6518,9 @@ static int check_func_proto(const struct bpf_func_proto *fn, int func_id) /* Packet data might have moved, any old PTR_TO_PACKET[_META,_END] * are now invalid, so turn them into unknown SCALAR_VALUE. + * + * This applies to dynptr slices belonging to skb dynptrs, + * since these slices point to packet data. */ static void __clear_all_pkt_pointers(struct bpf_verifier_env *env, struct bpf_func_state *state) @@ -6498,13 +6529,15 @@ static void __clear_all_pkt_pointers(struct bpf_verifier_env *env, int i; for (i = 0; i < MAX_BPF_REG; i++) - if (reg_is_pkt_pointer_any(®s[i])) + if (reg_is_pkt_pointer_any(®s[i]) || + reg_is_dynptr_slice_pkt(®s[i])) mark_reg_unknown(env, regs, i); bpf_for_each_spilled_reg(i, state, reg) { if (!reg) continue; - if (reg_is_pkt_pointer_any(reg)) + if (reg_is_pkt_pointer_any(reg) || + reg_is_dynptr_slice_pkt(®s[i])) __mark_reg_unknown(env, reg); } } @@ -7166,6 +7199,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn int *insn_idx_p) { enum bpf_prog_type prog_type = resolve_prog_type(env->prog); + enum bpf_dynptr_type dynptr_type = BPF_DYNPTR_TYPE_INVALID; const struct bpf_func_proto *fn = NULL; enum bpf_return_type ret_type; enum bpf_type_flag ret_flag; @@ -7339,23 +7373,42 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn } break; case BPF_FUNC_dynptr_data: - for (i = 0; i < MAX_BPF_FUNC_REG_ARGS; i++) { - if (arg_type_is_dynptr(fn->arg_type[i])) { - if (meta.ref_obj_id) { - verbose(env, "verifier internal error: meta.ref_obj_id already set\n"); - return -EFAULT; - } - /* Find the id of the dynptr we're tracking the reference of */ - meta.ref_obj_id = stack_slot_get_id(env, ®s[BPF_REG_1 + i]); - break; - } - } - if (i == MAX_BPF_FUNC_REG_ARGS) { + { + struct bpf_reg_state *reg; + + reg = get_dynptr_arg_reg(fn, regs); + if (!reg) { verbose(env, "verifier internal error: no dynptr in bpf_dynptr_data()\n"); return -EFAULT; } + + if (meta.ref_obj_id) { + verbose(env, "verifier internal error: meta.ref_obj_id already set\n"); + return -EFAULT; + } + + dynptr_type = stack_slot_get_dynptr_info(env, reg, &meta.ref_obj_id); break; } + case BPF_FUNC_dynptr_write: + { + struct bpf_reg_state *reg; + + reg = get_dynptr_arg_reg(fn, regs); + if (!reg) { + verbose(env, "verifier internal error: no dynptr in bpf_dynptr_write()\n"); + return -EFAULT; + } + + /* bpf_dynptr_write() for skb-type dynptrs may pull the skb, so we must + * invalidate all data slices associated with it + */ + if (stack_slot_get_dynptr_info(env, reg, NULL) == BPF_DYNPTR_TYPE_SKB) + changes_data = true; + + break; + } + } if (err) return err; @@ -7417,6 +7470,9 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn mark_reg_known_zero(env, regs, BPF_REG_0); regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag; regs[BPF_REG_0].mem_size = meta.mem_size; + if (func_id == BPF_FUNC_dynptr_data && + dynptr_type == BPF_DYNPTR_TYPE_SKB) + regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB; break; case RET_PTR_TO_MEM_OR_BTF_ID: { diff --git a/net/core/filter.c b/net/core/filter.c index 63e25d8ce501..0e2238516d8b 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -1681,8 +1681,8 @@ static inline void bpf_pull_mac_rcsum(struct sk_buff *skb) skb_postpull_rcsum(skb, skb_mac_header(skb), skb->mac_len); } -BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset, - const void *, from, u32, len, u64, flags) +int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from, + u32 len, u64 flags) { void *ptr; @@ -1707,6 +1707,12 @@ BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset, return 0; } +BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset, + const void *, from, u32, len, u64, flags) +{ + return __bpf_skb_store_bytes(skb, offset, from, len, flags); +} + static const struct bpf_func_proto bpf_skb_store_bytes_proto = { .func = bpf_skb_store_bytes, .gpl_only = false, @@ -1718,8 +1724,7 @@ static const struct bpf_func_proto bpf_skb_store_bytes_proto = { .arg5_type = ARG_ANYTHING, }; -BPF_CALL_4(bpf_skb_load_bytes, const struct sk_buff *, skb, u32, offset, - void *, to, u32, len) +int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len) { void *ptr; @@ -1738,6 +1743,12 @@ BPF_CALL_4(bpf_skb_load_bytes, const struct sk_buff *, skb, u32, offset, return -EFAULT; } +BPF_CALL_4(bpf_skb_load_bytes, const struct sk_buff *, skb, u32, offset, + void *, to, u32, len) +{ + return __bpf_skb_load_bytes(skb, offset, to, len); +} + static const struct bpf_func_proto bpf_skb_load_bytes_proto = { .func = bpf_skb_load_bytes, .gpl_only = false, @@ -1849,6 +1860,51 @@ static const struct bpf_func_proto bpf_skb_pull_data_proto = { .arg2_type = ARG_ANYTHING, }; +BPF_CALL_3(bpf_dynptr_from_skb_rdwr, struct sk_buff *, skb, u64, flags, + struct bpf_dynptr_kern *, ptr) +{ + if (flags) { + bpf_dynptr_set_null(ptr); + return -EINVAL; + } + + bpf_dynptr_init(ptr, skb, BPF_DYNPTR_TYPE_SKB, 0, skb->len); + + return 0; +} + +static const struct bpf_func_proto bpf_dynptr_from_skb_rdwr_proto = { + .func = bpf_dynptr_from_skb_rdwr, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_PTR_TO_DYNPTR | DYNPTR_TYPE_SKB | MEM_UNINIT, +}; + +BPF_CALL_3(bpf_dynptr_from_skb_rdonly, struct sk_buff *, skb, u64, flags, + struct bpf_dynptr_kern *, ptr) +{ + int err; + + err = ____bpf_dynptr_from_skb_rdwr(skb, flags, ptr); + if (err) + return err; + + bpf_dynptr_set_rdonly(ptr); + + return 0; +} + +static const struct bpf_func_proto bpf_dynptr_from_skb_rdonly_proto = { + .func = bpf_dynptr_from_skb_rdonly, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_PTR_TO_DYNPTR | DYNPTR_TYPE_SKB | MEM_UNINIT, +}; + BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk) { return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL; @@ -7704,6 +7760,8 @@ sk_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_get_socket_uid_proto; case BPF_FUNC_perf_event_output: return &bpf_skb_event_output_proto; + case BPF_FUNC_dynptr_from_skb: + return &bpf_dynptr_from_skb_rdonly_proto; default: return bpf_sk_base_func_proto(func_id); } @@ -7891,6 +7949,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_tcp_raw_check_syncookie_ipv6_proto; #endif #endif + case BPF_FUNC_dynptr_from_skb: + return &bpf_dynptr_from_skb_rdwr_proto; default: return bpf_sk_base_func_proto(func_id); } @@ -8090,6 +8150,8 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) case BPF_FUNC_skc_lookup_tcp: return &bpf_skc_lookup_tcp_proto; #endif + case BPF_FUNC_dynptr_from_skb: + return &bpf_dynptr_from_skb_rdwr_proto; default: return bpf_sk_base_func_proto(func_id); } @@ -8128,6 +8190,8 @@ lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_get_smp_processor_id_proto; case BPF_FUNC_skb_under_cgroup: return &bpf_skb_under_cgroup_proto; + case BPF_FUNC_dynptr_from_skb: + return &bpf_dynptr_from_skb_rdonly_proto; default: return bpf_sk_base_func_proto(func_id); } diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index f4ba82a1eace..52e891b00a35 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -5284,22 +5284,44 @@ union bpf_attr { * Description * Write *len* bytes from *src* into *dst*, starting from *offset* * into *dst*. - * *flags* is currently unused. + * + * *flags* must be 0 except for skb-type dynptrs. + * + * For skb-type dynptrs: + * * All data slices of the dynptr are automatically + * invalidated after **bpf_dynptr_write**\ (). If you wish to + * avoid this, please perform the write using direct data slices + * instead. + * + * * For *flags*, please see the flags accepted by + * **bpf_skb_store_bytes**\ (). * Return * 0 on success, -E2BIG if *offset* + *len* exceeds the length * of *dst*'s data, -EINVAL if *dst* is an invalid dynptr or if *dst* - * is a read-only dynptr or if *flags* is not 0. + * is a read-only dynptr or if *flags* is not correct. For skb-type dynptrs, + * other errors correspond to errors returned by **bpf_skb_store_bytes**\ (). * * void *bpf_dynptr_data(struct bpf_dynptr *ptr, u32 offset, u32 len) * Description * Get a pointer to the underlying dynptr data. * * *len* must be a statically known value. The returned data slice - * is invalidated whenever the dynptr is invalidated. + * is invalidated whenever the dynptr is invalidated. If the dynptr + * is read-only, this will return NULL. + * + * For skb-type dynptrs: + * * If *offset* + *len* extends into the skb's paged buffers, + * the user should manually pull the skb with **bpf_skb_pull_data**\ () + * and try again. + * + * * The data slice is automatically invalidated anytime + * **bpf_dynptr_write**\ () or a helper call that changes + * the underlying packet buffer (eg **bpf_skb_pull_data**\ ()) + * is called. * Return * Pointer to the underlying dynptr data, NULL if the dynptr is * read-only, if the dynptr is invalid, or if the offset and length - * is out of bounds. + * is out of bounds or in a paged buffer for skb-type dynptrs. * * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len) * Description @@ -5386,6 +5408,18 @@ union bpf_attr { * Return * Current *ktime*. * + * long bpf_dynptr_from_skb(struct sk_buff *skb, u64 flags, struct bpf_dynptr *ptr) + * Description + * Get a dynptr to the data in *skb*. *skb* must be the BPF program + * context. Depending on program type, the dynptr may be read-only. + * + * Calls that change the *skb*'s underlying packet buffer + * (eg **bpf_skb_pull_data**\ ()) do not invalidate the dynptr, but + * they do invalidate any data slices associated with the dynptr. + * + * *flags* is currently unused, it must be 0 for now. + * Return + * 0 on success or -EINVAL if flags is not 0. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -5597,6 +5631,7 @@ union bpf_attr { FN(tcp_raw_check_syncookie_ipv4), \ FN(tcp_raw_check_syncookie_ipv6), \ FN(ktime_get_tai_ns), \ + FN(dynptr_from_skb), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper From patchwork Wed Aug 31 18:32:23 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Joanne Koong X-Patchwork-Id: 12961288 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D9840ECAAD1 for ; Wed, 31 Aug 2022 18:39:50 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233018AbiHaSjt (ORCPT ); Wed, 31 Aug 2022 14:39:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59148 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232885AbiHaSjQ (ORCPT ); Wed, 31 Aug 2022 14:39:16 -0400 Received: from 66-220-155-178.mail-mxout.facebook.com (66-220-155-178.mail-mxout.facebook.com [66.220.155.178]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5382EDEA71 for ; Wed, 31 Aug 2022 11:36:52 -0700 (PDT) Received: by devbig010.atn6.facebook.com (Postfix, from userid 115148) id CCC321127DA72; Wed, 31 Aug 2022 11:36:38 -0700 (PDT) From: Joanne Koong To: bpf@vger.kernel.org Cc: andrii@kernel.org, daniel@iogearbox.net, ast@kernel.org, kafai@fb.com, memxor@gmail.com, toke@redhat.com, kuba@kernel.org, netdev@vger.kernel.org, Joanne Koong Subject: [PATCH bpf-next v5 2/3] bpf: Add xdp dynptrs Date: Wed, 31 Aug 2022 11:32:23 -0700 Message-Id: <20220831183224.3754305-3-joannelkoong@gmail.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220831183224.3754305-1-joannelkoong@gmail.com> References: <20220831183224.3754305-1-joannelkoong@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net Add xdp dynptrs, which are dynptrs whose underlying pointer points to a xdp_buff. The dynptr acts on xdp data. xdp dynptrs have two main benefits. One is that they allow operations on sizes that are not statically known at compile-time (eg variable-sized accesses). Another is that parsing the packet data through dynptrs (instead of through direct access of xdp->data and xdp->data_end) can be more ergonomic and less brittle (eg does not need manual if checking for being within bounds of data_end). For reads and writes on the dynptr, this includes reading/writing from/to and across fragments. For data slices, direct access to data in fragments is also permitted, but access across fragments is not. Any helper calls that change the underlying packet buffer (eg bpf_xdp_adjust_head) invalidates any data slices of the associated dynptr. The stack trace for this is check_helper_call() -> clear_all_pkt_pointers() -> __clear_all_pkt_pointers() -> mark_reg_unknown(). For examples of how xdp dynptrs can be used, please see the attached selftests. Signed-off-by: Joanne Koong Reported-by: kernel test robot --- include/linux/bpf.h | 6 ++++- include/linux/filter.h | 18 +++++++++++++ include/uapi/linux/bpf.h | 25 +++++++++++++++--- kernel/bpf/helpers.c | 12 +++++++++ kernel/bpf/verifier.c | 18 +++++++++---- net/core/filter.c | 46 +++++++++++++++++++++++++++++----- tools/include/uapi/linux/bpf.h | 25 +++++++++++++++--- 7 files changed, 132 insertions(+), 18 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 26ad1422a157..f07d127f559f 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -411,11 +411,15 @@ enum bpf_type_flag { /* DYNPTR points to sk_buff */ DYNPTR_TYPE_SKB = BIT(11 + BPF_BASE_TYPE_BITS), + /* DYNPTR points to xdp_buff */ + DYNPTR_TYPE_XDP = BIT(12 + BPF_BASE_TYPE_BITS), + __BPF_TYPE_FLAG_MAX, __BPF_TYPE_LAST_FLAG = __BPF_TYPE_FLAG_MAX - 1, }; -#define DYNPTR_TYPE_FLAG_MASK (DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB) +#define DYNPTR_TYPE_FLAG_MASK (DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB \ + | DYNPTR_TYPE_XDP) /* Max number of base types. */ #define BPF_BASE_TYPE_LIMIT (1UL << BPF_BASE_TYPE_BITS) diff --git a/include/linux/filter.h b/include/linux/filter.h index a3a415344001..09311cb4375d 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -1536,6 +1536,9 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len); int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from, u32 len, u64 flags); +int __bpf_xdp_load_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len); +int __bpf_xdp_store_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len); +void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len); #else /* CONFIG_NET */ int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len) { @@ -1547,6 +1550,21 @@ int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from, { return -EOPNOTSUPP; } + +int __bpf_xdp_load_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len) +{ + return -EOPNOTSUPP; +} + +int __bpf_xdp_store_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len) +{ + return -EOPNOTSUPP; +} + +void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len) +{ + return NULL; +} #endif /* CONFIG_NET */ #endif /* __LINUX_FILTER_H__ */ diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index b4e5f7d81a20..992f7565a41e 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -5315,13 +5315,18 @@ union bpf_attr { * and try again. * * * The data slice is automatically invalidated anytime - * **bpf_dynptr_write**\ () or a helper call that changes - * the underlying packet buffer (eg **bpf_skb_pull_data**\ ()) + * **bpf_dynptr_write**\ () is called. + * + * For skb-type and xdp-type dynptrs: + * * The data slice is automatically invalidated anytime a + * helper call that changes the underlying packet buffer + * (eg **bpf_skb_pull_data**\ (), **bpf_xdp_adjust_head**\ ()) * is called. * Return * Pointer to the underlying dynptr data, NULL if the dynptr is * read-only, if the dynptr is invalid, or if the offset and length - * is out of bounds or in a paged buffer for skb-type dynptrs. + * is out of bounds or in a paged buffer for skb-type dynptrs or + * across fragments for xdp-type dynptrs. * * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len) * Description @@ -5420,6 +5425,19 @@ union bpf_attr { * *flags* is currently unused, it must be 0 for now. * Return * 0 on success or -EINVAL if flags is not 0. + * + * long bpf_dynptr_from_xdp(struct xdp_buff *xdp_md, u64 flags, struct bpf_dynptr *ptr) + * Description + * Get a dynptr to the data in *xdp_md*. *xdp_md* must be the BPF program + * context. + * + * Calls that change the *xdp_md*'s underlying packet buffer + * (eg **bpf_xdp_adjust_head**\ ()) do not invalidate the dynptr, but + * they do invalidate any data slices associated with the dynptr. + * + * *flags* is currently unused, it must be 0 for now. + * Return + * 0 on success, -EINVAL if flags is not 0. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -5632,6 +5650,7 @@ union bpf_attr { FN(tcp_raw_check_syncookie_ipv6), \ FN(ktime_get_tai_ns), \ FN(dynptr_from_skb), \ + FN(dynptr_from_xdp), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 98fcb4704a9e..befafae34a63 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -1507,6 +1507,8 @@ BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, struct bpf_dynptr_kern *, src return 0; case BPF_DYNPTR_TYPE_SKB: return __bpf_skb_load_bytes(src->data, src->offset + offset, dst, len); + case BPF_DYNPTR_TYPE_XDP: + return __bpf_xdp_load_bytes(src->data, src->offset + offset, dst, len); default: WARN_ONCE(true, "bpf_dynptr_read: unknown dynptr type %d\n", type); return -EFAULT; @@ -1549,6 +1551,10 @@ BPF_CALL_5(bpf_dynptr_write, struct bpf_dynptr_kern *, dst, u32, offset, void *, case BPF_DYNPTR_TYPE_SKB: return __bpf_skb_store_bytes(dst->data, dst->offset + offset, src, len, flags); + case BPF_DYNPTR_TYPE_XDP: + if (flags) + return -EINVAL; + return __bpf_xdp_store_bytes(dst->data, dst->offset + offset, src, len); default: WARN_ONCE(true, "bpf_dynptr_write: unknown dynptr type %d\n", type); return -EFAULT; @@ -1600,6 +1606,12 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len data = skb->data; break; } + case BPF_DYNPTR_TYPE_XDP: + /* if the requested data in across fragments, then it cannot + * be accessed directly - bpf_xdp_pointer will return NULL + */ + return (unsigned long)bpf_xdp_pointer(ptr->data, + ptr->offset + offset, len); default: WARN_ONCE(true, "bpf_dynptr_data: unknown dynptr type %d\n", type); return 0; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 7f67fac66735..f5ff0f26b7cc 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -686,6 +686,8 @@ static enum bpf_dynptr_type arg_to_dynptr_type(enum bpf_arg_type arg_type) return BPF_DYNPTR_TYPE_RINGBUF; case DYNPTR_TYPE_SKB: return BPF_DYNPTR_TYPE_SKB; + case DYNPTR_TYPE_XDP: + return BPF_DYNPTR_TYPE_XDP; default: return BPF_DYNPTR_TYPE_INVALID; } @@ -1411,7 +1413,7 @@ static bool reg_is_pkt_pointer_any(const struct bpf_reg_state *reg) static bool reg_is_dynptr_slice_pkt(const struct bpf_reg_state *reg) { return base_type(reg->type) == PTR_TO_MEM && - reg->type & DYNPTR_TYPE_SKB; + (reg->type & DYNPTR_TYPE_SKB || reg->type & DYNPTR_TYPE_XDP); } /* Unmodified PTR_TO_PACKET[_META,_END] register from ctx access. */ @@ -6088,6 +6090,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg, case DYNPTR_TYPE_SKB: err_extra = "skb "; break; + case DYNPTR_TYPE_XDP: + err_extra = "xdp "; + break; default: break; } @@ -6519,7 +6524,7 @@ static int check_func_proto(const struct bpf_func_proto *fn, int func_id) /* Packet data might have moved, any old PTR_TO_PACKET[_META,_END] * are now invalid, so turn them into unknown SCALAR_VALUE. * - * This applies to dynptr slices belonging to skb dynptrs, + * This applies to dynptr slices belonging to skb or xdp dynptrs, * since these slices point to packet data. */ static void __clear_all_pkt_pointers(struct bpf_verifier_env *env, @@ -7470,9 +7475,12 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn mark_reg_known_zero(env, regs, BPF_REG_0); regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag; regs[BPF_REG_0].mem_size = meta.mem_size; - if (func_id == BPF_FUNC_dynptr_data && - dynptr_type == BPF_DYNPTR_TYPE_SKB) - regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB; + if (func_id == BPF_FUNC_dynptr_data) { + if (dynptr_type == BPF_DYNPTR_TYPE_SKB) + regs[BPF_REG_0].type |= DYNPTR_TYPE_SKB; + else if (dynptr_type == BPF_DYNPTR_TYPE_XDP) + regs[BPF_REG_0].type |= DYNPTR_TYPE_XDP; + } break; case RET_PTR_TO_MEM_OR_BTF_ID: { diff --git a/net/core/filter.c b/net/core/filter.c index 0e2238516d8b..cca8c7ab2829 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -3844,7 +3844,29 @@ static const struct bpf_func_proto sk_skb_change_head_proto = { .arg3_type = ARG_ANYTHING, }; -BPF_CALL_1(bpf_xdp_get_buff_len, struct xdp_buff*, xdp) +BPF_CALL_3(bpf_dynptr_from_xdp, struct xdp_buff*, xdp, u64, flags, + struct bpf_dynptr_kern *, ptr) +{ + if (flags) { + bpf_dynptr_set_null(ptr); + return -EINVAL; + } + + bpf_dynptr_init(ptr, xdp, BPF_DYNPTR_TYPE_XDP, 0, xdp_get_buff_len(xdp)); + + return 0; +} + +static const struct bpf_func_proto bpf_dynptr_from_xdp_proto = { + .func = bpf_dynptr_from_xdp, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_PTR_TO_DYNPTR | DYNPTR_TYPE_XDP | MEM_UNINIT, +}; + +BPF_CALL_1(bpf_xdp_get_buff_len, struct xdp_buff*, xdp) { return xdp_get_buff_len(xdp); } @@ -3946,7 +3968,7 @@ static void bpf_xdp_copy_buf(struct xdp_buff *xdp, unsigned long off, } } -static void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len) +void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len) { struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp); u32 size = xdp->data_end - xdp->data; @@ -3977,8 +3999,7 @@ static void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len) return offset + len <= size ? addr + offset : NULL; } -BPF_CALL_4(bpf_xdp_load_bytes, struct xdp_buff *, xdp, u32, offset, - void *, buf, u32, len) +int __bpf_xdp_load_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len) { void *ptr; @@ -3994,6 +4015,12 @@ BPF_CALL_4(bpf_xdp_load_bytes, struct xdp_buff *, xdp, u32, offset, return 0; } +BPF_CALL_4(bpf_xdp_load_bytes, struct xdp_buff *, xdp, u32, offset, + void *, buf, u32, len) +{ + return __bpf_xdp_load_bytes(xdp, offset, buf, len); +} + static const struct bpf_func_proto bpf_xdp_load_bytes_proto = { .func = bpf_xdp_load_bytes, .gpl_only = false, @@ -4004,8 +4031,7 @@ static const struct bpf_func_proto bpf_xdp_load_bytes_proto = { .arg4_type = ARG_CONST_SIZE, }; -BPF_CALL_4(bpf_xdp_store_bytes, struct xdp_buff *, xdp, u32, offset, - void *, buf, u32, len) +int __bpf_xdp_store_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len) { void *ptr; @@ -4021,6 +4047,12 @@ BPF_CALL_4(bpf_xdp_store_bytes, struct xdp_buff *, xdp, u32, offset, return 0; } +BPF_CALL_4(bpf_xdp_store_bytes, struct xdp_buff *, xdp, u32, offset, + void *, buf, u32, len) +{ + return __bpf_xdp_store_bytes(xdp, offset, buf, len); +} + static const struct bpf_func_proto bpf_xdp_store_bytes_proto = { .func = bpf_xdp_store_bytes, .gpl_only = false, @@ -8010,6 +8042,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_tcp_raw_check_syncookie_ipv6_proto; #endif #endif + case BPF_FUNC_dynptr_from_xdp: + return &bpf_dynptr_from_xdp_proto; default: return bpf_sk_base_func_proto(func_id); } diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 52e891b00a35..b22ffbb6f382 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -5315,13 +5315,18 @@ union bpf_attr { * and try again. * * * The data slice is automatically invalidated anytime - * **bpf_dynptr_write**\ () or a helper call that changes - * the underlying packet buffer (eg **bpf_skb_pull_data**\ ()) + * **bpf_dynptr_write**\ () is called. + * + * For skb-type and xdp-type dynptrs: + * * The data slice is automatically invalidated anytime a + * helper call that changes the underlying packet buffer + * (eg **bpf_skb_pull_data**\ (), **bpf_xdp_adjust_head**\ ()) * is called. * Return * Pointer to the underlying dynptr data, NULL if the dynptr is * read-only, if the dynptr is invalid, or if the offset and length - * is out of bounds or in a paged buffer for skb-type dynptrs. + * is out of bounds or in a paged buffer for skb-type dynptrs or + * across fragments for xdp-type dynptrs. * * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len) * Description @@ -5420,6 +5425,19 @@ union bpf_attr { * *flags* is currently unused, it must be 0 for now. * Return * 0 on success or -EINVAL if flags is not 0. + * + * long bpf_dynptr_from_xdp(struct xdp_buff *xdp_md, u64 flags, struct bpf_dynptr *ptr) + * Description + * Get a dynptr to the data in *xdp_md*. *xdp_md* must be the BPF program + * context. + * + * Calls that change the *xdp_md*'s underlying packet buffer + * (eg **bpf_xdp_adjust_head**\ ()) do not invalidate the dynptr, but + * they do invalidate any data slices associated with the dynptr. + * + * *flags* is currently unused, it must be 0 for now. + * Return + * 0 on success, -EINVAL if flags is not 0. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -5632,6 +5650,7 @@ union bpf_attr { FN(tcp_raw_check_syncookie_ipv6), \ FN(ktime_get_tai_ns), \ FN(dynptr_from_skb), \ + FN(dynptr_from_xdp), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper From patchwork Wed Aug 31 18:32:24 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Joanne Koong X-Patchwork-Id: 12961290 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C34EEECAAD1 for ; Wed, 31 Aug 2022 18:40:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233037AbiHaSkA (ORCPT ); Wed, 31 Aug 2022 14:40:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55824 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233118AbiHaSjU (ORCPT ); Wed, 31 Aug 2022 14:39:20 -0400 Received: from 66-220-155-178.mail-mxout.facebook.com (66-220-155-178.mail-mxout.facebook.com [66.220.155.178]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7C3D3CC33B for ; Wed, 31 Aug 2022 11:36:55 -0700 (PDT) Received: by devbig010.atn6.facebook.com (Postfix, from userid 115148) id C0C0E1127DA7D; Wed, 31 Aug 2022 11:36:40 -0700 (PDT) From: Joanne Koong To: bpf@vger.kernel.org Cc: andrii@kernel.org, daniel@iogearbox.net, ast@kernel.org, kafai@fb.com, memxor@gmail.com, toke@redhat.com, kuba@kernel.org, netdev@vger.kernel.org, Joanne Koong Subject: [PATCH bpf-next v5 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers Date: Wed, 31 Aug 2022 11:32:24 -0700 Message-Id: <20220831183224.3754305-4-joannelkoong@gmail.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20220831183224.3754305-1-joannelkoong@gmail.com> References: <20220831183224.3754305-1-joannelkoong@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net Test skb and xdp dynptr functionality in the following ways: 1) progs/test_cls_redirect_dynptr.c * Rewrite "progs/test_cls_redirect.c" test to use dynptrs to parse skb data * This is a great example of how dynptrs can be used to simplify a lot of the parsing logic for non-statically known values, and speed up execution times. When measuring the user + system time between the original version vs. using dynptrs, and averaging the time for 10 runs (using "time ./test_progs -t cls_redirect"), there was a 2x speed-up: original version: 0.053 sec with dynptrs: 0.025 sec 2) progs/test_xdp_dynptr.c * Rewrite "progs/test_xdp.c" test to use dynptrs to parse xdp data There were no noticeable diferences in user + system time between the original version vs. using dynptrs. Averaging the time for 10 runs (run using "time ./test_progs -t xdp_attach"): original version: 0.011 sec with dynptrs: 0.011 sec 3) progs/test_l4lb_noinline_dynptr.c * Rewrite "progs/test_l4lb_noinline.c" test to use dynptrs to parse skb data There were no noticeable diferences in user + system time between the original version vs. using dynptrs. Averaging the time for 10 runs (run using "time ./test_progs -t l4lb_all"): original version: 0.0502 sec with dynptrs: 0.055 sec For number of processed verifier instructions: original version: 6284 insns with dynptrs: 2538 insns 4) progs/test_parse_tcp_hdr_opt_dynptr.c * Add sample code for parsing tcp hdr opt lookup using dynptrs. This logic is lifted from a real-world use case of packet parsing in katran [0], a layer 4 load balancer. The original version "progs/test_parse_tcp_hdr_opt.c" (not using dynptrs) is included here as well, for comparison. There were no noticeable diferences in user + system time between the original version vs. using dynptrs. Averaging the time for 10 runs (run using "time ./test_progs -t parse_tcp_hdr_opt"): original version: 0.007 sec with dynptrs: 0.008 sec 5) progs/dynptr_success.c * Add test case "test_skb_readonly" for testing attempts at writes / data slices on a prog type with read-only skb ctx. 6) progs/dynptr_fail.c * Add test cases "skb_invalid_data_slice{1,2}" and "xdp_invalid_data_slice" for testing that helpers that modify the underlying packet buffer automatically invalidate the associated data slice. * Add test cases "skb_invalid_ctx" and "xdp_invalid_ctx" for testing that prog types that do not support bpf_dynptr_from_skb/xdp don't have access to the API. [0] https://github.com/facebookincubator/katran/blob/main/katran/lib/bpf/pckt_parsing.h Signed-off-by: Joanne Koong Acked-by: Andrii Nakryiko --- .../selftests/bpf/prog_tests/cls_redirect.c | 25 + .../testing/selftests/bpf/prog_tests/dynptr.c | 73 +- .../selftests/bpf/prog_tests/l4lb_all.c | 2 + .../bpf/prog_tests/parse_tcp_hdr_opt.c | 93 ++ .../selftests/bpf/prog_tests/xdp_attach.c | 11 +- .../testing/selftests/bpf/progs/dynptr_fail.c | 92 ++ .../selftests/bpf/progs/dynptr_success.c | 32 + .../bpf/progs/test_cls_redirect_dynptr.c | 968 ++++++++++++++++++ .../bpf/progs/test_l4lb_noinline_dynptr.c | 469 +++++++++ .../bpf/progs/test_parse_tcp_hdr_opt.c | 119 +++ .../bpf/progs/test_parse_tcp_hdr_opt_dynptr.c | 110 ++ .../selftests/bpf/progs/test_xdp_dynptr.c | 235 +++++ .../selftests/bpf/test_tcp_hdr_options.h | 1 + 13 files changed, 2214 insertions(+), 16 deletions(-) create mode 100644 tools/testing/selftests/bpf/prog_tests/parse_tcp_hdr_opt.c create mode 100644 tools/testing/selftests/bpf/progs/test_cls_redirect_dynptr.c create mode 100644 tools/testing/selftests/bpf/progs/test_l4lb_noinline_dynptr.c create mode 100644 tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt.c create mode 100644 tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt_dynptr.c create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_dynptr.c diff --git a/tools/testing/selftests/bpf/prog_tests/cls_redirect.c b/tools/testing/selftests/bpf/prog_tests/cls_redirect.c index 224f016b0a53..2a55f717fc07 100644 --- a/tools/testing/selftests/bpf/prog_tests/cls_redirect.c +++ b/tools/testing/selftests/bpf/prog_tests/cls_redirect.c @@ -13,6 +13,7 @@ #include "progs/test_cls_redirect.h" #include "test_cls_redirect.skel.h" +#include "test_cls_redirect_dynptr.skel.h" #include "test_cls_redirect_subprogs.skel.h" #define ENCAP_IP INADDR_LOOPBACK @@ -446,6 +447,28 @@ static void test_cls_redirect_common(struct bpf_program *prog) close_fds((int *)conns, sizeof(conns) / sizeof(conns[0][0])); } +static void test_cls_redirect_dynptr(void) +{ + struct test_cls_redirect_dynptr *skel; + int err; + + skel = test_cls_redirect_dynptr__open(); + if (!ASSERT_OK_PTR(skel, "skel_open")) + return; + + skel->rodata->ENCAPSULATION_IP = htonl(ENCAP_IP); + skel->rodata->ENCAPSULATION_PORT = htons(ENCAP_PORT); + + err = test_cls_redirect_dynptr__load(skel); + if (!ASSERT_OK(err, "skel_load")) + goto cleanup; + + test_cls_redirect_common(skel->progs.cls_redirect); + +cleanup: + test_cls_redirect_dynptr__destroy(skel); +} + static void test_cls_redirect_inlined(void) { struct test_cls_redirect *skel; @@ -496,4 +519,6 @@ void test_cls_redirect(void) test_cls_redirect_inlined(); if (test__start_subtest("cls_redirect_subprogs")) test_cls_redirect_subprogs(); + if (test__start_subtest("cls_redirect_dynptr")) + test_cls_redirect_dynptr(); } diff --git a/tools/testing/selftests/bpf/prog_tests/dynptr.c b/tools/testing/selftests/bpf/prog_tests/dynptr.c index bcf80b9f7c27..f3bdec4a595c 100644 --- a/tools/testing/selftests/bpf/prog_tests/dynptr.c +++ b/tools/testing/selftests/bpf/prog_tests/dynptr.c @@ -2,17 +2,26 @@ /* Copyright (c) 2022 Facebook */ #include +#include #include "dynptr_fail.skel.h" #include "dynptr_success.skel.h" static size_t log_buf_sz = 1048576; /* 1 MB */ static char obj_log_buf[1048576]; +enum test_setup_type { + /* no set up is required. the prog will just be loaded */ + SETUP_NONE, + SETUP_SYSCALL_SLEEP, + SETUP_SKB_PROG, +}; + static struct { const char *prog_name; const char *expected_err_msg; + enum test_setup_type type; } dynptr_tests[] = { - /* failure cases */ + /* these cases should trigger a verifier error */ {"ringbuf_missing_release1", "Unreleased reference id=1"}, {"ringbuf_missing_release2", "Unreleased reference id=2"}, {"ringbuf_missing_release_callback", "Unreleased reference id"}, @@ -42,11 +51,17 @@ static struct { {"release_twice_callback", "arg 1 is an unacquired reference"}, {"dynptr_from_mem_invalid_api", "Unsupported reg type fp for bpf_dynptr_from_mem data"}, - - /* success cases */ - {"test_read_write", NULL}, - {"test_data_slice", NULL}, - {"test_ringbuf", NULL}, + {"skb_invalid_data_slice1", "invalid mem access 'scalar'"}, + {"skb_invalid_data_slice2", "invalid mem access 'scalar'"}, + {"xdp_invalid_data_slice", "invalid mem access 'scalar'"}, + {"skb_invalid_ctx", "unknown func bpf_dynptr_from_skb"}, + {"xdp_invalid_ctx", "unknown func bpf_dynptr_from_xdp"}, + + /* these tests should be run and should succeed */ + {"test_read_write", NULL, SETUP_SYSCALL_SLEEP}, + {"test_data_slice", NULL, SETUP_SYSCALL_SLEEP}, + {"test_ringbuf", NULL, SETUP_SYSCALL_SLEEP}, + {"test_skb_readonly", NULL, SETUP_SKB_PROG}, }; static void verify_fail(const char *prog_name, const char *expected_err_msg) @@ -85,7 +100,7 @@ static void verify_fail(const char *prog_name, const char *expected_err_msg) dynptr_fail__destroy(skel); } -static void verify_success(const char *prog_name) +static void run_test(const char *prog_name, enum test_setup_type setup_type) { struct dynptr_success *skel; struct bpf_program *prog; @@ -107,15 +122,45 @@ static void verify_success(const char *prog_name) if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name")) goto cleanup; - link = bpf_program__attach(prog); - if (!ASSERT_OK_PTR(link, "bpf_program__attach")) - goto cleanup; + switch (setup_type) { + case SETUP_SYSCALL_SLEEP: + link = bpf_program__attach(prog); + if (!ASSERT_OK_PTR(link, "bpf_program__attach")) + goto cleanup; - usleep(1); + usleep(1); - ASSERT_EQ(skel->bss->err, 0, "err"); + bpf_link__destroy(link); + break; + case SETUP_SKB_PROG: + { + int prog_fd, err; + char buf[64]; + + LIBBPF_OPTS(bpf_test_run_opts, topts, + .data_in = &pkt_v4, + .data_size_in = sizeof(pkt_v4), + .data_out = buf, + .data_size_out = sizeof(buf), + .repeat = 1, + ); + + prog_fd = bpf_program__fd(prog); + if (!ASSERT_GE(prog_fd, 0, "prog_fd")) + goto cleanup; - bpf_link__destroy(link); + err = bpf_prog_test_run_opts(prog_fd, &topts); + + if (!ASSERT_OK(err, "test_run")) + goto cleanup; + + break; + } + case SETUP_NONE: + ASSERT_EQ(0, 1, "internal error: SETUP_NONE unimplemented"); + } + + ASSERT_EQ(skel->bss->err, 0, "err"); cleanup: dynptr_success__destroy(skel); @@ -133,6 +178,6 @@ void test_dynptr(void) verify_fail(dynptr_tests[i].prog_name, dynptr_tests[i].expected_err_msg); else - verify_success(dynptr_tests[i].prog_name); + run_test(dynptr_tests[i].prog_name, dynptr_tests[i].type); } } diff --git a/tools/testing/selftests/bpf/prog_tests/l4lb_all.c b/tools/testing/selftests/bpf/prog_tests/l4lb_all.c index 55f733ff4109..94079c89f2e9 100644 --- a/tools/testing/selftests/bpf/prog_tests/l4lb_all.c +++ b/tools/testing/selftests/bpf/prog_tests/l4lb_all.c @@ -93,4 +93,6 @@ void test_l4lb_all(void) test_l4lb("test_l4lb.o"); if (test__start_subtest("l4lb_noinline")) test_l4lb("test_l4lb_noinline.o"); + if (test__start_subtest("l4lb_noinline_dynptr")) + test_l4lb("test_l4lb_noinline_dynptr.o"); } diff --git a/tools/testing/selftests/bpf/prog_tests/parse_tcp_hdr_opt.c b/tools/testing/selftests/bpf/prog_tests/parse_tcp_hdr_opt.c new file mode 100644 index 000000000000..ab042fdcfe55 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/parse_tcp_hdr_opt.c @@ -0,0 +1,93 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include "test_parse_tcp_hdr_opt.skel.h" +#include "test_parse_tcp_hdr_opt_dynptr.skel.h" +#include "test_tcp_hdr_options.h" + +struct test_pkt { + struct ipv6_packet pk6_v6; + u8 options[16]; +} __packed; + +struct test_pkt pkt = { + .pk6_v6.eth.h_proto = __bpf_constant_htons(ETH_P_IPV6), + .pk6_v6.iph.nexthdr = IPPROTO_TCP, + .pk6_v6.iph.payload_len = __bpf_constant_htons(MAGIC_BYTES), + .pk6_v6.tcp.urg_ptr = 123, + .pk6_v6.tcp.doff = 9, /* 16 bytes of options */ + + .options = { + TCPOPT_MSS, 4, 0x05, 0xB4, TCPOPT_NOP, TCPOPT_NOP, + 0, 6, 0, 0, 0, 9, TCPOPT_EOL + }, +}; + +static void test_parse_opt(void) +{ + struct test_parse_tcp_hdr_opt *skel; + struct bpf_program *prog; + char buf[128]; + int err; + + LIBBPF_OPTS(bpf_test_run_opts, topts, + .data_in = &pkt, + .data_size_in = sizeof(pkt), + .data_out = buf, + .data_size_out = sizeof(buf), + .repeat = 3, + ); + + skel = test_parse_tcp_hdr_opt__open_and_load(); + if (!ASSERT_OK_PTR(skel, "skel_open_and_load")) + return; + + pkt.options[6] = skel->rodata->tcp_hdr_opt_kind_tpr; + prog = skel->progs.xdp_ingress_v6; + + err = bpf_prog_test_run_opts(bpf_program__fd(prog), &topts); + ASSERT_OK(err, "ipv6 test_run"); + ASSERT_EQ(topts.retval, XDP_PASS, "ipv6 test_run retval"); + ASSERT_EQ(skel->bss->server_id, 0x9000000, "server id"); + + test_parse_tcp_hdr_opt__destroy(skel); +} + +static void test_parse_opt_dynptr(void) +{ + struct test_parse_tcp_hdr_opt_dynptr *skel; + struct bpf_program *prog; + char buf[128]; + int err; + + LIBBPF_OPTS(bpf_test_run_opts, topts, + .data_in = &pkt, + .data_size_in = sizeof(pkt), + .data_out = buf, + .data_size_out = sizeof(buf), + .repeat = 3, + ); + + skel = test_parse_tcp_hdr_opt_dynptr__open_and_load(); + if (!ASSERT_OK_PTR(skel, "skel_open_and_load")) + return; + + pkt.options[6] = skel->rodata->tcp_hdr_opt_kind_tpr; + prog = skel->progs.xdp_ingress_v6; + + err = bpf_prog_test_run_opts(bpf_program__fd(prog), &topts); + ASSERT_OK(err, "ipv6 test_run"); + ASSERT_EQ(topts.retval, XDP_PASS, "ipv6 test_run retval"); + ASSERT_EQ(skel->bss->server_id, 0x9000000, "server id"); + + test_parse_tcp_hdr_opt_dynptr__destroy(skel); +} + +void test_parse_tcp_hdr_opt(void) +{ + if (test__start_subtest("parse_tcp_hdr_opt")) + test_parse_opt(); + if (test__start_subtest("parse_tcp_hdr_opt_dynptr")) + test_parse_opt_dynptr(); +} diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_attach.c b/tools/testing/selftests/bpf/prog_tests/xdp_attach.c index 62aa3edda5e6..5fb5f03022d3 100644 --- a/tools/testing/selftests/bpf/prog_tests/xdp_attach.c +++ b/tools/testing/selftests/bpf/prog_tests/xdp_attach.c @@ -4,11 +4,10 @@ #define IFINDEX_LO 1 #define XDP_FLAGS_REPLACE (1U << 4) -void serial_test_xdp_attach(void) +static void serial_test_xdp_attach(const char *file) { __u32 duration = 0, id1, id2, id0 = 0, len; struct bpf_object *obj1, *obj2, *obj3; - const char *file = "./test_xdp.o"; struct bpf_prog_info info = {}; int err, fd1, fd2, fd3; LIBBPF_OPTS(bpf_xdp_attach_opts, opts); @@ -85,3 +84,11 @@ void serial_test_xdp_attach(void) out_1: bpf_object__close(obj1); } + +void test_xdp_attach(void) +{ + if (test__start_subtest("xdp_attach")) + serial_test_xdp_attach("./test_xdp.o"); + if (test__start_subtest("xdp_attach_dynptr")) + serial_test_xdp_attach("./test_xdp_dynptr.o"); +} diff --git a/tools/testing/selftests/bpf/progs/dynptr_fail.c b/tools/testing/selftests/bpf/progs/dynptr_fail.c index b0f08ff024fb..2da6703d0caf 100644 --- a/tools/testing/selftests/bpf/progs/dynptr_fail.c +++ b/tools/testing/selftests/bpf/progs/dynptr_fail.c @@ -5,6 +5,7 @@ #include #include #include +#include #include "bpf_misc.h" char _license[] SEC("license") = "GPL"; @@ -622,3 +623,94 @@ int dynptr_from_mem_invalid_api(void *ctx) return 0; } + +/* The data slice is invalidated whenever a helper changes packet data */ +SEC("?tc") +int skb_invalid_data_slice1(struct __sk_buff *skb) +{ + struct bpf_dynptr ptr; + struct ethhdr *hdr; + + bpf_dynptr_from_skb(skb, 0, &ptr); + hdr = bpf_dynptr_data(&ptr, 0, sizeof(*hdr)); + + if (bpf_skb_pull_data(skb, skb->len)) + return SK_DROP; + + if (!hdr) + return SK_DROP; + + /* this should fail */ + hdr->h_proto = 1; + + return SK_PASS; +} + +/* The data slice is invalidated whenever bpf_dynptr_write() is called */ +SEC("?tc") +int skb_invalid_data_slice2(struct __sk_buff *skb) +{ + char write_data[64] = "hello there, world!!"; + struct bpf_dynptr ptr; + struct ethhdr *hdr; + + bpf_dynptr_from_skb(skb, 0, &ptr); + hdr = bpf_dynptr_data(&ptr, 0, sizeof(*hdr)); + + bpf_dynptr_write(&ptr, 0, write_data, sizeof(write_data), 0); + + if (!hdr) + return SK_DROP; + + /* this should fail */ + hdr->h_proto = 1; + + return SK_PASS; +} + +/* The data slice is invalidated whenever a helper changes packet data */ +SEC("?xdp") +int xdp_invalid_data_slice(struct xdp_md *xdp) +{ + struct bpf_dynptr ptr; + struct ethhdr *hdr; + + bpf_dynptr_from_xdp(xdp, 0, &ptr); + hdr = bpf_dynptr_data(&ptr, 0, sizeof(*hdr)); + if (!hdr) + return SK_DROP; + + hdr->h_proto = 9; + + if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(*hdr))) + return XDP_DROP; + + /* this should fail */ + hdr->h_proto = 1; + + return XDP_PASS; +} + +/* Only supported prog type can create skb-type dynptrs */ +SEC("?raw_tp") +int skb_invalid_ctx(void *ctx) +{ + struct bpf_dynptr ptr; + + /* this should fail */ + bpf_dynptr_from_skb(ctx, 0, &ptr); + + return 0; +} + +/* Only supported prog type can create xdp-type dynptrs */ +SEC("?raw_tp") +int xdp_invalid_ctx(void *ctx) +{ + struct bpf_dynptr ptr; + + /* this should fail */ + bpf_dynptr_from_xdp(ctx, 0, &ptr); + + return 0; +} diff --git a/tools/testing/selftests/bpf/progs/dynptr_success.c b/tools/testing/selftests/bpf/progs/dynptr_success.c index a3a6103c8569..8ed3ef201284 100644 --- a/tools/testing/selftests/bpf/progs/dynptr_success.c +++ b/tools/testing/selftests/bpf/progs/dynptr_success.c @@ -162,3 +162,35 @@ int test_ringbuf(void *ctx) bpf_ringbuf_discard_dynptr(&ptr, 0); return 0; } + +SEC("cgroup_skb/egress") +int test_skb_readonly(struct __sk_buff *skb) +{ + __u8 write_data[2] = {1, 2}; + struct bpf_dynptr ptr; + __u64 *data; + int ret; + + if (bpf_dynptr_from_skb(skb, 0, &ptr)) { + err = 1; + return 0; + } + + data = bpf_dynptr_data(&ptr, 0, sizeof(*data)); + /* since cgroup skbs are read only, bpf_dynptr_data + * should always return NULL + */ + if (data) { + err = 2; + return 0; + } + + /* since cgroup skbs are read only, writes should fail */ + ret = bpf_dynptr_write(&ptr, 0, write_data, sizeof(write_data), 0); + if (ret != -EINVAL) { + err = 3; + return 0; + } + + return 0; +} diff --git a/tools/testing/selftests/bpf/progs/test_cls_redirect_dynptr.c b/tools/testing/selftests/bpf/progs/test_cls_redirect_dynptr.c new file mode 100644 index 000000000000..40e5e7ef0228 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_cls_redirect_dynptr.c @@ -0,0 +1,968 @@ +// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause +// Copyright (c) 2019, 2020 Cloudflare + +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include "test_cls_redirect.h" + +#define offsetofend(TYPE, MEMBER) \ + (offsetof(TYPE, MEMBER) + sizeof((((TYPE *)0)->MEMBER))) + +#define IP_OFFSET_MASK (0x1FFF) +#define IP_MF (0x2000) + +char _license[] SEC("license") = "Dual BSD/GPL"; + +/** + * Destination port and IP used for UDP encapsulation. + */ +volatile const __be16 ENCAPSULATION_PORT; +volatile const __be32 ENCAPSULATION_IP; + +typedef struct { + uint64_t processed_packets_total; + uint64_t l3_protocol_packets_total_ipv4; + uint64_t l3_protocol_packets_total_ipv6; + uint64_t l4_protocol_packets_total_tcp; + uint64_t l4_protocol_packets_total_udp; + uint64_t accepted_packets_total_syn; + uint64_t accepted_packets_total_syn_cookies; + uint64_t accepted_packets_total_last_hop; + uint64_t accepted_packets_total_icmp_echo_request; + uint64_t accepted_packets_total_established; + uint64_t forwarded_packets_total_gue; + uint64_t forwarded_packets_total_gre; + + uint64_t errors_total_unknown_l3_proto; + uint64_t errors_total_unknown_l4_proto; + uint64_t errors_total_malformed_ip; + uint64_t errors_total_fragmented_ip; + uint64_t errors_total_malformed_icmp; + uint64_t errors_total_unwanted_icmp; + uint64_t errors_total_malformed_icmp_pkt_too_big; + uint64_t errors_total_malformed_tcp; + uint64_t errors_total_malformed_udp; + uint64_t errors_total_icmp_echo_replies; + uint64_t errors_total_malformed_encapsulation; + uint64_t errors_total_encap_adjust_failed; + uint64_t errors_total_encap_buffer_too_small; + uint64_t errors_total_redirect_loop; + uint64_t errors_total_encap_mtu_violate; +} metrics_t; + +typedef enum { + INVALID = 0, + UNKNOWN, + ECHO_REQUEST, + SYN, + SYN_COOKIE, + ESTABLISHED, +} verdict_t; + +typedef struct { + uint16_t src, dst; +} flow_ports_t; + +_Static_assert( + sizeof(flow_ports_t) != + offsetofend(struct bpf_sock_tuple, ipv4.dport) - + offsetof(struct bpf_sock_tuple, ipv4.sport) - 1, + "flow_ports_t must match sport and dport in struct bpf_sock_tuple"); +_Static_assert( + sizeof(flow_ports_t) != + offsetofend(struct bpf_sock_tuple, ipv6.dport) - + offsetof(struct bpf_sock_tuple, ipv6.sport) - 1, + "flow_ports_t must match sport and dport in struct bpf_sock_tuple"); + +struct iphdr_info { + void *hdr; + __u64 len; +}; + +typedef int ret_t; + +/* This is a bit of a hack. We need a return value which allows us to + * indicate that the regular flow of the program should continue, + * while allowing functions to use XDP_PASS and XDP_DROP, etc. + */ +static const ret_t CONTINUE_PROCESSING = -1; + +/* Convenience macro to call functions which return ret_t. + */ +#define MAYBE_RETURN(x) \ + do { \ + ret_t __ret = x; \ + if (__ret != CONTINUE_PROCESSING) \ + return __ret; \ + } while (0) + +static bool ipv4_is_fragment(const struct iphdr *ip) +{ + uint16_t frag_off = ip->frag_off & bpf_htons(IP_OFFSET_MASK); + return (ip->frag_off & bpf_htons(IP_MF)) != 0 || frag_off > 0; +} + +static int pkt_parse_ipv4(struct bpf_dynptr *dynptr, __u64 *offset, struct iphdr *iphdr) +{ + if (bpf_dynptr_read(iphdr, sizeof(*iphdr), dynptr, *offset, 0)) + return -1; + + *offset += sizeof(*iphdr); + + if (iphdr->ihl < 5) + return -1; + + /* skip ipv4 options */ + *offset += (iphdr->ihl - 5) * 4; + + return 0; +} + +/* Parse the L4 ports from a packet, assuming a layout like TCP or UDP. */ +static bool pkt_parse_icmp_l4_ports(struct bpf_dynptr *dynptr, __u64 *offset, flow_ports_t *ports) +{ + if (bpf_dynptr_read(ports, sizeof(*ports), dynptr, *offset, 0)) + return false; + + *offset += sizeof(*ports); + + /* Ports in the L4 headers are reversed, since we are parsing an ICMP + * payload which is going towards the eyeball. + */ + uint16_t dst = ports->src; + ports->src = ports->dst; + ports->dst = dst; + return true; +} + +static uint16_t pkt_checksum_fold(uint32_t csum) +{ + /* The highest reasonable value for an IPv4 header + * checksum requires two folds, so we just do that always. + */ + csum = (csum & 0xffff) + (csum >> 16); + csum = (csum & 0xffff) + (csum >> 16); + return (uint16_t)~csum; +} + +static void pkt_ipv4_checksum(struct iphdr *iph) +{ + iph->check = 0; + + /* An IP header without options is 20 bytes. Two of those + * are the checksum, which we always set to zero. Hence, + * the maximum accumulated value is 18 / 2 * 0xffff = 0x8fff7, + * which fits in 32 bit. + */ + _Static_assert(sizeof(struct iphdr) == 20, "iphdr must be 20 bytes"); + uint32_t acc = 0; + uint16_t *ipw = (uint16_t *)iph; + + for (size_t i = 0; i < sizeof(struct iphdr) / 2; i++) + acc += ipw[i]; + + iph->check = pkt_checksum_fold(acc); +} + +static bool pkt_skip_ipv6_extension_headers(struct bpf_dynptr *dynptr, __u64 *offset, + const struct ipv6hdr *ipv6, uint8_t *upper_proto, + bool *is_fragment) +{ + /* We understand five extension headers. + * https://tools.ietf.org/html/rfc8200#section-4.1 states that all + * headers should occur once, except Destination Options, which may + * occur twice. Hence we give up after 6 headers. + */ + struct { + uint8_t next; + uint8_t len; + } exthdr = { + .next = ipv6->nexthdr, + }; + *is_fragment = false; + + for (int i = 0; i < 6; i++) { + switch (exthdr.next) { + case IPPROTO_FRAGMENT: + *is_fragment = true; + /* NB: We don't check that hdrlen == 0 as per spec. */ + /* fallthrough; */ + + case IPPROTO_HOPOPTS: + case IPPROTO_ROUTING: + case IPPROTO_DSTOPTS: + case IPPROTO_MH: + if (bpf_dynptr_read(&exthdr, sizeof(exthdr), dynptr, *offset, 0)) + return false; + + /* hdrlen is in 8-octet units, and excludes the first 8 octets. */ + *offset += (exthdr.len + 1) * 8; + + /* Decode next header */ + break; + + default: + /* The next header is not one of the known extension + * headers, treat it as the upper layer header. + * + * This handles IPPROTO_NONE. + * + * Encapsulating Security Payload (50) and Authentication + * Header (51) also end up here (and will trigger an + * unknown proto error later). They have a custom header + * format and seem too esoteric to care about. + */ + *upper_proto = exthdr.next; + return true; + } + } + + /* We never found an upper layer header. */ + return false; +} + +static int pkt_parse_ipv6(struct bpf_dynptr *dynptr, __u64 *offset, struct ipv6hdr *ipv6, + uint8_t *proto, bool *is_fragment) +{ + if (bpf_dynptr_read(ipv6, sizeof(*ipv6), dynptr, *offset, 0)) + return -1; + + *offset += sizeof(*ipv6); + + if (!pkt_skip_ipv6_extension_headers(dynptr, offset, ipv6, proto, is_fragment)) + return -1; + + return 0; +} + +/* Global metrics, per CPU + */ +struct { + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); + __uint(max_entries, 1); + __type(key, unsigned int); + __type(value, metrics_t); +} metrics_map SEC(".maps"); + +static metrics_t *get_global_metrics(void) +{ + uint64_t key = 0; + return bpf_map_lookup_elem(&metrics_map, &key); +} + +static ret_t accept_locally(struct __sk_buff *skb, encap_headers_t *encap) +{ + const int payload_off = + sizeof(*encap) + + sizeof(struct in_addr) * encap->unigue.hop_count; + int32_t encap_overhead = payload_off - sizeof(struct ethhdr); + + /* Changing the ethertype if the encapsulated packet is ipv6 */ + if (encap->gue.proto_ctype == IPPROTO_IPV6) + encap->eth.h_proto = bpf_htons(ETH_P_IPV6); + + if (bpf_skb_adjust_room(skb, -encap_overhead, BPF_ADJ_ROOM_MAC, + BPF_F_ADJ_ROOM_FIXED_GSO | + BPF_F_ADJ_ROOM_NO_CSUM_RESET) || + bpf_csum_level(skb, BPF_CSUM_LEVEL_DEC)) + return TC_ACT_SHOT; + + return bpf_redirect(skb->ifindex, BPF_F_INGRESS); +} + +static ret_t forward_with_gre(struct __sk_buff *skb, struct bpf_dynptr *dynptr, + encap_headers_t *encap, struct in_addr *next_hop, + metrics_t *metrics) +{ + const int payload_off = + sizeof(*encap) + + sizeof(struct in_addr) * encap->unigue.hop_count; + int32_t encap_overhead = + payload_off - sizeof(struct ethhdr) - sizeof(struct iphdr); + int32_t delta = sizeof(struct gre_base_hdr) - encap_overhead; + uint16_t proto = ETH_P_IP; + uint32_t mtu_len = 0; + encap_gre_t *encap_gre; + + metrics->forwarded_packets_total_gre++; + + /* Loop protection: the inner packet's TTL is decremented as a safeguard + * against any forwarding loop. As the only interesting field is the TTL + * hop limit for IPv6, it is easier to use bpf_skb_load_bytes/bpf_skb_store_bytes + * as they handle the split packets if needed (no need for the data to be + * in the linear section). + */ + if (encap->gue.proto_ctype == IPPROTO_IPV6) { + proto = ETH_P_IPV6; + uint8_t ttl; + int rc; + + rc = bpf_skb_load_bytes( + skb, payload_off + offsetof(struct ipv6hdr, hop_limit), + &ttl, 1); + if (rc != 0) { + metrics->errors_total_malformed_encapsulation++; + return TC_ACT_SHOT; + } + + if (ttl == 0) { + metrics->errors_total_redirect_loop++; + return TC_ACT_SHOT; + } + + ttl--; + rc = bpf_skb_store_bytes( + skb, payload_off + offsetof(struct ipv6hdr, hop_limit), + &ttl, 1, 0); + if (rc != 0) { + metrics->errors_total_malformed_encapsulation++; + return TC_ACT_SHOT; + } + } else { + uint8_t ttl; + int rc; + + rc = bpf_skb_load_bytes( + skb, payload_off + offsetof(struct iphdr, ttl), &ttl, + 1); + if (rc != 0) { + metrics->errors_total_malformed_encapsulation++; + return TC_ACT_SHOT; + } + + if (ttl == 0) { + metrics->errors_total_redirect_loop++; + return TC_ACT_SHOT; + } + + /* IPv4 also has a checksum to patch. While the TTL is only one byte, + * this function only works for 2 and 4 bytes arguments (the result is + * the same). + */ + rc = bpf_l3_csum_replace( + skb, payload_off + offsetof(struct iphdr, check), ttl, + ttl - 1, 2); + if (rc != 0) { + metrics->errors_total_malformed_encapsulation++; + return TC_ACT_SHOT; + } + + ttl--; + rc = bpf_skb_store_bytes( + skb, payload_off + offsetof(struct iphdr, ttl), &ttl, 1, + 0); + if (rc != 0) { + metrics->errors_total_malformed_encapsulation++; + return TC_ACT_SHOT; + } + } + + if (bpf_check_mtu(skb, skb->ifindex, &mtu_len, delta, 0)) { + metrics->errors_total_encap_mtu_violate++; + return TC_ACT_SHOT; + } + + if (bpf_skb_adjust_room(skb, delta, BPF_ADJ_ROOM_NET, + BPF_F_ADJ_ROOM_FIXED_GSO | + BPF_F_ADJ_ROOM_NO_CSUM_RESET) || + bpf_csum_level(skb, BPF_CSUM_LEVEL_INC)) { + metrics->errors_total_encap_adjust_failed++; + return TC_ACT_SHOT; + } + + if (bpf_skb_pull_data(skb, sizeof(encap_gre_t))) { + metrics->errors_total_encap_buffer_too_small++; + return TC_ACT_SHOT; + } + + encap_gre = bpf_dynptr_data(dynptr, 0, sizeof(encap_gre_t)); + if (!encap_gre) { + metrics->errors_total_encap_buffer_too_small++; + return TC_ACT_SHOT; + } + + encap_gre->ip.protocol = IPPROTO_GRE; + encap_gre->ip.daddr = next_hop->s_addr; + encap_gre->ip.saddr = ENCAPSULATION_IP; + encap_gre->ip.tot_len = + bpf_htons(bpf_ntohs(encap_gre->ip.tot_len) + delta); + encap_gre->gre.flags = 0; + encap_gre->gre.protocol = bpf_htons(proto); + pkt_ipv4_checksum((void *)&encap_gre->ip); + + return bpf_redirect(skb->ifindex, 0); +} + +static ret_t forward_to_next_hop(struct __sk_buff *skb, struct bpf_dynptr *dynptr, + encap_headers_t *encap, struct in_addr *next_hop, + metrics_t *metrics) +{ + /* swap L2 addresses */ + /* This assumes that packets are received from a router. + * So just swapping the MAC addresses here will make the packet go back to + * the router, which will send it to the appropriate machine. + */ + unsigned char temp[ETH_ALEN]; + memcpy(temp, encap->eth.h_dest, sizeof(temp)); + memcpy(encap->eth.h_dest, encap->eth.h_source, + sizeof(encap->eth.h_dest)); + memcpy(encap->eth.h_source, temp, sizeof(encap->eth.h_source)); + + if (encap->unigue.next_hop == encap->unigue.hop_count - 1 && + encap->unigue.last_hop_gre) { + return forward_with_gre(skb, dynptr, encap, next_hop, metrics); + } + + metrics->forwarded_packets_total_gue++; + uint32_t old_saddr = encap->ip.saddr; + encap->ip.saddr = encap->ip.daddr; + encap->ip.daddr = next_hop->s_addr; + if (encap->unigue.next_hop < encap->unigue.hop_count) { + encap->unigue.next_hop++; + } + + /* Remove ip->saddr, add next_hop->s_addr */ + const uint64_t off = offsetof(typeof(*encap), ip.check); + int ret = bpf_l3_csum_replace(skb, off, old_saddr, next_hop->s_addr, 4); + if (ret < 0) { + return TC_ACT_SHOT; + } + + return bpf_redirect(skb->ifindex, 0); +} + +static ret_t skip_next_hops(__u64 *offset, int n) +{ + __u32 res; + switch (n) { + case 1: + *offset += sizeof(struct in_addr); + case 0: + return CONTINUE_PROCESSING; + + default: + return TC_ACT_SHOT; + } +} + +/* Get the next hop from the GLB header. + * + * Sets next_hop->s_addr to 0 if there are no more hops left. + * pkt is positioned just after the variable length GLB header + * iff the call is successful. + */ +static ret_t get_next_hop(struct bpf_dynptr *dynptr, __u64 *offset, encap_headers_t *encap, + struct in_addr *next_hop) +{ + if (encap->unigue.next_hop > encap->unigue.hop_count) + return TC_ACT_SHOT; + + /* Skip "used" next hops. */ + MAYBE_RETURN(skip_next_hops(offset, encap->unigue.next_hop)); + + if (encap->unigue.next_hop == encap->unigue.hop_count) { + /* No more next hops, we are at the end of the GLB header. */ + next_hop->s_addr = 0; + return CONTINUE_PROCESSING; + } + + if (bpf_dynptr_read(next_hop, sizeof(*next_hop), dynptr, *offset, 0)) + return TC_ACT_SHOT; + + *offset += sizeof(*next_hop); + + /* Skip the remainig next hops (may be zero). */ + return skip_next_hops(offset, encap->unigue.hop_count - encap->unigue.next_hop - 1); +} + +/* Fill a bpf_sock_tuple to be used with the socket lookup functions. + * This is a kludge that let's us work around verifier limitations: + * + * fill_tuple(&t, foo, sizeof(struct iphdr), 123, 321) + * + * clang will substitue a costant for sizeof, which allows the verifier + * to track it's value. Based on this, it can figure out the constant + * return value, and calling code works while still being "generic" to + * IPv4 and IPv6. + */ +static uint64_t fill_tuple(struct bpf_sock_tuple *tuple, void *iph, + uint64_t iphlen, uint16_t sport, uint16_t dport) +{ + switch (iphlen) { + case sizeof(struct iphdr): { + struct iphdr *ipv4 = (struct iphdr *)iph; + tuple->ipv4.daddr = ipv4->daddr; + tuple->ipv4.saddr = ipv4->saddr; + tuple->ipv4.sport = sport; + tuple->ipv4.dport = dport; + return sizeof(tuple->ipv4); + } + + case sizeof(struct ipv6hdr): { + struct ipv6hdr *ipv6 = (struct ipv6hdr *)iph; + memcpy(&tuple->ipv6.daddr, &ipv6->daddr, + sizeof(tuple->ipv6.daddr)); + memcpy(&tuple->ipv6.saddr, &ipv6->saddr, + sizeof(tuple->ipv6.saddr)); + tuple->ipv6.sport = sport; + tuple->ipv6.dport = dport; + return sizeof(tuple->ipv6); + } + + default: + return 0; + } +} + +static verdict_t classify_tcp(struct __sk_buff *skb, struct bpf_sock_tuple *tuple, + uint64_t tuplen, void *iph, struct tcphdr *tcp) +{ + struct bpf_sock *sk = + bpf_skc_lookup_tcp(skb, tuple, tuplen, BPF_F_CURRENT_NETNS, 0); + + if (sk == NULL) + return UNKNOWN; + + if (sk->state != BPF_TCP_LISTEN) { + bpf_sk_release(sk); + return ESTABLISHED; + } + + if (iph != NULL && tcp != NULL) { + /* Kludge: we've run out of arguments, but need the length of the ip header. */ + uint64_t iphlen = sizeof(struct iphdr); + + if (tuplen == sizeof(tuple->ipv6)) + iphlen = sizeof(struct ipv6hdr); + + if (bpf_tcp_check_syncookie(sk, iph, iphlen, tcp, + sizeof(*tcp)) == 0) { + bpf_sk_release(sk); + return SYN_COOKIE; + } + } + + bpf_sk_release(sk); + return UNKNOWN; +} + +static verdict_t classify_udp(struct __sk_buff *skb, struct bpf_sock_tuple *tuple, uint64_t tuplen) +{ + struct bpf_sock *sk = + bpf_sk_lookup_udp(skb, tuple, tuplen, BPF_F_CURRENT_NETNS, 0); + + if (sk == NULL) + return UNKNOWN; + + if (sk->state == BPF_TCP_ESTABLISHED) { + bpf_sk_release(sk); + return ESTABLISHED; + } + + bpf_sk_release(sk); + return UNKNOWN; +} + +static verdict_t classify_icmp(struct __sk_buff *skb, uint8_t proto, struct bpf_sock_tuple *tuple, + uint64_t tuplen, metrics_t *metrics) +{ + switch (proto) { + case IPPROTO_TCP: + return classify_tcp(skb, tuple, tuplen, NULL, NULL); + + case IPPROTO_UDP: + return classify_udp(skb, tuple, tuplen); + + default: + metrics->errors_total_malformed_icmp++; + return INVALID; + } +} + +static verdict_t process_icmpv4(struct __sk_buff *skb, struct bpf_dynptr *dynptr, __u64 *offset, + metrics_t *metrics) +{ + struct icmphdr icmp; + struct iphdr ipv4; + + if (bpf_dynptr_read(&icmp, sizeof(icmp), dynptr, *offset, 0)) { + metrics->errors_total_malformed_icmp++; + return INVALID; + } + + *offset += sizeof(icmp); + + /* We should never receive encapsulated echo replies. */ + if (icmp.type == ICMP_ECHOREPLY) { + metrics->errors_total_icmp_echo_replies++; + return INVALID; + } + + if (icmp.type == ICMP_ECHO) + return ECHO_REQUEST; + + if (icmp.type != ICMP_DEST_UNREACH || icmp.code != ICMP_FRAG_NEEDED) { + metrics->errors_total_unwanted_icmp++; + return INVALID; + } + + if (pkt_parse_ipv4(dynptr, offset, &ipv4)) { + metrics->errors_total_malformed_icmp_pkt_too_big++; + return INVALID; + } + + /* The source address in the outer IP header is from the entity that + * originated the ICMP message. Use the original IP header to restore + * the correct flow tuple. + */ + struct bpf_sock_tuple tuple; + tuple.ipv4.saddr = ipv4.daddr; + tuple.ipv4.daddr = ipv4.saddr; + + if (!pkt_parse_icmp_l4_ports(dynptr, offset, (flow_ports_t *)&tuple.ipv4.sport)) { + metrics->errors_total_malformed_icmp_pkt_too_big++; + return INVALID; + } + + return classify_icmp(skb, ipv4.protocol, &tuple, + sizeof(tuple.ipv4), metrics); +} + +static verdict_t process_icmpv6(struct bpf_dynptr *dynptr, __u64 *offset, struct __sk_buff *skb, + metrics_t *metrics) +{ + struct bpf_sock_tuple tuple; + struct ipv6hdr ipv6; + struct icmp6hdr icmp6; + bool is_fragment; + uint8_t l4_proto; + + if (bpf_dynptr_read(&icmp6, sizeof(icmp6), dynptr, *offset, 0)) { + metrics->errors_total_malformed_icmp++; + return INVALID; + } + + /* We should never receive encapsulated echo replies. */ + if (icmp6.icmp6_type == ICMPV6_ECHO_REPLY) { + metrics->errors_total_icmp_echo_replies++; + return INVALID; + } + + if (icmp6.icmp6_type == ICMPV6_ECHO_REQUEST) { + return ECHO_REQUEST; + } + + if (icmp6.icmp6_type != ICMPV6_PKT_TOOBIG) { + metrics->errors_total_unwanted_icmp++; + return INVALID; + } + + if (pkt_parse_ipv6(dynptr, offset, &ipv6, &l4_proto, &is_fragment)) { + metrics->errors_total_malformed_icmp_pkt_too_big++; + return INVALID; + } + + if (is_fragment) { + metrics->errors_total_fragmented_ip++; + return INVALID; + } + + /* Swap source and dest addresses. */ + memcpy(&tuple.ipv6.saddr, &ipv6.daddr, sizeof(tuple.ipv6.saddr)); + memcpy(&tuple.ipv6.daddr, &ipv6.saddr, sizeof(tuple.ipv6.daddr)); + + if (!pkt_parse_icmp_l4_ports(dynptr, offset, (flow_ports_t *)&tuple.ipv6.sport)) { + metrics->errors_total_malformed_icmp_pkt_too_big++; + return INVALID; + } + + return classify_icmp(skb, l4_proto, &tuple, sizeof(tuple.ipv6), + metrics); +} + +static verdict_t process_tcp(struct bpf_dynptr *dynptr, __u64 *offset, struct __sk_buff *skb, + struct iphdr_info *info, metrics_t *metrics) +{ + struct bpf_sock_tuple tuple; + struct tcphdr tcp; + uint64_t tuplen; + + metrics->l4_protocol_packets_total_tcp++; + + if (bpf_dynptr_read(&tcp, sizeof(tcp), dynptr, *offset, 0)) { + metrics->errors_total_malformed_tcp++; + return INVALID; + } + + *offset += sizeof(tcp); + + if (tcp.syn) + return SYN; + + tuplen = fill_tuple(&tuple, info->hdr, info->len, tcp.source, tcp.dest); + return classify_tcp(skb, &tuple, tuplen, info->hdr, &tcp); +} + +static verdict_t process_udp(struct bpf_dynptr *dynptr, __u64 *offset, struct __sk_buff *skb, + struct iphdr_info *info, metrics_t *metrics) +{ + struct bpf_sock_tuple tuple; + struct udphdr udph; + uint64_t tuplen; + + metrics->l4_protocol_packets_total_udp++; + + if (bpf_dynptr_read(&udph, sizeof(udph), dynptr, *offset, 0)) { + metrics->errors_total_malformed_udp++; + return INVALID; + } + *offset += sizeof(udph); + + tuplen = fill_tuple(&tuple, info->hdr, info->len, udph.source, udph.dest); + return classify_udp(skb, &tuple, tuplen); +} + +static verdict_t process_ipv4(struct __sk_buff *skb, struct bpf_dynptr *dynptr, + __u64 *offset, metrics_t *metrics) +{ + struct iphdr ipv4; + struct iphdr_info info = { + .hdr = &ipv4, + .len = sizeof(ipv4), + }; + + metrics->l3_protocol_packets_total_ipv4++; + + if (pkt_parse_ipv4(dynptr, offset, &ipv4)) { + metrics->errors_total_malformed_ip++; + return INVALID; + } + + if (ipv4.version != 4) { + metrics->errors_total_malformed_ip++; + return INVALID; + } + + if (ipv4_is_fragment(&ipv4)) { + metrics->errors_total_fragmented_ip++; + return INVALID; + } + + switch (ipv4.protocol) { + case IPPROTO_ICMP: + return process_icmpv4(skb, dynptr, offset, metrics); + + case IPPROTO_TCP: + return process_tcp(dynptr, offset, skb, &info, metrics); + + case IPPROTO_UDP: + return process_udp(dynptr, offset, skb, &info, metrics); + + default: + metrics->errors_total_unknown_l4_proto++; + return INVALID; + } +} + +static verdict_t process_ipv6(struct __sk_buff *skb, struct bpf_dynptr *dynptr, + __u64 *offset, metrics_t *metrics) +{ + struct ipv6hdr ipv6; + struct iphdr_info info = { + .hdr = &ipv6, + .len = sizeof(ipv6), + }; + uint8_t l4_proto; + bool is_fragment; + + metrics->l3_protocol_packets_total_ipv6++; + + if (pkt_parse_ipv6(dynptr, offset, &ipv6, &l4_proto, &is_fragment)) { + metrics->errors_total_malformed_ip++; + return INVALID; + } + + if (ipv6.version != 6) { + metrics->errors_total_malformed_ip++; + return INVALID; + } + + if (is_fragment) { + metrics->errors_total_fragmented_ip++; + return INVALID; + } + + switch (l4_proto) { + case IPPROTO_ICMPV6: + return process_icmpv6(dynptr, offset, skb, metrics); + + case IPPROTO_TCP: + return process_tcp(dynptr, offset, skb, &info, metrics); + + case IPPROTO_UDP: + return process_udp(dynptr, offset, skb, &info, metrics); + + default: + metrics->errors_total_unknown_l4_proto++; + return INVALID; + } +} + +SEC("tc") +int cls_redirect(struct __sk_buff *skb) +{ + struct bpf_dynptr dynptr; + struct in_addr next_hop; + /* Tracks offset of the dynptr. This will be unnecessary once + * bpf_dynptr_advance() is available. + */ + __u64 off = 0; + + bpf_dynptr_from_skb(skb, 0, &dynptr); + + metrics_t *metrics = get_global_metrics(); + if (metrics == NULL) + return TC_ACT_SHOT; + + metrics->processed_packets_total++; + + /* Pass bogus packets as long as we're not sure they're + * destined for us. + */ + if (skb->protocol != bpf_htons(ETH_P_IP)) + return TC_ACT_OK; + + encap_headers_t *encap; + + /* Make sure that all encapsulation headers are available in + * the linear portion of the skb. This makes it easy to manipulate them. + */ + if (bpf_skb_pull_data(skb, sizeof(*encap))) + return TC_ACT_OK; + + encap = bpf_dynptr_data(&dynptr, 0, sizeof(*encap)); + if (!encap) + return TC_ACT_OK; + + off += sizeof(*encap); + + if (encap->ip.ihl != 5) + /* We never have any options. */ + return TC_ACT_OK; + + if (encap->ip.daddr != ENCAPSULATION_IP || + encap->ip.protocol != IPPROTO_UDP) + return TC_ACT_OK; + + /* TODO Check UDP length? */ + if (encap->udp.dest != ENCAPSULATION_PORT) + return TC_ACT_OK; + + /* We now know that the packet is destined to us, we can + * drop bogus ones. + */ + if (ipv4_is_fragment((void *)&encap->ip)) { + metrics->errors_total_fragmented_ip++; + return TC_ACT_SHOT; + } + + if (encap->gue.variant != 0) { + metrics->errors_total_malformed_encapsulation++; + return TC_ACT_SHOT; + } + + if (encap->gue.control != 0) { + metrics->errors_total_malformed_encapsulation++; + return TC_ACT_SHOT; + } + + if (encap->gue.flags != 0) { + metrics->errors_total_malformed_encapsulation++; + return TC_ACT_SHOT; + } + + if (encap->gue.hlen != + sizeof(encap->unigue) / 4 + encap->unigue.hop_count) { + metrics->errors_total_malformed_encapsulation++; + return TC_ACT_SHOT; + } + + if (encap->unigue.version != 0) { + metrics->errors_total_malformed_encapsulation++; + return TC_ACT_SHOT; + } + + if (encap->unigue.reserved != 0) + return TC_ACT_SHOT; + + MAYBE_RETURN(get_next_hop(&dynptr, &off, encap, &next_hop)); + + if (next_hop.s_addr == 0) { + metrics->accepted_packets_total_last_hop++; + return accept_locally(skb, encap); + } + + verdict_t verdict; + switch (encap->gue.proto_ctype) { + case IPPROTO_IPIP: + verdict = process_ipv4(skb, &dynptr, &off, metrics); + break; + + case IPPROTO_IPV6: + verdict = process_ipv6(skb, &dynptr, &off, metrics); + break; + + default: + metrics->errors_total_unknown_l3_proto++; + return TC_ACT_SHOT; + } + + switch (verdict) { + case INVALID: + /* metrics have already been bumped */ + return TC_ACT_SHOT; + + case UNKNOWN: + return forward_to_next_hop(skb, &dynptr, encap, &next_hop, metrics); + + case ECHO_REQUEST: + metrics->accepted_packets_total_icmp_echo_request++; + break; + + case SYN: + if (encap->unigue.forward_syn) { + return forward_to_next_hop(skb, &dynptr, encap, &next_hop, + metrics); + } + + metrics->accepted_packets_total_syn++; + break; + + case SYN_COOKIE: + metrics->accepted_packets_total_syn_cookies++; + break; + + case ESTABLISHED: + metrics->accepted_packets_total_established++; + break; + } + + return accept_locally(skb, encap); +} diff --git a/tools/testing/selftests/bpf/progs/test_l4lb_noinline_dynptr.c b/tools/testing/selftests/bpf/progs/test_l4lb_noinline_dynptr.c new file mode 100644 index 000000000000..d7ff94d3ec1c --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_l4lb_noinline_dynptr.c @@ -0,0 +1,469 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (c) 2017 Facebook +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "test_iptunnel_common.h" +#include + +static __always_inline __u32 rol32(__u32 word, unsigned int shift) +{ + return (word << shift) | (word >> ((-shift) & 31)); +} + +/* copy paste of jhash from kernel sources to make sure llvm + * can compile it into valid sequence of bpf instructions + */ +#define __jhash_mix(a, b, c) \ +{ \ + a -= c; a ^= rol32(c, 4); c += b; \ + b -= a; b ^= rol32(a, 6); a += c; \ + c -= b; c ^= rol32(b, 8); b += a; \ + a -= c; a ^= rol32(c, 16); c += b; \ + b -= a; b ^= rol32(a, 19); a += c; \ + c -= b; c ^= rol32(b, 4); b += a; \ +} + +#define __jhash_final(a, b, c) \ +{ \ + c ^= b; c -= rol32(b, 14); \ + a ^= c; a -= rol32(c, 11); \ + b ^= a; b -= rol32(a, 25); \ + c ^= b; c -= rol32(b, 16); \ + a ^= c; a -= rol32(c, 4); \ + b ^= a; b -= rol32(a, 14); \ + c ^= b; c -= rol32(b, 24); \ +} + +#define JHASH_INITVAL 0xdeadbeef + +typedef unsigned int u32; + +static __noinline u32 jhash(const void *key, u32 length, u32 initval) +{ + u32 a, b, c; + const unsigned char *k = key; + + a = b = c = JHASH_INITVAL + length + initval; + + while (length > 12) { + a += *(u32 *)(k); + b += *(u32 *)(k + 4); + c += *(u32 *)(k + 8); + __jhash_mix(a, b, c); + length -= 12; + k += 12; + } + switch (length) { + case 12: c += (u32)k[11]<<24; + case 11: c += (u32)k[10]<<16; + case 10: c += (u32)k[9]<<8; + case 9: c += k[8]; + case 8: b += (u32)k[7]<<24; + case 7: b += (u32)k[6]<<16; + case 6: b += (u32)k[5]<<8; + case 5: b += k[4]; + case 4: a += (u32)k[3]<<24; + case 3: a += (u32)k[2]<<16; + case 2: a += (u32)k[1]<<8; + case 1: a += k[0]; + __jhash_final(a, b, c); + case 0: /* Nothing left to add */ + break; + } + + return c; +} + +static __noinline u32 __jhash_nwords(u32 a, u32 b, u32 c, u32 initval) +{ + a += initval; + b += initval; + c += initval; + __jhash_final(a, b, c); + return c; +} + +static __noinline u32 jhash_2words(u32 a, u32 b, u32 initval) +{ + return __jhash_nwords(a, b, 0, initval + JHASH_INITVAL + (2 << 2)); +} + +#define PCKT_FRAGMENTED 65343 +#define IPV4_HDR_LEN_NO_OPT 20 +#define IPV4_PLUS_ICMP_HDR 28 +#define IPV6_PLUS_ICMP_HDR 48 +#define RING_SIZE 2 +#define MAX_VIPS 12 +#define MAX_REALS 5 +#define CTL_MAP_SIZE 16 +#define CH_RINGS_SIZE (MAX_VIPS * RING_SIZE) +#define F_IPV6 (1 << 0) +#define F_HASH_NO_SRC_PORT (1 << 0) +#define F_ICMP (1 << 0) +#define F_SYN_SET (1 << 1) + +struct packet_description { + union { + __be32 src; + __be32 srcv6[4]; + }; + union { + __be32 dst; + __be32 dstv6[4]; + }; + union { + __u32 ports; + __u16 port16[2]; + }; + __u8 proto; + __u8 flags; +}; + +struct ctl_value { + union { + __u64 value; + __u32 ifindex; + __u8 mac[6]; + }; +}; + +struct vip_meta { + __u32 flags; + __u32 vip_num; +}; + +struct real_definition { + union { + __be32 dst; + __be32 dstv6[4]; + }; + __u8 flags; +}; + +struct vip_stats { + __u64 bytes; + __u64 pkts; +}; + +struct eth_hdr { + unsigned char eth_dest[ETH_ALEN]; + unsigned char eth_source[ETH_ALEN]; + unsigned short eth_proto; +}; + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, MAX_VIPS); + __type(key, struct vip); + __type(value, struct vip_meta); +} vip_map SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, CH_RINGS_SIZE); + __type(key, __u32); + __type(value, __u32); +} ch_rings SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, MAX_REALS); + __type(key, __u32); + __type(value, struct real_definition); +} reals SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); + __uint(max_entries, MAX_VIPS); + __type(key, __u32); + __type(value, struct vip_stats); +} stats SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, CTL_MAP_SIZE); + __type(key, __u32); + __type(value, struct ctl_value); +} ctl_array SEC(".maps"); + +static __noinline __u32 get_packet_hash(struct packet_description *pckt, bool ipv6) +{ + if (ipv6) + return jhash_2words(jhash(pckt->srcv6, 16, MAX_VIPS), + pckt->ports, CH_RINGS_SIZE); + else + return jhash_2words(pckt->src, pckt->ports, CH_RINGS_SIZE); +} + +static __noinline bool get_packet_dst(struct real_definition **real, + struct packet_description *pckt, + struct vip_meta *vip_info, + bool is_ipv6) +{ + __u32 hash = get_packet_hash(pckt, is_ipv6); + __u32 key = RING_SIZE * vip_info->vip_num + hash % RING_SIZE; + __u32 *real_pos; + + if (hash != 0x358459b7 /* jhash of ipv4 packet */ && + hash != 0x2f4bc6bb /* jhash of ipv6 packet */) + return false; + + real_pos = bpf_map_lookup_elem(&ch_rings, &key); + if (!real_pos) + return false; + key = *real_pos; + *real = bpf_map_lookup_elem(&reals, &key); + if (!(*real)) + return false; + return true; +} + +static __noinline int parse_icmpv6(struct bpf_dynptr *skb_ptr, __u64 off, + struct packet_description *pckt) +{ + struct icmp6hdr *icmp_hdr; + struct ipv6hdr *ip6h; + + icmp_hdr = bpf_dynptr_data(skb_ptr, off, sizeof(*icmp_hdr)); + if (!icmp_hdr) + return TC_ACT_SHOT; + + if (icmp_hdr->icmp6_type != ICMPV6_PKT_TOOBIG) + return TC_ACT_OK; + off += sizeof(struct icmp6hdr); + ip6h = (struct ipv6hdr *)bpf_dynptr_data(skb_ptr, off, sizeof(*ip6h)); + if (!ip6h) + return TC_ACT_SHOT; + pckt->proto = ip6h->nexthdr; + pckt->flags |= F_ICMP; + memcpy(pckt->srcv6, ip6h->daddr.s6_addr32, 16); + memcpy(pckt->dstv6, ip6h->saddr.s6_addr32, 16); + return TC_ACT_UNSPEC; +} + +static __noinline int parse_icmp(struct bpf_dynptr *skb_ptr, __u64 off, + struct packet_description *pckt) +{ + struct icmphdr *icmp_hdr; + struct iphdr *iph; + + icmp_hdr = bpf_dynptr_data(skb_ptr, off, sizeof(*icmp_hdr)); + if (!icmp_hdr) + return TC_ACT_SHOT; + if (icmp_hdr->type != ICMP_DEST_UNREACH || + icmp_hdr->code != ICMP_FRAG_NEEDED) + return TC_ACT_OK; + off += sizeof(struct icmphdr); + iph = bpf_dynptr_data(skb_ptr, off, sizeof(*iph)); + if (!iph || iph->ihl != 5) + return TC_ACT_SHOT; + pckt->proto = iph->protocol; + pckt->flags |= F_ICMP; + pckt->src = iph->daddr; + pckt->dst = iph->saddr; + return TC_ACT_UNSPEC; +} + +static __noinline bool parse_udp(struct bpf_dynptr *skb_ptr, __u64 off, + struct packet_description *pckt) +{ + struct udphdr *udp; + + udp = bpf_dynptr_data(skb_ptr, off, sizeof(*udp)); + if (!udp) + return false; + + if (!(pckt->flags & F_ICMP)) { + pckt->port16[0] = udp->source; + pckt->port16[1] = udp->dest; + } else { + pckt->port16[0] = udp->dest; + pckt->port16[1] = udp->source; + } + return true; +} + +static __noinline bool parse_tcp(struct bpf_dynptr *skb_ptr, __u64 off, + struct packet_description *pckt) +{ + struct tcphdr *tcp; + + tcp = bpf_dynptr_data(skb_ptr, off, sizeof(*tcp)); + if (!tcp) + return false; + + if (tcp->syn) + pckt->flags |= F_SYN_SET; + + if (!(pckt->flags & F_ICMP)) { + pckt->port16[0] = tcp->source; + pckt->port16[1] = tcp->dest; + } else { + pckt->port16[0] = tcp->dest; + pckt->port16[1] = tcp->source; + } + return true; +} + +static __noinline int process_packet(struct bpf_dynptr *skb_ptr, + struct eth_hdr *eth, __u64 off, + bool is_ipv6, struct __sk_buff *skb) +{ + struct packet_description pckt = {}; + struct bpf_tunnel_key tkey = {}; + struct vip_stats *data_stats; + struct real_definition *dst; + struct vip_meta *vip_info; + struct ctl_value *cval; + __u32 v4_intf_pos = 1; + __u32 v6_intf_pos = 2; + struct ipv6hdr *ip6h; + struct vip vip = {}; + struct iphdr *iph; + int tun_flag = 0; + __u16 pkt_bytes; + __u64 iph_len; + __u32 ifindex; + __u8 protocol; + __u32 vip_num; + int action; + + tkey.tunnel_ttl = 64; + if (is_ipv6) { + ip6h = bpf_dynptr_data(skb_ptr, off, sizeof(*ip6h)); + if (!ip6h) + return TC_ACT_SHOT; + + iph_len = sizeof(struct ipv6hdr); + protocol = ip6h->nexthdr; + pckt.proto = protocol; + pkt_bytes = bpf_ntohs(ip6h->payload_len); + off += iph_len; + if (protocol == IPPROTO_FRAGMENT) { + return TC_ACT_SHOT; + } else if (protocol == IPPROTO_ICMPV6) { + action = parse_icmpv6(skb_ptr, off, &pckt); + if (action >= 0) + return action; + off += IPV6_PLUS_ICMP_HDR; + } else { + memcpy(pckt.srcv6, ip6h->saddr.s6_addr32, 16); + memcpy(pckt.dstv6, ip6h->daddr.s6_addr32, 16); + } + } else { + iph = bpf_dynptr_data(skb_ptr, off, sizeof(*iph)); + if (!iph || iph->ihl != 5) + return TC_ACT_SHOT; + + protocol = iph->protocol; + pckt.proto = protocol; + pkt_bytes = bpf_ntohs(iph->tot_len); + off += IPV4_HDR_LEN_NO_OPT; + + if (iph->frag_off & PCKT_FRAGMENTED) + return TC_ACT_SHOT; + if (protocol == IPPROTO_ICMP) { + action = parse_icmp(skb_ptr, off, &pckt); + if (action >= 0) + return action; + off += IPV4_PLUS_ICMP_HDR; + } else { + pckt.src = iph->saddr; + pckt.dst = iph->daddr; + } + } + protocol = pckt.proto; + + if (protocol == IPPROTO_TCP) { + if (!parse_tcp(skb_ptr, off, &pckt)) + return TC_ACT_SHOT; + } else if (protocol == IPPROTO_UDP) { + if (!parse_udp(skb_ptr, off, &pckt)) + return TC_ACT_SHOT; + } else { + return TC_ACT_SHOT; + } + + if (is_ipv6) + memcpy(vip.daddr.v6, pckt.dstv6, 16); + else + vip.daddr.v4 = pckt.dst; + + vip.dport = pckt.port16[1]; + vip.protocol = pckt.proto; + vip_info = bpf_map_lookup_elem(&vip_map, &vip); + if (!vip_info) { + vip.dport = 0; + vip_info = bpf_map_lookup_elem(&vip_map, &vip); + if (!vip_info) + return TC_ACT_SHOT; + pckt.port16[1] = 0; + } + + if (vip_info->flags & F_HASH_NO_SRC_PORT) + pckt.port16[0] = 0; + + if (!get_packet_dst(&dst, &pckt, vip_info, is_ipv6)) + return TC_ACT_SHOT; + + if (dst->flags & F_IPV6) { + cval = bpf_map_lookup_elem(&ctl_array, &v6_intf_pos); + if (!cval) + return TC_ACT_SHOT; + ifindex = cval->ifindex; + memcpy(tkey.remote_ipv6, dst->dstv6, 16); + tun_flag = BPF_F_TUNINFO_IPV6; + } else { + cval = bpf_map_lookup_elem(&ctl_array, &v4_intf_pos); + if (!cval) + return TC_ACT_SHOT; + ifindex = cval->ifindex; + tkey.remote_ipv4 = dst->dst; + } + vip_num = vip_info->vip_num; + data_stats = bpf_map_lookup_elem(&stats, &vip_num); + if (!data_stats) + return TC_ACT_SHOT; + data_stats->pkts++; + data_stats->bytes += pkt_bytes; + bpf_skb_set_tunnel_key(skb, &tkey, sizeof(tkey), tun_flag); + *(u32 *)eth->eth_dest = tkey.remote_ipv4; + return bpf_redirect(ifindex, 0); +} + +SEC("tc") +int balancer_ingress(struct __sk_buff *ctx) +{ + struct bpf_dynptr ptr; + struct eth_hdr *eth; + __u32 eth_proto; + __u32 nh_off; + + nh_off = sizeof(struct eth_hdr); + + bpf_dynptr_from_skb(ctx, 0, &ptr); + eth = bpf_dynptr_data(&ptr, 0, sizeof(*eth)); + if (!eth) + return TC_ACT_SHOT; + eth_proto = eth->eth_proto; + if (eth_proto == bpf_htons(ETH_P_IP)) + return process_packet(&ptr, eth, nh_off, false, ctx); + else if (eth_proto == bpf_htons(ETH_P_IPV6)) + return process_packet(&ptr, eth, nh_off, true, ctx); + else + return TC_ACT_SHOT; +} + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt.c b/tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt.c new file mode 100644 index 000000000000..79bab9b50e9e --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt.c @@ -0,0 +1,119 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* This parsing logic is taken from the open source library katran, a layer 4 + * load balancer. + * + * This code logic using dynptrs can be found in test_parse_tcp_hdr_opt_dynptr.c + * + * https://github.com/facebookincubator/katran/blob/main/katran/lib/bpf/pckt_parsing.h + */ + +#include +#include +#include +#include +#include +#include +#include "test_tcp_hdr_options.h" + +char _license[] SEC("license") = "GPL"; + +/* Kind number used for experiments */ +const __u32 tcp_hdr_opt_kind_tpr = 0xFD; +/* Length of the tcp header option */ +const __u32 tcp_hdr_opt_len_tpr = 6; +/* maximum number of header options to check to lookup server_id */ +const __u32 tcp_hdr_opt_max_opt_checks = 15; + +__u32 server_id; + +struct hdr_opt_state { + __u32 server_id; + __u8 byte_offset; + __u8 hdr_bytes_remaining; +}; + +static int parse_hdr_opt(const struct xdp_md *xdp, struct hdr_opt_state *state) +{ + const void *data = (void *)(long)xdp->data; + const void *data_end = (void *)(long)xdp->data_end; + __u8 *tcp_opt, kind, hdr_len; + + tcp_opt = (__u8 *)(data + state->byte_offset); + if (tcp_opt + 1 > data_end) + return -1; + + kind = tcp_opt[0]; + + if (kind == TCPOPT_EOL) + return -1; + + if (kind == TCPOPT_NOP) { + state->hdr_bytes_remaining--; + state->byte_offset++; + return 0; + } + + if (state->hdr_bytes_remaining < 2 || + tcp_opt + sizeof(__u8) + sizeof(__u8) > data_end) + return -1; + + hdr_len = tcp_opt[1]; + if (hdr_len > state->hdr_bytes_remaining) + return -1; + + if (kind == tcp_hdr_opt_kind_tpr) { + if (hdr_len != tcp_hdr_opt_len_tpr) + return -1; + + if (tcp_opt + tcp_hdr_opt_len_tpr > data_end) + return -1; + + state->server_id = *(__u32 *)&tcp_opt[2]; + return 1; + } + + state->hdr_bytes_remaining -= hdr_len; + state->byte_offset += hdr_len; + return 0; +} + +SEC("xdp") +int xdp_ingress_v6(struct xdp_md *xdp) +{ + const void *data = (void *)(long)xdp->data; + const void *data_end = (void *)(long)xdp->data_end; + struct hdr_opt_state opt_state = {}; + __u8 tcp_hdr_opt_len = 0; + struct tcphdr *tcp_hdr; + __u64 tcp_offset = 0; + __u32 off; + int err; + + tcp_offset = sizeof(struct ethhdr) + sizeof(struct ipv6hdr); + tcp_hdr = (struct tcphdr *)(data + tcp_offset); + if (tcp_hdr + 1 > data_end) + return XDP_DROP; + + tcp_hdr_opt_len = (tcp_hdr->doff * 4) - sizeof(struct tcphdr); + if (tcp_hdr_opt_len < tcp_hdr_opt_len_tpr) + return XDP_DROP; + + opt_state.hdr_bytes_remaining = tcp_hdr_opt_len; + opt_state.byte_offset = sizeof(struct tcphdr) + tcp_offset; + + /* max number of bytes of options in tcp header is 40 bytes */ + for (int i = 0; i < tcp_hdr_opt_max_opt_checks; i++) { + err = parse_hdr_opt(xdp, &opt_state); + + if (err || !opt_state.hdr_bytes_remaining) + break; + } + + if (!opt_state.server_id) + return XDP_DROP; + + server_id = opt_state.server_id; + + return XDP_PASS; +} diff --git a/tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt_dynptr.c b/tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt_dynptr.c new file mode 100644 index 000000000000..0b97a6708fc5 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt_dynptr.c @@ -0,0 +1,110 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* This logic is lifted from a real-world use case of packet parsing, used in + * the open source library katran, a layer 4 load balancer. + * + * This test demonstrates how to parse packet contents using dynptrs. The + * original code (parsing without dynptrs) can be found in test_parse_tcp_hdr_opt.c + */ + +#include +#include +#include +#include +#include +#include +#include "test_tcp_hdr_options.h" + +char _license[] SEC("license") = "GPL"; + +/* Kind number used for experiments */ +const __u32 tcp_hdr_opt_kind_tpr = 0xFD; +/* Length of the tcp header option */ +const __u32 tcp_hdr_opt_len_tpr = 6; +/* maximum number of header options to check to lookup server_id */ +const __u32 tcp_hdr_opt_max_opt_checks = 15; + +__u32 server_id; + +static int parse_hdr_opt(struct bpf_dynptr *ptr, __u32 *off, __u8 *hdr_bytes_remaining, + __u32 *server_id) +{ + __u8 *tcp_opt, kind, hdr_len; + __u8 *data; + + data = bpf_dynptr_data(ptr, *off, sizeof(kind) + sizeof(hdr_len) + + sizeof(*server_id)); + if (!data) + return -1; + + kind = data[0]; + + if (kind == TCPOPT_EOL) + return -1; + + if (kind == TCPOPT_NOP) { + *off += 1; + *hdr_bytes_remaining -= 1; + return 0; + } + + if (*hdr_bytes_remaining < 2) + return -1; + + hdr_len = data[1]; + if (hdr_len > *hdr_bytes_remaining) + return -1; + + if (kind == tcp_hdr_opt_kind_tpr) { + if (hdr_len != tcp_hdr_opt_len_tpr) + return -1; + + __builtin_memcpy(server_id, (__u32 *)(data + 2), sizeof(*server_id)); + return 1; + } + + *off += hdr_len; + *hdr_bytes_remaining -= hdr_len; + return 0; +} + +SEC("xdp") +int xdp_ingress_v6(struct xdp_md *xdp) +{ + __u8 hdr_bytes_remaining; + struct tcphdr *tcp_hdr; + __u8 tcp_hdr_opt_len; + int err = 0; + __u32 off; + + struct bpf_dynptr ptr; + + bpf_dynptr_from_xdp(xdp, 0, &ptr); + + off = sizeof(struct ethhdr) + sizeof(struct ipv6hdr); + + tcp_hdr = bpf_dynptr_data(&ptr, off, sizeof(*tcp_hdr)); + if (!tcp_hdr) + return XDP_DROP; + + tcp_hdr_opt_len = (tcp_hdr->doff * 4) - sizeof(struct tcphdr); + if (tcp_hdr_opt_len < tcp_hdr_opt_len_tpr) + return XDP_DROP; + + hdr_bytes_remaining = tcp_hdr_opt_len; + + off += sizeof(struct tcphdr); + + /* max number of bytes of options in tcp header is 40 bytes */ + for (int i = 0; i < tcp_hdr_opt_max_opt_checks; i++) { + err = parse_hdr_opt(&ptr, &off, &hdr_bytes_remaining, &server_id); + + if (err || !hdr_bytes_remaining) + break; + } + + if (!server_id) + return XDP_DROP; + + return XDP_PASS; +} diff --git a/tools/testing/selftests/bpf/progs/test_xdp_dynptr.c b/tools/testing/selftests/bpf/progs/test_xdp_dynptr.c new file mode 100644 index 000000000000..bcc4daefd559 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_xdp_dynptr.c @@ -0,0 +1,235 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2022 Meta */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "test_iptunnel_common.h" + +const size_t tcphdr_sz = sizeof(struct tcphdr); +const size_t udphdr_sz = sizeof(struct udphdr); +const size_t ethhdr_sz = sizeof(struct ethhdr); +const size_t iphdr_sz = sizeof(struct iphdr); +const size_t ipv6hdr_sz = sizeof(struct ipv6hdr); + +struct { + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY); + __uint(max_entries, 256); + __type(key, __u32); + __type(value, __u64); +} rxcnt SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, MAX_IPTNL_ENTRIES); + __type(key, struct vip); + __type(value, struct iptnl_info); +} vip2tnl SEC(".maps"); + +static __always_inline void count_tx(__u32 protocol) +{ + __u64 *rxcnt_count; + + rxcnt_count = bpf_map_lookup_elem(&rxcnt, &protocol); + if (rxcnt_count) + *rxcnt_count += 1; +} + +static __always_inline int get_dport(void *trans_data, __u8 protocol) +{ + struct tcphdr *th; + struct udphdr *uh; + + switch (protocol) { + case IPPROTO_TCP: + th = (struct tcphdr *)trans_data; + return th->dest; + case IPPROTO_UDP: + uh = (struct udphdr *)trans_data; + return uh->dest; + default: + return 0; + } +} + +static __always_inline void set_ethhdr(struct ethhdr *new_eth, + const struct ethhdr *old_eth, + const struct iptnl_info *tnl, + __be16 h_proto) +{ + memcpy(new_eth->h_source, old_eth->h_dest, sizeof(new_eth->h_source)); + memcpy(new_eth->h_dest, tnl->dmac, sizeof(new_eth->h_dest)); + new_eth->h_proto = h_proto; +} + +static __always_inline int handle_ipv4(struct xdp_md *xdp, struct bpf_dynptr *xdp_ptr) +{ + struct bpf_dynptr new_xdp_ptr; + struct iptnl_info *tnl; + struct ethhdr *new_eth; + struct ethhdr *old_eth; + __u32 transport_hdr_sz; + struct iphdr *iph; + __u16 *next_iph; + __u16 payload_len; + struct vip vip = {}; + int dport; + __u32 csum = 0; + int i; + + if (ethhdr_sz + iphdr_sz + tcphdr_sz > xdp->data_end - xdp->data) + transport_hdr_sz = udphdr_sz; + else + transport_hdr_sz = tcphdr_sz; + + iph = bpf_dynptr_data(xdp_ptr, ethhdr_sz, iphdr_sz + transport_hdr_sz); + if (!iph) + return XDP_DROP; + + dport = get_dport(iph + 1, iph->protocol); + if (dport == -1) + return XDP_DROP; + + vip.protocol = iph->protocol; + vip.family = AF_INET; + vip.daddr.v4 = iph->daddr; + vip.dport = dport; + payload_len = bpf_ntohs(iph->tot_len); + + tnl = bpf_map_lookup_elem(&vip2tnl, &vip); + /* It only does v4-in-v4 */ + if (!tnl || tnl->family != AF_INET) + return XDP_PASS; + + if (bpf_xdp_adjust_head(xdp, 0 - (int)iphdr_sz)) + return XDP_DROP; + + bpf_dynptr_from_xdp(xdp, 0, &new_xdp_ptr); + new_eth = bpf_dynptr_data(&new_xdp_ptr, 0, ethhdr_sz + iphdr_sz + ethhdr_sz); + if (!new_eth) + return XDP_DROP; + + iph = (struct iphdr *)(new_eth + 1); + old_eth = (struct ethhdr *)(iph + 1); + + set_ethhdr(new_eth, old_eth, tnl, bpf_htons(ETH_P_IP)); + + iph->version = 4; + iph->ihl = iphdr_sz >> 2; + iph->frag_off = 0; + iph->protocol = IPPROTO_IPIP; + iph->check = 0; + iph->tos = 0; + iph->tot_len = bpf_htons(payload_len + iphdr_sz); + iph->daddr = tnl->daddr.v4; + iph->saddr = tnl->saddr.v4; + iph->ttl = 8; + + next_iph = (__u16 *)iph; + for (i = 0; i < iphdr_sz >> 1; i++) + csum += *next_iph++; + + iph->check = ~((csum & 0xffff) + (csum >> 16)); + + count_tx(vip.protocol); + + return XDP_TX; +} + +static __always_inline int handle_ipv6(struct xdp_md *xdp, struct bpf_dynptr *xdp_ptr) +{ + struct bpf_dynptr new_xdp_ptr; + struct iptnl_info *tnl; + struct ethhdr *new_eth; + struct ethhdr *old_eth; + __u32 transport_hdr_sz; + struct ipv6hdr *ip6h; + __u16 payload_len; + struct vip vip = {}; + int dport; + + if (ethhdr_sz + iphdr_sz + tcphdr_sz > xdp->data_end - xdp->data) + transport_hdr_sz = udphdr_sz; + else + transport_hdr_sz = tcphdr_sz; + + ip6h = bpf_dynptr_data(xdp_ptr, ethhdr_sz, ipv6hdr_sz + transport_hdr_sz); + if (!ip6h) + return XDP_DROP; + + dport = get_dport(ip6h + 1, ip6h->nexthdr); + if (dport == -1) + return XDP_DROP; + + vip.protocol = ip6h->nexthdr; + vip.family = AF_INET6; + memcpy(vip.daddr.v6, ip6h->daddr.s6_addr32, sizeof(vip.daddr)); + vip.dport = dport; + payload_len = ip6h->payload_len; + + tnl = bpf_map_lookup_elem(&vip2tnl, &vip); + /* It only does v6-in-v6 */ + if (!tnl || tnl->family != AF_INET6) + return XDP_PASS; + + if (bpf_xdp_adjust_head(xdp, 0 - (int)ipv6hdr_sz)) + return XDP_DROP; + + bpf_dynptr_from_xdp(xdp, 0, &new_xdp_ptr); + new_eth = bpf_dynptr_data(&new_xdp_ptr, 0, ethhdr_sz + ipv6hdr_sz + ethhdr_sz); + if (!new_eth) + return XDP_DROP; + + ip6h = (struct ipv6hdr *)(new_eth + 1); + old_eth = (struct ethhdr *)(ip6h + 1); + + set_ethhdr(new_eth, old_eth, tnl, bpf_htons(ETH_P_IPV6)); + + ip6h->version = 6; + ip6h->priority = 0; + memset(ip6h->flow_lbl, 0, sizeof(ip6h->flow_lbl)); + ip6h->payload_len = bpf_htons(bpf_ntohs(payload_len) + ipv6hdr_sz); + ip6h->nexthdr = IPPROTO_IPV6; + ip6h->hop_limit = 8; + memcpy(ip6h->saddr.s6_addr32, tnl->saddr.v6, sizeof(tnl->saddr.v6)); + memcpy(ip6h->daddr.s6_addr32, tnl->daddr.v6, sizeof(tnl->daddr.v6)); + + count_tx(vip.protocol); + + return XDP_TX; +} + +SEC("xdp") +int _xdp_tx_iptunnel(struct xdp_md *xdp) +{ + struct bpf_dynptr ptr; + struct ethhdr *eth; + __u16 h_proto; + + bpf_dynptr_from_xdp(xdp, 0, &ptr); + eth = bpf_dynptr_data(&ptr, 0, ethhdr_sz); + if (!eth) + return XDP_DROP; + + h_proto = eth->h_proto; + + if (h_proto == bpf_htons(ETH_P_IP)) + return handle_ipv4(xdp, &ptr); + else if (h_proto == bpf_htons(ETH_P_IPV6)) + + return handle_ipv6(xdp, &ptr); + else + return XDP_DROP; +} + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/test_tcp_hdr_options.h b/tools/testing/selftests/bpf/test_tcp_hdr_options.h index 6118e3ab61fc..56c9f8a3ad3d 100644 --- a/tools/testing/selftests/bpf/test_tcp_hdr_options.h +++ b/tools/testing/selftests/bpf/test_tcp_hdr_options.h @@ -50,6 +50,7 @@ struct linum_err { #define TCPOPT_EOL 0 #define TCPOPT_NOP 1 +#define TCPOPT_MSS 2 #define TCPOPT_WINDOW 3 #define TCPOPT_EXP 254