From patchwork Wed Dec 18 02:44:17 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "D. Wythe" X-Patchwork-Id: 13912956 Received: from out30-98.freemail.mail.aliyun.com (out30-98.freemail.mail.aliyun.com [115.124.30.98]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 75D8513C81B; Wed, 18 Dec 2024 02:44:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.98 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1734489875; cv=none; b=SttUDIBX50rV5aZwe6CCWv9oVr5Qc6zEw1YBaVwCAymarcfAe2Xf8wMOsZn7vrnOL5ereTqSNKJxMtHiZS4e6LFOTITWCFOZTHN2ai1oBKopUSv1DnWOJRl31f5qa17QkDDiA0+PIHepZLSKEIQbibSs+Jh3mTiLjSi6VNBj9NU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1734489875; c=relaxed/simple; bh=8Me0+ORdUwnQJJZusQP7gsBibbKOgKoUBm5L+YNIMj0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=q5DcSQMydlEwUdCCo0moSzteLtRCzIVkESBqoB5vqjpbpNsPnx8RYH+puD/q2GebnelaZwv5bRKI/c+e8isfStE1s9r11ChhwBHPzqfUVSVUIu8aeR0uYQ15X9d+l/PlVD0a0En6QE0uyx5tQd+CQfnQDdHuXibnD4xjp9SBsZk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=QgkUQ4wR; arc=none smtp.client-ip=115.124.30.98 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="QgkUQ4wR" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1734489870; h=From:To:Subject:Date:Message-ID:MIME-Version; bh=GAH1Lz6yySHh0akD6UjVzqvmSG58TUkUU80VrEvWcyY=; b=QgkUQ4wR4yaxnjMldItkIUu/YQf1UST2NbfDV29GgC255tQygnkLpL1rXXuCCVIEpYZvYIRZ5LpGTfngFzW/bS9GyuC3m3CZ95XI2YC90CGO4ZMiKO/Kbs7wnwnZzIJOv5Pz5kipM/BFVcA6AquWnPKPGujMGvms6RaV61yDejY= Received: from j66a10360.sqa.eu95.tbsite.net(mailfrom:alibuda@linux.alibaba.com fp:SMTPD_---0WLko6dn_1734489867 cluster:ay36) by smtp.aliyun-inc.com; Wed, 18 Dec 2024 10:44:28 +0800 From: "D. Wythe" To: kgraul@linux.ibm.com, wenjia@linux.ibm.com, jaka@linux.ibm.com, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, martin.lau@linux.dev, pabeni@redhat.com, song@kernel.org, sdf@google.com, haoluo@google.com, yhs@fb.com, edumazet@google.com, john.fastabend@gmail.com, kpsingh@kernel.org, jolsa@kernel.org, guwen@linux.alibaba.com Cc: kuba@kernel.org, davem@davemloft.net, netdev@vger.kernel.org, linux-s390@vger.kernel.org, linux-rdma@vger.kernel.org, bpf@vger.kernel.org Subject: [PATCH bpf-next v3 1/5] bpf: export necessary sympols for modules with struct_ops Date: Wed, 18 Dec 2024 10:44:17 +0800 Message-ID: <20241218024422.23423-2-alibuda@linux.alibaba.com> X-Mailer: git-send-email 2.45.0 In-Reply-To: <20241218024422.23423-1-alibuda@linux.alibaba.com> References: <20241218024422.23423-1-alibuda@linux.alibaba.com> Precedence: bulk X-Mailing-List: linux-rdma@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Exports three necessary symbols for implementing struct_ops with tristate subsystem. To hold or release refcnt of struct_ops refcnt by inline funcs bpf_try_module_get and bpf_module_put which use bpf_struct_ops_get(put) conditionally. And to copy obj name from one to the other with effective checks by bpf_obj_name_cpy. Signed-off-by: D. Wythe --- kernel/bpf/bpf_struct_ops.c | 2 ++ kernel/bpf/syscall.c | 1 + 2 files changed, 3 insertions(+) diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index 606efe32485a..00c212e0ad39 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -1119,6 +1119,7 @@ bool bpf_struct_ops_get(const void *kdata) map = __bpf_map_inc_not_zero(&st_map->map, false); return !IS_ERR(map); } +EXPORT_SYMBOL_GPL(bpf_struct_ops_get); void bpf_struct_ops_put(const void *kdata) { @@ -1130,6 +1131,7 @@ void bpf_struct_ops_put(const void *kdata) bpf_map_put(&st_map->map); } +EXPORT_SYMBOL_GPL(bpf_struct_ops_put); int bpf_struct_ops_supported(const struct bpf_struct_ops *st_ops, u32 moff) { diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 5684e8ce132d..62238ec989dc 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1167,6 +1167,7 @@ int bpf_obj_name_cpy(char *dst, const char *src, unsigned int size) return src - orig_src; } +EXPORT_SYMBOL_GPL(bpf_obj_name_cpy); int map_check_no_btf(const struct bpf_map *map, const struct btf *btf, From patchwork Wed Dec 18 02:44:18 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "D. Wythe" X-Patchwork-Id: 13912957 Received: from out30-112.freemail.mail.aliyun.com (out30-112.freemail.mail.aliyun.com [115.124.30.112]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 65FE61F5E6; Wed, 18 Dec 2024 02:44:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.112 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1734489876; cv=none; b=OyKIHVCF+o3/qn8hTMBYXK6S5pmcZX/BeEVo11jjmTTdsMti3t9+2XaTRkoDU7GcdgI0Q4DdI1d46k2FXUFd0ukZCMif4CBiIXAqtSTiNJ5xlFYO2PfcjDjXi5doxzk0+vnOQ3jizQJVdOylzKxwqS9LSap/bHS6DNPXTrz/nS4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1734489876; c=relaxed/simple; bh=00MZDtEQc8cudXFab639g8G6jcxKxY4ipxJHDtrbfBg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=seaEiI0TISLqPG9vY3yxe0PrH8bwDSekqPWNZH8OEfcFcjBjpNkqenEWrpIIRj5rb4v6MkH0NX2mqaS7qbLjpAs2ZccBUfcUeNejFKc1CQ6DL74j8fcNIcD3jOUTwrMRR1/J7JPLhlB/l7Sp9wtQyKJHnkN8gxK1HFNlSUrt7V0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=joWYY3x2; arc=none smtp.client-ip=115.124.30.112 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="joWYY3x2" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1734489870; h=From:To:Subject:Date:Message-ID:MIME-Version; bh=pj3X3jxKZ7nW5+LEb6ZD7OgevxazCO0NZ28sw3x9NS4=; b=joWYY3x2iiOt303l+34hRg+4GQ75F2O76WA2agm8MfLeUnhSuWMQj+rEuVAPOzfHl+KYT9QCLObgS0/bnpI/M3OkJwgQznpS0P1SKgkKDBiB0C4d2qfL9ruDsrvU7Aj7Ax4p/8oLrl5GcuycPshcqMRhF0EyhJ9jaG/xtUlbNqM= Received: from j66a10360.sqa.eu95.tbsite.net(mailfrom:alibuda@linux.alibaba.com fp:SMTPD_---0WLko6eC_1734489868 cluster:ay36) by smtp.aliyun-inc.com; Wed, 18 Dec 2024 10:44:28 +0800 From: "D. Wythe" To: kgraul@linux.ibm.com, wenjia@linux.ibm.com, jaka@linux.ibm.com, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, martin.lau@linux.dev, pabeni@redhat.com, song@kernel.org, sdf@google.com, haoluo@google.com, yhs@fb.com, edumazet@google.com, john.fastabend@gmail.com, kpsingh@kernel.org, jolsa@kernel.org, guwen@linux.alibaba.com Cc: kuba@kernel.org, davem@davemloft.net, netdev@vger.kernel.org, linux-s390@vger.kernel.org, linux-rdma@vger.kernel.org, bpf@vger.kernel.org Subject: [PATCH bpf-next v3 2/5] net/smc: Introduce generic hook smc_ops Date: Wed, 18 Dec 2024 10:44:18 +0800 Message-ID: <20241218024422.23423-3-alibuda@linux.alibaba.com> X-Mailer: git-send-email 2.45.0 In-Reply-To: <20241218024422.23423-1-alibuda@linux.alibaba.com> References: <20241218024422.23423-1-alibuda@linux.alibaba.com> Precedence: bulk X-Mailing-List: linux-rdma@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 The introduction of IPPROTO_SMC enables eBPF programs to determine whether to use SMC based on the context of socket creation, such as network namespaces, PID and comm name, etc. As a subsequent enhancement, to introduce a new generic hook that allows decisions on whether to use SMC or not at runtime, including but not limited to local/remote IP address or ports. Moreover, in the future, we can achieve more complex extensions to the protocol stack by extending this ops. Signed-off-by: D. Wythe --- include/net/netns/smc.h | 3 ++ include/net/smc.h | 51 ++++++++++++++++++++++ net/ipv4/tcp_output.c | 15 +++++-- net/smc/Kconfig | 12 ++++++ net/smc/Makefile | 1 + net/smc/smc_ops.c | 51 ++++++++++++++++++++++ net/smc/smc_ops.h | 29 +++++++++++++ net/smc/smc_sysctl.c | 95 +++++++++++++++++++++++++++++++++++++++++ 8 files changed, 253 insertions(+), 4 deletions(-) create mode 100644 net/smc/smc_ops.c create mode 100644 net/smc/smc_ops.h diff --git a/include/net/netns/smc.h b/include/net/netns/smc.h index fc752a50f91b..59d069f56b2d 100644 --- a/include/net/netns/smc.h +++ b/include/net/netns/smc.h @@ -17,6 +17,9 @@ struct netns_smc { #ifdef CONFIG_SYSCTL struct ctl_table_header *smc_hdr; #endif +#if IS_ENABLED(CONFIG_SMC_OPS) + struct smc_ops __rcu *ops; +#endif /* CONFIG_SMC_OPS */ unsigned int sysctl_autocorking_size; unsigned int sysctl_smcr_buf_type; int sysctl_smcr_testlink_time; diff --git a/include/net/smc.h b/include/net/smc.h index db84e4e35080..25c762aa96fc 100644 --- a/include/net/smc.h +++ b/include/net/smc.h @@ -18,6 +18,8 @@ #include "linux/ism.h" struct sock; +struct tcp_sock; +struct inet_request_sock; #define SMC_MAX_PNETID_LEN 16 /* Max. length of PNET id */ @@ -97,4 +99,53 @@ struct smcd_dev { u8 going_away : 1; }; +#define SMC_OPS_NAME_MAX 16 + +enum { + /* ops can be inherit from init_net */ + SMC_OPS_FLAG_INHERITABLE = 0x1, + + SMC_OPS_ALL_FLAGS = SMC_OPS_FLAG_INHERITABLE, +}; + +struct smc_ops { + /* priavte */ + + struct list_head list; + struct module *owner; + + /* public */ + + /* unique name */ + char name[SMC_OPS_NAME_MAX]; + int flags; + + /* Invoked before computing SMC option for SYN packets. + * We can control whether to set SMC options by returning varios value. + * Return 0 to disable SMC, or return any other value to enable it. + */ + int (*set_option)(struct tcp_sock *tp); + + /* Invoked before Set up SMC options for SYN-ACK packets + * We can control whether to respond SMC options by returning varios value. + * Return 0 to disable SMC, or return any other value to enable it. + */ + int (*set_option_cond)(const struct tcp_sock *tp, struct inet_request_sock *ireq); +}; + +#if IS_ENABLED(CONFIG_SMC_OPS) +#define smc_call_retops(init_val, sk, func, ...) ({ \ + typeof(init_val) __ret = (init_val); \ + struct smc_ops *ops; \ + rcu_read_lock(); \ + ops = READ_ONCE(sock_net(sk)->smc.ops); \ + if (ops && ops->func) \ + __ret = ops->func(__VA_ARGS__); \ + rcu_read_unlock(); \ + __ret; \ +}) +#else +#define smc_call_retops(init_val, ...) (init_val) +#endif /* CONFIG_SMC_OPS */ + #endif /* _SMC_H */ diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 5485a70b5fe5..7b402167fb4d 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -48,6 +48,7 @@ #include #include +#include /* Refresh clocks of a TCP socket, * ensuring monotically increasing values. @@ -759,14 +760,17 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp, mptcp_options_write(th, ptr, tp, opts); } -static void smc_set_option(const struct tcp_sock *tp, +static void smc_set_option(struct tcp_sock *tp, struct tcp_out_options *opts, unsigned int *remaining) { #if IS_ENABLED(CONFIG_SMC) if (static_branch_unlikely(&tcp_have_smc)) { if (tp->syn_smc) { - if (*remaining >= TCPOLEN_EXP_SMC_BASE_ALIGNED) { + tp->syn_smc = smc_call_retops(1, &tp->inet_conn.icsk_inet.sk, + set_option, tp); + /* re-check syn_smc */ + if (tp->syn_smc && *remaining >= TCPOLEN_EXP_SMC_BASE_ALIGNED) { opts->options |= OPTION_SMC; *remaining -= TCPOLEN_EXP_SMC_BASE_ALIGNED; } @@ -776,14 +780,17 @@ static void smc_set_option(const struct tcp_sock *tp, } static void smc_set_option_cond(const struct tcp_sock *tp, - const struct inet_request_sock *ireq, + struct inet_request_sock *ireq, struct tcp_out_options *opts, unsigned int *remaining) { #if IS_ENABLED(CONFIG_SMC) if (static_branch_unlikely(&tcp_have_smc)) { if (tp->syn_smc && ireq->smc_ok) { - if (*remaining >= TCPOLEN_EXP_SMC_BASE_ALIGNED) { + ireq->smc_ok = smc_call_retops(1, &tp->inet_conn.icsk_inet.sk, + set_option_cond, tp, ireq); + /* re-check smc_ok */ + if (ireq->smc_ok && *remaining >= TCPOLEN_EXP_SMC_BASE_ALIGNED) { opts->options |= OPTION_SMC; *remaining -= TCPOLEN_EXP_SMC_BASE_ALIGNED; } diff --git a/net/smc/Kconfig b/net/smc/Kconfig index ba5e6a2dd2fd..0ee16ec8dceb 100644 --- a/net/smc/Kconfig +++ b/net/smc/Kconfig @@ -33,3 +33,15 @@ config SMC_LO of architecture or hardware. if unsure, say N. + +config SMC_OPS + bool "Generic hook for SMC subsystem" + depends on SMC && BPF_SYSCALL + default n + help + SMC_OPS enables support to register genericfor hook via eBPF programs + for SMC subsystem. eBPF programs offer much greater flexibility + in modifying the behavior of the SMC protocol stack compared + to a complete kernel-based approach. + + if unsure, say N. diff --git a/net/smc/Makefile b/net/smc/Makefile index 60f1c87d5212..5dd706b2927a 100644 --- a/net/smc/Makefile +++ b/net/smc/Makefile @@ -7,3 +7,4 @@ smc-y += smc_cdc.o smc_tx.o smc_rx.o smc_close.o smc_ism.o smc_netlink.o smc_sta smc-y += smc_tracepoint.o smc_inet.o smc-$(CONFIG_SYSCTL) += smc_sysctl.o smc-$(CONFIG_SMC_LO) += smc_loopback.o +smc-$(CONFIG_SMC_OPS) += smc_ops.o \ No newline at end of file diff --git a/net/smc/smc_ops.c b/net/smc/smc_ops.c new file mode 100644 index 000000000000..0fc19cadd760 --- /dev/null +++ b/net/smc/smc_ops.c @@ -0,0 +1,51 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Shared Memory Communications over RDMA (SMC-R) and RoCE + * + * Generic hook for SMC subsystem. + * + * Copyright IBM Corp. 2016 + * Copyright (c) 2024, Alibaba Inc. + * + * Author: D. Wythe + */ + +#include "smc_ops.h" + +static DEFINE_SPINLOCK(smc_ops_list_lock); +static LIST_HEAD(smc_ops_list); + +static int smc_ops_reg(struct smc_ops *ops) +{ + int ret = 0; + + spin_lock(&smc_ops_list_lock); + /* already exist or duplicate name */ + if (smc_ops_find_by_name(ops->name)) + ret = -EEXIST; + else + list_add_tail_rcu(&ops->list, &smc_ops_list); + spin_unlock(&smc_ops_list_lock); + return ret; +} + +static void smc_ops_unreg(struct smc_ops *ops) +{ + spin_lock(&smc_ops_list_lock); + list_del_rcu(&ops->list); + spin_unlock(&smc_ops_list_lock); + + /* Ensure that all readers to complete */ + synchronize_rcu(); +} + +struct smc_ops *smc_ops_find_by_name(const char *name) +{ + struct smc_ops *ops; + + list_for_each_entry_rcu(ops, &smc_ops_list, list) { + if (strcmp(ops->name, name) == 0) + return ops; + } + return NULL; +} diff --git a/net/smc/smc_ops.h b/net/smc/smc_ops.h new file mode 100644 index 000000000000..214f4c99efd4 --- /dev/null +++ b/net/smc/smc_ops.h @@ -0,0 +1,29 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Shared Memory Communications over RDMA (SMC-R) and RoCE + * + * Generic hook for SMC subsystem. + * + * Copyright IBM Corp. 2016 + * Copyright (c) 2024, Alibaba Inc. + * + * Author: D. Wythe + */ + +#ifndef __SMC_OPS +#define __SMC_OPS + +#include + +#if IS_ENABLED(CONFIG_SMC_OPS) +/* Find ops by the target name, which required to be a c-string. + * Return NULL if no such ops was found,otherwise, return a valid ops. + * + * Note: Caller MUST ensure it's was invoked under rcu_read_lock. + */ +struct smc_ops *smc_ops_find_by_name(const char *name); +#else +static inline struct smc_ops *smc_ops_find_by_name(const char *name) { return NULL; } +#endif /* CONFIG_SMC_OPS*/ + +#endif /* __SMC_OPS */ diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c index 2fab6456f765..2aa3f19025f4 100644 --- a/net/smc/smc_sysctl.c +++ b/net/smc/smc_sysctl.c @@ -18,6 +18,7 @@ #include "smc_core.h" #include "smc_llc.h" #include "smc_sysctl.h" +#include "smc_ops.h" static int min_sndbuf = SMC_BUF_MIN_SIZE; static int min_rcvbuf = SMC_BUF_MIN_SIZE; @@ -30,6 +31,70 @@ static int links_per_lgr_max = SMC_LINKS_ADD_LNK_MAX; static int conns_per_lgr_min = SMC_CONN_PER_LGR_MIN; static int conns_per_lgr_max = SMC_CONN_PER_LGR_MAX; +#if IS_ENABLED(CONFIG_SMC_OPS) +static int smc_net_replace_smc_ops(struct net *net, const char *name) +{ + struct smc_ops *ops = NULL; + + rcu_read_lock(); + /* null or empty name ask to clear current ops */ + if (name && name[0]) { + ops = smc_ops_find_by_name(name); + if (!ops) { + rcu_read_unlock(); + return -EINVAL; + } + /* no change, just return */ + if (ops == rcu_dereference(net->smc.ops)) { + rcu_read_unlock(); + return 0; + } + } + if (!ops || bpf_try_module_get(ops, ops->owner)) { + /* xhcg */ + ops = rcu_replace_pointer(net->smc.ops, ops, true); + /* release old ops */ + if (ops) + bpf_module_put(ops, ops->owner); + } else if (ops) { + rcu_read_unlock(); + return -EBUSY; + } + rcu_read_unlock(); + return 0; +} + +static int proc_smc_ops(const struct ctl_table *ctl, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + struct net *net = container_of(ctl->data, struct net, + smc.ops); + char val[SMC_OPS_NAME_MAX]; + struct ctl_table tbl = { + .data = val, + .maxlen = SMC_OPS_NAME_MAX, + }; + struct smc_ops *ops; + int ret; + + rcu_read_lock(); + ops = rcu_dereference(net->smc.ops); + if (ops) + memcpy(val, ops->name, sizeof(ops->name)); + else + val[0] = '\0'; + rcu_read_unlock(); + + ret = proc_dostring(&tbl, write, buffer, lenp, ppos); + if (ret) + return ret; + + if (write) + ret = smc_net_replace_smc_ops(net, val); + return ret; +} +#endif /* CONFIG_SMC_OPS */ + static struct ctl_table smc_table[] = { { .procname = "autocorking_size", @@ -99,6 +164,15 @@ static struct ctl_table smc_table[] = { .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, +#if IS_ENABLED(CONFIG_SMC_OPS) + { + .procname = "ops", + .data = &init_net.smc.ops, + .mode = 0644, + .maxlen = SMC_OPS_NAME_MAX, + .proc_handler = proc_smc_ops, + }, +#endif /* CONFIG_SMC_OPS */ }; int __net_init smc_sysctl_net_init(struct net *net) @@ -109,6 +183,20 @@ int __net_init smc_sysctl_net_init(struct net *net) table = smc_table; if (!net_eq(net, &init_net)) { int i; +#if IS_ENABLED(CONFIG_SMC_OPS) + struct smc_ops *ops; + + rcu_read_lock(); + ops = rcu_dereference(init_net.smc.ops); + if (ops && ops->flags & SMC_OPS_FLAG_INHERITABLE) { + if (!bpf_try_module_get(ops, ops->owner)) { + rcu_read_unlock(); + return -EBUSY; + } + rcu_assign_pointer(net->smc.ops, ops); + } + rcu_read_unlock(); +#endif /* CONFIG_SMC_OPS */ table = kmemdup(table, sizeof(smc_table), GFP_KERNEL); if (!table) @@ -139,6 +227,9 @@ int __net_init smc_sysctl_net_init(struct net *net) if (!net_eq(net, &init_net)) kfree(table); err_alloc: +#if IS_ENABLED(CONFIG_SMC_OPS) + smc_net_replace_smc_ops(net, NULL); +#endif /* CONFIG_SMC_OPS */ return -ENOMEM; } @@ -148,6 +239,10 @@ void __net_exit smc_sysctl_net_exit(struct net *net) table = net->smc.smc_hdr->ctl_table_arg; unregister_net_sysctl_table(net->smc.smc_hdr); +#if IS_ENABLED(CONFIG_SMC_OPS) + smc_net_replace_smc_ops(net, NULL); +#endif /* CONFIG_SMC_OPS */ + if (!net_eq(net, &init_net)) kfree(table); } From patchwork Wed Dec 18 02:44:19 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "D. Wythe" X-Patchwork-Id: 13912958 Received: from out30-112.freemail.mail.aliyun.com (out30-112.freemail.mail.aliyun.com [115.124.30.112]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4073112A177; Wed, 18 Dec 2024 02:44:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.112 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1734489876; cv=none; b=Oejw+ilPH/ohgYbmFHd8rAtT4GljAuyIGA+LCeJ7/6gbHjzjTDUh8J559OnBXVJ/gjdIBhJb0/JWxwYpaVFFccc+r1d6SJpeqG5AEDEYjW5VSLYru9ITGITBrFFcZAtb7zte3RDYHU/teAbm4FvUhLubjgWfCd0VUmxKR+KFRdg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1734489876; c=relaxed/simple; bh=57OmKpCVaV9+6+Gxh8YTWvBSSvQ36c0/+xPMzPoYskQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Vnmu1fALTunVB/pdtcJ+KIfIeatPiTJlhDH55MIMmMJo8k9aBiQYynQVIQA5fbv9gzgaBK4sMGPCDlrvnmF67vy+fERstsLkwKm4zfRtwx9JxqHgV7F3NzIyEWfvMLClNM467tYmtjqXK2MKcHq7k66Y0PDjOuPNLfH3xWhZ/i0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=CaQ7gdoG; arc=none smtp.client-ip=115.124.30.112 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="CaQ7gdoG" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1734489871; h=From:To:Subject:Date:Message-ID:MIME-Version; bh=bVfMYczw64bHkwD9YBUFT8bRGuZT+R6Ml3/2tolKU4c=; b=CaQ7gdoGKfMlQBdwfb1LvZKSzJAK29cEri9FDAYcR0La2ak+P7pOJYZL2M9a0fFDz+UFk663AMjcQeRgm4FkTYTPUUbdEd5wTMgjMH8Yw9TWzW1EQheXla3Z3MUwfPJj8vqXQRV60Z2+ehH+AUnPgVfYS+GYndIMpEQ/KoDDVJQ= Received: from j66a10360.sqa.eu95.tbsite.net(mailfrom:alibuda@linux.alibaba.com fp:SMTPD_---0WLko6eT_1734489869 cluster:ay36) by smtp.aliyun-inc.com; Wed, 18 Dec 2024 10:44:29 +0800 From: "D. Wythe" To: kgraul@linux.ibm.com, wenjia@linux.ibm.com, jaka@linux.ibm.com, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, martin.lau@linux.dev, pabeni@redhat.com, song@kernel.org, sdf@google.com, haoluo@google.com, yhs@fb.com, edumazet@google.com, john.fastabend@gmail.com, kpsingh@kernel.org, jolsa@kernel.org, guwen@linux.alibaba.com Cc: kuba@kernel.org, davem@davemloft.net, netdev@vger.kernel.org, linux-s390@vger.kernel.org, linux-rdma@vger.kernel.org, bpf@vger.kernel.org Subject: [PATCH bpf-next v3 3/5] net/smc: bpf: register smc_ops info struct_ops Date: Wed, 18 Dec 2024 10:44:19 +0800 Message-ID: <20241218024422.23423-4-alibuda@linux.alibaba.com> X-Mailer: git-send-email 2.45.0 In-Reply-To: <20241218024422.23423-1-alibuda@linux.alibaba.com> References: <20241218024422.23423-1-alibuda@linux.alibaba.com> Precedence: bulk X-Mailing-List: linux-rdma@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 To implement injection capability for smc via struct_ops, so that user can make their own smc_ops to modify the behavior of smc stack. Currently, user can write their own implememtion to choose whether to use SMC or not before TCP 3rd handshake to be comleted. In the future, users can implement more complex functions on smc by expanding it. Signed-off-by: D. Wythe --- net/smc/af_smc.c | 10 +++++ net/smc/smc_ops.c | 99 +++++++++++++++++++++++++++++++++++++++++++++++ net/smc/smc_ops.h | 2 + 3 files changed, 111 insertions(+) diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c index 9d76e902fd77..6adedae2986d 100644 --- a/net/smc/af_smc.c +++ b/net/smc/af_smc.c @@ -55,6 +55,7 @@ #include "smc_sysctl.h" #include "smc_loopback.h" #include "smc_inet.h" +#include "smc_ops.h" static DEFINE_MUTEX(smc_server_lgr_pending); /* serialize link group * creation on server @@ -3576,8 +3577,17 @@ static int __init smc_init(void) pr_err("%s: smc_inet_init fails with %d\n", __func__, rc); goto out_ulp; } + + rc = smc_bpf_struct_ops_init(); + if (rc) { + pr_err("%s: smc_bpf_struct_ops_init fails with %d\n", __func__, rc); + goto out_inet; + } + static_branch_enable(&tcp_have_smc); return 0; +out_inet: + smc_inet_exit(); out_ulp: tcp_unregister_ulp(&smc_ulp_ops); out_lo: diff --git a/net/smc/smc_ops.c b/net/smc/smc_ops.c index 0fc19cadd760..0f07652f4837 100644 --- a/net/smc/smc_ops.c +++ b/net/smc/smc_ops.c @@ -10,6 +10,10 @@ * Author: D. Wythe */ +#include +#include +#include + #include "smc_ops.h" static DEFINE_SPINLOCK(smc_ops_list_lock); @@ -49,3 +53,98 @@ struct smc_ops *smc_ops_find_by_name(const char *name) } return NULL; } + +static int __bpf_smc_stub_set_tcp_option(struct tcp_sock *tp) { return 1; } +static int __bpf_smc_stub_set_tcp_option_cond(const struct tcp_sock *tp, + struct inet_request_sock *ireq) +{ + return 1; +} + +static struct smc_ops __bpf_smc_bpf_ops = { + .set_option = __bpf_smc_stub_set_tcp_option, + .set_option_cond = __bpf_smc_stub_set_tcp_option_cond, +}; + +static int smc_bpf_ops_init(struct btf *btf) { return 0; } + +static int smc_bpf_ops_reg(void *kdata, struct bpf_link *link) +{ + return smc_ops_reg(kdata); +} + +static void smc_bpf_ops_unreg(void *kdata, struct bpf_link *link) +{ + smc_ops_unreg(kdata); +} + +static int smc_bpf_ops_init_member(const struct btf_type *t, + const struct btf_member *member, + void *kdata, const void *udata) +{ + const struct smc_ops *u_ops; + struct smc_ops *k_ops; + u32 moff; + + u_ops = (const struct smc_ops *)udata; + k_ops = (struct smc_ops *)kdata; + + moff = __btf_member_bit_offset(t, member) / 8; + switch (moff) { + case offsetof(struct smc_ops, name): + if (bpf_obj_name_cpy(k_ops->name, u_ops->name, + sizeof(u_ops->name)) <= 0) + return -EINVAL; + return 1; + case offsetof(struct smc_ops, flags): + if (u_ops->flags & ~SMC_OPS_ALL_FLAGS) + return -EINVAL; + k_ops->flags = u_ops->flags; + return 1; + default: + break; + } + + return 0; +} + +static int smc_bpf_ops_check_member(const struct btf_type *t, + const struct btf_member *member, + const struct bpf_prog *prog) +{ + u32 moff = __btf_member_bit_offset(t, member) / 8; + + switch (moff) { + case offsetof(struct smc_ops, name): + case offsetof(struct smc_ops, flags): + case offsetof(struct smc_ops, set_option): + case offsetof(struct smc_ops, set_option_cond): + break; + default: + return -EINVAL; + } + + return 0; +} + +static const struct bpf_verifier_ops smc_bpf_verifier_ops = { + .get_func_proto = bpf_base_func_proto, + .is_valid_access = bpf_tracing_btf_ctx_access, +}; + +static struct bpf_struct_ops bpf_smc_bpf_ops = { + .name = "smc_ops", + .init = smc_bpf_ops_init, + .reg = smc_bpf_ops_reg, + .unreg = smc_bpf_ops_unreg, + .cfi_stubs = &__bpf_smc_bpf_ops, + .verifier_ops = &smc_bpf_verifier_ops, + .init_member = smc_bpf_ops_init_member, + .check_member = smc_bpf_ops_check_member, + .owner = THIS_MODULE, +}; + +int smc_bpf_struct_ops_init(void) +{ + return register_bpf_struct_ops(&bpf_smc_bpf_ops, smc_ops); +} diff --git a/net/smc/smc_ops.h b/net/smc/smc_ops.h index 214f4c99efd4..f4e50eae13f6 100644 --- a/net/smc/smc_ops.h +++ b/net/smc/smc_ops.h @@ -22,8 +22,10 @@ * Note: Caller MUST ensure it's was invoked under rcu_read_lock. */ struct smc_ops *smc_ops_find_by_name(const char *name); +int smc_bpf_struct_ops_init(void); #else static inline struct smc_ops *smc_ops_find_by_name(const char *name) { return NULL; } +static inline int smc_bpf_struct_ops_init(void) { return 0; } #endif /* CONFIG_SMC_OPS*/ #endif /* __SMC_OPS */ From patchwork Wed Dec 18 02:44:20 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "D. Wythe" X-Patchwork-Id: 13912959 Received: from out30-133.freemail.mail.aliyun.com (out30-133.freemail.mail.aliyun.com [115.124.30.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3C446130A73; Wed, 18 Dec 2024 02:44:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.133 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1734489878; cv=none; b=BD9fS04B/8rHGBiET6gQ+9IHCwU7yy35FtGmYHT61bIAJMqB+s3EcwczrDtykwo+NN+D/KM1XnVA2ZcB7BWEoKKkq/msauKAJhpuzUbcOGbvulKmBTlq4QJS8GFMUfxLT8Q3G7B84w8/3FaX2tfaPMQSsMhhn65tpbb5388D84s= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1734489878; c=relaxed/simple; bh=mJZToL78iwJ+MCIKU4dt4auK5Nd4QsNRcDUFMLBHKB0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=knlm0YxCWZXx2JpNK8slAWWcFoZUkL0aJ+9JomB7lzpL1DFMyiGPHERGspv6yPBmplBQuWHDY3sPuEr3gLfvpcUcN5VVPpxIPidghRhDS9e9bWGJC14Hh08ZJkzukfpUcAJE9s84dobqpO+9yzOQOpvOi58Y2ppgn8hVxvW8zOg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=uIW0khIc; arc=none smtp.client-ip=115.124.30.133 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="uIW0khIc" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1734489872; h=From:To:Subject:Date:Message-ID:MIME-Version; bh=y1jg5AwJB+GR83pd0hp8N7m53aNa6Qo3Q03Y+OmDbzc=; b=uIW0khIczUWE/N3efbqkpXCKSwMkoXzLXB9z5qrmZk7ikm2kS/uuR3rJXmifDZ9h6Vp71yRSbPilRpy/bFwCMh4GwN/Ig3MXYyACT6R0CpK2LmJCbD8qteBUgXeN4fzuL4GabO1QXCkytsILM1VgGq+yBVulhx6sVupHZeMFTAw= Received: from j66a10360.sqa.eu95.tbsite.net(mailfrom:alibuda@linux.alibaba.com fp:SMTPD_---0WLko6em_1734489869 cluster:ay36) by smtp.aliyun-inc.com; Wed, 18 Dec 2024 10:44:30 +0800 From: "D. Wythe" To: kgraul@linux.ibm.com, wenjia@linux.ibm.com, jaka@linux.ibm.com, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, martin.lau@linux.dev, pabeni@redhat.com, song@kernel.org, sdf@google.com, haoluo@google.com, yhs@fb.com, edumazet@google.com, john.fastabend@gmail.com, kpsingh@kernel.org, jolsa@kernel.org, guwen@linux.alibaba.com Cc: kuba@kernel.org, davem@davemloft.net, netdev@vger.kernel.org, linux-s390@vger.kernel.org, linux-rdma@vger.kernel.org, bpf@vger.kernel.org Subject: [PATCH bpf-next v3 4/5] libbpf: fix error when st-prefix_ops and ops from differ btf Date: Wed, 18 Dec 2024 10:44:20 +0800 Message-ID: <20241218024422.23423-5-alibuda@linux.alibaba.com> X-Mailer: git-send-email 2.45.0 In-Reply-To: <20241218024422.23423-1-alibuda@linux.alibaba.com> References: <20241218024422.23423-1-alibuda@linux.alibaba.com> Precedence: bulk X-Mailing-List: linux-rdma@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 When a struct_ops named xxx_ops was registered by a module, and it will be used in both built-in modules and the module itself, so that the btf_type of xxx_ops will be present in btf_vmlinux instead of in btf_mod, which means that the btf_type of bpf_struct_ops_xxx_ops and xxx_ops will not be in the same btf. Here are four possible case: +--------+-------------+-------------+---------------------------------+ | | st_opx_xxx | xxx | | +--------+-------------+-------------+---------------------------------+ | case 0 | btf_vmlinux | bft_vmlinux | be used and reg only in vmlinux | +--------+-------------+-------------+---------------------------------+ | case 1 | btf_vmlinux | bpf_mod | INVALID | +--------+-------------+-------------+---------------------------------+ | case 2 | btf_mod | btf_vmlinux | reg in mod but be used both in | | | | | vmlinux and mod. | +--------+-------------+-------------+---------------------------------+ | case 3 | btf_mod | btf_mod | be used and reg only in mod | +--------+-------------+-------------+---------------------------------+ At present, cases 0, 1, and 3 can be correctly identified, because st_ops_xxx is searched from the same btf with xxx. In order to handle case 2 correctly without affecting other cases, we cannot simply change the search method for st_ops_xxx from find_btf_by_prefix_kind() to find_ksym_btf_id(), because in this way, case 1 will not be recognized anymore. To address this issue, if st_ops_xxx cannot be found in the btf with xxx and mod_btf does not exist, do find_ksym_btf_id() again to avoid such issue. Fixes: 590a00888250 ("bpf: libbpf: Add STRUCT_OPS support") Signed-off-by: D. Wythe --- tools/lib/bpf/libbpf.c | 25 +++++++++++++++++-------- 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 66173ddb5a2d..56bf74894110 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -1005,7 +1005,8 @@ find_struct_ops_kern_types(struct bpf_object *obj, const char *tname_raw, const struct btf_member *kern_data_member; struct btf *btf = NULL; __s32 kern_vtype_id, kern_type_id; - char tname[256]; + char tname[256], stname[256]; + int ret; __u32 i; snprintf(tname, sizeof(tname), "%.*s", @@ -1020,17 +1021,25 @@ find_struct_ops_kern_types(struct bpf_object *obj, const char *tname_raw, } kern_type = btf__type_by_id(btf, kern_type_id); + ret = snprintf(stname, sizeof(stname), "%s%s", STRUCT_OPS_VALUE_PREFIX, tname); + if (ret < 0 || ret >= sizeof(stname)) + return -ENAMETOOLONG; + /* Find the corresponding "map_value" type that will be used * in map_update(BPF_MAP_TYPE_STRUCT_OPS). For example, * find "struct bpf_struct_ops_tcp_congestion_ops" from the * btf_vmlinux. */ - kern_vtype_id = find_btf_by_prefix_kind(btf, STRUCT_OPS_VALUE_PREFIX, - tname, BTF_KIND_STRUCT); + kern_vtype_id = btf__find_by_name_kind(btf, stname, BTF_KIND_STRUCT); if (kern_vtype_id < 0) { - pr_warn("struct_ops init_kern: struct %s%s is not found in kernel BTF\n", - STRUCT_OPS_VALUE_PREFIX, tname); - return kern_vtype_id; + if (kern_vtype_id == -ENOENT && !*mod_btf) + kern_vtype_id = find_ksym_btf_id(obj, stname, BTF_KIND_STRUCT, + &btf, mod_btf); + if (kern_vtype_id < 0) { + pr_warn("struct_ops init_kern: struct %s is not found in kernel BTF\n", + stname); + return kern_vtype_id; + } } kern_vtype = btf__type_by_id(btf, kern_vtype_id); @@ -1046,8 +1055,8 @@ find_struct_ops_kern_types(struct bpf_object *obj, const char *tname_raw, break; } if (i == btf_vlen(kern_vtype)) { - pr_warn("struct_ops init_kern: struct %s data is not found in struct %s%s\n", - tname, STRUCT_OPS_VALUE_PREFIX, tname); + pr_warn("struct_ops init_kern: struct %s data is not found in struct %s\n", + tname, stname); return -EINVAL; } From patchwork Wed Dec 18 02:44:21 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "D. Wythe" X-Patchwork-Id: 13912961 Received: from out30-131.freemail.mail.aliyun.com (out30-131.freemail.mail.aliyun.com [115.124.30.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EB80214D717; Wed, 18 Dec 2024 02:44:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.131 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1734489883; cv=none; b=YebyKfU7nMbpYXZWpn7ix8p76lBEgFakaRSSIjW+NOcrbSjkVzBcROP46qLYHSp+09hFRJyVX/KSKZBjf9Zm4XHuhttpZimDPLlV4CPEIR/ugGWNsgxL5MIi6Jeb/LcvSfij7aTooP3vhkQqRmL2uE02zxWB2GQylmwbar9/0/8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1734489883; c=relaxed/simple; bh=CIWmWnOSDMoaSzgUkc26VEJVykuPzyPnA3+oGc1scqw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=g0OwTtwy0fuC4KShpjL2tbL1AmESGvhtLYdJTXAY9DfoZzo9d/lYRtoFlMh8uj/uTQLkxhVfWqgky4EaTvvXR9ivhmc8wI0QPfTWgSa+0Y4HLwptbDc8H7AoVtnrFm0FOY01y6yAlQewjdPE+aRITSs+4t+xMtDd3HcTxfrShQQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=gP26JJAb; arc=none smtp.client-ip=115.124.30.131 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="gP26JJAb" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1734489872; h=From:To:Subject:Date:Message-ID:MIME-Version:Content-Type; bh=txdurCVMxoFIxt/gwAK72ldZ4HK8Prlb2xtsGBFWgWo=; b=gP26JJAbcEObgsk5NGPtckST6xnI4OaEY1N1ZJKc1VUQFAVR2vxQ0wzlcIo+psIJVwqvRIk1vWwuVyBA2k/FI0P582g3qUMowo7GqSXp/wevVD22QmCqg1FHGHwNVdZlUltNd/OAJSAAgkvvYg2eJZ+wdBZh7q4BMJmYhBMQ4gs= Received: from j66a10360.sqa.eu95.tbsite.net(mailfrom:alibuda@linux.alibaba.com fp:SMTPD_---0WLko6eu_1734489870 cluster:ay36) by smtp.aliyun-inc.com; Wed, 18 Dec 2024 10:44:30 +0800 From: "D. Wythe" To: kgraul@linux.ibm.com, wenjia@linux.ibm.com, jaka@linux.ibm.com, ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org, martin.lau@linux.dev, pabeni@redhat.com, song@kernel.org, sdf@google.com, haoluo@google.com, yhs@fb.com, edumazet@google.com, john.fastabend@gmail.com, kpsingh@kernel.org, jolsa@kernel.org, guwen@linux.alibaba.com Cc: kuba@kernel.org, davem@davemloft.net, netdev@vger.kernel.org, linux-s390@vger.kernel.org, linux-rdma@vger.kernel.org, bpf@vger.kernel.org Subject: [PATCH bpf-next v3 5/5] bpf/selftests: add selftest for bpf_smc_ops Date: Wed, 18 Dec 2024 10:44:21 +0800 Message-ID: <20241218024422.23423-6-alibuda@linux.alibaba.com> X-Mailer: git-send-email 2.45.0 In-Reply-To: <20241218024422.23423-1-alibuda@linux.alibaba.com> References: <20241218024422.23423-1-alibuda@linux.alibaba.com> Precedence: bulk X-Mailing-List: linux-rdma@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 This tests introduces a tiny smc_ops for filtering SMC connections based on IP pairs, and also adds a realistic topology model to verify this ops. Also, we can only use SMC loopback under CI test, so an additional configuration needs to be enabled. Follow the steps below to run this test. make -C tools/testing/selftests/bpf cd tools/testing/selftests/bpf sudo ./test_progs -t smc Results shows: Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: D. Wythe --- tools/testing/selftests/bpf/config | 4 + .../selftests/bpf/prog_tests/test_bpf_smc.c | 535 ++++++++++++++++++ tools/testing/selftests/bpf/progs/bpf_smc.c | 109 ++++ 3 files changed, 648 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/test_bpf_smc.c create mode 100644 tools/testing/selftests/bpf/progs/bpf_smc.c diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config index c378d5d07e02..79248eb04ec4 100644 --- a/tools/testing/selftests/bpf/config +++ b/tools/testing/selftests/bpf/config @@ -113,3 +113,7 @@ CONFIG_XDP_SOCKETS=y CONFIG_XFRM_INTERFACE=y CONFIG_TCP_CONG_DCTCP=y CONFIG_TCP_CONG_BBR=y +CONFIG_INFINIBAND=m +CONFIG_SMC=m +CONFIG_SMC_OPS=y +CONFIG_SMC_LO=y \ No newline at end of file diff --git a/tools/testing/selftests/bpf/prog_tests/test_bpf_smc.c b/tools/testing/selftests/bpf/prog_tests/test_bpf_smc.c new file mode 100644 index 000000000000..4b7c43badc06 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/test_bpf_smc.c @@ -0,0 +1,535 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include "bpf_smc.skel.h" + +#ifndef IPPROTO_SMC +#define IPPROTO_SMC 256 +#endif + +#define CLIENT_IP "127.0.0.1" +#define SERVER_IP "127.0.1.0" +#define SERVER_IP_VIA_RISK_PATH "127.0.2.0" + +#define SERVICE_1 11234 +#define SERVICE_2 22345 +#define SERVICE_3 33456 + +enum { + SMC_NETLINK_ADD_UEID = 10, + SMC_NETLINK_REMOVE_UEID +}; + +enum { + SMC_NLA_EID_TABLE_UNSPEC, + SMC_NLA_EID_TABLE_ENTRY, /* string */ +}; + +struct smc_strat_ip_key { + __u32 sip; + __u32 dip; +}; + +struct smc_strat_ip_value { + __u8 mode; +}; + +struct msgtemplate { + struct nlmsghdr n; + struct genlmsghdr g; + char buf[1024]; +}; + +#define GENLMSG_DATA(glh) ((void *)(NLMSG_DATA(glh) + GENL_HDRLEN)) +#define GENLMSG_PAYLOAD(glh) (NLMSG_PAYLOAD(glh, 0) - GENL_HDRLEN) +#define NLA_DATA(na) ((void *)((char *)(na) + NLA_HDRLEN)) +#define NLA_PAYLOAD(len) ((len) - NLA_HDRLEN) + +#define MAX_MSG_SIZE 1024 + +#define SMC_GENL_FAMILY_NAME "SMC_GEN_NETLINK" +#define SMC_BPFTEST_UEID "SMC-BPFTEST-UEID" + +static uint16_t smc_nl_family_id = -1; +static bool running = true; + +static int send_cmd(int fd, __u16 nlmsg_type, __u32 nlmsg_pid, __u16 nlmsg_flags, + __u8 genl_cmd, __u16 nla_type, + void *nla_data, int nla_len) +{ + struct nlattr *na; + struct sockaddr_nl nladdr; + int r, buflen; + char *buf; + + struct msgtemplate msg = {0}; + + msg.n.nlmsg_len = NLMSG_LENGTH(GENL_HDRLEN); + msg.n.nlmsg_type = nlmsg_type; + msg.n.nlmsg_flags = nlmsg_flags; + msg.n.nlmsg_seq = 0; + msg.n.nlmsg_pid = nlmsg_pid; + msg.g.cmd = genl_cmd; + msg.g.version = 1; + na = (struct nlattr *) GENLMSG_DATA(&msg); + na->nla_type = nla_type; + na->nla_len = nla_len + 1 + NLA_HDRLEN; + memcpy(NLA_DATA(na), nla_data, nla_len); + msg.n.nlmsg_len += NLMSG_ALIGN(na->nla_len); + + buf = (char *) &msg; + buflen = msg.n.nlmsg_len; + memset(&nladdr, 0, sizeof(nladdr)); + nladdr.nl_family = AF_NETLINK; + + while ((r = sendto(fd, buf, buflen, 0, (struct sockaddr *) &nladdr, + sizeof(nladdr))) < buflen) { + if (r > 0) { + buf += r; + buflen -= r; + } else if (errno != EAGAIN) + return -1; + } + return 0; +} + +static bool load_smc_module(void) +{ + int fd = socket(AF_INET, SOCK_STREAM, IPPROTO_SMC); + + if (!ASSERT_GE(fd, 0, "create ipproto_smc")) + return false; + close(fd); + return true; +} + +static bool create_netns(void) +{ + if (!ASSERT_OK(unshare(CLONE_NEWNET), "create netns")) + return false; + + if (!ASSERT_OK(system("ip addr add 127.0.1.0/8 dev lo"), "add server node")) + return false; + + if (!ASSERT_OK(system("ip addr add 127.0.2.0/8 dev lo"), "server via risk path")) + return false; + + if (!ASSERT_OK(system("ip link set dev lo up"), "bring up lo")) + return false; + + return true; +} + +static bool get_smc_nl_family_id(void) +{ + struct sockaddr_nl nl_src; + struct msgtemplate msg; + struct nlattr *nl; + int fd, ret; + pid_t pid; + + fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC); + if (!ASSERT_GT(fd, 0, "nl_family socket")) + return false; + + pid = getpid(); + + memset(&nl_src, 0, sizeof(nl_src)); + nl_src.nl_family = AF_NETLINK; + nl_src.nl_pid = pid; + + ret = bind(fd, (struct sockaddr *) &nl_src, sizeof(nl_src)); + if (!ASSERT_GE(ret, 0, "nl_family bind")) + goto fail; + + ret = send_cmd(fd, GENL_ID_CTRL, pid, + NLM_F_REQUEST, CTRL_CMD_GETFAMILY, + CTRL_ATTR_FAMILY_NAME, (void *)SMC_GENL_FAMILY_NAME, + strlen(SMC_GENL_FAMILY_NAME)); + if (!ASSERT_EQ(ret, 0, "nl_family query")) + goto fail; + + ret = recv(fd, &msg, sizeof(msg), 0); + if (!ASSERT_FALSE(msg.n.nlmsg_type == NLMSG_ERROR || (ret < 0) || + !NLMSG_OK((&msg.n), ret), "nl_family response")) + goto fail; + + nl = (struct nlattr *) GENLMSG_DATA(&msg); + nl = (struct nlattr *) ((char *) nl + NLA_ALIGN(nl->nla_len)); + if (!ASSERT_EQ(nl->nla_type, CTRL_ATTR_FAMILY_ID, "nl_family nla type")) + goto fail; + + smc_nl_family_id = *(uint16_t *) NLA_DATA(nl); + close(fd); + return true; +fail: + close(fd); + return false; +} + +static bool smc_ueid(int op) +{ + struct sockaddr_nl nl_src; + struct msgtemplate msg; + struct nlmsgerr *err; + char test_ueid[32]; + int fd, ret; + pid_t pid; + + /* UEID required */ + memset(test_ueid, '\x20', sizeof(test_ueid)); + memcpy(test_ueid, SMC_BPFTEST_UEID, strlen(SMC_BPFTEST_UEID)); + fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC); + if (!ASSERT_GT(fd, 0, "ueid socket")) + return false; + + pid = getpid(); + memset(&nl_src, 0, sizeof(nl_src)); + nl_src.nl_family = AF_NETLINK; + nl_src.nl_pid = pid; + + ret = bind(fd, (struct sockaddr *) &nl_src, sizeof(nl_src)); + if (!ASSERT_GE(ret, 0, "ueid bind")) + goto fail; + + ret = send_cmd(fd, smc_nl_family_id, pid, + NLM_F_REQUEST | NLM_F_ACK, op, SMC_NLA_EID_TABLE_ENTRY, + (void *)test_ueid, sizeof(test_ueid)); + if (!ASSERT_EQ(ret, 0, "ueid cmd")) + goto fail; + + ret = recv(fd, &msg, sizeof(msg), 0); + if (!ASSERT_FALSE((ret < 0) || !NLMSG_OK((&msg.n), ret), "ueid response")) + goto fail; + + if (msg.n.nlmsg_type == NLMSG_ERROR) { + err = NLMSG_DATA(&msg); + switch (op) { + case SMC_NETLINK_REMOVE_UEID: + if (!ASSERT_FALSE((err->error && err->error != -ENOENT), "ueid remove")) + goto fail; + break; + case SMC_NETLINK_ADD_UEID: + if (!ASSERT_EQ(err->error, 0, "ueid add")) + goto fail; + break; + default: + break; + } + } + close(fd); + return true; +fail: + close(fd); + return false; +} + +static bool setup_smc(void) +{ + /* required smc module was loaded */ + if (!load_smc_module()) + return false; + + /* setup new netns to avoid make impact on other tests */ + if (!create_netns()) + return false; + + /* get smc nl id */ + if (!get_smc_nl_family_id()) + return false; + + /* clear and add ueid for bpftest */ + (void) smc_ueid(SMC_NETLINK_REMOVE_UEID); + /* smc-loopback required ueid */ + if (!smc_ueid(SMC_NETLINK_ADD_UEID)) + return false; + + return true; +} + +static void cleanup_smc(void) +{ + (void) smc_ueid(SMC_NETLINK_REMOVE_UEID); +} + +static pthread_t create_service(const char *ip, int port, void *(*handler) (void *)) +{ + struct sockaddr_in servaddr; + pthread_t th; + int server, rc; + + server = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP); + if (!server) + return (pthread_t)0; + + servaddr.sin_family = AF_INET; + servaddr.sin_port = htons(port); + servaddr.sin_addr.s_addr = inet_addr(ip); + + rc = bind(server, &servaddr, sizeof(servaddr)); + if (!ASSERT_EQ(rc, 0, "server bind")) + goto fail; + + rc = listen(server, 1024); + if (!ASSERT_EQ(rc, 0, "server listen")) + goto fail; + + rc = pthread_create(&th, NULL, handler, (void *)(intptr_t)server); + if (!ASSERT_EQ(rc, 0, "pthread_create")) + goto fail; + + return th; +fail: + close(server); + return (pthread_t)0; +} + +static bool set_sock_timeout(int fd, int timeout_sec) +{ + struct timeval timeout = { .tv_sec = timeout_sec, }; + int rc; + + rc = setsockopt(fd, SOL_SOCKET, SO_SNDTIMEO, &timeout, sizeof(timeout)); + if (rc != 0) + return false; + + rc = setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &timeout, sizeof(timeout)); + if (rc != 0) + return false; + + return true; +} + +static void req_once(const char *local, const char *remote, int port) +{ + struct sockaddr_in localaddr, servaddr; + int client, rc, dummy = 0; + + client = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP); + if (!client) + return; + + /* 1 sec timeout for rcv and snd(connect) */ + if (!ASSERT_TRUE(set_sock_timeout(client, 1), "client sockopt")) + goto fail; + + localaddr.sin_family = AF_INET; + localaddr.sin_port = htons(0); + localaddr.sin_addr.s_addr = inet_addr(local); + + rc = bind(client, &localaddr, sizeof(localaddr)); + if (!ASSERT_EQ(rc, 0, "client bind")) + goto fail; + + servaddr.sin_family = AF_INET; + servaddr.sin_port = htons(port); + servaddr.sin_addr.s_addr = inet_addr(remote); + + rc = connect(client, &servaddr, sizeof(servaddr)); + if (!ASSERT_EQ(rc, 0, "client connect")) + goto fail; + + rc = send(client, &dummy, sizeof(dummy), 0); + if (!ASSERT_EQ(rc, sizeof(dummy), "client query")) + goto fail; + + rc = recv(client, &dummy, sizeof(dummy), 0); + if (!ASSERT_EQ(rc, sizeof(dummy), "client response")) + goto fail; + + close(client); + return; +fail: + close(client); +} + +static void *service1(void *ctx) +{ + int fd = (int)(intptr_t)ctx; + int cli, rc, dummy; + + /* 1 sec for accept timeout */ + if (!set_sock_timeout(fd, 1)) + goto finish; + + while (running) { + cli = accept(fd, NULL, NULL); + if (cli < 0) + continue; + + if (!set_sock_timeout(cli, 1)) + goto skip; + + rc = recv(cli, &dummy, sizeof(dummy), 0); + if (rc != sizeof(dummy)) + goto skip; + + /* service1 send a request to service2 */ + req_once(SERVER_IP, SERVER_IP, SERVICE_2); + + /* then echo dummy back to cli */ + rc = send(cli, &dummy, sizeof(dummy), 0); + if (rc != sizeof(dummy)) + goto skip; +skip: + close(cli); + } +finish: + close(fd); + return NULL; +} + +static void *service2(void *ctx) +{ + int fd = (int)(intptr_t)ctx; + int cli, rc, dummy; + + /* 1 sec for accept timeout */ + if (!set_sock_timeout(fd, 1)) + goto finish; + + while (running) { + cli = accept(fd, NULL, NULL); + if (cli < 0) + continue; + + if (!set_sock_timeout(cli, 1)) + goto skip; + + rc = recv(cli, &dummy, sizeof(dummy), 0); + if (rc != sizeof(dummy)) + goto skip; + + /* then echo dummy back to cli */ + rc = send(cli, &dummy, sizeof(dummy), 0); + if (rc != sizeof(dummy)) + goto skip; +skip: + close(cli); + } +finish: + close(fd); + return NULL; +} + +static void *service3(void *ctx) +{ + return service2(ctx); +} + +static void block_link(int map_fd, const char *src, const char *dst) +{ + struct smc_strat_ip_value val = { .mode = /* block */ 0 }; + struct smc_strat_ip_key key = { + .sip = inet_addr(src), + .dip = inet_addr(dst), + }; + + bpf_map_update_elem(map_fd, &key, &val, BPF_ANY); +} + +/* + * This test describes a real-life service topology as follows: + * + * +-------------> service_1 + * link1 | | + * +--------------------> server | link 2 + * | | V + * | +-------------> service_2 + * | link 3 + * client -------------------> server_via_unsafe_path -> service_3 + * + * Among them, + * 1. link-1 is very suitable for using SMC. + * 2. link-2 is not suitable for using SMC, because the mode of this link is kind of + * short-link services. + * 3. link-3 is also not suitable for using SMC, because the RDMA link is unavailable and + * needs to go through a long timeout before it can fallback to TCP. + * + * To achieve this goal, we use a customized SMC ip strategy via smc_ops. + */ +static void test_topo(void) +{ + pthread_t service_1, service_2, service_3; + struct bpf_smc *skel; + int rc, map_fd; + + skel = bpf_smc__open_and_load(); + if (!ASSERT_OK_PTR(skel, "bpf_smc__open_and_load")) + return; + + rc = bpf_smc__attach(skel); + if (!ASSERT_EQ(rc, 0, "bpf_smc__attach")) + goto fail; + + map_fd = bpf_map__fd(skel->maps.smc_strats_ip); + if (!ASSERT_GT(map_fd, 0, "bpf_map__fd")) + goto fail; + + /* Mock the process of transparent replacement, since we will modify protocol + * to ipproto_smc accropding to it via fmod_ret/update_socket_protocol. + */ + system("sysctl -w net.smc.ops=linkcheck"); + + /* Configure ip strat */ + block_link(map_fd, CLIENT_IP, SERVER_IP_VIA_RISK_PATH); + block_link(map_fd, SERVER_IP, SERVER_IP); + close(map_fd); + + /* Load service */ + service_1 = create_service(SERVER_IP, SERVICE_1, service1); + if (!ASSERT_NEQ(service_1, (pthread_t)0, "create service_1")) + goto fail; + + service_2 = create_service(SERVER_IP, SERVICE_2, service2); + if (!ASSERT_NEQ(service_2, (pthread_t)0, "create service_2")) { + running = false; + goto fail_service2; + } + + service_3 = create_service(SERVER_IP_VIA_RISK_PATH, SERVICE_3, service3); + if (!ASSERT_NEQ(service_3, (pthread_t)0, "create service_3")) { + running = false; + goto fail_service3; + } + + /* Run client*/ + req_once(CLIENT_IP, SERVER_IP, SERVICE_1); + + ASSERT_EQ(skel->bss->smc_cnt, 2, "smc count"); + ASSERT_EQ(skel->bss->fallback_cnt, 1, "fallback count"); + + req_once(CLIENT_IP, SERVER_IP, SERVICE_2); + + ASSERT_EQ(skel->bss->smc_cnt, 3, "smc count"); + ASSERT_EQ(skel->bss->fallback_cnt, 1, "fallback count"); + + req_once(CLIENT_IP, SERVER_IP_VIA_RISK_PATH, SERVICE_3); + + ASSERT_EQ(skel->bss->smc_cnt, 4, "smc count"); + ASSERT_EQ(skel->bss->fallback_cnt, 2, "fallback count"); + + /* We have set a timeout of 1 second for each accept */ + running = false; + pthread_join(service_3, NULL); +fail_service3: + pthread_join(service_2, NULL); +fail_service2: + pthread_join(service_1, NULL); +fail: + bpf_smc__destroy(skel); +} + +void test_bpf_smc(void) +{ + if (!setup_smc()) { + printf("setup for smc test failed, test SKIP:\n"); + test__skip(); + return; + } + + if (test__start_subtest("topo")) + test_topo(); + + cleanup_smc(); +} diff --git a/tools/testing/selftests/bpf/progs/bpf_smc.c b/tools/testing/selftests/bpf/progs/bpf_smc.c new file mode 100644 index 000000000000..3ef99da662a1 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/bpf_smc.c @@ -0,0 +1,109 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include "vmlinux.h" + +#include +#include +#include "bpf_tracing_net.h" + +char _license[] SEC("license") = "GPL"; + +struct smc_sock { + struct sock sk; + struct smc_sock *listen_smc; + bool use_fallback; +} __attribute__((preserve_access_index)); + +int smc_cnt = 0; +int fallback_cnt = 0; + +SEC("fentry/smc_listen_work") +int BPF_PROG(bpf_smc_listen_work) +{ + smc_cnt++; + return 0; +} + +SEC("fentry/smc_switch_to_fallback") +int BPF_PROG(bpf_smc_switch_to_fallback, struct smc_sock *smc) +{ + /* only count from one side (client) */ + if (smc && !smc->listen_smc) + fallback_cnt++; + return 0; +} + +/* go with default value if no strat was found */ +bool default_ip_strat_value = true; + +struct smc_strat_ip_key { + __u32 sip; + __u32 dip; +}; + +struct smc_strat_ip_value { + __u8 mode; +}; + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(key_size, sizeof(struct smc_strat_ip_key)); + __uint(value_size, sizeof(struct smc_strat_ip_value)); + __uint(max_entries, 128); + __uint(map_flags, BPF_F_NO_PREALLOC); +} smc_strats_ip SEC(".maps"); + +static bool smc_check_ip(__u32 src, __u32 dst) +{ + struct smc_strat_ip_value *value; + struct smc_strat_ip_key key = { + .sip = src, + .dip = dst, + }; + + value = bpf_map_lookup_elem(&smc_strats_ip, &key); + return value ? value->mode : default_ip_strat_value; +} + +SEC("fmod_ret/update_socket_protocol") +int BPF_PROG(smc_run, int family, int type, int protocol) +{ + struct task_struct *task; + + if (family != AF_INET && family != AF_INET6) + return protocol; + + if ((type & 0xf) != SOCK_STREAM) + return protocol; + + if (protocol != 0 && protocol != IPPROTO_TCP) + return protocol; + + task = bpf_get_current_task_btf(); + /* Prevent from affecting other tests */ + if (!task || !task->nsproxy->net_ns->smc.ops) + return protocol; + + return IPPROTO_SMC; +} + +SEC("struct_ops/bpf_smc_set_tcp_option_cond") +int BPF_PROG(bpf_smc_set_tcp_option_cond, const struct tcp_sock *tp, struct inet_request_sock *ireq) +{ + return smc_check_ip(ireq->req.__req_common.skc_daddr, + ireq->req.__req_common.skc_rcv_saddr); +} + +SEC("struct_ops/bpf_smc_set_tcp_option") +int BPF_PROG(bpf_smc_set_tcp_option, struct tcp_sock *tp) +{ + return smc_check_ip(tp->inet_conn.icsk_inet.sk.__sk_common.skc_rcv_saddr, + tp->inet_conn.icsk_inet.sk.__sk_common.skc_daddr); +} + +SEC(".struct_ops.link") +struct smc_ops linkcheck = { + .name = "linkcheck", + .set_option = (void *) bpf_smc_set_tcp_option, + .set_option_cond = (void *) bpf_smc_set_tcp_option_cond, +};