Message ID | 20211122134255.63347-1-tonylu@linux.alibaba.com (mailing list archive) |
---|---|
State | RFC |
Headers | show |
Series | [RFC,net-next] net/smc: Unbind buffer size from clcsock and make it tunable | expand |
On 22/11/2021 14:42, Tony Lu wrote: > SMC uses smc->sk.sk_{rcv|snd}buf to create buffer for send buffer or > RMB. And the values of buffer size inherits from clcsock. The clcsock is > a TCP sock which is initiated during SMC connection startup. > > The inherited buffer size doesn't fit SMC well. TCP provides two sysctl > knobs to tune r/w buffers, net.ipv4.tcp_{r|w}mem, and SMC use the default > value from TCP. The buffer size is tuned for TCP, but not fit SMC well > in some scenarios. For example, we need larger buffer of SMC for high > throughput applications, and smaller buffer of SMC for saving contiguous > memory. We need to adjust the buffer size apart from TCP and not to > disturb TCP. > > This unbinds buffer size which inherits from clcsock, and provides > sysctl knobs to adjust buffer size independently. These knobs can be > tuned with different values for different net namespaces for performance > and flexibility. > > Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> > Reviewed-by: Wen Gu <guwen@linux.alibaba.com> > --- To activate SMC for existing programs usually the smc_run command or the preload library (both from the smc-tools package) are used. This commit introduced support to set the send and recv window sizes using command line parameters or environment variables: https://github.com/ibm-s390-linux/smc-tools/commit/59bfb99c588746f7dca1b3c97fd88f3f7cbc975f Why another way to manipulate these sizes? Your solution would stop applications to set these values.
On Mon, Nov 22, 2021 at 04:08:37PM +0100, Karsten Graul wrote: > On 22/11/2021 14:42, Tony Lu wrote: > > SMC uses smc->sk.sk_{rcv|snd}buf to create buffer for send buffer or > > RMB. And the values of buffer size inherits from clcsock. The clcsock is > > a TCP sock which is initiated during SMC connection startup. > > > > The inherited buffer size doesn't fit SMC well. TCP provides two sysctl > > knobs to tune r/w buffers, net.ipv4.tcp_{r|w}mem, and SMC use the default > > value from TCP. The buffer size is tuned for TCP, but not fit SMC well > > in some scenarios. For example, we need larger buffer of SMC for high > > throughput applications, and smaller buffer of SMC for saving contiguous > > memory. We need to adjust the buffer size apart from TCP and not to > > disturb TCP. > > > > This unbinds buffer size which inherits from clcsock, and provides > > sysctl knobs to adjust buffer size independently. These knobs can be > > tuned with different values for different net namespaces for performance > > and flexibility. > > > > Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> > > Reviewed-by: Wen Gu <guwen@linux.alibaba.com> > > --- > > To activate SMC for existing programs usually the smc_run command or the > preload library (both from the smc-tools package) are used. > This commit introduced support to set the send and recv window sizes > using command line parameters or environment variables: > > https://github.com/ibm-s390-linux/smc-tools/commit/59bfb99c588746f7dca1b3c97fd88f3f7cbc975f Hi Graul, Thanks for your advice. We are using smc-tools, it is a very useful tool, and we also use smc_run or LD_PRELOAD to help our applications to replace with SMC from TCP. There are some differences to use SMC in our environment. The followings are our scenarios to use SMC: 1. Transparent acceleration This approach is widely used in our environment. The main idea of transparent acceleration is not to touch the applications. The applications are usually pre-compiled and pre-packaged containers or ECS, which means the binary and the binary needed environments, like glibc and other libraries are bundled as bootstrap containers. So it is hard to inject the smc_run or LD_PRELOAD into application's containers runtime. To solve this issue, we developed a set of patches to replace the AF_INET / SOCK_STREAM with AF_SMC / SMCPROTO_SMC{6} by configuration. So that we can control acceleration in kernel without any other changes in user-space, and won't break our application containers and publish workflow. These patches are still improving for upstream. 2. Use SMC explicitly This approach is very straightforward. Applications just create sockets using AF_SMC and SMCPROTO_SMC{6}, and SMC works fine. However, most of applications don't want to bind tightly to single SMC protocol. Most of them take into account compatibility, stability and flexibility. > Why another way to manipulate these sizes? > Your solution would stop applications to set these values. I didn't understand it clearly about stopping applications to set there values. IMHO, this RFC introduces two knobs for snd/rcvbuf. During the following stages, applications can set these values as expected. 1. SMC module or per-net-namespace initialized: sysctl_{w|r}mem_default initialized in smc_net_init() when current net namespace initialized. The default values of SMC inherit from TCP, and clcsock use TCP's configures. 2. create SMC socket: smc_create() is called, and smc->sk.sk_{snd|rcv}buf are initialized from per-netns earlier. There are no different from before. Except for changing the values of TCP after SMC initialized, users can change them with newly added two knobs. 3. applications call setsockopt() to modify SO_SNDBUF or SO_RCVBUF: After smc_create() creates socket, applications can use setsockopt() to change the snd/rcvbuf, which called sock_setsockopt() directly. If fallback happened, setsockopt() would change clcsock's values. In the end, we hope to provide a flexibility approach to change SMC's buffer size only and don't disturb others. Sysctl are considered as a better way to maintain and easy to use for users. Thanks, Tony Lu
On 23/11/2021 07:56, Tony Lu wrote: > To solve this issue, we developed a set of patches to replace the > AF_INET / SOCK_STREAM with AF_SMC / SMCPROTO_SMC{6} by configuration. > So that we can control acceleration in kernel without any other changes > in user-space, and won't break our application containers and publish > workflow. These patches are still improving for upstream. This sounds interesting. Will this also be namespace-based like the sysctls in the current patch? Will this future change integrate nicely with the current new sysctls? This might allow to activate smc for containers, selectively. Please send these changes in a patch series together when they are finished. We would like to review them as a whole to see how things play together. Thank you!
On Tue, Nov 23, 2021 at 10:33:07AM +0100, Karsten Graul wrote: > On 23/11/2021 07:56, Tony Lu wrote: > > To solve this issue, we developed a set of patches to replace the > > AF_INET / SOCK_STREAM with AF_SMC / SMCPROTO_SMC{6} by configuration. > > So that we can control acceleration in kernel without any other changes > > in user-space, and won't break our application containers and publish > > workflow. These patches are still improving for upstream. > > This sounds interesting. Will this also be namespace-based like the sysctls > in the current patch? Will this future change integrate nicely with the current > new sysctls? This might allow to activate smc for containers, selectively. > > Please send these changes in a patch series together when they are finished. > We would like to review them as a whole to see how things play together. > > Thank you! Hi Graul, I am glad to hear that. The container, which is isolated by net-namespace, is the minimal deployment unit in our environment. The transparent replacement facility is namespace-based, so that we can control the applications behaviors according to the dimensions of containers. The per-netns sysctl is the first step, control the applications' buffer, and not to disturb TCP connections in the same container. The container's memory is not as large as that of a physical machine, and containers might run different workload applications, so we use per-netns sysctl to adjust buffer. And it can be sent out separately. Then, we can allow to activate SMC for some containers with transparent replacement. These patches are improving, make sure the flexible enough for containers and applications scopes, and cohesion enough for upstream. I will send them out as the containers solutions. Thank for you advice. Thanks, Tony Lu
diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst new file mode 100644 index 000000000000..ba2be59a57dd --- /dev/null +++ b/Documentation/networking/smc-sysctl.rst @@ -0,0 +1,20 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========= +SMC Sysctl +========= + +/proc/sys/net/smc/* Variables +============================== + +wmem_default - INTEGER + Initial size of send buffer used by SMC sockets. + The default value inherits from net.ipv4.tcp_wmem[1]. + + Default: 16K + +rmem_default - INTEGER + Initial size of receive buffer (RMB) used by SMC sockets. + The default value inherits from net.ipv4.tcp_rmem[1]. + + Default: 131072 bytes. diff --git a/include/net/netns/smc.h b/include/net/netns/smc.h index ea8a9cf2619b..f948235e3156 100644 --- a/include/net/netns/smc.h +++ b/include/net/netns/smc.h @@ -12,5 +12,10 @@ struct netns_smc { /* protect fback_rsn */ struct mutex mutex_fback_rsn; struct smc_stats_rsn *fback_rsn; +#ifdef CONFIG_SYSCTL + struct ctl_table_header *smc_hdr; +#endif + int sysctl_wmem_default; + int sysctl_rmem_default; }; #endif diff --git a/net/smc/Makefile b/net/smc/Makefile index 196fb6f01b14..640af9a39f9c 100644 --- a/net/smc/Makefile +++ b/net/smc/Makefile @@ -4,4 +4,4 @@ obj-$(CONFIG_SMC) += smc.o obj-$(CONFIG_SMC_DIAG) += smc_diag.o smc-y := af_smc.o smc_pnet.o smc_ib.o smc_clc.o smc_core.o smc_wr.o smc_llc.o smc-y += smc_cdc.o smc_tx.o smc_rx.o smc_close.o smc_ism.o smc_netlink.o smc_stats.o -smc-y += smc_tracepoint.o +smc-y += smc_tracepoint.o smc_sysctl.o diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c index 5e93a5aa347e..543a6180be1d 100644 --- a/net/smc/af_smc.c +++ b/net/smc/af_smc.c @@ -51,6 +51,7 @@ #include "smc_close.h" #include "smc_stats.h" #include "smc_tracepoint.h" +#include "smc_sysctl.h" static DEFINE_MUTEX(smc_server_lgr_pending); /* serialize link group * creation on server @@ -2722,8 +2723,8 @@ static int smc_create(struct net *net, struct socket *sock, int protocol, sk_common_release(sk); goto out; } - smc->sk.sk_sndbuf = max(smc->clcsock->sk->sk_sndbuf, SMC_BUF_MIN_SIZE); - smc->sk.sk_rcvbuf = max(smc->clcsock->sk->sk_rcvbuf, SMC_BUF_MIN_SIZE); + smc->sk.sk_sndbuf = sock_net(sk)->smc.sysctl_wmem_default; + smc->sk.sk_rcvbuf = sock_net(sk)->smc.sysctl_rmem_default; out: return rc; @@ -2739,6 +2740,11 @@ unsigned int smc_net_id; static __net_init int smc_net_init(struct net *net) { + net->smc.sysctl_wmem_default = max(net->ipv4.sysctl_tcp_wmem[1], + SMC_BUF_MIN_SIZE); + net->smc.sysctl_rmem_default = max(net->ipv4.sysctl_tcp_rmem[1], + SMC_BUF_MIN_SIZE); + return smc_pnet_net_init(net); } @@ -2845,6 +2851,12 @@ static int __init smc_init(void) goto out_sock; } + rc = smc_sysctl_init(); + if (rc) { + pr_err("%s: sysctl fails with %d\n", __func__, rc); + goto out_sock; + } + static_branch_enable(&tcp_have_smc); return 0; @@ -2885,6 +2897,7 @@ static void __exit smc_exit(void) smc_clc_exit(); unregister_pernet_subsys(&smc_net_stat_ops); unregister_pernet_subsys(&smc_net_ops); + smc_sysctl_exit(); rcu_barrier(); } diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c new file mode 100644 index 000000000000..6706fe1bd888 --- /dev/null +++ b/net/smc/smc_sysctl.c @@ -0,0 +1,81 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include <linux/module.h> +#include <linux/init.h> +#include <linux/sysctl.h> +#include <net/sock.h> +#include <net/net_namespace.h> + +#include "smc_core.h" + +static int min_sndbuf = SMC_BUF_MIN_SIZE; +static int min_rcvbuf = SMC_BUF_MIN_SIZE; + +static struct ctl_table smc_table[] = { + { + .procname = "wmem_default", + .data = &init_net.smc.sysctl_wmem_default, + .maxlen = sizeof(init_net.smc.sysctl_wmem_default), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &min_sndbuf, + }, + { + .procname = "rmem_default", + .data = &init_net.smc.sysctl_rmem_default, + .maxlen = sizeof(init_net.smc.sysctl_rmem_default), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &min_rcvbuf, + }, + { } +}; + +static __net_init int smc_sysctl_init_net(struct net *net) +{ + struct ctl_table *table; + + table = smc_table; + if (!net_eq(net, &init_net)) { + int i; + + table = kmemdup(table, sizeof(smc_table), GFP_KERNEL); + if (!table) + goto err_alloc; + + for (i = 0; i < ARRAY_SIZE(smc_table) - 1; i++) + table[i].data += (void *)net - (void *)&init_net; + } + + net->smc.smc_hdr = register_net_sysctl(net, "net/smc", table); + if (!net->smc.smc_hdr) + goto err_reg; + + return 0; + +err_reg: + if (!net_eq(net, &init_net)) + kfree(table); +err_alloc: + return -ENOMEM; +} + +static __net_exit void smc_sysctl_exit_net(struct net *net) +{ + unregister_net_sysctl_table(net->smc.smc_hdr); +} + +static struct pernet_operations smc_sysctl_ops __net_initdata = { + .init = smc_sysctl_init_net, + .exit = smc_sysctl_exit_net, +}; + +int __init smc_sysctl_init(void) +{ + return register_pernet_subsys(&smc_sysctl_ops); +} + +void smc_sysctl_exit(void) +{ + unregister_pernet_subsys(&smc_sysctl_ops); +} diff --git a/net/smc/smc_sysctl.h b/net/smc/smc_sysctl.h new file mode 100644 index 000000000000..c01c5de3a3ea --- /dev/null +++ b/net/smc/smc_sysctl.h @@ -0,0 +1,22 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef _SMC_SYSCTL_H +#define _SMC_SYSCTL_H + +#ifdef CONFIG_SYSCTL + +int smc_sysctl_init(void); +void smc_sysctl_exit(void); + +#else + +int smc_sysctl_init(void) +{ + return 0; +} + +void smc_sysctl_exit(void) { } + +#endif /* CONFIG_SYSCTL */ + +#endif /* _SMC_SYSCTL_H */