Message ID | 20211228134435.41774-1-tonylu@linux.alibaba.com (mailing list archive) |
---|---|
State | Accepted |
Commit | d7cd421da9da2cc7b4d25b8537f66db5c8331c40 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next] net/smc: Introduce TCP ULP support | expand |
On 28/12/2021 14:44, Tony Lu wrote: > This implements TCP ULP for SMC, helps applications to replace TCP with > SMC protocol in place. And we use it to implement transparent > replacement. > > This replaces original TCP sockets with SMC, reuse TCP as clcsock when > calling setsockopt with TCP_ULP option, and without any overhead. This looks very interesting. Can you provide a simple userspace example about how to use ULP with smc? And how do you make sure that the in-band CLC handshake doesn't interfere with the previous TCP traffic, is the idea to put some kind of protocol around it so both sides 'know' when the protocol ended and the CLC handshake starts?
On Thu, Dec 30, 2021 at 04:03:19PM +0100, Karsten Graul wrote: > On 28/12/2021 14:44, Tony Lu wrote: > > This implements TCP ULP for SMC, helps applications to replace TCP with > > SMC protocol in place. And we use it to implement transparent > > replacement. > > > > This replaces original TCP sockets with SMC, reuse TCP as clcsock when > > calling setsockopt with TCP_ULP option, and without any overhead. > > This looks very interesting. Can you provide a simple userspace example about > how to use ULP with smc? Here is a userspace C/S application: fd = socket(AF_INET, SOCK_STREAM, 0); addr.sin_family = AF_INET; addr.sin_addr.s_addr = htonl(INADDR_ANY); /* for server */ addr.sin_addr.s_addr = inet_addr("127.0.0.1"); /* for client */ addr.sin_port = htons(PORT); /* kernel will find and load smc module, init smc socket and replace * tcp with smc, use the "clean" tcp as clcsock. */ ret = setsockopt(fd, SOL_TCP, TCP_ULP, "smc", sizeof("smc")); if (ret) /* if ulp init failed, TCP progress can be continued */ printf("replace tcp with smc failed, use tcp"); /* After this, this tcp socket will behave as smc socket. If error * happened, this socket is still a normal tcp socket. * * We check tcp socket's state, so after bind(), connect() or listen(), * ulp setup will be failed. */ bind(fd, (struct sockaddr *)&addr, sizeof(addr)); /* calls smc_bind() */ connect(...); /* for client, smc_connect() */ listen(...); /* for server, smc_listen() */ accept(...); /* for server, smc_accept() */ This approach is not convenient to use, it is a possible usage in userspace. The more important scene is to work with BPF. Transparent replacement with BPF: BPF provides a series of attach points, like: - BPF_CGROUP_INET_SOCK_CREATE, /* calls in the end of inet_create() */ - BPF_CGROUP_INET4_BIND, /* calls in the end of inet_bind() */ - BPF_CGROUP_INET6_BIND, So that we can inject BPF programs into these points in userspace: SEC("cgroup/connect4") int replace_to_smc(struct bpf_sock_addr *addr) { int pid = bpf_get_current_pid_tgid() >> 32; long ret; /* use-defined rules/filters, such as pid, tcp src/dst address, etc...*/ if (pid != DESIRED_PID) return 0; <...> ret = bpf_setsockopt(addr, SOL_TCP, TCP_ULP, "smc", sizeof("smc")); if (ret) { bpf_printk("replace TCP with SMC error: %ld\n", ret); return 0; } return 0; } Then use libbpf to load it with attach type BPF_CGROUP_INET4_CONNECT. Everytime userspace appliations try to bind socket, it will run this BPF prog, check user-defined rule and determine to replace with SMC. Because this BPF is injected outside of user applications, so we can use BPF to implement flexible and non-intrusive transparent replacement. BPF helper bpf_setsockopt() limits the options to call, so TCP_ULP is not allowed now. I will send patches out to allow TCP_ULP option after this approach is merged, which is suggested by BPF's developer. Here is the link about BPF patch: https://lore.kernel.org/netdev/20211209090250.73927-1-tonylu@linux.alibaba.com/ > And how do you make sure that the in-band CLC handshake doesn't interfere with the > previous TCP traffic, is the idea to put some kind of protocol around it so both > sides 'know' when the protocol ended and the CLC handshake starts? Yes, we need a "clean" TCP socket to replace with SMC. To archive it, smc_ulp_init will check the state of TCP socket. First, we make sure that socket is a REALLY TCP sockets. if (tcp->type != SOCK_STREAM || sk->sk_protocol != IPPROTO_TCP || (sk->sk_family != AF_INET && sk->sk_family != AF_INET6)) Then check the state of socket, and makes sure this socket is a newly created userspace socket, not connects to others, no data transferred ever. if (tcp->state != SS_UNCONNECTED || !tcp->file || tcp->wq.fasync_list) Consider this, we don't need to multiplex this socket, clcsock handshaking is the first "user". This behaves likes LD_PRELOAD (smc_run), the difference is the location to replace, user-space or kernel-space. Setting this in an old socket (has traffic already) is more general than current state, and we need more methods to handle this like a protocol wrap it. I would improve this ability in the future. Currently, transparent replace it in create and bind stage can coverage most scenes. Thank you. Tony Lu
Hello: This patch was applied to netdev/net-next.git (master) by David S. Miller <davem@davemloft.net>: On Tue, 28 Dec 2021 21:44:36 +0800 you wrote: > This implements TCP ULP for SMC, helps applications to replace TCP with > SMC protocol in place. And we use it to implement transparent > replacement. > > This replaces original TCP sockets with SMC, reuse TCP as clcsock when > calling setsockopt with TCP_ULP option, and without any overhead. > > [...] Here is the summary with links: - [net-next] net/smc: Introduce TCP ULP support https://git.kernel.org/netdev/net-next/c/d7cd421da9da You are awesome, thank you!
On 31/12/2021 10:13, Tony Lu wrote: > On Thu, Dec 30, 2021 at 04:03:19PM +0100, Karsten Graul wrote: >> On 28/12/2021 14:44, Tony Lu wrote: >>> This implements TCP ULP for SMC, helps applications to replace TCP with >>> SMC protocol in place. And we use it to implement transparent >>> replacement. >>> >>> This replaces original TCP sockets with SMC, reuse TCP as clcsock when >>> calling setsockopt with TCP_ULP option, and without any overhead. >> >> This looks very interesting. Can you provide a simple userspace example about >> how to use ULP with smc? > > Here is a userspace C/S application: Thanks for the example, it was very helpful!
Half a year too late, but then it hadn't been posted on fsdevel. Which it really should have been, due to > + /* replace tcp socket to smc */ > + smcsock->file = tcp->file; > + smcsock->file->private_data = smcsock; > + smcsock->file->f_inode = SOCK_INODE(smcsock); /* replace inode when sock_close */ > + smcsock->file->f_path.dentry->d_inode = SOCK_INODE(smcsock); /* dput() in __fput */ > + tcp->file = NULL; this. It violates a bunch of rather fundamental assertions about the data structures you are playing with, and I'm not even going into the lifetime and refcounting issues. * ->d_inode of a busy positive dentry never changes while refcount of dentry remains positive. A lot of places in VFS rely upon that. * ->f_inode of a file never changes, period. * ->private_data of a struct file associated with a socket never changes; it can be accessed lockless, with no precautions beyond "make sure that refcount of struct file will remain positive". PS: more than one thread could be calling methods of that struct socket at the same time; what's to stop e.g. connect(2) on the same sucker (e.g. called on the same descriptor from a different thread that happens to share the same descriptor table) to be sitting there trying to lock the struct sock currently held locked by caller of tcp_set_ulp()?
On Thu, Aug 04, 2022 at 03:56:11AM +0100, Al Viro wrote: > Half a year too late, but then it hadn't been posted on fsdevel. > Which it really should have been, due to > > > + /* replace tcp socket to smc */ > > + smcsock->file = tcp->file; > > + smcsock->file->private_data = smcsock; > > + smcsock->file->f_inode = SOCK_INODE(smcsock); /* replace inode when sock_close */ > > + smcsock->file->f_path.dentry->d_inode = SOCK_INODE(smcsock); /* dput() in __fput */ > > + tcp->file = NULL; > > this. It violates a bunch of rather fundamental assertions about the > data structures you are playing with, and I'm not even going into the > lifetime and refcounting issues. > > * ->d_inode of a busy positive dentry never changes while refcount > of dentry remains positive. A lot of places in VFS rely upon that. > * ->f_inode of a file never changes, period. > * ->private_data of a struct file associated with a socket never > changes; it can be accessed lockless, with no precautions beyond "make sure > that refcount of struct file will remain positive". Consider, BTW, what it does to sockfd_lookup() users. We grab a reference to struct file, pick struct socket from its ->private_data, work with that sucker, then do sockfd_put(). Which does fput(sock->file). Guess what happens if sockfd_lookup() is given the descriptor of your TCP socket, just before that tcp->file = NULL? Right, fput(NULL) as soon as matching sockfd_put() is called. And the very first thing fput() does is this: if (atomic_long_dec_and_test(&file->f_count)) { And that's just one example - a *lot* of places both in VFS and in net/* rely upon these assertions. This is really not a workable approach.
On Thu, Aug 04, 2022 at 03:56:11AM +0100, Al Viro wrote: > Half a year too late, but then it hadn't been posted on fsdevel. > Which it really should have been, due to > > > + /* replace tcp socket to smc */ > > + smcsock->file = tcp->file; > > + smcsock->file->private_data = smcsock; > > + smcsock->file->f_inode = SOCK_INODE(smcsock); /* replace inode when sock_close */ > > + smcsock->file->f_path.dentry->d_inode = SOCK_INODE(smcsock); /* dput() in __fput */ > > + tcp->file = NULL; > > this. It violates a bunch of rather fundamental assertions about the > data structures you are playing with, and I'm not even going into the > lifetime and refcounting issues. > > * ->d_inode of a busy positive dentry never changes while refcount > of dentry remains positive. A lot of places in VFS rely upon that. > * ->f_inode of a file never changes, period. > * ->private_data of a struct file associated with a socket never > changes; it can be accessed lockless, with no precautions beyond "make sure > that refcount of struct file will remain positive". > > PS: more than one thread could be calling methods of that struct socket at the > same time; what's to stop e.g. connect(2) on the same sucker (e.g. called on > the same descriptor from a different thread that happens to share the same > descriptor table) to be sitting there trying to lock the struct sock currently > held locked by caller of tcp_set_ulp()? Sorry for the late reply. SMC ULP tries to make original TCP sockets behave like SMC. The original TCP sockets will belong to this new SMC socket, and it can only be accessed in kernel with struct socket in SMC. The SMC and TCP sockets are bonded together. So this patch replaces the file of TCP to SMC socket which is allocated in kernel. It is guaranteed that the TCP socket is always freed before the newly replaced SMC socket. There is an other approach to archive this by changing af_ops of sockets. I will fix it without breaking the assertions. Tony Lu
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c index ba9d1a8ebb4a..deb40307820d 100644 --- a/net/smc/af_smc.c +++ b/net/smc/af_smc.c @@ -2700,8 +2700,8 @@ static const struct proto_ops smc_sock_ops = { .splice_read = smc_splice_read, }; -static int smc_create(struct net *net, struct socket *sock, int protocol, - int kern) +static int __smc_create(struct net *net, struct socket *sock, int protocol, + int kern, struct socket *clcsock) { int family = (protocol == SMCPROTO_SMC6) ? PF_INET6 : PF_INET; struct smc_sock *smc; @@ -2726,12 +2726,19 @@ static int smc_create(struct net *net, struct socket *sock, int protocol, smc = smc_sk(sk); smc->use_fallback = false; /* assume rdma capability first */ smc->fallback_rsn = 0; - rc = sock_create_kern(net, family, SOCK_STREAM, IPPROTO_TCP, - &smc->clcsock); - if (rc) { - sk_common_release(sk); - goto out; + + rc = 0; + if (!clcsock) { + rc = sock_create_kern(net, family, SOCK_STREAM, IPPROTO_TCP, + &smc->clcsock); + if (rc) { + sk_common_release(sk); + goto out; + } + } else { + smc->clcsock = clcsock; } + smc->sk.sk_sndbuf = max(smc->clcsock->sk->sk_sndbuf, SMC_BUF_MIN_SIZE); smc->sk.sk_rcvbuf = max(smc->clcsock->sk->sk_rcvbuf, SMC_BUF_MIN_SIZE); @@ -2739,12 +2746,76 @@ static int smc_create(struct net *net, struct socket *sock, int protocol, return rc; } +static int smc_create(struct net *net, struct socket *sock, int protocol, + int kern) +{ + return __smc_create(net, sock, protocol, kern, NULL); +} + static const struct net_proto_family smc_sock_family_ops = { .family = PF_SMC, .owner = THIS_MODULE, .create = smc_create, }; +static int smc_ulp_init(struct sock *sk) +{ + struct socket *tcp = sk->sk_socket; + struct net *net = sock_net(sk); + struct socket *smcsock; + int protocol, ret; + + /* only TCP can be replaced */ + if (tcp->type != SOCK_STREAM || sk->sk_protocol != IPPROTO_TCP || + (sk->sk_family != AF_INET && sk->sk_family != AF_INET6)) + return -ESOCKTNOSUPPORT; + /* don't handle wq now */ + if (tcp->state != SS_UNCONNECTED || !tcp->file || tcp->wq.fasync_list) + return -ENOTCONN; + + if (sk->sk_family == AF_INET) + protocol = SMCPROTO_SMC; + else + protocol = SMCPROTO_SMC6; + + smcsock = sock_alloc(); + if (!smcsock) + return -ENFILE; + + smcsock->type = SOCK_STREAM; + __module_get(THIS_MODULE); /* tried in __tcp_ulp_find_autoload */ + ret = __smc_create(net, smcsock, protocol, 1, tcp); + if (ret) { + sock_release(smcsock); /* module_put() which ops won't be NULL */ + return ret; + } + + /* replace tcp socket to smc */ + smcsock->file = tcp->file; + smcsock->file->private_data = smcsock; + smcsock->file->f_inode = SOCK_INODE(smcsock); /* replace inode when sock_close */ + smcsock->file->f_path.dentry->d_inode = SOCK_INODE(smcsock); /* dput() in __fput */ + tcp->file = NULL; + + return ret; +} + +static void smc_ulp_clone(const struct request_sock *req, struct sock *newsk, + const gfp_t priority) +{ + struct inet_connection_sock *icsk = inet_csk(newsk); + + /* don't inherit ulp ops to child when listen */ + icsk->icsk_ulp_ops = NULL; +} + +static struct tcp_ulp_ops smc_ulp_ops __read_mostly = { + .name = "smc", + .owner = THIS_MODULE, + .init = smc_ulp_init, + .clone = smc_ulp_clone, +}; + unsigned int smc_net_id; static __net_init int smc_net_init(struct net *net) @@ -2855,6 +2926,12 @@ static int __init smc_init(void) goto out_sock; } + rc = tcp_register_ulp(&smc_ulp_ops); + if (rc) { + pr_err("%s: tcp_ulp_register fails with %d\n", __func__, rc); + goto out_sock; + } + static_branch_enable(&tcp_have_smc); return 0; @@ -2883,6 +2960,7 @@ static int __init smc_init(void) static void __exit smc_exit(void) { static_branch_disable(&tcp_have_smc); + tcp_unregister_ulp(&smc_ulp_ops); sock_unregister(PF_SMC); smc_core_exit(); smc_ib_unregister_client(); @@ -2905,3 +2983,4 @@ MODULE_AUTHOR("Ursula Braun <ubraun@linux.vnet.ibm.com>"); MODULE_DESCRIPTION("smc socket address family"); MODULE_LICENSE("GPL"); MODULE_ALIAS_NETPROTO(PF_SMC); +MODULE_ALIAS_TCP_ULP("smc");
This implements TCP ULP for SMC, helps applications to replace TCP with SMC protocol in place. And we use it to implement transparent replacement. This replaces original TCP sockets with SMC, reuse TCP as clcsock when calling setsockopt with TCP_ULP option, and without any overhead. To replace TCP sockets with SMC, there are two approaches: - use setsockopt() syscall with TCP_ULP option, if error, it would fallback to TCP. - use BPF prog with types BPF_CGROUP_INET_SOCK_CREATE or others to replace transparently. BPF hooks some points in create socket, bind and others, users can inject their BPF logics without modifying their applications, and choose which connections should be replaced with SMC by calling setsockopt() in BPF prog, based on rules, such as TCP tuples, PID, cgroup, etc... BPF doesn't support calling setsockopt with TCP_ULP now, I will send the patches after this accepted. Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> --- net/smc/af_smc.c | 93 ++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 86 insertions(+), 7 deletions(-)