diff mbox series

[net-next] net/smc: Introduce TCP ULP support

Message ID 20211228134435.41774-1-tonylu@linux.alibaba.com (mailing list archive)
State Not Applicable
Headers show
Series [net-next] net/smc: Introduce TCP ULP support | expand

Commit Message

Tony Lu Dec. 28, 2021, 1:44 p.m. UTC
This implements TCP ULP for SMC, helps applications to replace TCP with
SMC protocol in place. And we use it to implement transparent
replacement.

This replaces original TCP sockets with SMC, reuse TCP as clcsock when
calling setsockopt with TCP_ULP option, and without any overhead.

To replace TCP sockets with SMC, there are two approaches:

- use setsockopt() syscall with TCP_ULP option, if error, it would
  fallback to TCP.

- use BPF prog with types BPF_CGROUP_INET_SOCK_CREATE or others to
  replace transparently. BPF hooks some points in create socket, bind
  and others, users can inject their BPF logics without modifying their
  applications, and choose which connections should be replaced with SMC
  by calling setsockopt() in BPF prog, based on rules, such as TCP tuples,
  PID, cgroup, etc...

  BPF doesn't support calling setsockopt with TCP_ULP now, I will send the
  patches after this accepted.

Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
---
 net/smc/af_smc.c | 93 ++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 86 insertions(+), 7 deletions(-)

Comments

Karsten Graul Dec. 30, 2021, 3:03 p.m. UTC | #1
On 28/12/2021 14:44, Tony Lu wrote:
> This implements TCP ULP for SMC, helps applications to replace TCP with
> SMC protocol in place. And we use it to implement transparent
> replacement.
> 
> This replaces original TCP sockets with SMC, reuse TCP as clcsock when
> calling setsockopt with TCP_ULP option, and without any overhead.

This looks very interesting. Can you provide a simple userspace example about 
how to use ULP with smc?

And how do you make sure that the in-band CLC handshake doesn't interfere with the
previous TCP traffic, is the idea to put some kind of protocol around it so both
sides 'know' when the protocol ended and the CLC handshake starts?
Tony Lu Dec. 31, 2021, 9:13 a.m. UTC | #2
On Thu, Dec 30, 2021 at 04:03:19PM +0100, Karsten Graul wrote:
> On 28/12/2021 14:44, Tony Lu wrote:
> > This implements TCP ULP for SMC, helps applications to replace TCP with
> > SMC protocol in place. And we use it to implement transparent
> > replacement.
> > 
> > This replaces original TCP sockets with SMC, reuse TCP as clcsock when
> > calling setsockopt with TCP_ULP option, and without any overhead.
> 
> This looks very interesting. Can you provide a simple userspace example about 
> how to use ULP with smc?

Here is a userspace C/S application:

	fd = socket(AF_INET, SOCK_STREAM, 0);

	addr.sin_family = AF_INET;
    addr.sin_addr.s_addr = htonl(INADDR_ANY);      /* for server */
	addr.sin_addr.s_addr = inet_addr("127.0.0.1"); /* for client */
    addr.sin_port = htons(PORT);

	/* kernel will find and load smc module, init smc socket and replace
	 * tcp with smc, use the "clean" tcp as clcsock.
	 */
	ret = setsockopt(fd, SOL_TCP, TCP_ULP, "smc", sizeof("smc"));
	if (ret) /* if ulp init failed, TCP progress can be continued */
		printf("replace tcp with smc failed, use tcp");
	
	/* After this, this tcp socket will behave as smc socket. If error
	 * happened, this socket is still a normal tcp socket.
	 *
	 * We check tcp socket's state, so after bind(), connect() or listen(),
	 * ulp setup will be failed.
	 */
	bind(fd, (struct sockaddr *)&addr, sizeof(addr)); /* calls smc_bind() */
	connect(...); /* for client, smc_connect() */
	listen(...);  /* for server, smc_listen() */
	accept(...);  /* for server, smc_accept() */

This approach is not convenient to use, it is a possible usage in
userspace. The more important scene is to work with BPF.

Transparent replacement with BPF:
	
	BPF provides a series of attach points, like:
	- BPF_CGROUP_INET_SOCK_CREATE, /* calls in the end of inet_create() */
	- BPF_CGROUP_INET4_BIND,       /* calls in the end of inet_bind() */
	- BPF_CGROUP_INET6_BIND,

	So that we can inject BPF programs into these points in userspace:

	SEC("cgroup/connect4")
	int replace_to_smc(struct bpf_sock_addr *addr)
	{
		int pid = bpf_get_current_pid_tgid() >> 32;
		long ret;

		/* use-defined rules/filters, such as pid, tcp src/dst address, etc...*/
		if (pid != DESIRED_PID)
			return 0;

		<...>
	
		ret = bpf_setsockopt(addr, SOL_TCP, TCP_ULP, "smc", sizeof("smc"));
		if (ret) {
			bpf_printk("replace TCP with SMC error: %ld\n", ret);
			return 0;
		}
		return 0;
	}

	Then use libbpf to load it with attach type BPF_CGROUP_INET4_CONNECT.
	Everytime userspace appliations try to bind socket, it will run this
	BPF prog, check user-defined rule and determine to replace with
	SMC. Because this BPF is injected outside of user applications, so
	we can use BPF to implement flexible and non-intrusive transparent
	replacement.

	BPF helper bpf_setsockopt() limits the options to call, so TCP_ULP
	is not allowed now. I will send patches out to allow TCP_ULP option
	after this approach is merged, which is suggested by BPF's developer.
	Here is the link about BPF patch:

	https://lore.kernel.org/netdev/20211209090250.73927-1-tonylu@linux.alibaba.com/

> And how do you make sure that the in-band CLC handshake doesn't interfere with the
> previous TCP traffic, is the idea to put some kind of protocol around it so both
> sides 'know' when the protocol ended and the CLC handshake starts?

Yes, we need a "clean" TCP socket to replace with SMC. To archive it,
smc_ulp_init will check the state of TCP socket.

First, we make sure that socket is a REALLY TCP sockets.

	if (tcp->type != SOCK_STREAM || sk->sk_protocol != IPPROTO_TCP ||
		(sk->sk_family != AF_INET && sk->sk_family != AF_INET6))

Then check the state of socket, and makes sure this socket is a newly
created userspace socket, not connects to others, no data transferred
ever.

	if (tcp->state != SS_UNCONNECTED || !tcp->file || tcp->wq.fasync_list)

Consider this, we don't need to multiplex this socket, clcsock
handshaking is the first "user". This behaves likes LD_PRELOAD (smc_run),
the difference is the location to replace, user-space or kernel-space.

Setting this in an old socket (has traffic already) is more general than
current state, and we need more methods to handle this like a protocol
wrap it. I would improve this ability in the future. Currently,
transparent replace it in create and bind stage can coverage most
scenes.

Thank you.
Tony Lu
patchwork-bot+netdevbpf@kernel.org Jan. 2, 2022, 12:20 p.m. UTC | #3
Hello:

This patch was applied to netdev/net-next.git (master)
by David S. Miller <davem@davemloft.net>:

On Tue, 28 Dec 2021 21:44:36 +0800 you wrote:
> This implements TCP ULP for SMC, helps applications to replace TCP with
> SMC protocol in place. And we use it to implement transparent
> replacement.
> 
> This replaces original TCP sockets with SMC, reuse TCP as clcsock when
> calling setsockopt with TCP_ULP option, and without any overhead.
> 
> [...]

Here is the summary with links:
  - [net-next] net/smc: Introduce TCP ULP support
    https://git.kernel.org/netdev/net-next/c/d7cd421da9da

You are awesome, thank you!
Karsten Graul Jan. 3, 2022, 9:09 a.m. UTC | #4
On 31/12/2021 10:13, Tony Lu wrote:
> On Thu, Dec 30, 2021 at 04:03:19PM +0100, Karsten Graul wrote:
>> On 28/12/2021 14:44, Tony Lu wrote:
>>> This implements TCP ULP for SMC, helps applications to replace TCP with
>>> SMC protocol in place. And we use it to implement transparent
>>> replacement.
>>>
>>> This replaces original TCP sockets with SMC, reuse TCP as clcsock when
>>> calling setsockopt with TCP_ULP option, and without any overhead.
>>
>> This looks very interesting. Can you provide a simple userspace example about 
>> how to use ULP with smc?
> 
> Here is a userspace C/S application:

Thanks for the example, it was very helpful!
Al Viro Aug. 4, 2022, 2:56 a.m. UTC | #5
Half a year too late, but then it hadn't been posted on fsdevel.
Which it really should have been, due to

> +	/* replace tcp socket to smc */
> +	smcsock->file = tcp->file;
> +	smcsock->file->private_data = smcsock;
> +	smcsock->file->f_inode = SOCK_INODE(smcsock); /* replace inode when sock_close */
> +	smcsock->file->f_path.dentry->d_inode = SOCK_INODE(smcsock); /* dput() in __fput */
> +	tcp->file = NULL;

this.  It violates a bunch of rather fundamental assertions about the
data structures you are playing with, and I'm not even going into the
lifetime and refcounting issues.

	* ->d_inode of a busy positive dentry never changes while refcount
of dentry remains positive.  A lot of places in VFS rely upon that.
	* ->f_inode of a file never changes, period.
	* ->private_data of a struct file associated with a socket never
changes; it can be accessed lockless, with no precautions beyond "make sure
that refcount of struct file will remain positive".

PS: more than one thread could be calling methods of that struct socket at the
same time; what's to stop e.g. connect(2) on the same sucker (e.g. called on
the same descriptor from a different thread that happens to share the same
descriptor table) to be sitting there trying to lock the struct sock currently
held locked by caller of tcp_set_ulp()?
Al Viro Aug. 4, 2022, 3:48 a.m. UTC | #6
On Thu, Aug 04, 2022 at 03:56:11AM +0100, Al Viro wrote:
> 	Half a year too late, but then it hadn't been posted on fsdevel.
> Which it really should have been, due to
> 
> > +	/* replace tcp socket to smc */
> > +	smcsock->file = tcp->file;
> > +	smcsock->file->private_data = smcsock;
> > +	smcsock->file->f_inode = SOCK_INODE(smcsock); /* replace inode when sock_close */
> > +	smcsock->file->f_path.dentry->d_inode = SOCK_INODE(smcsock); /* dput() in __fput */
> > +	tcp->file = NULL;
> 
> this.  It violates a bunch of rather fundamental assertions about the
> data structures you are playing with, and I'm not even going into the
> lifetime and refcounting issues.
> 
> 	* ->d_inode of a busy positive dentry never changes while refcount
> of dentry remains positive.  A lot of places in VFS rely upon that.
> 	* ->f_inode of a file never changes, period.
> 	* ->private_data of a struct file associated with a socket never
> changes; it can be accessed lockless, with no precautions beyond "make sure
> that refcount of struct file will remain positive".

Consider, BTW, what it does to sockfd_lookup() users.  We grab a reference
to struct file, pick struct socket from its ->private_data, work with that
sucker, then do sockfd_put().  Which does fput(sock->file).

Guess what happens if sockfd_lookup() is given the descriptor of your
TCP socket, just before that tcp->file = NULL?  Right, fput(NULL) as
soon as matching sockfd_put() is called.  And the very first thing fput()
does is this:
        if (atomic_long_dec_and_test(&file->f_count)) {

And that's just one example - a *lot* of places both in VFS and in
net/* rely upon these assertions.  This is really not a workable approach.
Tony Lu Aug. 11, 2022, 9:29 a.m. UTC | #7
On Thu, Aug 04, 2022 at 03:56:11AM +0100, Al Viro wrote:
> 	Half a year too late, but then it hadn't been posted on fsdevel.
> Which it really should have been, due to
> 
> > +	/* replace tcp socket to smc */
> > +	smcsock->file = tcp->file;
> > +	smcsock->file->private_data = smcsock;
> > +	smcsock->file->f_inode = SOCK_INODE(smcsock); /* replace inode when sock_close */
> > +	smcsock->file->f_path.dentry->d_inode = SOCK_INODE(smcsock); /* dput() in __fput */
> > +	tcp->file = NULL;
> 
> this.  It violates a bunch of rather fundamental assertions about the
> data structures you are playing with, and I'm not even going into the
> lifetime and refcounting issues.
> 
> 	* ->d_inode of a busy positive dentry never changes while refcount
> of dentry remains positive.  A lot of places in VFS rely upon that.
> 	* ->f_inode of a file never changes, period.
> 	* ->private_data of a struct file associated with a socket never
> changes; it can be accessed lockless, with no precautions beyond "make sure
> that refcount of struct file will remain positive".
>
> PS: more than one thread could be calling methods of that struct socket at the
> same time; what's to stop e.g. connect(2) on the same sucker (e.g. called on
> the same descriptor from a different thread that happens to share the same
> descriptor table) to be sitting there trying to lock the struct sock currently
> held locked by caller of tcp_set_ulp()?

Sorry for the late reply.

SMC ULP tries to make original TCP sockets behave like SMC. The original
TCP sockets will belong to this new SMC socket, and it can only be
accessed in kernel with struct socket in SMC. The SMC and TCP sockets are
bonded together.

So this patch replaces the file of TCP to SMC socket which is allocated
in kernel. It is guaranteed that the TCP socket is always freed before
the newly replaced SMC socket.

There is an other approach to archive this by changing af_ops of sockets.
I will fix it without breaking the assertions.

Tony Lu
diff mbox series

Patch

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index ba9d1a8ebb4a..deb40307820d 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -2700,8 +2700,8 @@  static const struct proto_ops smc_sock_ops = {
 	.splice_read	= smc_splice_read,
 };
 
-static int smc_create(struct net *net, struct socket *sock, int protocol,
-		      int kern)
+static int __smc_create(struct net *net, struct socket *sock, int protocol,
+			int kern, struct socket *clcsock)
 {
 	int family = (protocol == SMCPROTO_SMC6) ? PF_INET6 : PF_INET;
 	struct smc_sock *smc;
@@ -2726,12 +2726,19 @@  static int smc_create(struct net *net, struct socket *sock, int protocol,
 	smc = smc_sk(sk);
 	smc->use_fallback = false; /* assume rdma capability first */
 	smc->fallback_rsn = 0;
-	rc = sock_create_kern(net, family, SOCK_STREAM, IPPROTO_TCP,
-			      &smc->clcsock);
-	if (rc) {
-		sk_common_release(sk);
-		goto out;
+
+	rc = 0;
+	if (!clcsock) {
+		rc = sock_create_kern(net, family, SOCK_STREAM, IPPROTO_TCP,
+				      &smc->clcsock);
+		if (rc) {
+			sk_common_release(sk);
+			goto out;
+		}
+	} else {
+		smc->clcsock = clcsock;
 	}
+
 	smc->sk.sk_sndbuf = max(smc->clcsock->sk->sk_sndbuf, SMC_BUF_MIN_SIZE);
 	smc->sk.sk_rcvbuf = max(smc->clcsock->sk->sk_rcvbuf, SMC_BUF_MIN_SIZE);
 
@@ -2739,12 +2746,76 @@  static int smc_create(struct net *net, struct socket *sock, int protocol,
 	return rc;
 }
 
+static int smc_create(struct net *net, struct socket *sock, int protocol,
+		      int kern)
+{
+	return __smc_create(net, sock, protocol, kern, NULL);
+}
+
 static const struct net_proto_family smc_sock_family_ops = {
 	.family	= PF_SMC,
 	.owner	= THIS_MODULE,
 	.create	= smc_create,
 };
 
+static int smc_ulp_init(struct sock *sk)
+{
+	struct socket *tcp = sk->sk_socket;
+	struct net *net = sock_net(sk);
+	struct socket *smcsock;
+	int protocol, ret;
+
+	/* only TCP can be replaced */
+	if (tcp->type != SOCK_STREAM || sk->sk_protocol != IPPROTO_TCP ||
+	    (sk->sk_family != AF_INET && sk->sk_family != AF_INET6))
+		return -ESOCKTNOSUPPORT;
+	/* don't handle wq now */
+	if (tcp->state != SS_UNCONNECTED || !tcp->file || tcp->wq.fasync_list)
+		return -ENOTCONN;
+
+	if (sk->sk_family == AF_INET)
+		protocol = SMCPROTO_SMC;
+	else
+		protocol = SMCPROTO_SMC6;
+
+	smcsock = sock_alloc();
+	if (!smcsock)
+		return -ENFILE;
+
+	smcsock->type = SOCK_STREAM;
+	__module_get(THIS_MODULE); /* tried in __tcp_ulp_find_autoload */
+	ret = __smc_create(net, smcsock, protocol, 1, tcp);
+	if (ret) {
+		sock_release(smcsock); /* module_put() which ops won't be NULL */
+		return ret;
+	}
+
+	/* replace tcp socket to smc */
+	smcsock->file = tcp->file;
+	smcsock->file->private_data = smcsock;
+	smcsock->file->f_inode = SOCK_INODE(smcsock); /* replace inode when sock_close */
+	smcsock->file->f_path.dentry->d_inode = SOCK_INODE(smcsock); /* dput() in __fput */
+	tcp->file = NULL;
+
+	return ret;
+}
+
+static void smc_ulp_clone(const struct request_sock *req, struct sock *newsk,
+			  const gfp_t priority)
+{
+	struct inet_connection_sock *icsk = inet_csk(newsk);
+
+	/* don't inherit ulp ops to child when listen */
+	icsk->icsk_ulp_ops = NULL;
+}
+
+static struct tcp_ulp_ops smc_ulp_ops __read_mostly = {
+	.name		= "smc",
+	.owner		= THIS_MODULE,
+	.init		= smc_ulp_init,
+	.clone		= smc_ulp_clone,
+};
+
 unsigned int smc_net_id;
 
 static __net_init int smc_net_init(struct net *net)
@@ -2855,6 +2926,12 @@  static int __init smc_init(void)
 		goto out_sock;
 	}
 
+	rc = tcp_register_ulp(&smc_ulp_ops);
+	if (rc) {
+		pr_err("%s: tcp_ulp_register fails with %d\n", __func__, rc);
+		goto out_sock;
+	}
+
 	static_branch_enable(&tcp_have_smc);
 	return 0;
 
@@ -2883,6 +2960,7 @@  static int __init smc_init(void)
 static void __exit smc_exit(void)
 {
 	static_branch_disable(&tcp_have_smc);
+	tcp_unregister_ulp(&smc_ulp_ops);
 	sock_unregister(PF_SMC);
 	smc_core_exit();
 	smc_ib_unregister_client();
@@ -2905,3 +2983,4 @@  MODULE_AUTHOR("Ursula Braun <ubraun@linux.vnet.ibm.com>");
 MODULE_DESCRIPTION("smc socket address family");
 MODULE_LICENSE("GPL");
 MODULE_ALIAS_NETPROTO(PF_SMC);
+MODULE_ALIAS_TCP_ULP("smc");