diff mbox series

[RFC,net-next] net/smc:introduce 1RTT to SMC

Message ID 1653375127-130233-1-git-send-email-alibuda@linux.alibaba.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series [RFC,net-next] net/smc:introduce 1RTT to SMC | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next, async
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit fail Errors and warnings before: 0 this patch: 2
netdev/cc_maintainers warning 2 maintainers not CCed: pabeni@redhat.com edumazet@google.com
netdev/build_clang fail Errors and warnings before: 0 this patch: 2
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 0 this patch: 2
netdev/checkpatch warning WARNING: line length of 83 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 89 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

D. Wythe May 24, 2022, 6:52 a.m. UTC
From: "D. Wythe" <alibuda@linux.alibaba.com>

Hi Karsten,

We are promoting SMC-R to the field of cloud computing, dues to the
particularity of business on the cloud, the scale and the types of
customer applications are unpredictable. As a participant of SMC-R, we
also hope that SMC-R can cover more application scenarios. Therefore,
many connection problems are exposed during this time. There are two
main issue, one is that the establishment of a single connection takes
longer than that of the TCP, another is that the degree of concurrency
is low under multi-connection processing. This patch set is mainly
optimized for the first issue, and the follow-up of the second issue
will be synchronized in the future.

In terms of communication process, under current implement, a TCP
three-way handshake only needs 1-RTT time, while SMC-R currently
requires 4-RTT times, including 2-RTT over IP(TCP handshake, SMC
proposal & accept ) and 2-RTT over IB ( two times RKEY exchange), which
is most influential factor affecting connection established time at the
moment.

We have noticed that single network interface card is mainstream on the
cloud, dues to the advantages of cloud deployment costs and the cloud's
own disaster recovery support. On the other hand, the emergence of RoCE
LAG technology makes us no longer need to deal with multiple RDMA
network interface cards by ourselves,  just like NIC bonding does. In
Alibaba, Roce LAG is widely used for RDMA.

In that case, SMC-R have only one single link, if so, the RKEY LLC
messages that to perform information exchange in all links are no longer
needed, the SMC Proposal & accept has already complete the exchange of
all information needed. So we think that we can remove the RKEY exchange
in that case, which will save us 2-RTT over IB. We call it as SMC-R 2-RTT.

On the other hand, we can use TCP fast open, carry the SMC proposal data
by TCP SYN message, reduce the time that the SMC waits for the TCP
connection to be established. This will save us another 1-RTT over IP.

Based on the above two viewpoints, in this scenario, we can compress the
communication process of SMC-R into 1-RTT over IP, so that we can
theoretically obtain a time close to that of TCP connection
establishment. We call it as SMC-R 1-RTT. Of course, the specific results
will also be affected by the implementation.

In our test environment, we host two VMs on the same host for wrk/nginx
tests, used a script similar to the following to performing test:

Client.sh

conn=$1
thread=$2

wrk -H ‘Connection: Close’ -c ${conn} -t ${thread} -d 10

Server.sh

sysctl -w net.ipv4.tcp_fastopen=3
smc_run nginx

Statistic shows that:

+-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
|type|args  |   -c1 -t1     |   -c2 -t1     |   -c5 -t1      |  -c10 -t1    |   -c200 -t1    |  -c200 -t4    |  -c2000 -t8   |
+-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
|next-next  |   4188.5qps   |   5942.04qps  |   7621.81qps   |  7678.62qps  |   8204.94qps   |  8457.57qps   |  5687.60qps   |
+-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
|SMC-2RTT   |   4730.17qps  |   7394.85qps  |   11532.78qps  |  12016.22qps |   11520.81qps  |  11391.36qps  |  10364.41qps  |
+-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
|SMC-1RTT   |   5702.77qps  |   9645.18qps  |   11899.20qps  |  12005.16qps |   11536.67qps  |  11420.87qps  |  10392.4qps   |
+-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+-
|TCP        |   6415.74qps  |   11034.10qps |   16716.21qps  |  22217.06qps |   35926.74qps  |  117460.qps   |  120291.16qps |
+-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+

It can clearly be seen that:

1. In step by step short-link scenarios ( -c1 -t1 ), SMC-R after
optimization can reach 88% of TCP. There are still many implementation
details that can be optimized, we hope to optimize the performance of
SMC in this scenario to 90% of TCP.

2. The problem is very serious in the scenario of multi-threading and
multi-connection, the worst case is only 10% of TCP. Even though the
SMC-1RTT has certain optimizations for this scenario, it is clear that
the bottleneck is not here. We are doing some prototyping to solve this,
we hope to reach 60% of TCP in multi-threading and multi-connection
scenarios, and SMC-1RTT is the important prerequisite for upper limit of
subsequent optimization.

In this patch set, we had only completed a simple prototype, only make
sure SMC-1RTT can works.

Sincerely, we are looking forward for you comments, please
let us know if you have any suggestions.

Thanks.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
 net/smc/af_smc.c   | 72 ++++++++++++++++++++++++++++++++++++++++++------------
 net/smc/smc.h      |  8 ++++++
 net/smc/smc_clc.c  | 32 ++++++++++++++++++++----
 net/smc/smc_core.c |  2 ++
 net/smc/smc_pnet.c |  4 +--
 net/smc/smc_pnet.h |  3 +++
 6 files changed, 98 insertions(+), 23 deletions(-)

Comments

Tony Lu May 24, 2022, 7:49 a.m. UTC | #1
On Tue, May 24, 2022 at 02:52:07PM +0800, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> Hi Karsten,
> 
> We are promoting SMC-R to the field of cloud computing, dues to the
> particularity of business on the cloud, the scale and the types of
> customer applications are unpredictable. As a participant of SMC-R, we
> also hope that SMC-R can cover more application scenarios. Therefore,
> many connection problems are exposed during this time. There are two
> main issue, one is that the establishment of a single connection takes
> longer than that of the TCP, another is that the degree of concurrency
> is low under multi-connection processing. This patch set is mainly
> optimized for the first issue, and the follow-up of the second issue
> will be synchronized in the future.
> 
> In terms of communication process, under current implement, a TCP
> three-way handshake only needs 1-RTT time, while SMC-R currently
> requires 4-RTT times, including 2-RTT over IP(TCP handshake, SMC
> proposal & accept ) and 2-RTT over IB ( two times RKEY exchange), which
> is most influential factor affecting connection established time at the
> moment.
> 
> We have noticed that single network interface card is mainstream on the
> cloud, dues to the advantages of cloud deployment costs and the cloud's
> own disaster recovery support. On the other hand, the emergence of RoCE
> LAG technology makes us no longer need to deal with multiple RDMA
> network interface cards by ourselves,  just like NIC bonding does. In
> Alibaba, Roce LAG is widely used for RDMA.

I think this is an interesting topic whether we need SMC-level link
redundancy. I agreed with that RoCE LAG and RDMA in cloud vendors handle
redundancy and failover in the lower layer, and do it transparently for
SMC.

So let's move on, if a RDMA device has redundancy ability, we could make
SMC simpler by give an option for user-space or based on the device
capability (if we have this flag). This allows under layer to ensure the
reliability of link group.

As RFC 7609 mentioned, we should do some extra work for reliability to
add link. It should be an optional work if the device have capability
for redundancy, and make link group simpler and faster (for the
so-called SMC-2RTT in this RFC).

I also notice that RFC 7609 is released on August 2015, which is earlier
than RoCE LAG. RoCE LAG is provided after ConnectX-3/ConnectX-3 Pro in
kernel 4.0, and is available in 2017. And cloud vendors' RDMA adapters,
such as Alibaba Elastic RDMA adapter in [1].

Given that, I propose whether the second link can be used as an option
in newly created link group. Also, if it is possible, RFC 7609 can be
updated or extend it for this nowadays case.

Looking forward for your message, Karsten, D. Wythe and folks.

[1] https://lore.kernel.org/linux-rdma/20220523075528.35017-1-chengyou@linux.alibaba.com/

Thanks,
Tony Lu
 
> In that case, SMC-R have only one single link, if so, the RKEY LLC
> messages that to perform information exchange in all links are no longer
> needed, the SMC Proposal & accept has already complete the exchange of
> all information needed. So we think that we can remove the RKEY exchange
> in that case, which will save us 2-RTT over IB. We call it as SMC-R 2-RTT.
> 
> On the other hand, we can use TCP fast open, carry the SMC proposal data
> by TCP SYN message, reduce the time that the SMC waits for the TCP
> connection to be established. This will save us another 1-RTT over IP.
> 
> Based on the above two viewpoints, in this scenario, we can compress the
> communication process of SMC-R into 1-RTT over IP, so that we can
> theoretically obtain a time close to that of TCP connection
> establishment. We call it as SMC-R 1-RTT. Of course, the specific results
> will also be affected by the implementation.
> 
> In our test environment, we host two VMs on the same host for wrk/nginx
> tests, used a script similar to the following to performing test:
> 
> Client.sh
> 
> conn=$1
> thread=$2
> 
> wrk -H ‘Connection: Close’ -c ${conn} -t ${thread} -d 10
> 
> Server.sh
> 
> sysctl -w net.ipv4.tcp_fastopen=3
> smc_run nginx
> 
> Statistic shows that:
> 
> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
> |type|args  |   -c1 -t1     |   -c2 -t1     |   -c5 -t1      |  -c10 -t1    |   -c200 -t1    |  -c200 -t4    |  -c2000 -t8   |
> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
> |next-next  |   4188.5qps   |   5942.04qps  |   7621.81qps   |  7678.62qps  |   8204.94qps   |  8457.57qps   |  5687.60qps   |
> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
> |SMC-2RTT   |   4730.17qps  |   7394.85qps  |   11532.78qps  |  12016.22qps |   11520.81qps  |  11391.36qps  |  10364.41qps  |
> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
> |SMC-1RTT   |   5702.77qps  |   9645.18qps  |   11899.20qps  |  12005.16qps |   11536.67qps  |  11420.87qps  |  10392.4qps   |
> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+-
> |TCP        |   6415.74qps  |   11034.10qps |   16716.21qps  |  22217.06qps |   35926.74qps  |  117460.qps   |  120291.16qps |
> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
> 
> It can clearly be seen that:
> 
> 1. In step by step short-link scenarios ( -c1 -t1 ), SMC-R after
> optimization can reach 88% of TCP. There are still many implementation
> details that can be optimized, we hope to optimize the performance of
> SMC in this scenario to 90% of TCP.
> 
> 2. The problem is very serious in the scenario of multi-threading and
> multi-connection, the worst case is only 10% of TCP. Even though the
> SMC-1RTT has certain optimizations for this scenario, it is clear that
> the bottleneck is not here. We are doing some prototyping to solve this,
> we hope to reach 60% of TCP in multi-threading and multi-connection
> scenarios, and SMC-1RTT is the important prerequisite for upper limit of
> subsequent optimization.
> 
> In this patch set, we had only completed a simple prototype, only make
> sure SMC-1RTT can works.
> 
> Sincerely, we are looking forward for you comments, please
> let us know if you have any suggestions.
> 
> Thanks.
> 
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> ---
>  net/smc/af_smc.c   | 72 ++++++++++++++++++++++++++++++++++++++++++------------
>  net/smc/smc.h      |  8 ++++++
>  net/smc/smc_clc.c  | 32 ++++++++++++++++++++----
>  net/smc/smc_core.c |  2 ++
>  net/smc/smc_pnet.c |  4 +--
>  net/smc/smc_pnet.h |  3 +++
>  6 files changed, 98 insertions(+), 23 deletions(-)
> 
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index 1a556f4..bf646d1 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -492,7 +492,7 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link,
>  			     struct smc_buf_desc *rmb_desc)
>  {
>  	struct smc_link_group *lgr = link->lgr;
> -	int i, rc = 0;
> +	int i, lnk = 0, rc = 0;
>  
>  	rc = smc_llc_flow_initiate(lgr, SMC_LLC_FLOW_RKEY);
>  	if (rc)
> @@ -507,14 +507,20 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link,
>  		rc = smcr_link_reg_rmb(&lgr->lnk[i], rmb_desc);
>  		if (rc)
>  			goto out;
> +		/* available link count inc */
> +		lnk++;
>  	}
>  
> -	/* exchange confirm_rkey msg with peer */
> -	rc = smc_llc_do_confirm_rkey(link, rmb_desc);
> -	if (rc) {
> -		rc = -EFAULT;
> -		goto out;
> +	/* do not exchange confirm_rkey msg since there are only one link */
> +	if (lnk > 1) {
> +		/* exchange confirm_rkey msg with peer */
> +		rc = smc_llc_do_confirm_rkey(link, rmb_desc);
> +		if (rc) {
> +			rc = -EFAULT;
> +			goto out;
> +		}
>  	}
> +
>  	rmb_desc->is_conf_rkey = true;
>  out:
>  	mutex_unlock(&lgr->llc_conf_mutex);
> @@ -932,6 +938,31 @@ static int smc_find_rdma_device(struct smc_sock *smc, struct smc_init_info *ini)
>  	return 0;
>  }
>  
> +/* just prototype code
> + * since tcp connect has not happen, using route to perform smc_pnet_find_roce_by_pnetid
> + */
> +static int smc_find_rdma_device_with_dst(struct smc_sock *smc, struct smc_init_info *ini)
> +{
> +	struct sock *tsk = smc->clcsock->sk;
> +	struct rtable *rt;
> +
> +	rt = ip_route_output(sock_net(tsk), smc->remote_address.v4.sin_addr.s_addr,
> +			     0, 0, 0);
> +
> +	if (IS_ERR(rt))
> +		return -ECONNRESET;
> +
> +	smc_pnet_find_roce_by_pnetid(rt->dst.dev, ini);
> +	__builtin_prefetch(&ini->ib_dev->mac[ini->ib_port - 1]);
> +
> +	if (!ini->check_smcrv2 && !ini->ib_dev)
> +		return SMC_CLC_DECL_NOSMCRDEV;
> +	if (ini->check_smcrv2 && !ini->smcrv2.ib_dev_v2)
> +		return SMC_CLC_DECL_NOSMCRDEV;
> +
> +	return 0;
> +}
> +
>  /* check if there is an ISM device available for this connection. */
>  /* called for connect and listen */
>  static int smc_find_ism_device(struct smc_sock *smc, struct smc_init_info *ini)
> @@ -1019,13 +1050,17 @@ static int smc_find_proposal_devices(struct smc_sock *smc,
>  
>  	/* check if there is an rdma device available */
>  	if (!(ini->smcr_version & SMC_V1) ||
> -	    smc_find_rdma_device(smc, ini))
> +	    smc_find_rdma_device_with_dst(smc, ini))
>  		ini->smcr_version &= ~SMC_V1;
>  	/* else RDMA is supported for this connection */
>  
>  	ini->smc_type_v1 = smc_indicated_type(ini->smcd_version & SMC_V1,
>  					      ini->smcr_version & SMC_V1);
>  
> +	/* just prototype, do this for simple */
> +	ini->smc_type_v2 = SMC_TYPE_N;
> +	return rc;
> +
>  	/* check if there is an ism v2 device available */
>  	if (!(ini->smcd_version & SMC_V2) ||
>  	    !smc_ism_is_v2_capable() ||
> @@ -1492,11 +1527,7 @@ static void smc_connect_work(struct work_struct *work)
>  		smc->sk.sk_err = smc->clcsock->sk->sk_err;
>  	} else if ((1 << smc->clcsock->sk->sk_state) &
>  					(TCPF_SYN_SENT | TCPF_SYN_RECV)) {
> -		rc = sk_stream_wait_connect(smc->clcsock->sk, &timeo);
> -		if ((rc == -EPIPE) &&
> -		    ((1 << smc->clcsock->sk->sk_state) &
> -					(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)))
> -			rc = 0;
> +		rc = 0;
>  	}
>  	release_sock(smc->clcsock->sk);
>  	lock_sock(&smc->sk);
> @@ -1580,9 +1611,10 @@ static int smc_connect(struct socket *sock, struct sockaddr *addr,
>  		rc = -EALREADY;
>  		goto out;
>  	}
> -	rc = kernel_connect(smc->clcsock, addr, alen, flags);
> -	if (rc && rc != -EINPROGRESS)
> -		goto out;
> +
> +	/* copy remote address backup */
> +	memcpy(&smc->remote_address.ss, addr, alen);
> +	rc = -EINPROGRESS;
>  
>  	if (smc->use_fallback) {
>  		sock->state = rc ? SS_CONNECTING : SS_CONNECTED;
> @@ -2452,9 +2484,17 @@ static int smc_listen(struct socket *sock, int backlog)
>  {
>  	struct sock *sk = sock->sk;
>  	struct smc_sock *smc;
> -	int rc;
> +	int rc, val;
>  
>  	smc = smc_sk(sk);
> +
> +	/* enable server clcsock tcp fastopen.
> +	 * just a proto type code, magic number 5 for no reason
> +	 */
> +	val = 5;
> +	smc->clcsock->ops->setsockopt(smc->clcsock, SOL_TCP,
> +				      TCP_FASTOPEN, KERNEL_SOCKPTR(&val), sizeof(val));
> +
>  	lock_sock(sk);
>  
>  	rc = -EINVAL;
> diff --git a/net/smc/smc.h b/net/smc/smc.h
> index 5ed765e..ef18894 100644
> --- a/net/smc/smc.h
> +++ b/net/smc/smc.h
> @@ -261,6 +261,14 @@ struct smc_sock {				/* smc sock container */
>  	int			fallback_rsn;	/* reason for fallback */
>  	u32			peer_diagnosis; /* decline reason from peer */
>  	atomic_t                queued_smc_hs;  /* queued smc handshakes */
> +
> +	union {
> +		struct sockaddr		addr;
> +		struct sockaddr_in	v4;
> +		struct sockaddr_in6	v6;
> +		struct sockaddr_storage ss;
> +	} remote_address;
> +
>  	struct inet_connection_sock_af_ops		af_ops;
>  	const struct inet_connection_sock_af_ops	*ori_af_ops;
>  						/* original af ops */
> diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c
> index f9f3f59..f944c67 100644
> --- a/net/smc/smc_clc.c
> +++ b/net/smc/smc_clc.c
> @@ -20,6 +20,7 @@
>  #include <net/addrconf.h>
>  #include <net/sock.h>
>  #include <net/tcp.h>
> +#include <net/route.h>
>  
>  #include "smc.h"
>  #include "smc_core.h"
> @@ -486,8 +487,7 @@ static int smc_clc_prfx_set4_rcu(struct dst_entry *dst, __be32 ipv4,
>  		return -ENODEV;
>  
>  	in_dev_for_each_ifa_rcu(ifa, in_dev) {
> -		if (!inet_ifa_match(ipv4, ifa))
> -			continue;
> +		/* delete this for simple, just prototype code*/
>  		prop->prefix_len = inet_mask_len(ifa->ifa_mask);
>  		prop->outgoing_subnet = ifa->ifa_address & ifa->ifa_mask;
>  		/* prop->ipv6_prefixes_cnt = 0; already done by memset before */
> @@ -528,10 +528,10 @@ static int smc_clc_prfx_set6_rcu(struct dst_entry *dst,
>  
>  /* retrieve and set prefixes in CLC proposal msg */
>  static int smc_clc_prfx_set(struct socket *clcsock,
> +			    struct dst_entry *dst,
>  			    struct smc_clc_msg_proposal_prefix *prop,
>  			    struct smc_clc_ipv6_prefix *ipv6_prfx)
>  {
> -	struct dst_entry *dst = sk_dst_get(clcsock->sk);
>  	struct sockaddr_storage addrs;
>  	struct sockaddr_in6 *addr6;
>  	struct sockaddr_in *addr;
> @@ -802,7 +802,8 @@ int smc_clc_send_decline(struct smc_sock *smc, u32 peer_diag_info, u8 version)
>  }
>  
>  /* send CLC PROPOSAL message across internal TCP socket */
> -int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
> +int smc_clc_send_proposal_with_nexthop(struct smc_sock *smc,
> +				       struct dst_entry *dst, struct smc_init_info *ini)
>  {
>  	struct smc_clc_smcd_v2_extension *smcd_v2_ext;
>  	struct smc_clc_msg_proposal_prefix *pclc_prfx;
> @@ -838,7 +839,7 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
>  
>  	/* retrieve ip prefixes for CLC proposal msg */
>  	if (ini->smc_type_v1 != SMC_TYPE_N) {
> -		rc = smc_clc_prfx_set(smc->clcsock, pclc_prfx, ipv6_prfx);
> +		rc = smc_clc_prfx_set(smc->clcsock, dst, pclc_prfx, ipv6_prfx);
>  		if (rc) {
>  			if (ini->smc_type_v2 == SMC_TYPE_N) {
>  				kfree(pclc);
> @@ -961,6 +962,11 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
>  	}
>  	vec[i].iov_base = trl;
>  	vec[i++].iov_len = sizeof(*trl);
> +
> +	msg.msg_flags	|= MSG_FASTOPEN;
> +	msg.msg_name	= &smc->remote_address.addr;
> +	msg.msg_namelen = sizeof(struct sockaddr_in);
> +
>  	/* due to the few bytes needed for clc-handshake this cannot block */
>  	len = kernel_sendmsg(smc->clcsock, &msg, vec, i, plen);
>  	if (len < 0) {
> @@ -975,6 +981,22 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
>  	return reason_code;
>  }
>  
> +int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
> +{
> +	struct sock *tsk = smc->clcsock->sk;
> +	struct rtable *rt;
> +	int rc;
> +
> +	rt = ip_route_output(sock_net(tsk), smc->remote_address.v4.sin_addr.s_addr,
> +			     0, 0, 0);
> +
> +	if (IS_ERR(rt))
> +		return -ECONNRESET;
> +
> +	rc = smc_clc_send_proposal_with_nexthop(smc, &rt->dst, ini);
> +	return rc;
> +}
> +
>  /* build and send CLC CONFIRM / ACCEPT message */
>  static int smc_clc_send_confirm_accept(struct smc_sock *smc,
>  				       struct smc_clc_msg_accept_confirm_v2 *clc_v2,
> diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> index f40f6ed..ef5e5411 100644
> --- a/net/smc/smc_core.c
> +++ b/net/smc/smc_core.c
> @@ -1765,6 +1765,8 @@ int smc_vlan_by_tcpsk(struct socket *clcsock, struct smc_init_info *ini)
>  	int rc = 0;
>  
>  	ini->vlan_id = 0;
> +	/* just for simple , prototype code */
> +	return 0;
>  	if (!dst) {
>  		rc = -ENOTCONN;
>  		goto out;
> diff --git a/net/smc/smc_pnet.c b/net/smc/smc_pnet.c
> index 7055ed1..6aa3304 100644
> --- a/net/smc/smc_pnet.c
> +++ b/net/smc/smc_pnet.c
> @@ -1064,8 +1064,8 @@ static void smc_pnet_find_rdma_dev(struct net_device *netdev,
>   * If nothing found, check pnetid table.
>   * If nothing found, try to use handshake device
>   */
> -static void smc_pnet_find_roce_by_pnetid(struct net_device *ndev,
> -					 struct smc_init_info *ini)
> +void smc_pnet_find_roce_by_pnetid(struct net_device *ndev,
> +				  struct smc_init_info *ini)
>  {
>  	u8 ndev_pnetid[SMC_MAX_PNETID_LEN];
>  	struct net *net;
> diff --git a/net/smc/smc_pnet.h b/net/smc/smc_pnet.h
> index 80a88ee..2ffaf22 100644
> --- a/net/smc/smc_pnet.h
> +++ b/net/smc/smc_pnet.h
> @@ -67,4 +67,7 @@ void smc_pnet_find_alt_roce(struct smc_link_group *lgr,
>  			    struct smc_ib_device *known_dev);
>  bool smc_pnet_is_ndev_pnetid(struct net *net, u8 *pnetid);
>  bool smc_pnet_is_pnetid_set(u8 *pnetid);
> +
> +void smc_pnet_find_roce_by_pnetid(struct net_device *ndev,
> +				  struct smc_init_info *ini);
>  #endif
> -- 
> 1.8.3.1
Alexandra Winter May 25, 2022, 1:42 p.m. UTC | #2
On 24.05.22 09:49, Tony Lu wrote:
> On Tue, May 24, 2022 at 02:52:07PM +0800, D. Wythe wrote:
>> From: "D. Wythe" <alibuda@linux.alibaba.com>
>>
>> Hi Karsten,
>>
>> We are promoting SMC-R to the field of cloud computing, dues to the
>> particularity of business on the cloud, the scale and the types of
>> customer applications are unpredictable. As a participant of SMC-R, we
>> also hope that SMC-R can cover more application scenarios. Therefore,
>> many connection problems are exposed during this time. There are two
>> main issue, one is that the establishment of a single connection takes
>> longer than that of the TCP, another is that the degree of concurrency
>> is low under multi-connection processing. This patch set is mainly
>> optimized for the first issue, and the follow-up of the second issue
>> will be synchronized in the future.
>>
>> In terms of communication process, under current implement, a TCP
>> three-way handshake only needs 1-RTT time, while SMC-R currently
>> requires 4-RTT times, including 2-RTT over IP(TCP handshake, SMC
>> proposal & accept ) and 2-RTT over IB ( two times RKEY exchange), which
>> is most influential factor affecting connection established time at the
>> moment.
>>
>> We have noticed that single network interface card is mainstream on the
>> cloud, dues to the advantages of cloud deployment costs and the cloud's
>> own disaster recovery support. On the other hand, the emergence of RoCE
>> LAG technology makes us no longer need to deal with multiple RDMA
>> network interface cards by ourselves,  just like NIC bonding does. In
>> Alibaba, Roce LAG is widely used for RDMA.
> 
> I think this is an interesting topic whether we need SMC-level link
> redundancy. I agreed with that RoCE LAG and RDMA in cloud vendors handle
> redundancy and failover in the lower layer, and do it transparently for
> SMC.
> 
> So let's move on, if a RDMA device has redundancy ability, we could make
> SMC simpler by give an option for user-space or based on the device
> capability (if we have this flag). This allows under layer to ensure the
> reliability of link group.
> 
> As RFC 7609 mentioned, we should do some extra work for reliability to
> add link. It should be an optional work if the device have capability
> for redundancy, and make link group simpler and faster (for the
> so-called SMC-2RTT in this RFC).
> 
> I also notice that RFC 7609 is released on August 2015, which is earlier
> than RoCE LAG. RoCE LAG is provided after ConnectX-3/ConnectX-3 Pro in
> kernel 4.0, and is available in 2017. And cloud vendors' RDMA adapters,
> such as Alibaba Elastic RDMA adapter in [1].
> 
> Given that, I propose whether the second link can be used as an option
> in newly created link group. Also, if it is possible, RFC 7609 can be
> updated or extend it for this nowadays case.
> 
> Looking forward for your message, Karsten, D. Wythe and folks.
> 
> [1] https://lore.kernel.org/linux-rdma/20220523075528.35017-1-chengyou@linux.alibaba.com/
> 
> Thanks,
> Tony Lu
>  
Thank you D. Wythe for your proposals, the prototype and measurements.
They sound quite promising to us.

We need to carefully evaluate them and make sure everything is compatible
with the existing implementations of SMC-D and SMC-R v1 and v2. In the
typical s390 environment ROCE LAG is propably not good enough, as the card
is still a single point of failure. So your ideas need to be compatible
with link redundancy. We also need to consider that the extension of the
protocol does not block other desirable extensions.

Your prototype is very helpful for the understanding. Before submitting any
code patches to net-next, we should agree on the details of the protocol
extension. Maybe you could formulate your proposal in plain text, so we can
discuss it here? 

We also need to inform you that several public holidays are upcoming in the
next weeks and several of our team will be out for summer vacation, so please
allow for longer response times.

Kind regards
Alexandra Winter

>> In that case, SMC-R have only one single link, if so, the RKEY LLC
>> messages that to perform information exchange in all links are no longer
>> needed, the SMC Proposal & accept has already complete the exchange of
>> all information needed. So we think that we can remove the RKEY exchange
>> in that case, which will save us 2-RTT over IB. We call it as SMC-R 2-RTT.
>>
>> On the other hand, we can use TCP fast open, carry the SMC proposal data
>> by TCP SYN message, reduce the time that the SMC waits for the TCP
>> connection to be established. This will save us another 1-RTT over IP.
>>
>> Based on the above two viewpoints, in this scenario, we can compress the
>> communication process of SMC-R into 1-RTT over IP, so that we can
>> theoretically obtain a time close to that of TCP connection
>> establishment. We call it as SMC-R 1-RTT. Of course, the specific results
>> will also be affected by the implementation.
>>
>> In our test environment, we host two VMs on the same host for wrk/nginx
>> tests, used a script similar to the following to performing test:
>>
>> Client.sh
>>
>> conn=$1
>> thread=$2
>>
>> wrk -H ‘Connection: Close’ -c ${conn} -t ${thread} -d 10
>>
>> Server.sh
>>
>> sysctl -w net.ipv4.tcp_fastopen=3
>> smc_run nginx
>>
>> Statistic shows that:
>>
>> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
>> |type|args  |   -c1 -t1     |   -c2 -t1     |   -c5 -t1      |  -c10 -t1    |   -c200 -t1    |  -c200 -t4    |  -c2000 -t8   |
>> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
>> |next-next  |   4188.5qps   |   5942.04qps  |   7621.81qps   |  7678.62qps  |   8204.94qps   |  8457.57qps   |  5687.60qps   |
>> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
>> |SMC-2RTT   |   4730.17qps  |   7394.85qps  |   11532.78qps  |  12016.22qps |   11520.81qps  |  11391.36qps  |  10364.41qps  |
>> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
>> |SMC-1RTT   |   5702.77qps  |   9645.18qps  |   11899.20qps  |  12005.16qps |   11536.67qps  |  11420.87qps  |  10392.4qps   |
>> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+-
>> |TCP        |   6415.74qps  |   11034.10qps |   16716.21qps  |  22217.06qps |   35926.74qps  |  117460.qps   |  120291.16qps |
>> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+
>>
>> It can clearly be seen that:
>>
>> 1. In step by step short-link scenarios ( -c1 -t1 ), SMC-R after
>> optimization can reach 88% of TCP. There are still many implementation
>> details that can be optimized, we hope to optimize the performance of
>> SMC in this scenario to 90% of TCP.
>>
>> 2. The problem is very serious in the scenario of multi-threading and
>> multi-connection, the worst case is only 10% of TCP. Even though the
>> SMC-1RTT has certain optimizations for this scenario, it is clear that
>> the bottleneck is not here. We are doing some prototyping to solve this,
>> we hope to reach 60% of TCP in multi-threading and multi-connection
>> scenarios, and SMC-1RTT is the important prerequisite for upper limit of
>> subsequent optimization.
>>
>> In this patch set, we had only completed a simple prototype, only make
>> sure SMC-1RTT can works.
>>
>> Sincerely, we are looking forward for you comments, please
>> let us know if you have any suggestions.
>>
>> Thanks.
>>
>> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
>> ---
--------8<  snip  >8--------
D. Wythe May 26, 2022, 3:47 a.m. UTC | #3
在 2022/5/25 下午9:42, Alexandra Winter 写道:

> Thank you D. Wythe for your proposals, the prototype and measurements.
> They sound quite promising to us.
>  > We need to carefully evaluate them and make sure everything is compatible
> with the existing implementations of SMC-D and SMC-R v1 and v2. In the
> typical s390 environment ROCE LAG is propably not good enough, as the card
> is still a single point of failure. So your ideas need to be compatible
> with link redundancy. We also need to consider that the extension of the
> protocol does not block other desirable extensions.
> 
> Your prototype is very helpful for the understanding. Before submitting any
> code patches to net-next, we should agree on the details of the protocol
> extension. Maybe you could formulate your proposal in plain text, so we can
> discuss it here?

I am very pleased to hear that your team have interest in this 
proposals, and thanks a lot for your advise. We really appreciate your 
point of view about compatibility, In fact, we are working on some 
written drafts which compatibility is quite a important part, and will 
be shared here soon.

> 
> We also need to inform you that several public holidays are upcoming in the
> next weeks and several of our team will be out for summer vacation, so please
> allow for longer response times.

Thanks for your informing, that's totaly okay to us. May your holidays 
be full of warmth and cheer.


> Kind regards
> Alexandra Winter
> 


D. Wyther
Thanks.
Tony Lu May 26, 2022, 7:04 a.m. UTC | #4
On Wed, May 25, 2022 at 03:42:28PM +0200, Alexandra Winter wrote:
> 
> 
> On 24.05.22 09:49, Tony Lu wrote:
> > On Tue, May 24, 2022 at 02:52:07PM +0800, D. Wythe wrote:
> >> From: "D. Wythe" <alibuda@linux.alibaba.com>
> >>
> >> Hi Karsten,
> >>
> >> We are promoting SMC-R to the field of cloud computing, dues to the
> >> particularity of business on the cloud, the scale and the types of
> >> customer applications are unpredictable. As a participant of SMC-R, we
> >> also hope that SMC-R can cover more application scenarios. Therefore,
> >> many connection problems are exposed during this time. There are two
> >> main issue, one is that the establishment of a single connection takes
> >> longer than that of the TCP, another is that the degree of concurrency
> >> is low under multi-connection processing. This patch set is mainly
> >> optimized for the first issue, and the follow-up of the second issue
> >> will be synchronized in the future.
> >>
> >> In terms of communication process, under current implement, a TCP
> >> three-way handshake only needs 1-RTT time, while SMC-R currently
> >> requires 4-RTT times, including 2-RTT over IP(TCP handshake, SMC
> >> proposal & accept ) and 2-RTT over IB ( two times RKEY exchange), which
> >> is most influential factor affecting connection established time at the
> >> moment.
> >>
> >> We have noticed that single network interface card is mainstream on the
> >> cloud, dues to the advantages of cloud deployment costs and the cloud's
> >> own disaster recovery support. On the other hand, the emergence of RoCE
> >> LAG technology makes us no longer need to deal with multiple RDMA
> >> network interface cards by ourselves,  just like NIC bonding does. In
> >> Alibaba, Roce LAG is widely used for RDMA.
> > 
> > I think this is an interesting topic whether we need SMC-level link
> > redundancy. I agreed with that RoCE LAG and RDMA in cloud vendors handle
> > redundancy and failover in the lower layer, and do it transparently for
> > SMC.
> > 
> > So let's move on, if a RDMA device has redundancy ability, we could make
> > SMC simpler by give an option for user-space or based on the device
> > capability (if we have this flag). This allows under layer to ensure the
> > reliability of link group.
> > 
> > As RFC 7609 mentioned, we should do some extra work for reliability to
> > add link. It should be an optional work if the device have capability
> > for redundancy, and make link group simpler and faster (for the
> > so-called SMC-2RTT in this RFC).
> > 
> > I also notice that RFC 7609 is released on August 2015, which is earlier
> > than RoCE LAG. RoCE LAG is provided after ConnectX-3/ConnectX-3 Pro in
> > kernel 4.0, and is available in 2017. And cloud vendors' RDMA adapters,
> > such as Alibaba Elastic RDMA adapter in [1].
> > 
> > Given that, I propose whether the second link can be used as an option
> > in newly created link group. Also, if it is possible, RFC 7609 can be
> > updated or extend it for this nowadays case.
> > 
> > Looking forward for your message, Karsten, D. Wythe and folks.
> > 
> > [1] https://lore.kernel.org/linux-rdma/20220523075528.35017-1-chengyou@linux.alibaba.com/
> > 
> > Thanks,
> > Tony Lu
> >  
> Thank you D. Wythe for your proposals, the prototype and measurements.
> They sound quite promising to us.
> 
> We need to carefully evaluate them and make sure everything is compatible
> with the existing implementations of SMC-D and SMC-R v1 and v2. In the
> typical s390 environment ROCE LAG is propably not good enough, as the card
> is still a single point of failure. So your ideas need to be compatible
> with link redundancy. We also need to consider that the extension of the
> protocol does not block other desirable extensions.
> 
> Your prototype is very helpful for the understanding. Before submitting any
> code patches to net-next, we should agree on the details of the protocol
> extension. Maybe you could formulate your proposal in plain text, so we can
> discuss it here? 
> 
> We also need to inform you that several public holidays are upcoming in the
> next weeks and several of our team will be out for summer vacation, so please
> allow for longer response times.
> 
> Kind regards
> Alexandra Winter
> 
It's glad to hear this. This gave us a lot of confidence to insist on
it, thank you.

Cheers,
Tony Lu
D. Wythe June 1, 2022, 6:33 a.m. UTC | #5
在 2022/5/25 下午9:42, Alexandra Winter 写道:

> We need to carefully evaluate them and make sure everything is compatible
> with the existing implementations of SMC-D and SMC-R v1 and v2. In the
> typical s390 environment ROCE LAG is propably not good enough, as the card
> is still a single point of failure. So your ideas need to be compatible
> with link redundancy. We also need to consider that the extension of the
> protocol does not block other desirable extensions.
> 
> Your prototype is very helpful for the understanding. Before submitting any
> code patches to net-next, we should agree on the details of the protocol
> extension. Maybe you could formulate your proposal in plain text, so we can
> discuss it here?
> 
> We also need to inform you that several public holidays are upcoming in the
> next weeks and several of our team will be out for summer vacation, so please
> allow for longer response times.
> 
> Kind regards
> Alexandra Winter
> 

Hi alls,

In order to achieve signle-link compatibility, we must
complete at least once negotiation. We wish to provide
higher scalability while meeting this feature. There are
few ways to reach this.

1. Use the available reserved bits. According to
the SMC v2 protocol, there are at least 28 reserved octets
in PROPOSAL MESSAGE and at least 10 reserved octets in
ACCEPT MESSAGE are available. We can define an area in which
as a feature area, works like bitmap. Considering the subsequent 
scalability, we MAY use at least 2 reserved ctets, which can support 
negotiation of at least 16 features.

2. Unify all the areas named extension in current
SMC v2 protocol spec without reinterpreting any existing field
and field offset changes, including 'PROPOSAL V1 IP Subnet Extension',
'PROPOSAL V2 Extension', 'PROPOSAL SMC-DV2 EXTENSION' .etc. And provides
the ability to grow dynamically as needs expand. This scheme will use
at least 10 reserved octets in the PROPOSAL MESSAGE and at least 4 
reserved octets in ACCEPT MESSAGE and CONFIRM MESSAGE. Fortunately, we 
only need to use reserved fields, and the current reserved fields are 
sufficient. And then we can easily add a new extension named SIGNLE 
LINK. Limited by space, the details will be elaborated after the scheme 
is finalized.

But no matter what scheme is finalized, the workflow should be similar to:

Allow Single-link:

client							    server
	proposal with Single-link feature bit or extension
		-------->

	accept with Single-link feature bit extension
		<--------
		
		confirm
		-------->


Deny or not recognized:

client							     server
	proposal with Single-link feature bit or extension
		-------->

		rkey confirm
		<------
		------>

	accept without Single-link feature bit or extension
		<------

		rkey confirm
		------->
		<------
		
		confirm
		------->


Look forward to your advice and comments.

Thanks.
Tony Lu June 1, 2022, 9:24 a.m. UTC | #6
On Wed, Jun 01, 2022 at 02:33:09PM +0800, D. Wythe wrote:
> 
> 在 2022/5/25 下午9:42, Alexandra Winter 写道:
> 
> > We need to carefully evaluate them and make sure everything is compatible
> > with the existing implementations of SMC-D and SMC-R v1 and v2. In the
> > typical s390 environment ROCE LAG is propably not good enough, as the card
> > is still a single point of failure. So your ideas need to be compatible
> > with link redundancy. We also need to consider that the extension of the
> > protocol does not block other desirable extensions.
> > 
> > Your prototype is very helpful for the understanding. Before submitting any
> > code patches to net-next, we should agree on the details of the protocol
> > extension. Maybe you could formulate your proposal in plain text, so we can
> > discuss it here?
> > 
> > We also need to inform you that several public holidays are upcoming in the
> > next weeks and several of our team will be out for summer vacation, so please
> > allow for longer response times.
> > 
> > Kind regards
> > Alexandra Winter
> > 
> 
> Hi alls,
> 
> In order to achieve signle-link compatibility, we must
> complete at least once negotiation. We wish to provide
> higher scalability while meeting this feature. There are
> few ways to reach this.
> 
> 1. Use the available reserved bits. According to
> the SMC v2 protocol, there are at least 28 reserved octets
> in PROPOSAL MESSAGE and at least 10 reserved octets in
> ACCEPT MESSAGE are available. We can define an area in which
> as a feature area, works like bitmap. Considering the subsequent
> scalability, we MAY use at least 2 reserved ctets, which can support
> negotiation of at least 16 features.
> 
> 2. Unify all the areas named extension in current
> SMC v2 protocol spec without reinterpreting any existing field
> and field offset changes, including 'PROPOSAL V1 IP Subnet Extension',
> 'PROPOSAL V2 Extension', 'PROPOSAL SMC-DV2 EXTENSION' .etc. And provides
> the ability to grow dynamically as needs expand. This scheme will use
> at least 10 reserved octets in the PROPOSAL MESSAGE and at least 4 reserved
> octets in ACCEPT MESSAGE and CONFIRM MESSAGE. Fortunately, we only need to
> use reserved fields, and the current reserved fields are sufficient. And
> then we can easily add a new extension named SIGNLE LINK. Limited by space,
> the details will be elaborated after the scheme is finalized.

After reading this and latest version of protocol, I agree with that the
idea to provide a more flexible extension facilities. And, it's a good
chance for us to set here talking about the protocol extension.

There are some potential scenarios that need flexible extensions in my
mind:
- other protocols support, such as iWARP / IB or new version protocol,
- dozens of feature flags in the future, like this proposal. With the
  growth of new feature, it could overflow bitmap.

Actually, this extension facilities are very similar to TCP options.

So what about your opinions about the solution of this? If there are
some existed approaches for the future extensions, maybe this can get
involved in it. Or we can start a discuss about this as this mail
mentioned.

Also, I am wondering if there is plan to update the RFC7609, add the
latest v2 support?

Thanks,
Tony Lu
Alexandra Winter June 1, 2022, 11:35 a.m. UTC | #7
On 01.06.22 11:24, Tony Lu wrote:
> On Wed, Jun 01, 2022 at 02:33:09PM +0800, D. Wythe wrote:
>>
>> 在 2022/5/25 下午9:42, Alexandra Winter 写道:
>>
>>> We need to carefully evaluate them and make sure everything is compatible
>>> with the existing implementations of SMC-D and SMC-R v1 and v2. In the
>>> typical s390 environment ROCE LAG is propably not good enough, as the card
>>> is still a single point of failure. So your ideas need to be compatible
>>> with link redundancy. We also need to consider that the extension of the
>>> protocol does not block other desirable extensions.
>>>
>>> Your prototype is very helpful for the understanding. Before submitting any
>>> code patches to net-next, we should agree on the details of the protocol
>>> extension. Maybe you could formulate your proposal in plain text, so we can
>>> discuss it here?
>>>
>>> We also need to inform you that several public holidays are upcoming in the
>>> next weeks and several of our team will be out for summer vacation, so please
>>> allow for longer response times.
>>>
>>> Kind regards
>>> Alexandra Winter
>>>
>>
>> Hi alls,
>>
>> In order to achieve signle-link compatibility, we must
>> complete at least once negotiation. We wish to provide
>> higher scalability while meeting this feature. There are
>> few ways to reach this.
>>
>> 1. Use the available reserved bits. According to
>> the SMC v2 protocol, there are at least 28 reserved octets
>> in PROPOSAL MESSAGE and at least 10 reserved octets in
>> ACCEPT MESSAGE are available. We can define an area in which
>> as a feature area, works like bitmap. Considering the subsequent
>> scalability, we MAY use at least 2 reserved ctets, which can support
>> negotiation of at least 16 features.
>>
>> 2. Unify all the areas named extension in current
>> SMC v2 protocol spec without reinterpreting any existing field
>> and field offset changes, including 'PROPOSAL V1 IP Subnet Extension',
>> 'PROPOSAL V2 Extension', 'PROPOSAL SMC-DV2 EXTENSION' .etc. And provides
>> the ability to grow dynamically as needs expand. This scheme will use
>> at least 10 reserved octets in the PROPOSAL MESSAGE and at least 4 reserved
>> octets in ACCEPT MESSAGE and CONFIRM MESSAGE. Fortunately, we only need to
>> use reserved fields, and the current reserved fields are sufficient. And
>> then we can easily add a new extension named SIGNLE LINK. Limited by space,
>> the details will be elaborated after the scheme is finalized.
> 
> After reading this and latest version of protocol, I agree with that the
> idea to provide a more flexible extension facilities. And, it's a good
> chance for us to set here talking about the protocol extension.
> 
> There are some potential scenarios that need flexible extensions in my
> mind:
> - other protocols support, such as iWARP / IB or new version protocol,
> - dozens of feature flags in the future, like this proposal. With the
>   growth of new feature, it could overflow bitmap.
> 
> Actually, this extension facilities are very similar to TCP options.
> 
> So what about your opinions about the solution of this? If there are
> some existed approaches for the future extensions, maybe this can get
> involved in it. Or we can start a discuss about this as this mail
> mentioned.
> 
> Also, I am wondering if there is plan to update the RFC7609, add the
> latest v2 support?
> 
> Thanks,
> Tony Lu

We have asked the SMC protocol owners about their opinion about using the
reserved fields for new options in particular, and about where and how to
discuss this in general. (including where to document the versions).
Please allow some time for us to come back to you.

Kind regards
Alexandra
D. Wythe June 2, 2022, 3:26 a.m. UTC | #8
On Wed, Jun 01, 2022 at 01:35:52PM +0200, Alexandra Winter wrote:
> 
> 
> On 01.06.22 11:24, Tony Lu wrote:
> > On Wed, Jun 01, 2022 at 02:33:09PM +0800, D. Wythe wrote:
> >>
> >> 在 2022/5/25 下午9:42, Alexandra Winter 写道:
> >>
> >>> We need to carefully evaluate them and make sure everything is compatible
> >>> with the existing implementations of SMC-D and SMC-R v1 and v2. In the
> >>> typical s390 environment ROCE LAG is propably not good enough, as the card
> >>> is still a single point of failure. So your ideas need to be compatible
> >>> with link redundancy. We also need to consider that the extension of the
> >>> protocol does not block other desirable extensions.
> >>>
> >>> Your prototype is very helpful for the understanding. Before submitting any
> >>> code patches to net-next, we should agree on the details of the protocol
> >>> extension. Maybe you could formulate your proposal in plain text, so we can
> >>> discuss it here?
> >>>
> >>> We also need to inform you that several public holidays are upcoming in the
> >>> next weeks and several of our team will be out for summer vacation, so please
> >>> allow for longer response times.
> >>>
> >>> Kind regards
> >>> Alexandra Winter
> >>>
> >>
> >> Hi alls,
> >>
> >> In order to achieve signle-link compatibility, we must
> >> complete at least once negotiation. We wish to provide
> >> higher scalability while meeting this feature. There are
> >> few ways to reach this.
> >>
> >> 1. Use the available reserved bits. According to
> >> the SMC v2 protocol, there are at least 28 reserved octets
> >> in PROPOSAL MESSAGE and at least 10 reserved octets in
> >> ACCEPT MESSAGE are available. We can define an area in which
> >> as a feature area, works like bitmap. Considering the subsequent
> >> scalability, we MAY use at least 2 reserved ctets, which can support
> >> negotiation of at least 16 features.
> >>
> >> 2. Unify all the areas named extension in current
> >> SMC v2 protocol spec without reinterpreting any existing field
> >> and field offset changes, including 'PROPOSAL V1 IP Subnet Extension',
> >> 'PROPOSAL V2 Extension', 'PROPOSAL SMC-DV2 EXTENSION' .etc. And provides
> >> the ability to grow dynamically as needs expand. This scheme will use
> >> at least 10 reserved octets in the PROPOSAL MESSAGE and at least 4 reserved
> >> octets in ACCEPT MESSAGE and CONFIRM MESSAGE. Fortunately, we only need to
> >> use reserved fields, and the current reserved fields are sufficient. And
> >> then we can easily add a new extension named SIGNLE LINK. Limited by space,
> >> the details will be elaborated after the scheme is finalized.
> > 
> > After reading this and latest version of protocol, I agree with that the
> > idea to provide a more flexible extension facilities. And, it's a good
> > chance for us to set here talking about the protocol extension.
> > 
> > There are some potential scenarios that need flexible extensions in my
> > mind:
> > - other protocols support, such as iWARP / IB or new version protocol,
> > - dozens of feature flags in the future, like this proposal. With the
> >   growth of new feature, it could overflow bitmap.
> > 
> > Actually, this extension facilities are very similar to TCP options.
> > 
> > So what about your opinions about the solution of this? If there are
> > some existed approaches for the future extensions, maybe this can get
> > involved in it. Or we can start a discuss about this as this mail
> > mentioned.
> > 
> > Also, I am wondering if there is plan to update the RFC7609, add the
> > latest v2 support?
> > 
> > Thanks,
> > Tony Lu
> 
> We have asked the SMC protocol owners about their opinion about using the
> reserved fields for new options in particular, and about where and how to
> discuss this in general. (including where to document the versions).
> Please allow some time for us to come back to you.
> 
> Kind regards
> Alexandra

Thank you for the information. Before we officially push the document update,
if you had any suggestions for the two schemes we are mentioned above,
or which one you prefer, please keep us informed.

Best wishes.
D. Wyther
D. Wythe June 16, 2022, 1:49 p.m. UTC | #9
On 2022/6/1 下午2:33, D. Wythe wrote:
> 
> 在 2022/5/25 下午9:42, Alexandra Winter 写道:
> 
>> We need to carefully evaluate them and make sure everything is compatible
>> with the existing implementations of SMC-D and SMC-R v1 and v2. In the
>> typical s390 environment ROCE LAG is propably not good enough, as the card
>> is still a single point of failure. So your ideas need to be compatible
>> with link redundancy. We also need to consider that the extension of the
>> protocol does not block other desirable extensions.
>>
>> Your prototype is very helpful for the understanding. Before submitting any
>> code patches to net-next, we should agree on the details of the protocol
>> extension. Maybe you could formulate your proposal in plain text, so we can
>> discuss it here?
>>
>> We also need to inform you that several public holidays are upcoming in the
>> next weeks and several of our team will be out for summer vacation, so please
>> allow for longer response times.
>>
>> Kind regards
>> Alexandra Winter
>>
> 
> Hi alls,
> 
> In order to achieve signle-link compatibility, we must
> complete at least once negotiation. We wish to provide
> higher scalability while meeting this feature. There are
> few ways to reach this.
> 
> 1. Use the available reserved bits. According to
> the SMC v2 protocol, there are at least 28 reserved octets
> in PROPOSAL MESSAGE and at least 10 reserved octets in
> ACCEPT MESSAGE are available. We can define an area in which
> as a feature area, works like bitmap. Considering the subsequent scalability, we MAY use at least 2 reserved ctets, which can support negotiation of at least 16 features.
> 
> 2. Unify all the areas named extension in current
> SMC v2 protocol spec without reinterpreting any existing field
> and field offset changes, including 'PROPOSAL V1 IP Subnet Extension',
> 'PROPOSAL V2 Extension', 'PROPOSAL SMC-DV2 EXTENSION' .etc. And provides
> the ability to grow dynamically as needs expand. This scheme will use
> at least 10 reserved octets in the PROPOSAL MESSAGE and at least 4 reserved octets in ACCEPT MESSAGE and CONFIRM MESSAGE. Fortunately, we only need to use reserved fields, and the current reserved fields are sufficient. And then we can easily add a new extension named SIGNLE LINK. Limited by space, the details will be elaborated after the scheme is finalized.
> 
> But no matter what scheme is finalized, the workflow should be similar to:
> 
> Allow Single-link:
> 
> client                                server
>      proposal with Single-link feature bit or extension
>          -------->
> 
>      accept with Single-link feature bit extension
>          <--------
> 
>          confirm
>          -------->
> 
> 
> Deny or not recognized:
> 
> client                                 server
>      proposal with Single-link feature bit or extension
>          -------->
> 
>          rkey confirm
>          <------
>          ------>
> 
>      accept without Single-link feature bit or extension
>          <------
> 
>          rkey confirm
>          ------->
>          <------
> 
>          confirm
>          ------->
> 
> 
> Look forward to your advice and comments.
> 
> Thanks.

Hi all,

On the basis of previous,If we can put the application data over the PROPOSAL message,
we can achieve SMC 0-RTT. Its process should be similar to the following:

client									server
	PROPOSAL MESSAGE
		with first contact
		with 0RTT query extension
		-------->

	ACCEPT MESSAGE
			with(or without)
			0RTT response extension
		<--------

	CONFIRM MESSAGE
		-------->

client									server
	PROPOSAL MESSAGE
		without	first contact
		with ORTT Data
		-------->

	ACCEPT MESSAGE
		<---------

	CONFIRM MESSAGE
		-------->

If so, using reserved bit to exchange feature are not enough. We have a simple design
to perform compatibility with legacy extensions and support future extensions.

This draft try to unify all the areas named extension in current
SMC v2 protocol spec, includes 'PROPOSAL V1 IP Subnet Extension',
'PROPOSAL V2 Extension', 'PROPOSAL SMC-DV2 EXTENSION',
and 'First Contact Extension'.

This draft does lots of compromise designs in order to achieve compatibility.
I believe there must have better ways. Let me get the ball rolling. And please let
me know if you have any suggestions or better ideas. This draft of the design
is as follows:

SMC V2 CLC PROPOSAL MESSAGE:

+------+-------+------------------------------------------------------------+
|0	50     |NOT changed						    |
+------+-------+------------------------------------------------------------+
|50    |2      |SMC Version 2 Extension Offset(applicable when SMC V2)      |
+------+-------+------------------------------------------------------------+
|52    |19     |Reserved for growth                                         |
+------+-------+------------------------------------------------------------+
|71    |*      |Extension Area  (reserved before)                           |
+------+-------+------------------------------------------------------------+
|71    |2      |number of Extensions  (reserved before)                     |
+------+-------+------------------------------------------------------------+
|73    |7      |V1 IP Subnet Extension Header (when applicable)             |
+------+-------+------------------------------------------------------------+
|73    |7      |Padding Extension (when V1 IP Subnet Extension not present) |
+------+-------+------------------------------------------------------------+
|80    |*      |V1 IP Subnet Extension Payload (when applicable)            |
+------+-------+------------------------------------------------------------+
|      |       |V2 Extension (when applicable)                              |
+------+-------+------------------------------------------------------------+
|      |       |other available Extension (when applicable)                 |
+------+-------+------------------------------------------------------------+
|*     |4      |Eye catcher ‘SMCR’ (EBCDIC) message end                     |
+------+-------+------------------------------------------------------------+

Notes:

     1. In the current implementation, server read the proposal message with
fixed length, areas beyond the length will be silently ignored, and server will give
up to check eye catcher. Therefore, It's safe to extend the message from the tail.

     2. (reserved before) means that the areas used to be reserved.

     3. none of the existing fields have their offsets changed
within the PROPOSAL message.


Extension Areas Format:

+------+-------+-----------------------+
|0     |*      |Extensions Area        |
+------+-------+-----------------------+
|0     |2      |Number of Extensions   |
+------+-------+-----------------------+
|2     |*      |Extensions             |
+------+-------+-----------------------+
|      |       |End of Extensions Area |
+------+-------+-----------------------+

notes:

     1. All extensions within the extension areas should be contiguous.


Extension Format:

+------+-------+----------------------------------------+
|0     |*      |Extension                               |
+------+-------+----------------------------------------+
|0     |6+     |Extension header                        |
+------+-------+----------------------------------------+
|0     |4      |reserved                                |
+------+-------+----------------------------------------+
|2     |*      |Extension Type (variable length)        |
+------+-------+----------------------------------------+
|*     |*      |Payload Length (variable length)        |
+------+-------+----------------------------------------+
|*     |*      |payload                                 |
+------+-------+----------------------------------------+

notes:

     1. This scheme was specially designed to be compatible with
'PROPOSAL V2 Extension', since it is the only extension with no
reserved octets ahead of it.

     2. Another special case is 'PROPOSAL SMC-DV2 EXTENSION', it's also
has no reserved octets ahead of it, but it can be treats as an
optional part of 'PROPOSAL V2 Extension'.

     3. To be compatible with 'PROPOSAL V2 Extension', there are only
2 reserved octets left to place type and length fields. If octet per
each fileds, there can be only a maximum of 255 extension types and
a maximum length of 255. For better scalability, the type and length
fields are encoded as variable length integer.

variable length integer encoding:

+--------------+-------+---------------+--------+
|first bit     |octet  |Usable Bits    |Range   |
+--------------+-------+---------------+--------+
|0             |1      |7              |0-127   |
+--------------+-------+---------------+--------+
|1             |2      |15             |0-32767 |
+--------------+-------+---------------+--------+

notes;

     1. This design introduces some complexity and we can totally give it
up if we do not need more than 255 extensions at all.

V1 IP Subnet Extension Format:

+------+-------+-------------------------------------------------+
|0     |7      |Extension Header                                 |
+------+-------+-------------------------------------------------+
|0     |4      |Reserved                                         |
+------+-------+-------------------------------------------------+
|4     |1      |Extension type(0x2)                              |
+------+-------+-------------------------------------------------+
|5     |2      |payload length                                   |
+------+-------+-------------------------------------------------+
|7     |*      |V1 IP Subnet Extension Payload                   |
+------+-------+-------------------------------------------------+
|7     |5      |Client IPv4 Subnet Mask (IPv4 only)              |
+------+-------+-------------------------------------------------+
|7     |4      |Subnet Mask                                      |
+------+-------+-------------------------------------------------+
|9     |2      |Reserved                                         |
+------+-------+-------------------------------------------------+
|11    |*      |Client IPv6 Prefix Array (zero for IPv4)         |
+------+-------+-------------------------------------------------+
|11    |1      |Number of IPv6 Prefixes in Prefix array (1 - 8)  |
+------+-------+-------------------------------------------------+
|12    |*      |Prefix Array, variable length array              |
+------+-------+-------------------------------------------------+

notes:

     1. newly V1 IP Subnet Extension borrows 7 octets from the
reserved fields in the upper near part to form a completed extension.

     2. none of the existing fields have their offsets changed
within the PROPOSAL message.

Padding Extension Format:

+------+-------+-------------------------------+
|0     |2      |Reserved                       |
+------+-------+-------------------------------+
|2     |1      |Extension type(0x0)            |
+------+-------+-------------------------------+
|3     |*      |Payload length                 |
+------+-------+-------------------------------+
|*     |*      |Padding (fill with 0x0)        |
+------+-------+-------------------------------+

notes:

     1. Padding Extension is used to fill reserved areas that
have not been used yet. It doesn't mean anything, and can be replaced
in the future.

SMCv2 EXTENSION Format:

+------+-------+------------------------------------------------------------+
|0     |8      |SMCv2 Extension - Client Options Area (SMCRv2 & SMCDv2)     |
+------+-------+------------------------------------------------------------+
|0     |8      |SMCv2 Extension - Client Options Area Header                |
+------+-------+------------------------------------------------------------+
|0     |1      |EID Number                                                  |
+------+-------+------------------------------------------------------------+
|1     |1      |ISMv2 GID Number                                            |
+------+-------+------------------------------------------------------------+
|2     |1      |Flag 1 (bit 8) - Reserved                                   |
+------+-------+------------------------------------------------------------+
|3     |1      |Flag 2 (bit 8)                                              |
+------+-------+------------------------------------------------------------+
|4     |2      |Extension Header (reserved before)                          |
+------+-------+------------------------------------------------------------+
|4     |1      |Extension type(0x3)                                         |
+------+-------+------------------------------------------------------------+
|5     |1      |payload length (range 0-127)                                |
+------+-------+------------------------------------------------------------+
|6     |2      |SMCDv2 Extension Offset (if present)                        |
+------+-------+------------------------------------------------------------+
|8     |16     |RoCEv2 GID (IPv4 or IPv6 address)                           |
+------+-------+------------------------------------------------------------+
|8     |16     |RoCEv2 GID IPv6 address (when IPv6)                         |
+------+-------+------------------------------------------------------------+
|8     |12     |RoCEv2 GID IPv4 reserved (when IPv4)                        |
+------+-------+------------------------------------------------------------+
|20    |4      |RoCEv2 GID IPv4 address (right aligned)                     |
+------+-------+------------------------------------------------------------+
|24    |9      |Reserved                                                    |
+------+-------+------------------------------------------------------------+
|33    |7      |Continuation extension (reserved before)                    |
+------+-------+------------------------------------------------------------+
|33    |4      |Reserved                                                    |
+------+-------+------------------------------------------------------------+
|37    |1      |Extension type(0x1) (reserved before)                       |
+------+-------+------------------------------------------------------------+
|38    |2      |Payload length (reserved before)                            |
+------+-------+------------------------------------------------------------+
|40    |*      |EID Array Area – variable length (32 bytes * EID Number)    |
+------+-------+------------------------------------------------------------+
|*     |*      |SMCDv2 optional area (used to called SMCDv2 extension)      |
+------+-------+------------------------------------------------------------+

notes:

     1. newly V2 EXTENSION use several reserved octets to form a completed
extension. Note that none of the existing fields have their offsets changed
within the PROPOSAL message.

     2. the size of SMCv2 EXTENSION plus maximum size of EID Array Area is
much bigger than the highest number that one octet can represent. To be
compatible with 'legacy V2 Extension', there are only 2 reserved octets left to
place type and length fields. therefore, we use Continuation Extension to solve
it.

Continuation Extension Format:

+------+-------+-------------------------------+
|0     |4      |Reserved                       |
+------+-------+-------------------------------+
|4     |1      |Extension type(0x1)            |
+------+-------+-------------------------------+
|5     |*      |payload length                 |
+------+-------+-------------------------------+
|*     |*      |Continuation data              |
+------+-------+-------------------------------+

notes:

     1. Indicate that the content of this extension is continuation of
the content of its previous extension.

     2. In order to be compatible with some existing extensions,
when the reserved bytes that can be used are not enough to represent
its maximum length.


CLC ACCEPT MESSAGE (SMC-DV2 FORMAT) / CLC CONFIRM MESSAGE (SMC-Dv2 FORMAT)

+------+-------+---------------------------------------------------+
|34    |32     |EID (Negotiated Common EID selected by the server) |
+------+-------+---------------------------------------------------+
|66    |4      |Reserved                                           |
+------+-------+---------------------------------------------------+
|70    |*      |Extensions Area                                    |
+------+-------+---------------------------------------------------+
|70    |2      |number of Extensions  (reserved before)            |
+------+-------+---------------------------------------------------+
|72    |38     |First Contact Extension -                          |
|      |       |only present when first contact flag is on         |
+------+-------+---------------------------------------------------+
|72    |6      |First Contact Extension Header                     |
+------+-------+---------------------------------------------------+
|72    |2      |Reserved                                           |
+------+-------+---------------------------------------------------+
|74    |4      |FCE Header                                         |
+------+-------+---------------------------------------------------+
|74    |1      |FCE Header - reserved                              |
+------+-------+---------------------------------------------------+
|75    |1      |FCE Header Flag 1 (bit 8)                          |
+------+-------+---------------------------------------------------+
|76    |1      |Extension type (0x4) (reserved before)             |
+------+-------+---------------------------------------------------+
|77    |1      |Payload length (0x20) (reserved before)		   |
+------+-------+---------------------------------------------------+
|78    |32     |FCE Peer Host Name                                 |
+------+-------+---------------------------------------------------+
|110   |*      |other available Extension (when applicable)        |
+------+-------+---------------------------------------------------+
|*     |4      |Eye catcher ‘SMCD’ (EBCDIC) message end            |
+------+-------+---------------------------------------------------+

CLC ACCEPT MESSAGE (SMC-RV2 FORMAT)

+------+-------+---------------------------------------------------+
|64    |32     |EID (Negotiated EID selected by server)            |
+------+-------+---------------------------------------------------+
|96    |4      |Reserved                                           +
+------+-------+---------------------------------------------------+
|100   |*      |Extension Area                                     |
+------+-------+---------------------------------------------------+
|100   |2      |number of Extension  (reserved before)             |
+------+-------+---------------------------------------------------+
|102   |38     |First Contact Extension -                          |
|      |       |only present when first contact flag is on         |
+------+-------+---------------------------------------------------+
|102   |6      |First Contact Extension Header                     |
+------+-------+---------------------------------------------------+
|102   |2      |reserved                                           |
+------+-------+---------------------------------------------------+
|104   |4      |FCE Header                                         |
+------+-------+---------------------------------------------------+
|104   |1      |FCE Header - reserved                              |
+------+-------+---------------------------------------------------+
|105   |1      |FCE Header Flag 1 (bit 8)                          |
+------+-------+---------------------------------------------------+
|106   |1      |Extension type (0x4) (reserved before)             |
+------+-------+---------------------------------------------------+
|107   |1      |Payload length (0x20) (reserved before)            |
+------+-------+---------------------------------------------------+
|108   |32     |FCE Peer Host Name                                 |
+------+-------+---------------------------------------------------+
|140   |16     |Padding Extension	(reserved before)	   |
+------+-------+---------------------------------------------------+
|156   |*      |other available Extension (when applicable)        |
+------+-------+---------------------------------------------------+
|*     |4      |Eye catcher ‘SMCR’ (EBCDIC) message end            |
+------+-------+---------------------------------------------------+

notes:

     1. none of the existing fields have their offsets changed
within the message.


First Contact Extension Format:

+------+-------+----------------------------------------------------------------+
|0     |6      |First Contact Extension Header                                  |
+------+-------+----------------------------------------------------------------+
|0     |2      |Reserved                                                        |
+------+-------+----------------------------------------------------------------+
|2     |1      |FCE Header Flag 0                                               |
+------+-------+----------------------------------------------------------------+
|3     |1      |FCE Header Flag 1                                               |
+------+-------+----------------------------------------------------------------+
|4     |1      |Extension type (0x4) (reserved before)                          |
+------+-------+----------------------------------------------------------------+
|5     |1      |Payload length (0x20) (reserved before)                         |
+------+-------+----------------------------------------------------------------+
|6     |32     |FCE Peer Host Name (ASCII character - padded with ASCII blanks) |
+------+-------+----------------------------------------------------------------+

notes:

     1. newly First Contact Extension borrows 2 octets from the
reserved fields in the upper near part to form a completed extension.


CLC CONFIRM MESSAGE (SMC-RV2 FORMAT)

+------+-------+---------------------------------------------------+
|64    |32     |EID (Negotiated EID selected by server)            |
+------+-------+---------------------------------------------------+
|96    |4      |Reserved                                           |
+------+-------+---------------------------------------------------+
|100   |*      |Extension Area                                     |
+------+-------+---------------------------------------------------+
|100   |2      |number of Extension  (reserved before)             |
+------+-------+---------------------------------------------------+
|102   |38     |First Contact Extension -                          |
|      |        only present when first contact flag is on         |
+------+-------+---------------------------------------------------+
|102   |6      |First Contact Extension Header                     |
+------+-------+---------------------------------------------------+
|102   |2      |reserved                                           |
+------+-------+---------------------------------------------------+
|104   |4      |FCE Header                                         |
+------+-------+---------------------------------------------------+
|104   |1      |FCE Header - reserved                              |
+------+-------+---------------------------------------------------+
|105   |1      |FCE Header Flag 1 (bit 8)                          |
+------+-------+---------------------------------------------------+
|106   |1      |Extension type (0x4)  (reserved before)            |
+------+-------+---------------------------------------------------+
|107   |1      |Payload length (0x20)                              |
+------+-------+---------------------------------------------------+
|108   |32     |FCE Peer Host Name                                 |
+------+-------+---------------------------------------------------+
|140   |9      |PADDING extension (reserved before)                |
+------+-------+---------------------------------------------------+
|149   |*      |Client RoCEv2 GID Extension                        |
+------+-------+---------------------------------------------------+
|149   |7      |Client RoCEv2 GID Extension Header(reserved before)|
+------+-------+---------------------------------------------------+
|156   |*      |FCE Client RoCEv2 GID List                         |
+------+-------+---------------------------------------------------+
|*     |*      |other available Extension (when applicable)        |
+------+-------+---------------------------------------------------+
|*     |4      |Eye catcher ‘SMCR’ (EBCDIC) message end            |
+------+-------+---------------------------------------------------+

notes:

     1. Client RoCEv2 GID was once part of the First Contact Extension, and now it's
standalone extension.

Client RoCEv2 GID Extension Format:

+-+----+-------+---------------------------------------------------+
|0     |*      |Client RoCEv2 GID                                  |
+------+-------+---------------------------------------------------+
|0     |4      |Reserved                                           |
+------+-------+---------------------------------------------------+
|4     |1      |Extension type(0x5)                                |
+------+-------+---------------------------------------------------+
|5     |2      |Payload length                                     |
+------+-------+---------------------------------------------------+
|7     |*      |Client RoCEv2 GID List                             |
+------+-------+---------------------------------------------------+
|7     |4      |GID List Header                                    |
+------+-------+---------------------------------------------------+
|7     |1      |GID List No of Entries (1 - 8)                     |
+------+-------+---------------------------------------------------+
|8     |3      |Reserved                                           |
+------+-------+---------------------------------------------------+
|11    |*      |GID List Array Area                                |
+------+-------+---------------------------------------------------+
|11    |16     |GID List Entry - RoCEv2 IP address (IPv4 or IPv6)  |
+------+-------+---------------------------------------------------+
+*     |*      |End of Client GID List                             |
+------+-------+---------------------------------------------------+

notes:

     1.  newly Client RoCEv2 GID Extension borrows 7 octets from the
reserved fields in the upper near part to form a completed extension.

     2. none of the existing fields have their offsets changed
within the CONFIRM message.
Alexandra Winter June 23, 2022, 11:59 a.m. UTC | #10
On 16.06.22 15:49, D. Wythe wrote:
> 
> 
> On 2022/6/1 下午2:33, D. Wythe wrote:
>>
>> 在 2022/5/25 下午9:42, Alexandra Winter 写道:
>>
>>> We need to carefully evaluate them and make sure everything is compatible
>>> with the existing implementations of SMC-D and SMC-R v1 and v2. In the
>>> typical s390 environment ROCE LAG is propably not good enough, as the card
>>> is still a single point of failure. So your ideas need to be compatible
>>> with link redundancy. We also need to consider that the extension of the
>>> protocol does not block other desirable extensions.
>>>
>>> Your prototype is very helpful for the understanding. Before submitting any
>>> code patches to net-next, we should agree on the details of the protocol
>>> extension. Maybe you could formulate your proposal in plain text, so we can
>>> discuss it here?
>>>
>>> We also need to inform you that several public holidays are upcoming in the
>>> next weeks and several of our team will be out for summer vacation, so please
>>> allow for longer response times.
>>>
>>> Kind regards
>>> Alexandra Winter
>>>
>>
>> Hi alls,
>>
>> In order to achieve signle-link compatibility, we must
>> complete at least once negotiation. We wish to provide
>> higher scalability while meeting this feature. There are
>> few ways to reach this.
>>
>> 1. Use the available reserved bits. According to
>> the SMC v2 protocol, there are at least 28 reserved octets
>> in PROPOSAL MESSAGE and at least 10 reserved octets in
>> ACCEPT MESSAGE are available. We can define an area in which
>> as a feature area, works like bitmap. Considering the subsequent scalability, we MAY use at least 2 reserved ctets, which can support negotiation of at least 16 features.
>>
>> 2. Unify all the areas named extension in current
>> SMC v2 protocol spec without reinterpreting any existing field
>> and field offset changes, including 'PROPOSAL V1 IP Subnet Extension',
>> 'PROPOSAL V2 Extension', 'PROPOSAL SMC-DV2 EXTENSION' .etc. And provides
>> the ability to grow dynamically as needs expand. This scheme will use
>> at least 10 reserved octets in the PROPOSAL MESSAGE and at least 4 reserved octets in ACCEPT MESSAGE and CONFIRM MESSAGE. Fortunately, we only need to use reserved fields, and the current reserved fields are sufficient. And then we can easily add a new extension named SIGNLE LINK. Limited by space, the details will be elaborated after the scheme is finalized.
>>
[...]
>>
>>
>> Look forward to your advice and comments.
>>
>> Thanks.
> 
> Hi all,
> 
> On the basis of previous,If we can put the application data over the PROPOSAL message,
> we can achieve SMC 0-RTT. Its process should be similar to the following:
> 
[...]

Thank you D. Wythe for the detailed proposal, I have forwarded it to the protocol owner
and we are currently reviewing it. 
We may contact you and Tony Lu directly to discuss the details, if that is ok for you.

Kind regards
Alexandra Winter
D. Wythe June 23, 2022, 12:50 p.m. UTC | #11
On 2022/6/23 下午7:59, Alexandra Winter wrote:
> 
> 
> On 16.06.22 15:49, D. Wythe wrote:
>>
>>
>> On 2022/6/1 下午2:33, D. Wythe wrote:
>>>
>>> 在 2022/5/25 下午9:42, Alexandra Winter 写道:
>>>
>>>> We need to carefully evaluate them and make sure everything is compatible
>>>> with the existing implementations of SMC-D and SMC-R v1 and v2. In the
>>>> typical s390 environment ROCE LAG is propably not good enough, as the card
>>>> is still a single point of failure. So your ideas need to be compatible
>>>> with link redundancy. We also need to consider that the extension of the
>>>> protocol does not block other desirable extensions.
>>>>
>>>> Your prototype is very helpful for the understanding. Before submitting any
>>>> code patches to net-next, we should agree on the details of the protocol
>>>> extension. Maybe you could formulate your proposal in plain text, so we can
>>>> discuss it here?
>>>>
>>>> We also need to inform you that several public holidays are upcoming in the
>>>> next weeks and several of our team will be out for summer vacation, so please
>>>> allow for longer response times.
>>>>
>>>> Kind regards
>>>> Alexandra Winter
>>>>
>>>
>>> Hi alls,
>>>
>>> In order to achieve signle-link compatibility, we must
>>> complete at least once negotiation. We wish to provide
>>> higher scalability while meeting this feature. There are
>>> few ways to reach this.
>>>
>>> 1. Use the available reserved bits. According to
>>> the SMC v2 protocol, there are at least 28 reserved octets
>>> in PROPOSAL MESSAGE and at least 10 reserved octets in
>>> ACCEPT MESSAGE are available. We can define an area in which
>>> as a feature area, works like bitmap. Considering the subsequent scalability, we MAY use at least 2 reserved ctets, which can support negotiation of at least 16 features.
>>>
>>> 2. Unify all the areas named extension in current
>>> SMC v2 protocol spec without reinterpreting any existing field
>>> and field offset changes, including 'PROPOSAL V1 IP Subnet Extension',
>>> 'PROPOSAL V2 Extension', 'PROPOSAL SMC-DV2 EXTENSION' .etc. And provides
>>> the ability to grow dynamically as needs expand. This scheme will use
>>> at least 10 reserved octets in the PROPOSAL MESSAGE and at least 4 reserved octets in ACCEPT MESSAGE and CONFIRM MESSAGE. Fortunately, we only need to use reserved fields, and the current reserved fields are sufficient. And then we can easily add a new extension named SIGNLE LINK. Limited by space, the details will be elaborated after the scheme is finalized.
>>>
> [...]
>>>
>>>
>>> Look forward to your advice and comments.
>>>
>>> Thanks.
>>
>> Hi all,
>>
>> On the basis of previous,If we can put the application data over the PROPOSAL message,
>> we can achieve SMC 0-RTT. Its process should be similar to the following:
>>
> [...]
> 
> Thank you D. Wythe for the detailed proposal, I have forwarded it to the protocol owner
> and we are currently reviewing it.
> We may contact you and Tony Lu directly to discuss the details, if that is ok for you.
> 
> Kind regards
> Alexandra Winter
> 
> 
> 
> 

Thanks a lot for your support, it seems good to us. We are totally okay with that.

Best Wishes.
D. Wythe
diff mbox series

Patch

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 1a556f4..bf646d1 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -492,7 +492,7 @@  static int smcr_lgr_reg_rmbs(struct smc_link *link,
 			     struct smc_buf_desc *rmb_desc)
 {
 	struct smc_link_group *lgr = link->lgr;
-	int i, rc = 0;
+	int i, lnk = 0, rc = 0;
 
 	rc = smc_llc_flow_initiate(lgr, SMC_LLC_FLOW_RKEY);
 	if (rc)
@@ -507,14 +507,20 @@  static int smcr_lgr_reg_rmbs(struct smc_link *link,
 		rc = smcr_link_reg_rmb(&lgr->lnk[i], rmb_desc);
 		if (rc)
 			goto out;
+		/* available link count inc */
+		lnk++;
 	}
 
-	/* exchange confirm_rkey msg with peer */
-	rc = smc_llc_do_confirm_rkey(link, rmb_desc);
-	if (rc) {
-		rc = -EFAULT;
-		goto out;
+	/* do not exchange confirm_rkey msg since there are only one link */
+	if (lnk > 1) {
+		/* exchange confirm_rkey msg with peer */
+		rc = smc_llc_do_confirm_rkey(link, rmb_desc);
+		if (rc) {
+			rc = -EFAULT;
+			goto out;
+		}
 	}
+
 	rmb_desc->is_conf_rkey = true;
 out:
 	mutex_unlock(&lgr->llc_conf_mutex);
@@ -932,6 +938,31 @@  static int smc_find_rdma_device(struct smc_sock *smc, struct smc_init_info *ini)
 	return 0;
 }
 
+/* just prototype code
+ * since tcp connect has not happen, using route to perform smc_pnet_find_roce_by_pnetid
+ */
+static int smc_find_rdma_device_with_dst(struct smc_sock *smc, struct smc_init_info *ini)
+{
+	struct sock *tsk = smc->clcsock->sk;
+	struct rtable *rt;
+
+	rt = ip_route_output(sock_net(tsk), smc->remote_address.v4.sin_addr.s_addr,
+			     0, 0, 0);
+
+	if (IS_ERR(rt))
+		return -ECONNRESET;
+
+	smc_pnet_find_roce_by_pnetid(rt->dst.dev, ini);
+	__builtin_prefetch(&ini->ib_dev->mac[ini->ib_port - 1]);
+
+	if (!ini->check_smcrv2 && !ini->ib_dev)
+		return SMC_CLC_DECL_NOSMCRDEV;
+	if (ini->check_smcrv2 && !ini->smcrv2.ib_dev_v2)
+		return SMC_CLC_DECL_NOSMCRDEV;
+
+	return 0;
+}
+
 /* check if there is an ISM device available for this connection. */
 /* called for connect and listen */
 static int smc_find_ism_device(struct smc_sock *smc, struct smc_init_info *ini)
@@ -1019,13 +1050,17 @@  static int smc_find_proposal_devices(struct smc_sock *smc,
 
 	/* check if there is an rdma device available */
 	if (!(ini->smcr_version & SMC_V1) ||
-	    smc_find_rdma_device(smc, ini))
+	    smc_find_rdma_device_with_dst(smc, ini))
 		ini->smcr_version &= ~SMC_V1;
 	/* else RDMA is supported for this connection */
 
 	ini->smc_type_v1 = smc_indicated_type(ini->smcd_version & SMC_V1,
 					      ini->smcr_version & SMC_V1);
 
+	/* just prototype, do this for simple */
+	ini->smc_type_v2 = SMC_TYPE_N;
+	return rc;
+
 	/* check if there is an ism v2 device available */
 	if (!(ini->smcd_version & SMC_V2) ||
 	    !smc_ism_is_v2_capable() ||
@@ -1492,11 +1527,7 @@  static void smc_connect_work(struct work_struct *work)
 		smc->sk.sk_err = smc->clcsock->sk->sk_err;
 	} else if ((1 << smc->clcsock->sk->sk_state) &
 					(TCPF_SYN_SENT | TCPF_SYN_RECV)) {
-		rc = sk_stream_wait_connect(smc->clcsock->sk, &timeo);
-		if ((rc == -EPIPE) &&
-		    ((1 << smc->clcsock->sk->sk_state) &
-					(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)))
-			rc = 0;
+		rc = 0;
 	}
 	release_sock(smc->clcsock->sk);
 	lock_sock(&smc->sk);
@@ -1580,9 +1611,10 @@  static int smc_connect(struct socket *sock, struct sockaddr *addr,
 		rc = -EALREADY;
 		goto out;
 	}
-	rc = kernel_connect(smc->clcsock, addr, alen, flags);
-	if (rc && rc != -EINPROGRESS)
-		goto out;
+
+	/* copy remote address backup */
+	memcpy(&smc->remote_address.ss, addr, alen);
+	rc = -EINPROGRESS;
 
 	if (smc->use_fallback) {
 		sock->state = rc ? SS_CONNECTING : SS_CONNECTED;
@@ -2452,9 +2484,17 @@  static int smc_listen(struct socket *sock, int backlog)
 {
 	struct sock *sk = sock->sk;
 	struct smc_sock *smc;
-	int rc;
+	int rc, val;
 
 	smc = smc_sk(sk);
+
+	/* enable server clcsock tcp fastopen.
+	 * just a proto type code, magic number 5 for no reason
+	 */
+	val = 5;
+	smc->clcsock->ops->setsockopt(smc->clcsock, SOL_TCP,
+				      TCP_FASTOPEN, KERNEL_SOCKPTR(&val), sizeof(val));
+
 	lock_sock(sk);
 
 	rc = -EINVAL;
diff --git a/net/smc/smc.h b/net/smc/smc.h
index 5ed765e..ef18894 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -261,6 +261,14 @@  struct smc_sock {				/* smc sock container */
 	int			fallback_rsn;	/* reason for fallback */
 	u32			peer_diagnosis; /* decline reason from peer */
 	atomic_t                queued_smc_hs;  /* queued smc handshakes */
+
+	union {
+		struct sockaddr		addr;
+		struct sockaddr_in	v4;
+		struct sockaddr_in6	v6;
+		struct sockaddr_storage ss;
+	} remote_address;
+
 	struct inet_connection_sock_af_ops		af_ops;
 	const struct inet_connection_sock_af_ops	*ori_af_ops;
 						/* original af ops */
diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c
index f9f3f59..f944c67 100644
--- a/net/smc/smc_clc.c
+++ b/net/smc/smc_clc.c
@@ -20,6 +20,7 @@ 
 #include <net/addrconf.h>
 #include <net/sock.h>
 #include <net/tcp.h>
+#include <net/route.h>
 
 #include "smc.h"
 #include "smc_core.h"
@@ -486,8 +487,7 @@  static int smc_clc_prfx_set4_rcu(struct dst_entry *dst, __be32 ipv4,
 		return -ENODEV;
 
 	in_dev_for_each_ifa_rcu(ifa, in_dev) {
-		if (!inet_ifa_match(ipv4, ifa))
-			continue;
+		/* delete this for simple, just prototype code*/
 		prop->prefix_len = inet_mask_len(ifa->ifa_mask);
 		prop->outgoing_subnet = ifa->ifa_address & ifa->ifa_mask;
 		/* prop->ipv6_prefixes_cnt = 0; already done by memset before */
@@ -528,10 +528,10 @@  static int smc_clc_prfx_set6_rcu(struct dst_entry *dst,
 
 /* retrieve and set prefixes in CLC proposal msg */
 static int smc_clc_prfx_set(struct socket *clcsock,
+			    struct dst_entry *dst,
 			    struct smc_clc_msg_proposal_prefix *prop,
 			    struct smc_clc_ipv6_prefix *ipv6_prfx)
 {
-	struct dst_entry *dst = sk_dst_get(clcsock->sk);
 	struct sockaddr_storage addrs;
 	struct sockaddr_in6 *addr6;
 	struct sockaddr_in *addr;
@@ -802,7 +802,8 @@  int smc_clc_send_decline(struct smc_sock *smc, u32 peer_diag_info, u8 version)
 }
 
 /* send CLC PROPOSAL message across internal TCP socket */
-int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
+int smc_clc_send_proposal_with_nexthop(struct smc_sock *smc,
+				       struct dst_entry *dst, struct smc_init_info *ini)
 {
 	struct smc_clc_smcd_v2_extension *smcd_v2_ext;
 	struct smc_clc_msg_proposal_prefix *pclc_prfx;
@@ -838,7 +839,7 @@  int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
 
 	/* retrieve ip prefixes for CLC proposal msg */
 	if (ini->smc_type_v1 != SMC_TYPE_N) {
-		rc = smc_clc_prfx_set(smc->clcsock, pclc_prfx, ipv6_prfx);
+		rc = smc_clc_prfx_set(smc->clcsock, dst, pclc_prfx, ipv6_prfx);
 		if (rc) {
 			if (ini->smc_type_v2 == SMC_TYPE_N) {
 				kfree(pclc);
@@ -961,6 +962,11 @@  int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
 	}
 	vec[i].iov_base = trl;
 	vec[i++].iov_len = sizeof(*trl);
+
+	msg.msg_flags	|= MSG_FASTOPEN;
+	msg.msg_name	= &smc->remote_address.addr;
+	msg.msg_namelen = sizeof(struct sockaddr_in);
+
 	/* due to the few bytes needed for clc-handshake this cannot block */
 	len = kernel_sendmsg(smc->clcsock, &msg, vec, i, plen);
 	if (len < 0) {
@@ -975,6 +981,22 @@  int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
 	return reason_code;
 }
 
+int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini)
+{
+	struct sock *tsk = smc->clcsock->sk;
+	struct rtable *rt;
+	int rc;
+
+	rt = ip_route_output(sock_net(tsk), smc->remote_address.v4.sin_addr.s_addr,
+			     0, 0, 0);
+
+	if (IS_ERR(rt))
+		return -ECONNRESET;
+
+	rc = smc_clc_send_proposal_with_nexthop(smc, &rt->dst, ini);
+	return rc;
+}
+
 /* build and send CLC CONFIRM / ACCEPT message */
 static int smc_clc_send_confirm_accept(struct smc_sock *smc,
 				       struct smc_clc_msg_accept_confirm_v2 *clc_v2,
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index f40f6ed..ef5e5411 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -1765,6 +1765,8 @@  int smc_vlan_by_tcpsk(struct socket *clcsock, struct smc_init_info *ini)
 	int rc = 0;
 
 	ini->vlan_id = 0;
+	/* just for simple , prototype code */
+	return 0;
 	if (!dst) {
 		rc = -ENOTCONN;
 		goto out;
diff --git a/net/smc/smc_pnet.c b/net/smc/smc_pnet.c
index 7055ed1..6aa3304 100644
--- a/net/smc/smc_pnet.c
+++ b/net/smc/smc_pnet.c
@@ -1064,8 +1064,8 @@  static void smc_pnet_find_rdma_dev(struct net_device *netdev,
  * If nothing found, check pnetid table.
  * If nothing found, try to use handshake device
  */
-static void smc_pnet_find_roce_by_pnetid(struct net_device *ndev,
-					 struct smc_init_info *ini)
+void smc_pnet_find_roce_by_pnetid(struct net_device *ndev,
+				  struct smc_init_info *ini)
 {
 	u8 ndev_pnetid[SMC_MAX_PNETID_LEN];
 	struct net *net;
diff --git a/net/smc/smc_pnet.h b/net/smc/smc_pnet.h
index 80a88ee..2ffaf22 100644
--- a/net/smc/smc_pnet.h
+++ b/net/smc/smc_pnet.h
@@ -67,4 +67,7 @@  void smc_pnet_find_alt_roce(struct smc_link_group *lgr,
 			    struct smc_ib_device *known_dev);
 bool smc_pnet_is_ndev_pnetid(struct net *net, u8 *pnetid);
 bool smc_pnet_is_pnetid_set(u8 *pnetid);
+
+void smc_pnet_find_roce_by_pnetid(struct net_device *ndev,
+				  struct smc_init_info *ini);
 #endif