Message ID | 1653375127-130233-1-git-send-email-alibuda@linux.alibaba.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | [RFC,net-next] net/smc:introduce 1RTT to SMC | expand |
On Tue, May 24, 2022 at 02:52:07PM +0800, D. Wythe wrote: > From: "D. Wythe" <alibuda@linux.alibaba.com> > > Hi Karsten, > > We are promoting SMC-R to the field of cloud computing, dues to the > particularity of business on the cloud, the scale and the types of > customer applications are unpredictable. As a participant of SMC-R, we > also hope that SMC-R can cover more application scenarios. Therefore, > many connection problems are exposed during this time. There are two > main issue, one is that the establishment of a single connection takes > longer than that of the TCP, another is that the degree of concurrency > is low under multi-connection processing. This patch set is mainly > optimized for the first issue, and the follow-up of the second issue > will be synchronized in the future. > > In terms of communication process, under current implement, a TCP > three-way handshake only needs 1-RTT time, while SMC-R currently > requires 4-RTT times, including 2-RTT over IP(TCP handshake, SMC > proposal & accept ) and 2-RTT over IB ( two times RKEY exchange), which > is most influential factor affecting connection established time at the > moment. > > We have noticed that single network interface card is mainstream on the > cloud, dues to the advantages of cloud deployment costs and the cloud's > own disaster recovery support. On the other hand, the emergence of RoCE > LAG technology makes us no longer need to deal with multiple RDMA > network interface cards by ourselves, just like NIC bonding does. In > Alibaba, Roce LAG is widely used for RDMA. I think this is an interesting topic whether we need SMC-level link redundancy. I agreed with that RoCE LAG and RDMA in cloud vendors handle redundancy and failover in the lower layer, and do it transparently for SMC. So let's move on, if a RDMA device has redundancy ability, we could make SMC simpler by give an option for user-space or based on the device capability (if we have this flag). This allows under layer to ensure the reliability of link group. As RFC 7609 mentioned, we should do some extra work for reliability to add link. It should be an optional work if the device have capability for redundancy, and make link group simpler and faster (for the so-called SMC-2RTT in this RFC). I also notice that RFC 7609 is released on August 2015, which is earlier than RoCE LAG. RoCE LAG is provided after ConnectX-3/ConnectX-3 Pro in kernel 4.0, and is available in 2017. And cloud vendors' RDMA adapters, such as Alibaba Elastic RDMA adapter in [1]. Given that, I propose whether the second link can be used as an option in newly created link group. Also, if it is possible, RFC 7609 can be updated or extend it for this nowadays case. Looking forward for your message, Karsten, D. Wythe and folks. [1] https://lore.kernel.org/linux-rdma/20220523075528.35017-1-chengyou@linux.alibaba.com/ Thanks, Tony Lu > In that case, SMC-R have only one single link, if so, the RKEY LLC > messages that to perform information exchange in all links are no longer > needed, the SMC Proposal & accept has already complete the exchange of > all information needed. So we think that we can remove the RKEY exchange > in that case, which will save us 2-RTT over IB. We call it as SMC-R 2-RTT. > > On the other hand, we can use TCP fast open, carry the SMC proposal data > by TCP SYN message, reduce the time that the SMC waits for the TCP > connection to be established. This will save us another 1-RTT over IP. > > Based on the above two viewpoints, in this scenario, we can compress the > communication process of SMC-R into 1-RTT over IP, so that we can > theoretically obtain a time close to that of TCP connection > establishment. We call it as SMC-R 1-RTT. Of course, the specific results > will also be affected by the implementation. > > In our test environment, we host two VMs on the same host for wrk/nginx > tests, used a script similar to the following to performing test: > > Client.sh > > conn=$1 > thread=$2 > > wrk -H ‘Connection: Close’ -c ${conn} -t ${thread} -d 10 > > Server.sh > > sysctl -w net.ipv4.tcp_fastopen=3 > smc_run nginx > > Statistic shows that: > > +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+ > |type|args | -c1 -t1 | -c2 -t1 | -c5 -t1 | -c10 -t1 | -c200 -t1 | -c200 -t4 | -c2000 -t8 | > +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+ > |next-next | 4188.5qps | 5942.04qps | 7621.81qps | 7678.62qps | 8204.94qps | 8457.57qps | 5687.60qps | > +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+ > |SMC-2RTT | 4730.17qps | 7394.85qps | 11532.78qps | 12016.22qps | 11520.81qps | 11391.36qps | 10364.41qps | > +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+ > |SMC-1RTT | 5702.77qps | 9645.18qps | 11899.20qps | 12005.16qps | 11536.67qps | 11420.87qps | 10392.4qps | > +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+- > |TCP | 6415.74qps | 11034.10qps | 16716.21qps | 22217.06qps | 35926.74qps | 117460.qps | 120291.16qps | > +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+ > > It can clearly be seen that: > > 1. In step by step short-link scenarios ( -c1 -t1 ), SMC-R after > optimization can reach 88% of TCP. There are still many implementation > details that can be optimized, we hope to optimize the performance of > SMC in this scenario to 90% of TCP. > > 2. The problem is very serious in the scenario of multi-threading and > multi-connection, the worst case is only 10% of TCP. Even though the > SMC-1RTT has certain optimizations for this scenario, it is clear that > the bottleneck is not here. We are doing some prototyping to solve this, > we hope to reach 60% of TCP in multi-threading and multi-connection > scenarios, and SMC-1RTT is the important prerequisite for upper limit of > subsequent optimization. > > In this patch set, we had only completed a simple prototype, only make > sure SMC-1RTT can works. > > Sincerely, we are looking forward for you comments, please > let us know if you have any suggestions. > > Thanks. > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> > --- > net/smc/af_smc.c | 72 ++++++++++++++++++++++++++++++++++++++++++------------ > net/smc/smc.h | 8 ++++++ > net/smc/smc_clc.c | 32 ++++++++++++++++++++---- > net/smc/smc_core.c | 2 ++ > net/smc/smc_pnet.c | 4 +-- > net/smc/smc_pnet.h | 3 +++ > 6 files changed, 98 insertions(+), 23 deletions(-) > > diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c > index 1a556f4..bf646d1 100644 > --- a/net/smc/af_smc.c > +++ b/net/smc/af_smc.c > @@ -492,7 +492,7 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link, > struct smc_buf_desc *rmb_desc) > { > struct smc_link_group *lgr = link->lgr; > - int i, rc = 0; > + int i, lnk = 0, rc = 0; > > rc = smc_llc_flow_initiate(lgr, SMC_LLC_FLOW_RKEY); > if (rc) > @@ -507,14 +507,20 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link, > rc = smcr_link_reg_rmb(&lgr->lnk[i], rmb_desc); > if (rc) > goto out; > + /* available link count inc */ > + lnk++; > } > > - /* exchange confirm_rkey msg with peer */ > - rc = smc_llc_do_confirm_rkey(link, rmb_desc); > - if (rc) { > - rc = -EFAULT; > - goto out; > + /* do not exchange confirm_rkey msg since there are only one link */ > + if (lnk > 1) { > + /* exchange confirm_rkey msg with peer */ > + rc = smc_llc_do_confirm_rkey(link, rmb_desc); > + if (rc) { > + rc = -EFAULT; > + goto out; > + } > } > + > rmb_desc->is_conf_rkey = true; > out: > mutex_unlock(&lgr->llc_conf_mutex); > @@ -932,6 +938,31 @@ static int smc_find_rdma_device(struct smc_sock *smc, struct smc_init_info *ini) > return 0; > } > > +/* just prototype code > + * since tcp connect has not happen, using route to perform smc_pnet_find_roce_by_pnetid > + */ > +static int smc_find_rdma_device_with_dst(struct smc_sock *smc, struct smc_init_info *ini) > +{ > + struct sock *tsk = smc->clcsock->sk; > + struct rtable *rt; > + > + rt = ip_route_output(sock_net(tsk), smc->remote_address.v4.sin_addr.s_addr, > + 0, 0, 0); > + > + if (IS_ERR(rt)) > + return -ECONNRESET; > + > + smc_pnet_find_roce_by_pnetid(rt->dst.dev, ini); > + __builtin_prefetch(&ini->ib_dev->mac[ini->ib_port - 1]); > + > + if (!ini->check_smcrv2 && !ini->ib_dev) > + return SMC_CLC_DECL_NOSMCRDEV; > + if (ini->check_smcrv2 && !ini->smcrv2.ib_dev_v2) > + return SMC_CLC_DECL_NOSMCRDEV; > + > + return 0; > +} > + > /* check if there is an ISM device available for this connection. */ > /* called for connect and listen */ > static int smc_find_ism_device(struct smc_sock *smc, struct smc_init_info *ini) > @@ -1019,13 +1050,17 @@ static int smc_find_proposal_devices(struct smc_sock *smc, > > /* check if there is an rdma device available */ > if (!(ini->smcr_version & SMC_V1) || > - smc_find_rdma_device(smc, ini)) > + smc_find_rdma_device_with_dst(smc, ini)) > ini->smcr_version &= ~SMC_V1; > /* else RDMA is supported for this connection */ > > ini->smc_type_v1 = smc_indicated_type(ini->smcd_version & SMC_V1, > ini->smcr_version & SMC_V1); > > + /* just prototype, do this for simple */ > + ini->smc_type_v2 = SMC_TYPE_N; > + return rc; > + > /* check if there is an ism v2 device available */ > if (!(ini->smcd_version & SMC_V2) || > !smc_ism_is_v2_capable() || > @@ -1492,11 +1527,7 @@ static void smc_connect_work(struct work_struct *work) > smc->sk.sk_err = smc->clcsock->sk->sk_err; > } else if ((1 << smc->clcsock->sk->sk_state) & > (TCPF_SYN_SENT | TCPF_SYN_RECV)) { > - rc = sk_stream_wait_connect(smc->clcsock->sk, &timeo); > - if ((rc == -EPIPE) && > - ((1 << smc->clcsock->sk->sk_state) & > - (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT))) > - rc = 0; > + rc = 0; > } > release_sock(smc->clcsock->sk); > lock_sock(&smc->sk); > @@ -1580,9 +1611,10 @@ static int smc_connect(struct socket *sock, struct sockaddr *addr, > rc = -EALREADY; > goto out; > } > - rc = kernel_connect(smc->clcsock, addr, alen, flags); > - if (rc && rc != -EINPROGRESS) > - goto out; > + > + /* copy remote address backup */ > + memcpy(&smc->remote_address.ss, addr, alen); > + rc = -EINPROGRESS; > > if (smc->use_fallback) { > sock->state = rc ? SS_CONNECTING : SS_CONNECTED; > @@ -2452,9 +2484,17 @@ static int smc_listen(struct socket *sock, int backlog) > { > struct sock *sk = sock->sk; > struct smc_sock *smc; > - int rc; > + int rc, val; > > smc = smc_sk(sk); > + > + /* enable server clcsock tcp fastopen. > + * just a proto type code, magic number 5 for no reason > + */ > + val = 5; > + smc->clcsock->ops->setsockopt(smc->clcsock, SOL_TCP, > + TCP_FASTOPEN, KERNEL_SOCKPTR(&val), sizeof(val)); > + > lock_sock(sk); > > rc = -EINVAL; > diff --git a/net/smc/smc.h b/net/smc/smc.h > index 5ed765e..ef18894 100644 > --- a/net/smc/smc.h > +++ b/net/smc/smc.h > @@ -261,6 +261,14 @@ struct smc_sock { /* smc sock container */ > int fallback_rsn; /* reason for fallback */ > u32 peer_diagnosis; /* decline reason from peer */ > atomic_t queued_smc_hs; /* queued smc handshakes */ > + > + union { > + struct sockaddr addr; > + struct sockaddr_in v4; > + struct sockaddr_in6 v6; > + struct sockaddr_storage ss; > + } remote_address; > + > struct inet_connection_sock_af_ops af_ops; > const struct inet_connection_sock_af_ops *ori_af_ops; > /* original af ops */ > diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c > index f9f3f59..f944c67 100644 > --- a/net/smc/smc_clc.c > +++ b/net/smc/smc_clc.c > @@ -20,6 +20,7 @@ > #include <net/addrconf.h> > #include <net/sock.h> > #include <net/tcp.h> > +#include <net/route.h> > > #include "smc.h" > #include "smc_core.h" > @@ -486,8 +487,7 @@ static int smc_clc_prfx_set4_rcu(struct dst_entry *dst, __be32 ipv4, > return -ENODEV; > > in_dev_for_each_ifa_rcu(ifa, in_dev) { > - if (!inet_ifa_match(ipv4, ifa)) > - continue; > + /* delete this for simple, just prototype code*/ > prop->prefix_len = inet_mask_len(ifa->ifa_mask); > prop->outgoing_subnet = ifa->ifa_address & ifa->ifa_mask; > /* prop->ipv6_prefixes_cnt = 0; already done by memset before */ > @@ -528,10 +528,10 @@ static int smc_clc_prfx_set6_rcu(struct dst_entry *dst, > > /* retrieve and set prefixes in CLC proposal msg */ > static int smc_clc_prfx_set(struct socket *clcsock, > + struct dst_entry *dst, > struct smc_clc_msg_proposal_prefix *prop, > struct smc_clc_ipv6_prefix *ipv6_prfx) > { > - struct dst_entry *dst = sk_dst_get(clcsock->sk); > struct sockaddr_storage addrs; > struct sockaddr_in6 *addr6; > struct sockaddr_in *addr; > @@ -802,7 +802,8 @@ int smc_clc_send_decline(struct smc_sock *smc, u32 peer_diag_info, u8 version) > } > > /* send CLC PROPOSAL message across internal TCP socket */ > -int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini) > +int smc_clc_send_proposal_with_nexthop(struct smc_sock *smc, > + struct dst_entry *dst, struct smc_init_info *ini) > { > struct smc_clc_smcd_v2_extension *smcd_v2_ext; > struct smc_clc_msg_proposal_prefix *pclc_prfx; > @@ -838,7 +839,7 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini) > > /* retrieve ip prefixes for CLC proposal msg */ > if (ini->smc_type_v1 != SMC_TYPE_N) { > - rc = smc_clc_prfx_set(smc->clcsock, pclc_prfx, ipv6_prfx); > + rc = smc_clc_prfx_set(smc->clcsock, dst, pclc_prfx, ipv6_prfx); > if (rc) { > if (ini->smc_type_v2 == SMC_TYPE_N) { > kfree(pclc); > @@ -961,6 +962,11 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini) > } > vec[i].iov_base = trl; > vec[i++].iov_len = sizeof(*trl); > + > + msg.msg_flags |= MSG_FASTOPEN; > + msg.msg_name = &smc->remote_address.addr; > + msg.msg_namelen = sizeof(struct sockaddr_in); > + > /* due to the few bytes needed for clc-handshake this cannot block */ > len = kernel_sendmsg(smc->clcsock, &msg, vec, i, plen); > if (len < 0) { > @@ -975,6 +981,22 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini) > return reason_code; > } > > +int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini) > +{ > + struct sock *tsk = smc->clcsock->sk; > + struct rtable *rt; > + int rc; > + > + rt = ip_route_output(sock_net(tsk), smc->remote_address.v4.sin_addr.s_addr, > + 0, 0, 0); > + > + if (IS_ERR(rt)) > + return -ECONNRESET; > + > + rc = smc_clc_send_proposal_with_nexthop(smc, &rt->dst, ini); > + return rc; > +} > + > /* build and send CLC CONFIRM / ACCEPT message */ > static int smc_clc_send_confirm_accept(struct smc_sock *smc, > struct smc_clc_msg_accept_confirm_v2 *clc_v2, > diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c > index f40f6ed..ef5e5411 100644 > --- a/net/smc/smc_core.c > +++ b/net/smc/smc_core.c > @@ -1765,6 +1765,8 @@ int smc_vlan_by_tcpsk(struct socket *clcsock, struct smc_init_info *ini) > int rc = 0; > > ini->vlan_id = 0; > + /* just for simple , prototype code */ > + return 0; > if (!dst) { > rc = -ENOTCONN; > goto out; > diff --git a/net/smc/smc_pnet.c b/net/smc/smc_pnet.c > index 7055ed1..6aa3304 100644 > --- a/net/smc/smc_pnet.c > +++ b/net/smc/smc_pnet.c > @@ -1064,8 +1064,8 @@ static void smc_pnet_find_rdma_dev(struct net_device *netdev, > * If nothing found, check pnetid table. > * If nothing found, try to use handshake device > */ > -static void smc_pnet_find_roce_by_pnetid(struct net_device *ndev, > - struct smc_init_info *ini) > +void smc_pnet_find_roce_by_pnetid(struct net_device *ndev, > + struct smc_init_info *ini) > { > u8 ndev_pnetid[SMC_MAX_PNETID_LEN]; > struct net *net; > diff --git a/net/smc/smc_pnet.h b/net/smc/smc_pnet.h > index 80a88ee..2ffaf22 100644 > --- a/net/smc/smc_pnet.h > +++ b/net/smc/smc_pnet.h > @@ -67,4 +67,7 @@ void smc_pnet_find_alt_roce(struct smc_link_group *lgr, > struct smc_ib_device *known_dev); > bool smc_pnet_is_ndev_pnetid(struct net *net, u8 *pnetid); > bool smc_pnet_is_pnetid_set(u8 *pnetid); > + > +void smc_pnet_find_roce_by_pnetid(struct net_device *ndev, > + struct smc_init_info *ini); > #endif > -- > 1.8.3.1
On 24.05.22 09:49, Tony Lu wrote: > On Tue, May 24, 2022 at 02:52:07PM +0800, D. Wythe wrote: >> From: "D. Wythe" <alibuda@linux.alibaba.com> >> >> Hi Karsten, >> >> We are promoting SMC-R to the field of cloud computing, dues to the >> particularity of business on the cloud, the scale and the types of >> customer applications are unpredictable. As a participant of SMC-R, we >> also hope that SMC-R can cover more application scenarios. Therefore, >> many connection problems are exposed during this time. There are two >> main issue, one is that the establishment of a single connection takes >> longer than that of the TCP, another is that the degree of concurrency >> is low under multi-connection processing. This patch set is mainly >> optimized for the first issue, and the follow-up of the second issue >> will be synchronized in the future. >> >> In terms of communication process, under current implement, a TCP >> three-way handshake only needs 1-RTT time, while SMC-R currently >> requires 4-RTT times, including 2-RTT over IP(TCP handshake, SMC >> proposal & accept ) and 2-RTT over IB ( two times RKEY exchange), which >> is most influential factor affecting connection established time at the >> moment. >> >> We have noticed that single network interface card is mainstream on the >> cloud, dues to the advantages of cloud deployment costs and the cloud's >> own disaster recovery support. On the other hand, the emergence of RoCE >> LAG technology makes us no longer need to deal with multiple RDMA >> network interface cards by ourselves, just like NIC bonding does. In >> Alibaba, Roce LAG is widely used for RDMA. > > I think this is an interesting topic whether we need SMC-level link > redundancy. I agreed with that RoCE LAG and RDMA in cloud vendors handle > redundancy and failover in the lower layer, and do it transparently for > SMC. > > So let's move on, if a RDMA device has redundancy ability, we could make > SMC simpler by give an option for user-space or based on the device > capability (if we have this flag). This allows under layer to ensure the > reliability of link group. > > As RFC 7609 mentioned, we should do some extra work for reliability to > add link. It should be an optional work if the device have capability > for redundancy, and make link group simpler and faster (for the > so-called SMC-2RTT in this RFC). > > I also notice that RFC 7609 is released on August 2015, which is earlier > than RoCE LAG. RoCE LAG is provided after ConnectX-3/ConnectX-3 Pro in > kernel 4.0, and is available in 2017. And cloud vendors' RDMA adapters, > such as Alibaba Elastic RDMA adapter in [1]. > > Given that, I propose whether the second link can be used as an option > in newly created link group. Also, if it is possible, RFC 7609 can be > updated or extend it for this nowadays case. > > Looking forward for your message, Karsten, D. Wythe and folks. > > [1] https://lore.kernel.org/linux-rdma/20220523075528.35017-1-chengyou@linux.alibaba.com/ > > Thanks, > Tony Lu > Thank you D. Wythe for your proposals, the prototype and measurements. They sound quite promising to us. We need to carefully evaluate them and make sure everything is compatible with the existing implementations of SMC-D and SMC-R v1 and v2. In the typical s390 environment ROCE LAG is propably not good enough, as the card is still a single point of failure. So your ideas need to be compatible with link redundancy. We also need to consider that the extension of the protocol does not block other desirable extensions. Your prototype is very helpful for the understanding. Before submitting any code patches to net-next, we should agree on the details of the protocol extension. Maybe you could formulate your proposal in plain text, so we can discuss it here? We also need to inform you that several public holidays are upcoming in the next weeks and several of our team will be out for summer vacation, so please allow for longer response times. Kind regards Alexandra Winter >> In that case, SMC-R have only one single link, if so, the RKEY LLC >> messages that to perform information exchange in all links are no longer >> needed, the SMC Proposal & accept has already complete the exchange of >> all information needed. So we think that we can remove the RKEY exchange >> in that case, which will save us 2-RTT over IB. We call it as SMC-R 2-RTT. >> >> On the other hand, we can use TCP fast open, carry the SMC proposal data >> by TCP SYN message, reduce the time that the SMC waits for the TCP >> connection to be established. This will save us another 1-RTT over IP. >> >> Based on the above two viewpoints, in this scenario, we can compress the >> communication process of SMC-R into 1-RTT over IP, so that we can >> theoretically obtain a time close to that of TCP connection >> establishment. We call it as SMC-R 1-RTT. Of course, the specific results >> will also be affected by the implementation. >> >> In our test environment, we host two VMs on the same host for wrk/nginx >> tests, used a script similar to the following to performing test: >> >> Client.sh >> >> conn=$1 >> thread=$2 >> >> wrk -H ‘Connection: Close’ -c ${conn} -t ${thread} -d 10 >> >> Server.sh >> >> sysctl -w net.ipv4.tcp_fastopen=3 >> smc_run nginx >> >> Statistic shows that: >> >> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+ >> |type|args | -c1 -t1 | -c2 -t1 | -c5 -t1 | -c10 -t1 | -c200 -t1 | -c200 -t4 | -c2000 -t8 | >> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+ >> |next-next | 4188.5qps | 5942.04qps | 7621.81qps | 7678.62qps | 8204.94qps | 8457.57qps | 5687.60qps | >> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+ >> |SMC-2RTT | 4730.17qps | 7394.85qps | 11532.78qps | 12016.22qps | 11520.81qps | 11391.36qps | 10364.41qps | >> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+ >> |SMC-1RTT | 5702.77qps | 9645.18qps | 11899.20qps | 12005.16qps | 11536.67qps | 11420.87qps | 10392.4qps | >> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+- >> |TCP | 6415.74qps | 11034.10qps | 16716.21qps | 22217.06qps | 35926.74qps | 117460.qps | 120291.16qps | >> +-----------+---------------+---------------+----------------+--------------+----------------+---------------+---------------+ >> >> It can clearly be seen that: >> >> 1. In step by step short-link scenarios ( -c1 -t1 ), SMC-R after >> optimization can reach 88% of TCP. There are still many implementation >> details that can be optimized, we hope to optimize the performance of >> SMC in this scenario to 90% of TCP. >> >> 2. The problem is very serious in the scenario of multi-threading and >> multi-connection, the worst case is only 10% of TCP. Even though the >> SMC-1RTT has certain optimizations for this scenario, it is clear that >> the bottleneck is not here. We are doing some prototyping to solve this, >> we hope to reach 60% of TCP in multi-threading and multi-connection >> scenarios, and SMC-1RTT is the important prerequisite for upper limit of >> subsequent optimization. >> >> In this patch set, we had only completed a simple prototype, only make >> sure SMC-1RTT can works. >> >> Sincerely, we are looking forward for you comments, please >> let us know if you have any suggestions. >> >> Thanks. >> >> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> >> --- --------8< snip >8--------
在 2022/5/25 下午9:42, Alexandra Winter 写道: > Thank you D. Wythe for your proposals, the prototype and measurements. > They sound quite promising to us. > > We need to carefully evaluate them and make sure everything is compatible > with the existing implementations of SMC-D and SMC-R v1 and v2. In the > typical s390 environment ROCE LAG is propably not good enough, as the card > is still a single point of failure. So your ideas need to be compatible > with link redundancy. We also need to consider that the extension of the > protocol does not block other desirable extensions. > > Your prototype is very helpful for the understanding. Before submitting any > code patches to net-next, we should agree on the details of the protocol > extension. Maybe you could formulate your proposal in plain text, so we can > discuss it here? I am very pleased to hear that your team have interest in this proposals, and thanks a lot for your advise. We really appreciate your point of view about compatibility, In fact, we are working on some written drafts which compatibility is quite a important part, and will be shared here soon. > > We also need to inform you that several public holidays are upcoming in the > next weeks and several of our team will be out for summer vacation, so please > allow for longer response times. Thanks for your informing, that's totaly okay to us. May your holidays be full of warmth and cheer. > Kind regards > Alexandra Winter > D. Wyther Thanks.
On Wed, May 25, 2022 at 03:42:28PM +0200, Alexandra Winter wrote: > > > On 24.05.22 09:49, Tony Lu wrote: > > On Tue, May 24, 2022 at 02:52:07PM +0800, D. Wythe wrote: > >> From: "D. Wythe" <alibuda@linux.alibaba.com> > >> > >> Hi Karsten, > >> > >> We are promoting SMC-R to the field of cloud computing, dues to the > >> particularity of business on the cloud, the scale and the types of > >> customer applications are unpredictable. As a participant of SMC-R, we > >> also hope that SMC-R can cover more application scenarios. Therefore, > >> many connection problems are exposed during this time. There are two > >> main issue, one is that the establishment of a single connection takes > >> longer than that of the TCP, another is that the degree of concurrency > >> is low under multi-connection processing. This patch set is mainly > >> optimized for the first issue, and the follow-up of the second issue > >> will be synchronized in the future. > >> > >> In terms of communication process, under current implement, a TCP > >> three-way handshake only needs 1-RTT time, while SMC-R currently > >> requires 4-RTT times, including 2-RTT over IP(TCP handshake, SMC > >> proposal & accept ) and 2-RTT over IB ( two times RKEY exchange), which > >> is most influential factor affecting connection established time at the > >> moment. > >> > >> We have noticed that single network interface card is mainstream on the > >> cloud, dues to the advantages of cloud deployment costs and the cloud's > >> own disaster recovery support. On the other hand, the emergence of RoCE > >> LAG technology makes us no longer need to deal with multiple RDMA > >> network interface cards by ourselves, just like NIC bonding does. In > >> Alibaba, Roce LAG is widely used for RDMA. > > > > I think this is an interesting topic whether we need SMC-level link > > redundancy. I agreed with that RoCE LAG and RDMA in cloud vendors handle > > redundancy and failover in the lower layer, and do it transparently for > > SMC. > > > > So let's move on, if a RDMA device has redundancy ability, we could make > > SMC simpler by give an option for user-space or based on the device > > capability (if we have this flag). This allows under layer to ensure the > > reliability of link group. > > > > As RFC 7609 mentioned, we should do some extra work for reliability to > > add link. It should be an optional work if the device have capability > > for redundancy, and make link group simpler and faster (for the > > so-called SMC-2RTT in this RFC). > > > > I also notice that RFC 7609 is released on August 2015, which is earlier > > than RoCE LAG. RoCE LAG is provided after ConnectX-3/ConnectX-3 Pro in > > kernel 4.0, and is available in 2017. And cloud vendors' RDMA adapters, > > such as Alibaba Elastic RDMA adapter in [1]. > > > > Given that, I propose whether the second link can be used as an option > > in newly created link group. Also, if it is possible, RFC 7609 can be > > updated or extend it for this nowadays case. > > > > Looking forward for your message, Karsten, D. Wythe and folks. > > > > [1] https://lore.kernel.org/linux-rdma/20220523075528.35017-1-chengyou@linux.alibaba.com/ > > > > Thanks, > > Tony Lu > > > Thank you D. Wythe for your proposals, the prototype and measurements. > They sound quite promising to us. > > We need to carefully evaluate them and make sure everything is compatible > with the existing implementations of SMC-D and SMC-R v1 and v2. In the > typical s390 environment ROCE LAG is propably not good enough, as the card > is still a single point of failure. So your ideas need to be compatible > with link redundancy. We also need to consider that the extension of the > protocol does not block other desirable extensions. > > Your prototype is very helpful for the understanding. Before submitting any > code patches to net-next, we should agree on the details of the protocol > extension. Maybe you could formulate your proposal in plain text, so we can > discuss it here? > > We also need to inform you that several public holidays are upcoming in the > next weeks and several of our team will be out for summer vacation, so please > allow for longer response times. > > Kind regards > Alexandra Winter > It's glad to hear this. This gave us a lot of confidence to insist on it, thank you. Cheers, Tony Lu
在 2022/5/25 下午9:42, Alexandra Winter 写道: > We need to carefully evaluate them and make sure everything is compatible > with the existing implementations of SMC-D and SMC-R v1 and v2. In the > typical s390 environment ROCE LAG is propably not good enough, as the card > is still a single point of failure. So your ideas need to be compatible > with link redundancy. We also need to consider that the extension of the > protocol does not block other desirable extensions. > > Your prototype is very helpful for the understanding. Before submitting any > code patches to net-next, we should agree on the details of the protocol > extension. Maybe you could formulate your proposal in plain text, so we can > discuss it here? > > We also need to inform you that several public holidays are upcoming in the > next weeks and several of our team will be out for summer vacation, so please > allow for longer response times. > > Kind regards > Alexandra Winter > Hi alls, In order to achieve signle-link compatibility, we must complete at least once negotiation. We wish to provide higher scalability while meeting this feature. There are few ways to reach this. 1. Use the available reserved bits. According to the SMC v2 protocol, there are at least 28 reserved octets in PROPOSAL MESSAGE and at least 10 reserved octets in ACCEPT MESSAGE are available. We can define an area in which as a feature area, works like bitmap. Considering the subsequent scalability, we MAY use at least 2 reserved ctets, which can support negotiation of at least 16 features. 2. Unify all the areas named extension in current SMC v2 protocol spec without reinterpreting any existing field and field offset changes, including 'PROPOSAL V1 IP Subnet Extension', 'PROPOSAL V2 Extension', 'PROPOSAL SMC-DV2 EXTENSION' .etc. And provides the ability to grow dynamically as needs expand. This scheme will use at least 10 reserved octets in the PROPOSAL MESSAGE and at least 4 reserved octets in ACCEPT MESSAGE and CONFIRM MESSAGE. Fortunately, we only need to use reserved fields, and the current reserved fields are sufficient. And then we can easily add a new extension named SIGNLE LINK. Limited by space, the details will be elaborated after the scheme is finalized. But no matter what scheme is finalized, the workflow should be similar to: Allow Single-link: client server proposal with Single-link feature bit or extension --------> accept with Single-link feature bit extension <-------- confirm --------> Deny or not recognized: client server proposal with Single-link feature bit or extension --------> rkey confirm <------ ------> accept without Single-link feature bit or extension <------ rkey confirm -------> <------ confirm -------> Look forward to your advice and comments. Thanks.
On Wed, Jun 01, 2022 at 02:33:09PM +0800, D. Wythe wrote: > > 在 2022/5/25 下午9:42, Alexandra Winter 写道: > > > We need to carefully evaluate them and make sure everything is compatible > > with the existing implementations of SMC-D and SMC-R v1 and v2. In the > > typical s390 environment ROCE LAG is propably not good enough, as the card > > is still a single point of failure. So your ideas need to be compatible > > with link redundancy. We also need to consider that the extension of the > > protocol does not block other desirable extensions. > > > > Your prototype is very helpful for the understanding. Before submitting any > > code patches to net-next, we should agree on the details of the protocol > > extension. Maybe you could formulate your proposal in plain text, so we can > > discuss it here? > > > > We also need to inform you that several public holidays are upcoming in the > > next weeks and several of our team will be out for summer vacation, so please > > allow for longer response times. > > > > Kind regards > > Alexandra Winter > > > > Hi alls, > > In order to achieve signle-link compatibility, we must > complete at least once negotiation. We wish to provide > higher scalability while meeting this feature. There are > few ways to reach this. > > 1. Use the available reserved bits. According to > the SMC v2 protocol, there are at least 28 reserved octets > in PROPOSAL MESSAGE and at least 10 reserved octets in > ACCEPT MESSAGE are available. We can define an area in which > as a feature area, works like bitmap. Considering the subsequent > scalability, we MAY use at least 2 reserved ctets, which can support > negotiation of at least 16 features. > > 2. Unify all the areas named extension in current > SMC v2 protocol spec without reinterpreting any existing field > and field offset changes, including 'PROPOSAL V1 IP Subnet Extension', > 'PROPOSAL V2 Extension', 'PROPOSAL SMC-DV2 EXTENSION' .etc. And provides > the ability to grow dynamically as needs expand. This scheme will use > at least 10 reserved octets in the PROPOSAL MESSAGE and at least 4 reserved > octets in ACCEPT MESSAGE and CONFIRM MESSAGE. Fortunately, we only need to > use reserved fields, and the current reserved fields are sufficient. And > then we can easily add a new extension named SIGNLE LINK. Limited by space, > the details will be elaborated after the scheme is finalized. After reading this and latest version of protocol, I agree with that the idea to provide a more flexible extension facilities. And, it's a good chance for us to set here talking about the protocol extension. There are some potential scenarios that need flexible extensions in my mind: - other protocols support, such as iWARP / IB or new version protocol, - dozens of feature flags in the future, like this proposal. With the growth of new feature, it could overflow bitmap. Actually, this extension facilities are very similar to TCP options. So what about your opinions about the solution of this? If there are some existed approaches for the future extensions, maybe this can get involved in it. Or we can start a discuss about this as this mail mentioned. Also, I am wondering if there is plan to update the RFC7609, add the latest v2 support? Thanks, Tony Lu
On 01.06.22 11:24, Tony Lu wrote: > On Wed, Jun 01, 2022 at 02:33:09PM +0800, D. Wythe wrote: >> >> 在 2022/5/25 下午9:42, Alexandra Winter 写道: >> >>> We need to carefully evaluate them and make sure everything is compatible >>> with the existing implementations of SMC-D and SMC-R v1 and v2. In the >>> typical s390 environment ROCE LAG is propably not good enough, as the card >>> is still a single point of failure. So your ideas need to be compatible >>> with link redundancy. We also need to consider that the extension of the >>> protocol does not block other desirable extensions. >>> >>> Your prototype is very helpful for the understanding. Before submitting any >>> code patches to net-next, we should agree on the details of the protocol >>> extension. Maybe you could formulate your proposal in plain text, so we can >>> discuss it here? >>> >>> We also need to inform you that several public holidays are upcoming in the >>> next weeks and several of our team will be out for summer vacation, so please >>> allow for longer response times. >>> >>> Kind regards >>> Alexandra Winter >>> >> >> Hi alls, >> >> In order to achieve signle-link compatibility, we must >> complete at least once negotiation. We wish to provide >> higher scalability while meeting this feature. There are >> few ways to reach this. >> >> 1. Use the available reserved bits. According to >> the SMC v2 protocol, there are at least 28 reserved octets >> in PROPOSAL MESSAGE and at least 10 reserved octets in >> ACCEPT MESSAGE are available. We can define an area in which >> as a feature area, works like bitmap. Considering the subsequent >> scalability, we MAY use at least 2 reserved ctets, which can support >> negotiation of at least 16 features. >> >> 2. Unify all the areas named extension in current >> SMC v2 protocol spec without reinterpreting any existing field >> and field offset changes, including 'PROPOSAL V1 IP Subnet Extension', >> 'PROPOSAL V2 Extension', 'PROPOSAL SMC-DV2 EXTENSION' .etc. And provides >> the ability to grow dynamically as needs expand. This scheme will use >> at least 10 reserved octets in the PROPOSAL MESSAGE and at least 4 reserved >> octets in ACCEPT MESSAGE and CONFIRM MESSAGE. Fortunately, we only need to >> use reserved fields, and the current reserved fields are sufficient. And >> then we can easily add a new extension named SIGNLE LINK. Limited by space, >> the details will be elaborated after the scheme is finalized. > > After reading this and latest version of protocol, I agree with that the > idea to provide a more flexible extension facilities. And, it's a good > chance for us to set here talking about the protocol extension. > > There are some potential scenarios that need flexible extensions in my > mind: > - other protocols support, such as iWARP / IB or new version protocol, > - dozens of feature flags in the future, like this proposal. With the > growth of new feature, it could overflow bitmap. > > Actually, this extension facilities are very similar to TCP options. > > So what about your opinions about the solution of this? If there are > some existed approaches for the future extensions, maybe this can get > involved in it. Or we can start a discuss about this as this mail > mentioned. > > Also, I am wondering if there is plan to update the RFC7609, add the > latest v2 support? > > Thanks, > Tony Lu We have asked the SMC protocol owners about their opinion about using the reserved fields for new options in particular, and about where and how to discuss this in general. (including where to document the versions). Please allow some time for us to come back to you. Kind regards Alexandra
On Wed, Jun 01, 2022 at 01:35:52PM +0200, Alexandra Winter wrote: > > > On 01.06.22 11:24, Tony Lu wrote: > > On Wed, Jun 01, 2022 at 02:33:09PM +0800, D. Wythe wrote: > >> > >> 在 2022/5/25 下午9:42, Alexandra Winter 写道: > >> > >>> We need to carefully evaluate them and make sure everything is compatible > >>> with the existing implementations of SMC-D and SMC-R v1 and v2. In the > >>> typical s390 environment ROCE LAG is propably not good enough, as the card > >>> is still a single point of failure. So your ideas need to be compatible > >>> with link redundancy. We also need to consider that the extension of the > >>> protocol does not block other desirable extensions. > >>> > >>> Your prototype is very helpful for the understanding. Before submitting any > >>> code patches to net-next, we should agree on the details of the protocol > >>> extension. Maybe you could formulate your proposal in plain text, so we can > >>> discuss it here? > >>> > >>> We also need to inform you that several public holidays are upcoming in the > >>> next weeks and several of our team will be out for summer vacation, so please > >>> allow for longer response times. > >>> > >>> Kind regards > >>> Alexandra Winter > >>> > >> > >> Hi alls, > >> > >> In order to achieve signle-link compatibility, we must > >> complete at least once negotiation. We wish to provide > >> higher scalability while meeting this feature. There are > >> few ways to reach this. > >> > >> 1. Use the available reserved bits. According to > >> the SMC v2 protocol, there are at least 28 reserved octets > >> in PROPOSAL MESSAGE and at least 10 reserved octets in > >> ACCEPT MESSAGE are available. We can define an area in which > >> as a feature area, works like bitmap. Considering the subsequent > >> scalability, we MAY use at least 2 reserved ctets, which can support > >> negotiation of at least 16 features. > >> > >> 2. Unify all the areas named extension in current > >> SMC v2 protocol spec without reinterpreting any existing field > >> and field offset changes, including 'PROPOSAL V1 IP Subnet Extension', > >> 'PROPOSAL V2 Extension', 'PROPOSAL SMC-DV2 EXTENSION' .etc. And provides > >> the ability to grow dynamically as needs expand. This scheme will use > >> at least 10 reserved octets in the PROPOSAL MESSAGE and at least 4 reserved > >> octets in ACCEPT MESSAGE and CONFIRM MESSAGE. Fortunately, we only need to > >> use reserved fields, and the current reserved fields are sufficient. And > >> then we can easily add a new extension named SIGNLE LINK. Limited by space, > >> the details will be elaborated after the scheme is finalized. > > > > After reading this and latest version of protocol, I agree with that the > > idea to provide a more flexible extension facilities. And, it's a good > > chance for us to set here talking about the protocol extension. > > > > There are some potential scenarios that need flexible extensions in my > > mind: > > - other protocols support, such as iWARP / IB or new version protocol, > > - dozens of feature flags in the future, like this proposal. With the > > growth of new feature, it could overflow bitmap. > > > > Actually, this extension facilities are very similar to TCP options. > > > > So what about your opinions about the solution of this? If there are > > some existed approaches for the future extensions, maybe this can get > > involved in it. Or we can start a discuss about this as this mail > > mentioned. > > > > Also, I am wondering if there is plan to update the RFC7609, add the > > latest v2 support? > > > > Thanks, > > Tony Lu > > We have asked the SMC protocol owners about their opinion about using the > reserved fields for new options in particular, and about where and how to > discuss this in general. (including where to document the versions). > Please allow some time for us to come back to you. > > Kind regards > Alexandra Thank you for the information. Before we officially push the document update, if you had any suggestions for the two schemes we are mentioned above, or which one you prefer, please keep us informed. Best wishes. D. Wyther
On 2022/6/1 下午2:33, D. Wythe wrote: > > 在 2022/5/25 下午9:42, Alexandra Winter 写道: > >> We need to carefully evaluate them and make sure everything is compatible >> with the existing implementations of SMC-D and SMC-R v1 and v2. In the >> typical s390 environment ROCE LAG is propably not good enough, as the card >> is still a single point of failure. So your ideas need to be compatible >> with link redundancy. We also need to consider that the extension of the >> protocol does not block other desirable extensions. >> >> Your prototype is very helpful for the understanding. Before submitting any >> code patches to net-next, we should agree on the details of the protocol >> extension. Maybe you could formulate your proposal in plain text, so we can >> discuss it here? >> >> We also need to inform you that several public holidays are upcoming in the >> next weeks and several of our team will be out for summer vacation, so please >> allow for longer response times. >> >> Kind regards >> Alexandra Winter >> > > Hi alls, > > In order to achieve signle-link compatibility, we must > complete at least once negotiation. We wish to provide > higher scalability while meeting this feature. There are > few ways to reach this. > > 1. Use the available reserved bits. According to > the SMC v2 protocol, there are at least 28 reserved octets > in PROPOSAL MESSAGE and at least 10 reserved octets in > ACCEPT MESSAGE are available. We can define an area in which > as a feature area, works like bitmap. Considering the subsequent scalability, we MAY use at least 2 reserved ctets, which can support negotiation of at least 16 features. > > 2. Unify all the areas named extension in current > SMC v2 protocol spec without reinterpreting any existing field > and field offset changes, including 'PROPOSAL V1 IP Subnet Extension', > 'PROPOSAL V2 Extension', 'PROPOSAL SMC-DV2 EXTENSION' .etc. And provides > the ability to grow dynamically as needs expand. This scheme will use > at least 10 reserved octets in the PROPOSAL MESSAGE and at least 4 reserved octets in ACCEPT MESSAGE and CONFIRM MESSAGE. Fortunately, we only need to use reserved fields, and the current reserved fields are sufficient. And then we can easily add a new extension named SIGNLE LINK. Limited by space, the details will be elaborated after the scheme is finalized. > > But no matter what scheme is finalized, the workflow should be similar to: > > Allow Single-link: > > client server > proposal with Single-link feature bit or extension > --------> > > accept with Single-link feature bit extension > <-------- > > confirm > --------> > > > Deny or not recognized: > > client server > proposal with Single-link feature bit or extension > --------> > > rkey confirm > <------ > ------> > > accept without Single-link feature bit or extension > <------ > > rkey confirm > -------> > <------ > > confirm > -------> > > > Look forward to your advice and comments. > > Thanks. Hi all, On the basis of previous,If we can put the application data over the PROPOSAL message, we can achieve SMC 0-RTT. Its process should be similar to the following: client server PROPOSAL MESSAGE with first contact with 0RTT query extension --------> ACCEPT MESSAGE with(or without) 0RTT response extension <-------- CONFIRM MESSAGE --------> client server PROPOSAL MESSAGE without first contact with ORTT Data --------> ACCEPT MESSAGE <--------- CONFIRM MESSAGE --------> If so, using reserved bit to exchange feature are not enough. We have a simple design to perform compatibility with legacy extensions and support future extensions. This draft try to unify all the areas named extension in current SMC v2 protocol spec, includes 'PROPOSAL V1 IP Subnet Extension', 'PROPOSAL V2 Extension', 'PROPOSAL SMC-DV2 EXTENSION', and 'First Contact Extension'. This draft does lots of compromise designs in order to achieve compatibility. I believe there must have better ways. Let me get the ball rolling. And please let me know if you have any suggestions or better ideas. This draft of the design is as follows: SMC V2 CLC PROPOSAL MESSAGE: +------+-------+------------------------------------------------------------+ |0 50 |NOT changed | +------+-------+------------------------------------------------------------+ |50 |2 |SMC Version 2 Extension Offset(applicable when SMC V2) | +------+-------+------------------------------------------------------------+ |52 |19 |Reserved for growth | +------+-------+------------------------------------------------------------+ |71 |* |Extension Area (reserved before) | +------+-------+------------------------------------------------------------+ |71 |2 |number of Extensions (reserved before) | +------+-------+------------------------------------------------------------+ |73 |7 |V1 IP Subnet Extension Header (when applicable) | +------+-------+------------------------------------------------------------+ |73 |7 |Padding Extension (when V1 IP Subnet Extension not present) | +------+-------+------------------------------------------------------------+ |80 |* |V1 IP Subnet Extension Payload (when applicable) | +------+-------+------------------------------------------------------------+ | | |V2 Extension (when applicable) | +------+-------+------------------------------------------------------------+ | | |other available Extension (when applicable) | +------+-------+------------------------------------------------------------+ |* |4 |Eye catcher ‘SMCR’ (EBCDIC) message end | +------+-------+------------------------------------------------------------+ Notes: 1. In the current implementation, server read the proposal message with fixed length, areas beyond the length will be silently ignored, and server will give up to check eye catcher. Therefore, It's safe to extend the message from the tail. 2. (reserved before) means that the areas used to be reserved. 3. none of the existing fields have their offsets changed within the PROPOSAL message. Extension Areas Format: +------+-------+-----------------------+ |0 |* |Extensions Area | +------+-------+-----------------------+ |0 |2 |Number of Extensions | +------+-------+-----------------------+ |2 |* |Extensions | +------+-------+-----------------------+ | | |End of Extensions Area | +------+-------+-----------------------+ notes: 1. All extensions within the extension areas should be contiguous. Extension Format: +------+-------+----------------------------------------+ |0 |* |Extension | +------+-------+----------------------------------------+ |0 |6+ |Extension header | +------+-------+----------------------------------------+ |0 |4 |reserved | +------+-------+----------------------------------------+ |2 |* |Extension Type (variable length) | +------+-------+----------------------------------------+ |* |* |Payload Length (variable length) | +------+-------+----------------------------------------+ |* |* |payload | +------+-------+----------------------------------------+ notes: 1. This scheme was specially designed to be compatible with 'PROPOSAL V2 Extension', since it is the only extension with no reserved octets ahead of it. 2. Another special case is 'PROPOSAL SMC-DV2 EXTENSION', it's also has no reserved octets ahead of it, but it can be treats as an optional part of 'PROPOSAL V2 Extension'. 3. To be compatible with 'PROPOSAL V2 Extension', there are only 2 reserved octets left to place type and length fields. If octet per each fileds, there can be only a maximum of 255 extension types and a maximum length of 255. For better scalability, the type and length fields are encoded as variable length integer. variable length integer encoding: +--------------+-------+---------------+--------+ |first bit |octet |Usable Bits |Range | +--------------+-------+---------------+--------+ |0 |1 |7 |0-127 | +--------------+-------+---------------+--------+ |1 |2 |15 |0-32767 | +--------------+-------+---------------+--------+ notes; 1. This design introduces some complexity and we can totally give it up if we do not need more than 255 extensions at all. V1 IP Subnet Extension Format: +------+-------+-------------------------------------------------+ |0 |7 |Extension Header | +------+-------+-------------------------------------------------+ |0 |4 |Reserved | +------+-------+-------------------------------------------------+ |4 |1 |Extension type(0x2) | +------+-------+-------------------------------------------------+ |5 |2 |payload length | +------+-------+-------------------------------------------------+ |7 |* |V1 IP Subnet Extension Payload | +------+-------+-------------------------------------------------+ |7 |5 |Client IPv4 Subnet Mask (IPv4 only) | +------+-------+-------------------------------------------------+ |7 |4 |Subnet Mask | +------+-------+-------------------------------------------------+ |9 |2 |Reserved | +------+-------+-------------------------------------------------+ |11 |* |Client IPv6 Prefix Array (zero for IPv4) | +------+-------+-------------------------------------------------+ |11 |1 |Number of IPv6 Prefixes in Prefix array (1 - 8) | +------+-------+-------------------------------------------------+ |12 |* |Prefix Array, variable length array | +------+-------+-------------------------------------------------+ notes: 1. newly V1 IP Subnet Extension borrows 7 octets from the reserved fields in the upper near part to form a completed extension. 2. none of the existing fields have their offsets changed within the PROPOSAL message. Padding Extension Format: +------+-------+-------------------------------+ |0 |2 |Reserved | +------+-------+-------------------------------+ |2 |1 |Extension type(0x0) | +------+-------+-------------------------------+ |3 |* |Payload length | +------+-------+-------------------------------+ |* |* |Padding (fill with 0x0) | +------+-------+-------------------------------+ notes: 1. Padding Extension is used to fill reserved areas that have not been used yet. It doesn't mean anything, and can be replaced in the future. SMCv2 EXTENSION Format: +------+-------+------------------------------------------------------------+ |0 |8 |SMCv2 Extension - Client Options Area (SMCRv2 & SMCDv2) | +------+-------+------------------------------------------------------------+ |0 |8 |SMCv2 Extension - Client Options Area Header | +------+-------+------------------------------------------------------------+ |0 |1 |EID Number | +------+-------+------------------------------------------------------------+ |1 |1 |ISMv2 GID Number | +------+-------+------------------------------------------------------------+ |2 |1 |Flag 1 (bit 8) - Reserved | +------+-------+------------------------------------------------------------+ |3 |1 |Flag 2 (bit 8) | +------+-------+------------------------------------------------------------+ |4 |2 |Extension Header (reserved before) | +------+-------+------------------------------------------------------------+ |4 |1 |Extension type(0x3) | +------+-------+------------------------------------------------------------+ |5 |1 |payload length (range 0-127) | +------+-------+------------------------------------------------------------+ |6 |2 |SMCDv2 Extension Offset (if present) | +------+-------+------------------------------------------------------------+ |8 |16 |RoCEv2 GID (IPv4 or IPv6 address) | +------+-------+------------------------------------------------------------+ |8 |16 |RoCEv2 GID IPv6 address (when IPv6) | +------+-------+------------------------------------------------------------+ |8 |12 |RoCEv2 GID IPv4 reserved (when IPv4) | +------+-------+------------------------------------------------------------+ |20 |4 |RoCEv2 GID IPv4 address (right aligned) | +------+-------+------------------------------------------------------------+ |24 |9 |Reserved | +------+-------+------------------------------------------------------------+ |33 |7 |Continuation extension (reserved before) | +------+-------+------------------------------------------------------------+ |33 |4 |Reserved | +------+-------+------------------------------------------------------------+ |37 |1 |Extension type(0x1) (reserved before) | +------+-------+------------------------------------------------------------+ |38 |2 |Payload length (reserved before) | +------+-------+------------------------------------------------------------+ |40 |* |EID Array Area – variable length (32 bytes * EID Number) | +------+-------+------------------------------------------------------------+ |* |* |SMCDv2 optional area (used to called SMCDv2 extension) | +------+-------+------------------------------------------------------------+ notes: 1. newly V2 EXTENSION use several reserved octets to form a completed extension. Note that none of the existing fields have their offsets changed within the PROPOSAL message. 2. the size of SMCv2 EXTENSION plus maximum size of EID Array Area is much bigger than the highest number that one octet can represent. To be compatible with 'legacy V2 Extension', there are only 2 reserved octets left to place type and length fields. therefore, we use Continuation Extension to solve it. Continuation Extension Format: +------+-------+-------------------------------+ |0 |4 |Reserved | +------+-------+-------------------------------+ |4 |1 |Extension type(0x1) | +------+-------+-------------------------------+ |5 |* |payload length | +------+-------+-------------------------------+ |* |* |Continuation data | +------+-------+-------------------------------+ notes: 1. Indicate that the content of this extension is continuation of the content of its previous extension. 2. In order to be compatible with some existing extensions, when the reserved bytes that can be used are not enough to represent its maximum length. CLC ACCEPT MESSAGE (SMC-DV2 FORMAT) / CLC CONFIRM MESSAGE (SMC-Dv2 FORMAT) +------+-------+---------------------------------------------------+ |34 |32 |EID (Negotiated Common EID selected by the server) | +------+-------+---------------------------------------------------+ |66 |4 |Reserved | +------+-------+---------------------------------------------------+ |70 |* |Extensions Area | +------+-------+---------------------------------------------------+ |70 |2 |number of Extensions (reserved before) | +------+-------+---------------------------------------------------+ |72 |38 |First Contact Extension - | | | |only present when first contact flag is on | +------+-------+---------------------------------------------------+ |72 |6 |First Contact Extension Header | +------+-------+---------------------------------------------------+ |72 |2 |Reserved | +------+-------+---------------------------------------------------+ |74 |4 |FCE Header | +------+-------+---------------------------------------------------+ |74 |1 |FCE Header - reserved | +------+-------+---------------------------------------------------+ |75 |1 |FCE Header Flag 1 (bit 8) | +------+-------+---------------------------------------------------+ |76 |1 |Extension type (0x4) (reserved before) | +------+-------+---------------------------------------------------+ |77 |1 |Payload length (0x20) (reserved before) | +------+-------+---------------------------------------------------+ |78 |32 |FCE Peer Host Name | +------+-------+---------------------------------------------------+ |110 |* |other available Extension (when applicable) | +------+-------+---------------------------------------------------+ |* |4 |Eye catcher ‘SMCD’ (EBCDIC) message end | +------+-------+---------------------------------------------------+ CLC ACCEPT MESSAGE (SMC-RV2 FORMAT) +------+-------+---------------------------------------------------+ |64 |32 |EID (Negotiated EID selected by server) | +------+-------+---------------------------------------------------+ |96 |4 |Reserved + +------+-------+---------------------------------------------------+ |100 |* |Extension Area | +------+-------+---------------------------------------------------+ |100 |2 |number of Extension (reserved before) | +------+-------+---------------------------------------------------+ |102 |38 |First Contact Extension - | | | |only present when first contact flag is on | +------+-------+---------------------------------------------------+ |102 |6 |First Contact Extension Header | +------+-------+---------------------------------------------------+ |102 |2 |reserved | +------+-------+---------------------------------------------------+ |104 |4 |FCE Header | +------+-------+---------------------------------------------------+ |104 |1 |FCE Header - reserved | +------+-------+---------------------------------------------------+ |105 |1 |FCE Header Flag 1 (bit 8) | +------+-------+---------------------------------------------------+ |106 |1 |Extension type (0x4) (reserved before) | +------+-------+---------------------------------------------------+ |107 |1 |Payload length (0x20) (reserved before) | +------+-------+---------------------------------------------------+ |108 |32 |FCE Peer Host Name | +------+-------+---------------------------------------------------+ |140 |16 |Padding Extension (reserved before) | +------+-------+---------------------------------------------------+ |156 |* |other available Extension (when applicable) | +------+-------+---------------------------------------------------+ |* |4 |Eye catcher ‘SMCR’ (EBCDIC) message end | +------+-------+---------------------------------------------------+ notes: 1. none of the existing fields have their offsets changed within the message. First Contact Extension Format: +------+-------+----------------------------------------------------------------+ |0 |6 |First Contact Extension Header | +------+-------+----------------------------------------------------------------+ |0 |2 |Reserved | +------+-------+----------------------------------------------------------------+ |2 |1 |FCE Header Flag 0 | +------+-------+----------------------------------------------------------------+ |3 |1 |FCE Header Flag 1 | +------+-------+----------------------------------------------------------------+ |4 |1 |Extension type (0x4) (reserved before) | +------+-------+----------------------------------------------------------------+ |5 |1 |Payload length (0x20) (reserved before) | +------+-------+----------------------------------------------------------------+ |6 |32 |FCE Peer Host Name (ASCII character - padded with ASCII blanks) | +------+-------+----------------------------------------------------------------+ notes: 1. newly First Contact Extension borrows 2 octets from the reserved fields in the upper near part to form a completed extension. CLC CONFIRM MESSAGE (SMC-RV2 FORMAT) +------+-------+---------------------------------------------------+ |64 |32 |EID (Negotiated EID selected by server) | +------+-------+---------------------------------------------------+ |96 |4 |Reserved | +------+-------+---------------------------------------------------+ |100 |* |Extension Area | +------+-------+---------------------------------------------------+ |100 |2 |number of Extension (reserved before) | +------+-------+---------------------------------------------------+ |102 |38 |First Contact Extension - | | | only present when first contact flag is on | +------+-------+---------------------------------------------------+ |102 |6 |First Contact Extension Header | +------+-------+---------------------------------------------------+ |102 |2 |reserved | +------+-------+---------------------------------------------------+ |104 |4 |FCE Header | +------+-------+---------------------------------------------------+ |104 |1 |FCE Header - reserved | +------+-------+---------------------------------------------------+ |105 |1 |FCE Header Flag 1 (bit 8) | +------+-------+---------------------------------------------------+ |106 |1 |Extension type (0x4) (reserved before) | +------+-------+---------------------------------------------------+ |107 |1 |Payload length (0x20) | +------+-------+---------------------------------------------------+ |108 |32 |FCE Peer Host Name | +------+-------+---------------------------------------------------+ |140 |9 |PADDING extension (reserved before) | +------+-------+---------------------------------------------------+ |149 |* |Client RoCEv2 GID Extension | +------+-------+---------------------------------------------------+ |149 |7 |Client RoCEv2 GID Extension Header(reserved before)| +------+-------+---------------------------------------------------+ |156 |* |FCE Client RoCEv2 GID List | +------+-------+---------------------------------------------------+ |* |* |other available Extension (when applicable) | +------+-------+---------------------------------------------------+ |* |4 |Eye catcher ‘SMCR’ (EBCDIC) message end | +------+-------+---------------------------------------------------+ notes: 1. Client RoCEv2 GID was once part of the First Contact Extension, and now it's standalone extension. Client RoCEv2 GID Extension Format: +-+----+-------+---------------------------------------------------+ |0 |* |Client RoCEv2 GID | +------+-------+---------------------------------------------------+ |0 |4 |Reserved | +------+-------+---------------------------------------------------+ |4 |1 |Extension type(0x5) | +------+-------+---------------------------------------------------+ |5 |2 |Payload length | +------+-------+---------------------------------------------------+ |7 |* |Client RoCEv2 GID List | +------+-------+---------------------------------------------------+ |7 |4 |GID List Header | +------+-------+---------------------------------------------------+ |7 |1 |GID List No of Entries (1 - 8) | +------+-------+---------------------------------------------------+ |8 |3 |Reserved | +------+-------+---------------------------------------------------+ |11 |* |GID List Array Area | +------+-------+---------------------------------------------------+ |11 |16 |GID List Entry - RoCEv2 IP address (IPv4 or IPv6) | +------+-------+---------------------------------------------------+ +* |* |End of Client GID List | +------+-------+---------------------------------------------------+ notes: 1. newly Client RoCEv2 GID Extension borrows 7 octets from the reserved fields in the upper near part to form a completed extension. 2. none of the existing fields have their offsets changed within the CONFIRM message.
On 16.06.22 15:49, D. Wythe wrote: > > > On 2022/6/1 下午2:33, D. Wythe wrote: >> >> 在 2022/5/25 下午9:42, Alexandra Winter 写道: >> >>> We need to carefully evaluate them and make sure everything is compatible >>> with the existing implementations of SMC-D and SMC-R v1 and v2. In the >>> typical s390 environment ROCE LAG is propably not good enough, as the card >>> is still a single point of failure. So your ideas need to be compatible >>> with link redundancy. We also need to consider that the extension of the >>> protocol does not block other desirable extensions. >>> >>> Your prototype is very helpful for the understanding. Before submitting any >>> code patches to net-next, we should agree on the details of the protocol >>> extension. Maybe you could formulate your proposal in plain text, so we can >>> discuss it here? >>> >>> We also need to inform you that several public holidays are upcoming in the >>> next weeks and several of our team will be out for summer vacation, so please >>> allow for longer response times. >>> >>> Kind regards >>> Alexandra Winter >>> >> >> Hi alls, >> >> In order to achieve signle-link compatibility, we must >> complete at least once negotiation. We wish to provide >> higher scalability while meeting this feature. There are >> few ways to reach this. >> >> 1. Use the available reserved bits. According to >> the SMC v2 protocol, there are at least 28 reserved octets >> in PROPOSAL MESSAGE and at least 10 reserved octets in >> ACCEPT MESSAGE are available. We can define an area in which >> as a feature area, works like bitmap. Considering the subsequent scalability, we MAY use at least 2 reserved ctets, which can support negotiation of at least 16 features. >> >> 2. Unify all the areas named extension in current >> SMC v2 protocol spec without reinterpreting any existing field >> and field offset changes, including 'PROPOSAL V1 IP Subnet Extension', >> 'PROPOSAL V2 Extension', 'PROPOSAL SMC-DV2 EXTENSION' .etc. And provides >> the ability to grow dynamically as needs expand. This scheme will use >> at least 10 reserved octets in the PROPOSAL MESSAGE and at least 4 reserved octets in ACCEPT MESSAGE and CONFIRM MESSAGE. Fortunately, we only need to use reserved fields, and the current reserved fields are sufficient. And then we can easily add a new extension named SIGNLE LINK. Limited by space, the details will be elaborated after the scheme is finalized. >> [...] >> >> >> Look forward to your advice and comments. >> >> Thanks. > > Hi all, > > On the basis of previous,If we can put the application data over the PROPOSAL message, > we can achieve SMC 0-RTT. Its process should be similar to the following: > [...] Thank you D. Wythe for the detailed proposal, I have forwarded it to the protocol owner and we are currently reviewing it. We may contact you and Tony Lu directly to discuss the details, if that is ok for you. Kind regards Alexandra Winter
On 2022/6/23 下午7:59, Alexandra Winter wrote: > > > On 16.06.22 15:49, D. Wythe wrote: >> >> >> On 2022/6/1 下午2:33, D. Wythe wrote: >>> >>> 在 2022/5/25 下午9:42, Alexandra Winter 写道: >>> >>>> We need to carefully evaluate them and make sure everything is compatible >>>> with the existing implementations of SMC-D and SMC-R v1 and v2. In the >>>> typical s390 environment ROCE LAG is propably not good enough, as the card >>>> is still a single point of failure. So your ideas need to be compatible >>>> with link redundancy. We also need to consider that the extension of the >>>> protocol does not block other desirable extensions. >>>> >>>> Your prototype is very helpful for the understanding. Before submitting any >>>> code patches to net-next, we should agree on the details of the protocol >>>> extension. Maybe you could formulate your proposal in plain text, so we can >>>> discuss it here? >>>> >>>> We also need to inform you that several public holidays are upcoming in the >>>> next weeks and several of our team will be out for summer vacation, so please >>>> allow for longer response times. >>>> >>>> Kind regards >>>> Alexandra Winter >>>> >>> >>> Hi alls, >>> >>> In order to achieve signle-link compatibility, we must >>> complete at least once negotiation. We wish to provide >>> higher scalability while meeting this feature. There are >>> few ways to reach this. >>> >>> 1. Use the available reserved bits. According to >>> the SMC v2 protocol, there are at least 28 reserved octets >>> in PROPOSAL MESSAGE and at least 10 reserved octets in >>> ACCEPT MESSAGE are available. We can define an area in which >>> as a feature area, works like bitmap. Considering the subsequent scalability, we MAY use at least 2 reserved ctets, which can support negotiation of at least 16 features. >>> >>> 2. Unify all the areas named extension in current >>> SMC v2 protocol spec without reinterpreting any existing field >>> and field offset changes, including 'PROPOSAL V1 IP Subnet Extension', >>> 'PROPOSAL V2 Extension', 'PROPOSAL SMC-DV2 EXTENSION' .etc. And provides >>> the ability to grow dynamically as needs expand. This scheme will use >>> at least 10 reserved octets in the PROPOSAL MESSAGE and at least 4 reserved octets in ACCEPT MESSAGE and CONFIRM MESSAGE. Fortunately, we only need to use reserved fields, and the current reserved fields are sufficient. And then we can easily add a new extension named SIGNLE LINK. Limited by space, the details will be elaborated after the scheme is finalized. >>> > [...] >>> >>> >>> Look forward to your advice and comments. >>> >>> Thanks. >> >> Hi all, >> >> On the basis of previous,If we can put the application data over the PROPOSAL message, >> we can achieve SMC 0-RTT. Its process should be similar to the following: >> > [...] > > Thank you D. Wythe for the detailed proposal, I have forwarded it to the protocol owner > and we are currently reviewing it. > We may contact you and Tony Lu directly to discuss the details, if that is ok for you. > > Kind regards > Alexandra Winter > > > > Thanks a lot for your support, it seems good to us. We are totally okay with that. Best Wishes. D. Wythe
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c index 1a556f4..bf646d1 100644 --- a/net/smc/af_smc.c +++ b/net/smc/af_smc.c @@ -492,7 +492,7 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link, struct smc_buf_desc *rmb_desc) { struct smc_link_group *lgr = link->lgr; - int i, rc = 0; + int i, lnk = 0, rc = 0; rc = smc_llc_flow_initiate(lgr, SMC_LLC_FLOW_RKEY); if (rc) @@ -507,14 +507,20 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link, rc = smcr_link_reg_rmb(&lgr->lnk[i], rmb_desc); if (rc) goto out; + /* available link count inc */ + lnk++; } - /* exchange confirm_rkey msg with peer */ - rc = smc_llc_do_confirm_rkey(link, rmb_desc); - if (rc) { - rc = -EFAULT; - goto out; + /* do not exchange confirm_rkey msg since there are only one link */ + if (lnk > 1) { + /* exchange confirm_rkey msg with peer */ + rc = smc_llc_do_confirm_rkey(link, rmb_desc); + if (rc) { + rc = -EFAULT; + goto out; + } } + rmb_desc->is_conf_rkey = true; out: mutex_unlock(&lgr->llc_conf_mutex); @@ -932,6 +938,31 @@ static int smc_find_rdma_device(struct smc_sock *smc, struct smc_init_info *ini) return 0; } +/* just prototype code + * since tcp connect has not happen, using route to perform smc_pnet_find_roce_by_pnetid + */ +static int smc_find_rdma_device_with_dst(struct smc_sock *smc, struct smc_init_info *ini) +{ + struct sock *tsk = smc->clcsock->sk; + struct rtable *rt; + + rt = ip_route_output(sock_net(tsk), smc->remote_address.v4.sin_addr.s_addr, + 0, 0, 0); + + if (IS_ERR(rt)) + return -ECONNRESET; + + smc_pnet_find_roce_by_pnetid(rt->dst.dev, ini); + __builtin_prefetch(&ini->ib_dev->mac[ini->ib_port - 1]); + + if (!ini->check_smcrv2 && !ini->ib_dev) + return SMC_CLC_DECL_NOSMCRDEV; + if (ini->check_smcrv2 && !ini->smcrv2.ib_dev_v2) + return SMC_CLC_DECL_NOSMCRDEV; + + return 0; +} + /* check if there is an ISM device available for this connection. */ /* called for connect and listen */ static int smc_find_ism_device(struct smc_sock *smc, struct smc_init_info *ini) @@ -1019,13 +1050,17 @@ static int smc_find_proposal_devices(struct smc_sock *smc, /* check if there is an rdma device available */ if (!(ini->smcr_version & SMC_V1) || - smc_find_rdma_device(smc, ini)) + smc_find_rdma_device_with_dst(smc, ini)) ini->smcr_version &= ~SMC_V1; /* else RDMA is supported for this connection */ ini->smc_type_v1 = smc_indicated_type(ini->smcd_version & SMC_V1, ini->smcr_version & SMC_V1); + /* just prototype, do this for simple */ + ini->smc_type_v2 = SMC_TYPE_N; + return rc; + /* check if there is an ism v2 device available */ if (!(ini->smcd_version & SMC_V2) || !smc_ism_is_v2_capable() || @@ -1492,11 +1527,7 @@ static void smc_connect_work(struct work_struct *work) smc->sk.sk_err = smc->clcsock->sk->sk_err; } else if ((1 << smc->clcsock->sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) { - rc = sk_stream_wait_connect(smc->clcsock->sk, &timeo); - if ((rc == -EPIPE) && - ((1 << smc->clcsock->sk->sk_state) & - (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT))) - rc = 0; + rc = 0; } release_sock(smc->clcsock->sk); lock_sock(&smc->sk); @@ -1580,9 +1611,10 @@ static int smc_connect(struct socket *sock, struct sockaddr *addr, rc = -EALREADY; goto out; } - rc = kernel_connect(smc->clcsock, addr, alen, flags); - if (rc && rc != -EINPROGRESS) - goto out; + + /* copy remote address backup */ + memcpy(&smc->remote_address.ss, addr, alen); + rc = -EINPROGRESS; if (smc->use_fallback) { sock->state = rc ? SS_CONNECTING : SS_CONNECTED; @@ -2452,9 +2484,17 @@ static int smc_listen(struct socket *sock, int backlog) { struct sock *sk = sock->sk; struct smc_sock *smc; - int rc; + int rc, val; smc = smc_sk(sk); + + /* enable server clcsock tcp fastopen. + * just a proto type code, magic number 5 for no reason + */ + val = 5; + smc->clcsock->ops->setsockopt(smc->clcsock, SOL_TCP, + TCP_FASTOPEN, KERNEL_SOCKPTR(&val), sizeof(val)); + lock_sock(sk); rc = -EINVAL; diff --git a/net/smc/smc.h b/net/smc/smc.h index 5ed765e..ef18894 100644 --- a/net/smc/smc.h +++ b/net/smc/smc.h @@ -261,6 +261,14 @@ struct smc_sock { /* smc sock container */ int fallback_rsn; /* reason for fallback */ u32 peer_diagnosis; /* decline reason from peer */ atomic_t queued_smc_hs; /* queued smc handshakes */ + + union { + struct sockaddr addr; + struct sockaddr_in v4; + struct sockaddr_in6 v6; + struct sockaddr_storage ss; + } remote_address; + struct inet_connection_sock_af_ops af_ops; const struct inet_connection_sock_af_ops *ori_af_ops; /* original af ops */ diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c index f9f3f59..f944c67 100644 --- a/net/smc/smc_clc.c +++ b/net/smc/smc_clc.c @@ -20,6 +20,7 @@ #include <net/addrconf.h> #include <net/sock.h> #include <net/tcp.h> +#include <net/route.h> #include "smc.h" #include "smc_core.h" @@ -486,8 +487,7 @@ static int smc_clc_prfx_set4_rcu(struct dst_entry *dst, __be32 ipv4, return -ENODEV; in_dev_for_each_ifa_rcu(ifa, in_dev) { - if (!inet_ifa_match(ipv4, ifa)) - continue; + /* delete this for simple, just prototype code*/ prop->prefix_len = inet_mask_len(ifa->ifa_mask); prop->outgoing_subnet = ifa->ifa_address & ifa->ifa_mask; /* prop->ipv6_prefixes_cnt = 0; already done by memset before */ @@ -528,10 +528,10 @@ static int smc_clc_prfx_set6_rcu(struct dst_entry *dst, /* retrieve and set prefixes in CLC proposal msg */ static int smc_clc_prfx_set(struct socket *clcsock, + struct dst_entry *dst, struct smc_clc_msg_proposal_prefix *prop, struct smc_clc_ipv6_prefix *ipv6_prfx) { - struct dst_entry *dst = sk_dst_get(clcsock->sk); struct sockaddr_storage addrs; struct sockaddr_in6 *addr6; struct sockaddr_in *addr; @@ -802,7 +802,8 @@ int smc_clc_send_decline(struct smc_sock *smc, u32 peer_diag_info, u8 version) } /* send CLC PROPOSAL message across internal TCP socket */ -int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini) +int smc_clc_send_proposal_with_nexthop(struct smc_sock *smc, + struct dst_entry *dst, struct smc_init_info *ini) { struct smc_clc_smcd_v2_extension *smcd_v2_ext; struct smc_clc_msg_proposal_prefix *pclc_prfx; @@ -838,7 +839,7 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini) /* retrieve ip prefixes for CLC proposal msg */ if (ini->smc_type_v1 != SMC_TYPE_N) { - rc = smc_clc_prfx_set(smc->clcsock, pclc_prfx, ipv6_prfx); + rc = smc_clc_prfx_set(smc->clcsock, dst, pclc_prfx, ipv6_prfx); if (rc) { if (ini->smc_type_v2 == SMC_TYPE_N) { kfree(pclc); @@ -961,6 +962,11 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini) } vec[i].iov_base = trl; vec[i++].iov_len = sizeof(*trl); + + msg.msg_flags |= MSG_FASTOPEN; + msg.msg_name = &smc->remote_address.addr; + msg.msg_namelen = sizeof(struct sockaddr_in); + /* due to the few bytes needed for clc-handshake this cannot block */ len = kernel_sendmsg(smc->clcsock, &msg, vec, i, plen); if (len < 0) { @@ -975,6 +981,22 @@ int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini) return reason_code; } +int smc_clc_send_proposal(struct smc_sock *smc, struct smc_init_info *ini) +{ + struct sock *tsk = smc->clcsock->sk; + struct rtable *rt; + int rc; + + rt = ip_route_output(sock_net(tsk), smc->remote_address.v4.sin_addr.s_addr, + 0, 0, 0); + + if (IS_ERR(rt)) + return -ECONNRESET; + + rc = smc_clc_send_proposal_with_nexthop(smc, &rt->dst, ini); + return rc; +} + /* build and send CLC CONFIRM / ACCEPT message */ static int smc_clc_send_confirm_accept(struct smc_sock *smc, struct smc_clc_msg_accept_confirm_v2 *clc_v2, diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c index f40f6ed..ef5e5411 100644 --- a/net/smc/smc_core.c +++ b/net/smc/smc_core.c @@ -1765,6 +1765,8 @@ int smc_vlan_by_tcpsk(struct socket *clcsock, struct smc_init_info *ini) int rc = 0; ini->vlan_id = 0; + /* just for simple , prototype code */ + return 0; if (!dst) { rc = -ENOTCONN; goto out; diff --git a/net/smc/smc_pnet.c b/net/smc/smc_pnet.c index 7055ed1..6aa3304 100644 --- a/net/smc/smc_pnet.c +++ b/net/smc/smc_pnet.c @@ -1064,8 +1064,8 @@ static void smc_pnet_find_rdma_dev(struct net_device *netdev, * If nothing found, check pnetid table. * If nothing found, try to use handshake device */ -static void smc_pnet_find_roce_by_pnetid(struct net_device *ndev, - struct smc_init_info *ini) +void smc_pnet_find_roce_by_pnetid(struct net_device *ndev, + struct smc_init_info *ini) { u8 ndev_pnetid[SMC_MAX_PNETID_LEN]; struct net *net; diff --git a/net/smc/smc_pnet.h b/net/smc/smc_pnet.h index 80a88ee..2ffaf22 100644 --- a/net/smc/smc_pnet.h +++ b/net/smc/smc_pnet.h @@ -67,4 +67,7 @@ void smc_pnet_find_alt_roce(struct smc_link_group *lgr, struct smc_ib_device *known_dev); bool smc_pnet_is_ndev_pnetid(struct net *net, u8 *pnetid); bool smc_pnet_is_pnetid_set(u8 *pnetid); + +void smc_pnet_find_roce_by_pnetid(struct net_device *ndev, + struct smc_init_info *ini); #endif