diff mbox series

[RFC,net] net/smc: Ensure the active closing peer first closes clcsock

Message ID 20211116033011.16658-1-tonylu@linux.alibaba.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series [RFC,net] net/smc: Ensure the active closing peer first closes clcsock | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers fail 1 blamed authors not CCed: ubraun@linux.vnet.ibm.com; 1 maintainers not CCed: ubraun@linux.vnet.ibm.com
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch warning WARNING: line length of 83 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Tony Lu Nov. 16, 2021, 3:30 a.m. UTC
We found an issue when replacing TCP with SMC. When the actively closed
peer called close() in userspace, the clcsock of peer doesn't enter TCP
active close progress, but the passive closed peer close it first, and
enters TIME_WAIT state. It means the behavior doesn't match what we
expected. After reading RFC7609, there is no clear description of the
order in which we close clcsock during close progress.

Consider this, an application listen and accept connections, and a client
connects the server. When client actively closes the socket, the peer's
conneciton in server enters TIME_WAIT state, which means the address
is occupied and won't be reused before TIME_WAIT dismissing. The server
will be restarted failed for EADDRINUSE. Although SO_REUSEADDR can solve
this issue, but not all the applications use it. It is very common to
restart the server process for some reasons in product environment. If
we wait for TIME_WAIT dismissing, the service will be unavailable for a
long time.

Here is a simple example to reproduce this issue. Run these commands:
  server: smc_run ./sockperf server --tcp -p 12345
  client: smc_run ./sockperf pp --tcp -m 14 -i ${SERVER_IP} -t 10 -p 12345

After client benchmark finished, a TIME_WAIT connection will show up in
the server, not the client, which will occupy the address:
  tcp        0      0 100.200.30.40:12345      100.200.30.41:53010      TIME_WAIT   -

If we restart server's sockperf, it will fail for:
  sockperf: ERROR: [fd=3] Can`t bind socket, IP to bind: 0.0.0.0:12345
   (errno=98 Address already in use)

Here is the process of closing for PeerConnAbort. We can observe this
issue both in PeerConnAbort and PeerConnClosed.

Client                                                |  Server
close() // client actively close                      |
  smc_release()                                       |
      smc_close_active() // SMC_PEERCLOSEWAIT1        |
          smc_close_final() // abort or closed = 1    |
              smc_cdc_get_slot_and_msg_send()         |
          [ A ]                                       |
                                                      |smc_cdc_msg_recv_action() // SMC_ACTIVE
                                                      |  queue_work(smc_close_wq, &conn->close_work)
                                                      |    smc_close_passive_work() // SMC_PROCESSABORT or SMC_APPCLOSEWAIT1
                                                      |      smc_close_passive_abort_received() // only in abort
                                                      |
                                                      |close() // server recv zero, close
                                                      |  smc_release() // SMC_PROCESSABORT or SMC_APPCLOSEWAIT1
                                                      |    smc_close_active()
                                                      |      smc_close_abort() or smc_close_final() // SMC_CLOSED
                                                      |        smc_cdc_get_slot_and_msg_send() // abort or closed = 1
smc_cdc_msg_recv_action()                             |    smc_clcsock_release()
  queue_work(smc_close_wq, &conn->close_work)         |      sock_release(tcp) // actively close clc, enter TIME_WAIT
    smc_close_passive_work() // SMC_PEERCLOSEWAIT1    |    smc_conn_free()
      smc_close_passive_abort_received() // SMC_CLOSED|
      smc_conn_free()                                 |
      smc_clcsock_release()                           |
        sock_release(tcp) // passive close clc        |

To solve this issue, clcsock can be shutdown before the passive closed
peer closing it, to perform the TCP active close progress first. RFC7609
said the main idea of termination flows, is to terminate the normal TCP
connection in the end [1]. But there is no possible to release clcsock
before server calling sock_release(tcp), so shutdown the clcsock in [A],
which is the only place before server closing it.

Link: https://datatracker.ietf.org/doc/html/rfc7609#section-4.8.1 [1]
Fixes: b38d732477e4 ("smc: socket closing and linkgroup cleanup")
Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
---
 net/smc/smc_close.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Karsten Graul Nov. 17, 2021, 4:19 p.m. UTC | #1
On 16/11/2021 04:30, Tony Lu wrote:
> We found an issue when replacing TCP with SMC. When the actively closed
> peer called close() in userspace, the clcsock of peer doesn't enter TCP
> active close progress, but the passive closed peer close it first, and
> enters TIME_WAIT state. It means the behavior doesn't match what we
> expected. After reading RFC7609, there is no clear description of the
> order in which we close clcsock during close progress.

Thanks for your detailed description, it helped me to understand the problem.
Your point is that SMC sockets should show the same behavior as TCP sockets
in this situation: the side that actively closed the socket should get into
TIME_WAIT state, and not the passive side. I agree with this.
Your idea to fix it looks like a good solution for me. But I need to do more
testing to make sure that other SMC implementations (not Linux) work as
expected with this change. For example, Linux does not actively monitor the 
clcsocket state, but if another implementation would do this it could happen
that the SMC socket is closed already when the clcsocket shutdown arrives, and
pending data transfers are aborted.

I will respond to your RFC when I finished my testing.

Thank you.
Karsten Graul Nov. 22, 2021, 4:47 p.m. UTC | #2
On 17/11/2021 17:19, Karsten Graul wrote:
> On 16/11/2021 04:30, Tony Lu wrote:
>> We found an issue when replacing TCP with SMC. When the actively closed
>> peer called close() in userspace, the clcsock of peer doesn't enter TCP
>> active close progress, but the passive closed peer close it first, and
>> enters TIME_WAIT state. It means the behavior doesn't match what we
>> expected. After reading RFC7609, there is no clear description of the
>> order in which we close clcsock during close progress.
> 
> Thanks for your detailed description, it helped me to understand the problem.
> Your point is that SMC sockets should show the same behavior as TCP sockets
> in this situation: the side that actively closed the socket should get into
> TIME_WAIT state, and not the passive side. I agree with this.
> Your idea to fix it looks like a good solution for me. But I need to do more
> testing to make sure that other SMC implementations (not Linux) work as
> expected with this change. For example, Linux does not actively monitor the 
> clcsocket state, but if another implementation would do this it could happen
> that the SMC socket is closed already when the clcsocket shutdown arrives, and
> pending data transfers are aborted.
> 
> I will respond to your RFC when I finished my testing.
> 
> Thank you.
> 

Testing and discussions are finished, the patch looks good.
Can you please send your change as a patch to the mailing list?
Tony Lu Nov. 23, 2021, 3:03 a.m. UTC | #3
On Mon, Nov 22, 2021 at 05:47:43PM +0100, Karsten Graul wrote:
> On 17/11/2021 17:19, Karsten Graul wrote:
> > On 16/11/2021 04:30, Tony Lu wrote:
> >> We found an issue when replacing TCP with SMC. When the actively closed
> >> peer called close() in userspace, the clcsock of peer doesn't enter TCP
> >> active close progress, but the passive closed peer close it first, and
> >> enters TIME_WAIT state. It means the behavior doesn't match what we
> >> expected. After reading RFC7609, there is no clear description of the
> >> order in which we close clcsock during close progress.
> > 
> > Thanks for your detailed description, it helped me to understand the problem.
> > Your point is that SMC sockets should show the same behavior as TCP sockets
> > in this situation: the side that actively closed the socket should get into
> > TIME_WAIT state, and not the passive side. I agree with this.
> > Your idea to fix it looks like a good solution for me. But I need to do more
> > testing to make sure that other SMC implementations (not Linux) work as
> > expected with this change. For example, Linux does not actively monitor the 
> > clcsocket state, but if another implementation would do this it could happen
> > that the SMC socket is closed already when the clcsocket shutdown arrives, and
> > pending data transfers are aborted.
> > 
> > I will respond to your RFC when I finished my testing.
> > 
> > Thank you.
> > 
> 
> Testing and discussions are finished, the patch looks good.
> Can you please send your change as a patch to the mailing list?

Thanks for your advice and testing. I will send it soon.

Cheers,
Tony Lu
Karsten Graul Nov. 23, 2021, 9:26 a.m. UTC | #4
On 16/11/2021 04:30, Tony Lu wrote:
> diff --git a/net/smc/smc_close.c b/net/smc/smc_close.c
> index 0f9ffba07d26..04620b53b74a 100644
> --- a/net/smc/smc_close.c
> +++ b/net/smc/smc_close.c
> @@ -228,6 +228,12 @@ int smc_close_active(struct smc_sock *smc)
>  			/* send close request */
>  			rc = smc_close_final(conn);
>  			sk->sk_state = SMC_PEERCLOSEWAIT1;
> +
> +			/* actively shutdown clcsock before peer close it,
> +			 * prevent peer from entering TIME_WAIT state.
> +			 */
> +			if (smc->clcsock && smc->clcsock->sk)
> +				rc = kernel_sock_shutdown(smc->clcsock, SHUT_RDWR);
>  		} else {

While integrating this patch I stumbled over the overwritten rc, which was
already set with the return value from smc_close_final().
Is the rc from kernel_sock_shutdown() even important for the result of this 
function? How to handle this in your opinion?
Tony Lu Nov. 24, 2021, 8:57 a.m. UTC | #5
On Tue, Nov 23, 2021 at 10:26:21AM +0100, Karsten Graul wrote:
> On 16/11/2021 04:30, Tony Lu wrote:
> > diff --git a/net/smc/smc_close.c b/net/smc/smc_close.c
> > index 0f9ffba07d26..04620b53b74a 100644
> > --- a/net/smc/smc_close.c
> > +++ b/net/smc/smc_close.c
> > @@ -228,6 +228,12 @@ int smc_close_active(struct smc_sock *smc)
> >  			/* send close request */
> >  			rc = smc_close_final(conn);
> >  			sk->sk_state = SMC_PEERCLOSEWAIT1;
> > +
> > +			/* actively shutdown clcsock before peer close it,
> > +			 * prevent peer from entering TIME_WAIT state.
> > +			 */
> > +			if (smc->clcsock && smc->clcsock->sk)
> > +				rc = kernel_sock_shutdown(smc->clcsock, SHUT_RDWR);
> >  		} else {
> 
> While integrating this patch I stumbled over the overwritten rc, which was
> already set with the return value from smc_close_final().
> Is the rc from kernel_sock_shutdown() even important for the result of this 
> function? How to handle this in your opinion?

Hi Graul,

I have investigated the function smc_close_final() when return error:

1. return -EPIPE
  * conn->killed
  * !conn->lgr || (conn->lgr->is_smcd && conn->lgr->peer_shutdown)

2. return -ENOLINK
  * !smc_link_usable(link)
  * conn's link have changed during wr get free slot

3. return -EBUSY
  * smc_wr_tx_get_free_slot_index has no available slot

The return code -EBUSY is important for user-space to recall close()
again.

-ENOLINK and -EPIPE means there is no chance to tell peer to perform
close progress. The applications should known this. And the clcsock will
be released in the end.


And the caller of upper function smc_close_active():

1. __smc_release(), it doesn't handle rc and return it to user-space who
called close() directly.

2. smc_shutdown(), it return rc to caller, also with function
kernel_sock_shutdown().

IMHO, given that, it is better to not ignore smc_close_final(), and move 
kernel_sock_shutdown() to __smc_release(), because smc_shutdown() also
calls kernel_sock_shutdown() after smc_close_active() and
smc_close_shutdown_write(), then enters SMC_PEERCLOSEWAIT1. It's no need
to call it twice with SHUT_WR and SHUT_RDWR. 

Here is the complete code of __smc_release in af_smc.c


static int __smc_release(struct smc_sock *smc)
{
		struct sock *sk = &smc->sk;
		int rc = 0, rc1 = 0;

		if (!smc->use_fallback) {
				rc = smc_close_active(smc);

			    /* make sure don't call kernel_sock_shutdown() twice
				 * after called smc_shutdown with SHUT_WR or SHUT_RDWR,
				 * which will perform TCP closing progress.
				 */
				if (smc->clcsock && (!sk->sk_shutdown || 
				    (sk->sk_shutdown & RCV_SHUTDOWN)) &&
				    sk->sk_state == SMC_PEERCLOSEWAIT1) {
					rc1 = kernel_sock_shutdown(smc->clcsock, SHUT_RDWR);
					rc = rc ? rc : rc1;
				}

				sock_set_flag(sk, SOCK_DEAD);
				sk->sk_shutdown |= SHUTDOWN_MASK;
		} else {

// code ignored


Thanks for pointing it out, it would be more complete of this patch.

Tony Lu
Karsten Graul Nov. 24, 2021, 10:08 a.m. UTC | #6
On 24/11/2021 09:57, Tony Lu wrote:
> IMHO, given that, it is better to not ignore smc_close_final(), and move 
> kernel_sock_shutdown() to __smc_release(), because smc_shutdown() also
> calls kernel_sock_shutdown() after smc_close_active() and
> smc_close_shutdown_write(), then enters SMC_PEERCLOSEWAIT1. It's no need
> to call it twice with SHUT_WR and SHUT_RDWR. 

Since the idea is to shutdown the socket before the remote peer shutdowns it
first, are you sure that this shutdown in smc_release() is not too late?
Is it sure that smc_release() is called in time for this processing?

Maybe its better to keep the shutdown in smc_close_active() and to use an rc1
just like shown in your proposal, and return either the rc of smc_close_final() 
or the rc of kernel_sock_shutdown().
I see the possibility of calling shutdown twice for the clcsocket, but does it
harm enough to give a reason to check it before in smc_shutdown()? I expect TCP
to handle this already.
Tony Lu Nov. 24, 2021, 11:26 a.m. UTC | #7
On Wed, Nov 24, 2021 at 11:08:23AM +0100, Karsten Graul wrote:
> On 24/11/2021 09:57, Tony Lu wrote:
> > IMHO, given that, it is better to not ignore smc_close_final(), and move 
> > kernel_sock_shutdown() to __smc_release(), because smc_shutdown() also
> > calls kernel_sock_shutdown() after smc_close_active() and
> > smc_close_shutdown_write(), then enters SMC_PEERCLOSEWAIT1. It's no need
> > to call it twice with SHUT_WR and SHUT_RDWR. 
> 
> Since the idea is to shutdown the socket before the remote peer shutdowns it
> first, are you sure that this shutdown in smc_release() is not too late?

Hi Graul,

Yes, I have tested this idea, it will be too late sometime. I won't fix
this issue.

> Is it sure that smc_release() is called in time for this processing?
> 
> Maybe its better to keep the shutdown in smc_close_active() and to use an rc1
> just like shown in your proposal, and return either the rc of smc_close_final() 
> or the rc of kernel_sock_shutdown().

Yep, I am testing this approach in my environment. I am going to keep
these return codes and return the available one.

> I see the possibility of calling shutdown twice for the clcsocket, but does it
> harm enough to give a reason to check it before in smc_shutdown()? I expect TCP
> to handle this already.

TCP could handle this already, but it doesn't make much sense to call it twice. When
call smc_shutdown(), we can check sk_shutdown before call kernel_sock_shutdown(),
so that it can slightly speed up the release process.

I will send this soon, thanks for your advice.

Tony Lu
diff mbox series

Patch

diff --git a/net/smc/smc_close.c b/net/smc/smc_close.c
index 0f9ffba07d26..04620b53b74a 100644
--- a/net/smc/smc_close.c
+++ b/net/smc/smc_close.c
@@ -228,6 +228,12 @@  int smc_close_active(struct smc_sock *smc)
 			/* send close request */
 			rc = smc_close_final(conn);
 			sk->sk_state = SMC_PEERCLOSEWAIT1;
+
+			/* actively shutdown clcsock before peer close it,
+			 * prevent peer from entering TIME_WAIT state.
+			 */
+			if (smc->clcsock && smc->clcsock->sk)
+				rc = kernel_sock_shutdown(smc->clcsock, SHUT_RDWR);
 		} else {
 			/* peer event has changed the state */
 			goto again;