diff mbox series

[net,4/4] net/smc: Fix wq mismatch issue caused by smc fallback

Message ID 20211027085208.16048-5-tonylu@linux.alibaba.com (mailing list archive)
State Superseded
Headers show
Series Fixes for SMC | expand

Commit Message

Tony Lu Oct. 27, 2021, 8:52 a.m. UTC
From: Wen Gu <guwen@linux.alibaba.com>

A socket_wq mismatch issue may occur because of fallback.

When use SMC to replace TCP, applications add an epoll entry into SMC
socket's wq, but kernel uses clcsock's wq instead of SMC socket's wq
once fallback occurs, which means the application's epoll fd dosen't
work anymore.

For example:
server: nginx -g 'daemon off;'

client: smc_run wrk -c 1 -t 1 -d 5 http://11.200.15.93/index.html

  Running 5s test @ http://11.200.15.93/index.html
    1 threads and 1 connections
    Thread Stats   Avg      Stdev     Max   +/- Stdev
      Latency     0.00us    0.00us   0.00us    -nan%
      Req/Sec     0.00      0.00     0.00      -nan%
    0 requests in 5.00s, 0.00B read
  Requests/sec:      0.00
  Transfer/sec:       0.00B

This patch fixes this issue by using clcsock's wq regardless of
whether fallback occurs.

Reported-by: Jacob Qi <jacob.qi@linux.alibaba.com>
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
---
 net/smc/af_smc.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

Comments

Karsten Graul Oct. 28, 2021, 2:16 p.m. UTC | #1
On 27/10/2021 10:52, Tony Lu wrote:
> From: Wen Gu <guwen@linux.alibaba.com>
> 
> A socket_wq mismatch issue may occur because of fallback.
> 
> When use SMC to replace TCP, applications add an epoll entry into SMC
> socket's wq, but kernel uses clcsock's wq instead of SMC socket's wq
> once fallback occurs, which means the application's epoll fd dosen't
> work anymore.
> 
> For example:
> server: nginx -g 'daemon off;'
> 
> client: smc_run wrk -c 1 -t 1 -d 5 http://11.200.15.93/index.html
> 
>   Running 5s test @ http://11.200.15.93/index.html
>     1 threads and 1 connections
>     Thread Stats   Avg      Stdev     Max   +/- Stdev
>       Latency     0.00us    0.00us   0.00us    -nan%
>       Req/Sec     0.00      0.00     0.00      -nan%
>     0 requests in 5.00s, 0.00B read
>   Requests/sec:      0.00
>   Transfer/sec:       0.00B
> 
> This patch fixes this issue by using clcsock's wq regardless of
> whether fallback occurs.
> 

I need to spend some more time testing and thinking about this fix.
Thank you.
Karsten Graul Oct. 29, 2021, 9:40 a.m. UTC | #2
On 27/10/2021 10:52, Tony Lu wrote:
> From: Wen Gu <guwen@linux.alibaba.com>
> 
> A socket_wq mismatch issue may occur because of fallback.
> 
> When use SMC to replace TCP, applications add an epoll entry into SMC
> socket's wq, but kernel uses clcsock's wq instead of SMC socket's wq
> once fallback occurs, which means the application's epoll fd dosen't
> work anymore.

I am not sure if I understand this fix completely, please explain your intentions
for the changes in more detail.

What I see so far:
- smc_create() swaps the sk->sk_wq of the clcsocket and the new SMC socket
  - sets clcsocket sk->sk_wq to smcsocket->wq (why?)
  - sets smcsocket sk->sk_wq to clcsocket->wq (why?)
- smc_switch_to_fallback() resets the clcsock sk->sk_wq to clcsocket->wq
- smc_accept() sets smcsocket sk->sk_wq to clcsocket->wq when it is NOT fallback
  - but this was already done before in smc_create() ??
- smc_poll() now always uses clcsocket->wq for the call to sock_poll_wait()

In smc_poll() the comment says that now clcsocket->wq is used for poll, whats
the relation between socket->wq and socket->sk->sk_wq here?
Wen Gu Nov. 1, 2021, 6:15 a.m. UTC | #3
On 2021/10/29 5:40, Karsten Graul wrote:
> On 27/10/2021 10:52, Tony Lu wrote:
>> From: Wen Gu <guwen@linux.alibaba.com>
>>
>> A socket_wq mismatch issue may occur because of fallback.
>>
>> When use SMC to replace TCP, applications add an epoll entry into SMC
>> socket's wq, but kernel uses clcsock's wq instead of SMC socket's wq
>> once fallback occurs, which means the application's epoll fd dosen't
>> work anymore.
> 
> I am not sure if I understand this fix completely, please explain your intentions
> for the changes in more detail.
> 
> What I see so far:
> - smc_create() swaps the sk->sk_wq of the clcsocket and the new SMC socket
>    - sets clcsocket sk->sk_wq to smcsocket->wq (why?)
>    - sets smcsocket sk->sk_wq to clcsocket->wq (why?)
> - smc_switch_to_fallback() resets the clcsock sk->sk_wq to clcsocket->wq
> - smc_accept() sets smcsocket sk->sk_wq to clcsocket->wq when it is NOT fallback
>    - but this was already done before in smc_create() ??
> - smc_poll() now always uses clcsocket->wq for the call to sock_poll_wait()
> 
> In smc_poll() the comment says that now clcsocket->wq is used for poll, whats
> the relation between socket->wq and socket->sk->sk_wq here?
> 

Thanks for your reply.

Before explaining my intentions, I thought it would be better to 
describe the issue I encountered first:

In nginx/wrk tests, when nginx uses TCP and wrk uses SMC to replace TCP, 
wrk should fall back to TCP and get correct results theoretically, But 
in fact it only got all zeros.

For example:
server: nginx -g 'daemon off;'

client: smc_run wrk -c 1 -t 1 -d 5 http://11.200.15.93/index.html

   Running 5s test @ http://11.200.15.93/index.html
     1 threads and 1 connections
     Thread Stats   Avg      Stdev     Max   ± Stdev
         Latency     0.00us    0.00us   0.00us    -nan%
         Req/Sec     0.00      0.00     0.00      -nan%
        0 requests in 5.00s, 0.00B read
     Requests/sec:      0.00
     Transfer/sec:       0.00B

The reason for this result is that:

1) Before fallback: wrk uses epoll_insert() to associate an epoll fd 
with a smc socket, adding eppoll_entry into smc socket->wq, allowing 
itself to be notified when events such as EPOLL_OUT/EPOLL_IN occur on 
the smc socket.

2) After fallback: smc_switch_to_fallback() set fd->private_data as 
clcsocket. wrk starts to use TCP stack and kernel calls tcp_data_ready() 
when receving data, which uses clcsocket sk->sk_wq (equal to 
clcsocket->wq). As a result, the epoll fd hold by wrk in 1) can't 
receive epoll events and wrk stops transmitting data.

So the root cause of the issue is that wrk's eppoll_entry always stay in 
smcsocket->wq, but kernel turns to wake up 
clcsocket->sk->sk_wq(clcsocket->wq) after fallback, which makes wrk's 
epoll fd unusable.

[before fallback]
    application
         ^
         |
       (poll)
         |
         v
   smc socket->wq            clcsocket->wq
(has eppoll_entry)                .
         .                         .
         .                         .
         .                         .
         .                         .
smc socket->sk->sk_wq    clcsocket->sk->sk_wq
         ^                         ^
         |                         |
         | (data)                  |(tcp handshake in rendezvous)
         v                         v
|-------------------------------------------|
|                                           |
| sk_data_ready / sk_write_space ..         |
|                                           |
|-------------------------------------------|

[after fallback]
    application--------------------|
                         (cant poll anything)
                                   |
                                   |
                                   v
   smc socket->wq            clcsocket->wq
(has eppoll_entry)                .
         .                         .
         .                         .
         .                         .
         .                         .
smc socket->sk->sk_wq    clcsocket->sk->sk_wq
                                   ^
                                   |
                                   |(data)
                                   v
|-------------------------------------------|
|                                           |
| sk_data_ready / sk_write_space ..         |
|                                           |
|-------------------------------------------|


This patch's main idea is that since wait queue is switched from smc 
socket->wq to clcsocket->wq after fallback, making eppoll_entry in smc 
socket->wq unusable, maybe we can try to put eppoll_entry into 
clcsocket->wq at the beginning, set smc socket sk->sk_wq to 
clcsocket->wq before fallback, and reset clcsocket sk->sk_wq to 
clcsocket->wq after fallback. So that no matter whether fallback occurs, 
the kernel always wakes up clcsocket->wq to notify applications.


Therefore, the main modifications of this patch are as follows:

1) smc_poll() always uses clcsocket->wq for the call to sock_poll_wait()

After this modification, the user application's eppoll_entry will be 
added into clcsocket->wq at the beginning instead of smc socket->wq, 
allowing application's epoll fd to receive events such as EPOLL_IN even 
when a fallback occurs.

2) Swap the sk->sk_wq of the clcsocket and the new SMC socket in 
smc_create()

Sets smcsocket sk->sk_wq to clcsocket->wq. If SMC is used and NOT 
fallback, the sk_data_ready() will wake up clcsocket->wq, and user 
application's epoll fd will receive epoll events.

Sets clcsocket sk->sk_wq to smcsocket->wq. Since clcsocket->wq is 
occupied by smcsocket sk->sk_wq, clcsocket sk->sk_wq have to use another 
wait queue (smc socket->wq) during TCP handshake in rendezvous.

3) smc_switch_to_fallback() resets the clcsocket sk->sk_wq to clcsocket->wq

Once a fallback occurs, user application will start using clcsocket. At 
this point, clcsocket sk->sk_wq should be reseted to clcsocket->wq 
because eppoll_entry is in clcsocket->wq.

4) smc_accept() sets smcsocket sk->sk_wq to clcsocket->wq when it is NOT 
fallback

The reason for doing this on the listening side is simmilar to 
smc_create(): using clcsocket->wq in concert with smc_poll() when it is 
NOT fallback. For your query 'this was already done before in 
smc_create() ? ', the new smc socket here in smc_accept() seems be 
different from the smc socket created in smc_create().

So after the fix:

[before fallback]
    application--------------------|
                                (poll)
                                   |
                                   |
                                   v
   smc socket->wq            clcsocket->wq
          .                (has eppoll_entry)
             `   .                  .
                     `  .  .   `
                 .    `    `   .
            `                      `
smc socket->sk->sk_wq    clcsocket->sk->sk_wq
         ^                         ^
         |                         |
         |(data)                   |(tcp handshake)
         v                         v
|-------------------------------------------|
|                                           |
| sk_data_ready / sk_write_space ..         |
|                                           |
|-------------------------------------------|

[after fallback]
    application--------------------|
                                (poll)
                                   |
                                   |
                                   v
   smc socket->wq            clcsocket->wq
         .                 (has eppoll_entry)
             `   .                  .
                     `   .          .
                            `   .   .
                                    `
smc socket->sk->sk_wq    clcsocket->sk->sk_wq
                                    ^
                                    |(data)
                                    v
|-------------------------------------------|
|                                           |
| sk_data_ready / sk_write_space ..         |
|                                           |
|-------------------------------------------|


Cheers,
Wen Gu
Karsten Graul Nov. 2, 2021, 9:25 a.m. UTC | #4
On 01/11/2021 07:15, Wen Gu wrote:
> Before explaining my intentions, I thought it would be better to describe the issue I encountered first:
> 
> In nginx/wrk tests, when nginx uses TCP and wrk uses SMC to replace TCP, wrk should fall back to TCP and get correct results theoretically, But in fact it only got all zeros.

Thank you for the very detailed description, I now understand the situation.

The fix is not obvious and not easy to understand for the reader of the code,
did you think about a fix that uses own sk_data_ready / sk_write_space
implementations on the SMC socket to forward the call to the clcsock in the 
fallback situation?

I.e. we already have smc_tx_write_space(), and there is smc_clcsock_data_ready()
which is right now only used for the listening socket case.

If this works this would be a much cleaner and more understandable way to fix this issue.
Wen Gu Nov. 3, 2021, 8:56 a.m. UTC | #5
On 2021/11/2 下午5:25, Karsten Graul wrote:
> On 01/11/2021 07:15, Wen Gu wrote:
>> Before explaining my intentions, I thought it would be better to describe the issue I encountered first:
>>
>> In nginx/wrk tests, when nginx uses TCP and wrk uses SMC to replace TCP, wrk should fall back to TCP and get correct results theoretically, But in fact it only got all zeros.
> 
> Thank you for the very detailed description, I now understand the situation.
> 
> The fix is not obvious and not easy to understand for the reader of the code,
> did you think about a fix that uses own sk_data_ready / sk_write_space
> implementations on the SMC socket to forward the call to the clcsock in the
> fallback situation?
> 
> I.e. we already have smc_tx_write_space(), and there is smc_clcsock_data_ready()
> which is right now only used for the listening socket case.
> 
> If this works this would be a much cleaner and more understandable way to fix this issue.
> 

Thanks for your suggestions, they are very helpful.

I will try these ways and send a new patch later.

cheers,
Wen Gu
Wen Gu Nov. 4, 2021, 4:39 a.m. UTC | #6
On 2021/11/2 5:25 pm, Karsten Graul wrote:
> On 01/11/2021 07:15, Wen Gu wrote:
>> Before explaining my intentions, I thought it would be better to describe the issue I encountered first:
>>
>> In nginx/wrk tests, when nginx uses TCP and wrk uses SMC to replace TCP, wrk should fall back to TCP and get correct results theoretically, But in fact it only got all zeros.
> 
> Thank you for the very detailed description, I now understand the situation.
> 
> The fix is not obvious and not easy to understand for the reader of the code,
> did you think about a fix that uses own sk_data_ready / sk_write_space
> implementations on the SMC socket to forward the call to the clcsock in the
> fallback situation?
> 
> I.e. we already have smc_tx_write_space(), and there is smc_clcsock_data_ready()
> which is right now only used for the listening socket case.
> 
> If this works this would be a much cleaner and more understandable way to fix this issue.
> 

Thanks for your suggestions about implementing SMC own sk_data_ready / 
sk_write_space and forwarding call to clcsock. It's a great idea. But I 
found some difficulties here in implementation process:

In my humble opinion, SMC own sk_write_space implementation should be 
called by clcsk->sk_write_space and complete the following steps:

1) Get smc_sock through clcsk->sk_user_data, like what did in 
smc_clcsock_data_ready().

2) Forward call to original clcsk->sk_write_space, it MIGHT wake up 
clcsk->sk_wq, depending on whether certain conditions are met.

3) Wake up smc sk->sk_wq to nodify application if clcsk->sk_write_space 
acctually wakes up clcsk->sk_wq.

In step 3), it seems a bit troublesome for SMC to know whether 
clcsk->sk_write_space acctually wake up clcsk->sk_wq, which is a black 
box to SMC.

There might be a feasible way that add a wait_queue_head_t to 
clcsk->sk_wq and report to SMC when clcsk->sk_wq is waked up. Then SMC 
can report to application by waking up smc sk->sk_wq. But that seems to 
be complex and redundancy.

I'm looking forward to hear your opinion about it. Thank you!


cheers,
Wen Gu
Karsten Graul Nov. 6, 2021, 12:51 p.m. UTC | #7
On 04/11/2021 05:39, Wen Gu wrote:
> Thanks for your suggestions about implementing SMC own sk_data_ready / sk_write_space and forwarding call to clcsock. It's a great idea. But I found some difficulties here in implementation process:
> 
> In my humble opinion, SMC own sk_write_space implementation should be called by clcsk->sk_write_space and complete the following steps:
> 
> 1) Get smc_sock through clcsk->sk_user_data, like what did in smc_clcsock_data_ready().
> 
> 2) Forward call to original clcsk->sk_write_space, it MIGHT wake up clcsk->sk_wq, depending on whether certain conditions are met.
> 
> 3) Wake up smc sk->sk_wq to nodify application if clcsk->sk_write_space acctually wakes up clcsk->sk_wq.
> 
> In step 3), it seems a bit troublesome for SMC to know whether clcsk->sk_write_space acctually wake up clcsk->sk_wq, which is a black box to SMC.
> 
> There might be a feasible way that add a wait_queue_head_t to clcsk->sk_wq and report to SMC when clcsk->sk_wq is waked up. Then SMC can report to application by waking up smc sk->sk_wq. But that seems to be complex and redundancy.

Hmm so when more certain conditions have to be met in (2) to 
actually wake up clcsk->sk_wq then this might not be the right
way to go...
So when there are no better ways I would vote for your initial patch.
But please add the complete description about how this is intended to 
work to this patch to allow readers to understand the idea behind it.

Thank you.
Wen Gu Nov. 10, 2021, 12:50 p.m. UTC | #8
On 2021/11/6 8:51 pm, Karsten Graul wrote:
> On 04/11/2021 05:39, Wen Gu wrote:
>> Thanks for your suggestions about implementing SMC own sk_data_ready / sk_write_space and forwarding call to clcsock. It's a great idea. But I found some difficulties here in implementation process:
>>
>> In my humble opinion, SMC own sk_write_space implementation should be called by clcsk->sk_write_space and complete the following steps:
>>
>> 1) Get smc_sock through clcsk->sk_user_data, like what did in smc_clcsock_data_ready().
>>
>> 2) Forward call to original clcsk->sk_write_space, it MIGHT wake up clcsk->sk_wq, depending on whether certain conditions are met.
>>
>> 3) Wake up smc sk->sk_wq to nodify application if clcsk->sk_write_space acctually wakes up clcsk->sk_wq.
>>
>> In step 3), it seems a bit troublesome for SMC to know whether clcsk->sk_write_space acctually wake up clcsk->sk_wq, which is a black box to SMC.
>>
>> There might be a feasible way that add a wait_queue_head_t to clcsk->sk_wq and report to SMC when clcsk->sk_wq is waked up. Then SMC can report to application by waking up smc sk->sk_wq. But that seems to be complex and redundancy.
> 
> Hmm so when more certain conditions have to be met in (2) to
> actually wake up clcsk->sk_wq then this might not be the right
> way to go...
> So when there are no better ways I would vote for your initial patch.
> But please add the complete description about how this is intended to
> work to this patch to allow readers to understand the idea behind it.
> 
> Thank you.
> 

Thanks for your suggestions.

Unfortunately, I found a defect about fasync_list in my initial patch. 
Swapping sk->sk_wq of smc socket and clcsocket seems also an imperfect 
way to fix the issue. I will describe this in the next new mail.

In addition, I will provide another RFC patch which is aimed to fix the 
same issue. It will be also included in the next new mail.

Thank you.

Cheers,
Wen Gu
diff mbox series

Patch

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 78b663dbfa1f..3b7ec0abff52 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -546,6 +546,10 @@  static void smc_switch_to_fallback(struct smc_sock *smc, int reason_code)
 {
 	smc->use_fallback = true;
 	smc->fallback_rsn = reason_code;
+
+	/* clcsock's sock uses back to clcsock->wq, see also smc_create() */
+	rcu_assign_pointer(smc->clcsock->sk->sk_wq, &smc->clcsock->wq);
+
 	smc_stat_fallback(smc);
 	if (smc->sk.sk_socket && smc->sk.sk_socket->file) {
 		smc->clcsock->file = smc->sk.sk_socket->file;
@@ -1972,6 +1976,10 @@  static int smc_accept(struct socket *sock, struct socket *new_sock,
 	if (rc)
 		goto out;
 
+	/* new smc sock uses clcsock's wq. see also smc_create() */
+	if (!smc_sk(nsk)->use_fallback)
+		rcu_assign_pointer(nsk->sk_wq, &smc_sk(nsk)->clcsock->wq);
+
 	if (lsmc->sockopt_defer_accept && !(flags & O_NONBLOCK)) {
 		/* wait till data arrives on the socket */
 		timeo = msecs_to_jiffies(lsmc->sockopt_defer_accept *
@@ -2108,6 +2116,9 @@  static __poll_t smc_poll(struct file *file, struct socket *sock,
 		mask = smc->clcsock->ops->poll(file, smc->clcsock, wait);
 		sk->sk_err = smc->clcsock->sk->sk_err;
 	} else {
+		/* use clcsock->wq in sock_poll_wait(), see also smc_create() */
+		sock = smc->clcsock;
+
 		if (sk->sk_state != SMC_CLOSED)
 			sock_poll_wait(file, sock, wait);
 		if (sk->sk_err)
@@ -2505,6 +2516,10 @@  static int smc_create(struct net *net, struct socket *sock, int protocol,
 	smc->sk.sk_sndbuf = max(smc->clcsock->sk->sk_sndbuf, SMC_BUF_MIN_SIZE);
 	smc->sk.sk_rcvbuf = max(smc->clcsock->sk->sk_rcvbuf, SMC_BUF_MIN_SIZE);
 
+	/* In case smc fallbacks to tcp, smc's sock will use clcsock's wq in advance */
+	rcu_assign_pointer(sk->sk_wq, &smc->clcsock->wq);
+	rcu_assign_pointer(smc->clcsock->sk->sk_wq, &smc->sk.sk_socket->wq);
+
 out:
 	return rc;
 }