diff mbox series

[net-next,v2] net/smc: Reduce overflow of smc clcsock listen queue

Message ID 1641301961-59331-1-git-send-email-alibuda@linux.alibaba.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [net-next,v2] net/smc: Reduce overflow of smc clcsock listen queue | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 52 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

D. Wythe Jan. 4, 2022, 1:12 p.m. UTC
From: "D. Wythe" <alibuda@linux.alibaba.com>

In nginx/wrk multithread and 10K connections benchmark, the
backend TCP connection established very slowly, and lots of TCP
connections stay in SYN_SENT state.

Server: smc_run nginx

Client: smc_run wrk -c 10000 -t 4 http://server

Socket state in client host (wrk) shows like:

ss -t  | wc -l
10000

ss -t  | grep "SYN-SENT"  | wc -l
6248

While the socket state in server host (nginx) shows like:

ss -t  | wc -l
3752

Furthermore, the netstate of server host shows like:
    145042 times the listen queue of a socket overflowed
    145042 SYNs to LISTEN sockets dropped

This issue caused by smc_listen_work(), since the smc_tcp_listen_work()
shared the same workqueue (smc_hs_wq) with smc_listen_work(), while
smc_listen_work() do blocking wait for smc connection established, which
meanwhile block the accept() from TCP listen queue.

This patch creates a independent workqueue(smc_tcp_ls_wq) for
smc_tcp_listen_work(), separate it from smc_listen_work(), which is
quite acceptable considering that smc_tcp_listen_work() runs very fast.

After this patch, the smc 10K connections benchmark in my case is 5
times faster than before.

Before patch :

smc_run  ./wrk -c 10000 -t 3 -d 20  http://server
  3 threads and 10000 connections
  143300 requests in 20.04s, 94.29MB read
Requests/sec:   7150.33
Transfer/sec:      4.70MB

After patch:

smc_run  ./wrk -c 10000 -t 3 -d 20  http://server
  3 threads and 10000 connections
  902091 requests in 21.99s, 593.56MB read
Requests/sec:  41017.52
Transfer/sec:     26.99MB

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Tony Lu <tonylu@linux.alibaba.com>
---
changelog:
v2: code format
---
 net/smc/af_smc.c | 13 +++++++++++--
 net/smc/smc.h    |  1 +
 2 files changed, 12 insertions(+), 2 deletions(-)

Comments

Karsten Graul Jan. 4, 2022, 1:45 p.m. UTC | #1
On 04/01/2022 14:12, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> In nginx/wrk multithread and 10K connections benchmark, the
> backend TCP connection established very slowly, and lots of TCP
> connections stay in SYN_SENT state.

I see what you are trying to solve here.
So what happens with your patch now is that we are accepting way more connections
in advance and queue them up for the SMC connection handshake worker.
The connection handshake worker itself will not run faster with this change, so overall
it should be the same time that is needed to establish all connections.
What you solve is that when 10k connections are started at the same time, some of them
will be dropped due to tcp 3-way handshake timeouts. Your patch avoids that but one can now flood
the stack with an ~infinite amount of dangling sockets waiting for the SMC handshake, maybe even 
causing oom conditions.

What should be respected with such a change would be the backlog parameter for the listen socket,
i.e. how many backlog connections are requested by the user space application?
There is no such handling of backlog right now, and due to the 'braking' workers we avoided
to flood the kernel with too many dangling connections. With your change there should be a way to limit
this ind of connections in some way.
D. Wythe Jan. 4, 2022, 4:17 p.m. UTC | #2
It's seems last mail has been rejected by some reason, resend it for
confirm. sry to bother you if you already seen it.

Thanks.
D. Wythe Jan. 5, 2022, 4:40 a.m. UTC | #3
Hi, 

Since we are trying to use the backlog parameter to limit smc dangling
connections, it's seems there's no difference from increasing the
backlog parameter for the TCP listen socket, user space Application can
simply avoid the 10K connections problem through that.

If so, this patch looks redundant to me. Look forward to your advise.

Thanks.

On Tue, Jan 04, 2022 at 02:45:35PM +0100, Karsten Graul wrote:
> On 04/01/2022 14:12, D. Wythe wrote:
> > From: "D. Wythe" <alibuda@linux.alibaba.com>
> > 
> > In nginx/wrk multithread and 10K connections benchmark, the
> > backend TCP connection established very slowly, and lots of TCP
> > connections stay in SYN_SENT state.
> 
> I see what you are trying to solve here.
> So what happens with your patch now is that we are accepting way more connections
> in advance and queue them up for the SMC connection handshake worker.
> The connection handshake worker itself will not run faster with this change, so overall
> it should be the same time that is needed to establish all connections.
> What you solve is that when 10k connections are started at the same time, some of them
> will be dropped due to tcp 3-way handshake timeouts. Your patch avoids that but one can now flood
> the stack with an ~infinite amount of dangling sockets waiting for the SMC handshake, maybe even 
> causing oom conditions.
> 
> What should be respected with such a change would be the backlog parameter for the listen socket,
> i.e. how many backlog connections are requested by the user space application?
> There is no such handling of backlog right now, and due to the 'braking' workers we avoided
> to flood the kernel with too many dangling connections. With your change there should be a way to limit
> this ind of connections in some way.
Tony Lu Jan. 5, 2022, 8:28 a.m. UTC | #4
On Wed, Jan 05, 2022 at 12:40:49PM +0800, D. Wythe wrote:
> Hi, 
> 
> Since we are trying to use the backlog parameter to limit smc dangling
> connections, it's seems there's no difference from increasing the
> backlog parameter for the TCP listen socket, user space Application can
> simply avoid the 10K connections problem through that.
> 
> If so, this patch looks redundant to me. Look forward to your advise.
> 
> Thanks.

Hi D. Wythe,

IMHO, this patch is still useful for flood-connections scenes.
Karsten concerns the safety in flood case, we should limit the number
of tcp sockets for safety.

Imaging these two scenes:
- general socket setup process, if with this patch, setting up tcp
  socket is faster, but we also need to wait for the rest part of
  handshake (rdma...). Meanwhile, there are lots of dangling sockets,
  which may cause kernel OOM. If it isn't first-connected peer, the
  rest part is faster, and this patch should be useful.
- always fallback to tcp, the rest part of handshake failed, so we need
  to fallback, this patch should make it faster to fallback.

Whatever scenes we met, it could be better to limit the number of
"half-connected", like backlog in tcp. There is a possible approach,
using sysctl to limit the global or net-namespace max backlog.
Currently, smc doesn't support sysctl yet. I sent an incomplete sysctl
patch before, maybe it could help you. If so, I will fix that patch, and
resent it out.

Thanks,
Tony Lu
 
> On Tue, Jan 04, 2022 at 02:45:35PM +0100, Karsten Graul wrote:
> > On 04/01/2022 14:12, D. Wythe wrote:
> > > From: "D. Wythe" <alibuda@linux.alibaba.com>
> > > 
> > > In nginx/wrk multithread and 10K connections benchmark, the
> > > backend TCP connection established very slowly, and lots of TCP
> > > connections stay in SYN_SENT state.
> > 
> > I see what you are trying to solve here.
> > So what happens with your patch now is that we are accepting way more connections
> > in advance and queue them up for the SMC connection handshake worker.
> > The connection handshake worker itself will not run faster with this change, so overall
> > it should be the same time that is needed to establish all connections.
> > What you solve is that when 10k connections are started at the same time, some of them
> > will be dropped due to tcp 3-way handshake timeouts. Your patch avoids that but one can now flood
> > the stack with an ~infinite amount of dangling sockets waiting for the SMC handshake, maybe even 
> > causing oom conditions.
> > 
> > What should be respected with such a change would be the backlog parameter for the listen socket,
> > i.e. how many backlog connections are requested by the user space application?
> > There is no such handling of backlog right now, and due to the 'braking' workers we avoided
> > to flood the kernel with too many dangling connections. With your change there should be a way to limit
> > this ind of connections in some way.
Dust Li Jan. 5, 2022, 8:57 a.m. UTC | #5
On Wed, Jan 05, 2022 at 12:40:49PM +0800, D. Wythe wrote:
>Hi, 
>
>Since we are trying to use the backlog parameter to limit smc dangling
>connections, it's seems there's no difference from increasing the
>backlog parameter for the TCP listen socket, user space Application can
>simply avoid the 10K connections problem through that.
>
>If so, this patch looks redundant to me. Look forward to your advise.

I think increase backlog in the userspace application is not a good idea.

AFAIU, SMC tries to behave the same like TCP in the socket layer, asking
the APP to increase the backlog breaks this principle.

In the TCP case, the backlog usually don't get overflow if the APP calls
accept() fast enough.
For SMC, it should also accept() fast enough to make sure the backlog of
the CLC socket won't overflow. But it didn't because smc_hs_wq is busy
hence TCP dropped the SYN. From the APP's perspective of view, he is fast
enough, but the kernel didn't give him the chance. I think this behaves
different from TCP.

I'm thinking maybe we can actively fall back to TCP in this case ? Not
sure if this is a good idea.
Karsten Graul Jan. 5, 2022, 1:17 p.m. UTC | #6
On 05/01/2022 09:57, dust.li wrote:
> On Wed, Jan 05, 2022 at 12:40:49PM +0800, D. Wythe wrote:
> I'm thinking maybe we can actively fall back to TCP in this case ? Not
> sure if this is a good idea.

I think its a good decision to switch new connections to use the TCP fallback when the
current queue of connections waiting for a SMC handshake is too large.
With this the application is able to accept all incoming connections and they are not
dropped. The only thing that is be different compared to TCP is that the order of the
accepted connections is changed, connections that came in later might reach the user space 
application earlier than connections that still run the SMC hand shake processing. 
But I think that is semantically okay.
D. Wythe Jan. 5, 2022, 3:06 p.m. UTC | #7
LGTM. Fallback makes the restrictions on SMC dangling
connections more meaningful to me, compared to dropping them.

Overall, i see there are two scenario.

1. Drop the overflow connections limited by userspace application
accept.

2. Fallback the overflow connections limited by the heavy process of
current SMC handshake. ( We can also control its behavior through
sysctl.)

I'll follow those advise to improve my patch, more advise will be highly
appreciated.

Thanks all. 


On Wed, Jan 05, 2022 at 02:17:41PM +0100, Karsten Graul wrote:
> On 05/01/2022 09:57, dust.li wrote:
> > On Wed, Jan 05, 2022 at 12:40:49PM +0800, D. Wythe wrote:
> > I'm thinking maybe we can actively fall back to TCP in this case ? Not
> > sure if this is a good idea.
> 
> I think its a good decision to switch new connections to use the TCP fallback when the
> current queue of connections waiting for a SMC handshake is too large.
> With this the application is able to accept all incoming connections and they are not
> dropped. The only thing that is be different compared to TCP is that the order of the
> accepted connections is changed, connections that came in later might reach the user space 
> application earlier than connections that still run the SMC hand shake processing. 
> But I think that is semantically okay.
Karsten Graul Jan. 5, 2022, 7:13 p.m. UTC | #8
On 05/01/2022 16:06, D. Wythe wrote:
> LGTM. Fallback makes the restrictions on SMC dangling
> connections more meaningful to me, compared to dropping them.
> 
> Overall, i see there are two scenario.
> 
> 1. Drop the overflow connections limited by userspace application
> accept.
> 
> 2. Fallback the overflow connections limited by the heavy process of
> current SMC handshake. ( We can also control its behavior through
> sysctl.)
> 

I vote for (2) which makes the behavior from user space applications point of view more like TCP.

One comment to sysctl: our current approach is to add new switches to the existing 
netlink interface which can be used with the smc-tools package (or own implementations of course). 
Is this prereq problematic in your environment? 
We tried to avoid more sysctls and the netlink interface keeps use more flexible.
D. Wythe Jan. 6, 2022, 3:51 a.m. UTC | #9
One problem for the fallback scenario is that server must actively send 
decline message to client and wait for the clc proposal message that 
client may already sent, otherwise the message of SMC handshake may be 
read by user space application, which will also lead to OOM conditions 
caused by infinite amount of dangling sockets.

In that case, we have to make restrictions on 'SMC fallback ing', which 
makes things more complicated.

Any advise will be highly appreciated.

Thanks.


在 2022/1/5 下午11:06, D. Wythe 写道:
> LGTM. Fallback makes the restrictions on SMC dangling
> connections more meaningful to me, compared to dropping them.
> 
> Overall, i see there are two scenario.
> 
> 1. Drop the overflow connections limited by userspace application
> accept.
> 
> 2. Fallback the overflow connections limited by the heavy process of
> current SMC handshake. ( We can also control its behavior through
> sysctl.)
> 
> I'll follow those advise to improve my patch, more advise will be highly
> appreciated.
> 
> Thanks all.
> 
> 
> On Wed, Jan 05, 2022 at 02:17:41PM +0100, Karsten Graul wrote:
>> On 05/01/2022 09:57, dust.li wrote:
>>> On Wed, Jan 05, 2022 at 12:40:49PM +0800, D. Wythe wrote:
>>> I'm thinking maybe we can actively fall back to TCP in this case ? Not
>>> sure if this is a good idea.
>>
>> I think its a good decision to switch new connections to use the TCP fallback when the
>> current queue of connections waiting for a SMC handshake is too large.
>> With this the application is able to accept all incoming connections and they are not
>> dropped. The only thing that is be different compared to TCP is that the order of the
>> accepted connections is changed, connections that came in later might reach the user space
>> application earlier than connections that still run the SMC hand shake processing.
>> But I think that is semantically okay.
Tony Lu Jan. 6, 2022, 7:05 a.m. UTC | #10
On Wed, Jan 05, 2022 at 08:13:23PM +0100, Karsten Graul wrote:
> On 05/01/2022 16:06, D. Wythe wrote:
> > LGTM. Fallback makes the restrictions on SMC dangling
> > connections more meaningful to me, compared to dropping them.
> > 
> > Overall, i see there are two scenario.
> > 
> > 1. Drop the overflow connections limited by userspace application
> > accept.
> > 
> > 2. Fallback the overflow connections limited by the heavy process of
> > current SMC handshake. ( We can also control its behavior through
> > sysctl.)
> > 
> 
> I vote for (2) which makes the behavior from user space applications point of view more like TCP.

Fallback when smc reaches itself limit is a good idea. I'm curious
whether the fallback reason is suitable, it more like a non-negative
issue. Currently, smc fallback for negative issues, such as resource not
available or internal error. This issue doesn't like a non-negative
reason.

And I have no idea about to mix the normal and fallback connections at
same time, meanwhile there is no error happened or hard limit reaches,
is a easy to maintain for users? Maybe let users misunderstanding, a
parameter from userspace control this limit, and the behaviour (drop or
fallback).
 
> One comment to sysctl: our current approach is to add new switches to the existing 
> netlink interface which can be used with the smc-tools package (or own implementations of course). 
> Is this prereq problematic in your environment? 
> We tried to avoid more sysctls and the netlink interface keeps use more flexible.

I agree with you about using netlink is more flexible. There are
something different in our environment to use netlink to control the
behaves of smc.

Compared with netlink, sysctl is:
- easy to use on clusters. Applications who want to use smc, don't need
  to deploy additional tools or developing another netlink logic,
  especially for thousands of machines or containers. With smc forward,
  we should make sure the package or logic is compatible with current
  kernel, but sysctl's API compatible is easy to discover.

- config template and default maintain. We are using /etc/sysctl.conf to
  make sure the systeml configures update to date, such as pre-tuned smc
  config parameters. So that we can change this default values on boot,
  and generate lots of machines base on this machine template. Userspace
  netlink tools doesn't suit for it, for example ip related config, we
  need additional NetworkManager or netctl to do this.

- TCP-like sysctl entries. TCP provides lots of sysctl to configure
  itself, somethings it is hard to use and understand. However, it is
  accepted by most of users and system. Maybe we could use sysctl for
  the item that frequently and easy to change, netlink for the complex
  item.

We are gold to contribute to smc-tools. Use netlink and sysctl both
time, I think, is a more suitable choice.

Thanks,
Tony Lu
Karsten Graul Jan. 6, 2022, 9:54 a.m. UTC | #11
On 06/01/2022 04:51, D. Wythe wrote:
> 
> One problem for the fallback scenario is that server must actively send decline message to client and wait for the clc proposal message that client may already sent, otherwise the message of SMC handshake may be read by user space application, which will also lead to OOM conditions caused by infinite amount of dangling sockets.
> 
> In that case, we have to make restrictions on 'SMC fallback ing', which makes things more complicated.

Thats a good point, I need to think about this ....
Karsten Graul Jan. 13, 2022, 8:07 a.m. UTC | #12
On 06/01/2022 08:05, Tony Lu wrote:
> On Wed, Jan 05, 2022 at 08:13:23PM +0100, Karsten Graul wrote:
>> On 05/01/2022 16:06, D. Wythe wrote:
>>> LGTM. Fallback makes the restrictions on SMC dangling
>>> connections more meaningful to me, compared to dropping them.
>>>
>>> Overall, i see there are two scenario.
>>>
>>> 1. Drop the overflow connections limited by userspace application
>>> accept.
>>>
>>> 2. Fallback the overflow connections limited by the heavy process of
>>> current SMC handshake. ( We can also control its behavior through
>>> sysctl.)
>>>
>>
>> I vote for (2) which makes the behavior from user space applications point of view more like TCP.
> Fallback when smc reaches itself limit is a good idea. I'm curious
> whether the fallback reason is suitable, it more like a non-negative
> issue. Currently, smc fallback for negative issues, such as resource not
> available or internal error. This issue doesn't like a non-negative
> reason.

SMC falls back when the SMC processing cannot be completed, e.g. due to 
resource constraints like memory. For me the time/duration constraint is
also a good reason to fall back to TCP.

> 
> And I have no idea about to mix the normal and fallback connections at
> same time, meanwhile there is no error happened or hard limit reaches,
> is a easy to maintain for users? Maybe let users misunderstanding, a
> parameter from userspace control this limit, and the behaviour (drop or
> fallback).

I think of the following approach: the default maximum of active workers in a
work queue is defined by WQ_MAX_ACTIVE (512). when this limit is hit then we
have slightly lesser than 512 parallel SMC handshakes running at the moment,
and new workers would be enqueued without to become active.
In that case (max active workers reached) I would tend to fallback new connections
to TCP. We would end up with lesser connections using SMC, but for the user space
applications there would be nearly no change compared to TCP (no dropped TCP connection
attempts, no need to reconnect).
Imho, most users will never run into this problem, so I think its fine to behave like this.

As far as I understand you, you still see a good reason in having another behavior 
implemented in parallel (controllable by user) which enqueues all incoming connections
like in your patch proposal? But how to deal with the out-of-memory problems that might 
happen with that?

>  
>> One comment to sysctl: our current approach is to add new switches to the existing 
>> netlink interface which can be used with the smc-tools package (or own implementations of course). 
>> Is this prereq problematic in your environment? 
>> We tried to avoid more sysctls and the netlink interface keeps use more flexible.
> 
> I agree with you about using netlink is more flexible. There are
> something different in our environment to use netlink to control the
> behaves of smc.
> 
> Compared with netlink, sysctl is:
> - easy to use on clusters. Applications who want to use smc, don't need
>   to deploy additional tools or developing another netlink logic,
>   especially for thousands of machines or containers. With smc forward,
>   we should make sure the package or logic is compatible with current
>   kernel, but sysctl's API compatible is easy to discover.
> 
> - config template and default maintain. We are using /etc/sysctl.conf to
>   make sure the systeml configures update to date, such as pre-tuned smc
>   config parameters. So that we can change this default values on boot,
>   and generate lots of machines base on this machine template. Userspace
>   netlink tools doesn't suit for it, for example ip related config, we
>   need additional NetworkManager or netctl to do this.
> 
> - TCP-like sysctl entries. TCP provides lots of sysctl to configure
>   itself, somethings it is hard to use and understand. However, it is
>   accepted by most of users and system. Maybe we could use sysctl for
>   the item that frequently and easy to change, netlink for the complex
>   item.
> 
> We are gold to contribute to smc-tools. Use netlink and sysctl both
> time, I think, is a more suitable choice.

Lets decide that when you have a specific control that you want to implement. 
I want to have a very good to introduce another interface into the SMC module,
making the code more complex and all of that. The decision for the netlink interface 
was also done because we have the impression that this is the NEW way to go, and
since we had no interface before we started with the most modern way to implement it.

TCP et al have a history with sysfs, so thats why it is still there. 
But I might be wrong on that...
Jakub Kicinski Jan. 13, 2022, 6:50 p.m. UTC | #13
On Thu, 13 Jan 2022 09:07:51 +0100 Karsten Graul wrote:
> Lets decide that when you have a specific control that you want to implement. 
> I want to have a very good to introduce another interface into the SMC module,
> making the code more complex and all of that. The decision for the netlink interface 
> was also done because we have the impression that this is the NEW way to go, and
> since we had no interface before we started with the most modern way to implement it.
> 
> TCP et al have a history with sysfs, so thats why it is still there. 
> But I might be wrong on that...

To the best of my knowledge you are correct.
Tony Lu Jan. 20, 2022, 1:39 p.m. UTC | #14
On Thu, Jan 13, 2022 at 09:07:51AM +0100, Karsten Graul wrote:
> On 06/01/2022 08:05, Tony Lu wrote:
> 
> I think of the following approach: the default maximum of active workers in a
> work queue is defined by WQ_MAX_ACTIVE (512). when this limit is hit then we
> have slightly lesser than 512 parallel SMC handshakes running at the moment,
> and new workers would be enqueued without to become active.
> In that case (max active workers reached) I would tend to fallback new connections
> to TCP. We would end up with lesser connections using SMC, but for the user space
> applications there would be nearly no change compared to TCP (no dropped TCP connection
> attempts, no need to reconnect).
> Imho, most users will never run into this problem, so I think its fine to behave like this.

This makes sense to me, thanks.

> 
> As far as I understand you, you still see a good reason in having another behavior 
> implemented in parallel (controllable by user) which enqueues all incoming connections
> like in your patch proposal? But how to deal with the out-of-memory problems that might 
> happen with that?

There is a possible scene, when the user only wants to use SMC protocol, such
as performance benchmark, or explicitly specify SMC protocol, they can
afford the lower speed of incoming connection creation, but enjoy the
higher QPS after creation.

> Lets decide that when you have a specific control that you want to implement. 
> I want to have a very good to introduce another interface into the SMC module,
> making the code more complex and all of that. The decision for the netlink interface 
> was also done because we have the impression that this is the NEW way to go, and
> since we had no interface before we started with the most modern way to implement it.
> 
> TCP et al have a history with sysfs, so thats why it is still there. 
> But I might be wrong on that...

Thanks for the information that I don't know about the decision for new
control interface. I am understanding your decision about the interface.
We are glad to contribute the knobs to smc_netlink.c in the next patches.

There is something I want to discuss here about the persistent
configuration, we need to store new config in system, and make sure that
it could be loaded correctly after boot up. A possible solution is to
extend smc-tools for new config, and work with systemd for auto-loading.
If it works, we are glad to contribute these to smc-tools.

Thank you.
Tony Lu
Stefan Raspl Jan. 20, 2022, 4 p.m. UTC | #15
On 1/20/22 14:39, Tony Lu wrote:
> On Thu, Jan 13, 2022 at 09:07:51AM +0100, Karsten Graul wrote:
>> On 06/01/2022 08:05, Tony Lu wrote:
>>
>> I think of the following approach: the default maximum of active workers in a
>> work queue is defined by WQ_MAX_ACTIVE (512). when this limit is hit then we
>> have slightly lesser than 512 parallel SMC handshakes running at the moment,
>> and new workers would be enqueued without to become active.
>> In that case (max active workers reached) I would tend to fallback new connections
>> to TCP. We would end up with lesser connections using SMC, but for the user space
>> applications there would be nearly no change compared to TCP (no dropped TCP connection
>> attempts, no need to reconnect).
>> Imho, most users will never run into this problem, so I think its fine to behave like this.
> 
> This makes sense to me, thanks.
> 
>>
>> As far as I understand you, you still see a good reason in having another behavior
>> implemented in parallel (controllable by user) which enqueues all incoming connections
>> like in your patch proposal? But how to deal with the out-of-memory problems that might
>> happen with that?
> 
> There is a possible scene, when the user only wants to use SMC protocol, such
> as performance benchmark, or explicitly specify SMC protocol, they can
> afford the lower speed of incoming connection creation, but enjoy the
> higher QPS after creation.
> 
>> Lets decide that when you have a specific control that you want to implement.
>> I want to have a very good to introduce another interface into the SMC module,
>> making the code more complex and all of that. The decision for the netlink interface
>> was also done because we have the impression that this is the NEW way to go, and
>> since we had no interface before we started with the most modern way to implement it.
>>
>> TCP et al have a history with sysfs, so thats why it is still there.
>> But I might be wrong on that...
> 
> Thanks for the information that I don't know about the decision for new
> control interface. I am understanding your decision about the interface.
> We are glad to contribute the knobs to smc_netlink.c in the next patches.
> 
> There is something I want to discuss here about the persistent
> configuration, we need to store new config in system, and make sure that
> it could be loaded correctly after boot up. A possible solution is to
> extend smc-tools for new config, and work with systemd for auto-loading.
> If it works, we are glad to contribute these to smc-tools.

I'd be definitely open to look into patches for smc-tools that extend it to 
configure SMC properties, and that provide the capability to read (and apply) a 
config from a file! We can discuss what you'd imagine as an interface before you 
implement it, too.

Ciao,
Stefan
Tony Lu Jan. 21, 2022, 2:47 a.m. UTC | #16
On Thu, Jan 20, 2022 at 05:00:18PM +0100, Stefan Raspl wrote:
> I'd be definitely open to look into patches for smc-tools that extend it to
> configure SMC properties, and that provide the capability to read (and
> apply) a config from a file! We can discuss what you'd imagine as an
> interface before you implement it, too.

Thanks for your reply. I will complete my detailed proposal, and send it
out later.
Dust Li Feb. 16, 2022, 11:46 a.m. UTC | #17
On Thu, Jan 13, 2022 at 09:07:51AM +0100, Karsten Graul wrote:

>>> One comment to sysctl: our current approach is to add new switches to the existing 
>>> netlink interface which can be used with the smc-tools package (or own implementations of course). 
>>> Is this prereq problematic in your environment? 
>>> We tried to avoid more sysctls and the netlink interface keeps use more flexible.
>> 
>> I agree with you about using netlink is more flexible. There are
>> something different in our environment to use netlink to control the
>> behaves of smc.
>> 
>> Compared with netlink, sysctl is:
>> - easy to use on clusters. Applications who want to use smc, don't need
>>   to deploy additional tools or developing another netlink logic,
>>   especially for thousands of machines or containers. With smc forward,
>>   we should make sure the package or logic is compatible with current
>>   kernel, but sysctl's API compatible is easy to discover.
>> 
>> - config template and default maintain. We are using /etc/sysctl.conf to
>>   make sure the systeml configures update to date, such as pre-tuned smc
>>   config parameters. So that we can change this default values on boot,
>>   and generate lots of machines base on this machine template. Userspace
>>   netlink tools doesn't suit for it, for example ip related config, we
>>   need additional NetworkManager or netctl to do this.
>> 
>> - TCP-like sysctl entries. TCP provides lots of sysctl to configure
>>   itself, somethings it is hard to use and understand. However, it is
>>   accepted by most of users and system. Maybe we could use sysctl for
>>   the item that frequently and easy to change, netlink for the complex
>>   item.
>> 
>> We are gold to contribute to smc-tools. Use netlink and sysctl both
>> time, I think, is a more suitable choice.
>
>Lets decide that when you have a specific control that you want to implement. 
>I want to have a very good to introduce another interface into the SMC module,
>making the code more complex and all of that. The decision for the netlink interface 
>was also done because we have the impression that this is the NEW way to go, and
>since we had no interface before we started with the most modern way to implement it.
>
>TCP et al have a history with sysfs, so thats why it is still there. 
>But I might be wrong on that...

Sorry to bother on this topic again...

When implementing SMC autocork, I'd like to add a switch to enable or
disable SMC autocork just like what TCP does. But I encounter some
problem which I think might be relevant to this topic.

My requirements for the switch is like this:
1. Can be set dynamically by an userspace tool
2. Need to be namespacified so different containers can have their own
value
3. Should be able to be configured to some default values using a
configuration file so every time a container started, those values can
be set properly.


I notice we have a patch recently("net/smc: Add global configure for
handshake limitation by netlink") which did something similar. And I
tried the same way but found it might not be very elegant:
1. I need to copy most of the code(enable/disable/dump) for autocork
   which is quite redundant. Maybe we should add some common wrappers ?
2. I need to add a new enumeration, and what if years later, we found
   this function is no longer need ? Deleting this may cause ABI
   compatibility issues
3. Finally, How can we implement requirement #3 ? It is really needed
   in the K8S container environment.

Any suggestions or comments are really welcomed.

Thanks!
diff mbox series

Patch

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 0bb614e..08722c0 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -62,6 +62,7 @@ 
 						 * creation on client
 						 */
 
+struct workqueue_struct	*smc_tcp_ls_wq;	/* wq for tcp listen work */
 struct workqueue_struct	*smc_hs_wq;	/* wq for handshake work */
 struct workqueue_struct	*smc_close_wq;	/* wq for close work */
 
@@ -1872,7 +1873,7 @@  static void smc_clcsock_data_ready(struct sock *listen_clcsock)
 	lsmc->clcsk_data_ready(listen_clcsock);
 	if (lsmc->sk.sk_state == SMC_LISTEN) {
 		sock_hold(&lsmc->sk); /* sock_put in smc_tcp_listen_work() */
-		if (!queue_work(smc_hs_wq, &lsmc->tcp_listen_work))
+		if (!queue_work(smc_tcp_ls_wq, &lsmc->tcp_listen_work))
 			sock_put(&lsmc->sk);
 	}
 }
@@ -2610,9 +2611,14 @@  static int __init smc_init(void)
 		goto out_nl;
 
 	rc = -ENOMEM;
+
+	smc_tcp_ls_wq = alloc_workqueue("smc_tcp_ls_wq", 0, 0);
+	if (!smc_tcp_ls_wq)
+		goto out_pnet;
+
 	smc_hs_wq = alloc_workqueue("smc_hs_wq", 0, 0);
 	if (!smc_hs_wq)
-		goto out_pnet;
+		goto out_alloc_tcp_ls_wq;
 
 	smc_close_wq = alloc_workqueue("smc_close_wq", 0, 0);
 	if (!smc_close_wq)
@@ -2709,6 +2715,8 @@  static int __init smc_init(void)
 	destroy_workqueue(smc_close_wq);
 out_alloc_hs_wq:
 	destroy_workqueue(smc_hs_wq);
+out_alloc_tcp_ls_wq:
+	destroy_workqueue(smc_tcp_ls_wq);
 out_pnet:
 	smc_pnet_exit();
 out_nl:
@@ -2728,6 +2736,7 @@  static void __exit smc_exit(void)
 	smc_core_exit();
 	smc_ib_unregister_client();
 	destroy_workqueue(smc_close_wq);
+	destroy_workqueue(smc_tcp_ls_wq);
 	destroy_workqueue(smc_hs_wq);
 	proto_unregister(&smc_proto6);
 	proto_unregister(&smc_proto);
diff --git a/net/smc/smc.h b/net/smc/smc.h
index b1d6625..18fa803 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -256,6 +256,7 @@  static inline struct smc_sock *smc_sk(const struct sock *sk)
 	return (struct smc_sock *)sk;
 }
 
+extern struct workqueue_struct	*smc_tcp_ls_wq;	/* wq for tcp listen work */
 extern struct workqueue_struct	*smc_hs_wq;	/* wq for handshake work */
 extern struct workqueue_struct	*smc_close_wq;	/* wq for close work */