mbox series

[net-next,v5,0/3] Introduce IPPROTO_SMC

Message ID 1717061440-59937-1-git-send-email-alibuda@linux.alibaba.com (mailing list archive)
Headers show
Series Introduce IPPROTO_SMC | expand

Message

D. Wythe May 30, 2024, 9:30 a.m. UTC
From: "D. Wythe" <alibuda@linux.alibaba.com>

This patch allows to create smc socket via AF_INET,
similar to the following code,

/* create v4 smc sock */
v4 = socket(AF_INET, SOCK_STREAM, IPPROTO_SMC);

/* create v6 smc sock */
v6 = socket(AF_INET6, SOCK_STREAM, IPPROTO_SMC);

There are several reasons why we believe it is appropriate here:

1. For smc sockets, it actually use IPv4 (AF-INET) or IPv6 (AF-INET6)
address. There is no AF_SMC address at all.

2. Create smc socket in the AF_INET(6) path, which allows us to reuse
the infrastructure of AF_INET(6) path, such as common ebpf hooks.
Otherwise, smc have to implement it again in AF_SMC path. Such as:
  1. Replace IPPROTO_TCP with IPPROTO_SMC in the socket() syscall
     initiated by the user, without the use of LD-PRELOAD.
  2. Select whether immediate fallback is required based on peer's port/ip
     before connect().

A very significant result is that we can now use eBPF to implement smc_run
instead of LD_PRELOAD, who is completely ineffective in scenarios of static
linking.

Another potential value is that we are attempting to optimize the
performance of fallback socks, where merging socks is an important part,
and it relies on the creation of SMC sockets under the AF_INET path. 
(More information :
https://lore.kernel.org/netdev/1699442703-25015-1-git-send-email-alibuda@linux.alibaba.com/T/)

v2 -> v1:

- Code formatting, mainly including alignment and annotation repair.
- move inet_smc proto ops to inet_smc.c, avoiding af_smc.c becoming too bulky.
- Fix the issue where refactoring affects the initialization order.
- Fix compile warning (unused out_inet_prot) while CONFIG_IPV6 was not set.

v3 -> v2:

- Add Alibaba's copyright information to the newfile

v4 -> v3:

- Fix some spelling errors
- Align function naming style with smc_sock_init() to smc_sk_init()
- Reversing the order of the conditional checks on clcsock to make the code more intuitive

v5 -> v4:

- Fix some spelling errors
- Added comment, "/* CONFIG_IPV6 */", after the final #endif directive.
- Rename smc_inet.h and smc_inet.c to smc_inet.h and smc_inet.c
- Encapsulate the initialization and destruction of inet_smc in inet_smc.c,
  rather than implementing it directly in af_smc.c.
- Remove useless header files in smc_inet.h
- Make smc_inet_prot_xxx and smc_inet_sock_init() to be static, since it's
  only used in smc_inet.c

D. Wythe (3):
  net/smc: refactoring initialization of smc sock
  net/smc: expose smc proto operations
  net/smc: Introduce IPPROTO_SMC

 include/uapi/linux/in.h |   2 +
 net/smc/Makefile        |   2 +-
 net/smc/af_smc.c        | 162 +++++++++++++++++++++++++--------------------
 net/smc/smc.h           |  38 +++++++++++
 net/smc/smc_inet.c      | 170 ++++++++++++++++++++++++++++++++++++++++++++++++
 net/smc/smc_inet.h      |  22 +++++++
 6 files changed, 325 insertions(+), 71 deletions(-)
 create mode 100644 net/smc/smc_inet.c
 create mode 100644 net/smc/smc_inet.h

Comments

D. Wythe May 30, 2024, 10:14 a.m. UTC | #1
On 5/30/24 5:30 PM, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
>
> This patch allows to create smc socket via AF_INET,
> similar to the following code,
>
> /* create v4 smc sock */
> v4 = socket(AF_INET, SOCK_STREAM, IPPROTO_SMC);
>
> /* create v6 smc sock */
> v6 = socket(AF_INET6, SOCK_STREAM, IPPROTO_SMC);

Welcome everyone to try out the eBPF based version of smc_run during 
testing, I have added a separate command called smc_run.bpf,
it was equivalent to normal smc_run but with IPPROTO_SMC via eBPF.

You can obtain the code and more info from: 
https://github.com/D-Wythe/smc-tools

Usage:

smc_run.bpf
An eBPF implemented smc_run based on IPPROTO_SMC:

1. Support to transparent replacement based on command (Just like smc_run).
2. Supprot to transparent replacement based on pid configuration. And 
supports the inheritance of this capability between parent and child 
processes.
3. Support to transparent replacement based on per netns configuration.

smc_run.bpf COMMAND

1. Equivalent to smc_run but with IPPROTO_SMC via eBPF

smc_run.bpf -p pid

  1. Add the process with target pid to the map. Afterward, all socket() 
calls of the process and its descendant processes will be replaced from 
IPPROTO_TCP to IPPROTO_SMC.
  2. Mapping will be automatically deleted when process exits.
  3. Specifically, COMMAND mode is actually works like following:

     smc_run.bpf -p $$
     COMMAND
     exit

smc_run.bpf -n 1

  1. Make all socket() calls of the current netns to be replaced from 
IPPROTO_TCP to IPPROTO_SMC.
  2. Turn off it by smc_run.bpf -n 0
Wenjia Zhang May 31, 2024, 8:06 a.m. UTC | #2
On 30.05.24 12:14, D. Wythe wrote:
> 
> 
> On 5/30/24 5:30 PM, D. Wythe wrote:
>> From: "D. Wythe" <alibuda@linux.alibaba.com>
>>
>> This patch allows to create smc socket via AF_INET,
>> similar to the following code,
>>
>> /* create v4 smc sock */
>> v4 = socket(AF_INET, SOCK_STREAM, IPPROTO_SMC);
>>
>> /* create v6 smc sock */
>> v6 = socket(AF_INET6, SOCK_STREAM, IPPROTO_SMC);
> 
> Welcome everyone to try out the eBPF based version of smc_run during 
> testing, I have added a separate command called smc_run.bpf,
> it was equivalent to normal smc_run but with IPPROTO_SMC via eBPF.
> 
> You can obtain the code and more info from: 
> https://github.com/D-Wythe/smc-tools
> 
> Usage:
> 
> smc_run.bpf
> An eBPF implemented smc_run based on IPPROTO_SMC:
> 
> 1. Support to transparent replacement based on command (Just like smc_run).
> 2. Supprot to transparent replacement based on pid configuration. And 
> supports the inheritance of this capability between parent and child 
> processes.
> 3. Support to transparent replacement based on per netns configuration.
> 
> smc_run.bpf COMMAND
> 
> 1. Equivalent to smc_run but with IPPROTO_SMC via eBPF
> 
> smc_run.bpf -p pid
> 
>   1. Add the process with target pid to the map. Afterward, all socket() 
> calls of the process and its descendant processes will be replaced from 
> IPPROTO_TCP to IPPROTO_SMC.
>   2. Mapping will be automatically deleted when process exits.
>   3. Specifically, COMMAND mode is actually works like following:
> 
>      smc_run.bpf -p $$
>      COMMAND
>      exit
> 
> smc_run.bpf -n 1
> 
>   1. Make all socket() calls of the current netns to be replaced from 
> IPPROTO_TCP to IPPROTO_SMC.
>   2. Turn off it by smc_run.bpf -n 0
> 
> 
Hi D. Wythe,

Thank you for the info and description! The code generally looks good to 
me, just still some details I need to check again. And I'd like to give 
smc_run.bpf a try, and maybe let you know if it works for me next week.

Thanks,
Wenjia
D. Wythe June 3, 2024, 3:01 a.m. UTC | #3
On 5/31/24 4:06 PM, Wenjia Zhang wrote:
>
>
> On 30.05.24 12:14, D. Wythe wrote:
>>
>>
>> On 5/30/24 5:30 PM, D. Wythe wrote:
>>> From: "D. Wythe" <alibuda@linux.alibaba.com>
>>>
>>> This patch allows to create smc socket via AF_INET,
>>> similar to the following code,
>>>
>>> /* create v4 smc sock */
>>> v4 = socket(AF_INET, SOCK_STREAM, IPPROTO_SMC);
>>>
>>> /* create v6 smc sock */
>>> v6 = socket(AF_INET6, SOCK_STREAM, IPPROTO_SMC);
>>
>> Welcome everyone to try out the eBPF based version of smc_run during 
>> testing, I have added a separate command called smc_run.bpf,
>> it was equivalent to normal smc_run but with IPPROTO_SMC via eBPF.
>>
>> You can obtain the code and more info from: 
>> https://github.com/D-Wythe/smc-tools
>>
>> Usage:
>>
>> smc_run.bpf
>> An eBPF implemented smc_run based on IPPROTO_SMC:
>>
>> 1. Support to transparent replacement based on command (Just like 
>> smc_run).
>> 2. Supprot to transparent replacement based on pid configuration. And 
>> supports the inheritance of this capability between parent and child 
>> processes.
>> 3. Support to transparent replacement based on per netns configuration.
>>
>> smc_run.bpf COMMAND
>>
>> 1. Equivalent to smc_run but with IPPROTO_SMC via eBPF
>>
>> smc_run.bpf -p pid
>>
>>   1. Add the process with target pid to the map. Afterward, all 
>> socket() calls of the process and its descendant processes will be 
>> replaced from IPPROTO_TCP to IPPROTO_SMC.
>>   2. Mapping will be automatically deleted when process exits.
>>   3. Specifically, COMMAND mode is actually works like following:
>>
>>      smc_run.bpf -p $$
>>      COMMAND
>>      exit
>>
>> smc_run.bpf -n 1
>>
>>   1. Make all socket() calls of the current netns to be replaced from 
>> IPPROTO_TCP to IPPROTO_SMC.
>>   2. Turn off it by smc_run.bpf -n 0
>>
>>
> Hi D. Wythe,
>
> Thank you for the info and description! The code generally looks good 
> to me, just still some details I need to check again. And I'd like to 
> give smc_run.bpf a try, and maybe let you know if it works for me next 
> week.
>
> Thanks,
> Wenjia

Hi Wenjia,

That's okay to us. And if there are any issues regarding the use of 
smc_run.bpf, please let me know.

Best wishes,
D. Wythe
Niklas Schnelle June 3, 2024, 7:48 a.m. UTC | #4
On Thu, 2024-05-30 at 18:14 +0800, D. Wythe wrote:
> 
> On 5/30/24 5:30 PM, D. Wythe wrote:
> > From: "D. Wythe" <alibuda@linux.alibaba.com>
> > 
> > This patch allows to create smc socket via AF_INET,
> > similar to the following code,
> > 
> > /* create v4 smc sock */
> > v4 = socket(AF_INET, SOCK_STREAM, IPPROTO_SMC);
> > 
> > /* create v6 smc sock */
> > v6 = socket(AF_INET6, SOCK_STREAM, IPPROTO_SMC);
> 
> Welcome everyone to try out the eBPF based version of smc_run during 
> testing, I have added a separate command called smc_run.bpf,
> it was equivalent to normal smc_run but with IPPROTO_SMC via eBPF.
> 
> You can obtain the code and more info from: 
> https://github.com/D-Wythe/smc-tools
> 
> Usage:
> 
> smc_run.bpf
> An eBPF implemented smc_run based on IPPROTO_SMC:
> 
> 1. Support to transparent replacement based on command (Just like smc_run).
> 2. Supprot to transparent replacement based on pid configuration. And 
> supports the inheritance of this capability between parent and child 
> processes.
> 3. Support to transparent replacement based on per netns configuration.
> 
> smc_run.bpf COMMAND
> 
> 1. Equivalent to smc_run but with IPPROTO_SMC via eBPF
> 
> smc_run.bpf -p pid
> 
>   1. Add the process with target pid to the map. Afterward, all socket() 
> calls of the process and its descendant processes will be replaced from 
> IPPROTO_TCP to IPPROTO_SMC.
>   2. Mapping will be automatically deleted when process exits.
>   3. Specifically, COMMAND mode is actually works like following:
> 
>      smc_run.bpf -p $$
>      COMMAND
>      exit
> 
> smc_run.bpf -n 1
> 
>   1. Make all socket() calls of the current netns to be replaced from 
> IPPROTO_TCP to IPPROTO_SMC.
>   2. Turn off it by smc_run.bpf -n 0
> 
> 

Hi D. Wythe,

I gave this series plus your smc_run.bpf and SMC_LO based SMC-D a test
run on my Ryzen 3900X workstation and I have to say I'm quite
impressed. I first tried the SMC_LO feature as merged in v6.10-rc1 with
the classic LD_PRELOAD based smc_run and iperf3, and qperf …
tcp_bw/tcp_lat both with normal localhost and between docker
containers. For this to work I of course had to initially set my UEID
as x86_64 unlike s390x doesn't get an SEID set. I used the following
script for this.


#!/usr/bin/sh
machine_id_upper=$(cat /etc/machine-id | tr '[:lower:]' '[:upper:]')
machine_id_suffix=$(echo "$machine_id_upper" | head -c 27)
ueid="MID-$machine_id_suffix"
smcd ueid add "$ueid"


The performance is pretty impressive:
* iperf3 with 12 parallel connections (matching core count) results in
  ~152 Gbit/s on normal loopback and ~312 Gbit/s with SMC_LO.
* qperf … tcp_bw (single thread) results in ~46 Gbit/s on normal loopback
  and ~58 Gbit/s with SMC_LO
* qperf … tcp_lat latency test results in 5-9 us with normal loopback
  and around 3-4 us with SMC_LO

Then I applied this series on top of v6.10-rc1 and tried it with your
smc_run.bpf. The performance is of course in-line with the above but
thanks to being able to enable SMC on a per-netns basis I was able to
try a few more thing. First I tried just enabling it in my default
netns and verified that after restarting sshd new ssh connections to
localhost used SMC-D through SMC_LO. Then I started Chrome and
confirmed that its TCP connections also registered with SMC and
successfully fell back to TCP mode. I had no trouble with normal
browsing though I guess especially Google stuff often uses HTTP/3 so
isn't affected. Still nice to see I didn't get breakage.

Secondly I tried smc_run.bpf with docker containers using the following
trick:

docker inspect --format '{{.State.Pid}}' <my_container_name>
34651
nsenter -t 34651 -n smc_run.bpf -n 1

Sadly this only works for commands started in the container after
loading the BPF. So I wonder if you know of a good way to either
automatically execute smc_run.bpf on container start or maybe use it on
the docker daemon such that all namespaces created by docker get the
IPPROTO_SMC override. I'd then definitely consider using SMC-D with
SMC_LO between my home lab containers even if just for bragging rights
;-)

Feel free to add for the IPPROTO_SMC series:

Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>

Thanks,
Niklas
D. Wythe June 3, 2024, 3:07 p.m. UTC | #5
On 6/3/24 3:48 PM, Niklas Schnelle wrote:
> On Thu, 2024-05-30 at 18:14 +0800, D. Wythe wrote:
>> On 5/30/24 5:30 PM, D. Wythe wrote:
>>> From: "D. Wythe" <alibuda@linux.alibaba.com>
>>>
>>> This patch allows to create smc socket via AF_INET,
>>> similar to the following code,
>>>
>>> /* create v4 smc sock */
>>> v4 = socket(AF_INET, SOCK_STREAM, IPPROTO_SMC);
>>>
>>> /* create v6 smc sock */
>>> v6 = socket(AF_INET6, SOCK_STREAM, IPPROTO_SMC);
>> Welcome everyone to try out the eBPF based version of smc_run during
>> testing, I have added a separate command called smc_run.bpf,
>> it was equivalent to normal smc_run but with IPPROTO_SMC via eBPF.
>>
>> You can obtain the code and more info from:
>> https://github.com/D-Wythe/smc-tools
>>
>> Usage:
>>
>> smc_run.bpf
>> An eBPF implemented smc_run based on IPPROTO_SMC:
>>
>> 1. Support to transparent replacement based on command (Just like smc_run).
>> 2. Supprot to transparent replacement based on pid configuration. And
>> supports the inheritance of this capability between parent and child
>> processes.
>> 3. Support to transparent replacement based on per netns configuration.
>>
>> smc_run.bpf COMMAND
>>
>> 1. Equivalent to smc_run but with IPPROTO_SMC via eBPF
>>
>> smc_run.bpf -p pid
>>
>>    1. Add the process with target pid to the map. Afterward, all socket()
>> calls of the process and its descendant processes will be replaced from
>> IPPROTO_TCP to IPPROTO_SMC.
>>    2. Mapping will be automatically deleted when process exits.
>>    3. Specifically, COMMAND mode is actually works like following:
>>
>>       smc_run.bpf -p $$
>>       COMMAND
>>       exit
>>
>> smc_run.bpf -n 1
>>
>>    1. Make all socket() calls of the current netns to be replaced from
>> IPPROTO_TCP to IPPROTO_SMC.
>>    2. Turn off it by smc_run.bpf -n 0
>>
>>
> Hi D. Wythe,
>
> I gave this series plus your smc_run.bpf and SMC_LO based SMC-D a test
> run on my Ryzen 3900X workstation and I have to say I'm quite
> impressed. I first tried the SMC_LO feature as merged in v6.10-rc1 with
> the classic LD_PRELOAD based smc_run and iperf3, and qperf …
> tcp_bw/tcp_lat both with normal localhost and between docker
> containers. For this to work I of course had to initially set my UEID
> as x86_64 unlike s390x doesn't get an SEID set. I used the following
> script for this.
>
>
> #!/usr/bin/sh
> machine_id_upper=$(cat /etc/machine-id | tr '[:lower:]' '[:upper:]')
> machine_id_suffix=$(echo "$machine_id_upper" | head -c 27)
> ueid="MID-$machine_id_suffix"
> smcd ueid add "$ueid"
>
>
> The performance is pretty impressive:
> * iperf3 with 12 parallel connections (matching core count) results in
>    ~152 Gbit/s on normal loopback and ~312 Gbit/s with SMC_LO.
> * qperf … tcp_bw (single thread) results in ~46 Gbit/s on normal loopback
>    and ~58 Gbit/s with SMC_LO
> * qperf … tcp_lat latency test results in 5-9 us with normal loopback
>    and around 3-4 us with SMC_LO
>
> Then I applied this series on top of v6.10-rc1 and tried it with your
> smc_run.bpf. The performance is of course in-line with the above but
> thanks to being able to enable SMC on a per-netns basis I was able to
> try a few more thing. First I tried just enabling it in my default
> netns and verified that after restarting sshd new ssh connections to
> localhost used SMC-D through SMC_LO. Then I started Chrome and
> confirmed that its TCP connections also registered with SMC and
> successfully fell back to TCP mode. I had no trouble with normal
> browsing though I guess especially Google stuff often uses HTTP/3 so
> isn't affected. Still nice to see I didn't get breakage.
>
> Secondly I tried smc_run.bpf with docker containers using the following
> trick:
>
> docker inspect --format '{{.State.Pid}}' <my_container_name>
> 34651
> nsenter -t 34651 -n smc_run.bpf -n 1
>
> Sadly this only works for commands started in the container after
> loading the BPF. So I wonder if you know of a good way to either
> automatically execute smc_run.bpf on container start or maybe use it on
> the docker daemon such that all namespaces created by docker get the
> IPPROTO_SMC override. I'd then definitely consider using SMC-D with
> SMC_LO between my home lab containers even if just for bragging rights
> ;-)
>
> Feel free to add for the IPPROTO_SMC series:
>
> Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
>
> Thanks,
> Niklas

Hi Niklas ,

Thanks very much for your testing.

Regarding your question, have you ever tried starting the container 
using 'smc_run.bpf docker' ?

The smc_run.bpf allows the capability for replacement to be inherited by 
descendant processes. This might meet your needs.
However, it should be noted that this scope would no longer be limited 
to netns.

If you don't want to replace the docker command and would like to keep 
per netns, there are indeed some tricky ways, for example,
we could check current process name when creating new netns to decide if 
we should add it to the ebpf-map,
but I think it's not appropriate to include this in smc_run.bpf.

Best wishes,
D. Wythe
Niklas Schnelle June 4, 2024, 11:32 a.m. UTC | #6
On Mon, 2024-06-03 at 23:07 +0800, D. Wythe wrote:
> 
> On 6/3/24 3:48 PM, Niklas Schnelle wrote:
> > On Thu, 2024-05-30 at 18:14 +0800, D. Wythe wrote:
> > > On 5/30/24 5:30 PM, D. Wythe wrote:
> > > > From: "D. Wythe" <alibuda@linux.alibaba.com>
> > > 
---8<---
> > Hi D. Wythe,
> > 
> > I gave this series plus your smc_run.bpf and SMC_LO based SMC-D a test
> > run on my Ryzen 3900X workstation and I have to say I'm quite
> > impressed. I first tried the SMC_LO feature as merged in v6.10-rc1 with
> > the classic LD_PRELOAD based smc_run and iperf3, and qperf …
> > tcp_bw/tcp_lat both with normal localhost and between docker
> > containers. For this to work I of course had to initially set my UEID
> > as x86_64 unlike s390x doesn't get an SEID set. I used the following
> > script for this.
> > 
> > 
> > #!/usr/bin/sh
> > machine_id_upper=$(cat /etc/machine-id | tr '[:lower:]' '[:upper:]')
> > machine_id_suffix=$(echo "$machine_id_upper" | head -c 27)
> > ueid="MID-$machine_id_suffix"
> > smcd ueid add "$ueid"
> > 
> > 
> > The performance is pretty impressive:
> > * iperf3 with 12 parallel connections (matching core count) results in
> >    ~152 Gbit/s on normal loopback and ~312 Gbit/s with SMC_LO.
> > * qperf … tcp_bw (single thread) results in ~46 Gbit/s on normal loopback
> >    and ~58 Gbit/s with SMC_LO
> > * qperf … tcp_lat latency test results in 5-9 us with normal loopback
> >    and around 3-4 us with SMC_LO
> > 
> > Then I applied this series on top of v6.10-rc1 and tried it with your
> > smc_run.bpf. The performance is of course in-line with the above but
> > thanks to being able to enable SMC on a per-netns basis I was able to
> > try a few more thing. First I tried just enabling it in my default
> > netns and verified that after restarting sshd new ssh connections to
> > localhost used SMC-D through SMC_LO. Then I started Chrome and
> > confirmed that its TCP connections also registered with SMC and
> > successfully fell back to TCP mode. I had no trouble with normal
> > browsing though I guess especially Google stuff often uses HTTP/3 so
> > isn't affected. Still nice to see I didn't get breakage.
> > 
> > Secondly I tried smc_run.bpf with docker containers using the following
> > trick:
> > 
> > docker inspect --format '{{.State.Pid}}' <my_container_name>
> > 34651
> > nsenter -t 34651 -n smc_run.bpf -n 1
> > 
> > Sadly this only works for commands started in the container after
> > loading the BPF. So I wonder if you know of a good way to either
> > automatically execute smc_run.bpf on container start or maybe use it on
> > the docker daemon such that all namespaces created by docker get the
> > IPPROTO_SMC override. I'd then definitely consider using SMC-D with
> > SMC_LO between my home lab containers even if just for bragging rights
> > ;-)
> > 
> > Feel free to add for the IPPROTO_SMC series:
> > 
> > Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>
> > 
> > Thanks,
> > Niklas
> 
> Hi Niklas ,
> 
> Thanks very much for your testing.
> 
> Regarding your question, have you ever tried starting the container 
> using 'smc_run.bpf docker' ?
> 
> The smc_run.bpf allows the capability for replacement to be inherited by 
> descendant processes. This might meet your needs.
> However, it should be noted that this scope would no longer be limited 
> to netns.
> 
> If you don't want to replace the docker command and would like to keep 
> per netns, there are indeed some tricky ways, for example,
> we could check current process name when creating new netns to decide if 
> we should add it to the ebpf-map,
> but I think it's not appropriate to include this in smc_run.bpf.
> 
> Best wishes,
> D. Wythe

I'll have to try this. I'm guessing that for docker the smc_run would
have to be added for the docker daemon and not the individual docker
commands. For podman on the other hand the individual command might
work as there is no central daemon. And as you said bpf should allow us
to add other policies in the future.

Thanks,
Niklas