mbox series

[net-next,v7,00/11] net/smc: SMC intra-OS shortcut with loopback-ism

Message ID 20240428060738.60843-1-guwen@linux.alibaba.com (mailing list archive)
Headers show
Series net/smc: SMC intra-OS shortcut with loopback-ism | expand

Message

Wen Gu April 28, 2024, 6:07 a.m. UTC
This patch set acts as the second part of the new version of [1] (The first
part can be referred from [2]), the updated things of this version are listed
at the end.

- Background

SMC-D is now used in IBM z with ISM function to optimize network interconnect
for intra-CPC communications. Inspired by this, we try to make SMC-D available
on the non-s390 architecture through a software-implemented Emulated-ISM device,
that is the loopback-ism device here, to accelerate inter-process or
inter-containers communication within the same OS instance.

- Design

This patch set includes 3 parts:

 - Patch #1: some prepare work for loopback-ism.
 - Patch #2-#7: implement loopback-ism device and adapt SMC-D for it.
   loopback-ism now serves only SMC and no userspace interfaces exposed.
 - Patch #8-#11: memory copy optimization for intra-OS scenario.

The loopback-ism device is designed as an ISMv2 device and not be limited to
a specific net namespace, ends of both inter-process connection (1/1' in diagram
below) or inter-container connection (2/2' in diagram below) can find the same
available loopback-ism and choose it during the CLC handshake.

 Container 1 (ns1)                              Container 2 (ns2)
 +-----------------------------------------+    +-------------------------+
 | +-------+      +-------+      +-------+ |    |        +-------+        |
 | | App A |      | App B |      | App C | |    |        | App D |<-+     |
 | +-------+      +---^---+      +-------+ |    |        +-------+  |(2') |
 |     |127.0.0.1 (1')|             |192.168.0.11       192.168.0.12|     |
 |  (1)|   +--------+ | +--------+  |(2)   |    | +--------+   +--------+ |
 |     `-->|   lo   |-` |  eth0  |<-`      |    | |   lo   |   |  eth0  | |
 +---------+--|---^-+---+-----|--+---------+    +-+--------+---+-^------+-+
              |   |           |                                  |
 Kernel       |   |           |                                  |
 +----+-------v---+-----------v----------------------------------+---+----+
 |    |                            TCP                               |    |
 |    |                                                              |    |
 |    +--------------------------------------------------------------+    |
 |                                                                        |
 |                           +--------------+                             |
 |                           | smc loopback |                             |
 +---------------------------+--------------+-----------------------------+

loopback-ism device creates DMBs (shared memory) for each connection peer.
Since data transfer occurs within the same kernel, the sndbuf of each peer
is only a descriptor and point to the same memory region as peer DMB, so that
the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.

 Container 1 (ns1)                              Container 2 (ns2)
 +-----------------------------------------+    +-------------------------+
 | +-------+                               |    |        +-------+        |
 | | App C |-----+                         |    |        | App D |        |
 | +-------+     |                         |    |        +-^-----+        |
 |               |                         |    |          |              |
 |           (2) |                         |    |     (2') |              |
 |               |                         |    |          |              |
 +---------------|-------------------------+    +----------|--------------+
                 |                                         |
 Kernel          |                                         |
 +---------------|-----------------------------------------|--------------+
 | +--------+ +--v-----+                           +--------+ +--------+  |
 | |dmb_desc| |snd_desc|                           |dmb_desc| |snd_desc|  |
 | +-----|--+ +--|-----+                           +-----|--+ +--------+  |
 | +-----|--+    |                                 +-----|--+             |
 | | DMB C  |    +---------------------------------| DMB D  |             |
 | +--------+                                      +--------+             |
 |                                                                        |
 |                           +--------------+                             |
 |                           | smc loopback |                             |
 +---------------------------+--------------+-----------------------------+

- Benchmark Test

 * Test environments:
      - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
      - SMC sndbuf/DMB size 1MB.

 * Test object:
      - TCP: run on TCP loopback.
      - SMC lo: run on SMC loopback-ism.

1. ipc-benchmark (see [3])

 - ./<foo> -c 1000000 -s 100

                            TCP                  SMC-lo
Message
rate (msg/s)              84991                  151293(+78.01%)

2. sockperf

 - serv: <smc_run> sockperf sr --tcp
 - clnt: <smc_run> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30

                            TCP                  SMC-lo
Bandwidth(MBps)        5033.569                7987.732(+58.69%)
Latency(us)               5.986                   3.398(-43.23%)

3. nginx/wrk

 - serv: <smc_run> nginx
 - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80

                           TCP                   SMC-lo
Requests/s           187951.76                267107.90(+42.12%)

4. redis-benchmark

 - serv: <smc_run> redis-server
 - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024

                           TCP                   SMC-lo
GET(Requests/s)       86132.64                118133.49(+37.15%)
SET(Requests/s)       87374.40                122887.86(+40.65%)


Change log:
v7->v6
- Patch #2: minor: remove unnecessary 'return' of inline smc_loopback_exit().
- Patch #10: minor: directly return 0 instead of 'rc' in smcd_cdc_msg_send().
- all: collect the Reviewed-by tags.

v6->RFC v5
Link: https://lore.kernel.org/netdev/20240414040304.54255-1-guwen@linux.alibaba.com/
- Patch #2: make the use of CONFIG_SMC_LO cleaner.
- Patch #5: mark some smcd_ops that loopback-ism doesn't support as
  optional and check for the support when they are called.
- Patch #7: keep loopback-ism at the beginning of the SMC-D device list.
- Some expression changes in commit logs and comments.

RFC v5->RFC v4:
Link: https://lore.kernel.org/netdev/20240324135522.108564-1-guwen@linux.alibaba.com/
- Patch #2: minor changes in description of config SMC_LO and comments.
- Patch #10: minor changes in comments and if(smc_ism_support_dmb_nocopy())
  check in smcd_cdc_msg_send().
- Patch #3: change smc_lo_generate_id() to smc_lo_generate_ids() and SMC_LO_CHID
  to SMC_LO_RESERVED_CHID.
- Patch #5: memcpy while holding the ldev->dmb_ht_lock.
- Some expression changes in commit logs.

RFC v4->v3:
Link: https://lore.kernel.org/netdev/20240317100545.96663-1-guwen@linux.alibaba.com/
- The merge window of v6.9 is open, so post this series as an RFC.
- Patch #6: since some information fed back by smc_nl_handle_smcd_dev() dose
  not apply to Emulated-ISM (including loopback-ism here), loopback-ism is
  not exposed through smc netlink for the time being. we may refactor this
  part when smc netlink interface is updated.

v3->v2:
Link: https://lore.kernel.org/netdev/20240312142743.41406-1-guwen@linux.alibaba.com/
- Patch #11: use tasklet_schedule(&conn->rx_tsklet) instead of smcd_cdc_rx_handler()
  to avoid possible recursive locking of conn->send_lock and use {read|write}_lock_bh()
  to acquire dmb_ht_lock.

v2->v1:
Link: https://lore.kernel.org/netdev/20240307095536.29648-1-guwen@linux.alibaba.com/
- All the patches: changed the term virtual-ISM to Emulated-ISM as defined by SMCv2.1.
- Patch #3: optimized the description of SMC_LO config. Avoid exposing loopback-ism
  to sysfs and remove all the knobs until future definition clear.
- Patch #3: try to make lockdep happy by using read_lock_bh() in smc_lo_move_data().
- Patch #6: defaultly use physical contiguous DMB buffers.
- Patch #11: defaultly enable DMB no-copy for loopback-ism and free the DMB in
  unregister_dmb or detach_dmb when dmb_node->refcnt reaches 0, instead of using
  wait_event to keep waiting in unregister_dmb.

v1->RFC:
Link: https://lore.kernel.org/netdev/20240111120036.109903-1-guwen@linux.alibaba.com/
- Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
  /sys/devices/virtual/smc/loopback-ism/xfer_bytes
- Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
  merging sndbuf with peer DMB.
- Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
  control of whether to merge sndbuf and DMB. They can be respectively set by:
  /sys/devices/virtual/smc/loopback-ism/dmb_type
  /sys/devices/virtual/smc/loopback-ism/dmb_copy
  The motivation for these two control is that a performance bottleneck was
  found when using vzalloced DMB and sndbuf is merged with DMB, and there are
  many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
  by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
  or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
  vmap lock contention [6]. It has significant effects, but using virtual memory
  still has additional overhead compared to using physical memory.
  So this new version provides controls of dmb_type and dmb_copy to suit
  different scenarios.
- Some minor changes and comments improvements.

RFC->old version([1]):
Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/
- Patch #1: improve the loopback-ism dump, it shows as follows now:
  # smcd d
  FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
  0000 0     loopback-ism  ffff   No        0
- Patch #3: introduce the smc_ism_set_v2_capable() helper and set
  smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
  regardless of whether there is already a device in smcd device list.
- Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
- Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
  to activate or deactivate the loopback-ism.
- Patch #9: introduce the statistics of loopback-ism by
  /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
- Some minor changes and comments improvements.

[1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/
[2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/
[3] https://github.com/goldsborough/ipc-bench
[4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/
[5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/
[6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/


Wen Gu (11):
  net/smc: decouple ism_client from SMC-D DMB registration
  net/smc: introduce loopback-ism for SMC intra-OS shortcut
  net/smc: implement ID-related operations of loopback-ism
  net/smc: implement DMB-related operations of loopback-ism
  net/smc: mark optional smcd_ops and check for support when called
  net/smc: ignore loopback-ism when dumping SMC-D devices
  net/smc: register loopback-ism into SMC-D device list
  net/smc: add operations to merge sndbuf with peer DMB
  net/smc: {at|de}tach sndbuf to peer DMB if supported
  net/smc: adapt cursor update when sndbuf and peer DMB are merged
  net/smc: implement DMB-merged operations of loopback-ism

 drivers/s390/net/ism_drv.c |   2 +-
 include/net/smc.h          |  21 +-
 net/smc/Kconfig            |  13 ++
 net/smc/Makefile           |   1 +
 net/smc/af_smc.c           |  28 ++-
 net/smc/smc_cdc.c          |  36 +++-
 net/smc/smc_core.c         |  61 +++++-
 net/smc/smc_core.h         |   1 +
 net/smc/smc_ism.c          |  88 ++++++--
 net/smc/smc_ism.h          |  10 +
 net/smc/smc_loopback.c     | 427 +++++++++++++++++++++++++++++++++++++
 net/smc/smc_loopback.h     |  61 ++++++
 12 files changed, 721 insertions(+), 28 deletions(-)
 create mode 100644 net/smc/smc_loopback.c
 create mode 100644 net/smc/smc_loopback.h

Comments

Cong Wang April 28, 2024, 3:49 p.m. UTC | #1
On Sun, Apr 28, 2024 at 02:07:27PM +0800, Wen Gu wrote:
> This patch set acts as the second part of the new version of [1] (The first
> part can be referred from [2]), the updated things of this version are listed
> at the end.
> 
> - Background
> 
> SMC-D is now used in IBM z with ISM function to optimize network interconnect
> for intra-CPC communications. Inspired by this, we try to make SMC-D available
> on the non-s390 architecture through a software-implemented Emulated-ISM device,
> that is the loopback-ism device here, to accelerate inter-process or
> inter-containers communication within the same OS instance.

Just FYI:

Cilium has implemented this kind of shortcut with sockmap and sockops.
In fact, for intra-OS case, it is _very_ simple. The core code is less
than 50 lines. Please take a look here:
https://github.com/cilium/cilium/blob/v1.11.4/bpf/sockops/bpf_sockops.c

Like I mentioned in my LSF/MM/BPF proposal, we plan to implement
similiar eBPF things for inter-OS (aka VM) case.

More importantly, even LD_PRELOAD is not needed for this eBPF approach.
:)

Thanks.
patchwork-bot+netdevbpf@kernel.org April 30, 2024, 11:40 a.m. UTC | #2
Hello:

This series was applied to netdev/net-next.git (main)
by Paolo Abeni <pabeni@redhat.com>:

On Sun, 28 Apr 2024 14:07:27 +0800 you wrote:
> This patch set acts as the second part of the new version of [1] (The first
> part can be referred from [2]), the updated things of this version are listed
> at the end.
> 
> - Background
> 
> SMC-D is now used in IBM z with ISM function to optimize network interconnect
> for intra-CPC communications. Inspired by this, we try to make SMC-D available
> on the non-s390 architecture through a software-implemented Emulated-ISM device,
> that is the loopback-ism device here, to accelerate inter-process or
> inter-containers communication within the same OS instance.
> 
> [...]

Here is the summary with links:
  - [net-next,v7,01/11] net/smc: decouple ism_client from SMC-D DMB registration
    https://git.kernel.org/netdev/net-next/c/784c46f5467c
  - [net-next,v7,02/11] net/smc: introduce loopback-ism for SMC intra-OS shortcut
    https://git.kernel.org/netdev/net-next/c/46ac64419ded
  - [net-next,v7,03/11] net/smc: implement ID-related operations of loopback-ism
    https://git.kernel.org/netdev/net-next/c/45783ee85bf3
  - [net-next,v7,04/11] net/smc: implement DMB-related operations of loopback-ism
    https://git.kernel.org/netdev/net-next/c/f7a22071dbf3
  - [net-next,v7,05/11] net/smc: mark optional smcd_ops and check for support when called
    https://git.kernel.org/netdev/net-next/c/d1d8d0b6c7c6
  - [net-next,v7,06/11] net/smc: ignore loopback-ism when dumping SMC-D devices
    https://git.kernel.org/netdev/net-next/c/c8df2d449f64
  - [net-next,v7,07/11] net/smc: register loopback-ism into SMC-D device list
    https://git.kernel.org/netdev/net-next/c/04791343d858
  - [net-next,v7,08/11] net/smc: add operations to merge sndbuf with peer DMB
    https://git.kernel.org/netdev/net-next/c/439888826858
  - [net-next,v7,09/11] net/smc: {at|de}tach sndbuf to peer DMB if supported
    https://git.kernel.org/netdev/net-next/c/ae2be35cbed2
  - [net-next,v7,10/11] net/smc: adapt cursor update when sndbuf and peer DMB are merged
    https://git.kernel.org/netdev/net-next/c/cc0ab806fc52
  - [net-next,v7,11/11] net/smc: implement DMB-merged operations of loopback-ism
    https://git.kernel.org/netdev/net-next/c/c3a910f2380f

You are awesome, thank you!
Wen Gu May 7, 2024, 2:34 p.m. UTC | #3
On 2024/4/28 23:49, Cong Wang wrote:
> On Sun, Apr 28, 2024 at 02:07:27PM +0800, Wen Gu wrote:
>> This patch set acts as the second part of the new version of [1] (The first
>> part can be referred from [2]), the updated things of this version are listed
>> at the end.
>>
>> - Background
>>
>> SMC-D is now used in IBM z with ISM function to optimize network interconnect
>> for intra-CPC communications. Inspired by this, we try to make SMC-D available
>> on the non-s390 architecture through a software-implemented Emulated-ISM device,
>> that is the loopback-ism device here, to accelerate inter-process or
>> inter-containers communication within the same OS instance.
> 
> Just FYI:
> 
> Cilium has implemented this kind of shortcut with sockmap and sockops.
> In fact, for intra-OS case, it is _very_ simple. The core code is less
> than 50 lines. Please take a look here:
> https://github.com/cilium/cilium/blob/v1.11.4/bpf/sockops/bpf_sockops.c
> 
> Like I mentioned in my LSF/MM/BPF proposal, we plan to implement
> similiar eBPF things for inter-OS (aka VM) case.
> 
> More importantly, even LD_PRELOAD is not needed for this eBPF approach.
> :)
> 
> Thanks.

Hi, Cong. Thank you very much for the information. I learned about sockmap
before and from my perspective smcd loopback and sockmap each have their own
pros and cons.

The pros of smcd loopback is that it uses a standard process that defined
by RFC-7609 for negotiation, this CLC handshake helps smc correctly determine
whether the tcp connection should be upgraded no matter what middleware the
connection passes, e.g. through NAT. So we don't need to pay extra effort to
check whether the connection should be shortcut, unlike checking various policy
by bpf_sock_ops_ipv4() in sockmap. And since the handshake automatically select
different underlay devices for different scenarios (loopback-ism in intra-OS,
ISM in inter-VM of IBM z and RDMA in inter-VM of different hosts), various
scenarios can be covered through one smc protocol stack.

The cons of smcd loopback is also related to the CLC handshake, one more round
handshake may cause smc to perform worse than TCP in short-lived connection
scenarios. So we basically use smc upgrade in long-lived connection scenarios
and are exploring IPPROTO_SMC[1] to provide lossless fallback under adverse cases.

And we are also working on other upgrade ways than LD_PRELOAD, e.g. using eBPF
hook[2] with IPPROTO_SMC, to enhance the usability.

[1] https://lore.kernel.org/netdev/1708412505-34470-1-git-send-email-alibuda@linux.alibaba.com/
[2] https://lore.kernel.org/all/ac84be00f97072a46f8a72b4e2be46cbb7fa5053.1692147782.git.geliang.tang@suse.com/

Thanks!
Cong Wang May 7, 2024, 4:10 p.m. UTC | #4
On Tue, May 07, 2024 at 10:34:09PM +0800, Wen Gu wrote:
> 
> 
> On 2024/4/28 23:49, Cong Wang wrote:
> > On Sun, Apr 28, 2024 at 02:07:27PM +0800, Wen Gu wrote:
> > > This patch set acts as the second part of the new version of [1] (The first
> > > part can be referred from [2]), the updated things of this version are listed
> > > at the end.
> > > 
> > > - Background
> > > 
> > > SMC-D is now used in IBM z with ISM function to optimize network interconnect
> > > for intra-CPC communications. Inspired by this, we try to make SMC-D available
> > > on the non-s390 architecture through a software-implemented Emulated-ISM device,
> > > that is the loopback-ism device here, to accelerate inter-process or
> > > inter-containers communication within the same OS instance.
> > 
> > Just FYI:
> > 
> > Cilium has implemented this kind of shortcut with sockmap and sockops.
> > In fact, for intra-OS case, it is _very_ simple. The core code is less
> > than 50 lines. Please take a look here:
> > https://github.com/cilium/cilium/blob/v1.11.4/bpf/sockops/bpf_sockops.c
> > 
> > Like I mentioned in my LSF/MM/BPF proposal, we plan to implement
> > similiar eBPF things for inter-OS (aka VM) case.
> > 
> > More importantly, even LD_PRELOAD is not needed for this eBPF approach.
> > :)
> > 
> > Thanks.
> 
> Hi, Cong. Thank you very much for the information. I learned about sockmap
> before and from my perspective smcd loopback and sockmap each have their own
> pros and cons.
> 
> The pros of smcd loopback is that it uses a standard process that defined
> by RFC-7609 for negotiation, this CLC handshake helps smc correctly determine
> whether the tcp connection should be upgraded no matter what middleware the
> connection passes, e.g. through NAT. So we don't need to pay extra effort to
> check whether the connection should be shortcut, unlike checking various policy
> by bpf_sock_ops_ipv4() in sockmap. And since the handshake automatically select
> different underlay devices for different scenarios (loopback-ism in intra-OS,
> ISM in inter-VM of IBM z and RDMA in inter-VM of different hosts), various
> scenarios can be covered through one smc protocol stack.
> 
> The cons of smcd loopback is also related to the CLC handshake, one more round
> handshake may cause smc to perform worse than TCP in short-lived connection
> scenarios. So we basically use smc upgrade in long-lived connection scenarios
> and are exploring IPPROTO_SMC[1] to provide lossless fallback under adverse cases.

You don't have to bother RFC's, since you could define your own TCP
options. And, the eBPF approach could also use TCP options whenver
needed. Cilium probably does not use them only because for intra-OS case
it is too simple to bother TCP options, as everything can be shared via a
shared socketmap.

In reality, the setup is not that complex. In many cases we already know
whether we have VM or container (or mixed) setup before we develop (as
a part of requirement gathering). And they rarely change.

Taking one step back, the discovery of VM or container or loopback cases
could be done via TCP options too, to deal with complex cases like
KataContainer. There is no reason to bother RFC's, maybe except the RDMA
case.

In fact, this is an advantage to me. We don't need to argue with anyone
on our own TCP option or eBPF code, we don't even have to share our own
eBPF code here.

> 
> And we are also working on other upgrade ways than LD_PRELOAD, e.g. using eBPF
> hook[2] with IPPROTO_SMC, to enhance the usability.

That is wrong IMHO, because basically it just overwrites kernel modules
with eBPF, not how eBPF is supposed to be used. IOW, you could not use
it at all without SMC/MPTCP modules.

BTW, this approach does not work for kernel sockets, because you only
hook __sys_socket().

Of course, for sockmap or sockops, they could be used independently for
any other purposes. I hope now you could see the flexiblities of eBPF
over kernel modules.

Thanks.
Wen Gu May 8, 2024, 3:48 a.m. UTC | #5
On 2024/5/8 00:10, Cong Wang wrote:
> On Tue, May 07, 2024 at 10:34:09PM +0800, Wen Gu wrote:
>>
>>
>> On 2024/4/28 23:49, Cong Wang wrote:
>>> On Sun, Apr 28, 2024 at 02:07:27PM +0800, Wen Gu wrote:
>>>> This patch set acts as the second part of the new version of [1] (The first
>>>> part can be referred from [2]), the updated things of this version are listed
>>>> at the end.
>>>>
>>>> - Background
>>>>
>>>> SMC-D is now used in IBM z with ISM function to optimize network interconnect
>>>> for intra-CPC communications. Inspired by this, we try to make SMC-D available
>>>> on the non-s390 architecture through a software-implemented Emulated-ISM device,
>>>> that is the loopback-ism device here, to accelerate inter-process or
>>>> inter-containers communication within the same OS instance.
>>>
>>> Just FYI:
>>>
>>> Cilium has implemented this kind of shortcut with sockmap and sockops.
>>> In fact, for intra-OS case, it is _very_ simple. The core code is less
>>> than 50 lines. Please take a look here:
>>> https://github.com/cilium/cilium/blob/v1.11.4/bpf/sockops/bpf_sockops.c
>>>
>>> Like I mentioned in my LSF/MM/BPF proposal, we plan to implement
>>> similiar eBPF things for inter-OS (aka VM) case.
>>>
>>> More importantly, even LD_PRELOAD is not needed for this eBPF approach.
>>> :)
>>>
>>> Thanks.
>>
>> Hi, Cong. Thank you very much for the information. I learned about sockmap
>> before and from my perspective smcd loopback and sockmap each have their own
>> pros and cons.
>>
>> The pros of smcd loopback is that it uses a standard process that defined
>> by RFC-7609 for negotiation, this CLC handshake helps smc correctly determine
>> whether the tcp connection should be upgraded no matter what middleware the
>> connection passes, e.g. through NAT. So we don't need to pay extra effort to
>> check whether the connection should be shortcut, unlike checking various policy
>> by bpf_sock_ops_ipv4() in sockmap. And since the handshake automatically select
>> different underlay devices for different scenarios (loopback-ism in intra-OS,
>> ISM in inter-VM of IBM z and RDMA in inter-VM of different hosts), various
>> scenarios can be covered through one smc protocol stack.
>>
>> The cons of smcd loopback is also related to the CLC handshake, one more round
>> handshake may cause smc to perform worse than TCP in short-lived connection
>> scenarios. So we basically use smc upgrade in long-lived connection scenarios
>> and are exploring IPPROTO_SMC[1] to provide lossless fallback under adverse cases.
> 
> You don't have to bother RFC's, since you could define your own TCP
> options. And, the eBPF approach could also use TCP options whenver
> needed. Cilium probably does not use them only because for intra-OS case
> it is too simple to bother TCP options, as everything can be shared via a
> shared socketmap.
> 
> In reality, the setup is not that complex. In many cases we already know
> whether we have VM or container (or mixed) setup before we develop (as
> a part of requirement gathering). And they rarely change.
> 
> Taking one step back, the discovery of VM or container or loopback cases
> could be done via TCP options too, to deal with complex cases like
> KataContainer. There is no reason to bother RFC's, maybe except the RDMA
> case.
> 
> In fact, this is an advantage to me. We don't need to argue with anyone
> on our own TCP option or eBPF code, we don't even have to share our own
> eBPF code here.
> 

Private TCP option could be a historical burden and a risk for compatibility,
so IMHO it doesn't work, at least for SMC. Besides, the smc handshake process
I mentioned is not just about TCP option, that is the first step. There are
another 3-way CLC handshake to choose right underlay for different cases
or safelly fallback to TCP, keep users out of bother. Lastly, the process was
designed for different cases and smcd loopback is one of the whole picture,
so simplicity of intra-OS case does not mean the existing handshake is meaningless
nor smcd loopback should give up using existing process but follow sockmap.

>>
>> And we are also working on other upgrade ways than LD_PRELOAD, e.g. using eBPF
>> hook[2] with IPPROTO_SMC, to enhance the usability.
> 
> That is wrong IMHO, because basically it just overwrites kernel modules
> with eBPF, not how eBPF is supposed to be used. IOW, you could not use
> it at all without SMC/MPTCP modules.
> 
Yes, it expects to be used for SMC/MPTCP modules.

> BTW, this approach does not work for kernel sockets, because you only
> hook __sys_socket().
> 
In fact the purpose of this is mainly to transparently upgrade applications'
TCP sockets, so kernel sockets are not the target.

> Of course, for sockmap or sockops, they could be used independently for
> any other purposes. I hope now you could see the flexiblities of eBPF
> over kernel modules.
> 
Yes, I agree with the pros of eBPF way, like flexiblities you mentioned.
As I said above, from my perspective they both have their own pros and cons.

> Thanks.

Thanks!
Tony Lu May 8, 2024, 6:39 a.m. UTC | #6
On Tue, May 07, 2024 at 09:10:41AM -0700, Cong Wang wrote:
> On Tue, May 07, 2024 at 10:34:09PM +0800, Wen Gu wrote:
> > 
> > 
> > On 2024/4/28 23:49, Cong Wang wrote:
> > > On Sun, Apr 28, 2024 at 02:07:27PM +0800, Wen Gu wrote:
> > > > This patch set acts as the second part of the new version of [1] (The first
> > > > part can be referred from [2]), the updated things of this version are listed
> > > > at the end.
> > > > 
> > > > - Background
> > > > 
> > > > SMC-D is now used in IBM z with ISM function to optimize network interconnect
> > > > for intra-CPC communications. Inspired by this, we try to make SMC-D available
> > > > on the non-s390 architecture through a software-implemented Emulated-ISM device,
> > > > that is the loopback-ism device here, to accelerate inter-process or
> > > > inter-containers communication within the same OS instance.
> > > 
> > > Just FYI:
> > > 
> > > Cilium has implemented this kind of shortcut with sockmap and sockops.
> > > In fact, for intra-OS case, it is _very_ simple. The core code is less
> > > than 50 lines. Please take a look here:
> > > https://github.com/cilium/cilium/blob/v1.11.4/bpf/sockops/bpf_sockops.c
> > > 
> > > Like I mentioned in my LSF/MM/BPF proposal, we plan to implement
> > > similiar eBPF things for inter-OS (aka VM) case.
> > > 
> > > More importantly, even LD_PRELOAD is not needed for this eBPF approach.
> > > :)
> > > 
> > > Thanks.
> > 
> > Hi, Cong. Thank you very much for the information. I learned about sockmap
> > before and from my perspective smcd loopback and sockmap each have their own
> > pros and cons.
> > 
> > The pros of smcd loopback is that it uses a standard process that defined
> > by RFC-7609 for negotiation, this CLC handshake helps smc correctly determine
> > whether the tcp connection should be upgraded no matter what middleware the
> > connection passes, e.g. through NAT. So we don't need to pay extra effort to
> > check whether the connection should be shortcut, unlike checking various policy
> > by bpf_sock_ops_ipv4() in sockmap. And since the handshake automatically select
> > different underlay devices for different scenarios (loopback-ism in intra-OS,
> > ISM in inter-VM of IBM z and RDMA in inter-VM of different hosts), various
> > scenarios can be covered through one smc protocol stack.
> > 
> > The cons of smcd loopback is also related to the CLC handshake, one more round
> > handshake may cause smc to perform worse than TCP in short-lived connection
> > scenarios. So we basically use smc upgrade in long-lived connection scenarios
> > and are exploring IPPROTO_SMC[1] to provide lossless fallback under adverse cases.
> 
> You don't have to bother RFC's, since you could define your own TCP
> options. And, the eBPF approach could also use TCP options whenver
> needed. Cilium probably does not use them only because for intra-OS case
> it is too simple to bother TCP options, as everything can be shared via a
> shared socketmap.

You can define and use any private TCP options but that is not the right
way for a inter-host network protocol, especially for different subnet,
arch or OS (Linux, z/OS and so on). As the essence of communication
between any two parties, everyone need to abide by the same standards. I
also have to admit that SMC is a standard protocol and we need to extend
the protocol spec through standard and appropriate methods, such as IETF
RFC and protocol white paper. If it is only a temporary acceleration
solution used on a small scale, this restriction are not required.

> 
> In reality, the setup is not that complex. In many cases we already know
> whether we have VM or container (or mixed) setup before we develop (as
> a part of requirement gathering). And they rarely change.

To running SMC in nowadays cloud infra, TCP handshake is not the most
efficient but the most practical way, for we can't touch the infra
under VM, even don't know what kind of environment.

> 
> Taking one step back, the discovery of VM or container or loopback cases
> could be done via TCP options too, to deal with complex cases like
> KataContainer. There is no reason to bother RFC's, maybe except the RDMA
> case.
> 
> In fact, this is an advantage to me. We don't need to argue with anyone
> on our own TCP option or eBPF code, we don't even have to share our own
> eBPF code here.

Actually I am looking forward to learn about the whole way of eBPF. Both
SMC and eBPF are trying to solve this issue in their own view. I don't
think one must be replaced by another one for now. To define these
inter/intra-OS/host scene and captivate interest are more important, I
think, than seeking a one-size-fits-all solution for the present time.

> 
> > 
> > And we are also working on other upgrade ways than LD_PRELOAD, e.g. using eBPF
> > hook[2] with IPPROTO_SMC, to enhance the usability.
> 
> That is wrong IMHO, because basically it just overwrites kernel modules
> with eBPF, not how eBPF is supposed to be used. IOW, you could not use
> it at all without SMC/MPTCP modules.

The eBPF hookers are considering as a part of SMC modules. It should not
be used out of SMC module for now at least.

> 
> BTW, this approach does not work for kernel sockets, because you only
> hook __sys_socket().

Yep, SMC should not replace kernel socket for now. In our scenario,
almost all applications suitable for SMC only run in user space.

Thanks,
Tony Lu

> 
> Of course, for sockmap or sockops, they could be used independently for
> any other purposes. I hope now you could see the flexiblities of eBPF
> over kernel modules.
> 
> Thanks.