mbox series

[net-next,00/15] net/smc: implement loopback-ism used by SMC-D

Message ID 20240111120036.109903-1-guwen@linux.alibaba.com (mailing list archive)
Headers show
Series net/smc: implement loopback-ism used by SMC-D | expand

Message

Wen Gu Jan. 11, 2024, noon UTC
This patch set acts as the second part of the new version of [1] (The first
part can be referred from [2]), the updated things of this version are listed
at the end.

# Background

SMC-D is now used in IBM z with ISM function to optimize network interconnect
for intra-CPC communications. Inspired by this, we try to make SMC-D available
on the non-s390 architecture through a software-implemented virtual ISM device,
that is the loopback-ism device here, to accelerate inter-process or
inter-containers communication within the same OS instance.

# Design

This patch set includes 3 parts:

 - Patch #1-#2: some prepare work for loopback-ism.
 - Patch #3-#9: implement loopback-ism device.
 - Patch #10-#15: memory copy optimization for loopback scenario.

The loopback-ism device is designed as a ISMv2 device and not be limited to
a specific net namespace, ends of both inter-process connection (1/1' in diagram
below) or inter-container connection (2/2' in diagram below) can find the same
available loopback-ism and choose it during the CLC handshake.

 Container 1 (ns1)                              Container 2 (ns2)
 +-----------------------------------------+    +-------------------------+
 | +-------+      +-------+      +-------+ |    |        +-------+        |
 | | App A |      | App B |      | App C | |    |        | App D |<-+     |
 | +-------+      +---^---+      +-------+ |    |        +-------+  |(2') |
 |     |127.0.0.1 (1')|             |192.168.0.11       192.168.0.12|     |
 |  (1)|   +--------+ | +--------+  |(2)   |    | +--------+   +--------+ |
 |     `-->|   lo   |-` |  eth0  |<-`      |    | |   lo   |   |  eth0  | |
 +---------+--|---^-+---+-----|--+---------+    +-+--------+---+-^------+-+
              |   |           |                                  |
 Kernel       |   |           |                                  |
 +----+-------v---+-----------v----------------------------------+---+----+
 |    |                            TCP                               |    |
 |    |                                                              |    |
 |    +--------------------------------------------------------------+    |
 |                                                                        |
 |                           +--------------+                             |
 |                           | smc loopback |                             |
 +---------------------------+--------------+-----------------------------+

loopback-ism device creates DMBs (shared memory) for each connection peer.
Since data transfer occurs within the same kernel, the sndbuf of each peer
is only a descriptor and point to the same memory region as peer DMB, so that
the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.

 Container 1 (ns1)                              Container 2 (ns2)
 +-----------------------------------------+    +-------------------------+
 | +-------+                               |    |        +-------+        |
 | | App C |-----+                         |    |        | App D |        |
 | +-------+     |                         |    |        +-^-----+        |
 |               |                         |    |          |              |
 |           (2) |                         |    |     (2') |              |
 |               |                         |    |          |              |
 +---------------|-------------------------+    +----------|--------------+
                 |                                         |
 Kernel          |                                         |
 +---------------|-----------------------------------------|--------------+
 | +--------+ +--v-----+                           +--------+ +--------+  |
 | |dmb_desc| |snd_desc|                           |dmb_desc| |snd_desc|  |
 | +-----|--+ +--|-----+                           +-----|--+ +--------+  |
 | +-----|--+    |                                 +-----|--+             |
 | | DMB C  |    +---------------------------------| DMB D  |             |
 | +--------+                                      +--------+             |
 |                                                                        |
 |                           +--------------+                             |
 |                           | smc loopback |                             |
 +---------------------------+--------------+-----------------------------+

# Benchmark Test

 * Test environments:
      - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
      - SMC sndbuf/DMB size 1MB.
      - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0,
        which means sndbuf and DMB are merged and no data copied between them.
      - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0,
        which means DMB is physically contiguous buffer.

 * Test object:
      - TCP: run on TCP loopback.
      - SMC lo: run on SMC loopback device.

1. ipc-benchmark (see [3])

 - ./<foo> -c 1000000 -s 100

                            TCP                  SMC-lo
Message
rate (msg/s)              80636                  149515(+85.42%)

2. sockperf

 - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
 - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30

                            TCP                  SMC-lo
Bandwidth(MBps)         4909.36                 8197.57(+66.98%)
Latency(us)               6.098                   3.383(-44.52%)

3. nginx/wrk

 - serv: <smc_run> nginx
 - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80

                           TCP                   SMC-lo
Requests/s           181685.74                246447.77(+35.65%)

4. redis-benchmark

 - serv: <smc_run> redis-server
 - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024

                           TCP                   SMC-lo
GET(Requests/s)       85855.34                118553.64(+38.09%)
SET(Requests/s)       86824.40                125944.58(+45.06%)


Change log:

v1->RFC:
- Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
  /sys/devices/virtual/smc/loopback-ism/xfer_bytes
- Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
  merging sndbuf with peer DMB.
- Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
  control of whether to merge sndbuf and DMB. They can be respectively set by:
  /sys/devices/virtual/smc/loopback-ism/dmb_type
  /sys/devices/virtual/smc/loopback-ism/dmb_copy
  The motivation for these two control is that a performance bottleneck was
  found when using vzalloced DMB and sndbuf is merged with DMB, and there are
  many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
  by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
  or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
  vmap lock contention [6]. It has significant effects, but using virtual memory
  still has additional overhead compared to using physical memory.
  So this new version provides controls of dmb_type and dmb_copy to suit
  different scenarios.
- Some minor changes and comments improvements.

RFC->old version([1]):
Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/
- Patch #1: improve the loopback-ism dump, it shows as follows now:
  # smcd d
  FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
  0000 0     loopback-ism  ffff   No        0
- Patch #3: introduce the smc_ism_set_v2_capable() helper and set
  smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
  regardless of whether there is already a device in smcd device list.
- Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
- Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
  to activate or deactivate the loopback-ism.
- Patch #9: introduce the statistics of loopback-ism by
  /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
- Some minor changes and comments improvements.

[1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/
[2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/
[3] https://github.com/goldsborough/ipc-bench
[4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/
[5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/
[6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/

Wen Gu (15):
  net/smc: improve SMC-D device dump for virtual ISM
  net/smc: decouple specialized struct from SMC-D DMB registration
  net/smc: introduce virtual ISM device loopback-ism
  net/smc: implement ID-related operations of loopback-ism
  net/smc: implement some unsupported operations of loopback-ism
  net/smc: implement DMB-related operations of loopback-ism
  net/smc: register loopback-ism into SMC-D device list
  net/smc: introduce loopback-ism runtime switch
  net/smc: introduce loopback-ism statistics attributes
  net/smc: add operations to merge sndbuf with peer DMB
  net/smc: attach or detach ghost sndbuf to peer DMB
  net/smc: adapt cursor update when sndbuf and peer DMB are merged
  net/smc: introduce loopback-ism DMB type control
  net/smc: introduce loopback-ism DMB data copy control
  net/smc: implement DMB-merged operations of loopback-ism

 drivers/s390/net/ism_drv.c |   2 +-
 include/net/smc.h          |   7 +-
 net/smc/Kconfig            |  13 +
 net/smc/Makefile           |   2 +-
 net/smc/af_smc.c           |  28 +-
 net/smc/smc_cdc.c          |  58 ++-
 net/smc/smc_cdc.h          |   1 +
 net/smc/smc_core.c         |  61 +++-
 net/smc/smc_core.h         |   1 +
 net/smc/smc_ism.c          |  71 +++-
 net/smc/smc_ism.h          |   5 +
 net/smc/smc_loopback.c     | 718 +++++++++++++++++++++++++++++++++++++
 net/smc/smc_loopback.h     |  88 +++++
 13 files changed, 1026 insertions(+), 29 deletions(-)
 create mode 100644 net/smc/smc_loopback.c
 create mode 100644 net/smc/smc_loopback.h

Comments

Simon Horman Jan. 11, 2024, 1:36 p.m. UTC | #1
On Thu, Jan 11, 2024 at 08:00:21PM +0800, Wen Gu wrote:
> This patch set acts as the second part of the new version of [1] (The
> first part can be referred from [2]), the updated things of this version
> are listed at the end.

...

Hi Wen Gu,

unfortunately net-next is currently closed.

[adapted from text by Jakub]

## Form letter - net-next-closed

The merge window for v6.8 has begun and therefore net-next is closed
for new drivers, features, code refactoring and optimizations.
We are currently accepting bug fixes only.

Please repost when net-next reopens on or after 21st January.

RFC patches sent for review only are obviously welcome at any time.

See: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#development-cycle
--
pw-bot: defer
Jiri Pirko Jan. 11, 2024, 2:50 p.m. UTC | #2
Thu, Jan 11, 2024 at 01:00:21PM CET, guwen@linux.alibaba.com wrote:
>This patch set acts as the second part of the new version of [1] (The first
>part can be referred from [2]), the updated things of this version are listed
>at the end.
>
># Background
>
>SMC-D is now used in IBM z with ISM function to optimize network interconnect
>for intra-CPC communications. Inspired by this, we try to make SMC-D available

Care to provide more details about what ISM and intra-CPC is and what it
it good for?


>on the non-s390 architecture through a software-implemented virtual ISM device,
>that is the loopback-ism device here, to accelerate inter-process or

I see no such device. Is it a netdevice?

If it is "software-implemented", why is it part of smc driver and not
separate soft-device driver? If there is some smc specific code, I guess
there should be some level of separation. Can't this be implemented by
other devices too?



>inter-containers communication within the same OS instance.
>
># Design
>
>This patch set includes 3 parts:
>
> - Patch #1-#2: some prepare work for loopback-ism.
> - Patch #3-#9: implement loopback-ism device.
> - Patch #10-#15: memory copy optimization for loopback scenario.
>
>The loopback-ism device is designed as a ISMv2 device and not be limited to
>a specific net namespace, ends of both inter-process connection (1/1' in diagram
>below) or inter-container connection (2/2' in diagram below) can find the same
>available loopback-ism and choose it during the CLC handshake.
>
> Container 1 (ns1)                              Container 2 (ns2)
> +-----------------------------------------+    +-------------------------+
> | +-------+      +-------+      +-------+ |    |        +-------+        |
> | | App A |      | App B |      | App C | |    |        | App D |<-+     |
> | +-------+      +---^---+      +-------+ |    |        +-------+  |(2') |
> |     |127.0.0.1 (1')|             |192.168.0.11       192.168.0.12|     |
> |  (1)|   +--------+ | +--------+  |(2)   |    | +--------+   +--------+ |
> |     `-->|   lo   |-` |  eth0  |<-`      |    | |   lo   |   |  eth0  | |
> +---------+--|---^-+---+-----|--+---------+    +-+--------+---+-^------+-+
>              |   |           |                                  |
> Kernel       |   |           |                                  |
> +----+-------v---+-----------v----------------------------------+---+----+
> |    |                            TCP                               |    |
> |    |                                                              |    |
> |    +--------------------------------------------------------------+    |
> |                                                                        |
> |                           +--------------+                             |
> |                           | smc loopback |                             |
> +---------------------------+--------------+-----------------------------+
>
>loopback-ism device creates DMBs (shared memory) for each connection peer.
>Since data transfer occurs within the same kernel, the sndbuf of each peer
>is only a descriptor and point to the same memory region as peer DMB, so that
>the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.
>
> Container 1 (ns1)                              Container 2 (ns2)
> +-----------------------------------------+    +-------------------------+
> | +-------+                               |    |        +-------+        |
> | | App C |-----+                         |    |        | App D |        |
> | +-------+     |                         |    |        +-^-----+        |
> |               |                         |    |          |              |
> |           (2) |                         |    |     (2') |              |
> |               |                         |    |          |              |
> +---------------|-------------------------+    +----------|--------------+
>                 |                                         |
> Kernel          |                                         |
> +---------------|-----------------------------------------|--------------+
> | +--------+ +--v-----+                           +--------+ +--------+  |
> | |dmb_desc| |snd_desc|                           |dmb_desc| |snd_desc|  |
> | +-----|--+ +--|-----+                           +-----|--+ +--------+  |
> | +-----|--+    |                                 +-----|--+             |
> | | DMB C  |    +---------------------------------| DMB D  |             |
> | +--------+                                      +--------+             |
> |                                                                        |
> |                           +--------------+                             |
> |                           | smc loopback |                             |
> +---------------------------+--------------+-----------------------------+
>
># Benchmark Test
>
> * Test environments:
>      - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>      - SMC sndbuf/DMB size 1MB.
>      - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0,
>        which means sndbuf and DMB are merged and no data copied between them.
>      - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0,

Exposing any configuration knobs and statistics over sysfs for
softdevices does not look correct at all :/ Could you please avoid
sysfs?


>        which means DMB is physically contiguous buffer.
>
> * Test object:
>      - TCP: run on TCP loopback.
>      - SMC lo: run on SMC loopback device.
>
>1. ipc-benchmark (see [3])
>
> - ./<foo> -c 1000000 -s 100
>
>                            TCP                  SMC-lo
>Message
>rate (msg/s)              80636                  149515(+85.42%)
>
>2. sockperf
>
> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>
>                            TCP                  SMC-lo
>Bandwidth(MBps)         4909.36                 8197.57(+66.98%)
>Latency(us)               6.098                   3.383(-44.52%)
>
>3. nginx/wrk
>
> - serv: <smc_run> nginx
> - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
>
>                           TCP                   SMC-lo
>Requests/s           181685.74                246447.77(+35.65%)
>
>4. redis-benchmark
>
> - serv: <smc_run> redis-server
> - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
>
>                           TCP                   SMC-lo
>GET(Requests/s)       85855.34                118553.64(+38.09%)
>SET(Requests/s)       86824.40                125944.58(+45.06%)
>
>
>Change log:
>
>v1->RFC:
>- Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
>  /sys/devices/virtual/smc/loopback-ism/xfer_bytes
>- Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
>  merging sndbuf with peer DMB.
>- Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
>  control of whether to merge sndbuf and DMB. They can be respectively set by:
>  /sys/devices/virtual/smc/loopback-ism/dmb_type
>  /sys/devices/virtual/smc/loopback-ism/dmb_copy
>  The motivation for these two control is that a performance bottleneck was
>  found when using vzalloced DMB and sndbuf is merged with DMB, and there are
>  many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
>  by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
>  or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
>  vmap lock contention [6]. It has significant effects, but using virtual memory
>  still has additional overhead compared to using physical memory.
>  So this new version provides controls of dmb_type and dmb_copy to suit
>  different scenarios.
>- Some minor changes and comments improvements.
>
>RFC->old version([1]):
>Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/
>- Patch #1: improve the loopback-ism dump, it shows as follows now:
>  # smcd d
>  FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
>  0000 0     loopback-ism  ffff   No        0
>- Patch #3: introduce the smc_ism_set_v2_capable() helper and set
>  smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
>  regardless of whether there is already a device in smcd device list.
>- Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
>- Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
>  to activate or deactivate the loopback-ism.
>- Patch #9: introduce the statistics of loopback-ism by
>  /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
>- Some minor changes and comments improvements.
>
>[1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/
>[2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/
>[3] https://github.com/goldsborough/ipc-bench
>[4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/
>[5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/
>[6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/
>
>Wen Gu (15):
>  net/smc: improve SMC-D device dump for virtual ISM
>  net/smc: decouple specialized struct from SMC-D DMB registration
>  net/smc: introduce virtual ISM device loopback-ism
>  net/smc: implement ID-related operations of loopback-ism
>  net/smc: implement some unsupported operations of loopback-ism
>  net/smc: implement DMB-related operations of loopback-ism
>  net/smc: register loopback-ism into SMC-D device list
>  net/smc: introduce loopback-ism runtime switch
>  net/smc: introduce loopback-ism statistics attributes
>  net/smc: add operations to merge sndbuf with peer DMB
>  net/smc: attach or detach ghost sndbuf to peer DMB
>  net/smc: adapt cursor update when sndbuf and peer DMB are merged
>  net/smc: introduce loopback-ism DMB type control
>  net/smc: introduce loopback-ism DMB data copy control
>  net/smc: implement DMB-merged operations of loopback-ism
>
> drivers/s390/net/ism_drv.c |   2 +-
> include/net/smc.h          |   7 +-
> net/smc/Kconfig            |  13 +
> net/smc/Makefile           |   2 +-
> net/smc/af_smc.c           |  28 +-
> net/smc/smc_cdc.c          |  58 ++-
> net/smc/smc_cdc.h          |   1 +
> net/smc/smc_core.c         |  61 +++-
> net/smc/smc_core.h         |   1 +
> net/smc/smc_ism.c          |  71 +++-
> net/smc/smc_ism.h          |   5 +
> net/smc/smc_loopback.c     | 718 +++++++++++++++++++++++++++++++++++++
> net/smc/smc_loopback.h     |  88 +++++
> 13 files changed, 1026 insertions(+), 29 deletions(-)
> create mode 100644 net/smc/smc_loopback.c
> create mode 100644 net/smc/smc_loopback.h
>
>-- 
>2.32.0.3.g01195cf9f
>
>
Wen Gu Jan. 12, 2024, 2:54 a.m. UTC | #3
On 2024/1/11 21:36, Simon Horman wrote:
> On Thu, Jan 11, 2024 at 08:00:21PM +0800, Wen Gu wrote:
>> This patch set acts as the second part of the new version of [1] (The
>> first part can be referred from [2]), the updated things of this version
>> are listed at the end.
> 
> ...
> 
> Hi Wen Gu,
> 
> unfortunately net-next is currently closed.
> 
> [adapted from text by Jakub]
> 
> ## Form letter - net-next-closed
> 
> The merge window for v6.8 has begun and therefore net-next is closed
> for new drivers, features, code refactoring and optimizations.
> We are currently accepting bug fixes only.
> 
> Please repost when net-next reopens on or after 21st January.
> 
> RFC patches sent for review only are obviously welcome at any time.
> 
> See: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#development-cycle
> --
> pw-bot: defer

Thank you for notifying, Simon. I will follow the development-cycle. Thanks again.
Wen Gu Jan. 12, 2024, 8:29 a.m. UTC | #4
On 2024/1/11 22:50, Jiri Pirko wrote:
> Thu, Jan 11, 2024 at 01:00:21PM CET, guwen@linux.alibaba.com wrote:
>> This patch set acts as the second part of the new version of [1] (The first
>> part can be referred from [2]), the updated things of this version are listed
>> at the end.
>>
>> # Background
>>
>> SMC-D is now used in IBM z with ISM function to optimize network interconnect
>> for intra-CPC communications. Inspired by this, we try to make SMC-D available
> 
> Care to provide more details about what ISM and intra-CPC is and what it
> it good for?
> 

Hi Jiri,

Sure,

ISM (IBM System Z Internal Shared Memory) is a technology that provides the
internal communications capability required for SMC-D. It is a virtual PCI
network adapter that enables direct access to shared virtual memory providing
a highly optimized network interconnect for IBM Z intra-CPC communications.
(It can be found in https://www.ibm.com/docs/en/zos/3.1.0?topic=communications-shared-memory-reference-information
and https://www.ibm.com/docs/en/zos/3.1.0?topic=dv2-ismv2)

CPC (Central processor complex) is an IBM mainframe term to refer to the physical
collection of hardware that includes main storage, one or more central processors,
timers, and channels.
(It can be found in https://www.ibm.com/docs/en/zos-basic-skills?topic=concepts-mainframe-hardware-terminology
and https://www.ibm.com/docs/en/ztpf/2023?topic=support-central-processor-complex-cpc)

SMC (Shared Memory Communications) is a network protocol that allows two SMC
capable peers to communicate using memory that each peer allocates and manages
for their partner’s use. It has two forms:

- SMC over Remote Direct Memory Access (SMC-R)

   It is an open protocol that was initially introduced in z/OS V2R1 on the IBM zEC12.
   SMC-R is defined in an informational RFC entitled IBM’s Shared Memory Communications
   over RDMA (https://tools.ietf.org/html/rfc7609).

- SMC - Direct Memory Access (SMC-D)

   It is a variation of SMC-R. SMC-D is closely related to SMC-R but is based on the
   Internal Shared Memory (ISM) capabilities introduced with the IBM z13™ (z13) hardware
   model.

(SMC protocol can be found in 
https://www.ibm.com/support/pages/system/files/inline-files/IBM%20Shared%20Memory%20Communications%20Version%202.1_0.pdf)

So with ISM function, SMC-D can be used to improves throughput, lowers latency and cost,
and maintains existing functions of communications within CPC.

> 
>> on the non-s390 architecture through a software-implemented virtual ISM device,
>> that is the loopback-ism device here, to accelerate inter-process or
> 
> I see no such device. Is it a netdevice?

Currently, SMC-D depends on ISM and is only available on IBM Z systems. Now we try
to make SMC-D available on other system architectures other than s390 and Z system.
So 'virtual ISM' is proposed and acts as original firmware ISM on Z system.
(The virtual ISM supports can be found in https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/)

The loopback-ism is the first virtual ISM. It does not rely on a specific architecture
or hardware, and provides functions that ISM should have (like a dummy device). It is
designed to be used by SMC-D when communication occurs within OS instance.

It is not a typical network device, since it primarily provides exact functions
defined by SMC-D device operations(struct smcd_ops), e.g. provides and manages the
shared memory (term used is DMB in SMC, Direct Memory Buffer).

It can't be found now since it is introduced by this patchset.

> 
> If it is "software-implemented", why is it part of smc driver and not
> separate soft-device driver? If there is some smc specific code, I guess
> there should be some level of separation. Can't this be implemented by
> other devices too?
> 

loopback-ism is designed to specifically used by SMC-D (like s390 ISM), to
serves as a easy-available ISM for community to test SMC-D and to accelerate
intra-OS communication (see benchmark test). So the code is under net/smc.

> 
> 
>> inter-containers communication within the same OS instance.
>>
>> # Design
>>
>> This patch set includes 3 parts:
>>
>> - Patch #1-#2: some prepare work for loopback-ism.
>> - Patch #3-#9: implement loopback-ism device.
>> - Patch #10-#15: memory copy optimization for loopback scenario.
>>
>> The loopback-ism device is designed as a ISMv2 device and not be limited to
>> a specific net namespace, ends of both inter-process connection (1/1' in diagram
>> below) or inter-container connection (2/2' in diagram below) can find the same
>> available loopback-ism and choose it during the CLC handshake.
>>
>> Container 1 (ns1)                              Container 2 (ns2)
>> +-----------------------------------------+    +-------------------------+
>> | +-------+      +-------+      +-------+ |    |        +-------+        |
>> | | App A |      | App B |      | App C | |    |        | App D |<-+     |
>> | +-------+      +---^---+      +-------+ |    |        +-------+  |(2') |
>> |     |127.0.0.1 (1')|             |192.168.0.11       192.168.0.12|     |
>> |  (1)|   +--------+ | +--------+  |(2)   |    | +--------+   +--------+ |
>> |     `-->|   lo   |-` |  eth0  |<-`      |    | |   lo   |   |  eth0  | |
>> +---------+--|---^-+---+-----|--+---------+    +-+--------+---+-^------+-+
>>               |   |           |                                  |
>> Kernel       |   |           |                                  |
>> +----+-------v---+-----------v----------------------------------+---+----+
>> |    |                            TCP                               |    |
>> |    |                                                              |    |
>> |    +--------------------------------------------------------------+    |
>> |                                                                        |
>> |                           +--------------+                             |
>> |                           | smc loopback |                             |
>> +---------------------------+--------------+-----------------------------+
>>
>> loopback-ism device creates DMBs (shared memory) for each connection peer.
>> Since data transfer occurs within the same kernel, the sndbuf of each peer
>> is only a descriptor and point to the same memory region as peer DMB, so that
>> the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.
>>
>> Container 1 (ns1)                              Container 2 (ns2)
>> +-----------------------------------------+    +-------------------------+
>> | +-------+                               |    |        +-------+        |
>> | | App C |-----+                         |    |        | App D |        |
>> | +-------+     |                         |    |        +-^-----+        |
>> |               |                         |    |          |              |
>> |           (2) |                         |    |     (2') |              |
>> |               |                         |    |          |              |
>> +---------------|-------------------------+    +----------|--------------+
>>                  |                                         |
>> Kernel          |                                         |
>> +---------------|-----------------------------------------|--------------+
>> | +--------+ +--v-----+                           +--------+ +--------+  |
>> | |dmb_desc| |snd_desc|                           |dmb_desc| |snd_desc|  |
>> | +-----|--+ +--|-----+                           +-----|--+ +--------+  |
>> | +-----|--+    |                                 +-----|--+             |
>> | | DMB C  |    +---------------------------------| DMB D  |             |
>> | +--------+                                      +--------+             |
>> |                                                                        |
>> |                           +--------------+                             |
>> |                           | smc loopback |                             |
>> +---------------------------+--------------+-----------------------------+
>>
>> # Benchmark Test
>>
>> * Test environments:
>>       - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>>       - SMC sndbuf/DMB size 1MB.
>>       - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0,
>>         which means sndbuf and DMB are merged and no data copied between them.
>>       - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0,
> 
> Exposing any configuration knobs and statistics over sysfs for
> softdevices does not look correct at all :/ Could you please avoid
> sysfs?
> 

In previous reviews and calls, we think loopback-ism needs to be more
like a device and be visible under /sys/devices.

Would you mind explaining why using sysfs for loopback-ism is not correct?
since I saw some other configurations or statistics exists under /sys/devices,
e.g. /sys/devices/virtual/net/lo. Thank you!



Thanks again,
Wen Gu

> 
>>         which means DMB is physically contiguous buffer.
>>
>> * Test object:
>>       - TCP: run on TCP loopback.
>>       - SMC lo: run on SMC loopback device.
>>
>> 1. ipc-benchmark (see [3])
>>
>> - ./<foo> -c 1000000 -s 100
>>
>>                             TCP                  SMC-lo
>> Message
>> rate (msg/s)              80636                  149515(+85.42%)
>>
>> 2. sockperf
>>
>> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>>
>>                             TCP                  SMC-lo
>> Bandwidth(MBps)         4909.36                 8197.57(+66.98%)
>> Latency(us)               6.098                   3.383(-44.52%)
>>
>> 3. nginx/wrk
>>
>> - serv: <smc_run> nginx
>> - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
>>
>>                            TCP                   SMC-lo
>> Requests/s           181685.74                246447.77(+35.65%)
>>
>> 4. redis-benchmark
>>
>> - serv: <smc_run> redis-server
>> - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
>>
>>                            TCP                   SMC-lo
>> GET(Requests/s)       85855.34                118553.64(+38.09%)
>> SET(Requests/s)       86824.40                125944.58(+45.06%)
>>
>>
>> Change log:
>>
>> v1->RFC:
>> - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
>>   /sys/devices/virtual/smc/loopback-ism/xfer_bytes
>> - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
>>   merging sndbuf with peer DMB.
>> - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
>>   control of whether to merge sndbuf and DMB. They can be respectively set by:
>>   /sys/devices/virtual/smc/loopback-ism/dmb_type
>>   /sys/devices/virtual/smc/loopback-ism/dmb_copy
>>   The motivation for these two control is that a performance bottleneck was
>>   found when using vzalloced DMB and sndbuf is merged with DMB, and there are
>>   many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
>>   by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
>>   or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
>>   vmap lock contention [6]. It has significant effects, but using virtual memory
>>   still has additional overhead compared to using physical memory.
>>   So this new version provides controls of dmb_type and dmb_copy to suit
>>   different scenarios.
>> - Some minor changes and comments improvements.
>>
>> RFC->old version([1]):
>> Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/
>> - Patch #1: improve the loopback-ism dump, it shows as follows now:
>>   # smcd d
>>   FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
>>   0000 0     loopback-ism  ffff   No        0
>> - Patch #3: introduce the smc_ism_set_v2_capable() helper and set
>>   smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
>>   regardless of whether there is already a device in smcd device list.
>> - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
>> - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
>>   to activate or deactivate the loopback-ism.
>> - Patch #9: introduce the statistics of loopback-ism by
>>   /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
>> - Some minor changes and comments improvements.
>>
>> [1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/
>> [2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/
>> [3] https://github.com/goldsborough/ipc-bench
>> [4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/
>> [5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/
>> [6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/
>>
>> Wen Gu (15):
>>   net/smc: improve SMC-D device dump for virtual ISM
>>   net/smc: decouple specialized struct from SMC-D DMB registration
>>   net/smc: introduce virtual ISM device loopback-ism
>>   net/smc: implement ID-related operations of loopback-ism
>>   net/smc: implement some unsupported operations of loopback-ism
>>   net/smc: implement DMB-related operations of loopback-ism
>>   net/smc: register loopback-ism into SMC-D device list
>>   net/smc: introduce loopback-ism runtime switch
>>   net/smc: introduce loopback-ism statistics attributes
>>   net/smc: add operations to merge sndbuf with peer DMB
>>   net/smc: attach or detach ghost sndbuf to peer DMB
>>   net/smc: adapt cursor update when sndbuf and peer DMB are merged
>>   net/smc: introduce loopback-ism DMB type control
>>   net/smc: introduce loopback-ism DMB data copy control
>>   net/smc: implement DMB-merged operations of loopback-ism
>>
>> drivers/s390/net/ism_drv.c |   2 +-
>> include/net/smc.h          |   7 +-
>> net/smc/Kconfig            |  13 +
>> net/smc/Makefile           |   2 +-
>> net/smc/af_smc.c           |  28 +-
>> net/smc/smc_cdc.c          |  58 ++-
>> net/smc/smc_cdc.h          |   1 +
>> net/smc/smc_core.c         |  61 +++-
>> net/smc/smc_core.h         |   1 +
>> net/smc/smc_ism.c          |  71 +++-
>> net/smc/smc_ism.h          |   5 +
>> net/smc/smc_loopback.c     | 718 +++++++++++++++++++++++++++++++++++++
>> net/smc/smc_loopback.h     |  88 +++++
>> 13 files changed, 1026 insertions(+), 29 deletions(-)
>> create mode 100644 net/smc/smc_loopback.c
>> create mode 100644 net/smc/smc_loopback.h
>>
>> -- 
>> 2.32.0.3.g01195cf9f
>>
>>
Jiri Pirko Jan. 12, 2024, 9:10 a.m. UTC | #5
Fri, Jan 12, 2024 at 09:29:35AM CET, guwen@linux.alibaba.com wrote:
>
>
>On 2024/1/11 22:50, Jiri Pirko wrote:
>> Thu, Jan 11, 2024 at 01:00:21PM CET, guwen@linux.alibaba.com wrote:
>> > This patch set acts as the second part of the new version of [1] (The first
>> > part can be referred from [2]), the updated things of this version are listed
>> > at the end.
>> > 
>> > # Background
>> > 
>> > SMC-D is now used in IBM z with ISM function to optimize network interconnect
>> > for intra-CPC communications. Inspired by this, we try to make SMC-D available
>> 
>> Care to provide more details about what ISM and intra-CPC is and what it
>> it good for?
>> 
>
>Hi Jiri,
>
>Sure,
>
>ISM (IBM System Z Internal Shared Memory) is a technology that provides the
>internal communications capability required for SMC-D. It is a virtual PCI
>network adapter that enables direct access to shared virtual memory providing
>a highly optimized network interconnect for IBM Z intra-CPC communications.
>(It can be found in https://www.ibm.com/docs/en/zos/3.1.0?topic=communications-shared-memory-reference-information
>and https://www.ibm.com/docs/en/zos/3.1.0?topic=dv2-ismv2)
>
>CPC (Central processor complex) is an IBM mainframe term to refer to the physical
>collection of hardware that includes main storage, one or more central processors,
>timers, and channels.
>(It can be found in https://www.ibm.com/docs/en/zos-basic-skills?topic=concepts-mainframe-hardware-terminology
>and https://www.ibm.com/docs/en/ztpf/2023?topic=support-central-processor-complex-cpc)
>
>SMC (Shared Memory Communications) is a network protocol that allows two SMC
>capable peers to communicate using memory that each peer allocates and manages
>for their partner’s use. It has two forms:
>
>- SMC over Remote Direct Memory Access (SMC-R)
>
>  It is an open protocol that was initially introduced in z/OS V2R1 on the IBM zEC12.
>  SMC-R is defined in an informational RFC entitled IBM’s Shared Memory Communications
>  over RDMA (https://tools.ietf.org/html/rfc7609).
>
>- SMC - Direct Memory Access (SMC-D)
>
>  It is a variation of SMC-R. SMC-D is closely related to SMC-R but is based on the
>  Internal Shared Memory (ISM) capabilities introduced with the IBM z13™ (z13) hardware
>  model.
>
>(SMC protocol can be found in https://www.ibm.com/support/pages/system/files/inline-files/IBM%20Shared%20Memory%20Communications%20Version%202.1_0.pdf)
>
>So with ISM function, SMC-D can be used to improves throughput, lowers latency and cost,
>and maintains existing functions of communications within CPC.
>
>> 
>> > on the non-s390 architecture through a software-implemented virtual ISM device,
>> > that is the loopback-ism device here, to accelerate inter-process or
>> 
>> I see no such device. Is it a netdevice?
>
>Currently, SMC-D depends on ISM and is only available on IBM Z systems. Now we try
>to make SMC-D available on other system architectures other than s390 and Z system.
>So 'virtual ISM' is proposed and acts as original firmware ISM on Z system.
>(The virtual ISM supports can be found in https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/)
>
>The loopback-ism is the first virtual ISM. It does not rely on a specific architecture
>or hardware, and provides functions that ISM should have (like a dummy device). It is
>designed to be used by SMC-D when communication occurs within OS instance.
>
>It is not a typical network device, since it primarily provides exact functions
>defined by SMC-D device operations(struct smcd_ops), e.g. provides and manages the
>shared memory (term used is DMB in SMC, Direct Memory Buffer).
>
>It can't be found now since it is introduced by this patchset.
>
>> 
>> If it is "software-implemented", why is it part of smc driver and not
>> separate soft-device driver? If there is some smc specific code, I guess
>> there should be some level of separation. Can't this be implemented by
>> other devices too?
>> 
>
>loopback-ism is designed to specifically used by SMC-D (like s390 ISM), to
>serves as a easy-available ISM for community to test SMC-D and to accelerate
>intra-OS communication (see benchmark test). So the code is under net/smc.

Got it.


>
>> 
>> 
>> > inter-containers communication within the same OS instance.
>> > 
>> > # Design
>> > 
>> > This patch set includes 3 parts:
>> > 
>> > - Patch #1-#2: some prepare work for loopback-ism.
>> > - Patch #3-#9: implement loopback-ism device.
>> > - Patch #10-#15: memory copy optimization for loopback scenario.
>> > 
>> > The loopback-ism device is designed as a ISMv2 device and not be limited to
>> > a specific net namespace, ends of both inter-process connection (1/1' in diagram
>> > below) or inter-container connection (2/2' in diagram below) can find the same
>> > available loopback-ism and choose it during the CLC handshake.
>> > 
>> > Container 1 (ns1)                              Container 2 (ns2)
>> > +-----------------------------------------+    +-------------------------+
>> > | +-------+      +-------+      +-------+ |    |        +-------+        |
>> > | | App A |      | App B |      | App C | |    |        | App D |<-+     |
>> > | +-------+      +---^---+      +-------+ |    |        +-------+  |(2') |
>> > |     |127.0.0.1 (1')|             |192.168.0.11       192.168.0.12|     |
>> > |  (1)|   +--------+ | +--------+  |(2)   |    | +--------+   +--------+ |
>> > |     `-->|   lo   |-` |  eth0  |<-`      |    | |   lo   |   |  eth0  | |
>> > +---------+--|---^-+---+-----|--+---------+    +-+--------+---+-^------+-+
>> >               |   |           |                                  |
>> > Kernel       |   |           |                                  |
>> > +----+-------v---+-----------v----------------------------------+---+----+
>> > |    |                            TCP                               |    |
>> > |    |                                                              |    |
>> > |    +--------------------------------------------------------------+    |
>> > |                                                                        |
>> > |                           +--------------+                             |
>> > |                           | smc loopback |                             |
>> > +---------------------------+--------------+-----------------------------+
>> > 
>> > loopback-ism device creates DMBs (shared memory) for each connection peer.
>> > Since data transfer occurs within the same kernel, the sndbuf of each peer
>> > is only a descriptor and point to the same memory region as peer DMB, so that
>> > the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.
>> > 
>> > Container 1 (ns1)                              Container 2 (ns2)
>> > +-----------------------------------------+    +-------------------------+
>> > | +-------+                               |    |        +-------+        |
>> > | | App C |-----+                         |    |        | App D |        |
>> > | +-------+     |                         |    |        +-^-----+        |
>> > |               |                         |    |          |              |
>> > |           (2) |                         |    |     (2') |              |
>> > |               |                         |    |          |              |
>> > +---------------|-------------------------+    +----------|--------------+
>> >                  |                                         |
>> > Kernel          |                                         |
>> > +---------------|-----------------------------------------|--------------+
>> > | +--------+ +--v-----+                           +--------+ +--------+  |
>> > | |dmb_desc| |snd_desc|                           |dmb_desc| |snd_desc|  |
>> > | +-----|--+ +--|-----+                           +-----|--+ +--------+  |
>> > | +-----|--+    |                                 +-----|--+             |
>> > | | DMB C  |    +---------------------------------| DMB D  |             |
>> > | +--------+                                      +--------+             |
>> > |                                                                        |
>> > |                           +--------------+                             |
>> > |                           | smc loopback |                             |
>> > +---------------------------+--------------+-----------------------------+
>> > 
>> > # Benchmark Test
>> > 
>> > * Test environments:
>> >       - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>> >       - SMC sndbuf/DMB size 1MB.
>> >       - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0,
>> >         which means sndbuf and DMB are merged and no data copied between them.
>> >       - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0,
>> 
>> Exposing any configuration knobs and statistics over sysfs for
>> softdevices does not look correct at all :/ Could you please avoid
>> sysfs?
>> 
>
>In previous reviews and calls, we think loopback-ism needs to be more
>like a device and be visible under /sys/devices.
>
>Would you mind explaining why using sysfs for loopback-ism is not correct?
>since I saw some other configurations or statistics exists under /sys/devices,
>e.g. /sys/devices/virtual/net/lo. Thank you!

You have smc_netlink.c exposing clear netlink api for the subsystem.
Can't you extend it to contain the configuration knobs and expose stats
instead of sysfs?


>
>
>
>Thanks again,
>Wen Gu
>
>> 
>> >         which means DMB is physically contiguous buffer.
>> > 
>> > * Test object:
>> >       - TCP: run on TCP loopback.
>> >       - SMC lo: run on SMC loopback device.
>> > 
>> > 1. ipc-benchmark (see [3])
>> > 
>> > - ./<foo> -c 1000000 -s 100
>> > 
>> >                             TCP                  SMC-lo
>> > Message
>> > rate (msg/s)              80636                  149515(+85.42%)
>> > 
>> > 2. sockperf
>> > 
>> > - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>> > - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>> > 
>> >                             TCP                  SMC-lo
>> > Bandwidth(MBps)         4909.36                 8197.57(+66.98%)
>> > Latency(us)               6.098                   3.383(-44.52%)
>> > 
>> > 3. nginx/wrk
>> > 
>> > - serv: <smc_run> nginx
>> > - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
>> > 
>> >                            TCP                   SMC-lo
>> > Requests/s           181685.74                246447.77(+35.65%)
>> > 
>> > 4. redis-benchmark
>> > 
>> > - serv: <smc_run> redis-server
>> > - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
>> > 
>> >                            TCP                   SMC-lo
>> > GET(Requests/s)       85855.34                118553.64(+38.09%)
>> > SET(Requests/s)       86824.40                125944.58(+45.06%)
>> > 
>> > 
>> > Change log:
>> > 
>> > v1->RFC:
>> > - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
>> >   /sys/devices/virtual/smc/loopback-ism/xfer_bytes
>> > - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
>> >   merging sndbuf with peer DMB.
>> > - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
>> >   control of whether to merge sndbuf and DMB. They can be respectively set by:
>> >   /sys/devices/virtual/smc/loopback-ism/dmb_type
>> >   /sys/devices/virtual/smc/loopback-ism/dmb_copy
>> >   The motivation for these two control is that a performance bottleneck was
>> >   found when using vzalloced DMB and sndbuf is merged with DMB, and there are
>> >   many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
>> >   by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
>> >   or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
>> >   vmap lock contention [6]. It has significant effects, but using virtual memory
>> >   still has additional overhead compared to using physical memory.
>> >   So this new version provides controls of dmb_type and dmb_copy to suit
>> >   different scenarios.
>> > - Some minor changes and comments improvements.
>> > 
>> > RFC->old version([1]):
>> > Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/
>> > - Patch #1: improve the loopback-ism dump, it shows as follows now:
>> >   # smcd d
>> >   FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
>> >   0000 0     loopback-ism  ffff   No        0
>> > - Patch #3: introduce the smc_ism_set_v2_capable() helper and set
>> >   smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
>> >   regardless of whether there is already a device in smcd device list.
>> > - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
>> > - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
>> >   to activate or deactivate the loopback-ism.
>> > - Patch #9: introduce the statistics of loopback-ism by
>> >   /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
>> > - Some minor changes and comments improvements.
>> > 
>> > [1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/
>> > [2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/
>> > [3] https://github.com/goldsborough/ipc-bench
>> > [4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/
>> > [5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/
>> > [6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/
>> > 
>> > Wen Gu (15):
>> >   net/smc: improve SMC-D device dump for virtual ISM
>> >   net/smc: decouple specialized struct from SMC-D DMB registration
>> >   net/smc: introduce virtual ISM device loopback-ism
>> >   net/smc: implement ID-related operations of loopback-ism
>> >   net/smc: implement some unsupported operations of loopback-ism
>> >   net/smc: implement DMB-related operations of loopback-ism
>> >   net/smc: register loopback-ism into SMC-D device list
>> >   net/smc: introduce loopback-ism runtime switch
>> >   net/smc: introduce loopback-ism statistics attributes
>> >   net/smc: add operations to merge sndbuf with peer DMB
>> >   net/smc: attach or detach ghost sndbuf to peer DMB
>> >   net/smc: adapt cursor update when sndbuf and peer DMB are merged
>> >   net/smc: introduce loopback-ism DMB type control
>> >   net/smc: introduce loopback-ism DMB data copy control
>> >   net/smc: implement DMB-merged operations of loopback-ism
>> > 
>> > drivers/s390/net/ism_drv.c |   2 +-
>> > include/net/smc.h          |   7 +-
>> > net/smc/Kconfig            |  13 +
>> > net/smc/Makefile           |   2 +-
>> > net/smc/af_smc.c           |  28 +-
>> > net/smc/smc_cdc.c          |  58 ++-
>> > net/smc/smc_cdc.h          |   1 +
>> > net/smc/smc_core.c         |  61 +++-
>> > net/smc/smc_core.h         |   1 +
>> > net/smc/smc_ism.c          |  71 +++-
>> > net/smc/smc_ism.h          |   5 +
>> > net/smc/smc_loopback.c     | 718 +++++++++++++++++++++++++++++++++++++
>> > net/smc/smc_loopback.h     |  88 +++++
>> > 13 files changed, 1026 insertions(+), 29 deletions(-)
>> > create mode 100644 net/smc/smc_loopback.c
>> > create mode 100644 net/smc/smc_loopback.h
>> > 
>> > -- 
>> > 2.32.0.3.g01195cf9f
>> > 
>> >
Wen Gu Jan. 12, 2024, 12:32 p.m. UTC | #6
On 2024/1/12 17:10, Jiri Pirko wrote:
> Fri, Jan 12, 2024 at 09:29:35AM CET, guwen@linux.alibaba.com wrote:
>>
>>

<...>

>>>> inter-containers communication within the same OS instance.
>>>>
>>>> # Design
>>>>
>>>> This patch set includes 3 parts:
>>>>
>>>> - Patch #1-#2: some prepare work for loopback-ism.
>>>> - Patch #3-#9: implement loopback-ism device.
>>>> - Patch #10-#15: memory copy optimization for loopback scenario.
>>>>
>>>> The loopback-ism device is designed as a ISMv2 device and not be limited to
>>>> a specific net namespace, ends of both inter-process connection (1/1' in diagram
>>>> below) or inter-container connection (2/2' in diagram below) can find the same
>>>> available loopback-ism and choose it during the CLC handshake.
>>>>
>>>> Container 1 (ns1)                              Container 2 (ns2)
>>>> +-----------------------------------------+    +-------------------------+
>>>> | +-------+      +-------+      +-------+ |    |        +-------+        |
>>>> | | App A |      | App B |      | App C | |    |        | App D |<-+     |
>>>> | +-------+      +---^---+      +-------+ |    |        +-------+  |(2') |
>>>> |     |127.0.0.1 (1')|             |192.168.0.11       192.168.0.12|     |
>>>> |  (1)|   +--------+ | +--------+  |(2)   |    | +--------+   +--------+ |
>>>> |     `-->|   lo   |-` |  eth0  |<-`      |    | |   lo   |   |  eth0  | |
>>>> +---------+--|---^-+---+-----|--+---------+    +-+--------+---+-^------+-+
>>>>                |   |           |                                  |
>>>> Kernel       |   |           |                                  |
>>>> +----+-------v---+-----------v----------------------------------+---+----+
>>>> |    |                            TCP                               |    |
>>>> |    |                                                              |    |
>>>> |    +--------------------------------------------------------------+    |
>>>> |                                                                        |
>>>> |                           +--------------+                             |
>>>> |                           | smc loopback |                             |
>>>> +---------------------------+--------------+-----------------------------+
>>>>
>>>> loopback-ism device creates DMBs (shared memory) for each connection peer.
>>>> Since data transfer occurs within the same kernel, the sndbuf of each peer
>>>> is only a descriptor and point to the same memory region as peer DMB, so that
>>>> the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.
>>>>
>>>> Container 1 (ns1)                              Container 2 (ns2)
>>>> +-----------------------------------------+    +-------------------------+
>>>> | +-------+                               |    |        +-------+        |
>>>> | | App C |-----+                         |    |        | App D |        |
>>>> | +-------+     |                         |    |        +-^-----+        |
>>>> |               |                         |    |          |              |
>>>> |           (2) |                         |    |     (2') |              |
>>>> |               |                         |    |          |              |
>>>> +---------------|-------------------------+    +----------|--------------+
>>>>                   |                                         |
>>>> Kernel          |                                         |
>>>> +---------------|-----------------------------------------|--------------+
>>>> | +--------+ +--v-----+                           +--------+ +--------+  |
>>>> | |dmb_desc| |snd_desc|                           |dmb_desc| |snd_desc|  |
>>>> | +-----|--+ +--|-----+                           +-----|--+ +--------+  |
>>>> | +-----|--+    |                                 +-----|--+             |
>>>> | | DMB C  |    +---------------------------------| DMB D  |             |
>>>> | +--------+                                      +--------+             |
>>>> |                                                                        |
>>>> |                           +--------------+                             |
>>>> |                           | smc loopback |                             |
>>>> +---------------------------+--------------+-----------------------------+
>>>>
>>>> # Benchmark Test
>>>>
>>>> * Test environments:
>>>>        - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>>>>        - SMC sndbuf/DMB size 1MB.
>>>>        - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0,
>>>>          which means sndbuf and DMB are merged and no data copied between them.
>>>>        - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0,
>>>
>>> Exposing any configuration knobs and statistics over sysfs for
>>> softdevices does not look correct at all :/ Could you please avoid
>>> sysfs?
>>>
>>
>> In previous reviews and calls, we think loopback-ism needs to be more
>> like a device and be visible under /sys/devices.
>>
>> Would you mind explaining why using sysfs for loopback-ism is not correct?
>> since I saw some other configurations or statistics exists under /sys/devices,
>> e.g. /sys/devices/virtual/net/lo. Thank you!
> 
> You have smc_netlink.c exposing clear netlink api for the subsystem.
> Can't you extend it to contain the configuration knobs and expose stats
> instead of sysfs?
> 

Thank you for the suggestion. I've also considered this approach.

But I didn't choose to extend the smc netlink because for now smc netlink
are used for SMC protocol related attributes, for example:

SMC_NETLINK_GET_SYS_INFO:     SMC version, release, v2-capable..
SMC_NETLINK_GET_LGR_SMC{R|D}: SMC-{R|D} link group inform (lgr id, lgr conn num, lgr role..)
SMC_NETLINK_GET_LINK_SMCR:    SMC-R link inform (link id, link state, conn cnt..)
SMC_NETLINK_GET_DEV_SMCD:     SMC-D device generic inform (user cnt, pci_fid, pci_chid, pci_vendor..)
SMC_NETLINK_GET_DEV_SMCR:     SMC-R device generic inform (dev name, port pnet_id, port valid, port state..)
SMC_NETLINK_GET_STATS:        SMC generic stats (RMB cnt, Tx size, Rx size, RMB size...)

And the knobs and stats in this patchset are loopback-ism device specific
attributes, for example:

active:        loopback-ism runtime switch
dmb_type:      type of DMB provided by loopback-ism
dmb_copy:      support for DMB merge of loopback-ism
xfer_bytes:    data transferred by loopback-ism
dmbs_cnt:      DMB num provided by loopback-ism

The layer will be:

                +--------------------------------------+
                |                                      |
                |             SMC protocol             |
                |  (attrs by netlink in smc_netlink.c) |
                |                                      |
                +--------------------------------------+
              ------------------smcd_ops------------------
      +---------------+  +---------------------+  +--------------+
      | loopback-ism  |  |  s390 firmware ISM  |  | Possible     |
      +---------------+  |                     |  | other        |
      (attrs by sysfs    |                     |  | virtual ISM  |
     in smc_loopback.c)  |                     |  |              |
                         |                     |  |              |
                         +---------------------+  +--------------+

So I choose to use current way to provide this lower layer loopback-ism
device's attributes, restrict loopback-ism specific code to smc_loopback.c
and try to make a clear layer architecture.

Thanks,
Wen Gu
> 
>>
>>
>>
>> Thanks again,
>> Wen Gu
>>
>>>
>>>>          which means DMB is physically contiguous buffer.
>>>>
>>>> * Test object:
>>>>        - TCP: run on TCP loopback.
>>>>        - SMC lo: run on SMC loopback device.
>>>>
>>>> 1. ipc-benchmark (see [3])
>>>>
>>>> - ./<foo> -c 1000000 -s 100
>>>>
>>>>                              TCP                  SMC-lo
>>>> Message
>>>> rate (msg/s)              80636                  149515(+85.42%)
>>>>
>>>> 2. sockperf
>>>>
>>>> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>>>> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>>>>
>>>>                              TCP                  SMC-lo
>>>> Bandwidth(MBps)         4909.36                 8197.57(+66.98%)
>>>> Latency(us)               6.098                   3.383(-44.52%)
>>>>
>>>> 3. nginx/wrk
>>>>
>>>> - serv: <smc_run> nginx
>>>> - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
>>>>
>>>>                             TCP                   SMC-lo
>>>> Requests/s           181685.74                246447.77(+35.65%)
>>>>
>>>> 4. redis-benchmark
>>>>
>>>> - serv: <smc_run> redis-server
>>>> - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
>>>>
>>>>                             TCP                   SMC-lo
>>>> GET(Requests/s)       85855.34                118553.64(+38.09%)
>>>> SET(Requests/s)       86824.40                125944.58(+45.06%)
>>>>
>>>>
>>>> Change log:
>>>>
>>>> v1->RFC:
>>>> - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
>>>>    /sys/devices/virtual/smc/loopback-ism/xfer_bytes
>>>> - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
>>>>    merging sndbuf with peer DMB.
>>>> - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
>>>>    control of whether to merge sndbuf and DMB. They can be respectively set by:
>>>>    /sys/devices/virtual/smc/loopback-ism/dmb_type
>>>>    /sys/devices/virtual/smc/loopback-ism/dmb_copy
>>>>    The motivation for these two control is that a performance bottleneck was
>>>>    found when using vzalloced DMB and sndbuf is merged with DMB, and there are
>>>>    many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
>>>>    by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
>>>>    or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
>>>>    vmap lock contention [6]. It has significant effects, but using virtual memory
>>>>    still has additional overhead compared to using physical memory.
>>>>    So this new version provides controls of dmb_type and dmb_copy to suit
>>>>    different scenarios.
>>>> - Some minor changes and comments improvements.
>>>>
>>>> RFC->old version([1]):
>>>> Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/
>>>> - Patch #1: improve the loopback-ism dump, it shows as follows now:
>>>>    # smcd d
>>>>    FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
>>>>    0000 0     loopback-ism  ffff   No        0
>>>> - Patch #3: introduce the smc_ism_set_v2_capable() helper and set
>>>>    smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
>>>>    regardless of whether there is already a device in smcd device list.
>>>> - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
>>>> - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
>>>>    to activate or deactivate the loopback-ism.
>>>> - Patch #9: introduce the statistics of loopback-ism by
>>>>    /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
>>>> - Some minor changes and comments improvements.
>>>>
>>>> [1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/
>>>> [2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/
>>>> [3] https://github.com/goldsborough/ipc-bench
>>>> [4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/
>>>> [5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/
>>>> [6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/
>>>>
>>>> Wen Gu (15):
>>>>    net/smc: improve SMC-D device dump for virtual ISM
>>>>    net/smc: decouple specialized struct from SMC-D DMB registration
>>>>    net/smc: introduce virtual ISM device loopback-ism
>>>>    net/smc: implement ID-related operations of loopback-ism
>>>>    net/smc: implement some unsupported operations of loopback-ism
>>>>    net/smc: implement DMB-related operations of loopback-ism
>>>>    net/smc: register loopback-ism into SMC-D device list
>>>>    net/smc: introduce loopback-ism runtime switch
>>>>    net/smc: introduce loopback-ism statistics attributes
>>>>    net/smc: add operations to merge sndbuf with peer DMB
>>>>    net/smc: attach or detach ghost sndbuf to peer DMB
>>>>    net/smc: adapt cursor update when sndbuf and peer DMB are merged
>>>>    net/smc: introduce loopback-ism DMB type control
>>>>    net/smc: introduce loopback-ism DMB data copy control
>>>>    net/smc: implement DMB-merged operations of loopback-ism
>>>>
>>>> drivers/s390/net/ism_drv.c |   2 +-
>>>> include/net/smc.h          |   7 +-
>>>> net/smc/Kconfig            |  13 +
>>>> net/smc/Makefile           |   2 +-
>>>> net/smc/af_smc.c           |  28 +-
>>>> net/smc/smc_cdc.c          |  58 ++-
>>>> net/smc/smc_cdc.h          |   1 +
>>>> net/smc/smc_core.c         |  61 +++-
>>>> net/smc/smc_core.h         |   1 +
>>>> net/smc/smc_ism.c          |  71 +++-
>>>> net/smc/smc_ism.h          |   5 +
>>>> net/smc/smc_loopback.c     | 718 +++++++++++++++++++++++++++++++++++++
>>>> net/smc/smc_loopback.h     |  88 +++++
>>>> 13 files changed, 1026 insertions(+), 29 deletions(-)
>>>> create mode 100644 net/smc/smc_loopback.c
>>>> create mode 100644 net/smc/smc_loopback.h
>>>>
>>>> -- 
>>>> 2.32.0.3.g01195cf9f
>>>>
>>>>
Jiri Pirko Jan. 12, 2024, 3:50 p.m. UTC | #7
Fri, Jan 12, 2024 at 01:32:14PM CET, guwen@linux.alibaba.com wrote:
>
>
>On 2024/1/12 17:10, Jiri Pirko wrote:
>> Fri, Jan 12, 2024 at 09:29:35AM CET, guwen@linux.alibaba.com wrote:
>> > 
>> > 
>
><...>
>
>> > > > inter-containers communication within the same OS instance.
>> > > > 
>> > > > # Design
>> > > > 
>> > > > This patch set includes 3 parts:
>> > > > 
>> > > > - Patch #1-#2: some prepare work for loopback-ism.
>> > > > - Patch #3-#9: implement loopback-ism device.
>> > > > - Patch #10-#15: memory copy optimization for loopback scenario.
>> > > > 
>> > > > The loopback-ism device is designed as a ISMv2 device and not be limited to
>> > > > a specific net namespace, ends of both inter-process connection (1/1' in diagram
>> > > > below) or inter-container connection (2/2' in diagram below) can find the same
>> > > > available loopback-ism and choose it during the CLC handshake.
>> > > > 
>> > > > Container 1 (ns1)                              Container 2 (ns2)
>> > > > +-----------------------------------------+    +-------------------------+
>> > > > | +-------+      +-------+      +-------+ |    |        +-------+        |
>> > > > | | App A |      | App B |      | App C | |    |        | App D |<-+     |
>> > > > | +-------+      +---^---+      +-------+ |    |        +-------+  |(2') |
>> > > > |     |127.0.0.1 (1')|             |192.168.0.11       192.168.0.12|     |
>> > > > |  (1)|   +--------+ | +--------+  |(2)   |    | +--------+   +--------+ |
>> > > > |     `-->|   lo   |-` |  eth0  |<-`      |    | |   lo   |   |  eth0  | |
>> > > > +---------+--|---^-+---+-----|--+---------+    +-+--------+---+-^------+-+
>> > > >                |   |           |                                  |
>> > > > Kernel       |   |           |                                  |
>> > > > +----+-------v---+-----------v----------------------------------+---+----+
>> > > > |    |                            TCP                               |    |
>> > > > |    |                                                              |    |
>> > > > |    +--------------------------------------------------------------+    |
>> > > > |                                                                        |
>> > > > |                           +--------------+                             |
>> > > > |                           | smc loopback |                             |
>> > > > +---------------------------+--------------+-----------------------------+
>> > > > 
>> > > > loopback-ism device creates DMBs (shared memory) for each connection peer.
>> > > > Since data transfer occurs within the same kernel, the sndbuf of each peer
>> > > > is only a descriptor and point to the same memory region as peer DMB, so that
>> > > > the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.
>> > > > 
>> > > > Container 1 (ns1)                              Container 2 (ns2)
>> > > > +-----------------------------------------+    +-------------------------+
>> > > > | +-------+                               |    |        +-------+        |
>> > > > | | App C |-----+                         |    |        | App D |        |
>> > > > | +-------+     |                         |    |        +-^-----+        |
>> > > > |               |                         |    |          |              |
>> > > > |           (2) |                         |    |     (2') |              |
>> > > > |               |                         |    |          |              |
>> > > > +---------------|-------------------------+    +----------|--------------+
>> > > >                   |                                         |
>> > > > Kernel          |                                         |
>> > > > +---------------|-----------------------------------------|--------------+
>> > > > | +--------+ +--v-----+                           +--------+ +--------+  |
>> > > > | |dmb_desc| |snd_desc|                           |dmb_desc| |snd_desc|  |
>> > > > | +-----|--+ +--|-----+                           +-----|--+ +--------+  |
>> > > > | +-----|--+    |                                 +-----|--+             |
>> > > > | | DMB C  |    +---------------------------------| DMB D  |             |
>> > > > | +--------+                                      +--------+             |
>> > > > |                                                                        |
>> > > > |                           +--------------+                             |
>> > > > |                           | smc loopback |                             |
>> > > > +---------------------------+--------------+-----------------------------+
>> > > > 
>> > > > # Benchmark Test
>> > > > 
>> > > > * Test environments:
>> > > >        - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>> > > >        - SMC sndbuf/DMB size 1MB.
>> > > >        - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0,
>> > > >          which means sndbuf and DMB are merged and no data copied between them.
>> > > >        - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0,
>> > > 
>> > > Exposing any configuration knobs and statistics over sysfs for
>> > > softdevices does not look correct at all :/ Could you please avoid
>> > > sysfs?
>> > > 
>> > 
>> > In previous reviews and calls, we think loopback-ism needs to be more
>> > like a device and be visible under /sys/devices.
>> > 
>> > Would you mind explaining why using sysfs for loopback-ism is not correct?
>> > since I saw some other configurations or statistics exists under /sys/devices,
>> > e.g. /sys/devices/virtual/net/lo. Thank you!
>> 
>> You have smc_netlink.c exposing clear netlink api for the subsystem.
>> Can't you extend it to contain the configuration knobs and expose stats
>> instead of sysfs?
>> 
>
>Thank you for the suggestion. I've also considered this approach.
>
>But I didn't choose to extend the smc netlink because for now smc netlink
>are used for SMC protocol related attributes, for example:
>
>SMC_NETLINK_GET_SYS_INFO:     SMC version, release, v2-capable..
>SMC_NETLINK_GET_LGR_SMC{R|D}: SMC-{R|D} link group inform (lgr id, lgr conn num, lgr role..)
>SMC_NETLINK_GET_LINK_SMCR:    SMC-R link inform (link id, link state, conn cnt..)
>SMC_NETLINK_GET_DEV_SMCD:     SMC-D device generic inform (user cnt, pci_fid, pci_chid, pci_vendor..)
>SMC_NETLINK_GET_DEV_SMCR:     SMC-R device generic inform (dev name, port pnet_id, port valid, port state..)
>SMC_NETLINK_GET_STATS:        SMC generic stats (RMB cnt, Tx size, Rx size, RMB size...)
>
>And the knobs and stats in this patchset are loopback-ism device specific
>attributes, for example:
>
>active:        loopback-ism runtime switch
>dmb_type:      type of DMB provided by loopback-ism
>dmb_copy:      support for DMB merge of loopback-ism
>xfer_bytes:    data transferred by loopback-ism
>dmbs_cnt:      DMB num provided by loopback-ism
>
>The layer will be:
>
>               +--------------------------------------+
>               |                                      |
>               |             SMC protocol             |
>               |  (attrs by netlink in smc_netlink.c) |
>               |                                      |
>               +--------------------------------------+
>             ------------------smcd_ops------------------
>     +---------------+  +---------------------+  +--------------+
>     | loopback-ism  |  |  s390 firmware ISM  |  | Possible     |
>     +---------------+  |                     |  | other        |
>     (attrs by sysfs    |                     |  | virtual ISM  |
>    in smc_loopback.c)  |                     |  |              |
>                        |                     |  |              |
>                        +---------------------+  +--------------+

So nest it:
SMC_NETLINK_BACKEND_GET_INFO
SMC_NETLINK_BACKEND_GET_STATS
?
I mean, isn't it better to have the backend knobs and stats in one place
under same netlink commands and attributes than random sysfs path ?



>
>So I choose to use current way to provide this lower layer loopback-ism
>device's attributes, restrict loopback-ism specific code to smc_loopback.c
>and try to make a clear layer architecture.
>
>Thanks,
>Wen Gu
>> 
>> > 
>> > 
>> > 
>> > Thanks again,
>> > Wen Gu
>> > 
>> > > 
>> > > >          which means DMB is physically contiguous buffer.
>> > > > 
>> > > > * Test object:
>> > > >        - TCP: run on TCP loopback.
>> > > >        - SMC lo: run on SMC loopback device.
>> > > > 
>> > > > 1. ipc-benchmark (see [3])
>> > > > 
>> > > > - ./<foo> -c 1000000 -s 100
>> > > > 
>> > > >                              TCP                  SMC-lo
>> > > > Message
>> > > > rate (msg/s)              80636                  149515(+85.42%)
>> > > > 
>> > > > 2. sockperf
>> > > > 
>> > > > - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>> > > > - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>> > > > 
>> > > >                              TCP                  SMC-lo
>> > > > Bandwidth(MBps)         4909.36                 8197.57(+66.98%)
>> > > > Latency(us)               6.098                   3.383(-44.52%)
>> > > > 
>> > > > 3. nginx/wrk
>> > > > 
>> > > > - serv: <smc_run> nginx
>> > > > - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
>> > > > 
>> > > >                             TCP                   SMC-lo
>> > > > Requests/s           181685.74                246447.77(+35.65%)
>> > > > 
>> > > > 4. redis-benchmark
>> > > > 
>> > > > - serv: <smc_run> redis-server
>> > > > - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
>> > > > 
>> > > >                             TCP                   SMC-lo
>> > > > GET(Requests/s)       85855.34                118553.64(+38.09%)
>> > > > SET(Requests/s)       86824.40                125944.58(+45.06%)
>> > > > 
>> > > > 
>> > > > Change log:
>> > > > 
>> > > > v1->RFC:
>> > > > - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
>> > > >    /sys/devices/virtual/smc/loopback-ism/xfer_bytes
>> > > > - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
>> > > >    merging sndbuf with peer DMB.
>> > > > - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
>> > > >    control of whether to merge sndbuf and DMB. They can be respectively set by:
>> > > >    /sys/devices/virtual/smc/loopback-ism/dmb_type
>> > > >    /sys/devices/virtual/smc/loopback-ism/dmb_copy
>> > > >    The motivation for these two control is that a performance bottleneck was
>> > > >    found when using vzalloced DMB and sndbuf is merged with DMB, and there are
>> > > >    many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
>> > > >    by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
>> > > >    or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
>> > > >    vmap lock contention [6]. It has significant effects, but using virtual memory
>> > > >    still has additional overhead compared to using physical memory.
>> > > >    So this new version provides controls of dmb_type and dmb_copy to suit
>> > > >    different scenarios.
>> > > > - Some minor changes and comments improvements.
>> > > > 
>> > > > RFC->old version([1]):
>> > > > Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/
>> > > > - Patch #1: improve the loopback-ism dump, it shows as follows now:
>> > > >    # smcd d
>> > > >    FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
>> > > >    0000 0     loopback-ism  ffff   No        0
>> > > > - Patch #3: introduce the smc_ism_set_v2_capable() helper and set
>> > > >    smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
>> > > >    regardless of whether there is already a device in smcd device list.
>> > > > - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
>> > > > - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
>> > > >    to activate or deactivate the loopback-ism.
>> > > > - Patch #9: introduce the statistics of loopback-ism by
>> > > >    /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
>> > > > - Some minor changes and comments improvements.
>> > > > 
>> > > > [1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/
>> > > > [2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/
>> > > > [3] https://github.com/goldsborough/ipc-bench
>> > > > [4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/
>> > > > [5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/
>> > > > [6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/
>> > > > 
>> > > > Wen Gu (15):
>> > > >    net/smc: improve SMC-D device dump for virtual ISM
>> > > >    net/smc: decouple specialized struct from SMC-D DMB registration
>> > > >    net/smc: introduce virtual ISM device loopback-ism
>> > > >    net/smc: implement ID-related operations of loopback-ism
>> > > >    net/smc: implement some unsupported operations of loopback-ism
>> > > >    net/smc: implement DMB-related operations of loopback-ism
>> > > >    net/smc: register loopback-ism into SMC-D device list
>> > > >    net/smc: introduce loopback-ism runtime switch
>> > > >    net/smc: introduce loopback-ism statistics attributes
>> > > >    net/smc: add operations to merge sndbuf with peer DMB
>> > > >    net/smc: attach or detach ghost sndbuf to peer DMB
>> > > >    net/smc: adapt cursor update when sndbuf and peer DMB are merged
>> > > >    net/smc: introduce loopback-ism DMB type control
>> > > >    net/smc: introduce loopback-ism DMB data copy control
>> > > >    net/smc: implement DMB-merged operations of loopback-ism
>> > > > 
>> > > > drivers/s390/net/ism_drv.c |   2 +-
>> > > > include/net/smc.h          |   7 +-
>> > > > net/smc/Kconfig            |  13 +
>> > > > net/smc/Makefile           |   2 +-
>> > > > net/smc/af_smc.c           |  28 +-
>> > > > net/smc/smc_cdc.c          |  58 ++-
>> > > > net/smc/smc_cdc.h          |   1 +
>> > > > net/smc/smc_core.c         |  61 +++-
>> > > > net/smc/smc_core.h         |   1 +
>> > > > net/smc/smc_ism.c          |  71 +++-
>> > > > net/smc/smc_ism.h          |   5 +
>> > > > net/smc/smc_loopback.c     | 718 +++++++++++++++++++++++++++++++++++++
>> > > > net/smc/smc_loopback.h     |  88 +++++
>> > > > 13 files changed, 1026 insertions(+), 29 deletions(-)
>> > > > create mode 100644 net/smc/smc_loopback.c
>> > > > create mode 100644 net/smc/smc_loopback.h
>> > > > 
>> > > > -- 
>> > > > 2.32.0.3.g01195cf9f
>> > > > 
>> > > >
Wen Gu Jan. 13, 2024, 9:22 a.m. UTC | #8
On 2024/1/12 23:50, Jiri Pirko wrote:
> Fri, Jan 12, 2024 at 01:32:14PM CET, guwen@linux.alibaba.com wrote:
>>
>>
>> On 2024/1/12 17:10, Jiri Pirko wrote:
>>> Fri, Jan 12, 2024 at 09:29:35AM CET, guwen@linux.alibaba.com wrote:
>>>>

<...>

>>>>>> # Benchmark Test
>>>>>>
>>>>>> * Test environments:
>>>>>>         - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>>>>>>         - SMC sndbuf/DMB size 1MB.
>>>>>>         - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0,
>>>>>>           which means sndbuf and DMB are merged and no data copied between them.
>>>>>>         - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0,
>>>>>
>>>>> Exposing any configuration knobs and statistics over sysfs for
>>>>> softdevices does not look correct at all :/ Could you please avoid
>>>>> sysfs?
>>>>>
>>>>
>>>> In previous reviews and calls, we think loopback-ism needs to be more
>>>> like a device and be visible under /sys/devices.
>>>>
>>>> Would you mind explaining why using sysfs for loopback-ism is not correct?
>>>> since I saw some other configurations or statistics exists under /sys/devices,
>>>> e.g. /sys/devices/virtual/net/lo. Thank you!
>>>
>>> You have smc_netlink.c exposing clear netlink api for the subsystem.
>>> Can't you extend it to contain the configuration knobs and expose stats
>>> instead of sysfs?
>>>
>>
>> Thank you for the suggestion. I've also considered this approach.
>>
>> But I didn't choose to extend the smc netlink because for now smc netlink
>> are used for SMC protocol related attributes, for example:
>>
>> SMC_NETLINK_GET_SYS_INFO:     SMC version, release, v2-capable..
>> SMC_NETLINK_GET_LGR_SMC{R|D}: SMC-{R|D} link group inform (lgr id, lgr conn num, lgr role..)
>> SMC_NETLINK_GET_LINK_SMCR:    SMC-R link inform (link id, link state, conn cnt..)
>> SMC_NETLINK_GET_DEV_SMCD:     SMC-D device generic inform (user cnt, pci_fid, pci_chid, pci_vendor..)
>> SMC_NETLINK_GET_DEV_SMCR:     SMC-R device generic inform (dev name, port pnet_id, port valid, port state..)
>> SMC_NETLINK_GET_STATS:        SMC generic stats (RMB cnt, Tx size, Rx size, RMB size...)
>>
>> And the knobs and stats in this patchset are loopback-ism device specific
>> attributes, for example:
>>
>> active:        loopback-ism runtime switch
>> dmb_type:      type of DMB provided by loopback-ism
>> dmb_copy:      support for DMB merge of loopback-ism
>> xfer_bytes:    data transferred by loopback-ism
>> dmbs_cnt:      DMB num provided by loopback-ism
>>
>> The layer will be:
>>
>>                +--------------------------------------+
>>                |                                      |
>>                |             SMC protocol             |
>>                |  (attrs by netlink in smc_netlink.c) |
>>                |                                      |
>>                +--------------------------------------+
>>              ------------------smcd_ops------------------
>>      +---------------+  +---------------------+  +--------------+
>>      | loopback-ism  |  |  s390 firmware ISM  |  | Possible     |
>>      +---------------+  |                     |  | other        |
>>      (attrs by sysfs    |                     |  | virtual ISM  |
>>     in smc_loopback.c)  |                     |  |              |
>>                         |                     |  |              |
>>                         +---------------------+  +--------------+
> 
> So nest it:
> SMC_NETLINK_BACKEND_GET_INFO
> SMC_NETLINK_BACKEND_GET_STATS
> ?
> I mean, isn't it better to have the backend knobs and stats in one place
> under same netlink commands and attributes than random sysfs path ?
> 
Thank you for suggestion.

I think it is not about nesting or gathering knobs and stats. It is
about not coupling underlying device details to upper layer SMC stack.

 From SMC perspective, it cares about the abstract operations defined
by smcd_ops, regardless of which underlying devices provide these
functions and how they provide. So IMO the details or configurations
of underlying devices shouldn't be involved in SMC.

Besides, the knobs and stats here are specific for loopback-ism device,
they include runtime switch, buffer type choice and mode choice of
loopback-ism (basically they won't change after being set once). The
other kinds of devices used by SMC-D, e.g. s390 firmware ISM or other
virtual ISMs have no similar things.

So I prefer to keep the current solution instead of expanding upper
layer SMC netlink.

Thanks,
Wen Gu

> 
> 
>>
>> So I choose to use current way to provide this lower layer loopback-ism
>> device's attributes, restrict loopback-ism specific code to smc_loopback.c
>> and try to make a clear layer architecture.
>>
>> Thanks,
>> Wen Gu
>>>
>>>>
>>>>
>>>>
>>>> Thanks again,
>>>> Wen Gu
>>>>
>>>>>
>>>>>>           which means DMB is physically contiguous buffer.
>>>>>>
>>>>>> * Test object:
>>>>>>         - TCP: run on TCP loopback.
>>>>>>         - SMC lo: run on SMC loopback device.
>>>>>>
>>>>>> 1. ipc-benchmark (see [3])
>>>>>>
>>>>>> - ./<foo> -c 1000000 -s 100
>>>>>>
>>>>>>                               TCP                  SMC-lo
>>>>>> Message
>>>>>> rate (msg/s)              80636                  149515(+85.42%)
>>>>>>
>>>>>> 2. sockperf
>>>>>>
>>>>>> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>>>>>> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>>>>>>
>>>>>>                               TCP                  SMC-lo
>>>>>> Bandwidth(MBps)         4909.36                 8197.57(+66.98%)
>>>>>> Latency(us)               6.098                   3.383(-44.52%)
>>>>>>
>>>>>> 3. nginx/wrk
>>>>>>
>>>>>> - serv: <smc_run> nginx
>>>>>> - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
>>>>>>
>>>>>>                              TCP                   SMC-lo
>>>>>> Requests/s           181685.74                246447.77(+35.65%)
>>>>>>
>>>>>> 4. redis-benchmark
>>>>>>
>>>>>> - serv: <smc_run> redis-server
>>>>>> - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
>>>>>>
>>>>>>                              TCP                   SMC-lo
>>>>>> GET(Requests/s)       85855.34                118553.64(+38.09%)
>>>>>> SET(Requests/s)       86824.40                125944.58(+45.06%)
>>>>>>
>>>>>>
>>>>>> Change log:
>>>>>>
>>>>>> v1->RFC:
>>>>>> - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
>>>>>>     /sys/devices/virtual/smc/loopback-ism/xfer_bytes
>>>>>> - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
>>>>>>     merging sndbuf with peer DMB.
>>>>>> - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
>>>>>>     control of whether to merge sndbuf and DMB. They can be respectively set by:
>>>>>>     /sys/devices/virtual/smc/loopback-ism/dmb_type
>>>>>>     /sys/devices/virtual/smc/loopback-ism/dmb_copy
>>>>>>     The motivation for these two control is that a performance bottleneck was
>>>>>>     found when using vzalloced DMB and sndbuf is merged with DMB, and there are
>>>>>>     many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
>>>>>>     by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
>>>>>>     or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
>>>>>>     vmap lock contention [6]. It has significant effects, but using virtual memory
>>>>>>     still has additional overhead compared to using physical memory.
>>>>>>     So this new version provides controls of dmb_type and dmb_copy to suit
>>>>>>     different scenarios.
>>>>>> - Some minor changes and comments improvements.
>>>>>>
>>>>>> RFC->old version([1]):
>>>>>> Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/
>>>>>> - Patch #1: improve the loopback-ism dump, it shows as follows now:
>>>>>>     # smcd d
>>>>>>     FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
>>>>>>     0000 0     loopback-ism  ffff   No        0
>>>>>> - Patch #3: introduce the smc_ism_set_v2_capable() helper and set
>>>>>>     smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
>>>>>>     regardless of whether there is already a device in smcd device list.
>>>>>> - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
>>>>>> - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
>>>>>>     to activate or deactivate the loopback-ism.
>>>>>> - Patch #9: introduce the statistics of loopback-ism by
>>>>>>     /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
>>>>>> - Some minor changes and comments improvements.
>>>>>>
>>>>>> [1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/
>>>>>> [2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/
>>>>>> [3] https://github.com/goldsborough/ipc-bench
>>>>>> [4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/
>>>>>> [5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/
>>>>>> [6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/
>>>>>>
>>>>>> Wen Gu (15):
>>>>>>     net/smc: improve SMC-D device dump for virtual ISM
>>>>>>     net/smc: decouple specialized struct from SMC-D DMB registration
>>>>>>     net/smc: introduce virtual ISM device loopback-ism
>>>>>>     net/smc: implement ID-related operations of loopback-ism
>>>>>>     net/smc: implement some unsupported operations of loopback-ism
>>>>>>     net/smc: implement DMB-related operations of loopback-ism
>>>>>>     net/smc: register loopback-ism into SMC-D device list
>>>>>>     net/smc: introduce loopback-ism runtime switch
>>>>>>     net/smc: introduce loopback-ism statistics attributes
>>>>>>     net/smc: add operations to merge sndbuf with peer DMB
>>>>>>     net/smc: attach or detach ghost sndbuf to peer DMB
>>>>>>     net/smc: adapt cursor update when sndbuf and peer DMB are merged
>>>>>>     net/smc: introduce loopback-ism DMB type control
>>>>>>     net/smc: introduce loopback-ism DMB data copy control
>>>>>>     net/smc: implement DMB-merged operations of loopback-ism
>>>>>>
>>>>>> drivers/s390/net/ism_drv.c |   2 +-
>>>>>> include/net/smc.h          |   7 +-
>>>>>> net/smc/Kconfig            |  13 +
>>>>>> net/smc/Makefile           |   2 +-
>>>>>> net/smc/af_smc.c           |  28 +-
>>>>>> net/smc/smc_cdc.c          |  58 ++-
>>>>>> net/smc/smc_cdc.h          |   1 +
>>>>>> net/smc/smc_core.c         |  61 +++-
>>>>>> net/smc/smc_core.h         |   1 +
>>>>>> net/smc/smc_ism.c          |  71 +++-
>>>>>> net/smc/smc_ism.h          |   5 +
>>>>>> net/smc/smc_loopback.c     | 718 +++++++++++++++++++++++++++++++++++++
>>>>>> net/smc/smc_loopback.h     |  88 +++++
>>>>>> 13 files changed, 1026 insertions(+), 29 deletions(-)
>>>>>> create mode 100644 net/smc/smc_loopback.c
>>>>>> create mode 100644 net/smc/smc_loopback.h
>>>>>>
>>>>>> -- 
>>>>>> 2.32.0.3.g01195cf9f
>>>>>>
>>>>>>
Jiri Pirko Jan. 15, 2024, 2:11 p.m. UTC | #9
Sat, Jan 13, 2024 at 10:22:15AM CET, guwen@linux.alibaba.com wrote:
>
>
>On 2024/1/12 23:50, Jiri Pirko wrote:
>> Fri, Jan 12, 2024 at 01:32:14PM CET, guwen@linux.alibaba.com wrote:
>> > 
>> > 
>> > On 2024/1/12 17:10, Jiri Pirko wrote:
>> > > Fri, Jan 12, 2024 at 09:29:35AM CET, guwen@linux.alibaba.com wrote:
>> > > > 
>
><...>
>
>> > > > > > # Benchmark Test
>> > > > > > 
>> > > > > > * Test environments:
>> > > > > >         - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>> > > > > >         - SMC sndbuf/DMB size 1MB.
>> > > > > >         - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0,
>> > > > > >           which means sndbuf and DMB are merged and no data copied between them.
>> > > > > >         - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0,
>> > > > > 
>> > > > > Exposing any configuration knobs and statistics over sysfs for
>> > > > > softdevices does not look correct at all :/ Could you please avoid
>> > > > > sysfs?
>> > > > > 
>> > > > 
>> > > > In previous reviews and calls, we think loopback-ism needs to be more
>> > > > like a device and be visible under /sys/devices.
>> > > > 
>> > > > Would you mind explaining why using sysfs for loopback-ism is not correct?
>> > > > since I saw some other configurations or statistics exists under /sys/devices,
>> > > > e.g. /sys/devices/virtual/net/lo. Thank you!
>> > > 
>> > > You have smc_netlink.c exposing clear netlink api for the subsystem.
>> > > Can't you extend it to contain the configuration knobs and expose stats
>> > > instead of sysfs?
>> > > 
>> > 
>> > Thank you for the suggestion. I've also considered this approach.
>> > 
>> > But I didn't choose to extend the smc netlink because for now smc netlink
>> > are used for SMC protocol related attributes, for example:
>> > 
>> > SMC_NETLINK_GET_SYS_INFO:     SMC version, release, v2-capable..
>> > SMC_NETLINK_GET_LGR_SMC{R|D}: SMC-{R|D} link group inform (lgr id, lgr conn num, lgr role..)
>> > SMC_NETLINK_GET_LINK_SMCR:    SMC-R link inform (link id, link state, conn cnt..)
>> > SMC_NETLINK_GET_DEV_SMCD:     SMC-D device generic inform (user cnt, pci_fid, pci_chid, pci_vendor..)
>> > SMC_NETLINK_GET_DEV_SMCR:     SMC-R device generic inform (dev name, port pnet_id, port valid, port state..)
>> > SMC_NETLINK_GET_STATS:        SMC generic stats (RMB cnt, Tx size, Rx size, RMB size...)
>> > 
>> > And the knobs and stats in this patchset are loopback-ism device specific
>> > attributes, for example:
>> > 
>> > active:        loopback-ism runtime switch
>> > dmb_type:      type of DMB provided by loopback-ism
>> > dmb_copy:      support for DMB merge of loopback-ism
>> > xfer_bytes:    data transferred by loopback-ism
>> > dmbs_cnt:      DMB num provided by loopback-ism
>> > 
>> > The layer will be:
>> > 
>> >                +--------------------------------------+
>> >                |                                      |
>> >                |             SMC protocol             |
>> >                |  (attrs by netlink in smc_netlink.c) |
>> >                |                                      |
>> >                +--------------------------------------+
>> >              ------------------smcd_ops------------------
>> >      +---------------+  +---------------------+  +--------------+
>> >      | loopback-ism  |  |  s390 firmware ISM  |  | Possible     |
>> >      +---------------+  |                     |  | other        |
>> >      (attrs by sysfs    |                     |  | virtual ISM  |
>> >     in smc_loopback.c)  |                     |  |              |
>> >                         |                     |  |              |
>> >                         +---------------------+  +--------------+
>> 
>> So nest it:
>> SMC_NETLINK_BACKEND_GET_INFO
>> SMC_NETLINK_BACKEND_GET_STATS
>> ?
>> I mean, isn't it better to have the backend knobs and stats in one place
>> under same netlink commands and attributes than random sysfs path ?
>> 
>Thank you for suggestion.
>
>I think it is not about nesting or gathering knobs and stats. It is
>about not coupling underlying device details to upper layer SMC stack.
>
>From SMC perspective, it cares about the abstract operations defined
>by smcd_ops, regardless of which underlying devices provide these
>functions and how they provide. So IMO the details or configurations
>of underlying devices shouldn't be involved in SMC.

So you rather keep the device configuration and info exposed over random
sysfs files? Sorry, that makes not sense to me.


>
>Besides, the knobs and stats here are specific for loopback-ism device,
>they include runtime switch, buffer type choice and mode choice of
>loopback-ism (basically they won't change after being set once). The
>other kinds of devices used by SMC-D, e.g. s390 firmware ISM or other
>virtual ISMs have no similar things.

Okay, it is normal that different drivers implement different parts of
UAPI. No problem.


>
>So I prefer to keep the current solution instead of expanding upper
>layer SMC netlink.

Makes no sense to me. This is UAPI from 20 years ago. Is this a time
machine?


>
>Thanks,
>Wen Gu
>
>> 
>> 
>> > 
>> > So I choose to use current way to provide this lower layer loopback-ism
>> > device's attributes, restrict loopback-ism specific code to smc_loopback.c
>> > and try to make a clear layer architecture.
>> > 
>> > Thanks,
>> > Wen Gu
>> > > 
>> > > > 
>> > > > 
>> > > > 
>> > > > Thanks again,
>> > > > Wen Gu
>> > > > 
>> > > > > 
>> > > > > >           which means DMB is physically contiguous buffer.
>> > > > > > 
>> > > > > > * Test object:
>> > > > > >         - TCP: run on TCP loopback.
>> > > > > >         - SMC lo: run on SMC loopback device.
>> > > > > > 
>> > > > > > 1. ipc-benchmark (see [3])
>> > > > > > 
>> > > > > > - ./<foo> -c 1000000 -s 100
>> > > > > > 
>> > > > > >                               TCP                  SMC-lo
>> > > > > > Message
>> > > > > > rate (msg/s)              80636                  149515(+85.42%)
>> > > > > > 
>> > > > > > 2. sockperf
>> > > > > > 
>> > > > > > - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>> > > > > > - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
>> > > > > > 
>> > > > > >                               TCP                  SMC-lo
>> > > > > > Bandwidth(MBps)         4909.36                 8197.57(+66.98%)
>> > > > > > Latency(us)               6.098                   3.383(-44.52%)
>> > > > > > 
>> > > > > > 3. nginx/wrk
>> > > > > > 
>> > > > > > - serv: <smc_run> nginx
>> > > > > > - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
>> > > > > > 
>> > > > > >                              TCP                   SMC-lo
>> > > > > > Requests/s           181685.74                246447.77(+35.65%)
>> > > > > > 
>> > > > > > 4. redis-benchmark
>> > > > > > 
>> > > > > > - serv: <smc_run> redis-server
>> > > > > > - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
>> > > > > > 
>> > > > > >                              TCP                   SMC-lo
>> > > > > > GET(Requests/s)       85855.34                118553.64(+38.09%)
>> > > > > > SET(Requests/s)       86824.40                125944.58(+45.06%)
>> > > > > > 
>> > > > > > 
>> > > > > > Change log:
>> > > > > > 
>> > > > > > v1->RFC:
>> > > > > > - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
>> > > > > >     /sys/devices/virtual/smc/loopback-ism/xfer_bytes
>> > > > > > - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
>> > > > > >     merging sndbuf with peer DMB.
>> > > > > > - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
>> > > > > >     control of whether to merge sndbuf and DMB. They can be respectively set by:
>> > > > > >     /sys/devices/virtual/smc/loopback-ism/dmb_type
>> > > > > >     /sys/devices/virtual/smc/loopback-ism/dmb_copy
>> > > > > >     The motivation for these two control is that a performance bottleneck was
>> > > > > >     found when using vzalloced DMB and sndbuf is merged with DMB, and there are
>> > > > > >     many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
>> > > > > >     by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
>> > > > > >     or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
>> > > > > >     vmap lock contention [6]. It has significant effects, but using virtual memory
>> > > > > >     still has additional overhead compared to using physical memory.
>> > > > > >     So this new version provides controls of dmb_type and dmb_copy to suit
>> > > > > >     different scenarios.
>> > > > > > - Some minor changes and comments improvements.
>> > > > > > 
>> > > > > > RFC->old version([1]):
>> > > > > > Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/
>> > > > > > - Patch #1: improve the loopback-ism dump, it shows as follows now:
>> > > > > >     # smcd d
>> > > > > >     FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
>> > > > > >     0000 0     loopback-ism  ffff   No        0
>> > > > > > - Patch #3: introduce the smc_ism_set_v2_capable() helper and set
>> > > > > >     smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
>> > > > > >     regardless of whether there is already a device in smcd device list.
>> > > > > > - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
>> > > > > > - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
>> > > > > >     to activate or deactivate the loopback-ism.
>> > > > > > - Patch #9: introduce the statistics of loopback-ism by
>> > > > > >     /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
>> > > > > > - Some minor changes and comments improvements.
>> > > > > > 
>> > > > > > [1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/
>> > > > > > [2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/
>> > > > > > [3] https://github.com/goldsborough/ipc-bench
>> > > > > > [4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/
>> > > > > > [5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/
>> > > > > > [6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/
>> > > > > > 
>> > > > > > Wen Gu (15):
>> > > > > >     net/smc: improve SMC-D device dump for virtual ISM
>> > > > > >     net/smc: decouple specialized struct from SMC-D DMB registration
>> > > > > >     net/smc: introduce virtual ISM device loopback-ism
>> > > > > >     net/smc: implement ID-related operations of loopback-ism
>> > > > > >     net/smc: implement some unsupported operations of loopback-ism
>> > > > > >     net/smc: implement DMB-related operations of loopback-ism
>> > > > > >     net/smc: register loopback-ism into SMC-D device list
>> > > > > >     net/smc: introduce loopback-ism runtime switch
>> > > > > >     net/smc: introduce loopback-ism statistics attributes
>> > > > > >     net/smc: add operations to merge sndbuf with peer DMB
>> > > > > >     net/smc: attach or detach ghost sndbuf to peer DMB
>> > > > > >     net/smc: adapt cursor update when sndbuf and peer DMB are merged
>> > > > > >     net/smc: introduce loopback-ism DMB type control
>> > > > > >     net/smc: introduce loopback-ism DMB data copy control
>> > > > > >     net/smc: implement DMB-merged operations of loopback-ism
>> > > > > > 
>> > > > > > drivers/s390/net/ism_drv.c |   2 +-
>> > > > > > include/net/smc.h          |   7 +-
>> > > > > > net/smc/Kconfig            |  13 +
>> > > > > > net/smc/Makefile           |   2 +-
>> > > > > > net/smc/af_smc.c           |  28 +-
>> > > > > > net/smc/smc_cdc.c          |  58 ++-
>> > > > > > net/smc/smc_cdc.h          |   1 +
>> > > > > > net/smc/smc_core.c         |  61 +++-
>> > > > > > net/smc/smc_core.h         |   1 +
>> > > > > > net/smc/smc_ism.c          |  71 +++-
>> > > > > > net/smc/smc_ism.h          |   5 +
>> > > > > > net/smc/smc_loopback.c     | 718 +++++++++++++++++++++++++++++++++++++
>> > > > > > net/smc/smc_loopback.h     |  88 +++++
>> > > > > > 13 files changed, 1026 insertions(+), 29 deletions(-)
>> > > > > > create mode 100644 net/smc/smc_loopback.c
>> > > > > > create mode 100644 net/smc/smc_loopback.h
>> > > > > > 
>> > > > > > -- 
>> > > > > > 2.32.0.3.g01195cf9f
>> > > > > > 
>> > > > > >
Wen Gu Jan. 18, 2024, 8:27 a.m. UTC | #10
On 2024/1/11 20:00, Wen Gu wrote:
> This patch set acts as the second part of the new version of [1] (The first
> part can be referred from [2]), the updated things of this version are listed
> at the end.
> 

Hi Wenjia and Jan, I would appreciate any thoughts or comments you might have
on this series. Thank you very much!

> 
> # Design
> 
> This patch set includes 3 parts:
> 
>   - Patch #1-#2: some prepare work for loopback-ism.
>   - Patch #3-#9: implement loopback-ism device.
>   - Patch #10-#15: memory copy optimization for loopback scenario.
> 
> The loopback-ism device is designed as a ISMv2 device and not be limited to
> a specific net namespace, ends of both inter-process connection (1/1' in diagram
> below) or inter-container connection (2/2' in diagram below) can find the same
> available loopback-ism and choose it during the CLC handshake.
> 
>   Container 1 (ns1)                              Container 2 (ns2)
>   +-----------------------------------------+    +-------------------------+
>   | +-------+      +-------+      +-------+ |    |        +-------+        |
>   | | App A |      | App B |      | App C | |    |        | App D |<-+     |
>   | +-------+      +---^---+      +-------+ |    |        +-------+  |(2') |
>   |     |127.0.0.1 (1')|             |192.168.0.11       192.168.0.12|     |
>   |  (1)|   +--------+ | +--------+  |(2)   |    | +--------+   +--------+ |
>   |     `-->|   lo   |-` |  eth0  |<-`      |    | |   lo   |   |  eth0  | |
>   +---------+--|---^-+---+-----|--+---------+    +-+--------+---+-^------+-+
>                |   |           |                                  |
>   Kernel       |   |           |                                  |
>   +----+-------v---+-----------v----------------------------------+---+----+
>   |    |                            TCP                               |    |
>   |    |                                                              |    |
>   |    +--------------------------------------------------------------+    |
>   |                                                                        |
>   |                           +--------------+                             |
>   |                           | smc loopback |                             |
>   +---------------------------+--------------+-----------------------------+
> 
> loopback-ism device creates DMBs (shared memory) for each connection peer.
> Since data transfer occurs within the same kernel, the sndbuf of each peer
> is only a descriptor and point to the same memory region as peer DMB, so that
> the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.
> 
>   Container 1 (ns1)                              Container 2 (ns2)
>   +-----------------------------------------+    +-------------------------+
>   | +-------+                               |    |        +-------+        |
>   | | App C |-----+                         |    |        | App D |        |
>   | +-------+     |                         |    |        +-^-----+        |
>   |               |                         |    |          |              |
>   |           (2) |                         |    |     (2') |              |
>   |               |                         |    |          |              |
>   +---------------|-------------------------+    +----------|--------------+
>                   |                                         |
>   Kernel          |                                         |
>   +---------------|-----------------------------------------|--------------+
>   | +--------+ +--v-----+                           +--------+ +--------+  |
>   | |dmb_desc| |snd_desc|                           |dmb_desc| |snd_desc|  |
>   | +-----|--+ +--|-----+                           +-----|--+ +--------+  |
>   | +-----|--+    |                                 +-----|--+             |
>   | | DMB C  |    +---------------------------------| DMB D  |             |
>   | +--------+                                      +--------+             |
>   |                                                                        |
>   |                           +--------------+                             |
>   |                           | smc loopback |                             |
>   +---------------------------+--------------+-----------------------------+
> 
> # Benchmark Test
> 
>   * Test environments:
>        - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>        - SMC sndbuf/DMB size 1MB.
>        - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0,
>          which means sndbuf and DMB are merged and no data copied between them.
>        - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0,
>          which means DMB is physically contiguous buffer.
> 
>   * Test object:
>        - TCP: run on TCP loopback.
>        - SMC lo: run on SMC loopback device.
> 
> 1. ipc-benchmark (see [3])
> 
>   - ./<foo> -c 1000000 -s 100
> 
>                              TCP                  SMC-lo
> Message
> rate (msg/s)              80636                  149515(+85.42%)
> 
> 2. sockperf
> 
>   - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>   - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
> 
>                              TCP                  SMC-lo
> Bandwidth(MBps)         4909.36                 8197.57(+66.98%)
> Latency(us)               6.098                   3.383(-44.52%)
> 
> 3. nginx/wrk
> 
>   - serv: <smc_run> nginx
>   - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
> 
>                             TCP                   SMC-lo
> Requests/s           181685.74                246447.77(+35.65%)
> 
> 4. redis-benchmark
> 
>   - serv: <smc_run> redis-server
>   - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
> 
>                             TCP                   SMC-lo
> GET(Requests/s)       85855.34                118553.64(+38.09%)
> SET(Requests/s)       86824.40                125944.58(+45.06%)
> 
> 
> Change log:
> 
> v1->RFC:
> - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
>    /sys/devices/virtual/smc/loopback-ism/xfer_bytes
> - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
>    merging sndbuf with peer DMB.
> - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
>    control of whether to merge sndbuf and DMB. They can be respectively set by:
>    /sys/devices/virtual/smc/loopback-ism/dmb_type
>    /sys/devices/virtual/smc/loopback-ism/dmb_copy
>    The motivation for these two control is that a performance bottleneck was
>    found when using vzalloced DMB and sndbuf is merged with DMB, and there are
>    many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
>    by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
>    or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
>    vmap lock contention [6]. It has significant effects, but using virtual memory
>    still has additional overhead compared to using physical memory.
>    So this new version provides controls of dmb_type and dmb_copy to suit
>    different scenarios.
> - Some minor changes and comments improvements.
> 
> RFC->old version([1]):
> Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/
> - Patch #1: improve the loopback-ism dump, it shows as follows now:
>    # smcd d
>    FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
>    0000 0     loopback-ism  ffff   No        0
> - Patch #3: introduce the smc_ism_set_v2_capable() helper and set
>    smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
>    regardless of whether there is already a device in smcd device list.
> - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
> - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
>    to activate or deactivate the loopback-ism.
> - Patch #9: introduce the statistics of loopback-ism by
>    /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
> - Some minor changes and comments improvements.
> 
> [1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/
> [2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/
> [3] https://github.com/goldsborough/ipc-bench
> [4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/
> [5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/
> [6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/
> 
> Wen Gu (15):
>    net/smc: improve SMC-D device dump for virtual ISM
>    net/smc: decouple specialized struct from SMC-D DMB registration
>    net/smc: introduce virtual ISM device loopback-ism
>    net/smc: implement ID-related operations of loopback-ism
>    net/smc: implement some unsupported operations of loopback-ism
>    net/smc: implement DMB-related operations of loopback-ism
>    net/smc: register loopback-ism into SMC-D device list
>    net/smc: introduce loopback-ism runtime switch
>    net/smc: introduce loopback-ism statistics attributes
>    net/smc: add operations to merge sndbuf with peer DMB
>    net/smc: attach or detach ghost sndbuf to peer DMB
>    net/smc: adapt cursor update when sndbuf and peer DMB are merged
>    net/smc: introduce loopback-ism DMB type control
>    net/smc: introduce loopback-ism DMB data copy control
>    net/smc: implement DMB-merged operations of loopback-ism
> 
>   drivers/s390/net/ism_drv.c |   2 +-
>   include/net/smc.h          |   7 +-
>   net/smc/Kconfig            |  13 +
>   net/smc/Makefile           |   2 +-
>   net/smc/af_smc.c           |  28 +-
>   net/smc/smc_cdc.c          |  58 ++-
>   net/smc/smc_cdc.h          |   1 +
>   net/smc/smc_core.c         |  61 +++-
>   net/smc/smc_core.h         |   1 +
>   net/smc/smc_ism.c          |  71 +++-
>   net/smc/smc_ism.h          |   5 +
>   net/smc/smc_loopback.c     | 718 +++++++++++++++++++++++++++++++++++++
>   net/smc/smc_loopback.h     |  88 +++++
>   13 files changed, 1026 insertions(+), 29 deletions(-)
>   create mode 100644 net/smc/smc_loopback.c
>   create mode 100644 net/smc/smc_loopback.h
>
Wenjia Zhang Jan. 18, 2024, 1:59 p.m. UTC | #11
On 18.01.24 09:27, Wen Gu wrote:
> 
> 
> On 2024/1/11 20:00, Wen Gu wrote:
>> This patch set acts as the second part of the new version of [1] (The 
>> first
>> part can be referred from [2]), the updated things of this version are 
>> listed
>> at the end.
>>
> 
> Hi Wenjia and Jan, I would appreciate any thoughts or comments you might 
> have
> on this series. Thank you very much!
> 
Hi Wen,

I'm still in the middle of the proto type on IPPROTO_SMC and other 
issues, so that I need more time to review this patch series.

Thank you for your patience!
Wenjia
Wen Gu Jan. 19, 2024, 1:46 a.m. UTC | #12
On 2024/1/18 21:59, Wenjia Zhang wrote:
> 
> 
> On 18.01.24 09:27, Wen Gu wrote:
>>
>>
>> On 2024/1/11 20:00, Wen Gu wrote:
>>> This patch set acts as the second part of the new version of [1] (The first
>>> part can be referred from [2]), the updated things of this version are listed
>>> at the end.
>>>
>>
>> Hi Wenjia and Jan, I would appreciate any thoughts or comments you might have
>> on this series. Thank you very much!
>>
> Hi Wen,
> 
> I'm still in the middle of the proto type on IPPROTO_SMC and other issues, so that I need more time to review this patch 
> series.
> 
> Thank you for your patience!
> Wenjia

Understood. Thank you! Wenjia.

Best regards,
Wen Gu
Alexandra Winter Jan. 23, 2024, 2:03 p.m. UTC | #13
On 19.01.24 02:46, Wen Gu wrote:
> 
> 
> On 2024/1/18 21:59, Wenjia Zhang wrote:
>>
>>
>> On 18.01.24 09:27, Wen Gu wrote:
>>>
>>>
>>> On 2024/1/11 20:00, Wen Gu wrote:
>>>> This patch set acts as the second part of the new version of [1] (The first
>>>> part can be referred from [2]), the updated things of this version are listed
>>>> at the end.
>>>>
>>>
>>> Hi Wenjia and Jan, I would appreciate any thoughts or comments you might have
>>> on this series. Thank you very much!
>>>
>> Hi Wen,
>>
>> I'm still in the middle of the proto type on IPPROTO_SMC and other issues, so that I need more time to review this patch series.
>>
>> Thank you for your patience!
>> Wenjia
> 
> Understood. Thank you! Wenjia.
> 
> Best regards,
> Wen Gu
> 

Hello Wen Gu and others,

I just wanted to let you know that unfortunately both Wenjia and Jan have called in sick and we don't know
when they will be back at work. 
So I'm sorry but there may be mroe delays in the review of this patchset.

Kind regards
Alexandra Winter
Wen Gu Jan. 24, 2024, 6:33 a.m. UTC | #14
On 2024/1/23 22:03, Alexandra Winter wrote:
> Hello Wen Gu and others,
> 
> I just wanted to let you know that unfortunately both Wenjia and Jan have called in sick and we don't know
> when they will be back at work.
> So I'm sorry but there may be mroe delays in the review of this patchset.
> 
> Kind regards
> Alexandra Winter

Hi Alexandra,

Thank you for the update. Health comes first. Wishing Wenjia and Jan
both make a swift recovery.

Best regards,
Wen Gu
Wen Gu Feb. 5, 2024, 10:05 a.m. UTC | #15
On 2024/1/24 14:33, Wen Gu wrote:
> 
> 
> On 2024/1/23 22:03, Alexandra Winter wrote:
>> Hello Wen Gu and others,
>>
>> I just wanted to let you know that unfortunately both Wenjia and Jan have called in sick and we don't know
>> when they will be back at work.
>> So I'm sorry but there may be mroe delays in the review of this patchset.
>>
>> Kind regards
>> Alexandra Winter
> 
> Hi Alexandra,
> 
> Thank you for the update. Health comes first. Wishing Wenjia and Jan
> both make a swift recovery.
> 
> Best regards,
> Wen Gu

Hi, Wenjia and Jan

I would like to ask if I should wait for the review of this version
or send a v2 (with some minor changes) ?

Thanks!
Alexandra Winter Feb. 6, 2024, 12:18 p.m. UTC | #16
On 11.01.24 13:00, Wen Gu wrote:
> This patch set acts as the second part of the new version of [1] (The first
> part can be referred from [2]), the updated things of this version are listed
> at the end.
> 
> # Background
> 
> SMC-D is now used in IBM z with ISM function to optimize network interconnect
> for intra-CPC communications. Inspired by this, we try to make SMC-D available
> on the non-s390 architecture through a software-implemented virtual ISM device,
> that is the loopback-ism device here, to accelerate inter-process or
> inter-containers communication within the same OS instance.


Hello Wen Gu,

thank you very much for this patchset. I have been looking at it a bit.
I installed in on a testserver, but did not yet excercise the loopback-ism device.
I want to continue investigations, but daily work interferes, so I thought I
send you some comments now. So this is not at all a code review, but some 
thoughts and observations about the general concept.


In [1] there was a discussion about an abstraction layer between smc-d and the
ism devices. 
I am not sure what you are proposing now, is it an smc-d feature or independent of smc?
In 3/15 you say it is part of the SMC module, but then it has its own device entry.
Didn't you want to use it for other things as well? Or is it an SMC-D only feature?
Is it a device (Config help: "kind of virtual device")? Or an SMC-D feature?

Will we have a class of ism devices (s390 ism, ism-loopback, virtio-ism)
That share common properties (internal API?)
and smc-d will work with any of those?
But they all can exist without smc ?! BTW: This is what we want for s390-ism.
The client-registration interface [2] is currently the way to achieve this. 
But maybe we need a more general concept?

Maybe first a preparation patchset that introduces a class/ism
Together with an improved API?
In case you want to use ISM devices for other purposes as well..
But then the whole picture of ism-loopback in one patchset (RFC?) 
has its benefits as well.


Other points that I noticed:

Naming: smc loopback, ism-loopback, loopback-ism ?

config: why not tristate? Why under net/smc?

/sys/devices/virtual/smc  does not initially show up in my installation!!!
root@t35lp50:/sys/devices/virtual/> ls
3270/  bdi/  block/  graphics/  iommu/  mem/  memory_tiering/  misc/  net/  tty/  vc/  vtconsole/  workqueue/
root@t35lp50:/sys/devices/virtual/> ls smc/loopback-ism
active  dmb_copy  dmbs_cnt  dmb_type  subsystem@  uevent  xfer_bytes
root@t35lp50:/sys/devices/virtual/> ls
3270/  bdi/  block/  graphics/  iommu/  mem/  memory_tiering/  misc/  net/  smc/  tty/  vc/  vtconsole/  workqueue/
Is that normal behaviour?

You introduced a class/smc
Maybe class/ism would be better?
The other smc interfaces do not show up in class/smc!! Not so good

Why doesn't it show in smc_rnics?
(Maybe some deficiency of smc_rnics?)

But then it shows in other smc-tools:
root@t35lp50:/sys/> smcd device
FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
0000 0     loopback-ism  ffff   No        0
0029 ISM   0000:00:00.0  07c1   No        0  NET1
Nice!

Kind regards
Sandy


[1] https://lore.kernel.org/lkml/e3819550-7b10-4f9c-7347-dcf1f97b8e6b@linux.alibaba.com/
[2] 89e7d2ba61b7 ("net/ism: Add new API for client registration")
Wenjia Zhang Feb. 7, 2024, 9:08 a.m. UTC | #17
On 05.02.24 11:05, Wen Gu wrote:
> 
> On 2024/1/24 14:33, Wen Gu wrote:
>>
>>
>> On 2024/1/23 22:03, Alexandra Winter wrote:
>>> Hello Wen Gu and others,
>>>
>>> I just wanted to let you know that unfortunately both Wenjia and Jan 
>>> have called in sick and we don't know
>>> when they will be back at work.
>>> So I'm sorry but there may be mroe delays in the review of this 
>>> patchset.
>>>
>>> Kind regards
>>> Alexandra Winter
>>
>> Hi Alexandra,
>>
>> Thank you for the update. Health comes first. Wishing Wenjia and Jan
>> both make a swift recovery.
>>
>> Best regards,
>> Wen Gu
> 
> Hi, Wenjia and Jan
> 
> I would like to ask if I should wait for the review of this version
> or send a v2 (with some minor changes) ?
> 
> Thanks!

Hi Wen,

Finally I can carve out some time on this patches, the review is still 
ongoing. I'll send my comments out, as soon as I finish all of them.

Thank you for the patience!
Wenjia
Wen Gu Feb. 8, 2024, 4:12 p.m. UTC | #18
On 2024/2/6 20:18, Alexandra Winter wrote:
> 
> 
> On 11.01.24 13:00, Wen Gu wrote:
>> This patch set acts as the second part of the new version of [1] (The first
>> part can be referred from [2]), the updated things of this version are listed
>> at the end.
>>
>> # Background
>>
>> SMC-D is now used in IBM z with ISM function to optimize network interconnect
>> for intra-CPC communications. Inspired by this, we try to make SMC-D available
>> on the non-s390 architecture through a software-implemented virtual ISM device,
>> that is the loopback-ism device here, to accelerate inter-process or
>> inter-containers communication within the same OS instance.
> 
> 
> Hello Wen Gu,
> 
> thank you very much for this patchset. I have been looking at it a bit.
> I installed in on a testserver, but did not yet excercise the loopback-ism device.
> I want to continue investigations, but daily work interferes, so I thought I
> send you some comments now. So this is not at all a code review, but some
> thoughts and observations about the general concept.
> 

Hi Sandy and Wenjia,

Thank you very much for your feedback!
I am working on the detailed replies. As we are on holiday for Chinese New Year,
the progress may be slower. But please feel free to leave any other comments and
feedback, thank you!

Best regards!
Wen Gu

> 
> In [1] there was a discussion about an abstraction layer between smc-d and the
> ism devices.
> I am not sure what you are proposing now, is it an smc-d feature or independent of smc?
> In 3/15 you say it is part of the SMC module, but then it has its own device entry.
> Didn't you want to use it for other things as well? Or is it an SMC-D only feature?
> Is it a device (Config help: "kind of virtual device")? Or an SMC-D feature?
> 
> Will we have a class of ism devices (s390 ism, ism-loopback, virtio-ism)
> That share common properties (internal API?)
> and smc-d will work with any of those?
> But they all can exist without smc ?! BTW: This is what we want for s390-ism.
> The client-registration interface [2] is currently the way to achieve this.
> But maybe we need a more general concept?
> 
> Maybe first a preparation patchset that introduces a class/ism
> Together with an improved API?
> In case you want to use ISM devices for other purposes as well..
> But then the whole picture of ism-loopback in one patchset (RFC?)
> has its benefits as well.
> 
> 
> Other points that I noticed:
> 
> Naming: smc loopback, ism-loopback, loopback-ism ?
> 
> config: why not tristate? Why under net/smc?
> 
> /sys/devices/virtual/smc  does not initially show up in my installation!!!
> root@t35lp50:/sys/devices/virtual/> ls
> 3270/  bdi/  block/  graphics/  iommu/  mem/  memory_tiering/  misc/  net/  tty/  vc/  vtconsole/  workqueue/
> root@t35lp50:/sys/devices/virtual/> ls smc/loopback-ism
> active  dmb_copy  dmbs_cnt  dmb_type  subsystem@  uevent  xfer_bytes
> root@t35lp50:/sys/devices/virtual/> ls
> 3270/  bdi/  block/  graphics/  iommu/  mem/  memory_tiering/  misc/  net/  smc/  tty/  vc/  vtconsole/  workqueue/
> Is that normal behaviour?
> 
> You introduced a class/smc
> Maybe class/ism would be better?
> The other smc interfaces do not show up in class/smc!! Not so good
> 
> Why doesn't it show in smc_rnics?
> (Maybe some deficiency of smc_rnics?)
> 
> But then it shows in other smc-tools:
> root@t35lp50:/sys/> smcd device
> FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
> 0000 0     loopback-ism  ffff   No        0
> 0029 ISM   0000:00:00.0  07c1   No        0  NET1
> Nice!
> 
> Kind regards
> Sandy
> 
> 
> [1] https://lore.kernel.org/lkml/e3819550-7b10-4f9c-7347-dcf1f97b8e6b@linux.alibaba.com/
> [2] 89e7d2ba61b7 ("net/ism: Add new API for client registration")
Wenjia Zhang Feb. 16, 2024, 2:09 p.m. UTC | #19
On 11.01.24 13:00, Wen Gu wrote:
> This patch set acts as the second part of the new version of [1] (The first
> part can be referred from [2]), the updated things of this version are listed
> at the end.
> 
> # Background
> 
> SMC-D is now used in IBM z with ISM function to optimize network interconnect
> for intra-CPC communications. Inspired by this, we try to make SMC-D available
> on the non-s390 architecture through a software-implemented virtual ISM device,
> that is the loopback-ism device here, to accelerate inter-process or
> inter-containers communication within the same OS instance.
> 
> # Design
> 
> This patch set includes 3 parts:
> 
>   - Patch #1-#2: some prepare work for loopback-ism.
>   - Patch #3-#9: implement loopback-ism device.
>   - Patch #10-#15: memory copy optimization for loopback scenario.
> 
> The loopback-ism device is designed as a ISMv2 device and not be limited to
> a specific net namespace, ends of both inter-process connection (1/1' in diagram
> below) or inter-container connection (2/2' in diagram below) can find the same
> available loopback-ism and choose it during the CLC handshake.
> 
>   Container 1 (ns1)                              Container 2 (ns2)
>   +-----------------------------------------+    +-------------------------+
>   | +-------+      +-------+      +-------+ |    |        +-------+        |
>   | | App A |      | App B |      | App C | |    |        | App D |<-+     |
>   | +-------+      +---^---+      +-------+ |    |        +-------+  |(2') |
>   |     |127.0.0.1 (1')|             |192.168.0.11       192.168.0.12|     |
>   |  (1)|   +--------+ | +--------+  |(2)   |    | +--------+   +--------+ |
>   |     `-->|   lo   |-` |  eth0  |<-`      |    | |   lo   |   |  eth0  | |
>   +---------+--|---^-+---+-----|--+---------+    +-+--------+---+-^------+-+
>                |   |           |                                  |
>   Kernel       |   |           |                                  |
>   +----+-------v---+-----------v----------------------------------+---+----+
>   |    |                            TCP                               |    |
>   |    |                                                              |    |
>   |    +--------------------------------------------------------------+    |
>   |                                                                        |
>   |                           +--------------+                             |
>   |                           | smc loopback |                             |
>   +---------------------------+--------------+-----------------------------+
> 
> loopback-ism device creates DMBs (shared memory) for each connection peer.
> Since data transfer occurs within the same kernel, the sndbuf of each peer
> is only a descriptor and point to the same memory region as peer DMB, so that
> the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.
> 
>   Container 1 (ns1)                              Container 2 (ns2)
>   +-----------------------------------------+    +-------------------------+
>   | +-------+                               |    |        +-------+        |
>   | | App C |-----+                         |    |        | App D |        |
>   | +-------+     |                         |    |        +-^-----+        |
>   |               |                         |    |          |              |
>   |           (2) |                         |    |     (2') |              |
>   |               |                         |    |          |              |
>   +---------------|-------------------------+    +----------|--------------+
>                   |                                         |
>   Kernel          |                                         |
>   +---------------|-----------------------------------------|--------------+
>   | +--------+ +--v-----+                           +--------+ +--------+  |
>   | |dmb_desc| |snd_desc|                           |dmb_desc| |snd_desc|  |
>   | +-----|--+ +--|-----+                           +-----|--+ +--------+  |
>   | +-----|--+    |                                 +-----|--+             |
>   | | DMB C  |    +---------------------------------| DMB D  |             |
>   | +--------+                                      +--------+             |
>   |                                                                        |
>   |                           +--------------+                             |
>   |                           | smc loopback |                             |
>   +---------------------------+--------------+-----------------------------+
> 
> # Benchmark Test
> 
>   * Test environments:
>        - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>        - SMC sndbuf/DMB size 1MB.
>        - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0,
>          which means sndbuf and DMB are merged and no data copied between them.
>        - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0,
>          which means DMB is physically contiguous buffer.
> 
>   * Test object:
>        - TCP: run on TCP loopback.
>        - SMC lo: run on SMC loopback device.
> 
> 1. ipc-benchmark (see [3])
> 
>   - ./<foo> -c 1000000 -s 100
> 
>                              TCP                  SMC-lo
> Message
> rate (msg/s)              80636                  149515(+85.42%)
> 
> 2. sockperf
> 
>   - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>   - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30
> 
>                              TCP                  SMC-lo
> Bandwidth(MBps)         4909.36                 8197.57(+66.98%)
> Latency(us)               6.098                   3.383(-44.52%)
> 
> 3. nginx/wrk
> 
>   - serv: <smc_run> nginx
>   - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
> 
>                             TCP                   SMC-lo
> Requests/s           181685.74                246447.77(+35.65%)
> 
> 4. redis-benchmark
> 
>   - serv: <smc_run> redis-server
>   - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
> 
>                             TCP                   SMC-lo
> GET(Requests/s)       85855.34                118553.64(+38.09%)
> SET(Requests/s)       86824.40                125944.58(+45.06%)
> 
> 
> Change log:
> 
> v1->RFC:
> - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
>    /sys/devices/virtual/smc/loopback-ism/xfer_bytes
> - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
>    merging sndbuf with peer DMB.
> - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
>    control of whether to merge sndbuf and DMB. They can be respectively set by:
>    /sys/devices/virtual/smc/loopback-ism/dmb_type
>    /sys/devices/virtual/smc/loopback-ism/dmb_copy
>    The motivation for these two control is that a performance bottleneck was
>    found when using vzalloced DMB and sndbuf is merged with DMB, and there are
>    many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
>    by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
>    or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
>    vmap lock contention [6]. It has significant effects, but using virtual memory
>    still has additional overhead compared to using physical memory.
>    So this new version provides controls of dmb_type and dmb_copy to suit
>    different scenarios.
> - Some minor changes and comments improvements.
> 
> RFC->old version([1]):
> Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/
> - Patch #1: improve the loopback-ism dump, it shows as follows now:
>    # smcd d
>    FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
>    0000 0     loopback-ism  ffff   No        0
> - Patch #3: introduce the smc_ism_set_v2_capable() helper and set
>    smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
>    regardless of whether there is already a device in smcd device list.
> - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
> - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
>    to activate or deactivate the loopback-ism.
> - Patch #9: introduce the statistics of loopback-ism by
>    /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
> - Some minor changes and comments improvements.
> 
> [1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/
> [2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/
> [3] https://github.com/goldsborough/ipc-bench
> [4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/
> [5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/
> [6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/
> 
> Wen Gu (15):
>    net/smc: improve SMC-D device dump for virtual ISM
>    net/smc: decouple specialized struct from SMC-D DMB registration
>    net/smc: introduce virtual ISM device loopback-ism
>    net/smc: implement ID-related operations of loopback-ism
>    net/smc: implement some unsupported operations of loopback-ism
>    net/smc: implement DMB-related operations of loopback-ism
>    net/smc: register loopback-ism into SMC-D device list
>    net/smc: introduce loopback-ism runtime switch
>    net/smc: introduce loopback-ism statistics attributes
>    net/smc: add operations to merge sndbuf with peer DMB
>    net/smc: attach or detach ghost sndbuf to peer DMB
>    net/smc: adapt cursor update when sndbuf and peer DMB are merged
>    net/smc: introduce loopback-ism DMB type control
>    net/smc: introduce loopback-ism DMB data copy control
>    net/smc: implement DMB-merged operations of loopback-ism
> 
>   drivers/s390/net/ism_drv.c |   2 +-
>   include/net/smc.h          |   7 +-
>   net/smc/Kconfig            |  13 +
>   net/smc/Makefile           |   2 +-
>   net/smc/af_smc.c           |  28 +-
>   net/smc/smc_cdc.c          |  58 ++-
>   net/smc/smc_cdc.h          |   1 +
>   net/smc/smc_core.c         |  61 +++-
>   net/smc/smc_core.h         |   1 +
>   net/smc/smc_ism.c          |  71 +++-
>   net/smc/smc_ism.h          |   5 +
>   net/smc/smc_loopback.c     | 718 +++++++++++++++++++++++++++++++++++++
>   net/smc/smc_loopback.h     |  88 +++++
>   13 files changed, 1026 insertions(+), 29 deletions(-)
>   create mode 100644 net/smc/smc_loopback.c
>   create mode 100644 net/smc/smc_loopback.h
> 
Hi Wen,

Thank you for the patience again!

You can find the comments under the corresponding patches respectively.
About the file hierarchy in sysfs and the names, we still have some 
thoughts. We need to investigate a bit more time on it.

Thanks,
Gerd & Wenjia
Wen Gu Feb. 19, 2024, 2:04 p.m. UTC | #20
On 2024/2/6 20:18, Alexandra Winter wrote:
> 
> 
> On 11.01.24 13:00, Wen Gu wrote:
>> This patch set acts as the second part of the new version of [1] (The first
>> part can be referred from [2]), the updated things of this version are listed
>> at the end.
>>
>> # Background
>>
>> SMC-D is now used in IBM z with ISM function to optimize network interconnect
>> for intra-CPC communications. Inspired by this, we try to make SMC-D available
>> on the non-s390 architecture through a software-implemented virtual ISM device,
>> that is the loopback-ism device here, to accelerate inter-process or
>> inter-containers communication within the same OS instance.
> 
> 
> Hello Wen Gu,
> 
> thank you very much for this patchset. I have been looking at it a bit.
> I installed in on a testserver, but did not yet excercise the loopback-ism device.
> I want to continue investigations, but daily work interferes, so I thought I
> send you some comments now. So this is not at all a code review, but some
> thoughts and observations about the general concept.
> 

Thank you very much, Sandy.

> 
> In [1] there was a discussion about an abstraction layer between smc-d and the
> ism devices.
> I am not sure what you are proposing now, is it an smc-d feature or independent of smc?
> In 3/15 you say it is part of the SMC module, but then it has its own device entry.
> Didn't you want to use it for other things as well? Or is it an SMC-D only feature?
> Is it a device (Config help: "kind of virtual device")? Or an SMC-D feature?
> 

This patchset aims to propose an SMC feature, which is SMC-D loopback. The main work
to achieve this feature is to implement an Emulated-ISM, which is loopback-ism. The
loopback-ism is a 'built-in' dummy device of SMC and only serves SMC.

SMC-D protocol + 'built-in dummy device' (loopback-ism device) = SMC-D loopback feature.

To provide the runtime switch and statistics of loopback-ism, I need to find a sysfs
entry for it, since it doesn't belong to any class (e.g. pci_bus), I created an 'smc'
entry under /sys/devices/virtual/ and put loopback-ism under it.

The other SMC devices, such as RoCE, s390 ISM, virtio-ism will be in their own sysfs
entry, not under the /sys/devices/*virtual*/smc/.

The Config help is somewhat inaccurate. To be more precise, the SMC_LO config is used to
configure whether to enable this built-in dummy device for intra-OS communication.

> Will we have a class of ism devices (s390 ism, ism-loopback, virtio-ism)
> That share common properties (internal API?)
> and smc-d will work with any of those? > But they all can exist without smc ?! BTW: This is what we want for s390-ism.
> The client-registration interface [2] is currently the way to achieve this.
> But maybe we need a more general concept?
> 

I didn't mean to create a class to cover all the ISM devices. It is only for
loopback-ism. Because loopback-ism can not be classified, so I create an entry
under /sys/devices/virtual/.

> Maybe first a preparation patchset that introduces a class/ism
> Together with an improved API?
> In case you want to use ISM devices for other purposes as well..
> But then the whole picture of ism-loopback in one patchset (RFC?)
> has its benefits as well.
> 

Sorry for causing, I didn't mean to create a class to cover all the ISM devices.
They should be in their own sysfs entries (e.g. pci_bus), since they will be used
out of SMC. Only loopback-ism belongs only to SMC.

> 
> Other points that I noticed:
> 
> Naming: smc loopback, ism-loopback, loopback-ism ?
> 
> config: why not tristate? Why under net/smc?
> 

'SMC-D loopback' or 'SMC loopback' is used to indicate the feature or capability.
'loopback-ism' is the emulated-ISM device that 'SMC/SMC-D loopback' used.
('ism-loopback' doesn't seem to appear in my patchset)
If we all agree with these, I will check all the terms in the patch and unify them.

SMC_LO is used to configure whether SMC is allowed to use loopback-ism (CONFIG_SMC_LO),
it acts as a check in the code, so I defined it as a bool.
And loopback-ism only serves SMC-D loopback, as a feature of SMC, so the implementation
(net/smc/smc_loopback.{c|h}) is under net/smc.

> /sys/devices/virtual/smc  does not initially show up in my installation!!!
> root@t35lp50:/sys/devices/virtual/> ls
> 3270/  bdi/  block/  graphics/  iommu/  mem/  memory_tiering/  misc/  net/  tty/  vc/  vtconsole/  workqueue/
> root@t35lp50:/sys/devices/virtual/> ls smc/loopback-ism
> active  dmb_copy  dmbs_cnt  dmb_type  subsystem@  uevent  xfer_bytes
> root@t35lp50:/sys/devices/virtual/> ls
> 3270/  bdi/  block/  graphics/  iommu/  mem/  memory_tiering/  misc/  net/  smc/  tty/  vc/  vtconsole/  workqueue/
> Is that normal behaviour?
> 

/sys/devices/virtual/smc is created after SMC module initialization.
During the SMC module initialization, smc_loopback_init() is called, and the
/sys/devices/virtual/smc entry is created.

> You introduced a class/smc
> Maybe class/ism would be better?
> The other smc interfaces do not show up in class/smc!! Not so good
> 

Sorry for causing, I didn't mean to create a class to cover all the ISM devices.
They should be in their own sysfs entries (e.g. pci_bus), since they can be used
out of SMC. But loopback-ism is a SMC 'built-in' dummy device, it belongs only
to SMC and can't be classified to other entries.


> Why doesn't it show in smc_rnics?
> (Maybe some deficiency of smc_rnics?)
> 
smc_rnics can't be used on the arch other than s390.

# ./smc_rnics  -a
Error: s390/s390x supported only


> But then it shows in other smc-tools:
> root@t35lp50:/sys/> smcd device
> FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
> 0000 0     loopback-ism  ffff   No        0
> 0029 ISM   0000:00:00.0  07c1   No        0  NET1
> Nice!
> 

Yes, this is did on patch 01/15.

Best regards,
Wen Gu

> Kind regards
> Sandy
> 
> 
> [1] https://lore.kernel.org/lkml/e3819550-7b10-4f9c-7347-dcf1f97b8e6b@linux.alibaba.com/
> [2] 89e7d2ba61b7 ("net/ism: Add new API for client registration")
Wen Gu Feb. 20, 2024, 3:52 a.m. UTC | #21
On 2024/2/16 22:09, Wenjia Zhang wrote:
> 
> 
> On 11.01.24 13:00, Wen Gu wrote:
>> This patch set acts as the second part of the new version of [1] (The first
>> part can be referred from [2]), the updated things of this version are listed
>> at the end.
>>
>> # Background
>>
>> SMC-D is now used in IBM z with ISM function to optimize network interconnect
>> for intra-CPC communications. Inspired by this, we try to make SMC-D available
>> on the non-s390 architecture through a software-implemented virtual ISM device,
>> that is the loopback-ism device here, to accelerate inter-process or
>> inter-containers communication within the same OS instance.
>>
>> # Design
>>
>> This patch set includes 3 parts:
>>
>>   - Patch #1-#2: some prepare work for loopback-ism.
>>   - Patch #3-#9: implement loopback-ism device.
>>   - Patch #10-#15: memory copy optimization for loopback scenario.
>>
>> The loopback-ism device is designed as a ISMv2 device and not be limited to
>> a specific net namespace, ends of both inter-process connection (1/1' in diagram
>> below) or inter-container connection (2/2' in diagram below) can find the same
>> available loopback-ism and choose it during the CLC handshake.
>>
>>   Container 1 (ns1)                              Container 2 (ns2)
>>   +-----------------------------------------+    +-------------------------+
>>   | +-------+      +-------+      +-------+ |    |        +-------+        |
>>   | | App A |      | App B |      | App C | |    |        | App D |<-+     |
>>   | +-------+      +---^---+      +-------+ |    |        +-------+  |(2') |
>>   |     |127.0.0.1 (1')|             |192.168.0.11       192.168.0.12|     |
>>   |  (1)|   +--------+ | +--------+  |(2)   |    | +--------+   +--------+ |
>>   |     `-->|   lo   |-` |  eth0  |<-`      |    | |   lo   |   |  eth0  | |
>>   +---------+--|---^-+---+-----|--+---------+    +-+--------+---+-^------+-+
>>                |   |           |                                  |
>>   Kernel       |   |           |                                  |
>>   +----+-------v---+-----------v----------------------------------+---+----+
>>   |    |                            TCP                               |    |
>>   |    |                                                              |    |
>>   |    +--------------------------------------------------------------+    |
>>   |                                                                        |
>>   |                           +--------------+                             |
>>   |                           | smc loopback |                             |
>>   +---------------------------+--------------+-----------------------------+
>>
>> loopback-ism device creates DMBs (shared memory) for each connection peer.
>> Since data transfer occurs within the same kernel, the sndbuf of each peer
>> is only a descriptor and point to the same memory region as peer DMB, so that
>> the data copy from sndbuf to peer DMB can be avoided in loopback-ism case.
>>
>>   Container 1 (ns1)                              Container 2 (ns2)
>>   +-----------------------------------------+    +-------------------------+
>>   | +-------+                               |    |        +-------+        |
>>   | | App C |-----+                         |    |        | App D |        |
>>   | +-------+     |                         |    |        +-^-----+        |
>>   |               |                         |    |          |              |
>>   |           (2) |                         |    |     (2') |              |
>>   |               |                         |    |          |              |
>>   +---------------|-------------------------+    +----------|--------------+
>>                   |                                         |
>>   Kernel          |                                         |
>>   +---------------|-----------------------------------------|--------------+
>>   | +--------+ +--v-----+                           +--------+ +--------+  |
>>   | |dmb_desc| |snd_desc|                           |dmb_desc| |snd_desc|  |
>>   | +-----|--+ +--|-----+                           +-----|--+ +--------+  |
>>   | +-----|--+    |                                 +-----|--+             |
>>   | | DMB C  |    +---------------------------------| DMB D  |             |
>>   | +--------+                                      +--------+             |
>>   |                                                                        |
>>   |                           +--------------+                             |
>>   |                           | smc loopback |                             |
>>   +---------------------------+--------------+-----------------------------+
>>
>> # Benchmark Test
>>
>>   * Test environments:
>>        - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem.
>>        - SMC sndbuf/DMB size 1MB.
>>        - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0,
>>          which means sndbuf and DMB are merged and no data copied between them.
>>        - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0,
>>          which means DMB is physically contiguous buffer.
>>
>>   * Test object:
>>        - TCP: run on TCP loopback.
>>        - SMC lo: run on SMC loopback device.
>>
>> 1. ipc-benchmark (see [3])
>>
>>   - ./<foo> -c 1000000 -s 100
>>
>>                              TCP                  SMC-lo
>> Message
>> rate (msg/s)              80636                  149515(+85.42%)
>>
>> 2. sockperf
>>
>>   - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp
>>   - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 
>> -t 30
>>
>>                              TCP                  SMC-lo
>> Bandwidth(MBps)         4909.36                 8197.57(+66.98%)
>> Latency(us)               6.098                   3.383(-44.52%)
>>
>> 3. nginx/wrk
>>
>>   - serv: <smc_run> nginx
>>   - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80
>>
>>                             TCP                   SMC-lo
>> Requests/s           181685.74                246447.77(+35.65%)
>>
>> 4. redis-benchmark
>>
>>   - serv: <smc_run> redis-server
>>   - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024
>>
>>                             TCP                   SMC-lo
>> GET(Requests/s)       85855.34                118553.64(+38.09%)
>> SET(Requests/s)       86824.40                125944.58(+45.06%)
>>
>>
>> Change log:
>>
>> v1->RFC:
>> - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics:
>>    /sys/devices/virtual/smc/loopback-ism/xfer_bytes
>> - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports
>>    merging sndbuf with peer DMB.
>> - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and
>>    control of whether to merge sndbuf and DMB. They can be respectively set by:
>>    /sys/devices/virtual/smc/loopback-ism/dmb_type
>>    /sys/devices/virtual/smc/loopback-ism/dmb_copy
>>    The motivation for these two control is that a performance bottleneck was
>>    found when using vzalloced DMB and sndbuf is merged with DMB, and there are
>>    many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused
>>    by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg()
>>    or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the
>>    vmap lock contention [6]. It has significant effects, but using virtual memory
>>    still has additional overhead compared to using physical memory.
>>    So this new version provides controls of dmb_type and dmb_copy to suit
>>    different scenarios.
>> - Some minor changes and comments improvements.
>>
>> RFC->old version([1]):
>> Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/
>> - Patch #1: improve the loopback-ism dump, it shows as follows now:
>>    # smcd d
>>    FID  Type  PCI-ID        PCHID  InUse  #LGs  PNET-ID
>>    0000 0     loopback-ism  ffff   No        0
>> - Patch #3: introduce the smc_ism_set_v2_capable() helper and set
>>    smc_ism_v2_capable when ISMv2 or virtual ISM is registered,
>>    regardless of whether there is already a device in smcd device list.
>> - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/.
>> - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active
>>    to activate or deactivate the loopback-ism.
>> - Patch #9: introduce the statistics of loopback-ism by
>>    /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}.
>> - Some minor changes and comments improvements.
>>
>> [1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/
>> [2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/
>> [3] https://github.com/goldsborough/ipc-bench
>> [4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/
>> [5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/
>> [6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/
>>
>> Wen Gu (15):
>>    net/smc: improve SMC-D device dump for virtual ISM
>>    net/smc: decouple specialized struct from SMC-D DMB registration
>>    net/smc: introduce virtual ISM device loopback-ism
>>    net/smc: implement ID-related operations of loopback-ism
>>    net/smc: implement some unsupported operations of loopback-ism
>>    net/smc: implement DMB-related operations of loopback-ism
>>    net/smc: register loopback-ism into SMC-D device list
>>    net/smc: introduce loopback-ism runtime switch
>>    net/smc: introduce loopback-ism statistics attributes
>>    net/smc: add operations to merge sndbuf with peer DMB
>>    net/smc: attach or detach ghost sndbuf to peer DMB
>>    net/smc: adapt cursor update when sndbuf and peer DMB are merged
>>    net/smc: introduce loopback-ism DMB type control
>>    net/smc: introduce loopback-ism DMB data copy control
>>    net/smc: implement DMB-merged operations of loopback-ism
>>
>>   drivers/s390/net/ism_drv.c |   2 +-
>>   include/net/smc.h          |   7 +-
>>   net/smc/Kconfig            |  13 +
>>   net/smc/Makefile           |   2 +-
>>   net/smc/af_smc.c           |  28 +-
>>   net/smc/smc_cdc.c          |  58 ++-
>>   net/smc/smc_cdc.h          |   1 +
>>   net/smc/smc_core.c         |  61 +++-
>>   net/smc/smc_core.h         |   1 +
>>   net/smc/smc_ism.c          |  71 +++-
>>   net/smc/smc_ism.h          |   5 +
>>   net/smc/smc_loopback.c     | 718 +++++++++++++++++++++++++++++++++++++
>>   net/smc/smc_loopback.h     |  88 +++++
>>   13 files changed, 1026 insertions(+), 29 deletions(-)
>>   create mode 100644 net/smc/smc_loopback.c
>>   create mode 100644 net/smc/smc_loopback.h
>>
> Hi Wen,
> 
> Thank you for the patience again!
> 
> You can find the comments under the corresponding patches respectively.
> About the file hierarchy in sysfs and the names, we still have some thoughts. We need to investigate a bit more time on it.
> 

Hi Wenjia and Gerd,

Thank you very much!

I answered each comment you left. You can find my thoughts about sysfs and
knobs there. Looking forward to your further reply. Thanks!

Best regards,
Wen Gu

> Thanks,
> Gerd & Wenjia