Message ID | 20240111120036.109903-1-guwen@linux.alibaba.com (mailing list archive) |
---|---|
Headers | show |
Series | net/smc: implement loopback-ism used by SMC-D | expand |
On Thu, Jan 11, 2024 at 08:00:21PM +0800, Wen Gu wrote: > This patch set acts as the second part of the new version of [1] (The > first part can be referred from [2]), the updated things of this version > are listed at the end. ... Hi Wen Gu, unfortunately net-next is currently closed. [adapted from text by Jakub] ## Form letter - net-next-closed The merge window for v6.8 has begun and therefore net-next is closed for new drivers, features, code refactoring and optimizations. We are currently accepting bug fixes only. Please repost when net-next reopens on or after 21st January. RFC patches sent for review only are obviously welcome at any time. See: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#development-cycle -- pw-bot: defer
Thu, Jan 11, 2024 at 01:00:21PM CET, guwen@linux.alibaba.com wrote: >This patch set acts as the second part of the new version of [1] (The first >part can be referred from [2]), the updated things of this version are listed >at the end. > ># Background > >SMC-D is now used in IBM z with ISM function to optimize network interconnect >for intra-CPC communications. Inspired by this, we try to make SMC-D available Care to provide more details about what ISM and intra-CPC is and what it it good for? >on the non-s390 architecture through a software-implemented virtual ISM device, >that is the loopback-ism device here, to accelerate inter-process or I see no such device. Is it a netdevice? If it is "software-implemented", why is it part of smc driver and not separate soft-device driver? If there is some smc specific code, I guess there should be some level of separation. Can't this be implemented by other devices too? >inter-containers communication within the same OS instance. > ># Design > >This patch set includes 3 parts: > > - Patch #1-#2: some prepare work for loopback-ism. > - Patch #3-#9: implement loopback-ism device. > - Patch #10-#15: memory copy optimization for loopback scenario. > >The loopback-ism device is designed as a ISMv2 device and not be limited to >a specific net namespace, ends of both inter-process connection (1/1' in diagram >below) or inter-container connection (2/2' in diagram below) can find the same >available loopback-ism and choose it during the CLC handshake. > > Container 1 (ns1) Container 2 (ns2) > +-----------------------------------------+ +-------------------------+ > | +-------+ +-------+ +-------+ | | +-------+ | > | | App A | | App B | | App C | | | | App D |<-+ | > | +-------+ +---^---+ +-------+ | | +-------+ |(2') | > | |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| | > | (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ | > | `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | | > +---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+ > | | | | > Kernel | | | | > +----+-------v---+-----------v----------------------------------+---+----+ > | | TCP | | > | | | | > | +--------------------------------------------------------------+ | > | | > | +--------------+ | > | | smc loopback | | > +---------------------------+--------------+-----------------------------+ > >loopback-ism device creates DMBs (shared memory) for each connection peer. >Since data transfer occurs within the same kernel, the sndbuf of each peer >is only a descriptor and point to the same memory region as peer DMB, so that >the data copy from sndbuf to peer DMB can be avoided in loopback-ism case. > > Container 1 (ns1) Container 2 (ns2) > +-----------------------------------------+ +-------------------------+ > | +-------+ | | +-------+ | > | | App C |-----+ | | | App D | | > | +-------+ | | | +-^-----+ | > | | | | | | > | (2) | | | (2') | | > | | | | | | > +---------------|-------------------------+ +----------|--------------+ > | | > Kernel | | > +---------------|-----------------------------------------|--------------+ > | +--------+ +--v-----+ +--------+ +--------+ | > | |dmb_desc| |snd_desc| |dmb_desc| |snd_desc| | > | +-----|--+ +--|-----+ +-----|--+ +--------+ | > | +-----|--+ | +-----|--+ | > | | DMB C | +---------------------------------| DMB D | | > | +--------+ +--------+ | > | | > | +--------------+ | > | | smc loopback | | > +---------------------------+--------------+-----------------------------+ > ># Benchmark Test > > * Test environments: > - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem. > - SMC sndbuf/DMB size 1MB. > - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0, > which means sndbuf and DMB are merged and no data copied between them. > - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0, Exposing any configuration knobs and statistics over sysfs for softdevices does not look correct at all :/ Could you please avoid sysfs? > which means DMB is physically contiguous buffer. > > * Test object: > - TCP: run on TCP loopback. > - SMC lo: run on SMC loopback device. > >1. ipc-benchmark (see [3]) > > - ./<foo> -c 1000000 -s 100 > > TCP SMC-lo >Message >rate (msg/s) 80636 149515(+85.42%) > >2. sockperf > > - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp > - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30 > > TCP SMC-lo >Bandwidth(MBps) 4909.36 8197.57(+66.98%) >Latency(us) 6.098 3.383(-44.52%) > >3. nginx/wrk > > - serv: <smc_run> nginx > - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80 > > TCP SMC-lo >Requests/s 181685.74 246447.77(+35.65%) > >4. redis-benchmark > > - serv: <smc_run> redis-server > - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024 > > TCP SMC-lo >GET(Requests/s) 85855.34 118553.64(+38.09%) >SET(Requests/s) 86824.40 125944.58(+45.06%) > > >Change log: > >v1->RFC: >- Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics: > /sys/devices/virtual/smc/loopback-ism/xfer_bytes >- Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports > merging sndbuf with peer DMB. >- Patch #13 & #14: introduce loopback-ism device control of DMB memory type and > control of whether to merge sndbuf and DMB. They can be respectively set by: > /sys/devices/virtual/smc/loopback-ism/dmb_type > /sys/devices/virtual/smc/loopback-ism/dmb_copy > The motivation for these two control is that a performance bottleneck was > found when using vzalloced DMB and sndbuf is merged with DMB, and there are > many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused > by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg() > or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the > vmap lock contention [6]. It has significant effects, but using virtual memory > still has additional overhead compared to using physical memory. > So this new version provides controls of dmb_type and dmb_copy to suit > different scenarios. >- Some minor changes and comments improvements. > >RFC->old version([1]): >Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/ >- Patch #1: improve the loopback-ism dump, it shows as follows now: > # smcd d > FID Type PCI-ID PCHID InUse #LGs PNET-ID > 0000 0 loopback-ism ffff No 0 >- Patch #3: introduce the smc_ism_set_v2_capable() helper and set > smc_ism_v2_capable when ISMv2 or virtual ISM is registered, > regardless of whether there is already a device in smcd device list. >- Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/. >- Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active > to activate or deactivate the loopback-ism. >- Patch #9: introduce the statistics of loopback-ism by > /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}. >- Some minor changes and comments improvements. > >[1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/ >[2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/ >[3] https://github.com/goldsborough/ipc-bench >[4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/ >[5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/ >[6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/ > >Wen Gu (15): > net/smc: improve SMC-D device dump for virtual ISM > net/smc: decouple specialized struct from SMC-D DMB registration > net/smc: introduce virtual ISM device loopback-ism > net/smc: implement ID-related operations of loopback-ism > net/smc: implement some unsupported operations of loopback-ism > net/smc: implement DMB-related operations of loopback-ism > net/smc: register loopback-ism into SMC-D device list > net/smc: introduce loopback-ism runtime switch > net/smc: introduce loopback-ism statistics attributes > net/smc: add operations to merge sndbuf with peer DMB > net/smc: attach or detach ghost sndbuf to peer DMB > net/smc: adapt cursor update when sndbuf and peer DMB are merged > net/smc: introduce loopback-ism DMB type control > net/smc: introduce loopback-ism DMB data copy control > net/smc: implement DMB-merged operations of loopback-ism > > drivers/s390/net/ism_drv.c | 2 +- > include/net/smc.h | 7 +- > net/smc/Kconfig | 13 + > net/smc/Makefile | 2 +- > net/smc/af_smc.c | 28 +- > net/smc/smc_cdc.c | 58 ++- > net/smc/smc_cdc.h | 1 + > net/smc/smc_core.c | 61 +++- > net/smc/smc_core.h | 1 + > net/smc/smc_ism.c | 71 +++- > net/smc/smc_ism.h | 5 + > net/smc/smc_loopback.c | 718 +++++++++++++++++++++++++++++++++++++ > net/smc/smc_loopback.h | 88 +++++ > 13 files changed, 1026 insertions(+), 29 deletions(-) > create mode 100644 net/smc/smc_loopback.c > create mode 100644 net/smc/smc_loopback.h > >-- >2.32.0.3.g01195cf9f > >
On 2024/1/11 21:36, Simon Horman wrote: > On Thu, Jan 11, 2024 at 08:00:21PM +0800, Wen Gu wrote: >> This patch set acts as the second part of the new version of [1] (The >> first part can be referred from [2]), the updated things of this version >> are listed at the end. > > ... > > Hi Wen Gu, > > unfortunately net-next is currently closed. > > [adapted from text by Jakub] > > ## Form letter - net-next-closed > > The merge window for v6.8 has begun and therefore net-next is closed > for new drivers, features, code refactoring and optimizations. > We are currently accepting bug fixes only. > > Please repost when net-next reopens on or after 21st January. > > RFC patches sent for review only are obviously welcome at any time. > > See: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html#development-cycle > -- > pw-bot: defer Thank you for notifying, Simon. I will follow the development-cycle. Thanks again.
On 2024/1/11 22:50, Jiri Pirko wrote: > Thu, Jan 11, 2024 at 01:00:21PM CET, guwen@linux.alibaba.com wrote: >> This patch set acts as the second part of the new version of [1] (The first >> part can be referred from [2]), the updated things of this version are listed >> at the end. >> >> # Background >> >> SMC-D is now used in IBM z with ISM function to optimize network interconnect >> for intra-CPC communications. Inspired by this, we try to make SMC-D available > > Care to provide more details about what ISM and intra-CPC is and what it > it good for? > Hi Jiri, Sure, ISM (IBM System Z Internal Shared Memory) is a technology that provides the internal communications capability required for SMC-D. It is a virtual PCI network adapter that enables direct access to shared virtual memory providing a highly optimized network interconnect for IBM Z intra-CPC communications. (It can be found in https://www.ibm.com/docs/en/zos/3.1.0?topic=communications-shared-memory-reference-information and https://www.ibm.com/docs/en/zos/3.1.0?topic=dv2-ismv2) CPC (Central processor complex) is an IBM mainframe term to refer to the physical collection of hardware that includes main storage, one or more central processors, timers, and channels. (It can be found in https://www.ibm.com/docs/en/zos-basic-skills?topic=concepts-mainframe-hardware-terminology and https://www.ibm.com/docs/en/ztpf/2023?topic=support-central-processor-complex-cpc) SMC (Shared Memory Communications) is a network protocol that allows two SMC capable peers to communicate using memory that each peer allocates and manages for their partner’s use. It has two forms: - SMC over Remote Direct Memory Access (SMC-R) It is an open protocol that was initially introduced in z/OS V2R1 on the IBM zEC12. SMC-R is defined in an informational RFC entitled IBM’s Shared Memory Communications over RDMA (https://tools.ietf.org/html/rfc7609). - SMC - Direct Memory Access (SMC-D) It is a variation of SMC-R. SMC-D is closely related to SMC-R but is based on the Internal Shared Memory (ISM) capabilities introduced with the IBM z13™ (z13) hardware model. (SMC protocol can be found in https://www.ibm.com/support/pages/system/files/inline-files/IBM%20Shared%20Memory%20Communications%20Version%202.1_0.pdf) So with ISM function, SMC-D can be used to improves throughput, lowers latency and cost, and maintains existing functions of communications within CPC. > >> on the non-s390 architecture through a software-implemented virtual ISM device, >> that is the loopback-ism device here, to accelerate inter-process or > > I see no such device. Is it a netdevice? Currently, SMC-D depends on ISM and is only available on IBM Z systems. Now we try to make SMC-D available on other system architectures other than s390 and Z system. So 'virtual ISM' is proposed and acts as original firmware ISM on Z system. (The virtual ISM supports can be found in https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/) The loopback-ism is the first virtual ISM. It does not rely on a specific architecture or hardware, and provides functions that ISM should have (like a dummy device). It is designed to be used by SMC-D when communication occurs within OS instance. It is not a typical network device, since it primarily provides exact functions defined by SMC-D device operations(struct smcd_ops), e.g. provides and manages the shared memory (term used is DMB in SMC, Direct Memory Buffer). It can't be found now since it is introduced by this patchset. > > If it is "software-implemented", why is it part of smc driver and not > separate soft-device driver? If there is some smc specific code, I guess > there should be some level of separation. Can't this be implemented by > other devices too? > loopback-ism is designed to specifically used by SMC-D (like s390 ISM), to serves as a easy-available ISM for community to test SMC-D and to accelerate intra-OS communication (see benchmark test). So the code is under net/smc. > > >> inter-containers communication within the same OS instance. >> >> # Design >> >> This patch set includes 3 parts: >> >> - Patch #1-#2: some prepare work for loopback-ism. >> - Patch #3-#9: implement loopback-ism device. >> - Patch #10-#15: memory copy optimization for loopback scenario. >> >> The loopback-ism device is designed as a ISMv2 device and not be limited to >> a specific net namespace, ends of both inter-process connection (1/1' in diagram >> below) or inter-container connection (2/2' in diagram below) can find the same >> available loopback-ism and choose it during the CLC handshake. >> >> Container 1 (ns1) Container 2 (ns2) >> +-----------------------------------------+ +-------------------------+ >> | +-------+ +-------+ +-------+ | | +-------+ | >> | | App A | | App B | | App C | | | | App D |<-+ | >> | +-------+ +---^---+ +-------+ | | +-------+ |(2') | >> | |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| | >> | (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ | >> | `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | | >> +---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+ >> | | | | >> Kernel | | | | >> +----+-------v---+-----------v----------------------------------+---+----+ >> | | TCP | | >> | | | | >> | +--------------------------------------------------------------+ | >> | | >> | +--------------+ | >> | | smc loopback | | >> +---------------------------+--------------+-----------------------------+ >> >> loopback-ism device creates DMBs (shared memory) for each connection peer. >> Since data transfer occurs within the same kernel, the sndbuf of each peer >> is only a descriptor and point to the same memory region as peer DMB, so that >> the data copy from sndbuf to peer DMB can be avoided in loopback-ism case. >> >> Container 1 (ns1) Container 2 (ns2) >> +-----------------------------------------+ +-------------------------+ >> | +-------+ | | +-------+ | >> | | App C |-----+ | | | App D | | >> | +-------+ | | | +-^-----+ | >> | | | | | | >> | (2) | | | (2') | | >> | | | | | | >> +---------------|-------------------------+ +----------|--------------+ >> | | >> Kernel | | >> +---------------|-----------------------------------------|--------------+ >> | +--------+ +--v-----+ +--------+ +--------+ | >> | |dmb_desc| |snd_desc| |dmb_desc| |snd_desc| | >> | +-----|--+ +--|-----+ +-----|--+ +--------+ | >> | +-----|--+ | +-----|--+ | >> | | DMB C | +---------------------------------| DMB D | | >> | +--------+ +--------+ | >> | | >> | +--------------+ | >> | | smc loopback | | >> +---------------------------+--------------+-----------------------------+ >> >> # Benchmark Test >> >> * Test environments: >> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem. >> - SMC sndbuf/DMB size 1MB. >> - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0, >> which means sndbuf and DMB are merged and no data copied between them. >> - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0, > > Exposing any configuration knobs and statistics over sysfs for > softdevices does not look correct at all :/ Could you please avoid > sysfs? > In previous reviews and calls, we think loopback-ism needs to be more like a device and be visible under /sys/devices. Would you mind explaining why using sysfs for loopback-ism is not correct? since I saw some other configurations or statistics exists under /sys/devices, e.g. /sys/devices/virtual/net/lo. Thank you! Thanks again, Wen Gu > >> which means DMB is physically contiguous buffer. >> >> * Test object: >> - TCP: run on TCP loopback. >> - SMC lo: run on SMC loopback device. >> >> 1. ipc-benchmark (see [3]) >> >> - ./<foo> -c 1000000 -s 100 >> >> TCP SMC-lo >> Message >> rate (msg/s) 80636 149515(+85.42%) >> >> 2. sockperf >> >> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp >> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30 >> >> TCP SMC-lo >> Bandwidth(MBps) 4909.36 8197.57(+66.98%) >> Latency(us) 6.098 3.383(-44.52%) >> >> 3. nginx/wrk >> >> - serv: <smc_run> nginx >> - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80 >> >> TCP SMC-lo >> Requests/s 181685.74 246447.77(+35.65%) >> >> 4. redis-benchmark >> >> - serv: <smc_run> redis-server >> - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024 >> >> TCP SMC-lo >> GET(Requests/s) 85855.34 118553.64(+38.09%) >> SET(Requests/s) 86824.40 125944.58(+45.06%) >> >> >> Change log: >> >> v1->RFC: >> - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics: >> /sys/devices/virtual/smc/loopback-ism/xfer_bytes >> - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports >> merging sndbuf with peer DMB. >> - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and >> control of whether to merge sndbuf and DMB. They can be respectively set by: >> /sys/devices/virtual/smc/loopback-ism/dmb_type >> /sys/devices/virtual/smc/loopback-ism/dmb_copy >> The motivation for these two control is that a performance bottleneck was >> found when using vzalloced DMB and sndbuf is merged with DMB, and there are >> many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused >> by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg() >> or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the >> vmap lock contention [6]. It has significant effects, but using virtual memory >> still has additional overhead compared to using physical memory. >> So this new version provides controls of dmb_type and dmb_copy to suit >> different scenarios. >> - Some minor changes and comments improvements. >> >> RFC->old version([1]): >> Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/ >> - Patch #1: improve the loopback-ism dump, it shows as follows now: >> # smcd d >> FID Type PCI-ID PCHID InUse #LGs PNET-ID >> 0000 0 loopback-ism ffff No 0 >> - Patch #3: introduce the smc_ism_set_v2_capable() helper and set >> smc_ism_v2_capable when ISMv2 or virtual ISM is registered, >> regardless of whether there is already a device in smcd device list. >> - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/. >> - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active >> to activate or deactivate the loopback-ism. >> - Patch #9: introduce the statistics of loopback-ism by >> /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}. >> - Some minor changes and comments improvements. >> >> [1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/ >> [2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/ >> [3] https://github.com/goldsborough/ipc-bench >> [4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/ >> [5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/ >> [6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/ >> >> Wen Gu (15): >> net/smc: improve SMC-D device dump for virtual ISM >> net/smc: decouple specialized struct from SMC-D DMB registration >> net/smc: introduce virtual ISM device loopback-ism >> net/smc: implement ID-related operations of loopback-ism >> net/smc: implement some unsupported operations of loopback-ism >> net/smc: implement DMB-related operations of loopback-ism >> net/smc: register loopback-ism into SMC-D device list >> net/smc: introduce loopback-ism runtime switch >> net/smc: introduce loopback-ism statistics attributes >> net/smc: add operations to merge sndbuf with peer DMB >> net/smc: attach or detach ghost sndbuf to peer DMB >> net/smc: adapt cursor update when sndbuf and peer DMB are merged >> net/smc: introduce loopback-ism DMB type control >> net/smc: introduce loopback-ism DMB data copy control >> net/smc: implement DMB-merged operations of loopback-ism >> >> drivers/s390/net/ism_drv.c | 2 +- >> include/net/smc.h | 7 +- >> net/smc/Kconfig | 13 + >> net/smc/Makefile | 2 +- >> net/smc/af_smc.c | 28 +- >> net/smc/smc_cdc.c | 58 ++- >> net/smc/smc_cdc.h | 1 + >> net/smc/smc_core.c | 61 +++- >> net/smc/smc_core.h | 1 + >> net/smc/smc_ism.c | 71 +++- >> net/smc/smc_ism.h | 5 + >> net/smc/smc_loopback.c | 718 +++++++++++++++++++++++++++++++++++++ >> net/smc/smc_loopback.h | 88 +++++ >> 13 files changed, 1026 insertions(+), 29 deletions(-) >> create mode 100644 net/smc/smc_loopback.c >> create mode 100644 net/smc/smc_loopback.h >> >> -- >> 2.32.0.3.g01195cf9f >> >>
Fri, Jan 12, 2024 at 09:29:35AM CET, guwen@linux.alibaba.com wrote: > > >On 2024/1/11 22:50, Jiri Pirko wrote: >> Thu, Jan 11, 2024 at 01:00:21PM CET, guwen@linux.alibaba.com wrote: >> > This patch set acts as the second part of the new version of [1] (The first >> > part can be referred from [2]), the updated things of this version are listed >> > at the end. >> > >> > # Background >> > >> > SMC-D is now used in IBM z with ISM function to optimize network interconnect >> > for intra-CPC communications. Inspired by this, we try to make SMC-D available >> >> Care to provide more details about what ISM and intra-CPC is and what it >> it good for? >> > >Hi Jiri, > >Sure, > >ISM (IBM System Z Internal Shared Memory) is a technology that provides the >internal communications capability required for SMC-D. It is a virtual PCI >network adapter that enables direct access to shared virtual memory providing >a highly optimized network interconnect for IBM Z intra-CPC communications. >(It can be found in https://www.ibm.com/docs/en/zos/3.1.0?topic=communications-shared-memory-reference-information >and https://www.ibm.com/docs/en/zos/3.1.0?topic=dv2-ismv2) > >CPC (Central processor complex) is an IBM mainframe term to refer to the physical >collection of hardware that includes main storage, one or more central processors, >timers, and channels. >(It can be found in https://www.ibm.com/docs/en/zos-basic-skills?topic=concepts-mainframe-hardware-terminology >and https://www.ibm.com/docs/en/ztpf/2023?topic=support-central-processor-complex-cpc) > >SMC (Shared Memory Communications) is a network protocol that allows two SMC >capable peers to communicate using memory that each peer allocates and manages >for their partner’s use. It has two forms: > >- SMC over Remote Direct Memory Access (SMC-R) > > It is an open protocol that was initially introduced in z/OS V2R1 on the IBM zEC12. > SMC-R is defined in an informational RFC entitled IBM’s Shared Memory Communications > over RDMA (https://tools.ietf.org/html/rfc7609). > >- SMC - Direct Memory Access (SMC-D) > > It is a variation of SMC-R. SMC-D is closely related to SMC-R but is based on the > Internal Shared Memory (ISM) capabilities introduced with the IBM z13™ (z13) hardware > model. > >(SMC protocol can be found in https://www.ibm.com/support/pages/system/files/inline-files/IBM%20Shared%20Memory%20Communications%20Version%202.1_0.pdf) > >So with ISM function, SMC-D can be used to improves throughput, lowers latency and cost, >and maintains existing functions of communications within CPC. > >> >> > on the non-s390 architecture through a software-implemented virtual ISM device, >> > that is the loopback-ism device here, to accelerate inter-process or >> >> I see no such device. Is it a netdevice? > >Currently, SMC-D depends on ISM and is only available on IBM Z systems. Now we try >to make SMC-D available on other system architectures other than s390 and Z system. >So 'virtual ISM' is proposed and acts as original firmware ISM on Z system. >(The virtual ISM supports can be found in https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/) > >The loopback-ism is the first virtual ISM. It does not rely on a specific architecture >or hardware, and provides functions that ISM should have (like a dummy device). It is >designed to be used by SMC-D when communication occurs within OS instance. > >It is not a typical network device, since it primarily provides exact functions >defined by SMC-D device operations(struct smcd_ops), e.g. provides and manages the >shared memory (term used is DMB in SMC, Direct Memory Buffer). > >It can't be found now since it is introduced by this patchset. > >> >> If it is "software-implemented", why is it part of smc driver and not >> separate soft-device driver? If there is some smc specific code, I guess >> there should be some level of separation. Can't this be implemented by >> other devices too? >> > >loopback-ism is designed to specifically used by SMC-D (like s390 ISM), to >serves as a easy-available ISM for community to test SMC-D and to accelerate >intra-OS communication (see benchmark test). So the code is under net/smc. Got it. > >> >> >> > inter-containers communication within the same OS instance. >> > >> > # Design >> > >> > This patch set includes 3 parts: >> > >> > - Patch #1-#2: some prepare work for loopback-ism. >> > - Patch #3-#9: implement loopback-ism device. >> > - Patch #10-#15: memory copy optimization for loopback scenario. >> > >> > The loopback-ism device is designed as a ISMv2 device and not be limited to >> > a specific net namespace, ends of both inter-process connection (1/1' in diagram >> > below) or inter-container connection (2/2' in diagram below) can find the same >> > available loopback-ism and choose it during the CLC handshake. >> > >> > Container 1 (ns1) Container 2 (ns2) >> > +-----------------------------------------+ +-------------------------+ >> > | +-------+ +-------+ +-------+ | | +-------+ | >> > | | App A | | App B | | App C | | | | App D |<-+ | >> > | +-------+ +---^---+ +-------+ | | +-------+ |(2') | >> > | |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| | >> > | (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ | >> > | `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | | >> > +---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+ >> > | | | | >> > Kernel | | | | >> > +----+-------v---+-----------v----------------------------------+---+----+ >> > | | TCP | | >> > | | | | >> > | +--------------------------------------------------------------+ | >> > | | >> > | +--------------+ | >> > | | smc loopback | | >> > +---------------------------+--------------+-----------------------------+ >> > >> > loopback-ism device creates DMBs (shared memory) for each connection peer. >> > Since data transfer occurs within the same kernel, the sndbuf of each peer >> > is only a descriptor and point to the same memory region as peer DMB, so that >> > the data copy from sndbuf to peer DMB can be avoided in loopback-ism case. >> > >> > Container 1 (ns1) Container 2 (ns2) >> > +-----------------------------------------+ +-------------------------+ >> > | +-------+ | | +-------+ | >> > | | App C |-----+ | | | App D | | >> > | +-------+ | | | +-^-----+ | >> > | | | | | | >> > | (2) | | | (2') | | >> > | | | | | | >> > +---------------|-------------------------+ +----------|--------------+ >> > | | >> > Kernel | | >> > +---------------|-----------------------------------------|--------------+ >> > | +--------+ +--v-----+ +--------+ +--------+ | >> > | |dmb_desc| |snd_desc| |dmb_desc| |snd_desc| | >> > | +-----|--+ +--|-----+ +-----|--+ +--------+ | >> > | +-----|--+ | +-----|--+ | >> > | | DMB C | +---------------------------------| DMB D | | >> > | +--------+ +--------+ | >> > | | >> > | +--------------+ | >> > | | smc loopback | | >> > +---------------------------+--------------+-----------------------------+ >> > >> > # Benchmark Test >> > >> > * Test environments: >> > - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem. >> > - SMC sndbuf/DMB size 1MB. >> > - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0, >> > which means sndbuf and DMB are merged and no data copied between them. >> > - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0, >> >> Exposing any configuration knobs and statistics over sysfs for >> softdevices does not look correct at all :/ Could you please avoid >> sysfs? >> > >In previous reviews and calls, we think loopback-ism needs to be more >like a device and be visible under /sys/devices. > >Would you mind explaining why using sysfs for loopback-ism is not correct? >since I saw some other configurations or statistics exists under /sys/devices, >e.g. /sys/devices/virtual/net/lo. Thank you! You have smc_netlink.c exposing clear netlink api for the subsystem. Can't you extend it to contain the configuration knobs and expose stats instead of sysfs? > > > >Thanks again, >Wen Gu > >> >> > which means DMB is physically contiguous buffer. >> > >> > * Test object: >> > - TCP: run on TCP loopback. >> > - SMC lo: run on SMC loopback device. >> > >> > 1. ipc-benchmark (see [3]) >> > >> > - ./<foo> -c 1000000 -s 100 >> > >> > TCP SMC-lo >> > Message >> > rate (msg/s) 80636 149515(+85.42%) >> > >> > 2. sockperf >> > >> > - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp >> > - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30 >> > >> > TCP SMC-lo >> > Bandwidth(MBps) 4909.36 8197.57(+66.98%) >> > Latency(us) 6.098 3.383(-44.52%) >> > >> > 3. nginx/wrk >> > >> > - serv: <smc_run> nginx >> > - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80 >> > >> > TCP SMC-lo >> > Requests/s 181685.74 246447.77(+35.65%) >> > >> > 4. redis-benchmark >> > >> > - serv: <smc_run> redis-server >> > - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024 >> > >> > TCP SMC-lo >> > GET(Requests/s) 85855.34 118553.64(+38.09%) >> > SET(Requests/s) 86824.40 125944.58(+45.06%) >> > >> > >> > Change log: >> > >> > v1->RFC: >> > - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics: >> > /sys/devices/virtual/smc/loopback-ism/xfer_bytes >> > - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports >> > merging sndbuf with peer DMB. >> > - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and >> > control of whether to merge sndbuf and DMB. They can be respectively set by: >> > /sys/devices/virtual/smc/loopback-ism/dmb_type >> > /sys/devices/virtual/smc/loopback-ism/dmb_copy >> > The motivation for these two control is that a performance bottleneck was >> > found when using vzalloced DMB and sndbuf is merged with DMB, and there are >> > many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused >> > by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg() >> > or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the >> > vmap lock contention [6]. It has significant effects, but using virtual memory >> > still has additional overhead compared to using physical memory. >> > So this new version provides controls of dmb_type and dmb_copy to suit >> > different scenarios. >> > - Some minor changes and comments improvements. >> > >> > RFC->old version([1]): >> > Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/ >> > - Patch #1: improve the loopback-ism dump, it shows as follows now: >> > # smcd d >> > FID Type PCI-ID PCHID InUse #LGs PNET-ID >> > 0000 0 loopback-ism ffff No 0 >> > - Patch #3: introduce the smc_ism_set_v2_capable() helper and set >> > smc_ism_v2_capable when ISMv2 or virtual ISM is registered, >> > regardless of whether there is already a device in smcd device list. >> > - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/. >> > - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active >> > to activate or deactivate the loopback-ism. >> > - Patch #9: introduce the statistics of loopback-ism by >> > /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}. >> > - Some minor changes and comments improvements. >> > >> > [1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/ >> > [2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/ >> > [3] https://github.com/goldsborough/ipc-bench >> > [4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/ >> > [5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/ >> > [6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/ >> > >> > Wen Gu (15): >> > net/smc: improve SMC-D device dump for virtual ISM >> > net/smc: decouple specialized struct from SMC-D DMB registration >> > net/smc: introduce virtual ISM device loopback-ism >> > net/smc: implement ID-related operations of loopback-ism >> > net/smc: implement some unsupported operations of loopback-ism >> > net/smc: implement DMB-related operations of loopback-ism >> > net/smc: register loopback-ism into SMC-D device list >> > net/smc: introduce loopback-ism runtime switch >> > net/smc: introduce loopback-ism statistics attributes >> > net/smc: add operations to merge sndbuf with peer DMB >> > net/smc: attach or detach ghost sndbuf to peer DMB >> > net/smc: adapt cursor update when sndbuf and peer DMB are merged >> > net/smc: introduce loopback-ism DMB type control >> > net/smc: introduce loopback-ism DMB data copy control >> > net/smc: implement DMB-merged operations of loopback-ism >> > >> > drivers/s390/net/ism_drv.c | 2 +- >> > include/net/smc.h | 7 +- >> > net/smc/Kconfig | 13 + >> > net/smc/Makefile | 2 +- >> > net/smc/af_smc.c | 28 +- >> > net/smc/smc_cdc.c | 58 ++- >> > net/smc/smc_cdc.h | 1 + >> > net/smc/smc_core.c | 61 +++- >> > net/smc/smc_core.h | 1 + >> > net/smc/smc_ism.c | 71 +++- >> > net/smc/smc_ism.h | 5 + >> > net/smc/smc_loopback.c | 718 +++++++++++++++++++++++++++++++++++++ >> > net/smc/smc_loopback.h | 88 +++++ >> > 13 files changed, 1026 insertions(+), 29 deletions(-) >> > create mode 100644 net/smc/smc_loopback.c >> > create mode 100644 net/smc/smc_loopback.h >> > >> > -- >> > 2.32.0.3.g01195cf9f >> > >> >
On 2024/1/12 17:10, Jiri Pirko wrote: > Fri, Jan 12, 2024 at 09:29:35AM CET, guwen@linux.alibaba.com wrote: >> >> <...> >>>> inter-containers communication within the same OS instance. >>>> >>>> # Design >>>> >>>> This patch set includes 3 parts: >>>> >>>> - Patch #1-#2: some prepare work for loopback-ism. >>>> - Patch #3-#9: implement loopback-ism device. >>>> - Patch #10-#15: memory copy optimization for loopback scenario. >>>> >>>> The loopback-ism device is designed as a ISMv2 device and not be limited to >>>> a specific net namespace, ends of both inter-process connection (1/1' in diagram >>>> below) or inter-container connection (2/2' in diagram below) can find the same >>>> available loopback-ism and choose it during the CLC handshake. >>>> >>>> Container 1 (ns1) Container 2 (ns2) >>>> +-----------------------------------------+ +-------------------------+ >>>> | +-------+ +-------+ +-------+ | | +-------+ | >>>> | | App A | | App B | | App C | | | | App D |<-+ | >>>> | +-------+ +---^---+ +-------+ | | +-------+ |(2') | >>>> | |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| | >>>> | (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ | >>>> | `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | | >>>> +---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+ >>>> | | | | >>>> Kernel | | | | >>>> +----+-------v---+-----------v----------------------------------+---+----+ >>>> | | TCP | | >>>> | | | | >>>> | +--------------------------------------------------------------+ | >>>> | | >>>> | +--------------+ | >>>> | | smc loopback | | >>>> +---------------------------+--------------+-----------------------------+ >>>> >>>> loopback-ism device creates DMBs (shared memory) for each connection peer. >>>> Since data transfer occurs within the same kernel, the sndbuf of each peer >>>> is only a descriptor and point to the same memory region as peer DMB, so that >>>> the data copy from sndbuf to peer DMB can be avoided in loopback-ism case. >>>> >>>> Container 1 (ns1) Container 2 (ns2) >>>> +-----------------------------------------+ +-------------------------+ >>>> | +-------+ | | +-------+ | >>>> | | App C |-----+ | | | App D | | >>>> | +-------+ | | | +-^-----+ | >>>> | | | | | | >>>> | (2) | | | (2') | | >>>> | | | | | | >>>> +---------------|-------------------------+ +----------|--------------+ >>>> | | >>>> Kernel | | >>>> +---------------|-----------------------------------------|--------------+ >>>> | +--------+ +--v-----+ +--------+ +--------+ | >>>> | |dmb_desc| |snd_desc| |dmb_desc| |snd_desc| | >>>> | +-----|--+ +--|-----+ +-----|--+ +--------+ | >>>> | +-----|--+ | +-----|--+ | >>>> | | DMB C | +---------------------------------| DMB D | | >>>> | +--------+ +--------+ | >>>> | | >>>> | +--------------+ | >>>> | | smc loopback | | >>>> +---------------------------+--------------+-----------------------------+ >>>> >>>> # Benchmark Test >>>> >>>> * Test environments: >>>> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem. >>>> - SMC sndbuf/DMB size 1MB. >>>> - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0, >>>> which means sndbuf and DMB are merged and no data copied between them. >>>> - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0, >>> >>> Exposing any configuration knobs and statistics over sysfs for >>> softdevices does not look correct at all :/ Could you please avoid >>> sysfs? >>> >> >> In previous reviews and calls, we think loopback-ism needs to be more >> like a device and be visible under /sys/devices. >> >> Would you mind explaining why using sysfs for loopback-ism is not correct? >> since I saw some other configurations or statistics exists under /sys/devices, >> e.g. /sys/devices/virtual/net/lo. Thank you! > > You have smc_netlink.c exposing clear netlink api for the subsystem. > Can't you extend it to contain the configuration knobs and expose stats > instead of sysfs? > Thank you for the suggestion. I've also considered this approach. But I didn't choose to extend the smc netlink because for now smc netlink are used for SMC protocol related attributes, for example: SMC_NETLINK_GET_SYS_INFO: SMC version, release, v2-capable.. SMC_NETLINK_GET_LGR_SMC{R|D}: SMC-{R|D} link group inform (lgr id, lgr conn num, lgr role..) SMC_NETLINK_GET_LINK_SMCR: SMC-R link inform (link id, link state, conn cnt..) SMC_NETLINK_GET_DEV_SMCD: SMC-D device generic inform (user cnt, pci_fid, pci_chid, pci_vendor..) SMC_NETLINK_GET_DEV_SMCR: SMC-R device generic inform (dev name, port pnet_id, port valid, port state..) SMC_NETLINK_GET_STATS: SMC generic stats (RMB cnt, Tx size, Rx size, RMB size...) And the knobs and stats in this patchset are loopback-ism device specific attributes, for example: active: loopback-ism runtime switch dmb_type: type of DMB provided by loopback-ism dmb_copy: support for DMB merge of loopback-ism xfer_bytes: data transferred by loopback-ism dmbs_cnt: DMB num provided by loopback-ism The layer will be: +--------------------------------------+ | | | SMC protocol | | (attrs by netlink in smc_netlink.c) | | | +--------------------------------------+ ------------------smcd_ops------------------ +---------------+ +---------------------+ +--------------+ | loopback-ism | | s390 firmware ISM | | Possible | +---------------+ | | | other | (attrs by sysfs | | | virtual ISM | in smc_loopback.c) | | | | | | | | +---------------------+ +--------------+ So I choose to use current way to provide this lower layer loopback-ism device's attributes, restrict loopback-ism specific code to smc_loopback.c and try to make a clear layer architecture. Thanks, Wen Gu > >> >> >> >> Thanks again, >> Wen Gu >> >>> >>>> which means DMB is physically contiguous buffer. >>>> >>>> * Test object: >>>> - TCP: run on TCP loopback. >>>> - SMC lo: run on SMC loopback device. >>>> >>>> 1. ipc-benchmark (see [3]) >>>> >>>> - ./<foo> -c 1000000 -s 100 >>>> >>>> TCP SMC-lo >>>> Message >>>> rate (msg/s) 80636 149515(+85.42%) >>>> >>>> 2. sockperf >>>> >>>> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp >>>> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30 >>>> >>>> TCP SMC-lo >>>> Bandwidth(MBps) 4909.36 8197.57(+66.98%) >>>> Latency(us) 6.098 3.383(-44.52%) >>>> >>>> 3. nginx/wrk >>>> >>>> - serv: <smc_run> nginx >>>> - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80 >>>> >>>> TCP SMC-lo >>>> Requests/s 181685.74 246447.77(+35.65%) >>>> >>>> 4. redis-benchmark >>>> >>>> - serv: <smc_run> redis-server >>>> - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024 >>>> >>>> TCP SMC-lo >>>> GET(Requests/s) 85855.34 118553.64(+38.09%) >>>> SET(Requests/s) 86824.40 125944.58(+45.06%) >>>> >>>> >>>> Change log: >>>> >>>> v1->RFC: >>>> - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics: >>>> /sys/devices/virtual/smc/loopback-ism/xfer_bytes >>>> - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports >>>> merging sndbuf with peer DMB. >>>> - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and >>>> control of whether to merge sndbuf and DMB. They can be respectively set by: >>>> /sys/devices/virtual/smc/loopback-ism/dmb_type >>>> /sys/devices/virtual/smc/loopback-ism/dmb_copy >>>> The motivation for these two control is that a performance bottleneck was >>>> found when using vzalloced DMB and sndbuf is merged with DMB, and there are >>>> many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused >>>> by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg() >>>> or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the >>>> vmap lock contention [6]. It has significant effects, but using virtual memory >>>> still has additional overhead compared to using physical memory. >>>> So this new version provides controls of dmb_type and dmb_copy to suit >>>> different scenarios. >>>> - Some minor changes and comments improvements. >>>> >>>> RFC->old version([1]): >>>> Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/ >>>> - Patch #1: improve the loopback-ism dump, it shows as follows now: >>>> # smcd d >>>> FID Type PCI-ID PCHID InUse #LGs PNET-ID >>>> 0000 0 loopback-ism ffff No 0 >>>> - Patch #3: introduce the smc_ism_set_v2_capable() helper and set >>>> smc_ism_v2_capable when ISMv2 or virtual ISM is registered, >>>> regardless of whether there is already a device in smcd device list. >>>> - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/. >>>> - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active >>>> to activate or deactivate the loopback-ism. >>>> - Patch #9: introduce the statistics of loopback-ism by >>>> /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}. >>>> - Some minor changes and comments improvements. >>>> >>>> [1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/ >>>> [2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/ >>>> [3] https://github.com/goldsborough/ipc-bench >>>> [4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/ >>>> [5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/ >>>> [6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/ >>>> >>>> Wen Gu (15): >>>> net/smc: improve SMC-D device dump for virtual ISM >>>> net/smc: decouple specialized struct from SMC-D DMB registration >>>> net/smc: introduce virtual ISM device loopback-ism >>>> net/smc: implement ID-related operations of loopback-ism >>>> net/smc: implement some unsupported operations of loopback-ism >>>> net/smc: implement DMB-related operations of loopback-ism >>>> net/smc: register loopback-ism into SMC-D device list >>>> net/smc: introduce loopback-ism runtime switch >>>> net/smc: introduce loopback-ism statistics attributes >>>> net/smc: add operations to merge sndbuf with peer DMB >>>> net/smc: attach or detach ghost sndbuf to peer DMB >>>> net/smc: adapt cursor update when sndbuf and peer DMB are merged >>>> net/smc: introduce loopback-ism DMB type control >>>> net/smc: introduce loopback-ism DMB data copy control >>>> net/smc: implement DMB-merged operations of loopback-ism >>>> >>>> drivers/s390/net/ism_drv.c | 2 +- >>>> include/net/smc.h | 7 +- >>>> net/smc/Kconfig | 13 + >>>> net/smc/Makefile | 2 +- >>>> net/smc/af_smc.c | 28 +- >>>> net/smc/smc_cdc.c | 58 ++- >>>> net/smc/smc_cdc.h | 1 + >>>> net/smc/smc_core.c | 61 +++- >>>> net/smc/smc_core.h | 1 + >>>> net/smc/smc_ism.c | 71 +++- >>>> net/smc/smc_ism.h | 5 + >>>> net/smc/smc_loopback.c | 718 +++++++++++++++++++++++++++++++++++++ >>>> net/smc/smc_loopback.h | 88 +++++ >>>> 13 files changed, 1026 insertions(+), 29 deletions(-) >>>> create mode 100644 net/smc/smc_loopback.c >>>> create mode 100644 net/smc/smc_loopback.h >>>> >>>> -- >>>> 2.32.0.3.g01195cf9f >>>> >>>>
Fri, Jan 12, 2024 at 01:32:14PM CET, guwen@linux.alibaba.com wrote: > > >On 2024/1/12 17:10, Jiri Pirko wrote: >> Fri, Jan 12, 2024 at 09:29:35AM CET, guwen@linux.alibaba.com wrote: >> > >> > > ><...> > >> > > > inter-containers communication within the same OS instance. >> > > > >> > > > # Design >> > > > >> > > > This patch set includes 3 parts: >> > > > >> > > > - Patch #1-#2: some prepare work for loopback-ism. >> > > > - Patch #3-#9: implement loopback-ism device. >> > > > - Patch #10-#15: memory copy optimization for loopback scenario. >> > > > >> > > > The loopback-ism device is designed as a ISMv2 device and not be limited to >> > > > a specific net namespace, ends of both inter-process connection (1/1' in diagram >> > > > below) or inter-container connection (2/2' in diagram below) can find the same >> > > > available loopback-ism and choose it during the CLC handshake. >> > > > >> > > > Container 1 (ns1) Container 2 (ns2) >> > > > +-----------------------------------------+ +-------------------------+ >> > > > | +-------+ +-------+ +-------+ | | +-------+ | >> > > > | | App A | | App B | | App C | | | | App D |<-+ | >> > > > | +-------+ +---^---+ +-------+ | | +-------+ |(2') | >> > > > | |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| | >> > > > | (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ | >> > > > | `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | | >> > > > +---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+ >> > > > | | | | >> > > > Kernel | | | | >> > > > +----+-------v---+-----------v----------------------------------+---+----+ >> > > > | | TCP | | >> > > > | | | | >> > > > | +--------------------------------------------------------------+ | >> > > > | | >> > > > | +--------------+ | >> > > > | | smc loopback | | >> > > > +---------------------------+--------------+-----------------------------+ >> > > > >> > > > loopback-ism device creates DMBs (shared memory) for each connection peer. >> > > > Since data transfer occurs within the same kernel, the sndbuf of each peer >> > > > is only a descriptor and point to the same memory region as peer DMB, so that >> > > > the data copy from sndbuf to peer DMB can be avoided in loopback-ism case. >> > > > >> > > > Container 1 (ns1) Container 2 (ns2) >> > > > +-----------------------------------------+ +-------------------------+ >> > > > | +-------+ | | +-------+ | >> > > > | | App C |-----+ | | | App D | | >> > > > | +-------+ | | | +-^-----+ | >> > > > | | | | | | >> > > > | (2) | | | (2') | | >> > > > | | | | | | >> > > > +---------------|-------------------------+ +----------|--------------+ >> > > > | | >> > > > Kernel | | >> > > > +---------------|-----------------------------------------|--------------+ >> > > > | +--------+ +--v-----+ +--------+ +--------+ | >> > > > | |dmb_desc| |snd_desc| |dmb_desc| |snd_desc| | >> > > > | +-----|--+ +--|-----+ +-----|--+ +--------+ | >> > > > | +-----|--+ | +-----|--+ | >> > > > | | DMB C | +---------------------------------| DMB D | | >> > > > | +--------+ +--------+ | >> > > > | | >> > > > | +--------------+ | >> > > > | | smc loopback | | >> > > > +---------------------------+--------------+-----------------------------+ >> > > > >> > > > # Benchmark Test >> > > > >> > > > * Test environments: >> > > > - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem. >> > > > - SMC sndbuf/DMB size 1MB. >> > > > - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0, >> > > > which means sndbuf and DMB are merged and no data copied between them. >> > > > - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0, >> > > >> > > Exposing any configuration knobs and statistics over sysfs for >> > > softdevices does not look correct at all :/ Could you please avoid >> > > sysfs? >> > > >> > >> > In previous reviews and calls, we think loopback-ism needs to be more >> > like a device and be visible under /sys/devices. >> > >> > Would you mind explaining why using sysfs for loopback-ism is not correct? >> > since I saw some other configurations or statistics exists under /sys/devices, >> > e.g. /sys/devices/virtual/net/lo. Thank you! >> >> You have smc_netlink.c exposing clear netlink api for the subsystem. >> Can't you extend it to contain the configuration knobs and expose stats >> instead of sysfs? >> > >Thank you for the suggestion. I've also considered this approach. > >But I didn't choose to extend the smc netlink because for now smc netlink >are used for SMC protocol related attributes, for example: > >SMC_NETLINK_GET_SYS_INFO: SMC version, release, v2-capable.. >SMC_NETLINK_GET_LGR_SMC{R|D}: SMC-{R|D} link group inform (lgr id, lgr conn num, lgr role..) >SMC_NETLINK_GET_LINK_SMCR: SMC-R link inform (link id, link state, conn cnt..) >SMC_NETLINK_GET_DEV_SMCD: SMC-D device generic inform (user cnt, pci_fid, pci_chid, pci_vendor..) >SMC_NETLINK_GET_DEV_SMCR: SMC-R device generic inform (dev name, port pnet_id, port valid, port state..) >SMC_NETLINK_GET_STATS: SMC generic stats (RMB cnt, Tx size, Rx size, RMB size...) > >And the knobs and stats in this patchset are loopback-ism device specific >attributes, for example: > >active: loopback-ism runtime switch >dmb_type: type of DMB provided by loopback-ism >dmb_copy: support for DMB merge of loopback-ism >xfer_bytes: data transferred by loopback-ism >dmbs_cnt: DMB num provided by loopback-ism > >The layer will be: > > +--------------------------------------+ > | | > | SMC protocol | > | (attrs by netlink in smc_netlink.c) | > | | > +--------------------------------------+ > ------------------smcd_ops------------------ > +---------------+ +---------------------+ +--------------+ > | loopback-ism | | s390 firmware ISM | | Possible | > +---------------+ | | | other | > (attrs by sysfs | | | virtual ISM | > in smc_loopback.c) | | | | > | | | | > +---------------------+ +--------------+ So nest it: SMC_NETLINK_BACKEND_GET_INFO SMC_NETLINK_BACKEND_GET_STATS ? I mean, isn't it better to have the backend knobs and stats in one place under same netlink commands and attributes than random sysfs path ? > >So I choose to use current way to provide this lower layer loopback-ism >device's attributes, restrict loopback-ism specific code to smc_loopback.c >and try to make a clear layer architecture. > >Thanks, >Wen Gu >> >> > >> > >> > >> > Thanks again, >> > Wen Gu >> > >> > > >> > > > which means DMB is physically contiguous buffer. >> > > > >> > > > * Test object: >> > > > - TCP: run on TCP loopback. >> > > > - SMC lo: run on SMC loopback device. >> > > > >> > > > 1. ipc-benchmark (see [3]) >> > > > >> > > > - ./<foo> -c 1000000 -s 100 >> > > > >> > > > TCP SMC-lo >> > > > Message >> > > > rate (msg/s) 80636 149515(+85.42%) >> > > > >> > > > 2. sockperf >> > > > >> > > > - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp >> > > > - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30 >> > > > >> > > > TCP SMC-lo >> > > > Bandwidth(MBps) 4909.36 8197.57(+66.98%) >> > > > Latency(us) 6.098 3.383(-44.52%) >> > > > >> > > > 3. nginx/wrk >> > > > >> > > > - serv: <smc_run> nginx >> > > > - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80 >> > > > >> > > > TCP SMC-lo >> > > > Requests/s 181685.74 246447.77(+35.65%) >> > > > >> > > > 4. redis-benchmark >> > > > >> > > > - serv: <smc_run> redis-server >> > > > - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024 >> > > > >> > > > TCP SMC-lo >> > > > GET(Requests/s) 85855.34 118553.64(+38.09%) >> > > > SET(Requests/s) 86824.40 125944.58(+45.06%) >> > > > >> > > > >> > > > Change log: >> > > > >> > > > v1->RFC: >> > > > - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics: >> > > > /sys/devices/virtual/smc/loopback-ism/xfer_bytes >> > > > - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports >> > > > merging sndbuf with peer DMB. >> > > > - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and >> > > > control of whether to merge sndbuf and DMB. They can be respectively set by: >> > > > /sys/devices/virtual/smc/loopback-ism/dmb_type >> > > > /sys/devices/virtual/smc/loopback-ism/dmb_copy >> > > > The motivation for these two control is that a performance bottleneck was >> > > > found when using vzalloced DMB and sndbuf is merged with DMB, and there are >> > > > many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused >> > > > by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg() >> > > > or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the >> > > > vmap lock contention [6]. It has significant effects, but using virtual memory >> > > > still has additional overhead compared to using physical memory. >> > > > So this new version provides controls of dmb_type and dmb_copy to suit >> > > > different scenarios. >> > > > - Some minor changes and comments improvements. >> > > > >> > > > RFC->old version([1]): >> > > > Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/ >> > > > - Patch #1: improve the loopback-ism dump, it shows as follows now: >> > > > # smcd d >> > > > FID Type PCI-ID PCHID InUse #LGs PNET-ID >> > > > 0000 0 loopback-ism ffff No 0 >> > > > - Patch #3: introduce the smc_ism_set_v2_capable() helper and set >> > > > smc_ism_v2_capable when ISMv2 or virtual ISM is registered, >> > > > regardless of whether there is already a device in smcd device list. >> > > > - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/. >> > > > - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active >> > > > to activate or deactivate the loopback-ism. >> > > > - Patch #9: introduce the statistics of loopback-ism by >> > > > /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}. >> > > > - Some minor changes and comments improvements. >> > > > >> > > > [1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/ >> > > > [2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/ >> > > > [3] https://github.com/goldsborough/ipc-bench >> > > > [4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/ >> > > > [5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/ >> > > > [6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/ >> > > > >> > > > Wen Gu (15): >> > > > net/smc: improve SMC-D device dump for virtual ISM >> > > > net/smc: decouple specialized struct from SMC-D DMB registration >> > > > net/smc: introduce virtual ISM device loopback-ism >> > > > net/smc: implement ID-related operations of loopback-ism >> > > > net/smc: implement some unsupported operations of loopback-ism >> > > > net/smc: implement DMB-related operations of loopback-ism >> > > > net/smc: register loopback-ism into SMC-D device list >> > > > net/smc: introduce loopback-ism runtime switch >> > > > net/smc: introduce loopback-ism statistics attributes >> > > > net/smc: add operations to merge sndbuf with peer DMB >> > > > net/smc: attach or detach ghost sndbuf to peer DMB >> > > > net/smc: adapt cursor update when sndbuf and peer DMB are merged >> > > > net/smc: introduce loopback-ism DMB type control >> > > > net/smc: introduce loopback-ism DMB data copy control >> > > > net/smc: implement DMB-merged operations of loopback-ism >> > > > >> > > > drivers/s390/net/ism_drv.c | 2 +- >> > > > include/net/smc.h | 7 +- >> > > > net/smc/Kconfig | 13 + >> > > > net/smc/Makefile | 2 +- >> > > > net/smc/af_smc.c | 28 +- >> > > > net/smc/smc_cdc.c | 58 ++- >> > > > net/smc/smc_cdc.h | 1 + >> > > > net/smc/smc_core.c | 61 +++- >> > > > net/smc/smc_core.h | 1 + >> > > > net/smc/smc_ism.c | 71 +++- >> > > > net/smc/smc_ism.h | 5 + >> > > > net/smc/smc_loopback.c | 718 +++++++++++++++++++++++++++++++++++++ >> > > > net/smc/smc_loopback.h | 88 +++++ >> > > > 13 files changed, 1026 insertions(+), 29 deletions(-) >> > > > create mode 100644 net/smc/smc_loopback.c >> > > > create mode 100644 net/smc/smc_loopback.h >> > > > >> > > > -- >> > > > 2.32.0.3.g01195cf9f >> > > > >> > > >
On 2024/1/12 23:50, Jiri Pirko wrote: > Fri, Jan 12, 2024 at 01:32:14PM CET, guwen@linux.alibaba.com wrote: >> >> >> On 2024/1/12 17:10, Jiri Pirko wrote: >>> Fri, Jan 12, 2024 at 09:29:35AM CET, guwen@linux.alibaba.com wrote: >>>> <...> >>>>>> # Benchmark Test >>>>>> >>>>>> * Test environments: >>>>>> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem. >>>>>> - SMC sndbuf/DMB size 1MB. >>>>>> - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0, >>>>>> which means sndbuf and DMB are merged and no data copied between them. >>>>>> - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0, >>>>> >>>>> Exposing any configuration knobs and statistics over sysfs for >>>>> softdevices does not look correct at all :/ Could you please avoid >>>>> sysfs? >>>>> >>>> >>>> In previous reviews and calls, we think loopback-ism needs to be more >>>> like a device and be visible under /sys/devices. >>>> >>>> Would you mind explaining why using sysfs for loopback-ism is not correct? >>>> since I saw some other configurations or statistics exists under /sys/devices, >>>> e.g. /sys/devices/virtual/net/lo. Thank you! >>> >>> You have smc_netlink.c exposing clear netlink api for the subsystem. >>> Can't you extend it to contain the configuration knobs and expose stats >>> instead of sysfs? >>> >> >> Thank you for the suggestion. I've also considered this approach. >> >> But I didn't choose to extend the smc netlink because for now smc netlink >> are used for SMC protocol related attributes, for example: >> >> SMC_NETLINK_GET_SYS_INFO: SMC version, release, v2-capable.. >> SMC_NETLINK_GET_LGR_SMC{R|D}: SMC-{R|D} link group inform (lgr id, lgr conn num, lgr role..) >> SMC_NETLINK_GET_LINK_SMCR: SMC-R link inform (link id, link state, conn cnt..) >> SMC_NETLINK_GET_DEV_SMCD: SMC-D device generic inform (user cnt, pci_fid, pci_chid, pci_vendor..) >> SMC_NETLINK_GET_DEV_SMCR: SMC-R device generic inform (dev name, port pnet_id, port valid, port state..) >> SMC_NETLINK_GET_STATS: SMC generic stats (RMB cnt, Tx size, Rx size, RMB size...) >> >> And the knobs and stats in this patchset are loopback-ism device specific >> attributes, for example: >> >> active: loopback-ism runtime switch >> dmb_type: type of DMB provided by loopback-ism >> dmb_copy: support for DMB merge of loopback-ism >> xfer_bytes: data transferred by loopback-ism >> dmbs_cnt: DMB num provided by loopback-ism >> >> The layer will be: >> >> +--------------------------------------+ >> | | >> | SMC protocol | >> | (attrs by netlink in smc_netlink.c) | >> | | >> +--------------------------------------+ >> ------------------smcd_ops------------------ >> +---------------+ +---------------------+ +--------------+ >> | loopback-ism | | s390 firmware ISM | | Possible | >> +---------------+ | | | other | >> (attrs by sysfs | | | virtual ISM | >> in smc_loopback.c) | | | | >> | | | | >> +---------------------+ +--------------+ > > So nest it: > SMC_NETLINK_BACKEND_GET_INFO > SMC_NETLINK_BACKEND_GET_STATS > ? > I mean, isn't it better to have the backend knobs and stats in one place > under same netlink commands and attributes than random sysfs path ? > Thank you for suggestion. I think it is not about nesting or gathering knobs and stats. It is about not coupling underlying device details to upper layer SMC stack. From SMC perspective, it cares about the abstract operations defined by smcd_ops, regardless of which underlying devices provide these functions and how they provide. So IMO the details or configurations of underlying devices shouldn't be involved in SMC. Besides, the knobs and stats here are specific for loopback-ism device, they include runtime switch, buffer type choice and mode choice of loopback-ism (basically they won't change after being set once). The other kinds of devices used by SMC-D, e.g. s390 firmware ISM or other virtual ISMs have no similar things. So I prefer to keep the current solution instead of expanding upper layer SMC netlink. Thanks, Wen Gu > > >> >> So I choose to use current way to provide this lower layer loopback-ism >> device's attributes, restrict loopback-ism specific code to smc_loopback.c >> and try to make a clear layer architecture. >> >> Thanks, >> Wen Gu >>> >>>> >>>> >>>> >>>> Thanks again, >>>> Wen Gu >>>> >>>>> >>>>>> which means DMB is physically contiguous buffer. >>>>>> >>>>>> * Test object: >>>>>> - TCP: run on TCP loopback. >>>>>> - SMC lo: run on SMC loopback device. >>>>>> >>>>>> 1. ipc-benchmark (see [3]) >>>>>> >>>>>> - ./<foo> -c 1000000 -s 100 >>>>>> >>>>>> TCP SMC-lo >>>>>> Message >>>>>> rate (msg/s) 80636 149515(+85.42%) >>>>>> >>>>>> 2. sockperf >>>>>> >>>>>> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp >>>>>> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30 >>>>>> >>>>>> TCP SMC-lo >>>>>> Bandwidth(MBps) 4909.36 8197.57(+66.98%) >>>>>> Latency(us) 6.098 3.383(-44.52%) >>>>>> >>>>>> 3. nginx/wrk >>>>>> >>>>>> - serv: <smc_run> nginx >>>>>> - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80 >>>>>> >>>>>> TCP SMC-lo >>>>>> Requests/s 181685.74 246447.77(+35.65%) >>>>>> >>>>>> 4. redis-benchmark >>>>>> >>>>>> - serv: <smc_run> redis-server >>>>>> - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024 >>>>>> >>>>>> TCP SMC-lo >>>>>> GET(Requests/s) 85855.34 118553.64(+38.09%) >>>>>> SET(Requests/s) 86824.40 125944.58(+45.06%) >>>>>> >>>>>> >>>>>> Change log: >>>>>> >>>>>> v1->RFC: >>>>>> - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics: >>>>>> /sys/devices/virtual/smc/loopback-ism/xfer_bytes >>>>>> - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports >>>>>> merging sndbuf with peer DMB. >>>>>> - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and >>>>>> control of whether to merge sndbuf and DMB. They can be respectively set by: >>>>>> /sys/devices/virtual/smc/loopback-ism/dmb_type >>>>>> /sys/devices/virtual/smc/loopback-ism/dmb_copy >>>>>> The motivation for these two control is that a performance bottleneck was >>>>>> found when using vzalloced DMB and sndbuf is merged with DMB, and there are >>>>>> many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused >>>>>> by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg() >>>>>> or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the >>>>>> vmap lock contention [6]. It has significant effects, but using virtual memory >>>>>> still has additional overhead compared to using physical memory. >>>>>> So this new version provides controls of dmb_type and dmb_copy to suit >>>>>> different scenarios. >>>>>> - Some minor changes and comments improvements. >>>>>> >>>>>> RFC->old version([1]): >>>>>> Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/ >>>>>> - Patch #1: improve the loopback-ism dump, it shows as follows now: >>>>>> # smcd d >>>>>> FID Type PCI-ID PCHID InUse #LGs PNET-ID >>>>>> 0000 0 loopback-ism ffff No 0 >>>>>> - Patch #3: introduce the smc_ism_set_v2_capable() helper and set >>>>>> smc_ism_v2_capable when ISMv2 or virtual ISM is registered, >>>>>> regardless of whether there is already a device in smcd device list. >>>>>> - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/. >>>>>> - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active >>>>>> to activate or deactivate the loopback-ism. >>>>>> - Patch #9: introduce the statistics of loopback-ism by >>>>>> /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}. >>>>>> - Some minor changes and comments improvements. >>>>>> >>>>>> [1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/ >>>>>> [2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/ >>>>>> [3] https://github.com/goldsborough/ipc-bench >>>>>> [4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/ >>>>>> [5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/ >>>>>> [6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/ >>>>>> >>>>>> Wen Gu (15): >>>>>> net/smc: improve SMC-D device dump for virtual ISM >>>>>> net/smc: decouple specialized struct from SMC-D DMB registration >>>>>> net/smc: introduce virtual ISM device loopback-ism >>>>>> net/smc: implement ID-related operations of loopback-ism >>>>>> net/smc: implement some unsupported operations of loopback-ism >>>>>> net/smc: implement DMB-related operations of loopback-ism >>>>>> net/smc: register loopback-ism into SMC-D device list >>>>>> net/smc: introduce loopback-ism runtime switch >>>>>> net/smc: introduce loopback-ism statistics attributes >>>>>> net/smc: add operations to merge sndbuf with peer DMB >>>>>> net/smc: attach or detach ghost sndbuf to peer DMB >>>>>> net/smc: adapt cursor update when sndbuf and peer DMB are merged >>>>>> net/smc: introduce loopback-ism DMB type control >>>>>> net/smc: introduce loopback-ism DMB data copy control >>>>>> net/smc: implement DMB-merged operations of loopback-ism >>>>>> >>>>>> drivers/s390/net/ism_drv.c | 2 +- >>>>>> include/net/smc.h | 7 +- >>>>>> net/smc/Kconfig | 13 + >>>>>> net/smc/Makefile | 2 +- >>>>>> net/smc/af_smc.c | 28 +- >>>>>> net/smc/smc_cdc.c | 58 ++- >>>>>> net/smc/smc_cdc.h | 1 + >>>>>> net/smc/smc_core.c | 61 +++- >>>>>> net/smc/smc_core.h | 1 + >>>>>> net/smc/smc_ism.c | 71 +++- >>>>>> net/smc/smc_ism.h | 5 + >>>>>> net/smc/smc_loopback.c | 718 +++++++++++++++++++++++++++++++++++++ >>>>>> net/smc/smc_loopback.h | 88 +++++ >>>>>> 13 files changed, 1026 insertions(+), 29 deletions(-) >>>>>> create mode 100644 net/smc/smc_loopback.c >>>>>> create mode 100644 net/smc/smc_loopback.h >>>>>> >>>>>> -- >>>>>> 2.32.0.3.g01195cf9f >>>>>> >>>>>>
Sat, Jan 13, 2024 at 10:22:15AM CET, guwen@linux.alibaba.com wrote: > > >On 2024/1/12 23:50, Jiri Pirko wrote: >> Fri, Jan 12, 2024 at 01:32:14PM CET, guwen@linux.alibaba.com wrote: >> > >> > >> > On 2024/1/12 17:10, Jiri Pirko wrote: >> > > Fri, Jan 12, 2024 at 09:29:35AM CET, guwen@linux.alibaba.com wrote: >> > > > > ><...> > >> > > > > > # Benchmark Test >> > > > > > >> > > > > > * Test environments: >> > > > > > - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem. >> > > > > > - SMC sndbuf/DMB size 1MB. >> > > > > > - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0, >> > > > > > which means sndbuf and DMB are merged and no data copied between them. >> > > > > > - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0, >> > > > > >> > > > > Exposing any configuration knobs and statistics over sysfs for >> > > > > softdevices does not look correct at all :/ Could you please avoid >> > > > > sysfs? >> > > > > >> > > > >> > > > In previous reviews and calls, we think loopback-ism needs to be more >> > > > like a device and be visible under /sys/devices. >> > > > >> > > > Would you mind explaining why using sysfs for loopback-ism is not correct? >> > > > since I saw some other configurations or statistics exists under /sys/devices, >> > > > e.g. /sys/devices/virtual/net/lo. Thank you! >> > > >> > > You have smc_netlink.c exposing clear netlink api for the subsystem. >> > > Can't you extend it to contain the configuration knobs and expose stats >> > > instead of sysfs? >> > > >> > >> > Thank you for the suggestion. I've also considered this approach. >> > >> > But I didn't choose to extend the smc netlink because for now smc netlink >> > are used for SMC protocol related attributes, for example: >> > >> > SMC_NETLINK_GET_SYS_INFO: SMC version, release, v2-capable.. >> > SMC_NETLINK_GET_LGR_SMC{R|D}: SMC-{R|D} link group inform (lgr id, lgr conn num, lgr role..) >> > SMC_NETLINK_GET_LINK_SMCR: SMC-R link inform (link id, link state, conn cnt..) >> > SMC_NETLINK_GET_DEV_SMCD: SMC-D device generic inform (user cnt, pci_fid, pci_chid, pci_vendor..) >> > SMC_NETLINK_GET_DEV_SMCR: SMC-R device generic inform (dev name, port pnet_id, port valid, port state..) >> > SMC_NETLINK_GET_STATS: SMC generic stats (RMB cnt, Tx size, Rx size, RMB size...) >> > >> > And the knobs and stats in this patchset are loopback-ism device specific >> > attributes, for example: >> > >> > active: loopback-ism runtime switch >> > dmb_type: type of DMB provided by loopback-ism >> > dmb_copy: support for DMB merge of loopback-ism >> > xfer_bytes: data transferred by loopback-ism >> > dmbs_cnt: DMB num provided by loopback-ism >> > >> > The layer will be: >> > >> > +--------------------------------------+ >> > | | >> > | SMC protocol | >> > | (attrs by netlink in smc_netlink.c) | >> > | | >> > +--------------------------------------+ >> > ------------------smcd_ops------------------ >> > +---------------+ +---------------------+ +--------------+ >> > | loopback-ism | | s390 firmware ISM | | Possible | >> > +---------------+ | | | other | >> > (attrs by sysfs | | | virtual ISM | >> > in smc_loopback.c) | | | | >> > | | | | >> > +---------------------+ +--------------+ >> >> So nest it: >> SMC_NETLINK_BACKEND_GET_INFO >> SMC_NETLINK_BACKEND_GET_STATS >> ? >> I mean, isn't it better to have the backend knobs and stats in one place >> under same netlink commands and attributes than random sysfs path ? >> >Thank you for suggestion. > >I think it is not about nesting or gathering knobs and stats. It is >about not coupling underlying device details to upper layer SMC stack. > >From SMC perspective, it cares about the abstract operations defined >by smcd_ops, regardless of which underlying devices provide these >functions and how they provide. So IMO the details or configurations >of underlying devices shouldn't be involved in SMC. So you rather keep the device configuration and info exposed over random sysfs files? Sorry, that makes not sense to me. > >Besides, the knobs and stats here are specific for loopback-ism device, >they include runtime switch, buffer type choice and mode choice of >loopback-ism (basically they won't change after being set once). The >other kinds of devices used by SMC-D, e.g. s390 firmware ISM or other >virtual ISMs have no similar things. Okay, it is normal that different drivers implement different parts of UAPI. No problem. > >So I prefer to keep the current solution instead of expanding upper >layer SMC netlink. Makes no sense to me. This is UAPI from 20 years ago. Is this a time machine? > >Thanks, >Wen Gu > >> >> >> > >> > So I choose to use current way to provide this lower layer loopback-ism >> > device's attributes, restrict loopback-ism specific code to smc_loopback.c >> > and try to make a clear layer architecture. >> > >> > Thanks, >> > Wen Gu >> > > >> > > > >> > > > >> > > > >> > > > Thanks again, >> > > > Wen Gu >> > > > >> > > > > >> > > > > > which means DMB is physically contiguous buffer. >> > > > > > >> > > > > > * Test object: >> > > > > > - TCP: run on TCP loopback. >> > > > > > - SMC lo: run on SMC loopback device. >> > > > > > >> > > > > > 1. ipc-benchmark (see [3]) >> > > > > > >> > > > > > - ./<foo> -c 1000000 -s 100 >> > > > > > >> > > > > > TCP SMC-lo >> > > > > > Message >> > > > > > rate (msg/s) 80636 149515(+85.42%) >> > > > > > >> > > > > > 2. sockperf >> > > > > > >> > > > > > - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp >> > > > > > - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30 >> > > > > > >> > > > > > TCP SMC-lo >> > > > > > Bandwidth(MBps) 4909.36 8197.57(+66.98%) >> > > > > > Latency(us) 6.098 3.383(-44.52%) >> > > > > > >> > > > > > 3. nginx/wrk >> > > > > > >> > > > > > - serv: <smc_run> nginx >> > > > > > - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80 >> > > > > > >> > > > > > TCP SMC-lo >> > > > > > Requests/s 181685.74 246447.77(+35.65%) >> > > > > > >> > > > > > 4. redis-benchmark >> > > > > > >> > > > > > - serv: <smc_run> redis-server >> > > > > > - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024 >> > > > > > >> > > > > > TCP SMC-lo >> > > > > > GET(Requests/s) 85855.34 118553.64(+38.09%) >> > > > > > SET(Requests/s) 86824.40 125944.58(+45.06%) >> > > > > > >> > > > > > >> > > > > > Change log: >> > > > > > >> > > > > > v1->RFC: >> > > > > > - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics: >> > > > > > /sys/devices/virtual/smc/loopback-ism/xfer_bytes >> > > > > > - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports >> > > > > > merging sndbuf with peer DMB. >> > > > > > - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and >> > > > > > control of whether to merge sndbuf and DMB. They can be respectively set by: >> > > > > > /sys/devices/virtual/smc/loopback-ism/dmb_type >> > > > > > /sys/devices/virtual/smc/loopback-ism/dmb_copy >> > > > > > The motivation for these two control is that a performance bottleneck was >> > > > > > found when using vzalloced DMB and sndbuf is merged with DMB, and there are >> > > > > > many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused >> > > > > > by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg() >> > > > > > or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the >> > > > > > vmap lock contention [6]. It has significant effects, but using virtual memory >> > > > > > still has additional overhead compared to using physical memory. >> > > > > > So this new version provides controls of dmb_type and dmb_copy to suit >> > > > > > different scenarios. >> > > > > > - Some minor changes and comments improvements. >> > > > > > >> > > > > > RFC->old version([1]): >> > > > > > Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/ >> > > > > > - Patch #1: improve the loopback-ism dump, it shows as follows now: >> > > > > > # smcd d >> > > > > > FID Type PCI-ID PCHID InUse #LGs PNET-ID >> > > > > > 0000 0 loopback-ism ffff No 0 >> > > > > > - Patch #3: introduce the smc_ism_set_v2_capable() helper and set >> > > > > > smc_ism_v2_capable when ISMv2 or virtual ISM is registered, >> > > > > > regardless of whether there is already a device in smcd device list. >> > > > > > - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/. >> > > > > > - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active >> > > > > > to activate or deactivate the loopback-ism. >> > > > > > - Patch #9: introduce the statistics of loopback-ism by >> > > > > > /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}. >> > > > > > - Some minor changes and comments improvements. >> > > > > > >> > > > > > [1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/ >> > > > > > [2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/ >> > > > > > [3] https://github.com/goldsborough/ipc-bench >> > > > > > [4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/ >> > > > > > [5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/ >> > > > > > [6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/ >> > > > > > >> > > > > > Wen Gu (15): >> > > > > > net/smc: improve SMC-D device dump for virtual ISM >> > > > > > net/smc: decouple specialized struct from SMC-D DMB registration >> > > > > > net/smc: introduce virtual ISM device loopback-ism >> > > > > > net/smc: implement ID-related operations of loopback-ism >> > > > > > net/smc: implement some unsupported operations of loopback-ism >> > > > > > net/smc: implement DMB-related operations of loopback-ism >> > > > > > net/smc: register loopback-ism into SMC-D device list >> > > > > > net/smc: introduce loopback-ism runtime switch >> > > > > > net/smc: introduce loopback-ism statistics attributes >> > > > > > net/smc: add operations to merge sndbuf with peer DMB >> > > > > > net/smc: attach or detach ghost sndbuf to peer DMB >> > > > > > net/smc: adapt cursor update when sndbuf and peer DMB are merged >> > > > > > net/smc: introduce loopback-ism DMB type control >> > > > > > net/smc: introduce loopback-ism DMB data copy control >> > > > > > net/smc: implement DMB-merged operations of loopback-ism >> > > > > > >> > > > > > drivers/s390/net/ism_drv.c | 2 +- >> > > > > > include/net/smc.h | 7 +- >> > > > > > net/smc/Kconfig | 13 + >> > > > > > net/smc/Makefile | 2 +- >> > > > > > net/smc/af_smc.c | 28 +- >> > > > > > net/smc/smc_cdc.c | 58 ++- >> > > > > > net/smc/smc_cdc.h | 1 + >> > > > > > net/smc/smc_core.c | 61 +++- >> > > > > > net/smc/smc_core.h | 1 + >> > > > > > net/smc/smc_ism.c | 71 +++- >> > > > > > net/smc/smc_ism.h | 5 + >> > > > > > net/smc/smc_loopback.c | 718 +++++++++++++++++++++++++++++++++++++ >> > > > > > net/smc/smc_loopback.h | 88 +++++ >> > > > > > 13 files changed, 1026 insertions(+), 29 deletions(-) >> > > > > > create mode 100644 net/smc/smc_loopback.c >> > > > > > create mode 100644 net/smc/smc_loopback.h >> > > > > > >> > > > > > -- >> > > > > > 2.32.0.3.g01195cf9f >> > > > > > >> > > > > >
On 2024/1/11 20:00, Wen Gu wrote: > This patch set acts as the second part of the new version of [1] (The first > part can be referred from [2]), the updated things of this version are listed > at the end. > Hi Wenjia and Jan, I would appreciate any thoughts or comments you might have on this series. Thank you very much! > > # Design > > This patch set includes 3 parts: > > - Patch #1-#2: some prepare work for loopback-ism. > - Patch #3-#9: implement loopback-ism device. > - Patch #10-#15: memory copy optimization for loopback scenario. > > The loopback-ism device is designed as a ISMv2 device and not be limited to > a specific net namespace, ends of both inter-process connection (1/1' in diagram > below) or inter-container connection (2/2' in diagram below) can find the same > available loopback-ism and choose it during the CLC handshake. > > Container 1 (ns1) Container 2 (ns2) > +-----------------------------------------+ +-------------------------+ > | +-------+ +-------+ +-------+ | | +-------+ | > | | App A | | App B | | App C | | | | App D |<-+ | > | +-------+ +---^---+ +-------+ | | +-------+ |(2') | > | |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| | > | (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ | > | `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | | > +---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+ > | | | | > Kernel | | | | > +----+-------v---+-----------v----------------------------------+---+----+ > | | TCP | | > | | | | > | +--------------------------------------------------------------+ | > | | > | +--------------+ | > | | smc loopback | | > +---------------------------+--------------+-----------------------------+ > > loopback-ism device creates DMBs (shared memory) for each connection peer. > Since data transfer occurs within the same kernel, the sndbuf of each peer > is only a descriptor and point to the same memory region as peer DMB, so that > the data copy from sndbuf to peer DMB can be avoided in loopback-ism case. > > Container 1 (ns1) Container 2 (ns2) > +-----------------------------------------+ +-------------------------+ > | +-------+ | | +-------+ | > | | App C |-----+ | | | App D | | > | +-------+ | | | +-^-----+ | > | | | | | | > | (2) | | | (2') | | > | | | | | | > +---------------|-------------------------+ +----------|--------------+ > | | > Kernel | | > +---------------|-----------------------------------------|--------------+ > | +--------+ +--v-----+ +--------+ +--------+ | > | |dmb_desc| |snd_desc| |dmb_desc| |snd_desc| | > | +-----|--+ +--|-----+ +-----|--+ +--------+ | > | +-----|--+ | +-----|--+ | > | | DMB C | +---------------------------------| DMB D | | > | +--------+ +--------+ | > | | > | +--------------+ | > | | smc loopback | | > +---------------------------+--------------+-----------------------------+ > > # Benchmark Test > > * Test environments: > - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem. > - SMC sndbuf/DMB size 1MB. > - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0, > which means sndbuf and DMB are merged and no data copied between them. > - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0, > which means DMB is physically contiguous buffer. > > * Test object: > - TCP: run on TCP loopback. > - SMC lo: run on SMC loopback device. > > 1. ipc-benchmark (see [3]) > > - ./<foo> -c 1000000 -s 100 > > TCP SMC-lo > Message > rate (msg/s) 80636 149515(+85.42%) > > 2. sockperf > > - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp > - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30 > > TCP SMC-lo > Bandwidth(MBps) 4909.36 8197.57(+66.98%) > Latency(us) 6.098 3.383(-44.52%) > > 3. nginx/wrk > > - serv: <smc_run> nginx > - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80 > > TCP SMC-lo > Requests/s 181685.74 246447.77(+35.65%) > > 4. redis-benchmark > > - serv: <smc_run> redis-server > - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024 > > TCP SMC-lo > GET(Requests/s) 85855.34 118553.64(+38.09%) > SET(Requests/s) 86824.40 125944.58(+45.06%) > > > Change log: > > v1->RFC: > - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics: > /sys/devices/virtual/smc/loopback-ism/xfer_bytes > - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports > merging sndbuf with peer DMB. > - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and > control of whether to merge sndbuf and DMB. They can be respectively set by: > /sys/devices/virtual/smc/loopback-ism/dmb_type > /sys/devices/virtual/smc/loopback-ism/dmb_copy > The motivation for these two control is that a performance bottleneck was > found when using vzalloced DMB and sndbuf is merged with DMB, and there are > many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused > by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg() > or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the > vmap lock contention [6]. It has significant effects, but using virtual memory > still has additional overhead compared to using physical memory. > So this new version provides controls of dmb_type and dmb_copy to suit > different scenarios. > - Some minor changes and comments improvements. > > RFC->old version([1]): > Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/ > - Patch #1: improve the loopback-ism dump, it shows as follows now: > # smcd d > FID Type PCI-ID PCHID InUse #LGs PNET-ID > 0000 0 loopback-ism ffff No 0 > - Patch #3: introduce the smc_ism_set_v2_capable() helper and set > smc_ism_v2_capable when ISMv2 or virtual ISM is registered, > regardless of whether there is already a device in smcd device list. > - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/. > - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active > to activate or deactivate the loopback-ism. > - Patch #9: introduce the statistics of loopback-ism by > /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}. > - Some minor changes and comments improvements. > > [1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/ > [2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/ > [3] https://github.com/goldsborough/ipc-bench > [4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/ > [5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/ > [6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/ > > Wen Gu (15): > net/smc: improve SMC-D device dump for virtual ISM > net/smc: decouple specialized struct from SMC-D DMB registration > net/smc: introduce virtual ISM device loopback-ism > net/smc: implement ID-related operations of loopback-ism > net/smc: implement some unsupported operations of loopback-ism > net/smc: implement DMB-related operations of loopback-ism > net/smc: register loopback-ism into SMC-D device list > net/smc: introduce loopback-ism runtime switch > net/smc: introduce loopback-ism statistics attributes > net/smc: add operations to merge sndbuf with peer DMB > net/smc: attach or detach ghost sndbuf to peer DMB > net/smc: adapt cursor update when sndbuf and peer DMB are merged > net/smc: introduce loopback-ism DMB type control > net/smc: introduce loopback-ism DMB data copy control > net/smc: implement DMB-merged operations of loopback-ism > > drivers/s390/net/ism_drv.c | 2 +- > include/net/smc.h | 7 +- > net/smc/Kconfig | 13 + > net/smc/Makefile | 2 +- > net/smc/af_smc.c | 28 +- > net/smc/smc_cdc.c | 58 ++- > net/smc/smc_cdc.h | 1 + > net/smc/smc_core.c | 61 +++- > net/smc/smc_core.h | 1 + > net/smc/smc_ism.c | 71 +++- > net/smc/smc_ism.h | 5 + > net/smc/smc_loopback.c | 718 +++++++++++++++++++++++++++++++++++++ > net/smc/smc_loopback.h | 88 +++++ > 13 files changed, 1026 insertions(+), 29 deletions(-) > create mode 100644 net/smc/smc_loopback.c > create mode 100644 net/smc/smc_loopback.h >
On 18.01.24 09:27, Wen Gu wrote: > > > On 2024/1/11 20:00, Wen Gu wrote: >> This patch set acts as the second part of the new version of [1] (The >> first >> part can be referred from [2]), the updated things of this version are >> listed >> at the end. >> > > Hi Wenjia and Jan, I would appreciate any thoughts or comments you might > have > on this series. Thank you very much! > Hi Wen, I'm still in the middle of the proto type on IPPROTO_SMC and other issues, so that I need more time to review this patch series. Thank you for your patience! Wenjia
On 2024/1/18 21:59, Wenjia Zhang wrote: > > > On 18.01.24 09:27, Wen Gu wrote: >> >> >> On 2024/1/11 20:00, Wen Gu wrote: >>> This patch set acts as the second part of the new version of [1] (The first >>> part can be referred from [2]), the updated things of this version are listed >>> at the end. >>> >> >> Hi Wenjia and Jan, I would appreciate any thoughts or comments you might have >> on this series. Thank you very much! >> > Hi Wen, > > I'm still in the middle of the proto type on IPPROTO_SMC and other issues, so that I need more time to review this patch > series. > > Thank you for your patience! > Wenjia Understood. Thank you! Wenjia. Best regards, Wen Gu
On 19.01.24 02:46, Wen Gu wrote: > > > On 2024/1/18 21:59, Wenjia Zhang wrote: >> >> >> On 18.01.24 09:27, Wen Gu wrote: >>> >>> >>> On 2024/1/11 20:00, Wen Gu wrote: >>>> This patch set acts as the second part of the new version of [1] (The first >>>> part can be referred from [2]), the updated things of this version are listed >>>> at the end. >>>> >>> >>> Hi Wenjia and Jan, I would appreciate any thoughts or comments you might have >>> on this series. Thank you very much! >>> >> Hi Wen, >> >> I'm still in the middle of the proto type on IPPROTO_SMC and other issues, so that I need more time to review this patch series. >> >> Thank you for your patience! >> Wenjia > > Understood. Thank you! Wenjia. > > Best regards, > Wen Gu > Hello Wen Gu and others, I just wanted to let you know that unfortunately both Wenjia and Jan have called in sick and we don't know when they will be back at work. So I'm sorry but there may be mroe delays in the review of this patchset. Kind regards Alexandra Winter
On 2024/1/23 22:03, Alexandra Winter wrote: > Hello Wen Gu and others, > > I just wanted to let you know that unfortunately both Wenjia and Jan have called in sick and we don't know > when they will be back at work. > So I'm sorry but there may be mroe delays in the review of this patchset. > > Kind regards > Alexandra Winter Hi Alexandra, Thank you for the update. Health comes first. Wishing Wenjia and Jan both make a swift recovery. Best regards, Wen Gu
On 2024/1/24 14:33, Wen Gu wrote: > > > On 2024/1/23 22:03, Alexandra Winter wrote: >> Hello Wen Gu and others, >> >> I just wanted to let you know that unfortunately both Wenjia and Jan have called in sick and we don't know >> when they will be back at work. >> So I'm sorry but there may be mroe delays in the review of this patchset. >> >> Kind regards >> Alexandra Winter > > Hi Alexandra, > > Thank you for the update. Health comes first. Wishing Wenjia and Jan > both make a swift recovery. > > Best regards, > Wen Gu Hi, Wenjia and Jan I would like to ask if I should wait for the review of this version or send a v2 (with some minor changes) ? Thanks!
On 11.01.24 13:00, Wen Gu wrote: > This patch set acts as the second part of the new version of [1] (The first > part can be referred from [2]), the updated things of this version are listed > at the end. > > # Background > > SMC-D is now used in IBM z with ISM function to optimize network interconnect > for intra-CPC communications. Inspired by this, we try to make SMC-D available > on the non-s390 architecture through a software-implemented virtual ISM device, > that is the loopback-ism device here, to accelerate inter-process or > inter-containers communication within the same OS instance. Hello Wen Gu, thank you very much for this patchset. I have been looking at it a bit. I installed in on a testserver, but did not yet excercise the loopback-ism device. I want to continue investigations, but daily work interferes, so I thought I send you some comments now. So this is not at all a code review, but some thoughts and observations about the general concept. In [1] there was a discussion about an abstraction layer between smc-d and the ism devices. I am not sure what you are proposing now, is it an smc-d feature or independent of smc? In 3/15 you say it is part of the SMC module, but then it has its own device entry. Didn't you want to use it for other things as well? Or is it an SMC-D only feature? Is it a device (Config help: "kind of virtual device")? Or an SMC-D feature? Will we have a class of ism devices (s390 ism, ism-loopback, virtio-ism) That share common properties (internal API?) and smc-d will work with any of those? But they all can exist without smc ?! BTW: This is what we want for s390-ism. The client-registration interface [2] is currently the way to achieve this. But maybe we need a more general concept? Maybe first a preparation patchset that introduces a class/ism Together with an improved API? In case you want to use ISM devices for other purposes as well.. But then the whole picture of ism-loopback in one patchset (RFC?) has its benefits as well. Other points that I noticed: Naming: smc loopback, ism-loopback, loopback-ism ? config: why not tristate? Why under net/smc? /sys/devices/virtual/smc does not initially show up in my installation!!! root@t35lp50:/sys/devices/virtual/> ls 3270/ bdi/ block/ graphics/ iommu/ mem/ memory_tiering/ misc/ net/ tty/ vc/ vtconsole/ workqueue/ root@t35lp50:/sys/devices/virtual/> ls smc/loopback-ism active dmb_copy dmbs_cnt dmb_type subsystem@ uevent xfer_bytes root@t35lp50:/sys/devices/virtual/> ls 3270/ bdi/ block/ graphics/ iommu/ mem/ memory_tiering/ misc/ net/ smc/ tty/ vc/ vtconsole/ workqueue/ Is that normal behaviour? You introduced a class/smc Maybe class/ism would be better? The other smc interfaces do not show up in class/smc!! Not so good Why doesn't it show in smc_rnics? (Maybe some deficiency of smc_rnics?) But then it shows in other smc-tools: root@t35lp50:/sys/> smcd device FID Type PCI-ID PCHID InUse #LGs PNET-ID 0000 0 loopback-ism ffff No 0 0029 ISM 0000:00:00.0 07c1 No 0 NET1 Nice! Kind regards Sandy [1] https://lore.kernel.org/lkml/e3819550-7b10-4f9c-7347-dcf1f97b8e6b@linux.alibaba.com/ [2] 89e7d2ba61b7 ("net/ism: Add new API for client registration")
On 05.02.24 11:05, Wen Gu wrote: > > On 2024/1/24 14:33, Wen Gu wrote: >> >> >> On 2024/1/23 22:03, Alexandra Winter wrote: >>> Hello Wen Gu and others, >>> >>> I just wanted to let you know that unfortunately both Wenjia and Jan >>> have called in sick and we don't know >>> when they will be back at work. >>> So I'm sorry but there may be mroe delays in the review of this >>> patchset. >>> >>> Kind regards >>> Alexandra Winter >> >> Hi Alexandra, >> >> Thank you for the update. Health comes first. Wishing Wenjia and Jan >> both make a swift recovery. >> >> Best regards, >> Wen Gu > > Hi, Wenjia and Jan > > I would like to ask if I should wait for the review of this version > or send a v2 (with some minor changes) ? > > Thanks! Hi Wen, Finally I can carve out some time on this patches, the review is still ongoing. I'll send my comments out, as soon as I finish all of them. Thank you for the patience! Wenjia
On 2024/2/6 20:18, Alexandra Winter wrote: > > > On 11.01.24 13:00, Wen Gu wrote: >> This patch set acts as the second part of the new version of [1] (The first >> part can be referred from [2]), the updated things of this version are listed >> at the end. >> >> # Background >> >> SMC-D is now used in IBM z with ISM function to optimize network interconnect >> for intra-CPC communications. Inspired by this, we try to make SMC-D available >> on the non-s390 architecture through a software-implemented virtual ISM device, >> that is the loopback-ism device here, to accelerate inter-process or >> inter-containers communication within the same OS instance. > > > Hello Wen Gu, > > thank you very much for this patchset. I have been looking at it a bit. > I installed in on a testserver, but did not yet excercise the loopback-ism device. > I want to continue investigations, but daily work interferes, so I thought I > send you some comments now. So this is not at all a code review, but some > thoughts and observations about the general concept. > Hi Sandy and Wenjia, Thank you very much for your feedback! I am working on the detailed replies. As we are on holiday for Chinese New Year, the progress may be slower. But please feel free to leave any other comments and feedback, thank you! Best regards! Wen Gu > > In [1] there was a discussion about an abstraction layer between smc-d and the > ism devices. > I am not sure what you are proposing now, is it an smc-d feature or independent of smc? > In 3/15 you say it is part of the SMC module, but then it has its own device entry. > Didn't you want to use it for other things as well? Or is it an SMC-D only feature? > Is it a device (Config help: "kind of virtual device")? Or an SMC-D feature? > > Will we have a class of ism devices (s390 ism, ism-loopback, virtio-ism) > That share common properties (internal API?) > and smc-d will work with any of those? > But they all can exist without smc ?! BTW: This is what we want for s390-ism. > The client-registration interface [2] is currently the way to achieve this. > But maybe we need a more general concept? > > Maybe first a preparation patchset that introduces a class/ism > Together with an improved API? > In case you want to use ISM devices for other purposes as well.. > But then the whole picture of ism-loopback in one patchset (RFC?) > has its benefits as well. > > > Other points that I noticed: > > Naming: smc loopback, ism-loopback, loopback-ism ? > > config: why not tristate? Why under net/smc? > > /sys/devices/virtual/smc does not initially show up in my installation!!! > root@t35lp50:/sys/devices/virtual/> ls > 3270/ bdi/ block/ graphics/ iommu/ mem/ memory_tiering/ misc/ net/ tty/ vc/ vtconsole/ workqueue/ > root@t35lp50:/sys/devices/virtual/> ls smc/loopback-ism > active dmb_copy dmbs_cnt dmb_type subsystem@ uevent xfer_bytes > root@t35lp50:/sys/devices/virtual/> ls > 3270/ bdi/ block/ graphics/ iommu/ mem/ memory_tiering/ misc/ net/ smc/ tty/ vc/ vtconsole/ workqueue/ > Is that normal behaviour? > > You introduced a class/smc > Maybe class/ism would be better? > The other smc interfaces do not show up in class/smc!! Not so good > > Why doesn't it show in smc_rnics? > (Maybe some deficiency of smc_rnics?) > > But then it shows in other smc-tools: > root@t35lp50:/sys/> smcd device > FID Type PCI-ID PCHID InUse #LGs PNET-ID > 0000 0 loopback-ism ffff No 0 > 0029 ISM 0000:00:00.0 07c1 No 0 NET1 > Nice! > > Kind regards > Sandy > > > [1] https://lore.kernel.org/lkml/e3819550-7b10-4f9c-7347-dcf1f97b8e6b@linux.alibaba.com/ > [2] 89e7d2ba61b7 ("net/ism: Add new API for client registration")
On 11.01.24 13:00, Wen Gu wrote: > This patch set acts as the second part of the new version of [1] (The first > part can be referred from [2]), the updated things of this version are listed > at the end. > > # Background > > SMC-D is now used in IBM z with ISM function to optimize network interconnect > for intra-CPC communications. Inspired by this, we try to make SMC-D available > on the non-s390 architecture through a software-implemented virtual ISM device, > that is the loopback-ism device here, to accelerate inter-process or > inter-containers communication within the same OS instance. > > # Design > > This patch set includes 3 parts: > > - Patch #1-#2: some prepare work for loopback-ism. > - Patch #3-#9: implement loopback-ism device. > - Patch #10-#15: memory copy optimization for loopback scenario. > > The loopback-ism device is designed as a ISMv2 device and not be limited to > a specific net namespace, ends of both inter-process connection (1/1' in diagram > below) or inter-container connection (2/2' in diagram below) can find the same > available loopback-ism and choose it during the CLC handshake. > > Container 1 (ns1) Container 2 (ns2) > +-----------------------------------------+ +-------------------------+ > | +-------+ +-------+ +-------+ | | +-------+ | > | | App A | | App B | | App C | | | | App D |<-+ | > | +-------+ +---^---+ +-------+ | | +-------+ |(2') | > | |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| | > | (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ | > | `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | | > +---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+ > | | | | > Kernel | | | | > +----+-------v---+-----------v----------------------------------+---+----+ > | | TCP | | > | | | | > | +--------------------------------------------------------------+ | > | | > | +--------------+ | > | | smc loopback | | > +---------------------------+--------------+-----------------------------+ > > loopback-ism device creates DMBs (shared memory) for each connection peer. > Since data transfer occurs within the same kernel, the sndbuf of each peer > is only a descriptor and point to the same memory region as peer DMB, so that > the data copy from sndbuf to peer DMB can be avoided in loopback-ism case. > > Container 1 (ns1) Container 2 (ns2) > +-----------------------------------------+ +-------------------------+ > | +-------+ | | +-------+ | > | | App C |-----+ | | | App D | | > | +-------+ | | | +-^-----+ | > | | | | | | > | (2) | | | (2') | | > | | | | | | > +---------------|-------------------------+ +----------|--------------+ > | | > Kernel | | > +---------------|-----------------------------------------|--------------+ > | +--------+ +--v-----+ +--------+ +--------+ | > | |dmb_desc| |snd_desc| |dmb_desc| |snd_desc| | > | +-----|--+ +--|-----+ +-----|--+ +--------+ | > | +-----|--+ | +-----|--+ | > | | DMB C | +---------------------------------| DMB D | | > | +--------+ +--------+ | > | | > | +--------------+ | > | | smc loopback | | > +---------------------------+--------------+-----------------------------+ > > # Benchmark Test > > * Test environments: > - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem. > - SMC sndbuf/DMB size 1MB. > - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0, > which means sndbuf and DMB are merged and no data copied between them. > - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0, > which means DMB is physically contiguous buffer. > > * Test object: > - TCP: run on TCP loopback. > - SMC lo: run on SMC loopback device. > > 1. ipc-benchmark (see [3]) > > - ./<foo> -c 1000000 -s 100 > > TCP SMC-lo > Message > rate (msg/s) 80636 149515(+85.42%) > > 2. sockperf > > - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp > - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 -t 30 > > TCP SMC-lo > Bandwidth(MBps) 4909.36 8197.57(+66.98%) > Latency(us) 6.098 3.383(-44.52%) > > 3. nginx/wrk > > - serv: <smc_run> nginx > - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80 > > TCP SMC-lo > Requests/s 181685.74 246447.77(+35.65%) > > 4. redis-benchmark > > - serv: <smc_run> redis-server > - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024 > > TCP SMC-lo > GET(Requests/s) 85855.34 118553.64(+38.09%) > SET(Requests/s) 86824.40 125944.58(+45.06%) > > > Change log: > > v1->RFC: > - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics: > /sys/devices/virtual/smc/loopback-ism/xfer_bytes > - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports > merging sndbuf with peer DMB. > - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and > control of whether to merge sndbuf and DMB. They can be respectively set by: > /sys/devices/virtual/smc/loopback-ism/dmb_type > /sys/devices/virtual/smc/loopback-ism/dmb_copy > The motivation for these two control is that a performance bottleneck was > found when using vzalloced DMB and sndbuf is merged with DMB, and there are > many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused > by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg() > or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the > vmap lock contention [6]. It has significant effects, but using virtual memory > still has additional overhead compared to using physical memory. > So this new version provides controls of dmb_type and dmb_copy to suit > different scenarios. > - Some minor changes and comments improvements. > > RFC->old version([1]): > Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/ > - Patch #1: improve the loopback-ism dump, it shows as follows now: > # smcd d > FID Type PCI-ID PCHID InUse #LGs PNET-ID > 0000 0 loopback-ism ffff No 0 > - Patch #3: introduce the smc_ism_set_v2_capable() helper and set > smc_ism_v2_capable when ISMv2 or virtual ISM is registered, > regardless of whether there is already a device in smcd device list. > - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/. > - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active > to activate or deactivate the loopback-ism. > - Patch #9: introduce the statistics of loopback-ism by > /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}. > - Some minor changes and comments improvements. > > [1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/ > [2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/ > [3] https://github.com/goldsborough/ipc-bench > [4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/ > [5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/ > [6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/ > > Wen Gu (15): > net/smc: improve SMC-D device dump for virtual ISM > net/smc: decouple specialized struct from SMC-D DMB registration > net/smc: introduce virtual ISM device loopback-ism > net/smc: implement ID-related operations of loopback-ism > net/smc: implement some unsupported operations of loopback-ism > net/smc: implement DMB-related operations of loopback-ism > net/smc: register loopback-ism into SMC-D device list > net/smc: introduce loopback-ism runtime switch > net/smc: introduce loopback-ism statistics attributes > net/smc: add operations to merge sndbuf with peer DMB > net/smc: attach or detach ghost sndbuf to peer DMB > net/smc: adapt cursor update when sndbuf and peer DMB are merged > net/smc: introduce loopback-ism DMB type control > net/smc: introduce loopback-ism DMB data copy control > net/smc: implement DMB-merged operations of loopback-ism > > drivers/s390/net/ism_drv.c | 2 +- > include/net/smc.h | 7 +- > net/smc/Kconfig | 13 + > net/smc/Makefile | 2 +- > net/smc/af_smc.c | 28 +- > net/smc/smc_cdc.c | 58 ++- > net/smc/smc_cdc.h | 1 + > net/smc/smc_core.c | 61 +++- > net/smc/smc_core.h | 1 + > net/smc/smc_ism.c | 71 +++- > net/smc/smc_ism.h | 5 + > net/smc/smc_loopback.c | 718 +++++++++++++++++++++++++++++++++++++ > net/smc/smc_loopback.h | 88 +++++ > 13 files changed, 1026 insertions(+), 29 deletions(-) > create mode 100644 net/smc/smc_loopback.c > create mode 100644 net/smc/smc_loopback.h > Hi Wen, Thank you for the patience again! You can find the comments under the corresponding patches respectively. About the file hierarchy in sysfs and the names, we still have some thoughts. We need to investigate a bit more time on it. Thanks, Gerd & Wenjia
On 2024/2/6 20:18, Alexandra Winter wrote: > > > On 11.01.24 13:00, Wen Gu wrote: >> This patch set acts as the second part of the new version of [1] (The first >> part can be referred from [2]), the updated things of this version are listed >> at the end. >> >> # Background >> >> SMC-D is now used in IBM z with ISM function to optimize network interconnect >> for intra-CPC communications. Inspired by this, we try to make SMC-D available >> on the non-s390 architecture through a software-implemented virtual ISM device, >> that is the loopback-ism device here, to accelerate inter-process or >> inter-containers communication within the same OS instance. > > > Hello Wen Gu, > > thank you very much for this patchset. I have been looking at it a bit. > I installed in on a testserver, but did not yet excercise the loopback-ism device. > I want to continue investigations, but daily work interferes, so I thought I > send you some comments now. So this is not at all a code review, but some > thoughts and observations about the general concept. > Thank you very much, Sandy. > > In [1] there was a discussion about an abstraction layer between smc-d and the > ism devices. > I am not sure what you are proposing now, is it an smc-d feature or independent of smc? > In 3/15 you say it is part of the SMC module, but then it has its own device entry. > Didn't you want to use it for other things as well? Or is it an SMC-D only feature? > Is it a device (Config help: "kind of virtual device")? Or an SMC-D feature? > This patchset aims to propose an SMC feature, which is SMC-D loopback. The main work to achieve this feature is to implement an Emulated-ISM, which is loopback-ism. The loopback-ism is a 'built-in' dummy device of SMC and only serves SMC. SMC-D protocol + 'built-in dummy device' (loopback-ism device) = SMC-D loopback feature. To provide the runtime switch and statistics of loopback-ism, I need to find a sysfs entry for it, since it doesn't belong to any class (e.g. pci_bus), I created an 'smc' entry under /sys/devices/virtual/ and put loopback-ism under it. The other SMC devices, such as RoCE, s390 ISM, virtio-ism will be in their own sysfs entry, not under the /sys/devices/*virtual*/smc/. The Config help is somewhat inaccurate. To be more precise, the SMC_LO config is used to configure whether to enable this built-in dummy device for intra-OS communication. > Will we have a class of ism devices (s390 ism, ism-loopback, virtio-ism) > That share common properties (internal API?) > and smc-d will work with any of those? > But they all can exist without smc ?! BTW: This is what we want for s390-ism. > The client-registration interface [2] is currently the way to achieve this. > But maybe we need a more general concept? > I didn't mean to create a class to cover all the ISM devices. It is only for loopback-ism. Because loopback-ism can not be classified, so I create an entry under /sys/devices/virtual/. > Maybe first a preparation patchset that introduces a class/ism > Together with an improved API? > In case you want to use ISM devices for other purposes as well.. > But then the whole picture of ism-loopback in one patchset (RFC?) > has its benefits as well. > Sorry for causing, I didn't mean to create a class to cover all the ISM devices. They should be in their own sysfs entries (e.g. pci_bus), since they will be used out of SMC. Only loopback-ism belongs only to SMC. > > Other points that I noticed: > > Naming: smc loopback, ism-loopback, loopback-ism ? > > config: why not tristate? Why under net/smc? > 'SMC-D loopback' or 'SMC loopback' is used to indicate the feature or capability. 'loopback-ism' is the emulated-ISM device that 'SMC/SMC-D loopback' used. ('ism-loopback' doesn't seem to appear in my patchset) If we all agree with these, I will check all the terms in the patch and unify them. SMC_LO is used to configure whether SMC is allowed to use loopback-ism (CONFIG_SMC_LO), it acts as a check in the code, so I defined it as a bool. And loopback-ism only serves SMC-D loopback, as a feature of SMC, so the implementation (net/smc/smc_loopback.{c|h}) is under net/smc. > /sys/devices/virtual/smc does not initially show up in my installation!!! > root@t35lp50:/sys/devices/virtual/> ls > 3270/ bdi/ block/ graphics/ iommu/ mem/ memory_tiering/ misc/ net/ tty/ vc/ vtconsole/ workqueue/ > root@t35lp50:/sys/devices/virtual/> ls smc/loopback-ism > active dmb_copy dmbs_cnt dmb_type subsystem@ uevent xfer_bytes > root@t35lp50:/sys/devices/virtual/> ls > 3270/ bdi/ block/ graphics/ iommu/ mem/ memory_tiering/ misc/ net/ smc/ tty/ vc/ vtconsole/ workqueue/ > Is that normal behaviour? > /sys/devices/virtual/smc is created after SMC module initialization. During the SMC module initialization, smc_loopback_init() is called, and the /sys/devices/virtual/smc entry is created. > You introduced a class/smc > Maybe class/ism would be better? > The other smc interfaces do not show up in class/smc!! Not so good > Sorry for causing, I didn't mean to create a class to cover all the ISM devices. They should be in their own sysfs entries (e.g. pci_bus), since they can be used out of SMC. But loopback-ism is a SMC 'built-in' dummy device, it belongs only to SMC and can't be classified to other entries. > Why doesn't it show in smc_rnics? > (Maybe some deficiency of smc_rnics?) > smc_rnics can't be used on the arch other than s390. # ./smc_rnics -a Error: s390/s390x supported only > But then it shows in other smc-tools: > root@t35lp50:/sys/> smcd device > FID Type PCI-ID PCHID InUse #LGs PNET-ID > 0000 0 loopback-ism ffff No 0 > 0029 ISM 0000:00:00.0 07c1 No 0 NET1 > Nice! > Yes, this is did on patch 01/15. Best regards, Wen Gu > Kind regards > Sandy > > > [1] https://lore.kernel.org/lkml/e3819550-7b10-4f9c-7347-dcf1f97b8e6b@linux.alibaba.com/ > [2] 89e7d2ba61b7 ("net/ism: Add new API for client registration")
On 2024/2/16 22:09, Wenjia Zhang wrote: > > > On 11.01.24 13:00, Wen Gu wrote: >> This patch set acts as the second part of the new version of [1] (The first >> part can be referred from [2]), the updated things of this version are listed >> at the end. >> >> # Background >> >> SMC-D is now used in IBM z with ISM function to optimize network interconnect >> for intra-CPC communications. Inspired by this, we try to make SMC-D available >> on the non-s390 architecture through a software-implemented virtual ISM device, >> that is the loopback-ism device here, to accelerate inter-process or >> inter-containers communication within the same OS instance. >> >> # Design >> >> This patch set includes 3 parts: >> >> - Patch #1-#2: some prepare work for loopback-ism. >> - Patch #3-#9: implement loopback-ism device. >> - Patch #10-#15: memory copy optimization for loopback scenario. >> >> The loopback-ism device is designed as a ISMv2 device and not be limited to >> a specific net namespace, ends of both inter-process connection (1/1' in diagram >> below) or inter-container connection (2/2' in diagram below) can find the same >> available loopback-ism and choose it during the CLC handshake. >> >> Container 1 (ns1) Container 2 (ns2) >> +-----------------------------------------+ +-------------------------+ >> | +-------+ +-------+ +-------+ | | +-------+ | >> | | App A | | App B | | App C | | | | App D |<-+ | >> | +-------+ +---^---+ +-------+ | | +-------+ |(2') | >> | |127.0.0.1 (1')| |192.168.0.11 192.168.0.12| | >> | (1)| +--------+ | +--------+ |(2) | | +--------+ +--------+ | >> | `-->| lo |-` | eth0 |<-` | | | lo | | eth0 | | >> +---------+--|---^-+---+-----|--+---------+ +-+--------+---+-^------+-+ >> | | | | >> Kernel | | | | >> +----+-------v---+-----------v----------------------------------+---+----+ >> | | TCP | | >> | | | | >> | +--------------------------------------------------------------+ | >> | | >> | +--------------+ | >> | | smc loopback | | >> +---------------------------+--------------+-----------------------------+ >> >> loopback-ism device creates DMBs (shared memory) for each connection peer. >> Since data transfer occurs within the same kernel, the sndbuf of each peer >> is only a descriptor and point to the same memory region as peer DMB, so that >> the data copy from sndbuf to peer DMB can be avoided in loopback-ism case. >> >> Container 1 (ns1) Container 2 (ns2) >> +-----------------------------------------+ +-------------------------+ >> | +-------+ | | +-------+ | >> | | App C |-----+ | | | App D | | >> | +-------+ | | | +-^-----+ | >> | | | | | | >> | (2) | | | (2') | | >> | | | | | | >> +---------------|-------------------------+ +----------|--------------+ >> | | >> Kernel | | >> +---------------|-----------------------------------------|--------------+ >> | +--------+ +--v-----+ +--------+ +--------+ | >> | |dmb_desc| |snd_desc| |dmb_desc| |snd_desc| | >> | +-----|--+ +--|-----+ +-----|--+ +--------+ | >> | +-----|--+ | +-----|--+ | >> | | DMB C | +---------------------------------| DMB D | | >> | +--------+ +--------+ | >> | | >> | +--------------+ | >> | | smc loopback | | >> +---------------------------+--------------+-----------------------------+ >> >> # Benchmark Test >> >> * Test environments: >> - VM with Intel Xeon Platinum 8 core 2.50GHz, 16 GiB mem. >> - SMC sndbuf/DMB size 1MB. >> - /sys/devices/virtual/smc/loopback-ism/dmb_copy is set to default 0, >> which means sndbuf and DMB are merged and no data copied between them. >> - /sys/devices/virtual/smc/loopback-ism/dmb_type is set to default 0, >> which means DMB is physically contiguous buffer. >> >> * Test object: >> - TCP: run on TCP loopback. >> - SMC lo: run on SMC loopback device. >> >> 1. ipc-benchmark (see [3]) >> >> - ./<foo> -c 1000000 -s 100 >> >> TCP SMC-lo >> Message >> rate (msg/s) 80636 149515(+85.42%) >> >> 2. sockperf >> >> - serv: <smc_run> taskset -c <cpu> sockperf sr --tcp >> - clnt: <smc_run> taskset -c <cpu> sockperf { tp | pp } --tcp --msg-size={ 64000 for tp | 14 for pp } -i 127.0.0.1 >> -t 30 >> >> TCP SMC-lo >> Bandwidth(MBps) 4909.36 8197.57(+66.98%) >> Latency(us) 6.098 3.383(-44.52%) >> >> 3. nginx/wrk >> >> - serv: <smc_run> nginx >> - clnt: <smc_run> wrk -t 8 -c 1000 -d 30 http://127.0.0.1:80 >> >> TCP SMC-lo >> Requests/s 181685.74 246447.77(+35.65%) >> >> 4. redis-benchmark >> >> - serv: <smc_run> redis-server >> - clnt: <smc_run> redis-benchmark -h 127.0.0.1 -q -t set,get -n 400000 -c 200 -d 1024 >> >> TCP SMC-lo >> GET(Requests/s) 85855.34 118553.64(+38.09%) >> SET(Requests/s) 86824.40 125944.58(+45.06%) >> >> >> Change log: >> >> v1->RFC: >> - Patch #9: merge rx_bytes and tx_bytes as xfer_bytes statistics: >> /sys/devices/virtual/smc/loopback-ism/xfer_bytes >> - Patch #10: add support_dmb_nocopy operation to check if SMC-D device supports >> merging sndbuf with peer DMB. >> - Patch #13 & #14: introduce loopback-ism device control of DMB memory type and >> control of whether to merge sndbuf and DMB. They can be respectively set by: >> /sys/devices/virtual/smc/loopback-ism/dmb_type >> /sys/devices/virtual/smc/loopback-ism/dmb_copy >> The motivation for these two control is that a performance bottleneck was >> found when using vzalloced DMB and sndbuf is merged with DMB, and there are >> many CPUs and CONFIG_HARDENED_USERCOPY is set [4]. The bottleneck is caused >> by the lock contention in vmap_area_lock [5] which is involved in memcpy_from_msg() >> or memcpy_to_msg(). Currently, Uladzislau Rezki is working on mitigating the >> vmap lock contention [6]. It has significant effects, but using virtual memory >> still has additional overhead compared to using physical memory. >> So this new version provides controls of dmb_type and dmb_copy to suit >> different scenarios. >> - Some minor changes and comments improvements. >> >> RFC->old version([1]): >> Link: https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@linux.alibaba.com/ >> - Patch #1: improve the loopback-ism dump, it shows as follows now: >> # smcd d >> FID Type PCI-ID PCHID InUse #LGs PNET-ID >> 0000 0 loopback-ism ffff No 0 >> - Patch #3: introduce the smc_ism_set_v2_capable() helper and set >> smc_ism_v2_capable when ISMv2 or virtual ISM is registered, >> regardless of whether there is already a device in smcd device list. >> - Patch #3: loopback-ism will be added into /sys/devices/virtual/smc/loopback-ism/. >> - Patch #8: introduce the runtime switch /sys/devices/virtual/smc/loopback-ism/active >> to activate or deactivate the loopback-ism. >> - Patch #9: introduce the statistics of loopback-ism by >> /sys/devices/virtual/smc/loopback-ism/{{tx|rx}_tytes|dmbs_cnt}. >> - Some minor changes and comments improvements. >> >> [1] https://lore.kernel.org/netdev/1695568613-125057-1-git-send-email-guwen@linux.alibaba.com/ >> [2] https://lore.kernel.org/netdev/20231219142616.80697-1-guwen@linux.alibaba.com/ >> [3] https://github.com/goldsborough/ipc-bench >> [4] https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@linux.alibaba.com/ >> [5] https://lore.kernel.org/all/238e63cd-e0e8-4fbf-852f-bc4d5bc35d5a@linux.alibaba.com/ >> [6] https://lore.kernel.org/all/20240102184633.748113-1-urezki@gmail.com/ >> >> Wen Gu (15): >> net/smc: improve SMC-D device dump for virtual ISM >> net/smc: decouple specialized struct from SMC-D DMB registration >> net/smc: introduce virtual ISM device loopback-ism >> net/smc: implement ID-related operations of loopback-ism >> net/smc: implement some unsupported operations of loopback-ism >> net/smc: implement DMB-related operations of loopback-ism >> net/smc: register loopback-ism into SMC-D device list >> net/smc: introduce loopback-ism runtime switch >> net/smc: introduce loopback-ism statistics attributes >> net/smc: add operations to merge sndbuf with peer DMB >> net/smc: attach or detach ghost sndbuf to peer DMB >> net/smc: adapt cursor update when sndbuf and peer DMB are merged >> net/smc: introduce loopback-ism DMB type control >> net/smc: introduce loopback-ism DMB data copy control >> net/smc: implement DMB-merged operations of loopback-ism >> >> drivers/s390/net/ism_drv.c | 2 +- >> include/net/smc.h | 7 +- >> net/smc/Kconfig | 13 + >> net/smc/Makefile | 2 +- >> net/smc/af_smc.c | 28 +- >> net/smc/smc_cdc.c | 58 ++- >> net/smc/smc_cdc.h | 1 + >> net/smc/smc_core.c | 61 +++- >> net/smc/smc_core.h | 1 + >> net/smc/smc_ism.c | 71 +++- >> net/smc/smc_ism.h | 5 + >> net/smc/smc_loopback.c | 718 +++++++++++++++++++++++++++++++++++++ >> net/smc/smc_loopback.h | 88 +++++ >> 13 files changed, 1026 insertions(+), 29 deletions(-) >> create mode 100644 net/smc/smc_loopback.c >> create mode 100644 net/smc/smc_loopback.h >> > Hi Wen, > > Thank you for the patience again! > > You can find the comments under the corresponding patches respectively. > About the file hierarchy in sysfs and the names, we still have some thoughts. We need to investigate a bit more time on it. > Hi Wenjia and Gerd, Thank you very much! I answered each comment you left. You can find my thoughts about sysfs and knobs there. Looking forward to your further reply. Thanks! Best regards, Wen Gu > Thanks, > Gerd & Wenjia