mbox series

[v25,00/20] nvme-tcp receive offloads

Message ID 20240529160053.111531-1-aaptel@nvidia.com (mailing list archive)
Headers show
Series nvme-tcp receive offloads | expand

Message

Aurelien Aptel May 29, 2024, 4 p.m. UTC
Hi,

The next iteration of our nvme-tcp receive offload series, rebased on top of today's
net-next 782471db6c72 ("Merge branch 'xilinx-clock-support'").

Previous submission (v24):
https://lore.kernel.org/netdev/253a5l3qg4k.fsf@nvidia.com/T/

The changes are also available through git:
Repo: https://github.com/aaptel/linux.git branch nvme-rx-offload-v25
Web: https://github.com/aaptel/linux/tree/nvme-rx-offload-v25

The NVMe-TCP offload was presented in netdev 0x16 (video available):
- https://netdevconf.info/0x16/session.html?NVMeTCP-Offload-%E2%80%93-Implementation-and-Performance-Gains
- https://youtu.be/W74TR-SNgi4

From: Aurelien Aptel <aaptel@nvidia.com>
From: Shai Malin <smalin@nvidia.com>
From: Boris Pismenny <borisp@nvidia.com>
From: Or Gerlitz <ogerlitz@nvidia.com>
From: Yoray Zack <yorayz@nvidia.com>
From: Max Gurtovoy <mgurtovoy@nvidia.com>

=========================================

This series adds support for NVMe-TCP receive offloads. The method here
does not mandate the offload of the network stack to the device.
Instead, these work together with TCP to offload:
1. copy from SKB to the block layer buffers.
2. CRC calculation and verification for received PDU.

The series implements these as a generic offload infrastructure for storage
protocols, which calls TCP Direct Data Placement and TCP Offload CRC
respectively. We use this infrastructure to implement NVMe-TCP offload for
copy and CRC.
Future implementations can reuse the same infrastructure for other protocols
such as iSCSI.

Note:
These offloads are similar in nature to the packet-based NIC TLS offloads,
which are already upstream (see net/tls/tls_device.c).
You can read more about TLS offload here:
https://www.kernel.org/doc/html/latest/networking/tls-offload.html

Queue Level
===========
The offload for IO queues is initialized after the handshake of the
NVMe-TCP protocol is finished by calling `nvme_tcp_offload_socket`
with the tcp socket of the nvme_tcp_queue:
This operation sets all relevant hardware contexts in
hardware. If it fails, then the IO queue proceeds as usual with no offload.
If it succeeds then `nvme_tcp_setup_ddp` and `nvme_tcp_teardown_ddp` may be
called to perform copy offload, and crc offload will be used.
This initialization does not change the normal operation of NVMe-TCP in any
way besides adding the option to call the above mentioned NDO operations.

For the admin queue, NVMe-TCP does not initialize the offload.
Instead, NVMe-TCP calls the driver to configure limits for the controller,
such as max_hw_sectors and max_segments, these must be limited to accommodate
potential HW resource limits, and to improve performance.

If some error occurs, and the IO queue must be closed or reconnected, then
offload is teardown and initialized again. Additionally, we handle netdev
down events via the existing error recovery flow.

IO Level
========
The NVMe-TCP layer calls the NIC driver to map block layer buffers to CID
using `nvme_tcp_setup_ddp` before sending the read request. When the response
is received, then the NIC HW will write the PDU payload directly into the
designated buffer, and build an SKB such that it points into the destination
buffer. This SKB represents the entire packet received on the wire, but it
points to the block layer buffers. Once NVMe-TCP attempts to copy data from
this SKB to the block layer buffer it can skip the copy by checking in the
copying function: if (src == dst) -> skip copy

Finally, when the PDU has been processed to completion, the NVMe-TCP layer
releases the NIC HW context by calling `nvme_tcp_teardown_ddp` which
asynchronously unmaps the buffers from NIC HW.

The NIC must release its mapping between command IDs and the target buffers.
This mapping is released when NVMe-TCP calls the NIC
driver (`nvme_tcp_offload_socket`).
As completing IOs is performance critical, we introduce asynchronous
completions for NVMe-TCP, i.e. NVMe-TCP calls the NIC, which will later
call NVMe-TCP to complete the IO (`nvme_tcp_ddp_teardown_done`).

On the IO level, and in order to use the offload only when a clear
performance improvement is expected, the offload is used only for IOs
which are bigger than io_threshold.

SKB
===
The DDP (zero-copy) and CRC offloads require two additional bits in the SKB.
The ddp bit is useful to prevent condensing of SKBs which are targeted
for zero-copy. The crc bit is useful to prevent GRO coalescing SKBs with
different offload values. This bit is similar in concept to the
"decrypted" bit.

After offload is initialized, we use the SKB's crc bit to indicate that:
"there was no problem with the verification of all CRC fields in this packet's
payload". The bit is set to zero if there was an error, or if HW skipped
offload for some reason. If *any* SKB in a PDU has (crc != 1), then the
calling driver must compute the CRC, and check it. We perform this check, and
accompanying software fallback at the end of the processing of a received PDU.

Resynchronization flow
======================
The resynchronization flow is performed to reset the hardware tracking of
NVMe-TCP PDUs within the TCP stream. The flow consists of a request from
the hardware proxied by the driver, regarding a possible location of a
PDU header. Followed by a response from the NVMe-TCP driver.

This flow is rare, and it should happen only after packet loss or
reordering events that involve NVMe-TCP PDU headers.

CID Mapping
===========
ConnectX-7 assumes linear CID (0...N-1 for queue of size N) where the Linux NVMe
driver uses part of the 16 bit CCID for generation counter.
To address that, we use the existing quirk in the NVMe layer when the HW
driver advertises that they don't support the full 16 bit CCID range.

Enablement on ConnectX-7
========================
By default, NVMeTCP offload is disabled in the mlx driver and in the nvme-tcp host.
In order to enable it:

        # Disable CQE compression (specific for ConnectX)
        ethtool --set-priv-flags <device> rx_cqe_compress off

        # Enable the ULP-DDP
        ./tools/net/ynl/cli.py \
            --spec Documentation/netlink/specs/ulp_ddp.yaml --do caps-set \
            --json '{"ifindex": <device index>, "wanted": 3, "wanted_mask": 3}'

        # Enable ULP offload in nvme-tcp
        modprobe nvme-tcp ddp_offload=1

Following the device ULP-DDP enablement, all the IO queues/sockets which are
running on the device are offloaded.

Performance
===========
With this implementation, using the ConnectX-7 NIC, we were able to
demonstrate the following CPU utilization improvement:

Without data digest:
For  64K queued read IOs – up to 32% improvement in the BW/IOPS (111 Gbps vs. 84 Gbps).
For 512K queued read IOs – up to 55% improvement in the BW/IOPS (148 Gbps vs. 98 Gbps).

With data digest:
For  64K queued read IOs – up to 107% improvement in the BW/IOPS (111 Gbps vs. 53 Gbps).
For 512K queued read IOs – up to 138% improvement in the BW/IOPS (146 Gbps vs. 61 Gbps).

With small IOs we are not expecting that the offload will show a performance gain.

The test configuration:
- fio command: qd=128, jobs=8.
- Server: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz, 160 cores.

Patches
=======
Patch 1:  Introduce the infrastructure for all ULP DDP and ULP DDP CRC offloads.
Patch 2:  Add netlink family to manage ULP DDP capabilities & stats.
Patch 3:  The iov_iter change to skip copy if (src == dst).
Patch 4:  Export the get_netdev_for_sock function from TLS to generic location.
Patch 5:  NVMe-TCP changes to call NIC driver on queue init/teardown and resync.
Patch 6:  NVMe-TCP changes to call NIC driver on IO operation
          setup/teardown, and support async completions.
Patch 7:  NVMe-TCP changes to support CRC offload on receive
          Also, this patch moves CRC calculation to the end of PDU
          in case offload requires software fallback.
Patch 8:  NVMe-TCP handling of netdev events: stop the offload if netdev is
          going down.
Patch 9:  Documentation of ULP DDP offloads.

The rest of the series is the mlx5 implementation of the offload.

Testing
=======
This series was tested on ConnectX-7 HW using various configurations
of IO sizes, queue depths, MTUs, and with both the SPDK and kernel NVMe-TCP
targets.

Compatibility
=============
* The offload works with bare-metal or SRIOV.
* The HW can support up to 64K connections per device (assuming no
  other HW accelerations are used). In this series, we will introduce
  the support for up to 4k connections, and we have plans to increase it.
* In the current HW implementation, the combination of NVMeTCP offload
  with TLS is not supported. In the future, if it will be implemented,
  the impact on the NVMe/TCP layer will be minimal.
* The NVMeTCP offload ConnectX 7 HW can support tunneling, but we
  don't see the need for this feature yet.
* NVMe poll queues are not in the scope of this series.

Future Work
===========
* NVMeTCP transmit offload.
* NVMeTCP offloads incremental features.

Changes since v24:
=================
- ulp_ddp.h: rename cfg->io_cpu to ->affinity_hint (Sagi).
- add compile-time optimization for the iov memcpy skip check (David).
- add rtnl_lock/unlock() around get_netdev_for_sock().
- fix vlan lookup in get_netdev_for_sock().
- fix NULL deref when netdev doesn't have ulp_ddp ops.
- use inline funcs for skb bits to remove ifdef (match tls code).

Changes since v23:
=================
- ulp_ddp.h: remove queue_id (Sagi).
- nvme-tcp: remove nvme_status, always set req->{result,status} (Sagi).
- nvme-tcp: rename label to ddgst_valid (Sagi).
- mlx5: remove newline from error messages (Jakub).

Changes since v22:
=================
- protect ->set_caps() with rtnl_lock().
- refactor of netdev GOING_DOWN event handler (Sagi).
- fix DDGST recalc for IOs under offload threshold.
- rebase against new mlx5 driver changes.

Changes since v21:
=================
- add netdevice_tracker to get_netdev_for_sock() (Jakub).
- remove redundant q->data_digest check (Max).

Changes since v20:
=================
- get caps&limits from nvme and remove query_limits() (Sagi).
- rename queue->ddp_status to queue->nvme_status and move ouf of ifdef (Sagi).
- call setup_ddp() during request setup (Sagi).
- remove null check in ddgst_recalc() (Sagi).
- remove local var in offload_socket() (Sagi).
- remove ifindex and hdr from netlink context data (Jiri).
- clean netlink notify handling and use nla_get_uint() (Jiri).
- normalize doc in ulp_ddp netlink spec (Jiri).

Changes since v19:
=================
- rebase against net-next.
- fix ulp_ddp_is_cap_active() error reported by the kernel test bot.

Changes since v18:
=================
- rebase against net-next.
- integrate with nvme-tcp tls.
- add const in parameter for skb_is_no_condense() and skb_is_ulp_crc().
- update documentation.

Changes since v17:
=================
- move capabilities from netdev to driver and add get_caps() op (Jiri).
- set stats by name explicitly, remove dump ops (Jiri).
- rename struct, functions, YAML attributes, reuse caps enum (Jiri).
- use uint instead of u64 in YAML spec (Jakub).

Changes since v16:
=================
- rebase against net-next
- minor whitespace changes
- updated CC list

Changes since v15:
=================
- add API func to get netdev & limits together (Sagi).
- add nvme_tcp_stop_admin_queue()
- hide config.io_cpu in the interface (Sagi).
- rename skb->ulp_ddp to skb->no_condense (David).

Changes since v14:
=================
- Added dumpit op for ULP_DDP_CMD_{GET,STATS} (Jakub).
- Remove redundant "-ddp-" fom stat names.
- Fix checkpatch/sparse warnings.

Changes since v13:
=================
- Replace ethtool interface with a new netlink family (Jakub).
- Simplify and squash mlx5e refactoring changes.

Changes since v12:
=================
- Rebase on top of NVMe-TCP kTLS v10 patches.
- Add ULP DDP wrappers for common code and ref accounting (Sagi).
- Fold modparam and tls patches into control-path patch (Sagi).
- Take one netdev ref for the admin queue (Sagi).
- Simplify start_queue() logic (Sagi).
- Rename
  * modparam ulp_offload modparam -> ddp_offload (Sagi).
  * queue->offload_xxx to queue->ddp_xxx (Sagi).
  * queue->resync_req -> resync_tcp_seq (Sagi).
- Use SECTOR_SHIFT (Sagi).
- Use nvme_cid(rq) (Sagi).
- Use sock->sk->sk_incoming_cpu instead of queue->io_cpu (Sagi).
- Move limits results to ctrl struct.
- Add missing ifdefs.
- Fix docs and reverse xmas tree (Simon).

Changes since v11:
=================
- Rebase on top of NVMe-TCP kTLS offload.
- Add tls support bit in struct ulp_ddp_limits.
- Simplify logic in NVMe-TCP queue init.
- Use new page pool in mlx5 driver.

Changes since v10:
=================
- Pass extack to drivers for better error reporting in the .set_caps
  callback (Jakub).
- netlink: use new callbacks, existing macros, padding, fix size
  add notifications, update specs (Jakub).

Changes since v9:
=================
- Add missing crc checks in tcp_try_coalesce() (Paolo).
- Add missing ifdef guard for socket ops (Paolo).
- Remove verbose netlink format for statistics (Jakub).
- Use regular attributes for statistics (Jakub).
- Expose and document individual stats to uAPI (Jakub).
- Move ethtool ops for caps&stats to netdev_ops->ulp_ddp_ops (Jakub).

Changes since v8:
=================
- Make stats stringset global instead of per-device (Jakub).
- Remove per-queue stats (Jakub).
- Rename ETH_SS_ULP_DDP stringset to ETH_SS_ULP_DDP_CAPS.
- Update & fix kdoc comments.
- Use 80 columns limit for nvme code.

Changes since v7:
=================
- Remove ULP DDP netdev->feature bit (Jakub).
- Expose ULP DDP capabilities to userspace via ethtool netlink messages (Jakub).
- Move ULP DDP stats to dedicated stats group (Jakub).
- Add ethtool_ops operations for setting capabilities and getting stats (Jakub).
- Move ulp_ddp_netdev_ops into net_device_ops (Jakub).
- Use union for protocol-specific struct instances (Jakub).
- Fold netdev_sk_get_lowest_dev() into get_netdev_for_sock (Christoph).
- Rename memcpy skip patch to something more obvious (Christoph).
- Move capabilities from ulp_ddp.h to ulp_ddp_caps.h.
- Add "Compatibility" section on the cover letter (Sagi).

Changes since v6:
=================
- Moved IS_ULP_{DDP,CRC} macros to skb_is_ulp_{ddp,crc} inline functions (Jakub).
- Fix copyright notice (Leon).
- Added missing ifdef to allow build with MLX5_EN_TLS disabled.
- Fix space alignment, indent and long lines (max 99 columns).
- Add missing field documentation in ulp_ddp.h.

Changes since v5:
=================
- Limit the series to RX offloads.
- Added two separated skb indications to avoid wrong flushing of GRO
  when aggerating offloaded packets.
- Use accessor functions for skb->ddp and skb->crc (Eric D) bits.
- Add kernel-doc for get_netdev_for_sock (Christoph).
- Remove ddp_iter* routines and only modify _copy_to_iter (Al Viro, Christoph).
- Remove consume skb (Sagi).
- Add a knob in the ddp limits struct for the HW driver to advertise
  if they need the nvme-tcp driver to apply the generation counter
  quirk. Use this knob for the mlx5 CX7 offload.
- bugfix: use u8 flags instead of bool in mlx5e_nvmeotcp_queue->dgst.
- bugfix: use sg_dma_len(sgl) instead of sgl->length.
- bugfix: remove sgl leak in nvme_tcp_setup_ddp().
- bugfix: remove sgl leak when only using DDGST_RX offload.
- Add error check for dma_map_sg().
- Reduce #ifdef by using dummy macros/functions.
- Remove redundant netdev null check in nvme_tcp_pdu_last_send().
- Rename ULP_DDP_RESYNC_{REQ -> PENDING}.
- Add per-ulp limits struct (Sagi).
- Add ULP DDP capabilities querying (Sagi).
- Simplify RX DDGST logic (Sagi).
- Document resync flow better.
- Add ulp_offload param to nvme-tcp module to enable ULP offload (Sagi).
- Add a revert commit to reintroduce nvme_tcp_queue->queue_size.

Changes since v4:
=================
- Add transmit offload patches.
- Use one feature bit for both receive and transmit offload.

Changes since v3:
=================
- Use DDP_TCP ifdefs in iov_iter and skb iterators to minimize impact
  when compiled out (Christoph).
- Simplify netdev references and reduce the use of
  get_netdev_for_sock (Sagi).
- Avoid "static" in it's own line, move it one line down (Christoph)
- Pass (queue, skb, *offset) and retrieve the pdu_seq in
  nvme_tcp_resync_response (Sagi).
- Add missing assignment of offloading_netdev to null in offload_limits
  error case (Sagi).
- Set req->offloaded = false once -- the lifetime rules are:
  set to false on cmd_setup / set to true when ddp setup succeeds (Sagi).
- Replace pr_info_ratelimited with dev_info_ratelimited (Sagi).
- Add nvme_tcp_complete_request and invoke it from two similar call
  sites (Sagi).
- Introduce nvme_tcp_req_map_sg earlier in the series (Sagi).
- Add nvme_tcp_consume_skb and put into it a hunk from
  nvme_tcp_recv_data to handle copy with and without offload.

Changes since v2:
=================
- Use skb->ddp_crc for copy offload to avoid skb_condense.
- Default mellanox driver support to no (experimental feature).
- In iov_iter use non-ddp functions for kvec and iovec.
- Remove typecasting in NVMe-TCP.

Changes since v1:
=================
- Rework iov_iter copy skip if src==dst to be less intrusive (David Ahern).
- Add tcp-ddp documentation (David Ahern).
- Refactor mellanox driver patches into more patches (Saeed Mahameed).
- Avoid pointer casting (David Ahern).
- Rename NVMe-TCP offload flags (Shai Malin).
- Update cover-letter according to the above.

Changes since RFC v1:
=====================
- Split mlx5 driver patches to several commits.
- Fix NVMe-TCP handling of recovery flows. In particular, move queue offload.
  init/teardown to the start/stop functions.

Signed-off-by: Or Gerlitz <ogerlitz@nvidia.com>


Aurelien Aptel (3):
  netlink: add new family to manage ULP_DDP enablement and stats
  net/tls,core: export get_netdev_for_sock
  net/mlx5e: NVMEoTCP, statistics

Ben Ben-Ishay (8):
  iov_iter: skip copy if src == dst for direct data placement
  net/mlx5: Add NVMEoTCP caps, HW bits, 128B CQE and enumerations
  net/mlx5e: NVMEoTCP, offload initialization
  net/mlx5e: NVMEoTCP, use KLM UMRs for buffer registration
  net/mlx5e: NVMEoTCP, queue init/teardown
  net/mlx5e: NVMEoTCP, ddp setup and resync
  net/mlx5e: NVMEoTCP, async ddp invalidation
  net/mlx5e: NVMEoTCP, data-path for DDP+DDGST offload

Boris Pismenny (4):
  net: Introduce direct data placement tcp offload
  nvme-tcp: Add DDP offload control path
  nvme-tcp: Add DDP data-path
  net/mlx5e: TCP flow steering for nvme-tcp acceleration

Or Gerlitz (3):
  nvme-tcp: Deal with netdevice DOWN events
  net/mlx5e: Rename from tls to transport static params
  net/mlx5e: Refactor ico sq polling to get budget

Yoray Zack (2):
  nvme-tcp: RX DDGST offload
  Documentation: add ULP DDP offload documentation

 Documentation/netlink/specs/ulp_ddp.yaml      |  172 +++
 Documentation/networking/index.rst            |    1 +
 Documentation/networking/ulp-ddp-offload.rst  |  372 ++++++
 .../net/ethernet/mellanox/mlx5/core/Kconfig   |   11 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |    3 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |    9 +
 .../net/ethernet/mellanox/mlx5/core/en/fs.h   |    4 +-
 .../ethernet/mellanox/mlx5/core/en/params.c   |   12 +-
 .../ethernet/mellanox/mlx5/core/en/params.h   |    3 +
 .../mellanox/mlx5/core/en/reporter_rx.c       |    4 +-
 .../ethernet/mellanox/mlx5/core/en/rx_res.c   |   28 +
 .../ethernet/mellanox/mlx5/core/en/rx_res.h   |    4 +
 .../net/ethernet/mellanox/mlx5/core/en/tir.c  |   15 +
 .../net/ethernet/mellanox/mlx5/core/en/tir.h  |    2 +
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |   34 +-
 .../mlx5/core/en_accel/common_utils.h         |   32 +
 .../mellanox/mlx5/core/en_accel/en_accel.h    |    3 +
 .../mellanox/mlx5/core/en_accel/fs_tcp.c      |   12 +-
 .../mellanox/mlx5/core/en_accel/fs_tcp.h      |    2 +-
 .../mellanox/mlx5/core/en_accel/ktls.c        |    2 +-
 .../mellanox/mlx5/core/en_accel/ktls_rx.c     |    6 +-
 .../mellanox/mlx5/core/en_accel/ktls_tx.c     |    8 +-
 .../mellanox/mlx5/core/en_accel/ktls_txrx.c   |   36 +-
 .../mellanox/mlx5/core/en_accel/ktls_utils.h  |   17 +-
 .../mellanox/mlx5/core/en_accel/nvmeotcp.c    | 1118 +++++++++++++++++
 .../mellanox/mlx5/core/en_accel/nvmeotcp.h    |  141 +++
 .../mlx5/core/en_accel/nvmeotcp_rxtx.c        |  355 ++++++
 .../mlx5/core/en_accel/nvmeotcp_rxtx.h        |   37 +
 .../mlx5/core/en_accel/nvmeotcp_stats.c       |   66 +
 .../mlx5/core/en_accel/nvmeotcp_utils.h       |   66 +
 .../ethernet/mellanox/mlx5/core/en_ethtool.c  |    6 +
 .../net/ethernet/mellanox/mlx5/core/en_fs.c   |    4 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |   29 +-
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |   71 +-
 .../ethernet/mellanox/mlx5/core/en_stats.h    |    8 +
 .../net/ethernet/mellanox/mlx5/core/en_txrx.c |    4 +-
 drivers/net/ethernet/mellanox/mlx5/core/fw.c  |    6 +
 .../net/ethernet/mellanox/mlx5/core/main.c    |    1 +
 drivers/nvme/host/tcp.c                       |  540 +++++++-
 include/linux/mlx5/device.h                   |   59 +-
 include/linux/mlx5/mlx5_ifc.h                 |   83 +-
 include/linux/mlx5/qp.h                       |    1 +
 include/linux/netdevice.h                     |   10 +-
 include/linux/skbuff.h                        |   41 +-
 include/net/inet_connection_sock.h            |    6 +
 include/net/ulp_ddp.h                         |  321 +++++
 include/uapi/linux/ulp_ddp.h                  |   61 +
 lib/iov_iter.c                                |    9 +-
 net/Kconfig                                   |   20 +
 net/core/Makefile                             |    1 +
 net/core/dev.c                                |   32 +-
 net/core/skbuff.c                             |    3 +-
 net/core/ulp_ddp.c                            |   51 +
 net/core/ulp_ddp_gen_nl.c                     |   75 ++
 net/core/ulp_ddp_gen_nl.h                     |   30 +
 net/core/ulp_ddp_nl.c                         |  344 +++++
 net/ipv4/tcp_input.c                          |    7 +
 net/ipv4/tcp_ipv4.c                           |    1 +
 net/ipv4/tcp_offload.c                        |    1 +
 net/tls/tls_device.c                          |   31 +-
 60 files changed, 4272 insertions(+), 159 deletions(-)
 create mode 100644 Documentation/netlink/specs/ulp_ddp.yaml
 create mode 100644 Documentation/networking/ulp-ddp-offload.rst
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/common_utils.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_stats.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_utils.h
 create mode 100644 include/net/ulp_ddp.h
 create mode 100644 include/uapi/linux/ulp_ddp.h
 create mode 100644 net/core/ulp_ddp.c
 create mode 100644 net/core/ulp_ddp_gen_nl.c
 create mode 100644 net/core/ulp_ddp_gen_nl.h
 create mode 100644 net/core/ulp_ddp_nl.c

Comments

Jakub Kicinski May 31, 2024, 1:39 a.m. UTC | #1
On Wed, 29 May 2024 16:00:33 +0000 Aurelien Aptel wrote:
> These offloads are similar in nature to the packet-based NIC TLS offloads,
> which are already upstream (see net/tls/tls_device.c).
> You can read more about TLS offload here:
> https://www.kernel.org/doc/html/latest/networking/tls-offload.html

Which I had to send a fix for blindly again today:
https://lore.kernel.org/all/20240530232607.82686-1-kuba@kernel.org/
because there is no software model and none of the vendors apparently
cares enough to rigorously test it.

I'm not joking with the continuous tests.
Christoph Hellwig May 31, 2024, 6:11 a.m. UTC | #2
FYI, I still absolutely detest this code.  I know people want to
avoid the page copy for NVMe over TCP (or any TCP based storage
protocols for that matter), but having these weird vendors specific
hooks all the way up into the application protocol are just horrible.

IETF has standardized a generic data placement protocol, which is
part of iWarp.  Even if folks don't like RDMA it exists to solve
exactly these kinds of problems of data placement.   And if we can't
arse folks into standard data placement methods we at least need it
vendor independent and without hooks into the actual protocol
driver.
Sagi Grimberg June 3, 2024, 7:09 a.m. UTC | #3
On 31/05/2024 9:11, Christoph Hellwig wrote:
> FYI, I still absolutely detest this code.  I know people want to
> avoid the page copy for NVMe over TCP (or any TCP based storage
> protocols for that matter), but having these weird vendors specific
> hooks all the way up into the application protocol are just horrible.

I hoped for a transparent ddp offload as well, but I don't see how this
is possible.

>
> IETF has standardized a generic data placement protocol, which is
> part of iWarp.  Even if folks don't like RDMA it exists to solve
> exactly these kinds of problems of data placement.

iWARP changes the wire protocol. Is your comment to just go make people
use iWARP instead of TCP? or extending NVMe/TCP to natively support DDP?

I think that the former is limiting, and the latter is unclear.

 From what I understand, the offload engine uses the NVMe command-id as
the rkey (or stag) for ddp purposes.

>    And if we can't
> arse folks into standard data placement methods we at least need it
> vendor independent and without hooks into the actual protocol
> driver.
>

That would be great, but what does a "vendor independent without hooks" 
look like from
your perspective? I'd love having this translate to standard (and some 
new) socket operations,
but I could not find a way that this can be done given the current 
architecture.

Early on, I thought that enabling the queue offload could be modeled as 
a setsockopt() and
and nvme_tcp_setup_ddp() would be modeled as a new 
recvmsg(MSG_DDP_BUFFER, iovec, tag) but where I got stuck was the whole 
async teardown mechanism that the nic has. But if this is solvable, I 
think such an interface is much better.
FWIW, I think that the benefit of this is worth having. I think that the 
folks from NVIDIA
are committed to supporting and evolving it.
Sagi Grimberg June 10, 2024, 10:07 a.m. UTC | #4
On 31/05/2024 4:39, Jakub Kicinski wrote:
> On Wed, 29 May 2024 16:00:33 +0000 Aurelien Aptel wrote:
>> These offloads are similar in nature to the packet-based NIC TLS offloads,
>> which are already upstream (see net/tls/tls_device.c).
>> You can read more about TLS offload here:
>> https://www.kernel.org/doc/html/latest/networking/tls-offload.html
> Which I had to send a fix for blindly again today:
> https://lore.kernel.org/all/20240530232607.82686-1-kuba@kernel.org/
> because there is no software model and none of the vendors apparently
> cares enough to rigorously test it.
>
> I'm not joking with the continuous tests.

Is there any plans to address the testing concerns here?
Christoph Hellwig June 10, 2024, 12:29 p.m. UTC | #5
On Mon, Jun 03, 2024 at 10:09:26AM +0300, Sagi Grimberg wrote:
>> IETF has standardized a generic data placement protocol, which is
>> part of iWarp.  Even if folks don't like RDMA it exists to solve
>> exactly these kinds of problems of data placement.
>
> iWARP changes the wire protocol.

Compared to plain NVMe over TCP that's a bit of an understatement :)

> Is your comment to just go make people
> use iWARP instead of TCP? or extending NVMe/TCP to natively support DDP?

I don't know to be honest.  In many ways just using RDMA instead of
NVMe/TCP would solve all the problems this is trying to solve, but
there are enough big customers that have religious concerns about
the use of RDMA.

So if people want to use something that looks non-RDMA but have the
same benefits we have to reinvent it quite similarly under a different
name.  Looking at DDP and what we can learn from it without bringing
the Verbs API along might be one way to do that.

Another would be to figure out what amount of similarity and what
amount of state we need in an on the wire protocol to have an
efficient header splitting in the NIC, either hard coded or even
better downloadable using something like eBPF.

> That would be great, but what does a "vendor independent without hooks" 
> look like from
> your perspective? I'd love having this translate to standard (and some new) 
> socket operations,
> but I could not find a way that this can be done given the current 
> architecture.

Any amount of calls into NIC/offload drivers from NVMe is a nogo.
Aurelien Aptel June 10, 2024, 1:23 p.m. UTC | #6
> Is there any plans to address the testing concerns here?

Yes. We are still discussing it internally.
Sagi Grimberg June 10, 2024, 2:30 p.m. UTC | #7
On 10/06/2024 15:29, Christoph Hellwig wrote:
> On Mon, Jun 03, 2024 at 10:09:26AM +0300, Sagi Grimberg wrote:
>>> IETF has standardized a generic data placement protocol, which is
>>> part of iWarp.  Even if folks don't like RDMA it exists to solve
>>> exactly these kinds of problems of data placement.
>> iWARP changes the wire protocol.
> Compared to plain NVMe over TCP that's a bit of an understatement :)

Yes :) the comment was that people want to use NVMe/TCP, and adding
DDP awareness inspired by iWARP would change the existing NVMe/TCP wire 
protocol.

This offload, does not.

>
>> Is your comment to just go make people
>> use iWARP instead of TCP? or extending NVMe/TCP to natively support DDP?
> I don't know to be honest.  In many ways just using RDMA instead of
> NVMe/TCP would solve all the problems this is trying to solve, but
> there are enough big customers that have religious concerns about
> the use of RDMA.
>
> So if people want to use something that looks non-RDMA but have the
> same benefits we have to reinvent it quite similarly under a different
> name.  Looking at DDP and what we can learn from it without bringing
> the Verbs API along might be one way to do that.
>
> Another would be to figure out what amount of similarity and what
> amount of state we need in an on the wire protocol to have an
> efficient header splitting in the NIC, either hard coded or even
> better downloadable using something like eBPF.

 From what I understand, this is what this offload is trying to do. It uses
the nvme command_id similar to how the read_stag is used in iwarp,
it tracks the NVMe/TCP pdus to split pdus from data transfers, and maps
the command_id to an internal MR for dma purposes.

What I think you don't like about this is the interface that the offload 
exposes
to the TCP ulp driver (nvme-tcp in our case)?

>
>> That would be great, but what does a "vendor independent without hooks"
>> look like from
>> your perspective? I'd love having this translate to standard (and some new)
>> socket operations,
>> but I could not find a way that this can be done given the current
>> architecture.
> Any amount of calls into NIC/offload drivers from NVMe is a nogo.
>

Not following you here...
*something* needs to program a buffer for DDP, *something* needs to
invalidate this buffer, *something* needs to declare a TCP stream as DDP 
capable.

Unless I interpret what you're saying is that the interface needs to be 
generalized to
extend the standard socket operations (i.e. 
[s|g]etsockopt/recvmsg/cmsghdr etc) ?
Christoph Hellwig June 11, 2024, 6:41 a.m. UTC | #8
On Mon, Jun 10, 2024 at 05:30:34PM +0300, Sagi Grimberg wrote:
>> efficient header splitting in the NIC, either hard coded or even
>> better downloadable using something like eBPF.
>
> From what I understand, this is what this offload is trying to do. It uses
> the nvme command_id similar to how the read_stag is used in iwarp,
> it tracks the NVMe/TCP pdus to split pdus from data transfers, and maps
> the command_id to an internal MR for dma purposes.
>
> What I think you don't like about this is the interface that the offload 
> exposes
> to the TCP ulp driver (nvme-tcp in our case)?

I don't see why a memory registration is needed at all.

The by far biggest painpoint when doing storage protocols (including
file systems) over IP based storage is the data copy on the receive
path because the payload is not aligned to a page boundary.

So we need to figure out a way that is as stateless as possible that
allows aligning the actual data payload on a page boundary in an
otherwise normal IP receive path.
Sagi Grimberg June 11, 2024, 11:01 a.m. UTC | #9
On 11/06/2024 9:41, Christoph Hellwig wrote:
> On Mon, Jun 10, 2024 at 05:30:34PM +0300, Sagi Grimberg wrote:
>>> efficient header splitting in the NIC, either hard coded or even
>>> better downloadable using something like eBPF.
>>  From what I understand, this is what this offload is trying to do. It uses
>> the nvme command_id similar to how the read_stag is used in iwarp,
>> it tracks the NVMe/TCP pdus to split pdus from data transfers, and maps
>> the command_id to an internal MR for dma purposes.
>>
>> What I think you don't like about this is the interface that the offload
>> exposes
>> to the TCP ulp driver (nvme-tcp in our case)?
> I don't see why a memory registration is needed at all.

I don't see how you can do it without memory registration.

>
> The by far biggest painpoint when doing storage protocols (including
> file systems) over IP based storage is the data copy on the receive
> path because the payload is not aligned to a page boundary.
>
> So we need to figure out a way that is as stateless as possible that
> allows aligning the actual data payload on a page boundary in an
> otherwise normal IP receive path.

But the device gets payload from the network, and needs a buffer
to dma to. In order to dma to the "correct" buffer it needs some
sort of pre-registration expressed with a tag, that the device can
infer by some sort of stream inspection. The socket recv call from
the ulp happens at a later stage.

I am not sure I understand the alignment assurance help the NIC
to dma payload from the network to the "correct" buffer
(i.e. userspace doing O_DIRECT read).
David Laight June 15, 2024, 9:34 p.m. UTC | #10
From: Christoph Hellwig
> Sent: 11 June 2024 07:42
> 
> On Mon, Jun 10, 2024 at 05:30:34PM +0300, Sagi Grimberg wrote:
> >> efficient header splitting in the NIC, either hard coded or even
> >> better downloadable using something like eBPF.
> >
> > From what I understand, this is what this offload is trying to do. It uses
> > the nvme command_id similar to how the read_stag is used in iwarp,
> > it tracks the NVMe/TCP pdus to split pdus from data transfers, and maps
> > the command_id to an internal MR for dma purposes.
> >
> > What I think you don't like about this is the interface that the offload
> > exposes
> > to the TCP ulp driver (nvme-tcp in our case)?
> 
> I don't see why a memory registration is needed at all.
> 
> The by far biggest painpoint when doing storage protocols (including
> file systems) over IP based storage is the data copy on the receive
> path because the payload is not aligned to a page boundary.

How much does the copy cost anyway?
If the hardware has merged the segments then it should be a single copy.
On x86 (does anyone care about anything else :-) 'rep mosvb' with a
cache-line aligned destination runs at 64 bytes/clock.
(The source alignment doesn't matter at all.)
I guess it loads the source data into the D-cache, the target is probably
required anyway - or you wouldn't be doing a read.

	David

> 
> So we need to figure out a way that is as stateless as possible that
> allows aligning the actual data payload on a page boundary in an
> otherwise normal IP receive path.

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Aurelien Aptel June 26, 2024, 3:21 p.m. UTC | #11
Hi,

We have taken some time to review your documents and code and have had
several internal discussions regarding the CI topic. We truly appreciate
the benefits that a CI setup could bring. However, we believe that since
this feature primarily relies on nvme-tcp, it might achieve better
coverage and testing if integrated with blktest. Your design focuses on
the netdev layer, which we don't think is sufficient.

blktests/nvme is designed to test the entire nvme upstream
infrastructure including nvme-tcp that targets corner cases and bugs in
on-going development.  Chaitanya, Shinichiro, Daniel and other
developers are actively developing blktests and running these tests in
timely manner on latest branch in linux-nvme repo and for-next branch in
linux-block repo.

Again, we are open to provide NIC so that others can also test this
feature on upstream kernel on our NIC to facilitate easier testing
including distros, as long as they are testing this feature on upstream
kernel. In this way we don't have to replicate the nvme-block storage
stack infra/tools/tests in the framework that is focused on netdev
development and yet achieve good coverage, what do you think?

Thanks
Jakub Kicinski June 26, 2024, 3:50 p.m. UTC | #12
On Wed, 26 Jun 2024 18:21:54 +0300 Aurelien Aptel wrote:
> We have taken some time to review your documents and code and have had
> several internal discussions regarding the CI topic. We truly appreciate
> the benefits that a CI setup could bring. However, we believe that since
> this feature primarily relies on nvme-tcp, it might achieve better
> coverage and testing if integrated with blktest. Your design focuses on
> the netdev layer, which we don't think is sufficient.
> 
> blktests/nvme is designed to test the entire nvme upstream
> infrastructure including nvme-tcp that targets corner cases and bugs in
> on-going development.  Chaitanya, Shinichiro, Daniel and other
> developers are actively developing blktests and running these tests in
> timely manner on latest branch in linux-nvme repo and for-next branch in
> linux-block repo.
> 
> Again, we are open to provide NIC so that others can also test this
> feature on upstream kernel on our NIC to facilitate easier testing
> including distros, as long as they are testing this feature on upstream
> kernel. In this way we don't have to replicate the nvme-block storage
> stack infra/tools/tests in the framework that is focused on netdev
> development and yet achieve good coverage, what do you think?

I'm not sure we're on the same page. The ask is to run the tests on 
the netdev testing branch, at 12h cadence, and generate a simple JSON
file with results we can ingest into our reporting. Extra points to
reporting it to KCIDB. You mention "framework that is focused on
netdev", IDK what you mean.
Chaitanya Kulkarni June 26, 2024, 7:34 p.m. UTC | #13
Hi Jakub,

On 6/26/24 08:50, Jakub Kicinski wrote:
> On Wed, 26 Jun 2024 18:21:54 +0300 Aurelien Aptel wrote:
>> We have taken some time to review your documents and code and have had
>> several internal discussions regarding the CI topic. We truly appreciate
>> the benefits that a CI setup could bring. However, we believe that since
>> this feature primarily relies on nvme-tcp, it might achieve better
>> coverage and testing if integrated with blktest. Your design focuses on
>> the netdev layer, which we don't think is sufficient.
>>
>> blktests/nvme is designed to test the entire nvme upstream
>> infrastructure including nvme-tcp that targets corner cases and bugs in
>> on-going development.  Chaitanya, Shinichiro, Daniel and other
>> developers are actively developing blktests and running these tests in
>> timely manner on latest branch in linux-nvme repo and for-next branch in
>> linux-block repo.
>>
>> Again, we are open to provide NIC so that others can also test this
>> feature on upstream kernel on our NIC to facilitate easier testing
>> including distros, as long as they are testing this feature on upstream
>> kernel. In this way we don't have to replicate the nvme-block storage
>> stack infra/tools/tests in the framework that is focused on netdev
>> development and yet achieve good coverage, what do you think?
> I'm not sure we're on the same page. The ask is to run the tests on
> the netdev testing branch, at 12h cadence, and generate a simple JSON
> file with results we can ingest into our reporting. Extra points to
> reporting it to KCIDB. You mention "framework that is focused on
> netdev", IDK what you mean.

just to clarify are you saying that you are want us to :-

1. Pull the latest changes from netdev current development branch
    every 12 hours.
2. Run blktests on the HEAD of that branch.
3. Gather those results into JASON file.
4. Post it publicly accessible to you.

I didn't understand the "ingest into our reporting part", can you
please clarify ?

-ck
Jakub Kicinski June 26, 2024, 7:43 p.m. UTC | #14
On Wed, 26 Jun 2024 19:34:50 +0000 Chaitanya Kulkarni wrote:
> > I'm not sure we're on the same page. The ask is to run the tests on
> > the netdev testing branch, at 12h cadence, and generate a simple JSON
> > file with results we can ingest into our reporting. Extra points to
> > reporting it to KCIDB. You mention "framework that is focused on
> > netdev", IDK what you mean.  
> 
> just to clarify are you saying that you are want us to :-
> 
> 1. Pull the latest changes from netdev current development branch
>     every 12 hours.

Small but important difference - testing branch, not development branch.
There may be malicious code in that branch, since its not fully
reviewed.

> 2. Run blktests on the HEAD of that branch.

Or some subset of blktest, if the whole run takes too long.

> 3. Gather those results into JASON file.
> 4. Post it publicly accessible to you.

Yes.

> I didn't understand the "ingest into our reporting part", can you
> please clarify ?

You just need to publish the JSON files with results, periodically
(publish = expose somewhere we can fetch from over HTTP).
The ingestion part is on our end.
Chaitanya Kulkarni June 26, 2024, 8:10 p.m. UTC | #15
Jakub,

On 6/26/24 12:43, Jakub Kicinski wrote:
> On Wed, 26 Jun 2024 19:34:50 +0000 Chaitanya Kulkarni wrote:
>>> I'm not sure we're on the same page. The ask is to run the tests on
>>> the netdev testing branch, at 12h cadence, and generate a simple JSON
>>> file with results we can ingest into our reporting. Extra points to
>>> reporting it to KCIDB. You mention "framework that is focused on
>>> netdev", IDK what you mean.
>> just to clarify are you saying that you are want us to :-
>>
>> 1. Pull the latest changes from netdev current development branch
>>      every 12 hours.
> Small but important difference - testing branch, not development branch.
> There may be malicious code in that branch, since its not fully
> reviewed.
>

indeed, we will ask you for a right branch once we have things in place ...

>> 2. Run blktests on the HEAD of that branch.
> Or some subset of blktest, if the whole run takes too long.

we need to target nvme-tcp blktests, I believe we can't skip any of those
tests, if needed we can do some optimizations on testing side to speed up
things ..

>> 3. Gather those results into JASON file.
>> 4. Post it publicly accessible to you.
> Yes.

sounds good ..

>> I didn't understand the "ingest into our reporting part", can you
>> please clarify ?
> You just need to publish the JSON files with results, periodically
> (publish = expose somewhere we can fetch from over HTTP).
> The ingestion part is on our end.

Thanks for the clarification, we will get back to you before mid next week.

We are actively building and running blktests so we are committed to
testing.

My other question is how can we make progress on getting this
code merge ? (assuming that we do all the above, but it will take some time
to build the infra) IOW, can we get this code merge before we build above
infrastructure ?

If we can, which repo it should go through ? nvme or netdev ?

-ck
Jakub Kicinski June 26, 2024, 8:14 p.m. UTC | #16
On Wed, 26 Jun 2024 20:10:41 +0000 Chaitanya Kulkarni wrote:
> My other question is how can we make progress on getting this
> code merge ? (assuming that we do all the above, but it will take some time
> to build the infra) IOW, can we get this code merge before we build above
> infrastructure ?
> 
> If we can, which repo it should go through ? nvme or netdev ?

Let's wait for the testing.